"Science, Faculty of"@en . "Statistics, Department of"@en . "DSpace"@en . "UBCV"@en . "Kondo, Yumi"@en . "2011-09-01T20:59:55Z"@en . "2011"@en . "Master of Science - MSc"@en . "University of British Columbia"@en . "Searching a dataset for the \u00E2\u0080\u0098\u00E2\u0080\u0098natural grouping / clustering\u00E2\u0080\u0099\u00E2\u0080\u0099 is an important explanatory technique for understanding complex multivariate datasets. One might expect that the true underlying clusters present in a dataset differ only with respect to a small fraction of the features. Furthermore, one might afraid that the dataset might contain potential outliers. \nThrough simulation studies, we \u00EF\u00AC\u0081nd that an existing sparse clustering method can be severely affected by a single outlier. In this thesis, we develop a robust clustering method that is also able to perform variable selection: we robusti\u00EF\u00AC\u0081ed sparse K-means (Witten and Tibshirani [28]), based on the idea of trimmed K-means introduced by Gordaliza [7] and Gordaliza [8]. Since high dimensional datasets often contain quite a few missing observations, we made our proposed method capable of handling datasets with missing values. The performance of the proposed robust sparse K-means is assessed in various simulation studies and two data analyses. The simulation studies show that robust sparse K-means performs better than other competing algorithms in terms of both the selection of features and the selection of a partition when datasets are contaminated. The analysis of a microarray dataset shows that robust sparse K-means best re\u00EF\u00AC\u0082ects the oestrogen receptor status of the patients among all other competing algorithms. We also \nadapt Clest (Duboit and Fridlyand [5]) to our robust sparse K-means to provide an automatic robust procedure of selecting the number of clusters. Our proposed methods are implemented in the R package RSKC."@en . "https://circle.library.ubc.ca/rest/handle/2429/37093?expand=metadata"@en . "Robustification of the sparse K-means clustering algorithm by Yumi Kondo B.S. Economics, American University, 2009 B.A. Business Administration, Ritsumeikan University, 2009 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Master of Science in THE FACULTY OF GRADUATE STUDIES (Statistics) The University Of British Columbia (Vancouver) August 2011 c\u00C2\u00A9 Yumi Kondo, 2011 Abstract Searching a dataset for the \u00E2\u0080\u0098\u00E2\u0080\u0098natural grouping / clustering\u00E2\u0080\u0099\u00E2\u0080\u0099 is an important explana- tory technique for understanding complex multivariate datasets. One might expect that the true underlying clusters present in a dataset differ only with respect to a small fraction of the features. Furthermore, one might afraid that the dataset might contain potential outliers. Through simulation studies, we find that an existing sparse clustering method can be severely affected by a single outlier. In this thesis, we develop a robust clustering method that is also able to perform variable selection: we robustified sparse K-means (Witten and Tibshirani [28]), based on the idea of trimmed K- means introduced by Gordaliza [7] and Gordaliza [8]. Since high dimensional datasets often contain quite a few missing observations, we made our proposed method capable of handling datasets with missing values. The performance of the proposed robust sparse K-means is assessed in various simulation studies and two data analyses. The simulation studies show that robust sparse K-means performs better than other competing algorithms in terms of both the selection of features and the selection of a partition when datasets are contaminated. The analysis of a microarray dataset shows that robust sparse K-means best reflects the oestrogen receptor status of the patients among all other competing algorithms. We also adapt Clest (Duboit and Fridlyand [5]) to our robust sparse K-means to provide an automatic robust procedure of selecting the number of clusters. Our proposed methods are implemented in the R package RSKC. Revision: ubcdiss.cls r28 ii Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 Literature review of clustering methods . . . . . . . . . . . . . . . . 7 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 K-means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2.1 The method . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2.2 The standard computing algorithm (Lloyd\u00E2\u0080\u0099s algorithm) and Hartigan and Wong\u00E2\u0080\u0099s algorithm . . . . . . . . . . . . . . 12 2.2.3 Measurements of the agreement between two partitions . . 14 2.2.4 Small simulation studies of K-means . . . . . . . . . . . 16 2.3 Sparse K-means . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.3.1 K-means as a maximization problem . . . . . . . . . . . . 18 2.3.2 The additivity of dissimilarities in each feature . . . . . . 21 2.3.3 The sparse K-means clustering algorithm . . . . . . . . . 22 2.3.4 The L1 tuning parameter values and the sparsity of features 25 iii 2.3.5 A small simulation study of sparse K-means with datasets generated from the noisy model . . . . . . . . . . . . . . 28 2.3.6 Simulation studies of sparse K-means with contaminated datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.3.7 Simulation studies of sparse 3-means with contamintaed datasets and fixed L1 tuning parameter values . . . . . . . 31 2.4 Trimmed K-means . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.4.1 Trimmed K-means computing algorithm . . . . . . . . . . 33 3 Robust sparse K-means clustering . . . . . . . . . . . . . . . . . . . 35 3.1 The method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.1.1 Robust Sparse K-means algorithm . . . . . . . . . . . . . 36 3.1.2 The choice of l1 . . . . . . . . . . . . . . . . . . . . . . . 38 3.2 Simulation studies of robust sparse 3-means with contaminated datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.3 Simulation studies to compare the performance of K-means, sparse K-means, trimmed K-means and robust sparse K-means . . . . . . 42 3.4 Robust sparse K-means with missing values . . . . . . . . . . . . 51 4 Selection of K . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.1 Brief literature review . . . . . . . . . . . . . . . . . . . . . . . . 53 4.2 The gap statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.2.1 Two methods for generating the reference distributions of the gap statistic . . . . . . . . . . . . . . . . . . . . . . . 54 4.3 Clest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.3.1 The idea . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.3.2 The idea of the clest algorithm . . . . . . . . . . . . . . . 56 4.3.3 The algorithm of Clest with robust sparse K-means . . . . 58 4.4 Simulation studies with Clest . . . . . . . . . . . . . . . . . . . . 60 5 Data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 5.1 Before moving to the data analysis . . . . . . . . . . . . . . . . . 66 5.1.1 The dissimilarity measure based on Pearson\u00E2\u0080\u0099s correlation . 66 5.1.2 The revised silhouette . . . . . . . . . . . . . . . . . . . 68 iv 5.2 The dataset of van\u00E2\u0080\u0099t Veer et al. [27] . . . . . . . . . . . . . . . . . 69 5.2.1 Selection of K . . . . . . . . . . . . . . . . . . . . . . . 70 5.2.2 The comparison of performances . . . . . . . . . . . . . . 72 6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 A Technical Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 A.1 The algorithm of Hartigan and Wong . . . . . . . . . . . . . . . . 86 A.2 The difference between Silhouette and the revised silhouette . . . 87 v List of Tables Table 2.1 The pair types and corresponding values of the indicator func- tions, IC (i, i\u00E2\u0080\u00B2) and IC\u00CC\u0083 (i, i \u00E2\u0080\u00B2). . . . . . . . . . . . . . . . . . . . . 15 Table 2.2 The simulation results of 3-means on 100 datasets generated from the clean model. . . . . . . . . . . . . . . . . . . . . . . 17 Table 2.3 The simulation results of 3-means on 100 datasets generated from the noisy model. . . . . . . . . . . . . . . . . . . . . . . 17 Table 2.4 The simulation results of sparse 3-means on 100 datasets gen- erated from the noisy model. . . . . . . . . . . . . . . . . . . 29 Table 2.5 The simulation results of sparse 3-means with the L1 tuning pa- rameter values chosen by the permutation approach. . . . . . . 30 Table 2.6 The simulation results of sparse 3-means with fixed L1 tuning parameter values. . . . . . . . . . . . . . . . . . . . . . . . . 32 Table 3.1 The results of models 1 to 3 with robust sparse 3-means and fixed l1 constraint. . . . . . . . . . . . . . . . . . . . . . . . . 41 Table 3.2 The number of times the single outlier is captured in OW and OE over 100 datasets generated from Model 1. . . . . . . . . . 41 Table 3.3 The results from robust sparse 3-means with the fixed L1 value on datasets generated from models 4 to 8. . . . . . . . . . . . 44 Table 4.1 The trimming proportion \u00CE\u00B1 for the simulation studies of Clest. 60 Table 4.2 The conservative trimming proportion values for the simulation studies of Clest. . . . . . . . . . . . . . . . . . . . . . . . . . 62 Table 5.1 The result from sparse 2-means. The \u00E2\u0088\u0097 indicates partitions that agree with each other. . . . . . . . . . . . . . . . . . . . . . . 72 vi Table 5.2 The result from robust sparse 2-means. The \u00E2\u0088\u0097 indicates parti- tions that agree with each other. . . . . . . . . . . . . . . . . 73 vii List of Figures Figure 1.1 The simulated dataset consists of 2 clustering features. . . . . 2 Figure 1.2 The simulated dataset consists of 2 clustering features and 1000 noise. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Figure 1.3 The simulated dataset is consists of 2 clustering features and 1000 noise feautres and a single outlier. . . . . . . . . . . . . 4 Figure 1.4 The simulated dataset consists of two clustering features, 1000 noise features and more than 10 outliers. . . . . . . . . . . . 5 Figure 2.1 Scatter Plots with and without normalization of datasets. . . . 9 Figure 2.2 Regions of L1 and L2 constraints in R2. . . . . . . . . . . . . 26 Figure 3.1 The distributions of weights from robust sparse K-means on a dataset simulated from Model 2. The x-axis shows the feature indices and y-axis shows the weights. . . . . . . . . . . . . . 38 Figure 3.2 The weights from robust sparse K-means on a dataset gener- ated from Model 2 v.s. the between clster sum of squares of the jth feature. . . . . . . . . . . . . . . . . . . . . . . . . . 40 Figure 3.3 The simulation results with \u00C2\u00B5 = 1.0. The y-axis shows the CERs. K = K-means, SK = Sparse K-means, TK = Trimmed K-means, and RSK = Robust Sparse K-means. . . . . . . . . 45 Figure 3.4 The simulation results with \u00C2\u00B5 = 1.0. The y-axis shows the median proportions of weights on the clustering features. . . 46 Figure 3.5 The simulation results with \u00C2\u00B5 = 0.8. The y-axis shows CERs. 47 Figure 3.6 The simulation results with \u00C2\u00B5=0.8. The y-axis shows the me- dian proportions of weights on the clustering features. . . . . 48 viii Figure 4.1 The results of Clest on datasets generated from 8 models with \u00C2\u00B5 = 2. The histogram of the estimated number of clusters from Clest over 50 generated datasets for each model. . . . . . . . 61 Figure 4.2 The results of Clest on datasets generated from 8 models with \u00C2\u00B5 = 2. The histogram of the estimated number of clusters from Clest over 50 generated datasets for each model. We use the conservative trimming proportion values. . . . . . . . . . . . 62 Figure 4.3 The results of Clest on datasets generated from 8 models with \u00C2\u00B5 = 1. The histogram of the estimated number of clusters from Clest over 50 generated datasets for each model. . . . . . . . 64 Figure 4.4 The results of Clest on datasets generated from 8 models with \u00C2\u00B5 = 1. The significance level, \u00CE\u00B2 is set to 1. The histogram of the estimated number of clusters from Clest over 50 generated datasets for each model. . . . . . . . . . . . . . . . . . . . . 65 Figure 5.1 Solid circles represent significant p-values. Clest chooses K=2 as a solution for all pairs of \u00CE\u00B1 and l1 values. . . . . . . . . . 71 Figure 5.2 The call rate plot of the 4 algorithms. The results of the sparse 2-means and robust sparse 2-means with \u00CE\u00B1 = 1/78,5/78 and 10/78. The results of sparse methods with l1=6 are shown. . . 75 Figure 5.3 The silhouettes of the partition returned by the robust sparse 2-means clustering. The L1 tuning parameter value is 6. . . . 77 Figure 5.4 The weighted squared Euclidean distances between all cases and their cluster centers from the result of robust sparse 2- means with l1 = 6. . . . . . . . . . . . . . . . . . . . . . . . 78 Figure 5.5 The weighted squared Euclidean distances between the cases and their cluster centers for the result of sparse 2-means with l1 = 6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 Figure 5.6 The nonzero weights from robust sparse 2-means for selected l1 values. The three rows show the results of robust sparse 2-means with \u00CE\u00B1=1/78, 5/78 and 10/78, respectively. . . . . . 80 Figure 5.7 The nonzero weights from sparse 2-means for selected l1 val- ues. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 ix Figure A.1 The example of negative Silhouette. . . . . . . . . . . . . . . 89 x Acknowledgments I would like to express my gratitude to my supervisors, Professor Matias Salibian- Barrera and Professor Ruben H. Zamar. Your patient teaching, encouragement and guidance were absolutely necessary to complete this thesis. I am very thankful that you let me think about problems throughly and taught me how to approach to them. You made Statistics very interesting to me. I also express my gratitude to Professor John Petkau to be the second reader of this thesis. I would like to thank every single friend of mine at UBC Statistics department. The discussions with you at LSK office late night inspired me and enriched my understanding in Statistics. Without all of you, I could not have enjoyed the process of learning this much nor could I push myself for learning. I will never forget this intense two years at UBC and I am proud of completing MSc in Statistics with you. xi Chapter 1 Introduction 1.1 Clustering Searching for \u00E2\u0080\u0098\u00E2\u0080\u0098natural\u00E2\u0080\u0099\u00E2\u0080\u0099 groups of objects is an important exploratory technique for understanding complex data. The origins of clustering can be traced back to taxonomy where it is necessary that different people assign similar objects to the same group. Clusterings or groupings were traditionally done by taxonomists who picked important grouping variables based on their rich knowledge of species. To- day, the principal function of clustering is to name, display, summarize and to elicit an explanation for the resulting partitions of the dataset (Hartigan [10]). In Statistics and Computer Science, clustering means automatic (computer aided) grouping of similar cases based on some dissimilarity measure. Cluster- ing is sometimes refereed to as \u00E2\u0080\u0098\u00E2\u0080\u0098numerical taxonomy\u00E2\u0080\u0099\u00E2\u0080\u0099. For example, a DNA microarray is a type of dataset to which clustering algo- rithms are applied. A microarray is a rectangular array of N rows, one for each case (e.g. a patient or a tumor) and p columns, one for each feature (e.g. genes, SNP\u00E2\u0080\u0099s). A reliable, precise and robust classification of tumors is essential for successful diagnosis of cancer. In a clinical application of microarray-based cancer diagnos- tics, the definition of new tumor classes can be based on the partitions produced by clustering. The clusters would then be used to build predictors for new tumor samples (Duboit and Fridlyand [5]). Current applications of clustering algorithms often include a large number of 1 (a) True cluster labels (b) The partition from 2-means Figure 1.1: The simulated dataset consists of 2 clustering features. features and visualizing such datasets is difficult. Typically, only a relatively small number of features are important to determine the class memberships of the cases. If thousands of potential clustering features must be considered, the traditional taxonomists\u00E2\u0080\u0099 approach of hand picking important features becomes difficult and impractical. Instead, we need a method that automatically chooses the important clustering variables. Furthermore, large datasets may contain outliers, which are defined as cases that do not belong to any of the given clusters. Then we may wish to use a wise algorithm that identifies the important features and outliers together with the clusters. As a motivating example, we generate 100 independent cases from a bivariate normal distribution. Fifty cases are from a bivariate normal distribution with mean (0,0)T and the remaining 50 cases have mean (3,3)T . In this thesis, the features whose means vary over clusters are called \u00E2\u0080\u0098\u00E2\u0080\u0098clustering features\u00E2\u0080\u0099\u00E2\u0080\u0099. Non-clustering features are called \u00E2\u0080\u0098\u00E2\u0080\u0098noise features\u00E2\u0080\u0099\u00E2\u0080\u0099. The first panel of Figure 1.1 shows the dis- tribution of the generated sample marked by true cluster labels. From the second panel, we see that 2-means successfully identifies the two clusters. In the first and the second panels of Figure 1.2, we show the clustering result when 1000 noise 2 (a) The partition from 2-means (b) The partition from sparse 2-means Figure 1.2: The simulated dataset consists of 2 clustering features and 1000 noise. features are added to the original bivariate data. The noise features are all indepen- dent standard normal variables. In this new setting, 2-means fails to identify the two clusters (first panel in Figure 1.2). Sparse 2-means, however, automatically identifies the two clustering features and manages to distinguish the given clusters (second panel in Figure 1.2). Now, we examine the effect of adding some outliers to this dataset. A single extreme value is added to the clustering feature 1. Figure 1.3 shows the partition results: sparse 2-means fails to identify the two clusters. Sparse 2-means returns a cluster with a single outlier and a cluster with all the rest of the cases. However, our proposed robust sparse 2-means automatically finds the outlier and correctly identifies the two clusters. As another motivating example, 10 outliers are added to the clustering features 1 and 2. The first panel of Figure 1.4 shows the true cluster labels with respect to the marginal bivariate distribution of the two clustering features. Outliers are marked with a symbol \u00E2\u0080\u0098\u00E2\u0080\u0098+\u00E2\u0080\u0099\u00E2\u0080\u0099. The second panel shows the partition result of sparse 2-means. Again, sparse 2-means returns a cluster which contain only the outliers. The third panel shows the partition result of robust 2-means. It successfully captures all the 3 (a) The true cluster labels (b) The partition from sparse 2-means (c) The partition from robust sparse 2-means Figure 1.3: The simulated dataset is consists of 2 clustering features and 1000 noise feautres and a single outlier. 4 (a) The true cluster labels (b) The partition from sparse 2-means (c) The partition from robust sparse 2-means Figure 1.4: The simulated dataset consists of two clustering features, 1000 noise features and more than 10 outliers. outliers and identifies the two true clusters. The rest of this thesis is organized as follows. In Chapter 2, we review some existing clustering methods: K-means, sparse K-means and trimmed K-means are discussed there. In Chapter 3, robust sparse K-means is proposed. The per- formance of the proposed algorithm is assessed using several simulated datasets. 5 Chapter 4 discusses methods to select the appropriate number of clusters. Chapter 5 contains the application of the robust sparse K-means method to a microarray dataset. The conclusion is in Chapter 6. 6 Chapter 2 Literature review of clustering methods 2.1 Introduction In this chapter we review three clustering methods which provide the foundation for our proposed robust sparse procedure: K-means, sparse K-means and trimmed K-means. Let X \u00E2\u0088\u0088 RN\u00C3\u0097p denote the data matrix with N cases and p features and let xi \u00E2\u0088\u0088 Rp denote the ith row of X (i = 1, \u00C2\u00B7 \u00C2\u00B7 \u00C2\u00B7 ,N). Moreover, let X j \u00E2\u0088\u0088 RN denote the jth column of X ( j = 1, \u00C2\u00B7 \u00C2\u00B7 \u00C2\u00B7 , p). That is: X = [X1 \u00C2\u00B7 \u00C2\u00B7 \u00C2\u00B7Xp] = \u00EF\u00A3\u00AE\u00EF\u00A3\u00AF\u00EF\u00A3\u00AF\u00EF\u00A3\u00B0 xT1 ... xTN \u00EF\u00A3\u00B9\u00EF\u00A3\u00BA\u00EF\u00A3\u00BA\u00EF\u00A3\u00BB= \u00EF\u00A3\u00AE\u00EF\u00A3\u00AF\u00EF\u00A3\u00AF\u00EF\u00A3\u00B0 x11 \u00C2\u00B7 \u00C2\u00B7 \u00C2\u00B7 x1p ... ... xN1 \u00C2\u00B7 \u00C2\u00B7 \u00C2\u00B7 xN p \u00EF\u00A3\u00B9\u00EF\u00A3\u00BA\u00EF\u00A3\u00BA\u00EF\u00A3\u00BB . A cluster can be described as a group of \u00E2\u0080\u0098similar\u00E2\u0080\u0099 cases. The cases within a clus- ter should be more \u00E2\u0080\u0098similar\u00E2\u0080\u0099 to each other than to cases from other clusters. It is immediately clear that partitions depend on the definition of \u00E2\u0080\u0098dissimilarity\u00E2\u0080\u0099. Se- ber [21] defines a function d : Rp\u00C3\u0097Rp\u00E2\u0086\u0092 R+ to be a dissimilarity if d(a,a) = 0, d(a,b) \u00E2\u0089\u00A5 0 and d(a,b) = d(b,a) for all a and b \u00E2\u0088\u0088 Rp. It measures how different a and b are. Note that all distance measures are dissimilarity measures. The most intuitive choice for a dissimilarity measure is the Euclidean distance or the squared 7 Euclidean distance. The Euclidean distance between two cases is the length of the line segment connecting them. The Euclidean distance between vectors xi and xi\u00E2\u0080\u00B2 is defined as: ||xi\u00E2\u0088\u0092xi\u00E2\u0080\u00B2 ||= \u00E2\u0088\u009A (xi\u00E2\u0088\u0092xi\u00E2\u0080\u00B2)T (xi\u00E2\u0088\u0092xi\u00E2\u0080\u00B2) (2.1) = \u00E2\u0088\u009A (xi1\u00E2\u0088\u0092 xi\u00E2\u0080\u00B21)2+ \u00C2\u00B7 \u00C2\u00B7 \u00C2\u00B7+(xip\u00E2\u0088\u0092 xi\u00E2\u0080\u00B2p)2 = \u00E2\u0088\u009A\u00E2\u0088\u009A\u00E2\u0088\u009A\u00E2\u0088\u009A p\u00E2\u0088\u0091 j=1 (xi j\u00E2\u0088\u0092 xi\u00E2\u0080\u00B2 j)2. This distance takes into account the differences between xi and xi\u00E2\u0080\u00B2 directly, based on the magnitude of the coordinates in the difference vector. The squared Euclidean distance is simply ||xi\u00E2\u0088\u0092 xi\u00E2\u0080\u00B2 ||2. The squared Euclidean distance is additive in the features. This additivity will play a central role for sparse K-means clustering (see Section 2.3). For this reason, we mostly employ the squared Euclidean distance as a dissimilarity measure in this thesis. The squared Euclidean distance between two cases assigns equal weights to all the squared differences along features: ||xi\u00E2\u0088\u0092xi\u00E2\u0080\u00B2 ||2 = 1 \u00C2\u00B7 (xi1\u00E2\u0088\u0092 xi\u00E2\u0080\u00B21)2+ \u00C2\u00B7 \u00C2\u00B7 \u00C2\u00B7+1 \u00C2\u00B7 (xip\u00E2\u0088\u0092 xi\u00E2\u0080\u00B2p)2. It implicitly assumes that units of all the features are the same. That is, with- out appropriate normalization of a dataset, the variable with the largest scale may dominate the distance. Let X\u00E2\u0088\u0097= [X\u00E2\u0088\u00971 \u00C2\u00B7 \u00C2\u00B7 \u00C2\u00B7X\u00E2\u0088\u0097p] = [x\u00E2\u0088\u0097i j]\u00E2\u0088\u0088RN\u00C3\u0097p denote a nonnormalized dataset. From here, let X \u00E2\u0088\u0088 RN\u00C3\u0097p denote the normalized dataset: xi, j = x\u00E2\u0088\u0097i j\u00E2\u0088\u0092 x\u00CC\u0084\u00E2\u0088\u0097j sd j , (2.2) where x\u00CC\u0084\u00E2\u0088\u0097j = \u00E2\u0088\u0091Ni=1 x\u00E2\u0088\u0097i j N , and sd j = \u00E2\u0088\u009A \u00E2\u0088\u0091Ni=1(x\u00E2\u0088\u0097i j\u00E2\u0088\u0092 x\u00CC\u0084\u00E2\u0088\u0097j)2 N\u00E2\u0088\u00921 . Then clearly, all the features of X have unit sample standard deviation. One can 8 ll l l l l l l l l l l l l l l l l lll l l l ll l l \u00E2\u0088\u009220 \u00E2\u0088\u009210 0 10 20 \u00E2\u0088\u0092 20 \u00E2\u0088\u0092 10 0 10 20 Clustering feature N oi se fe a tu re l l l l l l l l l l l l l l l l l l l l l l l l l ll l l l \u00E2\u0088\u00923 \u00E2\u0088\u00922 \u00E2\u0088\u00921 0 1 2 3 \u00E2\u0088\u0092 3 \u00E2\u0088\u0092 2 \u00E2\u0088\u0092 1 0 1 2 3 Clustering feature N oi se fe a tu re Figure 2.1: Scatter Plots with and without normalization of datasets. see the necessity of normalization of the dataset by a simple simulated example in Figure 2.1. It shows 60 cases from two clusters, generated from two bivariate normal distributions. The two clusters are separated only by a clustering feature. The marginal variance of clustering feature is 1 while the marginal variance of the noise feature is set to 40. The left panel shows a nonnormalized raw dataset and the right panel shows its normalized dataset. Since the scale of the noise feature is \u00E2\u0088\u009A 40 times larger, the two clusters are almost hidden in the left panel. However, after the normalization, one can see the two clusters more clearly. For these reasons, normalization is critical for the squared Euclidean dissimi- larity measure. Throughout this paper, we assume that the dataset X has already been normalized. 2.2 K-means The term \u00E2\u0080\u0098K-means\u00E2\u0080\u0099 was first used by MacQueen [16] in 1967 but the idea goes back to Steinhaus [23] in 1956. 9 2.2.1 The method K-means partitions N cases into a predetermined number, K, of clusters (1\u00E2\u0089\u00A4 K \u00E2\u0089\u00A4 N). The goal of K-means is to find the partition of cases which minimizes the sum of the dissimilarities between cases within each cluster. Let Ck denote the set of the case indices in the kth cluster and set |Ck| = nk, where |A| is the number of elements in set A. Then C = {C1, ..,CK} is a partition of the dataset into K clusters. For any partition C and dissimilarity d, we can define the overall within cluster dissimilarity: OWCD(C ) = K \u00E2\u0088\u0091 k=1 \u00E2\u0088\u0091 i,i\u00E2\u0080\u00B2\u00E2\u0088\u0088Ck d(xi,xi\u00E2\u0080\u00B2). An issue may arise if one looks for the partition C that minimizes OWCD(C ): partitions C with same cluster sizes, n1 = \u00C2\u00B7 \u00C2\u00B7 \u00C2\u00B7 = nK , may be favored even if there is a clear cluster structure with different cluster sizes. To see this, if the i\u00CC\u0083th case is clustered in Ck\u00CC\u0083, then there are 2nk\u00CC\u0083 distance components in the sum of OWCD(C ) associated with the i\u00CC\u0083th case. This is because each distance is counted twice and the distance of itself, d(xi\u00CC\u0083,xi\u00CC\u0083), is also counted. Suppose there is another cluster Ck\u00E2\u0080\u00B2 whose cluster size is considerably smaller than the cluster size of Ck\u00CC\u0083: nk\u00CC\u0083\u001D nk\u00E2\u0080\u00B2 . Then even if the cases in Ck\u00E2\u0080\u00B2 are more dissimilar to the i\u00CC\u0083th case than the cases in Ck\u00CC\u0083, the partition that minimizes OWCD(C ) would prefer to have the i\u00CC\u0083 th to be clustered in Ck\u00E2\u0080\u00B2 . This is because by doing so the number of distance components decreases in OWCD(C ). In order to solve this issue, we can define the within cluster dissimilarity as: WCD(C ) = K \u00E2\u0088\u0091 k=1 1 2nk \u00E2\u0088\u0091 i,i\u00E2\u0080\u00B2\u00E2\u0088\u0088Ck d(xi,xi\u00E2\u0080\u00B2). The goal of K-means is to find a partition C \u00E2\u0088\u0097 that minimizes WCD(C ). That is: C \u00E2\u0088\u0097 = argmin C ; |C |=K WCD(C ). (2.3) When the dissimilarity measure is the squared Euclidean distance, WCD(C ) is 10 called within cluster sum of Squares: WSS(C ) = K \u00E2\u0088\u0091 k=1 1 2nk \u00E2\u0088\u0091 i,i\u00E2\u0080\u00B2\u00E2\u0088\u0088Ck ||xi\u00E2\u0088\u0092xi\u00E2\u0080\u00B2 ||2. Interestingly, WSS(C ) can be rewritten in terms of the distances to the kth cluster center if one defines a center of the kth cluster as the average vector over the cases within the kth cluster: x\u00CC\u0084k = 1 nk \u00E2\u0088\u0091 i\u00E2\u0088\u0088Ck xi. WSS(C ) can be expressed in terms of the cluster centers as follows: WSS(C ) = K \u00E2\u0088\u0091 k=1 1 2nk \u00E2\u0088\u0091 i,i\u00E2\u0080\u00B2\u00E2\u0088\u0088Ck ||xi\u00E2\u0088\u0092xi\u00E2\u0080\u00B2 ||2 = K \u00E2\u0088\u0091 k=1 1 2nk \u00E2\u0088\u0091 i,i\u00E2\u0080\u00B2\u00E2\u0088\u0088Ck ||xi\u00E2\u0088\u0092 x\u00CC\u0084k + x\u00CC\u0084k\u00E2\u0088\u0092xi\u00E2\u0080\u00B2 ||2 = K \u00E2\u0088\u0091 k=1 1 2nk ( nk \u00E2\u0088\u0091 i\u00E2\u0088\u0088Ck ||xi\u00E2\u0088\u0092 x\u00CC\u0084k||2 +\u00E2\u0088\u0091 i\u00E2\u0088\u0088Ck (xi\u00E2\u0088\u0092 x\u00CC\u0084k)T \u00E2\u0088\u0091 i\u00E2\u0080\u00B2\u00E2\u0088\u0088Ck (x\u00CC\u0084k\u00E2\u0088\u0092xi\u00E2\u0080\u00B2)+nk \u00E2\u0088\u0091 i\u00E2\u0080\u00B2\u00E2\u0088\u0088Ck ||x\u00CC\u0084k\u00E2\u0088\u0092xi\u00E2\u0080\u00B2 ||2 ) = K \u00E2\u0088\u0091 k=1 \u00E2\u0088\u0091 i\u00E2\u0088\u0088Ck ||xi\u00E2\u0088\u0092 x\u00CC\u0084k||2. (2.4) Then K-means with squared Euclidean distance solves the minimization problem: min C ; |C |=K WSS(C ) = min C ; |C |=K K \u00E2\u0088\u0091 k=1 \u00E2\u0088\u0091 i\u00E2\u0088\u0088Ck ||xi\u00E2\u0088\u0092 x\u00CC\u0084k||2. (2.5) Use of the expression in (2.5) with the mean term facilitates the computation. This is one of the reasons why the squared Euclidean distance is a popular choice for the dissimilarity measure. 11 2.2.2 The standard computing algorithm (Lloyd\u00E2\u0080\u0099s algorithm) and Hartigan and Wong\u00E2\u0080\u0099s algorithm No analytic solution is available for (2.3). One might calculate the values of the objective functions for all the partitions. However, even with K=2 one has to con- sider \u00E2\u0088\u0091bN/2ci=1 (N i ) possible partitions. The number of partition grows rapidly with the sample size, N. Various algorithms have been proposed to achieve a local optimum of (2.3). The most popular ones are due to Lloyd [14], Hartigan and Wong [11], MacQueen [16] and Forgy [6]. All these algorithms adopt the squared Euclidean distance as a dissimilarity measure. MacKay [15] stated that the most commonly used standard iterative algorithm was that proposed by Lloyd [14]. The trimmed K-means algorithm, introduced in Section 2.4, robustifies this method. The kmeans command in the R package stat uses Hartigan and Wong\u00E2\u0080\u0099s al- gorithm by default. This algorithm made Lloyd\u00E2\u0080\u0099s algorithm (Lloyd [14]) more efficient and generally achieves better local optima than other methods. Sparse K- means, introduced later, uses Hartigan and Wong\u00E2\u0080\u0099s algorithm. First, we introduce Lloyd\u00E2\u0080\u0099s algorithm then we will introduce Hartigan and Wong\u00E2\u0080\u0099s algorithm, which is substantially more complicated. 2.2.2.1 The standard computing algorithm (Lloyd\u00E2\u0080\u0099s algorithm) Here, we use another notation, IC1(i), to describe a partition C . IC1(i) represents the index of the cluster to which the ith case belongs. That is, IC1(i) = k is equiv- alent to i \u00E2\u0088\u0088 Ck. Randomly choose K cases to serve as initial cluster centers and assign each case to its closest initial cluster center. Repeat 1 and 2 below until convergence. \u00E2\u0080\u00A2 Step (a): Given a cluster assignment, update new cluster centers equal to the cluster means, x\u00CC\u0084k for k = 1, \u00C2\u00B7 \u00C2\u00B7 \u00C2\u00B7 ,K. \u00E2\u0080\u00A2 Step (b): Given cluster centers, update a class assignment by assigning each case to its closest cluster center. The index of the cluster to which the ith case belongs is denoted by IC1(i) = argmin 1\u00E2\u0089\u00A4k\u00E2\u0089\u00A4K ||xi\u00E2\u0088\u0092 x\u00CC\u0084k||2 \u00E2\u0088\u0080i \u00E2\u0088\u0088 {1, \u00C2\u00B7 \u00C2\u00B7 \u00C2\u00B7 ,N}. Since each iteration minimizes the objective function, the algorithm converges to a local minimum. The solutions can differ from each other depending on the random 12 selection of initial cluster centers. The user of the K-means algorithm should try several initial cluster centers and choose the partition that corresponds to the local minimum with the smallest objective function, WSS(C ), among all the local min- imums. Also, this algorithm can return a partition with less than K clusters. This may happen when some cluster centers do not have any case that is closest to them. 2.2.2.2 Hartigan and Wong\u00E2\u0080\u0099s algorithm Hartigan and Wong\u00E2\u0080\u0099s algorithm differs from Lloyd\u00E2\u0080\u0099s algorithm mainly in three ways. First, they differ in the reassignment criteria. Lloyd\u00E2\u0080\u0099s algorithm reassigns a case from the kth1 cluster to the k th 2 cluster if it is closer to the k th 2 cluster center. Therefore, Step (b) of Lloyd\u00E2\u0080\u0099s algorithm compares the distances: ||xi\u00E2\u0088\u0092 x\u00CC\u0084k2 ||2 < ||xi\u00E2\u0088\u0092 x\u00CC\u0084k1 ||2. (2.6) However, Hartigan and Wong\u00E2\u0080\u0099s algorithm uses a different criterion. Sparks (1973) proposed that it is more effective to reassign the case if the squared Euclidean distance from the center of the kth2 cluster is less than that from the center of the kth1 cluster, even when the cluster centers are simultaneously repositioned (Sparks [22]). That is, when: nk2 nk2 +1 ||xi\u00E2\u0088\u0092 x\u00CC\u0084k2 ||2 < nk1 nk1\u00E2\u0088\u00921 ||xi\u00E2\u0088\u0092 x\u00CC\u0084k1 ||2. (2.7) Criterion (2.6) is a sufficient condition for criterion (2.7) because nk2 nk2+1 < 1< nk1 nk1\u00E2\u0088\u00921 . Therefore, reassignments of cases occur less frequently with (2.6) than with (2.7). Second, Hartigan and Wong\u00E2\u0080\u0099s algorithm is more efficient. The Lloyd\u00E2\u0080\u0099s algo- rithm is inefficient in the sense that it computes the squared Euclidean distances between all cases and all cluster centers until convergence. Consider the situation when K=4 and cluster C1 and C2 are clearly defined while remaining clusters, C3 and C4, have a large overlap. Then the cases in C1 and C2 would stop switching be- tween clusters at an early stage while the cases in C3 and C4 would keep switching clusters. Lloyd\u00E2\u0080\u0099s algorithm, however, calculates the squared Euclidean distances of all cases, including the cases in C1 and C2, to all cluster centers until all cases stop switching between clusters (convergence). Hartigan and Wong\u00E2\u0080\u0099s algorithm re- 13 computes the squared Euclidean distances between a case and cluster centers that have the potential to be its new cluster centers. Hartigan and Wong define a set, the \u00E2\u0080\u0098\u00E2\u0080\u0098live set\u00E2\u0080\u0099\u00E2\u0080\u0099, that contains all the clusters whose cluster centers have the potential to change. By treating the cases in clusters that belong to the live set differently from others, Hartigan and Wong\u00E2\u0080\u0099s algorithm became computationally more efficient. Third, Hartigan and Wong\u00E2\u0080\u0099s algorithm achieves a better local optimum. In Lloyd\u00E2\u0080\u0099s algorithm, when each case is examined to see if it should be reassigned to a different cluster at Step (b), only the closest cluster center is used to check for the possible reallocation. However, Hartigan and Wong\u00E2\u0080\u0099s algorithm examine not only the closest cluster center but also the second closest center. The details of how they examine the two cluster centers and the algorithm are described in Appendix A.1. Hartigan and Wong [11] showed that their algorithm achieves a better local optimum than Lloyd\u00E2\u0080\u0099s algorithm in general. Therefore, we will use Hartigan and Wong\u00E2\u0080\u0099s algorithm to compute K-means and sparse K-means throughout this thesis. 2.2.3 Measurements of the agreement between two partitions In order to assess the performance of different cluster algorithms we need to mea- sure how well two given partitions C and C\u00CC\u0083 agree. For instance, in a simulation study we would like to compare the partitions produced by the different clustering algorithms with the \u00E2\u0080\u0098\u00E2\u0080\u0098true partition\u00E2\u0080\u0099\u00E2\u0080\u0099 generated by the model. Given two cases and two partitions C and C\u00CC\u0083 , there are four possible scenarios (Hubert [12]): Type 1: the two cases belong to a single cluster in both C and C\u00CC\u0083 , Type 2: the two cases belong to different clusters in both C and C\u00CC\u0083 , Type 3: the two cases belong to different clusters inC but to the same cluster in C\u00CC\u0083 , Type 4: the two cases belong to the same cluster inC but to different clusters in C\u00CC\u0083 . Cases of types 1 and 2 suggest agreement of the two partitions while cases of types 3 and 4 suggest disagreement. This definition is used in the famous Rand index (Rand [19]), as well as in the classification error rate (CER) (Chipman and 14 IC (i, i\u00E2\u0080\u00B2) IC\u00CC\u0083 (i, i \u00E2\u0080\u00B2) |IC (i, i\u00E2\u0080\u00B2)\u00E2\u0088\u0092 IC\u00CC\u0083 (i, i\u00E2\u0080\u00B2)| Type 1 1 1 0 Type 2 0 0 0 Type 3 0 1 1 Type 4 1 0 1 Table 2.1: The pair types and corresponding values of the indicator functions, IC (i, i\u00E2\u0080\u00B2) and IC\u00CC\u0083 (i, i \u00E2\u0080\u00B2). Tibshirani [2]). The CER is simply the proportion of pairs of cases that are together in one partition and apart in the other partition, i.e., Type 3 or Type 4 pairs of cases. On the other hand, the Rand Index is the proportion of Type 1 or Type 2 pairs of cases. Therefore, CER = 1 - Rand index. Throughout this thesis, we will use CER as a measurement of agreement of two partitions. The calculation of CER is described in the following section. 2.2.3.1 The calculation of CER CER was first proposed by Chipman et al. [3] in the context of decision trees. Later, Chipman and Tibshirani [2] applied this measurement for clustering. Let IC be an indicator function, IC : {1, \u00C2\u00B7 \u00C2\u00B7 \u00C2\u00B7 ,N}\u00C3\u0097{1, \u00C2\u00B7 \u00C2\u00B7 \u00C2\u00B7 ,N} \u00E2\u0086\u0092 {0,1}. Given the pair of case indices, (i, i\u00E2\u0080\u00B2), the indicator function returns 1 if the ith case and the i\u00E2\u0080\u00B2th case are paired in partition C and returns 0 if they are apart: IC (i, i\u00E2\u0080\u00B2) = { 1 if i and i\u00E2\u0080\u00B2 belong to the same cluster in partition C , 0 otherwise. Given two partitions C and C\u00CC\u0083 , the four types of the pairs can be represented by the output of two indicator functions IC (i, i\u00E2\u0080\u00B2) and IC\u00CC\u0083 (i, i \u00E2\u0080\u00B2) as in Table 2.1. Then the absolute difference of the two indicator functions, IC and IC\u00CC\u0083 is zero if the pairs of cases are apart or together in both partitions while it is one if the pair is apart in one partition and together in another. 15 CER is the percentage of the pairs of cases that are Type 3 or Type 4: CER(C ,C \u00E2\u0080\u00B2) = \u00E2\u0088\u0091i>i \u00E2\u0080\u00B2 |IC (i, i\u00E2\u0080\u00B2)\u00E2\u0088\u0092 IC\u00CC\u0083 (i, i\u00E2\u0080\u00B2)|(N 2 ) . Note that there are (N 2 ) pairs of cases and this factor scales the range of CER to be between zero and one. CER(C , C\u00CC\u0083 ) equals zero if the two partitions perfectly agree and becomes one if the two partitions completely disagree (Chipman and Tibshirani [2]). For example, if C contains a single cluster, |C | = 1, while C\u00CC\u0083 contains as many clusters as sample size, |C\u00CC\u0083 |= N, then all the pairs of clusters are Type 4 and CER becomes 1. 2.2.4 Small simulation studies of K-means Since K-means is easy to understand (minimizes a simple objective function) and readily available in many statistical packages, this algorithm has been a popular choice in many fields. In fact, K-means performs well to identify clusters when the datasets do not contain noise features. To see this, we report on a small sim- ulation study with the following simulation model. We will refer to this model as \u00E2\u0080\u0098\u00E2\u0080\u0098clean model\u00E2\u0080\u0099\u00E2\u0080\u0099. A dataset generated from the clean model contains 60 cases xi = [xi1, \u00C2\u00B7 \u00C2\u00B7 \u00C2\u00B7 ,xi50]T (i = 1, \u00C2\u00B7 \u00C2\u00B7 \u00C2\u00B7 ,60). The cases are generated from 3 clusters as xi j = \u00C2\u00B5i+ \u00CE\u00B5i j, with \u00CE\u00B5i j i.i.d\u00E2\u0088\u00BC N(0,1) where: \u00C2\u00B5i = \u00EF\u00A3\u00B1\u00EF\u00A3\u00B4\u00EF\u00A3\u00B2\u00EF\u00A3\u00B4\u00EF\u00A3\u00B3 \u00C2\u00B5 if 1\u00E2\u0089\u00A4 i\u00E2\u0089\u00A4 20, 0 if 21\u00E2\u0089\u00A4 i\u00E2\u0089\u00A4 40, \u00E2\u0088\u0092\u00C2\u00B5 if 41\u00E2\u0089\u00A4 i\u00E2\u0089\u00A4 60. Note that all the 50 features are clustering features. We consider different val- ues for \u00C2\u00B5 in our simulations (\u00C2\u00B5 = 1 and 0.8). The magnitude of \u00C2\u00B5 controls how close the clusters are. In this simulation model, the true partition of cases is C \u00E2\u0080\u00B2 = {C1,C2,C3} where C1 = {1, \u00C2\u00B7 \u00C2\u00B7 \u00C2\u00B7 ,20},C2 = {21, \u00C2\u00B7 \u00C2\u00B7 \u00C2\u00B7 ,40} and C3 = {41, \u00C2\u00B7 \u00C2\u00B7 \u00C2\u00B7 ,60}. We generate 100 datasets from the clean model for each \u00C2\u00B5 . Then K-means with K=3 is performed on each dataset. K-means with K predetermined to be 3 is called 3-means. Harigan and Wong\u00E2\u0080\u0099s algorithm is used, as implemented in the R 16 \u00C2\u00B5 CER 1.0 0.00 (0.003) 0.8 0.01 (0.016) Table 2.2: The simulation results of 3-means on 100 datasets generated from the clean model. package stat as a default method. The performance of 3-means is assessed by CER between the true partition C \u00E2\u0080\u00B2 and the partition returned by the algorithm. The average and the standard deviation of CERs over 100 simulations are re- ported in Table 2.2. The CERs are nearly 0 for both \u00C2\u00B5s, indicating that 3-means can recover the true cluster structure nearly perfectly. A drawback of K-means is that it uses all p features to obtain partitions re- gardless of their importance for clustering. Since K-means does not automatically perform feature selection, the user must decide by himself which features are the clustering features. To illustrate this issue, we examine the performance of 3-means in a dataset generated from a \u00E2\u0080\u0098\u00E2\u0080\u0098noisy model\u00E2\u0080\u0099\u00E2\u0080\u0099. The dataset generated from the noisy model consists of 60 cases, xi = [xi1, \u00C2\u00B7 \u00C2\u00B7 \u00C2\u00B7 ,xi500]T and 500 features (i = 1, \u00C2\u00B7 \u00C2\u00B7 \u00C2\u00B7 ,60). The first 50 features are clustering features, generated as in the clean model. The last 450 features are noise features. That is for j = 51, \u00C2\u00B7 \u00C2\u00B7 \u00C2\u00B7 ,500, xi j is defined as xi j = \u00CE\u00B5i j. \u00C2\u00B5 CER 1.0 0.02 (0.020) 0.8 0.10 (0.061) Table 2.3: The simulation results of 3-means on 100 datasets generated from the noisy model. We generate 100 datasets from the noisy model for each \u00C2\u00B5 (\u00C2\u00B5 = 1 or 0.8) and perform 3-means on each dataset, using Harigan and Wong\u00E2\u0080\u0099s algorithm. The average CER values are reported in Table 2.3. The result is shown in Table 2.3. The performance of 3-means becomes poorer on datasets from the noisy model than on datasets from the clean model. 17 2.3 Sparse K-means Our small simulation study above shows that the performance of K-means may deteriorate when datasets contain many noise features. To address this problem, Witten and Tibshirani [28] proposed the sparse K-means algorithm which clusters the cases using an adaptively chosen subset of the features. Sparse K-means is based on the fact that the problem of K-means can be viewed as a maximization problem. Moreover, it hinges critically on the assumption that the dissimilarity is additive in each feature. Before moving onto the discussion of sparse K-means, we introduce this different representation of the problem of K-means in Section 2.3.1. We will explain that some dissimilarities can be decomposed by each feature in Section 2.3.2. 2.3.1 K-means as a maximization problem We will show below that minimizing the overall within cluster dissimilarity is the same as maximizing the overall between cluster dissimilarity. The overall between cluster dissimilarity is defined as: OBCD(C ) = K \u00E2\u0088\u0091 k=1 \u00E2\u0088\u0091 i\u00E2\u0088\u0088Ck \u00E2\u0088\u0091 i\u00E2\u0080\u00B2 /\u00E2\u0088\u0088Ck d(xi,xi\u00E2\u0080\u00B2). Moreover, we define the overall total cluster dissimilarity to be the sum of the dissimilarities of all the pairs of cases: OTCD = N \u00E2\u0088\u0091 i=1 N \u00E2\u0088\u0091 i\u00E2\u0080\u00B2=1 d(xi,xi\u00E2\u0080\u00B2). (2.8) Therefore, OTCD does not depend on a partition, C . Now we will show that OBCD(C ) is simply the difference between the overall total cluster dissimilarity and the overall within cluster dissimilarity: OBCD(C ) = OTCD\u00E2\u0088\u0092OWCD(C ). 18 To see this: OTCD = N \u00E2\u0088\u0091 i=1 N \u00E2\u0088\u0091 i\u00E2\u0080\u00B2=1 d(xi,xi\u00E2\u0080\u00B2) = K \u00E2\u0088\u0091 k=1 \u00E2\u0088\u0091 i\u00E2\u0088\u0088Ck N \u00E2\u0088\u0091 i\u00E2\u0080\u00B2=1 d(xi,xi\u00E2\u0080\u00B2) = K \u00E2\u0088\u0091 k=1 \u00E2\u0088\u0091 i\u00E2\u0088\u0088Ck ( \u00E2\u0088\u0091 i\u00E2\u0080\u00B2 /\u00E2\u0088\u0088Ck d(xi,xi\u00E2\u0080\u00B2)+ \u00E2\u0088\u0091 i\u00E2\u0080\u00B2\u00E2\u0088\u0088Ck d(xi,xi\u00E2\u0080\u00B2) ) = K \u00E2\u0088\u0091 k=1 \u00E2\u0088\u0091 i\u00E2\u0088\u0088Ck \u00E2\u0088\u0091 i\u00E2\u0080\u00B2 /\u00E2\u0088\u0088Ck d(xi,xi\u00E2\u0080\u00B2)+ K \u00E2\u0088\u0091 k=1 \u00E2\u0088\u0091 i\u00E2\u0088\u0088Ck \u00E2\u0088\u0091 i\u00E2\u0080\u00B2\u00E2\u0088\u0088Ck d(xi,xi\u00E2\u0080\u00B2) = OWCD(C )+OBCD(C ). (2.9) As in (2.8), the total cluster dissimilarity can be defined as: TCD = 1 2N N \u00E2\u0088\u0091 i=1 N \u00E2\u0088\u0091 i\u00E2\u0080\u00B2=1 d(xi,xi\u00E2\u0080\u00B2). Then it is natural from (2.9) to define the between cluster dissimilarity: BCD(C ) = TCD\u00E2\u0088\u0092WCD(C ) = 1 2N N \u00E2\u0088\u0091 i=1 N \u00E2\u0088\u0091 i\u00E2\u0080\u00B2=1 d(xi,xi\u00E2\u0080\u00B2)\u00E2\u0088\u0092 K \u00E2\u0088\u0091 k=1 1 2nk \u00E2\u0088\u0091 i,i\u00E2\u0080\u00B2\u00E2\u0088\u0088Ck d(xi,xi\u00E2\u0080\u00B2). With the definition of WCD(C ), BCD(C ) and TCD, now we can show that the minimization problem of K-means (2.3) can be rewritten as a maximization problem: C \u00E2\u0088\u0097 = argmin C ; |C |=K WCD(C ) = argmin C ; |C |=K {TCD\u00E2\u0088\u0092BCD(C )} = argmin C ; |C |=K {\u00E2\u0088\u0092BCD(C )} (since TCD does not depend on C ) = argmax C ; |C |=K BCD(C ). (2.10) 19 For the squared Euclidean dissimilarity measure, the total cluster dissimilarity is called the total cluster sum of squares (TSS) and the between cluster dissimilarity is called the between cluster sum of square (BSS). Then we have: T SS = 1 2N N \u00E2\u0088\u0091 i=1 N \u00E2\u0088\u0091 i\u00E2\u0080\u00B2=1 ||xi\u00E2\u0088\u0092xi\u00E2\u0080\u00B2 ||2, BSS(C ) = T SS\u00E2\u0088\u0092WSS(C ). (2.11) As WSS(C ),T SS(C ) can also be expressed in terms of the distances to the cluster centers: T SS = 1 2N N \u00E2\u0088\u0091 i=1 N \u00E2\u0088\u0091 i\u00E2\u0080\u00B2=1 ||xi\u00E2\u0088\u0092 x\u00CC\u0084+ x\u00CC\u0084\u00E2\u0088\u0092xi\u00E2\u0080\u00B2 ||2 = 1 2N ( N N \u00E2\u0088\u0091 i=1 ||xi\u00E2\u0088\u0092 x\u00CC\u0084||2+ N \u00E2\u0088\u0091 i=1 (xi\u00E2\u0088\u0092 x\u00CC\u0084)T N \u00E2\u0088\u0091 i\u00E2\u0080\u00B2=1 (x\u00CC\u0084\u00E2\u0088\u0092xi\u00E2\u0080\u00B2)+N N \u00E2\u0088\u0091 i\u00E2\u0080\u00B2=1 ||x\u00CC\u0084\u00E2\u0088\u0092xi\u00E2\u0080\u00B2 ||2 ) = N \u00E2\u0088\u0091 i=1 ||xi\u00E2\u0088\u0092 x\u00CC\u0084||2. (2.12) Therefore, BSS(C ) can be also expressed in terms of the distances to the cluster centers: BSS(C ) = N \u00E2\u0088\u0091 i=1 ||xi\u00E2\u0088\u0092 x\u00CC\u0084||2\u00E2\u0088\u0092 K \u00E2\u0088\u0091 k=1 \u00E2\u0088\u0091 i\u00E2\u0088\u0088Ck ||xi\u00E2\u0088\u0092 x\u00CC\u0084k||2. Then the minimization problem of K-means with the squared Euclidean dissimi- larity measure (2.5) is equivalent to the maximization problem: C \u00E2\u0088\u0097 = argmax C BSS(C ). (2.13) This maximization approach to the K-means problem is used in the sparse K-means algorithm. 20 2.3.2 The additivity of dissimilarities in each feature Sparse K-means hinges critically on the assumption that the dissimilarity measures can be represented as a sum of terms which depend on a single feature each. In other words, sparse K-means assumes that BCD(C ) can be decomposed as follows: WCD(C ) = p \u00E2\u0088\u0091 j=1 WCD j(C ), TCD = P \u00E2\u0088\u0091 j=1 TCD j, BCD(C ) = p \u00E2\u0088\u0091 j=1 BCD j(C ). where WCD j(C ),TCD j and BCD j(C ) depend only on the jth column of X \u00E2\u0088\u0088 RN\u00C3\u0097p. This assumption is satisfied if the dissimilarity measure is additive in the features: d(xi,xi\u00E2\u0080\u00B2) = p \u00E2\u0088\u0091 j=1 d j(xi,xi\u00E2\u0080\u00B2). Note that d j(xi,xi\u00E2\u0080\u00B2) is the dissimilarity between xi and xi\u00E2\u0080\u00B2 along the jth feature. Our derivations below show that this assumption holds for the squared Euclidean distances: WSS(C ) = K \u00E2\u0088\u0091 k=1 \u00E2\u0088\u0091 i\u00E2\u0088\u0088Ck ||xi\u00E2\u0088\u0092 x\u00CC\u0084k||2 (From (2.4)) = K \u00E2\u0088\u0091 k=1 \u00E2\u0088\u0091 i\u00E2\u0088\u0088Ck p \u00E2\u0088\u0091 j=1 (xi j\u00E2\u0088\u0092 x\u00CC\u0084k, j)2 where x\u00CC\u0084k, j = 1nk \u00E2\u0088\u0091i\u00E2\u0088\u0088Ck xi j = p \u00E2\u0088\u0091 j=1 { K \u00E2\u0088\u0091 k=1 \u00E2\u0088\u0091 i\u00E2\u0088\u0088Ck (xi j\u00E2\u0088\u0092 x\u00CC\u0084k, j)2 } = p \u00E2\u0088\u0091 j=1 WSS j(C ) 21 T SS = N \u00E2\u0088\u0091 i=1 ||xi\u00E2\u0088\u0092 x\u00CC\u0084||2 (From (2.12)) = N \u00E2\u0088\u0091 i=1 p \u00E2\u0088\u0091 j=1 (xi j\u00E2\u0088\u0092 x\u00CC\u0084 j)2 where x\u00CC\u0084 j = 1N N \u00E2\u0088\u0091 i=1 xi j = p \u00E2\u0088\u0091 j=1 { N \u00E2\u0088\u0091 i=1 (xi j\u00E2\u0088\u0092 x\u00CC\u0084 j)2 } = p \u00E2\u0088\u0091 j=1 T SS j. Finally, BSS(C ) =WSS\u00E2\u0088\u0092T SS (From (2.11)) = p \u00E2\u0088\u0091 j=1 { T SS j\u00E2\u0088\u0092WSS j(C ) } = p \u00E2\u0088\u0091 j=1 BSS j(C ). Notice that BSS j(C ), can be interpreted as the contribution of the jth feature to the separation between the given K clusters. We can expect clustering features to have large values of BSS j(C ) while noise features should have small values. In Section 2.3.3, we will introduce the sparse K-means with the squared Euclidean dissimilarity measure. 2.3.3 The sparse K-means clustering algorithm Now we are ready to introduce sparse K-means. More precisely, sparse K-means is defined as the solution to the problem: max C ,w; |C |=K p \u00E2\u0088\u0091 j=1 w jBSS j(C ) s.t. ||w||1 \u00E2\u0089\u00A4 l1, ||w||2 \u00E2\u0089\u00A4 1,w j \u00E2\u0089\u00A5 0 \u00E2\u0088\u0080 j, (2.14) 22 where w = [w1, \u00C2\u00B7 \u00C2\u00B7 \u00C2\u00B7 ,wp]T . Note that ||a||1 represents the L1 norm of the vector a. The idea is that the algorithm returns sparse weights such that only clustering features receive nonzero weights. The L1 penalty on w results in sparsity for small values of the tuning parameter l1. The L1 tuning parameter value should be selected between 1 and \u00E2\u0088\u009A p to have sparse weights. The L2 penalty also serves an important role, since without it, at most one element of w would be nonzero in general. This point is discussed in Section 2.3.4. Witten and Tibshirani [28] proposed an algorithm to solve (2.14): the algorithm simply maximizes (2.14) over C and w iteratively. Initially choose equal weights for all features, i.e. let w = 1/p1 where 1 = [1, \u00C2\u00B7 \u00C2\u00B7 \u00C2\u00B7 ,1]T \u00E2\u0088\u0088 Rp. Repeat steps (a) and (b) below until convergence. \u00E2\u0080\u00A2 Step (a): For a given w, maximize (2.14) with respect to C . That is, perform K-means on the transformed dataset, Y: argmin C ; |C |=K WSSY(C ). (2.15) where WSSY(C ) represents the within cluster sum of squares calculated on the dataset Y \u00E2\u0088\u0088 RN\u00C3\u0097p. The columns of Y are the columns of X multiplied by the corresponding weights. The transformation of the dataset into Y is explained in Section 2.3.3.1. \u00E2\u0080\u00A2 Step (b): For a given C , maximize (2.14) with respect to w: max w wT D s.t. ||w||2 \u00E2\u0089\u00A4 1, ||w||1 \u00E2\u0089\u00A4 l1,w j \u00E2\u0089\u00A5 0 \u00E2\u0088\u0080 j, (2.16) where D = [BSS1(C ), \u00C2\u00B7 \u00C2\u00B7 \u00C2\u00B7 ,BSSp(C )]T . The analytic solution for this opti- mization problem, given by Witten and Tibshirani [28], is w(4\u00E2\u0088\u0097) where: w(4) = (D\u00E2\u0088\u009241)+||(D\u00E2\u0088\u009241)+|| , (2.17) 23 and 4\u00E2\u0088\u0097 = { 0 if ||w(0)||1 \u00E2\u0089\u00A4 l1, root of ||w(4)||1\u00E2\u0088\u0092 l1 = 0 otherwise . Note that a+ = a if a\u00E2\u0089\u00A5 0 and a+ = 0 if a < 0. 2.3.3.1 The transformation of the dataset X into Y Sparse K-means aims at finding clusters and weights that simultaneously maximize the sum of the dissimilarity measures across clusters. If weights were fixed, then (2.14) becomes the objective function of K-means with weighted within cluster sum of squares: argmax C ; |C |=K p \u00E2\u0088\u0091 j=1 w jBSS j(C ) = argmin C ; |C |=K p \u00E2\u0088\u0091 j=1 w jWSS j(C ). (2.18) One can see this problem as a K-means problem on a transformed dataset Y \u00E2\u0088\u0088 RN\u00C3\u0097p. We transform the dataset, X, into Y by scaling the jth feature of the dataset X j by \u00E2\u0088\u009Aw j. Specifically, let: Y = \u00EF\u00A3\u00AE\u00EF\u00A3\u00AF\u00EF\u00A3\u00AF\u00EF\u00A3\u00B0 y11 \u00C2\u00B7 \u00C2\u00B7 \u00C2\u00B7 y1p ... ... yN1 \u00C2\u00B7 \u00C2\u00B7 \u00C2\u00B7 yN p \u00EF\u00A3\u00B9\u00EF\u00A3\u00BA\u00EF\u00A3\u00BA\u00EF\u00A3\u00BB = \u00EF\u00A3\u00AE\u00EF\u00A3\u00AF\u00EF\u00A3\u00AF\u00EF\u00A3\u00B0 \u00E2\u0088\u009A w1x11 \u00C2\u00B7 \u00C2\u00B7 \u00C2\u00B7 \u00E2\u0088\u009Awpx1p ... ...\u00E2\u0088\u009A w1xN1 \u00C2\u00B7 \u00C2\u00B7 \u00C2\u00B7 \u00E2\u0088\u009AwpxN p \u00EF\u00A3\u00B9\u00EF\u00A3\u00BA\u00EF\u00A3\u00BA\u00EF\u00A3\u00BB =[ \u00E2\u0088\u009A w1X1 \u00C2\u00B7 \u00C2\u00B7 \u00C2\u00B7\u00E2\u0088\u009AwpXp]. The within cluster sum of squares calculated on the transformed dataset Y can be expressed as: WSSY(C ) = K \u00E2\u0088\u0091 k=1 1 2nk \u00E2\u0088\u0091 i,i\u00E2\u0080\u00B2\u00E2\u0088\u0088Ck ||yi\u00E2\u0088\u0092yi\u00E2\u0080\u00B2 ||2, (2.19) 24 and is equivalent to the objective function on the left hand side in (2.18). To see this: WSSY(C ) = K \u00E2\u0088\u0091 k=1 1 2nk \u00E2\u0088\u0091 i,i\u00E2\u0080\u00B2\u00E2\u0088\u0088Ck p \u00E2\u0088\u0091 j=1 (yi j\u00E2\u0088\u0092 yi\u00E2\u0080\u00B2 j)2 = K \u00E2\u0088\u0091 k=1 1 2nk \u00E2\u0088\u0091 i,i\u00E2\u0080\u00B2\u00E2\u0088\u0088Ck p \u00E2\u0088\u0091 j=1 ( \u00E2\u0088\u009A w jxi j\u00E2\u0088\u0092\u00E2\u0088\u009Aw jxi\u00E2\u0080\u00B2 j)2 = K \u00E2\u0088\u0091 k=1 1 2nk \u00E2\u0088\u0091 i,i\u00E2\u0080\u00B2\u00E2\u0088\u0088Ck p \u00E2\u0088\u0091 j=1 w j(xi j\u00E2\u0088\u0092 xi\u00E2\u0080\u00B2 j)2 = p \u00E2\u0088\u0091 j=1 w j K \u00E2\u0088\u0091 k=1 \u00E2\u0088\u0091 i,i\u00E2\u0080\u00B2\u00E2\u0088\u0088Ck 1 2nk (xi j\u00E2\u0088\u0092 xi\u00E2\u0080\u00B2 j)2 = p \u00E2\u0088\u0091 j=1 w jWSS j(C ). Therefore, for fixed weights, the problem of (2.14) becomes the problem of K- means on the transformed dataset Y: argmax C ; |C |=K p \u00E2\u0088\u0091 j=1 w jBSS j(C ) = argmin C ; |C |=K WSSY(C ). The R package sparcl contains the function KMeanSparseCluster, which performs sparse K-means. Although we can apply any standard implementation of K-means on the transformed dataset Y at Step (a), KMeanSparseCluster uses Hartigan and Wong\u00E2\u0080\u0099s algorithm to run K-means on a transformed dataset Y at Step (a). 2.3.4 The L1 tuning parameter values and the sparsity of features Sparse K-means requires users to prespecify the value of the tuning parameter l1. The lasso-type penalty on w results in sparsity when the value of the L1 tuning pa- rameter value is sufficiently small. In Step (b), the objective function is maximized over the domain of the vector w. Figure 2.2 shows the admissible region for w for different values of the upper bound l1 for p=2. The L2 constraint is the area inside of the unit circle and the L1 constraint is the area inside of the diamond with y intercepts l1 and \u00E2\u0088\u0092l1. Since w j \u00E2\u0089\u00A5 0 for all j, the shaded region in Figure 2.2 is the 25 -1.0 -0.5 0.0 0.5 1.0 -1 .0 -0 .5 0. 0 0. 5 1. 0 W1 W 2 -1.0 -0.5 0.0 0.5 1.0 L2=1 L1 K\u00E2\u0088\u0097, because we are essentially adding unnecessary cluster centers in the middle of a cluster. The location of such an elbow decline in the plot of WSS(C ) against K indicates the appropriate num- ber of clusters and experimenters have chosen such K as estimate of the number of clusters. A problem arises in this heuristic procedure: it assumes that there is a cluster structure (K > 1). The idea of gap statistic is to standardize WSS(C ) by compar- ing it to its expectation under an appropriate reference distribution (K = 1) of the dataset (See Section 4.2.1 for more detail about the reference distribution). The gap statistic wishes to screen all the evidence of K > 1 against K = 1 and selects the K which has the strongest evidence against the null hypothesis K = 1 after taking the sampling distribution into account. In order to estimate the expectation of WSS(C ) under an appropriate reference distribution (K = 1) of the dataset, Tibshirani et al. [25] proposed that we should generate B0 Monte Carlo samples from the reference distribution. Performing clustering with all the possible numbers of clusters, K, on each Monte Carlo sampling leads to B0 within cluster sum of squares of reference datasets for each K. Tibshirani et al. [25] use their mean as an estimate of the ex- pectation of WSS(C ) under the reference distribution and their standard deviation to take the sampling distribution into account. 4.2.1 Two methods for generating the reference distributions of the gap statistic If all observed cases were from a cluster, then they are from a reference distribution. Then an appropriate reference distribution should form a cluster and reflect the sampling distribution of the observed cases. Tibshirani et al. [25] proposed two methods to find such a reference distribution (see Section 4.2.1). We will explain the two methods in detail because they are also used in Clest. 1. Simple reference distribution: generate each reference feature uniformly and independently over the range of the observed values for that feature. 2. PCA reference distribution: generate the reference features from a uniform distribution over a box aligned with the principal components of the data. 54 Given the normalized data matrix X \u00E2\u0088\u0088RN\u00C3\u0097p, the variance covariance matrix is: SX = 1 N\u00E2\u0088\u00921X T X. Then SX is orthogonally diagonalized as: SX = PDPT . where: \u00E2\u0080\u00A2 D is a diagonal matrix with the eigenvalues, \u00CE\u00BB1, ..,\u00CE\u00BBp of SX on the di- agonal entries, arranged such that \u00CE\u00BB1 \u00E2\u0089\u00A5 \u00C2\u00B7\u00C2\u00B7 \u00C2\u00B7 \u00E2\u0089\u00A5 \u00CE\u00BBp \u00E2\u0089\u00A5 0. \u00E2\u0080\u00A2 P is an orthogonal matrix whose columns are the corresponding unit eigenvectors u1, ..,up \u00E2\u0088\u0088 RP of SX. We transform X via X\u00E2\u0088\u0097 = XP. Then the transformed data is no longer corre- lated. To see this: SX\u00E2\u0088\u0097 = 1 N\u00E2\u0088\u00921X \u00E2\u0088\u0097\u00E2\u0080\u00B2X\u00E2\u0088\u0097 = P\u00E2\u0080\u00B2SXP = P\u00E2\u0080\u00B2PDP\u00E2\u0080\u00B2P = D. Then we draw uniform features Z\u00E2\u0088\u0097 over the range of the columns of X\u00E2\u0088\u0097 as in method 1 with X. Finally we back-transform via Z = Z\u00E2\u0088\u0097P\u00E2\u0080\u00B2 to give reference data set Z. By applying PCA, the reference distribution takes into account the shape of the data distribution, or the angle created by sampled features. Before the gap statistic was proposed, Milligan and Cooper [18] conducted an extensive Monte Carlo evaluation of 30 criteria for selecting the number of clusters. Tibshirani et al. [25] carried out a comparative Monte Carlo study of the gap statistic and some of the 30 criteria which were examined by Milligan and Cooper [18]. Their simulation studies showed that the gap statistic performs better than the 30 criteria. 55 4.3 Clest 4.3.1 The idea Duboit and Fridlyand [5] proposed a prediction-based resampling method for esti- mating the number of clusters in a dataset. Clest uses an idea of supervised learning in the context of clustering, which is an unsupervised learning method. In unsuper- vised learning, partitions are unknown and need to be discovered from a dataset. In supervised learning, partitions are predefined and the task is to build a classification rule. Then the user of a supervised learning method applies the classification rule for predicting the cluster labels on future dataset. In unsupervised learning, once clustering procedures identify the new clusters and cluster labels are assigned to the cases, the next step is often to build a classifier for predicting the cluster labels of future cases. Duboit and Fridlyand [5] argue that the classifier built from the clustering algorithm with the true number of clusters should have the most stable predictability. If we had a future dataset, the stability in the prediction via the clas- sifier can be assessed by comparing the predicted partition and the partition from the clustering algorithm on the future dataset. Clest uses a resampling method to generate \u00E2\u0080\u0098\u00E2\u0080\u0098future\u00E2\u0080\u0099\u00E2\u0080\u0099 datasets from the original dataset. Their simulation study compares the performance of Clest, the gap statistic method and other criteria for selecting the number of clusters. Duboit and Fridlyand [5] concluded that for their simulation model, Clest was the most robust and accu- rate. For this reason and because Clest is applicable to any clustering algorithm, we apply Clest to robust sparse K-means. 4.3.2 The idea of the clest algorithm Given a clustering algorithm, Clest returns the estimated number of clusters K\u00CC\u0082 that produces the most \u00E2\u0080\u0098\u00E2\u0080\u0098stable\u00E2\u0080\u0099\u00E2\u0080\u0099 predictability in the clustering procedure. Since it is impossible to obtain future datasets, in order to assess predictability, Dudoit and Fridlyand used a resampling technique. The idea of Clest algorithm with arbitrary clustering algorithm is explained as follows. First, randomly partition the dataset X into a learning set L\u00E2\u0088\u0088RNL\u00C3\u0097p, containing NL cases and a testing set T \u00E2\u0088\u0088 RNT\u00C3\u0097p, containing NT cases. Given the learning set 56 L and a number of clusters K\u00E2\u0080\u00B2, a clustering algorithm partitions the cases into K\u00E2\u0080\u00B2 clusters. It returns a classifier as well as a partition of cases in L. In K-means, for example, the classifier can be cluster centers: one can classify future cases by assigning them to their closest cluster centers. We wish to assess the prediction power of the classifiers. Given the testing set, T, the classifier is used to partition the cases of T into K\u00E2\u0080\u00B2 clusters. The clus- tering algorithm can also partition the cases of T into K\u00E2\u0080\u00B2 clusters. Then we have two partitions of the training set T: one from the classifier and another from the clustering algorithm. If K\u00E2\u0080\u00B2 is the true number of clusters, the prediction accuracy of the \u00E2\u0080\u0098\u00E2\u0080\u0098classifiers\u00E2\u0080\u0099\u00E2\u0080\u0099 should be good, i.e., the two partitions of the T should agree well. Given a measure of the agreement of these two partitions, we can assess how well they agree. This measure can be, for example, CER. If we repeat the above two procedures for all the potential number of clusters, say k = 2, \u00C2\u00B7 \u00C2\u00B7 \u00C2\u00B7 ,M, one can obtain M-1 observed measures of agreement of the re- sulting pairs of partitions. Then one might think that one can choose the number of clusters corresponding to the best observed measure of agreement of partition. However, this strategy ignores the fact that there could be no cluster structure. When a clustering algorithm is applied to a dataset, a partition of the dataset is returned whether or not the dataset has the true cluster structure. Therefore, we should standardize the observed measure of agreement by comparing it to its ex- pected value under a null hypothesis: K=1. In order to estimate the expected value of the observed measure of agreement under K=1, Clest generates B0 Monte Carlo reference datasets from the reference distribution as in the gap statistic. Then re- peat the procedure above to obtain the B0 reference measures of agreement of the two partitions under K=1. The median of the B0 reference measures is used as the estimate. Then the percentage of the reference measures of agreement that are better than the observed measures of agreement can serve a role of p-values: the es- timate of the probability of observing observed measure or more extreme measure under the null hypothesis K = 1. Now, we have two measurements of agreement of partitions: one from the observed dataset and another from the reference dataset. The standardized measurement of agreement is calculated by comparing the ob- served measurement of agreement to the reference measurement of agreement for each k = 2, \u00C2\u00B7 \u00C2\u00B7 \u00C2\u00B7 ,M. Finally Clest chooses the number of clusters that returns the 57 strongest \u00E2\u0080\u0098\u00E2\u0080\u0098significant\u00E2\u0080\u0099\u00E2\u0080\u0099 evidence against the hypothesis H0 : K = 1. 4.3.3 The algorithm of Clest with robust sparse K-means We introduce the algorithm adapting Clest to robust sparse K-means. We use CER to assess the agreement of two partitions. Since there should be more cases to de- termine classifiers in the learning set than to assess the predictability of classifiers in the testing set, 2/3 of cases are used in the learning set. The clest algorithm needs the following parameters to be predetermined: M: the maximum number of clusters that you suspect; B: the number of times that an observed dataset X is randomly partitioned into a learning set and a training set; B0: the number of times that the reference distribution is generated; \u00CE\u00B2 : the \u00E2\u0080\u0098\u00E2\u0080\u0098significance level\u00E2\u0080\u0099\u00E2\u0080\u0099 (this \u00E2\u0080\u0098\u00E2\u0080\u0098significance level\u00E2\u0080\u0099\u00E2\u0080\u0099 will be explained along with the algorithm later.); In addition, for the robust sparse K-means algorithm, the following parameters must be predetermined: \u00CE\u00B1: the proportion of cases to be trimmed; l1: the L1 tuning parameter value, determining the sparsity of weights. Given the above parameter values, the clest algorithm proceeds as follows: Repeat Step (a) and Step (b) for each k\u00E2\u0080\u00B2 \u00E2\u0088\u0088 {2, ...,M}. \u00E2\u0080\u00A2 Step (a): \u00E2\u0080\u0093 Repeat Step (a-1) and Step (a-2) B times (b = 1, \u00C2\u00B7 \u00C2\u00B7 \u00C2\u00B7 ,B). \u00E2\u0088\u0097 Step (a-1): \u00C2\u00B7 Randomly partition a dataset, X, into two sets, Lb and Tb, where Lb has 2/3 of the cases. \u00C2\u00B7 Apply robust sparse k\u00E2\u0080\u00B2-means to the learning set Lb; return k\u00E2\u0080\u00B2 cluster centers and p weights. 58 \u00E2\u0088\u0097 Step (a-2): \u00C2\u00B7 Apply the classifiers to the training set Tb; return cluster la- bels Cc,Tb . That is, given weights and cluster centers from Step (a-1), the weighted distances between each case and every cluster cen- ter are calculated. Then the cases in T b are assigned to the cluster whose cluster center is the closest in weighted squared Euclidean distance. \u00C2\u00B7 Apply robust sparse k\u00E2\u0080\u00B2-means to the training set Tb; return cluster labels Cr,Tb . \u00C2\u00B7 Compute CERk\u00E2\u0080\u00B2,b=CER(Cc,Tb ,Cr,Tb). \u00E2\u0080\u0093 Obtain CERk\u00E2\u0080\u00B2,1,...,CERk\u00E2\u0080\u00B2,B. Let CERk\u00E2\u0080\u00B2=median{CERk\u00E2\u0080\u00B2,1,...,CERk\u00E2\u0080\u00B2,B}. \u00E2\u0080\u00A2 Step (b): \u00E2\u0080\u0093 Generate B0 samples under the reference distribution. \u00E2\u0080\u0093 Repeat Step (a) for each sample with B=1 and compute: \u00E2\u0088\u0097 CER0k\u00E2\u0080\u00B2 = median{CER\u00E2\u0088\u00971k\u00E2\u0080\u00B2 . . .CER\u00E2\u0088\u0097B 0 k\u00E2\u0080\u00B2 }. CER0k\u00E2\u0080\u00B2 is the estimate of expected value of CER under the null hypothesis that all the cases are from the same cluster, k\u00E2\u0080\u00B2=1. \u00E2\u0088\u0097 pk\u00E2\u0080\u00B2 = |{CER\u00E2\u0088\u0097ik\u00E2\u0080\u00B2 ;CER\u00E2\u0088\u0097ik\u00E2\u0080\u00B2 < CERk\u00E2\u0080\u00B2}|/B0. This pk\u00E2\u0080\u00B2 is a p-value; it is the probability of observing CER more extreme (small) than the observed CER under the null hypothesis. \u00E2\u0088\u0097 Return dk standardized measurement of agreement: dk = CERk\u00E2\u0088\u0092CER0k . After performing Step (a) and Step (b) for all the k\u00E2\u0080\u00B2 \u00E2\u0088\u0088 {1, \u00C2\u00B7 \u00C2\u00B7 \u00C2\u00B7 ,M}, finally we esti- mate of the number of clusters: k\u00CC\u0082 = \u00EF\u00A3\u00B1\u00EF\u00A3\u00B2\u00EF\u00A3\u00B3 argmink\u00E2\u0088\u0088{k;pk\u00E2\u0089\u00A4\u00CE\u00B2} dk if {k; pk \u00E2\u0089\u00A4 \u00CE\u00B2} 6= /01 otherwise 59 Model 1 (out=500) 2 (out=500) 3 4 5 6 7 8 \u00CE\u00B1 1/20 1/20 1/20 0.1 0.1 0.2 0.1 1/20 Table 4.1: The trimming proportion \u00CE\u00B1 for the simulation studies of Clest. In the algorithm described above, there is a minor modification from the origi- nal clest algorithm. In the original clest algorithm, each reference dataset is parti- tioned into a learning set and a testing set B times. However, we set B to be 1 for the reference dataset to ease the computational cost. We found that even if we par- tition the reference dataset more than one time and compute the reference CERs of training sets, the CERs on the training sets were very similar, because the reference dataset forms a uniform cloud. The user still has to predetermine the number of times B that the observed dataset X is partitioned into a learning set and a testing set. 4.4 Simulation studies with Clest Now we examine the performance of Clest with robust sparse K-means in the 8 simulation models (see sections 2.3.6 and 3.3). In this simulation study, we find that it is very challenging to find the true number of clusters when the clusters have considerable overlap (\u00C2\u00B5 = 1). Therefore, we examine the performance of the algorithm on datasets generated from models with \u00C2\u00B5 = 1 and 2. We implemented the code for Clest in the R package RSKC and we use it for the simulation study. Table 4.1 shows the values of the trimming proportion, \u00CE\u00B1 , that were used in each simulation Model. The true percentage of contaminated cases are used for \u00CE\u00B1 values for models 4, 5, 6 and 7. For models 1, 2, 3 and 8, \u00CE\u00B1 levels are set so that at least one case is trimmed in the testing set. The L1 tuning parameter value is set to 7.862 for models with \u00C2\u00B5 = 2 and for models with \u00C2\u00B5 = 1. These values are obtained as in Section 2.3.7. Clest parameters are set to (B,B0,\u00CE\u00B2 ,M)=(10, 20, 0.05, 5). We use PCA reference distribution to generate the reference dataset. Figure 4.1 shows the histogram of the estimates of the number of clusters over 50 datasets generated from each Model with \u00C2\u00B5 = 2. The modes of the histogram are 60 (a) Model 1 out=500 (b) Model 2 out=500 (c) Model 3 (d) Model 4 (e) Model 5 (f) Model 6 (g) Model 7 (h) Model 8 Figure 4.1: The results of Clest on datasets generated from 8 models with \u00C2\u00B5 = 2. The histogram of the estimated number of clusters from Clest over 50 generated datasets for each model. at K\u00CC\u0082 = 3 for all the models except Model 7. In particular, Clest does an excellent job for Models 1 and 2: Clest provides K\u00CC\u0082=3 for all 50 generated datasets of each Model. This indicates that the estimate of Clest is reliable for these types of datasets. In Model 7, Clest returns K\u00CC\u0082 = 4 most frequently. This could be because the \u00CE\u00B1 value is too small. Since Model 7 contains 6 contaminated cases, if all the 6 cases are randomly selected in the testing set, it has over 30% (=6/20) contamination. Then \u00CE\u00B1 = 0.1 of robust sparse K-means is too small to capture all the contaminated cases. We performed Clest with larger, or more \u00E2\u0080\u0098\u00E2\u0080\u0098conservative\u00E2\u0080\u0099\u00E2\u0080\u0099 \u00CE\u00B1 levels on datasets generated from the 8 models with \u00C2\u00B5 = 2. The \u00CE\u00B1s are set so that robust sparse 3- 61 Model 1 (out=500) 2 (out=500) 3 4 5 6 7 8 \u00CE\u00B1 1/20 1/20 1/20 6/20 6/20 12/20 6/20 2/20 Table 4.2: The conservative trimming proportion values for the simulation studies of Clest. (a) Model 1 out=500 (b) Model 2 out=500 (c) Model 3 (d) Model 4 (e) Model 5 (f) Model 6 (g) Model 7 (h) Model 8 Figure 4.2: The results of Clest on datasets generated from 8 models with \u00C2\u00B5 = 2. The histogram of the estimated number of clusters from Clest over 50 generated datasets for each model. We use the conservative trimming proportion values. 62 means can capture all the outliers when they are all in the testing set. Table 4.2 shows the conservative \u00CE\u00B1 level for each Model. The results are in Figure 4.2. The results for models 1 to 3 are almost the same as ones from Figure 4.1, which makes sense because the \u00CE\u00B1 levels are the same for these Models. The performances of Clest in Models 4, 5, 7, and 8 improve in that Clest returns K\u00CC\u0082 = 3 in more generated datasets. However, the performance of Clest in Model 6 become worse. Since Model 6 contains 12 contaminated cases, the \u00CE\u00B1 level must be set to at least 60% (=12/(N/3)) to capture all the contaminated cases in case all of them are in the testing set. Since only 40% of the cases are used to define the cluster centers in the testing set, Clest performs poorly. From these results, it is not recommended to use Clest algorithm if one suspects that more than 50% of the cases in a testing set could be contaminated. That is, one should not use Clest to select the number of clusters if one suspects that more than 17% (=1/6 = 50% of the 1/ 3 of the total number of the cases) of the original datasets are contaminated. Figure 4.3 shows the simulation results of the models with \u00C2\u00B5=1 when the trim- ming proportion, \u00CE\u00B1 levels are set as in Table 4.1. In this case, the clusters overlap more and are harder to distinguish. The histograms of the estimated number of clusters for all models except Model 8 have modes at K\u00CC\u0082 = 1. This means that none of the observed CER of K = 2, 3 , 4 or 5 is significantly smaller than the reference CER for the models. When clusters overlap more, Clest tends to underestimate the true number of clusters. If we know that there is a cluster structure, i.e., K > 1, then we should rather choose the K\u00CC\u0082, which minimizes the test statistics, dk. That is, the significance level \u00CE\u00B2 should be 1.00. Figure 4.4 shows the simulation results of Clest with significance levels, \u00CE\u00B2=1.00 and \u00CE\u00B1 as in Table 4.1 on the datasets generated from the models with \u00C2\u00B5 = 1. The performance of Clest with \u00CE\u00B2 = 1.00 improves substantially over the performance of Clest when \u00CE\u00B2 = 0.05 on them: now the modes of all models are at K\u00CC\u0082 = 3. This indicates that if we know that and there is a cluster structure and the clusters are substantially overlapping, then we should rather use significance level \u00CE\u00B2= 1. The simulations showed that Clest with robust sparse K-means could identify the number of clusters when the clusters are reasonably well separated. Since in the clest algorithm robust sparse K-means is performed on subsets of dataset (i.e., 63 (a) Model 1 out=500 (b) Model 2 out=500 (c) Model 3 (d) Model 4 (e) Model 5 (f) Model 6 (g) Model 7 (h) Model 8 Figure 4.3: The results of Clest on datasets generated from 8 models with \u00C2\u00B5 = 1. The histogram of the estimated number of clusters from Clest over 50 generated datasets for each model. learning set or testing set), the trimming proportion, \u00CE\u00B1 should be selected close to the potential proportion of outliers in the training subset of dataset. Due to the random partitioning of observed datasets or reference datasets into the learning and testing sets, the proportion of outliers in the subset of the dataset could be larger than in the observed dataset. Therefore, we recommend that the user of Clest choose the conservative values of \u00CE\u00B1 levels. It was necessary to know that there were more than two clusters if clusters were not well separated. Lastly, Clest with robust sparse K-means is not suitable in the following situa- tions: \u00E2\u0080\u00A2 The user does not know if a dataset has a cluster structure and if it does, the 64 (a) Model 1 out=500 (b) Model 2 out=500 (c) Model 3 (d) Model 4 (e) Model 5 (f) Model 6 (g) Model 7 (h) Model 8 Figure 4.4: The results of Clest on datasets generated from 8 models with \u00C2\u00B5 = 1. The significance level, \u00CE\u00B2 is set to 1. The histogram of the esti- mated number of clusters from Clest over 50 generated datasets for each model. clusters might overlap considerably. \u00E2\u0080\u00A2 The dataset contains quite a few outliers. A rule of the thumb is that if one suspects that the proportion of outliers in the dataset is greater than 17% then one should not use Clest with robust sparse K-means. 65 Chapter 5 Data analysis 5.1 Before moving to the data analysis 5.1.1 The dissimilarity measure based on Pearson\u00E2\u0080\u0099s correlation We will examine a microarray dataset in this chapter. Until now, our discussions are mostly based on the squared Euclidean dissimilarity measure. In microarray datasets, however, the dissimilarity measure most commonly used is based on Pear- son\u00E2\u0080\u0099s correlation across features (Hardin et al. [9]). Let x\u00CC\u0084i denote the average over the features of the ith case: x\u00CC\u0084i = 1 p p \u00E2\u0088\u0091 j=1 xi j. Pearson\u00E2\u0080\u0099s correlation coefficient across features is defined as: r(xi,xi\u00E2\u0080\u00B2) = (xi\u00E2\u0088\u0092 x\u00CC\u0084i1)T (xi\u00E2\u0080\u00B2\u00E2\u0088\u0092 x\u00CC\u0084i\u00E2\u0080\u00B21) ||xi\u00E2\u0088\u0092 x\u00CC\u0084i1||||xi\u00E2\u0080\u00B2\u00E2\u0088\u0092 x\u00CC\u0084i\u00E2\u0080\u00B21|| . Pearson\u00E2\u0080\u0099s correlation can be converted to dissimilarities, dr(xi,xi\u00E2\u0080\u00B2), for instance, by setting: dr(xi,xi\u00E2\u0080\u00B2) = 1\u00E2\u0088\u0092 r(xi,xi\u00E2\u0080\u00B2) \u00E2\u0088\u0088 [0,2]. (5.1) 66 With this formula, cases with a high positive correlation receive a dissimilarity co- efficient close to zero, whereas cases with a strongly negative correlation will be considered very dissimilar. This however produces the odd result that for a com- pletely uncorrelated pair we have dr(xi,xi\u00E2\u0080\u00B2) = 1. For this reason, some prefer to use dr\u00E2\u0088\u0097 = 1\u00E2\u0088\u0092|r(xi,xi\u00E2\u0080\u00B2)|, in which case also variables with a strongly negative cor- relation will be assigned a small dissimilarity. Lance and Williams [13] compared these formulas by means of real data, and concluded that (5.1) was unequivocally the best. Therefore, we will use (5.1) as Pearson\u00E2\u0080\u0099s correlation dissimilarity mea- sure. One of the reasons for its popularity is that the correlation is scale invariant. It captures the similarity in \u00E2\u0080\u0098shape\u00E2\u0080\u0099 of two cases. Now we illustrate the difference between the squared Euclidean dissimilarity measure and Pearson\u00E2\u0080\u0099s correlation dissimilarity measure with a simple example. Consider the two cases: x\u00E2\u0088\u00971 = [1,2,3] T and x\u00E2\u0088\u00972 = [4,5,6] T . In order to calculate their squared Euclidean dissimilarity measure, we first nor- malize x\u00E2\u0088\u00971 and x\u00E2\u0088\u00972: x1 = [\u00E2\u0088\u00920.7,\u00E2\u0088\u00920.7,\u00E2\u0088\u00920.7]T and x2 = [0.7,0.7,0.7]T . Then their dissimilarity is ||x1\u00E2\u0088\u0092x2||2 = 2.42 6= 0. On the other hand, since x\u00E2\u0088\u00971 and x\u00E2\u0088\u00972 have the same shape, their Pearson\u00E2\u0080\u0099s correlation dissimilarity is dr(x\u00E2\u0088\u00971,x\u00E2\u0088\u00972) = 0. Maitra and Ramler [17] introduced the relationship between Pearson\u00E2\u0080\u0099s corre- lation dissimilarity measure and the squared Euclidean distance dissimilarity mea- sure between standardized profiles. Pearson\u00E2\u0080\u0099s correlation dissimilarity measure is incorporated in a clustering algorithm as follows. The raw dataset X\u00E2\u0088\u0097 is trans- formed into X\u00CC\u0083 = [x\u00CC\u0083i j] \u00E2\u0088\u0088 RN\u00C3\u0097p as: x\u00CC\u0083i j = x\u00E2\u0088\u0097i j\u00E2\u0088\u0092 x\u00CC\u0084\u00E2\u0088\u0097i\u00E2\u0088\u009A \u00E2\u0088\u0091 j(x\u00E2\u0088\u0097i j\u00E2\u0088\u0092x\u00CC\u0084\u00E2\u0088\u0097i )2 p = \u00E2\u0088\u009A p x\u00E2\u0088\u0097i j\u00E2\u0088\u0092 x\u00CC\u0084\u00E2\u0088\u0097i ||x\u00E2\u0088\u0097i \u00E2\u0088\u0092 x\u00CC\u0084\u00E2\u0088\u0097i 1|| . (5.2) 67 Notice that in the normalization (2.2), the columns of the data matrix are centered then scaled while in the transformation (5.2), the rows of the data matrix are cen- tered then scaled. Then the squared Euclidean distance between two transformed vectors, x\u00CC\u0083i and x\u00CC\u0083i\u00E2\u0080\u00B2 , becomes proportional to Pearson\u00E2\u0080\u0099s correlation dissimilarity be- tween the two original vectors, x\u00E2\u0088\u0097i and x\u00E2\u0088\u0097i\u00E2\u0080\u00B2 . ||x\u00CC\u0083i\u00E2\u0088\u0092 x\u00CC\u0083i\u00E2\u0080\u00B2 ||2 = p || x \u00E2\u0088\u0097 i \u00E2\u0088\u0092 x\u00CC\u0084\u00E2\u0088\u0097i 1 ||x\u00E2\u0088\u0097i \u00E2\u0088\u0092 x\u00CC\u0084\u00E2\u0088\u0097i 1|| \u00E2\u0088\u0092 x \u00E2\u0088\u0097 i\u00E2\u0080\u00B2\u00E2\u0088\u0092 x\u00CC\u0084\u00E2\u0088\u0097i\u00E2\u0080\u00B21 ||x\u00E2\u0088\u0097i\u00E2\u0080\u00B2\u00E2\u0088\u0092 x\u00CC\u0084\u00E2\u0088\u0097i\u00E2\u0080\u00B21|| ||2 = p ( ||x\u00E2\u0088\u0097i \u00E2\u0088\u0092 x\u00CC\u0084\u00E2\u0088\u0097i 1||2 ||x\u00E2\u0088\u0097i \u00E2\u0088\u0092 x\u00CC\u0084\u00E2\u0088\u0097i 1||2 \u00E2\u0088\u00922 (x \u00E2\u0088\u0097 i\u00E2\u0080\u00B2\u00E2\u0088\u0092 x\u00CC\u0084\u00E2\u0088\u0097i\u00E2\u0080\u00B21)\u00E2\u0080\u00B2(x\u00E2\u0088\u0097i\u00E2\u0080\u00B2\u00E2\u0088\u0092 x\u00CC\u0084\u00E2\u0088\u0097i\u00E2\u0080\u00B21) ||x\u00E2\u0088\u0097i\u00E2\u0080\u00B2\u00E2\u0088\u0092 x\u00CC\u0084\u00E2\u0088\u0097i\u00E2\u0080\u00B21||||x\u00E2\u0088\u0097i \u00E2\u0088\u0092 x\u00CC\u0084\u00E2\u0088\u0097i 1|| + ||x\u00E2\u0088\u0097i\u00E2\u0080\u00B2\u00E2\u0088\u0092 x\u00CC\u0084\u00E2\u0088\u0097i\u00E2\u0080\u00B21||2 ||x\u00E2\u0088\u0097i\u00E2\u0080\u00B2\u00E2\u0088\u0092 x\u00CC\u0084\u00E2\u0088\u0097i\u00E2\u0080\u00B21||2 ) = 2p(1\u00E2\u0088\u0092 r(x\u00E2\u0088\u0097i ,x\u00E2\u0088\u0097i\u00E2\u0080\u00B2)) = 2p d(x\u00E2\u0088\u0097i ,x \u00E2\u0088\u0097 i\u00E2\u0080\u00B2). After the raw dataset is transformed as in (5.2), we can perform clustering algorithms on the transformed dataset. This transformation can be coded simply in R as t(scale(t(data))). 5.1.2 The revised silhouette In the data analysis of this chapter, we will assess how well each case is clustered by various clustering methods. Intuitively, if a case is placed \u00E2\u0080\u0098\u00E2\u0080\u0098deep\u00E2\u0080\u0099\u00E2\u0080\u0099 in its cluster then it should be considered to be well clustered and if a case is placed at the edge between its cluster and its second closest cluster then it should not be considered so well clustered. Therefore, we use the following simple measure to assess how well cases are clustered. We call this measure the revised silhouette, because it is very similar to the silhouette proposed by Rousseeuw [20]. We explain the differ- ence between the two measures and the reason why we choose to use the revised silhouette in Appendix A.2. The revised silhouette compares the distance from a case to its cluster centers. Let X\u00CC\u0083 denote the transformed dataset as in (5.2). We define a\u00E2\u0080\u00B2(i) to be the squared Euclidean distance between the ith case and its cluster center. Denote the cluster to which the ith case belongs by IC1(i). Then the closest cluster center vector from 68 the ith case is \u00C2\u00AF\u00CC\u0083xIC1(i), j and: a\u00E2\u0080\u00B2(i) = p \u00E2\u0088\u0091 j=1 (x\u00CC\u0083i j\u00E2\u0088\u0092 \u00C2\u00AF\u00CC\u0083xIC1(i), j)2. We also define b\u00E2\u0080\u00B2(i) to be the squared Euclidean distance between the ith case and its second closest cluster center. Denote the cluster such that its cluster center is the second closest from the ith case by IC2(i). Then the second closest cluster center vector from the ith case is \u00C2\u00AF\u00CC\u0083xIC2(i), j and: b\u00E2\u0080\u00B2(i) = p \u00E2\u0088\u0091 j=1 (x\u00CC\u0083i j\u00E2\u0088\u0092 \u00C2\u00AF\u00CC\u0083xIC2(i), j)2. We will use the following \u00E2\u0080\u0098\u00E2\u0080\u0098revised silhouette width\u00E2\u0080\u0099\u00E2\u0080\u0099 to assess how well the ith case is clustered: rev.silhouette(i) = b\u00E2\u0080\u00B2(i)\u00E2\u0088\u0092a\u00E2\u0080\u00B2(i) b\u00E2\u0080\u00B2(i) . The revised silhouette width takes values between 0 and 1. One can see that rev.silhouette(i) =1 when a\u00E2\u0080\u00B2(i) = 0, meaning that the case is placed \u00E2\u0080\u0098\u00E2\u0080\u0098deep\u00E2\u0080\u0099\u00E2\u0080\u0099 in its cluster and perfectly well clustered. This occurs only if the ith case itself forms the cluster center. If the ith case is placed at the middle of the two clusters, then b\u00E2\u0080\u00B2(i)= a\u00E2\u0080\u00B2(i), and rev.silhouette(i) is 0. Note that to assess partitions from the sparse methods, the weighted squared Euclidean distances are used to calculate b\u00E2\u0080\u00B2(i) and a\u00E2\u0080\u00B2(i) by simply weighting the jth feature component of the squared Euclidean dis- tance by the corresponding weight w j. 5.2 The dataset of van\u00E2\u0080\u0099t Veer et al. [27] In this section we analyze a microarray dataset. The data consists of 4751 gene expressions for 78 primary breast tumors, and were previously analyzed by van\u00E2\u0080\u0099t Veer et al. [27]. The dataset is available at: http://www.nature.com/nature/journal/v415/n6871 /suppinfo/415530a.html. The authors applied a supervised classification method and found a subset of 70 genes that identified the two groups according to whether the patient developed 69 distant metastasis within 5 years or not. However, the threshold \u00E2\u0080\u0098within 5 years\u00E2\u0080\u0099 appears to be somewhat arbitrary. Experimentalists are also interested in the oe- strogen receptor (ER) status of patients because the ER status is essential clinical information for the treatment of metastatic disease. There is a clinical consensus that if the ER status of a patient is positive then treatment tends to work better (Barnes et al. [1]). In this section we will try to identify features that can cluster patients according to the partition based on oestrogen receptor (ER) status. We represent the ER status as the true biological partition C ER. Throughout our analysis, we use a transformed dataset so that the clustering algorithm based on the squared Euclidean distance uses Pearson\u00E2\u0080\u0099s correlation dis- similarity measure. The outline of our analysis is as following. First we select the number of clusters, K, via Clest with robust sparse K-means and sparse K-means. Since we have no information about potential outliers nor the number of clustering features, we run Clest with various sets of (\u00CE\u00B1, l1) values. Then given the estimate of the number of clusters, K\u00E2\u0080\u00B2, we run the 4 clustering algorithms: K\u00E2\u0080\u00B2-means, trimmed K\u00E2\u0080\u00B2-means, sparse K\u00E2\u0080\u00B2-means, and robust sparse K\u00E2\u0080\u00B2-means. Again, various \u00CE\u00B1 and l1 values are examined. We compare CERs between partitions from algorithms and the partition according to the ER status C ER. In order to facilitate the comparison between the robust sparse K\u00E2\u0080\u00B2-means and sparse K\u00E2\u0080\u00B2-means, we pick one partition re- sult for each method. We select a partition which is formed by the smallest number of features and reflects the ER status well from each sparse method. We use a call rate plot and the revised silhouette as the main tools for the comparison. Finally we provide a conclusion. 5.2.1 Selection of K We run Clest with robust sparse K-means and sparse K-means on the dataset to choose the number of clusters. We have no information about potential outliers nor the number of clustering features. Therefore, we run Clest with all possible combi- nations of \u00CE\u00B1 = (0,1/78,5/78,10/78) and l1 = (4,6,8). Clest has four parameters: B the number of times observed data is partitioned into learning set and testing set, B0 the number of times reference datasets are generated, \u00CE\u00B2 the significance level and the maximum number M of clusters that are present. They are set to (B, B0, \u00CE\u00B2 , 70 M)= (10, 20, 0.05, 5). Since there are no missing values in the dataset, we employ the PCA reference distribution. For this analysis we use the function Clest in package RSKC. (a) \u00CE\u00B1=0/78 (b) \u00CE\u00B1=1/78 (c) \u00CE\u00B1=5/78 (d) \u00CE\u00B1=10/78 Figure 5.1: Solid circles represent significant p-values. Clest chooses K=2 as a solution for all pairs of \u00CE\u00B1 and l1 values. The circles in Figure 5.1 show the standardized measure of agreement, dk given \u00CE\u00B1 , l1 and k. We indicate the dk values that correspond to significant p-values with 71 solid circles and insignificant p-values with open circles. One run of Clest returns one line in the plot. It is clear that K = 2 is chosen for all the combinations of l1 and \u00CE\u00B1 . From this result, we fix K to be 2 and perform all the clustering algorithms. 5.2.2 The comparison of performances We now perform 2-means, sparse 2-means, trimmed 2-means and our proposed robust sparse 2-means and compare the resulting partition with the partition ac- cording to the ER status of the cases via CER. We used the function kmeans (in package stat) and KMeansSparseCluster in package sparcl to perform 2-means and sparse 2-means respectively. We use our function RSKC in the pack- age RSKC to perform trimmed 2-means and robust sparse 2-means. Since there is no information about the clustering features nor outliers, we examine the following L1 tuning parameter values: l1 = 1.05,1.5,2,4,6,8,12,20,30,50, and 70 for sparse methods and the following trimming proportion: \u00CE\u00B1 = 1/78,5/78, and 10/78 for the robust methods. That is, we performed 2-means, trimmed 2-means, sparse 2-means and robust sparse 2-means 1, 3, 11, and 33 (=11\u00C3\u0097 3) times, respectively. l1 CER # non zero weights 1.05 0.21 3 1.50 0.14 5 2 0.12 8 4 0.12 36 6 0.10 \u00E2\u0088\u0097 74 8 0.10 \u00E2\u0088\u0097 127 12 0.10 \u00E2\u0088\u0097 378 20 0.10 \u00E2\u0088\u0097 1397 30 0.10 \u00E2\u0088\u0097 4751 50 0.10 \u00E2\u0088\u0097 4751 70 0.10 \u00E2\u0088\u0097 4751 Table 5.1: The result from sparse 2-means. The \u00E2\u0088\u0097 indicates partitions that agree with each other. The 2-means algorithm returns a CER of 0.12. Trimmed 2-means with \u00CE\u00B1 = 1/78,5/78 and 10/78 returns CERs of 0.12, 0.10 and 0.12, respectively. As for 72 CER # nonzero weight l1\\u00CE\u00B1 1/78 5/78 10/78 1/78 5/78 10/78 1.05 0.23 0.19 0.19 2 2 2 1.5 0.14 0.14 0.14 4 3 4 2 0.12 0.12 0.14 7 6 6 4 0.12 0.12 0.12 31 29 28 6 0.10\u00E2\u0088\u0097 0.10\u00E2\u0088\u0097 0.10\u00E2\u0088\u0097 72 72 66 8 0.10\u00E2\u0088\u0097 0.10\u00E2\u0088\u0097 0.10\u00E2\u0088\u0097 130 131 126 12 0.10\u00E2\u0088\u0097 0.10\u00E2\u0088\u0097 0.10\u00E2\u0088\u0097 379 366 365 20 0.10\u00E2\u0088\u0097 0.10\u00E2\u0088\u0097 0.10\u00E2\u0088\u0097 1397 1386 1333 30 0.10\u00E2\u0088\u0097 0.10\u00E2\u0088\u0097 0.10\u00E2\u0088\u0097 4301 4081 3906 50 0.10\u00E2\u0088\u0097 0.10\u00E2\u0088\u0097 0.10\u00E2\u0088\u0097 4751 4751 4751 70 0.10\u00E2\u0088\u0097 0.10\u00E2\u0088\u0097 0.10\u00E2\u0088\u0097 4751 4751 4751 Table 5.2: The result from robust sparse 2-means. The \u00E2\u0088\u0097 indicates partitions that agree with each other. the sparse methods, sparse 2-means and robust sparse 2-means both return CERs of 0.10 for reasonably large L1 tuning parameter values: l1 \u00E2\u0089\u00A5 6 ( Table 5.1 and Table 5.2). On the other hand, the sparse methods return poor CERs when the L1 tuning parameter values are too small. This is probably because feature weights are too sparse (For example, when the L1 tuning parameter value is 4, there are only 36 features with non-zero weights for sparse 2-means). However, for all reasonably large L1 tuning parameter values, the sparse methods consistently return as good or better partition compared to the nonsparse methods. Interestingly all the partitions from robust sparse 2-means (regardless of the \u00CE\u00B1 levels) and sparse 2-means are exactly the same for l1 \u00E2\u0089\u00A5 6. We denote these partitions by \u00E2\u0080\u0098\u00E2\u0080\u0098\u00E2\u0088\u0097\u00E2\u0080\u0099\u00E2\u0080\u0099 in Table 5.1 and Table 5.2. When the L1 tuning parameter is set to 6, sparse 2-means and robust sparse 2-means return partitions using about 70 features. The partitions resulting from larger values of the L1 tuning parameter use substantially more features. If the performance of the partition is the same, then the experimenters would want to use fewer features, because 70 features are much easier to interpret than 4751 features. For this reason, we will compare the performance of sparse 2-means and robust sparse 2-means with l1 = 6. We wonder if there are any outliers in the dataset. We cannot simply treat the 73 trimmed cases as outliers, because robust sparse K-means trims a predetermined number of cases, regardless of the existence of outliers. Moreover, in this way, it would be impossible to find outliers via sparse K-means simply because it does not trim cases. For these reasons, we use the following rule based on \u00E2\u0080\u0098\u00E2\u0080\u0098cut off\u00E2\u0080\u0099\u00E2\u0080\u0099 values to find the potential outliers. The cut off value is calculated as follows. Given a partition result, the median m\u00CC\u0083 and MAD \u00CF\u0084\u00CC\u0083 of all the weighted squared Euclidean distances to their cluster centers are calculated. Then the cut off value is defined as: cut off = m\u00CC\u0083+ threshold \u00CF\u0084\u00CC\u0083, where we choose threshold values between 1 and 5. The cases that are greater than this cut off value are considered as potential outliers. CERs are again calculated, excluding the potential outliers. Note that the first b\u00CE\u00B1Nc cases that are excluded from the CER calculation between the robust sparse 2-means partition and the C ER are the trimmed cases. This is because the b\u00CE\u00B1Nc trimmed cases have the top b\u00CE\u00B1Nc largest weighted squared Euclidean distances to their cluster centers. Figure 5.2 is the call rate plot for four algorithms: sparse 2-means with l1=6 and robust sparse 2-means with (l1,\u00CE\u00B1)=(6,1/78), (6,3/78), (6,10/78). One can see that for large value of the threshold, the CERs from the four algorithms are almost the same. This makes sense because the partitions of all four algorithms agree. As the threshold value decreases, robust sparse 2-means with moderately large \u00CE\u00B1 , (\u00CE\u00B1 = 5/78 and 10/78), returns smaller CERs than sparse 2-means. This means that some cases excluded from the CER calculation of the robust sparse 2-means are, indeed, misclassified cases. In fact, robust sparse 2-means with \u00CE\u00B1 = 5/78 and 10/78 trim 1 and 2 misclassified cases, respectively. Therefore the misclassified cases end up placed far from the cluster centers and found as potential outliers. However, sparse 2-means fails to find the misclassified cases as potential outliers. Robust sparse 2-means trimmed some of the cases that are misclassified by other algorithms. Now, we wonder how well the misclassified cases and trimmed cases are clustered in the partitions from the sparse methods. Figure 5.3 shows the revised silhouette widths of all cases. The black lines are the revised silhouette 74 Figure 5.2: The call rate plot of the 4 algorithms. The results of the sparse 2- means and robust sparse 2-means with \u00CE\u00B1 = 1/78,5/78 and 10/78. The results of sparse methods with l1=6 are shown. 75 widths of the trimmed cases. The red digits show the case indices of misclassi- fied cases. Robust sparse 2-means with \u00CE\u00B1 = 5/78 and \u00CE\u00B1 = 10/78 capture 1 (the 28th case) and 2 (the cases 28th and the 73th) misclassified cases, respectively. In- terestingly, the revised silhouette widths of most trimmed cases are almost zero, which means that they lie at the border of the two clusters. In this example, ro- bust sparse 2-means trims the cases whose class memberships are \u00E2\u0080\u0098\u00E2\u0080\u0098ambiguous\u00E2\u0080\u0099\u00E2\u0080\u0099 and defines the cluster centers without such cases. The misclassified cases are ex- cluded in the definition of the cluster centers and placed \u00E2\u0080\u0098\u00E2\u0080\u0098far\u00E2\u0080\u0099\u00E2\u0080\u0099 from their cluster centers and identified as potential outliers. As a result, the partition from the robust sparse 2-means without the potential outliers becomes more desirable than the par- tition from the sparse 2-means without the potential outliers. Figures 5.4 and 5.5 show the weighted distances between the cases, including trimmed cases, and their cluster centers from the result with l1 = 6. Lastly, figures 5.6 and 5.7 show the spread of the feature weights over the 4751 features from robust sparse 2-means and sparse 2-means for selected l1 values. The x-axis shows the feature indices. The distributions are very similar among the different algorithms for each l1 level. In summary, we have two conclusions: \u00E2\u0080\u00A2 The sparse methods perform better than the nonsparse methods to identify the ER status of each case. The sparse methods find small subsets of about 70 features which create the two clusters of ER status. The performances of the sparse methods with about 70 subset features are as good as the perfor- mances of the sparse methods with the full set of features. \u00E2\u0080\u00A2 Robust sparse 2-means trims some of the cases that are misclassified by sparse 2-means. The misclassified cases are excluded from the calculation of cluster centers. As a result, the misclassified cases are placed far from their cluster centers and the call rate plot is able to identify them as potential out- liers. Robust sparse 2-means found a partition more similar to the partition based on ER status by excluding potential outliers than sparse 2-means did. 76 (a ) \u00CE\u00B1 =1 /7 8 (b ) \u00CE\u00B1 =5 /7 8 (c ) \u00CE\u00B1 =1 0/ 78 Fi gu re 5. 3: T he si lh ou et te s of th e pa rt iti on re tu rn ed by th e ro bu st sp ar se 2- m ea ns cl us te ri ng .T he L 1 tu ni ng pa ra m et er va lu e is 6. 77 (a) \u00CE\u00B1=1/78 (b) \u00CE\u00B1=5/78 (c) \u00CE\u00B1=10/78 Figure 5.4: The weighted squared Euclidean distances between all cases and their cluster centers from the result of robust sparse 2-means with l1 = 6. 78 Figure 5.5: The weighted squared Euclidean distances between the cases and their cluster centers for the result of sparse 2-means with l1 = 6. 79 (a) l1=6 (b) l1=12 (c) l1=70 (d) l1=6 (e) l1=12 (f) l1=70 (g) l1=6 (h) l1=12 (i) l1=70 Figure 5.6: The nonzero weights from robust sparse 2-means for selected l1 values. The three rows show the results of robust sparse 2-means with \u00CE\u00B1=1/78, 5/78 and 10/78, respectively. 80 (a) l1=6 (b) l1=12 (c) l1=70 Figure 5.7: The nonzero weights from sparse 2-means for selected l1 values. 81 Chapter 6 Conclusion We developed a robust clustering method that is also able to perform variable selec- tions. Our proposed method can handle datasets with missing values. We adapted Clest to our robust sparse K-means method to provide an automatic way of select- ing the number of clusters. We implemented our proposed method in the R package RSKC. Our simulation studies show that our proposed robust sparse K-means cluster- ing was robust against various kinds of contaminated datasets. They also show that Clest with robust sparse K-means could identify the number of clusters when the clusters were reasonably well separated. It was necessary to know that there were more than two clusters if clusters were not well separated and the dataset is con- taminated. The application of robust sparse K-means on the microarray datasets of van\u00E2\u0080\u0099t Veer et al. [27] showed that our proposed method could identify the ER status of 78 breast tumors better than other methods. The cases which were mis- classified by other algorithms were trimmed by robust sparse 2-means. Therefore, the misclassified cases ended up far from their cluster centers defined by robust sparse K-means and we could find them as potential outliers. As future research, an automatic robust method of selecting the L1 tuning pa- rameter values is of interest. Since both current robust clustering R packages, trimcluster and RSKC, apply Lloyd\u00E2\u0080\u0099s algorithm, one might implement them with Hartigan and Wong\u00E2\u0080\u0099s algorithm, which is computationally more intense but tends to achieve a better local optima. 82 Bibliography [1] D. Barnes, R. Millis, L. Beex, S. Thorpe, and R. Leake. Increased use of immunohistochemistry for oestrogen receptor measurement in mammary carcinoma: the need for quality assurance. European Journal of Cancer, 34, 1998. [2] H. Chipman and R. Tibshirani. Hybrid hierarchical clustering with applications to microarray data. Biostatistics, 7(2):286\u00E2\u0080\u0093301, 2005. [3] H. A. Chipman, E. I. George, and R. E. Mcculloch. Making sense of a forest of trees. In Proceedings of the 30th Symposium on the Interface, pages 84\u00E2\u0080\u009392, 1998. [4] J. A. Cuesta-Albertos, A. Gordaliza, and C. Matran. Trimmed k-means: An attempt to robustify quantizers. The Annals of Statistics, 25(2):553\u00E2\u0080\u0093576, 1997. [5] S. Duboit and J. Fridlyand. A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biology, 3(7), 2002. [6] E. W. Forgy. Cluster analysis of multivariate data: efficiency vs interpretability of classifications. Biometrics, 21:768769, 1965. [7] A. Gordaliza. Best approximations to random variables based on trimming procedures. Journal of Approximation Theory, 64, 1991a. [8] A. Gordaliza. On the breakdown point of multivariate location estimators based on trimming procedures. Statistics & Probability Letters, 11, 1991b. [9] J. Hardin, A. Mitani, L. Hicks, and B. VanKoten. A robust measure of correlation between two genes on a microarray. BMC Bioinformatics, 8:220, 2007. [10] J. A. Hartigan. Clustering Algorithm. John Wiley and Sons, Inc., 1975. 83 [11] J. A. Hartigan and M. A. Wong. A k-means clustering algorithm. Applied Statistics, 28:100\u00E2\u0080\u0093108, 1979. [12] L. Hubert. Comparing partitions. Journal of Classification, 2:193\u00E2\u0080\u0093218, 1985. [13] G. N. Lance and W. T. Williams. Inver: A program for the computation of distance-measures between attributes of mixed types. Australian Computer Journal, 11(1):27\u00E2\u0080\u009328, 1979. [14] S. P. Lloyd. Least squares quantization in pcm. IEEE Transactions on information theory, 28(2):129\u00E2\u0080\u0093136, 1982. [15] D. MacKay. Chapter 20. An Example Inference Task: Clustering. Cambridge University Press, 2003. [16] J. MacQueen. Some methods for classification and analysis of multivariate observations. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, eds. L.M. LeCam and J. Neyman, 1967. [17] R. Maitra and I. P. Ramler. A k-mean-directions algorithm for fast clustering of data on the sphere. Journal of Computational and Graphical Statistics, 19:377\u00E2\u0080\u0093396, 2010. [18] G. Milligan and M. Cooper. An examination of procedures for determining the number of clusters in a data set. Psychometrika, 2(50):159\u00E2\u0080\u0093179, 1985. [19] W. M. Rand. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66:846\u00E2\u0080\u0093850, 1971. [20] P. J. Rousseeuw. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Computational and Applied Mathematics, 20, 1987. [21] G. A. F. Seber. Multivariate Observations. Wiley, 2004. [22] D. Sparks. Algorithm AS 58. Euclidean cluster analysis. Euclidean cluster analysis, 22:126\u00E2\u0080\u0093130, 1973. [23] H. Steinhaus. Sur la division des corps matriels en parties. Bull. Acad. Polon. Sci, 4(12):801\u00E2\u0080\u0093804, 1956. [24] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society., 58(11), 1996. 84 [25] R. Tibshirani, G. Walther, and T. Hastie. Estimating the number of clusters in a data set via the gap statistic. Journal of Royal Statistical Society, Series B, 63, 2000. [26] O. Troyanskaya, M. Cantor, G. Sherlock, P. Brown, T. Hastie, R. Tibshirani, D. Botstein, and R. B. Altman. Missing value estimation methods for DNA microarrays. Bioinformatics, 17(6):520\u00E2\u0080\u0093525, 2000. [27] L. J. van\u00E2\u0080\u0099t Veer, H. Dal, M. J. van de Vijver, Y. D. He, A. A. M. Hart, M. Mao, H. L. Peterse, K. van der Kooy, M. J. Marton, A. T. Witteveen, G. J. Schreiber, R. M. Kerkhoven, C. Roberts, P. S. Linsley, R. Bernards, and S. H. Friend. Gene expression profiling predicts clinical outcome of breast cancer. Nature, 415, 2002. [28] D. M. Witten and R. Tibshirani. A framework for feature selection in clustering. Journal of the American Statistical Association, 105(490): 713\u00E2\u0080\u0093726, 2010. 85 Appendix A Technical Appendix A.1 The algorithm of Hartigan and Wong This Appendix explains the technical details of Hartigan and Wong\u00E2\u0080\u0099s algorithm. To simplify the notation, define a function: R2(i,k) = nk nk +1 ||xi\u00E2\u0088\u0092 x\u00CC\u0084k||2. Then R2(i,k2) is the left hand side of the inequality in (2.7). Let LIV E denote a live set. Randomly choose K cases to serve as initial cluster centers. Perform the first and the second step of Lloyd\u00E2\u0080\u0099s algorithm. Let all clusters belong to the live set: C1, \u00C2\u00B7 \u00C2\u00B7 \u00C2\u00B7 ,CK \u00E2\u0088\u0088 LIV E. Repeat 1,2 and 3 until the live set becomes empty. 1. This is the optimal-transfer (OPTRA) stage. In this stage, calculations and reallocations are made with regard to all cases relative to clusters in the live set. For all the cases, i = 1, \u00C2\u00B7 \u00C2\u00B7 \u00C2\u00B7 ,N, perform: (a) If CIC1(i) \u00E2\u0088\u0088 LIV E i. Denote k1 = IC1(i). Compare the criterion (2.7) where k2 is the cluster which returns the smallest R2(i,k) among all clusters ex- cept the kth1 cluster. If (2.7) holds, then allocate the i th case into Ck2 and set IC1(i) = k2 and IC2(i) = k1. Otherwise, keep the ith case 86 in Ck1 and set IC1(i) = k1 and IC2(i) = k2. ii. Given a cluster assignment IC1(i) for all i from Step i, cluster centers x\u00CC\u0084k\u00E2\u0080\u0099s are updated. The two clusters that are involved in the transfer of cases, i\u00E2\u0080\u0099s, at this step are now in the live set. (b) if CIC1(i) /\u00E2\u0088\u0088 LIV E This step is the same as Step 1(a), except that the minimum R2(i,k) is computed only over clusters in the live set. 2. Stop if the live set is empty. 3. This is the quick transfer (QTRAN) stage. This stage only checks for the memberships of recently reallocated clusters and updates the clusters. i. Let k1 = IC1(i) and k2 = IC2(i) and compare the criterion (2.7). If the criterion is satisfied then switch IC1(i) and IC2(i). The clusters that are involved in the transfer of cases at this step are now in the live set. ii. Given a cluster assignment IC1(i) for all i from Step i, cluster centers x\u00CC\u0084k\u00E2\u0080\u0099s are updated. The two clusters that are involved in the transfer of cases at this step are now in the live set. 4. If no reallocation took place in the quick transfer stage, then go to the optimal- transfer stage, Otherwise go to the quick transfer stage. A.2 The difference between Silhouette and the revised silhouette In our data analyses, we used the revised silhouette to assess if a case is well clus- tered. In fact, this measure is very similar to the famous measure, Silhouette, pro- posed by Rousseeuw [20]. In this section, we explain the difference between Sil- houette and the revised silhouette and explain why we decided to use the revised silhouette. Silhouette is based on a measure, called the silhouette width, which assesses how well a case is assigned. The author defined a \u00E2\u0080\u0098\u00E2\u0080\u0098neighbor\u00E2\u0080\u0099\u00E2\u0080\u0099 of a case to be the cluster to which the case could have been assigned if the first choice cluster had not 87 been available. The author argued that if the ith case is well clustered, the average dissimilarity between the ith case and the others in the same cluster, denoted as a(i), should be considerably smaller than the average dissimilarity between the ith case and the cases in its neighbor, denoted as b(i). The silhouette width of the ith case, silhouette(i), was defined to be the standardized difference of b(i) and a(i). Although similar, this silhouette width introduced by Rousseeuw [20] is differ- ent from our proposed revised silhouette width above. Intuitively, Silhouette tends to \u00E2\u0080\u0098\u00E2\u0080\u0098favor\u00E2\u0080\u0099\u00E2\u0080\u0099 clusters with low variation when the squared Euclidean dissimilarity measure is used. We can measure the variation within the kth cluster by the contribution from the kth cluster, WSSk, to WSS: WSSk = \u00E2\u0088\u0091 l\u00E2\u0088\u0088Ck ||x\u00CC\u0083l\u00E2\u0088\u0092 \u00C2\u00AF\u00CC\u0083xk||2. One can see that b(i) and a(i), defined above, are increasing functions of the vari- ation within the ith neighbor and the variation within the cluster to which the ith case belongs, respectively. To see this, let dis(i,Ck) be the average dissimilarity of the ith case to all cases in Ck. Then b(i) is the minimum of dis(i,Ck) over Ck such that k is not IC1(i). This dis(i,Ck) is not only a function of the distance from the ith case to the kth cluster center but also a function of WSSk: dis(i,Ck) = average dissimilarity of the ith case to all cases in Ck = 1 |Ck| \u00E2\u0088\u0091l\u00E2\u0088\u0088Ck ||x\u00CC\u0083i\u00E2\u0088\u0092 x\u00CC\u0083l||2 = 1 |Ck| ( \u00E2\u0088\u0091 l\u00E2\u0088\u0088Ck ||x\u00CC\u0083i\u00E2\u0088\u0092 \u00C2\u00AF\u00CC\u0083xk||2+(x\u00CC\u0083i\u00E2\u0088\u0092 \u00C2\u00AF\u00CC\u0083xk)T \u00E2\u0088\u0091 l\u00E2\u0088\u0088Ck ( \u00C2\u00AF\u00CC\u0083xk\u00E2\u0088\u0092 x\u00CC\u0083l)+ \u00E2\u0088\u0091 l\u00E2\u0088\u0088Ck || \u00C2\u00AF\u00CC\u0083xk\u00E2\u0088\u0092 x\u00CC\u0083l||2 ) = ||x\u00CC\u0083i\u00E2\u0088\u0092 \u00C2\u00AF\u00CC\u0083xk||2+ 1|Ck|WSSk. One can also show that a(i) is a function of both WSS(\u00E2\u0088\u0092i)k /(|Ck| \u00E2\u0088\u0092 1) and ||x\u00CC\u0083i\u00E2\u0088\u0092 \u00C2\u00AF\u00CC\u0083x(\u00E2\u0088\u0092i)k ||2, where WSS(\u00E2\u0088\u0092i)k and \u00C2\u00AF\u00CC\u0083x (\u00E2\u0088\u0092i) k represent the WSSk and \u00C2\u00AF\u00CC\u0083xk calculated without the ith case, respecitvely. Therefore b(i) could be smaller than a(i) even if the ith case is clustered in the closest cluster center when the variation of the closest cluster 88 is considerably large. Then Silhouette suggests to change the cluster belonging of the ith case to its neighbor. This situation can be seen in the simple example in -1 0 1 2 3 -1 0 1 2 True clusters d[,1] d[ ,2 ] -1 0 1 2 3 -1 0 1 2 Clusters from K-means d[,1] d[ ,2 ] -1 0 1 2 3 -1 0 1 2 Clusters from K-means with Silhouette width d[,1] d[ ,2 ] X X X X X X Figure A.1: The example of negative Silhouette. Figure A.1. The first panel shows the generated cases from two clusters. The cases are colored by true cluster labels. Clearly one cluster has much larger variation than another. The second panel and the third panel show the same cases colored by the partition from 2-means. In the third panel, the cases with negative silhouette widths are indicated by \u00E2\u0080\u0098\u00E2\u0080\u0098X\u00E2\u0080\u0099\u00E2\u0080\u0099. The circles indicate the cluster centers from 2-means. Although the cases are clustered in the clusters with the closest cluster centers, Silhouette suggests that they should rather be placed in the neighbor. We think that it is odd to favor the cluster with less variation to measure how well cases are clustered. A direct comparison of the distance between a case to its cluster center and the distance between the case to its second closest cluster might be better to measure how well cases are clustered. For this reason, we use rev.silhouette for the data analyses in Chapter 5. 89"@en . "Thesis/Dissertation"@en . "2011-11"@en . "10.14288/1.0072190"@en . "eng"@en . "Statistics"@en . "Vancouver : University of British Columbia Library"@en . "University of British Columbia"@en . "Attribution-NonCommercial-NoDerivatives 4.0 International"@en . "http://creativecommons.org/licenses/by-nc-nd/4.0/"@en . "Graduate"@en . "Robustification of the sparse K-means clustering algorithm"@en . "Text"@en . "http://hdl.handle.net/2429/37093"@en .