UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Robustification of the sparse K-means clustering algorithm Kondo, Yumi


Searching a dataset for the ‘‘natural grouping / clustering’’ is an important explanatory technique for understanding complex multivariate datasets. One might expect that the true underlying clusters present in a dataset differ only with respect to a small fraction of the features. Furthermore, one might afraid that the dataset might contain potential outliers. Through simulation studies, we find that an existing sparse clustering method can be severely affected by a single outlier. In this thesis, we develop a robust clustering method that is also able to perform variable selection: we robustified sparse K-means (Witten and Tibshirani [28]), based on the idea of trimmed K-means introduced by Gordaliza [7] and Gordaliza [8]. Since high dimensional datasets often contain quite a few missing observations, we made our proposed method capable of handling datasets with missing values. The performance of the proposed robust sparse K-means is assessed in various simulation studies and two data analyses. The simulation studies show that robust sparse K-means performs better than other competing algorithms in terms of both the selection of features and the selection of a partition when datasets are contaminated. The analysis of a microarray dataset shows that robust sparse K-means best reflects the oestrogen receptor status of the patients among all other competing algorithms. We also adapt Clest (Duboit and Fridlyand [5]) to our robust sparse K-means to provide an automatic robust procedure of selecting the number of clusters. Our proposed methods are implemented in the R package RSKC.

Item Media

Item Citations and Data


Attribution-NonCommercial-NoDerivatives 4.0 International