UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Hierarchical clustering of observations and features in high-dimensional data Zhang, Hongyang (Fred)

Abstract

In this thesis, we present new developments of hierarchical clustering in high-dimensional data. We consider different use cases of hierarchical clustering, namely, clustering observations for exploratory analysis and clustering high-dimensional features for adaptive feature grouping and ensembling. We first focus on the clustering of observations. In high-dimensional data, the existence of potential noise features and outliers poses unique challenges to the existing hierarchical clustering techniques. We propose the Robust Sparse Hierarchical Clustering (RSHC) and the Multi-rank Sparse Hierarchical Clustering (MrSHC) to address these challenges. We show that via robust feature selection techniques, both RSHC and MrSHC can handle the potential existence of noise features and outliers in high-dimensional data and result in better clustering accuracy and interpretation comparing to the existing hierarchical clustering methods. We then consider clustering of features in high-dimensional data. We propose a new hierarchical clustering technique to adaptively divide the large number of features into subgroups called Regression Phalanxes. Features in the same Regression Phalanx work well together as predictors in a pre-defined regression model. Then models built on different Regression Phalanxes are considered for further ensembling. We show that the ensemble of Regression Phalanxes resulting from the hierarchical clustering produces further gains in prediction accuracy when applied to an effective method like Lasso or Random Forests.

Item Citations and Data

Rights

Attribution-NonCommercial-NoDerivatives 4.0 International