Hierarchical clustering of observations and features in high-dimensional data

UBC Theses and Dissertations

Featured Collection

UBC Theses and Dissertations

Hierarchical clustering of observations and features in high-dimensional data Zhang, Hongyang (Fred)

Abstract

In this thesis, we present new developments of hierarchical clustering in high-dimensional data. We consider different use cases of hierarchical clustering, namely, clustering observations for exploratory analysis and clustering high-dimensional features for adaptive feature grouping and ensembling. We first focus on the clustering of observations. In high-dimensional data, the existence of potential noise features and outliers poses unique challenges to the existing hierarchical clustering techniques. We propose the Robust Sparse Hierarchical Clustering (RSHC) and the Multi-rank Sparse Hierarchical Clustering (MrSHC) to address these challenges. We show that via robust feature selection techniques, both RSHC and MrSHC can handle the potential existence of noise features and outliers in high-dimensional data and result in better clustering accuracy and interpretation comparing to the existing hierarchical clustering methods. We then consider clustering of features in high-dimensional data. We propose a new hierarchical clustering technique to adaptively divide the large number of features into subgroups called Regression Phalanxes. Features in the same Regression Phalanx work well together as predictors in a pre-defined regression model. Then models built on different Regression Phalanxes are considered for further ensembling. We show that the ensemble of Regression Phalanxes resulting from the hierarchical clustering produces further gains in prediction accuracy when applied to an effective method like Lasso or Random Forests.

Item Metadata

Title	Hierarchical clustering of observations and features in high-dimensional data
Creator	Zhang, Hongyang (Fred)
Publisher	University of British Columbia
Date Issued	2017
Description	In this thesis, we present new developments of hierarchical clustering in high-dimensional data. We consider different use cases of hierarchical clustering, namely, clustering observations for exploratory analysis and clustering high-dimensional features for adaptive feature grouping and ensembling. We first focus on the clustering of observations. In high-dimensional data, the existence of potential noise features and outliers poses unique challenges to the existing hierarchical clustering techniques. We propose the Robust Sparse Hierarchical Clustering (RSHC) and the Multi-rank Sparse Hierarchical Clustering (MrSHC) to address these challenges. We show that via robust feature selection techniques, both RSHC and MrSHC can handle the potential existence of noise features and outliers in high-dimensional data and result in better clustering accuracy and interpretation comparing to the existing hierarchical clustering methods. We then consider clustering of features in high-dimensional data. We propose a new hierarchical clustering technique to adaptively divide the large number of features into subgroups called Regression Phalanxes. Features in the same Regression Phalanx work well together as predictors in a pre-defined regression model. Then models built on different Regression Phalanxes are considered for further ensembling. We show that the ensemble of Regression Phalanxes resulting from the hierarchical clustering produces further gains in prediction accuracy when applied to an effective method like Lasso or Random Forests.
Genre	Thesis/Dissertation
Type	Text
Language	eng
Date Available	2017-08-16
Provider	Vancouver : University of British Columbia Library
Rights	Attribution-NonCommercial-NoDerivatives 4.0 International
DOI	10.14288/1.0354395
URI	http://hdl.handle.net/2429/62648
Degree	Doctor of Philosophy - PhD
Program	Statistics
Affiliation	Science, Faculty of; Statistics, Department of
Degree Grantor	University of British Columbia
Graduation Date	2017-09
Campus	UBCV
Scholarly Level	Graduate
Rights URI	http://creativecommons.org/licenses/by-nc-nd/4.0/
Aggregated Source Repository	DSpace

Open Collections

UBC Theses and Dissertations

UBC Theses and Dissertations

Hierarchical clustering of observations and features in high-dimensional data Zhang, Hongyang (Fred)

Abstract

Item Metadata

Item Media

Item Citations and Data

Rights