Hierarchical Clustering of Observations and Features inHigh-dimensional DatabyHongyang (Fred) ZhangM.Sc., The University of British Columbia, 2012A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFDoctor of PhilosophyinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Statistics)The University of British Columbia(Vancouver)August 2017c© Hongyang (Fred) Zhang, 2017AbstractIn this thesis, we present new developments of hierarchical clustering in high-dimensional data. We consider different use cases of hierarchical clustering, namely,clustering observations for exploratory analysis and clustering high-dimensionalfeatures for adaptive feature grouping and ensembling.We first focus on the clustering of observations. In high-dimensional data,the existence of potential noise features and outliers poses unique challenges tothe existing hierarchical clustering techniques. We propose the Robust Sparse Hi-erarchical Clustering (RSHC) and the Multi-rank Sparse Hierarchical Clustering(MrSHC) to address these challenges. We show that via robust feature selectiontechniques, both RSHC and MrSHC can handle the potential existence of noisefeatures and outliers in high-dimensional data and result in better clustering accu-racy and interpretation comparing to the existing hierarchical clustering methods.We then consider clustering of features in high-dimensional data. We proposea new hierarchical clustering technique to adaptively divide the large number offeatures into subgroups called Regression Phalanxes. Features in the same Regres-sion Phalanx work well together as predictors in a pre-defined regression model.Then models built on different Regression Phalanxes are considered for further en-sembling. We show that the ensemble of Regression Phalanxes resulting from thehierarchical clustering produces further gains in prediction accuracy when appliedto an effective method like Lasso or Random Forests.iiLay SummaryThere has been a surge in the number of high-dimensional data sets – data setscontaining a large number of features and a relatively small number of observations– due to the rapidly growing ability to collect and store information. For example,in microarray studies, expression levels of thousands of genes are measured on onlya few samples; X-ray experiments are typically conducted on a limited numberof samples but the spectra over thousands of wavelengths are measured on eachsample. Mining such high-dimensional or large-flat data is an urgent problem ofgreat practical importance. In this thesis, we concentrate on the applications ofhierarchical clustering, a widely used data mining technique, on high-dimensionaldata. We introduce several novel techniques that successfully extract and combinekey information from the usually noisy high-dimensional data using hierarchicalclustering. The proposed methods have well-established performance and broadapplications in exploratory analysis and prediction.iiiPrefaceThis thesis is written up under the supervision of my supervisor Prof. Ruben H.Zamar. During our weekly research meetings, we discussed thoroughly regardingthe research topics, proposed methods and the related analyses. For all the Chaptersof this thesis, I conducted all the the computations, data analyses, first-drafts andfollow-up revisions. Prof. Zamar and Prof. William Welch spent a huge amount oftime checking the results and proof-reading.ivTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . 21.1.1 Dissimilarity Measures . . . . . . . . . . . . . . . . . . . 31.1.2 Linkage Criteria . . . . . . . . . . . . . . . . . . . . . . 41.1.3 Advantages and Disadvantages . . . . . . . . . . . . . . . 51.2 Hierarchical Clustering of Observations in High-dimensional Data 61.3 Hierarchical Clustering of Features in High-dimensional Data . . 111.3.1 Lasso . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111.3.2 Random Forests . . . . . . . . . . . . . . . . . . . . . . 131.4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . 16v2 Robust Sparse Hierarchical Clustering . . . . . . . . . . . . . . . . 182.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.2 Effect of Outliers on Sparse Hierarchical Clustering . . . . . . . . 192.3 Robust Sparse Hierarchical Clustering . . . . . . . . . . . . . . . 212.3.1 Robust Sparse Principal Component Based on τ-estimators 212.3.2 Robust Sparse Hierarchical Clustering . . . . . . . . . . . 222.4 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . 262.5 Application to Microarray Data Sets . . . . . . . . . . . . . . . . 312.5.1 Breast Cancer Data Set in Perou et al. [24] . . . . . . . . 312.5.2 Breast Cancer Data Set in van’t Veer et al. [35] . . . . . . 342.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343 Multi-rank Sparse Hierarchical Clustering . . . . . . . . . . . . . . 373.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.2 Limitations of Witten and Tibshirani [38]’s Sparse HierarchicalClustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383.3 Multi-rank Sparse Hierarchical Clustering . . . . . . . . . . . . . 393.3.1 Case 1: Known q and r . . . . . . . . . . . . . . . . . . . 403.3.2 Case 2: Known q, Unknown r . . . . . . . . . . . . . . . 413.3.3 Case 3: Unknown q and r . . . . . . . . . . . . . . . . . 443.4 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . 463.4.1 Simulation I . . . . . . . . . . . . . . . . . . . . . . . . . 463.4.2 Simulation II . . . . . . . . . . . . . . . . . . . . . . . . 473.4.3 Computational Times and Complexity . . . . . . . . . . . 483.5 Application to Microarray Data Sets . . . . . . . . . . . . . . . . 493.5.1 Lymphoma Data Set in Dettling [12] . . . . . . . . . . . . 493.5.2 Breast Cancer Data Set in Perou et al. [24] . . . . . . . . 503.5.3 Breast Cancer Data Set in van’t Veer et al. [35] . . . . . . 513.6 Robustification of MrSHC . . . . . . . . . . . . . . . . . . . . . 533.6.1 Lymphoma Data Set in Dettling [12] . . . . . . . . . . . . 553.6.2 Breast Cancer Data Set in Perou et al. [24] . . . . . . . . 563.6.3 Breast Cancer Data Set in van’t Veer et al. [35] . . . . . . 573.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58vi4 Regression Phalanxes . . . . . . . . . . . . . . . . . . . . . . . . . . 604.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 604.2 Phalanx Formation Algorithm . . . . . . . . . . . . . . . . . . . 614.2.1 Initial Grouping . . . . . . . . . . . . . . . . . . . . . . . 624.2.2 Screening of Initial Groups . . . . . . . . . . . . . . . . . 634.2.3 Hierarchical Formation of Candidate Phalanxes . . . . . . 664.2.4 Screening of Candidate Phalanxes . . . . . . . . . . . . . 664.2.5 Ensembling of Regression Phalanxes . . . . . . . . . . . 674.3 A Simple Illustrative Example: Octane . . . . . . . . . . . . . . . 674.4 Simulation Studies . . . . . . . . . . . . . . . . . . . . . . . . . 684.4.1 Lasso as Base Regression Model . . . . . . . . . . . . . . 694.4.2 Random Forests as Base Regression Model . . . . . . . . 704.5 Additional Real Data Examples . . . . . . . . . . . . . . . . . . . 714.5.1 AID364 Data Set . . . . . . . . . . . . . . . . . . . . . . 724.5.2 CLM Data Set . . . . . . . . . . . . . . . . . . . . . . . 734.5.3 Gene Expression Data Set . . . . . . . . . . . . . . . . . 754.5.4 Glass Data Set . . . . . . . . . . . . . . . . . . . . . . . 764.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 775 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . 785.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 785.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 805.2.1 Clustering Observations . . . . . . . . . . . . . . . . . . 805.2.2 Regression Phalanxes . . . . . . . . . . . . . . . . . . . . 81Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82A Supporting Materials . . . . . . . . . . . . . . . . . . . . . . . . . . 86viiList of TablesTable 2.1 Average CER for different clustering methods. The maximumSE of CER in this table is 0.01, while the maximum SE of CERfor FRSHC methods is 0.003 . . . . . . . . . . . . . . . . . . 29Table 2.2 Average RR for different clustering methods. RR for AW andIW methods are the same due to the use of indicator functionfor non-zero weights. The maximum SE of RR in this table is0.017, while the maximum SE of RR for FRSHC methods is0.008. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30Table 2.3 Average CPU times in seconds of a 2.8 GHz Intel Xeon fordifferent clustering methods . . . . . . . . . . . . . . . . . . . 30Table 3.1 Different cases for MrSHC; q is the target number of selectedfeatures and r is the rank of the SPC approximation . . . . . . 40Table 3.2 Average CER, RR, and q for different clustering methods. . . . 47Table 3.3 Average CER, RR, and q for different clustering methods. . . . 48Table 3.4 Average computing times (in seconds) for different clusteringmethods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48Table 3.5 Average CER, RR, and q for different clustering methods. . . . 54Table 3.6 Average CER, RR, and q for different clustering methods. . . . 55Table 4.1 Number of variables, initial groups, screened groups, candidatephalanxes, screened phalanxes and prediction accuracies for theoctane data set. . . . . . . . . . . . . . . . . . . . . . . . . . 68viiiTable 4.2 Number of variables, initial groups, screened groups, candi-date phalanxes, screened phalanxes and prediction accuraciesof base regression models and ERPX for different noise levelswhen calculating sample covariance matrix Σ. . . . . . . . . . 70Table 4.3 Number of variables, initial groups, screened groups, candi-date phalanxes, screened phalanxes and prediction accuraciesof base regression models and ERPX for different noise levelswhen calculating sample covariance matrix Σ. . . . . . . . . . 71Table 4.4 Number of variables, initial groups, screened groups, candidatephalanxes, screened phalanxes and prediction accuracies for theAID 364 assay and five descriptor sets. Three runs of ERPX arepresented. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73Table 4.5 Number of variables, initial groups, screened groups, candidatephalanxes, screened phalanxes and prediction accuracies for theAID 364 assay and four descriptor sets, with initial groups ob-tained from the hierarchical clustering. Three runs of ERPX arepresented. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74Table 4.6 Number of variables, initial groups, screened groups, candi-date phalanxes, screened phalanxes and prediction accuraciesfor TOTVEGC and LAI from CLM-CN simulations. Three ex-periments based on random splitting of training and testing setsare presents. . . . . . . . . . . . . . . . . . . . . . . . . . . . 75Table 4.7 Number of variables, initial groups, screened groups, candidatephalanxes, screened phalanxes and prediction accuracies for thegene expression data set. . . . . . . . . . . . . . . . . . . . . 76Table 4.8 Number of variables, base regression model, initial groups, screenedgroups, candidate phalanxes, screened phalanxes and predictionaccuracies (OOB-MSE for RF as base regression model, 5-foldCV-MSE for Lasso as base regression model) for four responsesof the glass data set. . . . . . . . . . . . . . . . . . . . . . . . 77Table A.1 MSEs of Lasso (5-fold CV-MSEs) and RF (OOB-MSEs) forfive descriptor sets of the AID 364 assay . . . . . . . . . . . . 90ixTable A.2 MSEs of Lasso (CV-MSEs) and RF (OOB-MSEs) for two re-sponses of the CLM data . . . . . . . . . . . . . . . . . . . . 90xList of FiguresFigure 1.1 Dendrograms with squared Euclidean distance and trimmedabsolute distance (10% largest distances are trimmed) on a dataset with a outlier. Colors are used to indicate the true underly-ing cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . 8Figure 2.1 Dendrograms (with Ward’s linkage) from HC and SHC appliedto data sets without and with outliers. Colors are used to indi-cate the true underlying cluster. . . . . . . . . . . . . . . . . 20Figure 2.2 An illustration of choosing tuning parameter λ based on sec-ond order differences of Gap(λi), i = 1, . . . ,k . . . . . . . . . 25Figure 2.3 A graphical illustration of the hierarchical structure of the sim-ulated clusters . . . . . . . . . . . . . . . . . . . . . . . . . . 27Figure 2.4 Dendrograms generated by HC, SHC, FRSHC-AW and FRSHC-IW with 496 clustering features for the clean Perou et al. [24]data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32Figure 2.5 Dendrograms generated by HC, SHC, FRSHC-AW and FRSHC-IW with 496 clustering features for the contaminated Perouet al. [24] data . . . . . . . . . . . . . . . . . . . . . . . . . . 33Figure 2.6 Dendrograms generated by HC, SHC, FRSHC-AW and FRSHC-IW with 70 clustering features for van’t Veer et al. [35]’s data.The mild outlier is pointed by the arrow. . . . . . . . . . . . . 36xiFigure 3.1 Examples of simple and complex structures within two impor-tant features. In simple structure, the variance in important fea-tures can be mostly projected to a single dimension. The vari-ance of the complex structure can be discovered if projected tomore than one dimension. . . . . . . . . . . . . . . . . . . . 38Figure 3.2 Dendrogram generated with the first four chosen features {V13,V11,V6,V1} from Witten and Tibshirani’s sparse hierarchical clusteringframework . . . . . . . . . . . . . . . . . . . . . . . . . . . 39Figure 3.3 Dendrogram generated from MrSHC with known q . . . . . . 44Figure 3.4 Dendrogram generated from MrSHC with unknown q . . . . . 45Figure 3.5 Dendrograms generated by HC, SHC, MrSHC and HC-HMVfor Dettling [12] data (n = 62, p = 4026) . . . . . . . . . . . 50Figure 3.6 Dendrograms generated by MrSHC (rank 1) and SHC with q=140 for Dettling [12] data (n = 62, p = 4026) . . . . . . . . . 51Figure 3.7 Dendrograms generated by HC, SHC, MrSHC and HC-HMVfor Perou et al. [24] data (n = 62, p = 1753) . . . . . . . . . . 52Figure 3.8 Dendrograms generated by HC, SHC, MrSHC and HC-HMVfor van’t Veer et al. [35]’s data. (n = 77, p = 4751) . . . . . . 53Figure 3.9 Dendrograms generated by HC, MrSHC, Robust-MrSHC andSHC for contaminated Dettling [12] data (n = 62, p = 4026) . 56Figure 3.10 Dendrograms generated by HC, MrSHC, Robust-MrSHC andSHC for contaminated Perou et al. [24] data (n = 62, p = 1753) 58Figure 3.11 Dendrograms generated by Robust-MrSHC for van’t Veer et al.[35]’s data. (n = 77, p = 4751) . . . . . . . . . . . . . . . . . 59Figure 4.1 A sketch of the phalanx-formation procedure. D variables arepartitioned into d initial groups, screened down to s groups,combined into e candidate phalanxes, and then screened downto h phalanxes in the final ensemble (D≥ d ≥ s≥ e≥ h). . . . 62Figure 4.2 Boxplots of OOB MSE (from 5000 randomly selected train-ing samples) and test errors (from 4983 corresponding testingsamples) obtained from 20 random splits of the data into train-ing and testing sets . . . . . . . . . . . . . . . . . . . . . . . 75xiiAcknowledgmentsI would like to express my special appreciation and thanks to my advisor ProfessorRuben H. Zamar, you have been a tremendous mentor and friend for me. I wouldlike to thank you for encouraging my research and for allowing me to grow as adata scientist. Your advice on both research as well as on my career have beenpriceless.I would also like to thank my committee members, professor William Welchand professor Matı´as Salibia´n-Barrera for serving as my committee members andfor your brilliant comments and suggestions.A special thanks to my mother Lirong Zhao, my father Ming Zhang and mygrandma Xiuzhi Tian for your unconditional love and support. At the end I wouldlike to express appreciation to my beloved wife Xin Zhao who has always beenthere for me through my ups and downs. Words cannot express how grateful I amto all of you.xiiiChapter 1IntroductionThere has been a surge in the number of high-dimensional data sets – data sets con-taining a large number of features and a relatively small number of observations –due to the rapidly growing ability to collect and store information in medical re-search and other fields. For example, in microarray studies, expression levels ofthousands of genes (features) are measured on only a few samples; X-ray exper-iments are typically conducted on a limited number of samples (e.g. less than ahundred) but the spectra over thousands of wavelengths (features) are measured oneach sample. Mining such high-dimensional or large-flat data is an urgent problemof great practical importance.Clustering is an important data mining technique widely used in various fields.We mainly focus on the hierarchical clustering, one of the most widely used clus-tering algorithms with a lot of unique advantages for exploratory data analysis. Inthis thesis, we use hierarchical clustering for the following two tasks.• Exploratory analysis – Hierarchical clustering of observations in high-dimensionaldata.Hierarchical clustering is a well established method for exploratory dataanalysis. However, in high-dimensional settings, the performance of tra-ditional hierarchical clustering algorithms can be distorted due to the possi-ble existence of noise features and data contaminations. Surprisingly, thereexist only limited attempts to improve the performance of hierarchical clus-1tering in high-dimensional settings. In Chapter 2 and Chapter 3, we pro-pose two novel hierarchical clustering frameworks which yield better perfor-mance than the existing hierarchical clustering methods in high-dimensionaldata.• Regression – Hierarchical clustering of features in high-dimensional data.Ensemble methods prove to be useful in various regression tasks by produc-ing superior prediction accuracies. In Chapter 4, we propose a novel ensem-ble method via hierarchical clustering of features. The goal is to discoversubsets of features so that the features in each subset work well together aspredictors in a pre-specified regression procedure. By defining a proper dis-similarity measure of the features, the hierarchical clustering is able to groupthe features together according to their combined predictive power. Then wepropose to ensemble all the regression models built on different subsets offeatures to build a final ensemble model.The rest of the chapter is organized as follows. In Section 1.1, we brieflydescribe the hierarchical clustering. Section 1.2 presents an introduction of hi-erarchical clustering of observations and existing proposals for high-dimensionaldata. Section 1.3 presents an introduction of hierarchical clustering of features andbriefly describes some of the regression methods we use in the following chapters.Finally, Section 1.4 presents the organization of the remainder of the thesis.1.1 Hierarchical ClusteringIn data mining and statistics, hierarchical clustering is a method of clustering anal-ysis which aims to categorize observations into a hierarchical set of groups. Thereare two general strategies for hierarchical clustering:• Agglomerative or “bottom up” approach.In the agglomerative approach, each observation forms its own cluster at thebeginning of the process. Then the pairs of clusters are iteratively merged toform clusters of higher hierarchy.• Divisive or “top down” approach.2In the divisive approach, a single cluster containing all the observations getsinitialized, which further splits into sub-clusters. Splits are performed recur-sively to generate the hierarchy of clusters.Unless stated otherwise, we take agglomerative approach as the default ap-proach for hierarchical clustering throughout this thesis. This is not only becausethe agglomerative approach is usually considered as the default approach for hier-archical clustering but also because it is less computational burdensome comparingto the divisive approach. The merges in hierarchical clustering are performed in agreedy manner. In each iteration, only one pair of clusters are merged. In orderto choose the pair of clusters being merged, a measure of dissimilarity betweenclusters of observations needs to be specified. This is usually achieved by makingappropriate choices of the dissimilarity measure (a measure of distance betweenpairs of observations), and the linkage criterion which defines the dissimilarity ofclusters as a function of the pairwise distances of observations (defined by the cho-sen dissimilarity measure) in the clusters. The results of hierarchical clustering areusually presented in a tree structure called dendrogram. In the following sections,we present some common choices of both the dissimilarity measures and the link-age criteria, some sample dendrograms and a brief discussion on the advantagesand disadvantages of the hierarchical clustering methods.1.1.1 Dissimilarity MeasuresSuppose we have a pair of observations represented as p-dimensional vectors x =(x1,x2, . . . ,xp)T and y = (y1,y2, . . . ,yp), a selection of commonly used dissimilaritymeasures as follows.• Euclidean distance: ||x− y||2 =√∑i(xi− yi)2.• Squared Euclidean distance: ||x− y||22 = ∑i(xi− yi)2.• Manhattan distance: ||x− y||1 = ∑i |xi− yi|.• Maximum distance: ||x− y||∞ = maxi |xi− yi|.• Mahalanobis distance:√(x− y)T S−1(x− y), where S is the covariance ma-trix of the data set.3It is apparent that different choices of the dissimilarity measure will affect,sometimes significantly, on the resulting distance of the observations. A pair ofobservations may be considered close by using certain dissimilarity measures butfarther away by others. For example, suppose there are two cluster centres (2,2,2)and (3,0,0). Then (0,0,0) is considered to be closer to (3,0,0) with squared Eu-clidean distance. However, with maximum distance, it is considered to be closer to(2,2,2). Therefore, the choice of the dissimilarity measure can consequently affectthe clustering results, especially the shape of the resulting clusters.1.1.2 Linkage CriteriaSuppose we have two clusters X and Y each containing several observations andthe chosen dissimilarity measure is d(x,y). Some commonly choices of linkagecriteria and their corresponding calculations of distance between X and Y are pre-sented as follows.• Complete linkage: max{d(x,y) : x ∈ X ,y ∈ Y}.• Single linkage: min{d(x,y) : x ∈ X ,y ∈ Y}.• Average linkage: 1|X ||Y | ∑x∈X ∑y∈Y d(x,y).• Ward’s linkage: the distance is measured by the decrease in variance forthe clusters being merged. At each merging step, a pair of clusters A ={a1, . . . ,anA} and B = {b1, . . . ,bnB} are merged if the following criterion isminimized.nAnBnA+nB(2nAnBnA∑i=1nB∑j=1d(ai,b j)− 1n2AnA∑i=1nA∑j=1d(ai,a j)− 1n2BnB∑i=1nB∑j=1d(bi,b j))If squared Euclidean distance is used for d(·, ·), the above criterion is equiv-alent to Ward’s minimum variance criterion [30]. Other distance measuressuch as Manhattan or Mahalanobis distance are also used as d(·, ·) in practice[2].The linkage criterion defines the dissimilarity of clusters as a function of the pair-wise distances of observations (defined by the chosen dissimilarity measure) in the4clusters. Given a chosen dissimilarity measure, difference choices of the linkagecriterion will yield sometimes significantly different clusters.1.1.3 Advantages and DisadvantagesHierarchical clustering has a broad range of applications such as microarray dataanalysis, digital imaging, stock prediction, text mining, etc. This is because it hasa unique set of advantages over some other clustering techniques such as K-means,model-based clustering, etc. Some of the advantages are listed as follows.• No need to specify the number of clusters.Clustering techniques such as K-means require to specify the target numberof clusters in advance. However, the target number of clusters is usuallyunknown. On the contrary, hierarchical clustering results in a hierarchy ofclusters visualized as a dendrogram, so there is no need to specify the num-ber of clusters in advance. After visualizing the dendrogram, one can thenget the desired number of clusters by cutting the dendrogram accordingly.Moreover, dendrograms will reveal any intrinsic nested cluster structures inthe data that are otherwise hard to discover.• Flexibility– Flexibility on the choice of linkage criteria.There are numerous choices of linkage criteria available, which adaptsto different types of data and clustering goals.– Versatility due to numerous combinations of dissimilarity measuresand linkage criteria.Due to the flexibility on both the choice of dissimilarity measures andthe choice of linkage criteria, hierarchical clustering is very versatile indiscovering clusters of different shapes and it is usually possible to finda certain combination of dissimilarity measure and linkage criteria thatsatisfy your need. In practice, the combination of dissimilarity measureand linkage criteria can be chosen based on the perceived usefulness ofthe resulting dendrograms.5A known disadvantage of agglomerative hierarchical clustering is the high compu-tational cost. The complexity of the agglomerative approach is O(n2 log(n)) with ndenoting the number of observations, which does not scale well for a large numberof observations. However, in our applications, either the data set contains a lim-ited number of observations (high-dimensional flat data sets) or we conduct smartfiltering to control the computational time of hierarchical clustering.1.2 Hierarchical Clustering of Observations inHigh-dimensional DataClassical hierarchical clustering, likewise most of the traditional clustering tech-niques, is challenged by high-dimensional data sets. Although a large number offeatures may allow for a better coverage of clustering features (i.e. features thathelp explain the true underlying clusters), a large proportion of the features maycontain no information about the cluster structures. That is, clusters can usuallybe well separated from a relatively small number of features, but not if all thefeatures are used to compute the distance between objects due to the perturbationintroduced by noise or non-informative features. Furthermore, interpretability canbe impeded when the clustering procedure uses an unnecessarily large number ofvariables.Therefore, feature selection becomes a key feature of clustering methods whenapplying them to high-dimensional data sets. The goal of feature selection is tokeep only the important features for clustering while discarding the noise features.Clustering methods with feature selection capabilities are often referred to as sparseclustering methods. In comparison with the classical clustering methods where allthe features are used, sparse clustering has the following advantages:• If the clusters differ only with respect to a small number of features, thensparse clustering is likely to identify the clusters more accurately than non-sparse clustering methods.• The results of sparse clustering are more interpretable because the clustersare determined by a smaller number of features.Multiple feature selection approaches have been proposed for clustering meth-6ods such as K-means (e.g. Witten and Tibshirani [38], Sun et al. [31]) and model-based clustering (e.g. Raftery and Dean [25], Pan and Shen [23], Wang and Zhu[37], Xie et al. [40]). However, there has been less research for hierarchical clus-tering. A brief survey of such proposals is given below.Let X be an n× p data matrix, with n observations and p features. Let di,i′ =d(xi,xi′) be a measure of dissimilarity between observations xi and xi′ (1≤ i, i′≤ n),which are the rows i and i′ of the data matrix X . We will assume that d is addi-tive in the features: d(xi,xi′) = di,i′ = ∑pj=1 di,i′, j, where di,i′, j indicates the dis-similarity between observations i and i′ along feature j. Unless specified other-wise, our examples and simulations take d equal to the squared Euclidean distance,di,i′, j = (Xi j−Xi′ j)2. However, other dissimilarity measures are possible, such asthe absolute difference di,i′, j = |Xi j −Xi′ j|. A nice property of hierarchical clus-tering is that the potential outliers are clearly exposed in the dendrogram. Theabsolute difference is more robust than the squared Euclidean distance in the pres-ence of outliers, however, a robust dissimilarity measure such as trimmed absolutedistance may not be desirable in the context of hierarchical clustering. For exam-ple, we generate a sample data set containing 60 observations and 20 features. Theobservations are well separated into 3 clusters of equal size. A value in the data setis replaced with a large outlier. As we can see from Figure 1.1, the contaminatedobservation is exposed in its standalone cluster with squared Euclidean distance.On the contrary, it is missed from the dendrogram under the trimmed absolute dis-tance since the robust measure down-weights or ignores its distances from the restof the samples.Friedman and Meulman [17] proposed clustering objects on subsets of at-tributes (COSA). COSA employs a criterion, related to a weighted version of K-means clustering, to automatically detect subgroups of objects that preferentiallycluster on subsets of the attribute variables rather than on all of them simultane-ously. An extension of COSA for hierarchical clustering was also proposed. Thealgorithm is quite complex and requires multiple tuning parameters. Moreover, asnoted by Witten and Tibshirani [38], this proposal does not truly result in a sparseclustering because all the variables have nonzero weights.Witten and Tibshirani [38] proposed a new framework for sparse clustering that7Figure 1.1: Dendrograms with squared Euclidean distance and trimmed ab-solute distance (10% largest distances are trimmed) on a data set with aoutlier. Colors are used to indicate the true underlying cluster.can be applied to procedures that optimize a criterion of the formmaxΘ∈G{p∑j=1f j(X j,Θ)}, (1.1)where X j = (X1 j,X2 j, ...,Xn j)T ∈ Rn denotes the observed j-th feature.Each f j(X j,Θ) is a function that solely depends on the j-th feature and Θ is aset of unknown parameters taking values on G. For example, in K-means, it can beshown that for a fixed number K of clusters,f j(X j,Θ) =1nn∑i=1n∑i′=1di,i′, j−K∑k=11nk∑i,i′∈Ckdi,i′, j,where nk is the number of observations in cluster k, Ck is the set of indices ofobservations in cluster k and di,i′, j is the squared Euclidean distance between ob-servation i and i′ on the j-th feature. In this chapter, we only consider squaredEuclidean measure for di,i′, j.The optimizing criterion in (1.1) is not sparse because the criterion involves allthe features. To introduce sparsity Witten and Tibshirani [38] modified criterion8(1.1) as follows:maxw,Θ∈G{p∑j=1w j f j(X j,Θ)}subject to ||w||22 ≤ 1, ||w||1 ≤ sand w j ≥ 0. (1.2)Here w = (w1,w2, . . . ,wp) is a vector of weights for each feature, ||w||22 is squaredL2-norm on w, and ||w||1 is L1-norm on w. A feature with zero-weight is clearlynot used in the criterion.Hierarchical clustering does not optimize a criterion like (1.1) and, therefore,does not directly fit into Witten and Tibshirani [38] sparse clustering framework(1.2). To overcome this difficulty they casted the dissimilarity matrix{di,i′}n×n asthe solution of an optimization problem as follows:maxU∈Rn×n{p∑j=1n∑i,i′=1di,i′, jUi,i′}subject ton∑i,i′=1U2i,i′ ≤ 1. (1.3)It can be shown that the solution Ûi,i′ to (1.3) is proportional to the dissimilaritymatrix, that is, Ûi,i′ ∝ di,i′ . The criterion in (1.3) is a special case of (1.1) when welet f j(X j,Θ) =∑ni,i′=1 di,i′, jUi,i′ . Now sparse hierarchical clustering can be achievedby obtaining a sparse dissimilarity matrix. Now the sparse hierarchical clusteringcriterion (note that this is a preprocessing step of the hierarchical clustering) canbe defined as follows:maxw,U∈Rn×n{p∑j=1w jn∑i,i′=1di,i′, jUi,i′}subject ton∑i,i′=1U2i,i′ ≤ 1, ||w||22 ≤ 1, ||w||1 ≤ s.(1.4)The constraint w j ≥ 0 has been removed because di,i′, j ≥ 0 for all 1≤ i, i′ ≤ n and1≤ j ≤ p. The solution to (1.4) can be obtained using sparse principal component(SPC) proposed in Witten et al. [39] as follows: Let u be a vector of length n2that contains all elements in (Ui,i′)n×n and D be a n2× p matrix whose j-th columncontains the n2 elements of{di,i′, j}n×n – the dissimilarity matrix calculated fromthe j-th feature alone. Now the criterion in (4) is equivalent to the following:maxw,u{uT Dw}subject to ||u||22 ≤ 1, ||w||22 ≤ 1, ||w||1 ≤ s. (1.5)9This reduces to applying SPC on the transformed dissimilarity matrix, D. It canalso be shown that the solution to (1.5) satisfies: ûT ∝ Dŵ. As ŵ is sparse, sois û. Thus, by re-arranging the elements in û into a n× n dissimilarity matrixÛ , we obtain a sparse dissimilarity matrix which only contains the informationfrom a subset of selected features. Finally, sparse hierarchical clustering can beobtained by applying classical hierarchical clustering on the sparse dissimilaritymatrix Û. Witten and Tibshirani [38] showed, using a simulated dataset and agenomic dataset, that their proposed sparse hierarchical clustering results in moreaccurate identification of the underlying clusters and more interpretable results thanstandard hierarchical clustering and COSA when applied on datasets with noisefeatures.Witten and Tibshirani [38] demonstrated on a simulated and a real data set thatSHC results in accurate identification of the underlying clusters. Moreover, SHCgives far more interpretable results than classical hierarchical clustering when thedata set contains noise features. However, there are some limitations of the SHCframework that jeopardize its performance in real applications. We present theselimitations below.1. Unable to discover features with complex structures.The structures of features in high-dimensional data are usually complex. Weshow that when features contain complex structures, SHC has its limitationin discovering the important features from the noise.2. Vulnerable to data contaminations.Given the large number of features in high-dimensional data sets, the pres-ence of data contaminations becomes more likely. The SHC framework issensitive to the outliers and its clustering performance deteriorates severelyunder data contamination.3. Non-scalable to even medium-sized data. SHC operates on the transformeddissimilarity matrix D (a n2× p matrix). Therefore, it won’t scale to medium-sized data sets with relatively big n.In Chapter 2 and Chapter 3, we propose two novel sparse hierarchical cluster-10ing methods to remedy the three limitations of the existing SHC method simulta-neouly.1.3 Hierarchical Clustering of Features inHigh-dimensional DataEnsemble methods prove to be useful in various regression tasks by producing su-perior prediction accuracies. In Chapter 4, we propose a novel ensemble methodvia hierarchical clustering of features. The goal is to discover subsets of features sothat the features in each subset work well together as predictors in a pre-specifiedregression procedure called “base regression model”. By defining a proper dis-similarity measure of the clusters of features, the hierarchical clustering is able togroup the features together according to their combined predictive power. Then wepropose to ensemble all the regression models built on different subsets of featuresto build a final ensemble model.This ensemble approach is very different from the existing ensemble approaches.It separates the feature space into distinct clusters on which different base regres-sion models are trained for further ensemble. This approach is very flexible sincein principle, any regression methods, even the existing ensemble methods, can beused as base regression models. We briefly review the candidate base regressionmodels we considered in this thesis, namely Lasso and Random Forests.1.3.1 LassoLasso (least absolute shrinkage and selection operator), proposed by Tibshirani[32], is a regularized linear regression method method that performs feature se-lection in linear models via regularization. It constrains the size of the regressioncoefficients by forcing the sum of the absolute values of their values to be less thana chosen value, which shrinks the coefficients and forces certain coefficients to ex-actly zero. As a result, the model fitting process automatically select only a subsetof the features (the ones with non-zero coefficients) in the final Lasso model. It isshown that via regularization and shrinkage, Lasso is able to enhance the predictionaccuracy and interpretability of the model at the same time.Consider a data set X containing n observations and p features, each obser-11vation corresponding to a single response in y. Denote the i-th observation asxi = (x1,x2, . . . ,xp)T and its corresponding response as yi. Lasso solves the follow-ing L1-regularized least square problem.minβ0,β{1NN∑i=1(yi−β0− xTi β )2}subject top∑j=1|β j| ≤ t, (1.6)where t is a prespecified tuning parameter that controls the amount of regularizationand shrinkage.Equation 1.6 can re-written in a more compact fashion.minβ0,β{1N‖y−β0−Xβ ‖22}subject to ‖β ‖1 ≤ t, (1.7)where ‖Z‖p =(∑Ni=1 |Zi|p)1/p represents the Lp norm.Sinceβˆ0 = y¯− x¯Tβ , (1.8)therefore,yi− βˆ0− xTi β = yi− (y¯− x¯Tβ )− xTi β = (yi− y¯)− (xi− x¯)Tβ , (1.9)it is a standard procedure to normalize the features – subtract feature means andthen standardize features(∑Ni=1 x2i j = 1). As a result, the Lasso solution does notdepend on the measurement scale of the features.Suppose the features in X are normalized, we can further rewrite the optimiza-tion problem as follows.minβ∈Rp{1N‖y−Xβ ‖22}subject to ‖β ‖1 ≤ t. (1.10)This is equivalent to the following problem according to the Lagrangian form.minβ∈Rp{1N‖y−Xβ ‖22+λ‖β ‖1}. (1.11)There is a one-to-one relationship between t and λ but the exact relationship varies12and it’s data dependent.By solving Equation (1.11), which is a convex optimization problem, the Lassocoefficients can be obtained.1.3.2 Random ForestsTo better understand Random Forests, we first presents two closely related prelim-inary methods: Regression Tree and Bagging.Regression TreeA regression tree is a binary decision tree. The root node of the tree contains allthe training sample, which gets split into two child nodes. The the two child nodessplit into their own child nodes recursively. Each split is determined by choosinga feature and its corresponding best split value. The goal of the split is to maxi-mizing the gain in homogeneity in a greed matter. In the context of regression tree,the gain in homogeneity is usually represented by the decrease in sample variancesof responses, i.e., sample response variance of the parent node subtracting the sumof sample response variances of the two child nodes. Therefore, among all thefeatures, the feature and its split value that allow the maximum amount of gain forthe current node are chosen. More specifically, at a particular node, the tree grow-ing/splitting process chooses a split from all possible splits so that the resultingleft and right child nodes are as homogeneous as possible. This is done by firstfinding each feature’s best split value and then searching for the split that achievesthe maximum gain. Since different features and split values may be chosen for dif-ferent splits, a regression tree essentially partitions the feature space successivelyinto smaller hyper-rectangles (represented by nodes) and each hyper-rectangle isas homogeneous as possible.The tree splitting process stops according to the “stopping rules”. A node willnot be split and becomes a terminal node if:• the node contains observations with identical responses (no more gain inhomogeneity can be achieved),• all the observations in the node have identical values for each feature,• the tree size reaches the user-specified maximum tree size,13• the size of the node is smaller than the user-specified minimum node size,• the child node size is smaller than the user-specified minimum child nodesize, and• the amount of gain from the best split is smaller than the user-specified min-imum amount.For further reference, please see Breiman et al. [7] and Chapter 9 of Venables andRipley [36].Once the tree splitting process finishes, each branch of the tree ends in a termi-nal node. Each observation falls into one and exactly one terminal node, and eachterminal node corresponds to a unique set of splits.Bootstrap Aggregating (Bagging)Bootstrap Aggregating, or bagging, is a general ensemble framework proposedby Breiman [5]. Bagging fits different regression or classification models on sev-eral bootstrap samples for further aggregation. A bootstrap sample is obtained byselecting a random sample with replacement of equal size of the training observa-tions [14]. More specifically, given a training data set X = (xT1 , . . . ,xTn )T with thecorresponding response y = (y1, . . . ,yn)T , bagging has the followings steps.• For b = 1, . . . ,B:– Sample, with replacement, n training observations from X and y ac-cordingly. Denote the sample as X b and yb.– Fit a regression or classification model Mb on X b and yb.• The prediction of a new observation x′ can be calculated by averaging thepredictions from all the B models M1, . . . ,MB as yˆ′ = 1B ∑Bb=1 Mb(x′), or forclassification applications, a majority vote can be applied.Bagging often results in better prediction performance since it decreases thevariance of the model by averaging, without increasing the bias. This benefits themethods that are not stable, namely, methods with high variance (perturbation ofa training set leads to significant changes in predictions) and low bias (the fittingaccuracy is high despite changes in training sets). In the case of a regression trees,14its predictions are highly sensitive to noise in the training sets, and the average ofmany regression trees becomes less sensitive, as long as the trees are not correlated.In order to reduce the correlation among trees, bootstrap sampling is an effectivemethod as it de-correlate the trees by showing them different training sets.The tuning parameter B in bagging represents the number of models (e.g. re-gression trees). Building a few hundred models is a common practice, and thenumber varies according to the size of the training set. In practice, B can be chosenvia cross-validation, or according to the out-of-bag (OOB) error. The OOB error iscalculated as follows. For each training observation xi, the mean prediction error iscalculated using only the models that did not include xi in their bootstrap sample.Random ForestsRandom Forests, a tree-based ensemble method proposed by Breiman [6], canbe considered as an extension of bagging. Random Forests differ from baggingof trees mainly in the following way. In bagging, each tree is built on a bootstrapsample and each split is chosen from all candidate splits from all the features. How-ever, in Random Forests, each split is chosen from only a random sample of mtryfeatures instead of all the features. This extra source of randomness in tree splitsfurther de-correlate the trees in Random Forests on top of the bootstrap samplesand leads to better ensemble performance by further reducing the average correla-tion/variance. Given a training data set X = (xT1 , . . . ,xTn )T with the correspondingresponse y = (y1, . . . ,yn)T , the steps of Random Forests are presented as follows.• For b = 1, . . . ,B:– Sample, with replacement, n training observations from X and y ac-cordingly. Denote the sample as X b and yb.– Build a fully grown, unpruned tree on X b and yb.– At each node of the tree, randomly select mtry features from which thebest split-point of that node is chosen.• The prediction of a new observation x′ can be calculated by averaging thepredictions from all the B models M1, . . . ,MB as yˆ′ = 1B ∑Bb=1 Mb(x′), or forclassification applications, a majority vote can be applied.15Fully grown trees in Random Forests lead to high prediction variability, how-ever, the ensemble over many trees reduces the average variance and avoid overfit-ting.The prediction accuracy of Random Forests can be assessed using the OOBerrors. Random Forests also offers evaluation of importance of feature: Each fea-ture is randomly permuted in the OOB samples and its impact on prediction erroris measured; features corresponding to larger impact on the prediction error areconsidered more important. Other than the number of trees B, Random Forests hasanother important tuning parameter mtry which may impact its performance signif-icantly. Breiman [6] suggests to use mtry =⌊√p⌋for classification problems withp features, and mtry = bp/3c for regression problems with a minimum node size of5. In practice, it is common to build a few hundred trees in Random Forests.1.4 Thesis OrganizationThe rest of the thesis is organized as follows:• Chapter 2. Robust Sparse Hierarchical Clustering.We first show how the existing SHC framework can be severely affectedby outliers. Then we propose a novel approach called Robust Sparse Hi-erarchical Clustering which is robust to outliers in high-dimensional data.Based on this Chapter, a paper entitled “Robust sparse hierarchical cluster-ing”, co-authored by Leung, Zhang and Zamar, and myself will be submittedfor publication.• Chapter 3. Multi-rank Sparse Hierarchical Clustering.We first show the limitation of the SHC framework when clustering featureshave a complex structure. Then we propose a new framework called MrSHCwhich is more powerful when dealing with more complex structures. Basedon this Chapter, a paper entitled “Multi-rank sparse hierarchical clustering”,co-authored by Zhang and Zamar, is available in ArXiv, and will be soonsubmitted for publication.• Chapter 4. Regression Phalanxes.16We focus on building an ensemble method for regression tasks via hierar-chical clustering of features in high-dimensional data. We call the generalensemble framework Regression Phalanxes. We show that via simulationstudies and several real data examples, Regression Phalanxes can furtherboost the performance of a good performing regression procedure such asLasso or Random Forests. Based on this Chapter, a paper entitled ”Regres-sion Phalanxes”, co-authored by Zhang, Welch and Zamar, will be submittedfor publication.17Chapter 2Robust Sparse HierarchicalClustering2.1 IntroductionIn Section 1.2, we provide a brief review of the existing research on sparse hier-archical clustering. We mainly focus on the SHC approach proposed by Wittenand Tibshirani [38]. It is demonstrated on a simulated and a real data set that SHCresults in more accurate identification of the underlying clusters as well as moreinterpretable results than classical hierarchical clustering when the data set is high-dimensional and contains potentially a large portion of noise features. However,as we noted in Section 1.2, there are some limitations of the SHC framework. Inthis chapter, we focus on its vulnerability to data contaminations. Sparse cluster-ing methods tend to suffer from data contaminations. Kondo et al. [19] pointedout that sparse K-means can be severely affected by a single outlying observationand proposed robust sparse K-means clustering (RSKC) to remedy this problem.Although classical hierarchical clustering methods themselves are fairly robust tooutliers, we will show that a very small fraction of outliers can severely upset theresults of SHC. In order to address this issue, we propose robust sparse hierarchicalclustering (RSHC) with respect to outlying observations.The rest of this chapter is organized as follows. We first provide some discus-sion and illustration of the effect of outliers on SHC. Then we propose two RSHC18methods based on L1-penalty and τ-estimators. Finally, we compare the RSHCapproaches with other alternatives using simulation studies and two real data sets.2.2 Effect of Outliers on Sparse Hierarchical ClusteringThe performance of SHC may be negatively affected by a small fraction of outliers,which often happens in large and flat data sets. We demonstrate this using thefollowing simulated example.Consider a data set with 120 observations and 100 features. The observationsare generated from 3 underlying clusters defined by the first two features:Xi j =µi+ εi j j = 1,2εi j j = 3, . . . ,100where εi j ∼i.i.d N(0,1) andµi =−µ 1≤ i≤ 400 41≤ i≤ 80µ 81≤ i≤ 120We choose µ = 3.We then consider the same data set but with outliers. The outliers are generatedby replacing a total of 10 randomly selected entries in the noise features by valuesfrom N(0,20).Figure 2.1 shows the dendrograms from classical hierarchical clustering (HC)and SHC applied to the simulated data sets without and with outliers. An idealclustering method would separate the three clusters labelled with red, green andblue respectively. In the case of no outliers, SHC outperforms HC and separatesthe underlying 3 clusters well. In the presence of outliers (labelled by black), bothmethods fail to identify the underlying clusters and generate mixed clusters.To understand this behavior, we examine the expression for the feature weightswhich are found by solving the subgradient equation of (1.5) with respect to w j19Figure 2.1: Dendrograms (with Ward’s linkage) from HC and SHC appliedto data sets without and with outliers. Colors are used to indicate thetrue underlying cluster.(fixing u):w j =S(DTj u,∆)||S(DT u,∆)||2, ∆> 0, j = 1, ..., p,where D j is the j-th column of D, S denotes the soft thresholding operator (S(a,∆)=sgn(a)(|a|−∆)+, where ∆> 0 is a constant, and x+ =max(x,0)). See Witten et al.[39] for details. Notice that if some components in feature j, D j, are unusually20large, most of the weights will be assigned to that feature by the soft thresholdingoperator, that is, the corresponding weight w j will be large. As an extreme case,all weights may be assigned to the contaminated features, breaking down SHC.2.3 Robust Sparse Hierarchical ClusteringAn key building block of SHC is sparse principal component (SPC) analysis, whichneeds to be robustified to achieve robust SHC.First we introduce a new robust sparse principal component method based onτ-estimators (RSPC-τ). Then we propose two approaches for robust sparse hier-archical clustering (RSHC) based on RSPC-τ . Finally we provide a permutationmethod to choose the sparsity tuning parameters.2.3.1 Robust Sparse Principal Component Based on τ-estimatorsConsider a data set Z with n observations and p features. Let a = (a1, . . . ,an)T ,b = (b1, . . . ,bp)T , and µ = (µ1, . . . ,µp)T . In the context of principal componentanalysis, ai (i = 1, . . . ,n) are the scores of the first principal component, b j ( j =1, . . . , p) are the loadings or weights for each feature, and µ is the center of thedata. Then the rank-1 approximation to Z isZ (1)(a,b,µ ) = 1nµ T +abT ,where 1n is a n×1 vector of 1’s. The parameters, a, b, and µ , can be estimated bysolving the following minimization problem:mina,b,µp∑j=1s2j , subject to ||b||22 = 1, (2.1)where s2j is the sample variance of the residuals {r1 j, ...,rn j}, where ri j = ri j(a,b,µ )=Zi j−Z(1)i j (a,b,µ ) and Z(1)i j = µ j +aib j.Boente and Salibian-Barrera [3] introduced robust principal component by us-ing robust scale estimator instead of sample variance in (2.1) as follows:mina,b,µp∑j=1σ̂2j , subject to ||b||22 = 1, (2.2)21where σ̂ j is a robust scale estimator for the residuals (see also Maronna [22]).We propose robust sparse principal component by introducing an L1 penaltyon the feature weights, b, in (2.2):mina,b,µ{p∑j=1σ̂2j +λ ||b||1}, subject to ||b||22 = 1, (2.3)where λ is a tuning parameter that controls the sparseness of the feature weights,b.We use τ-estimators of scale for σ̂ j, j = 1, ..., p, because the estimator achievesrobustness when there is contamination in the data and otherwise, it will producesimilar results compared to the standard deviation [41].To solve (2.3), we propose an iterative approximation approach, which is de-scribed in detail in the Appendix.2.3.2 Robust Sparse Hierarchical ClusteringSHC uses SPC to handle noise variables. Likewise, our robust sparse hierarchicalclustering uses RSPC-τ to handle both noise variables and outliers. We proposetwo approaches to perform RSHC. The first approach is a direct robustification ofSHC, which we call Direct Robust Sparse Hierarchical Clustering (DRSHC). Thesecond approach introduces a faster procedure which we call Fast Robust SparseHierarchical Clustering (FRSHC). We include DRSHC only for completion andcomparison purposes, but we shall show that FRSHC outperforms DRSHC in bothcomputational efficiency and clustering accuracy.Direct Robust Sparse Hierarchical ClusteringDRSHC is a direct robustification of SHC, where we apply RSPC-τ on the n2× ptranformed dissimilarity matrix D instead of SPC. Details of DRSHC are presentedin Algorithm 1.Fast Robust Sparse Hierarchical ClusteringWe propose another approach, FRSHC, which differs from DRSHC mainly intwo aspects.22Algorithm 1 DRSHC1: Input: A data matrix X n×p, tuning parameter λ .2: Compute the n2× p matrix D.3: Apply RSPC-τ using λ to D to obtain âD, b̂D and µ̂D. The number of non-zeroweights in b̂D is denoted qD.4: Set û = Db̂D, which is a vector of length n2.5: Re-arrange the elements in û into a n×n sparse dissimilarity matrix Û .6: Apply classical hierarchical clustering (with any linkage of choice) to Û ,which only involves qD features.7: Output: A dendrogramH C , sparse weights for the p features ŵ = b̂D.In FRSHC, RSPC-τ is directly applied to the original data X (a n× p matrix)instead of the transformed dissimilarity matrix D (a n2× p matrix). The intuitivefor using D is rather unclear. The reason why X is used is because of the followingreasons. A single feature is considered to be important if it contains enough vari-ability so that it can potentially separate the clusters (in the extreme case, a featurewith constant values will have no effect in separating clusters). Our goal is to finda subset of variables that separate the underlying clusters, therefore, we look forthe sparse principal components that represent most of the variance in the originaldata. Since D is just a transformation of X , if we can discover enough variance inD, we would be able to discover the same variations in X . Hence, FRSHC copesbetter with larger values of n due to the considerable difference in size between Xand D.The sparse weights obtained from FRSHC can be negative because unlike D,X may contain negative entries. The non-zero negative weights indicate that thecorresponding variables have some influence for constructing the sparse PC. Forour purposes, this could be recognized by assigning these variables weights pro-portional to the absolute value of the original negative weights. An alternativesimpler approach would be to assign weight 1 to all the features with non-zeroweights. This is essentially distinguishing important and unimportant variables.This will give a simpler interpretation for the clustering results. To distinguishbetween the procedures resulting from these two weighting schemes, we refer tothem as FRSHC-AW (FRSHC with Absolute Weights) and FRSHC-IW (FRSHCwith Indicator Weights) respectively.23Details of FRSHC are presented in Algorithm 2.Algorithm 2 FRSHC1: Input: A data matrix X n×p, tuning parameter λ .2: Apply RSPC-τ using λ to X to obtain âX , b̂X and µ̂X . The number of non-zeroweights in b̂X is denoted qX .3: Calculate the sparse dissimilarity matrix Û as follows:4: For FRSHC-AW:• Calculate D.• Let |b̂X | be the vector of absolute values of the elements in b̂X .• Set û = D|b̂X |, a vector of length n2.• Re-arrange the elements in û into a n×n sparse dissimilarity matrix Û .For FRSHC-IW:• Calculate Û directly using the features with non-zero weights in b̂X .5: Apply classical hierarchical clustering (with any linkage of choice) to Û ,which only involves qX features.6: Output: A dendrogramH C , sparse weights for the p features ŵ = b̂X .Remark 2.3.1. In principle, other robust sparse principal component methodscould be used in place of RSPC-τ . For comparison, we consider the method pro-posed in Croux et al. [11], which we call RSPC-Croux. The corresponding di-rect and fast robust sparse hierarchical clustering procedures are referred to asDRSHC-Croux and FRSHC-Croux.Choice of Tuning ParametersIn this section, we propose a permutation approach for choosing the tuning param-eter λ of DRSHC and FRSHC. The procedure is described as follows:1. Obtain B permuted data sets X 1, . . . ,X B by independently permuting the ob-servations within each feature of X .2. For each candidate λ :24(a) Compute J(λ ) = ∑pj=1 τ̂2j + λ ||b||1, the objective obtained from opti-mizing RSPC-τ criterion (2.3) with λ on data X (for FRSHC) or trans-formed dissimilarity matrix D (for DRSHC).(b) For b = 1, . . . ,B, compute Jb(λ ), the objective obtained by performingRSPC-τ with λ on the data X b (for FRSHC) or the corresponding Db(for DRSHC).(c) Calculate Gap(λ ) = 1B ∑Bb=1 log(Jb(λ ))− log(J(λ )).3. Among a set of candidates 0 < λ1 < .. . < λk (e.g. k = 20), calculate thesecond order differences of Gap(λi), i = 1, . . . ,k, and choose λ ∗ that min-imizes the differences λ ∗ = argmin{Gap(λi+1)−2Gap(λi)+Gap(λi−1)}.This corresponds to either the most prominent local maximum or a sud-den increase in the Gap values. The behavior of this step is illustrated inFigure 2.2. The Gap(λi), i = 1, . . . ,k, obtained from a simulated example isplotted against the corresponding λi, and the red point corresponds to thechosen λ i. We can see that this point is both the most prominent local maxi-mum and also the point after a sudden increase. Extensive simulation resultsand real data applications confirm that this criterion works well in practice.Figure 2.2: An illustration of choosing tuning parameter λ based on secondorder differences of Gap(λi), i = 1, . . . ,kThe resulting DRSHC and FRSHC with automatically selected λ are denotedas DRSHC-Gap and FRSHC-Gap respectively.25For cases where target numbers of important features are suggested by subject-area knowledge (as in the real data examples in Section 2.5), we propose using abackward approach to determine λ that will result in the target number of featuresqX (for FRSHC) or qD (for DRSHC).Given qX (or qD), we choose λ by the following bisection procedure:1. Choose λ = λ (Mid), where λ (Mid)=(λ (UB)+λ (LB))/2, λ (LB)= 0 and λ (UB)=c. We use c = 20, but other choices can be considered.2. Apply FRSHC (or DRSHC) using λ from Step 1.3. If the resulting number of features for clustering from Step 2 is less than qX(or qD) then we update λ (UB) = λ , else update λ (LB) = λ .4. Repeat Step 1–3 until we obtain qX ±2 (or qD±2) features for clustering.We have seen in our simulation studies that the above procedure converges quicklyin a few iterations.2.4 Simulation StudyWe conduct simulation studies to compare the quality of dendrograms and the ac-curacy of feature selection of the following methods: classical hierarchical cluster-ing (HC), SHC, DRSHC, FRSHC-AW, FRSHC-IW, FRSHC-Croux-AW, FRSHC-Croux-IW, FRSHC-Gap-AW and FRSHC-Gap-IW. We do not include DRSHC-Croux and DRSHC-Gap in the simulation studies due to their high computationalburden. We show the results for all the methods with Ward’s linkage. Similarresults (not shown here) are obtained with other linkages.We generate data sets X with n = 60 observations and p = 1000 features asfollows. The observations are generated from 3 main underlying clusters C1, C2and C3. Moreover, C1 and C2 have two nested sub-clusters labeled as C1a andC1b, and C2a and C2b. See Figure 2.3 for a graphical illustration.More precisely, the clusters are determined by q = 100 features as follows:Xi j =µi+ εi j j = 1, ...,100εi j j = 101, ...,100026Figure 2.3: A graphical illustration of the hierarchical structure of the simu-lated clusterswhere εi j ∼i.i.d N(0,1) andµi =0 i = 1, . . . ,12 (i ∈C1a)µ i = 13, . . . ,24 (i ∈C1b)µ+1 i = 25, . . . ,36 (i ∈C2a)2µ+1 i = 37, . . . ,48 (i ∈C2b)2µ+2 i = 49, . . . ,60 (i ∈C3)We show the results for µ = 0.5. Similar conclusions are obtained for other choicesof µ .We contaminate the data sets using outliers generated from the following mod-els:M0. No outliers.M1. One large-mean outlier in one noise feature. A randomly selected obser-vation of a noise feature is replaced by a value from N(25,1).M2. One large-mean outlier in one clustering feature. A randomly selectedobservation of a clustering feature is replaced by a value from N(25,1).M3. Large-variance outliers in noise features. For each of the five clusters,randomly select two entries in the noise features to be replaced by values27from N(0,152).M4. Large-variance outliers in clustering features. For each of the five clusters,randomly select two entries in the clustering features to be replaced by valuesfrom N(0,152).M5. Large-variance outliers in noise and clustering features. For each of thefive clusters, randomly select two entries in the noise features and two entriesin the clustering features to be replaced by values from N(0,152).M6. Mild-mean outliers in noise features. Randomly select 50 noise features,and replace five random entries from each by values from N(5,1).M7. Overall variance increase in some noise features. Fifty randomly selectednoise features are generated from N(0,22).We generate 100 data sets for each contamination model.The quality of the resulting dendrograms is evaluated as follows. The dendro-grams are cut at a high and a low level which are chosen to obtain exactly threeand five clusters, respectively. Clusters with two or less observations are discarded.The classification error rate (CER) is then used to assess clustering accuracy [see10]. We do this by first comparing the resulting labels from the three main clustersagainst the underlying true labels (C1, C2, C3). Next, we compare the resulting la-bels from the five clusters against the underlying true labels (C1a, C1b, C2a, C2b,C3).The accuracy of feature selection is evaluated by the recall rate (RR). Let Jbe the set of indices corresponding to all the clustering features, and |J |= q. Therecall rate (RR) is calculated as follows:RR(J ) =∑ j∈J I(ŵ j 6= 0)q,where I(·) is an indicator function.For SHC, DRSHC, FRSHC-AW, FRSHC-IW, FRSHC-Croux-AW, and FRSHC-Croux-IW, the target number of clustering features is given and equal to 100. ForFRSHC-Gap-AW and FRSHC-Gap-AW, the number of clustering features is cho-sen automatically. The FRSHC-Gap methods generally identify the correct number28HC SHC DRSHCFRSHC FRSHC-Gap FRSHC-CrouxAW IW AW IW AW IWM0K=3 0.024 0.000 0.006 0.000 0.000 0.001 0.000 0.170 0.046K=5 0.147 0.061 0.095 0.056 0.055 0.059 0.054 0.180 0.142M1K=3 0.024 0.393 0.006 0.000 0.000 0.001 0.000 0.172 0.049K=5 0.147 0.263 0.089 0.055 0.055 0.059 0.054 0.184 0.145M2K=3 0.024 0.313 0.006 0.000 0.000 0.000 0.001 0.178 0.049K=5 0.147 0.249 0.095 0.055 0.057 0.057 0.054 0.177 0.141M3K=3 0.025 0.408 0.005 0.000 0.000 0.001 0.000 0.166 0.033K=5 0.146 0.276 0.092 0.054 0.056 0.060 0.052 0.179 0.134M4K=3 0.025 0.299 0.009 0.003 0.002 0.006 0.002 0.144 0.054K=5 0.146 0.242 0.089 0.056 0.055 0.061 0.053 0.173 0.139M5K=3 0.029 0.375 0.012 0.002 0.001 0.006 0.002 0.143 0.059K=5 0.151 0.270 0.099 0.058 0.058 0.062 0.057 0.174 0.135M6K=3 0.035 0.234 0.006 0.001 0.001 0.001 0.000 0.184 0.064K=5 0.153 0.228 0.096 0.059 0.079 0.069 0.067 0.191 0.155M7K=3 0.035 0.338 0.018 0.000 0.000 0.000 0.000 0.460 0.309K=5 0.156 0.282 0.111 0.062 0.071 0.058 0.061 0.321 0.261Table 2.1: Average CER for different clustering methods. The maximum SEof CER in this table is 0.01, while the maximum SE of CER for FRSHCmethods is 0.003of features used for clustering with the average numbers in M0–M7 are 100.05,100.22, 101.21, 100.87, 100.37, 99.94, 98.380, and 100.152, respectively.Table 2.1 presents the average CER. When data sets are not contaminated,sparse methods in general produce more accurate clustering results than the clas-sical HC; SHC and the FRSHC methods yield the smallest average CER, and DR-SHC gives slightly higher CER. Under M1–M7, SHC gives poor clustering results.On the other hand, HC is robust against outliers but clearly affected by the largenumber of noise features. Finally, DRSHC and the FRSHC methods are highlyrobust against outliers and noise features. There is not much difference in perfor-mance between the FRSHC methods with absolute and indicator weights.All the methods except for SHC show better clustering accuracy when dendro-grams are cut into three rather than five clusters, as expected.Table 2.2 presents the average RR. When data sets are not contaminated, SHCidentifies most of the clustering features with the highest RR. DRSHC, the FRSHCmethods, and the FRSHC-Gap methods give slightly smaller RR. Under M1–M7,SHC gives the lowest RR because it gives most of the weights to the contaminated29SHC DRSHC FRSHC FRSHC-Gap FRSHC-CrouxM0 0.993 0.807 0.972 0.990 0.393M1 0.487 0.802 0.972 0.989 0.401M2 0.545 0.800 0.972 0.980 0.397M3 0.429 0.815 0.971 0.980 0.404M4 0.543 0.830 0.964 0.984 0.409M5 0.441 0.810 0.962 0.984 0.398M6 0.544 0.834 0.933 0.969 0.388M7 0.529 0.790 0.949 0.972 0.099Table 2.2: Average RR for different clustering methods. RR for AW and IWmethods are the same due to the use of indicator function for non-zeroweights. The maximum SE of RR in this table is 0.017, while the maxi-mum SE of RR for FRSHC methods is 0.008.HC SHC DRSHC FRSHC FRSHC-Gap FRSHC-CrouxM0 0.014 0.161 45.940 0.315 163.115 14.038M1 0.013 0.190 45.811 0.316 164.811 14.118M2 0.013 0.197 45.382 0.315 164.989 13.885M3 0.013 0.271 45.980 0.314 165.623 14.182M4 0.014 0.264 46.886 0.332 167.977 13.927M5 0.014 0.272 46.730 0.338 170.452 14.036M6 0.015 0.301 46.017 0.384 134.041 13.950M7 0.015 0.158 42.166 0.311 119.865 12.880Table 2.3: Average CPU times in seconds of a 2.8 GHz Intel Xeon for differ-ent clustering methodsfeatures. On the other hand, DRSHC, the FRSHC methods, and the FRSHC-Gapmethods are only mildly affected by the contamination and give most of the weightsto the clustering features.Table 2.3 presents the average computing times. The computing times for theFRSHC methods are comparable to HC and SHC. For larger n, say 120, the FRSHCmethods become faster than SHC. DRSHC is more computationally intensive. TheFRSHC-Gap methods give the longest computing time because of the computa-tionally intensive permutation approach used to automatically select the number ofclustering features.302.5 Application to Microarray Data SetsWe consider two breast cancer microarray data sets [24, 35]. For each data set,we apply the following hierarchical clustering methods: HC, SHC, SHC-Gap,FRSHC-AW, FRSHC-IW, FRSHC-Gap-AW, and FRSHC-Gap-IW, where SHC-Gap represents the SHC with the number of clustering features chosen automati-cally by the permutation approach [see 38]. Dendrograms are created using Ward’slinkage so that the results are comparable with those in the original papers.2.5.1 Breast Cancer Data Set in Perou et al. [24]The data set was first published in Perou et al. [24] and later analyzed in Witten andTibshirani [38]. It contains 1753 gene expression levels (features) for 62 samples(observations) to profile surgical specimens of human breast tumors. Perou et al.[24] categorized the 62 samples into four groups (clusters): basal-like, Erb-B2,normal breast-like, and ER+. Perou et al. [24] suggested that the four underlyingclusters could be explained by only 496 of the 1753 features, which was confirmedby Witten and Tibshirani [38]. Two misclassified samples were identified by Wittenand Tibshirani [38]. The data set was pre-processed before being published. Assuch, there are no outliers in the data set.Figure 2.4 shows the dendrograms resulting from HC, SHC, FRSHC-AW, andFRSHC-IW, with the number of clustering features manually fixed at 496 for SHCand the FRSHC methods as suggested. Colors are used to indicate the suggestedfour tumor groups.The dendrogram generated from SHC found four main clusters, as expected.Two red samples were misclassified as orange. This is surprising because the redand orange clusters are otherwise very well separated in the SHC dendrogram. TheFRSHC methods identify the blue and green clusters, however, the dendrogramsfurther suggest that the red cluster may be further split into two sub-clusters, oneof which is close to the orange cluster. This is more consistent with the misclassi-fication of the two red samples.To assess the performance of the different methods in the presence of outliers,we randomly select four observations, and for each of them, the value of a sin-gle random feature is replaced by a value from N(0,15). Figure 2.5 shows the31Figure 2.4: Dendrograms generated by HC, SHC, FRSHC-AW and FRSHC-IW with 496 clustering features for the clean Perou et al. [24] dataresulting dendrograms from HC, SHC, FRSHC-AW, and FRSHC-IW applied onthe contaminated data set, with the number of clustering features manually fixed at496 for SHC and the FRSHC methods. HC is mildly affected by the outliers, as ex-pected. On the other hand, SHC produces mixed clusters. The FRSHC methods arerobust against the outliers and give very similar clustering patterns and hierarchicalstructures as in the original data set.Assuming that the number of clustering features is unknown, we repeat theanalysis by applying SHC-Gap and the FRSHC-Gap methods on the original and32Figure 2.5: Dendrograms generated by HC, SHC, FRSHC-AW and FRSHC-IW with 496 clustering features for the contaminated Perou et al. [24]datacontaminated data sets. Notice that there is no difference in feature selectionbetween FRSHC-Gap-AW and FRSHC-Gap-IW. For the original data set, SHC-Gap selects 112 clustering features, while the FRSHC-Gap methods select 396.For the contaminated data set, SHC-Gap selects 49 clustering features while theFRSHC-Gap methods select 418. The resulting dendrograms from SHC-Gap andthe FRSHC-Gap methods give similar clustering patterns and hierarchical struc-tures, compared to those from their manual counterparts, respectively (results not33shown).2.5.2 Breast Cancer Data Set in van’t Veer et al. [35]The data set was presented and analyzed in van’t Veer et al. [35]. It consists of 4751gene expression levels for 77 primary breast tumor samples. A supervised classi-fication technique was used in van’t Veer et al. [35], revealing that only a subsetof 70 out of the 4751 genes may help discriminating patients that have developeddistant metastasis within five years.Figure 2.6 shows the resulting dendrograms generated from HC, SHC, FRSHC-AW, and FRSHC-IW, with the number of clustering features fixed at 70, as sug-gested. The FRSHC methods give slightly more accurate clustering results thanHC and SHC, which may be explained by a mild outlier identified from all thedendrograms. When the number of clustering features is assumed to be unknown,SHC-Gap selects 1046 features, while the FRSHC-Gap methods select 111. Thedendrograms from SHC-Gap and the FRSHC-Gap methods are similar to theirmanual counterparts, as in the previous example (result not shown).2.6 DiscussionClassical hierarchical clustering method is moderately robust against outlying ob-servations, but it has been shown to be non-robust against noise features. On theother hand, sparse hierarchical clustering in Witten and Tibshirani [38] is designedto be robust against noise features. However, we show by simulation studies andreal data examples that it is non-robust against outlying observations. Based onthese considerations, we propose FRSHC, to achieve robust sparse hierarchicalclustering that are robust against outliers and noise features. Moreover, FRSHCscales better than SHC for larger n. We investigate two possible ways of apply-ing weights to the clustering features found in FRSHC, namely, absolute weight(FRSHC-AW) and indicator weights (FRSHC-IW). Simulation studies show thatboth variants of FRSHC deliver better clustering and feature selection accuracywith less computational time comparing to DRSHC. A permutation approach isproposed to automatically choose the tuning parameter λ , which determines thenumber of clustering features. Another aspect of our implementation is to allow34the user to choose the number of features to be selected.35Figure 2.6: Dendrograms generated by HC, SHC, FRSHC-AW and FRSHC-IW with 70 clustering features for van’t Veer et al. [35]’s data. The mildoutlier is pointed by the arrow.36Chapter 3Multi-rank Sparse HierarchicalClustering3.1 IntroductionIn this chapter, we continue the study of sparse hierarchical clustering. In Sec-tion 1.2, two main disadvantages of the existing SHC framework are presented. InChapter 2, we propose a novel robust framework to address the robustness issue. Inthis chapter, we mainly focus on the other issue mentioned in Section 1.2, namely,incapability of discovering features with complex structures. In high-dimensionaldata sets, the structures of the features, especially the important features, are usu-ally complex. If the total variance within the important features cannot be mostlyprojected onto a single dimension, then we call the structure of the important fea-tures complex. For example, Figure 3.1 shows examples of simple and complexstructures within two important features. We show that when features contain com-plex structures, SHC has its limitation in discovering the important features fromthe noise. Then we propose an improved sparse hierarchical clustering frameworkcalled the multi-rank sparse hierarchical clustering (MrSHC) framework. Throughextensive simulation studies and real data examples, we show that MrSHC is amore flexible framework, and is able to discover the important features more effi-ciently while producing better clustering results comparing to the SHC framework.37Figure 3.1: Examples of simple and complex structures within two importantfeatures. In simple structure, the variance in important features can bemostly projected to a single dimension. The variance of the complexstructure can be discovered if projected to more than one dimension.3.2 Limitations of Witten and Tibshirani [38]’s SparseHierarchical ClusteringSHC essentially applies SPC criterion to D∗ and obtains the best rank-1 sparse ap-proximation of D∗ given a sparsity constraint, i.e. criterion (1.5). The clusteringfeatures are chosen according to the non-zero loadings in the first sparse princi-pal component resulted from the rank-1 sparse approximation. This approach andconsequently our RSHC approaches share the same limitations when the cluster-ing features may not be fully identified by a single sparse principal component (seeFigure 3.1). In other words, the clustering features may not be properly recoveredby only rank-1 approximation. The following simulated example illustrates thissituation.We generate a data set X as follows: X contains n = 20 observations withp = 15 features, i.e. X n×p = (x1,x2, . . . ,xn)T , where xi = (xi,1,xi,2, . . . ,xi,p)T , 1 ≤i ≤ n. The observations are organized in four clusters of size 5. Let Yi, (i =1, . . . ,n) denote the cluster memberships. Then xi j (i = 1, . . . ,n) is generated fromN(µ j(Yi),0.1) for j = 1, . . . ,4, and N(µ j(Yi),1) for j = 5, . . . , p.A sketch of µ j(Yi) is presented in the table below:38Yi µ1(Yi) µ2(Yi) µ3(Yi) µ4(Yi) µ5(Yi) . . .µP(Yi)1 1 1 1 1 0 . . . 02 -1 -1 1 1 0 . . . 03 -1 -1 -1 -1 0 . . . 04 1 1 -1 -1 0 . . . 0We apply SHC to X . By gradually increasing the sparsity constraint, we obtainthe sequence of the first 9 chosen variables {V13,V11,V6, V1, V2, V14, V8, V4, V3}.The first three chosen features are noise features. As a result, the dendrogramgenerated from the first four chosen features (which is suggested by Witten andTibshirani [38]’s auto-selection method) gives mixed clusters (See Figure 3.2). Theclustering result is still unsatisfactory even if seven variables are chosen (results notshow here). Moreover, five noise features are selected before all the four clusteringfeatures are chosen.Figure 3.2: Dendrogram generated with the first four chosen features{V13,V11,V6, V1} from Witten and Tibshirani’s sparse hierarchical clus-tering framework3.3 Multi-rank Sparse Hierarchical ClusteringTo remedy this limitation of SHC, we propose the multi-rank sparse hierarchicalclustering (MrSHC). Similar to SHC, MrSHC uses SPC as an important buildingblock for feature selection, but MrSHC is different in the following aspects.• As motivated by the good performance of FRSHC, MrSHC applies SPC di-39rectly to the original data X .• Indicator weights for the non-zero loadings in sparse PCs are used for asimpler interpretation for the clustering results. Indicator weights also allowMrSHC to match classical hierarchical clustering if no sparsity constraint isapplied.• MrSHC identifies and recovers the clustering features using multi-rank sparseapproximation through SPC. In other words, the clustering features are cho-sen according to the non-zero loadings in multiple sparse PCs.MrSHC is very different from the traditional approach where high-dimensionaldata are clustered based on the first few principal components. First, MrSHC ap-plies SPC instead of traditional PCA to the data, also, MrSHC chooses the originalfeatures for clustering. The chosen features can be closely approximated by sparselow-rank approximations, in other words, they should have similar patterns of vari-ations (e.g. if the features can be closely approximated by rank-1 approximation,then each of the features should have similar variation as the first PC). In clus-tering, similar patterns of variations usually represent information of clusters, andthus, the features chosen from MrSHC should contain key informations of clusters.This is confirmed by the simulated and real data examples in later sections.We outline the MrSHC framework under the cases in Table 3.1.Case q r1 Known Known2 Known Unknown3 Unknown UnknownTable 3.1: Different cases for MrSHC; q is the target number of selected fea-tures and r is the rank of the SPC approximation.3.3.1 Case 1: Known q and rSuppose the target number of clustering features q and the appropriate rank r ofthe SPC approximation are known. MrSHC applies SPC to X and obtain the first r40sparse PCs. In MrSHC, a feature is considered selected if it has a non-zero loadingin any of the r sparse PCs. With an appropriately chosen sparsity constraint λ , wecan get q or approximately q±d (say d = 1) chosen features from the r sparse PCs.MrSHC chooses λ using a bi-section approach presented in Algorithm 3.Algorithm 3 Feature set selection1: Input: X n×p, q, r, d (default d = 0).2: Assign λ− = 1 and λ+ =√n (λ+ can be set smaller in practice).3: Repeat Step 4-8.4: Apply SPC to X with λ = (λ−k +λ+k )/2; obtain the first r sparse PCs.5: q∗ := the number of variables with non-zero loadings in any of the r sparsePCs.6: Cr := the set of q∗ chosen variables.7: Break if q−d ≤ q∗ ≤ q+d.8: If q∗ > q+d, λ+ = λ ∗; if q∗ < q−d, λ− = λ ∗.9: Output: Cr, q∗.We have seen in the simulations and real data examples that Algorithm 3 willfinish in a few iterations.Given q and r, a set of chosen features Cr can be obtained from Algorithm 3.Then MrSHC simply generates a dendrogram (with any linkage of choice) basedon the features in Cr.3.3.2 Case 2: Known q, Unknown rSuppose q is known, but not r. To choose r, MrSHC first applies Algorithm 3with increasing ranks ri, i = 1, . . . ,R (different R can be chosen; we choose R = 8here). Given rank ri, let Cri denote the candidate feature set obtained from Al-gorithm 3. Different ranks ri will be compared through their corresponding Cri .MrSHC assesses the quality of a feature set Cri through the dendrogram generatedfrom its features. To be more specific, a feature set Cri is evaluated according tothe following aspects.• The number of well-separated clusters discovered from the dendrogram gen-erated from Cri . MrSHC uses a multi-layer pruning approach to obtain thewell-separated clusters from a dendrogram. We introduce a “reference num-41ber of clusters” in the multi-layer pruning to facilitate later comparisons (de-scribed below).• The degree of separation of the discovered clusters, which is evaluated bysilhouette values [27].Given the number of discovered clusters and the silhouette values, an iterative se-lection approach is proposed to choose the final rank r.Multi-layer pruning (MLP) prunes the dendrogram from the top to the bottom,with each split evaluated by the Gap statistics [33]. We introduce a “reference num-ber of clusters” K in MLP, which is both an “upper bound” and a “lower bound”. Itis an upper bound of the number of clusters discovered in MLP. When K is chosenproperly, MLP will produce labels for K clusters for most of the input dendrogramsgenerated from Cri , i= 1, . . . ,R. This facilitates later comparisons since labels withdifferent number of clusters are in general difficult to compare. It is also a lowerbound of the number of clusters that are expected to be discovered. If less than Kclusters are discovered from a dendrogram according to MLP, such a dendrogramand its corresponding feature set are considered to be of low quality since key clus-ters may be missing. Therefore, such feature sets and their corresponding ranksare screened out and excluded from the later comparisons. Details of MLP are pre-sented in Algorithm 4. The reference number of clusters K can be chosen basedon subject area knowledge. If not specified, we set the default reference numberof clusters to be max{2,K0}, where K0 is set as follows: apply MLP with K =+∞to the dendrograms generated from Ci, i = 1, . . . ,R, then K0 is the largest outputnumber of leaf nodes from MLP. We have seen that in practice, a reliable K0 canusually be found by applying MLP to Ci, i = 1,2,3.Suppose there are M (M ≤ R) left over dendrograms after screening out theones with less than K clusters. Let r j, Cr j andLr j ( j = 1, . . . ,M) denote their cor-responding ranks, feature sets and labels (for K clusters from MLP), respectively.Given Cr j andLr j , the degree of separation of the corresponding K clusters can beevaluated by the average silhouette value Sr j . High average silhouette values indi-cate well-separated clusters, and thus, are preferred. If two ranks lead to similaraverage silhouette values, the lower one is preferred since the feature set associatedwith the higher rank is more likely to contain noise variables. Therefore, among42Algorithm 4 Multi-layer pruning (MLP)1: Input: A dendrogramD , number of bootstrap samples B for the Gap statistics,and reference number of clusters K.2: Assign the root node of D as the current node; mark current node as active.3: Repeat Step 4-7.4: Split the current node sequentially according to D ; obtain increasing numberof clusters.5: Evaluate different numbers of clusters from Step 4 using the Gap statistics.• If the chosen number of clusters is 1, set the current node as inactive;• Otherwise, split the current node into two active leaf nodes.6: Break if either of the following applies:• Number of leaf nodes (both active and inactive) is equal to K.• All the leaf nodes are inactive.7: Assign the active leaf node with the highest height in D as the current node.8: Output: Number of leaf nodes (less than or equal to K), and the correspond-ing cluster labelsL .local minimums in Sr j ( j = 1, . . . ,M), the one with the highest rank is the leastfavourite, and thus, we remove such local minimums iteratively until the left-overSr j are monotonically increasing or decreasing. Given monotonically increasingaverage silhouette values, the smallest rank after the largest increase in average sil-houette value will be selected, since smaller ranks are preferred unless the increasein average silhouette value is large. On the other hand, if the average silhouettevalues are decreasing as rank increases, the smallest left-over rank will be selected.Details of this iterative selection approach are presented in Algorithm 5.Once the chosen rank r is obtained from Algorithm 5, MrSHC generates adendrogram (with any linkage of choice) based on the features in Cr.We revisit the example in Section 3.2. Suppose q = 4 is known, and we applyMrSHC with default reference number of clusters K = 2. The resulting Cr withits corresponding rank r = 2 contains the four true clustering features: V1, V2, V3and V4. The resulting dendrogram is presented in Figure 3.3. The four clusters areseparated correctly.43Algorithm 5 Iterative selection of rank1: Input: X n×p, Cr j andLr j , j = 1, . . . ,M.2: For j = 1, . . . ,M; Sr j := the average silhouette value calculated from Cr j ,Lr jand X .3: Repeat Step 4-5.4: Among the local minimums in Sr j ( j = 1, . . . ,M), remove the one with thehighest rank.5: Break if the left-over Sr j , as r j increases, are monotonically:• increasing: r := the smallest rank r j after the biggest increase in the left-over Sr j .• decreasing: r := the smallest left-over rank r j.6: Output: The chosen rank r.Figure 3.3: Dendrogram generated from MrSHC with known q3.3.3 Case 3: Unknown q and rSuppose q and r are both unknown. MrSHC considers a list of target numbers ofchosen features qt , t = 1, . . . ,T . For each of the candidate qt , MrSHC chooses itscorresponding feature set Cqt (|Cqt | = qt) and average silhouette value Sqt as de-scribed in Section 3.3.2. Higher silhouette values are preferred, and at the meantime, smaller target numbers of features are preferred for the sake of interpreta-tion and exclusion of noise features. Therefore, MrSHC uses a similar iterativeapproach as in Algorithm 5 to choose q among qt (t = 1, . . . ,T ). Details of thisiterative approach are presented in Algorithm 6.44Algorithm 6 Iterative selection of number of features1: Input: qt and Sqt , t = 1, . . . ,T .2: Repeat Step 3-4.3: Among the local minimums in Sqt (t = 1, . . . ,T ), remove the one with thehighest qt .4: Break if the left-over Sqt , as qt increases, are monotonically:• increasing: q := the smallest qt after the biggest increase in the left-overSqt .• decreasing: q := the smallest left-over rank qt .5: Output: The chosen target number of chosen features q.Once the chosen q is obtained from Algorithm 6, MrSHC generates a dendro-gram (with any linkage of choice) based on the features in its corresponding Cq.Again, we revisit the example in Section 3.2. Suppose q and r are unknown,and we apply MrSHC with default reference number of clusters K = 2 and the listof qt {2,3, . . . ,8}. MrSHC suggests q = 3 and its corresponding feature set {V 1,V 3, V 4}. Although q = 3 is smaller than the true value 4, the four clusters can stillbe separated correctly (see Figure 3.4).Figure 3.4: Dendrogram generated from MrSHC with unknown q453.4 Simulation StudyWe conduct simulation studies to compare the quality of dendrograms and the ac-curacy of feature selection of the following methods: HC, SHC (with known/un-known q) and MrSHC (with known/unknown q). We show the results for all themethods with complete linkage. Similar results (not shown here) are obtained withother linkages.3.4.1 Simulation IWe generate data sets X with n = 60 observations and p = 500 features as follows.The observations are generated from three main underlying clusters C1, C2 andC3.To be more specific, the clusters are determined by q = 50 features as follows:Xi j =µi+ εi j j = 1, ...,50εi j j = 51, ...,500where εi j ∼i.i.d N(0,1) andµi =0 i = 1, . . . ,20 (i ∈C1)µ i = 21, . . . ,40 (i ∈C2)−µ i = 41, . . . ,60 (i ∈C3)We show the results for µ = 1. Similar conclusions are obtained for µ = 0.8.We generate 100 data sets and apply HC, SHC (with known/unknown q) andMrSHC (with known/unknown q) to each.The quality of the resulting dendrograms is evaluated as follows. The dendro-grams are cut at a level to obtain three clusters. CER is then used to assess clus-tering accuracy by comparing the resulting labels from three clusters against theunderlying true labels (C1, C2, C3). The accuracy of feature selection is evaluatedby RR.Table 3.2 presents the average CER, average RR, and the corresponding aver-age q. For SHC and MrSHC, the unknown q is chosen automatically. HC gives the46highest CER, while SHC achieves better accuracy due to the sparseness. MrSHCachieves the best accuracy among the three methods. When q= 50 is known, SHCand MrSHC give very similar average RR. When q is unknown, both methods givealmost perfect RR, while MrSHC selects less features on average.HC SHC MrSHCCER q CER RR q CER RR qKnown (q = 50) 0.202 500 0.047 0.926 50 0.009 0.995 50Unknown q 0.202 500 0.127 0.997 30.6 0.064 1.000 19.7Table 3.2: Average CER, RR, and q for different clustering methods.3.4.2 Simulation IIWe generate data sets X with n= 80 observations and p= 500 features, i.e. X n×p =(x1,x2, . . . ,xn)T , where xi = (xi,1,xi,2, . . . ,xi,p)T , 1 ≤ i ≤ n. The observations aregenerated from four main underlying clusters C1, C2, C3 and C4. Let Yi, (i =1, . . . ,n) denote the cluster memberships. Then xi j (i = 1, . . . ,n) is generated fromN(µ j(Yi),0.1) for j = 1, . . . ,50, and N(µ j(Yi),1) for j = 51, . . . , p.A sketch of µ j(Yi) is presented in the table below:Yi µ1(Yi) . . . µ25(Yi) µ26(Yi) . . . µ50(Yi) µ51(Yi) . . .µp(Yi)1 µ . . . µ µ . . . µ 0 . . . 02 −1.5µ . . . −1.5µ 0 . . . 0 0 . . . 03 0 . . . 0 −µ . . . −µ 0 . . . 04 0 . . . 0 0 . . . 0 0 . . . 0We show the results for µ = 1. Similar conclusions are obtained for otherchoices of µ .Again, we generate 100 data sets and apply HC, SHC (with known/unknown q)and MrSHC (with known/unknown q) to each. Table 3.3 presents the average CER,average RR, and the corresponding average q. When q= 50 is known, HC and Mr-SHC produce the highest and lowest average CER, respectively. MrSHC producesmore accurate feature selection and clustering than SHC due to the smaller aver-age CER and larger average RR. When q is unknown, SHC produces the highestaverage CER with on average 19.5 chosen features, while MrSHC achieves the47smallest average CER with on average 31.5 chosen features.HC SHC MrSHCCER q CER RR q CER RR qKnown (q = 50) 0.172 500 0.129 0.875 50 0.041 0.977 50Unknown q 0.172 500 0.202 1.000 19.5 0.057 0.992 31.5Table 3.3: Average CER, RR, and q for different clustering methods.3.4.3 Computational Times and ComplexityWe investigate the computational times of MrSHC and SHC.The HierarchicalSparseCluster function from the R-package sparclis used to conduct SHC. MrSHC is implemented in R, where the SPC function fromthe R-package PMA is used to conduct the sparse PCA algorithm. We use defaultinput parameters in MrSHC: the number of bootstrap samples B = 50, maximumrank R = 5, and default selection of the reference number of clusters K. The aver-age computing times for Simulation I & II are presented in Table 3.4.HC SHC MrSHCSimulation I II I II I IIKnown (q = 50) 0.006 0.012 0.746 1.075 7.754 11.276Unknown q 0.006 0.012 10.496 15.518 73.254 103.752Table 3.4: Average computing times (in seconds) for different clusteringmethods.When n and p are relatively small, SHC takes less time to compute. How-ever, since the framework of MrSHC is embarrassingly parallel, parallel comput-ing functions such as mcapply from the R-package multicore can be easilyused to speed up the computation. Moreover, when n and p get larger, MrSHCwill become less time consuming (observed with n = 320 and p = 2000, resultsnot shown). This is because the computational complexities of MrSHC and SHCare O(n3qB+ np) and O(n3q+ n2 p), respectively, and as a result, SHC will be-come more computationally demanding due to the n2 p term as n and p increaseand p >> n.483.5 Application to Microarray Data SetsThree microarray data sets [1, 24, 35] are considered. We apply HC, SHC, MrSHC,and HC using features with the highest marginal variance (HC-HMV) to each dataset. Dendrograms are created using the complete linkage. The default referencenumber of clusters is used in MrSHC, and the default list of candidate q (for bothSHC and MrSHC) is {20,40,60, . . . ,400}.3.5.1 Lymphoma Data Set in Dettling [12]The data set is provided by Dettling [12]. It contains 4026 gene expression lev-els (features) for 62 samples (observations). Three types of most prevalent adultlymphoid malignancies were studied: 42 cases of diffuse large B-cell lymphoma(DLBCL), 9 samples of follicular lymphoma (FL), and 11 observations of B-cellchronic lymphocytic leukemia (CLL). A specialized cDNA microarray was usedto measure the gene expression levels. Following the pre-processing steps in Du-doit et al. [13], the data set is pre-processed by first setting a thresholding window[100,16000] and then excluding genes with max/min≤ 5 or (max−min)≤ 500.A logarithmic transformation and standardization are applied. Finally, a simple 5nearest neighbor algorithm is employed to impute the missing values.The dendrograms generated from HC, SHC, MrSHC and HC-HMV are shownin Figure 3.5. Colors are used to indicate the three tumor types. HC only mis-classifies two red samples, while SHC gives mixed clusters with the automaticallychosen q = 20. MrSHC chooses q = 140 and uses rank r = 2, with only two bluesamples misclassified into the green cluster (notice that the blue and green clustersare closer to each other). HC-HMV with q = 140 mixes the blue and green clus-ters. Therefore, MrSHC achieves better clustering accuracy with a better chosenq = 140 features using rank r = 2.We further investigate the effect of the rank selection by the dendrograms inFigure 3.6. Figure 3.6(a) shows the dendrogram generated from MrSHC with q =140, but r = 1. Figure 3.6(b) shows the dendrogram generated from SHC withq = 140. Both dendrograms suggest mixed clusters. This confirms that the rankselection can be crucial for better feature selection and clustering accuracy.49Figure 3.5: Dendrograms generated by HC, SHC, MrSHC and HC-HMV forDettling [12] data (n = 62, p = 4026)3.5.2 Breast Cancer Data Set in Perou et al. [24]The data set was first published in Perou et al. [24] and later analyzed in Witten andTibshirani [38]. It contains 1753 gene expression levels (features) for 62 samples(observations) to profile surgical specimens of human breast tumors. Perou et al.[24] categorized the 62 samples into four groups (clusters): basal-like, Erb-B2,normal breast-like, and ER+. Perou et al. [24] suggested that the four underlyingclusters could be explained by only 496 of the 1753 features, which was confirmedby Witten and Tibshirani [38]. Two misclassified samples were identified by Witten50Figure 3.6: Dendrograms generated by MrSHC (rank 1) and SHC with q =140 for Dettling [12] data (n = 62, p = 4026)and Tibshirani [38]. The data set was pre-processed before being published. Assuch, there are no outliers in the data set.Figure 3.7 shows the dendrograms generated from HC, SHC, MrSHC and HC-HMV. Colors are used to indicate the suggested four tumor groups. HC givesmixed clusters. SHC achieves better clustering – 5 misclassified samples, with theautomatically chosen q= 100 (similar results – q= 93 features were automaticallychosen and 5 samples are misclassfied – were obtained in Witten and Tibshirani[38]). MrSHC misclassifies only 2 samples by using the automatically selectedq = 60 and rank r = 1. HC-HMV with q = 60 gives mixed clusters. AlthoughMrSHC chooses r = 1 over higher ranks, it still provides better feature selectionand more accurate clustering comparing to the other three methods.3.5.3 Breast Cancer Data Set in van’t Veer et al. [35]The data set was presented and analyzed in van’t Veer et al. [35]. It consists of 4751gene expression levels for 77 primary breast tumor samples. A supervised classi-fication technique was used in van’t Veer et al. [35], revealing that only a subsetof 70 out of the 4751 genes may help discriminating patients that have developeddistant metastasis within five years.51Figure 3.7: Dendrograms generated by HC, SHC, MrSHC and HC-HMV forPerou et al. [24] data (n = 62, p = 1753)Figure 3.8 shows the resulting dendrograms generated from HC, SHC, MrSHCand HC-HMV. HC misclassifies 6 samples. SHC achieves slightly better accuracy– 5 misclassified samples, with the automatically chosen q = 340. MrSHC withr = 2 achieves the same accuracy with less (q = 100) features. HC-HMV withq = 100 features gives similar accuracy as MrSHC and SHC.52Figure 3.8: Dendrograms generated by HC, SHC, MrSHC and HC-HMV forvan’t Veer et al. [35]’s data. (n = 77, p = 4751)3.6 Robustification of MrSHCIn this section, we focus on improving the robustness of MrSHC. We considerMrSHC to be relatively robust to outliers due to the use of indicator weights for theselected features – the effect of distortions on the feature weights generated fromSPC is minimized. However, since SPC is not robust to outliers (see Chapter 2 formore details), we propose to further improve the robustness of MrSHC by usingRSPC-τ instead of SPC in Algorithm 3. The robust version of Algorithm 3 ispresented in Algorithm 7.53Algorithm 7 Robust feature set selection1: Input: X n×p, q, r, d (default d = 0).2: Assign λ− = 1 and λ+ =√n (λ+ can be set smaller in practice).3: Repeat Step 4-8.4: Apply RSPC-τ to X with λ = (λ−k + λ+k )/2; obtain the first r robust sparsePCs.5: q∗ := the number of variables with non-zero loadings in any of the r robustsparse PCs.6: Cr := the set of q∗ chosen variables.7: Break if q−d ≤ q∗ ≤ q+d.8: If q∗ > q+d, λ+ = λ ∗; if q∗ < q−d, λ− = λ ∗.9: Output: Cr, q∗.Replacing Algorithm 3 in place with Algorithm 7 results in a robust version ofMrSHC – Robust-MrSHC. To compare the performance of Robust-MrSHC, Mr-SHC and SHC, we first revisit the simulation studies in Section 3.4.We first generate data sets as described in Section 3.4.1. To illustrate the ro-bustness of Robust-MrSHC, we add outliers in the simulated data set as follows.For each clusters, randomly select two entries in the noise features and two en-tries in the clustering features to be replaced by values from N(0,152). The resultsobtained from 100 simulation runs are presented in Table 3.5. We can see thatRobust-MrSHC achieves the smallest CER and better RR than SHC.HC SHC MrSHCCER q CER RR q CER RR qKnown (q = 50) 0.176 500 0.425 0.299 50 0.021 0.839 50Unknown q 0.176 500 0.437 0.364 28.1 0.060 0.812 39Table 3.5: Average CER, RR, and q for different clustering methods.We then generate data sets as described in Section 3.4.2. Again, for each clus-ters, we randomly select two entries in the noise features and two entries in theclustering features and replace them with values drawn from N(0,152). The re-sults obtained from 100 simulation runs are presented in Table 3.6. We can see thatRobust-MrSHC achieves the smallest CER and better RR than SHC.Finally, we revisit the microarray examples in Section 3.5 and compare theperformance of HC, SHC, MrSHC and Robust-MrSHC.54HC SHC MrSHCCER q CER RR q CER RR qKnown (q = 50) 0.175 500 0.382 0.285 50 0.101 0.737 50Unknown q 0.175 500 0.385 0.327 41.7 0.123 0.756 44.5Table 3.6: Average CER, RR, and q for different clustering methods.The default list of candidate q for SHC, MrSHC and Robust-MrSHC is reducedto {25,50,75, . . . ,400} from {20,40,60, . . . ,400} to accommodate the increasedcomputational cost of Robust-MrSHC.3.6.1 Lymphoma Data Set in Dettling [12]We revisit the lymphoma data set to show the performance of Robust-MrSHC.To assess the robustness and performance of different methods in the presence ofoutliers, we randomly select five observations and ten features that were not cho-sen by any of the sparse clustering methods when applied to the clean data. Foreach chosen observation, the values of the ten noise features are replaced by val-ues generated from N(15,1). Since the contaminations are in the noise features,a robust feature selection procedure will remain unchanged and avoid selectingthose noise features, while a non-robust procedure tends to select those noise fea-tures due to contamination. We apply HC, SHC, MrSHC and Robust-MrSHC tothe contaminated data set. The dendrograms generated from HC, SHC, MrSHCand Robust-MrSHC are shown in Figure 3.9. Colors are used to indicate the threetumor types.HC only misclassifies two red samples, however, it is affected by outliers inthe noise features and groups the outliers into a standalone cluster. SHC is alsoseverely affected by the outliers and gives mixed clusters with the automaticallychosen q = 25. The chosen 25 features include almost all the contaminated noisefeatures which explains the cluster consisting of the outliers. MrSHC choosesq = 175 and uses rank r = 4, and it generates a cluster of outliers since some con-taminated noise features are chosen by mistake and as a consequence, the green andblue clusters are not clearly separated. On the other hand, Robust-MrSHC choosesq = 250 features with rank r = 4 and it is not affected by the contaminations (thecontaminated noise features are not chosen) while successfully separating the three55underlying clusters. It is clear that Robust-MrSHC is more robust and outperformsHC, SHC and MrSHC in this case.Figure 3.9: Dendrograms generated by HC, MrSHC, Robust-MrSHC andSHC for contaminated Dettling [12] data (n = 62, p = 4026)3.6.2 Breast Cancer Data Set in Perou et al. [24]We revisit the breast cancer data set in Perou et al. [24]. As mentioned in previoussections, this data set was pre-processed before being published. As such, there areno outliers in the data set. Again, we contaminate this data set with a few outliers.56We randomly select five observations and ten features that were not chosen by anyof the sparse clustering methods on the clean data. For each chosen observation,the values of the ten noise features are replaced by values generated from N(5,1).Figure 3.10 shows the dendrograms generated from HC, SHC, MrSHC andRobust-MrSHC. Colors are used to indicate the suggested four tumor groups. HCgives mixed clusters. SHC automatically chooses q = 50 features, including somecontaminated noise features. Therefore, the outliers form a single cluster and SHCalso has trouble separating the green and blue clusters. MrSHC also chooses q= 50with rank r = 1. Although it is affected less by the outliers, it still gives mixed clus-ters in general. Robust-MrSHC automatically chooses q = 300 features with rankr = 1. It separates the green and blue clusters successfully while group the orangeand red clusters closely together. This is coherent with the findings in Section 2.5.The dendrogram generated from SHC on the clean data found four main clusterswith two red samples were misclassified as orange. This is counter-intuitive be-cause the red and orange clusters are otherwise very well separated. The resultsfrom Robust-MrSHC is consistent with those from the FRSHC methods (see Sec-tion 2.5), which suggests that the red cluster may be further split into two sub-clusters, one of which is close to the orange cluster. In general, Robust-MrSHC isaffected the least by the outliers comparing to the other three methods.3.6.3 Breast Cancer Data Set in van’t Veer et al. [35]As pointed out in Section 2.5, this data set is suspected to contain a mild outlier.Therefore, we directly apply Robust-MrSHC and present its dendrogram in Fig-ure 3.11.Robust-MrSHC only misclassifies 4 samples with q = 25 features and rankr = 4. As presented in Section 3.5, the numbers of misclassified samples from HC,SHC and MrSHC are 6, 5 and 5 respectively. Therefore, Robust-MrSHC achievesbetter clustering accuracy with a smaller number of features on this data set with amild outlier.57Figure 3.10: Dendrograms generated by HC, MrSHC, Robust-MrSHC andSHC for contaminated Perou et al. [24] data (n = 62, p = 1753)3.7 DiscussionIn this section, we propose the multi-rank sparse hierarchical clustering (MrSHC),which automatically selects clustering features with higher rank considerations andproduces the corresponding sparse hierarchical clustering. As demonstrated in sim-ulation studies and real data examples, MrSHC gives superior feature selection andclustering performance comparing with the classical hierarchical clustering and the58Figure 3.11: Dendrograms generated by Robust-MrSHC for van’t Veer et al.[35]’s data. (n = 77, p = 4751)sparse hierarchical clustering proposed by Witten and Tibshirani [38]. We also en-dow MrSHC with the capability of dealing with data contamination and show therobustness and performance of the resulting Robust-MrSHC with several real dataexamples.59Chapter 4Regression Phalanxes4.1 IntroductionTomal et al. [34] introduced a novel approach for building diverse classificationmodels, for the ensembling of classification models in the context of rare-class de-tection in two-class classification problems. They proposed an algorithm to dividethe often large number of features (or explanatory variables) into subsets adaptivelyand build a base classifier (e.g. Random Forests) on each subset. The various clas-sification models are then ensembled to produce one model, which ranks the newsamples by their probabilities of belonging to the rare class of interest. The essenceof the algorithm is to automatically choose the subset groups such that variables inthe same group work well together for classification tasks; such groups are calledphalanxes.In this chapter, we propose a different class of phalanxes for application ingeneral regression tasks. We define a “Regression Phalanx” – a subset of featuresthat work well together for regression (or prediction). We then propose a novelalgorithm, with hierarchical clustering of features at its core, that automaticallybuilds Regression Phalanxes from high-dimensional data sets and builds a regres-sion model for each phalanx for further ensembling.In principle, any given regression method can be used as the base regressionmodel. The goal is to improve the prediction accuracy of the base method. Inthis chapter, we mainly focus on two well-established regression models: Lasso60[32] and Random Forests [6]. These two methods are known to have superior per-formance in various regression and prediction applications. For each applicationin this chapter, we first compare the performances of Lasso and Random Forests(RF). The better performing method between the two is then chosen as the baseregression model for building Regression Phalanxes.The idea of ensembling Regression Phalanxes is promising because each Re-gression Phalanx is relatively low-dimensional. Thus, each variable makes a moresignificant contribution in the fitted model. Compared to training a full modelwhere variables compete with each other in contributing to the final fit, more use-ful variables are likely to contribute to the ensembled regression model.Our proposed phalanx-forming procedure resembles a hierarchical clusteringof features (instead of samples), where “similarity” between a pair of features (orsubsets of features) is defined by how well they work together in the same regres-sion or prediction model. With properly defined similarity measures, features canbe then hierarchically merged into different phalanxes.The rest of chapter is organized as follows. Section 2 presents the details ofour proposed algorithm for building Regression Phalanxes. Section 3 presents asimple illustrative example, which forms the basis for simulation studies in Section4. In Section 5, we demonstrate the performance of Regression Phalanxes on fouradditional real data sets. Finally, we conclude with some remarks and discussionof future work.4.2 Phalanx Formation AlgorithmIn this section, the details of the Regression Phalanx formation algorithm are pre-sented. The procedure is an agglomerative approach to build Regression Phalanxes,which is, in essence, a hierarchical clustering of variables. There are four key steps:1. Initial grouping. Form d initial groups from the original D variables (d ≤D).2. Screening of initial groups. Screen out the underperforming initial groups toobtain s≤ d groups.3. Hierarchical merging into phalanxes. Hierarchically merge the s screenedgroups into e≤ s candidate phalanxes.614. Screening of candidate phalanxes. Screen out the underperforming candidatephalanxes to obtain h≤ e final phalanxes.A sketch of the procedure is presented in Figure 4.1. Each step of the phalanx-forming procedure is explained in more details in the following sections.Figure 4.1: A sketch of the phalanx-formation procedure. D variables arepartitioned into d initial groups, screened down to s groups, combinedinto e candidate phalanxes, and then screened down to h phalanxes inthe final ensemble (D≥ d ≥ s≥ e≥ h).4.2.1 Initial GroupingThis is an optional step. If this step is omitted, each initial group contains a singleindividual feature and the number of initial groups equals the number of features.As a result, the following steps in the phalanx-formation steps become more com-putational intensive since the time complexity is quadratic in the number of groups.Thus, it is recommended that the features be grouped into fewer initial groups ifthey lend themselves to natural grouping (e.g. initial groups can be identified by62features with similar names). Also, if an initial group only contains a binary fea-ture, its corresponding model is likely to be weak since it can only predict twopossible values. On the other hand, an initial group with k binary features can pro-duce up to 2k different predictions. Thus, we recommend the grouping of the binaryfeatures into initial groups. If the data set contains a large number of features butno obvious hints for natural grouping, we can still use hierarchical clustering toobtain the initial groups. For binary features, Tomal et al. [34] proposed to use theJaccard dissimilarity index, defined between binary features xi and x j asdJ(xi,x j) = 1− xi∩ x jxi∪ x j .Here xi∩ x j is the number of observations where the corresponding pair of entriesof xi and x j both take the value 1, and xi∪ x j is the number of observations whereat least one of the corresponding entries of xi or x j take the value 1. It is easy to seethat 0≤ dJ(xi,x j)≤ 1. For continuous features (or a mix of binary and continuousfeatures), we propose to use the “1-Abs(Correlation)” dissimilarity measure. Thatis, the dissimilarity between variable xi and x j is calculated asdC(xi,x j) = 1−|corr(xi,x j)|.Notice that the data set needs to be transposed before clustering so that thefeatures are clustered instead of the observations.This step partitions the original D features into d≤D initial groups g1,g2, . . . ,gd .4.2.2 Screening of Initial GroupsHigh-dimensional data are likely to contain noise features which contribute littleor even negatively to the prediction task. In such cases, initial groups need to bescreened so that noisy initial groups do not participate in the following steps. Wefirst introduce some notation and then we present two tests for the screening of theinitial groups.63NotationWe first define some notations to be used in the screening procedure. Denote byc the assessment criterion of a given regression task. Typically c is defined as themean squared error (MSE) of predictionc =1NN∑i=1(yi− yˆi)2, (4.1)where y = (y1, . . . ,yN)T and yˆ = (yˆ1, . . . , yˆN)T are the observed values and theirpredictions, respectively, of the response at N test points. The data available forthe application in Section 4.5.2 allow separate test data, but usually the test pointswill be generated by cross validation or related methods using the training data.To assess accuracy based on training data only, different strategies are used forthe two candidate regression methods, Lasso and RF.• Lasso.In Lasso, the predictions are produced by K-fold Cross-Validation (we chooseK = 5 through out the chapter). More specifically, the data set X and the cor-responding y are randomly grouped into K folds (X (1),y(1)), . . ., (X (K),y(K))with n(1), . . . ,n(K) observations respectively (∑Ki=1 n(i) = n). Then the predic-tions for y(i), namely yˆ(i) = (yˆ(i)1 , . . . , yˆ(i)n(i))T is obtained byyˆ(i)j = fˆ(−i)(x(i)j ), j = 1, . . . ,n(i)where fˆ(−i)(·) is the Lasso model fit from the (K− 1) folds other than thei-th fold, and x(i)j is the corresponding feature vector of y(i)j . We denote theassessment criterion c for Lasso as the Cross-Validation MSE (CV-MSE).• Random Forests (RF).In RF, yˆ can be obtained from the out-of-bag (OOB) predictions. The pre-diction yˆOOBi is obtained from only the trained trees that do not have yi intheir bootstrapped sample, hence, the prediction is called the out-of-bag pre-diction. We denote the assessment criterion c for RF as the out-of-bag MSE(OOB-MSE).64We further denote yˆ(gi) as the vector of predictions from the base regressionmodel (e.g. Lasso or RF) using only the variables in gi. Let ci = c(yˆ(gi)) denotethe assessment measure. Denote by yˆ(gi∪g j) the predictions when the model is fitusing the variables in gi and g j (i 6= j), and denoteci j = c(yˆ(gi∪g j)) (4.2)as the resulting performance. Similarly, letc¯i j = c((yˆ(gi)+ yˆ(g j))/2) (4.3)be the performance of the ensemble of the predictions from gi and g j.Tests for Screening Initial GroupsA group survives the screening if it passes the two tests described as follows. Agroup gi passes the first test if its own performance is “strong”, i.e. high predictionaccuracy. A group survives the second test if it produces “significant combin-ing improvement”, i.e., after combining with another group g j, the model trainedon the combined variables (from gi and g j) produces significantly better accuracycomparing to that from g j.We use a null permutation distribution to establish the baseline for strong indi-vidual performance and significant combining improvement. Denote y˜ as the vec-tor of permuted response values in which the original vector of response variablevalues y is randomly permuted relative to the features. Then the counterparts of ciand ci j are calculated with y˜ as the response and denoted as c˜i and c˜i j respectively.Denote the α quantile of c˜i (i = 1, . . . ,d) as p˜α and the 1−α/(d− 1) quantileof c˜i− c˜i j (i = 1, . . . ,d; j = 1, . . . ,d) by q˜1−α/(d−1). Then a group gi survives thescreening if it passes both the following two tests:1. gi is strong itself:ci ≤ p˜α . (4.4)The rationale is that the strength of an individual initial group should be com-petitive with the α quantile of the strengths of initial groups with a randomly65permuted response.2. gi improves the strength of any other group g j when gi and g j are combinedto build a regressor:q˜1−α/(d−1) ≤ c j− ci j (4.5)The rationale is that the improvement from combining gi and g j should becompetitive with the 1− α/(d − 1) quantile of combining improvementsof initial groups with a randomly permuted response. The quantile is 1−α/(d−1) to adjust for the (d−1) tests for each initial group.After the screening of initial groups, a list of surviving groups is relabeled as{G1,G2, . . . ,Gs} for the next step.4.2.3 Hierarchical Formation of Candidate PhalanxesWe use simple greedy hierarchical clustering techniques to merge Groups {G1,G2,. . ., Gs} into phalanxes, which proves to be effective in all of our applications. Eachiteration merges the pair of groups Gi and G j that minimizesmi j = ci j/c¯i j. (4.6)Here mi j < 1 indicates that combining Gi and G j to build a single model providesmore strength than ensembling two models built separately on Gi and G j. Thenumber of groups, s, decreases by 1 after each merge. The merging process stopswhen mi j ≥ 1 for all i, j, indicating that further merging damages the performanceand the resulting groups, i.e. e candidate phalanxes PX1,PX2, . . . ,PXe, should benow considered for ensembling.4.2.4 Screening of Candidate PhalanxesSearching for the best subset of candidate phalanxes for further ensembling is acombinatorial problem. In order to reduce the computational cost, we establish thesearch path P in a forward selection fashion. We initialize P with the candidatephalanx with the smallest value of the criterion c. At each stage all the remainingcandidate phalanxes will be considered for ensembling with phalanxes in P one at66a time, and the one with the best ensembling performance with P will be addedto P. The ensembling performance c¯(1,2,...,h) is calculated as follows. The predic-tions of yi from PX1, . . . ,PXh are denoted by yˆi;(PX1), . . . , yˆi;(PXh), and the ensembledprediction for yi is calculated asyˆi;(1,...,h) =h∑j=1yˆi;(PX j)/h (4.7)The ensemble’s performance is then calculated from the ensembled predictions.For ease of description, we assume the order of entry to be PX1,PX2, . . . ,PXk,and the corresponding ensembling performance as each candidate is added to becPX{1} ,cPX{1,2} , . . . , cPX{1,...,k} respectively. Then the set of candidate phalanxes cor-responding to the best ensembling performance, say cPX1,...,h , will be selected, andthe remaining candidates are screened out.After the screening of weak phalanxes, the surviving h phalanxes are the finalphalanxes, and they will be ensembled in the last step.4.2.5 Ensembling of Regression PhalanxesWe fit a model for each of the h phalanxes of variables, and obtain predictionsfrom them. (For RF, we can increase the number of trees and get better OOB pre-dictions as final predictions.) For a test point, the h predictions from the ensembleof regression phalanxes (ERPX) are averaged to give the final prediction.4.3 A Simple Illustrative Example: OctaneThe octane data set [15] consists of NIR absorbance spectra over 226 wavelengthsranging from 1102 nm to 1552 nm with measurements every two nanometers. Foreach of the 39 production gasoline samples the octane number was measured.It is known that the octane data set contains six outliers (cases 25, 26, 36–39)to which alcohol was added. We omit those outliers to obtain 33 clean samples.We apply Lasso and RF separately on the data set, then we choose Lasso as thebase regression model since it produces better results: the MSE of Lasso is about0.085 compared with about 0.27 for RF.The R package “glmnet” [16] is used for fitting Lasso models and the penalty67parameter λ is chosen by using “cv.glmnet” with method “lambda.1se”. Sincecross validation is used for choosing the tuning parameter λ for Lasso, the Phalanxformation procedure has inherited randomness. Therefore, we apply both ERPXand the original Lasso to the Octane data set three times each. For ERPX, weskip the initial grouping step since the number of features is relatively small andthe features are all continuous. Different numbers of surviving groups and finalphalanxes are obtained. The accuracies of both methods are assessed by CV-MSE.For each run, results for mean CV-MSE over 20 CV repetitions are presented inTable 4.1.Number ofGroups Phalanxes CV-MSERun Variables Initial Screened Candidate Screened ERPX Lasso1 226 226 190 7 2 0.051 0.0842 226 226 195 8 5 0.049 0.0863 226 226 192 9 2 0.044 0.083Table 4.1: Number of variables, initial groups, screened groups, candidatephalanxes, screened phalanxes and prediction accuracies for the octanedata set.We can see that the CV-MSE values from ERPX are much smaller than thoseobtained from the original Lasso models, which confirms that ERPX can boost theperformance of the base regression model.In the following section, we generate synthetic data sets based on the octanedata. We simulate data favoring Lasso and RF respectively as the base regressionmodel and show that we are able to improve the performance over the base regres-sion model using the proposed ERPX.4.4 Simulation StudiesIn this section, we present several numerical experiments to demonstrate the per-formance of the proposed ERPX. We simulate data by first emulating the featurestructure of the octane data. Then we generate response data as a function of thefeatures in two different ways, to represent linear and nonlinear relationships withthe features, respectively.68The data sets X (n×p) are generated based on the octane data as follows.• X = (x1, . . . ,xp), x j = (x1 j,x2 j, . . . ,xn j)T ( j = 1, . . . , p), where n = 33, p =226 as in the octane data set.• X is sampled from multivariate normal distribution N(µ ,Σ), where µ con-sists of p column means of the octane data.• The covariance matrix Σ is equal to the sample covariance matrix as observedor from a perturbation of the data. Specifically, we add different levels ofnoise into the octane data, and for each noise level we take the resultingsample covariance matrix for Σ. We consider three noise levels:– No noise: Σ is the sample covariance matrix of the octane data.– Medium noise: Random samples from N(0,3σ) are added to each el-ement of the octane data, where σ is the minimum feature standarddeviation among all the 226 features of the octane data. Σ is then thesample covariance matrix of the modified octane data.– High noise: Random samples from N(0,5σ) are added to each ele-ment of the octane data. Σ is then the sample covariance matrix of themodified octane data.For each noise level we simulate the feature set X 100 times. The strategy forgenerating y depends on the choice of the base regression model.4.4.1 Lasso as Base Regression ModelTo favor Lasso, for each of the 100 simulated X , we generate the response variabley in a linear pattern as follows.• Randomly select 10 features xk1 , . . . ,xk10 from the columns of X .• For each feature xk j ( j = 1, . . . ,10), generate the corresponding coefficientβ j from a Uniform distribution U(0,1).• Generate yinit as yinit = ∑10j=1β jxk j .69• Generate y as y =min(yoct)+a ·yinit+ε , where a is a scale constant (max(yoct)−min(yoct))/(max(yinit)−min(yinit)), yoct is the original response variablefrom the octane data set, and ε contains noises generated from N(0,1).For each set of simulated data (X ,y), we apply ERPX with base regression modelsLasso and RF respectively. Since the response variable is generated as a linearcombination of the predictors, Lasso is expected to perform at least as good as RF.The performance is measured by 5-fold CV-MSE and OOB-MSE for Lasso andRF respectively.Average number ofGroups MSEBase Noise Variables Initial Screened Phalanxes ERPX BaseLassoNo 226 226 172.32 6.01 0.96 1.43Medium 226 226 138.32 3.56 0.95 1.58High 226 226 88.75 2.78 1.07 1.71RFNo 226 226 99.58 2.59 0.98 1.32Medium 226 226 62.93 2.67 0.96 1.40High 226 226 33.35 2.65 1.02 1.56Table 4.2: Number of variables, initial groups, screened groups, candidatephalanxes, screened phalanxes and prediction accuracies of base regres-sion models and ERPX for different noise levels when calculating samplecovariance matrix Σ.We can see from Table 4.2, for all the simulation settings, regardless of thechoice of the base regression model, ERPX produces more accurate predictionsthan the corresponding base regression model with large relative margins. Some-what surprisingly, RF slightly outperforms Lasso in around 70% of the 300 simu-lated data sets (and also on average). This could be caused by the use of differentmetrics, OOB-MSE versus CV-MSE, when the sample size is small.4.4.2 Random Forests as Base Regression ModelTo generate the response variable y from highly nonlinear relationships favoringRF, we deploy the following strategy.• Randomly select 10 features xk1 , . . . ,xk10 from the columns of X .70• Generate two sets of coefficients β 1 =(β11, . . . ,β1,10)T and β 2 =(β21, . . . ,β2,10)Tfrom a Uniform distribution U(0,1).• Generateyiniti =∑10j=1β1 jxk j xik1 < median(xk1)∑10j=1β2 jxk j xik1 ≥median(xk1)Since the initial values in yinit are generated from a mixture of two linearpatterns, they cannot be easily modelled by linear methods such as Lasso.• Generate y = min(yoct)+ a · yinit + ε , where a = (max(yoct)−min(yoct))/(max(yinit)−min(yinit)), a scale constant, and ε contains noises generatedfrom N(0,1).For each set of simulated data (X ,y), we again apply ERPX with either Lasso orRF as the base regression model. In this case, we expect RF to outperform Lasso.We can see from Table 4.2, for all the simulation settings, RF outperforms Lassoas anticipated given the simulation set up, and ERPX improves upon both baseregression models by large relative margins.Average number ofGroups MSEBase Noise Variables Initial Screened Phalanxes ERPX BaseLassoNo 226 226 150.10 5.29 2.75 3.83Medium 226 226 92.79 2.65 2.67 3.80High 226 226 48.83 2.48 2.52 3.71RFNo 226 226 72.83 1.90 1.59 2.15Medium 226 226 37.52 2.04 1.66 2.49High 226 226 16.43 1.86 1.57 2.51Table 4.3: Number of variables, initial groups, screened groups, candidatephalanxes, screened phalanxes and prediction accuracies of base regres-sion models and ERPX for different noise levels when calculating samplecovariance matrix Σ.4.5 Additional Real Data ExamplesThe prediction performance of ERPX is illustrated on the following data sets.714.5.1 AID364 Data SetAssay AID364 is a cytotoxicity assay conducted by the Scripps Research Insti-tute Molecular Screening Center. There are 3,286 compounds used in our study,with their inhibition percentages recorded. Visit http://pubchem.ncbi.nlm.nih.gov/assay/assay.cgi?aid=364 for details. Because toxic reactions can occur in manydifferent ways, this assay is expected to present modelling challenges. We con-sider five sets of descriptors for the assay, to make 5 data sets. The descriptorsets are the following: atom pairs (AP), with 380 variables; Burden numbers [BN,8], with 24 variables; Carhart atom pairs [CAP, 9], with 1585 variables; fragmentpairs (FP), with 580 variables; and pharmacophores fingerprints (PH), with 120variables. The Burden numbers are continuous descriptors, and the other four arebit strings where each bit is set to “1” when a certain feature is present and “0”when it is not. See Liu et al. [21] and Hughes-Oliver et al. [18] for further expla-nation of the molecular properties captured by the descriptor sets.The initial groups for the descriptor sets are determined by their features names.For example, related features for FP present similar names such as AR 01 AR,AR 02 AR, . . ., AR 07 AR, and such features will form the initial groups. Weperform our proposed ERPX on each of the five descriptor sets. The base regressionmodel is chosen to be RF due to its superior performance over Lasso in this case(see Table A.1 in Appendix).ERPX and the original RF are run on each of the five descriptor sets as well asthe five descriptor sets combined as a whole set, three times each. The results arepresented in Table 4.4. As we can see, ERPX provides superior prediction accuracyover the original RF, and the margin gets bigger with all five descriptor sets mergedtogether. This is because ERPX can exploit the “richness” of features to improveprediction accuracy.Due to the naming schema of the features, the descriptor sets lend themselveswell to obtaining initial groups. However, we show that ERPX still produces com-parable prediction accuracies with initial groups obtained from the hierarchicalclustering approach described in Section 2.1. We choose the same number of ini-tial groups as shown in Table 4.4 to facilitate comparison. Table 4.5 presents thenew results for descriptor sets other than BN (since no initial grouping is needed).72SetNumber ofGroups Phalanxes OOB MSERun Variables Initial Screened Candidate Screened ERPX RFBN1 24 24 15 4 4 122.41 126.702 24 24 14 5 5 123.81 126.973 24 24 15 5 5 121.64 127.33PH1 120 21 15 3 2 131.59 135.702 120 21 16 2 2 131.50 134.623 120 21 17 2 2 131.67 135.57FP1 580 105 64 9 2 120.03 125.532 580 105 52 9 3 120.26 126.153 580 105 44 6 2 121.80 127.09AP1 380 78 35 4 3 124.40 132.072 380 78 33 4 2 124.31 131.733 380 78 31 5 3 124.92 131.69CAP1 1585 666 93 11 2 116.04 131.122 1585 666 73 9 2 118.86 130.743 1585 666 95 10 3 116.40 131.25ALL51 2689 895 159 21 6 112.51 125.872 2689 895 208 20 5 113.69 124.783 2689 895 204 18 4 112.80 125.13Table 4.4: Number of variables, initial groups, screened groups, candidatephalanxes, screened phalanxes and prediction accuracies for the AID 364assay and five descriptor sets. Three runs of ERPX are presented.As we can see, the prediction accuracies are comparable to those in Table 4.4 wherethe initial grouping is done according to feature names.4.5.2 CLM Data SetThe community land model with carbon/nitrogen biogeochemistry (CLM-CN) isa state-of-the-art land surface model to make future climate projections. Sargsyanet al. [28] performed simulations using CLM-CN for a single plant functional type:temperate evergreen needleleaf forest. The outputs were 100-year-later projectedvalues of several quantities of interest (QoIs). The simulator was run 10,000 timesfor different settings of 79 input parameters, out of which 9983 runs succeeded.We make predictions for two QoIs, leaf area index (LAI) and total vegetationcarbon (TOTVEGC), from the 79 input variables. The two QoIs are log scaledsince their sample distributions are highly right-skewed. The 9983 runs contain38% and 0.07% of zeros, respectively. Therefore, we add constants 10−10 and10−13 to the two QoIs respectively before we apply the log scaling (these valuesare roughly equal to the minima of the respective non-zero values). Since the cli-73SetNumber ofGroups Phalanxes OOB MSERun Variables Initial Screened Candidate Screened ERPX RFPH1 120 21 14 3 2 132.06 135.012 120 21 15 3 2 130.68 134.213 120 21 15 3 2 131.47 134.78FP1 580 105 47 6 3 120.85 125.732 580 105 44 9 2 122.85 125.533 580 105 44 10 3 123.87 126.48AP1 380 78 20 4 2 125.23 132.352 380 78 27 5 2 124.02 131.273 380 78 20 5 2 123.95 132.77CAP1 1585 666 57 7 3 120.76 129.822 1585 666 74 10 3 118.26 131.843 1585 666 58 7 5 121.35 131.16ALL51 2689 895 143 12 4 114.15 125.002 2689 895 89 11 6 115.46 123.673 2689 895 74 12 5 116.44 125.35Table 4.5: Number of variables, initial groups, screened groups, candidatephalanxes, screened phalanxes and prediction accuracies for the AID 364assay and four descriptor sets, with initial groups obtained from the hier-archical clustering. Three runs of ERPX are presented.mate projections are affected by a number of uncertainties in CLM-CN, predictingthe LAI and TOTVEGC is a challenging task. We choose RF as the base regressionmodel due to its superior performance versus Lasso in this example (see Table A.2in Appendix). We apply both ERPX and the original RF to demonstrate their per-formance. For ERPX, we skip the initial grouping step since there are only 79variables.The 9983 observations are randomly split into training and testing sets with5000 and 4983 observations respectively. We repeat the random splitting 20 timesto obtain 20 different pairs of training and testing sets. For each random split,both ERPX and RF are trained on the randomly sampled training set and appliedto the corresponding test set. Therefore, we can generate boxplots of both the 20OOB-MSEs from the training sets and the 20 test MSEs from the testing sets. Theboxplots are shown in Figure 4.2. It is clear that ERPX provides more accuratepredictions than RF, for both TOTVECG and LAI as response variables.We also present the detailed results from the first three splits in Table 4.6. Sincethe number of phalanxes is always one in all the runs, the mainly difference be-tween ERPX and RF is the screening of initial groups. By screening out most of74Figure 4.2: Boxplots of OOB MSE (from 5000 randomly selected trainingsamples) and test errors (from 4983 corresponding testing samples) ob-tained from 20 random splits of the data into training and testing setsthe less important initial groups, ERPX is able to produce more accurate predictionthan RF with only a small number of “strong” initial groups.QoINumber ofGroups Phalanxes OOB MSERun Variables Initial Screened Candidate Screened ERPX RFTOTVEGC1 79 79 16 1 1 3.47 3.662 79 79 14 1 1 3.57 3.673 79 79 14 1 1 3.10 3.62LAI1 79 79 18 1 1 10.34 12.632 79 79 20 1 1 11.02 12.943 79 79 14 1 1 11.40 12.89Table 4.6: Number of variables, initial groups, screened groups, candidatephalanxes, screened phalanxes and prediction accuracies for TOTVEGCand LAI from CLM-CN simulations. Three experiments based on ran-dom splitting of training and testing sets are presents.4.5.3 Gene Expression Data SetScheetz et al. [29] conducted a study of mammalian eye diseases where the gene ex-75pressions of the eye tissues from 120 twelve-week-old male F2 rats were recorded.A gene coded as TRIM32 is of particular interest here since it is responsible forcausing Bardet-Biedl syndrome.According to Scheetz et al. [29], only 18976 probe sets exhibited sufficient sig-nal for reliable analysis and at least 2-fold variation in expressions. The intensityvalues of these genes are evaluated on the logarithm scale and normalized usingthe method in Bolstad et al. [4]. It is believed from previous studies that TRIM32is only linked to a small number of genes, so following Scheetz et al. [29] we con-centrate mainly on the top 5000 genes with the highest marginal sample variance.Again, we choose RF (MSE around 0.0128) as the base regression model overlasso (MSE around 0.0131). We apply ERPX and the original RF to this data setthree times each. The results are presented in Table 4.7. For ERPX, we skip theinitial grouping step.Number ofGroups Phalanxes OOB MSERun Variables Initial Screened Candidate Screened ERPX RF1 5000 5000 659 4 2 0.0112 0.01302 5000 5000 578 4 3 0.0114 0.01253 5000 5000 513 5 4 0.0110 0.0128Table 4.7: Number of variables, initial groups, screened groups, candidatephalanxes, screened phalanxes and prediction accuracies for the gene ex-pression data set.It is clear that ERPX is providing more accurate predictions than RF for thisdata set.4.5.4 Glass Data SetThe glass data set [20] was obtained from an electron probe X-ray microanalysis(EPXMA) of archaeological glass samples. A total of 180 glass samples wereanalyzed and each glass sample has a spectrum with 640 wavelengths. The goal isto predict the concentrations of several major constituents of glass, namely, Na2O,SiO2, K2O and CaO, from the spectrum. For different responses, the choice ofthe base regression model varies, since neither RF nor Lasso performs uniformlybetter accross the response variables. We apply ERPX and the corresponding base76regression model to each of the six responses three times each. The results arepresented in Table 4.8. As we can see, ERPX improves the prediction accuracy ofthe corresponding base regression method.Number ofGroups Phalanxes MSEBase Run Variables Initial Screened Candidate Screened ERPX BaseNa2O RF1 640 640 224 3 3 0.78 0.942 640 640 247 2 2 0.86 0.923 640 640 245 2 2 0.85 0.96SiO2 RF1 640 640 86 1 1 0.98 1.232 640 640 86 2 2 0.98 1.263 640 640 89 1 1 0.97 1.22K2O Lasso1 640 640 309 1 1 0.094 0.1002 640 640 425 1 1 0.093 0.0933 640 640 257 1 1 0.096 0.096CaO Lasso1 640 640 302 1 1 0.109 0.1112 640 640 201 1 1 0.109 0.1113 640 640 265 1 1 0.111 0.112Table 4.8: Number of variables, base regression model, initial groups,screened groups, candidate phalanxes, screened phalanxes and predictionaccuracies (OOB-MSE for RF as base regression model, 5-fold CV-MSEfor Lasso as base regression model) for four responses of the glass dataset.4.6 ConclusionIn this chapter, we propose a novel framework called ensemble of Regression Pha-lanxes (ERPX). We propose to divide a often large number of features into subsetscalled Regression Phalanxes. Separate predictive models are built using features ineach Regression Phalanx and they are further ensembled.The proposed approach is widely applicable. We have demonstrated it on a va-riety of applications spanning chemistry, drug discovery, climate-change ecology,and gene expression. The simulated examples and real applications demonstratethat ERPX can take advantage of the richness of the features and produce gains inprediction accuracy over effective base regression model such as Lasso and RF.77Chapter 5Conclusion and Future Work5.1 SummaryIn this thesis, we focus on two use cases of hierarchical clustering – clusteringobservations for exploratory analysis and clustering features for adaptive featuregrouping and ensembling.Hierarchical clustering is widely used to cluster observations for exploratoryanalysis. Although hierarchical clustering is a well-established technique for thistask, it may yield unsatisfactory clustering accuracy and interpretation in high-dimensional settings. We point out two main issues that may jeopardize the perfor-mance of the traditional hierarchical clustering – noise features and data contami-nations. Noise features in high-dimensional data may introduce undesired pertur-bations that cover the true underlying clusters, while data contaminations such asoutliers may deteriorate the feature selection quality.Witten and Tibshirani [38] proposed a hierarchical clustering framework calledSparse Hierarchical Clustering (SHC) which adaptively chooses the important fea-tures for clustering from the noise features. However, its performance deterioratessignificantly under data contaminations. Therefore, we propose a novel frame-work called Robust Sparse Hierarchical Clustering (RSHC) which handles noisefeatures and data contaminations simultaneously. In RSHC, the important featuresare chosen using a novel sparse principal component analysis called Robust SparsePrincipal Component with τ estimator (RSPC-τ). Using τ-estimator in place of the78non-robust cost function in the PCA optimization problem leads to a robust featureselection procedure. The features are selected if they correspond to non-zero load-ings in the Robust Sparse PC obtained from applying RSPC-τ to the original data.We show that using the unweighted features, we can build desirable dendrogramseven under data contaminations in both noise and important features. We also pro-pose a way of automatically choosing the number of important features to select.Through extensive simulations and real data examples, we show that the proposedRSHC outperforms both the traditional hierarchical clustering as well as SHC inboth clustering accuracy and interpretability.However, as we further study the performance of RSHC, we notice that thereare still room for improvements. In high-dimensional data, the important featuresthat separate the underlying clusters may exhibit a complex structure. We show thatRSHC and SHC are sometimes inefficient when selecting features with complexstructures. Therefore, we propose another framework called Multi-rank Sparse Hi-erarchical Clustering (MrSHC) that conducts robust feature selection for hierarchi-cal clustering, even if the features have complex structures. MrSHC incorporatesdifferent rank considerations in the feature selection process. Instead of choos-ing features based on one SPC, MrSHC considers features with non-zero loadingsin multiple SPC generated from Sparse Principal Component (or RSPC-τ for bet-ter robustness). We also propose algorithms to automatically choose the numberof rank and the number of important features. Using simulated and real data ex-amples, we show that MrSHC and its robust version Robust-MrSHC yield betterclustering accuracy and interpretability in high-dimensional settings even when thedata is contaminated and of complex structure.Using hierarchical clustering to cluster features for adaptive feature groupingand ensembling is a relatively new concept. It is first introduced in the context ofrare-class classification problems by Tomal et al. [34]. We focus on regression tasksand propose a general ensemble framework called Regression Phalanxes. We formRegression Phalanxes by adaptively selecting subsets of features. Features in oneRegression Phalanx work well together as predictors in a pre-defined regressionprocedure. Models built on different Regression Phalanxes are then ensembled toform a final regression model. We show that via simulation studies and real dataexamples, Regression Phalanxes can further boost the performance of a reasonable79regression model such as Lasso or Random Forests.5.2 Future Work5.2.1 Clustering ObservationsFor the purpose of clustering observations, the proposed RSHC and MrSHC canhandle noise features and outliers simultaneously. Future work is still needed toaddress the following issues.Missing ValuesIt is likely that there are missing values in the high-dimensional data. NeitherRSHC nor MrSHC is able to handle missing values. We would like to endowour proposed frameworks with automatic imputation mechanisms. A potential ap-proach would be replacing the τ estimator with a robust estimator that can handlemissing values in RSPC-τ . The new RSPC method will be able to handle bothoutliers and missing values in the feature selection procedure. After we obtain therobust sparse PCs, we can then impute the missing values according to the low-rankapproximations. The imputed data will be further used for hierarchical clustering.Parallel ComputingThe computational cost of MrSHC is relatively high. Fortunately, the feature andrank selection procedure is embarrassingly parallel. Therefore, we can make Mr-SHC scalable by paralleling computing. MrSHC is currently implemented in R,therefore, we plan to investigate packages such as snow or foreach. With a suf-ficient parallel computing package, the computational time needed for MrSHCwould be reduced significantly.Normalization of High-dimensional DataAll the real data examples used in this thesis are pre-normalized using subject-area knowledge before published publicly. However, we would like to extend ourframeworks to data sets without pre-normalization. A natural way of normalizingthe data is to standardize the features with their corresponding mean and standarddeviation. Although this approach can eliminate the effect of measurement units(e.g. meters, milli-meters, etc), it may result in significantly deteriorated cluster-ing accuracy since it may potentially downscale the effect (i.e. variance) of the80important features that contribute significantly to the separation of clusters. There-fore, we would like to explore other ways of normalization such that the resultingclustering accuracy and interpretability are satisfactory.5.2.2 Regression PhalanxesThe proposed Regression Phalanxes are ensembled through unweighted average oftheir predictions. We can potentially increase the prediction accuracy of Regres-sion Phalanxes by using weighted ensembling methods, where the weights for thephalanxes are chosen in an adaptive fashion according to their prediction accura-cies. Instead of just considering the overall prediction accuracy, we may changethe goal of Regression Phalanxes to ranking the predictions. For example, chemistsmay be interested in ranking the activity of the compounds. We will need to useobjective functions such as hit rate instead of mean squared errors. We may alsoconsider other types of base regression models other than Lasso and Random For-est to explore whether the boost in performance is model-dependent, e.g. tree-basedmethods tend to enjoy a larger boost in performance than linear methods. We alsoplan to investigate a parallel computing paradigm for the hierarchical clusteringphase of Regression Phalanxes. There are very few developments around parallelcomputing strategy for hierarchical clustering, which I think could be an importanttopic to work on.81Bibliography[1] A. A. Alizadeh, M. B. Eisen, R. E. Davis, C. Ma, I. S. Lossos,A. Rosenwald, J. C. Boldrick, H. Sabet, T. Tran, X. Yu, et al. Distinct typesof diffuse large b-cell lymphoma identified by gene expression profiling.Nature, 403(6769):503–511, 2000. → pages 49[2] V. Batagelj. Generalized ward and related clustering problems.Classification and related methods of data analysis, pages 67–74, 1988. →pages 4[3] G. Boente and M. Salibian-Barrera. S-estimators for functional principalcomponent analysis. Journal of the American Statistical Association, 110(511):1100–1111, 2015. → pages 21[4] B. M. Bolstad, R. A. Irizarry, M. A˚strand, and T. P. Speed. A comparison ofnormalization methods for high density oligonucleotide array data based onvariance and bias. Bioinformatics, 19(2):185–193, 2003. → pages 76[5] L. Breiman. Bagging predictors. Machine learning, 24(2):123–140, 1996.→ pages 14[6] L. Breiman. Random forests. Machine learning, 45(1):5–32, 2001. → pages15, 16, 61[7] L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification andregression trees, the wadsworth statistics/probability series. 1984.Wadsworth International Group, Belmont, CA. → pages 14[8] F. R. Burden. Molecular identification number for substructure searches.Journal of Chemical Information and Computer Sciences, 29(3):225–227,1989. → pages 72[9] R. E. Carhart, D. H. Smith, and R. Venkataraghavan. Atom pairs asmolecular features in structure-activity studies: definition and applications.82Journal of Chemical Information and Computer Sciences, 25(2):64–73,1985. → pages 72[10] H. Chipman and R. Tibshirani. Hybrid hierarchical clustering withapplications to microarray data. Biostatistics, 7(2):286–301, 2006. → pages28[11] C. Croux, P. Filzmoser, and H. Fritz. Robust sparse principal componentanalysis. Technometrics, 55(2):202–214, 2013. → pages 24[12] M. Dettling. Bagboosting for tumor classification with gene expression data.Bioinformatics, 20(18):3583–3593, 2004. → pages vi, xii, 49, 50, 51, 55, 56[13] S. Dudoit, J. Fridlyand, and T. P. Speed. Comparison of discriminationmethods for the classification of tumors using gene expression data. Journalof the American statistical association, 97(457):77–87, 2002. → pages 49[14] B. Efron and R. J. Tibshirani. An introduction to the bootstrap. CRC press,1994. → pages 14[15] K. Esbensen, T. Midtgaard, and S. Scho¨nkopf. Multivariate Analysis inPractice: A Training Package. Camo As, 1996. → pages 67[16] J. Friedman, T. Hastie, and R. Tibshirani. Regularization paths forgeneralized linear models via coordinate descent. Journal of statisticalsoftware, 33(1):1, 2010. → pages 67[17] J. H. Friedman and J. J. Meulman. Clustering objects on subsets of attributes(with discussion). Journal of the Royal Statistical Society: Series B(Statistical Methodology), 66(4):815–849, 2004. → pages 7[18] J. M. Hughes-Oliver, A. D. Brooks, W. J. Welch, M. G. Khaledi,D. Hawkins, S. S. Young, K. Patil, G. W. Howell, R. T. Ng, and M. T. Chu.Chemmodlab: a web-based cheminformatics modeling laboratory. In silicobiology, 11(1-2):61–81, 2010. → pages 72[19] Y. Kondo, M. Salibian-Barrera, and R. Zamar. A robust and sparse k-meansclustering algorithm. arXiv preprint arXiv:1201.6082, 2012. → pages 18[20] P. Lemberge, I. De Raedt, and K. H. Janssens. Quantitative analysis of16-17th century archaeological glass vessels using pls regression of epxmaand mu-xrf data. Journal of chemometrics, 14(5):751–764, 2000. → pages7683[21] K. Liu, J. Feng, and S. S. Young. Powermv: a software environment formolecular viewing, descriptor generation, data analysis and hit evaluation.Journal of chemical information and modeling, 45(2):515–522, 2005. →pages 72[22] R. Maronna. Principal components and orthogonal regression based onrobust scales. Technometrics, 47(3):264–273, 2005. → pages 22[23] W. Pan and X. Shen. Penalized model-based clustering with application tovariable selection. The Journal of Machine Learning Research, 8:1145–1164, 2007. → pages 7[24] C. M. Perou, T. Sørlie, M. B. Eisen, M. van de Rijn, S. S. Jeffrey, C. A.Rees, J. R. Pollack, D. T. Ross, H. Johnsen, L. A. Akslen, et al. Molecularportraits of human breast tumours. Nature, 406(6797):747–752, 2000. →pages vi, xi, xii, 31, 32, 33, 49, 50, 52, 56, 58[25] A. E. Raftery and N. Dean. Variable selection for model-based clustering.Journal of the American Statistical Association, 101(473):168–178, 2006.→ pages 7[26] P. Rousseeuw and V. Yohai. Robust regression means of S-estimators. InRobust and Nonlinear Time Series Analysis. Lecture Notes in Statist., 26,pages 256–272, New York, 1984. Springer. → pages 86[27] P. J. Rousseeuw. Silhouettes: a graphical aid to the interpretation andvalidation of cluster analysis. Journal of computational and appliedmathematics, 20:53–65, 1987. → pages 42[28] K. Sargsyan, C. Safta, H. N. Najm, B. J. Debusschere, D. Ricciuto, andP. Thornton. Dimensionality reduction for complex models via bayesiancompressive sensing. International Journal for Uncertainty Quantification,4(1), 2014. → pages 73[29] T. E. Scheetz, K.-Y. A. Kim, R. E. Swiderski, A. R. Philp, T. A. Braun, K. L.Knudtson, A. M. Dorrance, G. F. DiBona, J. Huang, T. L. Casavant, et al.Regulation of gene expression in the mammalian eye and its relevance to eyedisease. Proceedings of the National Academy of Sciences, 103(39):14429–14434, 2006. → pages 75, 76[30] T. Strauss and M. J. von Maltitz. Generalising ward?s method for use withmanhattan distances. PloS one, 12(1):e0168288, 2017. → pages 484[31] W. Sun, J. Wang, Y. Fang, et al. Regularized k-means clustering ofhigh-dimensional data and its asymptotic consistency. Electronic Journal ofStatistics, 6:148–167, 2012. → pages 7[32] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal ofthe Royal Statistical Society. Series B (Methodological), pages 267–288,1996. → pages 11, 61[33] R. Tibshirani, G. Walther, and T. Hastie. Estimating the number of clustersin a data set via the gap statistic. Journal of the Royal Statistical Society:Series B (Statistical Methodology), 63(2):411–423, 2001. → pages 42[34] J. H. Tomal, W. J. Welch, R. H. Zamar, et al. Ensembling classificationmodels based on phalanxes of variables with applications in drug discovery.The Annals of Applied Statistics, 9(1):69–93, 2015. → pages 60, 63, 79[35] L. J. van’t Veer, H. Dai, M. J. Van De Vijver, Y. D. He, A. A. Hart, M. Mao,H. L. Peterse, K. van der Kooy, M. J. Marton, A. T. Witteveen, et al. Geneexpression profiling predicts clinical outcome of breast cancer. nature, 415(6871):530–536, 2002. → pages vi, xi, xii, 31, 34, 36, 49, 51, 53, 57, 59[36] W. N. Venables and B. D. Ripley. Modern applied statistics with S-PLUS.Springer Science & Business Media, 2013. → pages 14[37] S. Wang and J. Zhu. Variable selection for model-based high-dimensionalclustering and its application to microarray data. Biometrics, 64(2):440–448,2008. → pages 7[38] D. M. Witten and R. Tibshirani. A framework for feature selection inclustering. Journal of the American Statistical Association, 105(490):713–726, 2010. → pages vi, 7, 8, 9, 10, 18, 31, 34, 38, 39, 50, 51, 59, 78[39] D. M. Witten, R. Tibshirani, and T. Hastie. A penalized matrixdecomposition, with applications to sparse principal components andcanonical correlation analysis. Biostatistics, 10(3):515–534, 2009. → pages9, 20[40] B. Xie, W. Pan, and X. Shen. Penalized model-based clustering withcluster-specific diagonal covariance matrices and grouped variables.Electronic journal of statistics, 2:168, 2008. → pages 7[41] V. Yohai and R. Zamar. High breakdown-point estimates of regression bymeans of the minimization of an efficient scale. Journal of the AmericanStatistical Association, 83(402):406–413, 1988. → pages 22, 8785Appendix ASupporting MaterialsAlgorithm to Compute RSPC-τFollowing the notations introduced in Section 2.3.1, we present an iterativeapproximation approach for computing RSPC-τ . Recall that the objective functionfor RSPC-τ is (2.3):mina,b,µ{p∑j=1σ̂2j +λ ||b||1}, subject to ||b||22 = 1,where λ is a tuning parameter that controls the sparseness of the feature weights,b.Since τ-estimator of scale is used for σ̂ j for j = 1, ..., p, we replace the notationσ̂ j by τ̂ j and re-write (2.3) asmina,b,µ{p∑j=1τ̂2j +λ ||b||1}, subject to ||b||22 = 1, (A.1)Given a,b and µ , let σ̂2j be the M-estimate of scale [26] based on loss functionρc1 , which satisfies1nn∑i=1ρc1(ri j(a,b,µ )σ̂ j)= b1. (A.2)86Then the τ-scale τ̂ j = τ̂ j(a,b,µ ) is defined byτ̂2j =σ̂2jb2· 1nn∑i=1ρc2(ri j(a,b,µ )σ̂ j), (A.3)with where ρc1(t) = ρ(t/c1), ρc2(t) = ρ(t/c2) are assumed to be symmetric, con-tinuously differentiable, bounded, strictly increasing on [0,c j] and constant on[c j,∞), with 0 < c j < ∞, j = 1,2. Here we consider Yohai-Zamar loss functionρ(t) =1.38( tc)2 ∣∣ tc∣∣≤ 230.55−2.69( tc)2+10.76( tc)4−11.66( tc)6+4.04( tc)8 23 < ∣∣ tc ∣∣≤ 11∣∣ tc∣∣> 1.(A.4)To obtain consistency and 50% breakdown point, we choose the tuning parame-ters c1 = 1.214 and b1 = 0.5 for ρc1 . The choices c2 = 3.270 and b2 = 0.128 for ρc2yield a τ-estimator with 95% efficiency when the errors are normally distributed(see Yohai and Zamar [41]).Given b, the estimating equations for ai and µ j are given by the followingformulas which are obtained by fixing b and solving the subgradient equations in(A.1) with respect to ai and µ j:ai =∑pj=1 wi j(Zi j−µ j)b j∑pj=1 wi jb2j, 1≤ i≤ n, (A.5)µ j =∑ni=1 wi j(Zi j−aib j)∑ni=1 wi j, 1≤ j ≤ p, (A.6)withwi j =[d jh−1j ρ′c1(ri jτ̂ j)+ρ ′c2(ri jτ̂ j)τ̂ j]r−1i j , (A.7)h j =n∑i=1ρ ′c1(ri jτ̂ j)ri jτ̂ j,andd j = 2τ̂ j[n∑i=1ρc2(ri jτ̂k)]−n∑i=1ρ ′c2(ri jτ̂k)ri j.87The minimization over b for fixed a and µ is more involved due to the L1-penalty. As such we use the following fast and approximated iterative approach toobtain an estimating equation for b. First we re-write (A.1) asmina,b,µ{p∑j=1τ̂2j +λp∑j=1g(b j)}, subject to ||b||22 = 1. (A.8)whereg(t) ={t2|t| if t 6= 00 otherwiseAt this point, we make the following approximation to simplify the optimiza-tion procedure. If |b j|( j = 1, . . . , p) in the denominator of g(b j) are non-zero, thenwe consider them constants b(const)j ( j = 1, . . . , p), and Equation A.8 is approxi-mated as follows.minb{p∑j=1τ̂2j +λp∑j=1b2j|b(const)j |}, subject to ||b||22 = 1. (A.9)By taking the derivative of the above objective function with respect to b j (ignoringthe side constraints ||b||22 = 1 since it is just for ensuring the uniqueness of thesolution) and setting it to 0, we have the closed form solution for b j:b j =∑ni=1 wi j(Zi j−µ j)ai∑ni=1 wi j(ai)2+2λ/|b(const)j |. (A.10)Since b(const)j is unknown, we initialize it with an non-zero value and obtain b jusing the following iteration:b(k+1)j =∑ni=1 wi j(Zi j−µ j)ai∑ni=1 wi j(ai)2+2λ/|b(k)j |. (A.11)The iteration steps in Equation A.11 will continue until convergence or the valueof b(k+1)j is smaller than a pre-defined threshold, say 10−20, in which case the valueof b j is set to 0.Note that the constraint on b (i.e. ||b||22 = 1) is just to ensure the uniqueness of88the solution, we normalize b after the iteration terminates for all the b j( j= 1, . . . , p)in b.We use the L1-median of the input matrix as the initial values of µ . For thevector b, we generate R random starts by assigning b(0) either by one of the R−1randomly selected rows of the original matrix or the SVD solution obtained fromthe original matrix, i.e. bsvd . The columns of the matrix A are the sparse scoresof each observation given b(0). For each set of the initial values we run max iteriterations, or until a tolerance level is achieved.Algorithm 8 gives a detailed description of the algorithm used for computingRSPCA-τ . We observe that, in Algorithm 8, for set of initial values, the value ofthe objective function does not always decrease during the iterations. Therefore,we cache the smallest value of the objective function and its corresponding a, band µ among all the runs with different initial values to be returned as the finalsolution. Although we do not provide theoretical guarantee on the convergenceof the iteration process, we observe convergence within a pre-specified number ofmax iter (say 500) in all of our simulations and real examples.89Set MSELasso RFBN 152.53 127.39PH 145.29 135.01FP 143.56 125.98AP 148.45 131.76CAP 146.57 131.67Table A.1: MSEs of Lasso (5-fold CV-MSEs) and RF (OOB-MSEs) for fivedescriptor sets of the AID 364 assayResponse MSELasso RFLAI 12.37 11.58TOTVEGC 3.47 3.29Table A.2: MSEs of Lasso (CV-MSEs) and RF (OOB-MSEs) for two re-sponses of the CLM data90Algorithm 8 RSPCA-τ1: Let LS(a,b,µ ) = ∑pj=1 τ̂2j +λ ||b||1 subject to ||b||2 = 12: Input: Z = (Z1, · · · ,Zp) ∈ Rn×p, M ∈ N, tuning parameter λ > 0, convergencethreshold ω , default 10−10, max number of iteration max iter, default 500, number ofinitial value set R3: L∗S =+∞4: Generate B including R random starts for b by randomly selection R−1 rows from Z,and bSV D.5: for b(0) in B6: a(0)← Zb(0), µ (0)← (median(Z1), ...,median(Zp))T .7: N iter = 18: while N iter < max iter9: compute wi j, i = 1, ...,n, j = 1, ..., p in (A.7) using a(0),b(0),µ (0).10: a(1)i ← (∑pj=1 wi j(Zi j−µ(0)j )b(0)j )/(∑pj=1 wi j(b(0)j )2), i = 1, ...,n11: for j = 1, · · · , p:12: if b(0)j = 013: b(1)j ← 014: else15: b(0∗)j ← b(0)j16: n iter = 117: while n iter < max iter18: b(1∗)j ←(∑ni=1 wi j(Zi j−µ(0)j )a(0)i)/(∑ni=1 wi j(a(0)i )2+2λ/|b(0∗)j |)19: if |b(1∗)j −b(0∗)j |< ω or |b(1∗)j |< ω220: break21: b(0∗)j ← b(1∗)j22: n iter++23: end while24: b(1)j ← b(1∗)j if |b(1∗)|> ω2, otherwise b(1)j ← 025: end for26: Normalize b(1)j ( j = 1, . . . , p) so that ∑ j(b(1)j )2 = 127: µ(1)j ← (∑ni=1 wi j(Zi j−a(1)i b(1)j ))/(∑ni=1 wi j)28: if LS(a(1),b(1),µ (1))< L∗S29: L∗S = LS(a(1),b(1),µ (1)), a∗ = a(1), b∗ = b(1), µ ∗ = µ (1)30: if |LS(a(1),b(1),µ (1))−LS(a(0),b(0),µ (0))|< ε , say 10−531: break32: a(0)← a(1),b(0)← b(1),µ (0)← µ (1)33: N iter++34: end while35: end for36: Output: a∗, b∗, µ ∗ with L∗S.91
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Hierarchical clustering of observations and features...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Hierarchical clustering of observations and features in high-dimensional data Zhang, Hongyang (Fred) 2017
pdf
Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
Page Metadata
Item Metadata
Title | Hierarchical clustering of observations and features in high-dimensional data |
Creator |
Zhang, Hongyang (Fred) |
Publisher | University of British Columbia |
Date Issued | 2017 |
Description | In this thesis, we present new developments of hierarchical clustering in high-dimensional data. We consider different use cases of hierarchical clustering, namely, clustering observations for exploratory analysis and clustering high-dimensional features for adaptive feature grouping and ensembling. We first focus on the clustering of observations. In high-dimensional data, the existence of potential noise features and outliers poses unique challenges to the existing hierarchical clustering techniques. We propose the Robust Sparse Hierarchical Clustering (RSHC) and the Multi-rank Sparse Hierarchical Clustering (MrSHC) to address these challenges. We show that via robust feature selection techniques, both RSHC and MrSHC can handle the potential existence of noise features and outliers in high-dimensional data and result in better clustering accuracy and interpretation comparing to the existing hierarchical clustering methods. We then consider clustering of features in high-dimensional data. We propose a new hierarchical clustering technique to adaptively divide the large number of features into subgroups called Regression Phalanxes. Features in the same Regression Phalanx work well together as predictors in a pre-defined regression model. Then models built on different Regression Phalanxes are considered for further ensembling. We show that the ensemble of Regression Phalanxes resulting from the hierarchical clustering produces further gains in prediction accuracy when applied to an effective method like Lasso or Random Forests. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2017-08-16 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivatives 4.0 International |
DOI | 10.14288/1.0354395 |
URI | http://hdl.handle.net/2429/62648 |
Degree |
Doctor of Philosophy - PhD |
Program |
Statistics |
Affiliation |
Science, Faculty of Statistics, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2017-09 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 24-ubc_2017_september_zhang_hongyang.pdf [ 3.99MB ]
- Metadata
- JSON: 24-1.0354395.json
- JSON-LD: 24-1.0354395-ld.json
- RDF/XML (Pretty): 24-1.0354395-rdf.xml
- RDF/JSON: 24-1.0354395-rdf.json
- Turtle: 24-1.0354395-turtle.txt
- N-Triples: 24-1.0354395-rdf-ntriples.txt
- Original Record: 24-1.0354395-source.json
- Full Text
- 24-1.0354395-fulltext.txt
- Citation
- 24-1.0354395.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
data-media="{[{embed.selectedMedia}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0354395/manifest