UBC Faculty Research and Publications

diceR: an R package for class discovery using an ensemble driven approach Chiu, Derek S; Talhouk, Aline Jan 15, 2018

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


52383-12859_2017_Article_1996.pdf [ 797.12kB ]
JSON: 52383-1.0366289.json
JSON-LD: 52383-1.0366289-ld.json
RDF/XML (Pretty): 52383-1.0366289-rdf.xml
RDF/JSON: 52383-1.0366289-rdf.json
Turtle: 52383-1.0366289-turtle.txt
N-Triples: 52383-1.0366289-rdf-ntriples.txt
Original Record: 52383-1.0366289-source.json
Full Text

Full Text

SOFTWARE ARTICLE Open AccessdiceR: an R package for class discoveryusing an ensemble driven approachDerek S. Chiu1 and Aline Talhouk1,2*AbstractBackground: Given a set of features, researchers are often interested in partitioning objects into homogeneousclusters. In health research, cancer research in particular, high-throughput data is collected with the aim ofsegmenting patients into sub-populations to aid in disease diagnosis, prognosis or response to therapy. Clusteranalysis, a class of unsupervised learning techniques, is often used for class discovery. Cluster analysis suffers fromsome limitations, including the need to select up-front the algorithm to be used as well as the number of clustersto generate, in addition, there may exist several groupings consistent with the data, making it very difficult tovalidate a final solution. Ensemble clustering is a technique used to mitigate these limitations and facilitate thegeneralization and reproducibility of findings in new cohorts of patients.Results: We introduce diceR (diverse cluster ensemble in R), a software package available on CRAN:https://CRAN.R-project.org/package=diceRConclusions: diceR is designed to provide a set of tools to guide researchers through a general cluster analysisprocess that relies on minimizing subjective decision-making. Although developed in a biological context,the tools in diceR are data-agnostic and thus can be applied in different contexts.Keywords: Data mining, Cluster analysis, Ensemble, Consensus, CancerBackgroundCluster analysis has been used in cancer research to dis-cover new classifications of disease and improve the un-derstanding of underlying biological mechanisms. Thistechnique belongs to a set of unsupervised statisticallearning methods used to partition objects and/or featuresinto homogeneous groups or clusters [1]. It providesinsight, for example, to how co-regulated genes associatewith groupings of similar patients based on features oftheir disease, such as prognostic risk or propensity to re-spond to therapy. Many clustering algorithms are avail-able, though none stand out as universally better than theothers. Different algorithms may be better suited for spe-cific types of data, and in high dimensions it is difficult toevaluate whether algorithm assumptions are met. Further-more, researchers must set the number of clusters a priorifor most algorithms. Additionally, several clusteringsolutions consistent with the data are possible, making theascertainment of a final result without considerable reli-ance on additional extrinsic information difficult [2]. Manyinternal clustering criteria have been proposed to evaluatethe output of cluster analysis. These generally consist ofmeasures of compactness (how similar are objects withinthe same cluster), separation (how distinct are objectsfrom different clusters), and robustness (how reproducibleare the clusters in other datasets) [2–4]. External evalu-ation can also be used to assess how resulting clusters andgroupings corroborate known biological features. Re-searchers may choose to use internal clustering criteriaonly for performance evaluation [5, 6] to keep the analysiscongruent with an unsupervised approach.Ensemble methods are a popular class of algorithmsthat have been used in both the supervised [7, 8] andunsupervised learning setting. In the unsupervised set-ting, cluster ensembles have been proposed as a class ofalgorithms that can help mitigate many of the limitationsof traditional cluster analysis by combining clustering re-sults from multiple “experts” [2, 9]. Ensembles areachieved by generating different clusterings, using* Correspondence: atalhouk@bccrc.ca1Department of Molecular Oncology, BC Cancer Agency, Vancouver, BC,Canada2Department of Pathology and Laboratory Medicine, University of BritishColumbia, Vancouver, BC, Canada© The Author(s). 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, andreproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link tothe Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.Chiu and Talhouk BMC Bioinformatics  (2018) 19:11 DOI 10.1186/s12859-017-1996-ydifferent subsets of the data, different algorithms, or dif-ferent number of clusters, and combining the resultsinto a single consensus solution. Ensemble methods havebeen shown to result in a more robust clustering thatconverges to a true solution (if a unique one exists) asthe number of experts is increased [9–11]. The agnosticapproach of ensemble learning makes the technique use-ful in many health applications, and non-health applica-tions such as clustering communities in social networkanalysis (Maglaras et al., 2016) and classifying creditscores (Koutanaei et al., 2015).ImplementationIn this paper, we introduce diverse cluster ensemble in R(diceR), a software package built in the R statistical lan-guage (version 3.2.0+) that provides a suite of functionsand tools to implement a systematic framework for clus-ter discovery using ensemble clustering. This frameworkguides the user through the steps of generating diverseclusterings, ensemble formation, and algorithm selectionto the arrival at a final consensus solution, most consist-ent with the data. We developed a visual and analyticalvalidation framework, thereby integrating the assessmentof the final result into the process. Problems with scal-ability to large datasets were solved by rewriting some ofthe functions to run parallel on a computing cluster.diceR is available on CRAN.Results and discussionThe steps performed in the diceR framework are sum-marized below and in Fig. 1; a more detailed examplecan be found in the Additional file 1 and at https://alinetalhouk.github.io/diceRDiverse cluster generationThe full process is incorporated into a single functiondice that wraps the different components describedherein. The input data consists of a data frame withrows as samples and columns as features. Clustergeneration is obtained by applying a variety of clus-tering algorithms (e.g. k-means, spectral clustering,etc.), distance metrics (e.g. Euclidean, Manhattan,etc.), and cluster sizes to the input data (pleaseconsult the supplementary methods for the list ofalgorithms and clustering distances currently imple-mented). In addition to algorithms and distances im-plemented within diceR, a simple framework isavailable for the user to input the algorithm or dis-tance of their choosing. Every algorithm is applied toseveral subsets of the data, each consisting of 80% ofthe original observations. As a result of subsampling,not every sample is included in each clustering; thedata is “completed” using k-nearest neighbor and ma-jority voting.The output of the cluster generation step is an array ofclustering assignments computed across cluster sizes, al-gorithms, and subsamples of the data (See “ClusteringArray” and “Completed Clustering Array” in Fig. 1). Thistechnique extends the consensus clustering method pro-posed by Monti et al. [12] to include a consensus acrossalgorithms.Consensus ensembleA cluster ensemble is generated by combining resultsfrom the cluster generation step. diceR implementsfour methods for consensus formation: MajorityVoting [13], K-modes [14], Link-Based ClusterEnsembles (LCE) [10], and Cluster-based SimilarityPartitioning Algorithm (CSPA) [9, 15] (See Fig. 1).Fig. 1 Ensemble clustering pipeline implemented in diceR. The analyticalprocess is carried out by the main function of the package: diceChiu and Talhouk BMC Bioinformatics  (2018) 19:11 Page 2 of 4Thus, the final ensemble is a consensus across sam-ples and algorithms.There is also an option to choose a consensus clustersize using the proportion of ambiguous clustering (PAC)metric [4]. The cluster size corresponding to the smal-lest PAC value is selected, since low values of PAC indi-cate greater clustering stability. Additionally, the usercan allocate different weights to the algorithms in theensemble, proportional to their internal evaluation indexscores.Visualization and evaluationFor each clustering algorithm used, we calculate in-ternal and external validity indices [5, 6]. diceR hasCalinski HarabaszDunn PBMTauGammaC indexDavies BouldinMclain RaoSD distanceRay TuriG plusCompactnessConnectivityTCGA Ovarian Cancer Gene Expression Data UCI Breast Tissue Data Calinski HarabaszDunn PBMTauGammaC indexDavies BouldinMclain RaoSD distanceRay TuriG plusCompactnessConnectivityIndexMaximizedMinimized210123UCI Parkinsons Speech Data Calinski HarabaszDunn PBMTauGammaC indexDavies BouldinMclain RaoSD distanceRay TuriG plusCompactnessConnectivity A  B CPAM (Spearman)KM (Spearman)kmodes_trimBLOCKNMF (Brunet)APKM (Euclidean)CSPACSPA_trimNMF (Lee)CMEANSLCEkmodesmajority_trimSCmajorityPAM (Euclidean)LCE_trimSOMPAM (Spearman)KM (Spearman)kmodesLCE_trimkmodes_trimBLOCKLCEKM (Euclidean)NMF (Brunet)CMEANSCSPA_trimCSPAPAM (Euclidean)APSCNMF (Lee)majoritySOMmajority_trimKM (Spearman)PAM (Spearman)BLOCKNMF (Brunet)NMF (Lee)kmodesAPmajorityPAM (Euclidean)kmodes_trimKM (Euclidean)CSPAmajority_trimLCESOMCMEANSSCCSPA_trimLCE_trimFig. 2 A comparative evaluation using diceR applied to three datasets. Using 10 clustering algorithms, we repeated the clustering of each data set, eachtime using only 80% of the data. Four ensemble approaches were considered. The ensembles were constructed using all the individual clusterings andwere repeated by omitting the least performing algorithms (the trim version in the figure). Thirteen internal validity indices were used to rank order thesealgorithms based on performance from top to bottom. Indices were standardized so their performance is relative to each other. The green/red annotationtracks at the top indicate which indices should be maximized or minimized respectively. Ensemble methods were highlighted using a bold fontChiu and Talhouk BMC Bioinformatics  (2018) 19:11 Page 3 of 4visualization plots to compare clustering results be-tween different cluster sizes. The user can monitor theconsensus cumulative distribution functions (CDFs),relative change in area under the curve for CDFs, heat-maps, and track how cluster assignments change inrelation to the requested cluster size.A hypothesis testing mechanism based on the SigClustmethod is also implemented in diceR to assess whetherclustering results are statistically significant [16]. This al-lows quantification of the confidence in the partitions.For example, we can test whether the number of statisti-cally distinct clusters is equal to two or three, as op-posed to just one (i.e. unimodal distribution no clusters).In Fig. 2 we present a visualization of the results of acomparative analysis.Algorithm selectionPoor-performing algorithms can affect a cluster ensem-ble’s performance, so one way to limit that is to includeonly the top N performing algorithms in the ensemble[17]. To this end, the internal validity indices for all al-gorithms are computed (see Additional file 1 for full listof indices). Then, rank aggregation is used to select asubset of algorithms that perform well across all indices[18]. The resulting subset of algorithms is selected forinclusion in the cluster ensemble. Our “diverse” strategyis not to impose diversity onto the ensemble, but to con-sider a diverse set of algorithms and ultimately allow thedata to select which best performing algorithms to re-tain. This step of the analysis continues to be an activearea of research and is subject to revision andimprovements.ConclusionsThe software we have developed provides an easy-to-useinterface for researchers of all fields to use for their clus-ter analysis needs. More clustering algorithms will beadded to diceR as they become available.Additional fileAdditional file 1: A detailed tutorial and example of Cluster Analysisusing diceR. (PDF 326 kb)AbbreviationsCDF: Cumulative distribution function; CSPA: Cluster-Based PartitioningAlgorithm; diceR: Diverse cluster ensemble in R; LCE: Linkage-Based ClusterEnsembles; PAC: Proportion of ambiguous clusteringAcknowledgementsWe would like to acknowledge the contributions of Johnson Liu in packagedevelopment and Dr. Michael Anglesio and Jennifer Ji for providing helpfulfeedback.FundingThis research was supported by donor funds to OVCARE (www.ovcare.ca)from the Vancouver General Hospital and University of British ColumbiaHospital Foundation and the BC Cancer Foundation.Availability of data and materialsdiceR is available on CRAN: https://CRAN.R-project.org/package=diceR.Authors’ contributionsDSC and AT wrote and analysed the functions in the software package. Bothauthors wrote, read, and approved the final manuscript.Ethics approval and consent to participateNot applicable.Consent for publicationNot applicable.Competing interestsThe authors declare that they have no competing interests.Publisher’s NoteSpringer Nature remains neutral with regard to jurisdictional claims inpublished maps and institutional affiliations.Received: 8 August 2017 Accepted: 12 December 2017References1. Hennig C, Meila M, Murtagh F, Rocci R. Handbook of cluster analysis: CRCPress Book; 2015.2. Song Q, et al. Cancer classification in the genomic era: five contemporaryproblems. Hum Genomics. 2015;9:27.3. Liu Y, et al. Understanding and enhancement of internal clusteringvalidation measures. IEEE Trans Cybern. 2013;43:982–94.4. Șenbabaoğlu Y, et al. Critical limitations of consensus clustering in classdiscovery. Sci Rep. 2014;4:6207.5. Arbelaitz O, et al. An extensive comparative study of cluster validity indices.Pattern Recogn. 2013;46:243–56.6. Handl J, et al. Computational cluster validation in post-genomic dataanalysis. Bioinformatics. 2005;21:3201–12.7. Breiman L. Random forests. Mach Learn. 2001;45:5–32.8. Neumann U, et al. EFS: an ensemble feature selection tool implemented asR-package and web-application. BioData Min. 2017;10:21.9. Strehl A, Ghosh J. Cluster ensembles – a knowledge reuse framework forcombining multiple partitions. J Mach Learn Res. 2002;3:583–617.10. Iam-On N, et al. LCE: a link-based cluster ensemble method for improvedgene expression data analysis. Bioinformatics. 2010;26:1513–9.11. Topchy, A.P. et al. A mixture model for clustering ensembles. In, SDM., 2004.pp. 379–390.12. Monti S, et al. Consensus clustering: a resampling based method for classdiscovery and visualization of gene expression microarray data. Mach Learn.2003;52:91–118.13. Ayad HG, Kamel MS. On voting-based consensus of cluster ensembles.Pattern Recogn. 2010;43:1943–53.14. Huang Z. A fast clustering algorithm to cluster very large categorical datasets in data mining. Res Issues Data Min Knowl Discov. 1997:1–8.15. Ghosh J, Acharya A. Cluster ensembles. Wiley Interdiscip Rev Data MinKnowl Discov. 2011;1:305–15.16. Huang H, et al. Statistical significance of clustering using soft Thresholding.J Comput Graph Stat. 2015;24:975–93.17. Naldi MC, et al. Cluster ensemble selection based on relative validityindexes. Data Min Knowl Discov. 2013;27:259–89.18. Pihur V, et al. RankAggreg, an R package for weighted rank aggregation.BMC Bioinformatics. 2009;10:62.Chiu and Talhouk BMC Bioinformatics  (2018) 19:11 Page 4 of 4


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items