Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Robust methods for inferring cluster structure in single cell RNA sequencing data Willie, Elijah 2020

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.

Item Metadata


24-ubc_2020_november_willie_elijah. [ 8.38MB ]
JSON: 24-1.0394826.json
JSON-LD: 24-1.0394826-ld.json
RDF/XML (Pretty): 24-1.0394826-rdf.xml
RDF/JSON: 24-1.0394826-rdf.json
Turtle: 24-1.0394826-turtle.txt
N-Triples: 24-1.0394826-rdf-ntriples.txt
Original Record: 24-1.0394826-source.json
Full Text

Full Text

Robust methods for inferring cluster structure in SingleCell RNA Sequencing databyElijah WillieB.Sc, Simon Fraser University, 2018A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMaster of ScienceinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Bioinformatics)The University of British Columbia(Vancouver)October 2020c© Elijah Willie, 2020The following individuals certify that they have read, and recommend to the Fac-ulty of Graduate and Postdoctoral Studies for acceptance, the thesis entitled:Robust methods for inferring cluster structure in Single Cell RNA Se-quencing datasubmitted by Elijah Willie in partial fulfillment of the requirements for the degreeof Master of Science in Bioinformatics.Examining Committee:Sara Mostafavi, Statistics and Medical GeneticsSupervisorRyan Brinkman, Medicine and Medical GeneticsSupervisory Committee MemberFrancis Lynn, Medicine and Biomedical EngineeringSupervisory Committee MemberiiAbstractSingle Cell RNA sequencing (SCRNA-SEQ) enables researchers to gain insightsinto complex biological systems not possible with previous technologies. Unsu-pervised machine learning, and in particular clustering algorithms, are critical tothe analysis of scRNA-seq datasets, enabling investigators to systematically definecell types based on similarities in global gene expression profiles. However, in ro-bustly applying a clustering algorithm to identify and define cell types, two criticalopen questions remain: i) to what extent do natural clusters exist in a given dataset?ii) what is the number of clusters best supported by the data? More specifically,most clustering algorithms will attempt to identify a fixed number of clusters with-out considering whether a given dataset is clusterable (e.g., natural clusters exist).However, understanding when the application of clustering algorithms is appropri-ate is crucial in making inferences from scRNA-seq datasets. Further, all clusteringalgorithms require the user to explicitly or implicitly specify the number of clus-ters to search for. In this thesis, we first assess the robustness of multimodalitytesting methods for determining whether a given set of points (or a dataset) is clus-terable. Next, we utilize this framework to develop an algorithm, which we refer toas CCMT, for inferring the number of robust clusters in a given dataset. Results onsimulation studies show that multimodality testing as a means for inferring clusterstructure is robust and scales favorably for large datasets. This method can detectcluster structure with high statistical power in situations where there is high over-lap between the clusters. We also apply our approach to real scRNA-seq datasetsand show that it can accurately determine the cluster structure in both positive andnegative control experiments. In the second part of this work, we apply CCMT insimulation studies and show that coupling multimodality testing with the nestediiistructure of hierarchical clustering and discriminant analysis provides a robust ap-proach for determining the number of clusters in a given dataset. Results on realdata also show that CCMT can recover ground truth partitions with reasonable accu-racy, and it is much faster than the competing methods that have a similar accuracyrange.ivLay SummaryThe human body contains millions of cells, each able to perform a wide variety oftasks. It is now possible to view each of these cells individually at the molecularlevel. Doing so has enabled scientists to systematically define cell types basedon their molecular (gene expression) properties. To identify existing cell types andpotentially define new ones, scientists often use clustering algorithms. Clustering isa technique used to group similar objects based on their properties – in this context,our objects are cells, and their properties are the expression levels of genes. Inthis thesis, we address two questions that are critical to the success of clusteringalgorithms: i) to what extend “natural” clusters exist in a given dataset, ii) howmany clusters can be found in a given dataset.vPrefaceThis dissertation is an original intellectual product of the author, Elijah Willie. Thebreakdown of contributions reported in the present manuscript is as follows: Theimplementation of all statistical methodologies as described in all of Chapter 3 hasbeen performed by Elijah Willie. Results discussions and exploration presented inChapters 4 and 5 has been performed by Elijah Willie. This manuscript has beenwritten by Elijah Willie under the supervision of Dr. Sara Mostafavi.viTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiGlossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xivAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Single cell RNA sequencing . . . . . . . . . . . . . . . . . . . . 21.1.1 Analysis methods . . . . . . . . . . . . . . . . . . . . . . 21.1.2 Thesis motivation and contribution . . . . . . . . . . . . . 71.1.3 Detailed outline of thesis . . . . . . . . . . . . . . . . . . 82 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.1 The clusterability problem . . . . . . . . . . . . . . . . . . . . . 112.1.1 Visual assessment of clusterability . . . . . . . . . . . . . 112.1.2 Spatial randomness . . . . . . . . . . . . . . . . . . . . . 12vii2.1.3 Multimodality testing . . . . . . . . . . . . . . . . . . . . 132.2 Estimating the number of clusters C . . . . . . . . . . . . . . . . 162.2.1 Conventional methods . . . . . . . . . . . . . . . . . . . 172.2.2 Statistical significance methods . . . . . . . . . . . . . . 182.2.3 The Gaussian model for statistical significance . . . . . . 192.2.4 The Unimodal model for statistical significance . . . . . . 202.2.5 ScRNA-seq specific methods for estimating C . . . . . . . 213 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.1 Testing expression patterns for multimodality . . . . . . . . . . . 233.2 Counts modelling and normalization . . . . . . . . . . . . . . . . 243.2.1 Log transformation of normalized counts . . . . . . . . . 253.2.2 Multinomial normalization . . . . . . . . . . . . . . . . . 253.2.3 Regularized negative binomial normalization . . . . . . . 263.3 Feature selection . . . . . . . . . . . . . . . . . . . . . . . . . . 273.3.1 Feature selection log transformation of normalized counts 273.3.2 Feature selection for Regularized Negative Binomial Nor-malisation . . . . . . . . . . . . . . . . . . . . . . . . . . 283.4 Dimensionality reduction . . . . . . . . . . . . . . . . . . . . . . 283.5 Multimodality testing . . . . . . . . . . . . . . . . . . . . . . . . 293.6 Computing clusters through multimodality testing . . . . . . . . . 303.6.1 Generating a hierarchical partition . . . . . . . . . . . . . 303.6.2 Discriminant coordinates . . . . . . . . . . . . . . . . . . 313.6.3 The CCMT procedure . . . . . . . . . . . . . . . . . . . . 323.6.4 Combining significant partitions . . . . . . . . . . . . . . 343.7 Simulation studies . . . . . . . . . . . . . . . . . . . . . . . . . . 343.7.1 Simulating data scalability, differing cluster number and sizes 353.7.2 Simulating cluster separability . . . . . . . . . . . . . . . 373.7.3 Simulating data sparsity . . . . . . . . . . . . . . . . . . 383.8 Benchmarking data . . . . . . . . . . . . . . . . . . . . . . . . . 403.8.1 Small scale datasets . . . . . . . . . . . . . . . . . . . . . 413.8.2 Medium to large scale datasets . . . . . . . . . . . . . . . 423.9 Obtaining ground truth clusters . . . . . . . . . . . . . . . . . . . 42viii3.10 Evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . . . 433.11 Clustering methods . . . . . . . . . . . . . . . . . . . . . . . . . 433.12 Performance assessment . . . . . . . . . . . . . . . . . . . . . . 443.12.1 Simulation Studies . . . . . . . . . . . . . . . . . . . . . 443.12.2 Positive control . . . . . . . . . . . . . . . . . . . . . . . 443.12.3 Negative control . . . . . . . . . . . . . . . . . . . . . . 453.13 Run time assessment . . . . . . . . . . . . . . . . . . . . . . . . 453.14 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.1 Evaluation of multimodality testing . . . . . . . . . . . . . . . . 484.1.1 Simulation studies . . . . . . . . . . . . . . . . . . . . . 484.1.2 Positive control . . . . . . . . . . . . . . . . . . . . . . . 504.1.3 Negative control . . . . . . . . . . . . . . . . . . . . . . 504.2 Factors affecting multimodality testing . . . . . . . . . . . . . . . 504.2.1 The effect of data normalisation . . . . . . . . . . . . . . 504.2.2 The limitations of multimodality testing . . . . . . . . . . 534.3 Evaluation of the CCMT procedure . . . . . . . . . . . . . . . . . 554.3.1 Simulation studies . . . . . . . . . . . . . . . . . . . . . 554.3.2 Positive control . . . . . . . . . . . . . . . . . . . . . . . 564.3.3 Negative control . . . . . . . . . . . . . . . . . . . . . . 564.4 Factors affecting the CCMT procedure . . . . . . . . . . . . . . . 574.4.1 The effect of the number of informative genes . . . . . . . 584.4.2 The effect of the number of PCS . . . . . . . . . . . . . . 604.4.3 The limitations of the CCMT procedure . . . . . . . . . . 614.5 Comparing CCMT to other clustering methods . . . . . . . . . . . 624.5.1 Small scale datasets . . . . . . . . . . . . . . . . . . . . . 644.5.2 Medium to large scale datasets . . . . . . . . . . . . . . . 644.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 655 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 675.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 675.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68ix5.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72xList of TablesTable 3.1 Small scale datasets . . . . . . . . . . . . . . . . . . . . . . . 42Table 3.2 Medium to large scale datasets . . . . . . . . . . . . . . . . . 42Table 4.1 P-value table for simulating data scalability on balanced datasets 48Table 4.2 P-value table for simulating data scalability on unbalanced datasets 49Table 4.3 P-value table for simulating data sparsity . . . . . . . . . . . . 50Table 4.4 P-value table for positive control datasets . . . . . . . . . . . . 51Table 4.5 P-value table for simulating data scalability with high overlapon balanced datasets . . . . . . . . . . . . . . . . . . . . . . . 53Table 4.6 P-value table for simulating data scalability with high overlapon unbalanced datasets . . . . . . . . . . . . . . . . . . . . . 55Table 4.7 ARI table for simulating data scalability on balanced datasets . 57Table 4.8 ARI table for simulating data scalability on unbalanced datasets 57Table 4.9 ARI table for simulating data sparsity . . . . . . . . . . . . . . 58Table 4.10 ARI table for positive control datasets . . . . . . . . . . . . . . 59Table 4.11 CCMT results on negative control datasets . . . . . . . . . . . . 59Table 4.12 ARI table for simulating data scalability with high overlap onbalanced datasets . . . . . . . . . . . . . . . . . . . . . . . . 63Table 4.13 ARI table for simulating data scalability with high overlap onunbalanced datasets . . . . . . . . . . . . . . . . . . . . . . . 64xiList of FiguresFigure 1.1 K-means example . . . . . . . . . . . . . . . . . . . . . . . . 7Figure 2.1 Unimodal Example . . . . . . . . . . . . . . . . . . . . . . . 14Figure 3.1 Multimodality pipeline . . . . . . . . . . . . . . . . . . . . . 24Figure 3.2 Multimodality example . . . . . . . . . . . . . . . . . . . . . 31Figure 3.3 Fisher discriminant coordinates illustration . . . . . . . . . . 34Figure 3.4 Fisher discriminant coordinates example . . . . . . . . . . . . 35Figure 3.5 CCMT tree example . . . . . . . . . . . . . . . . . . . . . . . 36Figure 3.6 Methods pipeline . . . . . . . . . . . . . . . . . . . . . . . . 36Figure 3.7 Ensemble method pipeline . . . . . . . . . . . . . . . . . . . 37Figure 3.8 Multiview method pipeline . . . . . . . . . . . . . . . . . . . 37Figure 3.9 Simulating data scalibility pipeline . . . . . . . . . . . . . . . 38Figure 3.10 Simulating cluster separability pipeline . . . . . . . . . . . . 39Figure 3.11 Simulating cluster separability example . . . . . . . . . . . . 39Figure 3.12 Simulating data sparsity pipeline . . . . . . . . . . . . . . . . 40Figure 3.13 Simulating data sparsity example . . . . . . . . . . . . . . . . 41Figure 3.14 Negative control dataset . . . . . . . . . . . . . . . . . . . . 46Figure 4.1 P-value heatmap for simulating cluster separability . . . . . . 49Figure 4.2 Density plots for negative control dataset . . . . . . . . . . . 52Figure 4.3 Simulating data scalability with high overlap pipeline . . . . . 54Figure 4.4 Simulating data scalability with high overlap example . . . . 54Figure 4.5 ARI heatmap for simulating cluster separability . . . . . . . . 58Figure 4.6 ARI boxplots for positive control datasets . . . . . . . . . . . 60xiiFigure 4.7 CCMT predicted number of clusters . . . . . . . . . . . . . . 61Figure 4.8 ARI boxplots of varying number of genes . . . . . . . . . . . 62Figure 4.9 ARI boxplots of varying number of PCS . . . . . . . . . . . . 63Figure 4.10 Boxplots of methods performance on small scale datasets . . . 65Figure 4.11 Boxplots of methods performance on medium to large scaledatasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66xiiiGlossaryARI Adjusted Rand IndexCCMT Computing clusters through multimodality testingCDF Cumulative distribution functionCSR Complete spatial randomnessDWH Soft least squares Euclidean consensus partitions described in Dimitriadou,Weingessel and HornikFWER Family wise error ratePBMCS Peripheral Blood Mononuclear CellsPCA Principal component analysisPCS Principal componentsxivPDF Probability density functionSCRNA-SEQ Single cell RNA sequencingT-SNE t-distributed Stochastic Neighbor EmbeddingUMAP Uniform Manifold Approximation and Projection for Dimension Reduc-tionxvAcknowledgmentsAlthough the process of obtaining a Masters degree may seem like a solitary en-deavor, mine certainly would not have been possible without the support of aplethora of individuals including mentors, colleagues, friends and family members.With all my heart, I would like to sincerely thank the following individuals:• My supervisor Dr. Sara Mostafavi, for her continual support, guidance andwisdom throughout my studies.• Dr. Bernard Ng, for his willingness to always help and provide constructivefeedback whenever necessary.• All of my lab-mates, who made graduate school a delightful experience.Particularly I would like to thank Gherman Novakovsky, for his willingnessto always talk to me about doubts I had about my project.• My beloved friends Divya Bafna and Figal Taho for being part of this journeywith me and always being available to share a laugh even when it was hardto do so. You guys made graduate school an unforgettable experience, bothin the lab and outside the lab.• My family, for always believing in me, showing me unconditional love, nur-turing and encouragement, and encouraging me to push myself beyond mylimits. They are the rock through wish all my strength is built upon. All thiswould not be possible without them.• Dr. Leond Chindelevitch, who cultivated my interest in research during myundergraduate studies, commitment to my research development and madeso many opportunities possible for me.xvi• My Lord and Saviour, Jesus Christ, for his provision, unconditional love andsaving grace, and through whom all things are possible.xviiChapter 1IntroductionThe advent of single-cell RNA sequencing (SCRNA-SEQ) has enabled the inves-tigation of complex tissues cellular composition with unprecedented resolution.Deciphering cell type composition is one of the primary applications of scRNA-seq, typically done using unsupervised clustering, a technique borrowed from themachine learning literature. The nature of scRNA-seq datasets has provided newchallenges for unsupervised clustering methods. These challenges include highvariability and noise levels do technical limitations and biological factors [5, 7, 36,45, 110].A fundamental challenge when clustering scRNA-seq datasets is whether thedata contains inherent clusters to warrant clustering in the first place [2]. This issueis essential because a dataset should be clustered only when there exists an inher-ent cluster structure. The idea of clusterability, which seeks evidence for structureinherent to a dataset, should be a pivotal step in helping the user decide if cluster-ing is appropriate for their dataset. Clustering should only be applied if a datasetcontains inherent structure; else, the results would be misleading. As an example,consider a set of n = 1000 points generated from a unimodal distribution. There isno inherent cluster structure, and thus all clustering algorithms should return a sin-gle cluster. Any number (k > 1) of clusters returned would be purely nonsensical.See Figure 1.1.Determining a suitable number of clusters is another concern when clusteringscRNA-seq datasets. For most clustering algorithms, the number of clusters needs1to be known apriori and is often unknown. Domain knowledge is often used in thebiological setting to determine a suitable number of clusters. For example, whendealing with scRNA-seq gene expression data, the data is clustered with multiplecluster numbers. The user then uses domain knowledge, such as the expression ofknown marker genes from the literature, to determine a suitable number of clusters[60].In this thesis, we first assess the robustness of multimodality testing as a proxyfor assessing clusterability in scRNA-seq datasets. Next, we utilize the multi-modality framework to design an algorithm for computing the number of clusters inscRNA-seq datasets. We show that this approach is robust to challenges inherent toscRNA-seq datasets through an extensive simulation study. Finally, extensive com-parisons using benchmarking datasets show that this approach compares favorablyto methods currently used for analyzing scRNA-seq datasets.1.1 Single cell RNA sequencingIn the late mid-2000s, RNA-sequencing (RNA-Seq) emerged as a novel approachthat would eventually supersede the already successful gene expression microar-rays. New protocols developed for this technology typically required bulk sam-pling to profile thousands to millions of cells. In 2009, [104] provided a novelprotocol referred to as single-cell RNA-seq (scRNA-seq), which profiles individ-ual cells. ScRNA-seq treats a cell as an individual entity and allows for comparingcells within a specific population. For a general review, refer to [40, 56]. A suf-ficient amount of research is currently devoted to developing novel protocols andtechnologies to increase profiling accuracy. It is now possible to profile thousandsof cells in a single experiment [103]. This increase in the number of cells profiledis partially due to reduced sequencing costs, improved cell dissociation protocols,and library preparation.1.1.1 Analysis methodsMost scRNA-seq methods are designed for unsupervised clustering, pseudo-timeordering, and network inference to gain biological insights. After preprocessing[84, 101] and quality control [51, 76] of the output from a sequencing machine,2typical analysis steps include:1. Data normalisation.Data normalization techniques have become vital in dealing with fluctuationsin reads obtained per cell for sequencing technologies with high throughput.Methods developed for bulk RNA-seq can be used on scRNA-seq data aswell. These methods include counts per million (CPM) normalization [82],where each value in the count matrix is divided by the total count in cell andmultiplied by a million. When applied to scRNA-seq data, this is referredto as transcripts per million (TPM). Other methods include the use of sizefactors [4, 7] and quantile normalization [71, 76].2. Unsupervised clustering of cells.Finding clusters of similar cells, or characterization of cell-type compositionis one of the essential applications of scRNA-seq. Characterization of cell-type composition is done using unsupervised clustering. Below are some ofthe most prominent clustering paradigms currently used in the scRNA-seqliterature(a) Partitioning modelsPartitioning based clustering algorithms are some of the most widelyused clustering methodologies. These methods try to partition a givendataset into K partitions such that objects in the same partition are moresimilar to each other than objects across other partitions. These meth-ods include k-means, and k-medoids, which many scRNA-Seq cluster-ing algorithms are based on. These methods include SC3 [59] SCUBA[74], SAIC [121], pcaReduce [128] and RaceID2 [37].(b) Mixture modelsIf there exists significant prior knowledge about the data generationprocess, specifically the distributions generating the data, then mixturemodels provide an intuitive way to compute clusters where each dis-tribution represents a cluster. Clustering involves assigning objects tothe most likely distribution that generated the data using expectationmaximization. ScRNA-Seq methods based on this paradigm include3BISCUIT [87], Seurat [17] and TSCAN [53].(c) Graph modelsGraph-based models rely on building a graph representing objects asnodes and then finding densely connected regions as clusters. Gen-erating these graphs typically involves computing a similarity scorebetween the objects and using these scores as edge weights betweenobjects. Finding densely connected regions is then reduced to findingregions of high similarity in the graph. Both spectral and clique detec-tion methods are used for finding these dense regions. These modelsare attractive because they make no distributional assumptions aboutthe data. ScRNA-Seq methods based on this paradigm include Seurat[17], SIMLR [113], SNN-Cliq [127] and SCANPY [116].(d) Density modelsLike the intuition from graph-based clustering, density-based algorithmsseek to find highly dense regions of objects without representing theobjects in a graph. Most notable algorithms using this paradigm in-cludes DBSCAN [27] and Density peak [99] clustering. These modelsare attractive in the scRNA-Seq domain due to their potential for find-ing rare cell populations. Methods include Monocle [108], GiniClust[54] and sscClust [88].(e) Ensemble modelsBorrowing from the machine learning literature where weaker classi-fiers are combined to form a more robust classifier, ensembles mod-els have become particularly useful for clustering. The idea here is tocluster the set of objects using different methods, including features,similarity metrics, and clustering algorithms, and then combine themto form an ensemble. Ensemble models can help to combine diversityobtained from different clustering solutions. Ensembles have also beenshown to outperform single models in terms of accuracy and robustness[34, 50].(f) Hierarchical modelsOver the years, hierarchical clustering has become one of the most4widely used clustering methods in the scRNA-Seq domain [117]. Theirincreased use is due to the lack of distributional assumptions about thedata generating mechanism [90]. Hierarchical clustering algorithmscan uncover possible hierarchies amongst cell types and represent themusing a tree structure, which is appealing since cell types can exist inhierarchies. Hierarchical representation also makes interpreting clus-ters easier. ScRNA-Seq methods that employs this algorithms includeSC3 [59], cellTree [25], CIDR [68] and DendroSplit [125].(g) Multiview modelsAdvances in data curation and clustering techniques have enabled re-searchers to gather information on an experiment from multiple viewsto increase the resolution about the phenomena under examination.Multiview clustering [13, 20, 122] provides a methodology for com-bining the information contained in these views into a common repre-sentation. There are a few advantages of multiview, including gener-ating a complete knowledge of the data, reducing noise content, andgenerating a more robust clustering of the data. Multiview methodsdiffer from ensemble methods in that they combine information acrossmultiple views before clustering is done. Whereas in ensemble mod-els, clustering is done for each view, and the results are then combined.There have not been many adaptations of these methods to the scRNA-seq domain. For example, [16] combines gene expression data andpaired epigenetic data to infer cell types and gene regulatory networks.Also, in [98], a similar idea based on pattern fusion analysis is used tointegrate multiple heterogeneous omics data. Even with these methods,the multiview clustering literature for the scRNA-seq domain remainssparse and provides a possible avenue for further research.This is by no means an exhaustive list of all the clustering paradigms andmethods available to the scRNA-seq domain. However, most methods de-veloped fall somewhere within these paradigms. Other methods combineaspects of multiple paradigms. Most methods under these paradigms requirethat the number of clusters K is known apriori. This value is typically un-5known, and there exists some heuristics to compute it prospectively or retro-spectively. For a more thorough review of these paradigms, their advantages,and disadvantages, refer to [85].3. Ordering of cells.An alternate but equally significant aspect of the scRNA-seq analysis ispseudo-time analysis. Pseudo-time methods are used for the analysis of geneexpression levels over a continuous axis. Since scRNA-seq datasets onlyprovide a snapshot of cells at a point in time, pseudo-time analysis requiressetting up a continuous axis and observing gene expression of cells. The taskthen becomes ordering these cells using this continuous axis. Pseudo-timeanalysis could lead to understanding differentiation trajectories for differentcell types and understanding how different cell states vary. A few methodshave been developed for this including Monocle [108], TSCAN [53], SCUBA[74] and scVelo [11]. This area remains an active area of research, and newmethods are released routinely, see [19] for a review.4. Differential expression analysis.After clustering the cells, the next step is the interpretation of clusters. In-terpreting clusters is typically done by finding genes that are differentiallyexpressed between clusters or marker genes that are expressed in specificclusters. There have been many methods developed to do this including D3E[22], DEsingle [79], MAST [30], and SCDE [57]. Some of these methodshave been designed specifically for scRNA-seq data and do not have limi-tations faced by bulk RNA-seq methods. Other methods developed for bulkRNA-seq have been applied to scRNA-seq data. However, they may not beappropriate due to assumptions made, see [111]. Due to large numbers ofcells obtained from scRNA-seq experiments, simple statistical methods suchas the Mann-Whitney U test, Student’s t-test, or logistic regression may notbe limited by their statistical assumptions. These simple tests are being im-plemented in many scRNA-seq data analysis pipelines including Seurat [17]and scater [76]. For cells ordered using pseudo-time analysis, differentialexpression analysis is done by finding genes with significant relationshipsbetween expression and the continuous axis. Typically splines are fit, and co-6efficients are tested for significance. Most pseudo-time analysis methods de-scribed previously include methods for finding differentially expressed genesalong a trajectory.Figure 1.1: Plots of uniformly distributed points clustered by K-means withk = 2 and k = 3. A) K-means with k = 2. B) K-means with k = 3.Note that the cluster borders returned by K-means appears to be arbi-trary since there is no inherent cluster structure.1.1.2 Thesis motivation and contributionSince clustering is a crucial aspect of most scRNA-seq analysis pipelines, greatcare must be taken when applying clustering algorithms. Most clustering algo-rithms will find clusters in a dataset, even if none exists [2]. See Figure 1.1. This isbecause clustering algorithms are designed to optimize an objective function thatseeks to partition the dataset optimally [73]. These algorithms do not consider thepossibility that there may not exist inherent clusters in the dataset. Thus, thereis a risk associated with blindly applying them without proper prior analysis. Inthis thesis, we focused our efforts on developing a methodology to systematically7assess the level of clustering or cluster structure (Clusterability) in scRNA-seqdatasets, while also estimating the possible number of clusters. To our knowl-edge,this question has not been addressed specifically for scRNA-Seq datasets andthus this work provides an initial framework for addressing this problem.We perform extensive simulation studies in order to understand which prob-lems affect clusterability analysis in scRNA-seq datasets. We consider problemssuch as increasing data size, complexity, sparsity, cluster size, and cluster sepa-rability. Exploring these issues in a controlled manner will illuminate the factorslimiting the effectiveness of scRNA-seq clustering methods. Clusterability analy-sis is also done for a range of real datasets. We found that the methods developedin this thesis are robust to data sparsity, cluster size, cluster separability, and scaleswell with increasing data size.We also compared methods developed in this work to other methods most oftenused to cluster scRNA-seq data. We used benchmarking data to evaluate thesemethods’ running time and accuracy concretely and compared them to methodsdeveloped in this work. Our method provides a competitive running time while, onaverage, having a higher accuracy when clustering.1.1.3 Detailed outline of thesisThe rest of this thesis is outlined as follows:• Chapter 2 is dedicated to surveying the current literature on clusterabilityanalysis and estimating the number of clusters. We discuss some of themethods most often used to assess cluster structure in a dataset. We alsodiscuss a few ways currently used to estimate the number of clusters. Thesediscussions are done in a broader scope without reference to the scRNA-seq domain. Finally, we provide a short discussion on how some of thesemethods are used for scRNA-seq analysis.• In Chapter 3, we outline the methods developed to assess clusterability andestimate the number of clusters. First, we provide a detailed outline of themultimodality testing method for assessing clusterablity. Next, we presentthe CCMT procedure for estimating the number of clusters. Thirdly we out-line both the simulation studies and real datasets used to evaluate both meth-8ods. Lastly, we provide details about the methods used for benchmarkingand the evaluation metrics.• In Chapter 4, we discuss the results of our methods based on simulation stud-ies and real data. First, we provide simulation studies and real data resultsfor assessing clusterability. Next, we provide the results on simulation stud-ies and real data for the CCMT procedure. Finally, we provide the results forthe comparative analysis for the CCMT procedure against other methods.• In Chapter 5, general conclusions are provided about the finding of this work.Also discussed are the possible shortcomings of the proposed methods andthe provision of future directions that may improve these methods.9Chapter 2Related WorksThe clustering task consists of partitioning a set of objects into k groups (possiblyoverlapping groups) such that members of the same groups are sufficiently simi-lar to each other and sufficiently dissimilar to non-members. Defining similaritybetween members is highly dependent on the question asked, the phenomena stud-ied, and the clustering algorithm used. This creates an inherent level of subjectivitywhen clustering. In any clustering task, the user needs to make some assumptionsabout the data being clustered, with the most implicit and necessary assumptionthat the data indeed contains meaningful clusters. Based on the clustering algo-rithm chosen, further assumptions need to be made about the data, and the efficacyof the results depends on how strongly these assumptions hold. The user also needsto decide how many clusters to compute. This is a problem because the user rarelyknows beforehand how many clusters to expect, and clustering results may heavilydepend on the number of clusters chosen.In this chapter, we survey the current literature on clusterability and determin-ing a suitable number of clusters. First, we discuss the clusterability problem andthe various methods for addressing it. Second, we discuss the problem of estimat-ing the number of clusters and the current methods for addressing it.102.1 The clusterability problem”Clusterability” seeks to provide a measure of the level of cluster structure in adataset. Clusterability methods assess the potential for a dataset to form clusterswithout making any assumptions about the data’s nature. If clustering seeks tofind meaningful partitions in a dataset, then clusterability seeks to find the extentto which these partitions exist. Ideally, a clustering solution is meaningful if itcaptures the natural cluster structure in the data.2.1.1 Visual assessment of clusterabilityThe first and most intuitive way of assessing clusterability is by visual inspection[70]. Cluster structure is assessed visually by projecting the points in a dataset toa lower dimension space (typically 2 or 3 dimensions) using linear or non-lineardimensionality reduction methods. The projected points are then visualized usinga 2D or 3D plot, and the cluster structure is assessed by identifying the groupingstructure in the plot. Linear dimensionality reduction methods such as PrincipalComponent Analysis (PCA) [55] assumes that the data lies on a linear plane. Non-linear dimensionality reduction methods such as t-distributed stochastic neighborembedding (T-SNE) [112] and Uniform Manifold Approximation and Projectionfor Dimension Reduction (UMAP) [77] do not make this assumption about the dataand uses heuristics for projecting the points on to a non-linear plane. After pro-jection, grouping structure is identified by human eyes by seeking regions of highdensity separated by regions of low density. This creates a level of subjectivity thatmay differ between users.There are more standard methods of assessing cluster structure visually [12, 49,115]. These methods first compute pairwise euclidean distances between pointsand orders them such that any potential cluster structure in the data becomes ob-vious. A heat-map of the ordered dissimilarities is plotted, and a diagonal blockstructure provides evidence for the existence of clusters. There is another method[115] similar in flavor that uses image processing techniques to automatically countthe number of diagonal blocks. However, this method has high computational com-plexity and becomes impractical for datasets containing many points common toscRNA-seq. Most, if not all, scRNA-seq analysis pipelines contain methods to vi-11sualize data by reducing the data on to a linear or non-linear subspace. Some estab-lished pipelines use already developed methods such as PCA [55], T-SNE [112] andUMAP [77]. Other pipelines have developed visualization methods specific to theanalysis of scRNA-seq. These include methods such as ZIFA [86], ZINB-WaVE[89], scvis [24], and many more. Methods such as T-SNE and UMAP are stochasticand typically require parameter tuning by the user. Depending on the parameteri-zation of these methods, it is possible to obtain completely different results for thesame dataset, which can become problematic when determining cluster structure.This signifies the need for new methods that are deterministic and are not highlyimpacted by parameters.2.1.2 Spatial randomnessAnother method for assessing clusterability is through testing for complete spatialrandomness (CSR). This method is one of the first and maybe the oldest methodsfor testing clusterability in a dataset [23, 65]. First, define a point process as a set ofmathematically defined points located in an underlying space, such as the real lineor the Cartesian plane. If we consider a data matrix D of n points in p dimensionsas a point process, the notion of CSR can be applied. More formally, CSR methodsperform a statistical test on the matrix D and draw one of three conclusions [52]:1. The points in D are arranged randomly, meaning there is no evidence ofcluster structure.2. The points are aggregated or clustered, meaning that there is evidence forcluster structure.3. The points are regularly spaced.If applied to scRNA-seq, we can let D be the gene expression counts matrix, andthen we can apply these types of tests and make conclusions based on the results.One of the most prominent tests for spatial randomness is the Hopkins test [47, 64].This method tests spatial randomness by comparing a set of sampled points from Dand their nearest neighbors to distances of points sampled from a null model. If Dexhibits complete spatial randomness, these distances should be similar on average.12Simulation studies using this test have shown low power when the putative clustersare not well-separated [2].Other tests for CSR includes:1. Scan tests which are based on the number of points in the densest sub-regionof a predefined sampling window. A large count provides evidence for thepresence of cluster structure.2. Quadrat analysis partitions a predefined sampling window into equally sizedrectangles, and the number of points in each quadrat is counted. Thesecounts follow a Poisson model under the assumption of CSR. A chi-squaretest is used for hypothesis testing.3. Inter-point distances which reflect structural relationships among points. Atest for randomness compares these distances with that computed from a nullmodel.4. Structural graphs methods, which defines a graph over the pairwise dis-tances between points. A test for spatial randomness compares the distribu-tion of this graph’s edge lengths with that of a null model.The listed methods above are described in [52]. Most of these methods suffer fromexpensive computations, impracticality in high dimensions, and a lack of suitablenull models to be used extensively. To our knowledge, there are currently no testsof spatial randomness applied to scRNA-seq datasets.2.1.3 Multimodality testingMultimodality testing is another way to assess clusterability [31]. Multimodalitytesting tests the existence of multiple modes for a probability distribution over aset of points. Some of the well-known methods in this field are described below.However, before explaining further, a few definitions are provided.Definition A probability density function (PDF) f is unimodal if for some valuek, f is monotonically increasing for x≤ k monotonically decreasing for x≥ k. f ismultimodal if no such k exists.13Definition A cumulative distribution function (CDF) F is unimodal if a k existssuch that F is convex on the interval (−∞,k] and concave on the interval [k,∞). Fis multimodal if no such k exists.Definition The modes of a multimodal probability density function are thevalues of x where f(x) it’s local maximum. This maximum may be located at asingle point signifying a mode or a closed interval signifying a modal interval.See Figure 2.1 for an example of a unimodal PDF and CDF.Figure 2.1: Plots of a unimodal PDF and CDF. Note that for both the PDFand CDF, the mode is located a. Figure adopted from [29].Modality tests formally test a distribution generated from a set of points formultiple modes or multimodality. The relation can be stated as follows:1. If a dataset D contains multiple clusters, then points in the same cluster willbe closer to each other than in other clusters.2. The distribution function of (dis)similarity between the points can be statis-tically tested for multimodality.3. A significant test provides evidence for multimodality, meaning clusters arepresent, hence evidence for clusterability.14Below we state two of the most frequently used multimodality testing methods,namely the Dip test [42], and the Silverman test [100]. There are other tests [3, 43,93] not described here due to their low statistical power, computational inefficiency,and lack of suitable implementations.The Dip testConsider a dataset X = (x1, ....,xn), where for simplicity we assume all xi are in-dependent and identically distributed(i.i.d). Let f be a distribution function of(dis)similarities between a set of points, the Dip test compares these two hypothe-sis:H0: f has 1 mode vs. H1: f has > 1 mode(s).Now, define the Dip statistic D of a cumulative distribution function (CDF) tobe:D(F) = minG∈Usupx|F(x)−G(x)| (2.1)Where U is the class of all unimodal distributions. This value is computedusing the empirical cumulative distribution (ECDF) [42]. Put another way, the Dipstatistic is the maximum difference between the empirical distribution functionand the unimodal distribution function that minimizes this difference. The Dipessentially measures a function’s departure from unimodality. To obtain a p-value,the Dip statistic is computed for the ECDF, and this value is compared with Dipvalues for b samples from a uniform null distribution. Formally,p− value = ∑Bb=1{Dipb > DipX}B(2.2)where Dipb is the Dip statistic for the bth sample from U(0,1), and DipX isthe Dip statistic for the data. This p-value can also be computed by interpolationfrom a table containing empirical percentage points of the Dip statistic based onN = 1000001 samples of size n from U [0,1].The Silverman testAgain Consider a dataset X = (x1, ....,xn), where for simplicity we assume all xi areindependent and identically distributed(i.i.d). The Silverman test [100] compares15these two hypothesis: H0: f has 1 mode vs. H1: f has > 1 mode(s).This test employs a kernel density estimation of the form:fX(h) =1nhn∑i=1K(x− xih)(2.3)where K(·) is the Gaussian kernel and h is the bandwidth. The Gaussian kernelis used because the number of modes k of fX(h) is a non-increasing function ofh. First Let fX(h) be the kernel density estimate for X as a function of h. Next,Let hcrit be the maximum value of h such that fX(h) has at most k modes. Thevalue of hcrit is used as a test statistic with large values indicating evidence againstunimodality and small values indicating evidence for unimodality.The test works as below:1. Generate B samples Zb = {Zb1 , · · ·,Zbn} with b = 1, · · ·,B, where Zbi = (1+h2kσ2 )−12 Xbi ), where σ2 is the sample variance and Xbi is generated from f (hcrit)2. For each sample Zb, compute hbcrit3. Finally, to get a p-value, we compute:p− value = ∑Bb=1 1(hbcrit ≤ hcrit)B(2.4)See [3, 100] for a careful treatment of the methodology and algorithm.2.2 Estimating the number of clusters CThere are several clustering algorithms available for use [119]. However, mostof these algorithms require that the number of clusters c is specified. While it isrelatively easy to point out the cluster grouping structure using 2D or 3D plots, itis more challenging to determine exactly how many groups are present. Thereforeit is necessary to have robust ways of estimating the number of clusters to obtaingood results. Below we discuss a few ways typically used to estimate the numberof clusters when clustering is applied.162.2.1 Conventional methodsThe conventional approach to determining the number of clusters is to run a cluster-ing algorithm with multiple values for c and use cluster validity indices or stabilityindices to select an optimal c [120]. Validity indices are classified into two groups;external and internal. External indices validate a clustering solution by comparingit to an external source such as known cluster labels. Internal indices, in contrast,do not rely on any such external information. Their validation is based purely onthe clustering solution. Since users typically do not have prior information, such asoriginal labels, we focus our attention on internal validation and stability indices.Many internal indices have been proposed to estimate the number of clusters.These methods include the CH Index [18], Silhouette index [92] and the Gap index[106]. Most of these indices are distance-based and measure cluster compactnessusing pairwise distances. These measures also compute cluster separation by com-puting the distances between cluster centers. A value for c is chosen, such thatcluster compactness or cluster separation is maximized. Distance-based internalindices are sensitive to noise, outliers, and the scaling of the variables of interest.This sensitivity is often circumvented when the data is processed to remove outliersand scaled to zero mean and unit variance before clustering is done. This methodof estimating c can be computationally expensive. This is because a clustering al-gorithm is run many times, and the validation index is computed, which can becostly for large datasets.Cluster stability methods seek to find a clustering that is robust to data pertur-bation and noise. The basic idea is to find a clustering of a dataset that is robust torandom perturbation. According to [10], if the data is over-clustered, the clusteringalgorithm will need to randomly split true clusters leading to a lack of stability inthe resulting clusters. Similarly, if the data is under-clustered, the clustering algo-rithm will need randomly merge true clusters, again leading to the lack of stabilityin the resulting clusters. One method which assesses cluster stability is based onresampling. This approach clusters overlapping subsets of the data and then com-putes a similarity score between the clustering of the subsets. A similarity scoreis computed as the pairwise distances between these clustering, and the stabilityscore is the average of these pairwise distances. A value for c is chosen, such that17it maximizes the stability value over a range of different values for c. See [72] fora detailed treatment of the resampling based method for cluster stability. Anothermethod not discussed here is based on building an ensemble. This method com-bines many clustering of the dataset over a range of c values. See [102] for thetreatment of this approach.2.2.2 Statistical significance methodsAnother way for estimating the number of clusters is by using more formal sta-tistical tests. This is done by finding a value of c such that it provides the mostsignificant evidence against a null hypothesis of a single cluster. The hypothesisincludes the uniformity hypothesis [95] and the unimodal hypothesis [14, 41, 52].Under the unimodality hypothesis, the data is viewed as a random sample gener-ated from a multivariate Gaussian distribution. The data is viewed as a randomsample generated from an d-dimensional uniform distribution under the unifor-mity hypothesis. For both these hypotheses, evidence for or against the null can becomputed using internal validation indices discussed in section 2.2.1. Since inter-nal indices are used, multiple clustering of the dataset for different values of c isagain required. See [52] for a comprehensive treatment of these methods.Another method for estimating the number of clusters makes use of the nestednature of hierarchical clustering. The results of hierarchical clustering methods arepresented using a binary tree or a dendrogram, which provides an intuitive way ofviewing the hierarchical structure in the data. The number of clusters is estimatedusing a statistical test in a top-down manner at each node in the resulting dendro-gram. These tests are designed such that the null hypothesis tests data homogeneityby computing a test statistic on the data and comparing it to a suitable null distri-bution. The null distribution is computed by making specific assumptions aboutthe data. These assumptions can be both parametric and non-parametric. Methodsthat have implemented this approach include [44, 58]. These methods either use aGaussian null model [58] or a Unimodal null model [44] described below.182.2.3 The Gaussian model for statistical significanceThis model assumes clusters are derived from a single Gaussian distribution param-eterized by a mean vector µ and a covariance matrix Σ. One of the most notablemethods implementing this model is [69]. The hypothesis tested is formally statedasH0 : A pair of clusters follow a single Gaussian dist.Ha : A pair of clusters do not follow a single Gaussian dist.Where a p-value obtained would provide evidence for or against the null.The test statistic is the 2-means cluster index which is a tightness or compact-ness measure for clusters is defined as follows:CI =∑2k=1∑ j∈Ck ‖ x j− x¯k ‖∑nj=1 ‖ x j− x¯ ‖(2.5)Where x¯k is the mean for each cluster k ∈ {1,2}, Ck is the sample index for thecluster k and x¯ is the overall mean. We note that a smaller value of this function isassociated with larger variation explained by a given clustering, implying a betterclustering.Lastly, significance for a pair of clusters is obtained by comparing the teststatistics computed from the data with the same statistic computed from the nulldistribution. The null distribution is empirically estimated by computing the teststatistic for many datasets generated from a single Gaussian distribution. TheGaussian distribution is estimated from the original dataset using methods de-scribed in [48, 69]. A p-value is then obtained by computing:p− value = ∑Bb=1{CIb >CIdata}B(2.6)B is the number of datasets generated, and CIb is the cluster index for the nulldataset generated during iteration b.In [58], this method is extended to a hierarchical setting by a Monte-Carlobased sequential hypothesis testing framework. First, a hierarchical tree is gener-ated using agglomerative clustering. Next, at select nodes, starting from the root,19a test for significance of clustering is done between the two clusters below the cur-rent node. To control the Family Wise Error Rate (FWER) due to multiple testing,methods described in [78] is used. This method returns the k nodes that are signif-icant which implies k+1 clusters.2.2.4 The Unimodal model for statistical significanceThe unimodal model is non-parametric and assumes that clusters follow a unimodaldistribution. The Gaussian model makes specific assumptions about the distribu-tion of clusters. This assumption may decrease statistical power if clusters arenon-Gaussian. The method implemented in [44] assumes a unimodal distributionfor clusters.H0 : A pair of clusters follow a single unimodal distribution.Ha : A pair of clusters do not follow a single unimodal distribution.Where a p-value obtained would provide evidence for or against the null. Toperform the test, a null dataset X0 needs to be computed with the requirementsthat it is as close as possible to X under unimodality conditions. This is achievedby using a Gaussian kernel density estimator (KDE) to model each feature in thedataset. This can be expressed as:ˆf (t;h j) = (nh j)−1n∑i=1K(h−1j (t−Xi j)) (2.7)Where h j is the bandwidth; K(·) is the Gaussian kernel function; and X1 j, ...,Xn jare the entries for feature j. According to [100], there exists a critical bandwidthhk j such that:hk j = in f{h j : fˆ j(·;h j)}(2.8)has at most k modes.h1 j is then computed and ˆf (t;h1 j) is re-scaled to have variance equal to thatof the sample variance S. Finally, using the re-scaled KDE, bootstrap samples are20generated by:X0i j = (1+h21 jσ2j)12 (XIi j +h1 jεi) (2.9)where ε ∼N(0,1), σ2 is the sample variance for feature j, and XIi j are sampleduniformly with replacement from the observed data for feature j [44]. Finally, toensure the covariance structure is maintained, the data is scaled to have mean 0 andvariance 1 before computing X0. X0 is then be multiplied by the Cholesky root ofthe sample covariance matrix of X .The 2-means cluster index defined in Section 2.2.3 for the Gaussian model isused to measure the strength of the clustering obtained from both X and X0. Again,significance for a pair of clusters is obtained by comparing the test statistics com-puted from the data with the same statistic computed from the null distribution.The null distribution is empirically estimated by computing the test statistic formany datasets generated using the null unimodal distribution. The unimodal dis-tribution is estimated from the original dataset using the methods described aboveand formally in [44]. A p-value is then obtained by computing:∑Bb=1{CIb >CIdata}B(2.10)Where B is the number of datasets generated, and CIb is the cluster index forthe null dataset generated during iteration b.It is worth noting that the authors mention the applicability of this method to anested setting, as seen in [58]. However, no methods for controlling the FWER dueto multiple testing are presented. This is a potential drawback that can be addressedby adopting the FWER control method used in [58] to this model. Both methodsrequire bootstrapping to generate a suitable null model, which can be very timeconsuming and impractical for large datasets.2.2.5 ScRNA-seq specific methods for estimating CA few clustering algorithms developed for scRNA-seq datasets estimate the num-ber of clusters directly or indirectly [38, 59, 113, 127]. For example, the consensusclustering method SC3 [59] uses random matrix theory to compare the eigenval-21ues of a transformed gene expression matrix to estimate the number of clusters.However, this is highly susceptible to overfitting when the data sparsity is high.Overfitting due to data sparsity is particularly worrisome since scRNA-seq datasetshave an abundance of zero entries. Another method that attempts this is SIMLR[113]. SIMLR estimates the number of clusters by optimizing heuristic functionsbased on network diffusion. SIMLR’s optimization algorithm requires a range ofthe possible number of clusters, which, in turn, requires prior knowledge about thepossible number of clusters. Imposing prior knowledge on the possible number ofclusters renders clusterability analysis useless since there is an implicit assumptionthat the dataset contains clusters. However, this may not be the case.22Chapter 3MethodsThis chapter details our approach for testing for clusterability and computing thenumber of clusters. In Section 3.1, we describe how multimodality testing usingthe Dip test can be used as a measure of clusterability. Section 3.6 describes theComputing Clusters Through Multimodality Testing (CCMT) procedure for com-puting the number of clusters. In Section 3.7, we formally describe the simulationsetup used for methods evaluation. In Section 3.8, we discuss the real datasetsused for benchmarking and comparisons with other methods. In Section 3.10, wedetail the metric used for evaluation on both simulated data and real benchmarkingdata. In Section 3.11, we provide details about the clustering methods used forcomparison against the CCMT procedure. In Section 3.12, we discuss how both themultimodality testing and the CCMT procedure performance are assessed the sim-ulation studies and real data. Finally, in Section 3.13, details of how computationalrunning times are computed are presented.3.1 Testing expression patterns for multimodalityTo assess the cluster structure inherent to a scRNA-seq dataset, we look for mul-timodality in the gene expression patterns for cells. We use multimodality testingas a proxy for clusterability. Formally, this method takes as input a scRNA-seqgene expression matrix M with n rows and p columns. We denote yi j as the rawexpression count for gene j in cell i. We assume that the count values are integer-23valued and have not been normalized. Before any analysis is done, we filter genesby selecting genes expressed in at least 5 cells. We also select cells with a mini-mum of 200 genes expressed and a maximum of 2500 genes. We cap the numberof genes expressed in a cell because cells with extremely high gene counts tend tobe multiplets, and cells with very low gene counts tend to be the results of emptydroplets. Cells with more than 5% mitochondrial content are also filtered out. Thisis the possible result of cells that were broken in a droplet during sequencing.Following steps formally described below, the matrix M of counts is normal-ized, highly informative genes are selected, and dimensionality reduction usingPCA is performed. Finally, multimodality is tested using the Dip test [42] on thecosine distances between the cells in PCA space. See Figure 3.1 for a visualizationof the pipeline.Figure 3.1: An overview of the pipeline for multimodality testing integrated.Starting with a molecular counts matrix, Cell Normalisation is doneusing three methods. Next, feature Selection and PCA is done for eachnormalisation method followed by computing cell to cell Cosine dis-tances for each normalisation method. Finally multimodality testingis done using Dip test on the distribution of these cell to cell distances.The output is a table containing three p-values. One for each normali-sation method.3.2 Counts modelling and normalizationNormalization of the counts matrix is a pivotal step in analyzing scRNA-seq data.Counts normalization is typically done to address differences in sequencing depth24from cell to cell. Normalization of counts also helps to adjust for various formsof noise or noise bias in the sequencing process, so it does not heavily confoundreal biological differences present. To combat this, we use three different meth-ods(described below) for normalizing the counts matrix.3.2.1 Log transformation of normalized countsLog-transformed molecular counts have become widely used in the analysis ofscRNA-seq data. This is because of their statistical simplicity and interpretation.These transformed values can represent log-fold changes in gene expression be-tween cells, which is sometimes used for informative genes selection and differ-ential expression analysis. Log-transformation also helps to reduce the severity ofstochasticity in the counts for genes that are highly expressed. Formally, the log-transformed model is defined as follows: let yi j be the observed molecular countfor celli and gene j. Let ni = ∑ j yi j be the total molecular counts in celli. Now letiˆ j =yi jnibe the true observed proportion of genei in celli. We can define the logtransformed values as as zi j = log2(c+ pˆii j ∗m) where c = 1 is a pseudo-count todeal with situations where pˆii j = 0, and m is a scaling factor(typically set to 106).The resulting zi j values are used as normalised expression counts for downstreamanalyses. The Seurat package [17] was used to compute the log-transformed val-ues.3.2.2 Multinomial normalizationRecent studies have shown that log-transformation of molecular counts causes sta-tistical biases, which leads to loss of power during downstream inferences [39, 46,71, 107]. These biases include inadequate variance stabilization caused by artifi-cial variance inflation. The second normalization method assumes the counts arederived from a multinomial model and thus models the molecular counts directly.Formally, the multinomial model is defined as follows: let yi j be the observedmolecular count for celli and gene j and let ni = ∑ j yi j. Now let pii j be the trueunknown relative abundance for gene j in celli. We can say that the vector yi =(yi1, ....,yi,J)T with constraint ni = ∑ j yi j follows a multinomial distribution with adensity function25f (yi) =(niyi1, ....,yi,J)∏jpiyi ji j (3.1)This model is computed using Maximum Likelihood Estimation on the rawmolecular counts using a Binomial or Poisson approximation. The Pearson resid-uals or Deviance residuals values are used as normalized expression counts fordownstream analyses. For a more thorough treatment of the multinomial model formolecular counts, see [107]. The scry package [107] was used for fitting the modeland computing the normalised values.3.2.3 Regularized negative binomial normalizationThe third and final model used to normalize the counts is a regularized negativebinomial model. This model is similar to the multinomial model in that they bothmodels the counts directly. However, this model assumes the counts follow a nega-tive binomial model. This is very similar to existing normalization models, such asZIFA [86]. However, [39] showed that these models are prone to overfitting, whichnegatively affects downstream analyses such as clustering and differential expres-sion analysis. To combat this, [39] proposes a regularized version of the negativebinomial model. Here a generalized linear model is fitted with molecular counts fora gene j or yi j as the response and sequencing depth as a covariate. Regularizationis done by using kernel regression on parameters estimated from this model.Formally, the regularized negative binomial model is defined as follows:log(E(xi)) = β0+β1log10m (3.2)Where xi is a vector of molecular counts for genei, and m is a vector of countvalues assigned to the cells, i.e m j = ∑i xi j. Here β0 and β1 are regularised acrossgenes using kernel regression. For a thorough treatment of the regularized negativebinomial model, see [39]. Normalised counts are computed using Pearson residualsdefined using the regularized regression parameters. The normalised counts aredefined as follows: zi j =xi j−µi jσi j , where µi j = exp(β0i + β1i log10 m j) and σi j =√µi j +µ2i jθi [39]. Here zi j is the Pearson residual of genei in cell j, µi j is the expected26molecular count for genei in cell j, xi j is the observed molecular count for genei incell j in the regression model defined earlier. Finally, β0, β1, and θi are parametersobtained from the model. The sctransfrom package [39] was used for fitting themodel and obtaining the normalised scores.3.3 Feature selectionA typical scRNA-seq dataset can contain expression values for thousands of genes,creating potential problems for downstream analyses. Firstly, with increasing di-mensionality, most analysis methods do not scale well and dramatically increasecomputational costs. Secondly, the existence of a large proportion of non-informativegenes will significantly increase the amount of noise in the data, which will reducethe true biological signal, thus reducing the power of statistical methods. Thisproblem is circumvented by selecting a subset of 500− 1500 informative genes,and there are a few existing methods that can compute how informative genes are.For methods based on gene variability, genes are ranked based on their variabilityacross cells, and only highly variable genes are retained [15]. For methods based ongene expression, genes are then ranked based on averaged expression across cells,and only highly expressed genes are retained [26]. Depending on the normalizationmethod used, we select the top 500 most informative genes. The selection methodsare described below.3.3.1 Feature selection log transformation of normalized countsTo select highly informative genes for log-transformed normalized counts, first, avariance stabilizing transformation as described in [75] is performed. This is doneto account for the mean-variance relationship inherent to scRNA-seq data that isnot accounted for by the log transformation. Next, a line is fitted to the relationshipbetween log(variance) and log(mean) of genes using local polynomial regression(loess). Highly informative genes are then selected by ranking the genes using thestandardized residuals from the fitted line. The Seurat package [17] was used tocompute these genes.27Feature selection for multinomalial normalizationFor the multinomial model, the gene deviance residuals based on a binomial ap-proximation to the multinomial model are defined as:D j = 2∑i[yi j logyi jni ˆj+(ni− yi j) log ni− yi jni(1− ˆj)](3.3)where Di is the deviance for gene j and other parameters are defined as beforein Section 3.2.2 [107]. This model assumes constant gene expression across cells,so genes that deviate significantly from this model are genes that are the mostinformative. Gene deviance values are sorted, and the most deviant genes are usedfor clusterability analyses. The scry package [107] was used for computing thegene deviance.3.3.2 Feature selection for Regularized Negative BinomialNormalisationFor the negative binomial normalization, the gene residuals zi j defined in Section3.2.3 are sorted. The top genes with the highest residuals are selected and used fordownstream analyses. The sctranform package [39] is used to compute the genePearson residuals.3.4 Dimensionality reductionDimension reduction methods have become an integral part of most scRNA-seqanalysis pipelines since scRNA-seq datasets often have high dimensionality. Di-mensionality reduction helps make analysis methods faster and scalable by pro-ducing lower dimensional embedding of high dimensional datasets. The lower di-mension projections are computed such that they preserve the majority of relevantsignals in the dataset. These methods can also be used for denoising, visualizing,and data compression. Most dimension reduction methods come in two flavors,linear and non-linear. Linear methods assume that the data lies on a linear man-ifold and uses a linear function to project it onto a lower dimension. However,if this manifold is non-linear, a linear method will result in an insufficient lower28dimension embedding. Non-linear methods do not make this assumption and thusbetter able to better project non-linear data onto a lower dimension. Throughoutthis thesis, PCA is used for dimensionality reduction.3.5 Multimodality testingTesting for multimodality in gene expression patterns is an integral part of thiswork. As an example, consider a scRNA-seq dataset containing four distinct celltypes. Cells of the same type will typically have similar gene expression patterns.Whereas, cells of different cell types will have different gene expression patterns.We can compute and plot the cosine values of gene expressions between cells.The distribution of cosine values will contain two modes. One mode centeredaround a large cosine value indicating cells in the same partition with similar geneexpression patterns. The second mode will be centered around a small cosine valueindicating cells in different partitions with different gene expression patterns. Incontrast, a dataset containing a single cell type or homogeneous cell types shouldshow roughly the same cosine values between cells indicating a single partition,see Figure 3.2.Multimodality tests [3, 42, 100] discussed earlier in Section 2.1.3 can be usedto assess this idea more formally. We use the Dip test to test the distribution of thecosine distances between the cells for multimodality. The idea of using modalitytests on pairwise distances was proposed initially by [1, 2]. However, we have mod-ified this approach in few ways for a more effective use on scRNA-seq data. Firstly,we have made the number of PCS used for computing distances to be dependent onthe characteristics of the data. This is done by selecting the top most significantPCS. This helps to limit the influence of noise by selecting PCS that contribute asignificant amount of signal. Secondly, we are computing distances between cellsusing the cosine distance defined in 3.4 instead of the Euclidean distance betweenthe cells. The Dip test is used because of its scalability and statistical power.D(X,Y) = 1− x ·y‖x‖‖y‖ = 1−n∑i=1xiyi√n∑i=1x2i√n∑i=1y2i(3.4)29where x and y are both vectors of gene expression values.For each normalization method, we perform this test as follows:1. Compute the cosine distances between the top K PCS.2. Run the Dip test on these pairwise distances with significant threshold α3. If the p-value is < α , the dataset shows significant evidence for clusteringstructure or clusterability.4. If the p-value is > α , the dataset does not show significant evidence forclustering structure or clusterability.The Dip test provides a p-value representing the probability of seeing this dis-tribution or a more extreme multimodal distribution if the data is unimodal. Thisp-value should be large for unimodal distributions and small for multimodal distri-butions. As is common practice, the significant threshold α is set to be 0.05. SeeFigure 3.1 for an illustration of this pipeline.3.6 Computing clusters through multimodality testingBelow we describe the Computing Clusters Through Multimodality Testing (CCMT)procedure developed for estimating the number clusters. This method is similar tothe statistical significance methods discussed in Section 2.2.2. We couple hierar-chical clustering with discriminant analysis and multimodality testing to estimatethe number of clusters. Below we discuss generating the hierarchical partition anddiscriminant coordinates. Next, the CCMT procedure is discussed in detail. Finally,we discuss how the results of the CCMT procedure are combined across normaliza-tion methods.3.6.1 Generating a hierarchical partitionA hierarchical tree is generated by first transforming the normalized datasets byfirst computing the top k PCS of the cosine distances between the cells defined inequation 3.4. Next, hierarchical clustering is performed with the top k PCS usingsquared Euclidean distances coupled with Ward’s minimum variance criterion.30Figure 3.2: A) PCA plot of four simulated heterogeneous cell types. B) Den-sity plot of the correlation distances between the cells in A. C) PCA plotof four simulated homogeneous cell types. D) Density plot of the co-sine distances between the cells in B. Note that high values in B and Dindicates low correlation and low values indicate high correlation.3.6.2 Discriminant coordinatesDiscriminant coordinates are often used to study clustering effects in a dataset in-cluding face recognition [28], action recognition [81] and gesture recognition [94].One example of methods used is Fisher discriminant coordinates [32], which com-putes the direction that best separates two classes. Discriminant coordinates aim tofind a projection of two classes on to a line that best separates both classes. See Fig-ure 3.3. By projecting the classes on to this discriminant line, the overlap betweenboth classes decreases, which will enable the testing of clustering strength throughmultimodality testing. The method of discriminant coordinates is described below.Consider a set of n observations (in this case, normalised expression valuesfor genes) with each ni belonging to {i = 0,1} with ∑ni = n. Let xi j be the jthobservation in group i and denote31x¯i =1nini∑j=1xi j and x¯ =1ni2∑i=1ni∑j=1xi j (3.5)We can then define two matrices, the within group covariance W :W =1ni2∑i=1ni∑j=1(x¯i− x¯)(x¯i− x¯)′ =2∑i=1(ni−1)Si (3.6)and the between group covariance matrix B:B =1ni2∑i=1ni∑j=1(xi j− x¯i)(xi j− x¯i)′ =2∑i=1ni(xi j− x¯i)(xi j− x¯i)′ = (n−g)S (3.7)Here Si is the sample covariance matrix for group i, and S is the sample covari-ance matrix for the combined groups. A discriminant coordinate vector is definedas as the vector c that maximizes the function:J(c) =c′Bcc′Wc(3.8)J(c) can be maximized computing the eigenvector associated with the largesteigenvalue of W−1B. This eigenvector provides the direction that maximizes thebetween-class variance of B while minimizing the within-class variance of W . Byprojecting the classes on to the vector c, we decrease the overlap between bothclasses, which will enable us to test for separability between the classes.3.6.3 The CCMT procedureThe CCMT procedure is based on testing for multimodality sequentially using theDip test coupled with discriminant coordinates. Contrary to established clustersignificance testing methods, this method requires no formal hypothesis testingexcept for the one performed by the Dip test. The details of the test are formallydescribed below.1. Let M denote the normalized counts matrix generated using one of the de-scribed methods and Mc be a matrix of cosine distances between points inM.322. Let D denote the top k PCS for Mc.3. Generate a tree T from D using the ward’s minimum variance method onsquared Euclidean distances.4. For each node in T , let C1 and C2 be two clusters containing the childrennodes in the two respective sub-trees.(a) Let Dc12 be D subsetted to contain only the points in the two subtreesat the current node being considered.(b) Compute a discriminant coordinates vector w that best separates C1 andC2.(c) Generate a one dimensional projection of Dc12 on to w and denote thisas x.(d) Finally, test x for multimodality using the Dip test and return the p-value. If the classes are well separated, their projection onto the dis-criminant vector will be multimodal, and a significant p-value will beobtained.5. Starting with the root node steps, 4a− 4d are performed on all nodes thatparents were themselves significant or had a number of points in each subtreeto be greater than n. For this implementation, the significance level α = 0.05and the minimum number of points n = 10.6. Finally, a dendrogram is returned where each node’s significance is labeled.See Figure 3.5. The points below the significant nodes are the significantpartitions. The number of significant partitions is an estimate of the num-ber of clusters. These significant partitions are also extracted and used as aclustering of the data.See Figure 3.4 for an example of this method applied to a simulated dataset.Both multimodality testing and the CCMT procedure can be integrated into apipeline. The user would first apply multimodality testing and then proceed to runthe CCMT procedure if there is significant evidence for cluster structure. See figure3.6 for an illustration of the pipeline33Figure 3.3: An illustration of discriminant coordinates. Fisher discriminantcoordinates is the line that best separates both classes. In both pan-els, the red and green represents two different classes. A) Shows twoclasses that are not well separated by the discriminant line. B) Showstwo classes that well separated by the discriminant line.3.6.4 Combining significant partitionsFor a given dataset, the CCMT is performed on each of the three normalized ver-sions. We provide two methods to combine the results across normalized datasets.The first uses an ensemble method which combines the clustering solutions acrossthe three normalized datasets. See Figure 3.7. The second uses a multiview ap-proach [13, 20, 122], which combines all three normalized versions of the datasetbefore the CCMT procedure is performed. This approach computes a set of PCScommon to all three normalized datasets [109]. The CCMT procedure is then ap-plied to the common PCS. See figure Simulation studiesTo test the robustness and scalability of the methods developed in this thesis, a setof synthetic datasets was generated using the splatter package [123]. This pack-age was designed for simulating scRNA-seq count data and allows one to vary thenumber of cells, genes, and clusters, with varying levels of separability and varyingdegree of sparsity. The Splatter tool enables the possibility of generating syntheticdatasets that capture different properties that the typical scRNA-seq dataset mayhave. This tool also enables the ability to simulate problems typically associatedwith the analysis of scRNA-seq datasets. These problems include sparsity due to34Figure 3.4: An example of discriminant coordinates significance clusteringtest. A) and C) Shows two classes that are separated by a discriminantline in black. B) and D) Shows the Dip test applied to the projection ofthe two classes onto this discriminant line. Note that for B, a significantp-value is obtained reflecting the separation between the classes. For D,the opposite is observed. P-values < 2e−16 were rounded to 0.dropout events, low separability between cell types, and an increasing number ofcells. We adopted a similar simulation paradigm used in [62] to generate threesimulation setups. We, however, changed some of the parameters used in the sim-ulation setup for more rigorous testing. The simulation setups are described below.3.7.1 Simulating data scalability, differing cluster number and sizesThe first simulation setup assessed scalability and differing cluster sizes. Scala-bility here is defined as the capacity for methods developed to adequately handlean increasing number of cells. We also require it to be able to handle an increas-ing number of clusters with different relative sizes. For this purpose, counts aresimulated for a range of cells in {5000,10000,15000}, each with 1000 genes. Foreach possible number of cells, clusters in the set {4,8,16} are generated. For35Figure 3.5: An illustration of the CCMT procedure A) Shows a PCA plot offour simulated heterogeneous population. B) Shows the significancetree after applying the CCMT procedure. Note that the tree is colored bysignificance and if a test was performed at a node or not.Figure 3.6: Overview of how multimodality testing and the CCMT procedurecan be integrated into a pipeline. If the test for clusterability returnsevidence for a cluster structure, the CCMT procedure is applied and theclusters are returned. Otherwise there is a possibility of a single cellpopulation or the cells are on a continuum.36Figure 3.7: An illustration of the ensemble method for combining significantpartitions across normalisation methods. An ensemble is generated us-ing the soft least squares consensus partition method (DWH) implementin the Clue package.Figure 3.8: An illustration of the multiview method for combining significantpartitions. A set of common principal components (CPCS) is first com-puted for all three normalisation methods. These CPCS are then usedfor significant clustering.each number of clusters, datasets were generated containing equal proportionsof cells in each cluster and a dataset containing an uneven proportion of cells ineach cluster. To generate uneven cluster proportions, p1, p2, ..., pk numbers weresimulated from a uniform distribution such that ∑ki=1 pi = 1. Here, k is the num-ber of clusters for the current simulation. For example, a set of proportions ofcells for a dataset containing four clusters generated that are unbalanced would be{0.20,0.10,0.40,0.30}. In contrast, the proportions of cells for a dataset generatedcontaining four balanced clusters would be {0.25,0.25,0.25,0.25}. This setupgenerated a total of 18 datasets. See Figure 3.9 for an illustration of this setup.3.7.2 Simulating cluster separabilityThe second simulation setup assessed the ability to detect varying degrees of sepa-ration between clusters. Here, the number of cells and genes was fixed at 1000 and37Figure 3.9: An illustration of simulation setup 1. A total of 18 simulateddatasets were generated.5000, respectively. The number of clusters was fixed to be 8 with relatively bal-anced sizes. The separability between clusters was then varied from no separationat all to well separated. To control cluster separability, the probability for a gene tobe deferentially expressed between clusters was varied by generating 50 values ina range of {0,0.5}. Figure 3.11 shows illustrations of both extremes. Separabilityvalues close to 0 produces clusters with low separability and values closer to 1 pro-duces clusters with higher separability. This setup generated a total of 50 datasets.See Figures 3.10 and 3.11 for an illustration of this simulation setup.3.7.3 Simulating data sparsityIn the third simulation setup, we assessed our method’s ability to handle data spar-sity or increasing proportions of missing information. Again the number of cellsand genes is held constant at 5000 and 1000, respectively. The number of clustersremained the same again at 8, with the sizes balanced. To control the proportionsparsity in the generated datasets, the parameter(dropout.mid) controlling rate ofzero counts in the logistic function that generates the counts in the Splatter packagewas varied. This parameter was varied in the range of {0,1,2,3,4,5} to generatedatasets ranging in 20% to 90% sparsity. This setup generated a total of 6 datasets.38Figure 3.10: An illustration of simulation setup 2. A total of 50 simulateddatasets were generatedFigure 3.11: An illustration of the possible ranges cluster separability in 4simulated clusters. A) A simulated dataset with no cluster separability.B) A simulated dataset with low cluster separability. C) A simulateddataset with intermediate cluster separability. D) A simulated datasetwith high cluster separability.39See Figure 3.12 for an illustration. Figure 3.13 shows the possible ranges of datasparsity being considered and how it affects the datasets. Notice that as sparsityincreases, the separability between the clusters decreases.Figure 3.12: An illustration of simulation setup 3. A total of 6 simulateddatasets were generated.3.8 Benchmarking dataA set of published scRNA-seq datasets frequently used for analyzing scRNA-seqpipelines was used for evaluation as well. A majority of these datasets were ob-tained from the Hemberg lab Github repository. The Peripheral Blood Mononu-clear Cells (PBMCS) datasets were obtained from the 10X genomics website. Thesedatasets ranged in the number of cells, number clusters, cluster separability, spar-sity, and overall complexity. These datasets are also derived from multiple organ-isms, including humans and mice, as well as multiple tissues, including blood,pancreas, brain, spleen, and retina.The datasets used are formally described below. They contain two groups, thesmall scale, and the medium to large scale datasets. The small scale datasets aredatasets that have less than 2500 cells. These datasets were used to benchmarkCCMT against other scRNA-seq methods that do not scale very well for largerdatasets. The medium to large datasets group are datasets that have greater than2500 cells. These datasets were used to benchmark CCMT against other scRNA-40−0.02−−0.03 −0.02 −0.01 0.00 0.01 0.02PC1PC20.000.250.500.751.00Zero FractionDropout Rate = 0A−0.02−−0.03 −0.02 −0.01 0.00 0.01 0.02PC1PC20.000.250.500.751.00Zero FractionDropout Rate = 1B−−0.02 0.00 0.02PC1PC20.000.250.500.751.00Zero FractionDropout Rate = 3C−−0.050 −0.025 0.000 0.025PC1PC20.000.250.500.751.00Zero FractionDropout Rate = 5DFigure 3.13: An illustration of the possible ranges of data sparsity in 4 sim-ulated clusters. Sparsity is defined as the fraction of genes with 0 ex-pression values in each cell. A) A simulated dataset with 20− 30%sparsity. B) A simulated dataset with 25− 35% sparsity. C) A sim-ulated dataset with 50− 70% sparsity. D) A simulated dataset with80−90% sparsity.seq methods that are well equipped to handle reasonably large datasets.3.8.1 Small scale datasetsSeven small scale datasets described in Table 3.1 were used for benchmarking.For the CCMT procedure, these datasets were processed, as described in Chapter3.1. For the other methods, preprocessing was done using the default settings pro-vided by the methods. We note that with improvements in sequencing technologyand reduced sequencing costs, it is now possible to profile hundreds of thousandsmore cells than considered in the small scale datasets. We decided to include thesedatasets for completeness and to provide a means for testing against other methodswith computational bottlenecks such as an increasing number of cells.41Dataset NumGenes NumCells NumPopulation SparsityGold [33] 58,302 925 3 0.85Baron Mouse [8] 14,878 1886 13 0.89Tasic [105] 24,150 1,679 18 0.69Muraro [83] 19,127 2,126 10 0.73Wang [114] 19,950 635 8 0.70Xin [118] 39,851 1,600 8 0.88Li [66] 55,186 561 9 0.79Table 3.1: Small scale datasets used.3.8.2 Medium to large scale datasetsTen medium to large scale datasets described in Table 3.2 were used for bench-marking. For the CCMT procedure, these datasets were processed, as describedin Chapter 3.1. For the other methods, preprocessing was done using the defaultsettings.Dataset NumGenes NumCells NumPopulation SparsitySilver5 [33] 17,043 8,352 11 0.96Segerstolpe [96] 25,525 3,514 15 0.82Klein [61] 24,175 2,717 4 0.66Zheng [126] 15,568 3,994 4 0.97Chen [21] 23,284 14,437 47 0.93HMS [63] 28,962 11,127 9 0.97Zeisel [124] 19,972 3,005 9 0.81Romanov [91] 24,341 2,881 7 0.88BaronHuman [8] 20,125 8,569 14 0.91Shekar [97] 13,166 27,499 19 0.93Table 3.2: Medium to large scale datasets3.9 Obtaining ground truth clustersFor both the small and medium to large scale datasets used in this work, the celllabels provided by their respective authors are used as ground truth. We are awarethat intrinsic difficulties exist when defining ground truth cell labels when evalu-42ating clustering or classification methods. This is due to the existence of multiplebiologically plausible and interpretable way of clustering a scRNA-seq dataset,each representing relevant signals. The datasets used in this work were clusteredwith existing algorithms, and cell type labels were assigned using domain exper-tise. Therefore, we acknowledge that there is a risk of inherent bias in favor of theclustering method used to compute these cell type labels when making compar-isons.3.10 Evaluation metricsTo benchmark clustering solutions across methods, we use the Adjusted Rand In-dex (ARI). This metric has been routinely used in the clustering domain to evaluatehow similar two clustering solutions are. It can also be used to compare how simi-lar a clustering solution is to some know ground truth. The ARI score takes valuesbetween 0 and 1, with 0 being no similarity between two clustering solutions and1 being perfect similarity between two clustering solutions.Let D be a set containing n points. Denote a clustering of D as C, a set of nonoverlapping and non empty subsets C1, ....,Ck. Now denote another clustering of Dto be the set C’ of non overlapping and non empty subsets C′1, ....,C′m. Using C andC′, we can create a k by m contingency table T such that the Ti j is the intersectionof Ci and C′j.We can then formally define the ARI score as:ARI(C,C′) =∑ki=1∑mj=1(Ti j2)−u312(u1+u2)−u3(3.9)where u1 = ∑ki=1(|Ci|2), u2 = ∑mi=1(|C′j|2), and u3 = 2u1u2n(n−1)3.11 Clustering methodsThere are many algorithms available to cluster scRNA-seq datasets. We selectedthree to benchmark on the small scale datasets and one to benchmark on the mediumto large scale datasets. For the small scale datasets, we evaluated Seurat [17], SC343[59] and SIMLR [113]. For the medium to large scale datasets, only Seurat wasused. This provides a comparison to the CCMT procedure against methods that arecurrently used. Both SC3 and SIMLR were used only on the small scale datasetsbecause they do not scale well with larger datasets. Seurat was chosen because it iscurrently one of the most widely used methods for clustering scRNA-data. It alsoscales quite well with larger datasets. For benchmarking, we evaluated clusteringaccuracy and consistency using the Adjusted Rand Index described in section 3.10.We also compared the running time for all of the clustering algorithms. The de-fault parameters provided in their respective software packages were used for allclustering methods, including how data preprocessing is done. We note that morecareful consideration of these parameters may provide different results.3.12 Performance assessment3.12.1 Simulation StudiesTo assess the performance of multimodality testing under various simulated con-ditions. We look at the p-values returned from the Dip test on each of the simula-tion studies. For the first simulation setup, all the datasets show cluster structure.Therefore multimodality testing should return a significant p-value for all simu-lations. For the second setup, which simulates cluster separability, we expect theDip test to be very sensitive to very low cluster separability. For the third setup,which simulates data sparsity, all the datasets have a cluster structure. Therefore,multimodality testing should return a significant p-value for all levels of sparsity.For the CCMT procedure, these simulation studies provided a way to measurehow well CCMT can recover the simulated partitions in each setup. The ARI scoreis used to measure how well the simulated partitions are recovered.3.12.2 Positive controlThe small scale and the medium to large scale datasets discussed earlier are usedas positive controls datasets. These datasets have been used routinely to evaluatenew clustering algorithms for scRNA-seq datasets. Therefore, there is an implicitassumption that these datasets exhibit significant cluster structure. These datasets44also provide a more realistic example of the type of datasets often encountered. Itis expected that multimodality testing will return a significant p-value for all thesedatasets.3.12.3 Negative controlTo test how multimodality testing behaves when there is a single cluster, we se-lected the Gold Standard dataset used in [33]. This dataset is composed of threedifferent cell lines cultured separately from human lung adenocarcinoma. To gen-erate a set of three negative control datasets, we isolated each of the cell linesseparately. See Figure 3.14 for an example of the isolated HCC827 cell line.The same three isolated cell lines from the gold standard dataset was used asnegative controls for the clustering methods, including the CCMT procedure. TheARI score was again used to evaluate how well the clustering methods recoveredthe single cluster present.3.13 Run time assessmentAll clustering methods, including the CCMT procedure, were run in R programminglanguage. To compare running times, the Microbenchmark R package was used.All methods were run with 12 threads with 32GB of ram on an Intel R© CoreTMi7-8750H CPU with 2.20GHz. All timing measurements include preprocessingsteps.3.14 SummaryIn summary, we proposed a method for assessing the cluster structure level bytesting the gene expression patterns between cells for multimodality. We also cou-pled multimodality testing with hierarchical clustering and discriminant analysisto estimate the number of clusters. We also developed various simulation stud-ies to assess the reliability of the methods developed. Real datasets and methodsused for benchmarking were presented as well as the metrics used for performanceevaluation.45Figure 3.14: PCA plots of the dataset used for negative controls. This exam-ple shows the HCC287 cells that were isolated and used as a one ofthe negative controls for both mulitmodality testing and benchmarkingthe CCMT procedure. A) shows a PCA plot of the three cell lines. B)shows a PCA plot of the isolated HCC287 cell line.46Chapter 4ResultsIn this chapter, we summarize the results of multimodality testing and the CCMTprocedure. In section 4.1, results for multimodality testing are presented. First weshow results of multimodality testing on the simulation studies discussed in sec-tion 3.7 and the positive control datasets (Section 3.12.2) as well as the negativecontrol dataset (Section 3.12.3). Next, we show the results of the CCMT procedureapplied to the simulation studies. For the CCMT procedure, we also show the re-sults based on the positive (Section 3.12.2) control dataset as well as the negativecontrol dataset (Section 3.12.3). Next, we present comparative results of the CCMTproduce against clustering methods discussed in Section 3.11. Comparative resultsinclude the ARI score and running time assessment for both the small scale datasets(Section 3.8.1) and the medium to large scale data (Section 3.8.2).To test the robustness and scalability of inferring clusterability through modal-ity testing and using the CCMT procedure estimate the number of robust clusters,we generated a set of synthetic datasets discussed in 3.7. Each simulation setupwas designed to simulate problems inherent to scRNA-seq data. This includes in-creasing data and cluster sizes, cluster separability, and data sparsity. We expectthe Dip test to capture the simulated datasets cluster structure with high accuracyfor multimodality testing. This implies that an insignificant p-value should be re-turned for datasets with no significant cluster structure. In contrast, a significantp-value should be returned for datasets with significant cluster structure. For theCCMT procedure, we expect a high ARI score for all the simulation setups implying47that this procedure can recover the simulated partition with high accuracy.4.1 Evaluation of multimodality testing4.1.1 Simulation studiesTables 4.1, 4.2, 4.3 and Figure 4.1, shows the results of the clusterability analysison all three simulation setups presented as heatmaps. Values in the tables and theheatmaps are p-values obtained after running the modality testing pipeline.The first simulation setup results, which assesses data scalability, shows per-fect performance across cluster sizes and the number of cells. This implies thatclusterability testing using the Dip test scales well. This also implies that multi-modality testing is not affected adversely by the relative sizes of clusters presentin the dataset. This is an essential feature since it is rarely the case that clustersare perfectly balanced in scRNA-seq datasets. For the second simulation, whichassesses cluster separability, all normalization methods perform reasonably well.This also implies that the Dip test is quite sensitive to cluster separability since itcan find significant cluster structure even when clusters show low separability. Re-sults for the third simulation setup, which assesses the effect data sparsity, are quitegood as well. Multimodality testing can detect cluster structure in the presence ofhigh data sparsity.Dataset Log NegBinom Multinom5K Cells, 4 Clusters p < 2e−16 p < 2e−16 p < 2e−165K Cells, 8 Clusters p < 2e−16 p < 2e−16 p < 2e−165K Cells, 16 Clusters p < 2e−16 p < 2e−16 p < 2e−1610K Cells, 4 Clusters p < 2e−16 p < 2e−16 p < 2e−1610K Cells, 8 Clusters p < 2e−16 p < 2e−16 p < 2e−1610K Cells, 16 Clusters p < 2e−16 p < 2e−16 p < 2e−1615K Cells, 4 Clusters p < 2e−16 p < 2e−16 p < 2e−1615K Cells, 8 Clusters p < 2e−16 p < 2e−16 p < 2e−1615K Cells, 16 Clusters p < 2e−16 p < 2e−16 p < 2e−16Table 4.1: Table of p-values (p) obtained for multimodality testing done forsimulating data scalability on balanced datasets.48Dataset Log NegBinom Multinom5K Cells, 4 Clusters p < 2e−16 p < 2e−16 p < 2e−165K Cells, 8 Clusters p < 2e−16 p < 2e−16 p < 2e−165K Cells, 16 Clusters p < 2e−16 p < 2e−16 p < 2e−1610K Cells, 4 Clusters p < 2e−16 p < 2e−16 p < 2e−1610K Cells, 8 Clusters p < 2e−16 p < 2e−16 p < 2e−1610K Cells, 16 Clusters p < 2e−16 p < 2e−16 p < 2e−1615K Cells, 4 Clusters p < 2e−16 p < 2e−16 p < 2e−1615K Cells, 8 Clusters p < 2e−16 p < 2e−16 p < 2e−1615K Cells, 16 Clusters p < 2e−16 p < 2e−16 p < 2e−16Table 4.2: Table of p-values (p) obtained for multimodality testing done forsimulating data scalability on unbalanced datasets.Figure 4.1: Heatmap of p-values obtained for multimodality testing done forsimulating cluster separability. The x-axis is the cluster separability(higher values indicated higher separability between clusters) generatedand the y-axis the normalisation method used. Note that p-values <2e−16 were rounded to 0 for visualization purposes.49Dropout Rate Log NegBinom Multinom0 p < 2e−16 p < 2e−16 p < 2e−161 p < 2e−16 p < 2e−16 p < 2e−162 p < 2e−16 p < 2e−16 p < 2e−163 p < 2e−16 p < 2e−16 p < 2e−164 p < 2e−16 p < 2e−16 p < 2e−165 1 p < 2e−16 p < 2e−16Table 4.3: Table of p-values (p) obtained for multimodality testing done whensimulating data sparsity. Higher dropout rate values indicates highersparsity4.1.2 Positive controlTable 4.4 shows the results of multimodality testing applied to the benchmarkingdatasets used as positive controls. Multimodality testing finds significant evidenceof cluster structure in all datasets.4.1.3 Negative controlFigure 4.2 shows the result of multimodality applied to the isolated HCC287 cellline from the negative control dataset. There is no evidence of cluster structure(Dip p-value > 0.05) returned by multimodality testing across all the normaliza-tion methods. This shows that multimodality testing can correctly identify whenthere is no apparent cluster structure present. Similar results were obtained for theremaining two cell lines.4.2 Factors affecting multimodality testing4.2.1 The effect of data normalisationOverall, multimodality testing performed well across all the simulation setups, pos-itive control datasets, and the negative control dataset. However, taking a morein-depth look at the simulation studies results, we can make a few observations.Consider the second simulation setup that addresses cluster separability; com-pared to the other two normalization methods; the log normalization method is not50Dataset Log NegBinom MultinomBaronHuman p < 2e−16 p < 2e−16 p < 2e−16BaronMouse p < 2e−16 p < 2e−16 p < 2e−16Chen p < 2e−16 p < 2e−16 p < 2e−16Gold p < 2e−16 p < 2e−16 p < 2e−16HMS p < 2e−16 p < 2e−16 p < 2e−16Klein p < 2e−16 p < 2e−16 p < 2e−16Li p < 2e−16 p < 2e−16 p < 2e−16Maccosko p < 2e−16 p < 2e−16 p < 2e−16Muraro p < 2e−16 p < 2e−16 p < 2e−16Romanov p < 2e−16 p < 2e−16 p < 2e−16Segerstolpe p < 2e−16 p < 2e−16 p < 2e−16Shekar p < 2e−16 p < 2e−16 p < 2e−16Silver5 p < 2e−16 p < 2e−16 p < 2e−16Tasic p < 2e−16 p < 2e−16 p < 2e−16Wang p < 2e−16 p < 2e−16 p < 2e−16Xin p < 2e−16 p < 2e−16 p < 2e−16Zeisel p < 2e−16 p < 2e−16 p < 2e−16Zheng p < 2e−16 p < 2e−16 p < 2e−16Table 4.4: Table of p-values (p) obtained for multimodality testing done forthe benchmarking data. The table contains both the small and medium tolarge scale sensitive to cluster separability. This can be seen in Figure 4.1, where a sig-nificant p-value is returned after separability value of 0.05. For reference, Figure3.11C is an example of how a dataset looks for separability value 0.03. In Figure3.11B, the clustering structure is quite obvious. As such, both the multinomial andnegative binomial normalization methods are able provide enough evidence for theDip test to find significant cluster structure evidence. However, the Dip test appliedto the log normalized data fails to find evidence for cluster structure. This impliesthat the log normalization typically done may not always be the best way of datanormalization. It may obscure possible cluster structure by failing to find highlyoverlapping clusters.Next, consider the third simulation setup, which assesses the effect data spar-sity; all normalization methods show good performance for low to moderate spar-51Figure 4.2: A) shows a PCA plot of the isolated HCC287 cells. B) shows thedensity of the of the cosine distances between the top k PCS for the lognormalisation method. C) shows the density of the of the cosine dis-tances between the top k PCS for the Negative Binomial (NegBinom)normalisation method. D) shows the density of the cosine distances be-tween the top k PCS for the Multinomial normalisation method. Similarresults not shown here were obtained for the remaining two cell lines.sity levels. See Table 4.3. However, for sparsity, > 90%, the Dip test applied tothe log normalized data fails to find evidence for cluster structure. The multinomialnormalized and negative binomial normalized data have no trouble dealing with in-creasing sparsity levels. As noted in Figure 3.13, as sparsity increases, the separa-bility between the clusters decreases. Since the log normalization method showedpoor performance for low separability cases in the second simulation setup, it isnot surprising to observe similar performances for higher data sparsity. Again, thisis a cause for concern since log normalization is most often used for its statisti-cal simplicity. Other normalization methods such as the negative binomial or themultinomial may be more appropriate.524.2.2 The limitations of multimodality testingThe multimodality testing framework has performed well across simulation studiesand real benchmarking data, including the negative controls datasets. However, tounderstand the situations where multimodality testing does not perform well, weconstructed a fourth simulation setup, which combines the first two simulationsetups from Section 3.7. To do this, the cluster separability was set to be 0.3 (seeFigure 3.11B), while we varied the size and proportions of cells in each clusteras done in simulation setup 2 (see Figure 3.9). See Figures 4.3 and 4.4 for anillustration of and an example of this simulation setup.Tables 4.5 and 4.6 show the results of multimodality testing applied to this sim-ulation setup. Multimodality testing cannot capture the presence of cluster struc-ture in all of the simulation datasets for both the balanced and unbalanced datasets.With high cluster overlap and an increasing number of clusters, multimodality test-ing struggles to find evidence for cluster structure. This pattern is evident for boththe balanced and unbalanced datasets. For the balanced datasets (Table 4.5), the lognormalization performs the worst while the multinomial normalization performsthe best. For the unbalanced datasets (Table 4.6), all three normalization meth-ods performs equally well. These results show that multimodality testing is limitedwhen the cluster sizes are relatively balanced and highly overlapping. However, forthe unbalanced datasets, multimodality testing is not as limited since it can capturethe presence of cluster structure in over half of the datasets.Dataset Log Multinom NegBinom5K Cells, 4 Clusters p < 2e−16 p < 2e−16 p < 2e−165K Cells, 8 Clusters 0.10 p < 2e−16 0.065K Cells, 16 Clusters 0.10 p < 2e−16 0.9910K Cells, 4 Clusters p < 2e−16 0.10 p < 2e−1610K Cells, 8 Clusters 0.10 0.10 p < 2e−1610K Cells, 16 Clusters 0.10 0.10 115K Cells, 4 Clusters p < 2e−16 p < 2e−16 p < 2e−1615K Cells, 8 Clusters 0.10 0.10 115K Cells, 16 Clusters 0.10 0.10 1Table 4.5: Table of p-values (p) obtained for multimodality testing done forsimulating data scalability with high overlap on balanced datasets.53Figure 4.3: An illustration of simulation setup 4. A total of 18 simulateddatasets were generated.Figure 4.4: An illustration of the balanced and unbalanced datasets simu-lated.A) A simulated balanced dataset with 4 clusters. B) A simulatedunbalanced dataset with 4 clusters. Each dataset was generated to haveoverlapping clusters.54Dataset Log Multinom NegBinom5K Cells, 4 Clusters p < 2e−16 p < 2e−16 p < 2e−165K Cells, 8 Clusters p < 2e−16 p < 2e−16 p < 2e−165K Cells, 16 Clusters 0.10 p < 2e−16 110K Cells, 4 Clusters p < 2e−16 0.10 p < 2e−1610K Cells, 8 Clusters p < 2e−16 p < 2e−16 p < 2e−1610K Cells, 16 Clusters 0.10 0.10 115K Cells, 4 Clusters p < 2e−16 p < 2e−16 p < 2e−1615K Cells, 8 Clusters 0.10 0.10 p < 2e−1615K Cells, 16 Clusters 0.10 0.10 1Table 4.6: Table of p-values (p) obtained for multimodality testing done forsimulating data scalability with high overlap on unbalanced datasets.4.3 Evaluation of the CCMT procedure4.3.1 Simulation studiesTables 4.7, 4.8, 4.9, and Figure 4.5 shows the results of the CCMT procedure appliedto the three simulation setups. For the first simulation setup, the CCMT proceduredoes a good job of recovering the ground truth partitions simulated. Both the en-semble (Figure 3.7) and multiview (Figure 3.8) models do a good job of recoveringthe simulated partitions. The CCMT procedure is robust to the relative proportionsof cluster sizes and different numbers of clusters. Results on the second simula-tion setup show that the CCMT procedure is generally sensitive to cluster overlap.However, the ensemble method appears to be more sensitive and stable comparedto the multiview model. The multiview model is not always able to fully recapturethe simulated partitions for clusters having very high overlaps. For example, inFigure 4.5, the clusters are relatively well separated for a separability value of 0.1.See Figure 3.11C. However, for this value, the multiview model fails to perfectlyrecover the simulated partition (ARI score = 0.5). See Section 3.10 for a discussionof the ARI score. Results on the final simulation setup (Table 4.9) shows that theCCMT procedure is also robust to increasing data sparsity. Both the multiview andthe ensemble models are fully able to recapture the simulated clusters for increas-ing levels of data sparsity. Overall the CCMT procedure performs well across all55simulation studies.4.3.2 Positive controlThe CCMT procedure applied to the positive control benchmarking datasets showsgood accuracy for most datasets. Table 4.10 and Figure 4.6, shows a table and aboxplot of the ARI score for all the benchmarking datasets. Figure 4.6 shows anaverage ARI score of 0.70, which implies that the CCMT procedure recovers thecorrect partitions approximately 70% of the time across all datasets. The CCMTprocedure also obtains a maximum ARI score of 0.99 and a minimum of 0.19.Both the ensemble and the multiview models show similar performance across alldatasets. However, the ensemble model has more variability, as seen by the longtail in Figure 4.6. Figure 4.7 shows scatter plots of the ARI score as a functionof the predicted number of clusters by the CCMT procedure for the benchmarkingdatasets. In both plots, the red values indicate the correct number of clusters. Inthese plots, we judge performance by both the ARI scores and the predicted numberof clusters. For a high ARI score, we would expect to see a closer agreementbetween the true number of clusters (x-axis) and the predicted number of clusters(red values). Both the ensemble and the multiview models tend to underestimatethe number of clusters. Compared to the ensemble model, the multiview modelgenerally returns a smaller number of clusters. However, both models, on average,return partitions with high overlap with the ground truth partitions, as seen by thehigh average ARI scores in Figures 4.7A and 4.7B.4.3.3 Negative controlOn the negative control datasets, the CCMT procedure fails to return a single clusterfor the ensemble mode across all of the isolated cell lines (Table 4.11). The multi-view model, in contrast, returns only a single cluster for two (HCC827 and H2228)of the isolated cell lines. This implies that the ensemble model is more sensitiveto the over-partitioning of the data compared to the multiview model. The multi-view model is conservative when partitioning a dataset. This is similar to what isobserved for the predicted number of clusters for the positive control datasets.56Dataset Multiview ARI Score Ensemble ARI Score5K Cells, 4 Clusters 1 15K Cells, 8 Clusters 1 15K Cells, 16 Clusters 0.83 110K Cells, 4 Clusters 1 110K Cells, 8 Clusters 1 110K Cells, 16 Clusters 0.8 115K Cells, 4 Clusters 1 115K Cells, 8 Clusters 0.81 115K Cells, 16 Clusters 0.83 1Table 4.7: Table of the ARI scores obtained from the CCMT procedure appliedwhen simulating data scalability on balanced datasets.Dataset Multiview ARI Score Ensemble ARI Score5K Cells, 4 Clusters 1 15K Cells, 8 Clusters 0.97 15K Cells, 16 Clusters 0.87 110K Cells, 4 Clusters 1 0.9810K Cells, 8 Clusters 0.99 110K Cells, 16 Clusters 0.91 115K Cells, 4 Clusters 1 115K Cells, 8 Clusters 0.97 115K Cells, 16 Clusters 0.9 1Table 4.8: Table of the ARI scores obtained from the CCMT procedure appliedwhen simulating data scalability on unbalanced datasets.4.4 Factors affecting the CCMT procedureOverall, the CCMT procedure provides a robust and accurate way of finding signifi-cant partitions in a dataset. However, there are a few parameters that may affect theperformance of the CCMT procedure. These parameters include selecting highly in-formative genes and the number of principal components to use when running theCCMT procedure.57Figure 4.5: Heatmap of the ARI scores obtained from the CCMT procedureapplied to simulating cluster separability. The x-axis is the cluster sep-arability (higher values indicated higher separability between clusters)generated and the y-axis is the model used to combine the partitionsacross the three normalisation methods.Dropout Rate Multiview ARI Score Ensemble ARI Score0 1 11 1 12 1 13 1 14 1 15 0.98 0.99Table 4.9: Table of ARI scores obtained for the CCMT procedure applied whensimulating data sparsity. Higher dropout rate values indicates highersparsity4.4.1 The effect of the number of informative genesIn this thesis, we set the number of highly informative genes to 500. To test howthe selection of highly informative genes affects the performance of the CCMT pro-58Dataset Multiview ARI Score Ensemble ARI ScoreBaronHuman 0.89 0.9BaronMouse 0.92 0.92Chen 0.65 0.64Gold 0.99 0.85HMS 0.83 0.84Klein 0.83 0.80Li 0.59 0.74Maccosko 0.87 0.9Muraro 0.93 0.92Romanov 0.67 0.19Segerstolpe 0.58 0.53Shekar 0.51 0.89Silver5 0.55 0.50Tasic 0.29 0.30Wang 0.41 0.49Xin 0.89 0.61Zeisel 0.54 0.75Zheng 0.97 0.7Table 4.10: Table of ARI scores obtained for the CCMT procedure applied tothe benchmarking data. The table contains both the small and mediumto large scale datasets.Ensemble Multiview Cell LineNumber of clusters 4 1 HCC827Number of clusters 7 1 H2228Number of clusters 5 3 H1975Table 4.11: CCMT applied to the negative control datasets.cedure, for all the benchmarking datasets, we varied the number of highly infor-mative genes from 500 to 2000 in increments of 500. Figure 4.8 shows the resultsof applying the CCMT procedure to a varying number of genes. There appears tobe no significant increase in performance for the multiview model when increasingthe number of genes. However, for the ensemble model, increasing the number ofinformative genes does increase the overall performance. From Figure 4.8, we seethat setting the number of informative genes to 500 achieves the highest average59Figure 4.6: Boxplots of the average ARI score across all benchmarkingdatasets (n = 18) for both the ensemble and multiview model.ARI score for the multiview model. The opposite is observed for the ensemblemodel. In contrast, using 2000 informative genes results in the lowest overall ARIscore for the multiview model. Again, the opposite is observed for the ensemblemodel.4.4.2 The effect of the number of PCSIn this thesis, we set the number of PCS by automatically finding the knee point ofthe PCA scree plot. To test the effect of the number of PCS selected for clustering,we varied the number of PCS from 5 to 25 in increments of 5. Figure 4.9 shows theresults of applying the CCMT procedure to a varying number of PCS. For both theensemble and multiview models, increasing the number of PCS negatively impactsthe performance. The ensemble model is more affected by an increasing amountof PCS compared to the multiview model. The performance of the CCMT proce-dure when using a heuristic to select the number of PCS (Figure 4.6) is much bettercompared to when the number of PCS is fixed. This is most likely because increas-60Figure 4.7: A) Scatter plot of the predicted number of clusters vs ARI scorefor the ensemble model. The x-axis is the predicted number of clustersand the y-axis ARI score. B)Scatter plot of the predicted number of clus-ters vs ARI score for the multiview model. The x-axis is the predictednumber of clusters and the y-axis ARI score. For both scatter plots, thered values are the true number of the number of PCS does not necessarily increase the data’s signal. It’s possiblethat the extra PCS included introduces significantly more noise content, which ad-versely affects the model’s performance. This is most likely the reason the heuristicmethod works much better than fixing the number of PCS. The heuristic methodcan better identify the number of PCS to use based on the dataset’s properties.4.4.3 The limitations of the CCMT procedureTo assess the cases where the CCMT procedure significantly fails to recover groundtruth partitions, we used the fourth simulation setup outlined in Section 4.2.2. Boththe balanced and unbalanced datasets were passed to the CCMT procedure, andperformance was assessed using the ARI score (Section 3.10) as before.Tables 4.12 and 4.13 shows the ARI scores of the CCMT procedure applied to61Figure 4.8: Box plot of ARI scores for varying number of genes for the boththe Multiview and the Ensemble model. The x-axis is the number ofgenes and the y-axis is the ARI score.the balanced and unbalanced datasets. The results show that the CCMT procedureis limited at recovering the ground truth clusters when there is high cluster over-lap and an increasing number of clusters. This trend is more pronounced in thebalanced dataset compared to the unbalanced datasets. For both the balanced andunbalanced datasets, the ensemble (Section 3.6.4) model, on average, does a betterjob at recovering the ground truth partitions compared to the multiview (Section3.6.4) model.4.5 Comparing CCMT to other clustering methodsThe small scale and the medium to large scale datasets was used to comparethe CCMT procedure against other clustering methods often used when analysingscRNA-seq data. We compared the ARI score and the running time across all meth-ods.62Figure 4.9: Box plot of ARI scores for varying number of PCS for the both theMultiview and the Ensemble model. The x-axis is the number of PCsand the y-axis is the ARI score.Dataset Ensemble ARI Score Multiview ARI Score5K Cells, 4 Clusters 0.97 0.605K Cells, 8 Clusters 0.29 0.175K Cells, 16 Clusters 0 010K Cells, 4 Clusters 0 0.5510K Cells, 8 Clusters 0.08 0.1710K Cells, 16 Clusters 0 015K Cells, 4 Clusters 0.98 0.7115K Cells, 8 Clusters 0.46 0.1815K Cells, 16 Clusters 0 0Table 4.12: Table of the ARI scores obtained from the CCMT procedure ap-plied when simulating data scalability with high overlap on balanceddatasets.63Dataset Ensemble ARI Score Multiview ARI Score5K Cells, 4 Clusters 0.94 0.945K Cells, 8 Clusters 0.88 0.405K Cells, 16 Clusters 0 0.0910K Cells, 4 Clusters 0.86 0.8210K Cells, 8 Clusters 0.12 0.2310K Cells, 16 Clusters 0.03 0.0815K Cells, 4 Clusters 0.76 0.7515K Cells, 8 Clusters 0.29 0.2915K Cells, 16 Clusters 0.02 0.07Table 4.13: Table of the ARI scores obtained from the CCMT procedure ap-plied when simulating data scalability with high overlap on unbalanceddatasets.4.5.1 Small scale datasetsFor the small scale datasets, we compared the CCMT procedure against Seurat,SIMLR and SC3. SIMLR and SC3 were specifically used for the small scale datasetsbecause these methods have high computational complexity and thus do not scalewell with larger datasets. Figure 4.10 shows boxplots of the ARI score and therunning time of the methods applied to the small scale datasets. The running time(Figure 4.10A) was computed in nanoseconds, and values are presented on a logscale. Running time varied substantially between all the methods. Seurat was thefastest and SIMLR was the slowest. The CCMT models (ensemble and multiview)were the second and third fastest, respectively, and SC3 was the fourth fastest. Forthe ARI score versus running time (Figure 4.10B), both the CCMT models have thehighest average overall ARI score. Seurat has the second-highest overall average,with SC3 coming third and SIMLR coming last. Even though Seurat has the fastestrunning time, it is not as accurate as the CCMT procedures. The CCMT methods maybe slower compared to Seurat but are better able to recover ground truth partitions.4.5.2 Medium to large scale datasetsFor the medium to large scale datasets, we compared the CCMT models againstSeurat. Seurat was explicitly used for these datasets because of its speed and ac-64curacy. Seurat scales quite well for larger datasets, which made comparisons toCCMT easier and efficient. Figure 4.11 shows boxplots of the ARI score and therunning time of the methods applied to the medium to large scale datasets. On themedium to large scale datasets, Seurat is again the fastest method. Again, boththe CCMT models have a similar running time. For the ARI score (Figure 4.11B),the CCMT models again have the highest overall average ARI score. Again, we seethat the CCMT models do a better job of recovering the ground truth partitions thancurrently used methods.Figure 4.10: A) Box plot of running times of methods on the small scaledatasets (n = 7). The x-axis is the method used and the y-axis naturallog of the computational time in nano seconds. B)Box plot of runningtimes of methods vs ARI score on the small scale datasets (n= 7). Thex-axis is the method used and the y-axis ARI score.4.6 SummaryTo summarize, through simulation studies and real data, we showed that the mul-timodality testing is able to infer cluster structure accurately. This method also is65Figure 4.11: A) Box plot of running times of methods on the medium to largescale datasets (n = 11). The x-axis is the method used and the y-axisnatural log of the computational time in nanoseconds. B)Box plot ofrunning times of methods vs ARI score on the medium to large scaledatasets (n = 11). The x-axis is the method used and the y-axis and sensitive to cluster size, separability, and data sparsity. We also showedthat multimodality testing performs well on both the negative and positive controldatasets. Further, we showed that for the CCMT procedure, both the multiview andthe ensemble models works well for all the simulation studies, as measured by theARI score. On real datasets, we compared both the computational time and ARIscore of the CCMT procedure to well well known and often used methods. Weshowed that it is faster than some of the current methods and, on average, moreaccurate. Finally, to see the effects of the number of genes and PCS on the CCMTprocedure, we performed an experiment where the number of genes and PCS werevaried. The results showed that increasing the number of genes generally increasedthe overall performance of the CCMT procedure. However, the opposite is observedwhen increasing the number of PCS.66Chapter 5Conclusions5.1 SummaryIn this thesis, we developed a method that assesses the cluster structure inherent toa dataset. We used multimodality testing coupled with three different data normal-ization and gene selection methods. The cosine distance was used to calculate thedistribution of gene expression patterns between cells, and the Dip test was usedto test this distribution for multimodality. The cosine distance was used because itis computationally efficient to compute. This distance metric also showed higheraverage accuracy and sensitivity than other distance metrics such as Euclidean andManhattan on simulation studies and real data. Next, we used extensive simulationstudies to show that this method is robust to the challenges inherent to scRNA-seqdatasets. These challenges include high cluster overlap, high sparsity, an increas-ing number of clusters, and increasing data size. Using real datasets as positive andnegative controls, we showed that this method performs as expected in the presenceof clusters and the absence of cluster structure.The second method developed in this thesis addressed finding the number ofclusters and returning the clusters. To do this, we developed the CCMT procedure.The CCMT procedure couples multimodality testing with hierarchical clusteringand discriminant analysis. The CCMT procedure assumes that clusters are derivedfrom unimodal distribution. This assumption makes the CCMT procedure flexibleenough to accommodate datasets with different distributions since many known67distributions have a unimodal variant. Using extensive simulations, we showedthat the CCMT procedure is able to recover simulated clusters accurately. It is alsosensitive to cluster overlap, meaning that it can detect clusters even when theyare highly overlapping. Using real datasets as positive and negative controls, wedemonstrated the CCMT procedure ability to accurately recover ground truth parti-tions. We separated the real datasets into two groups and used them to benchmarkthe CCMT procedure against other methods currently used to cluster scRNA-seqdatasets. We showed that for both the small and medium to large scale datasets,the CCMT procedure is more accurate than other methods. The CCMT procedure is,however, slower than Seurat.In the last part of the thesis, we attempted to understand the factors affectingmultimodality testing and the CCMT procedure. We showed that the log normal-ization method is less sensitive than the other two when assessing cluster structure.For the CCMT procedure, we showed that increasing the number of highly infor-mative genes used increases the ensemble model’s overall performance. However,for the multiview model, the performance remains relatively constant. We alsoshowed that for the CCMT procedure setting, the number of PCS constant for alldatasets negatively affects both the ensemble and multiview models’ performance.Lastly, we showed that modality testing and the CCMT procedure are limited insituations with increasing number of clusters and high cluster overlaps. However,these effect is more pronounced in situations where the clusters sizes are relativelybalanced compared to the cases where sizes are unbalanced.5.2 DiscussionNotably, for multimodality testing, the log normalization method proved to be lesssensitive for the simulation studies. There could be a few reasons for this. Firstly,the other two normalization methods may be more similar to the simulation mech-anism used in the Splatter package. The Splatter package uses a gamma-Poissonmodel to simulate molecular counts. Since both the negative binomial and themultinomial distributions can be estimated using a Poisson distribution, both ofthese normalization methods would perform better overall on simulation studies.However, this is not seen on real data both for the positive and negative control.68The data generating mechanism for the benchmarking datasets may not be com-pletely Poisson. There may also be enough differences in gene expression patternsbetween cells in each of the benchmarking datasets that all the normalization meth-ods can pick up, resulting in significant cluster structure evidence.Multimodality testing is no stranger to the scRNA-seq domain. It has been usedmultiple times when analyzing scRNA-seq datasets. In [6], the authors used theDip test to show the continuous nature of the distribution of T-cells activation states.The Dip test was also used in [35] to show that separation between cells couldconsistently be found when the cells are represented as a time series. However, tothe best of our knowledge, this is the first time that multimodality testing has beenused to the extent shown in this work.We investigated the effect of an increased number of genes on the performanceof the CCMT procedure. The results showed that increasing the number of genespositively impacted the ensemble model and had no significant impact on the mul-tiview model. See Figure 4.8. This difference is partly due to how the clusteringresults are combined between the ensemble and the multiview model. For the en-semble model, a clustering solution is generated independently for each normaliza-tion method. Next, these clustering results are combined to form an ensemble. Themultiview model combines the result from all three normalization methods beforegenerating a clustering solution. By default, we select the top 500 most informa-tive genes. The multiview model has an upper bound of 1500 on the total numberof genes when clustering, which happens when there are no overlaps between themost informative genes across the normalization methods. The multiview modelhas a higher gene pool before clustering, so we see no significant performance in-crease. In contrast, the ensemble is limited to 500 genes in each normalizationmethod. Thus increasing the number of genes improves each of the independentclustering solutions before the ensemble generation, which improves the ensembleclusters.We also investigated the effect of an increased number of PCS used for cluster-ing on the CCMT procedure performance. The results showed that increasing thenumber of PCS had a negative impact on both the ensemble and multiview models.See Figure 4.9. This decrease in performance is largely due to more PCS decreas-ing the signal to noise ratio in the dataset, which negatively impacts clustering. We69currently select the number of PCS for each dataset based on automatically findingthe knee point on a PCA scree plot. This knee point is found by first sorting the PCSin decreasing order. Next, we find the point with the largest distance to the firstand last scree plot points. Since each dataset has different characteristics and thusbehaves differently, the knee point method is better able to take this in to account.The CCMT procedure results when using the knee point method are better whencompared to holding PCS constant for every dataset.The CCMT procedure heavily relies on the projection of two classes on a linethat best separates their centers. The Fisher discriminant function that is used tocompute the projections assumes that the classes are linearly separable. This meansthat a single straight line can separate both classes. A straight line may not alwaysbe the case capable of separating the classes. If the classes are not linearly separa-ble, the projection may fail, causing the CCMT procedure to fail subsequently. Wedid not explore this topic in this work because there is currently no way to simulatescRNA-seq datasets with linearly not separable classes fully. The results for theCCMT procedure on benchmarking data justify our assumption that the clusters orcell types present can be separated using the Fisher linear discriminant function.A critical decision when clustering scRNA-seq data is how many cell typesto identify. There is generally no accepted way of choosing an adequate numberof clusters or cell types when clustering scRNA-seq datasets. It depends on theresolution at which the user wants to view the dataset. If the user decides on asmaller number, this will result in identifying more distinct cell types. However,if the user decides on a larger number, this results in less distinct cell types. Thiswork has served to provide some automated guidance on this issue. We hope thiswork will help relieve some of the headaches associated with deciding how manypossible cell types are present in a dataset. The CCMT procedure developed in thiswork tends to underestimate the correct number of clusters. However, based onthe ARI scores, it seems that even though the CCMT procedure underestimates thecorrect number of clusters, the clusters have a higher agreement with the groundtruth.705.3 Future WorkMuch of this work depends on computing distances between cells. However, forrelatively large datasets, this can become computationally expensive. A possiblefuture direction for handling quite large datasets would be to use machine learn-ing methods to reduce computational costs. For the CCMT procedure, it would bepossible to use a subset of the dataset to train a model and then use it to predictthe other cells. One approach would be to randomly sample the dataset and thenrun the multimodality testing algorithm on the random sample for assessing clus-terability. If a large enough sample size is chosen, it should capture the generalcluster structure present in the dataset.Another avenue for future work is integrating a non-linear projection method.Currently, the Fisher discriminant coordinates projection method is used for pro-jecting the classes. However, this is a linear projection method and may fail whenthe classes are not linearly separable. A kernelized version of the Fisher projec-tion method can handle cases where there is no clear linear separation between theclasses. A kernelized version of Fisher discriminant coordinates is discussed in[9, 67, 80]. Combining both the linear and the kernel versions of this projectionmethod will make the CCMT procedure well rounded and more flexible in handlingdifferent datasets.71Bibliography[1] M. Ackerman, A. Adolfsson, and N. Brownstein. An effective and efficientapproach for clusterability evaluation, 2016. → page 29[2] A. Adolfsson, M. Ackerman, and N. C. Brownstein. To cluster, or not tocluster: An analysis of clusterability methods. Pattern Recognition, 88:13–26, Apr 2019. ISSN 0031-3203. doi:10.1016/j.patcog.2018.10.026.URL → pages 1, 7, 13, 29[3] J. Ameijeiras-Alonso, R. Crujeiras, and A. Casal. Mode testing, criticalbandwidth and excess mass. TEST, 09 2016.doi:10.1007/s11749-018-0611-5. → pages 15, 16, 29[4] S. Anders and W. Huber. Differential expression analysis for sequencecount data. Nature Precedings, 5, 04 2010. doi:10.1038/npre.2010.4282.2.→ page 3[5] T. S. Andrews, V. Yu. Kiselev, and M. Hemberg. Statistical Methods forSingle-Cell RNA-Sequencing, chapter 26, pages 735–20. John Wiley Sons,Ltd, 2019. ISBN 9781119487845. doi:10.1002/9781119487845.ch26.URL →page 1[6] E. Azizi, A. Carr, G. Plitas, A. Cornish, C. Konopacki, S. Prabhakaran,J. Nainys, K. Wu, V. Kiseliovas, M. Setty, K. Choi, R. Fromme, P. Dao,P. McKenney, R. Wasti, K. Kadaveru, L. Mazutis, A. Rudensky, andD. Pe’er. Single-cell map of diverse immune phenotypes in the breasttumor microenvironment. Cell, 174, 06 2018.doi:10.1016/j.cell.2018.05.060. → page 69[7] R. Bacher, L.-F. Chu, N. Leng, A. Gasch, J. Thomson, R. Stewart,M. Newton, and C. Kendziorski. Scnorm: robust normalization of72single-cell rna-seq data. Nature Methods, 14, 04 2017.doi:10.1038/nmeth.4263. → pages 1, 3[8] M. Baron, A. Veres, S. Wolock, A. Faust, R. Gaujoux, A. Vetere, J. Ryu,B. Wagner, S. Shen-Orr, A. Klein, D. Melton, and I. Yanai. A single-celltranscriptomic map of the human and mouse pancreas reveals inter- andintra-cell population structure. Cell systems, 3:346–360, 10 2016.doi:10.1016/j.cels.2016.08.011. → page 42[9] G. Baudat and F. Anouar. Generalized discriminant analysis using a kernelapproach. Neural computation, 12:2385–404, 11 2000.doi:10.1162/089976600300014980. → page 71[10] S. Ben-David, U. Luxburg, and D. Pa´l. A sober look at clustering stability.pages 5–19, 06 2006. doi:10.1007/11776420 4. → page 17[11] V. Bergen, M. Lange, S. Peidli, F. Wolf, and F. Theis. Generalizing rnavelocity to transient cell states through dynamical modeling. 10 2019.doi:10.1101/820936. → page 6[12] J. Bezdek and R. Hathaway. Vat: A tool for visual assessment of (cluster)tendency. volume 3, pages 2225 – 2230, 02 2002. ISBN 0-7803-7278-6.doi:10.1109/IJCNN.2002.1007487. → page 11[13] S. Bickel and T. Scheffer. Multi-view clustering. pages 19– 26, 12 2004.ISBN 0-7695-2142-8. doi:10.1109/ICDM.2004.10095. → pages 5, 34[14] H.-H. Bock. On some significance tests in cluster analysis. Journal ofClassification, 2:77–108, 12 1985. doi:10.1007/BF01908065. → page 18[15] P. Brennecke, S. Anders, J. Kim, A. A. Kolodziejczyk, X. Zhang,V. Proserpio, B. Baying, V. Benes, S. Teichmann, J. Marioni, andM. Heisler. Accounting for technical noise in single-cell rna-seqexperiments (vol 10, pg 1093, 2013). Nature Methods, 11:210–210, 022014. doi:10.1038/nmeth0214-210b. → page 27[16] C. Burdziak, E. Azizi, S. Prabhakaran, and D. Pe’er. A nonparametricmulti-view model for estimating cell type-specific gene regulatorynetworks. 02 2019. → page 5[17] A. Butler, P. Hoffman, P. Smibert, E. Papalexi, and R. Satija. Integratingsingle-cell transcriptomic data across different conditions, technologies,and species. Nature Biotechnology, 2018. ISSN 1087-0156.73doi:10.1038/nbt.4096. URL →pages 4, 6, 25, 27, 43[18] T. Calin´ski and H. JA. A dendrite method for cluster analysis.Communications in Statistics - Theory and Methods, 3:1–27, 01 1974.doi:10.1080/03610927408827101. → page 17[19] R. Cannoodt, W. Saelens, and S. Yvan. Computational methods fortrajectory inference from single-cell transcriptomics. European Journal ofImmunology, 46, 09 2016. doi:10.1002/eji.201646347. → page 6[20] G. Chao, S. Sun, and J. Bi. A survey on multi-view clustering. 12 2017. →pages 5, 34[21] R. Chen, X. Wu, L. Jiang, and Y. Zhang. Single-cell rna-seq revealshypothalamic cell diversity. Cell Reports, 18:3227–3241, 03 2017.doi:10.1016/j.celrep.2017.03.004. → page 42[22] M. Delmans and M. Hemberg. Discrete distributional differentialexpression (d3e) - a tool for gene expression analysis of single-cell rna-seqdata. BMC Bioinformatics, 17, 12 2016. doi:10.1186/s12859-016-0944-6.→ page 6[23] P. Diggle. The Statistical Analysis of Spatial Point Patterns. 01 2003. →page 12[24] J. Ding, A. Condon, and S. Shah. Interpretable dimensionality reduction ofsingle cell transcriptome data with deep generative models. NatureCommunications, 9, 12 2018. doi:10.1038/s41467-018-04368-5. → page12[25] D. duVerle, S. Yotsukura, S. Nomura, H. Aburatani, and K. Tsuda.Celltree: An r/bioconductor package to infer the hierarchical structure ofcell populations from single-cell rna-seq data. BMC Bioinformatics, 17:363, 09 2016. doi:10.1186/s12859-016-1175-6. → page 5[26] A. Duo`, M. Robinson, and C. Soneson. A systematic performanceevaluation of clustering methods for single-cell rna-seq data.F1000Research, 7:1141, 09 2018. doi:10.12688/f1000research.15666.2.→ page 27[27] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithmfor discovering clusters in large spatial databases with noise. volume 96,pages 226–231, 01 1996. → page 474[28] K. Etemad. Discriminant analysis for recognition of human. 11 2003. →page 31[29] F. Faridafshin, B. Grechuk, and A. Naess. Calculating exceedanceprobabilities using a distributionally robust method. Structural Safety, 67:132–141, 07 2017. doi:10.1016/j.strusafe.2017.02.003. → page 14[30] G. Finak, A. McDavid, M. Yajima, J. Deng, V. Gersuk, A. Shalek,C. Slichter, H. Miller, M. McElrath, M. Prlic, P. Linsley, and R. Gottardo.Mast: A flexible statistical framework for assessing transcriptional changesand characterizing heterogeneity in single-cell rna sequencing data.Genome Biology, 16, 12 2015. doi:10.1186/s13059-015-0844-5. → page 6[31] N. Fischer, E. Mammen, and J. Marron. Testing for multimodality.Computational Statistics Data Analysis, 18:499–512, 12 1994.doi:10.1016/0167-9473(94)90080-9. → page 13[32] R. Fisher. The use of multiple measurements in taxonomic problems.Annals of eugenics, 7:179–188, 01 1936. → page 31[33] S. Freytag, L. Tian, I. Lo¨nnstedt, M. Ng, and M. Bahlo. Comparison ofclustering tools in r for medium-sized 10x genomics single-cellrna-sequencing data. F1000Research, 7:1297, 08 2018.doi:10.12688/f1000research.15809.1. → pages 42, 45[34] T. Geddes, T. Kim, L. Nan, J. Burchfield, J. Yang, D. Tao, and P. Yang.Autoencoder-based cluster ensembles for single-cell rna-seq data analysis.09 2019. doi:10.1101/773903. → page 4[35] W. Gong, I.-Y. Kwak, N. Koyano-Nakagawa, W. Pan, and D. Garry. Tcmvisualizes trajectories and cell populations from single cell data. NatureCommunications, 9, 12 2018. doi:10.1038/s41467-018-05112-9. → page69[36] D. Gru¨n, L. Kester, and A. Oudenaarden. Validation of noise models forsingle-cell transcriptomics. Nature methods, 11, 04 2014.doi:10.1038/nmeth.2930. → page 1[37] D. Gru¨n, M. Muraro, J.-C. Boisset, K. Wiebrands, A. Lyubimova,G. Dharmadhikari, M. van den Born, J. Es, E. Jansen, H. Clevers,E. de Koning, and A. Oudenaarden. De novo prediction of stem cellidentity using single-cell transcriptome data. Cell Stem Cell, 19, 06 2016.doi:10.1016/j.stem.2016.05.010. → page 375[38] M. Guo, H. Wang, S. Potter, J. Whitsett, and Y. Xu. Sincera: a pipeline forsingle-cell rna-seq profiling analysis. PLOS Computational Biology, 11:e1004575, 11 2015. doi:10.1371/journal.pcbi.1004575. → page 21[39] C. Hafemeister and R. Satija. Normalization and variance stabilization ofsingle-cell rna-seq data using regularized negative binomial regression.Genome Biology, 20, 12 2019. doi:10.1186/s13059-019-1874-1. → pages25, 26, 27, 28[40] A. Haque, J. Engel, S. Teichmann, and T. Lo¨nnberg. A practical guide tosingle-cell rna-sequencing for biomedical research and clinicalapplications. Genome Medicine, 9, 12 2017.doi:10.1186/s13073-017-0467-4. → page 2[41] J. Hartigan. Asymptotic distributions for clustering criteria. Annals ofStatistics, 6, 01 1978. doi:10.1214/aos/1176344071. → page 18[42] J. Hartigan and P. Hartigan. The dip test of unimodality. The Annals ofStatistics, 13, 03 1985. doi:10.1214/aos/1176346577. → pages 15, 24, 29[43] J. Hartigan and S. Mohanty. The runt test for multimodality. Journal ofClassification, 9:63–70, 02 1992. doi:10.1007/BF02618468. → page 15[44] E. Helgeson and E. Bair. Non-parametric cluster significance testing withreference to a unimodal null distribution. 10 2016. → pages 18, 20, 21[45] S. Hicks, F. W. Townes, M. Teng, and R. Irizarry. Missing data andtechnical variability in single-cell rna-sequencing experiments.Biostatistics (Oxford, England), 19, 11 2017.doi:10.1093/biostatistics/kxx053. → page 1[46] S. Hicks, F. W. Townes, M. Teng, and R. Irizarry. Missing data andtechnical variability in single-cell rna-sequencing experiments.Biostatistics (Oxford, England), 19, 11 2017.doi:10.1093/biostatistics/kxx053. → page 25[47] B. HOPKINS and J. SKELLAM. A new method for detecting the type ofdistribution of plant individuals. Annals of Botany, 18, 04 1954.doi:10.1093/oxfordjournals.aob.a083391. → page 12[48] H. Huang, Y. Liu, M. Yuan, and J. Marron. Statistical significance ofclustering using soft thresholding. Journal of Computational andGraphical Statistics, 24, 05 2013. doi:10.1080/10618600.2014.948179. →page 1976[49] J. Huband, J. Bezdek, and R. Hathaway. Bigvat: Visual assessment ofcluster tendency for large data sets. Pattern Recognition, 38:1875–1886, 112005. doi:10.1016/j.patcog.2005.03.018. → page 11[50] R. Huh, Y. Yang, Y. Jiang, Y. Shen, and Y. Li. SAME-clustering:Single-cell Aggregated Clustering via Mixture Model Ensemble. NucleicAcids Research, 48(1):86–95, 11 2019. ISSN 0305-1048.doi:10.1093/nar/gkz959. URL → page4[51] T. Ilicic, J. Kim, A. A. Kolodziejczyk, F. Bagger, D. McCarthy, J. Marioni,and S. Teichmann. Classification of low quality cells from single-cellrna-seq data. Genome Biology, 17, 12 2016.doi:10.1186/s13059-016-0888-1. → page 2[52] A. Jain and R. Dubes. Algorithms for Clustering Data, volume 32. 011988. doi:10.2307/1268876. → pages 12, 13, 18[53] Z. Ji and H. Ji. Tscan: Pseudo-time reconstruction and evaluation insingle-cell rna-seq analysis. Nucleic Acids Research, 44:gkw430, 05 2016.doi:10.1093/nar/gkw430. → pages 4, 6[54] L. Jiang, H. Chen, L. Pinello, and G.-C. Yuan. Giniclust: Detecting rarecell types from single-cell gene expression data with gini index. Genomebiology, 17:144, 07 2016. doi:10.1186/s13059-016-1010-4. → page 4[55] I. Jolliffe and J. Cadima. Principal component analysis: A review andrecent developments. Philosophical Transactions of the Royal Society A:Mathematical, Physical and Engineering Sciences, 374:20150202, 042016. doi:10.1098/rsta.2015.0202. → pages 11, 12[56] T. Kalisky and S. Quake. Single-cell genomics. Nature methods, 8:311–4,04 2011. doi:10.1038/nmeth0411-311. → page 2[57] P. Kharchenko, L. Silberstein, and D. Scadden. Bayesian approach tosingle-cell differential expression analysis. Nature methods, 11, 05 2014.doi:10.1038/nmeth.2967. → page 6[58] P. Kimes, Y. Liu, D. Hayes, and J. Marron. Statistical significance forhierarchical clustering. Biometrics, 73, 11 2014. doi:10.1111/biom.12647.→ pages 18, 19, 2177[59] V. Kiselev, K. Kirschner, M. Schaub, T. Andrews, A. Yiu, T. Chandra,K. Natarajan, W. Reik, M. Barahona, A. Green, and M. Hemberg. Sc3:consensus clustering of single-cell rna-seq data. 05 2017.doi:10.17863/CAM.9872. → pages 3, 5, 21, 44[60] V. Kiselev, T. Andrews, and M. Hemberg. Publisher correction: Challengesin unsupervised clustering of single-cell rna-seq data. Nature ReviewsGenetics, 20:1, 01 2019. doi:10.1038/s41576-019-0095-5. → page 2[61] A. Klein, L. Mazutis, I. Akartuna, N. Tallapragada, A. Veres, V. Li,L. Peshkin, D. Weitz, and M. Kirschner. Droplet barcoding for single-celltranscriptomics applied to embryonic stem cells. Cell, 161:1187–1201, 052015. doi:10.1016/j.cell.2015.04.044. → page 42[62] M. Krzak, Y. Raykov, A. Boukouvalas, L. Cutillo, and C. Angelini.Benchmark and parameter sensitivity analysis of single-cell rna sequencingclustering methods. Frontiers in Genetics, 10:1253, 12 2019.doi:10.3389/fgene.2019.01253. → page 35[63] I. labs. Whole CD45+ splenocytes from B6 mice, 2019 (accessed 2020).URL cell/study/SCP306/whole-cd45-splenocytes-from-b6-mice-10x-hms#study-summarys. →page 42[64] R. Lawson and P. Jurs. New index for clustering tendency and itsapplication to chemical problems. Journal of Chemical Information andComputer Sciences, 30:36–41, 02 1990. doi:10.1021/ci00065a010. →page 12[65] I. Lengyel and P. Derish. Ripley, b. d. 1981. spatial statistics. john wileysons, new york. 09 2002. → page 12[66] H. Li, E. Courtois, D. Sengupta, Y. Tan, K. Chen, J. Goh, S. Kong,C. Chua, L. Hon, W. S. Tan, M. Wong, P. Choi, L. Wee, A. Hillmer, I. Tan,P. Robson, and S. Prabhakar. Reference component analysis of single-celltranscriptomes elucidates cellular heterogeneity in human colorectaltumors. Nature Genetics, 49, 03 2017. doi:10.1038/ng.3818. → page 42[67] Y. Li, S. Gong, and H. Liddell. Recognising trajectories of facial identitiesusing kernel discriminant analysis. Image and Vision Computing, 21:1077–1086, 01 2004. doi:10.1016/j.imavis.2003.08.010. → page 7178[68] P. Lin, M. Troup, and J. Ho. Cidr: Ultrafast and accurate clustering throughimputation for single-cell rna-seq data. Genome Biology, 18, 12 2017.doi:10.1186/s13059-017-1188-0. → page 5[69] Y. Liu, D. Hayes, A. Nobel, and J. Marron. Statistical significance ofclustering for high-dimension, low-sample size data. Journal of theAmerican Statistical Association, 103:1281–1293, 09 2008.doi:10.1198/016214508000000454. → page 19[70] S. Lovie. Exploratory Data Analysis, volume 27. 03 2008. ISBN9780470061572. doi:10.1002/9780470061572.eqr222. → page 11[71] A. Lun, K. Bach, and J. Marioni. Pooling across cells to normalizesingle-cell rna sequencing data with many zero counts. Genome Biology,17, 12 2016. doi:10.1186/s13059-016-0947-7. → pages 3, 25[72] U. Luxburg. Clustering stability: An overview. Found Trends MachineLearn, 2, 07 2010. doi:10.1561/2200000008. → page 18[73] U. Luxburg and S. Ben-David. Towards a statistical theory of clustering.PASCAL Workshop on Statistics and Optimization of Clustering, 01 2005.→ page 7[74] E. Marco, R. Karp, G. Guo, P. Robson, A. Hart, L. Trippa, and G.-C. Yuan.Bifurcation analysis of single-cell gene expression data reveals epigeneticlandscape. Proceedings of the National Academy of Sciences of the UnitedStates of America, 111, 12 2014. doi:10.1073/pnas.1408993111. → pages3, 6[75] C. Mayer, C. Hafemeister, R. Bandler, R. Machold, R. Batista-Brito,X. Jaglin, K. Allaway, A. Butler, G. Fishell, and R. Satija. Developmentaldiversification of cortical inhibitory interneurons. Nature, 555, 03 2018.doi:10.1038/nature25999. → page 27[76] D. McCarthy, K. Campbell, A. Lun, and Q. Wills. Scater: Pre-processing,quality control, normalization and visualization of single-cell rna-seq datain r. Bioinformatics (Oxford, England), 33, 01 2017.doi:10.1093/bioinformatics/btw777. → pages 2, 3, 6[77] L. McInnes, J. Healy, and J. Melville. Umap: Uniform manifoldapproximation and projection for dimension reduction, 2018. → pages11, 1279[78] N. Meinshausen, L. Meier, and P. Bu¨hlmann. P-values forhigh-dimensional regression, 2008. → page 20[79] Z. Miao and X. Zhang. Desingle: A new method for single-celldifferentially expressed genes detection and classification. bioRxiv, 2017.doi:10.1101/173997. URL → page 6[80] S. Mika, G. Ra¨tsch, J. Weston, B. Scholkopf, and K.-R. Mu¨ller. Fisherdiscriminant analysis with kernels. volume 9, pages 41 – 48, 09 1999.ISBN 0-7803-5673-x. doi:10.1109/NNSP.1999.788121. → page 71[81] M. Mokari, H. Mohammadzade, and B. Ghojogh. Recognizing involuntaryactions from 3d skeleton data using body states. Scientia Iranica, 27:1424–1436, 06 2020. doi:10.24200/sci.2018.20446. → page 31[82] A. Mortazavi, B. Williams, K. Mccue, L. Schaeffer, and B. Wold. Mappingand quantifying mammalian transcriptomes by rna-seq. Nature methods, 5:621–8, 08 2008. doi:10.1038/nmeth.1226. → page 3[83] M. Muraro, G. Dharmadhikari, D. Gru¨n, N. Groen, T. Dielen, E. Jansen,L. Gurp, M. Engelse, F. Carlotti, E. de Koning, and A. Oudenaarden. Asingle-cell transcriptome atlas of the human pancreas. Cell Systems, 3, 092016. doi:10.1016/j.cels.2016.09.002. → page 42[84] F. Perraudeau, D. Risso, K. Street, E. Purdom, and S. Dudoit. Bioconductorworkflow for single-cell rna sequencing: Normalization, dimensionalityreduction, clustering, and lineage inference. F1000Research, 6:1158, 072017. doi:10.12688/f1000research.12122.1. → page 2[85] R. Petegrosso, Z. Li, and R. Kuang. Machine learning and statisticalmethods for clustering single-cell rna-sequencing data. Briefings inbioinformatics, 06 2019. doi:10.1093/bib/bbz063. → page 6[86] E. Pierson and C. Yau. Zifa: Dimensionality reduction for zero-inflatedsingle-cell gene expression analysis. Genome Biology, 16, 12 2015.doi:10.1186/s13059-015-0805-z. → pages 12, 26[87] S. Prabhakaran, E. Azizi, A. Carr, and D. Pe’er. Dirichlet process mixturemodel for correcting technical variation in single-cell gene expression data.Proc. 33nd Int. Conf. Mach. Learn., ICML 2016, 48:1070–1079, 01 2016.→ page 480[88] X. Ren, L. Zheng, and Z. Zhang. Sscc: A novel computational frameworkfor rapid and accurate clustering large-scale single cell rna-seq data.Genomics, Proteomics Bioinformatics, 17, 06 2019.doi:10.1016/j.gpb.2018.10.003. → page 4[89] D. Risso, F. Perraudeau, S. Gribkova, S. Dudoit, and J.-P. Vert. A generaland flexible method for signal extraction from single-cell rna-seq data.Nature Communications, 9, 12 2018. doi:10.1038/s41467-017-02554-5.→ page 12[90] M. Rodriguez, C. Comin, D. Casanova, O. Bruno, D. Amancio,F. Rodrigues, and L. da F. Costa. Clustering algorithms: A comparativeapproach. PLOS ONE, 14, 12 2016. doi:10.1371/journal.pone.0210236.→ page 5[91] R. Romanov, A. Zeisel, J. Bakker, F. Girach, A. Hellysaz, R. Tomer,A. Alpar, J. Mulder, F. Clotman, E. Keimpema, B. Hsueh, A. Crow,H. Martens, C. Schwindling, D. Calvigioni, J. Bains, Z. Ma´te´, G. Szabo,Y. Yanagawa, and T. Harkany. Molecular interrogation of hypothalamicorganization reveals distinct dopamine neuronal subtypes. Nat Neurosci,20:176–188, 02 2017. doi:10.1038/nn.4462. → page 42[92] P. Rousseeuw. Silhouettes: A graphical aid to the interpretation andvalidation of cluster analysis. Journal of Computational and AppliedMathematics, 20:53–65, 11 1987.doi:10.1016/0377-0427\%2887\%2990125-7. → page 17[93] G. Rozal and J. Hartigan. The map test for multimodality. Journal ofClassification, 11:5–36, 02 1994. → page 15[94] A.-A. Samadani, E. Kubica, R. Gorbet, and D. Kulic. Perception andgeneration of affective hand movements. International Journal of SocialRobotics, 5, 01 2012. doi:10.1007/s12369-012-0169-4. → page 31[95] W. Sarle and S. Institute. Cubic Clustering Criterion. SAS technical report.SAS Institute, 1983. URL → page 18[96] Segerstolpe, A. Palasantza, P. Eliasson, E.-M. Andersson, A.-C.Andre´asson, X. Sun, S. Picelli, A. Sabirsh, M. Clausen, M. Bjursell,D. Smith, M. Kasper, C. Ammala, and R. Sandberg. Single-celltranscriptome profiling of human pancreatic islets in health and type 281diabetes. Cell Metabolism, 24:1–15, 10 2016.doi:10.1016/j.cmet.2016.08.020. → page 42[97] K. Shekhar, S. Lapan, I. Whitney, N. Tran, E. Macosko, M. Kowalczyk,X. Adiconis, J. Levin, J. Nemesh, M. Goldman, S. Mccarroll, C. Cepko,A. Regev, and J. Sanes. Comprehensive classification of retinal bipolarneurons by single-cell transcriptomics. Cell, 166:1308–1323.e30, 08 2016.doi:10.1016/j.cell.2016.07.054. → page 42[98] Q. Shi, C. Zhang, M. Peng, X. Yu, T. Zeng, J. Liu, and L. Chen. Patternfusion analysis by adaptive alignment of multiple heterogeneous omicsdata. Bioinformatics (Oxford, England), 33, 05 2017.doi:10.1093/bioinformatics/btx176. → page 5[99] S. Sieranoja. Fast and general density peaks clustering. Pattern RecognitionLetters, 128, 10 2019. doi:10.1016/j.patrec.2019.10.019. → page 4[100] B. Silverman. Using kernel density estimates to investigate multimodality.J. Roy. Stat. Soc., Ser. B., Volume 43, p. 97-99, 43, 09 1981.doi:10.1111/j.2517-6161.1981.tb01155.x. → pages 15, 16, 20, 29[101] T. Smith, A. Heger, and I. Sudbery. Umi-tools: Modeling sequencing errorsin unique molecular identifiers to improve quantification accuracy. GenomeResearch, 27:gr.209601.116, 01 2017. doi:10.1101/gr.209601.116. → page2[102] A. Strehl and J. Ghosh. Cluster ensembles - a knowledge reuse frameworkfor combining multiple partitions. Journal of Machine Learning Research,3:583–617, 01 2002. doi:10.1162/153244303321897735. → page 18[103] V. Svensson, R. Vento-Tormo, and S. Teichmann. Exponential scaling ofsingle-cell rna-seq in the past decade. Nature Protocols, 13:599–604, 032018. doi:10.1038/nprot.2017.149. → page 2[104] F. Tang, C. Barbacioru, Y. Wang, E. Nordman, C. Lee, N. Xu, X. Wang,J. Bodeau, B. Tuch, A. Siddiqui, K. Lao, and M. Surani. mrna-seqwhole-transcriptome analysis of a single cell. Nature methods, 6:377–82,05 2009. doi:10.1038/nmeth.1315. → page 2[105] B. Tasic, V. Menon, T. N. Nguyen, S. Kim, T. Jarsky, Z. Yao, B. Levi,L. Gray, S. Sorensen, T. Dolbeare, D. Bertagnolli, J. Goldy,N. Shapovalova, S. Parry, C. Lee, K. Smith, A. Bernard, L. Madisen,S. Sunkin, and H. Zeng. Adult mouse cortical cell taxonomy revealed by82single cell transcriptomics. Nature neuroscience, 19, 01 2016.doi:10.1038/nn.4216. → page 42[106] R. Tibshirani, G. Walther, and T. Hastie. Estimating the number of clustersin a data set via the gap statistic. Journal of the Royal Statistical SocietySeries B, 63:411–423, 02 2001. doi:10.1111/1467-9868.00293. → page 17[107] F. W. Townes, S. Hicks, M. Aryee, and R. Irizarry. Feature selection anddimension reduction for single-cell rna-seq based on a multinomial model.Genome Biology, 20, 12 2019. doi:10.1186/s13059-019-1861-6. → pages25, 26, 28[108] C. Trapnell, D. Cacchiarelli, J. Grimsby, P. Pokharel, S. Li, M. Morse,N. Lennon, K. Livak, T. Mikkelsen, and J. Rinn. Pseudo-temporal orderingof individual cells reveals dynamics and regulators of cell fate decisions.Nature biotechnology, 32, 03 2014. doi:10.1038/nbt.2859. → pages 4, 6[109] N. Trendafilov. Stepwise estimation of common principal components.Comput. Stat. Data Anal., 54, 12 2010. doi:10.1016/j.csda.2010.03.010.→ page 34[110] P.-Y. Tung, J. Blischak, J. Hsiao, D. Knowles, J. Burnett, J. Pritchard, andY. Gilad. Batch effects and the effective design of single-cell geneexpression studies. Scientific Reports, 7, 01 2017. doi:10.1038/srep39921.→ page 1[111] K. Van den Berge, F. Perraudeau, C. Soneson, M. Love, D. Risso, J.-P. Vert,M. Robinson, S. Dudoit, and L. Clement. Observation weights unlock bulkrna-seq tools for zero inflation and single-cell applications. GenomeBiology, 19, 12 2018. doi:10.1186/s13059-018-1406-4. → page 6[112] L. van der Maaten and G. Hinton. Viualizing data using t-sne. Journal ofMachine Learning Research, 9:2579–2605, 11 2008. → pages 11, 12[113] B. Wang, D. Ramazzotti, L. De Sano, J. Zhu, E. Pierson, and S. Batzoglou.Simlr: A tool for large-scale genomic analyses by multi-kernel learning.PROTEOMICS, 18, 03 2017. doi:10.1002/pmic.201700232. → pages4, 21, 22, 44[114] J. Wang, J. Schug, K. J. Won, C. Liu, A. Naji, D. Avrahami, M. Golson, andK. Kaestner. Single-cell transcriptomics of the human endocrine pancreas.Diabetes, 65:db160405, 06 2016. doi:10.2337/db16-0405. → page 4283[115] L. Wang, U. Nguyen, J. Bezdek, C. Leckie, and K. Ramamohanarao. ivatand avat: Enhanced visual analysis for cluster tendency assessment.volume 6118, pages 16–27, 06 2010. doi:10.1007/978-3-642-13657-3 5.→ page 11[116] F. Wolf, P. Angerer, and F. Theis. Scanpy: Large-scale single-cell geneexpression data analysis. Genome Biology, 19, 12 2018.doi:10.1186/s13059-017-1382-0. → page 4[117] Z. Wu and H. Wu. Accounting for cell type hierarchy in evaluating singlecell rna-seq clustering. Genome Biology, 21, 12 2020.doi:10.1186/s13059-020-02027-x. → page 5[118] Y. Xin, J. Kim, H. Okamoto, M. Ni, Y. Wei, C. Adler, A. Murphy,G. Yancopoulos, C. Lin, and J. Gromada. Rna sequencing of single humanislet cells reveals type 2 diabetes genes. Cell metabolism, 24, 09 2016.doi:10.1016/j.cmet.2016.08.018. → page 42[119] D. Xu and Y. Tian. A comprehensive survey of clustering algorithms.Annals of Data Science, 2, 08 2015. doi:10.1007/s40745-015-0040-1. →page 16[120] S. Xu, X. Qiao, L. Zhu, Y. Zhang, C. Xue, and L. Li. Reviews ondetermining the number of clusters. Applied Mathematics InformationSciences, 10:1493–1512, 07 2016. doi:10.18576/amis/100428. → page 17[121] L. Yang, J. Liu, Q. Lu, A. Riggs, and X. Wu. Saic: An iterative clusteringapproach for analysis of single cell rna-seq data. BMC Genomics, 18, 102017. doi:10.1186/s12864-017-4019-5. → page 3[122] F. Ye, Z. Chen, H. Qian, R. Li, C. Chen, and Z. Zheng. New Approaches inMulti-View Clustering. 08 2018. ISBN 978-1-78923-526-5.doi:10.5772/intechopen.75598. → pages 5, 34[123] L. Zappia, B. Phipson, and A. Oshlack. Splatter: Simulation of single-cellrna sequencing data. Genome Biology, 18, 12 2017.doi:10.1186/s13059-017-1305-0. → page 34[124] A. Zeisel, A. Manchado, S. Codeluppi, P. Lonnerberg, G. La Manno,A. Jure´us, S. Marques, H. Munguba, L. He, C. Betsholtz, C. Rolny,G. Castelo-Branco, J. Hjerling Leffler, and S. Linnarsson. Brain structure.cell types in the mouse cortex and hippocampus revealed by single-cellrna-seq. Science, 347, 03 2015. doi:10.1126/science.aaa1934. → page 4284[125] J. Zhang, J. Fan, H. Fan, D. Rosenfeld, and D. Tse. An interpretableframework for clustering single-cell rna-seq datasets. BMC Bioinformatics,19, 12 2018. doi:10.1186/s12859-018-2092-7. → page 5[126] G. Zheng, J. Terry, P. Belgrader, P. Ryvkin, Z. Bent, R. Wilson, S. Ziraldo,T. Wheeler, G. McDermott, J. Zhu, M. Gregory, J. Shuga, L. Montesclaros,J. Underwood, D. Masquelier, S. Nishimura, M. Schnall-Levin, P. Wyatt,C. Hindson, and J. Bielas. Massively parallel digital transcriptionalprofiling of single cells. Nature Communications, 8:14049, 01 2017.doi:10.1038/ncomms14049. → page 42[127] X. Zhu, J. Zhang, Y. Xu, J. Wang, X. Peng, and H.-D. Li. Single-cellclustering based on shared nearest neighbor and graph partitioning.Interdisciplinary Sciences: Computational Life Sciences, 12, 02 2020.doi:10.1007/s12539-019-00357-4. → pages 4, 21[128] J. Zˇurauskiene˙ and C. Yau. pcareduce: Hierarchical clustering of single celltranscriptional profiles. BMC Bioinformatics, 17, 03 2016.doi:10.1186/s12859-016-0984-y. → page 385


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items