Robust methods for inferring cluster structure in single cell RNA sequencing data

UBC Theses and Dissertations

Featured Collection

UBC Theses and Dissertations

Robust methods for inferring cluster structure in single cell RNA sequencing data Willie, Elijah

Abstract

Single Cell RNA sequencing (SCRNA-SEQ) enables researchers to gain insights into complex biological systems not possible with previous technologies. Unsupervised machine learning, and in particular clustering algorithms, are critical to the analysis of scRNA-seq datasets, enabling investigators to systematically define cell types based on similarities in global gene expression profiles. However, in robustly applying a clustering algorithm to identify and define cell types, two critical open questions remain: i) to what extent do natural clusters exist in a given dataset? ii) what is the number of clusters best supported by the data? More specifically, most clustering algorithms will attempt to identify a fixed number of clusters without considering whether a given dataset is clusterable (e.g., natural clusters exist). However, understanding when the application of clustering algorithms is appropriate is crucial in making inferences from scRNA-seq datasets. Further, all clustering algorithms require the user to explicitly or implicitly specify the number of clusters to search for. In this thesis, we first assess the robustness of multimodality testing methods for determining whether a given set of points (or a dataset) is clusterable. Next, we utilize this framework to develop an algorithm, which we refer to as CCMT, for inferring the number of robust clusters in a given dataset. Results on simulation studies show that multimodality testing as a means for inferring cluster structure is robust and scales favorably for large datasets. This method can detect cluster structure with high statistical power in situations where there is high overlap between the clusters. We also apply our approach to real scRNA-seq datasets and show that it can accurately determine the cluster structure in both positive and negative control experiments. In the second part of this work, we apply CCMT in simulation studies and show that coupling multimodality testing with the nested structure of hierarchical clustering and discriminant analysis provides a robust approach for determining the number of clusters in a given dataset. Results on real data also show that CCMT can recover ground truth partitions with reasonable accuracy, and it is much faster than the competing methods that have a similar accuracy range.

Item Metadata

Title	Robust methods for inferring cluster structure in single cell RNA sequencing data
Creator	Willie, Elijah
Publisher	University of British Columbia
Date Issued	2020
Description	Single Cell RNA sequencing (SCRNA-SEQ) enables researchers to gain insights into complex biological systems not possible with previous technologies. Unsupervised machine learning, and in particular clustering algorithms, are critical to the analysis of scRNA-seq datasets, enabling investigators to systematically define cell types based on similarities in global gene expression profiles. However, in robustly applying a clustering algorithm to identify and define cell types, two critical open questions remain: i) to what extent do natural clusters exist in a given dataset? ii) what is the number of clusters best supported by the data? More specifically, most clustering algorithms will attempt to identify a fixed number of clusters without considering whether a given dataset is clusterable (e.g., natural clusters exist). However, understanding when the application of clustering algorithms is appropriate is crucial in making inferences from scRNA-seq datasets. Further, all clustering algorithms require the user to explicitly or implicitly specify the number of clusters to search for. In this thesis, we first assess the robustness of multimodality testing methods for determining whether a given set of points (or a dataset) is clusterable. Next, we utilize this framework to develop an algorithm, which we refer to as CCMT, for inferring the number of robust clusters in a given dataset. Results on simulation studies show that multimodality testing as a means for inferring cluster structure is robust and scales favorably for large datasets. This method can detect cluster structure with high statistical power in situations where there is high overlap between the clusters. We also apply our approach to real scRNA-seq datasets and show that it can accurately determine the cluster structure in both positive and negative control experiments. In the second part of this work, we apply CCMT in simulation studies and show that coupling multimodality testing with the nested structure of hierarchical clustering and discriminant analysis provides a robust approach for determining the number of clusters in a given dataset. Results on real data also show that CCMT can recover ground truth partitions with reasonable accuracy, and it is much faster than the competing methods that have a similar accuracy range.
Genre	Thesis/Dissertation
Type	Text
Language	eng
Date Available	2020-10-28
Provider	Vancouver : University of British Columbia Library
Rights	Attribution-NonCommercial-NoDerivatives 4.0 International
DOI	10.14288/1.0394826
URI	http://hdl.handle.net/2429/76374
Degree	Master of Science - MSc
Program	Bioinformatics
Affiliation	Science, Faculty of
Degree Grantor	University of British Columbia
Graduation Date	2020-11
Campus	UBCV
Scholarly Level	Graduate
Rights URI	http://creativecommons.org/licenses/by-nc-nd/4.0/
Aggregated Source Repository	DSpace

Open Collections

UBC Theses and Dissertations

UBC Theses and Dissertations

Robust methods for inferring cluster structure in single cell RNA sequencing data Willie, Elijah

Abstract

Item Metadata

Item Media

Item Citations and Data

Rights