- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Robust methods for inferring cluster structure in single...
Open Collections
UBC Theses and Dissertations
UBC Theses and Dissertations
Robust methods for inferring cluster structure in single cell RNA sequencing data Willie, Elijah
Abstract
Single Cell RNA sequencing (SCRNA-SEQ) enables researchers to gain insights into complex biological systems not possible with previous technologies. Unsupervised machine learning, and in particular clustering algorithms, are critical to the analysis of scRNA-seq datasets, enabling investigators to systematically define cell types based on similarities in global gene expression profiles. However, in robustly applying a clustering algorithm to identify and define cell types, two critical open questions remain: i) to what extent do natural clusters exist in a given dataset? ii) what is the number of clusters best supported by the data? More specifically, most clustering algorithms will attempt to identify a fixed number of clusters without considering whether a given dataset is clusterable (e.g., natural clusters exist). However, understanding when the application of clustering algorithms is appropriate is crucial in making inferences from scRNA-seq datasets. Further, all clustering algorithms require the user to explicitly or implicitly specify the number of clusters to search for. In this thesis, we first assess the robustness of multimodality testing methods for determining whether a given set of points (or a dataset) is clusterable. Next, we utilize this framework to develop an algorithm, which we refer to as CCMT, for inferring the number of robust clusters in a given dataset. Results on simulation studies show that multimodality testing as a means for inferring cluster structure is robust and scales favorably for large datasets. This method can detect cluster structure with high statistical power in situations where there is high overlap between the clusters. We also apply our approach to real scRNA-seq datasets and show that it can accurately determine the cluster structure in both positive and negative control experiments. In the second part of this work, we apply CCMT in simulation studies and show that coupling multimodality testing with the nested structure of hierarchical clustering and discriminant analysis provides a robust approach for determining the number of clusters in a given dataset. Results on real data also show that CCMT can recover ground truth partitions with reasonable accuracy, and it is much faster than the competing methods that have a similar accuracy range.
Item Metadata
Title |
Robust methods for inferring cluster structure in single cell RNA sequencing data
|
Creator | |
Publisher |
University of British Columbia
|
Date Issued |
2020
|
Description |
Single Cell RNA sequencing (SCRNA-SEQ) enables researchers to gain insights into complex biological systems not possible with previous technologies. Unsupervised machine learning, and in particular clustering algorithms, are critical to the analysis of scRNA-seq datasets, enabling investigators to systematically define cell types based on similarities in global gene expression profiles. However, in robustly applying a clustering algorithm to identify and define cell types, two critical open questions remain: i) to what extent do natural clusters exist in a given dataset? ii) what is the number of clusters best supported by the data? More specifically, most clustering algorithms will attempt to identify a fixed number of clusters without considering whether a given dataset is clusterable (e.g., natural clusters exist). However, understanding when the application of clustering algorithms is appropriate is crucial in making inferences from scRNA-seq datasets. Further, all clustering algorithms require the user to explicitly or implicitly specify the number of clusters to search for. In this thesis, we first assess the robustness of multimodality testing methods for determining whether a given set of points (or a dataset) is clusterable. Next, we utilize this framework to develop an algorithm, which we refer to as CCMT, for inferring the number of robust clusters in a given dataset. Results on simulation studies show that multimodality testing as a means for inferring cluster structure is robust and scales favorably for large datasets. This method can detect cluster structure with high statistical power in situations where there is high overlap between the clusters. We also apply our approach to real scRNA-seq datasets and show that it can accurately determine the cluster structure in both positive and negative control experiments. In the second part of this work, we apply CCMT in simulation studies and show that coupling multimodality testing with the nested structure of hierarchical clustering and discriminant analysis provides a robust approach for determining the number of clusters in a given dataset. Results on real data also show that CCMT can recover ground truth partitions with reasonable accuracy, and it is much faster than the competing methods that have a similar accuracy range.
|
Genre | |
Type | |
Language |
eng
|
Date Available |
2020-10-28
|
Provider |
Vancouver : University of British Columbia Library
|
Rights |
Attribution-NonCommercial-NoDerivatives 4.0 International
|
DOI |
10.14288/1.0394826
|
URI | |
Degree | |
Program | |
Affiliation | |
Degree Grantor |
University of British Columbia
|
Graduation Date |
2020-11
|
Campus | |
Scholarly Level |
Graduate
|
Rights URI | |
Aggregated Source Repository |
DSpace
|
Item Media
Item Citations and Data
Rights
Attribution-NonCommercial-NoDerivatives 4.0 International