- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Interpretable latent variable models for high-dimensional...
Open Collections
UBC Theses and Dissertations
UBC Theses and Dissertations
Interpretable latent variable models for high-dimensional biological data analysis Zhang, Yichen
Abstract
A latent variable model is a statistical model used to uncover the hidden patterns in the data. Matrix factorization and autoencoder are two widely used modeling frameworks for latent variable models. Matrix factorization-based methods represent data in a low-dimensional latent space where each latent dimension is explained by a weighted combination of the original features. While offering a straightforward interpretation of the latent space, these methods do not capture the non-linear structure of the data. On the other hand, artificial neural networks such as autoencoders have emerged as a powerful tool to summarize complex data structures through a series of non-linear transformations. However, the resulting latent representation often lacks a biological interpretation. In this dissertation, I develop three latent variable models to extract meaningful biological patterns from various datasets, using a hybrid structure of autoencoder and matrix factorization framework. The first model incorporates the domain knowledge of biological pathways into an autoencoder framework. I demonstrate that the proposed method can better retain the biological signals in the latent space and recover the underlying latent structure more accurately than a previous matrix-factorization-based approach. The second model builds a topic analysis tool for single-cell genomics, leveraging the variational autoencoder framework and latent topic model. The proposed topic model recovers cell clusters and cell-specific gene programs more accurately than conventional methods, such as principal component analysis and non-negative matrix factorization. Our results suggest that latent topics are suitable to capture cell-type-specific marker genes and recapitulate known immune cells in pancreatic cancer. Lastly, I extend this topic model to interpret dynamic changes in gene expression. The dynamic topic model uncovers short-term transcriptional dynamics from a plethora of spliced and unspliced single-cell RNA-sequencing counts. I demonstrate that modeling both types of RNA counts can improve robustness in statistical estimation and reveal new aspects of transcriptional dynamics that can be missed in previous analyses. In the latent space, I discovered that seven gene programs (topics) are highly correlated with cancer prognosis and generally enrich immune cell types and pathways.
Item Metadata
Title |
Interpretable latent variable models for high-dimensional biological data analysis
|
Creator | |
Supervisor | |
Publisher |
University of British Columbia
|
Date Issued |
2024
|
Description |
A latent variable model is a statistical model used to uncover the hidden patterns in the data. Matrix factorization and autoencoder are two widely used modeling frameworks for latent variable models. Matrix factorization-based methods represent data in a low-dimensional latent space where each latent dimension is explained by a weighted combination of the original features. While offering a straightforward interpretation of the latent space, these methods do not capture the non-linear structure of the data. On the other hand, artificial neural networks such as autoencoders have emerged as a powerful tool to summarize complex data structures through a series of non-linear transformations. However, the resulting latent representation often lacks a biological interpretation.
In this dissertation, I develop three latent variable models to extract meaningful biological patterns from various datasets, using a hybrid structure of autoencoder and matrix factorization framework. The first model incorporates the domain knowledge of biological pathways into an autoencoder framework. I demonstrate that the proposed method can better retain the biological signals in the latent space and recover the underlying latent structure more accurately than a previous matrix-factorization-based approach. The second model builds a topic analysis tool for single-cell genomics, leveraging the variational autoencoder framework and latent topic model. The proposed topic model recovers cell clusters and cell-specific gene programs more accurately than conventional methods, such as principal component analysis and non-negative matrix factorization. Our results suggest that latent topics are suitable to capture cell-type-specific marker genes and recapitulate known immune cells in pancreatic cancer. Lastly, I extend this topic model to interpret dynamic changes in gene expression. The dynamic topic model uncovers short-term transcriptional dynamics from a plethora of spliced and unspliced single-cell RNA-sequencing counts. I demonstrate that modeling both types of RNA counts can improve robustness in statistical estimation and reveal new aspects of transcriptional dynamics that can be missed in previous analyses. In the latent space, I discovered that seven gene programs (topics) are highly correlated with cancer prognosis and generally enrich immune cell types and pathways.
|
Genre | |
Type | |
Language |
eng
|
Date Available |
2024-03-27
|
Provider |
Vancouver : University of British Columbia Library
|
Rights |
Attribution-NonCommercial-NoDerivatives 4.0 International
|
DOI |
10.14288/1.0440942
|
URI | |
Degree | |
Program | |
Affiliation | |
Degree Grantor |
University of British Columbia
|
Graduation Date |
2024-05
|
Campus | |
Scholarly Level |
Graduate
|
Rights URI | |
Aggregated Source Repository |
DSpace
|
Item Media
Item Citations and Data
Rights
Attribution-NonCommercial-NoDerivatives 4.0 International