UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

A bayesian nonparametric model for RNA-sequencing data Lepur, Matthieu


Cancers from different tissue types can share a latent structure reflecting commonly altered gene pathways. It is difficult to cluster cancer patients based on this latent structure because the tissue of origin often dominates the latent structure effect. We propose a Bayesian nonparametric model that accounts for the tissue effect and clusters based on a latent structure using a Dirichlet Process prior. More specifically, we use an infinite Gaussian mix- ture model where the mean parameter is modelled as the linear combination of tissue, gene, and latent cluster effects. The choice of the Dirichlet Pro- cess prior allows us to side-step a model selection problem as the number of latent clusters is unknown apriori. Our approach learns the tissue effect by using tissue parameters in a supervised learning setting, while simultaneously learning the latent structure based on the residuals in an unsupervised setting. These so-called residuals result from subtracting out the inferred tissue and gene parameters from the observations and can be interpreted as the cluster effect. A key component of the model is its ability to leverage conjugacy between the likelihood model and cluster parameters. The Gaussian form of the model is not effect by our choice of mean parameter therefore conjugacy is preserved. Indeed, the model has the intuitive interpretation of clustering on the cluster effect signal that remains subtracting out the tissue and gene effects. Conjugacy allows for the use of sophisticated Markov chain Monte Carlo techniques used in Bayesian mixture models such as Split-Merge samplers. We demonstrate our model by showing results on synthetic data, semi-synthetic data generated using a publicly available dataset from the Genome-Tissue Expression (GTEx) portal, and another publicly available dataset from the International Cancer Genome Consortium (ICGC).

Item Citations and Data


Attribution-NonCommercial-NoDerivatives 4.0 International