UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Advancing our understanding of genome regulation via optimization of stem cell differentiation and interpretable deep learning Novakovskiy, German

Abstract

The regulation of gene expression is a core challenge in understanding how diverse types of cells can be produced from the same DNA instructions. Insights about this complex machinery advance not only science but applications in therapy and pharmacology. For instance, the differentiation of stem cells for the purpose of regenerative medicine to treat patients with diabetes. In my second chapter, I address the problem of optimizing the differentiation protocol towards definitive endoderm, the precursor of insulin-producing pancreatic beta cells, by replacing the expensive growth factor with cheap molecule alternatives. I introduce a multiple-step pipeline based on small molecule transcriptome response profiles. The discovered chemicals emphasize the importance of key transcription factors in the process, such as HIF and MYC. The study of transcription factors is of high importance, and will further promote our knowledge about differentiation. Motivated by the thought, I explore the current trends of studying transcription factors in the gene regulation context. With large-scale data generation efforts by public consortia such as ENCODE, deep learning methods have become pervasive. A large training dataset is fundamental to the success of these methods, however, the amount of TF-related data is often small. To tackle this issue, in my third chapter, I perform an in-depth assessment of transfer learning for TF binding prediction and provide biologically motivated guidelines for efficient training of deep models when the data is limited. An additional challenge for deep models beyond data sufficiency is interpretability. In the fourth chapter, I systematically categorize and summarize interpretation approaches, exploring their underlying assumptions, strengths, and weaknesses. Inspired by transparent deep learning architectures, I present ExplaiNN, a new transparent model for the genomics tasks. I explore its efficiency and usability on a variety of problems in the fifth chapter of this thesis. Finally, in the last chapter, I apply ExplaiNN to ATAC-seq datasets of mouse and human immune systems to study differences in cis-regulatory logic. Transparency of the new method allowed me to discover a reproducible set of sequence motifs that either individually or combinatorially are responsible for the bulk of the predictions, and tend to have species-specific occurrence patterns.

Item Citations and Data

Rights

Attribution-NonCommercial-NoDerivatives 4.0 International