- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Guiding future genomics research through machine learning...
Open Collections
UBC Theses and Dissertations
UBC Theses and Dissertations
Guiding future genomics research through machine learning : functional DNA analysis and benchmarking Luthra, Ishika
Abstract
Differences in gene expression can drive variation in phenotypes and disease. Gene expression is modulated, in part, by proteins called transcription factors that bind specific sequence sites in our genome. The resulting transcription factor binding landscape orchestrates gene expression patterns. Thus, by mapping transcription factor binding patterns, we can predict gene expression directly from the DNA sequence. Leveraging machine learning models can aid in parsing the vast and complex regulatory landscape and its evolution. In the past, we have observed biochemical activity across the genome but have not been able to conclude what proportion is functional without a proper null hypothesis. To better understand the baseline regulatory activity of the genome, I defined a genomic null hypothesis by predicting the activity of random DNA in humans. I used a state-of-the-art machine learning model and found that, while mononucleotide shuffled sequences were predicted to have minimal activity, local dinucleotide content-matched randomized DNA is predicted to retain much of the regulatory activity of evolved sequences. These results were surprising given that the human genome had not seen this foreign DNA, and yet transcriptional activity was predicted to be present. We also observed abundant and diverse transcription when measuring the activity of human genomic DNA in yeast. These results suggested that basal regulatory activity seems to be the default DNA state in eukaryotes. While sequence-based genomics models are widely used in the field, their progress has proven difficult to benchmark due to substantial heterogeneity in model architectures, training datasets, and ad hoc model evaluations. I defined a system for large-scale, community-led standardized model benchmarking. I designed an Application Programming Interface that allows seamless communication across pre-trained models and functional genomics datasets. The ability to easily compare results across different datasets and models will accelerate the improvement of genomics models, motivate novel functional genomic benchmarks, and provide a more nuanced understanding of model abilities. Overall, the work in this thesis shows the potential for using sequence-based models to answer intractable genomics problems and provides a pathway for systematic and seamless model benchmarking that can help the field design better models of gene regulation.
Item Metadata
Title |
Guiding future genomics research through machine learning : functional DNA analysis and benchmarking
|
Creator | |
Supervisor | |
Publisher |
University of British Columbia
|
Date Issued |
2024
|
Description |
Differences in gene expression can drive variation in phenotypes and disease. Gene expression is modulated, in part, by proteins called transcription factors that bind specific sequence sites in our genome. The resulting transcription factor binding landscape orchestrates gene expression patterns. Thus, by mapping transcription factor binding patterns, we can predict gene expression directly from the DNA sequence. Leveraging machine learning models can aid in parsing the vast and complex regulatory landscape and its evolution. In the past, we have observed biochemical activity across the genome but have not been able to conclude what proportion is functional without a proper null hypothesis. To better understand the baseline regulatory activity of the genome, I defined a genomic null hypothesis by predicting the activity of random DNA in humans. I used a state-of-the-art machine learning model and found that, while mononucleotide shuffled sequences were predicted to have minimal activity, local dinucleotide content-matched randomized DNA is predicted to retain much of the regulatory activity of evolved sequences. These results were surprising given that the human genome had not seen this foreign DNA, and yet transcriptional activity was predicted to be present. We also observed abundant and diverse transcription when measuring the activity of human genomic DNA in yeast. These results suggested that basal regulatory activity seems to be the default DNA state in eukaryotes.
While sequence-based genomics models are widely used in the field, their progress has proven difficult to benchmark due to substantial heterogeneity in model architectures, training datasets, and ad hoc model evaluations. I defined a system for large-scale, community-led standardized model benchmarking. I designed an Application Programming Interface that allows seamless communication across pre-trained models and functional genomics datasets. The ability to easily compare results across different datasets and models will accelerate the improvement of genomics models, motivate novel functional genomic benchmarks, and provide a more nuanced understanding of model abilities. Overall, the work in this thesis shows the potential for using sequence-based models to answer intractable genomics problems and provides a pathway for systematic and seamless model benchmarking that can help the field design better models of gene regulation.
|
Genre | |
Type | |
Language |
eng
|
Date Available |
2025-01-13
|
Provider |
Vancouver : University of British Columbia Library
|
Rights |
Attribution-NonCommercial-NoDerivatives 4.0 International
|
DOI |
10.14288/1.0447747
|
URI | |
Degree | |
Program | |
Affiliation | |
Degree Grantor |
University of British Columbia
|
Graduation Date |
2025-05
|
Campus | |
Scholarly Level |
Graduate
|
Rights URI | |
Aggregated Source Repository |
DSpace
|
Item Media
Item Citations and Data
Rights
Attribution-NonCommercial-NoDerivatives 4.0 International