- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Application of machine learning and language models...
Open Collections
UBC Theses and Dissertations
UBC Theses and Dissertations
Application of machine learning and language models in health and biological sciences Keshavarz Rahaghi, Faeze
Abstract
The advent of sequencing technologies and the completion of the Human Genome Project revolutionized our understanding of DNA and RNA molecules. Variation in DNA sequences is both a source of genetic diversity and the cause of various diseases, while transcriptomic changes influence protein levels and cellular homeostasis. Large-scale initiatives have applied DNA and RNA sequencing to thousands of genomes, deepening our knowledge of cellular processes, with implications for health and biological sciences. Given the complexity of cellular networks, genetic alterations trigger widespread downstream effects, necessitating development of advanced analytical approaches. In this thesis, I employed machine learning to extract insights from large genomic datasets. The aim of the first part of the work was to uncover transcriptional patterns in cancer and assess the impact of genetic alterations on cellular networks. We hypothesized that transcriptome analysis reveals tumour vulnerabilities and identifies therapeutic targets. Primary and metastatic tumours were classified based on the altered status of 50 cancer-related genes, which uncovered distinct transcriptional patterns associated with different genes. Notable examples include TP53 (F1 score: 0.87), which exhibits a pan-cancer transcriptional pattern, BRAF (0.93) and ATRX (0.84), which display tumour-type-specific patterns. Additionally, some genes exhibited no or weak patterns with the loss of wild-type function. The identification of genes highly contributing to classification led to the finding that existing therapies, such as AURKA inhibitors, could also benefit patients with alterations in genes beyond those for which the treatment was originally approved, provided existence of similar transcriptional modifications. In the final part of this work, I applied foundation models to nematode genomes and predicted their DNA sequences to establish a groundwork for language model-driven genomic analysis. This investigation underscored the importance of selecting appropriate sample sets aligned with specific research objectives. Expanding datasets with vastly different species does not always yield the most informative results. The optimal choice of genomes depends on whether the goal is to address species- or genus-specific questions or to gain broader insights across multiple species or genera. These findings highlight the potential of machine learning and language models in refining personalized medicine and enhancing our understanding of evolutionary biology.
Item Metadata
Title |
Application of machine learning and language models in health and biological sciences
|
Creator | |
Supervisor | |
Publisher |
University of British Columbia
|
Date Issued |
2025
|
Description |
The advent of sequencing technologies and the completion of the Human Genome Project revolutionized our understanding of DNA and RNA molecules. Variation in DNA sequences is both a source of genetic diversity and the cause of various diseases, while transcriptomic changes influence protein levels and cellular homeostasis. Large-scale initiatives have applied DNA and RNA sequencing to thousands of genomes, deepening our knowledge of cellular processes, with implications for health and biological sciences. Given the complexity of cellular networks, genetic alterations trigger widespread downstream effects, necessitating development of advanced analytical approaches. In this thesis, I employed machine learning to extract insights from large genomic datasets. The aim of the first part of the work was to uncover transcriptional patterns in cancer and assess the impact of genetic alterations on cellular networks. We hypothesized that transcriptome analysis reveals tumour vulnerabilities and identifies therapeutic targets. Primary and metastatic tumours were classified based on the altered status of 50 cancer-related genes, which uncovered distinct transcriptional patterns associated with different genes. Notable examples include TP53 (F1 score: 0.87), which exhibits a pan-cancer transcriptional pattern, BRAF (0.93) and ATRX (0.84), which display tumour-type-specific patterns. Additionally, some genes exhibited no or weak patterns with the loss of wild-type function. The identification of genes highly contributing to classification led to the finding that existing therapies, such as AURKA inhibitors, could also benefit patients with alterations in genes beyond those for which the treatment was originally approved, provided existence of similar transcriptional modifications. In the final part of this work, I applied foundation models to nematode genomes and predicted their DNA sequences to establish a groundwork for language model-driven genomic analysis. This investigation underscored the importance of selecting appropriate sample sets aligned with specific research objectives. Expanding datasets with vastly different species does not always yield the most informative results. The optimal choice of genomes depends on whether the goal is to address species- or genus-specific questions or to gain broader insights across multiple species or genera. These findings highlight the potential of machine learning and language models in refining personalized medicine and enhancing our understanding of evolutionary biology.
|
Genre | |
Type | |
Language |
eng
|
Date Available |
2025-09-09
|
Provider |
Vancouver : University of British Columbia Library
|
Rights |
Attribution-NonCommercial-NoDerivatives 4.0 International
|
DOI |
10.14288/1.0450082
|
URI | |
Degree (Theses) | |
Program (Theses) | |
Affiliation | |
Degree Grantor |
University of British Columbia
|
Graduation Date |
2025-11
|
Campus | |
Scholarly Level |
Graduate
|
Rights URI | |
Aggregated Source Repository |
DSpace
|
Item Media
Item Citations and Data
Rights
Attribution-NonCommercial-NoDerivatives 4.0 International