Application of machine learning and language models in health and biological sciences

UBC Theses and Dissertations

Featured Collection

UBC Theses and Dissertations

Application of machine learning and language models in health and biological sciences Keshavarz Rahaghi, Faeze

Abstract

The advent of sequencing technologies and the completion of the Human Genome Project revolutionized our understanding of DNA and RNA molecules. Variation in DNA sequences is both a source of genetic diversity and the cause of various diseases, while transcriptomic changes influence protein levels and cellular homeostasis. Large-scale initiatives have applied DNA and RNA sequencing to thousands of genomes, deepening our knowledge of cellular processes, with implications for health and biological sciences. Given the complexity of cellular networks, genetic alterations trigger widespread downstream effects, necessitating development of advanced analytical approaches. In this thesis, I employed machine learning to extract insights from large genomic datasets. The aim of the first part of the work was to uncover transcriptional patterns in cancer and assess the impact of genetic alterations on cellular networks. We hypothesized that transcriptome analysis reveals tumour vulnerabilities and identifies therapeutic targets. Primary and metastatic tumours were classified based on the altered status of 50 cancer-related genes, which uncovered distinct transcriptional patterns associated with different genes. Notable examples include TP53 (F1 score: 0.87), which exhibits a pan-cancer transcriptional pattern, BRAF (0.93) and ATRX (0.84), which display tumour-type-specific patterns. Additionally, some genes exhibited no or weak patterns with the loss of wild-type function. The identification of genes highly contributing to classification led to the finding that existing therapies, such as AURKA inhibitors, could also benefit patients with alterations in genes beyond those for which the treatment was originally approved, provided existence of similar transcriptional modifications. In the final part of this work, I applied foundation models to nematode genomes and predicted their DNA sequences to establish a groundwork for language model-driven genomic analysis. This investigation underscored the importance of selecting appropriate sample sets aligned with specific research objectives. Expanding datasets with vastly different species does not always yield the most informative results. The optimal choice of genomes depends on whether the goal is to address species- or genus-specific questions or to gain broader insights across multiple species or genera. These findings highlight the potential of machine learning and language models in refining personalized medicine and enhancing our understanding of evolutionary biology.

Item Metadata

Title	Application of machine learning and language models in health and biological sciences
Creator	Keshavarz Rahaghi, Faeze
Supervisor	Jones, Steven J. M.
Publisher	University of British Columbia
Date Issued	2025
Description	The advent of sequencing technologies and the completion of the Human Genome Project revolutionized our understanding of DNA and RNA molecules. Variation in DNA sequences is both a source of genetic diversity and the cause of various diseases, while transcriptomic changes influence protein levels and cellular homeostasis. Large-scale initiatives have applied DNA and RNA sequencing to thousands of genomes, deepening our knowledge of cellular processes, with implications for health and biological sciences. Given the complexity of cellular networks, genetic alterations trigger widespread downstream effects, necessitating development of advanced analytical approaches. In this thesis, I employed machine learning to extract insights from large genomic datasets. The aim of the first part of the work was to uncover transcriptional patterns in cancer and assess the impact of genetic alterations on cellular networks. We hypothesized that transcriptome analysis reveals tumour vulnerabilities and identifies therapeutic targets. Primary and metastatic tumours were classified based on the altered status of 50 cancer-related genes, which uncovered distinct transcriptional patterns associated with different genes. Notable examples include TP53 (F1 score: 0.87), which exhibits a pan-cancer transcriptional pattern, BRAF (0.93) and ATRX (0.84), which display tumour-type-specific patterns. Additionally, some genes exhibited no or weak patterns with the loss of wild-type function. The identification of genes highly contributing to classification led to the finding that existing therapies, such as AURKA inhibitors, could also benefit patients with alterations in genes beyond those for which the treatment was originally approved, provided existence of similar transcriptional modifications. In the final part of this work, I applied foundation models to nematode genomes and predicted their DNA sequences to establish a groundwork for language model-driven genomic analysis. This investigation underscored the importance of selecting appropriate sample sets aligned with specific research objectives. Expanding datasets with vastly different species does not always yield the most informative results. The optimal choice of genomes depends on whether the goal is to address species- or genus-specific questions or to gain broader insights across multiple species or genera. These findings highlight the potential of machine learning and language models in refining personalized medicine and enhancing our understanding of evolutionary biology.
Genre	Thesis/Dissertation
Type	Text
Language	eng
Date Available	2025-09-09
Provider	Vancouver : University of British Columbia Library
Rights	Attribution-NonCommercial-NoDerivatives 4.0 International
DOI	10.14288/1.0450082
URI	http://hdl.handle.net/2429/92275
Degree (Theses)	Doctor of Philosophy - PhD
Program (Theses)	Bioinformatics
Affiliation	Science, Faculty of
Degree Grantor	University of British Columbia
Graduation Date	2025-11
Campus	UBCV
Scholarly Level	Graduate
Rights URI	http://creativecommons.org/licenses/by-nc-nd/4.0/
Aggregated Source Repository	DSpace

Open Collections

UBC Theses and Dissertations

UBC Theses and Dissertations

Application of machine learning and language models in health and biological sciences Keshavarz Rahaghi, Faeze

Abstract

Item Metadata

Item Media

Item Citations and Data

Rights