- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Development of computational solutions to process and...
Open Collections
UBC Theses and Dissertations
UBC Theses and Dissertations
Development of computational solutions to process and interpret mass spectrometry-based metabolomics data Wang, Yukai
Abstract
Metabolomics, the study of small molecules within biological systems, offers valuable insights into biochemical processes. This thesis addresses two challenges in mass spectrometry-based metabolomics: processing complex breathomics data and enhancing the structural annotation of metabolites using deep learning. In the first study, I introduce BreathXplorer, an open-source Python package designed to process real-time exhaled breath data from secondary electrospray ionization high-resolution mass spectrometry (SESI-HRMS). BreathXplorer tackles the challenge of non-Gaussian metabolic signal shapes by employing topological algorithms or Gaussian mixture models (GMM) to identify exhalation intervals and density-based spatial clustering (DBSCAN) to cluster m/z values. It accurately determines the start and end points of exhalation, ensuring precise quantitative measurements. In a proof-of-concept study on exercise breathomics, BreathXplorer identified exercise-responsive metabolites, showing its potential in real-time metabolomics research. In the second study, I explore deep learning as a solution for compound annotation in tandem mass spectrometry (MS/MS). This offers a predictive strategy where traditional library search tools are limited due to small spectral libraries. Generative models have become a fundamental approach for various tasks; however, their application to MS/MS structural annotation is less developed and often underperforms compared to traditional fingerprint-based counterparts. In this work, I investigate the potential of generative models by building and comparing transformer-based fingerprint models and generative models. This comparison helps to understand their strengths and limitations in annotating chemical structures. Training and testing on 616,594 unique structures, I identified three key limitations of direct structure generation with generative models: (1) error accumulation, (2) generation of invalid compounds, and (3) generation of unrecorded compounds. I then propose a solution using the generative model as a ranker. My results demonstrate that generative-based ranking outperforms fingerprint-based systems. Furthermore, I observed a positive correlation between high generation scores and the correctness of predicted structures. My analysis shows that generative models can leverage richer structural information during training, leading to improved accuracy in end-to-end chemical structure identification.
Item Metadata
Title |
Development of computational solutions to process and interpret mass spectrometry-based metabolomics data
|
Creator | |
Supervisor | |
Publisher |
University of British Columbia
|
Date Issued |
2024
|
Description |
Metabolomics, the study of small molecules within biological systems, offers valuable insights into biochemical processes. This thesis addresses two challenges in mass spectrometry-based metabolomics: processing complex breathomics data and enhancing the structural annotation of metabolites using deep learning.
In the first study, I introduce BreathXplorer, an open-source Python package designed to process real-time exhaled breath data from secondary electrospray ionization high-resolution mass spectrometry (SESI-HRMS). BreathXplorer tackles the challenge of non-Gaussian metabolic signal shapes by employing topological algorithms or Gaussian mixture models (GMM) to identify exhalation intervals and density-based spatial clustering (DBSCAN) to cluster m/z values. It accurately determines the start and end points of exhalation, ensuring precise quantitative measurements. In a proof-of-concept study on exercise breathomics, BreathXplorer identified exercise-responsive metabolites, showing its potential in real-time metabolomics research.
In the second study, I explore deep learning as a solution for compound annotation in tandem mass spectrometry (MS/MS). This offers a predictive strategy where traditional library search tools are limited due to small spectral libraries. Generative models have become a fundamental approach for various tasks; however, their application to MS/MS structural annotation is less developed and often underperforms compared to traditional fingerprint-based counterparts. In this work, I investigate the potential of generative models by building and comparing transformer-based fingerprint models and generative models. This comparison helps to understand their strengths and limitations in annotating chemical structures. Training and testing on 616,594 unique structures, I identified three key limitations of direct structure generation with generative models: (1) error accumulation, (2) generation of invalid compounds, and (3) generation of unrecorded compounds. I then propose a solution using the generative model as a ranker. My results demonstrate that generative-based ranking outperforms fingerprint-based systems. Furthermore, I observed a positive correlation between high generation scores and the correctness of predicted structures. My analysis shows that generative models can leverage richer structural information during training, leading to improved accuracy in end-to-end chemical structure identification.
|
Genre | |
Type | |
Language |
eng
|
Date Available |
2024-10-18
|
Provider |
Vancouver : University of British Columbia Library
|
Rights |
Attribution-NonCommercial-NoDerivatives 4.0 International
|
DOI |
10.14288/1.0445606
|
URI | |
Degree | |
Program | |
Affiliation | |
Degree Grantor |
University of British Columbia
|
Graduation Date |
2024-11
|
Campus | |
Scholarly Level |
Graduate
|
Rights URI | |
Aggregated Source Repository |
DSpace
|
Item Media
Item Citations and Data
Rights
Attribution-NonCommercial-NoDerivatives 4.0 International