Investigating ML potentials and deep generative models for efficient conformational sampling

UBC Theses and Dissertations

Featured Collection

UBC Theses and Dissertations

Investigating ML potentials and deep generative models for efficient conformational sampling Shenoy, Nikhil

Abstract

Efficiently sampling from the landscape of molecular conformations is an important task in computational drug discovery. Simulation approaches like Molecular Dynamics (MD) require the use of an energy function that is fast, accurate, transferable and scalable. Traditional energy function methods like Force Fields are fast, but inaccurate while Quantum Mechanical (QM) Methods are accurate but slow. Recently, Machine Learning (ML) potentials trained on datasets labeled with QM methods have become popular as they address the accuracy and speed trade-off. However, generating QM datasets is a cost-intensive exercise, and design choices like conformational diversity (coverage of the conformational landscape) and structural diversity (coverage of the chemical space) during the generation process can introduce biases into the dataset. In the first part of the thesis, we explore the intricate relationship between dataset biases, specifically conformational and structural diversity, and ML potential generalization. We investigate these dynamics through two distinct experiments: a fixed budget one, where the dataset size remains constant, and a fixed molecular set one, which focuses on fixed structural diversity while varying conformational diversity. Our results reveal the critical need for balanced structural and conformational diversity in QM datasets for optimal generalization, which current datasets lack. We believe that these findings can inform future data generation and development of ML potentials that generalize beyond training data. An alternative approach is directly sampling conformations given the molecular graph using deep generative models like Diffusion. Existing competitive approaches either use expensive local structure methods or rely on general-purpose large architecture models without task-specific inductive biases. In the second part of the thesis, we challenge the status quo and develop a simple and scalable deep generative method, Equivariant Transformer Flow (ET-Flow) incorporating flow-matching, an equivariant transformer, a harmonic prior, and an approximate Optimal Transport alignment. We achieve state-of-the-art performance on several molecular conformer generation benchmarks with significantly fewer parameters and inference steps than existing methods, highlighting the importance of inductive biases and well-informed modelling choices.

Item Metadata

Title	Investigating ML potentials and deep generative models for efficient conformational sampling
Creator	Shenoy, Nikhil
Supervisor	Ding, Jiarui
Publisher	University of British Columbia
Date Issued	2024
Description	Efficiently sampling from the landscape of molecular conformations is an important task in computational drug discovery. Simulation approaches like Molecular Dynamics (MD) require the use of an energy function that is fast, accurate, transferable and scalable. Traditional energy function methods like Force Fields are fast, but inaccurate while Quantum Mechanical (QM) Methods are accurate but slow. Recently, Machine Learning (ML) potentials trained on datasets labeled with QM methods have become popular as they address the accuracy and speed trade-off. However, generating QM datasets is a cost-intensive exercise, and design choices like conformational diversity (coverage of the conformational landscape) and structural diversity (coverage of the chemical space) during the generation process can introduce biases into the dataset. In the first part of the thesis, we explore the intricate relationship between dataset biases, specifically conformational and structural diversity, and ML potential generalization. We investigate these dynamics through two distinct experiments: a fixed budget one, where the dataset size remains constant, and a fixed molecular set one, which focuses on fixed structural diversity while varying conformational diversity. Our results reveal the critical need for balanced structural and conformational diversity in QM datasets for optimal generalization, which current datasets lack. We believe that these findings can inform future data generation and development of ML potentials that generalize beyond training data. An alternative approach is directly sampling conformations given the molecular graph using deep generative models like Diffusion. Existing competitive approaches either use expensive local structure methods or rely on general-purpose large architecture models without task-specific inductive biases. In the second part of the thesis, we challenge the status quo and develop a simple and scalable deep generative method, Equivariant Transformer Flow (ET-Flow) incorporating flow-matching, an equivariant transformer, a harmonic prior, and an approximate Optimal Transport alignment. We achieve state-of-the-art performance on several molecular conformer generation benchmarks with significantly fewer parameters and inference steps than existing methods, highlighting the importance of inductive biases and well-informed modelling choices.
Genre	Thesis/Dissertation
Type	Text
Language	eng
Date Available	2024-07-06
Provider	Vancouver : University of British Columbia Library
Rights	Attribution-NonCommercial-NoDerivatives 4.0 International
DOI	10.14288/1.0444096
URI	http://hdl.handle.net/2429/88586
Degree	Master of Science - MSc
Program	Computer Science
Affiliation	Science, Faculty of; Computer Science, Department of
Degree Grantor	University of British Columbia
Graduation Date	2024-11
Campus	UBCV
Scholarly Level	Graduate
Rights URI	http://creativecommons.org/licenses/by-nc-nd/4.0/
Aggregated Source Repository	DSpace

Open Collections

UBC Theses and Dissertations

UBC Theses and Dissertations

Investigating ML potentials and deep generative models for efficient conformational sampling Shenoy, Nikhil

Abstract

Item Metadata

Item Media

Item Citations and Data

Rights