Detection of enriched patterns in protein sequence data

UBC Theses and Dissertations

Featured Collection

UBC Theses and Dissertations

Detection of enriched patterns in protein sequence data Smith, Theodore Gray

Abstract

Proteolysis is a form of post-translational modification consisting of the cleavage of a protein at the site of a peptide bond. This process is primarily mediated by a class of enzymes known as proteases, which exhibit varying specificity for the protein sequences they cleave. Although advances in proteomics have enabled sequencing of complex mixtures of proteins from biological samples, direct detection of protease activity remains challenging due to low protease abundance and the fact that observation of a protease is not always indicative of its activity level. Detection of proteolysis is therefore typically accomplished indirectly by observation of protease substrates in protein sequencing data. However, many proteases’ cleavage-site specificities are not well-understood, restricting the utility of supervised classification methods. We present a tool to overcome this limitation through unsupervised detection of overrepresented patterns in protein sequence data, providing insight into the specificities of the proteases contributing to a sample’s composition, even if the proteases themselves are poorly characterized. These patterns can be compared to those detected in sets of established protease substrate sequences, and patterns identified in both sets can be interpreted as indicators of mutual protease activity. Here we apply this methodology to the proteolytic cleavage event data in the MEROPS database, identifying specificity patterns corresponding to over 100 distinct proteases. The statistical validity of the algorithm is assessed through a series of tests on in silico data sets, and the performance of the algorithm is compared to alternative existing motif detection and clustering tools. Multiple clinical data sets are then analyzed using the algorithm, yielding patterns consistent with markers of both cancer and cellular response to chemotherapy treatment. The utility of the algorithm is then discussed in light of these findings, several potential use cases are presented, and possible future enhancements are proposed.

Item Metadata

Title	Detection of enriched patterns in protein sequence data
Creator	Smith, Theodore Gray
Publisher	University of British Columbia
Date Issued	2019
Description	Proteolysis is a form of post-translational modification consisting of the cleavage of a protein at the site of a peptide bond. This process is primarily mediated by a class of enzymes known as proteases, which exhibit varying specificity for the protein sequences they cleave. Although advances in proteomics have enabled sequencing of complex mixtures of proteins from biological samples, direct detection of protease activity remains challenging due to low protease abundance and the fact that observation of a protease is not always indicative of its activity level. Detection of proteolysis is therefore typically accomplished indirectly by observation of protease substrates in protein sequencing data. However, many proteases’ cleavage-site specificities are not well-understood, restricting the utility of supervised classification methods. We present a tool to overcome this limitation through unsupervised detection of overrepresented patterns in protein sequence data, providing insight into the specificities of the proteases contributing to a sample’s composition, even if the proteases themselves are poorly characterized. These patterns can be compared to those detected in sets of established protease substrate sequences, and patterns identified in both sets can be interpreted as indicators of mutual protease activity. Here we apply this methodology to the proteolytic cleavage event data in the MEROPS database, identifying specificity patterns corresponding to over 100 distinct proteases. The statistical validity of the algorithm is assessed through a series of tests on in silico data sets, and the performance of the algorithm is compared to alternative existing motif detection and clustering tools. Multiple clinical data sets are then analyzed using the algorithm, yielding patterns consistent with markers of both cancer and cellular response to chemotherapy treatment. The utility of the algorithm is then discussed in light of these findings, several potential use cases are presented, and possible future enhancements are proposed.
Genre	Thesis/Dissertation
Type	Text
Language	eng
Date Available	2020-09-30
Provider	Vancouver : University of British Columbia Library
Rights	Attribution-NonCommercial-NoDerivatives 4.0 International
DOI	10.14288/1.0380835
URI	http://hdl.handle.net/2429/71639
Degree (Theses)	Master of Science - MSc
Program (Theses)	Bioinformatics
Affiliation	Science, Faculty of
Degree Grantor	University of British Columbia
Graduation Date	2019-11
Campus	UBCV
Scholarly Level	Graduate
Rights URI	http://creativecommons.org/licenses/by-nc-nd/4.0/
Aggregated Source Repository	DSpace

Open Collections

UBC Theses and Dissertations

UBC Theses and Dissertations

Detection of enriched patterns in protein sequence data Smith, Theodore Gray

Abstract

Item Metadata

Item Media

Item Citations and Data

Rights