UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Detection of enriched patterns in protein sequence data Smith, Theodore Gray


Proteolysis is a form of post-translational modification consisting of the cleavage of a protein at the site of a peptide bond. This process is primarily mediated by a class of enzymes known as proteases, which exhibit varying specificity for the protein sequences they cleave. Although advances in proteomics have enabled sequencing of complex mixtures of proteins from biological samples, direct detection of protease activity remains challenging due to low protease abundance and the fact that observation of a protease is not always indicative of its activity level. Detection of proteolysis is therefore typically accomplished indirectly by observation of protease substrates in protein sequencing data. However, many proteases’ cleavage-site specificities are not well-understood, restricting the utility of supervised classification methods. We present a tool to overcome this limitation through unsupervised detection of overrepresented patterns in protein sequence data, providing insight into the specificities of the proteases contributing to a sample’s composition, even if the proteases themselves are poorly characterized. These patterns can be compared to those detected in sets of established protease substrate sequences, and patterns identified in both sets can be interpreted as indicators of mutual protease activity. Here we apply this methodology to the proteolytic cleavage event data in the MEROPS database, identifying specificity patterns corresponding to over 100 distinct proteases. The statistical validity of the algorithm is assessed through a series of tests on in silico data sets, and the performance of the algorithm is compared to alternative existing motif detection and clustering tools. Multiple clinical data sets are then analyzed using the algorithm, yielding patterns consistent with markers of both cancer and cellular response to chemotherapy treatment. The utility of the algorithm is then discussed in light of these findings, several potential use cases are presented, and possible future enhancements are proposed.

Item Citations and Data


Attribution-NonCommercial-NoDerivatives 4.0 International