UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Detection of enriched patterns in protein sequence data Smith, Theodore Gray 2019

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.

Item Metadata

Download

Media
24-ubc_2019_november_smith_theodore.pdf [ 3.08MB ]
Metadata
JSON: 24-1.0380835.json
JSON-LD: 24-1.0380835-ld.json
RDF/XML (Pretty): 24-1.0380835-rdf.xml
RDF/JSON: 24-1.0380835-rdf.json
Turtle: 24-1.0380835-turtle.txt
N-Triples: 24-1.0380835-rdf-ntriples.txt
Original Record: 24-1.0380835-source.json
Full Text
24-1.0380835-fulltext.txt
Citation
24-1.0380835.ris

Full Text

DETECTION OF ENRICHED PATTERNS IN PROTEIN SEQUENCE DATA by  Theodore Gray Smith  B.S., The University of Arizona, 2016  A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF  MASTER OF SCIENCE in THE FACULTY OF GRADUATE AND POSTDOCTORAL STUDIES (Bioinformatics)  THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver)  September 2019  © Theodore Gray Smith, 2019    ii   The following individuals certify that they have read, and recommend to the Faculty of Graduate and Postdoctoral Studies for acceptance, a thesis entitled:  Detection of Enriched Patterns in Protein Sequence Data  submitted by Theodore Gray Smith in partial fulfillment of the requirements for the degree of Master of Science in Bioinformatics  Examining Committee: Dr. Philipp Lange Supervisor  Dr. Tamara Munzner Supervisory Committee Member  Dr. Nobuhiko Tokuriki Supervisory Committee Member Dr. Adam Frankel Additional Examiner     iii   Abstract  Proteolysis is a form of post-translational modification consisting of the cleavage of a protein at the site of a peptide bond. This process is primarily mediated by a class of enzymes known as proteases, which exhibit varying specificity for the protein sequences they cleave. Although advances in proteomics have enabled sequencing of complex mixtures of proteins from biological samples, direct detection of protease activity remains challenging due to low protease abundance and the fact that observation of a protease is not always indicative of its activity level. Detection of proteolysis is therefore typically accomplished indirectly by observation of protease substrates in protein sequencing data. However, many proteases’ cleavage-site specificities are not well-understood, restricting the utility of supervised classification methods. We present a tool to overcome this limitation through unsupervised detection of overrepresented patterns in protein sequence data, providing insight into the specificities of the proteases contributing to a sample’s composition, even if the proteases themselves are poorly characterized. These patterns can be compared to those detected in sets of established protease substrate sequences, and patterns identified in both sets can be interpreted as indicators of mutual protease activity. Here we apply this methodology to the proteolytic cleavage event data in the MEROPS database, identifying specificity patterns corresponding to over 100 distinct proteases. The statistical validity of the algorithm is assessed through a series of tests on in silico data sets, and the performance of the algorithm is compared to alternative existing motif detection and clustering tools. Multiple clinical data sets are then analyzed using the algorithm, yielding patterns consistent with markers of both cancer and cellular response to chemotherapy treatment. The utility of the algorithm is then discussed in light of these findings, several potential use cases are presented, and possible future enhancements are proposed.   iv   Lay summary  Proteins are synthesized from the genetic code of all living things and are responsible for a vast assortment of essential biological functions. However, the genome alone does not fully explain protein diversity. Proteins are modified in several ways after synthesis, leading to a greater number of protein molecular states than protein encoding genes. One of these ways is through cleavage of proteins by enzymes called proteases. Despite continuous improvement in our ability to measure and sequence proteins, direct detection of this process remains challenging. To address this challenge, we present a new tool for the indirect detection of protease activity by statistically identifying patterns corresponding to the targets of specific proteases. This tool can be used to support the analysis of data sets from both clinical and laboratory environments, enhancing our understanding of proteins in biology and improving our ability to measure physiological changes in disease.                v   Preface  The concept for a project aimed at the indirect detection of protease activity in protein sequencing data through the analysis of substrate sequence features was initially proposed by Dr. Philipp Lange. Suggestions for modifications to the figures and visualizations produced by this tool were provided by Dr. Tamara Munzner. All development and analysis for this project was conducted by me, Theodore Smith, under the supervision and guidance of Dr. Philipp Lange.                   vi   Table of contents  Abstract ........................................................................................................................................................ iii Lay summary .............................................................................................................................................. iv Preface ......................................................................................................................................................... v Table of contents ....................................................................................................................................... vi List of tables ................................................................................................................................................ x List of figures ............................................................................................................................................. xi List of abbreviations .................................................................................................................................. xii List of equations ....................................................................................................................................... xiii Glossary ..................................................................................................................................................... xiv Acknowledgements ................................................................................................................................. xvii Dedication ................................................................................................................................................ xviii Chapter 1: Introduction .............................................................................................................................. 1 1.1 Proteins, proteomics, and post-translational modifications ....................................................... 1 1.2 Proteases and proteolysis ......................................................................................................... 2 1.3 Limitations of detection and analysis of proteolysis .................................................................. 3 1.4 Solution ..................................................................................................................................... 5 Chapter 2: Algorithm development and characterization .......................................................................... 7 2.1 Methods ..................................................................................................................................... 7 2.1.1   Software dependencies ................................................................................................... 7 2.1.2   Swiss-Prot data ............................................................................................................... 7 2.1.3   MEROPS data ................................................................................................................. 8 2.1.4   BLOSUM-62 data ............................................................................................................ 8 2.1.5   Default algorithm settings ................................................................................................ 9 2.2 Results ...................................................................................................................................... 9 2.2.1   Algorithm ......................................................................................................................... 9 2.2.1.1      Algorithm description and data flow ....................................................................... 9 2.2.1.2      Background frequencies ....................................................................................... 10 2.2.1.3      Compound residues ............................................................................................. 11 2.2.1.4      Input and preprocessing ....................................................................................... 13 2.2.1.5      Enrichment calculation ......................................................................................... 14 2.2.1.5.1       P-values ............................................................................................................. 14 2.2.1.5.2       Minimum frequency requirements ...................................................................... 17 2.2.1.5.3       Positional weighting ........................................................................................... 17 vii   2.2.1.5.4       Enrichment scores ............................................................................................. 18 2.2.1.6      Output ................................................................................................................... 18 2.2.1.6.1       Sequence clustermap ........................................................................................ 18 2.2.1.6.2       Protease pattern heatmap ................................................................................. 20 2.2.1.6.3       Pattern-matching sequence subset logo maps .................................................. 22 2.2.1.6.4       Matching sequence subsets for each pattern .................................................... 23 2.2.1.6.5       Tabular output .................................................................................................... 23 2.2.1.7      Code availability ................................................................................................... 23 2.2.2   Algorithm performance evaluation ................................................................................ 24 2.2.2.1      Methods ................................................................................................................ 24 2.2.2.1.1       Algorithm settings ............................................................................................... 24 2.2.2.1.2       In silico sequence sets ....................................................................................... 24 2.2.2.1.4       Calculation of F1 scores .................................................................................... 24 2.2.2.2      Investigation of minimum absolute frequency ...................................................... 25 2.2.2.2.1       Objective ............................................................................................................ 25 2.2.2.2.2       Methods ............................................................................................................. 27 2.2.2.2.3       Results ............................................................................................................... 28 2.2.2.3      Investigation of minimum percentage frequency .................................................. 29 2.2.2.3.1       Objective ............................................................................................................ 29 2.2.2.3.2       Methods ............................................................................................................. 30 2.2.2.3.3       Results ............................................................................................................... 31 2.2.2.4      Investigation of pattern frequency fold difference ................................................ 32 2.2.2.4.1       Objective ............................................................................................................ 32 2.2.2.4.2       Methods ............................................................................................................. 33 2.2.2.4.3       Results ............................................................................................................... 34 2.2.2.5      Investigation of number of detectable top-level patterns ..................................... 35 2.2.2.5.1       Objective ............................................................................................................ 35 2.2.2.5.2       Methods ............................................................................................................. 35 2.2.2.5.3       Results ............................................................................................................... 37 2.2.2.6      Investigation of false positive rate in randomly selected Swiss-Prot sequences . 38 2.2.2.6.1       Objective ............................................................................................................ 38 2.2.2.6.2       Methods ............................................................................................................. 39 2.2.2.6.3       Results ............................................................................................................... 40 2.2.3   Comparison to existing tools ......................................................................................... 41 2.2.3.1      Comparison to MoMo reimplementation of Motif-X .............................................. 41 viii   2.2.3.1.1       Objective ............................................................................................................ 41 2.2.3.1.2       Algorithm configurations..................................................................................... 42 2.2.3.1.3       Random Plasmodium falciparum sequences .................................................... 42 2.2.3.1.3.1       Methods ...................................................................................................... 43 2.2.3.1.3.2 Results ........................................................................................................... 43 2.2.3.1.4       Analysis of supplemental data 3 from Pease, et al., 2018 ................................. 43 2.2.3.1.4.1 Methods ......................................................................................................... 43 2.2.3.1.4.2 Results ........................................................................................................... 44 2.2.3.2      Comparison to Gibbs Clustering .......................................................................... 48 2.2.3.2.1       Objective ............................................................................................................ 48 2.2.3.2.2       Random Swiss-Prot sequences ............................................................................. 49 2.2.3.2.2.1 Methods ......................................................................................................... 49 2.2.3.2.2.2 Results ........................................................................................................... 49 2.2.3.2.3       Mixed MEROPS protease substrate sequences ............................................... 53 2.2.3.2.3.1 Methods ......................................................................................................... 53 2.2.3.2.3.1 Results ........................................................................................................... 56 2.2.3.2.4       Analysis of Supplemental Data 3 from Pease, et al., 2018................................ 60 2.2.3.2.4.1 Methods ......................................................................................................... 61 2.2.3.2.4.2 Results ........................................................................................................... 61 2.2.4   Characterization of MEROPS protease substrate data ................................................ 63 2.2.4.1      Methods used in the analysis of MEROPS protease substrate data ................... 63 2.2.4.2      Detected patterns ................................................................................................. 64 2.2.4.3      Protease substrate clustermaps ........................................................................... 66 2.2.4.4      Protease substrate heatmaps .............................................................................. 67 2.2.5   Applied analyses ........................................................................................................... 67 2.2.5.1      Frantzi et al., 2016: urothelial bladder cancer biomarkers ................................... 67 2.2.5.1.1       Description of data ............................................................................................. 67 2.2.5.1.2       Methods ............................................................................................................. 68 2.2.5.1.3       Results ............................................................................................................... 68 2.2.5.2      Wiita et al., 2014: detecting signatures of chemotherapy-induced apoptosis in blood ..................................................................................................................... 72 2.2.5.1.1       Description of data ............................................................................................. 72 2.2.5.1.2       Methods ............................................................................................................. 73 2.2.5.1.3       Results ............................................................................................................... 74 Chapter 3: Conclusion ............................................................................................................................. 79 3.1 Discussion ............................................................................................................................... 79 ix   3.1.2   Discussion of comparison to other tools ....................................................................... 79 3.1.2.1      Motif-X and MoMo ................................................................................................ 79 3.1.2.1.1       Motif-X and MoMo are unsuitable for analysis of proteolytic data sets ............. 79 3.1.2.1.2       Background frequencies .................................................................................... 80 3.1.2.1.3       Positional weighting ........................................................................................... 83 3.1.2.1.4       Compound residues ........................................................................................... 84 3.1.2.2      Gibbs clustering discussion .................................................................................. 85 3.1.2.2.1       Limitations due to unpredictability of number of clusters ................................... 85 3.1.2.2.2       Analysis of mixed MEROPS protease substrates .............................................. 86 3.1.2.2.3       Analysis of mixed MEROPS protease substrates .............................................. 87 3.1.3   Discussion of applied analyses ..................................................................................... 88 3.1.3.1      Frantzi et al., 2016 UBC biomarker panel ............................................................ 88 3.1.3.1      Wiita et al., 2014 chemotherapy-induced apoptosis peptide indicators ............... 89 3.2 Future directions...................................................................................................................... 89 3.3 Concluding remarks ................................................................................................................ 90 References ................................................................................................................................................. 91      x   List of tables  Table 1. Default compound residues. ......................................................................................................... 13                   xi   List of figures  Figure 1. Recursive Pattern Extraction Data Flow Diagram ....................................................................... 10 Figure 2. Example Clustermap.................................................................................................................... 19 Figure 3. Example Heatmap ....................................................................................................................... 21 Figure 4. Mean Pattern Detection Accuracy by Absolute Pattern Frequency ............................................ 28 Figure 5. Mean Pattern Detection Accuracy by Pattern Percentage Frequency. ....................................... 31 Figure 6. Mean Detection Accuracy for Two Patterns with Increasing Frequency Fold Difference. .......... 34 Figure 7. Mean Pattern Detection Accuracy by Number of Patterns .......................................................... 37 Figure 8. Analysis of False Positive Rate in Random Sequence Sets ....................................................... 40 Figure 9. Logo Maps for Patterns Detected in Pease Supplemental Data 3 .............................................. 45 Figure 10. Gibbs Clustering Results: KLD by Number of Clusters for Random Swiss-Prot Sequences. ... 50 Figure 11. Gibbs Clustering Results: Logo Maps for 15 Cluster Solution on Random Swiss-Prot Sequences .................................................................................................................................................. 52 Figure 12. Logo map of Caspase-6 Substrates from the MEROPS database ........................................... 54 Figure 13. Logo map of Cathepsin B Substrates from the MEROPS database ......................................... 54 Figure 14. Logo map of Matrix Metallopeptidase-9 Substrates from the MEROPS database ................... 55 Figure 15. Logo map of Mixed MEROPS Protease Substrate Sample ...................................................... 56 Figure 16. Gibbs Clustering Results: KLD by Number of Clusters for Mixed MEROPS Protease Substrate Sequences .................................................................................................................................................. 57 Figure 17. Gibbs Clustering Results: Logo Maps for 2 Cluster Solution on Mixed MEROPS Protease Substrate Sequences .................................................................................................................................. 58 Figure 18 Clustermap for Mixed MEROPS Protease Substrate Sample. ................................................... 59 Figure 19. Heatmap for Mixed MEROPS Protease Substrate Sample ...................................................... 60 Figure 20. Gibbs Clustering Results: KLD by Number of Clusters for Supplemental Data 3 from Pease et al., 2018....................................................................................................................................................... 62 Figure 21. Gibbs Clustering Results: Logo Maps for 1 Cluster Solution on Pease Supplemental Data 3 . 63 Figure 22. Number of proteases with Detected Pattern by Catalytic Type ................................................. 65 Figure 23. Number of Unique Patterns by Level of Protease Classification ............................................... 66 Figure 24. Clustermap for Frantzi et al., 2016 Table 5 ............................................................................... 69 Figure 25. Heatmap for Frantzi et al., 2016 Table 5 ................................................................................... 70 Figure 26. Clustermap for Frantzi et al., 2016 Table 3 ............................................................................... 71 Figure 27. Heatmap for Frantzi et al., 2016 Table 3 ................................................................................... 72 Figure 28. Wiita et al., 2014 Dataset S1 Patient Sample Clustermap. ....................................................... 74 Figure 29. Wiita et al., 2014 Dataset S1 Patient Sample Heatmap ............................................................ 76 Figure 30. Wiita et al., 2014 Dataset S3 Clustermap of Sequences Detected Using Inclusion List ........... 77 Figure 31. Wiita et al., 2014 Dataset S3 Heatmap of Sequences Detected Using Inclusion List ............... 78 Figure 32. Wiita et al., 2014 Dataset S1 Control Plasma Heatmap ............................................................ 78 Figure 33. Gibbs Clustering Results: Logo Maps for 3 Cluster Solution on Mixed MEROPS Protease Substrate Sequences .................................................................................................................................. 87    xii   List of abbreviations  ANN: Artificial Neural Network CDF: Cumulative Distribution Function CSV: Comma-Separated Values file format DNA: Deoxyribose Nucleic Acid FASTA: FAST-All file format JSON: JavaScript Object Notation file format KLD: Kullback-Leibler Divergence MMP: Matrix Metallopeptidase PTM: Post-Translational Modification RNA: Ribose Nucleic Acid SVM: Support-Vector Machine        xiii   List of equations  Equation 1. Effective expected residue frequency calculation .................................................................... 11 Equation 2. Effective expected compound residue frequency calculation .................................................. 12 Equation 3. Effective expected fixed position residue frequency calculation ............................................. 12 Equation 4. Regularized Incomplete Beta Function .................................................................................... 15 Equation 5. Incomplete Beta Function ........................................................................................................ 15 Equation 6. Beta Function ........................................................................................................................... 16 Equation 7. Binomial probability of X or More Successes .......................................................................... 16 Equation 8. Bonferroni correction ............................................................................................................... 16 Equation 9. Positional weighting calculation ............................................................................................... 17 Equation 10. Enrichment calculation ........................................................................................................... 18 Equation 11. Sequence similarity score calculation .................................................................................... 19 Equation 12. Cosine similarity ..................................................................................................................... 20 Equation 13. F1 Score Calculation ............................................................................................................. 25 Equation 14. Precision calculation .............................................................................................................. 25 Equation 15. Recall calculation ................................................................................................................... 25      xiv   Glossary  Artificial Neural Network (ANN) - A form of machine learning algorithm comprised of a series of interconnected nodes, inspired by the structure and organization of biological neural networks. Beta function - A special function, also known as the Euler integral of the first kind, with broad utility in probability theory and statistics. Bernoulli trial - An experiment with two possible outcomes and a constant probability associated with each outcome which is exactly the same for all trials. Binomial distribution - A discrete distribution describing the probability of observing exactly n successes out of N independent Bernoulli trials, given a probability p of success for each Bernoulli trial. Biomarker - An indicator of a biological state, condition, or activity.  Bonferroni multiple hypothesis testing correction - A statistical technique used to correct for overestimation of significance arising from multiple comparisons. Cleavage site specificity - The sequence motif(s) preferentially recognized by a protease as a target for peptide bond cleavage.  Clustermap - A specialized heatmap containing hierarchically-clustered rows and/or columns using a specified similarity function, distance metric, and cluster linkage function. C-terminal - One end of a protein sequence, characterized by an unbound carboxyl functional group. Cumulative distribution function (CDF) - The probability that, when evaluated at x, a particular random variable X will take a value equal to or less than x. Heatmap - A matrix visualization wherein values are represented with colored cells. Hierarchical clustering - A method of clustering defined by the construction of a hierarchy of clusters. The two primary types of hierarchical clustering are agglomerative, in which each element of a set is initially in a separate cluster, and divisive, in which each element of a set is initially in the same cluster. Agglomerative hierarchical clustering iteratively combines these clusters on a pairwise basis into larger clusters, while divisive hierarchical clustering seeks to segregate the initial cluster into small, more homogeneous sub-clusters. The hierarchies of both methods are informed by a similarity metric and a cluster linkage function. xv   Incomplete beta function - A generalization of the beta function, replacing the function’s upper bound of integration, ordinarily equal to 1, with a variable. Kinase - An enzyme that catalyzes the transfer of a phosphate from a high-energy donor molecule to a specific substrate.  Kullback-Leibler Divergence (KLD) - A comparative measure of the difference between two probability distributions, where one distribution serves as a reference against which the other is evaluated. Kullback-leibler Divergence is also known a relative entropy in the context of information theory. Logo map - A visual representation of the position-specific abundance of the characters in an aligned data set. The height of each character displayed on the figure encodes its relative abundance. Abundance may be represented in absolute or percentage terms, or in relative terms as compared to an expected baseline. Markov chain - A stochastic sequence in which the probability of an event is exclusively dependent on the state of the system produced by the previous event. Monte Carlo method - A computational algorithm characterized by repeated random sampling. Motif - A repeated pattern observed in sequence data. N-terminal - One end of a protein sequence, characterized by an unbound amino functional group. Peptide bond - A covalent bond formed between the carboxyl group of one molecule and the amino group of another molecule. These bonds are responsible for the polymerization of amino acids into peptides and proteins. Phosphopeptide - A modified peptide featuring one or more amino acid residues with a phosphate group covalently bound to its side chain. Plasmodium falciparum - A unicellular, protozoan parasitic organism responsible for malarial infection in humans. Plasmodium falciparum is transmitted through the bite of the Anopheles mosquito and is widely considered the deadliest parasite in humans. Post-translational modification - The chemical modification of proteins after their biological synthesis. These modifications serve a variety of biochemical functions including protein maturation, targeting, signaling, activation, and degradation. Post-translational modifications are often performed by specialized enzymes. xvi   Protease - An enzyme that catalyzes the hydrolysis of a peptide bond between two amino acids. These enzymes are produced by all organisms and viruses and exhibit various substrate and cleavage site specificities. Proteolysis - The cleavage of proteins into smaller units at the site of peptide bonds between amino acid residues. Proteome - The complete set of proteins associated with a particular organism. P-value - The probability that, given the null hypothesis of a statistical model is true, an equally or more extreme outcome will be observed. Regularized incomplete beta function - The cumulative distribution function of the beta distribution defined in terms of the beta function and the incomplete beta function. R group - The functional group attached to the alpha carbon of an amino acid. These groups, also known as side chains, are responsible for the differentiation of distinct amino acids and exhibit various biochemical properties. Support-Vector Machine (SVM) - A type of supervised machine learning characterized by the use of a non-probabilistic linear binary classification strategy.    xvii   Acknowledgements  I would like to thank my supervisor, Dr. Philipp Lange, for his extensive support and guidance throughout the completion of this project and my thesis.  I would also like to thank my committee members, Dr. Tamara Munzner and Dr. Nobuhiko Tokuriki, for helping to focus my research efforts as well as for providing invaluable insight into challenging topics during the development of this tool and composition of my thesis.  I am also grateful for the support of Dr. Adam Frankel, who generously offered his time and attention in order to serve as Chair of my Examination Committee.  I offer my sincere appreciation to the other members of the Lange Lab for their assistance in relating my work to the broader landscape of protein research and biology.  Finally, I acknowledge the Natural Sciences and Engineering Research Council of Canada (NSERC) for funding the first year of my studies through an NSERC-CREATE scholarship.          xviii   Dedication  I dedicate this thesis to Shireen and Gojira Anbardan, as well as to Anthony Frank Iommi, Terence Michael Joseph Butler, William Thomas Ward, and John Michael Osbourne. 1   Chapter 1: Introduction  1.1 Proteins, proteomics, and post-translational modifications Proteins are class of biological macromolecules consisting of one or more variable-length chains of amino acids. These chains are the indirect product of sequences of genetic code stored within the genome in the form of DNA. Firstly, DNA is transcribed into RNA through the activity of RNA polymerase and accompanying enzymes. RNA is similar to DNA in that both are comprised of sequences of nucleotides, however they differ in that RNA is less stable and much shorter-lived among other things. Secondly, RNA undergoes a series of processing operations conducted by a variety of enzymes (themselves proteins) ultimately producing mature mRNA sequences. Thirdly, either immediately or at some later point in time, mature mRNA is translated into protein by ribosomes through the recognition of three-nucleotide combinations, each of which corresponds to a particular amino acid via a many-to-one relationship. Due to the degeneracy of this code, protein translation is an irreversible process with respect to the loss of information involved in the process.1 Proteins were long thought to be the final functional products of gene expression, however advances in large scale protein research have revealed that this is far from a comprehensive description of protein space. It is now known that following translation proteins undergo a complex process of spatial folding and chemical modification. Consequently, while the number of human protein coding genes would imply a protein space of roughly 20,000 entities (known as canonical or genomic proteins), projections based on the number and frequency of these common processing operations places the total expected number of chemically-differentiable proteins in the low millions.2 In order to simplify discussion of this space, each distinct protein molecule has come to be referred to as a proteoform.3 The field of proteomics endeavors to characterize proteoform space primarily through sequencing of protein molecules. To date, the most effective means of protein sequencing is mass spectrometry coupled with a variety of biochemical methods intended to isolate and enrich particular types of proteins or their components in vitro.4 Throughout the history of proteomics research, a number of specific chemical processes have been identified which lead to the diversification of proteoforms. These processes are collectively known as post-translational modification.5  2   1.2 Proteases and proteolysis Proteolysis is a form of post-translational modification involving the cleavage of a peptide bond between two amino acids within a protein. While it is possible for this reaction to occur spontaneously, proteolysis is overwhelmingly catalyzed by a class of enzymes known as proteases (also referred to as peptidases). Proteases are found in all organisms and viruses, and the diversity of this space is the result of both convergent and divergent evolution.6  Proteases are broadly classified into seven distinct groups by catalytic type. These types include serine proteases, cysteine proteases, aspartic proteases, glutamic proteases, threonine proteases, metalloproteases, and asparagine peptide lyases.7 These catalytic types are defined by the biochemical feature, generally speaking an amino acid or metallic cofactor, responsible for catalyzing the proteolytic reactions they mediate. Proteases have been further categorized based on a variety of factors including biochemical activity and evolutionary factors. Perhaps the most noteworthy and widely recognized of these additional categorization systems are the families described in the The Welcome Trust Sanger Institute’s MEROPS database.8 MEROPS was first released in 2010 and contains a substantial wealth of data and knowledge related to proteases, the contexts in which they are found, and the reactions they mediate. MEROPS subdivides proteases in families based on the statistical similarity of their amino acid sequences, thus employing an evolutionary perspective for protease classification.8 Protease specificity is mediated by several levels of sequence structure. At the primary level, substrates must feature a site that can be hydrolyzed by the catalytic motif of a protease. At the secondary, tertiary, and quaternary levels, more complex interactions arise including spatial exclusion of incompatible substrates, stabilization or modulation of substrate spatial arrangement to improve the rate and favorability of catalysis, and chemical interactions involving protease residues far from the catalytic site in terms of primary sequence. By grouping proteases on the basis of sequence similarity rather than on catalytic type alone, MEROPS captures meaningful commonalities between proteases including these longer-range factors influencing protease specificity. Proteolysis has broad biological implications, serving crucial functions in normal cellular development, maintenance, and death. In particular, proteolysis is largely responsible for functions 3   related to protein processing and maturation, as well as protein degradation. These processes are tightly regulated in intricate networks of protease activity and, as is true for many sophisticated biological networks, these interactions are frequently perturbed in disease.9 Diseases exhibiting known dysregulation of protease activity range from Alzheimer’s disease to cancer to gastrointestinal dysfunction.10,11,12 Due to the widespread implication of proteolytic activity in both healthy and pathological physiology, there is a great deal of interest in detecting and analyzing the activity of proteases.  1.3 Limitations of detection and analysis of proteolysis Although advances in proteomics have enabled sequencing of complex mixtures of proteins from biological samples, direct detection of protease activity remains challenging. This is due in part to the relatively low abundance of protease material in physiological samples as compared to a variety of other protein classes. At least as important as this limitation, however, is the poor correspondence between protease abundance and activity. Even when proteases can be detected, it is not always clear that they are in an active state. Additionally, the protease responsible for the cleavage of a given substrate may not be co-localized with the substrate at the time of sequencing. This limitation may arise as a result of the physiological transport of the protease, the substrate, or both. Alternatively, it may arise due to biochemical procedures involved in sample preparation. Due to these limitations, detection of proteolytic activity is typically accomplished indirectly by observation of protease substrates. One strategy for accomplishing this indirect detection involves statistically predicting putative protease cleavage sites based on observation of known cleavage events corresponding to a type of protease. By training an artificial neural network (ANN), support vector machine (SVM), or similar model to recognize the sequence features characteristic of cleavage sites in the known substrates of a protease, those models may in turn be used to predict candidate cleavage sites in a sample of protein sequences.13 In the event that the sample contains the products of proteolysis, either a subset or the complete set of those predicted cleavage sites would be expected to correspond to one or both termini of the sequences. This approach can therefore hypothetically be used to infer proteolytic activity which contributed to the composition of the sample. Two popular examples of this strategy are PROSPERous and Proteasix,14,15 4   both of which support the prediction of cleavage sites corresponding to a large number of proteases with varying degrees of accuracy and comprehensiveness. While this approach has proven effective in certain contexts for well characterized proteases, many proteases lack reliable cleavage event data. This principle restricts the utility of supervised classification methods to proteases which have already been thoroughly investigated, and fails to account for the activity of less-understood proteases. Moreover, cleavage event data have been compiled over a substantial period of time via a shifting landscape of methodologies which vary drastically in both comprehensiveness and reliability. Of great concern is that some portion of these data are no longer current, relevant, or accurate. By relying on these data to train supervised learning algorithms, bias in protease specificity knowledge is compounded and propagated. In addition to supervised clustering, there is precedent for the use of sequence pattern detection for the analysis of post-translational modifications. Most notably, Motif-X has been utilized for several years to analyze samples containing a single, fixed central residue which is the target of a covalent modification such as phosphorylation.16,17 This algorithm was recently re-implemented in a tool called MoMo, which employs similar methods but supports the analysis of data sets containing multiple fixed central residues, executes a more accurate probability calculation, and takes an alternative opinionated view with respect to background frequency calculation.18 Although these motif detection tools are promising resources for the unbiased analysis of PTM data centered around a modified residue, neither is suitable for the analysis of proteolytic data sets. Unlike other post-translational modifications, proteolytic events are centered around a bond between two residues rather than around a central residue which undergoes a chemical modification. Moreover, the amino acids found adjacent to proteolytic cleavage sites are highly variable and depend on the specific protease responsible for a cleavage event. Proteolytic data sets therefore lack the fixed central residue required by these tools, even if the sequence alignment is shifted in order to center the sequences around a residue position rather than the cleavage site.  5   1.4 Solution To overcome these challenges, we have developed a new tool which detects statistically overrepresented patterns of positional amino acids in pre-aligned protein sequence data. This is accomplished by recursively computing the binomial probability of observing at least as many occurrences of each amino acid in every position of an input data set. Probabilities are calculated relative to expected amino acid frequencies, as computed from a background proteome (the Swiss-Prot human proteome by default). Sequence sets exactly matching the detected patterns are then made available, along with logo maps which display the positional composition of the subsets, thus directly de-multiplexing data sets. Sequences are also clustered based on positional similarity to the detected patterns, thus providing a strategy for unbiased and unsupervised sequence clustering in order to aid in interpretation of dominant sequence features contributing to sample composition. To produce an additional annotated output from the algorithm, detected patterns are compared to a database of pre-computed patterns produced by performing the same analysis on the sets of well-established protease substrates available in the MEROPS database. These analyses utilized background frequencies computed from the Swiss-Prot human proteome,19,20 selected to serve as a baseline of comparison for human-derived samples. MEROPS substrate patterns can be optionally recomputed using a user-specified background dataset. This provides a biased interpretation layer based on the current state of protease specificity knowledge without detracting from the advantages of an unbiased, sample-driven clustering and decomposition process. While superficially similar alternative sequence pattern detection algorithms such as Motif-X and MoMo, this tool implements several novel features including, but not limited to: (1) A generalized workflow circumventing the need for a fixed central residue: Without this feature neither Motif-X nor MoMo is natively capable of analyzing proteolytic data sets due to the fact that proteolysis is bond-centered rather than residue-centered. (2) A positional weighting term in the enrichment calculation: This term emphasizes low diversity positions, thus incorporating global sample features into its clustering procedure. (3) Dynamic background frequency adjustment: Individual amino acid frequencies are averaged across a background proteome and corrected for the removal of amino acids from the foreground. (4) Dynamic multiple testing correction: P-values are Bonferroni corrected based on the total 6   possible number of residues that could be detected at a particular point in the data flow of the algorithm. This ensures scaling of the p-value threshold stringency in proportion to the true number of hypothesis tests and supports stable performance on samples of a large range of sizes and complexities. (5) Implementation of compound residues: In addition to detecting enrichment of individual amino acids, this algorithm also detects cumulative enrichment of groups of biochemically similar amino acids in the same position. The development and specifications of this algorithm are described in detail. Additionally, a number of analyses designed to test the accuracy and limitations of the algorithm are presented and discussed. The outputs of this algorithm are then compared to several existing tools including Motif-X, MoMo, and Gibbs clustering 21 (an alternative unsupervised clustering algorithm used for the analysis of biological sequence data). Next, the results of a large-scale analysis of the protease substrate data in the MEROPS database are considered. Finally, the tool is applied to an experimental data set in order to evaluate its utility in a real-world analysis of biological data.             7   Chapter 2: Algorithm development and characterization  This chapter addresses the design and development of the algorithm as well as the analyses conducted in order to evaluate the performance of the algorithm on a variety of in silico and biological data sets. These analyses included a set of experiments designed to characterize the limitations of the algorithm using several challenging input data sets, comparison of the tool to two alternative motif detection algorithms, Motif-X and MoMo, and comparison of the tool to an alternative Markov chain Monte Carlo method, Gibbs clustering, for clustering of sequence data. Additionally, the results of an extensive analysis of the proteolytic cleavage events contained in the MEROPS database are discussed in detail, and the performance of the algorithm is tested on multiple data sets containing a mixture of substrates for several proteases in MEROPS. Finally, the algorithm is applied to a real-world data set in order to address the practical utility of the algorithm.  2.1 Methods  2.1.1   Software dependencies All custom software was developed using the Python (release 3.7.2) programming language. A number of Python packages are utilized by this tool, and a full list of Python package dependencies is available in the GitHub repository for this tool. During development, these dependencies were maintained through a virtual environment to ensure consistency and replicability of results. Additionally, this tool relies on a MySQL (MariaDB) database for persistent storage of a copy of the MEROPS database as well as several supporting tables containing data related to pre-computed protease substrate patterns.  2.1.2   Swiss-Prot data Context protein sequences were imported from an offline copy of the Swiss-Prot human proteome stored in FASTA format.19,20 These sequences are used for the extension of input peptides by mapping the peptides to the proteome using a combination of protein accession numbers and regular expressions. 8   The Swiss-Prot context proteome is also used by the algorithm to calculate expected amino acid frequencies by default. An alternative background data set may be supplied by the user to calculate expected amino acid frequencies if they so choose.  Due to the fact that the Swiss-Prot database is regularly updated with new and revised protein sequences, the stored offline copy used by this algorithm will be updated automatically at scheduled intervals when the algorithm is released to a production environment.  2.1.3   MEROPS data An offline copy of the MEROPS database version 12.0 was used throughout the development of this algorithm.8 The proteolytic cleavage event data stored in MEROPS were used to generate data sets of protease substrates. The algorithm was used to detect enriched patterns in each of these protease substrate data sets and, when patterns were detected, store those patterns in a supplementary table. Proteases analyzed for patterns were restricted to those flagged as “human active,” in the database, and synthetic sequences were excluded from the analyses. MEROPS protease substrate patterns were then used in the generation of the heatmaps produced as a default output of the tool, providing a layer of protease activity knowledge annotation for the results. As MEROPS is an active and expanding database, the offline copy used by this tool will periodically be rebuilt (including computation and storage of protease substrate patterns) when a new version of MEROPS becomes available.  2.1.4   BLOSUM-62 data This tool relies on a copy of the BLOSUM-62 substitution probability matrix.22 This matrix is used to calculate the similarity of each sequence in an input data set to each enriched pattern detected within that data set. The matrix is stored in text file format with substitution probabilities encoded in proportion form.  9   2.1.5   Default algorithm settings By default, this algorithm enforces an uncorrected p-value threshold of 0.001, while the corrected p-value threshold is dynamically calculated and a minimum of 2 occurrences of each detected pattern in the input data set. Additionally, the tool supports the detection of compound residues by default. These compound residues aggregate multiple amino acids into a set of residue groups within the same position of the aligned input data set, where the amino acids in each group are biochemically similar.  2.2 Results  2.2.1   Algorithm  2.2.1.1      Algorithm description and data flow The fundamental function of this tool is the detection of overrepresented (enriched) patterns of positional residues in aligned sequence data. This is accomplished by identifying statistically over-represented (enriched) residues in each position of the input data set. The algorithm then sequentially selects the most enriched positional residue, removes all sequences featuring that positional residue from the data set, recompiles the statistical background, and recomputes the foreground statistics not including the removed positional residues. This set reduction operation facilitates the detection of many over-represented residues, even in cases where one or more positional residues are so abundant that they suppress the statistical significance of other positional residues when p-values are calculated simultaneously. In addition to recomputing enrichment scores for the main data set, the algorithm also computes scores for each new subset of sequences defined by an enriched positional residue and performs the same selection and set reduction operations based on those scores. In this manner, the algorithm recursively constructs a set of patterns consisting of one or more fixed positional residues, where each fixed positional residue represents a statistical over-representation in a subset of the aligned input sequences. This workflow is depicted in Figure 1. 10    Figure 1. Recursive Pattern Extraction Data Flow Diagram. The data flow of this algorithm consists of several stages. First, enrichment scores are calculated for each positional residue. If one or more statistically enriched positional residues are detected, the one with the highest score is selected and isolated from the remainder of the data set. This process is repeated on each new subset of the data, iteratively constructing a set of patterns described by fixed positional residues which represent highly over-represented positional residues in the data.  2.2.1.2      Background frequencies Patterns are detected by this algorithm on the basis of overrepresented positional residues. In order to determine whether a positional residue is overrepresented, the expected frequency of that residue in a data set of a given size must first be estimated. This is accomplished through the compilation of a statistical background to which the residue composition of the foreground data set is compared. By default, this tool utilizes the Swiss-Prot human proteome for calculation of expected amino acid background frequencies. However, user-specified background data sets are also supported. Background data sets may be pre-computed by the user and supplied as either a CSV- or JSON-formatted list of expected amino acid frequencies. Alternatively, the user may opt to provide a context proteome in FASTA format.  11   To estimate expected amino acid background frequencies, the percentage frequency of each amino acid in the context proteome is calculated. These percentage frequencies are then stored in a vector of length r (where r is the number of distinct amino acids in the context proteome). A fundamental operation of this algorithm involves identifying a position/residue pair that is significantly overrepresented relative to what is expected based on amino acid background frequencies from a context proteome. Sequences featuring this overrepresented position/residue pair are isolated and removed from the remainder of the foreground dataset, thus eliminating said residue from the foreground. It is therefore necessary to compensate for this removal by eliminating that residue from the background as well. Therefore, following each removal of a positional residue, the effective background frequencies are recalculated for each position. The calculation of the effective positional background frequency for each residue is:  Equation 1. Effective expected residue frequency calculation:   𝑓𝑒𝑓𝑓𝑒𝑐𝑡𝑖𝑣𝑒 =𝑓𝑏𝑎𝑐𝑘𝑔𝑟𝑜𝑢𝑛𝑑100 − ∑ 𝑓𝑏𝑎𝑐𝑘𝑔𝑟𝑜𝑢𝑛𝑑𝑖𝑟𝑖=1  ● f: percentage frequency of residue in background frequency data set ● r: number of residues eliminated from positional foreground and background  The results of the positional background frequency adjustment are stored in a matrix of shape m x r (where m is the length of the input sequences and r is the number of distinct residues in the background data set). These position-specific, effective expected residue frequencies are used for all subsequent calculations.  2.2.1.3      Compound residues A unique feature of this tool is the incorporation of compound residues. Compound residues consist of groups of two or more amino acids which feature similar biochemical features. These features may be structural, electronic, or functional in nature. Inclusion of these groups allows for the detection of enriched biochemical features in a foreground data set, even when the amino acids comprising a compound residue group are not individually overrepresented. 12   The expected frequency of a compound residue is calculated as the sum of the expected frequencies of its constituent amino acids, as exhibited in Equation 2.  Equation 2. Effective expected compound residue frequency calculation: 𝑓𝑐𝑜𝑚𝑝𝑜𝑢𝑛𝑑 =∑𝑓𝑒𝑓𝑓𝑒𝑐𝑡𝑖𝑣𝑒𝑖𝑐𝑖=1 ● c: Number of compound residue constituents  As with individual amino acids, it is necessary to compensate for the set reduction operation of the algorithm on a positional basis. To address this, the sum of expected frequencies is taken after the positional background frequency correction previous described. Furthermore, it is possible that following the set reduction operation a subset characterized by a compound residue in a particular position may contain an enrichment of a more specific compound residue or individual amino acid(s) in the same position. To allow for further decomposition of these subsets, the effective expected residues of fixed positions are also recalculated with each iteration of the algorithm. The calculation of the expected frequency of each residue in a fixed position consists of that residue’s background frequency divided by the cumulative background frequencies of all residues possible in that position. For a single amino acid, therefore, its effective expected frequency is calculated as 100% in that position. Therefore, no further decomposition of that position will occur. This formula is expressed in Equation 3.  Equation 3. Effective expected fixed position residue frequency calculation: 𝑓𝑓𝑖𝑥𝑒𝑑 = 1.0 − ∑𝑓𝑒𝑓𝑓𝑒𝑐𝑡𝑖𝑣𝑒𝑖𝑐𝑖=1  Alternatively, for a compound residue, the effective expected residue frequency of each constituent amino acid (as well as of other, more specific, compound residues) will be less than 100% 13   and proportional to that amino acid’s background frequency. Therefore, if a subset characterized by a compound residue contains a disproportionately high frequency of a particular amino acid in that same position, the subset featuring that overrepresented amino acid will be isolated. By default, the algorithm uses a list of five compound residue groups. These groups represent the historical biochemical categories of the 20 common amino acids appearing in most proteomes. Information regarding these compound residues including the arbitrary codes used to represent the groups by this tool, their biochemical descriptions, and the constituent amino acids belonging to each group can be seen in Table 1.   Compound residue code Biochemical description Constituent amino acids 1 Nonpolar, aliphatic R groups G, A, V, L, M, I 2 Aromatic R groups F, Y, W 3 Polar, uncharged R groups S, T, C, P, N, Q 4 Positively charged R groups K, R, H 5 Negatively charged R groups D, E Table 1. Default compound residues. Table of default compound residue groups used by our algorithm, complete with the symbol used to encode the groups in patter string representations and visualizations, biochemical descriptions of the groups, and individual amino acid constituents of the groups.  2.2.1.4      Input and preprocessing This tool accepts a number of foreground data set formats as input. These formats include pre-aligned sequences of equal width which can be supplied either as a text file, a FASTA file, or a list of 14   sequences in a text field. Alternatively, unaligned peptides can be supplied as a text file, FASTA file, or as a list in a text field. If unaligned peptides are supplied, a row-match list of Swiss-Prot protein IDs for each peptide must also be supplied. Finally, peptides and accompanying protein IDs may be supplied in the form of a MaxQuant evidence.txt file.23 Pre-aligned sequences may be of even or odd length, however all sequences in the input data set must be of equal length. Support for both even- and odd-length input sequences is intended to allow for the analysis of multiple types of post translational modifications as these modifications may be centered around a particular residue (such as phosphorylation) producing odd-length sequences, or around a particular bond (such as proteolysis) producing even-length sequences. If unaligned peptides are supplied, the tool will attempt to extend each peptide using the sequence of the corresponding protein and will truncate the resulting sequences to a specified width centered on the cleavage site. Peptides can be extended in the N-terminal direction, the C-terminal direction, or in both directions. Extension in both directions enables analysis of intact peptides, which feature up to two proteolytic cleavage sites. Following the extension and alignment stages of preprocessing, the foreground sequence data set is converted to a binary tensor of shape n x m x r (where n is the number of input sequences, m is the length of the sequences, and r is the number of distinct residue codes). Importantly, the order of values on the r-axis of the tensor is matched to the order of values in the background frequency vector. This tensor is used for all subsequent operations on the foreground data set.   2.2.1.5      Enrichment calculation  2.2.1.5.1       P-values The core operation of this tool is the detection of statistically overrepresented positional residues in a set of input sequences relative to the expected frequencies of those residues. In order to estimate significance, a p-value is calculated for each positional residue. Each individual observation of a positional residue in the aligned data set is treated as a discrete, independent event, wherein the observation of a positional residue in one input sequence does not alter the expected probability of 15   observing the same positional residue in another sequence. P-values are therefore calculated according to the binomial distribution and reflect the probability of observing as many or more occurrences of a positional residue in a sample of size n. To most accurately approximate these values, this tool utilizes SciPy’s implementation of the regularized incomplete beta function (Equation 4) which is defined in terms of the incomplete beta function (Equation 5) and the beta function (Equation 6). The regularized incomplete beta function is the cumulative distribution function (CDF) of the beta distribution. The CDF of the beta function is related to the cumulative distribution function of the binomial distribution by the following formula where p is equal to the probability of success (the expected residue frequency in this application) and n is equal to the sample size of the subset of sequences corresponding to a particular iteration of the recursive pattern detection algorithm. Therefore, the regularized incomplete beta function directly provides the probability of observing an equal or greater number of occurrences of a positional residue based on the binomial distribution. This approach avoids floating point rounding errors which arise when attempting to calculate the complement of the binomial CDF for very small p-values, allowing accurate calculation of p-values down to approximately 5e-324.  Equation 4. Regularized Incomplete Beta Function: 𝐼𝑥(𝑎, 𝑏)  =  𝐵(𝑥;  𝑎, 𝑏)𝐵(𝑎, 𝑏)  Equation 5. Incomplete Beta Function: 𝐵(𝑥;  𝑎, 𝑏)  =  ∫ 𝑡𝑎−1(1 − 𝑡)𝑏−1𝑥0𝑑𝑡    16   Equation 6. Beta Function: 𝐵(𝑎, 𝑏)  =  ∫ 𝑡𝑎−110(1 − 𝑡)𝑏−1𝑑𝑡 ● x: Upper bound of integration. Set equal to expected frequency when used to calculate binomial probability. ● a: Number of successes plus one when used to calculate binomial probability. ● b: Sample size minus number of successes when used to calculate binomial probability.  Equation 7. Binomial probability of X or More Successes: 𝑃𝑟(𝑋 ≥  𝑘)  =  𝐼𝑝(𝑘 + 1, 𝑛 − 𝑘) ● n: Sample size ● k: Positional residue frequency  Since multiple positional residue p-values are calculated and compared during each iteration, it is necessary to perform multiple testing correction when enforcing the p-value threshold. This is accomplished using Bonferroni correction, as expressed in Equation 8, where m is adjusted dynamically based on the number of residues that could be observed in a position.24  Equation 8. Bonferroni correction:  𝛼𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑 =𝛼𝑢𝑛𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑𝑚  Following multiple testing correction, a maximum p-value threshold is enforced. Positional residues with p-values equal to or below this threshold (𝛼𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑  from Equation 8) are deemed significantly overrepresented, while positional residues with p-values above this threshold are deemed insignificant. The p-value threshold is set to 0.001 uncorrected by default, however this setting is parameterized and may be specified by the user. In most samples, this default settings results in a correct p-value threshold on the order of 10^-6, which is consistent with the p-values selected for use in alternative motif detection algorithms.  17   2.2.1.5.2       Minimum frequency requirements In addition to a p-value threshold, this tool also supports the optional enforcement of minimum frequency and minimum fold difference thresholds. Minimum frequency is enforced simply by counting the number of occurrences of a positional residue, while minimum fold difference is enforced by comparing the percentage frequency of a positional residue in the foreground dataset to the expected percentage frequency of that positional residue. The expected percentage frequency is equivalent to the positionally-corrected background frequency of the residue. By default, the minimum frequency threshold is set to two occurrences and the minimum fold difference threshold is set to 1.0. Both of these thresholds are parameterized and may be specified by the user according to their needs.   2.2.1.5.3       Positional weighting Enrichment scores calculated by this algorithm depend not only on the significance of the observed frequency of a positional residue in the context of its expected frequency, but also on a positional weighting factor. This term of the enrichment calculation favors low diversity positions by generating an increasing large term as the number of under-represented residues from the background data set, as well as the fold difference of their under-representation relative to their expected frequencies increases. The formula for this calculation is expressed in Equation 9.  Equation 9. Positional weighting calculation: 𝑊𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛 = −𝑙𝑜𝑔(   1.0 −(  ∑ (1.0 −𝑓𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑𝑖𝑓𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑𝑖)𝐴𝑖=1𝐵)  )    ● A: Number of under-represented residues in a position ● B: Number of possible residues in a position  Positional weighting serves two main purposes. Firstly, it serves to contribute additional order to the clustering of enriched positional residues in the foreground, based not only on the statistical 18   enrichment of those residues individually but also on the structure of the data set across residues. Secondly, it serves as potential tie breaker when two positional residues exhibit the same p-value.  2.2.1.5.4       Enrichment scores Enrichment scores as calculated by this algorithm consist of the product of two terms. The first term of the enrichment score formula is the negative natural logarithm of a positional residue’s p-value. The second term of the enrichment score formula is the positional weight corresponding to the residue’s position in the aligned data set. This formula is expressed in Equation 10.  Equation 10. Enrichment calculation: 𝐸𝑟𝑒𝑠𝑖𝑑𝑢𝑒 = 𝑊𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛  ×  −𝑙𝑜𝑔(𝑝𝑟𝑒𝑠𝑖𝑑𝑢𝑒)  2.2.1.6      Output The default output of this algorithm consists of five main components. These include (1) a clustermap of the aligned input sequences against any patterns detected by the algorithm, (2) a heatmap indicating the frequency of each detected pattern in both absolute and percentage terms as well as a fitness score comparing detected patterns against pre-computed patterns from the MEROPS database, (3) a set of logo maps depicting the positional amino acid composition of the aligned sequence subsets exactly matching each detected pattern, (4) a text file containing the subset of aligned sequences exactly matching each detected pattern, and (5) a tabular output file containing the sequences provided in the input data set in their original order along with additional data produced by the algorithm. Each of these output components is subsequently described in detail.  2.2.1.6.1       Sequence clustermap  Clustering of the input data set based on the patterns detected by the algorithm is visualized in the form of a hierarchically clustered heatmap, shown in Figure 2. This heatmap consists of columns corresponding to each sequence in the aligned input data set and rows corresponding to each pattern detected by the algorithm. 19    Figure 2. Example Clustermap. An example of the default clustermap produced by this tool including labels indicating the cleavage site-aligned pattern labels, sequence and pattern dendrograms, and color mapping scale.  Cell values represent similarity scores between each sequence and each pattern. These similarity scores represent the sum of the positional substitution probabilities for pattern and sequence residues, where positional substitution probabilities are drawn from the BLOSUM-62 substitution probability matrix. BLOSUM-62 is a broadly used substitution matrix which is computed from the pairwise substitution frequencies of each amino acid in the context of locally aligned, highly evolutionarily conserved protein sequences.22 The similarity scores generated in this step are therefore not contingent on prior knowledge of protease activity, and the resulting clustermap is minimally biased with respect to the context of proteolysis. This scoring function is represented mathematically in Equation 11 below.  Equation 11. Sequence similarity score calculation:  𝑆𝑠𝑒𝑞𝑢𝑒𝑛𝑐𝑒 =∑ 𝑓𝐵𝐿𝑂𝑆𝑈𝑀𝑠𝑚𝑝𝑚𝑚𝑖=1𝑚  ● m: Number of fixed positions in pattern ● fBLOSUM62: Substitution frequency of the foreground sequence positional residue and enriched pattern positional residue ● s: Foreground sequence ● p: Enriched pattern 20   Clustering is performed using SciPy’s hierarchical (agglomerative) clustering algorithm. Pairwise distances are computed using SciPy’s calculation of the cosine similarity distance metric (Equation 12) and agglomeration order is determined using SciPy’s “complete” method which employs the Farthest Point / Voor Hees Algorithm.25 This algorithm determines clustering order by selecting the two clusters with the smallest maximum pairwise distance between the vectors in each cluster. The clustermap is then rendered using Python’s Matplotlib package.26  Equation 12. Cosine similarity: 𝑐𝑜𝑠(𝜃) =𝐴 ∙ 𝐵|𝐴||𝐵|  2.2.1.6.2       Protease pattern heatmap In addition to the sequence clustermap, this tool also generates a heatmap consisting of similarity scores comparing the patterns detected in an input data set (hereafter referred to as foreground patterns) to patterns detected in protease substrate sequence sets drawn from the MEROPS database (hereafter referred to as protease patterns). These patterns are pre-computed using the same the same data flow described in Section 2.2.1.1 in and are stored in the manner described in Section 2.1.3. An example heatmap is displayed in Figure 3, below. 21    Figure 3. Example Heatmap. An example of the default protease pattern heatmap produced by this tool including labels indicating the cleavage site-aligned pattern labels, pattern frequency data, protease annotations, and color mapping scale.  The heatmap is oriented such that foreground patterns are plotted as rows of the matrix and MEROPS proteases are plotted as columns of the matrix. The label for each pattern contains the pattern itself, formatted as fixed width positions centered on the cleavage site, as well as the absolute and percentage frequencies of that pattern in the aligned input sequence set. The label for each protease consists of the full name of the protease in the MEROPS database, the MEROPS family to which the protease belongs, and the catalytic type of the protease. All MEROPS proteases with associated patterns are included on the x-axis, and the proteases are sorted alphabetically, first by catalytic type, then by family, and finally by individual protease. This configuration as well as the annotations provided for both patterns and proteases are intended to facilitate easy comparison of foreground-to-protease pattern mappings between samples. Similarity scores are calculated as the fraction of positions in a foreground pattern that are shared by a protease pattern. The equivalent positions of a foreground pattern and a protease pattern are considered to match if one of four conditions is met. Firstly, a position matches if both patterns contain the 22   same amino acid in that position. Secondly, a position matches if both patterns contain the same compound residue in that position. Thirdly, a position matches if that position in the foreground pattern contains an amino acid which is a constituent of a compound residue in the equivalent position of the protease pattern. Fourthly, a position matches if the constituents of a compound residue in that position of the foreground pattern are a strict subset of the constituents of a compound residue in the equivalent position of the protease pattern. By matching in this fashion, a foreground pattern is evaluated as being either wholly or partially explainable by a protease pattern provided the foreground pattern is at least as specific as the protease pattern. In other words, a foreground pattern may have a greater number of fixed positions, as well as a narrower set of residues comprising a fixed position than a protease pattern. A protease pattern, on the other hand, may only be at most as specific as a foreground pattern for the two patterns to be evaluated as a whole or partial match. Foreground pattern and protease axis labels are highlighted in red when whole pattern matches are detected. This logic is reflective of the purpose of the heatmap, which is to provide a layer of biological annotation to the output of the tool in order to aid in interpretation of results in the context of proteolysis. A protease is expected to preferentially cleave sequences matching its cleavage site specificity. However, activity of a protease would not be expected to diminish as a result of sequence features which are not relevant to its cleavage site specificity. Therefore, sequences belonging to the subset of an input data set matching a detected foreground pattern which matches the cleavage site specificity of a protease are putatively attributable to the activity of that protease, even if the subset also exhibits an overrepresentation of specific positional residues which are not relevant to that protease’s cleavage site specificity.  2.2.1.6.3       Pattern-matching sequence subset logo maps Logo maps are used to visualize the amino acid distribution of each subset of aligned input sequences exactly matching one of the detected patterns. These figures are generated using the Python implementation of WebLogo.27 23   2.2.1.6.4       Matching sequence subsets for each pattern The complete subset of aligned input sequences exactly matching each detected pattern is provided as a text file. This decomposition of the input data set based on patterns facilitates follow up analyses on subsets of the data corresponding to features of interest based on the demands of a user and their analytical objectives.  2.2.1.6.5       Tabular output A summary, tabular output file is generated for each input data set supplied to this tool. This file consists of a column containing the original input sequences in their native order,  a column containing the protein accession number associated with the sequence (empty if a pre-aligned data set was supplied by the user), an additional column for each position of the aligned sequence set analyzed by the algorithm, and an additional column for each pattern detected by the algorithm (populated with either a 1 if the sequence on that row exactly matches the pattern on that column, or a 0 if the sequence on that row does not match the pattern on that column). The inclusion of this file in the output data set is intended to support the integration of this tool into a more comprehensive workflow. In essence, this file serves as an augmented version of the input data set, representing the data in its original arrangement along with the additional information generated by the algorithm.  2.2.1.7      Code availability Pending submission for publication, this tool will be made available as a web application and source code will be made available as a Python package. The interface for the web application will be hosted on a BCCHR Oracle APEX server, while the application logic will be hosted locally on a Linux server.  24   2.2.2   Algorithm performance evaluation The accuracy of this algorithm was assessed through a series of analyses, each intended to evaluate a characteristic of the algorithm’s performance. The specific objective, methods, and results of each of these experiments are discussed in detail.  2.2.2.1      Methods  2.2.2.1.1       Algorithm settings Default values were used for all tool parameters in all experiments described in this Section, unless otherwise specified.  2.2.2.1.2       In silico sequence sets Where in silico sequence sets were used, these sets were constructed using custom Python scripts which have been made available in the testing Section of the GitHub repository associated with this tool. The non-fixed positional amino acids comprising these sequence sets were drawn randomly using NumPy,28 with the probability of drawing a particular amino acid corresponding to the expected background frequency of that amino acid in the Swiss-Prot human proteome.  2.2.2.1.4       Calculation of F1 scores F1 scores were used as a metric of algorithm accuracy in several of the experiments described below. F1 scores estimate accuracy as the harmonic mean of precision and recall, and were averaged across 100 replicates for each experimental condition in all analyses. These scores were computed using Python using the formula described in Equation 15 and the associated code has been made available in the testing Section of the GitHub repository for this tool.    25   Equation 13. F1 Score Calculation: 𝐹1 = 2 × (𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑅𝑒𝑐𝑎𝑙𝑙𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙)  Equation 14. Precision calculation: 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =𝑡𝑝𝑡𝑝 +  𝑓𝑝  Equation 15. Recall calculation: 𝑅𝑒𝑐𝑎𝑙𝑙 =𝑡𝑝𝑡𝑝 +  𝑓𝑛  ● tp: Number of true positives ● fp: Number of false positives ● fn: Number of false negatives  2.2.2.2      Investigation of minimum absolute frequency  2.2.2.2.1       Objective Due to the fact that this tool uses a p-value threshold to determine whether or not a particular positional residue is significantly enriched, the true positive rate in samples containing only sequences featuring a single pattern is anticipated to be predictable and ordered based on the expected background frequencies of each residue. Relative to all residues in the background frequency set, the least common residue will be counted significant in the smallest sample where one hundred percent of the sequences in that sample contain said residue in the same position. Conversely, the most common residue will require the largest sample containing only sequences featuring that residue in the same position in order to be counted as significantly enriched. All other residues fall between the least common residue and most common residue in terms of the minimum absolute frequency at which they will be detected as significantly enriched. 26   However, in samples containing sequences featuring more than one position, residues in other positions may contribute to the overall error rate in two ways. Firstly, other positions may contribute false positives to the outcomes of an analysis. This occurs when a position contains a spurious over-representation of a particular residue arising either as a result of random selection during construction of an artificial data set, by coincidental co-occurrence of a particular residue in a specific position due to chance rather than biological activity, or as a result of an off-target biological activity which generated a true enrichment of a positional residue but which is not of interest with respect to a particular analysis. Secondly, these same sources of error may contribute false negatives to the outcomes of an analysis if the spuriously enriched positional residues occur in sequences also featuring the pattern of interest. Due to the set reduction operation of this algorithm, if an erroneous pattern is detected prior to the pattern of interest under these circumstances, the number of sequences corresponding to the pattern of interest is depleted thus diminishing the likelihood of its detection. If represented at equal or greater frequency than the erroneous pattern, this error will not be encountered if the pattern of interest is the least abundant residue in the background data set. This is due to the fact that its lower expected frequency will produce a smaller p-value resulting in that positional residue being selected first. Therefore, for a sample in which every sequence contains the lowest expected frequency residue in a particular position, this source of error should be impossible as the frequency of any other enriched residue cannot possibly exceed the frequency of the lowest expected frequency residue. On the other hand, this scenario is most likely to be observed when the pattern of interest corresponds to the highest expected frequency residue. Based on these factors, the highest overall error rate in a simple sample containing a single enriched positional residue is expected to occur when that residue has the highest expected frequency, the lowest overall error rate is expected to occur when that residue has the lowest expected frequency, while samples containing any other enriched residue are expected to exhibit overall error rates between the two in proportion to the expected frequency of the enriched positional residue. Therefore, to assess the lower bound of detection of this algorithm in terms of absolute pattern frequency, samples containing the lowest and highest background frequency residues were analyzed. 27   2.2.2.2.2       Methods Samples with containing between 1 and 20 sequences were incrementally generated as described in Section 2.2.2.1.2, with each sequential iteration increasing the number of sequences by 1. From these random samples, two sample variants were generated by substituting all residues in a single position with either the least abundant amino acid, tryptophan, or the most abundant amino acid, leucine, based on Swiss-Prot background frequencies. Substituting the amino acids into otherwise identical sets of sequences ensured perfect comparability of error rate between the two data sets. 100 replicates were generated and analyzed for each combination of sample size and fixed residue and the number of true positives, false positives, and true negatives were recorded using the output of each analysis. Precision and recall were computed from these outcomes, which were in turn used to calculate an F1 score for the results associated with each replicate. The mean F1 score for each combination of sample size and amino acid was calculated across replicates and plotted on a single axis.  28   2.2.2.2.3       Results  Figure 4. Mean Pattern Detection Accuracy by Absolute Pattern Frequency. Mean F1 scores for samples of size increasing from 1 sequence to 20 sequences, within which every sequence contained a particular fixed positional residue. These residues were selected to represent the lowest and highest expected frequency amino acids based on background frequencies calculated from the Swiss-Prot human proteome. These are predicted to be detectable and the lowest and highest absolute frequencies, respectively. The blue line represents results for the lowest expected frequency residue (tryptophan), and the orange line represents results for the highest expected frequency residue (leucine).  Extremely similar results were observed for both the most and least abundant amino acids with respect to the lower bound of detection in terms of absolute pattern frequency. In both cases, the pattern of interest was detected with 50% accuracy at a sample size of 4 sequences, and accuracy stably exceeding 95% when the sample size was greater than or equal to 7 sequences. 29   These results indicate that the default settings of the algorithm are sufficiently sensitive to detect over-representation of even the most common residue, while still specific enough to reject the types of errors that are likely to be observed under the same conditions. 2.2.2.3      Investigation of minimum percentage frequency  2.2.2.3.1       Objective While the previous analysis served to evaluate the performance of the algorithm on the simplest possible pattern-containing data sets, it is rare that every sequence in a biological sample corresponds to an individual activity represented by a single enriched positional residue. With this in mind, this Section addresses the detection of a single, enriched pattern against a background of noise. To accomplish this, a number of fixed sample size data sets containing a variable percentage frequency of a single pattern of interest. In addition to the sources of error discussed in 2.2.2.2.1, this sample configuration gives rise to three other potential sources. Firstly, false negatives may arise if the percentage frequency of sequences exhibiting the pattern of interest is not sufficiently high for the positional residue to pass the p-value threshold enforcement of the algorithm. Secondly, the set reduction operation of the algorithm may give rise to a false negative by depleting the frequency of sequences exemplifying the pattern of interest in the event that a false positive is encountered at an earlier stage of the recursive dataflow. While this was discussed at length in Section 2.2.2.2.1, the error must be extended to patterns defined by a position containing tryptophan (the least abundant residue based on Swiss-Prot background frequencies) when the percentage frequency of the pattern is less than 100%. This is due to the fact that it is possible for a pattern defined by a higher-expected-frequency positional residue to produce a p-value lower than the tryptophan-defined pattern in the event that its percentage frequency sufficiently exceeds that of the tryptophan-defined pattern. Thirdly, one of the novel features of this pattern detection algorithm is the implementation of compound residues, which aggregate multiple individual amino acids contained in the same position into larger groups based on common biochemical features shared by the amino acids. With this feature enabled, analysis of a sample containing a single pattern at lower than 100% frequency may produce a 30   single false negative accompanied by one or more false positives. This form of error arises when the pattern of interest is aggregated into a compound residue group due to the coincidental over-representation of one or more other constituent(s) of the same compound residue in that position. While the algorithm does allow decomposition of compound residues into simpler compound residues or individual amino acids, it is possible for this process to fail to resolve the pattern of interest if the coincidental enrichment of the other constituent(s) happens to be proportional to the residue of interest.  2.2.2.3.2       Methods For these experiments, four data sets with a fixed overall size (n=[10, 100, 1000, 10000]) were randomly populated with residues as described in Section 2.2.2.1.2. These sample sizes were selected based on the anticipated sample sizes of the datasets anticipated as inputs supplied by users of this tool. Variants of these data sets were then generated by substituting a variable number of either leucine or tryptophan (the most and least abundant residues based on Swiss-Prot background frequencies, respectively) into a single position of the sequences. The number of substituted sequences was increased from 10% to 100% of the overall sample size in steps of 10% (either 1 sequence, 10 sequences, 100 sequences, or 1000 sequence, respectively). 100 replicates were generated for each combination of sample size and pattern percentage frequency. Each sample was analyzed and F1 scores were calculated as described in Section 2.2.2.1.4. Mean F1 scores were then plotted, with results partitioned into separate figures for each overall sample size and scores corresponding to each pattern plotted on the same axis. 31   2.2.2.3.3       Results  Figure 5. Mean Pattern Detection Accuracy by Pattern Percentage Frequency. F1 scores averaged across replicates showing the accuracy of the algorithm on samples of various sizes, incrementing the percentage frequency of the pattern of interest from 10% to 100% of the total sample size in steps of 10%. Total sample size was held constant for in the datasets corresponding to each subfigure. Therefore, as pattern percentage frequency was increased, noise sequence frequency was decreased in equal proportion. (A) Results for samples containing a total of 10 sequences. (B) Results for samples containing a total of 100 sequences. (C) Results for samples containing a total of 1,000 sequences. (D) Results for samples containing a total of 10,000 sequences.  As expected, these experiments reveal that the minimum percentage frequency of a pattern depends on the overall sample size and, by extension, on the absolute frequency of sequences corresponding to both the pattern and noise. In the analyses featuring an overall sample size of 10 sequences, detection of the pattern containing leucine reached 50% accuracy when the percentage frequency of sequences containing the pattern was between 60% and 70%. This is equivalent to 6 or 7 32   sequences given the sample size and represents a rightward shift of absolute frequency required for detection shown in Figure 7. Moreover, non-zero F1 scores were observed for this pattern at lower percentage frequency levels, indicating that this shift is not solely due to the minimum percentage frequency necessary for the pattern to exceed the p-value threshold. It is thus clear that, particularly in very small samples, this algorithm’s limits of detection are governed not only by the absolute frequency of sequences containing a particular pattern, but also by the errors discussed in Section 2.2.2.3.1 which arise when the percentage frequency of sequences matching a particular pattern is less than 100%. This conclusion is supported by the observation that this trend persists in analyses of samples of greater overall size. While tryptophan-defined patterns are reliably detectable at levels down to at least 10% frequency, patterns defined by leucine exhibit lower accuracy of detection at such low percentage frequencies. However, in samples of 100 sequences or more, accuracy is extremely stable and close to 100% when pattern percentage frequency exceeds the limit of detection. Additionally, the difference of this lower bound for the least and most abundant residues diminishes as sample size is increased.  2.2.2.4      Investigation of pattern frequency fold difference  2.2.2.4.1       Objective The prior analyses have dealt exclusively with samples containing a single enriched pattern. However, it is likely that the majority of use cases of this algorithm will involve analysis of datasets containing multiple enriched patterns arising from more than one biological activity. It is also likely that the levels of activity of these multiple processes will differ and, therefore, that the enriched patterns arising from their specificites will vary in frequency. When the activity of one biological activity exceeds that of another, it is possible for the enrichment arising from the first to mask that of the second when evaluated simultaneously. This algorithm addresses that issue through the implementation of a set reduction operation which isolates sequences containing the most enriched positional residue producing two subsets, one containing those sequences and the other containing the remainder of sequences, during each iteration. 33   The purpose of this experiment is to demonstrate that, provided a first positional residue is enriched in the context of a set of sequences, and provided that a second positional residue is enriched in the context of the subset of those sequences not containing the first positional residue, the overall frequency of the first positional residue may be increased ad nauseum without impacting the likelihood of detecting the second positional residue.  2.2.2.4.2       Methods For each replicate, a fixed number of sequences (n=20 sequences) containing a single fixed positional residue were generated, as described in Section 2.2.2.1.2. These sequences, also generated using the methods described in Section 2.2.2.1.2, were then concatenated with a second set of sequences containing a different fixed positional residue. The sample size of the second set of sequences was set to either 1x (20 sequences), 10x (200 sequences), 100x (2000 sequences), or 1000x (20,000 sequences) the number of sequences in the first set. 100 replicates were generated for each pattern frequency fold difference condition. Each replicate was analyzed twice: once with the algorithm’s set reduction operation enabled and once with set reduction disabled. Mean F1 scores were calculated as described in Section 2.2.2.1.4 and plotted on a single figure for direct comparison of results for each sample condition across the two algorithm configurations. 34   2.2.2.4.3       Results  Figure 6. Mean Detection Accuracy for Two Patterns with Increasing Frequency Fold Difference. Mean F1 scores for the sequential detection of two enriched patterns with frequency fold difference increasing from 1x to 1000x in steps of 1 order of magnitude per iteration. The blue line shows scores corresponding to analysis of samples with the algorithm’s set reduction operation enabled, and thus shows results when the first of the two patterns in removed from the main sequence set following its detection. The yellow line shows results with the set reduction operation disabled.  Predictably, the algorithm’s set reduction operation is crucial for both the accurate detection of multiple patterns in the same dataset, as well as the accurate rejection of spuriously over-represented positional residues. As is evident in Figure 9, with this set reduction enabled, the performance limitations discussed in these analyses apply only to the most enriched positional residue at point in an analysis. Removal of that positional residue from both the foreground and background data sets for subsequent iterations ensures that it does not interfere with detection of other enriched positional residues. This 35   behavior ensures mean accuracy of greater than 95%, even cases of extreme fold difference between the foremost and next most enriched patterns in a dataset. The logic of this conclusion can be extended to datasets containing more than two patterns, provided each pattern exceeds the minimum absolute and percentage frequency thresholds with respect to either the full input set for the most over-represented pattern, or the subset of sequences excluding all patterns which are more enriched.  2.2.2.5      Investigation of number of detectable top-level patterns  2.2.2.5.1       Objective In addition to assessing the algorithms performance on samples containing patterns with variable frequency fold difference, it is also critical to assess the performance of this algorithm on samples containing a variable number of patterns. A key advantage of this algorithm over many alternative clustering algorithms is that the number of anticipated patterns does not need to be specified in advance. This number can be difficult to predict and often involves a process of clustering a data set many times in order to probe for the correct number. While there are quantitative methods available to support this process, it remains error prone and biased. It is therefore desirable to demonstrate the reliability of this algorithm on datasets featuring a wide range of patterns and, moreover, to characterize the algorithm’s capacity to resolve an unknown number of patterns accurately and exhaustively.  2.2.2.5.2       Methods As demonstrated in the previous analysis, the proportional representation of two patterns does not impact the accuracy with which the algorithm detects either pattern, provided each pattern’s frequency exceeds the minimum absolute and percentage frequency thresholds for detection. Therefore, to simplify the design of data sets for these experiments as well as to minimize sources of variation, frequency was held constant at 50 sequences for all patterns in this analysis. This is equivalent to a percentage frequency of 5% for each pattern in the input data sets, which is at the lower end of the detectable range as discussed in Section 2.2.2.3.3. 36   Random data sets of 1000 sequences each were generated using the methods described in Section 2.2.2.1.2. These data sets were then populated with a variable number of patterns with the total number of patterns in each data set increasing from 1 (5% of the total sample size cumulatively) to 20 (100% of the total sample size cumulatively). This was intended to cover the performance of the algorithm on data sets ranging from a single pattern at the lower end of the bounds of detection in terms of percentage frequency up to the maximum number of possible patterns at that same percentage frequency. These data sets consisted of two variants: one containing enriched positional residues in random positions and one containing enriched positional residues in the same position across patterns. While the first variant is likely to more closely approximate what would be encountered in a user-supplied data set, the second variant is intended to approximate the most challenging case for the algorithm. With all enriched residues occurring in the same position, the algorithm is likely to identify several of the enrichments as a single, aggregated compound residue enrichment. Due to the fact that biochemically similar amino acids tend to occur at similar frequencies in the proteome, and that all patterns are represented in equal frequency in these data sets, these compound residues are very unlikely to be reliably decomposed into individual amino acid enrichments by the algorithm. As each of these events produces one or more false negatives as well as one or more false positives, they dominate the errors anticipated in this analysis. Therefore, the results corresponding to these data sets represent the lower extreme of accuracy that could be expected in real world applications. 100 replicates were produced for each combination of positional enforcement and number of patterns. Results of the analyses were partitioned based on the positional enforcement in the set design and plotted as separate lines on the same figure. This figure displays mean F1 scores calculated as described in Section 2.2.2.1.4 as a function of the number of patterns in the datasets with scores averaged across replicates. 37   2.2.2.5.3       Results  Figure 7. Mean Pattern Detection Accuracy by Number of Patterns. Mean F1 scores across replicates for samples (n=1000) containing between 1 and 20 patterns comprised of 50 sequences each. The blue line indicates mean F1 scores where the fixed position of each pattern was selected at random, while the orange line indicates mean F1 scores where all patterns exhibited the same fixed position.  Performance was found to degrade slightly as the number of patterns increased when patterns were allowed to occur in any position, selected at random. This is close to, but slightly worse than, the expected outcome of stable performance up to the maximum number of patterns based on fixed pattern frequency. This is the result of a combination of factors including erroneous aggregation and subsequent failure of decomposition of compound residues as discussed in Section 2.2.2.5.2 as well as cross-talk between patterns in different positions which coincidentally feature one or more positional residues corresponding to another pattern. Depletion of pattern frequency due to this factor is discussed at length in Section 2.2.2.3.1. Given that pattern percentage frequency was selected to be close to the lower bound 38   of detection, any depletion of pattern exemplar sequences due to set reduction has relatively high potential to lead to a false negative. Due to the factors discussed in the previous Section, performance is markedly worse when all patterns are forced to occur in the same position with mean accuracy dropping below 50% when 14 or more patterns are present. However, this is an extremely unlikely occurrence and only serves to show the extreme low end of performance. At the other end of this spectrum, a sample which only contains patterns with fixed residues in different positions would likely exhibit the lowest error rate. However, this would limit the total number of single residue patterns that could be detected to 8 based on the width of the sequences included in these data sets (equivalent to the number of positions supplied by MEROPS in their protease substrate data sets). On those grounds, as well as on the grounds that these analyses are intended to reveal the limitations rather than the ideal performance of this algorithm, those experiments were not conducted.  2.2.2.6      Investigation of false positive rate in randomly selected Swiss-Prot sequences  2.2.2.6.1       Objective All prior analyses have dealt with samples containing one or more patterns, known to be enriched based on the construction of each input data set. However, an alternative anticipated scenario is that input samples may contain no over-represented positional residues relative to expected residue frequencies. In other words, one would expect that a biological sample collected at random from normal tissue or fluid would consist of a mixture of proteins containing amino acids frequencies that cumulatively approximate the expected background frequencies of that organism and tissue or fluid type. It is therefore critical to evaluate the algorithm’s ability to correctly reject all spuriously overrepresented positional residues in samples which should not contain any enriched patterns. Moreover, it is important to demonstrate this ability when using default algorithm settings which have already been demonstrated to be capable of detecting truly enriched patterns with high accuracy. 39   This analysis evaluates the rate of false positives discovered by this algorithm in samples of various sizes corresponding to the sizes selected for prior analyses as well as the sizes anticipated to be supplied as input to the algorithm by users.  2.2.2.6.2       Methods Two types of samples were constructed for this analysis. Firstly, samples of 10, 100, 1,000, and 10,000 random sequences were generated using the methods described in Section 2.2.2.1.2. Secondly, random sub-sequences were selected from the protein sequences including in the Swiss-Prot human proteome. 100 replicates were produced for each sample size. Each of these samples was then analyzed twice: once with compound residues enabled and once with compound residues disabled. Results were then partitioned into combinations of sample size, sequence type, and compound residue status and mean F1 scores were computed across all 100 replicates of each combination as described in Section 2.2.2.1.4. Mean F1 scores for each sample size were plotted on a separate figure, with each combination of sequence type and compound residue status plotted in a different color on the same set of axes. Histograms were generated to convey number of false positives as a percentage of test cases. 40   2.2.2.6.3       Results  Figure 8. Analysis of False Positive Rate in Random Sequence Sets. Histograms displaying number of false positives by percentage of replicates for each combination of sample type and compound residue status. Sample types were characterized by either in silico sequences populated with residues selected at frequencies proportional to their expected frequency given the Swiss-Prot human proteome as a background, or sequences selected at randomly directly from the proteins comprising the Swiss-Prot human proteome.  The results of this analysis demonstrate the robustness of this algorithm to false positives, even in very large data sets (in terms of the anticipated use cases). Importantly, false positives are not significantly more prevalent in samples of sequences drawn directly from the human proteome as compared to samples of in silico sequences. This is relevant as real-world input to the algorithm will consist of sequences directly from the proteome, rather than sequences deliberately constructed to reflect the expected background frequencies of each residue.  41   Enabling the detection of compound residues did not increase the frequency of false positives. In the samples containing 10,000 sequences, the outcomes for analyses with compound residues enabled showed slightly better performance. While this is almost certain to be the result of chance, these analyses taken together suggest that compound residues do not hinder the ability of the algorithm to correctly reject false negatives.  2.2.3   Comparison to existing tools  2.2.3.1      Comparison to MoMo reimplementation of Motif-X  2.2.3.1.1       Objective The analyses described in this Section compare the output of this tool to the outputs of Motif-X and MoMo,16,17,18 two alternative motif detection algorithms, for input data sets consisting of 13 amino acid, residue-centered sequences from the Plasmodium falciparum proteome.29 This proteome, as well as the length of the input sequences, were selected on the basis of the results reported by Pease, et al. in a 2018 paper reporting on PfPK7-regulated phosphorylation in Plasmodium falciparum, a protozoan human parasite responsible for a highly infectious and frequently lethal strain of malaria.30 In addition to the obvious clinical relevance of the data in the original Pease study, the sequences contained in Supplemental Data 3 were initially analyzed using the original implementation of Motif-X and subsequently reanalyzed by the authors of MoMo to evaluate the performance of their algorithm.18,30 Neither Motif-X nor MoMo were tested on a proteolytic data sets due to the requirement of a fixed central residue enforced by both tools. Proteolysis involves cleavage of protein sequences between residues and, therefore, sequences aligned to the sites of proteolytic cleavage events lack a fixed central residue. Motif-X and MoMo are therefore inherently unsuitable for the analysis of proteolytic data. This is discussed in greater detail in Chapter 3.  42   2.2.3.1.2       Algorithm configurations Due to the fact that the web server for the original implementation of Motif-X has been unavailable throughout the development of this tool, all comparisons to Motif-X were completed using the MoMo reimplementation of the original algorithm. As demonstrated by the authors of MoMo in their development paper, their reimplementation is functionally identical to the original.18  Firstly, the sequences were analyzed using MoMo’s re-implementation of the original Motif-X algorithm using the settings described in Pease, et al., 2018. These settings included the original Motif-X significance calculation (accurate to 1e-16), a minimum of 10 occurrences of each reported pattern in the sequences, a p-value cutoff of 1e-05, and a background derived from the Plasmodium falciparum proteome.30 Secondly, the sequences were analyzed using MoMo’s re-implementation of Motif-X using their enhancements and default settings. These settings included MoMo’s improved accuracy significance calculation (accurate to approximately 1e-300), a minimum of 5 occurrences of each reported pattern in the sequences, a p-value cutoff of 1e-05, and background frequencies derived from a shuffled version of the input data set.18 Finally, the sequences were analyzed using the default settings (unless otherwise noted) of this algorithm. These settings are described in Section 2.1.5.  2.2.3.1.3       Random Plasmodium falciparum sequences The first comparison of this algorithm to Motif-X and MoMo involved evaluation of the detection of false positives in a randomly selected sample of sequences from the Plasmodium falciparum proteome. The purpose of this analysis was to assess the capability of each algorithm configuration to correctly reject false positives in a sample of sequences which, in principle, should only be at most as patterned as the background population of sequences from which they were drawn.  43   2.2.3.1.3.1       Methods A set of non-redundant, 13 amino acid sequences were selected at random from the Plasmodium falciparum (Ensembl version 38) proteome. These sequences were then analyzed using the methods described in Section 2.2.3.1.2 and the results were compared across algorithms.  2.2.3.1.3.2 Results No enrichments were detected by any of the three algorithms, indicating that all three are sufficiently specific for use on complex data sets of phosphopeptides. This finding suggests that patterns detected by any of the three algorithms in data sets containing sequences derived from the Plasmodium falciparum proteome are likely to be meaningful in the context of the samples in which they are detected, rather than statistical flukes arising as a result of redundancy or patterning to the broader context of the full proteome of the organism.  2.2.3.1.4       Analysis of supplemental data 3 from Pease, et al., 2018 Following the comparative assessment of false positive detection, each algorithm was used to analyze the 196 sequences contained Supplemental Data 3 of Pease, et al 2018. These sequences corresponded to phosphorylation sites that decreased in abundance in the absence of PfPK7.30 These sequences were analyzed using Motif-X in the Pease paper, identifying three enriched motifs which were related back to prior knowledge of kinase specificity. They were subsequently re-analyzed by the authors of MoMo to demonstrate the functional equivalence of their version of Motif-X to the original.18  2.2.3.1.4.1 Methods The sequences from Pease, et al. 2018 Supplemental Data 3 were retrieved and stored in raw text format. These sequences were then analyzed by using the same methods described in Section 2.2.3.1.3.1. In addition to default algorithm settings, this data set was also analyzed using our algorithm with compound residues disabled in order to provide an additional point for comparison. Detected patterns were then considered both in terms of the differences between the algorithms as well as in terms of the putative biological relevance of each pattern. 44   In order to ensure direct comparability of the outputs from each tool, logo maps were generated by supplying the subsets of sequences matching each detected pattern as input to IceLogo.31  2.2.3.1.4.2 Results At least one enriched pattern was detected by each algorithm. Logo maps corresponding to these enriched patterns are presented in Figure 12. 45    Figure 9. Logo Maps for Patterns Detected in Pease Supplemental Data 3. Logo maps corresponding to patterns detected by (A) MoMo’s functionally identical re-implementation of the original Motif-X algorithm, (B) MoMo run with default settings, (C) our tool with compound residues disabled, and (D) our tool with default settings. 46   As expected, the Motif-X configuration of MoMo produces the same three patterns, as seen in Figure 12A, reported in the original paper.30 Two of these three patterns (RxxxxxS* and RxxS*) were also detected by our algorithm with compound residues disabled, while the third pattern (S*xD) was found embedded within a slightly more specific pattern detected by our algorithm (S*DD). Due to the fact that our algorithm does not utilize a prespecified central residue, a fourth pattern containing the central serine residue was also detected. Patterns detected by our algorithm with compound residues disabled are visible in Figure 12C. Caseine kinase 1 (CK1) and caseine kinase 2 (CK2) are known to target the S*xD motif detected by Motif-X and embedded within the motifs detected by our algorithm in both configurations.32,33 Importantly, these enzymes were also predicted to recognize the majority of phosphopeptides exhibiting the largest decrease in abundance in the absence of PfPK7 using Group-based Prediction System 3.0 (GPS 3.0).30,34 This provides additional supporting evidence for both the capability of this algorithm to detect statistically and biologically relevant patterns in complex experimental data sets. The RxxS* motif detected by Motif-X and our algorithm both with and without compound residues enabled is known to be utilized by a variety of AGC protein kinases.35,36 The RxxxxxS* motif detected by Motif-X and our algorithm with compound residues disabled can putatively be attributed to protein kinase C (PKC) or CDC2-like kinase (CLK) activity.37,38 While AKT and PKA (both belonging to the group of AGC protein kinases) activity are supported by the substrate recognition predictions made using GPS 3.0, activities corresponding to the RxxxxxS* motif are not supported in the Pease paper by additional evidence.30 Of note is that this motif was not detected by our algorithm with compound residues enabled, but was reflected in the logo map of a more specific RxxxS*xD as seen in Figure 12D. This pattern was detected in the subset of sequences containing acidic residues two positions to the right of the central serine residue, initially described by an S*x[DE] motif and subsequently decomposed into three more specific subsets featuring the RxxxS*xD pattern, an S*x[DE]xxx[RK] and an S*x[DE][DE] pattern, respectively. A particularly interesting feature of this difference is that this alternative positional arginine (four positions to the left of the central serine residue) is reflected in multiple logo maps corresponding to the patterns detected by the other algorithms, including the logo map for the RxxxxxS* motif. 47   These findings indicate two things. Firstly, arginine does appear to be enriched six positions to the left of the central serine residue, however this enrichment is in fact specific to the subset of sequences featuring an arginine residue three positions to the left of the central serine residue. Secondly, the enrichment of arginine three positions to the left of the central serine residue is only revealed through the aggregation and subsetting of sequences containing an acidic residue two positions to the right of the central serine residue. Conversely, the subsets matching the other two patterns containing an acidic residue two positions to the right of the central serine residue do not display an overrepresentation of arginine six positions to the left of the central serine. While the overrepresentation of arginine four positions to the right of the central serine residue seems to corroborate the detection of this residue as a fixed pattern position by our algorithm with compound residues enabled, no further evidence from kinase specificity literature was identified. This could either be due to the fact that this pattern does not serve as a kinase specificity target or, alternatively, due to the fact that prior attempts to describe kinase specificity have typically been based on techniques which do not account for amino acid exchangeability and therefore miss enriched features which are only observed when sequences are aggregated based on biochemically similar positional residues. MoMo detected a single, entirely different, pattern consisting of a single fixed valine, six positions to the left of the central serine residue. This positional valine residue is not represented in the patterns nor the logo maps of any other algorithm configuration tested in this analysis. On the other hand, the logo map corresponding to the pattern detected by MoMo (Figure 12B) does show a strong over-representation of acidic residues located one, two, three, and four positions to the right of the central serine residue. Additionally, this logo maps exhibits a maximal overrepresentation of arginine six positions to the right of the central serine residue, which is consistent with the output of our algorithm with compound residues enabled. The VxxxxxS* was not attributed to any known kinase activity by the authors of MoMo, nor was it connected to a known specificity in the course of this analysis.   48   2.2.3.2      Comparison to Gibbs Clustering  2.2.3.2.1       Objective Gibbs clustering is a popular Markov chain Monte Carlo algorithm used for clustering of sequence data sets.21 The algorithm takes a prespecified number of clusters and initializes the clustering workflow by randomly assigning each input sequence to one of those clusters. The algorithm the performs a series of operations including a leftward or rightward shift of one or more sequences relative to the others, or the reassignment of a sequence from one cluster to another, once every fixed number of iterations.21 The favorability of each move is assessed using the metric of Kullback-Leibler Divergence (KLD), and the move is either accepted or rejected according to a schedule of decreasing probability for accepting unfavorable moves, accepting all favorable moves.21 If the number of clusters present in a data set is not known in advance, the algorithm can be run several times using a range of numbers of clusters. The set of clusters which maximizes system-wide (average) KLD is automatically selected as the correct solution.21 Clusters are represented by a stacked bar chart representing the KLD of each cluster and, cumulatively, the KLD of the system, as well as by a logo map corresponding to each pattern.21 Using the logo maps one can ascertain the relevant over-represented positional residues which define a cluster if any are present. Gibbs clustering differs from this tool in several critical ways. For instance, the input data set for Gibbs clustering does not strictly need to be pre-aligned as the algorithm will attempt to realign sequences at a particular number of iterations by default. This is useful if there is no clear feature on which to align a set of sequences such as a cleavage site or modified residue. Additionally, as noted earlier, Gibbs clustering requires the user to specify the number of clusters in a data set in advance or, alternatively, to run the algorithm over a range of possible numbers of clusters. This presents a challenge when applying Gibbs clustering to the analysis of proteolytic data, where the mixture of activities contributing to the composition of a sample is both complicated and variable. In addition to these differences, Gibbs clustering is a non-deterministic algorithm, therefore limiting the reproducibility of results between runs. 49   The objective of this comparison is to assess the relative suitability of our deterministic, pattern detection-based sample deconvolution approach against the non-deterministic, distance-based Gibbs clustering for analysis of complex samples drawn from the Swiss-Prot human proteome.  2.2.3.2.2       Random Swiss-Prot sequences This analysis compares the performance of Gibbs clustering to our algorithm on a sample of random sequences from the Swiss-Prot human proteome. The purpose of this comparison is to assess the ability of each algorithm to correctly reject spurious patterns in a sample that should not contain any statistically overrepresented patterns.  2.2.3.2.2.1 Methods An analysis of the frequency of false positive detection by our algorithm was already conducted and is described in Section 2.2.2.6. A single replicate (replicate 32) from the set of all samples containing 1000 sequences was randomly selected and supplied as input to the Gibbs clustering algorithm. Gibbs clustering was run over a range of 1 to 15 possible clusters, the minimum and maximum settings for the web interface, respectively. A motif length of 8 (the full length of the input sequences) was selected, the simple shift move of the algorithm which attempts realignment of the dataset was disabled, and the trash cluster for outlier rejection was enabled. All other settings were kept at their default values.  2.2.3.2.2.2 Results Gibbs clustering selected an optimal result of 15 clusters, equal to the maximum number of clusters tested in this experiment. This underscores a limitation of the algorithm, which is that it inherently favors small, homogeneous clusters rather than large, generalized clusters. There is an attempt to offset this trait with the incorporation of a penalty for small clusters, however these results suggest that the default value for the small cluster penalty is insufficient to suppress the behavior when analyzing a strictly random sample.  50    Figure 10. Gibbs Clustering Results: KLD by Number of Clusters for Random Swiss-Prot Sequences. Stacked bar chart showing the mean KLD for each cluster in a set of solutions ranging from 1 cluster to 15 clusters. The algorithm selects the solution with the highest average system KLD as the correct solution. In this case, the algorithm selected a solution of 15 clusters with an average system KLD of 4.332545.  The outcome also illuminates a potentially treacherous property of Gibbs clustering with respect to the analysis of samples of unknown composition. The algorithm will always assign sequences to the pre-specified number of clusters, producing the best solution possible for that number of clusters. Even when the algorithm is run across a range of possible cluster numbers, there is no guarantee that the 51   correct number of clusters falls within that range. Consequently, in the course of finding the best sample clustering given a particular number of clusters, will inherently enrich certain sequence features in each cluster. This is particularly problematic for samples containing no true clusters, as every sequence feature enrichment is artificial. This is visible in Figure 14, where the logo maps reflect highly overrepresented residues in certain positions despite the deliberate randomness of the input data set. These results demonstrate that our algorithm is better suited for the analysis of data sets which do not contain any true enrichment, producing an empty result set rather than a complicated set of clusters which misleadingly indicate a number of enriched features in the input data set.  52    Figure 11. Gibbs Clustering Results: Logo Maps for 15 Cluster Solution on Random Swiss-Prot Sequences. Gibbs clustering selected a solution of 15 clusters from a range of 1 to 15 clusters when supplied with an input data set containing sequences randomly selected from the Swiss-Prot human proteome. By selecting the largest number of clusters, Gibbs clustering partitioned the data set into the smallest and most homogenous subsets allowed, despite the absence of any true discriminating features between sequences.   53   2.2.3.2.3       Mixed MEROPS protease substrate sequences  2.2.3.2.3.1 Methods Next, the relative performance of this algorithm and Gibbs clustering were evaluated on the analysis of a sample containing a complex mixture of enriched features. This data set was generated by taking a combination of protease substrate sequences from the MEROPS database. 194 substrates of caspase-6, 469 substrates of cathepsin B, and 301 substrates of matrix metallopeptidase 9 were combined to generate a sample of 964 sequences. As seen in Figure 13, the caspase-6 subset of this sample is characterized by enriched acidic residues in several positions. These residues are particularly overrepresented in the 0, 1, and 3 positions, using the index of the logo map, along with valine in the 0 position. In contrast to the caspase-6 subset, Figure 14 shows that the cathepsin B substrate set is characterized by enrichment of small and hydrophobic residues in many positions, with a heavy overrepresentation of glycine in the 3 and 6 positions. Based on this sequence composition, one would anticipate that caspase-6 and cathepsin B substrates should be relatively easy to separate.  54    Figure 12. Logo map of Caspase-6 Substrates from the MEROPS database. This logo map depicts the amino acid composition of four positions flanking the cleavage site of 194 experimentally validated substrates of caspase-6, retrieved from the MEROPS database. .  Figure 13. Logo map of Cathepsin B Substrates from the MEROPS database. This logo map depicts the amino acid composition of four positions flanking the cleavage site of 469 experimentally validated substrates of cathepsin B, retrieved from the MEROPS database.  On the other hand, Figure 15 indicates that the substrates of MMP-9 also display enrichment of small and hydrophobic residues in several positions. While the relative positional overrepresentation of 55   these residues is sufficiently different for our algorithm to detect a different set of patterns corresponding to the specificities of MMP-9 and cathepsin B when analyzed separately, it is likely that the similarity of the amino acid compositions of their substrates will prove challenging to a clustering-based approach to when the subsets are combined.  Figure 14. Logo map of Matrix Metallopeptidase-9 Substrates from the MEROPS database. This logo map depicts the amino acid composition of four positions flanking the cleavage site of 301 experimentally validated substrates of matrix metallopeptidase-9, retrieved from the MEROPS database.  This is observation is reinforced by examining the logo map corresponding to the mixed substrate data set, presented in Figure 16. While the distinct positional features corresponding to caspase-6 are preserved in the mixed sample, it becomes extremely challenging to differentiate the contributions of the cathepsin B and MMP-9 subsets to the composition of each position.  Gibbs clustering settings used for this analysis were identical to those described in Section 2.2.3.2.2.1. Default settings were used for our algorithm, as described in Section 2.1.5.  56    Figure 15. Logo map of Mixed MEROPS Protease Substrate Sample. This logo map depicts the amino acid composition of four positions flanking the cleavage sites of a mixed sample of 964 experimentally validated substrates retrieved from the MEROPS database. The sample includes 194 caspase-6 substrates, 496 cathepsin B substrates, and 301 matrix metallopeptidase-9 substrates.  2.2.3.2.3.1 Results Performance of Gibbs clustering on an input data set which does contain enriched features is considerably better than the performance of Gibbs clustering on a completely random data set. In this case, Gibbs clustering selected a solution consisting of two clusters, with an average system KLD of 8.260799, as shown in Figure 15. 57    Figure 16. Gibbs Clustering Results: KLD by Number of Clusters for Mixed MEROPS Protease Substrate Sequences. Stacked bar chart showing the mean KLD for each cluster in a set of solutions ranging from 1 cluster to 15 clusters. The algorithm selected a best solution of 2 clusters with an average system KLD of 8.260799.  Despite this improvement in performance as compared to the analysis involving a blank sample, Gibbs clustering failed to resolve MMP-9 and cathepsin B substrates into separate clusters. While these proteases have distinct specificity profiles, the substrates of both primarily contain small and hydrophobic 58   residues. The sets of their substrates are therefore more similar to one another than either is to the set of caspase-6 substrates, which primarily targets acid residues in a small number of positions.  Figure 17. Gibbs Clustering Results: Logo Maps for 2 Cluster Solution on Mixed MEROPS Protease Substrate Sequences. Logo maps for the two cluster solution show clearly different positional residue composition for each of the two clusters. Cluster 1 features enrichment of small and hydrophobic amino acids in several positions, while Cluster 2 features enrichment of acid residues in a small number of positions. This is consistent with co-clustering of MMP9 and cathepsin B substrates in Cluster 1, and clustering of caspase-6 sequences in Cluster 2.  As can be seen in Figure 15, our algorithm also struggled to resolve the substrates of MMP-9 and cathepsin B while producing a distinct cluster corresponding to the substrates of caspase-6 despite the fact that this clustering was based on substrate similarity to the set of detected patterns rather than direct, pairwise comparison of sequences. This underscores the profound similarity of the substrates of MMP-9 and cathepsin B and as well as the limitations of clustering to deconvolute samples containing sequences generated by a mixture of protease activities. In contrast to the output observed for both Gibbs clustering and the hierarchical clustering component of our algorithm, the patterns detected in the sample were revealed to map quite effectively to our pre-computed MEROPS patterns, as can be seen on the heatmap presented in Figure 16.   59    Figure 18 Clustermap for Mixed MEROPS Protease Substrate Sample. Clustermap displaying the patterns detected in a mixed sample containing substrates of caspase-6, cathepsin B, and matrix metallopeptidase 9. Each column represents a sequence in the data set each row represents a pattern detected by our algorithm. Both rows and columns are clustered and the matching dendrograms are presented to the left and above the clustermap, respectively. Pattern distances are encoded as length in the dendrograms.   60    Figure 19. Heatmap for Mixed MEROPS Protease Substrate Sample. Heatmap displaying the patterns detected in a mixed sample containing substrates of caspase-6, cathepsin B, and matrix metallopeptidase 9 along with protease specificity annotation from the pre-computed MEROPS protease substrate patterns included with this tool.  Distinct pattern matches were observed for cathepsin B as well as multiple caspases and matrix metallopeptidases. Among the matched caspases and matrix metallopeptidases were caspase-6 and MMP-9, however the same patterns matching these proteases also matched other proteases from both respective MEROPS families. This is a consequence of the fact that proteases from the same family often exhibit extremely similar, if not identical, cleavage site specificity. This result demonstrates the efficacy of the pattern detection performed by our algorithm, even when facing a deeply convoluted sample which cannot be easily resolved into subsets meaningfully corresponding to biological activity through sequence clustering.  2.2.3.2.4       Analysis of Supplemental Data 3 from Pease, et al., 2018 Gibbs clustering was next used to analyze the Plasmodium falciparum phosphoprotein data set described in Section 2.2.3.1.3. While Motif-X and MoMo can be used to analyze sequences aligned to a side chain modification such as a phosphorylation site, but cannot be used to analyze sequences aligned to the bond cleaved in a proteolytic event, Gibbs clustering can be used to analyze sequences of both 61   types. This analysis therefore serves to show the performance of Gibbs clustering on a sample type accommodated by all pattern detection algorithms evaluated in this comparison.  2.2.3.2.4.1 Methods Gibbs clustering was run over a range of 1 to 15 clusters with a motif length of 13 and the simple shift realignment move disabled. Due to the fact that Gibbs clustering uses Swiss-Prot human background frequencies by default, the alternative option of estimating background frequencies from the input data set was selected. All other settings were left at their default values.  2.2.3.2.4.2 Results Gibbs clustering selected an optimal solution containing a single group with an average KLD of 5.558908, as seen in Figure 17.  The logo map representing this single cluster clearly shows the fixed central serine residue as well as highly overrepresented basic and acid residues in several positions, including the positions in which these residues were detected as enriched by our algorithm. Given the low probability of substituting an acidic residue for a basic residue due to their opposite charge, it is surprising to see that Gibbs clustering failed to separate sequences on the basis of these residues. A plausible explanation for this is that the central serine residue common to all sequences in the input data set sufficiently increased inter-cluster similarity to a level that precluded the selection of a multi-cluster solution. These findings suggest that Gibbs clustering may struggle to decompose data sets containing a fixed central residue into meaningful clusters, even in cases where distinct motifs are identifiable within the data through the use of pattern detection algorithms.  62    Figure 20. Gibbs Clustering Results: KLD by Number of Clusters for Supplemental Data 3 from Pease et al., 2018. Stacked bar chart showing the mean KLD for each cluster in a set of solutions ranging from 1 cluster to 15 clusters. The algorithm selected a best solution of 1 cluster with an average system KLD of 5.558908.   63    Figure 21. Gibbs Clustering Results: Logo Maps for 1 Cluster Solution on Pease Supplemental Data 3. This logo map depicts the positional composition of the sequence co-clustered in Gibbs clustering’s 1 cluster solution for a data set containing 13 amino acid Plasmodium falciparum sequences centered on phosphorylated serine.  2.2.4   Characterization of MEROPS protease substrate data  2.2.4.1      Methods used in the analysis of MEROPS protease substrate data MEROPS contains a large quantity of experimentally-validated cleavage event data for a large number of proteases in the form of 8-amino-acid-long substrate sequences. These sequences were retrieved from the database, partitioned by protease, and analyzed as input data sets using default algorithm settings. When one pattern or more was detected by the algorithm, default outputs were generated and the patterns were appended to a supplementary table in the database. Prior to analysis, proteases were filtered to include only those marked as “human active,” (indicating confirmed activity in humans), and substrates were filtered to exclude synthetic and theoretical peptides. 64   2.2.4.2      Detected patterns As discussed in Section 2.2.1.6.2, the heatmaps generated by this tool rely on patterns detected in protease substrate data sets drawn from the Swiss-Prot database in order to produce a layer of annotation corresponding to protease activity. These patterns are detected in the substrate sets in identical fashion to the analyses previously described and used default algorithm settings in all cases. In total, 261 patterns across 108 proteases were detected using default settings of the algorithm. These patterns are broken down by catalytic type in Figure 19, while the number of patterns unique to a protease classification at each level of granularity is depicted in Figure 20. 65    Figure 22. Number of proteases with Detected Pattern by Catalytic Type. Number of proteases in the MEROPS database for which patterns were detected in their substrate sequence sets, broken down by catalytic type.  66    Figure 23. Number of Unique Patterns by Level of Protease Classification. Number of unique patterns associated with a particular catalytic type, family, or individual protease.  2.2.4.3      Protease substrate clustermaps As the analysis of each protease substrate data set in the MEROPS database is essentially equivalent to the analysis of a user-supplied experimental dataset, similar outputs are generated for each protease substrate set in which enriched patterns are detected. Among these outputs is a clustermap, as described in Section 2.2.1.6.1. This clustermap provides a visual impression of the clustering of the protease substrates based on the detected patterns and will support the more nuanced interpretation of the wealth of cleavage event data stored within the database. 67   2.2.4.4      Protease substrate heatmaps In addition to the clustermaps generated for protease substrate analyses, heatmaps were also generated as described in Section 2.2.1.6.2. While the mapping of the patterns detected for a given protease to themselves is obviously redundant, the mapping of those patterns to those detected for all other protease substrate sets in MEROPS yields valuable information. The descriptive statistics and figures characterize the number of proteases, families, and catalytic types that can be uniquely identified based on a specific pattern. However, the heatmaps produced by the algorithm convey partial matches between patterns in an input data set and the patterns associated with MEROPS proteases as well. By plotting a heatmap for each protease with detected patterns, one gains a sense of how effectively each individual protease can be visually disambiguated from all others on the basis of the combination of patterns associated with it.  2.2.5   Applied analyses  2.2.5.1      Frantzi et al., 2016: urothelial bladder cancer biomarkers  2.2.5.1.1       Description of data In order to assess the performance of this algorithm on a real world, complex data set with clinical implication, sequences were collected from a paper discussing the development of a biomarker panel for urothelial bladder cancer.39 These sequences consisted of intact peptides detected in urine samples using capillary electrophoresis coupled to mass spectrometry (CE-MS).40 Firstly the list of all peptides included in the biomarker panels (Supplementary Table 5) were analyzed and the patterns detected in these peptides are reported and discussed. Next, the smaller set of statistically significant biomarkers, as compared to negative controls, from Supplementary Table 3 were analyzed. Again, patterns detected in these peptides are reported and discussed.  68   2.2.5.1.2       Methods Due to the fact that the peptides provided in Supplementary Table 5 and Supplementary Table 3 and intact, sequence extension was performed in both the N-terminal and C-terminal directions, thus producing two input sequences from each input peptide. The N-terminal and C-terminal extended sequences for each data set were analyzed together with the rationale being that both termini were drawn from the same physiological and biochemical environment.  Each data set was analyzed using the default algorithm settings described in Section 2.1.5.  2.2.5.1.3       Results Patterns detected in this data set revealed a high concentration of small and hydrophobic residues including glycine, proline, and alanine in several positions. This amino acid composition is consistent with the author’s statement that most of the peptide sequences included in their biomarker panel were fragments of collagen.39 These patterns, as well as their similarity to the sequences comprising the data set, are visualized on the clustermap presented in Figure 25.  69    Figure 24. Clustermap for Frantzi et al., 2016 Table 5. Clustermap displaying the patterns detected in Table 5 from Frantzi et al., 2016. Each column represents a sequence in the data set each row represents a pattern detected by our algorithm. Both rows and columns are clustered and the matching dendrograms are presented to the left and above the clustermap, respectively. Pattern distances are encoded as length in the dendrograms.  As seen in Figure 26, one of the detected patterns exactly matched a pre-computed pattern associated with matrix metallopeptidases 8, 9, and 12. As noted by the authors, the high abundance of collagen fragments most likely arose as a result of cancer-associated changes in cellular processes such as increased proteolysis and remodeling of the extracellular matrix.39 Matrix metallopeptidases are largely responsible for the degradation of the components of the extracellular matrix including collagen.41 The detection of MMP activity in this sample is therefore a testament to the reliability and biological relevance of the patterns identified by this algorithm.  70    Figure 25. Heatmap for Frantzi et al., 2016 Table 5. Heatmap displaying the patterns detected in Table 5 of Frantzi et al., 2016 along with protease specificity annotation from the pre-computed MEROPS protease substrate patterns included with this tool.  Subsequent analysis of the list of significant biomarker peptides yielded an even more interesting result. As was true for the full list of detected peptides, the significant biomarkers exhibited enrichment of small and hydrophobic residues including glycine, proline, and alanine in several positions. As compared to the full peptide list, the patterns detected amongst the significant biomarker peptides were present in higher numbers and with greater specificity.  71    Figure 26. Clustermap for Frantzi et al., 2016 Table 3. Clustermap displaying the patterns detected in Table 3 from Frantzi et al., 2016. Each column represents a sequence in the data set each row represents a pattern detected by our algorithm. Both rows and columns are clustered and the matching dendrograms are presented to the left and above the clustermap, respectively. Pattern distances are encoded as length in the dendrograms.  Notably, one of these more specific patterns was identified as a perfect match for a pre-computed pattern corresponding to the specificity of matrix metallopeptidases 9 and 13. While MMPs 8 and 12 were detected in the full list of peptides, they were not detected in the list of significant biomarkers. While the roles of those two proteases have not been widely investigated in the specific context of bladder cancer, increased activity of MMP-9 is consistently reported in urine samples from bladder cancer patients as compared to normal controls.41 Furthermore, elevated MMP-13 expression has been associated with higher tumor stage and grade.42  72    Figure 27. Heatmap for Frantzi et al., 2016 Table 3. Heatmap displaying the patterns detected in Table 3 of Frantzi et al., 2016 along with protease specificity annotation from the pre-computed MEROPS protease substrate patterns included with this tool.  The results of this analysis demonstrate that our algorithm was not only sensitive and accurate enough to detect broad MMP activity in a sample of urinary peptides collected from bladder cancer patients, but also to detect the activity of two specific MMPs with well-established correlation to bladder cancer in a list of peptides determined to be statistically significant biomarkers by orthogonal means.  2.2.5.2      Wiita et al., 2014: detecting signatures of chemotherapy-induced apoptosis in blood  2.2.5.1.1       Description of data This analysis applies our algorithm to the data presented in a Wiita et al., 2014 paper submitted to PNAS concerning the detection of proteolytic protein fragments in samples of blood collected from patients suffering from hematological malignancies following the induction of chemotherapy.43 Specifically, the authors were interested in detecting the release of normally intracellular peptides generated by protease activity into the bloodstream. The authors assert that these peptides serve as indicators of apoptosis, or programmed cell death, which is commonly induced by chemotherapy.43 73   Apoptosis involves degradation of cellular components including the cell membrane, leading to the release of intracellular contents into the extracellular medium.43 To maximize the probability of detecting these apoptosis-associated protein fragments, the authors enriched their discovery samples for free protein N-termini using subtiligase labeling.44 Enriched samples were then sequenced during the discovery stage of their study using mass spectrometry operated in a data-dependent acquisition mode in order to generate an unbiased list of detected peptides. A list was then compiled by identifying peptides which appeared in the post-chemotherapy patient sample, but not in a normal plasma sample. Peptides from this list were compared to a database of more than 700 known proteolytic peptides to interpret the results.43,45 In total, 98 peptides not normally found in blood were identified across 5 patient samples.43 These peptides included a number of putative signatures of apoptosis, including 23 candidate caspase substrates.43 The peptide lists from both the patient discovery sample and the control plasma sample were subsequently combined with peptides from a database of apoptotic samples to generate a peptide inclusion list for biased detection of apoptosis-associated peptides with greater sensitivity than the unbiased strategy.46  2.2.5.1.2       Methods Peptides and protein accession numbers for the control plasma and post-chemotherapy patient samples were retrieved from supplemental dataset S1 of the Wiita et al., 2014 paper. These peptides were then extended in the N-terminal direction using the Swiss-Prot human proteome and truncated to 8 residues symmetrically flanking the N-termini of the original peptide sequences. Following extension and de-duplication, the post-chemotherapy patient data set consisted of 71 aligned sequences and the control plasma sample consisted of 79 aligned sequences. Sequences were then analyzed using the default settings of our algorithm described in Section 2.1.5. Default output figures including the sequence clustermap, protease pattern heatmap, and pattern logo maps were generated, reported, and discussed. 74   2.2.5.1.3       Results Analysis of the post-chemotherapy patient sample revealed a total of four enriched patterns. One of these patterns, xxxx|M[1]xxx (where 1 represents the compound residue for non-polar amino acids), corresponds to genomic N-termini in the dataset. Genomic N-termini are the first positions of canonical proteins, defined by the sequences of their protein coding genes. Unlike the diversity found at neo-N-termini generated by various proteases, all genomic N-termini begin with methionine, coded for by the genetic start codon, which is most often followed by a non-polar residue such as alanine or valine. The subtiligase labeling method used by the authors biochemically enriches protein N-termini, but does not discriminate between genomic N-termini and neo-N-termini produced by proteolytic cleavage events.44 It is therefore unsurprising to find a concentration of genomic N-termini in this data set.   Figure 28. Wiita et al., 2014 Dataset S1 Patient Sample Clustermap. Clustermap displaying the patterns detected in discovery sample from Wiita et al., 2014. Each column represents a sequence in the data set each row represents a pattern detected by our algorithm. Both rows and columns are clustered and the matching dendrograms are presented to the left and above the clustermap, respectively. Pattern distances are encoded as length in the dendrograms. 75   As seen in Figure 30, two of the four detected patterns exactly matched pre-computed MEROPS protease patterns. Most notably, the first of the two patterns matched multiple members of the C14 MEROPS caspase family. A cascade of caspase activity is known to be activated during apoptosis, while these proteases are typically inactive healthy, functioning cells.47 This discovery is corroborated by the authors’ identification of peptides correlated with caspase-cleavage as well as their identification of other indicators of apoptosis. The second of these patterns exactly matched corin, a serine protease, while also partially matching patterns associated with a number of other serine proteases. Corin is not known to be significantly upregulated in hematological malignancies, although it is known to be found in human blood under normal condition.48 It is therefore speculatively plausible to assert that this pattern does, in fact, correspond to corin activity however it should also be noted that a number of serine proteases exhibit similar preferential cleavage of substrates with a basic residue in the P1 position. Serine proteases are among the most physiologically common and active proteases.49 It is therefore at least as likely that this pattern represents an aggregate of serine protease activity that resembles the specific activity of corin by chance.  76    Figure 29. Wiita et al., 2014 Dataset S1 Patient Sample Heatmap. Heatmap displaying the patterns detected in the discovery sample from Wiita et al., 2014 along with protease specificity annotation from the pre-computed MEROPS protease substrate patterns included with this tool.  A larger, but similar, set of patterns were detected in the sample of sequences identified using the biased inclusion list of apoptosis-associated peptides. This is unsurprising, as the use of a targeted inclusion list increases the likelihood of detecting peptides of interest.46 In this case, those peptides correspond to increases in apoptotic cellular processes including proteolysis. Once again, a pattern specifically matching the known specificity of C14 caspases was detected, reinforcing our interpretation of the results for the discovery-stage patient data set. Also detected was a pattern exactly matching a pre-computed pattern for gastricsin, a digestive enzyme. This pattern partially matched select cysteine- and metallo-type proteases and, importantly, shared a fixed position in common with a pattern matching the specificity of C14 caspases. No biological conclusions were drawn from the observation of this pattern due to its questionable relevance. Further establishing the efficacy and specificity of this tool for detecting meaningful enrichments of biological activity is the fact that no patterns were detected in the normal plasma control sample, as visualized in Figure 27. This reinforces the patterns detected in the unbiased discovery sample as well as the biased inclusion list sample are legitimate and significant. 77     Figure 30. Wiita et al., 2014 Dataset S3 Clustermap of Sequences Detected Using Inclusion List. Clustermap displaying the patterns identified in a sample of peptides detected using inclusion list from Wiita et al., 2014. Each column represents a sequence in the data set each row represents a pattern detected by our algorithm. Both rows and columns are clustered and the matching dendrograms are presented to the left and above the clustermap, respectively. Pattern distances are encoded as length in the dendrograms.  78    Figure 31. Wiita et al., 2014 Dataset S3 Heatmap of Sequences Detected Using Inclusion List. Heatmap displaying the patterns detected in sample of peptides detected using inclusion list from Wiita et al., 2014 along with protease specificity annotation from the pre-computed MEROPS protease substrate patterns included with this tool.    Figure 32. Wiita et al., 2014 Dataset S1 Control Plasma Heatmap. Heatmap displaying the patterns detected in control plasma sample from Wiita et al., 2014 along with protease specificity annotation from the pre-computed MEROPS protease substrate patterns included with this tool. 79   Chapter 3: Conclusion  3.1 Discussion  3.1.2   Discussion of comparison to other tools  3.1.2.1      Motif-X and MoMo  3.1.2.1.1       Motif-X and MoMo are unsuitable for analysis of proteolytic data sets A substantial limitation of Motif-X as well as of the more recent, but closely related, MoMo is the unsuitability of both algorithms for the analysis of proteolytic data sets. Both tools rely on an input data set centered around a fixed residue.16-18 MoMo generalizes this to some extent by allowing multiple central residues to be present in the same input data set as well as by detecting central residues automatically rather than requiring user specification, however it still accepts only odd-length sequences and subsets those sequences based on their composition with respect to the middle position of the aligned data.18 This design choice was likely made to support the strategies employed by each algorithm for the computation of expected residue frequencies as well as on the basis that many post-translational modifications are, indeed, residue-centric. That is to say, modifications such as phosphorylation and acetylation, among others, involve the covalent modification of the side chain of a particular amino acid. It therefore follows that the most sensible alignment of sequences exhibiting these modifications is symmetric about the modification site. However, unlike these side chain-modifying processes, proteolysis involves cleavage of proteins between residues at the site of peptide bonds joining amino acid. Therefore, the most sensible alignment of proteolytic data is symmetric about the cleavage site for each proteolytic event resulting in a sample of even-length sequences lacking a central residue. A proteolytic data set could be manipulated for analysis using MoMo by adding an artificial central residue common to all sequences in the data set, however this would restrict the user to using expected 80   residue frequencies derived from a shuffled foreground rather than expected residue frequencies from a context data set such as the proteome of a particular organism. This strategy could not be employed for analysis with Motif-X (or MoMo’s reimplementation of the original Motif-X algorithm) due to the fact the algorithm calculates expected residue frequencies by tracking fixed residues in the foreground and identifying all sequences of a specified length containing those fixed residues in a context data set, beginning with the central residue. Therefore, if an artificial central residue corresponding to an amino acid from the context data set were selected, Motif-X would select an erroneously specific set of background sequences for expected residue frequency calculation. Alternatively, if the artificial residue was not contained within the context data set, this background frequency calculation procedure would fail entirely as no matching sequences would be identified.  3.1.2.1.2       Background frequencies Integral to the discussion of the differences between this tool, Motif-X, and MoMo are the philosophical differences in how background frequency calculation is approached by each of the tools. Given that all three tools detect enrichment of positional residues in reference to expected residue frequencies derived from a background data set, the manner in which this background data set is compiled, and these expected frequencies are computed is of the utmost importance. Motif-X computes background frequencies through the alignment of subsequences from a user-specified context proteome. These subsequences are equal in length to the sequences in the foreground dataset, and correspond to the residues flanking occurrences of the fixed central residue specified by the user in the context proteome.16,17 As new patterns are detected and the foreground dataset is partitioned based on these patterns, a new background dataset is computed for each subset by selecting only those sequences from the original background dataset that feature the pattern defining the subset.16,17 Expected residue frequencies are calculated from these background data sets as the percentage frequency of each residue in each position. Therefore, enrichment can be described as the overrepresentation of positional residues in reference to the subset of sequences in the context proteome which also contain the fixed positional residues corresponding to the foreground data set. 81   While this strategy provides an increasingly specific set of expected residue frequencies, it also enforces the assumption that sequences targeted by enzyme specificities are rare within the subset of the proteome corresponding to the central residue as well as within the progressively more specific subsets of the proteome corresponding to subsequently detected positional residues. This, however, is not necessarily a fair assumption as many enzymes exhibit specificity for a combination of positional residues. Moreover, the occurrence of each constituent of these combinations in the context proteome may be highly correlated, meaning that the subset of sequences containing one constituent may also contain the other constituent(s) at high frequency. Therefore, though the strategy employed by Motif-X is statistically valid, it may underestimate biologically relevant enrichments which arise as a consequence of correlations between positions due to its strict assumption of positional independence. MoMo employs a substantially different approach for expected residue frequency calculation, instead generating a background data set by shuffling the positional residues of the foreground data set while preserving the fixed central residue.18 Subsets of this shuffled data set are then tracked in the same manner utilized by Motif-X, however the background dataset is intended to reflect the amino acid distribution of the foreground data set rather than a biological context such as the proteome of an organism. MoMo therefore claims to de-emphasize general differences in amino acid abundance found in a foreground data set relative to what would be expected from a biological context data set, instead exclusively favoring residues which occur at higher than expected frequency in a subset of positions in the context of the residue distribution observed in the foreground data set.18 This interpretation regards sample-wide differences in amino acid frequency between a foreground data set and a context data set such as a proteome as meaningless statistical annoyances leading to overestimation of enrichment and undermining the use of exact hypothesis testing due to the differences between the foreground and background distributions. While this view does accommodate the notion that biological sequence data is inherently patterned and that specific amino acids are not evenly distributed throughout the proteome, it fails to recognize that differences in amino acid composition between a sample and its broader biological context may be of great interest with respect to the interpretation of an analysis. These differences should not be normalized out unless a conscious decision 82   has been made to do so on the basis that one is exclusively interested in positional variation within a sample rather than in the compositional variation of a sample relative to its context. The issues with MoMo’s strategy for calculation of expected residue frequencies go deeper than this, however, in that they apply the position-specific methodology developed by Motif-X to sets of shuffled, artificial sequences. In a manner similar to Motif-X, MoMo selects the subset of sequences from its background data set containing positional residues matching the fixed positional residues in the foreground data set (only the fixed central residue for root-level iterations of the algorithm).18 The logic behind this design decision in Motif-X is that the positional residue composition of these subsets may differ from the positional residue composition of the background data set at large. However, in MoMo’s shuffled foreground-derived background data set, positional residue composition is randomized at the sequence level.18 Therefore, any such differences are the product of chance, cross-position fluctuations in residue frequency within individual sequences, error in either the generation or application of the pseudo-random numbers used by MoMo, or a combination of those factors. At best, this strategy is dramatically less computationally efficient than averaging expected residue frequencies across positions, which would provide the uniformity sought by the authors of MoMo, while conferring no advantage. At worst, MoMo may ultimately incorporate systematic errors into its background frequencies, particularly for small foreground data sets or foreground data sets which exhibit extreme differences in amino acid percentage frequency. Our algorithm computes expected residue frequencies by calculating the percentage frequency of each residue in a context data set. By default, this data set is the Swiss-Prot human proteome, however an alternative data set may be supplied either as a set of context sequences (such as the proteome of a different organism) or as a list of pre-computed expected frequencies. Of note is that a user may select the set of foreground sequences for use as the context data set, thus achieving a similar effect to what MoMo aims to accomplish. By tracking the removal and selection of positional residues, as described in Section 2.2.1.2, our algorithm compensates for the effective enrichment of residual residues following each set reduction step thus ensuring consistent and accurate estimation of expected residue frequencies on a positional basis. Furthermore, by eliminating the need for a sequence feature common to every 83   member of the background data set, such as the central residue which must be specified in Motif-X, our algorithm supports the analysis of data sets symmetric about a bond rather than a residue. This approach affords users a great deal of flexibility, allowing them to determine what they consider relevant to the calculation of positional residue enrichment. If a user is interested in sensitively comparing the composition of a sample to a known baseline, it may be in their best interest to supply the proteome of the organism as the context data set. Alternatively, users may be interested in detecting only uncommon patterns in the context of a particular modification site or sequence feature. In this case, the user may be best served by supplying a subset of sequences from the proteome corresponding to the feature(s) of interest. If a user is strictly interested in patterns which reflect positions that differ significantly from the internal average amino acid composition, they are provided with the option to calculate expected residue frequencies from the foreground data set.  3.1.2.1.3       Positional weighting A key difference between our algorithm and alternative motif detection algorithms such as Motif-X and MoMo is the incorporation of a positional weighting term into the enrichment calculation utilized by this tool. Motif-X and MoMo rank positional residue enrichments strictly on the probability of observing as many or more occurrences of a positional residue based on a set of expected residue frequencies. This calculation alone, however, only partially describes the relevance of the enrichment of a particular positional residue in the context of a sample. In addition to the individual overrepresentation of each residue, it is important to consider deviation of amino acid diversity in each position of the aligned data relative to what one would expect given the percentage frequencies of amino acids in the background data set. Our positional weighting term therefore assigns a higher score to positional residues which occur in positions with lower-than-expected diversity, thus accomplishing two things. Firstly, overrepresentation of rare residues is less heavily emphasized than by Motif-X and MoMo, except for cases in which that overrepresentation is also accompanied by diminished amino acid diversity in the same position. Secondly, global positional features are integrated into each individual enrichment calculation, adding emphasis to overrepresented residues in positions with lower-than-expected diversity. This proves 84   particularly advantageous in the analysis of residue-centered data sets due to the fact that our tool does differentiate between central residues and all other residues. 3.1.2.1.4       Compound residues Compound residues are a unique feature of our algorithm which serve to aggregate sequences into more comprehensive groups of the basis of biochemically similar amino acids which may be better interpreted as a single, function positional residue in the data. Enzyme specificity is generally dependent on biochemical properties such as polarity, charge, shape, and size. Several distinct amino acids may in fact be highly similar in terms of one or more of these properties. Therefore, while some enzymes exhibit exquisite specificity for a particular substrate, many enzymes will accept a variety of residues with a common profile of biochemical properties. It therefore stands to reason that allowing the algorithm to detect groups of similar amino acids may result in more accurate depiction of the enriched patterns in a data set. For instance, with compound residues enabled, our algorithm detected an S*x[DE][DE] pattern in the analysis described in Section 2.2.3.1.3, while Motif-X detected a simpler S*xD pattern. Examination of the logo map corresponding to this pattern detected by Motif-X reveals an overrepresentation of both aspartic acid (D) and glutamic acid (E) three positions to the right of the central serine residue which corresponds to the additional fixed position of the pattern detected by our algorithm. Neither of these overrepresentations were sufficient to be detected as enriched by Motif-X alone, however, by aggregating the sequences containing both amino acids into a single compound residue on the basis of their acidic side chains, our algorithm was able to detect a cumulative enrichment or the pair. Furthermore, examination of the logo map corresponding to the RxxxxxS* pattern shows a residual overrepresentation of glutamic acid two positions to the right of the central serine residue. Analysis by our tool with compound residues enabled reveals that these sequences are better described generalizing the S*xD pattern produced by Motif-X from a fixed aspartic acid to a fixed acidic residue two positions to the right of the central serine residue. This generalization is embedded within the corresponding pattern detected by our algorithm.   85    3.1.2.2      Gibbs clustering discussion  3.1.2.2.1       Limitations due to unpredictability of number of clusters Gibbs clustering relies on the user to specify the number of clusters expected in a data set in advance, though it supports the specification of a range from 1 cluster to 15 clusters. If a range is specified, the Gibbs clustering algorithm will attempt to select the best solution from the range by the solution set with the maximal system-wide KLD.21 While this is an effective solution for the analysis of data sets which can reasonably be expected to contain at least one cluster with a distinctive sequence composition, our analyses revealed that Gibbs clustering struggles to consolidate a sample containing no distinctive feature into a single cluster. Instead, Gibbs clustering segregates sequences into the smallest, most homogeneous clusters possible due to the fact that, in a random sample, these small clusters inherently exhibit the greatest intra-cluster similarity as well as the greatest inter-cluster distance. This limitation is known to the authors of Gibbs clustering and they have responded by including a penalty on small clusters.21 However, the default penalty was not sufficient to mitigate this behavior in the analysis described in Section 2.2.3.2.2. Selection of an appropriately penalty is challenging if the number of anticipated clusters is unknown, and this is of great concern for the analysis of PTM datasets, particularly when those data sets are derived from a real biological context. A sample collected from a complex biochemical environment such as physiological tissue or fluid may contain sequences reflecting anywhere from zero to a large number of enzyme activities. Moreover, these activities may be extremely specific, enriching a single sequence feature, intermediately or mildly specific, enriching multiple sequence features, or generalistic, leading to no enrichment of sequence features. While sufficient metadata may help to guide one to a reasonable expectation regarding the range in which the number of clusters may fall for a particular sample, accurate prediction of the actual number is rarely feasible.   86   3.1.2.2.2       Analysis of mixed MEROPS protease substrates The analyses conducted in Section 2.2.3.2.3 revealed that Gibbs clustering and hierarchical clustering struggled to separate substrates corresponding to three distinct proteases. Both methods were able to isolate a cluster corresponding to the substrates of caspase-6 due to the fact that these sequences were clearly differentiated by their sequence composition. However, segregation of the substrates of matrix metallopeptidase-9 and cathepsin B was less effective due to the commonality of their sequence motifs. Though each of these two proteases has a distinctive set of patterns describing their specificity, both corresponding substrate sets contain a general overrepresentation of small and hydrophobic residues in every position. What ultimately distinguishes the specificity of the two proteases are the positions in which these residues are most overrepresented, rather than an obvious difference in positional composition as is the case for caspase-6. Therefore, when mixed together, these substrates become challenging to distinguish by comparison of their sequences either to each other or to patterns detected in sequence subsets corresponding to each protease. Despite these challenges, our algorithm succeeds in identifying a set of foreground patterns which match the sets of pre-computed patterns associated with each of the proteases corresponding to the supplied substrate sequences. These patterns, as well as their relationship to prior knowledge of protease specificity, are clearly represented in the heatmap generated by our tool. This supports the assertion that our algorithm generates valuable and informative output even for the analysis of data sets which present great challenges to alternative motif detection and clustering algorithms. A caveat to this observation is that a selection of the detected patterns also match alternative caspases and matrix metallopeptidases in the database. This is not the product of inaccuracy, but is instead a consequence of mutual specificity shared by these enzymes. As discussed in the introduction of the MEROPS database, proteases are grouped into families based on sequence similarity. Due to the fact that cleavage-site specificity partially determined by the primary structure of a protease, it is not uncommon to find that multiple members of the same family cleave very similar substrates. It is therefore encouraging to find that our algorithm correctly identified the three families corresponding to the input proteases, as well as that specific input proteases were flagged as matches within those families.  87   3.1.2.2.3       Analysis of mixed MEROPS protease substrates Given that in this case the desired number of clusters was known based on the number of distinct protease substrate sets supplied as input, the Gibbs clustering solution for 3 groups was investigated in further detail. The outputs of this Gibbs clustering solution were interpreted in light of the results generated by our algorithm.   Figure 33. Gibbs Clustering Results: Logo Maps for 3 Cluster Solution on Mixed MEROPS Protease Substrate Sequences. Clustering solution selected based on the known number of distinct protease substrate sets supplied as input to the Gibbs clustering algorithm.  Group one in Figure 28 exhibits a clear overrepresentation of acidic residues in positions 2 and 4, as well as slight overrepresentation acidic residues as well as valine in position 1. This pattern of overrepresentation is consistent with the specificity of caspase-6. Group two in Figure 28 shows a maximally enriched pattern of GPxG|xxGx. This pattern matches multiple matrix metallopeptidases belonging to the MEROPS M10 family, consistent with the known input set of MMP-9 substrates and reflected by the patterns detected by our algorithm as reflected in Figure 16. Finally, Group 3 displays a strong overrepresentation of glycine in position 7 and a mixture of other overrepresented small and hydrophobic residues in other positions. While this logo map would be challenging to interpret without the support of the heatmap generated by our tool, it becomes trivial to recognize that this cluster most closely corresponds to the xxxG|xxGx pattern associated matching cathepsin B in Figure 16. Although this clustering solution does not perfectly resolve the substrates into clusters precisely corresponding to the input proteases, it does a better job than both the hierarchical clustering solution produced by our tool and the solution selected by Gibbs clustering based on maximum average KLD. The 88   superiority of this solution is evident when comparing the logo maps from the 3 cluster solution in Figure 32 to the logo maps of the protease substrate subsets shown in Figures 13, 14, and 15. Notably similar features distinguishing the predominantly-MMP-9 group from the predominantly-cathepsin B group include the enrichment of glycine and proline in the first and second positions, respectively, of Group Two, as well as the enrichment of leucine in the second position of Group 3. This finding supports an approach involving the use of the patterns detected by our tool as well as the accompanying annotations relating those patterns to prior knowledge of protease specificity as a point of reference for interpreting the results of other clustering algorithms.  3.1.3   Discussion of applied analyses  3.1.3.1      Frantzi et al., 2016 UBC biomarker panel The Frantzi et al., 2016 paper used for the first set of applications of our algorithm to experimental data focuses on the development of a urinary peptide biomarker panel. To accomplish this, the authors used a machine learning-based classifier to identify high-confidence biomarkers from a list of peptides detected in urine samples from urothelial bladder cancer patients.39 The significant biomarkers reported in the paper prominently featured proteolytic fragments generated by matrix metallopeptidase activity.39 Using an unbiased, probabilistic pattern detection approach, our algorithm detected a set of patterns in the full list of urinary peptides which corresponded to patterns which we previously detected in data sets containing curated substrates of multiple matrix metallopeptidases. This is a powerful demonstration of our tool’s capability to produce clinically-relevant findings using techniques which are not limited by the biases and training requirements of supervised classifiers. The result sets an encouraging precedent for the application of our tool to the analysis of clinical patient data. Moreover, application of our tool to the list of significant biomarkers revealed a more specific set of matrix metallopeptidase activities with historically-validated correlation to bladder cancer stage and aggressiveness.41,42 This suggests that the incorporation of our tool into workflows already employing alternative classification techniques may aid in the interpretation of those results as well as provide support for the findings of other statistical tools by orthogonal means. 89   3.1.3.1      Wiita et al., 2014 chemotherapy-induced apoptosis peptide indicators The second set of applications of our algorithm to laboratory data drew from a Wiita et al., 2014 paper discussing the enrichment and detection of peptide indicators of apoptosis as a means of monitoring the progress of chemotherapy.43 Our algorithm detected caspase activity in both the unbiased discovery stage data set produced by the authors while looking for candidate biomarkers, as well as in the biased sequencing results generated using an inclusion list of peptides likely to be associated with apoptosis. Additionally, our algorithm detected no overrepresented patterns in a sample of control peptides drawn from normal blood plasma. These results, coupled with our findings in the analysis of the urinary peptide biomarker study, establish the capability of our algorithm in detecting relevant indicators of both disease and response to treatment. Furthermore, we demonstrated in both cases that our algorithm is not dependent on preliminary classification or screening of input data sets using an alternative algorithm. Rather, the signatures detected by this tool were observed in raw data sets collected using unbiased techniques and conserved or augmented in the preprocessed data sets.  3.2 Future directions Given the demonstrated usefulness of this algorithm, the possibility of future enhancements is worth exploring. For instance, this tool could eventually be enhanced through the incorporation of quantitative data collected using mass spectrometry. The reliability and accuracy of both protein- and peptide-level quantitation have improved dramatically in recent years, however much work remains in ensuring the validity of these measurements, as well as in normalizing measurements between laboratories and instruments. A design decision to avoid incorporation of quantitative data was therefore made at this time, as the core purpose of this algorithm is to provide an alternative to biased protein sequencing analysis techniques. That being said, these data may ultimately aid in the disambiguation of results, particularly with respect to proteases exhibiting very similar specificities but different enzyme kinetics. An additional obvious enhancement would be to enable the submission of multiple samples, providing output reflecting the differential composition of each with respect to the other(s). This would 90   streamline the comparison of experimental samples to baselines, as well as comparison of experimental samples across conditions. The results of our analyses, and particularly those of the applied analyses of clinical data, established this algorithm’s ability to detect activity profiles indicative of biological states such as malignancy and apoptosis. Further analysis of data sets such as these, coupled with large scale compilation of the results, may support the identification of reproducible signatures corresponding to specific types of disease or physiological processes. These could hypothetically in turn be applied as rapid diagnostic indicators both in the clinic as well as in the laboratory. While the utility of this tool for the detection of enriched patterns in data sets containing other types of post-translational modifications such as phosphorylation has already been demonstrated, the tool does not currently integrate existing knowledge regarding the specificity of these types of modifications as it does for proteolysis. By linking the pattern detection algorithm to additional databases in the same fashion that it is linked to the proteolytic cleavage event data in the MEROPS database, the outputs of the tool could be annotated with existing knowledge of the specificity of a broad range of PTM-mediating enzymes.  3.3 Concluding remarks Contained within this thesis is a comprehensive description of the development, validation, and initial applications of a novel tool for the detection of patterns in biological sequence data sets. This tool was developed with an emphasis on the analysis of proteolytic fragments sequenced using mass spectrometry due to the fact that this domain is underserved by current alternative algorithms. The accuracy and validity of results produced by this algorithm are exhaustively explored through a series of internal performance characterization tests and comparison to alternative algorithms. The biological relevance and utility of detected patterns are then established through the analysis of multiple physiological data sets. Finally, these findings are discussed in light of prior work, the implications of this tool in the context of protein research are considered, and potential future enhancements to the methodology are proposed.   91   References  1.  Berg JM, Tymoczko JL, Stryer L. Amino Acids Are Encoded by Groups of Three Bases Starting from a Fixed Point - Biochemistry - NCBI Bookshelf. 2002. 2.  Aebersold R, Agar JN, Amster IJ, et al. How many human proteoforms are there? Nat Chem Biol. 2018;14(3):206-214. doi:10.1038/nchembio.2576 3.  Smith LM, Kelleher NL, Consortium for Top Down Proteomics. Proteoform: a single term describing protein complexity. Nat Methods. 2013;10(3):186-187. doi:10.1038/nmeth.2369 4.  Bantscheff M, Lemeer S, Savitski MM, Kuster B. Quantitative mass spectrometry in proteomics: critical review update from 2007 to the present. Anal Bioanal Chem. 2012;404(4):939-965. doi:10.1007/s00216-012-6203-4 5.  Krishna RG, Wold F. Post-Translational Modifications of Proteins. In: Imahori K, Sakiyama F, eds. Methods in Protein Sequence Analysis. Boston, MA: Springer US; 1993:167-172. doi:10.1007/978-1-4899-1603-7_21 6.  Neurath H. Evolution of proteolytic enzymes. Science. 1984;224(4647):350-357. doi:10.1126/science.6369538 7.  Oda K. New families of carboxyl peptidases: serine-carboxyl peptidases and glutamic peptidases. J Biochem. 2012;151(1):13-25. doi:10.1093/jb/mvr129 8.  Rawlings ND, Waller M, Barrett AJ, Bateman A. MEROPS: the database of proteolytic enzymes, their substrates and inhibitors. Nucleic Acids Res. 2014;42(Database issue):D503-9. doi:10.1093/nar/gkt953 9.  Fortelny N, Cox JH, Kappelhoff R, et al. Network analyses reveal pervasive functional regulation between proteases in the human protease web. PLoS Biol. 2014;12(5):e1001869. doi:10.1371/journal.pbio.1001869  92   10.  Bonfili L, Cecarini V, Berardi S, et al. Microbiota modulation counteracts Alzheimer’s disease progression influencing neuronal proteolysis and gut hormones plasma levels. Sci Rep. 2017;7(1):2426. doi:10.1038/s41598-017-02587-2 11.  Hillebrand LE, Reinheckel T. Impact of proteolysis on cancer stem cell functions. Biochimie. March 2019. doi:10.1016/j.biochi.2019.03.002 12.  Vergnolle N. Protease inhibition as new therapeutic strategy for GI diseases. Gut. 2016;65(7):1215-1224. doi:10.1136/gutjnl-2015-309147 13.  duVerle DA, Mamitsuka H. A review of statistical methods for prediction of proteolytic cleavage. Brief Bioinformatics. 2012;13(3):337-349. doi:10.1093/bib/bbr059 14.  Song J, Li F, Leier A, et al. PROSPERous: high-throughput prediction of substrate cleavage sites for 90 proteases with improved accuracy. Bioinformatics. 2018;34(4):684-687. doi:10.1093/bioinformatics/btx670 15.  Klein J, Eales J, Zürbig P, Vlahou A, Mischak H, Stevens R. Proteasix: a tool for automated and large-scale prediction of proteases involved in naturally occurring peptide generation. Proteomics. 2013;13(7):1077-1082. doi:10.1002/pmic.201200493 16.  Schwartz D, Gygi SP. An iterative statistical approach to the identification of protein phosphorylation motifs from large-scale data sets. Nat Biotechnol. 2005;23(11):1391-1398. doi:10.1038/nbt1146 17.  Chou MF, Schwartz D. Biological sequence motif discovery using motif-x. Curr Protoc Bioinformatics. 2011;Chapter 13:Unit 13.15-24. doi:10.1002/0471250953.bi1315s35 18.  Cheng A, Grant CE, Noble WS, Bailey TL. MoMo: Discovery of statistically significant post-translational modification motifs. Bioinformatics. December 2018. doi:10.1093/bioinformatics/bty1058 19.  Bairoch A, Apweiler R. The SWISS-PROT protein sequence database: its relevance to human molecular medical research. J Mol Med. 1997;75(5):312-316.  93   20.  UniProt Consortium. UniProt: a worldwide hub of protein knowledge. Nucleic Acids Res. 2019;47(D1):D506-D515. doi:10.1093/nar/gky1049 21.  Andreatta M, Alvarez B, Nielsen M. GibbsCluster: unsupervised clustering and alignment of peptide sequences. Nucleic Acids Res. 2017;45(W1):W458-W463. doi:10.1093/nar/gkx248 22.  Henikoff S, Henikoff JG. Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci USA. 1992;89(22):10915-10919. doi:10.1073/pnas.89.22.10915 23.  Cox J, Mann M. MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification. Nat Biotechnol. 2008;26(12):1367-1372. doi:10.1038/nbt.1511 24.  Dunn OJ. Multiple Comparisons among Means. J Am Stat Assoc. 1961;56(293):52-64. doi:10.1080/01621459.1961.10482090 25.  Voorhees EM. Implementing agglomerative hierarchic clustering algorithms for use in document retrieval. Information Processing & Management. 1986;22(6):465-476. doi:10.1016/0306-4573(86)90097-X 26.  Hunter JD. Matplotlib: A 2D Graphics Environment. Comput Sci Eng. 2007;9(3):90-95. doi:10.1109/MCSE.2007.55 27.  Crooks GE, Hon G, Chandonia JM, Brenner SE. WebLogo: a sequence logo generator. Genome Res. 2004;14(6):1188-1190. doi:10.1101/gr.849004 28.  van der Walt S, Colbert SC, Varoquaux G. The NumPy Array: A Structure for Efficient Numerical Computation. Comput Sci Eng. 2011;13(2):22-30. doi:10.1109/MCSE.2011.37 29.  Rich SM, Leendertz FH, Xu G, et al. The origin of malignant malaria. Proc Natl Acad Sci USA. 2009;106(35):14902-14907. doi:10.1073/pnas.0907740106 30.  Pease BN, Huttlin EL, Jedrychowski MP, et al. Characterization of Plasmodium falciparum Atypical Kinase PfPK7- Dependent Phosphoproteome. J Proteome Res. 2018;17(6):2112-2123. doi:10.1021/acs.jproteome.8b00062 94   31.  Colaert N, Helsens K, Martens L, Vandekerckhove J, Gevaert K. Improved visualization of protein consensus sequences by iceLogo. Nat Methods. 2009;6(11):786-787. doi:10.1038/nmeth1109-786 32.  Pearson RB, Kemp BE. [3] Protein kinase phosphorylation site sequences and consensus specificity motifs: Tabulations. In: Protein Phosphorylation Part A: Protein Kinases: Assays, Purification, Antibodies, Functional Analysis, Cloning, and Expression. Vol 200. Methods in Enzymology. Elsevier; 1991:62-81. doi:10.1016/0076-6879(91)00127-I 33.  Marte BM, Downward J. PKB/Akt: connecting phosphoinositide 3-kinase to cell survival and beyond. Trends Biochem Sci. 1997;22(9):355-358. doi:10.1016/S0968-0004(97)01097-9 34.  Xue Y, Zhou F, Zhu M, Ahmed K, Chen G, Yao X. GPS: a comprehensive www server for phosphorylation sites prediction. Nucleic Acids Res. 2005;33(Web Server issue):W184-7. doi:10.1093/nar/gki393 35.  Mewes HW, Heumann K, Kaps A, et al. MIPS: a database for genomes and protein sequences. Nucleic Acids Res. 1999;27(1):44-48. doi:10.1093/nar/27.1.44 36.  Pease BN, Huttlin EL, Jedrychowski MP, et al. Global analysis of protein expression and phosphorylation of three stages of Plasmodium falciparum intraerythrocytic development. J Proteome Res. 2013;12(9):4028-4045. doi:10.1021/pr400394g 37.  Knippschild U, Gocht A, Wolff S, Huber N, Löhler J, Stöter M. The casein kinase 1 family: participation in multiple cellular processes in eukaryotes. Cell Signal. 2005;17(6):675-689. doi:10.1016/j.cellsig.2004.12.011 38.  Gao Y, Wang H. Casein kinase 2 Is activated and essential for Wnt/beta-catenin signaling. J Biol Chem. 2006;281(27):18394-18400. doi:10.1074/jbc.M601112200 39.  Frantzi M, van Kessel KE, Zwarthoff EC, et al. Development and Validation of Urine-based Peptide Biomarker Panels for Detecting Bladder Cancer in a Multi-center Study. Clin Cancer Res. 2016;22(16):4077-4086. doi:10.1158/1078-0432.CCR-15-2715   95   40.  Dakna M, He Z, Yu WC, Mischak H, Kolch W. Technical, bioinformatical and statistical aspects of liquid chromatography-mass spectrometry (LC-MS) and capillary electrophoresis-mass spectrometry (CE-MS) based clinical proteomics: a critical assessment. J Chromatogr B Analyt Technol Biomed Life Sci. 2009;877(13):1250-1258. doi:10.1016/j.jchromb.2008.10.048 41.  Szarvas T, vom Dorp F, Ergün S, Rübben H. Matrix metalloproteinases and their clinical relevance in urinary bladder cancer. Nat Rev Urol. 2011;8(5):241-254. doi:10.1038/nrurol.2011.44 42.  Rodriguez Faba O, Palou-Redorta J, Fernández-Gómez JM, et al. Matrix metalloproteinases and bladder cancer: what is new? ISRN Urol. 2012;2012:581539. doi:10.5402/2012/581539 43.  Wiita AP, Hsu GW, Lu CM, Esensten JH, Wells JA. Circulating proteolytic signatures of chemotherapy-induced cell death in humans discovered by N-terminal labeling. Proc Natl Acad Sci USA. 2014;111(21):7594-7599. doi:10.1073/pnas.1405987111 44.  Mahrus S, Trinidad JC, Barkan DT, Sali A, Burlingame AL, Wells JA. Global sequencing of proteolytic cleavage sites in apoptosis by specific labeling of protein N termini. Cell. 2008;134(5):866-876. doi:10.1016/j.cell.2008.08.012 45.  Crawford ED, Seaman JE, Agard N, et al. The DegraBase: a database of proteolysis in healthy and apoptotic human cells. Mol Cell Proteomics. 2013;12(3):813-824. doi:10.1074/mcp.O112.024372 46.  Whiteaker JR, Lin C, Kennedy J, et al. A targeted proteomics-based pipeline for verification of biomarkers in plasma. Nat Biotechnol. 2011;29(7):625-634. doi:10.1038/nbt.1900 47.  Crawford ED, Wells JA. Caspase substrates and cellular remodeling. Annu Rev Biochem. 2011;80:1055-1087. doi:10.1146/annurev-biochem-061809-121639 48.  Ichiki T, Huntley BK, Heublein DM, et al. Corin is present in the normal human heart, kidney, and blood, with pro-B-type natriuretic peptide processing in the circulation. Clin Chem. 2011;57(1):40-47. doi:10.1373/clinchem.2010.153908 49.  Kraut J. Serine proteases: structure and mechanism of catalysis. Annu Rev Biochem. 1977;46:331-358. doi:10.1146/annurev.bi.46.070177.001555  

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            data-media="{[{embed.selectedMedia}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0380835/manifest

Comment

Related Items