UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Generation of truncated proteoforms in proteolytic networks : modeling and prediction in the protease… Fortelny, Nikolaus 2016

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


24-ubc_2016_september_fortelny_nikolaus.pdf [ 18.93MB ]
JSON: 24-1.0305855.json
JSON-LD: 24-1.0305855-ld.json
RDF/XML (Pretty): 24-1.0305855-rdf.xml
RDF/JSON: 24-1.0305855-rdf.json
Turtle: 24-1.0305855-turtle.txt
N-Triples: 24-1.0305855-rdf-ntriples.txt
Original Record: 24-1.0305855-source.json
Full Text

Full Text

GENERATION OF TRUNCATED PROTEOFORMS IN PROTEOLYTIC NETWORKS: MODELING AND PREDICTION IN THE PROTEASE WEB by Nikolaus Fortelny  B.Sc., University of Vienna, 2009 M.Sc., University of Geneva, 2010   A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF  DOCTOR OF PHILOSOPHY in THE FACULTY OF GRADUATE AND POSTDOCTORAL STUDIES (Biochemistry and Molecular Biology)  THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver)  July 2016  © Nikolaus Fortelny, 2016   ii Abstract Primarily controlled by gene expression and fine-tuned by translation and degradation rates, protein activity is governed by a plethora of post-translation modifications such as phosphorylation and glycosylation, which generate a diversity of protein species and thereby control complex biological phenotypes. Protease processing by proteases is a particular modification leading to the irreversible generation of stable protein truncations. Well understood in examples such as signal- or propeptide removal, recent analyses consistently identify >50% of N-terminal peptides mapping inside the protein sequence as predicted by genomics, indicating an important regulatory role of proteases. All proteins undergo protease cleavage as part of processing or degradation, a second biological process controlled by proteases. Proteases are involved in numerous pathologies and commonly considered as drug targets. However, protease research and drug development is complicated, in part due to widespread crosstalk between proteases. Proteases regulate other proteases through direct cleavage or cleavage of protease inhibitors in a complex network of protease interactions, the protease web. Yet, a comprehensive analysis of the protease web has not been performed, hampering assignment of proteases to clear biological roles, their direct substrates, and protease inhibitor drug targeting. A second problem in the identification of protein processing is the potential confound between protein termini generated by protease processing, alternative splicing, and alternative translation. In this thesis, I computationally analyzed large and diverse datasets of protease interactions and protein truncations to gain insight into complex proteolytic processes and to guide biochemical follow-up experiments. Analyzing protease cleavage, alternative splicing and alternative translation data incorporated into our database TopFIND, I found that protease cleavage and alternative translation likely generate most protein truncations. Combining protease cleavage and inhibition   iii data in a graph model of the protease web, I demonstrated extensive protease crosstalk and then predicted and validated a proteolytic pathway. Finally, investigating strategies for the prediction of protease inhibition, I predicted hundreds of protease-inhibitor interactions, and validated inhibition of kallikrein-5 by serpin B12. This work thus generated predictions for biochemical follow-up as well as important insights into the regulation of biological systems through proteases.   iv Preface  Together with my supervisors, Dr. Paul Pavlidis and Dr. Christopher Overall, I was responsible for the identification and design of the research program described in this thesis. I was the primary author of every chapter and corresponding publications. My supervisors contributed study design, supervision, concepts, text, and editorial suggestions for all chapters.  A version of chapter 2 is published (Fortelny N, Yang S, Pavlidis P, Lange PF and Overall CM (2015). Proteome TopFIND 3.0 with TopFINDer and PathFINDer: database and analysis tools for the association of protein termini to pre- and post-translational events, Nucleic Acids Research, 43 (D1): D290-D297). Sharon Yang contributed code to the development of the software TopFINDer under my supervision. Philipp Lange implemented cross mapping of TopFIND data between protein isoforms, helped design the study, and provided editorial support. I developed TopFINDer and PathFINDer and added inferred protein termini from alternative splicing and alternative translation to TopFIND. I also wrote most of the manuscript.  A version of chapter 3 is published (Fortelny N, Pavlidis P and Overall CM (2015). The path of no return—Truncated protein N-termini and current ignorance of their genesis, PROTEOMICS, 0, 1–6). I performed all the analysis and wrote most of the manuscript.  A version of chapter 4 has been published (Fortelny N, Cox JH, Kappelhoff R, Starr AE, Lange PF, Pavlidis P, Overall CM (2014). Network Analyses Reveal Pervasive Functional Regulation Between Proteases in the Human Protease Web, PLOS Biology, 12(5): e1001869). Jennifer Cox   v and Amanda Starr performed all biochemical experiments. Reinhild Kappelhoff contributed tissue gene expression data of proteases and inhibitors. Philipp Lange contributed to experimental design and edited the manuscript. I performed all computational analyses and wrote most of the manuscript.  At the time of writing, a version of chapter 5 has been submitted for publication (Fortelny N, Butler G, Overall CM and Pavlidis P (2016). Prediction of inhibitory protein-protein interactions between protease inhibitors and their target proteases in the protease web—pitfalls and successes). Georgina Butler contributed study design and supervision of biochemical validation experiments and editorial assistance. I performed all biochemical and computational work and wrote most of the manuscript.  Publications related to this thesis:  Fortelny N, Pavlidis P and Overall CM (2015) The path of no return—Truncated protein N-termini and current ignorance of their genesis, PROTEOMICS, 0, 1–6  Fortelny N, Yang S, Pavlidis P, Lange PF* and Overall CM* (2015)  Proteome TopFIND 3.0 with TopFINDer and PathFINDer: database and analysis tools for the association of protein termini to pre- and post-translational events, Nucl Acids Res, 43 (D1): D290-D297  Fortelny N, Cox JH, Kappelhoff R, Starr AE, Lange PF, Pavlidis P, Overall CM (2014),  Network Analyses Reveal Pervasive Functional Regulation Between Proteases in the  Human Protease Web,  PLOS Biol, 12(5): e1001869  Fortelny N, Butler G, Overall CM* and Pavlidis P*  (2016) Prediction of inhibitory protein-protein interactions between protease inhibitors and their target proteases in the protease web—pitfalls and successes, under review      vi Conference presentations related to this thesis:  Talk/poster: Characterization and prediction in a computational representation of the protease web. Nikolaus Fortelny, Reinhild Kappelhoff, Philipp F. Lange, Paul Pavlidis, Christopher M. Overall, 9th General Meeting of the International Proteolysis Society, Penang, Malaysia, October 4-8, 2015  Poster: Truncated protein isoforms and their genesis in the human proteome. Nikolaus Fortelny, Sharon Yang, Paul Pavlidis, Philipp F. Lange, Christopher M. Overall, Annual Congress of the Human Proteome Organization (HUPO), Vancouver, Canada, September 28, 2015  Poster: The protease web: a pervasive and complex proteolytic network generating a multitude of protein isoforms. Nikolaus Fortelny, Reinhild Kappelhoff, Philipp F. Lange, Paul Pavlidis, Christopher M. Overall, Annual Congress of the Human Proteome Organization (HUPO), Vancouver, Canada, September 28, 2015  Talk: Making sense of lists of protease substrates using TopFINDer and PathFINDer. Nikolaus Fortelny, Jennifer Cox, Sharon Yang, Reinhild Kappelhoff, Amanda E. Starr, Philipp F. Lange, Paul Pavlidis, and Christopher M. Overall, Pacific Coast Protease Meeting, Desert Hot Springs, CA, USA, May 3-6, 2015  Poster: The protease web: a pervasive and complex proteolytic network. Nikolaus Fortelny, Jennifer H. Cox, Amanda E. Starr, Philipp F. Lange, Paul Pavlidis, and Christopher M. Overall, 14th Annual International Symposium, Institute for Systems Biology, Seattle, WA, USA, April 6-7, 2015  Poster: Pervasive interactions of proteases and their inhibitors form protein networks as part of a global protease web. Nikolaus Fortelny, Jennifer H. Cox, Philipp F. Lange, Paul Pavlidis, and Christopher M. Overall, US Human Proteome Organization (US HUPO), Seattle, WA, USA, April 6-9, 2014  Talk: Pervasive interactions of proteases and their inhibitors form protein networks as part of a global protease web. Nikolaus Fortelny, Jennifer H. Cox, Philipp F. Lange, Paul Pavlidis, and Christopher M. Overall, 7th International Symposium on Serpin Biology, Structure and Function, Leogang, Austria, March 29 - April 2nd, 2014  Talk: Pervasive interactions of proteases and their inhibitors form protein networks as part of a global protease web. Nikolaus Fortelny, Jennifer H. Cox, Philipp F. Lange, Paul Pavlidis, and Christopher M. Overall, 31st Winterschool on Proteases and Inhibitors, Tiers, Italy, February 26th - March 2nd, 2014  Talk/poster: Gene coexpression and protein interaction networks to fill in gaps in the protease web. Nikolaus Fortelny, Reinhild Kappelhoff, Philipp F. Lange, Paul Pavlidis, and Christopher M. Overall, Cascadia Proteomics Symposium, Seattle, WA, USA, July 15-16, 2013    vii Talk: Bioinformatic analysis of the human protease web reveals a highly robust pervasive interaction network. Nikolaus Fortelny, Reinhild Kappelhoff, Philipp F. Lange, Paul Pavlidis, and Christopher M. Overall, Canadian Proteomics Network Symposium, Vancouver, Canada, April 20-24, 2013  Poster: Bioinformatic analysis of the human protease web reveals a highly robust pervasive interaction network. Nikolaus Fortelny, Reinhild Kappelhoff, Philipp F. Lange, Paul Pavlidis, and Christopher M. Overall, Asia Pacific Bioinformatics Conference, Vancouver, Canada, Jan 21-23, 2013  Talk/poster: Bioinformatic analysis of the human protease web reveals a highly robust pervasive interaction network. Nikolaus Fortelny, Reinhild Kappelhoff, Philipp F. Lange, Paul Pavlidis, and Christopher M. Overall, Cascadia Proteomics Symposium, Seattle, WA, USA, July 19-21, 2012      viii Table of contents  Abstract .......................................................................................................................................... ii	Preface ........................................................................................................................................... iv	Table of contents ........................................................................................................................ viii	List of tables ............................................................................................................................... xiii	List of figures .............................................................................................................................. xiv	List of abbreviations ................................................................................................................... xvi	Acknowledgements .................................................................................................................. xviii	Chapter 1: Introduction ................................................................................................................ 1	1.1	 Proteases and their biochemical properties ........................................................................ 2	1.2	 Biological functions of proteases ....................................................................................... 4	1.2.1	 Proteases in protein degradation .................................................................................. 5	1.2.2	 Protease processing in protein maturation ................................................................... 6	1.2.3	 Protease processing in the regulation of mature proteins ............................................ 7	1.3	 Protease cascades ................................................................................................................ 9	1.3.1	 Caspases in apoptosis ................................................................................................ 10	1.3.2	 Coagulation cascade .................................................................................................. 11	1.3.3	 Complement cascade ................................................................................................. 12	1.4	 Regulation of protease activity by protease inhibitors ..................................................... 13	1.5	 Protease diseases, drug targeting and failures .................................................................. 15	1.6	 Biochemical methods for the identification of protease substrates .................................. 17	1.6.1	 Terminomics: targeted methods to identify N- and C-termini .................................. 20	  ix 1.7	 Terminomics as atypical tools in the identification of proteins ........................................ 22	1.7.1	 Extent and genesis of N-terminal truncations ............................................................ 23	1.8	 Protease-centered substrate identification experiments .................................................... 25	1.9	 Protease networks and cross talks in the protease web .................................................... 27	1.10	 Network modeling of biological networks ..................................................................... 28	1.10.1	 Topological network analysis .................................................................................. 29	1.10.2	 Biochemically testable predictions based on network topology ............................. 31	1.11	 Prediction of protein interactions ................................................................................... 33	1.11.1	 Coexpression ........................................................................................................... 35	1.11.2	 Phylogenetic profiles ............................................................................................... 36	1.11.3	 Colocalization .......................................................................................................... 37	1.11.4	 Co-annotation and co-mentioning in literature ........................................................ 37	1.11.5	 Applications of protein-protein interaction prediction ............................................ 38	1.11.6	 Computational predictions of protease cleavage ..................................................... 39	1.12	 Databases of protease cleavage and inhibition ............................................................... 41	1.13	 Themes and outlook ........................................................................................................ 43	Chapter 2: Proteome TopFIND 3.0 with TopFINDer and PathFINDer: database and analysis tools for the association of protein termini to pre- and post-translational events .. 44	2.1	 Introduction ...................................................................................................................... 44	2.2	 Methods ............................................................................................................................ 49	2.3	 Results .............................................................................................................................. 51	2.3.1	 Changes to the database content ................................................................................ 51	  x	 Additional protein termini originating from alternative splicing and alternative translation .......................................................................................................................... 51	2.3.2	 TopFINDer—the TopFIND ExploreR ...................................................................... 53	2.3.3	 PathFINDer: Protease web path-finding in TopFIND ............................................... 58	2.3.4	 Validation of TopFINDer .......................................................................................... 60	2.3.5	 Validation of PathFINDer ......................................................................................... 61	2.4	 Discussion ......................................................................................................................... 61	2.5	 Summary ........................................................................................................................... 63	Chapter 3: Truncated protein N-termini and their genesis ..................................................... 64	3.1	 Introduction ...................................................................................................................... 64	3.2	 Methods ............................................................................................................................ 66	3.3	 Results .............................................................................................................................. 66	3.3.1	 The gap between observed and inferred termini ....................................................... 68	3.4	 Discussion ......................................................................................................................... 73	3.5	 Summary ........................................................................................................................... 75	Chapter 4: Network analyses reveal pervasive functional regulation between proteases in the human protease web .............................................................................................................. 76	4.1	 Introduction ...................................................................................................................... 76	4.2	 Methods ............................................................................................................................ 82	4.2.1	 Protease web data ...................................................................................................... 82	4.2.2	 Classifying proteases and inhibitors .......................................................................... 82	4.2.3	 Network construction and analysis ............................................................................ 82	4.2.4	 Mapping mouse to human proteins ........................................................................... 83	  xi 4.2.5	 Network figures ......................................................................................................... 84	4.2.6	 Involvement of proteases and inhibitors in biological processes .............................. 84	4.2.7	 In vivo N-terminomics data of murine skin ............................................................... 84	4.2.8	 Analysis of protease and inhibitor expression in 23 human tissues .......................... 84	4.2.9	 Chemokines, proteinases, and inhibitors ................................................................... 85	4.2.10	 Animals .................................................................................................................... 85	4.2.11	 Neutrophil isolation ................................................................................................. 85	4.2.12	 LIX cleavage assays ................................................................................................ 86	4.3	 Results .............................................................................................................................. 87	4.3.1	 Protease web data ...................................................................................................... 87	4.3.2	 Protease web structure ............................................................................................... 96	4.3.3	 Theoretical network analysis of the protease web ................................................... 108	4.3.4	 High connectivity in the protease web is robust to possible annotation errors ....... 110	4.3.5	 Human tissue-specific protease webs ...................................................................... 117	4.3.6	 Evidence for the protease web in other data ............................................................ 120	4.3.7	 Using the protease web to decipher in vivo network effects ................................... 122	4.4	 Discussion ....................................................................................................................... 131	4.4.1	 Principles of regulation in the protease web ............................................................ 133	4.4.2	 Applicability of the protease web ............................................................................ 134	4.5	 Summary ......................................................................................................................... 135	Chapter 5: Prediction of inhibitory protein-protein interactions between protease inhibitors and their target proteases in the protease web ...................................................... 137	5.1	 Introduction .................................................................................................................... 137	  xii 5.2	 Materials and methods .................................................................................................... 142	5.2.1	 Data analysis ............................................................................................................ 142	5.2.2	 Proteases and protease inhibitor data ....................................................................... 142	5.2.3	 Protein-protein interaction networks ....................................................................... 142	5.2.4	 Protein localization .................................................................................................. 142	5.2.5	 Coexpression networks ............................................................................................ 143	5.2.6	 Phylogenetic profiling ............................................................................................. 146	5.2.7	 Machine learning ..................................................................................................... 146	5.2.8	 Biochemical validation experiments ....................................................................... 146	5.3	 Results ............................................................................................................................ 148	5.3.1	 Protein-protein interactions ..................................................................................... 148	5.3.2	 Coexpression patterns .............................................................................................. 157	5.3.3	 Phylogenetic profiles ............................................................................................... 170	5.3.4	 Colocalization .......................................................................................................... 170	5.3.5	 Co-annotation and co-mentioning in the literature .................................................. 172	5.3.6	 Predicting inhibitions ............................................................................................... 173	5.3.7	 Limited coexpression of proteases and their inhibitors ........................................... 179	5.4	 Discussion ....................................................................................................................... 184	5.5	 Summary ......................................................................................................................... 190	Chapter 6: Conclusions ............................................................................................................. 191	References ................................................................................................................................... 195	   xiii List of tables  Table 2.1 Counts of non-canonical termini evidenced by alternative splicing evidence (from Ensembl) or by alternative translation (from TISdb). ................................................................... 52	Table 2.2 Protease enrichment results from TopFINDer in the skin dataset. ................................ 57	Table 3.1 Internal N-termini (position > 3) observed and explained in the individual datasets used in this study. .......................................................................................................................... 72	Table 4.1 Human and mouse proteolytic networks created from all annotated proteases, inhibitors, and substrates. .............................................................................................................. 93	Table 4.2 Proteins comprising the human and mouse protease webs. .......................................... 99	Table 4.3 List of nodes with highest reachability and betweenness in the network. ................... 101	Table 4.4 Reachability values of nodes in the human protease web. .......................................... 101	Table 5.1 Microarray datasets used to generate coexpression matrices across datasets. ............ 144	Table 5.2 Matrices of coexpression and phylogenetic similarity. ............................................... 145	Table 5.3 Protein protein interaction pairs of protease with inhibitor predicted as “inhibition” from HIPPIE. ............................................................................................................................... 152	Table 5.4 True positive (TP) and true negative (TN) inhibitor-protease pairs after applying colocalization filters. ................................................................................................................... 171	Table 5.5 True positive (TP) inhibitor-protease pairs are enriched compared to true negatives (TN) using coexpression and colocalization filters. .................................................................... 174	   xiv List of figures  Figure 2.1 Biological	processes	leading	to	differences	in	termini	in	proteins	and	databases	containing	corresponding	information. ..................................................................................... 45	Figure 2.2 Input	and	output	of	TopFINDer. ................................................................................ 55	Figure 2.3 Fragment of the graphviz figure of protease web connections identified by PathFINDer. ................................................................................................................................... 59	Figure 3.1 Inferred and observed N-termini in the human N-terminome ..................................... 69	Figure 3.2 Fraction of explained internal N-termini in the individual datasets analyzed and the processes identified in each dataset. .............................................................................................. 71	Figure 4.1 Biochemical protease interactions represented by graph theory. ................................. 89	Figure 4.2 Protease networks in mouse and human. ..................................................................... 91	Figure 4.3 Annotation biases in protease substrate identification. ................................................ 93	Figure 4.4 Human proteases are overrepresented as substrates. .................................................... 95	Figure 4.5 Reachability in network examples and the human and murine protease webs. ........... 99	Figure 4.6 The largest connected component of the human protease web. ................................. 103	Figure 4.7 New connections in known proteolytic pathways. ..................................................... 105	Figure 4.8 Interactions between protease groups in the human protease web. ........................... 107	Figure 4.9 The protease web compared to random networks. ..................................................... 109	Figure 4.10 Reachability in the human protease web after various perturbations. ..................... 115	Figure 4.11 Reachability in the network does not depend on one single node. .......................... 116	Figure 4.12 Reachability in tissue-specific protease webs. ......................................................... 119	  xv Figure 4.13 Reachability in the protease web strongly depends on the presence of six important nodes. ........................................................................................................................................... 119	Figure 4.14 Proteases and their inhibitors involved in multiple, discrete biological processes. . 121	Figure 4.15 Protease web affects validation in vivo. ................................................................... 127	Figure 4.16 MALDI-TOF analysis of LIX cleavage by MMP8 and neutrophil elastase. ........... 128	Figure 4.17 MMP8, neutrophil elastase, and cathepsin G cleavage of LIX. ............................... 130	Figure 5.1 Protein-protein interaction (PPI) networks constructed for proteases and inhibitors.151	Figure 5.2 Expression patterns of proteases and their inhibitors. ................................................ 159	Figure 5.3 Scatterplots of expression for selected pairs of proteins. ........................................... 161	Figure 5.4 Tissue/sample distribution of datasets. ...................................................................... 163	Figure 5.5 Description and comparison of coexpression and phylogenetic similarity matrices obtained. ...................................................................................................................................... 165	Figure 5.6 Performance of coexpression and phylogenetic similarity matrices in predicting protease inhibition. ...................................................................................................................... 169	Figure 5.7 Performance of machine learning algorithms in predicting protease-inhibitor pairs. 174	Figure 5.8 Coexpression of predicted inhibitor-protease pairs shown as a network. .................. 176	Figure 5.9 Inhibition of KLK5 by SERPINB12. ......................................................................... 178	Figure 5.10 Similarity of coexpression and phylogenetic similarity matrices for recovering protease web inhibitions. ............................................................................................................. 181	Figure 5.11 Recovery of inhibitor-protease pairs in matrices. .................................................... 183	    xvi List of abbreviations AC   accession code ATP   adenosine triphosphate AUC   area under the curve BA   Barabasi-Albert model CAFA   critical assessment of protein function annotation  COFRADIC  combined fractional diagonal chromatography DNA   deoxyribonucleic acid ER   Erdős-Rényi model GBA   guilt by association GO   gene ontology GTI-Seq  global translation initiation sequencing HIV    human immunodeficiency virus HT    high trough-put IRES    internal ribosome entry sites LDA   linear discriminant analysis LPS   lipopolysaccharide LT    low through-put MALDI   matrix-assisted laser desorption ionization mRNA   messenger RNA MS   mass spectroscopy PAGE   polyacrylamide gel electrophoresis PICS   proteomic identification of cleavage sites PMA   phorbol myristate acetate PMN   polymorphonuclear leukocyte PPI   protein-protein interaction PTM   post-translational modification RF   random forest RNA   ribonucleic acid ROC   receiver operator characteristic   xvii RPKM   reads per kilobase of transcript per million mapped reads SCX   strong-cation-exchange SDS   sodium dodecyl sulfate SRM    single reaction monitoring SVM   support vector machines TAILS   terminal amine isotopic labeling of substrates TIS   translation initiation site TN   true negative TOF   time of flight TP   true positive     xviii Acknowledgements First and foremost, I thank my supervisors Chris Overall and Paul Pavlidis. Chris and his inspiring excitement for novel biological discovery have been major driving forces behind this work, pushing me to realize the full impact of my results and teaching me to clearly communicate intricate analyses to international audiences. Equally crucial to this work were the many thoughtful questions and comments from Paul, who always made sure that my enthusiasm is paralleled by scientific rigor and a healthy suspicion for the many pitfalls of scientific and computational analyses that await enthusiastic bioinformaticians. I am deeply thankful to both my supervisors for their enthusiastic engagement in our long, insightful, and enlightening discussions and their readiness and patience in embarking on this highly interdisciplinary project.   I further thank my committee members, Gabriela Cohen-Freue for her patience with my studies of statistical concepts, and Jörg Gsponer for his helpful advice and profound insights into the workings of science and the occasional Alpine German chat. I am grateful to past and current members of the Overall and Pavlidis labs for providing a supportive environment for me, the many friends I have made in those labs and at UBC, and in particular my collaborators, students, and bug-finders Tony Dufour, Anna Prudova, George Butler, Ulrich Eckhard, Sharon Yang, and Shamsuddin Bhuiyan for their collaborative effort and time for discussions. I thank Philipp Lange for his guidance at the beginning of my project and for developing TopFIND for me to extend upon. I am especially grateful to our most helpful lab managers Reinhild Kappelhoff and Sanja Rogic. I also thank Doris Metcalf for her help in organizing my studies and the Department of Biochemistry and Molecular Biology at UBC for their financial support.    xix In addition, I owe greatly to my many friends, roommates, and fellow Angoleiros, who have together provided the right amount of distraction from my work and are the cause of countless wonderful experiences in Vancouver and abroad. I especially thank my friends in Austria and around the world for their continued friendship and open arms, despite my perpetual absence. Finally, I owe great gratitude to my brother Stephan and my parents, Angelika and Wilhelm, who always accept and support me in all my doings and never fail to provide a place of trust for me.   1 Chapter 1: Introduction This thesis revolves around the processing of proteins by proteases in the protease web, a proteolytic network, which – in concert with other regulatory mechanisms – controls protein activity and thereby regulates a large number of biological functions. Proteins carry out most functions of living organisms by linking genotype to phenotype and are therefore tightly controlled in quantity, localization, and activity. Quantity of proteins is primarily regulated by gene expression but also heavily influenced on the level of translation and degradation (Abreu et al., 2009; Payne, 2015; Vogel and Marcotte, 2012). Proteins are further regulated by more than 300 different types of post translation modifications (PTMs) such as phosphorylation, acetylation, and processing by proteases (Mann and Jensen, 2003; Nørregaard Jensen, 2004; Witze et al., 2007; Doucet et al., 2008), which can also be interdependent (Kötzler and Withers, 2016). Proteoforms, the many protein species generated by PTMs from a single gene (Smith et al., 2013), differ in their interactions, localization, activity, or stability. Studies of PTMs are thus crucial for the understanding of biology and pathologies, which has remained elusive in many cases despite decades of research (Edwards et al., 2009).  This introduction outlines key concepts of one important PTM, the proteolytic processing of proteins by proteases. Previously mostly considered as protein degradation machineries, proteases fulfill a much larger set of roles in regulating biological systems through limited and specific proteolysis of substrates called protein processing. In this introduction, I review long-established biological roles of proteases and how novel roles are continuously discovered with the use of contemporary high-throughput substrate screens. I particularly focus on interactions between proteases in proteolytic cascades and how recent research has lead to the hypothesis of   2 the existence of a complex “protease web”, a network resulting from crosslinking of previously disconnected groups and cascades of proteases. This network, among other factors, is thought responsible for the failure of protease inhibitor drug developments and is the object of investigation for a large part of this thesis. I therefore end this introduction by outlining network modeling and prediction strategies available for analysis of biological networks, some of which are applied to the protease web in the result chapters of this thesis.  1.1 Proteases and their biochemical properties Proteases are also called peptidases or proteinases (Rawlings et al., 2015) and catalyze the cleavage of peptide bonds in a process called proteolysis. Endopeptidases cleave inside a protein chain generating two protein fragments and exopeptidases cleave terminal residues of a protein and are consequently further separated into aminopeptidases and carboxypeptidases for cleavage of the N- and C-termini, respectively (Rawlings et al., 2015). Proteases are major regulatory enzymes because all proteins undergo proteolysis by proteases in protein degradation and many proteins are further processed by proteases into stable but altered protein fragments at some point in their life cycle. Protease are a large set of genes with 460 members in human and 525 in mouse (Puente et al., 2003). In an attempt to capture this complexity and to facilitate biochemical studies of related enzymes, proteases have therefore traditionally been classified and studied in defined groups, often based on their biochemical properties.  A common classification separates proteases into seven classes (five of which are found in human and mouse) based on the structure of the active site responsible for peptide cleavage, which in all cases relies on a nucleophile to attack the carbonyl carbon of the peptide bond   3 (Rawlings et al., 2015). In serine and threonine proteases the hydroxyl and in cysteine proteases the thiol groups are the nucleophiles. In aspartic and in the recently discovered glutamic (Fujinaga et al., 2004) proteases a water molecule is coordinated between the respective side chains and acts as a nucleophile. Similarly, in metalloproteinases a metal ion is coordinated in the active site of the protease and a water molecule acts as the nucleophile. Finally, in the case of the self cleaving Tsh protein precursor, asparagine acts as the nucleophile (Rawlings et al., 2011). Classes of proteases are recorded in the MEROPS database, the main database for the storage of protease specific information, where protease species are identified and annotated across organisms based on their sequence similarity and biochemical properties (Rawlings et al., 2015). Annotated protease species are further classified into families based on the homology of their amino acid sequence around the active site. Protease families are further grouped into clans mostly based on their tertiary structure.  In addition to biochemical properties of proteases themselves, an important feature and basis of classification is their sequence specificity, the specificity for amino acids in the substrate that surround the cleavage site (Fuchs et al., 2013). Proteases with similar specificity are hypothesized to cleave similar substrates and specificity is therefore often used as a proxy for functional similarity. Protease cleavage specificity is conferred by binding pockets in subsites of the protease close to the active site, which chemically favor binding and thus cleavage of peptides or proteins with specific amino acid side chains (Schechter and Berger, 1967). For example, many serine proteases such as trypsin cleave selectively after lysine or arginine whereas caspases generally cleave after aspartate. Specificity profiles of proteases are commonly visualized by sequence logos, plots similar to stacked histograms that show probabilities of   4 finding amino acids (indicated by their one letter code) at given positions in a sequence. One example of this representation is the iceLogo (Colaert et al., 2009), which compares the probability of each amino acid to an organism-specific background probability. Whereas the classification of protease based on their specificity is useful, cleavage specificity can currently only be determined for a minority of proteases with sufficient numbers of identified substrates (Fuchs et al., 2013). In addition, protease specificity is in some cases influenced by exosites, substrate binding domains located outside the catalytic domain of the protease (Overall, 2002). Binding substrates independently of sequence specificity, these domains limit the use of protease specificity in identifying substrates and specificity-based classification of proteases.  1.2 Biological functions of proteases A more direct classification of proteases into distinct groups is based on their biological roles, their biological functions, which are often studied in isolation. However, this classification is also less stable, since new biological roles of proteins are discovered periodically and functional annotations therefore change over time (Gillis and Pavlidis, 2013a). Another difficulty with functional classifications is the high multifunctionality of some genes because it leads to overlaps and blurring between groups (Jeffery, 2003; Gillis and Pavlidis, 2011a; Chapple et al., 2015). Yet, functional classification is a common tool in biochemical literature to simplify biological complexity. In doing so, scientists frequently distinguish between intra- and extracellular roles of proteins, with about half of all human proteases found in either of these locations and a few proteases bound to the cellular membrane (Doucet et al., 2008). This distinction is generally useful but not without exceptions. Surprising subcellular re-localization of proteases is continuously discovered (Goulet et al., 2004; Kwan et al., 2004; Golubkov et al.,   5 2005; Marchant et al., 2014), demonstrating uncertainty in current localization annotation. Likewise, (annotated) intracellular proteins are sometimes identified as extracellular protease substrates, indicating a role of transportation mechanisms other than conventional protein secretion (Butler and Overall, 2009a, 2009b). Additional roles of proteins, often found upon changes in localization, are also known as moonlighting, a term that more generally refers to multifunctionality of proteins that perform two or more functions under different circumstances (Jeffery, 2003). The above observations cast doubt on subcellular localization annotation of proteins. Nevertheless, subcellular localization plays a major role in separating proteins and the biological processes mediated by proteins in cells. Moonlighting roles have remained the exception and so I distinguish between intra- and extracellular roles in the following, where I outline commonly known roles of proteases in protein degradation and maturation as well as more recent examples of proteases cleaving and regulating mature proteins.  1.2.1 Proteases in protein degradation Protein degradation by proteases is well-understood as a point of regulation in biology and failure of protein degradation processes is thought to contribute to a range of diseases (Goldberg, 2003; Ciechanover, 2005; Rubinsztein, 2006). Intracellular protein degradation is mediated by ubiquitination of proteins, which targets proteins to the proteasome. The proteasome is a large molecular complex containing multiple protease subunits that degrade proteins into small peptides. A variant of the proteasome responsible for antigen processes in immune responses is the immunoproteasome, a protein complex similar to the proteasome where certain proteasomal subunits are replaced by homologues (Morel et al., 2000). Degradation of extracellular and mitochondrial proteins is mainly mediated by cathepsins, proteases located in the lysosome that   6 encounter their targets after fusion of the lysosome with endocytotic, phagocytotic, and autophagocytotic vesicles (Ciechanover, 2005; Luzio et al., 2007). Interestingly, however, cathepsins also carry out multiple additional intra- and extracellular roles such as antigen processing and bone remodeling and thereby exemplify the multifunctionality of proteases (Turk et al., 2001; Yasuda et al., 2005; Mohamed and Sloane, 2006). Indeed, it is important to note that the widely accepted and studied roles of proteases in protein degradation are only a part of the many biological tasks fulfilled by proteases. For example and equally important to degradation, proteases play a major part in the regulation of protein maturation summarized below.  1.2.2 Protease processing in protein maturation In contrast to protein degradation, protein maturation relies on proteolytic processing to generate stable protein species. Processing is a highly specific, efficient, limited and controlled process (López-Otín and Overall, 2002; Doucet et al., 2008), demonstrated in the removal of the N-terminal initiator methionine of proteins by type I and II methionine aminopeptidases (Addlagatta et al., 2005). Protease processing also regulates the activation of proteins such as proteases themselves, which are often synthesized as zymogens including an inhibitory pro-peptide and thus require proteolytic removal of this peptide to be activated (Khan and James, 1998). Removal of peptides by proteases further plays a role in subcellular localization of proteins. Secreted proteins often contain signal peptides that are proteolytically removed in the endoplasmic reticulum as part of the secretion process and similar targeting sequences are known for mitochondria and chloroplast localization (Schatz and Dobberstein, 1996; Paetzel et al., 2002; Rapoport, 2007; Rawlings et al., 2010). Proteases further regulate localization of proteins by shedding of membrane proteins such as transmembrane growth factors, membrane receptors,   7 or adhesion molecules through proteolytic processing just after a transmembrane domain thus releasing ectodomains of shed proteins into the media or the extracellular space (Arribas and Borroto, 2002; Hayashida et al., 2010). A biologically important but less pervasive example of protease processing in protein maturation is the generation of signaling peptides such as the angiotensin peptides, which control vasoconstriction and blood pressure and are strongly associated with cardiovascular conditions such as hypertension and heart failure (Welches et al., 1993; Turk, 2006). Multiple cleavages of the precursor angiotensinogen by different proteases give rise to different signaling peptides that strongly differ in their biological effect. This example thereby demonstrates how protease processing can regulate proteins even after maturation to generate additional protein species with altered activity or localization similar to other regulators such as kinases. The role of proteases as post-translational modifiers of protein activity beyond maturation is often underestimated, especially in comparison to kinases, but evidence for such processes is abundant as demonstrated in the following.   1.2.3 Protease processing in the regulation of mature proteins Examples of proteases that cleave mature proteins to regulate localization, signaling and other functions are especially well studied for matrix metalloproteinases (MMPs): Originally believed to only degrade extracellular matrix (Nagase and Woessner, 1999), these enzymes carry out a range of additional functions by limited and controlled proteolysis of numerous substrates (Overall and Kleifeld, 2006; Butler and Overall, 2009b; Lange and Overall, 2013). One biologically important example of processing by MMPs is the cleavage of C-C motif chemokine 7 (CCL7 also referred to as monocyte chemoattractant protein–3, MCP-3) by gelatinase A (MMP2) (McQuibban et al., 2000). Proteolytic removal of only 4 N-terminal residuals from   8 CCL7 switches the activity of this chemokine from receptor agonist to antagonist. Calcium response and leukocyte cell migration, the consequences of full length CCL7 binding to CC-receptor-1 (CCR-1), -2, and -3 are inhibited by cleaved CCL7, which further antagonizes this response even for stimulation by other chemokines such as CCL2 and CCL3. This complex effect resulting from the cleavage of just a few amino acids demonstrates the biological impact of protease processing. Another example of complex regulatory effects mediated by a protease is the cleavage of Ankyrin repeat and SAM domain-containing protein 4B (ANKS4B or HARP) by MMP2. In this intriguing example, the N-terminal cleavage fragment N-HARP increases mitogenesis when added to the culture medium of murine fibroblasts in combination with full length HARP, whereas the C-terminal fragment C-HARP has an antagonistic effect (Dean et al., 2007).   MMPs are indeed highly involved in the regulation of signaling molecules, also by cleaving extracellular matrix proteins at defined positions (Mott and Werb, 2004). For example, MMP2 processing of subunit γ2 of the extra-cellular matrix component laminin (LAMC2) generates a shortened fragment of γ2, which induces cell migration during tissue remodeling and tumor invasion (Giannelli et al., 1997). Another biological role of precise proteolytic processing by MMPs is the shedding of cell surface proteins, for example galectin, a carbohydrate-binding protein found on the cell surface and involved in cell growth and differentiation (Ochieng et al., 1994). MMPs can also expose protein domains buried inside proteins. This was observed in the protease processing of fibronectin (FN1) by multiple MMPs (Fukai et al., 1995). Cleavage of FN1 exposes the central cell-binding domain, which strongly enhances chemotactic migration of fibroblasts in cell culture, a function that is absent in intact FN1 because the responsible domain   9 is not accessible. Surprisingly, the processing roles of MMPs are not only extracellular. In a recent study of virus infection, it was found that MMP12 released by macrophages traffics inside cells to the nucleus, where it acts as a transcription factor regulating gene expression (Marchant et al., 2014). By regulating IκB expression inside cells MMP12 modifies IFN-α secretion, which is required for antiviral immunity. Interestingly, the same study found that MMP12 also proteolytically inactivates systemic IFN-α to prevent excessive immune response, indicating the importance of spatial and temporal elements in protease regulation.   The above examples of MMPs demonstrate the diverse roles carried out by proteases beyond protein degradation and maturation. Proteases regulate protein activity on a post-translational level similar to other post-translational regulators such as kinases and are indeed embedded in cell signaling network by their interactions with kinases (López-Otín and Bond, 2008; López-Otín and Hunter, 2010) and glycosylation (Kötzler and Withers, 2016). Similar to kinases, proteolytic signals are also transferred in chains of signaling events, the protease cascades, which are described in the next section.  1.3 Protease cascades Proteolytic cascades are a fundamental concept of protease biology. As a series of consecutive cleavage events, cascades are mostly thought of as one-directional pathways but are also modulated by additional regulatory steps and feedback loops. Cascades are generally started by the activation of initiator proteases that cleave further intermediate targets, which then, in turn, cleave executioner proteases responsible for cleavage and activation of biologically relevant proteins. Since one active protease can rapidly turn over a large number of substrates at each   10 step, such chains of events lead to drastic signal amplification. Furthermore, protease cascades can be initiated by multiple signals resulting in the same downstream response and provide multiple points of regulation in the initiation and intermediate phase of the cascade. Whereas these principles are thought to apply to many groups of proteases, they are mostly studied in three protease cascades described in the following, the caspases involved in apoptosis, complement factors in immunity, and coagulation factors involved in blood clotting.  1.3.1 Caspases in apoptosis One well-understood example of an intracellular protease cascade are the caspases involved in apoptosis (Stennicke and Salvesen, 1998; Thornberry and Lazebnik, 1998; Slee et al., 1999; Ashkenazi and Salvesen, 2014). Caspases are structurally related cysteine proteases with a strict specificity for aspartyl bonds (MEROPS protease family C14, (Rawlings et al., 2015)). Apoptosis is a complex process triggered either by extracellular signals recognized by specialized death receptors (Fas receptor) or by intracellular damage sensors, which control the release of pro-apoptotic signals (in particular cytochrome c released from mitochondria). Both pathways converge in the dimerization and activation of initiator caspases-8, -9 or -10. Initiator caspases consequently cleave and thereby activate further downstream caspases, the executioner caspases-3 and -7, which in turn cleave a range of other protein substrates thus leading to the controlled dismantling of the cell. Interestingly, apoptosis through caspases can also be initiated by granzymes, proteases secreted by cytotoxic T lymphocytes upon encounter of infected or tumorigenic cells that can enter target cells to trigger apoptosis (Darmon et al., 1995), providing another mode of activation of this cascade.     11 1.3.2 Coagulation cascade Another cascade of proteases is responsible for the formation of blood clots in wound healing (Davie and Ratnoff, 1964; Macfarlane, 1964; Adams and Bird, 2009). This cascade of coagulation factors becomes active upon tissue damage and ultimately leads to cleavage and activation of the protease thrombin (factor II). Thrombin cleaves fibrinogen and factor XIII. In addition to platelet activation, this leads to fibrin polymerization in a gel covalently cross-linked by factor XIIIa. Biochemically, the coagulation cascade was originally divided in two pathways. The intrinsic pathway requires a cascade of factors XII, XI, IX, VIII, and V activated upon contact with negatively charged surfaces, whereas the extrinsic pathway involves exposure of coagulation factor VII to tissue-factor upon tissue injury. These pathways were recently discovered to be more interlinked, with tissue factor playing a major role in the activation of all types of coagulation (Adams and Bird, 2009). In this newer model, the coagulation is initiated by tissue factor and factor VIIa forming the extrinsic factor tenase complex and cleave factor IX (originally part of the intrinsic pathway). Factor IX then creates the intrinsic factor tenase complex with factor VIIIa. In the amplification phase, both complexes cleave factor X, which forms a third complex with factor Va to cleave and activate thrombin thus leading to clot formation. The important counterpart to blood clotting is clot resolution, also called fibrinolysis (Cesarman-Maus and Hajjar, 2005; Adams and Bird, 2009). In fibrinolysis, inactive circulating plasminogen (a zymogen) is cleaved and activated by other proteases such as tissue or urokinase-type plasminogen activators that are slowly released by surrounding cells in response to coagulation. Plasmin then itself cleaves tissue and urokinase-type plasminogen activators in a positive feedback loop and ultimately cleave fibrin in clots.    12 1.3.3 Complement cascade The complement system in immunity is a second extracellular cascade involving multiple proteases and more than 30 serum or membrane bound proteins (Sim et al., 1979; Muller-Eberhard, 1988; Matsushita et al., 2000; Sarma and Ward, 2011; Ehrnthaller et al., 2011). The complement cascade is triggered by three main pathways, all of which lead to cleavage of the protein C5 into two fragments C5a and C5b. C5b then recruits additional factors to form the matrix attack complex, which disrupts the cell wall and thus results in osmotic lysis of parasitic cells. The classical pathway of the complement cascade is initiated by complement proteins binding to antibodies targeting infected cells, upon which a conformational change leads to autocatalytic activation of serine proteases C1r and C1s, which in turn cleave C4 and C2. Cleavage products C4b and C2a form the C3 convertase C4b2a. Subsequent cleavage of C3 yields the C4b2a3b complex that is the C5 convertase. In the alternative pathway C3 hydrolyzes spontaneously to a C3b-like form named C3(H20), which is rapidly inactivated by healthy but not by parasite cells. When attached to cell wall components, C3(H20) can be stabilized to assemble the alternative C3 convertase C3bBbP. The resulting, increased cleavage of C3 to C3a and C3b leads to the assembly of the C5 convertase complex C3bBbC3bP. Finally, in the lectin pathway complement activation occurs by mannose binding by mannose binding lectin (MBL), which circulates in blood as a complex with MBL associated serine proteases (MASPs). Upon binding, MASP-1 and MASP-2 are thought to cleave C4 and C2 to create the C3 convertase C4bC2a, which then cleaves C3 to create the C5 convertase. In addition to forming the membrane attack complex, all three pathways result in the formation of cleavage products and cytokines C3a, C4a, and C5a, important signaling molecules in the immune response. In   13 addition, other complement cleavage fragments such as C3b, C4b and C5b coat and opsonize target cells.   1.4 Regulation of protease activity by protease inhibitors As demonstrated by the above examples, proteases and their cascades control various vital processes of living organisms. Aberrant protease activity can have devastating effects because of the rapid turnover of large numbers of substrates once a protease is activated. This is particularly relevant for cascades of proteases, where multiple cleavage steps lead to further amplification of the signal. Spurious initiation of these cascades is thus tightly controlled, in particular by specific inhibitors of activated proteases which are major regulators of protease activity (Adams and Bird, 2009; Ehrnthaller et al., 2011; Ashkenazi and Salvesen, 2014). Protease inhibitors temporally and spatially confine protease activity. For example, stefins are inhibitors that trap cysteine cathepsins, which accidentally localize to the cytosol (Turk et al., 2001). Similarly, antithrombin and other protease inhibitors circulating in blood restrict coagulation activity to the site of injury by inhibiting diffused proteases (Adams and Bird, 2009).   Equivalent to the classification of proteases, protease inhibitors are classified in clans and families based on their sequence and structural features (Rawlings et al., 2015). Inhibitors are fewer than proteases (159 annotated human inhibitors compared to 460 proteases in the Mammalian Degradome Database (Quesada et al., 2009)) and thus often inhibit multiple proteases. The class of proteases inhibited largely determines the inhibitory profile of an inhibitor: Kallistatins inhibit kallikreins, cystatins inhibit cathepsins, inhibitors of apoptosis inhibit caspases, and tissue inhibitors of metalloproteinases inhibit MMPs (Doucet et al., 2008;   14 Rawlings et al., 2004). Whereas these classifications are useful in limiting the possible targets of an inhibitor, the physiologically relevant targets for many inhibitors are commonly unknown, in particular for the so-called “orphan inhibitors”, which lack any known target. On the other hand, the mechanism of action of exemplary protease inhibitors is well studied: In many cases, the mechanism of inhibition is reversible in that the inhibitor and the protease are in equilibrium between bound and unbound form, where only the bound form inhibits protease activity. In a second type of inhibition mechanism called suicide inhibition, protease inhibitors are cleaved by a protease leading to a conformational change in the inhibitor and inhibition the protease. One large family of suicide inhibitors is called serpins (SERine Protease INhibitors). Serpins expose a reactive site loop (bait), which contains an amino acid sequence that matches the cleavage specificity of target proteases. Upon cleavage, the serpin stays covalently bound to the protease and undergoes a rapid conformational change, which forces a conformational change in the protease thus leading to loss of protease activity (Huntington et al., 2000; Toh et al., 2010). This mechanism requires the inhibitor to covalently bind the active site side chain of the protease and so serpins successfully inhibit serine and cysteine proteases but do not inhibit metalloproteinases. Metalloproteinases use a water molecule as a nucleophile, which does not tether the protease to the inhibitor as would be required for inhibition (Keller et al., 2013). Other cases of protease suicide inhibition are alpha-2-macroglobulin (A2M) and pregnancy zone protein (PZP). These inhibitors also display a reactive site loop, which is recognized and cleaved by a protease. Contrary to serpins, A2M and PZP engulf their target proteases upon cleavage and therefore can target proteases of any class, inhibiting protein cleavage but not the cleavage of smaller peptide substrates (Borth, 1992; Jensen and Stigbrand, 1992; Marrero et al., 2012).     15 In conclusion, protease inhibitors are important regulators of proteases and further add to the complexity of protease biology. Whereas many inhibitor target proteases are unknown, it is clear that protease action depends on a fine balance of activating and inhibiting factors, and protease cascades only go forward upon strong stimuli outweighing inhibitor activity. Consequently, aberrant regulation of proteases can have devastating effects for cells, which is demonstrated by protease-mediated pathologies as outlined in the following.  1.5 Protease diseases, drug targeting and failures Proteases have been implicated in numerous pathologies and consequently been considered as promising drug targets (Turk, 2006; Overall and Blobel, 2007; López-Otín and Bond, 2008; Drag and Salvesen, 2010). Successful inhibitor drugs are being used in the clinic against coagulation factors in coagulopathies and bleeding disorders, against angiotensin converting enzyme (ACE) in hypertension, and against HIV protease. However, many more diseases are associated with proteases without any drugs to treat these pathologies. Beta-Secretase (BACE) and other proteases are involved in the accumulation of amyloid beta protein and thus are promising drug targets in Alzheimer’s disease (Vassar, 2002; Nalivaeva et al., 2008). Kallikreins have central roles in a wide range diseases including of skin and neurological diseases as well as cancer (Prassas et al., 2015). Cathepsins have been implicated in osteoporosis, arthritis, asthma and cancer (Mohamed and Sloane, 2006; Vasiljeva et al., 2007). Caspases are crucial regulators of apoptosis and thus strongly implicated in cancer (Li and Yuan, 2008; Olsson and Zhivotovsky, 2011; Thornberry and Lazebnik, 1998). Finally, MMPs were targeted in cancer based on their role as matrix degrading enzymes and more recently also implicated in auto-inflammatory disorders (Overall and Kleifeld, 2006; Dufour and Overall, 2013).    16 As outlined, proteases offer a long list of potential therapeutic targets to treat diverse diseases. With the well-understood chemical mechanism of protease cleavage that can be exploited for drug development, these enzymes are promising targets for inhibitor drug development. However, despite these promising prospects, many factors and in particular the complexity of biological systems have impeded protease inhibitor drug development trials so that few drugs were successfully developed despite many attempts (Turk, 2006). One of the clearest examples of inhibitor drug failures was the development of chemical MMP inhibitors. Mainly considered as extracellular matrix degrading enzymes, MMPs were thought to facilitate cancer invasion and metastasis, the leading causes of cancer mortality in patients (Zucker et al., 2000). Over 70 companies started MMP inhibitor developments but all clinical trials failed due to lacking efficacy or unexpected side effects (Overall and Kleifeld, 2006; Butler and Overall, 2009b; Dufour and Overall, 2013). Multiple factors were considered to explain drug failures including the design of clinical trials, selection of patients, drug dosing, and drug timing. Since the first inhibitors developed were not specific for individual MMPs, it was also unclear whether individual MMPs were responsible for side effects. Drug development failures sparked further research in the area and led to accumulating evidence of additional roles of MMPs, in particular in modulating the immune response through cleavage of signaling molecules, which are described in section 1.2.3. Biological roles of MMPs and other proteases are mediated by the cleavage of additional substrates. It is now clear that many proteases carry out a range of biological roles, which make them targets in some but anti-targets in other diseases (Dufour and Overall, 2013). Determination of drug suitability of a protease in a specific disease is therefore only possible after identification of its substrates, the substrate repertoire (López-Otín and Overall, 2002). Experiments to identify protease substrates thus are a major step in studying   17 proteases and in defining interactions among them. I therefore outline major methodologies of substrate identification in the following.  1.6 Biochemical methods for the identification of protease substrates Biochemical identification of cleaved proteins is difficult, since full-length and cleaved proteoforms often have very similar chemical properties. This problem is aggravated for smaller truncations of a few amino acids. Yet, even small truncations can have a strong biological effect (McQuibban et al., 2000). This problem applies to gel electrophoresis of purified proteins, which is used to identify proteins by their differences in size but is limited in resolution. In addition, this method does not yield precise cleavage site information. Both limitations also apply to Western blots, which can visualize individual proteins in complex mixtures using a first antibody directed to the protein of interest and a second labeling antibody that stains the respective band, for example through fluorescence. Using Western blots, an additional challenge is the selection of the antibody, which can fail to detect a protein if it is directed against an epitope missing in the cleaved protein. A method to biochemically precisely identify the N-terminal amino acids of a peptide is Edman degradation (Edman and Begg, 1967). In this procedure, N-terminal amino acids are cleaved off the peptide one by one with the use of phenyl isothiocyanate and then individually analyzed by thin layer or reversed phase chromatography. However, the length of protein sequences that can be analyzed by Edman degradation is limited to about 30 amino acids by the yield of individual steps of this procedure.       18 A more suitable method to determine protease substrates is mass spectrometry (MS), a precise and parallelizable method that can identify protease cleavage sites in multiple peptides in the same experiment. MS is extensively used in proteomics and encompasses a range of instruments, often combined to suit a task at hand. Generally, analytes are first charged and transferred into gas phase in what is called the ion source. Ionization of biological molecules is commonly performed from solid state using matrix-assisted laser desorption ionization (MALDI) or from liquid state with electrospray ionization (ESI) (Aebersold and Mann, 2003). Ionized analytes are then transferred into the mass spectrometer by an electric field, where the mass to charge ratio is determined by a mass analyzer. Common examples are Orbitrap mass analyzers, where the mass to charge ratio of ionized molecules is determined based on their frequency of oscillation in an electric field, or time of flight mass analyzers, where analytes are separated based on their velocity upon acceleration by an electronic pulse and quantified by a detector plate (Marshall and Hendrickson, 2008). Basic MS identifies analyte mass to charge ratio without information structural information. However, structural information can be determined by fragmentation of the analyte ion (for example by collision-induced dissociation in a collision cell). Resulting fragment ions depend on the structure of the molecule and so allow inference of amino acid sequence of peptides based on their fragmentation patterns.  MS methods described above can be used to characterize individual proteins or simple mixtures thereof. In a complex biological sample of proteins (proteome), MS generally lacks resolution and sensitivity to directly identify proteins (Aebersold and Mann, 2003; Catherman et al., 2014). Instead, a bottom up strategy is employed, where protein samples are first digested to peptides. Trypsin is widely used for protein digestion because of its high specificity for lysine and arginine   19 residues. Trypsin generates so-called tryptic peptides of suitable length that are easily predicted based on the protein sequence, readily solubilized and separated, and generate favorable fragmentation patterns (Aebersold and Mann, 2003; Chait, 2006; Kelleher et al., 2014). To simplify the highly complex peptide mixture resulting from a tryptic digest, peptides are separated by gel electrophoresis or liquid chromatography prior to analysis in the mass spectrometer. Fragmentation spectra of peptides are then compared to experimentally derived spectral libraries or theoretical spectral libraries to identify peptide sequences, which are ultimately matched to known protein sequences as inferred from genomics to identify proteins. The described bottom up approach is extensively employed in proteomics research, yet sensitivity and reproducibility is limited by the fraction of tryptic peptides reproducibly detected, a challenge tackled by newer approaches (Gillet et al., 2012), and the inference of proteins from peptides. Due to these inherent limitations, interest persists in the application of top down proteomics, where proteins are analyzed and fragmented in the mass spectrometer without prior enzymatic degradation (Catherman et al., 2014; Kelleher et al., 2014). However, top down proteomics have thus far failed to achieve the same coverage as bottom up approaches.   Despite their wide application and comprehensive coverage of proteins, conventional proteomics fail in covering the full diversity of proteomes including proteoforms derived from PTMs. This has led to the development of specialized workflows to specifically target modifications of interest (Nørregaard Jensen, 2004). In the following, I describe methodologies developed to identify protease cleavage sites. These methods generally focus on the chemical enrichment of N- or C-terminal peptides of proteins (Kleifeld et al., 2010) and are thus referred to as terminomics.   20 1.6.1 Terminomics: targeted methods to identify N- and C-termini  In addition to identifying protease substrates, terminomics also improve MS detection of terminal peptides and increase dynamic range by reducing complexity in the sample (Kleifeld et al., 2011) by focusing on significant peptides. Consequently, there has been steady interest in the development of these promising biochemical tools. An early terminomics approach was the positive enrichment of N-terminal peptides by ligation with biotin using the engineered enzyme subtiligase, which is specific for the modification of N-terminal amines and does not modify primary amines on side chains (lysines) (Mahrus et al., 2008). Biotinylated proteins are then digested with trypsin to generate unmodified internal peptides as well as biotinylated N-terminal peptides, which are positively enriched on an avidin affinity media. The captured N-terminal peptides are then readily cleaved off the column, identified by MS, and ultimately aligned with protein sequences (often translated from genome sequences) to identify protein termini.  Application of subtiligase-based enrichment is limited by the potential bias of subtiligase specificity and because this method does not identify modified N-termini (Kleifeld et al., 2011). A second approach in terminomics, combined fractional diagonal chromatography (COFRADIC) (Gevaert et al., 2003), avoids these limitations: In this method, N-termini are first blocked with acetylation on the full protein followed by trypsin cleavage. This results in two populations of peptides: One population with blocked N-termini, the original N-termini of proteins, and another with unblocked N-termini, internal peptides resulting from trypsin cleavage with free N-terminal primary amines that are chemically reactive. These two populations are then separated in two steps of chromatography separation. After a first step of reverse phase chromatography separation, internal peptides with free N-terminus are modified with 2,4,6-  21 trinitrobenzenesulphonic acid to cause a hydrophobic shift (delay) in their elution profile in a second reverse phase chromatography separation. This protocol has been modified to contain a first step of strong cation exchange chromatography for additional enrichment of N-terminal peptides (Staes et al., 2008) and to identify N- and C-termini simultaneously (Van Damme et al., 2010). COFRADIC being an important technique to comprehensively assay the terminome, multiple steps of chromatography separation and the resulting number of MS runs of fractions make COFRADIC an expensive and time-consuming method.   A faster and cheaper but equally sensitive MS based approach is the terminal amine isotopic labeling of substrates (TAILS) (Kleifeld et al., 2010). In N-TAILS, blocked terminal peptides and free internal peptides are not separated by chromatography. Instead, reactive N-termini of unblocked peptides are bound to a high-molecular-weight polymer, which is readily extracted by ultrafiltration, negatively enriching N-terminal peptides for MS analysis. TAILS protocols have been developed for the identification of N-terminal (N-TAILS (Kleifeld et al., 2010)) and C-terminal peptides (C-TAILS (Schilling et al., 2010)). N-TAILS analyses are currently simpler and more efficient than C-TAILS because C-TAILS involves blocking of primary amines before and after trypsin digest to avoid cross-reactivity and because of low yields of C-TAILS due to low reactivity of the carboxylic acid at the protein C-terminus. N-TAILS was further developed to use charge reversal and strong cation exchange chromatography to separate terminal and internal peptides (Lai et al., 2015) and was adapted for the analysis of time series, for example in wound healing (Schlage et al., 2014; Sabino et al., 2015).     22 1.7 Terminomics as atypical tools in the identification of proteins Whereas terminomics were originally developed for the identification of protease substrates, an additional interest in applying targeted proteomics is to aid in the cataloguing of the full repertoire of human proteins, which is mostly attempted using conventional proteomics techniques (Uhlen et al., 2010; Kim et al., 2014; Wilhelm et al., 2014). One particular effort of this kind is the Human Proteome Project, which aims to identify and characterize at least one protein product of each human gene, in particular the so called “missing proteins” (Lane et al., 2014). Missing proteins are proteins that have evaded characterization by biochemical methods and could possibly fill in gaps in our understanding of biological systems and disease processes. Many reasons are hypothesized to contribute to the difficulty in identifying such proteins, for example expression in unusual organs or cell types or in early developmental stages, under specific conditions, or in low quantity close to the limit of detection and thus also hidden by the limited dynamic range of contemporary methods. Terminomics can identify proteins at a high dynamic range (Lange et al., 2014) by focusing on terminal peptides and thus reducing sample complexity, and is therefore well-suited to identify missing proteins. Some missing protein might also lack tryptic peptides with favorable peptide ionization and fragmentation properties, thus rendering them opaque to conventional shotgun proteomics. Semi-tryptic (protease cleaved) peptides of these proteins differ from their tryptic counterparts in mass, charge, and amino acid distribution and thus might be identifiable by terminomics. Applied to human dental pulp, an underexplored tissue, terminomics was thus able to identify 17 missing protein candidates (Eckhard et al., 2015).     23 1.7.1 Extent and genesis of N-terminal truncations In the search for missing proteins and to comprehensively characterize proteomes, terminomics methodologies have been applied to a range of tissues and conditions, often without focus on a specific protease (Van Damme et al., 2005; Mahrus et al., 2008; Vögtle et al., 2009; Prudova et al., 2014; Lange et al., 2014; Eckhard et al., 2015). Interestingly, these experiments have demonstrated a surprisingly large extent of N-terminal truncations in various tissues. It was found that > 50% of N-terminal peptides consistently map inside the mature protein sequence as inferred from genomics with 61% in murine skin (Keller et al., 2013), 64% in red blood cells (Lange et al., 2014), 77% in platelets (Prudova et al., 2014), and 78 % in dental pulp (Eckhard et al., 2015). Whereas these pervasive N-terminal truncations are of high biological relevance because they can shape activity and localization of proteins by removing entire protein domains, their biological impact is uncertain because biological context is lacking. It could thus be hypothesized that these observations are artifacts resulting from low abundance protein species without physiological relevance, a point that is mitigated by the fact that these observed protein species were found in sufficient quantity to be identified by MS. Nevertheless, additional biological insights are required to interpret the observed truncations and one important question concerns their genesis.   In addition to protease cleavage, protein truncations can also result from two additional processes, alternative translation and alternative splicing. Alternative translation occurs when secondary structures of an mRNA mediate internal ribosome entry and thus initiate translation at sites other than the first AUG codon (Wan and Qian, 2014). These internal ribosome entry sites (IRES) can be found up- or downstream of the canonical translation initiation site (TIS), and   24 result in in-frame or frame-shifted translation at AUG or non-AUG sites. One example of alternative translation, where different translation start sites result in proteoforms with altered N-termini, is fibroblast growth factor 2 (FGF2 (Vagner et al., 1995). Whereas the longer protein species of this gene are localized in the nucleus and can induce cell immortalization, the shorter ones are localized in the cytosol and can induce cell transformation. Alternative translation starts sites are biochemically identified using global translation initiation sequencing (Lee et al., 2012), where ribosomes are fixed on the translation start site using a translation inhibitor such as lactimidomycin followed by degradation of mRNA and sequencing of the ribosome protected fragments. Alternative start sites can be downloaded from databases such as TisDB (Wan and Qian, 2014), which contains 6991 TIS sites from 4961 human genes. Alternative translation start sites have been mapped to protein N-termini in individual experiments but no global resource was available (Damme et al., 2014).   As with alternative translation, alternative splicing is another process that can generate truncated protein species. Splicing generates non-canonical N-termini if the first exon of a gene is removed or shortened. It can also result in shortened C-termini if the last exon is not used or shortened. An example of C-terminal truncation by alternative splicing is HLA class II histocompatibility antigen DM beta chain (HLA-DMB) (Modrek et al., 2001). The canonical form of HLA-DMB contains a C-terminal transmembrane domain and lysosomal targeting signal, resulting in the anchorage of the protein in early lysosomes. These two elements can be removed by alternative splicing, thus altering localization of the protein. To annotate such protein termini, transcripts resulting from alternative splicing events can be obtained from public sequence databases such as ENSEMBL, which are based on expressed sequence tags or other sequencing methods (Flicek   25 et al., 2014).  After in silico translation of the transcript to the amino acid sequence, it can be aligned to the protein sequence to infer protein termini resulting from alternative splicing. A few protein isoforms generated by alternative splicing are annotated in protein databases such as UniProt (The UniProt Consortium, 2014). However, these annotations do not comprehensively capture all generated protein species.  In conclusion, multiple biological processes can generate the protein truncations identified in comprehensive MS-based experiments. With the large number of N-terminal truncations observed in recent surveys of various tissues and their regulatory power by removing entire protein domains, truncated protein species can present possibilities for novel discovery of biological pathways, biomarkers, and disease treatments. It is therefore important to further study the biological processes generating protein truncations and to identify truncations in a process-specific manner. In the following, I outline protease substrate identification experiments, which contribute to both, the research on protein termini and also the study of biological roles of a protease and the anticipation of protease drug targeting effects.  1.8 Protease-centered substrate identification experiments In protease substrate identification, terminomics techniques are generally applied in a multiplexed assay comparing samples with and without protease using isotopic labels such as dimethylation. For example, a prominent and successful strategy of in vitro substrate discovery by terminomics is to compare a given proteome (secretome of a cell line) with another sample of the same proteome incubated with a purified and activated protease of interest. Cleavage sites by the protease are then indicated by semi-tryptic peptides (truncated tryptic peptides) that are found   26 exclusively or in elevated amount in the protease-treated sample (Prudova et al., 2010). An improvement of this method is the time-resolved observation of cleavage fragments by assaying multiple time-points of protease incubation (Schlage et al., 2014). Whereas this requires higher multiplexing, it makes it possible to follow kinetic profiles of protease cleavage to increase confidence in the identification of substrates. A major point of critique of such in vitro experiments is the uncertainty of the physiological relevance of identified substrates. Subcellular localization, various binding factors, and PTMs can influence cleavage of protease substrates in vivo to differ from in vitro cleavage sites and substrates. This motivated the application of terminomics to the identification of protease substrates by comparing substrate cleavage profiles of physiological samples with normal and abrogated protease activity, for example by comparing wild-type and knock-out murine samples (Keller et al., 2013) or human individuals with and without mutated alleles of a protease gene (Klein et al., 2015). Murine studies can be combined with the analysis of a particular phenotype, for example in the analysis of MMP2 cleavage sites in inflammatory murine skin, where inflammation was induced in samples of MMP2 knock-out and wild-type and compared to controls in a four-plex experiment (Keller et al., 2013). Whereas in vivo substrate identification assays are useful in discovering physiologically relevant substrates, they are far more difficult to interpret. Biological systems can react to modifications (for example protease knock-out) in unpredictable ways, for examples by modifying gene expression of proteases and their inhibitors (Krüger, 2009). In addition, proteases can influence the cleavage of indirect substrates by cleaving downstream proteases or their inhibitors (Overall and Dean, 2006; Doucet et al., 2008). These complex signaling effects in the protease web impede direct assignment of protease substrates based on in vivo experiments, requiring   27 laborious follow-up experiments. Yet, protease interactions are poorly characterized and evidence for the protease web only existed anecdotally, as outlined in the following.  1.9 Protease networks and cross talks in the protease web Individual interactions between proteases have been characterized for a long time but not been analyzed comprehensively. In principal many proteases require interaction with other proteases for activation if they are synthesized as zymogens that require proteolytic removal of a propeptide for activation (Kassell and Kay, 1973; Khan and James, 1998). Cleavage of zymogens indeed controls major steps in all protease cascades described in section 1.3 (Davie and Ratnoff, 1964; Macfarlane, 1964; Muller-Eberhard, 1988; Thornberry and Lazebnik, 1998). These protease cascades were traditionally considered to function in isolation but more recent examples of cross talk between groups of proteases, as well as the failure of protease inhibitor trials, eventually led to the hypothesis of more pervasive interactions between proteases in the protease web (Overall and Dean, 2006). Additional network-like interactions between proteases were also identified, for example between kallikreins (Beaufort et al., 2010). Consequently, attempts were made to capture this network in manually assembled cancer subnetworks of the protease web found in extensive literature reviews (Doucet et al., 2008; Butler and Overall, 2009b; Mason and Joyce, 2011), yet these attempts did not capture the global extent of protease interactions.   Examples of cross talk between protease groups and cascades are continuously identified. An example linking two important protease cascades is the cleavage of C5 (complement system) by thrombin (coagulation) (Krisinger et al., 2012). An important type of cross talk between   28 proteases is the cleavage of protease inhibitors by proteases to release and amplify protease signal of otherwise inhibited proteases. For example, this has been observed in the cleavage of tissue factor pathway inhibitor by MMPs (Belaaouaj et al., 2000). Another example is the MMP mediated cleavage of serpins in the exposed bait region, which inactivates these inhibitors (Desrochers et al., 1991; Mast et al., 1991a; Keller et al., 2013). Despite individual insights, a comprehensive characterization of the structure and the extent of this regulatory network are unclear. Protease interactions play a key role in the identification of protease substrates in vivo to distinguish direct from indirect substrates and thereby create difficulties in elucidating biological functions of proteases and, in consequence, protease inhibitor drug discovery. Without knowledge of direct and indirect (downstream) targets of a protease, it is impossible to predict effects of protease drug targeting in vivo.   In this thesis, I aim to characterize interactions of proteases in the protease web in order to gain insights into the extent and topology of this network and to facilitate interpretation of results of protease-centered biochemical experiments in the context of complicated protease biology. I therefore outline relevant computational approaches applied to the analysis and prediction of other complex biological networks in the following.   1.10 Network modeling of biological networks One avenue to gain insight into the working of complex biological networks is graph modeling, an approach applied to various biological systems since the emergence of systems biology and large datasets of interactions between biological entities. A graph is a mathematical   29 representation of interactions that consists of nodes, the interacting elements, and edges, the connections between the elements. For example, in protein-protein interaction networks (Stelzl et al., 2005) proteins are nodes and two nodes are connected by an edge if they were found to interact in an interaction screen such as yeast-2-hyprid or co-immunoprecipitation. In metabolic networks, nodes frequently represent chemical compounds that are connected if they participate in the same metabolic reaction (Jeong et al., 2000). In coexpression networks (Stuart et al., 2003), nodes are genes that are connected by an edge if they are coexpressed (coexpression is discussed in detail in section 1.11.1).   A range of modeling approaches is available when analyzing networks (as graphs). Edges can either be binary (representing absence or presence of a connection) or weighted (to represent strength of a connection, for example strength of coexpression). Edges can further be directed to distinguish between source and target of an interaction (metabolic reactions), or undirected without defining source and target (protein-protein interactions). Once an appropriate graph representation of a given network is found, topological properties of the graph can be investigated.  1.10.1 Topological network analysis Graph topology analysis is a promising avenue to gain insights into the organizational principles of complex networks and has been applied to a range of biological networks. One attribute of biological networks suggested by topological analysis is their “scale-freeness” (Barabási and Albert, 1999), which refers to the observation that few nodes with many connections (“hubs”) connect many nodes with few connections. Scale-freeness is often assessed by plotting the node   30 degree distribution, the number of connections per node, which should follow a power law function in scale-free networks. This was thought to contribute to the error tolerance of networks (random failure of nodes would mostly affect nodes with low degree and not affect network stability) and to their attack sensitivity (targeted attacks affect hubs and thus lead to fragmentation of the network) (Albert et al., 2000). Another proposed attribute of biological networks is their small world character, which was especially studied in metabolic networks (Jeong et al., 2000; Wagner and Fell, 2001). Small world architecture is found by measuring the shortest paths between all pairs of nodes and indicates high connectivity and cross talk between elements of a network. Network analysis is also used the identify smaller structural elements, so called network motifs (Milo et al., 2002; Shen-Orr et al., 2002). Network motifs are small patterns of a few nodes, one example being a feed-forward loop in a transcription factor networks. Enrichment of network motifs in a given network can be quantified by comparing motif occurrence in a real network to random networks with similar properties. Whereas topological findings have led to intriguing hypothesis on the large-scale functioning of biological networks, they have been put in question by follow-up research (Arita, 2004; Lima-Mendez and Helden, 2009). For example, the small world effect of metabolic networks was driven by the incorporation of so called currency metabolites (for example water and ATP), which considerably shortened the overall path length without representing meaningful mass transfer in the network. In addition, it is unclear to which extent observed topological features result from data biases and incompleteness, which particularly affect protein interaction networks (Gillis et al., 2014). Such controversies have demonstrated that results from topological analysis are highly sensitive to data biases and modeling choices. However, they have been difficult to resolve   31 because hypothesis of organizational principles of biological networks based on topological analyses often do not propose definite biochemical validation experiments.   1.10.2 Biochemically testable predictions based on network topology Biochemically testable applications of biological networks are predictions of attributes of individual entities of the network. For example, it was shown that yeast protein hubs in a protein interaction network are more likely to be essential proteins than other proteins (Jeong et al., 2001). Another application is the prediction of pathways in networks (Vinayagam et al., 2011). Pathways are chains of interactions, such as protease cascades, leading from an input signal to traversal of the graph by multiple steps to an output signal. Predicted pathways can be validated biochemically by perturbing elements of the pathway and measuring outcome.   Finally, a common application of networks to generate detailed and testable biochemical hypothesis is the prediction of gene function, where functions of genes are predicted based on the functions of their neighbors in what is called a “guilt-by-association” (GBA) approach (Uetz et al., 2000; Mostafavi et al., 2008). GBA-based predictions are considered promising in combining complex biological data to predict unknown functions of genes thus accelerating biological discovery. Numerous algorithms and datasets were developed to prediction gene function, and were evaluated in a critical assessment of protein function annotation (CAFA) experiment (Radivojac et al., 2013). Whereas such approaches have reported success in predicting protein functions, biochemical validation success has been limited (Pavlidis and Gillis, 2013). It was discovered that computationally evaluated prediction performance is often driven by gene multifunctionality, the number of functions carried out by a gene (Gillis and Pavlidis, 2011a;   32 Pavlidis and Gillis, 2012). Multifunctionality biases predictions, where additional functions tend to be predicted for highly multifunctional genes because they also tend to have many interactions in networks. Such predictions are accurate in computational prediction evaluations such as cross-validation, where known functions of genes are first hidden and the algorithm is then evaluated on how well it can recover hidden functions based on GBA. Predictions can also be accurate in the prediction of novel functions because multifunctional genes indeed tend to carry out additional predicted biological functions. However, predicted functions of multifunctional genes are often not useful for novel biological discovery because they do not capture specific functions of these genes. To illustrate this effect, it was demonstrated that – without predicting genes for a specific function or relying on any biological network – a single list of genes ranked by multifunctionality scores highly in the prediction of diverse functions (Gillis and Pavlidis, 2011a).   A second danger in computational evaluations of function prediction is information retrieval of already known (published) functions instead of de novo predictions. Information retrieval can occur if a predicted (hidden) function and an interaction (the input of prediction) are the result of the same biological experiment and thus not independent. In cross-validation, the function will be correctly recovered based on the interaction without demonstrating the ability of de novo prediction of functions, which is the aim of prediction algorithms (Pavlidis and Gillis, 2012). These above sources of overestimation of prediction performance have demonstrated that computational evaluation of prediction performance can often fail and that biochemical validation is crucial to accurately measure usefulness of predictions. Discovered in the function prediction task, these points also apply to other prediction efforts, for example the prediction of   33 protein interactions, which are applied in this thesis to predict interactions in the protease web and described in the following.    1.11 Prediction of protein interactions  An approach to gain insight into biological networks is the prediction of interactions themselves. Predictions of physical interaction of proteins are a useful complement to biochemical experiments, which are often limited by false positives and context dependent coverage (von Mering et al., 2002; Braun et al., 2009) so that the full space of protein interactions remains unexplored (Lees et al., 2011). Physical protein interaction data stem from low-throughput biochemical experiments, for example identification of proteins co-purified using gel electrophoresis, and high-throughput approaches such as yeast-2-hybrid or co-immunoprecipitation coupled to mass spectroscopy. They are aggregated in a variety of databases, such as HIPPIE (Schaefer et al., 2012), BioGRID (Chatr-Aryamontri et al., 2015) and STRING (Szklarczyk et al., 2015). In predictions of protein-protein interactions, proteins are linked together on the basis of shared features such as patterns of conservation, expression, or annotations (Jansen et al., 2003; Bhardwaj and Lu, 2005; Rhodes et al., 2005; Franceschini et al., 2012). These features can be viewed as a network themselves, so that interactions in one network are predicted from interactions in a second network with the same nodes (proteins) but different edges. In contrast to such features that are based on complementary data, another type of feature is based on the topology of the predicted network itself. For example, interactions can be predicted between proteins of the same cluster (Goldberg and Roth, 2003). In general, multiple features of interactions (described below) are combined in a classifier, which predicts absence or presence of interaction in unknown examples. Constructing such a classifier has several phases.   34 First, the predictive power of each feature is estimated in a process called training on a mixture of known interactions (positive examples) and non-interactions (negative examples) as training data (Lu et al., 2005; Liu and Chen, 2012). Whereas true positives are readily retrieved from biological databases, definition of true negatives is more intricate. Common approaches use random interactions (based on the fact that true interactions are a small subset of all possible interactions), proteins localized to different compartments (Braun et al., 2009), or proteins which are at a large distance in a network of current data (Liu et al., 2012). Algorithms then attempt to optimize the separation of positives and negatives. It is not assumed that coexpression implies physical interaction. Instead, strength of coexpression of known positive pairs is compared to coexpression of negative pairs. The better the feature separates the two groups, the stronger the weight for the feature in the classifier. Features that do not separate positive from negative pairs are not added to the classifier. The trained classifier is then evaluated on unseen test data with further known positive and negative examples, a process that can be repeated in cross-validation. Such evaluations are often limited because independence of testing and training set cannot be guaranteed and because imbalance of positive and negative examples in a “real world” situation often further decreases performance (Lees et al., 2011). Therefore, to finally judge the success of a prediction method, novel interactions need to be tested in biochemical experiments. Prediction efficacy estimation based on structured evaluations, which exist for function prediction (Radivojac et al., 2013), structure prediction (Moult et al., 2014), and structural docking (Janin, 2002), are missing in protein interaction prediction. However, similar to function prediction, it can be appreciated that computational prediction of physiologically interactions is difficult. One evidence for this is the continued development of novel biochemical methods to identify interactions (Kristensen et al., 2012; Weisbrod et al., 2013). Despite these difficulties, protein   35 interaction prediction can be useful even if just to prioritize interactions for biochemical follow-up. A major element in protein interaction prediction is the choice of input features for prediction. In the following, I outline features commonly used in protein interaction prediction that are potentially useful for the prediction of protease web interactions.  1.11.1 Coexpression Gene expression is a key site of regulation in biological systems. Its use as a predictor of protein interaction stems from the rationale that interacting proteins must be expressed in the same place and in the same time frame. This rationale was supported by experimental observations. Correlation of expression of interacting proteins was consistently observed in yeast (Ge et al., 2001; Grigoriev, 2001; Bhardwaj and Lu, 2005), especially in protein complexes (Dezső et al., 2003), and similar albeit weaker in humans (Bhardwaj and Lu, 2005; Rual et al., 2005). Gene coexpression is promising as a prediction tool because it is unbiased compared to protein interaction data in that RNA expression is generally measured for all genes simultaneously using microarrays or RNA-Seq (Gillis and Pavlidis, 2011b). The RNA samples used can encompass a variety of conditions including different tissues, cell lines or cell types, disease or other biological activation states, or time points. Coexpression is sometimes understood as the simple co-occurrence of two proteins in one cell or tissue. In prediction, it is more commonly treated as the correlation of expression pattern of two genes across many samples: Not only do the genes co-occur in one given sample, but when one of them is up- or down-regulated the other is as well, leading to similar profiles of expression across samples. Correlation of expression profiles is usually measured using symmetric similarity measures, for example Pearson correlation or mutual information (Claverie, 1999; Cunningham et al., 2000; Niehrs and Pollet, 1999). A binary   36 coexpression network can be constructed by thresholding, sometimes after applying additional methods for determining the optimal threshold (Langfelder and Horvath, 2008). From these basic ideas, many variations and more sophisticated methods can be applied. These include partial correlations, where the specific correlation of a pair of genes is assessed after removing the effects of correlation with other genes (Wang and Huang, 2014), or weighted gene correlation network analysis (Langfelder and Horvath, 2008), where a power adjacency function is applied to original correlation values. All these methods can be used in a meta-analysis, where correlation results for many datasets are combined with the aim of ending up with a more robust coexpression result (Lee et al., 2004; Ballouz et al., 2015).  1.11.2 Phylogenetic profiles Loosely speaking, genes that co-occur in the genomes of some organisms and are both absent in others are more likely to work together. To calculate similarity of co-occurrence of genes across organisms, first a phylogenetic profile for each gene is defined over a range of taxa with complete genome annotation, indicating whether the gene has an orthologue in each taxon. The similarity of phylogenetic profiles of two genes is assessed by correlating the presence and absence of both genes in different organisms (Pellegrini et al., 1999). In this framework, an alternative to looking for homologues of the whole gene is the alignment of solely the protein domain (Ranea et al., 2007) similar to the approach taken in MEROPS for mapping proteases and their inhibitors across organisms (Rawlings et al., 2015). Phylogenetic similarity was applied in bacteria (Lüttgen et al., 2000) and archaea (Carlson et al., 2004) to identify members of metabolic pathways. In Caenorhabditis elegans (Tabach et al., 2013) phylogenetic similarity identified proteins involved in RNA interference silencing. This method is attractive because it is   37 orthogonal to other features and not biased towards any group of proteins—it can be computed for any pair of proteins with known sequence. However, the resolution of this method might be limited by the resolution of homology mapping, especially in the resolution of closely related gene family members.   1.11.3 Colocalization Colocalization exploits knowledge of the cellular localization of proteins. Two proteins found in the same cellular compartment (e.g., nucleus or endoplasmic reticulum) are more likely to be interacting than non-colocalized proteins (Shin et al., 2009). Localization annotation for proteins is available from databases such as LocDB (Rastogi and Rost, 2011), the Human Protein Atlas (Uhlen et al., 2010), and Gene Ontology (GO) annotation (Ashburner et al., 2000). However, these data are often incomplete and also biased towards highly studied proteins. Colocalization is therefore commonly combined with other features, as described below in section 1.11.5.  1.11.4 Co-annotation and co-mentioning in literature Interactions between proteins contribute to biological processes and so it was hypothesized that genes of similar functions have a high probability of interacting (King et al., 2003). For example, proteins can thus be predicted to interact, if they co-occur in publications. Another approach exploits annotations in Gene Ontology (Ashburner et al., 2000), a structured annotation of genes where annotation terms are arranged hierarchically in a directed, acyclic graph. Whereas this approach was consistently found to successfully predict protein interactions (Maetschke et al., 2012), it also fails to distinguish de novo predictions from information retrieval (Pavlidis and Gillis, 2013) because it is possible that annotation of protein function and interaction are the   38 result of the same publication, as explained in section 1.10.2. The performance of co-annotation based predictions in detecting novel interactions is therefore unclear.   1.11.5 Applications of protein-protein interaction prediction A difficulty in assessing prediction performance of protein interaction prediction methodologies is the limited biochemical validation of results. One example including biochemical validation was the prediction of protein interactions by combining coexpression, coessentiality, and colocalization information of gene products in yeast (Jansen et al., 2003). Whereas it was shown that biochemically identified interactions overlapped with predictions, this required lowering the threshold for prediction so that performance evaluation is difficult. A second interaction prediction approach combined model organism interaction data, coexpression, co-annotation in Gene Ontology, and protein domains to predict a large number of human interactions. Whereas a few successfully validated examples were reported and lead to novel biological insights, no clear report of prediction performance was made available (Rhodes et al., 2005). Finally, a recent approach combined structural similarity of homologues interactors with other features such as coexpression, functional similarity, and evolutionary similarity to predict interactions, which were stored in the PrePPI database (Zhang et al., 2013). 15 out of 19 predicted interactions were biochemically validated, indicating an precision of 0.79 (Zhang et al., 2012). Because of the use of functional annotation in these predictions, it is unclear to which extent information retrieval was involved in generating the results. Furthermore, an independent control experiment only found 63 of 381 biochemically identified interactions to overlapped with the 1235 interaction predicted by PrePPI (Moon et al., 2014), suggesting that performance was previously overestimated.    39  Taken together, performance of predictions of protein interactions remains elusive. Despite this, prediction approaches remain promising avenues to complement biochemical discovery because of the high throughput of computational prediction. In addition, bioinformatic predictions of interactions can be used to guide biochemical assays by reducing the search space of possible interactions for biochemical testing. Interaction predictions as reported above are directly applicable to the protease web. Predicting interactions between proteases and substrates as cleavage and interactions between inhibitors and proteases as inhibitions, predictions are directly transferable with the additional difficulty in distinguishing cleavage or inhibition interactions from other interactions. However, no systematic prediction efforts based on protein interaction screens were applied to the protease web. Predictions of protease substrates are rather based on protease specific features as described in the following.  1.11.6 Computational predictions of protease cleavage In silico prediction methods applied for the identification of protease cleavage sites are commonly based on substrate specificity of the protease, often in combination with secondary structure, native disorder, and solvent accessibility of the target sequence in the substrate (Song et al., 2011a). Simpler methods apply regular expressions based on defined protease specificity for predictions (Gasteiger et al., 2005) but more complex methods combine a set of features in statistical machine-learning methods similar to the protein interaction predictors described above (duVerle and Mamitsuka, 2012). Cleavage predictors vary in focus. Some examples predict cleavage for specific proteases such as caspases and granzyme B (Backes et al., 2005; Barkan et al., 2010; Song et al., 2010) or calpains (duVerle et al., 2011). Other predictors are based on   40 protease specificity data uploaded by the user (Boyd et al., 2005; Verspurten et al., 2009) and thus applicable to any protease with available data. Cleavage prediction tools focused on substrates are also available, either to predict cleavage without knowledge of the responsible protease (Li et al., 2012) or predicting cleavage from a set of predefined proteases (Song et al., 2012). Despite good prediction performance reported by the above tools in computational validation experiments, biochemical validation of predictions is rare. Furthermore, protease cleavage prediction is generally limited to a few proteases by the low numbers of substrates known for most proteases (Fuchs et al., 2013). This motivated the development of MS based methods aimed at elucidating protease cleavage specificity profiles used in substrate prediction.   In these approaches, an active protease is first incubated with a mixture of peptides. Then, after identification of cleavages in peptides with MS, cleavage sites in peptides are aligned to determine protease specificity. Proteomic Identification of protease Cleavage Sites (PICS) (Schilling et al., 2011) is one such approach, where peptide libraries are generated by tryptic digestion of a proteome of choice, followed by blockage of reactive amines (N-termini and side chains) and digestion by the query protease. This results in new reactive primary amines, which are enriched by biotin-streptavidin binding and identified by MS analysis. Aligning peptides with known protein sequences ultimately identifies the cleavage site. In a second strategy (O’Donoghue et al., 2012), a synthetic peptide library designed to capture all possible combinations of two neighboring or near-neighboring amino acids is incubated with the target protease. This simplified mixture is limited to application on proteases with specificity conferred by one or two amino acids but does not require additional labeling and sample fractionation thus improving upon reproducibility. Furthermore, by multiplexing over time points, this method can   41 be used to approximate kinetic parameters of protease cleavage. Whereas cleavage specificity information is thus becoming available for an increasing number of proteases, this approach is limited to proteases with high sequence specificity and not applicable to proteases with external substrate binding domains such as exosites (Overall, 2002). A second, inherent limitation in the utility of protease specificity is the similarity of specificity of proteases (Fuchs et al., 2013), which makes assignment of substrates to specific proteases of a protease family difficult. Despite its large potential, prediction of protease cleavage has thus remained an aspiration and reliable available cleavage data is based on biochemical experiments described in section 1.6.  1.12 Databases of protease cleavage and inhibition A few databases exist to capture information on protease cleavage, as well as related terminomics, protease family, and protease inhibition data. MEROPS, the earliest and most comprehensive database of this kind, is focused on annotating protease and inhibitors in genomes of sequenced organisms and contains a comprehensive collection of over 60,000 annotated protease-substrate cleavages (Rawlings et al., 2015). In MEROPS, a type example of each protein species (protease or inhibitor) is selected as a holotype. Protein species are then mapped across species to identify orthologues and are further grouped into classes, families, and clans. Other databases collecting protease cleavage data are available but generally capture less cleavage events: CutDB, which contains manually annotated and predicted cleavages as well as examples transferred from MEROPS currently captures 11,086 cleavages of 3391 substrates by 601 proteases (status last updated on June 24th 2009) (Igarashi et al., 2007). DegraBase annotates healthy and apoptotic human cell lines identified by subtiligase-based terminomics and contains 8090 N-termini in 3206 proteins but does not identify proteases responsible for these termini   42 (Crawford et al., 2013). The CASBAH database contains cleavages from caspases curated from literature and at the time of writing contained 777 substrates, with limited annotation of cleavage site position in the substrate and the protease responsible (Lüthi and Martin, 2007). Finally, TOPPR collects protein termini observed by COFRADIC terminomics and currently contains 27,147 termini observed on 2,234 substrates in 18 studied treatments (including protease treatments) (Colaert et al., 2013).    In an effort to combine dispersed data on N- and C-termini and protease cleavage in one database TopFIND (termini-oriented protein function inferred database) (Lange and Overall, 2011; Lange et al., 2011) was created, originally covering the five organisms Homo sapiens, Mus musculus, Arabidopsis thaliana, Saccharomyces cerevisiae, and Escherichia coli. TopFIND contains protein annotations such as protein and gene names, chromosome location, amino acid sequence, sequence variations, and secondary structure from UniProt (The UniProt Consortium, 2014) and combines this information with protease cleavage and protease inhibition information from MEROPS (Rawlings et al., 2015). TopFIND further annotates cleavage-inferred protein termini for each protein entry by assuming that protease cleavage generates one N- and one C-terminus. These termini annotations are further supplemented with biochemically observed termini from terminomics experiments. TopFIND thereby combines comprehensive data on protease cleavage and protein termini in one location and thereby provides an adequate starting point for the studies of protease processing in the protease web performed in this thesis.    43 1.13 Themes and outlook Proteolytic interactions in the protease web are a major source of uncertainty. Studies of protease function in vivo, in particular the identification of substrates, are intricate due to this complex and uncharacterized network. In addition, this network is thought to cause complications in the development of inhibitor drugs to target proteases, which are major regulators in crucial biological processes and thus involved in numerous diseases. A second but related difficulty is to distinguish the processes generating proteomics derived protein termini, which can result from protein processing, alternative translation, or alternative splicing events. In this introduction, I have outlined how computational approaches are used to gain insight into complex behavior of networks, to predict biochemically testable interactions, and the computational tools available to analyze protease biology. This thesis extends these approaches, addressing challenges in protease biology through the development of databases, software, and computational models to gain insight into processes governing protease cleavage and protein truncations. Chapter 2 outlines the extension of the TopFIND database to globally cover evidence of protein termini resulting from all three processes generating protein termini, in addition to the software tools PathFINDer and TopFINDer, which allow systematic queries of the protease web and the terminome in TopFIND. In chapter 3, I report an analysis of experimentally observed and inferred human N-termini in TopFIND to investigate the prevalence of termini generating processes. Network modeling and characterization of the protease web in a global analysis of all protease cleavage and inhibition data are summarized in result chapter 4 in this thesis. In chapter 5 I focus on a specific and simplified subclass of the interaction prediction task: the prediction of inhibitory interactions between proteases and their protease inhibitors, which are major regulators of protease activity.    44 Chapter 2: Proteome TopFIND 3.0 with TopFINDer and PathFINDer: database and analysis tools for the association of protein termini to pre- and post-translational events  2.1 Introduction Genetic information typically results in many protein species differing in amino acid sequence or by modification of individual amino acids. Pre- and post-translational processes can result in protein species that differ from other species encoded by the same gene in the extent of their sequence. Such protein species can have fundamentally different properties due to differences in primary structure including different domain structure, linear motifs, post-translational modification sites, as well as protein and ligand binding sites. This can radically change the function of the protein, its location within a cell, whether it is exported or not to function extracellularly. The start and end of a protein chain, the N- and C-termini respectively, are particularly defining features marking the extent of the primary structure and thus the functional competence of the protein. Since the chemistries of the protein termini differ from the amino acid side chains, protein termini can also undergo specific PTMs that in turn can direct function. Thus knowledge of protein termini is an essential facet of the understanding the biology of the protein and the networks of protein that it interacts with.  Three independent processes form protein species that differ in the extent of their sequence and hence termini. First, RNA splicing can lead to the selection of alternative exons encoding the N- or C-terminus (Figure 2.1). Second, use of alternative translation initiation sites leads to protein   45 species with either shorter or longer sequence and hence alternate protein N-termini (Figure 2.1). Third, post-translational modification by proteolytic processing truncates the protein and leads to the formation of shorter stable protein chains with a unique neo N- or neo C-terminus (Figure 2.1) (Lange and Overall, 2013; Rogers and Overall, 2013). Recent analyses show that proteolytic processing of proteins in vivo in tissues (Keller et al., 2013) or specialized enucleate cells, the erythrocyte (Lange et al., 2014) and platelet (Prudova et al., 2014), is remarkably high at 49%, 68% and 77% respectively. Such analyses reveal the formation of stable protein species by proteolytic processing as a pervasive phenomenon in proteomes and hence one that needs to be considered in interpreting biological processes.   Figure 2.1 Biological	processes	leading	to	differences	in	termini	in	proteins	and	databases	containing	corresponding	information.	  Ribosome RRProteaseDNAmRNA  transcriptsspliced  transcriptsTranscriptionSplicingTranslationCleavagep1 p2 p3protein  speciesIRESp1 p2 p3 p4protein  speciesEnsemblTISdbTopFIND MeropsFig 1  46 Although proteolytic processing by proteases, also known as peptidases, affects every protein in the proteome, it is a generally underappreciated post-translational modification (Rogers and Overall, 2013). Endopeptidases cleave a protein at a precise position in the sequence whereas exopeptidases remove one and two amino acids from the end of the protein chain. If trimming continues in a processive manner this is termed “ragging”. Through these three main specific protein modifications, the start and end position of a protein has been found at any point in the genome encoded protein sequence even to within 20% of the protein length from the C-terminus (Lange and Overall, 2011). The known functional implications of proteolysis include conventional protein maturation e.g. removal of N-terminal methionine, signal and transit peptides, as well as specific chain cleavages during protein maturation, or cleavages after disulphide bridges form protein homodimers or heteromers. Precise proteolytic processing also generates bioactive peptides from parent chains, e.g. bradykinin and many hormones, and can act at different stages in the life of a protein to modulate protein function, localization, and membrane protein shedding.  The chemical nature and length of different protein N- and C-termini not only conveys different functional properties, such as altered receptor binding to switch from agonist to antagonist as exemplified by the truncation of 4 amino acids only from the N-terminus of CCL7 (Lange and Overall, 2013; McQuibban et al., 2000). Most prominent chemical changes include N-terminal acetylation which regulates protein stability and half-life, either destabilizing the protein (Hwang et al., 2010) or increasing stability in a sequence dependent manner (Lange et al., 2014). Other notable examples include N-myristoylation and N-palmitoylation which regulate protein trafficking and membrane localization (Pierre et al., 2007; Resh, 2006). Modifications of the C-  47 terminus are less well understood due to the difficulty in identifying C-termini by sequencing or in their enrichment, but are equally relevant as for example the carboxyl methylation of phosphoprotein phosphatase 2A which is a prerequisite for the association with its Bα subunit (Tolstykh et al., 2000).  Whereas the impact of proteases on protein termini and function is of great biological relevance in vivo studies of proteases and their substrates are complicated by the deeply connected and dynamic interactions of proteases in cells and tissues described in chapter 4 (Fortelny et al., 2014). In particular, proteases interact to form an interconnected network termed the protease web by cleaving other proteases or protease inhibitors. Thereby a protease indirectly influences the cleavage of substrates of downstream proteases in addition to the direct substrate repertoire. Such interactions complicate the differentiation between direct and indirect cleavage events and thus hamper the assignment of proteases to substrates in in vivo studies. We showed that computational modeling of existing cleavage and inhibition information can greatly assist in distinguishing direct and indirect effects and can be used to assist in assigning proteases to cleavage events observed in vivo (Fortelny et al., 2014). However, tools for performing this routinely are desperately needed in order to interpret biological phenomena, knockout mouse models and for drug target validation.  In view of the added complexity arising from altered termini position and nature we developed TopFIND (Lange and Overall, 2011; Lange et al., 2011) to comprehensively integrate data on protein termini and their formation by proteolytic processing as well as to associate shortened protein chains with relevant information on protein function. TopFIND is based on experimental   48 data directly uploaded to TopFIND and complemented with data obtained from MEROPS (Rawlings et al., 2012) and UniProt (The UniProt Consortium, 2012) databases. A web interface and application programming interfaces (API) enable manual and automated data retrieval and analysis from TopFIND. The web interface is protein-centric with a dedicated summary page for each protein. Protein-pages contain general protein specific information and a variety of information related to the termini of the protein, position-specific information of domains and features, sites of proteolytic processing and positions of protein termini, and mutation or single nucleotide polymorphism (SNP) sites altering the protein sequence at or in the neighbourhood of cleavage sites (Lange and Overall, 2011). Proteases and protease inhibitors are further annotated by dynamically calculated cleavage site specificities represented as iceLogos (Colaert et al., 2009). Connectivity of each protein in the protease web is summarized in network figures showing other proteins connected directly and indirectly by paths of maximally two cleavage or inhibition interactions. For each observed terminus or cleavage the underlying evidence including information on confidence, biological relevance, experimental conditions and publications is displayed. A powerful filter enables efficient selection of a subset of termini and cleavage data based on the underlying evidence. For example, data can be limited to specific experimental methodologies, confidence cut-offs, source laboratories or databases.   Here we present the next major release of TopFIND version 3.0 in which we addressed three major limitations and user needs. First, we now account for all biological processes leading to the formation of alternate termini in all isoforms including alternative translation and splicing. Second, by creating the analysis software TopFIND ExploRer (TopFINDer) we have enabled researchers to annotate and statistically evaluate large-scale proteomics experiments in view of   49 the TopFIND resource. Third, with PathFINDer we developed the first publicly available tool to identify putative indirect proteolytic effects from in vivo proteomics data by placing proteins in the context of the proteolytic network (the extension of the protease web by protease substrates) (Fortelny et al., 2014) and identifying indirect connections from a query protease to the protein using graph path finding. With these new tools, TopFIND 3.0 now addresses and greatly facilitates the hardest problem in current protease research, the identification of the cognate protease responsible for a given cleavage event from a complex in vivo sample.  2.2 Methods TopFIND is developed in Ruby with a MySQL database backend and a web application frontend developed on a Rails framework as described previously (Lange and Overall, 2011). The MySQL database is centered around a table containing protein entries and additional tables (for example cleavages, inhibitions, N- and C-termini) linked to the protein table. To annotate termini inferred from alternative transcripts, human and mouse Ensembl (Flicek et al., 2014) protein (ENSP) sequences were downloaded in FASTA format from http://uswest.Ensembl.org/info/data/ftp/index.html. The 20 first and last amino acids of each protein sequence were mapped to the corresponding UniProt (The UniProt Consortium, 2014) sequence to annotate the position of the new N- and C-termini, respectively. To annotate N-termini derived from alternative translation start sites, human and mouse TISdb (Wan and Qian, 2014) files were downloaded from http://tisdb.human.cornell.edu/download/. The RefSeq sequence ID was used to retrieve the sequence using BioMart and 20 amino acids from the indicated start position of the translated sequence were mapped to the protein sequence from UniProt to annotated N-terminus. We developed TopFINDer to show protease enrichment p-  50 values that were obtained using a Fisher Exact Test. The background for this test is made up of cleavages on the proteins from the list by the proteases identified by TopFINDer. TopFINDer then calculates a q-value using Benjamini-Hochberg multiple testing correction. Icelogos are created using http://iomics.ugent.be/icelogoserver. PathFINDer identifies paths (sequences of directed edges representing cleavage and inhibition events) that connect the query protease to each submitted substrate and finally plots all identified paths as a network. Graphical representations of the connections identified by PathFINDer are plotted using graphviz (http://www.graphviz.org/). Mapping of orthologous proteins between Homo sapiens and Mus musculus for PathFINDer was derived from the InParanoid database (Östlund et al., 2010) version 8.0. The example dataset was taken from data sheet 8 from the supplementary tables published by Keller et al. (Keller et al., 2013). Using TopFINDer, annotated N-termini in a window of plus or minus three amino acids surrounding the identified terminus were collected (default settings of the parameter “positional precision”). Peptides with a log2 fold change higher or lower than 1.19 (the cutoff applied in the original paper) were run in PathFINDer with MMP2 (P33434) as a query protease and using the human network.      51 2.3 Results 2.3.1 Changes to the database content In previous versions of TopFIND we limited the annotation of termini to the canonical isoform as defined by UniProt (The UniProt Consortium, 2014). We have now extended the functionality in two ways. First, in addition to the main entry for the canonical isoform we now provide a full entry with accompanying web page for each individual isoform. Second, in addition to the original isoform-specific annotation we now map every entry to the corresponding isoforms using exact alignment of the local sequence context of the 20 amino acids following (N-terminus), preceding (C-terminus) or surrounding (cleavage) the reported position. Entries derived from isoform mapping are clearly marked and associated with respective evidence stating and linking to the original observation of the inferred terminus. Additional protein termini originating from alternative splicing and alternative translation We incorporated termini entries and corresponding evidence into TopFIND for termini resulting from experimentally observed alternative mRNA transcripts as annotated by Ensembl (Flicek et al., 2014). We collected protein-coding transcripts and mapped their N- and C-termini to protein sequences. Thereby evidence for 11,809 and 6,676 termini were added for human and mouse, respectively (Table 2.1).   We also added N-termini resulting from alternative translation initiation. Alternative translation of proteins is the result of internal ribosome entry sites (IRES) or leaky ribosome scanning (Wan and Qian, 2014) and is commonly probed by Global Translation Initiation Sequencing (GTI-Seq)   52 where a translation elongation inhibitor is added to a sample and the RNA found with the bound ribosome is sequenced. We integrated this information from TISdb (Wan and Qian, 2014), which represents the gold standard database for alternative translation initiation. To date TopFIND incorporates 439 human and 1,437 mouse N-termini based on evidence from alternative translation reported in TISdb. However, we expect these numbers to dramatically increase in the next few years.   Table 2.1 Counts of non-canonical termini evidenced by alternative splicing evidence (from Ensembl) or by alternative translation (from TISdb).   Non-canonical termini  Terminus type Biological process human mouse SUM N-termini alternative splicing 3,141 1,390 4,531  alternative translation  439 1,437 1,876 C-termini alternative splicing 8,229 3,849 12,078  alternative translation  0 0 0 Total  11,809 6,676 18,485       53 2.3.2 TopFINDer—the TopFIND ExploreR A frequently requested functionality is to facilitate information retrieval for lists of thousands of proteins commonly generated by proteomics experiments. Here, we extended TopFIND with such a powerful analysis tool we call TopFINDer. TopFINDer enables analysis and functional annotation of protein N- or C-termini sets based on TopFIND data. From a list of protein identifiers and their terminal amino acid sequences as input (Figure 2.2A) TopFINDer returns comprehensive protein and terminus related (position-specific) information and analyses based on TopFIND data. The results are sent by email and contain a table (in a EXCEL compatible tab delimited text file) with general annotation of each protein and position-specific information: For each identified terminus, TopFINDer reports the position of the terminus relative to the genome encoded sequence as well as the sequence context surrounding the terminus and evidence for the terminus from TopFIND. Important new information is provided including evidences for their classification by their origin as (i) termini inferred from alternative splicing derived protein isoforms in Ensembl or UniProt, (ii) N-termini inferred from alternative translation, (iii) termini inferred from cleavage together with the associated proteases, (iv) status as UniProt annotated canonical protein termini, and (v) termini observed experimentally, but without a known protease responsible for the cleavage.      54   Fig 2BECDA  55 Figure 2.2 Input	and	output	of	TopFINDer.	  (A) Input mask within the TopFIND web interface. (B) Venn diagram showing the overlap of termini evidences retrieved from TopFIND for a list of proteins. Evidence is either UniProt annotated terminus (Curated Start), terminus of an isoform derived from alternative splicing (Alternatively Spliced), or from alternative translation (Alternatively Translated), from cleavage (Cleaved), or a terminus observed in a non-protease related terminomics experiment (Experimentally Observed). (C) Matrix of substrates and proteases indicating cleavage of substrates. Fields are red where there is cleavage and black where there is none known between the protease and the substrate at this position. The y-axis shows the protein identifier and the position of each terminus. (D) Barplot showing the number of cleavages of each protease in the list. Bars of proteases whose cleavages are enriched in the list are in red, others in blue. (E) IceLogo of the sequences in the list.       56 Thoughtful analyses can then be performed by comparing each observed terminus in the list to evidences for this terminus in TopFIND. Thus, the biological relevance of the terminus can be assessed immediately and the biological process generating the protein terminus can be inferred. TopFINDer also displays domains and features N-terminal and C-terminal to the identified terminus as well as features located at the position of the terminus, allowing for inference of the impact on protein function. TopFINDer also allows for the retrieval of information in the proximity of the query terminus using a user definable N- and C-terminal extension of up to 10 amino acids from the terminus position. Thereby, processes such as ragging, which lead to nearby but different different termini can be accounted for by the original process and included in the analyses.  In addition to annotation, TopFINDer calculates summary and enrichment statistics for each of the submitted termini that TopFINDer classifies as described above. The overlap and relative distribution of these groups is visualized in a Venn diagram (Figure 2.2B). Of great use to the protease community, termini originating from proteolytic processing are assigned to proteases already annotated by TopFIND to mediate the specific cleavage. These data are visualized in a clustered matrix with substrates on the y-axis and proteases on the x-axis (Figure 2.2C) as well as a histogram of counts of cleavages per protease (Figure 2.2D). The amino acid sequence of the submitted termini, except for those originating from translational events, is summarized in an IceLogo (Colaert et al., 2009), which is a frequency based sequence logo with probability cut-off (Figure 2.2E). Taken together and by comparison with protease specificity logos provided by TopFIND this can enable the identification of one or several dominant candidate protease activities in the sample. To account for the bias created by different numbers of protease-  57 substrate associations available in TopFIND, ToFINDer also calculates enrichment statistics for each protease enabling the researcher to assess the likelihood of the protease being responsible for the cleavage. In addition to graphical representations the statistical summary is also reported in tabular format (Table 2.2).    Table 2.2 Protease enrichment results from TopFINDer in the skin dataset.	Protease name List count (total = 129) DB count  (total = 1265) Fold enrichment Fisher Exact Test (p-value) Adjusted Fisher Exact Test (q-value) MMP2 68 537 1.24 0.016 0.053 CATE 51 544 0.92 0.803 0.803 THRB 1 1 9.81 0.177 0.287 GRAB 16 58 2.71 0.001 0.009 *** CASP3 7 28 2.45 0.037 0.080 CASP7 7 28 2.45 0.037 0.080 CATD 29 309 0.92 0.722 0.782 MPPB 4 4 9.81 0.004 0.024 *** CAN2 2 8 2.45 0.235 0.329 MMP9 5 11 4.46 0.012 0.052 CASP1 2 16 1.23 0.507 0.600 MMP8 1 2 4.90 0.253 0.329 MMP13 1 1 9.81 0.177 0.287  *** significantly enriched proteases     58 2.3.3 PathFINDer: Protease web path-finding in TopFIND We employ path-finding in the protease web to identify known direct and indirect explanations for the observed cleavages. This powerful analysis can reveal previously hidden dependencies, facilitate differentiation of direct and indirect effects and explain counter intuitive experimental results as show in chapter 4 (Fortelny et al., 2014). With PathFINDer it is now possible to submit a list of identified human or mouse in vivo substrate candidates and cleavage sites and a candidate protease to find known direct and indirect connections (paths) from the protease to the identified substrates. By creating a representation of the protease web based on cleavage and inhibition data in TopFIND and then dynamically extending this network with connections from the candidate protease to the proteins in the list, PathFINDer identifies paths from the protease to each substrate. We expanded this analysis by cross-species mapping between human and mouse orthologous proteins to compensate for the currently sparse data, in particular in mouse (Fortelny et al., 2014). In this way if data are absent but present for homologues in the other species a reasonable prediction can still be formulated to test. This analysis can be run separately or in combination with a TopFINDer analysis. The identified paths are visualized in a network view (Figure 2.3) and listed in a tabular format including links to all relevant proteins, cleavages and evidences in the TopFIND web interface.			  59 	Figure 2.3 Fragment of the graphviz figure of protease web connections identified by PathFINDer.  Nodes are proteins, the query protease is marked in color and the proteins from the submitted list are grey. Edges are cleavages (arrows, with numbers for the position of the cleavage) or inhibitions (T shaped arrows, labeled as “inh”). Edges from TopFIND are solid and edges inferred from the list are dotted. Nodes from the complement system are marked with red. 	    60 2.3.4 Validation of TopFINDer To demonstrate the utility of TopFINDer, we re-analyzed a proteomics dataset of all 1255 N-termini found in inflamed and normal skin samples of wild type and matrix metalloproteinase-2 knock-out (Mmp2-/-) mice (7). The Venn diagram generated by TopFINDer (Figure 2.2B) showed that 558 (44%) N-termini from the list correspond to UniProt annotated, canonical start sites, many of which are past the initiator methionine and map to the start of stable protein chains as defined by UniProt; 348 N-termini have experimental evidence (~60% by TAILS), demonstrating the reproducibility of the N-terminal enrichment method TAILS; 129 N-termini are the result of cleavages, 16 of which coincide with a UniProt annotated protein start; 24 N-termini have evidence from alternative translation and 16 from alternative splicing. Thus, translation accounts for about 50% of N-termini with the other 50% likely due to proteolytic processing because cleavage data are the most incomplete. TopFINDer could identify an annotated protease for about 20% of these remaining N-termini. The histogram (Figure 2.2D) and table (Table 2.2) shows that as many as 68 cleavages could be attributed to MMP2. However, as expected when analyzing the entire list of peptides from this experiment, enrichment for MMP2 cleavages was not significant but the high count is due to the high number of known MMP2 cleavages. However, Granzyme B and Mitochondrial-processing peptidase were found active in the sample with 16 of 58 and 4 out of 4 substrate cleavages identified, respectively. Finally, the clustered substrate-protease matrix (Figure 2.2C) showed that there was a small overlap between MMP2 and non-metalloproteinases. Carried out manually, similar analyses that would take days or weeks to complete is returned by TopFINDer in a few minutes.    61 2.3.5 Validation of PathFINDer PathFINDer analysis of the proteins in the murine network so far did not yield relevant insights (data not shown) reflecting the sparse network based on murine protease – substrate association data (Fortelny et al., 2014). However, when cross-mapping to the human network, numerous connections between the candidate protease and identified substrates can be identified generating concrete and meaningful mechanistic hypotheses (Figure 2.3). For example, we observed the connection between MMP2, serpin G1 (complement inhibitor protein 1) and complement 4A, which indicated MMP2 acting on the complement system. Indeed, this was observed and validated in vivo by proteomic TAILS analyses that showed increased levels of cleaved serpin G1 in wild type but not Mmp2-/- mice in inflammation in vivo was also and biochemically validated in vitro in the original publication (Keller et al., 2013). PathFINDer thus succeeds in assembling biochemical knowledge of proteases and inhibitors and provides relevant hypotheses, which can in turn be validated experimentally.  2.4 Discussion With version 3.0, TopFIND goes beyond proteolysis-generated termini and now accounts for all known biological processes that lead to variation in the start and end of a protein by assembling available evidence for termini. For any experimentally observed terminus TopFINDer thereby greatly facilitates assessment of the origin of the terminus. This is particularly powerful in combination with large lists of termini generated by current methods. The statistics of termini generating biological processes reported by TopFINDer make it possible to quickly assess biological processes in a sample. Because many termini appear to be caused by proteolytic cleavage events, TopFINDer further expands on that aspect and reports protease statistics   62 enabling identification of active proteases in the system, allowing for the inference of biological pathways that are also active in the system. Furthermore, TopFINDer enables inference of the consequence of cleavage by reporting functional consequences of termini via domains lost and retained in a cleavage fragment remaining.  Finally, with the second termini list analysis tool PathFINDer, TopFIND 3.0 also enables the formulation of network biology inspired mechanistic hypotheses for experimental validation critical in any interpretation of in vivo research, where proteins can no longer be viewed as independent entities but in context as parts of a bigger interconnected complexes. Thus, TopFINDer and PathFINDer together enable powerful analysis of large proteomic datasets in addition to the existing proven web and API access methods. This database update thereby answers to the need of modern, systems-biology research, which too often is slowed when new leads need to be drawn from the analysis of large protein lists. By dynamically displaying many layers of relevant data from multiple databases, TopFIND accelerates the data analysis and hypothesis generation process.      63 2.5 Summary The knowledgebase TopFIND is an analysis platform focussed on protein termini, their origin, modification and hence their role on protein structure and function. Here we present a major update to TopFIND, version 3, which includes a 70% increase in the underlying data to now cover a 90,696 proteins, 165,044 N-termini, 130,182 C-termini, 14,382 cleavage sites and 33,209 substrate cleavages in H. sapiens, M. musculus, A. thaliana, S. cerevisiae and E. coli. New features include the mapping of protein termini and cleavage entries across protein isoforms and significantly, the mapping of protein termini originating from alternative transcription and alternative translation start sites. Furthermore, two analysis tools for complex data analysis based on the TopFIND resource are now available online: TopFINDer, the TopFIND ExploRer, characterizes and annotates proteomics-derived N- or C-termini sets for their origin, sequence context and implications for protein structure and function. Neo-termini are also linked to associated proteases. PathFINDer identifies indirect connections between a protease and list of substrates or termini thus supporting the evaluation of complex proteolytic processes in vivo. To demonstrate the utility of the tools, a recent N-terminomics dataset of inflamed murine skin has been re-analyzed. In re-capitulating the major findings originally performed manually, this validates the utility of these new resources. The point of entry for the resource is http://clipserve.clip.ubc.ca/topfind from where the graphical interface, all application programming interfaces (API) and the analysis tools are freely accessible.     64 Chapter 3: Truncated protein N-termini and their genesis  3.1 Introduction Protein products of a gene can be highly variable. From 20,061 human proteins (neXtProt database (Gaudet et al., 2015), release 2015-01-01) many more proteoforms are created, which result in millions of different proteins through modifications at the mRNA and the protein level. Whereas post-transcriptional modification of genomic sequences by RNA splicing is irreversible, commonly considered post-translational chemical modifications are often reversible modifications to specific amino acid residues, for example, by phosphorylation or acetylation. However, irreversible modifications to protein chains also are now increasingly recognized to play important roles in generating diversity in protein structure and hence function and cellular or tissue phenotypes (Overall, 2014). One irreversible modification to proteins involves truncation of proteins to create new, shorter proteoforms with new internal N- or C-termini. Protein truncation has been postulated to have a great impact on generating diversity in the human proteome (Overall, 2014) and to increase the functional repertoire of proteins by precise alteration in the biological properties of proteins (Lange and Overall, 2013). Successful terminomics techniques (Gevaert et al., 2003; Kleifeld et al., 2010), dedicated to the identification of the precise position of protein termini in tissues in vivo or in cells, have discovered a large percentage of internal N-termini consistently exceeding 50% of N-terminal peptides with 44% in murine skin (Keller et al., 2013), 68% in human erythrocytes (Lange et al., 2014) and 77% in human platelets (Prudova et al., 2014). This unexpectedly high percentage means that many populations of a protein occur that do not start at their canonical genetic encoded N-termini yet this has largely been overlooked in proteomics data analyses and in their   65 biological interpretation. In addition to the impact on the proteome composition and on the emergent change in functional properties of the altered proteins, a major question is the nature of the mechanism generating N-termini, one that is especially relevant for designing therapeutics. Protein truncations are generally thought to be the result of protease activity and so proteases may be new drug targets if their substrates are a disease driver. Not withstanding the pervasiveness of proteolysis in vivo, neo N- and C-termini can also result from alternative translation and alternative splicing events.  The assignment of the genesis and impact of terminal peptides on protein function, as well as the importance for the biological system, are a hurdle in current terminomics data analysis that often take significant time. Thus TopFIND was updated in January 2015 (chapter 2) (Fortelny et al., 2015) with new data and analysis tools to aid terminomics analyses and assignment of cleavages to the relevant proteases—TopFIND now has 165,044 N-termini and 130,182 C-termini in 90,696 proteins from Homo sapiens, Mus musculus, Arabidopsis thaliana, Saccharomyces cervisiae, and Escherichia coli, thus representing the most comprehensive collection of termini and their evidences.  As the main systematic annotation effort of protein N and C termini data, the knowledgebase TopFIND details a huge amount of evidence for termini derived from four main sources: direct experimental observation of N and C-termini in terminomics screens (termed here “observed” termini); termini predicted from the biochemical and structural characterization of the protease and its substrates (designated here “cleavage” termini); termini predicted from alternative translation events found by global translation initiation sequencing stored in TISdb (Wan and   66 Qian, 2014); and finally termini predicted from alternatively spliced transcripts curated from sequencing data in Ensembl (Flicek et al., 2014). Without direct evidence of their genesis from proteomics analyses we designate termini in the last three categories as “inferred” termini. In this present analysis we used TopFIND to compare observed N-termini with inferred N-termini in order to answer crucial questions about the N-terminome, in particular, to identify the position of internal N-termini in proteins and to assess the processes generating the non-canonical N-termini.  3.2 Methods TopFIND data was extracted from the MySQL dump on November 5th, 2014. Analyses and plotting was performed using R (R Core Team, 2013).   3.3 Results We analyzed human N-termini observed by terminomics screens and compared these to N-termini inferred from cleavage events, alternative translation, or alternative transcription as annotated in TopFIND, by counting the overlap between the instances of N-termini in each group. Counting N-termini in a position-specific and not an experiment-specific manner, we avoided recounting the same N-terminus multiple times—although this does underestimate cleavage events where two or more proteases cut at the same site—as it is the case when, for example, cleavage is facilitated by protein structural features such as in flexible N or C terminal protein sequences, linker regions between domains, or in exposed loops on or between domains.  We identified 48,095 observed or inferred N-termini in the human proteome. Focusing on N-termini that do not correspond to the canonical start site of the protein (i.e. carboxyl to position >   67 3 to account for initiator methionine processing plus one exopeptidase or diaminopeptidase event), we identified 23,915 observed and inferred N termini as shown in Figure 3.1A.  We then collapsed together N-termini if they are close (distance < 3 amino acids) to avoid recounting the same terminus multiple times that might be due to proteolytic ragging by exopeptidase activity in vivo. Again, this may underestimate the number of distinct termini due to independent cleavage events by the same or different proteases within the 3-residue cluster. Focusing on observed N-termini, we identified 9,843 N-termini clusters that contain at least one observed N-terminus across the full length of the protein. Notably, 7,409 of these were internal in the protein (again carboxyl to position > 3 with the remaining 2,434 clusters at positions 1-3) and mostly were equally distributed along proteins as shown in Figure 3.1B, with a peak observed at or near the original N-terminus. Thus the clusters mostly did not represent processing of N-terminal methionine with or without further aminopeptidase or diaminopeptidase activity (only 3.5% of 7,409 clusters are between position 3 and 10), under 10% were the consequence of protein maturation to remove signal or transit peptides (9.2% of clusters were between position 9 and 30), and the vast majority (87.2% of clusters beyond position 30) represented the generation of novel shortened proteoforms. A similar distribution but with lower numbers was observed in the original TopFIND publication (Lange and Overall, 2011) and has not changed by the addition of newer datasets. We expect that in general these represent true protein N-termini and not randomly generated peptide fragments, because these peptides were sufficiently stable to accumulate to levels to be reliably identified by mass spectrometry. Moreover, typically, terminomics experimenters report great care to avoid proteolysis at and after sample collection by immediate incorporation of protease inhibitors and   68 maintaining samples on ice or frozen whenever feasible. In summary, our analysis highlights the large percentage of internal termini in the human proteome and shows that examples of new protein starts will be found virtually anywhere in a protein sequence.  3.3.1 The gap between observed and inferred termini We next investigated how many observed N-termini were explained by inferred termini in TopFIND. As described above, inferred N-termini are predicted from a reported cleavage, an alternative spliced transcript start site, or an alternative translation start site reported at the position in question. An experimentally observed N-terminus identified at the same position as an inferred terminus was considered to be explained by at least one N-termini generating process. Of the 7,409 internal clusters, only 933 were explained in that they had an associated evidence for an inferred terminus (Figure 3.1C). Consequently, for the large majority (87%) of N-termini clusters there was no explanatory N-terminus generating biological process reported to be associated with these termini. Thus huge gaps were found in our current knowledge of N-termini generating processes despite their great impact on the proteome. Further aggravating the situation, we are likely reporting an underestimation, because inferred N-termini in the vicinity (+/- 3 amino acids) of the observed N-terminus were taken into account in this analysis. When we only counted inferred N-termini at the precise position of the observed N-termini, we were only able to explain 535 of 8,878 N-termini (6%), with 94% of observed N-termini (8,343) remaining unexplained (Figure 3.1A).     69  Figure 3.1 Inferred and observed N-termini in the human N-terminome  (A) Overlap between observed (blue) and inferred human N-termini in TopFIND. Inferred N-termini include those that are predicted from knowledge of sites of cleavage, alternative splicing, or alternative translation. N-termini are counted by position, i.e. an N-terminus identified multiple times in different experiments at the same position is only counted once. (B) Position of the 7,409 observed and internal N-termini clusters in proteins relative to the full protein length. (C) Overlap between inferred and observed (blue) N-termini clusters in TopFIND. The arrow points to the systematic breakdown of the 933 observed and inferred N-termini clusters to the biological processes generating those N-termini.     70 In the clustered data, most of the 933 explained N-termini clusters (849) fell close to a protease cleavage site (Figure 3.1C). This effect was real and not simply due to annotation biases favoring cleavages. Of 7,141 clusters containing cleavage-inferred N-termini, 849 (11.9%) were found in actual experimentally observed N-termini as did 62 (1.7%) of 3,590 clusters containing splice site-inferred N-termini and 42 (9.9%) of 425 clusters containing N-termini inferred from alternative translation. Therefore, protein cleavage and alternative translation are the main two mechanisms generating internal protein N-termini annotated to date. Protease activity and alternative translation are also likely candidates to largely explain the remaining 6,476 observed clusters, because the search space for cleavage and alternative translation (Damme et al., 2014; Gawron et al., 2014) remains largely unexplored. Indeed, substrate annotation is only available for about half of all human proteases and even for those proteases, their substrate repertoires remain mostly unexplored as described in chapter 4 (Fortelny et al., 2014). This is partly due to study biases (Gillis et al., 2014), where few proteases are well studied and many ignored, but it also reflects the lack of database annotation of known cleavages. Hence the urgent unmet need for the community to upload experimental data to the appropriate databases including MEROPS (Rawlings et al., 2012) and TopFIND so accurate analyses and predictions can be more reliably made.  The above observations hold true when analyzing individual datasets, as shown in Table 3.1 and Figure 3.2. Less than 10% of internal N-termini can be explained in any dataset. The explained N-termini mostly map to known cleavage sites, except for one dataset where N-termini are analyzed using strong-cation-exchange (SCX) chromatography. In this dataset of 38 internal N-termini, only one N-terminus could be explained by alternative translation and two N-termini   71 were explained by alternative splicing (Table 3.1). While a representative comparison is thus lacking larger numbers of identifications, we suspect a bias against cleavage-induced N-termini in SCX data, since SCX focuses on modified N-termini, which is not the case for other terminomics techniques that identify termini by negative selection in an unbiased manner.     Figure 3.2 Fraction of explained internal N-termini in the individual datasets analyzed and the processes identified in each dataset.    72  Table 3.1 Internal N-termini (position > 3) observed and explained in the individual datasets used in this study.  Source Method Alternative translation Alternative splicing Cleavage Total Crawford2013 (Crawford et al., 2013) Cell lines Subtiligase 9 35 426 7402 Lange2014 (Lange et al., 2014) Erythrocytes N-TAILS 1 5 16 763 Mahrus2008 (Mahrus et al., 2008) Cell lines Subtiligase 4 6 103 1228 VanDamme2010 (Van Damme et al., 2010) Cell lines COFRADIC 0 0 0 0 Wildes2010 (Wildes and Wells, 2010) Blood plasma Subtiligase 0 0 41 532 Bienvenut2012 (Bienvenut et al., 2012) Cell lines SCX 1 2 0 38      73  3.4 Discussion Our present analysis using TopFIND v3.0 has assessed the state of knowledge of the N-terminome. The pervasiveness of N-terminal truncations throughout the proteome highlights the importance of both seeking and understanding the effect of protein truncations and the processes generating neo-termini. This will benefit both an in-depth understanding of protein and cell biology as well as for the development of targeted therapeutics that avoid clinically relevant drug side effects from unrelated off targets and family member counter targets or anti-targets. With the complexity of phenotypes of cells and organisms, complex patterns of deep regulation are expected to be at work. However, it remains to be investigated which of the mechanisms of regulation at play have the greatest impact on the phenotype. In addition, truncation of proteins could also be the reason why certain proteins and their peptides are not observed by mass spectrometry. Knowledge of these alternative N-termini could increase the detection of proteins with proteomics screens using archetypical peptides, SRM or antibodies. Indeed, if antibodies or SRM are deployed to target parts of a protein from only the analysis of the sequence, e.g. N- or C-terminal sequence stretches, without considering known cleavage or translation sites, this will lead to systematic false negative results if these regions are present in some conditions but proteolytically removed in others. Therefore, in such cases, the presence and absence of proteins cannot be reliably assessed without experimental knowledge of the population of N- and C-termini of proteins.     74 In summary, we assessed the amount and genesis of protein truncation observed and found that it has a great impact on the majority of proteins in the proteome. Protein truncation is a special protein modification in that it is irreversible and so sets proteins on a path of no return. As compared to alternate translation and splicing, which are also irreversible, protein processing by proteases has very immediate consequence to the phenotype. Thus, we expect protein cleavage to be a quick response to stimuli and for secreted proteins, this is one of the last opportunities for a cell to modify a protein as it leaves the cells’ realm of influence.       75 3.5 Summary Almost all regulatory processes in biology ultimately lead to or originate from modifications of protein function. However, it is unclear to which extent each mechanism of regulation actually affects proteins and thus phenotypes. We assessed the extent of N-terminal protein truncation in a global analysis of N-terminomics data and find that most proteins have N-terminally truncated proteoforms. Because N-terminomics analyses do not identify the process generating the identified N-termini, we compared identified termini to the three N-termini generating events: protein cleavage, alternative translation, and alternative splicing. Of these, we sought to identify the most likely cause of N-terminal protein truncations in the human proteome. We found that protease cleavage and alternative protein translation are the likely cause for most shortened proteoforms. However, the vast majority (about 90%) of N-termini remain unexplained by any of these processes identified to date, so revealing large gaps in our knowledge of protein termini and their genesis. Further analysis and annotation of terminomics data is required, to which end we have created the TopFIND database, a major systematic annotation effort for protein termini. We outline the new features version 3.0 of the updated database and the new bioinformatics tools available and encourage submission of generated data.     76 Chapter 4: Network analyses reveal pervasive functional regulation between proteases in the human protease web  4.1 Introduction Proteolysis, the hydrolysis of peptide and isopeptide bonds in protein substrates by proteases (also termed peptidases or proteinases (Barrett et al., 2013)), affects every protein at some point during its lifetime. The outcomes of proteolysis are of two kinds: Protein degradation ablates protein function by breakdown to amino acids, whereas proteolytic processing is an irreversible posttranslational modification to precisely produce modified, stable protein chains. The length of this cleavage product is defined by the substrate site specificity of the protease catalyzing the reaction, which can be exquisite. Processed proteins often have radically altered activity, protein interactions, structure, or cellular location and hence are implicated in many human diseases (Turk, 2006; López-Otín and Bond, 2008; Dufour and Overall, 2013). Recent research has focused on identifying the cleavage products of protease activity in cell culture and in vivo as a means of understanding their biological roles and hence guiding drug target identification and validation (Overall and Blobel, 2007). This need has led to the development of genomics and proteomics approaches that have come to be termed degradomics (McQuibban et al., 2000; López-Otín and Overall, 2002) in which the specialized subfield known as terminomics that identifies N termini (Gevaert et al., 2003; Mahrus et al., 2008; Kleifeld et al., 2010) and C termini (Schilling et al., 2010; Van Damme et al., 2010) has seen recent rapid development. In one such terminomics analysis of murine skin in vivo, ~44% of identified N termini mapped to internal positions in proteins, revealing proteolytic cleavage after translation as part of protein   77 maturation and function (Keller et al., 2013). With ~68% of identified N-termini being internal, human erythrocytes have been found to possess an even higher proportion of processed proteins (Lange and Overall, 2013). These recent findings demonstrate that proteolytic processing is a widespread and functionally important posttranslational modification. Thereby, proteolytic processing modifies the activity of many more proteins than currently appreciated from conventional shotgun proteomics analyses and biological studies.  As exemplified by N-terminal cleavage of chemokines (McQuibban et al., 2000), the activity of a protein often depends on the exact position and nature of its N and C termini (Lange and Overall, 2013). Therefore, identifying the termini of proteins is essential for functional insight into protein bioactivity, annotation of proteins in the Human Proteome Project, and drug development (Lange et al., 2014). However, deeper biological insight requires identifying the protease responsible for generation of neo-termini that distinguish cleavage products from the original protein termini. Whereas low- and high-throughput methods to identify the in vitro substrate repertoire of proteases, also known as the substrate degradome (López-Otín and Overall, 2002), are well established, in vivo identification is problematic (Doucet et al., 2008). In vitro experiments can only indicate potential cleavage in vivo because of difficulties assigning precise parameters governing cleavage in the actual biological system, such as protease and substrate colocalization spatially and temporally, presence of inhibitors, zymogen activation, pH, ion concentrations, interaction with nonprotein compounds (Li et al., 2004), as well as O-glycosylation or phosphorylation of the protease or substrate (López-Otín and Hunter, 2010). Thus, posttranslational modifications of proteases, inhibitors, and their substrates add complexity to the dynamic nature of the proteome and cell responses. Thus, an observed cleavage in vitro   78 might not occur in vivo—that is, “just because it can (in vitro) does not mean it does (in vivo)” (Overall and Blobel, 2007).  In vivo studies, which rely on comparing samples of protease knockout or inhibition to controls, are hampered in particular because the underlying biological system reacts to the removal of a protease or inhibitor in complex and unpredictable ways. For example, a protease knockout can lead to alterations in gene expression profiles of proteases, inhibitors, and substrates (Krüger, 2009; Keller et al., 2013), due to the biological consequences of altered substrate cleavages in vivo, including cleavage of transcription factors (Goulet et al., 2004). Another factor is the activation of other proteases in the system through increasingly recognized activation cascades of protease zymogens by other proteases and the proteolytic regulation of protease inhibitor activity by nontarget proteases that cleave and inactivate the inhibitor. For example, serpins and cystatins inhibit serine and cysteine proteases, respectively, but when cleaved by an MMP, the inhibitor is inactivated and the protease remains active (Knäuper et al., 1990; Mast et al., 1991b; Dean and Overall, 2007; Keller et al., 2013). Through activating and inactivating cleavages of other proteases and inhibitors, a protease thereby indirectly influences the activity of additional proteases. Such interactions can lead to knock-on effects that alter the cleavage of a range of additional protein substrates that are not direct substrates of the protease. Furthermore, titration of inhibitors upon covalent or tight interaction with one protease can reduce the availability of free inhibitors to regulate other proteases. Consequently, phenotyping protease and inhibitor genetic knockout mice is complicated, which also hampers biological understanding and drug target validation of proteases.   79 Protease biology is also complex due to the large protease numbers in humans (460) and mice (525), which form the second largest enzyme family after ubiquitin ligases in these organisms (Puente et al., 2003). Moreover, an additional 93 and 103 are predicted to be inactive proteases in human and mouse, respectively, which often can function as dominant negative counterparts (Puente et al., 2003). Protease numbers are almost equally distributed in the intracellular and extracellular environments, and other than some proteases that segue between these two compartments, this distribution partitions and limits their potential interactions with each other. In an effort to systematically comprehend this complex biology, proteases are grouped by the MEROPS database, which is assembled from biochemical experimental data curated from the literature, into seven classes, five of which are found in human and mouse, according to the active site residue catalyzing substrate cleavage, and into clans based on the structure of the active site (Rawlings et al., 2012). Similarly, inhibitors are commonly grouped according to the class of proteases they inhibit, with several inhibitors exhibiting broad inhibitory activity against proteases from more than one class. Interactions between proteases of the same class are well established as part of classically described cascades of proteases such as the complement (Sim et al., 1979; Muller-Eberhard, 1988; Matsushita et al., 2000) and coagulation (Davie and Ratnoff, 1964; Macfarlane, 1964) systems, and newer recognized cascades such as kallikreins (Pampalakis and Sotiropoulou, 2007) and caspases in apoptosis (Thornberry and Lazebnik, 1998; Lincz, 1998; Turk, 2006; Drag and Salvesen, 2010). However, wide-ranging additional protease interactions have also been proposed to extend more globally to link networks forming what was termed the protease web (Overall and Kleifeld, 2006). The protease web was defined as the universe of cleavage and inhibition interactions between proteases and their inhibitors. Stemming from examples in simple systems such as in vitro biochemical analyses and early in vitro and cell   80 culture degradomics analyses of protease substrates (Tam et al., 2004; Butler and Overall, 2009b; Prudova et al., 2010), and mRNA transcript analyses in cancer upon administration of protease inhibitors or tissue inhibitor of metalloproteinase (TIMP) overexpression and knockout studies (Krüger, 2009), the protease web concept has been well supported. Extending terminomics analyses to in vivo situations, for example skin inflammation in wild-type versus Mmp2 knockout mice in vivo, has revealed hitherto biologically relevant and unsuspected critical connections of MMPs in regulating the complement and coagulation cascades and the plasma kallikrein system, which regulates vessel permeability through bradykinin excision and release from kininogen (Keller et al., 2013).  Such interactions between protease families were shown to create small networks in specific cases (Krüger, 2009; Beaufort et al., 2010; Mason and Joyce, 2011; Keller et al., 2013), but the full extent of the protease web, the fraction of proteases and inhibitors involved, and hence the regulatory potential of this network remain underexplored and underappreciated despite the potentially wide impact on the functional state of proteomes. Furthermore, the protease web is a black box with an unknown mechanism of regulation—it is unclear whether it follows a super structure of known cascades, where signals are amplified downstream, or forms more of a network, where signals can flow in multiple directions with multiple positive and negative feedback loops (Mason and Joyce, 2011; Overall and Kleifeld, 2006). Similarly, it is unclear which are the main regulatory protein switches controlling subparts of the network. Descriptions of the protease web are difficult to assemble, as many proteases remain poorly studied and characterized. Likewise, many proteases have no described inhibitors and many predicted   81 inhibitors have unknown protease targets and deorphanization examples are uncommon (Dietzel et al., 2013).  Here, we assessed the global extent and structure of protease interactions computationally. Graph models are used to describe multiple interactions between many elements and have been applied extensively in research on various biological networks. We represented existing biochemically validated data on protease cleavages and inhibition as annotated in the manually curated database TopFIND (Lange and Overall, 2011) as organism-specific networks. TopFIND stores established biochemical information on substrate cleavage and protease inhibition from MEROPS (Rawlings et al., 2012), the most complete collection of such data, most of it published, and combines it with published high-throughput terminomics and degradomics datasets as well as protein annotations from UniProt (The UniProt Consortium, 2012) for five different organisms. Our analyses revealed a large and pervasive network spanning all known cascades and four of the five protease classes present in human and mouse tissues. The network is highly connected in that via a few connections a protease can potentially influence many other proteases, with inhibitors often taking a special role as key connectors in the protease web. We demonstrate the utility of our analysis by applying the network to gain mechanistic in vivo insights into protease web effects, which we then validated in vitro, in cell culture, and in vivo.      82 4.2 Methods 4.2.1 Protease web data Tables containing proteases and their substrates (cleavages) and protease inhibitors and their target proteases (inhibitions) as well as tables mapping UniProt IDs to MEROPS IDs and gene names were collected from the TopFIND MySQL database (http://clipserve.clip.ubc.ca/topfind/; downloaded January 15, 2012).  4.2.2 Classifying proteases and inhibitors Proteases were classified based on their MEROPS IDs in TopFIND. Determining the inhibitor class specificity of human protease inhibitors was performed by downloading lists of UniProt ACs for Gene Ontology (Ashburner et al., 2000) annotations cysteine-type (GO:0004869, n = 49 proteins), metallo- (GO:0008191, n = 11 proteins), or serine-type  (GO:0004867, n = 95 proteins) endopeptidase inhibitor from neXtProt (Lane et al., 2011) on May 24, 2012. A term “aspartic-type endopeptidase inhibitor” (GO:0019828) exists, but no proteins are annotated with this term. Inhibitors were labeled “broad” if they are annotated to inhibit more than one class of protease based on (i) their GO terms from neXtProt or (ii) their annotated inhibitions from TopFIND.  4.2.3 Network construction and analysis The network representation of cleavages and inhibitions was obtained via R (R Core Team, 2013) scripts, heavily relying on the use of the igraph library (Csardi and Nepusz, 2006). Proteins are represented as nodes. Cleavages are represented as directed edges from the proteases node to the substrate node. Accordingly, inhibitions were represented as directed edges from the   83 inhibitor to the inhibited protease. Reachability of a node was calculated by counting all proteins where a shortest path can be found using the shortest.path function of igraph. Betweenness of nodes was calculated using the betweenness function of the igraph package. By recalculating betweenness after removing each node, the iterative identification of nodes with the highest betweenness was performed. Paths from MMP8 to neutrophil elastase were identified in the network using the get.all.shortest.paths function of the igraph package. Erdős-Rényi networks with the same number of nodes and edges as the original graph were generated using the erdos.renyi.game function of the igraph package, and Barabasi-Albert networks were generated with the barabasi.game function, forcing the same out-degree distribution as the protease web. Edge-shuffled random graphs were generated using the degree.sequence.game function once keeping out- and in-degree distributions the same so that each node has the same in- and out-degree as in the original network (Shuffled) and once shuffling those distributions before passing them to the method (Shuffled2). Inverse empirical cumulative distribution functions were calculated and plotted using an inverted version of the empirical cumulative function “ecdf” in R. The area under the curve (AUC) was calculated by calling the integrate function in R on the cumulative function.  4.2.4 Mapping mouse to human proteins Mouse and human networks were compared by identifying connections, which occur between homologous proteins. The homology mapping between UniProt ACs of the two species was performed by mapping UniProt ACs to Ensembl protein IDs via the Ensembl database of the biomaRt package (Durinck et al., 2009) in R obtained from Bioconductor (Gentleman et al., 2004). The homology mapping between Ensembl protein IDs was performed using the   84 InParanoid (Östlund et al., 2010) database via the hom.Hs.inp.db (Carlson and Pages) package in R/Bioconductor.  4.2.5 Network figures  Network figures were plotted using Cytoscape 2.8.3 (Smoot et al., 2011).  4.2.6 Involvement of proteases and inhibitors in biological processes Proteins involved in selected, protease-specific biological processes were identified by obtaining Gene Ontology (Ashburner et al., 2000) annotation of proteins using the org.Hs.eg.db package (Pages et al.) in R/Bioconductor on August 8, 2013.  4.2.7 In vivo N-terminomics data of murine skin N-terminal cleavage sites in normal and inflamed murine skin were obtained from Supplementary Table S8 from (Keller et al., 2013).  4.2.8 Analysis of protease and inhibitor expression in 23 human tissues The data for the analysis of the protease and inhibitor expression profile was achieved by analysis of commercially available RNAs from 23 different healthy human tissues on the protease- and inhibitor-specific oligonucleotide-based CLIP-CHIP microarray (Kappelhoff et al., 2010). Data from 84 CLIP-CHIP microarrays representing biological and technical replicates of antisense RNA of these tissues were used, and average signal intensity values (A-Value) of each gene were combined. An expression cutoff was determined at an A-Value of 7.5, where 95% of   85 the intensities of the negative oligonucleotide probes on the microarray were below this cutoff (data are available at http://clipserve.clip.ubc.ca/supplements/protease-web).  4.2.9 Chemokines, proteinases, and inhibitors All chemokines were synthesized using tBoc (tertiary butyloxycarbonyl) solid phase chemistry as described previously (Clark-Lewis et al., 1997). Recombinant human and murine MMP8 were expressed and purified as described previously (Pelman et al., 2005). Human neutrophil elastase and cathepsin G were purchased from Elastin Products Company and Calbiochem, respectively. Murine neutrophil elastase was kindly provided by Dr. Dieter Jenne (Max Plank Institute of Neurobiology, Martinsried). The 2-aminoethyl benzenesulfonyl fluoride hydrochloride and α1-proteinase inhibitor were from Sigma, and SLPI was from ICN Biomedicals. The synthetic neutrophil elastase inhibitor GW311616 was from Tocris Bioscience.  4.2.10 Animals Mice deficient in MMP8 on a C57BL6/J × 129 S background were provided by Dr. S. Shapiro (Boston, MA). Animal breeding and experimental procedures were approved by the Animal Care Committee of the University of British Columbia. Mice 6 to 8 week old, segregated according to sex, were used for all experiments.  4.2.11 Neutrophil isolation  Murine neutrophils were isolated from bone marrow by flushing of fibulas and tibias. Neutrophils were separated on a density gradient comprised of Histopaque 1077 layered on top of Histopaque 1119 according to the manufacturer’s instructions (Sigma) followed by washing   86 with Hanks Balanced Salt Solution. Neutrophil purity and viability were consistently determined to be >90%. Neutrophils were activated with 50 nM phorbol 12-myristate 13-acetate (Sigma), unless indicated otherwise. Neutrophils (1 × 106 cells) were incubated with 10 mg LIX for up to 4 h in Dulbecco’s Modified Eagle Medium at 37 °C. Inhibitors were preincubated with cells for 30 min at 37 °C prior to the addition of chemokine. Cells were removed by centrifugation (500 × g, 5 min) at the desired time points, and supernatants were analyzed as described below by MALDI-TOF mass spectrometry and Tris-Tricine SDS-PAGE.  4.2.12 LIX cleavage assays Analysis of substrate cleavage by isolated proteases was performed at enzyme/chemokine (E:S) ratios from 1:10,000 up to 1:50 (mol:mol) for 16 h at 37 °C in assay buffer (50 mM Tris, 200 mM NaCl, 5 mM CaCl2, pH 7.4). MMP8 was activated by 1 mM 4-aminophenylmercuric acetate (Sigma). Digests were spotted on MALDI target plates with sinapinic acid for MALDI-TOF analysis or terminated by adding SDS-PAGE sample buffer. Reaction products were analyzed by 15% Tris-Tricine SDS-PAGE and silver stained. Specificity constants (kcat/KM) of cleavage were determined by densitometry as described previously (Cox et al., 2008). The mass-to-charge ratios (m/z) with +1 ionization ([M+H]+) were determined on a Voyager-DE STR Biospectrometry Workstation (ABI). Mass spectrometry data were deconvoluted to identify the substrate cleavage sites. Molecular weight prediction was obtained using the “Compute pI/Mw tool” on expasy.org.    87 4.3 Results 4.3.1 Protease web data Functional protease interactions comprising cleavage and inhibition events influence the in vivo cleavage of substrates in many ways. Cleavage of a substrate by a protease is a direct event, and as shown in Figure 4.1, by cleaving other proteases and protease inhibitors, one protease can activate, inactivate, or alter the activity of a second protease, thereby indirectly influencing the cleavage of substrates of another protease. To assess the global extent of such effects, we represented protease interactions as a graph, connecting proteases and protease inhibitors to their established substrates and protease targets, respectively. The resulting graph contains nodes, which are proteins, and edges, which represent cleavages or inhibitions. Edges link proteases to their substrates and protease inhibitors to their target proteases. Therefore, edges are directed: an edge from protein X to protein Y signifies cleavage or inhibition of Y by X but does not contain information about cleavage or inhibition of X by Y. In graph theory, the latter would require another edge with the opposite directionality. Figure 4.1 outlines functional protease interactions and how they are represented in small graph models, which were then aggregated to represent the full complexity of the protease web based on curated biochemical data as described below. As input to our analysis of the protease network, we used the TopFIND v 2.0 knowledgebase (Lange et al., 2011) to retrieve validated cleavage and inhibition data mostly annotated from published experiments. TopFIND contained 4,774 cleavages for Homo sapiens, 3,679 for Mus musculus, 426 for Escherichia coli, 190 for yeast, and 43 for Arabidopsis thaliana. Due to the low number of cleavages annotated for other organisms, we focused our analysis on human and mouse. Only proteins performing an annotated cleavage or inhibition were added, and then these were connected via edges representing the biochemical reactions as explained in Figure 4.1.  88    					   		  89 Figure 4.1 Biochemical protease interactions represented by graph theory.  Proteases influence cleavage of substrates both directly and also indirectly through cleavage of other proteases and inhibitors. Protease interactions as represented biochemically (left) and by graph theory (right). Proteases are green or blue, inhibitors are red, and other substrates are grey. Examples of protease interactions (cleavage and inhibition events) are outlined on the left: (i) In the simplest case, a protease directly cleaves a substrate, as indicated by the presence of proteolytic activity, with no further interactions. A protease can also indirectly influence cleavages by cleaving another protease for (ii) zymogen activation (Kassell and Kay, 1973), (iii) catalytic domain removal, or (iv) exosite domain removal (Rice and Banda, 1995). This will increase (ii), decrease (iii), or alter (iv) (Li et al., 1995) the activity of the affected protease and thereby influence the cleavage of its substrates. (v) If a protease inhibitor is present, the protease does not cleave substrates. (vi) An inhibitor can be cleaved and inactivated by another protease (Yang et al., 2003), which leads to increased cleavage of substrates by its cognate protease. Proteases also compensate for loss of function of other proteases or complement their activity by (vii) cleaving the same substrate at the same site or (viii) substrate cleavage by one protease can depend on prior cleavage by another protease at a different site. By graph theory of protease interactions (right), all proteins are nodes. Proteases (P) are represented as green or blue circles, inhibitors (I) as red diamond shapes, and substrates (S) are grey squares or rectangles. An edge from protein A to protein B signifies a direct regulatory influence from A on B. Such a regulatory effect could either be a cleavage or inhibition, resulting in higher, lower, altered, or unchanged activity of the target.    90 These networks extend the protease web, which contains only proteases and inhibitors, by also including all other substrates of proteases, and hence represent the annotated functional proteolytic interactions between the substrates in the proteome and the protease web. The human and murine networks (with 1,230 and 1,393 nodes, respectively) are shown in Figure 4.2 and available for download as a Cytoscape file, gml file, and R objects at www.chibi.ubc.ca/ProteaseWeb and http://clipserve.clip.ubc.ca/supplements/protease-web.  The human and murine proteolytic networks show that the majority of proteins are connected and only very few are in unconnected components. Thus, in both networks, the Largest Connected Component (i.e., the biggest group of nodes directly or indirectly connected) encompasses the vast majority of these proteins—1,183 of 1,230 (96%) in human and 1,377 of 1,393 (99%) in mouse (Table 4.1). This remarkable connectivity is particularly surprising given the incompleteness of annotation currently available in the databases. Indeed, Table 4.1 shows that of 460 human proteases, only 244 (53%) have one or more known and annotated substrates. In mouse this number is even lower, with only 88 of 525 (17%) proteases having a substrate annotated. Furthermore, even the data on these proteases are incomplete and biased, with most substrates assigned to few, well-studied proteases. Figure 4.3 shows the out-degree (i.e., the sum of cleavages catalyzed by a protease or the sum of inhibitions caused by a protease inhibitor) for proteases and inhibitors having any annotated cleavage or inhibition, respectively. Although few proteases have a large known substrate repertoire (higher out-degree), most proteases have very few known substrates.     91   Figure 4.2 Protease networks in mouse and human.  Networks of all proteases (green circles), protease inhibitors (red diamonds), and protease substrates (grey squares), which take part in any cleavage or inhibition reaction annotated in MEROPS/TopFIND. Networks are shown for human (A) and mouse (B). To resolve individual nodes and edges, click to zoom. Proteins are designated by their UniProt gene names. ATF6TRIM28EEF1A1CADGZMBTUBBHSP90AA1ACTBPOLR2APFKMPARP1SREBF1HARSPFKLCDK11BATXN3BCL2L1ATP2B4LMNB1MBTPS2SREBF2RPL10XRCC4 RPL5MKI67RRP15ACTL6ASF3B1MCL1NCLPRKDCEXOSC10XRCC6PKN1SNRNP70SRP72SRRM1LRBAFNTAAPLP1PXNBAG6DCCSSBARHGDIBGZMHNUP153MCM5MYH10RFC1ARPDE6ASTAT1BRCA1PAICSNMT1HCLS1APCHNRNPH1EIF4G1FBLPOLR1ACENPBPABPC1UBTFHSPA1BSTRNROCK2PTBP1HSP90AB1MRE11APMS2STIP1RPS4XFGFR1AARSC1QBPMS1MEFVGZMAIARSMCM2ITGB4BRPF1SFMBT1CTSA DPP9 WNT4 CAPN3SENP8 PDLIM1CPZLAMP2NEDD8GMNNSLC4A1EIF3JLMNARPL6 HIST1H1AH3F3BGTF2BLYNZMPSTE24SSSCA1RPL11RPL24RPL30YBX1PPP1R11SYNCRIPBCL2L11HIST1H2BIRPS6RPL15RPL7ARPS8RPS18TAF15MEF2CMAXRASA1VCPNEDD4ATP2B2GSNTOP1PTMAUBE4BCASTPTK2TFAP2AGAS2DCLK1 GZMKCASP5TUBB3MDM4TUBB1HNRNPRARPC5PSMD12KPNA2NCAPD3NAP1L1RPL27ASUB1RPL35EIF3BYLPM1RPS20TP53MYL3RETMGEA5FYNUSP1GRAP2MITFPLA2G6PLCG1APAF1CDC42PSIP1CLSPNRAD21SRFRPL7GARTRDBPU2AF2HIST1H1BHTATSF1HIST1H2AMDHX16STAU1PHGDHCSDE1TJP3DDX3XH2AFYHMGN1BTF3NHP2ZFC3H1HNRNPA2B1HNRNPA1TCOF1HNRNPUNDUFS3H2AFZZBTB11DDX39AWDR46SURF6AKAP8SART1PCF11ANKRD17PAK4HIST1H1DRPS3EPB41RPL13M6PRSONRPL3RBMXRPL4MAP4HMGB2AREGBNFKBIACDK5R2AMPHBTC FLNAEGFRNF2ITGB2MIPTXNCASP6ERBB4METAP2SH3BGRLCASP9ASMTLRAF1EEF2NOTCH1CFLARMAPTRIPK1FXR2GRIA1HDAC7CASP8EZRVIMSATB1EEF1GPLECIL37BIRC7CAPZA1PLA2G4ASNRPFBIDDLATPSEN2MEF2DBCAP31ABL1PARK2PPLALOX5GAPDHHSP90B1FABP7ERCC5RPL13ANOP2APEX1RPS19RBM34SRP14RPL19NFATC2STK3FOXK1PDAP1SRSF9HIST1H3JU2AF1TRAF3RPS26HIST4H4SLFN5TIAM1RPL23ATARDBPRPL8 RPL31DLG1ADAM28BCAR1IGHG3SERPINB9SRP54RBX1ACTG1IGHG4HNRNPKRPS16PRKCQRPS3ARAD51RPS11SRSF1MFAP1ACLYADARRPL14NUP107RPL34USP18C1RL HPASAH1LPHN3 AGAEMR2LPHN1ISG15SP1CLTAVAV1CKAP5BCL2NFX1 KHDRBS1KCNAB2NASPKSR1SERBP1TNNT2KIF18BZC3HAV1SARSGCLCZNF598URGCPPPP1R7SLC1A2 MYH9MLH1CTNNB1ADD1PSRC1C17orf85ALKBH2ZC3H13RNASEH2BCLINT1 SCRIBPPP2R1ARNPS1NAA30AKT1CCNE1CD247GAPVD1PPP3CAXIAPMDM2MAGI1COPS6SSRP1KRT17KRT18PSME3KRT15GOLGA3DBN1 SETYWHAE GFAPATN1PRKCZCFL1MEF2AEIF2S1TRAF1AKAP1YWHAQCCNB1CASP2ACTN1HTTGDI1RB1YWHAZPCYT1APPP1R9BTUBA3DBIRC8EIF2AK2STK24SMPD1CASP7ERBB2MATR3ATXN7TAL1USO1ACTN4CASP10GATA1ANXA5NOLC1RRP1BUBR7 UNC5BTLL2OCLNBCDIN3DRBM39SAFB2SNW1NFE2L2LOXCHRDAMBPMMP20CLUCOL2A1ADAMTS2COL1A2BGNLPADMP1GH1BMP1MSTNTLL1CTGFGZMMDCNSCNN1GCOL5A1COL18A1ELNFBN2PPIGRPS15CASP1SPTAN1CREB3CREB3L2ITSN1TGM2CHD3C1QCLGALS3RECKNUMA1MBL2TUBA1ALAMC2NPM1TIMP4APOC2C1QAHAPLN1TIMP1COL5A2SAA2DFFARPL18PMLRPL18ASRSF4KARSNONOIKIL16HTRA2ANP32BZCCHC8PARK7ACTA1MYBBP1AVPS4BCREB3L3ALDH1A2BIRC2CREB3L1CREB3L4MBTPS1SLC6A3CTTNPSEN1HIPK2ENO1PPA1PPIASPPL2BMMP24SDC3CASP3TNFSF11IL6RADAM17MYOC CAPN2ADAM19EREGKITLGCTSCENO2TNFGRPADAM9KISS1CD44IL18HBEGFCXCL2 PHEXCALCA CDH1TUBB6AVENXPO5CDC5LCASP4LARP1KIAA1967R M28TUBB4QUTP18HSD17B10BACE2BLMHACEADAM33CTSZCXCL1EFNB1MMP8PROCRPFN1HTRA1PRTN3DPP8MMP9MMP3CALCBLGALS1MMP16DAG1GDF15SDC1COL11A1SNCARAD23ARPS21SNAP25CCL2BSGCDKN1AEPHB1IL1BCST3BDNFSPP1MMP2CXCL6ADAM8RELACCL5DPP4 CCL11PPBPMMP17APPCCL22MMP13FGGMMP25MMP12PRELPADAMTS5LOXL3CILPFBN1VCANAMELXLXNFMODIGFBP1SFTPDMATN3MMP26FGAADAMTS4MMP19CILP2GDF5TMPRSS6 SERPINF2CDCP1CSTMPRSS11EELANEELF2HP1BP3SPTY2D1COL7A1GPATCH4SF3A2COPAC5AR1PGA5TMPRSS11DPCM1LAMA5IGHG1CPB1PLAURDEFA5TMPRSS9FLGST14DUS1LTHBS1HUWE1CTRB2PRSS22TPSB2THBS2DHX30ADAMTS1POP1FAPIP5K1AMAP4K1C17orf56SCG3RNF34OVOS1MMP1CCL7PRLCCL13FASLGREG3ASPARCTIMP2LAMB3COL3A1TFPIAPOA1 ANXA2IL8WISP2MAGTMPRSS15ACANST13CELA1COL1A1F2ROGNAMHADAMTS7ACRL1CAMA2MCCL8PRSS3MMP14IFNB1APOECXCL10COL4A1MBPCXCL11FASSERPINE2CRYABVTNPZPADAMTS10TNCCCL3L3PI3CTSKITGAV MMP10HSPB1CXCL9PCNT ACTA2C1QBPSTAM2 GRNCXCL5MMP11MMP7IGFBP3SERPINA1SERPINB1SERPINA3FGBPRSS2TRHFOLR1ALPPCD59PIGKMIPEPAGRPCD55ADAMTS9DNAJC17CTSHENPEPCPMCDH13TAC3BCLAF1APOFPRCPGCGCELA2AEGFRCC2PCSK1 ERAP2GP 37L1CPA3ERAP1MRTO4AOPEPCSN2NPYCCL20ALBCSTADCDINHBCMEP1BGDNFPYYXPNPEP2EDN1AGTCTSEECE1XPNPEP1 CST1DPP7NTSCCKTPP1NMB ACE2INSSYNRG TPP2ECE2EDN3TAC4FCER2TAC1AQPEPEDN2SEMA3FLUC7LPCSK7IGF1 ELL3SORL1CUX1BMP5KNG1MEP1AMARCKSCDH3CCL21SCTCDH5CTSDLGMNCTSSBMP7CHGAIL32SERPINA4 EPCAMRENSERPINB4HPSETGM3SETD2HBA2LECT1SOD3 NSRP1OSBPL3HRGZP3CXCL12CMA1TFCTSL1HBBSPHK1 CTSL3BACE1MCOLN1CST5BGLAPCOL4A3CTSBITIH2PALLDDDX17PURBBRF1ZNF428NAP1L5YOD1USP15EIF4G2PRKCDPOLEUSP30PPFIA1USP44CAPN1USP37STK4USP51USP42USP2AP1B1USP35USP34USP38CYLDDHX9BAP1USP9YNFATC3NUP98PARLUFSP1 PINK1MUC1PIDDUFSP2ASPRV1UFM1PKD1PTENCCT4RAC1HNRNPCBLMBMXSRRM2LSM3EIF4BRANBP2USP5USP36USP8USP46STAMBPUSP49USP45USP4USP7GABARAPKRASTNFAIP3TASP1GTF2A1MALT1MLLBCL10CAMK4SENP1ATG4BRCE1RHOARANGAP1ATG4AGABARAPL2MAP1LC3BSUMO3SUMO2SUMO1SENP5SENP3UBCBDH2MAP3K1ATMEIF3CLROCK1EBNA1BP2PAK2OXTLTA4HIDETMSB4XUBBNPPANPPB IGF2NPEPPSANPEPPRNPRNPEPTGFATMPONLNNPPCNOC2LLUC7L2THRAP3GNRH1SSTPHRF1ARGLU1AKAP8LNRD1PRRC2CEIF2B4DDX24FAM60AMKI67IPDDX21CORINRPF2SDCBP2CPA2BMP4PROZCDH4PDGFBITIH3NTF4FURINPCSK5PTPRMFIGFCUBNKLK4CAMPKLK6KLK1INHATGHGFAC COL4A2C8B GP1BAPRSS1F11APLP2UTP14AINHBAMMTAG2CELA3ACST6SELPLGMST1INSRPRB4CPN1SERPING1 PRPF18CTSL2SPINT2 TP53RKSLPISNCAIPPDGFAITGA4TGFB1PCSK9CPN2 PLGNID1FN1ADAMTS12F2RL1TPSAB1CTSGGRIN2AHABP2KLKB1UQCRQSLC25A5UQCR10NDUFB9 SLC25A4HAX1METAP1DSFXN1TSTTOMM22SLC25A11DNAJC19VDAC1NDUFA8HSPE1STK39MOBKL3MAGED1DLGAP4VEGFCPTHCGNDSG3ATP5A1HSPD1TNS4ATP5JBADNDUFV2KRI1PAWRDDX1CNN2CDC6SDHCPMPCBOATATP5DFHFXN UQCRHATP5C1ACIN1ATP5F1SLC25A6NDUFA11RAB32 MAOBCYCS AK2CDK5R1PRKCASPTBSMN2FLG2ZNF175PTPRNFLNC ITGB3PKN2EIF4HEIF4EBP1RNF219RABEP1PARGPRPSAP1ATP5OOTCSLC4A1APDPH2CA4NT5ESCG5DIP2BNAPSASFTPBDPEP1PGCKIAA1618GSTP1SERPINA9SERPINC1ZGPATERICH1SDC4SERPIND1CALHM1CCL14HPNFHOD1 GRIN1KCNIP3F8PAPPA2HYOU1PROS1WIBGPRKRIP1SERPINB6MST4DBNLGMIPTMPRSS7CFDP1CFBIGFBP5MASP1PAPPA F9SEMG2SEMG1PTHLHKLK13C2PROCADAMTS13F5SERPINI1 CPB2ITGA6PRSS12SPINLW1SERPINA5PLATMAPK8IP1 SERPINE1F13A1BAX SPTBN1CFDPRG2ARHGEF1KLK2RPL36ALKLK8APOHCELA2BKLK12S100A10C8ASPINK6KLK10KLK11KLK9KLK7 PLA2G1BRBM14ACPPADAM12IL1AIGFBP4KLK14DSG1KLK5LTV1LRP1NTF3LIPGBMP6C4AB2MC5C1SCFICELA3BMASP2CPA1C3ITGA3PCSK6LCTEDAITIH1CDH2C1RNPR3BRIX1LSM14AIGFBP2SPINK9AKR7A2KLK15 CDH16 F10ADAMTS3CCL15F2RL3F2RL2 RARRES2DOCK2 F2KLK3VWFF7SPINK1SERPINA7CTRC IGFBP7APOA2HGFPLAUCOMPF12CCL3CPGP5MMEADMPREPKELGHRHSELLDPP3 CPA6MMEL1CD40LGDNPEPPDYNPOMCPITRM1APEHADAM10CPA4GASTLNPEPCPA5 VIPAVPRBM19CALCATAF9AMZ2THOP1PCSK2APLN PENKWBP11GHRLPGPEP1IAPP SUPT16H CPEPCSK1NRbbp7Ccdc165Uap1l1Fam129aFrmd8Caprin1Arcn1Mapkapk3Sema4bBcap31Aldh16a1Ripk1Clns1aCbx5CastMyl6Ppm1gPafah1b2Capn1G6pdxNcf4Tubb5CpeFhNucb1Myo1fArf1Ube2l3GaaCsrp2Rbx1Ufd1lRps12Col6a1PepdHddc2Nsun2Anxa7Eif5bCluCasp6Ap2b1CatDesDbnlFhl3Pgm1SfpqPsmb1Pdia3Iqgap1TktDdx19aSnd1AdslPrkar1aPlecSmarce1NenfPsma3Sept9AgaFscn1RhocHist1h2bbFan1Fstl1CcnhSh3gl1Sf3a2GartAhcyl1Nbeal2Xpnpep1Smap2Usp8Bud31Smarcc2Pld4Agfg2Pcsk4C2Creb3l3Capn8Mbtps1IhhC3Masp1Pcsk9Pcsk7Akt1CpoxEif4a1Snrpd2Vps26aSnrpa1Kcnab2Rpl5TesRap1aHspa4lDynlrb1AcadmRps3Hmgb1Psmc4Eef1dDnaja1Ruvbl1Rad23bAtp5bCyb5aCamk2bLta4hGlud1Sema3eRps10Psmc5Acta1Ube2nRab1ACops2Kpnb1Lonp1Mbtps2 Cma1C4b Serpina1aStarUchl3Dpp3SetPdia6Ccdc124Cryzl1GrhprTab2Crip2GakWasf1Hsd17b10Aimp1Eif2dTmod3Huwe1Srrm2Pphln1Ptk2Bcl2l1Capn2Samhd1FybTrim28Nudt5Camk4Bcas2Paip2Trap1Snw1Cndp2PxnFubp1Tpd52l2RpeRprd1aRnh1Birc2Naip1Birc5Birc3CanxBadCasp4XiapLarsEif4bUbash3bNudt9Arf2SgtaRcn3Rcc2GmppbQdprRybpPrmt5Farsa Eif3bPpm1fVps13cLsm1OafUba6Cnot2Dctn4Eif3cCcar1Glrx5Ppp2r2aRps27Ap4e1Casp2U2surpSwap70Hn1lCasp7Pde4aCasp12Vps37bSvilGorasp2Lims2Acap1Ptgr1Exoc6ApobrMavsMtapD10Wsu52eHip1Pdlim2TbceYarsFermt3Osbpl11Paip1Cspg4Eif2c2Sh3bgrl3Uba2Actn1Lims1YwhaeEtf1NsfHnrnpa3Arpc5Rac2Tubb2cRps18Snx2Ipo7EzrUsp5Hmgb2Cox6b1GssRpiaRab11bDrg1Hist1h1dSptan1CaluPpp1r12aNae1Gtf2f1Akr1b1Capza2Cct3Zfyve16Naa15Serpinh1Vrk1Eif3aUbl4aRab11aFusMtmr3Snrpd3Arhgap18Agfg1Arf3Rac3Kctd12PicalmLmnb1PcnaPgm2Hmga1Sik3Ecm29GusbCltcHist2h2ac Hmgn2Prl2c3Exoc3Mea1Spp1Col4a2Fn1CrkH2-LRad21Hsph1Col5a1Lin7cCdc37Eif6H2-K1Arrb1Sept11NAAhctf1Sept10Plcg2PgpAdprhl2Arhgap25Atp6v1hUba3Acta2Nol6Acat1Ero1lRac1Rps28Znf830Hnrnpul1Actc1Exosc2Actg2Mapk1TbccBgnNap1l1Impdh2Eea1Zc3h14Limd2Man2a1Tfpi2Serpinb12Prss8ParvaZc3h11aVps35Snx6GmfgEif4enif1Tpm4Ppa1Klc4Fam122aVclF7Sptbn1Znf259TfpiSpint2Dag1Syap1F12Stat5bDpysl3Spint1Cnn2Trim33Thrap3Lap3Pstpip2NacaHgfacMthfsPsapCtsgChordc1PcolceNhp2l1Mapre1Pgm3RnmtWbp4Mospd2Rpl14Rbbp4Stat5aSerpina3kPafah1b3St14Ctrb1Mcpt3OvosLamc1Serpinb1aKlkb1F11Lamb3SlpiLrrc47Tbc1d2bSpag9Efhd2YwhabGas7Ola1RbmxPpp1r12cSec31aPpp2r1bAarsCkbRgs14Dab2PygbShc3Csnk2bGapdhsFnbp4Rpl31Sec23ipGnb2l1Tceb1Clint1Gsk3aHnrnpuNcf1Gspt2Fam107bRbm17Ccl7Ube2zRpl18aCdc42PtnActr1aDdb1Rps24Col6a2Psmc1ZfrPdcd5Fhl2Gucy2eCcl11Metap2Cx3cl1Anks1aAnxa5Naa38Dazap1Tcea2Smndc1Nap1l4Mphosph10FlnbScnn1gRnf114Cul3Defa12PlgMmp7FaslgFbln2Mmp3Ccl2PlaurVegfaPsma4Nfkb2Arpc1aDnajc7Hook3Cstf2tHeatr3 NagkTbl1xr1Lamb1Serpinb6Serpina5Serpinb2Lama1Pebp1Klk14AcrF5ElaneSerpini1F2Nid1Klk8PlauProcPlatPuraAhcyCap1CrklAdssPrrc2cThbs2Loxl1Ppp1r14bPkm2Myh9Dync1h1Taldo1PpibYwhagPhgdhSf3b5Eef1gLgals3PpidFkbp3Rplp0Cct8TbcaTxnl1Tpt1Actn3Uba1Sf3a3Npm1LdhaC1raRcc1StamCtpsAcot7Stk38Psmb7FlncTuba1cNucb2Mst4Nme2Serpinf1Dus3lDarsHnrpllPsmb4Tmed9G3bp1Serping1F10Serpine1Spink2 Serpine2F9Serpind1Serpinc1Serpina10HpnTmprss9Tmprss6Prtn3Tmprss11eHabp2PpifCryl1Rcn1Top1Mat2bAplp2IkGlod4Stub1Psma8Rtn4Cyb5bHcls1Chd7Ceacam1Mcm2Ckap5MetMap3k1Cdh1Kpna4Anp32bRps25Rpl13Pkn1Smarca4Pdcd11Arhgef1LyarChd4Map2k6Ranbp2Sart1Anp32eVta1Sipa1PrkcshRpl28Abl1NclHspa1bKif4Mcm5Sf3b1HnrnpfTcof1Strn4Igbp1bTbl1xGps1CopgPlod1Lactb2Psmb2Dnajb11Prkd1Git2Eftud2Ebna1bp2Ascc2Casp3DffaPtmaStrnNfatc2ipU2af2 Hsp90b1PgdIst1Hist1h1bSh3bgrlEef2Vat1Actr3Myl12bRplp2Atxn3Nsfl1cHist1h2aoMpstRpl10aArg1 GpiMdh1Pdcd6ipPfdn6Lcp1Gpsm3SriSdf2l1Ccdc50Chmp2bArpc4GnsPsmd14LppRpl19Stk24Pak2Ubqln1Clic1Abca5Fis1Ndrg1Btf3Psmd2PaicsSfnPsphPsme2Rps6Vbp1Man2b1GmfbRpl7S100a9PycardNme1S100a13G3bp2 Srp19Cbx3Rpl12Anxa6GapdhStip1Hsp90aa1AticErp29Ranbp1Fhod1Hpcal1Hebp1Cmpk1Tgm2Aco2Commd9DstnCcsMybbp1aActr2NagaClasp1Zc3h4Cwc15Gyg1Eif3jRpl10Grb2CltaPrdx5ActbSt13Tpi1PpiaGzmbRpl11Lrrfip1Pa2g4Rps7H2afzEno1Pabpc1Rps3aEif5aRpl35Uimc1Hdgfrp2 Fkbp5RmpSrrm1CltbRps15Sh3glb1Pfn1Idh1S100a11Rprd1bUbxn6Eif4ePsmd8EsdSerpinb1cTbcbArhgdibD10Jhu81eAbcf1Arpc5lLtv1Sbf1Acp1ParvgMap4KhsrpRpl3Rps4xRpl6Psmc2Hk2Gstm1Thoc4LplH1f0Gpx1Pcbp1Snx1Actn4YwhaqFkbp4AldoaS100a8Eif2s1Bin1SparcPrl2c2Ddi2S100a10Kifc2Api5Tmsb4xDhx15Prl2c4Rpl35aEno2Pgam2Dok2Map1bPsmb10Psmd3Adh1H2-D1Bcat2Stag2 Arhgap6Lamtor3Bcat1Tuba1aEno3Dync1i2RdxAgtpbp1Becn1Snx12Cnn3Ehd4Dkc1Clstn1Copb1Exosc8Erap1Gtf2iMapk8ip3GhrlIns2Ins1GastPamPomcPcsk1CartptGhrhDkk3GcgDazap2Copg2Nrd1Snx5CoasyEif3iRnpepCst11Txlng Gtf3c1Iws1VprbpRrp9Larp1Hexim1Luc7l2PnocTpp1Tac1Pcsk2Tac3PmchPsmb6TxnSnap91Cpa4Casp8Parp2Eef1bCpa1LxnCpb2Cpa2PenkChgaCpa3Pdyn NtsRps21Dpysl2Lasp1Bag6GarsArhgdiaPpp2r4PtmsSarnpCacybpArpc3 Macf1 Dnaja2Eny2Tubb4ChgbPcsk1nPpanOgfrCckScg2Abcf2Ddx50Psma6Pgk2Cct7CapgEtfbEif2s3xHnrnph1CycsPpp2r1aHspa8Pgam1SmapSkp1Necap2Nubp1Coro7Rpap3Hdac2Eif4hCtscTln1GzmaAtp5a1Dclk1Ifih1Serbp1Hnrnph2Sub1Prpf38aMre11aCalrNipblUtp18Rpl4YwhahIrf2bplB2mPsma5Rbm10Tagln2SnrpbP4hbSlkActg1Hspd1NfiaRpl18Msh6Ptpn18PfkmAtxn2lCpsf7Vps72Rps15aPolr2iGsnBub3HnrnpmDnpepPpil1Yif1bLsm12ProscDdx39bCdc5lEif4g2Zcchc8NAUbe2oPdap1Cpsf6Eif4g1Anp32aCfl2Hk1Hist1h1eHmbsAss1LmnaIgfbp6Rpl13aApex1Map2k1 PfklGlulDcnTcea1GrnLdhcPnnRnf4Casp1Hmgn1Itpr1Mdm2Supt5hFgfr1Ptpn12 Apbb1ipPla2g4aAp2a2Sdc4VimTom1Mcm3Smg7Fbp3Metap1dCyctSlc25a5HdgfTimp4Mmp13Adamts5Cxcl11Mmp12Cxcl5Adam28Adam12Mmp19Defa4Defa2Adam8Dll1Klk1NptxrChl1EgfAdam10Klk1b1Mmp17Mmp15Timp3Adamts4Mmp16Mmp14Mmp8Mmp25Sparcl1Kng1CtsfMmp11Spock1Timp2Timp1 Adamts2MbpCol18a1Sept7PglsTimm8a1 Cct2Dctn2Ostf1NacaMsnSnx3Nmt1Cfl1Rgs10BlvrbBlvraChmp1aSugt1Coro1a Myh11TarsDlstPrdx1NANaspHnrnpdDync1li2Prdx2NADapSlc9a3r1Mthfd1Hist1h1aPark7Fam129bTxndc12CstbCst3Hist1h3iItsn1Cst7Uchl5Fxr2Actl6aNoc2lCol3a1Ctbp2Crybb1 Igfbp4Chi3l3Twf1VcpMmp9CttnCol1a2CtsbCtshStk3Cpa6Nop56Wdr55CtssArl6ip4BidMphosph6Serpinb13Serpina3fCtla2bCdca5CtskCd74Ctla2aLgals1FlnaItgb2Col1a1Ctsl1 Klk7 Spink6Sept5HnrpdlDnm1Casp14AldocSart3LgmnFam111aAkap8Mif PldnBanf1Pgk1Pole3Eef1a1Hspa4Sept2Pdia4OtcSrp14MipepGot2PmpcbRpsaUqcrhAtp5dCamk2aSlc25a3Hsp90ab1Atp6v1b2NudcHdlbpSf1Fkbp1aHnrnpa1Oxsr1NarsFliiDld Rrbp1Ccdc80Ddx1Eif3d HypkNAPtrfEpb41l2 IkbkbUgdhNcf2 Ppt1 Lin7bIpo5Hspa5Rpl22Eif3fSyncripFermt2Prrc2aMat2aCtnna1Adssl1Col5a2PrkcdIgbp1PxdnDbiPnpLsp1Kiaa1704Phldb1Selenbp1FasnMmp20Klk1b9St6gal1Lamc2Adam15Scn4bBace1Scn3bEhd1Ak3Bmp1GalmEpb41l1Eif2b5Rfc2C1rbGspt1Cdc73Rps6ka4Adam17NgfrHbegfAregBtcInsl6Scn1bFurinVtnCol23a1NgfScnn1aAmbpGdf11Sema3aScn2bImpa2Suclg2Serpinb8Rpa1 ParnAtp6ap1H2afjTll1Tll2Dmp1MstnChrdGldnOgnCtsePmm2Mki67ipGbp2Rpl9Dido1Dctn6Whsc1Edc3Cd2apAldh9a1Pfdn5Adrm1FdpsMmp2Clic4Snap29Dok3Anxa4MvpCtsdPsmb3CapzbEprsGsrEvlEcdDnajc9Rrs1Ybx1Ptbp1Otub1Gtpbp4LpxnRsl24d1Tuba1bPno1Rpl7aRps6ka3Srsf5Rps14Hmgb3Atox1Osbpl9Eif4a2Anxa3Psmd4Capns1Bola2Diaph1 Sae1Cops5Mta2TardbpFen1Capza1RanAldh2Mdh2Rad23aPsma1Dnajc8Rps5ZyxS100a4WarsWasf2Mobkl1bSp1Ssb100a6Il16PurbTcp1Cct6aHspa1lKif5bCoro1bAgtPsen1Ren2KitlgEphb2Notch1Scg5Coro1cAk2Rap1bAdam9PfkpAdam19Psat1 Basp1Tnfsf11Celf2Znf622Mtmr9PtprkAdam33Dr1Egfbp2Klk1b22Ripk3 GdaHsbp1Klk1b26Cotl1Il1bNotch4Sept6L1camVps24Pcsk5Cct5Rps17Glrx3Hint1Wdr1Rtn2YwhazRanbp3Rbm3Fam21Rps19Chmp4b Trmt112Polr2lHnrnpul2 Rab7aPsme1Plin3Hist4h4Timm13PyglCalm3Anxa1Nup50Ptpn11Akr1a1Atp6v1dHnrnpa2b1Psma2 Anxa2AclyCct4Pin1Pfdn2Hist1h1cCtsaHnrnpabNASarsZswim4Qki Gdi2HnrnpkKiaa1598Rufy1 Rpl23Psmd1Psma7Nhlrc2HgsHadhGsto1H2afvMtpnPrdx6Snrpd1Atp6v1e1Rpl30Dctn1Fth1BA  92 Although this could be due to high substrate specificity, it is more likely that these proteases simply received less attention in studies dedicated to discover substrate repertoires. This effect is especially pronounced for the mouse data, where 80% of total cleavages (2,938 of 3,679) are assigned to three proteases—cathepsin D (UniProt: P18242), cathepsin E (UniProt: P70269), and MMP 2 (UniProt: P33434)—and are mostly derived from high-throughput proteomics screens. Accordingly, the annotations differ strongly between human and mouse. Although the networks have similar size (1,230 and 1,393 nodes, respectively), they overlap minimally, with only 126 of 3,852 connections in mouse (3.3%) reflected in 122 of 4,905 human connections (2.5%). However, we suggest that the small overlap is mostly due to differences in the state of data annotation between the networks rather than to actual differences in the evolution of these networks.  The human data are further biased in that proteases and inhibitors are largely overrepresented as substrates themselves (Figure 4.4). Strong representation of protease–protease cleavages is expected because many proteases are synthesized as zymogens requiring proteolytic cleavage for activation by other proteases. Indeed, this strong enrichment is found in the human TopFIND/MEROPS data, but less so in mouse. We compared these values to a terminomics data set of cleavages in mouse skin (Keller et al., 2013), which more accurately reflects reality because terminomics analyzes N termini in an unbiased fashion. However, in this in vivo data set, inhibitors, and not proteases, were overrepresented as processed proteins, indicating that the overrepresentation of proteases as cleavage substrates in the human in vitro database is likely exaggerated.    93  Figure 4.3 Annotation biases in protease substrate identification.  Out-degree of protease and inhibitor proteins with an out-degree of 1 or greater in the human and mouse data. Out-degree is the sum of cleavages catalyzed by a protease or inhibitions caused by a protease inhibitor. Proteins (nodes) are sorted by their out-degree. Human values are in red; mouse values are in blue.       Table 4.1 Human and mouse proteolytic networks created from all annotated proteases, inhibitors, and substrates. Organism Total Nodes Largest Connected Component Proteases with substratesa Inhibitors with Target Proteasesa Edges Human 1,230 1,183 244 41 4,905 Mouse 1,393 1,377 88 47 3,852 aProtease and inhibitor numbers refer to those with at least one annotated cleavage or inhibition, respectively, in MEROPS and TopFIND.   0 2 4 6 8 10 12 14 Proteases Inhibitors Substrates (%) Human (theoretical) Human (TopFIND) Mouse (theoretical) Mouse (TopFIND) murine TAILS data Fortelny et al., Supplementary FiguresSupplementary Figure 1Supplementary Figure 2Supplementary Figure 3Reachability (nodes)0 50 100 150 200 250020406080100orig1 rmPercent of nodes0 50 100150200250300020040060080010001200Nodes sorted by outdegreeOutdegreehumanmouse  94 The observed data biases likely resulted from the nature of biochemical studies, where many substrates were identified for some “interesting” proteases (target bias) and “interesting” proteins are more likely to be tested as substrates (substrate bias). Substrate bias is especially found for proteases themselves, which are preferably tested as substrates in zymogen activation studies. With the advent of degradomics utilizing proteomics methods dedicated to substrate discovery, we anticipate both an increase in target bias in the future with many substrates identified for a few proteases, and a decrease in substrate bias where any protein can be identified as a substrate without prior selection of interesting candidates. Therefore, the cleavages annotated represent a biased fraction of the biochemically possible cleavages in the organism compared with an unknown number of as yet uncharacterized cleavages. On these grounds, the high connectivity in both the mouse and human networks is even more noteworthy because future information can only further increase connectivity. The observed, extensive interactions between proteases and inhibitors are further characterized as described in the following.        95  Figure 4.4 Human proteases are overrepresented as substrates.  Percentage of proteases and inhibitors that are known substrates. The percentages of all UniProt/Swiss-Prot proteins with an annotated MEROPS ID indicating they are proteases or inhibitors are shown as “theoretical.” “TopFIND” refers to the percentage of all substrates that are proteases or inhibitors found in the TopFIND database. The percentage of proteases or inhibitors (proteins with a MEROPS ID) amongst all internal neo-N termini in a recent TAILS analysis of murine skin (Keller et al., 2013) are referred to as “murine TAILS data.”   0 2 4 6 8 10 12 14 Proteases Inhibitors Substrates (%) Human (theoretical) Human (TopFIND) Mouse (theoretical) Mouse (TopFIND) murine TAILS data Fortelny et al., Supplementary FiguresSupplementary Figure 1Supplementary Figure 2Supplementary Figure 3Reachability (nodes)0 50 100 150 200 250020406080100orig1 rmPercent of nodes0 50 100150200250300020040060080010001200Nodes sorted by outdegreeOutdegreehumanmouse  96 4.3.2 Protease web structure  In the interactions between proteases in proteolytic signaling pathways, there are major upstream regulators or initiation factors, whose proteolytic activity leads to the cleavage of downstream proteases, which in turn activate even further downstream factors that finally cleave and activate the effector molecules at the end of the pathway. A special case of proteolytic pathways are activation cascades, where signal amplification occurs to generate large quantities of the end protein products in seconds as classically described for coagulation (Davie and Ratnoff, 1964; Macfarlane, 1964). To investigate whether the connections in the overarching protease web follow such a pathway or cascade (hierarchical) structure, we used a graph measure termed reachability. Reachability of node X denotes the number of other nodes Y where there is a path from X to Y in the network. A path is a sequence of directed edges connecting X and Y, following the directionality of edges in the network. The path from X to Y can therefore be different from the path from Y to X (and the existence of one does not guarantee the existence of the other). In the protease web, reachability corresponds to the number of proteins that can be influenced by one protease or inhibitor. Figure 4.5A outlines reachability values of nodes in three theoretical examples: (i) an unconnected (single), (ii) a strongly connected (circle), and (iii) a cascade-like network (cascade). Figure 4.5B shows the respective distribution of reachability values of these three theoretical examples.  We next compared the theoretical reachability distributions with the distributions observed in our human and mouse protease networks. In order to specifically describe the selective connectivity between proteases and inhibitors, which form the protease web, we excluded from further analysis other simple substrates (nonprotease and noninhibitor proteins), whose reachability in   97 the network is 1 by definition. Table 4.2 summarizes the resulting protease web networks for human (340) and mouse (220) proteins that have annotated cleavages or inhibitions. In analyzing the human and mouse protease webs, we further identified one dominant “largest connected component” comprised of 255 proteins for human and 187 proteins for mouse. Figure 4.5C compares the distribution of reachability scores in the largest connected component in mouse (blue curve) and human (red curve). In mouse, reachability indicates a cascade-like, hierarchical network, where most nodes have a very low reachability and fewer nodes have gradually higher reachability. In contrast, the reachability distribution of the human network follows a strongly bimodal distribution: 158 (62%) nodes reach 153 (60%) or more nodes. This is very high reachability that is most similar to the circle graph in Figure 4.5B, where any node can reach any other node. For a biological system, this implies that 158 proteases or inhibitors have the potential to regulate the activity of 153 or more other proteases and inhibitors in the network. In other words, there are one or more directed paths between 24,166 pairs of proteases in the human protease web, which are 37% of all 64,770 possible directed connections between pairs of 255 proteins. This number of connections between pairs rises to 141,523 paths when substrates are added (network with 1,230 nodes). This highlights the high degree of connectivity between proteases and inhibitors.     98     Casp 8Casp 9Casp 10Casp 3Casp 7Casp 6BIRC 8CBA54321DABCFGED1Reachability (nodes)Reachability (% of nodes)Percent of nodesPercent of nodesABCFGED70 20 40 60 80 100020406080100humanmouseNumber of pathsPath length1 2 3 4 5 6 7020406080100singlecirclecasc.single circle cascadeCasp8Casp9Casp10Casp3Casp7Casp6BIRC8CBA Reachability examplesExample networksFortelny et al., Figure 354321Observed reachability DABCFGED1Reachability (nodes)Reachability ( % of nodes)Percent of nodesPercent of nodesABCFGED70 20 40 60 80 100020406080100humanmouseNumber of pathsPath length1 2 3 4 5 6 7 80500010000150002000025000Observed path length1 2 3 4 5 6 7020406080100singlecirclecasc.        	  99 Figure 4.5 Reachability in network examples and the human and murine protease webs.  Connectivity in the protease networks as measured by the reachability distribution of nodes in the network. (A) Reachability in three theoretical model networks: In an unconnected network without edges, each node has a reachability of 1. In a strongly connected network, where each node can reach each other node, the reachability of each node is the sum of nodes. In a hierarchical, cascade-like network (apoptosis cascade taken from KEGG (Kanehisa et al., 2009)), reachability values are high for upstream regulators and decrease as one descends the cascade towards the downstream effector proteins. For each protein, the corresponding reachability values are shown on the right. Proteases are represented as green circles and inhibitors as red diamonds. Edges are cleavages (green, with arrow head) and inhibitions (red, with “T” head). Although these two types of edges have biologically distinct interpretations, the implication for the graph model and reachability is identical. (B) Reachability values of nodes in a theoretical hierarchical cascade (cascade), unconnected (single), or strongly connected (circle) networks shown in (A). Reachability is plotted as an inverse cumulative function of the percentage of nodes, which can reach a given minimum number of nodes in the corresponding network. (C) Inverse cumulative function of reachability values of the largest connected components of the human protease web (255 nodes, red line) and the mouse protease web (187 nodes, blue line). Reachability is plotted as the inverse cumulative function of the percentage of nodes that can reach a given minimum percentage of nodes in the corresponding network. (D) Histogram of the path length of all shortest paths in the human network comprised of a total of 24,166 paths.  Table 4.2 Proteins comprising the human and mouse protease webs. Organism Proteins with MEROPS ID Protease Weba Largest Connected Component Nodes Edges Nodes Proteasesb Inhibitorsb Edges Human 755 340 1,264 255 215 40 1,238 Mouse 696 220 415 187 141 46 404 aOnly nodes having a MEROPS ID and are part of a cleavage or inhibition are in the protease web. bProteases and inhibitors are assigned based on the MEROPS IDs of the proteins.    100 Reachability between nodes does not take the path length between nodes into account and so might be the result of very long and hence biologically irrelevant paths in the network. However, this possibility can be excluded as most paths have a length of just four (Figure 4.5D). The lack of connectivity in the mouse network is not surprising given the small overlap between the two networks. We assume that this difference is due to data biases rather than a real biological difference, and accordingly we focused on characterizing the extensive and more complete human network.  High connectivity in the human protease web is due to a strongly connected component (87 nodes), a subgroup of nodes within the largest connected component, that can directly or indirectly reach each other and hence have the same reachability value of 153 (Table 4.3). We visualized this effect in Figure 4.6, where nodes of the human protease web are shown separated by their reachability. Upstream of the strongly connected component are 71 nodes with reachability higher than 153; these nodes can reach the strongly connected component, but cannot be reached from it. Downstream (with reachability smaller than 7) are 97 nodes, which cannot reach the strongly connected component. The nodes in Figure 4.6 are also colored according to their centrality in the network, as measured by node betweenness (Freeman, 1977). Betweenness is calculated by first finding the shortest paths (as explained above) between all 64,770 pairs of nodes in the network and then counting the number of times a node appears in these paths. Notably, all nodes with high betweenness are found in the strongly connected component; these nodes tether the network together. Nodes with high betweenness or reachability are listed in Table 4.4.     101 Table 4.3 List of nodes with highest reachability and betweenness in the network. Gene name MEROPS ID Reachability Betweenness Out-degree In-degree FURIN S08.071 162 0 22 3 CST6 I25.006 157 0 4 0 PIGK C13.005 156 0 2 0 TMPRSS15 S01.156 155 0 2 0 HTRA2 S01.278 155 0 9 4 MMP11 M10.007 155 43 7 3 PCSK5 S08.076 155 0 4 0 PCSK7 S08.077 155 0 1 0 CTSL3 I29.001 155 0 3 0 CSTA I25.001 155 0 5 0 PLG S01.233 153 7939 40 27 A2M I39.001 153 6980 27 16 CTSL1 C01.032 153 5215 29 13 APP I02.015 153 4920 5 102 SERPINA1 I04.001 153 4400 10 25 KNG1 I25.016 153 4183 1 55 KLK4 S01.251 153 3770 32 6 CASP3 C14.003 153 2529 24 13 ELANE S01.131 153 2290 40 6 F2 S01.217 153 2004 21 12   Table 4.4 Reachability values of nodes in the human protease web. Number of Nodes Reachability 97 <7 87 153 71 >153      102      103 Figure 4.6 The largest connected component of the human protease web.  The structure of the core of the human protease web is comprised of 255 connected proteases and inhibitors that form the largest connected component. Proteins are designated by their UniProt gene names. Proteases are circles and inhibitors are diamonds. Nodes are color-shaded according to their betweenness. All nodes are positioned from top to bottom by decreasing reachability, which is indicated by the depth of shade of the green background. Edges are cleavages (with arrow head) or inhibitions (“T” head). Nodes of known protease cascades are labeled and marked by dashed circles.     Figure 4.6 shows that our network data from MEROPS/TopFIND contain all the known proteolytic pathways (e.g., coagulation, complement system, apoptosis, and kallikreins) as they were discovered, published, and annotated previously in MEROPS (detailed in Figure 4.7). In addition, these proteolytic pathways are extended by connections linking known pathways with other pathways and additional proteases. Details of these connections can be found in Figure 4.8A, which shows separated protease groups in the strongly connected component after removing inhibitors. Figures 4.6, 4.8A, and 4.7 show that the observed connectivity in the protease web is caused by the concerted action of defined protease cascades and key protease inhibitors: alpha-2-macroglobulin (A2M, UniProt: P01023), amyloid precursor protein (APP, UniProt: P05067), kininogen 1 (KNG1, UniProt: P01042), and alpha-1-antitrypsin (also known as serpin A1) (SERPINA1, UniProt: P01009). Whereas intragroup connections are pervasive as expected, intergroup connections are also considerable, in particular between coagulation factors and kallikreins or MMPs, but also including cathepsins and caspases. These findings are   104 confirmed in Figure 4.8B, which shows that connections among four of the five classes of proteases and protease inhibitors in human are extensive. Importantly, Figure 4.8B also shows proteases frequently cleaving inhibitors of other protease classes, an important regulatory aspect of protease activity. Only threonine proteases, which are found exclusively in large specialized cell organelles termed the proteasome and immunoproteasome, remain isolated from connections with other proteases and inhibitors according to current data.       	 	 	  105 Figure 4.7 New connections in known proteolytic pathways.  (A) Coagulation, (B) complement system, (C) apoptosis, and (D) kallikreins are shown with connections as they are in the network. Proteases are represented as green circles and inhibitors as red diamonds. Edges are cleavages (green, with arrow head) and inhibitions (red, with “T” head). Edges of originally defined pathways are solid, and additional edges are dotted. (A) Coagulation factors XII, XI, X, IX, VII, and V that form the clot (UniProt gene names: F12, F11, F10, F9, F7, and F2) are connected as originally described (Davie and Ratnoff, 1964; Macfarlane, 1964). This figure also shows PLG, tissue-type, and urokinase-type PLG activators involved in fibrinolysis (PLG, PLAU, and PLAT) (Drag and Salvesen, 2010) and many connections between those proteins, which were not classically described. (B) The main complement cascade of proteins C1R, C1S, C2, C3, and C5 of the classical pathway, as well as cofactors from the alternative pathway complement factors D, B, and I (UniProt gene names: CFD, CFB, and CFI) (Muller-Eberhard, 1988).  Additional connections not originally described are with the lectin pathway activators mannose-binding lectin serine protease 1 and 2 (MASP1 and MASP2) (Matsushita et al., 2000) and the plasma protease C1 inhibitor (SERPING1) (Sim et al., 1979). (C) The network contains connections between initiator caspases 8, 9, and 10 (UniProt gene names: CASP8, CASP9, and CASP10), and their cleavage of effector caspases 3 and 7 (CASP3 and CASP7) and caspase 6 (CASP6) as described in (Lincz, 1998). The network also contains caspases 4 (CASP4) and interactions with apoptosis protease inhibitors (BIRC7, BIRC8, and XIAP). (D) Kallikreins of the semen liquefaction cascade are connected as described previously (Pampalakis and Sotiropoulou, 2007) with the protease network showing many additional connections.     106   !"#$%&'!'!('!&!"')$*!"&&!!+&,'&(-.#+)/! ,+,!,-,),/,%,!,.,!!,,!%,!),!,!+$ ,! -/!!)#!++)%! %,,0!%0!!0%0! $',10#0-"!&,!"%"!		  !  !! ! !   23      4  107 Figure 4.8 Interactions between protease groups in the human protease web.  (A) Detailed connections between pathways and protease groups in the strongly connected component of the network. The network presented is limited to proteases (no inhibitors) with a reachability of 153 from Figure 4.6. Nodes are proteases and edges are cleavages. Proteases are designated by their UniProt gene names. (B) Interactions between classes of proteases and their inhibitors. Nodes are classes of proteins: classes of proteases are green circles; classes of protease inhibitors are red diamonds. The size of the nodes represents the number of proteins in each class as exemplified with groups of 10, 50, and 100 nodes in the legend. Protein classification: “M” are metallo, “S” are serine, “C” are cysteine, “A” are aspartate, and “T” are threonine proteases (as classified in MEROPS) or the corresponding inhibitors (as annotated in neXtProt). “B” are broad-spectrum inhibitors that are annotated to inhibit more than one class of protease and include A2M, serpin B4, serpin B9, PZP, histidine-rich glycoprotein, ovostatin homolog 1, and reversion-inducing cysteine-rich protein with Kazal motifs. Edges are cleavages (green, with arrow head) or inhibitions (red, with “T” head). Thickness of edges corresponds to the number of cleavages or inhibitions between the classes as exemplified with edges corresponding to 10, 50, or 100 interactions in the legend.       108 4.3.3 Theoretical network analysis of the protease web From a biological standpoint, the highly interconnected (reachable) nature of the protease web was surprising and underappreciated in the literature. To explore the degree to which this result is statistically surprising given the properties of the proteins making up the network, we investigated theoretical network models as well as randomized versions of the network. We first compared the protease web to two commonly used generative network models, the Erdős-Rényi model (ER) and the Barabasi-Albert model (BA), with parameters chosen to mimic the properties of the real network’s member proteins (see Materials and Methods). We found that neither model (each 500 networks) adequately explains the data, yielding networks that have either much higher (ER) or lower (BA) reachability on average (Figure 4.9A–C). These experiments therefore leave open the statistical nature of the process that generates the network, which we stress currently involves both biological components and experimenter biases, the latter being due to the incomplete nature of the underlying biochemical analyses (many potential edges have not been tested). We next generated two types of edge-shuffled networks, one maintaining in- and out-degree of each node (“Shuffled”) and a second preserving overall in- and out-degree distributions of the network, but not for each node (“Shuffled2”). The mean reachability was lower in the real network (72.09) than in 353 Shuffled networks (70.6% of all 500; average reachability was 73.96 across all 500 networks; see Figure 4.9D) but higher than all 500 Shuffled2 networks (average 34.8; Figure 4.9C). Taken together, these results indicate that high reachability emerges quite readily in a network composed of proteins with the measured in- and out-degrees found in a real biological network, such as the protease web described here. In fact, a network without such high reachability—as it is often assumed in biochemistry and cell biology—would be surprising from these results. Importantly, this further suggests that the   109 current biochemical description of cascades and individual proteases working in isolation is unlikely.   Figure 4.9 The protease web compared to random networks.  (A) Out-degree and (B) in-degree of nodes in the protease web (“Real network”) compared to the Barabasi-Albert (BA) and Erdős-Rényi (ER) model networks (averaged over 500 networks). A small constant (0.001) was added to enable log/log plots. (C) Mean reachability of nodes in 500 networks generated from each BA and ER model, and two different edge-shuffling methods (boxplots) compared to the protease web (red line). (D) Mean reachability in the protease web (red line) compared to the mean reachability of 500 edge-shuffled networks (black density curve).        	)## *"+& *"+      !""#",,,,,,,,,,,,,,,,,,,,,,,,,,,,,& *"+'(%"&  $ !""#"& *"+'(%"&%"&'()##)##& *"+  110 4.3.4 High connectivity in the protease web is robust to possible annotation errors To assess reliability of high connectivity in the protease web, which we observed assuming that all cleavage and inhibition data are trustworthy, we addressed the possibility of erroneous data passing through database annotations into our network. A possibility of validating our findings is to compare the network to another second network derived from an orthologous data source. However, MEROPS being the only database of similar coverage, we instead tested whether the same connectivity can be observed by removing nodes in anticipation that some interactions are wrongly annotated. Protease specificity is mostly influenced by three factors: substrate sequence, substrate folding, and the encounter of protease and substrate (Song et al., 2011a). In MEROPS/TopFIND, annotations are mostly derived from in vitro experiments where a protease is incubated with a substrate. Although some proteases are specific for given substrate sequences, others will cleave a wider range of sequences, but in both cases, possible cleavage sites are masked in the protein structure of the substrate. Hence, experimental parameters of protease cleavage assays are designed to preserve protein folding and activity of both the protease and substrate in order to prevent unspecific cleavage of denatured substrates. Colocalization of proteases and substrates in vivo is an important factor but not unambiguously determinable, with unexpected localization recently revealed (Butler and Overall, 2009a, 2009b; Golubkov et al., 2005; Goulet et al., 2004; Kwan et al., 2004). In addition, most experiments are only performed if it can be assumed that the protease and substrate will colocalize in vivo. Assuming that most annotations are correct but individual assignments can be wrong, we randomly and selectively removed edges from the protease web (focusing on the regulatory core, the largest connected component with 255 nodes) to test how reachability is maintained or influenced by such modifications. We utilized the term “physiological relevance,” as annotated   111 in MEROPS and TopFIND, to first create a high-confidence network (abbreviated as “hc” in Figure 4.10A) by removing all edges that were annotated with physiological relevance other than “yes.” As a consequence, the reachability of the resulting network was markedly decreased (Figure 4.10A), with the area under the curve (AUC) reduced to 22% of the original network. This was mostly due to the removal of all inhibitors (abbreviated as “i” below) as all 131 human inhibitions in TopFIND have a physiological relevance annotation of “unknown”; that is, their physiological relevance is not annotated in MEROPS from which TopFIND data are largely derived. Upon adding back the inhibitors to the high confidence network (“hc + i”), but still removing all “low confidence” nonphysiological cleavages, high reachability was largely recovered as indicated by an AUC of 88% of the original network. The observation that limiting the cleavages to high-confidence cleavages only barely reduces network connectivity strengthens the result that the protease web is not due to incorrect annotations. Moreover, removing inhibitions from the network severely impacted reachability and thus connectivity, highlighting the essential role of inhibitors in connecting the protease web.  Given the observed importance of inhibitors, we assessed the possibility of incorrect annotation of cleavages of inhibitors. The molecular mechanism of cysteine or serine protease inhibition by serpins involves cleavage of the serpin at its flexible reactive loop, which displays “bait” amino acids. Following cleavage, an induced conformational change leads to entrapment and inactivation of the protease (Huntington et al., 2000; Toh et al., 2010). Because the trap occurs after formation of the acyl intermediate during catalysis, the inhibited serine proteases, but also some cysteine proteases, remain covalently bound to the inhibitor. In contrast, metalloproteinase and aspartic protease cleavage of serpins in the reactive loop does not result in their inhibition, as   112 the nucleophile of these protease classes is a water molecule. Thus, these proteases are not trapped and therefore escape inhibition, but the serpin is now inactivated. Mechanisms of trapping upon cleavage have also been observed for some metalloproteinase inhibitors (Arolas et al., 2011) and for A2M or pregnancy zone protein (PZP, UniProt: P20742), which use a physical trapping mechanism to inhibit all classes of proteases, except exopeptidases (Jensen and Stigbrand, 1992; Marrero et al., 2012). Therefore, annotated cleavages of a protease inhibitor comprise cleavages that reflect either a regulatory inhibition of the protease or a regulatory inactivation cleavage of the inhibitor. To date, this distinction is not annotated in the databases, but is one that we suggest implementing. As a conservative estimate, we removed all cleavages of serpins by serine or cysteine proteases and from any protease to A2M or PZP (“inh rm” in Figure 4.10B). Therefore 144 edges were deleted from the original 1,238 edges of the largest connected component of the protease web (“orig” in Figure 4.10B). Notably, this removal only moderately reduced reachability (AUC 74% of original) and preserved a bimodal distribution. Thus, the high connectivity is not a result of unspecific inhibitors. Hence, the observed connectivity in the network is not an artifact attributable to ambiguous annotation of inhibitor cleavage and so further supports the importance of inhibitors in connecting the protease web. We next assessed the dependence of reachability on individual nodes of the network. By removing each node individually, we found that reachability in the protease web is not dependent on any one single node (Figure 4.11). Indeed, by iteratively removing all nodes with the highest betweenness from the network, we identified the six most important nodes: plasminogen (PLG; UniProt: P00747), alpha-1-antitrypsin, A2M, cathepsin L1 (CTSL1; UniProt: P07711), alpha-1-antichymotrypsin (also known as serpin A3) (SERPINA3; UniProt: P01011), and kallikrein-4 (KLK4; UniProt: Q9Y5K2) (Figure 4.10C). Removing all six nodes simultaneously removes 227   113 edges whereupon this significantly breaks down the bimodal distribution of reachability values, an effect not observed when removing any combination of five out of the six connectors. Thus, high connectivity in the protease web is robust in that it depends not on a single protein, but rather on six important connectors. Furthermore, even after removal of those six nodes the reachability for many proteins remains high with many long paths in the network. Notably, none of these six important nodes are digestive tract proteases, such as trypsin or chymotrypsin, which are broad-acting proteases and ones that might have been expected to form many connections. However, we predict that the identity and number of these key connector proteins will change as more information on the protease web is uploaded to the databases with further experimentation. Finally, we addressed the possibility of incorrect annotations by removing a fixed percentage of edges, thereby simulating a situation where these edges are incorrect cleavage or inhibition annotations and therefore would have to be removed from the network (Figure 4.10D). We randomly removed 10%, 20%, 30%, and 40% of all edges (cleavages and inhibitions) 200 times and then plotted the worst case for each experiment. The AUC was reduced to 78%, 65%, 47%, and 52%, respectively, but nonetheless even removal of 40% of edges still preserved the bimodality of the reachability values. Therefore, again the protease web shows a strong resistance to removal of elements, which further increases confidence in the description of a highly connected protease web with inherent robustness to change. This also leads to biological resilience and shows the importance of proteases that can nonetheless be resiliently maintained in genetic deficiencies or pathological perturbations of the system.     114     			 	 	 		 	                           115 Figure 4.10 Reachability in the human protease web after various perturbations.  Reachability of the largest connected component of the protease web (shown in Figures 4.5 and 4.6) after various perturbations. Reachability is plotted as the inverse cumulative function of the percentage of nodes that can reach a given minimum number of nodes in the corresponding network. (A) Reachability in the high confidence network comprised of nodes annotated as having physiological relevance. The reachability distribution of the original network (“orig,” red solid line as also shown in Figure 4.5C) is compared to networks where edges were removed to create a high confidence network (“hc,” black dashed line) and the high confidence network plus inhibitors (“hc + i,” black solid line). (B) Reachability before (“orig,” red line) and after (“inh rm,” black line) removing edges, reflecting cleavages of inhibitors. Cleavage edges were removed if (i) the inhibitor is annotated to be a serine protease inhibitor and the protease is a serine or cysteine protease or (ii) the inhibitor is A2M or PZP. (C) Reachability after removal of six nodes from the original network (PLG, alpha-1-antitrypsin, A2M, CTSL1, alpha-1-antichymotrypsin, and KLK4). The reachability after removing these six nodes (“6 rm,” black solid line) is compared to the reachability distribution of the original network (“orig,” red line) and to six networks representing each possible combination of keeping one of the six nodes and removing the other five (“5 rm,” black dotted lines), each showing much smaller reduction in reachability. (D) Reachability after removal of random edges. The reachability in the original network (“orig,” red line) compared to networks where 10%, 20%, 30%, or 40% of edges were removed at random. In each case, random edge deletion was carried out 200 times and the worst AUC value was selected for plotting.    116   Figure 4.11 Reachability in the network does not depend on one single node.  (A) High reachability is maintained after removal of single nodes from the network. The original protease web (“orig,” red line) is compared to 255 modified networks, each of which is missing one of the 255 nodes of the original network (“1 rm,” black lines). (B) The AUC for the 255 modified networks (histogram) compared to the AUC under the original network (red vertical line).        	     !"#$      117 4.3.5 Human tissue-specific protease webs Our analyses suggested that the protease web represents a robust regulatory system of high complexity and flexibility enabling complex patterns of regulation of proteins at the posttranslational level. We next assessed how this system is implemented in vivo where only a fraction of proteases and inhibitors is expressed or active at the same time in the same cell, compartment, or tissue. We constructed tissue-specific networks based on protease and inhibitor gene expression levels in 23 different human tissues quantified by CLIP-CHIP microarray (Kappelhoff et al., unpublished data available at http://clipserve.clip.ubc.ca/supplements/protease-web). We used negative control spots on this microarray to define a threshold of expression at detectable levels and then limited networks to those proteins expressed above this threshold. We next plotted the reachability of the nodes in the largest connected component of the resulting networks for all 23 tissue-specific protease webs (Figure 4.12A). Figure 4.12B shows liver, spleen, and skin results in more detail. Although most tissue-specific networks (e.g., skin) show low reachability values, some preserve the strong connectivity of the original network totally (e.g., kidney and liver) or partially (e.g., spleen, small intestine, pancreas, lung, colon). Notably, the tissue-specific networks also show that reachability is highly dependent on expression of the same six network connectors shown in Figure 4.10C (Figure 4.13).     118    		        	  	  	   !" # $	 %&	 &"! '( )* +* '* %,- ." ),-	 ,/ 0	 , 1 +1/" +1	 1 (	!	 23- ''     		   *(	!	 23-&	,,-			   *  119 Figure 4.12 Reachability in tissue-specific protease webs.  (A) Beanplot of reachability distributions in the largest connected components of 23 human tissue-specific networks based on gene expression in the corresponding tissues and the original protease web reachability distributions. Overlaid is a scatterplot of the precise values of each node. Numbers in parentheses refer to the size of the network. (B) Inverse cumulative distribution plot of reachability values for skin (dashed black line), spleen (black solid line), liver (dotted black line), and original network (red solid line). Reachability is plotted as an inverse cumulative function of the percentage of nodes that can reach a given minimum percentage of nodes in the corresponding network.      Figure 4.13 Reachability in the protease web strongly depends on the presence of six important nodes. Reachability plotted against the presence of the six important proteins identified in Figure 4.10C (PLG, SERPINA1, A2M, CTSL1, SERPINA3, and KLK4) for the 23 tissue-specific networks. The AUC of the inverse cumulative function of reachability values in each tissue-specific network (x-axis) was plotted against the count of important proteins (out of six) present in each network (on the y-axis). Fortelny et al., Supplementary FiguresSupplementary Figure 4010002000300040003456Area under curveImportant nodes  120 4.3.6 Evidence for the protease web in other data In agreement with our findings based on biochemical interactions, general biological literature also shows that proteases and their inhibitors can be involved in multiple biological processes (Figure 4.14A). It is easy to imagine that this multifunctionality is partly due to the interplay in the protease web. Indeed most of the proteins in Figure 4.14A are found in the strongly connected component of our protease web, indicating that they serve in connecting different biological processes. One example is TIMP1 (UniProt: P01033). Protein expression levels of TIMP1, an MMP inhibitor mainly involved in extracellular matrix remodeling and organization, were found associated with hemostasis (Aznaouridis et al., 2007). This finding, which is derived from orthogonal data to the protease web, primed us to search for connections linking TIMP1 to coagulation factors, which we could indeed identify (Figure 4.14B). Together, these provide a plausible mechanism of action of TIMP1 and hence MMPs on coagulation and could explain the association observed. Hence, the protease web can be used to explain multifunctionality of proteases, which in turn strengthens our conclusion of a large interplay between proteases.     121  Figure 4.14 Proteases and their inhibitors involved in multiple, discrete biological processes.  (A) A matrix showing the annotation of proteases and inhibitors with selected, protease-specific biological processes based on Gene Ontology (Ashburner et al., 2000). Proteins annotated with more than one term are displayed. (B) A subnetwork of the protease web connecting TIMP1 to coagulation: TIMP1 (UniProt: P01033) inhibits MMP10 (UniProt: P09238) and MMP1 (UniProt: P03956), which both cleave and activate MMP9 (UniProt: P14780), which cleaves PLG (UniProt: P00747). Similarly, MMP1 and MMP9 cleave and inactivate serpin A1 (UniProt: P01009), which is an inhibitor of PLG.    PLGMMP9MMP10MMP1SERPINA1TIMP1																								 !"!#$!%"&'!"!A B  122 4.3.7 Using the protease web to decipher in vivo network effects We were able to test the utility of our graph representation of the protease web by deciphering a previously inexplicable result in vivo. We analyzed the MMP8-dependent cleavage of the murine chemokine C-X-C motif chemokine 5 (CXCL5, UniProt: P50228), also known as lipopolysaccharide (LPS)-induced chemokine or cytokine LIX (LIX). LIX is a potent chemoattractant chemokine for polymorphonuclear (PMN) leukocytes, and MMP8 (UniProt: O70138) is PMN specific. It was previously demonstrated in an in vivo airpouch model that MMP8 knockout mice showed reduced PMN migration in response to LPS (Tester et al., 2007). This was attributed to MMP8 processing and activation of LIX at position Ser4↓Val5, with a second cleavage at Lys79↓Arg80 of the 92-residue protein. Indeed the MMP8-truncated activated form of LIX (5–79) showed equal cell migration in wild-type and knockout mice, validating LIX as a physiological MMP8-dependent mechanism for promoting neutrophil infiltration in vivo. However, a neoepitope antibody specific to the MMP8-generated neo-N terminus failed to detect truncations at Ser4↓Val5 in the airpouch model. Thus, cleavage of LIX is a MMP8-dependent but MMP8-indirect event in vivo that could not be explained, prompting a further analysis of alternate MMP8-dependent proteolytic pathways predicted using our representation of the protease web.      123 To examine the importance of neutrophil-derived MMP8 in LIX processing and activation, we isolated bone marrow neutrophils from wild-type and MMP8 knockout mice. Neutrophils were stimulated with phorbol myristate acetate (PMA) followed by incubation of the activated neutrophils with chemokine for up to 3 h. Truncations of LIX generating the bioactive products LIX (9–92) and LIX (9–78), as determined by MALDI-TOF mass spectrometry from the still inactive form LIX (1–78), were readily apparent, even after only 1 h of incubation (Figure 4.15A). However, both the MMP8 knockout and wild-type neutrophils showed identical cleavage sites (Ala8↓Thr9 and Ala78↓Lys79) and cleavage kinetics. Because these sites differ from the MMP8 cleavage sites (Figures 4.16, 4.17, and 4.15B), MMP8 is not the dominant neutrophil protease cleaving LIX in the cellular context. Investigating protease web effects that may account for this, we found that LIX cleavage by neutrophils was inhibited by the serine protease inhibitor 2-aminoethyl benzenesulfonyl fluoride hydrochloride (Figure 4.15C). This showed that one or more of the four serine proteases in neutrophils—neutrophil elastase (UniProt: Q3UP87), cathepsin G (UniProt: P28293) (Kessenbrock et al., 2011), proteinase-3 (UniProt: Q61096), or the recently described neutrophil serine proteinase 4 (UniProt: Q14B24) (Perera et al., 2012)—were responsible for LIX cleavage. Using low concentrations of the endogenous serine proteinase inhibitors α1-proteinase inhibitor (α1-PI, UniProt: P07758) (Knäuper et al., 1990) and secreted leukocyte proteinase inhibitor (SLPI, UniProt: P97430) (Figure 4.15C), we excluded proteinase-3 and neutral serine proteinase 4 as candidates, as SLPI does not inhibit these proteinases (Perera et al., 2012; Rao et al., 1993). Moreover, neutral serine proteinase 4 has a stringent substrate specificity that does not fit our observed cleavage sites. Cathepsin G did not cut after Ala8 and required high enzyme concentrations (>100 nM) in generating the C-terminal cleavage (Figure 4.17) as it was inefficient with a kcat/KM 60 M−1 s−1. Thus, neutrophil elastase   124 was the strongest candidate, and indeed 1 nM elastase efficiently cleaved LIX with a kcat/KM 1,200 M−1 s−1 at Ala8↓Thr9 and Ala78↓Lys79 (Figures 4.15D, 4.16, and 4.17). Because MMP8 cleaves N-terminal to the Ala8↓Thr9 elastase site and C-terminal to the Ala78↓Lys79 elastase site, truncations by elastase will remove evidence of any MMP8 cleavage. Furthermore, MMP8 is less efficient (kcat/KM 600 M−1 s−1) than elastase in cleaving LIX. Thus, elastase is the dominant protease for LIX cleavage by neutrophils in vivo.  To explain the paradoxical result that in the Mmp8−/− mouse LIX is not cleaved in vivo despite the presence of neutrophil elastase, we employed path finding in the protease web to identify potential regulatory effects from MMP8 on neutrophil elastase. Although no path was found in the murine network, the more extensive human network contains a path that had potential to explain this perplexing result (Figure 4.15E). Human MMP8 is known to cleave and inactivate human α1-PI (Knäuper et al., 1990), the potent inhibitor of neutrophil elastase, but SLPI is resistant to MMP8 cleavage (Henry et al., 2002). We verified α1-PI cleavage by MMP8 using mouse proteins for the first time at various enzyme-to-substrate ratios and in time course experiments (Figure 4.15F) from which we found that murine MMP8 efficiently cleaves and inactivates murine α1-PI in vitro with a kcat/KM 7.7 × 103 M−1 s−1.  We next validated the in vitro results in vivo. In murine bronchioalveolar lavage collected following 24 h of treatment with LPS, both the full-length and high molecular weight forms of α1-PI, which were present as inhibitor-serine protease complexes, were greatly enhanced in Mmp8−/− mice compared to wild type (Figure 4.15G). Together, these in vitro and in vivo data show that efficient cleavage of α1-PI occurs by MMP8 in vivo and indicates the importance of   125 MMP8 in modulating the balance of functional α1-PI protein and activity in vivo and hence elastase activity. Finally, we confirmed neutrophil elastase-dependent LIX cleavage in vivo using a specific neutrophil elastase chemical inhibitor (GW311616). Specific elastase inhibition reduced the relative numbers of neutrophils in wild-type mouse bronchioalveolar lavage similar to the decrease in cell migration in the MMP8 knockout versus the wild-type mouse bronchioalveolar lavage (Figure 4.15H). We conclude that MMP8 cleaves and inactivates α1-PI in vivo acting as the “metallo-serpin” switch leading to increased neutrophil elastase activity and LIX activation, which thereby promotes neutrophil infiltration in vivo. Evidence of LIX cleavage by MMP8 is lost following elastase cleavage in vivo, which is also catalytically more efficient than MMP8. Thus, the protease web enabled deconvolution of a complex biologically relevant proteolytic event and in turn formulation of a testable hypothesis that was confirmed in vitro and in vivo.   126    AFigure 2, Cox et al +!WT PMNs! -! +! +! +! +! +! +!BMeasured!m/z [M+H]+!LIX (1-92)!LIX (9-92)!LIX (1-78)!LIX (9-78)!9849    !9150!8325    !7626!9850    !9153!8325    !7628!Predicted!CLIX!LIX![M+H]+![M+H]+!Chemokine!PMA (µM)!LIX!Time (h)!-! -! 0.05! 10! 0.05! 10! 0.05! 10!-! +! +! +! +! +! +! +!3! 1! 1! 1! 2! 2! 3! 3!+!KO PMNs! -! +! +! +! +! +! +!PMA (µM)!LIX!Time (h)!-! -! 0.05! 10! 0.05! 10! 0.05! 10!-! +! +! +! +! +! +! +!3! 1! 1! 1! 2! 2! 3! 3!APSS VIAA TELR…LIGSDKKKA K RNALAVERTASVQ 1 10 90 80 70 DE-!WT PMNs! +! +! +! +! +! +! +!LIX (9-92)!Inhib (µM)! -! -! 100! 500! 0.1! 0.5! 0.1! 0.5!AEBSF! !1-PI! SLPI!LIX (1-78)!LIX (9-78)!LIX (1-92)!MMP-8 MMP-8 !AFigure 8, Cox et al 54 kDa !1-PI!50 kDa !1-PI!BWT! KO! WT! KO!LPS! LPS + GW!!1-PI!64!50!36!FL! +MMP8!AFigure 2, Cox et al +!WT PMNs! -! +! +! +! +! +! +!BMeasured!m/z [M+H]+!LIX (1-92)!LIX (9-92)!LIX (1-78)!LIX (9-78)!9849    !9150!8325    !7626!9850    !9153!8325    !7628!Predicted!CLIX!LIX![M+H]+![M+H]+!Chemokine!PMA (µM)!LIX!Time (h)!-! -! 0.05! 10! 0.05! 10! 0.05! 10!-! +! +! +! +! +! +! +!3! 1! 1! 1! 2! 2! 3! 3!+!KO PMNs! -! +! +! +! +! +! +!PMA (µM)!LIX!Time (h)!-! -! 0.05! 10! 0.05! 10! 0.05! 10!-! +! +! +! +! +! +! +!3! 1! 1! 1! 2! 2! 3! 3!APSS VIAA TELR…LIGSDKKKA K RNALAVERTASVQ 1 10 90 80 70 DE-!WT PMNs! +! +! +! +! +! +! +!LIX (9-92)!Inhib (µM)! -! -! 100! 500! 0.1! 0.5! 0.1! 0.5!AEBSF! !1-PI! SLPI!LIX (1-78)!LIX (9-78)!LIX (1-92)!MMP-8 MMP-8 AAFigure 2, Cox et al +!WT PMNs! -! +! +! +! +! +! +!BMeasured!m/z [M+H]+!LIX (1-92)!LIX (9-92)!LIX (1-78)!LIX (9-78)!9849    !9150!8325    !7626!9850    !9153!8325    !7628!Predicted!CLIX!LIX![M+H]+![M+H]+!Chemokine!PMA (µM)!LIX!Time (h)!-! -! 0.05! 10! 0.05! 10! 0.05! 10!-! +! +! +! +! +! +! +!3! 1! 1! 1! 2! 2! 3! 3!+!KO PMNs! -! +! +! +! +! +! +!PMA (µM)!LIX!Time (h)!-! -! 0.05! 10! 0.05! 10! 0.05! 10!-! +! +! +! + +! +! +!3! 1! 1! 1! 2! 2! 3! 3!APSS VIAA TELR…LIGSDKKKA K RNALAVERTASVQ 1 10 90 80 70 DE-!WT PMNs! +! + +! + +! +! +!LIX (9-92)!Inhib (µM)! -! -! 100! 500! 0.1! 0.5! 0.1! 0.5!AEBSF! !1-PI! SLPI!LIX (1-78)!LIX (9-78)!LIX (1-92)!MMP-8 MMP-8 AFigure 2, Cox t al +!WT PMNs! ! +! +! +! +! +! +!BMeasured!m/z [M+H]+!LIX (1-92)!I  (9- )!I  1 78 !LI  (9-78)!9849    !150!8325    !7626!9850    !1 3!8325    !7628!Predicted!CLIX!LIX![M+H]+![M+H]+!Chemokine!PMA (µM)!LIX!Time (h)!-! -! 0.05! 10! 0.05! 10! 0.05! 10!-! +! +! +! +! +! +! +!3! 1! 1! 1! 2! 2! 3! 3!+KO PMNs! +! +! +! +! +! +!PMA (µM)!LIX!Time (h)!-! -! 0.05! 10! 0.05! 10! 0.05! 10!-! +! +! +! +! +! +! +!3! 1! 1! 1! 2! 2! 3! 3!APSS VIAA TELR…LIGSDKKKA K RNALAVERTASVQ 1 10 90 80 70 DE!WT PMNs +! +! +! +! +! +! +!LIX (9-92)!Inhib (µM)! -! -! 100! 500! 0.1! 0.5! 0.1! 0.5!AEBSF! !1-PI! SLPI!LIX (1-78)!78LIX (1-92)!MMP-8 MMP-8 Figure 2, Cox t al +!T PMNs! -! +! +! +! +! +! +!easured!/z [ +H]+!LIX (1-92)!LIX (9-92)!LIX (1-78)!LIX (9-78)!9849    !9150!8325    !7626!9850    !9153!8325    !7628!Predicted!LIX!LIX![M+H]+![ +H]+!Che okine!PMA (µM)!LIX!Time (h)!-! -! 0.05! 10! 0.05! 10! 0.05! 10!-! +! +! +! +! +! +! +!3! 1! 1! 1! 2! 2! 3! 3!+!KO P Ns! -! +! +! +! +! +! +!P A (µ )!LIX!Ti e (h)!-! -! 0.05! 10! 0.05! 10! 0.05! 10!-! +! +! +! +! +! +! +!3! 1! 1! 1! 2! 2! 3! 3!APSS VIAA TELR…LIGSDKKKA K RNALAVERTASVQ 1 10 90 80 70 -!T P s! +! +! +! +! +! +! +!LIX (9-92)!Inhib (µ )! -! -! 100! 500! 0.1! 0.5! 0.1! 0.5!AEBSF! !1-PI! SLPI!LIX (1-78)!LIX (9-78)!LIX (1-92)!P-8 P-8 EBα1-PINELIXDHC? ?MMP8Cell count (cells x 10-5)985  9157 8330 7632Cell count (cells X 10-5 )WT KOLPSLPS + GW12.510. 7, Cox et al -!:S (w:w)! 1:5000!B1:1000!1:500!1:100!1:50!1:10!1:5!54 kDa !1-PI!50 kDa !1-PI!0!Time (h)! 0.5! 1! 2! 3! 4! 6! 8! 10!54 kDa !1-PI!50 kDa !1-PI!AFigure 7, Cox et al -!E:S (w:w)! 1:5000!B1:1000!1:500!1:100!1:50!1:10!1:5!54 kDa !1-PI!50 kDa !1-PI!0!Time (h)! 0.5! 1! 2! 3! 4! 6! 8! 10!54 kDa !1-PI!50 kDa !1-PI!F645036  127 Figure 4.15 Protease web affects validation in vivo.  (A) Tris-Tricine 15% SDS-PAGE and MALDI-TOF mass spectrometry analyses of LIX cleavage following incubation with wild-type (WT) or MMP8-deficient (KO) murine polymorphonuclear leukocytes (PMNs) for up to 3 h after PMA stimulation to release PMN proteases to the culture medium. (B) Sequence of the N- and C-terminal regions of LIX with cleavage sites by PMN MMP8 and the unknown protease (?) annotated. (C) Tris-Tricine SDS-PAGE analysis of LIX cleavage by PMNs after addition of protease inhibitors: AEBSF, 2-aminoethyl benzenesulfonyl fluoride hydrochloride; α1-PI, α1-proteinase inhibitor; SLPI, secreted leukocyte proteinase inhibitor. (D) LIX cleavage by murine (m) MMP8 and murine neutrophil elastase (mNE) analyzed by 15% Tris-Tricine SDS-PAGE analysis and MALDI-ToF mass spectrometry. E:S, enzyme-to-substrate ratio; “Marker,” molecular weight markers as indicated. (E) Network effects on LIX cleavage. Proteases are green, inhibitors red, and other substrate proteins are grey. Edges are cleavages (green, with arrow head) or inhibitions (red, with “T” head). (F) MMP8 cleavage of α1-proteinase inhibitor (α1-PI). The serine protease inhibitor α1-PI was incubated with MMP8 for 16 h at 37 °C in 50 mM Tris, 200 mM NaCl, 5 mM CaCl2, pH 7.4 containing 1 mM APMA. The enzyme-to-substrate (E:S) ratio ranged from 1:5 to 1:5,000 (w:w). Reactions were visualized on a 10% SDS-PAGE (silver stained). Below, a time course of MMP8 cleavage of α1-PI at 1:50 (w:w) E:S ratio. (G) Bronchioalveolar lavage of mice stimulated with LPS. WT, wild-type mouse; KO, MMP8 knockout mouse. LPS (2 µg) was instilled in the lungs of female mice, and after 48 h, the mice were sacrificed and the lungs lavaged with PBS. Cell-free bronchioalveolar lavage from three mice was pooled and concentrated by acetone precipitation. α1-PI detection was with Alexa-conjugated antibodies (Molecular Probes) on the LiCOR Odyssey. (H) Numbers of PMNs in the bronchioalveolar lavage after LPS stimulation with (n = 3) and without (n = 3) instillation of GW311616 (GW), a specific neutrophil elastase inhibitor.    128   Figure 4.16 MALDI-TOF analysis of LIX cleavage by MMP8 and neutrophil elastase.  LIX cleavage products from Figure 4.15D analyzed by MALDI-TOF mass spectrometry. Analysis of LIX alone (LIX 1–92) was compared to the LIX cleavage products at E:S ratios of 1:5,000, 1:500, and 1:50 for murine MMP8 on the left and murine neutrophil elastase (NE) on the right. MMP8 and NE are not observed in this m/z range of the spectra.  	     	     	     	     	     	     	     									  129    130 Figure 4.17 MMP8, neutrophil elastase, and cathepsin G cleavage of LIX.  (A) LIX cleavage by murine (m) and human (h) proteases MMP8, neutrophil elastase (NE), and cathepsin G (CATG) analyzed by 15% Tris-Tricine SDS-PAGE analysis and MALDI-TOF mass spectrometry. Resolution of mMMP8 cleavage products was technically difficult to show by gel electrophoresis and so we relied upon the data generated by MALDI-TOF mass spectrometry (Figure 4.16). E:S, enzyme to substrate ratio; “Marker,” molecular weight markers as indicated. (B) Sequence of the N- and C-terminal regions of LIX with major protease cleavage sites annotated as determined by MALDI-TOF mass spectrometry. Sites for MMP8 and NE were found for both human and murine enzymes; mNE are unique for the murine neutrophil elastase.      131 4.4 Discussion To our knowledge, this is the first systematic bioinformatics analysis of the extent and structure of the protease web. We assembled in silico networks comprising all biochemically annotated interactions between proteases and their inhibitors, which therefore represent the potential of regulation among proteases based on current biochemical data. By representing the human protease web as a graph, we show the depth of how proteases and inhibitors regulate each other across families and even catalytic classes. Thus, known cascades and proteases do not act in isolation, as often assumed, but crosstalk extensively. The structure of the human protease web is not cascade-like and hierarchical but multidirectional with connections between top and bottom proteins of known cascades with six proteases and inhibitors identified as key connectors in this network. Although other connectors might be identified in future versions of the network, this shows how regulatory switches, especially inhibitors, tether subnetworks of the overall network. Notably, the observed potential for regulatory crosstalk between proteases and inhibitors is not an artifact of data annotation as it persists robustly despite various perturbations we tested (Figure 4.10). On the contrary, the extent of such crosstalk is an underestimation because current data on protease cleavage and inhibition are largely incomplete.  As high-throughput terminomics analyses continue to massively add new information, more connections will undoubtedly be found, thereby further increasing the observed connectivity. In fact, a decrease in connectivity can only occur if current annotations are proven wrong and are corrected by removing edges from the network. However, we demonstrated that connectivity in the protease web is highly robust against such modifications, further validating the existence of a pervasive network of proteases and inhibitors embedded in different proteomes. Investigating   132 tissue-specific implementations of the protease web, we found that gene expression shapes the protease web specifically in various tissues. Thus, subnetworks of the entire network are active at any place and time in different tissues. Some human tissues exhibit a protease web with connectivity close to the global network, further validating the existence of such a network in vivo. Mouse annotations are currently focused on few proteases and can therefore not yet display large-scale network features. Despite this and the current lower connectivity in the murine network (Figure 4.5C), we expect that with further annotations the murine network will morph to form more of a multidirectional, highly connected structure similar to the described human network.  The utility of the protease network as a concept and as a tool was demonstrated in successfully deciphering a paradoxical in vivo result involving cleavage of the murine chemokine LIX by neutrophils, an important inflammatory cell in innate immunity, which had been previously shown to be a substrate of the neutrophil-specific MMP8. Our analyses showed that even though MMP8 cleaves LIX in vitro and in the Mmp8−/− mouse LIX cleavage is also reduced, it was not cut by MMP8 in vivo. Rather, we identified neutrophil elastase as the relevant protease in vivo. Path finding in the protease web enabled us to then prove that MMP8 potently but indirectly facilitated LIX cleavage through direct MMP8 cleavage and inactivation of the elastase inhibitor α1-PI in cellular contexts and in vivo. Thus, combining individual interactions stored in TopFIND/MEROPS through interrogation of the protease web by random and directed walks generated a testable hypothesis that was experimentally validated. This revealed the mechanistic importance of MMP8 in mediating the cleavage of LIX—not directly as observed in vitro, but indirectly by enabling elastase activity through removal of the biologically relevant blocking   133 inhibitor, thus forming a metallo-serpin switch to regulate the concentrations of active versus inactive α1-PI in vivo. The biological outcome of path walking in the network will depend on the relative concentrations of the individual nodes in different tissues or tissue conditions and pathologies. Thus, what is biological meaningful in one situation may not be in another and so requires experimental validation, as we performed here. Hence, the overall workflow of path prediction and validation can now be transferred to other investigations of complex in vivo protease biology.  4.4.1 Principles of regulation in the protease web Critical control of protease activity is exerted at the protein level. Proteases from one class (e.g., metalloproteinases) frequently cleave proteases from other classes (e.g., serine proteases) or their cognate inhibitors (serpins), and subnetworks can thereby be activated or inactivated. In this process, we found that protease inhibitors take an important connecting role in the web—they are highly enriched as substrates of all classes of proteases and removal of inhibition strongly decreases reachability of all nodes in the network. Protease inhibitors often lack specificity and inhibit families of proteases rather than just individual enzymes. Thus, inhibitors function as key on/off switches of entire subnetworks within the protease web, enabling rapid and efficient activation of proteolytic processes upon their cleavage. We provided a new example of a metallo-serpin switch controlling chemokine activation. As an important biological consequence of this, removal of inhibition is therefore recognized to be as important as zymogen activation in cascades in controlling proteolysis. Indeed this was recently demonstrated in skin inflammation in vivo, where MMP2 was found to cleave and inactivate serpin G1, also known as complement C1 inhibitor (Keller et al., 2013). Dynamically regulating the activity levels of serpin G1   134 inhibition allowed complement activation to cascade, which otherwise was greatly reduced in the Mmp2−/− mouse, where excess amounts of intact functional serpin G1 were proteomically quantified by TAILS terminomics. The central role of this metallo-serpin inhibitor switch in the protease web was further shown in the regulation of another subnetwork involving plasma kallikrein cleavage of kininogen to release the vasoactive peptide bradykinin. The network representation of the protease web emphasizes that proteases of one family and class can markedly regulate the activity of proteases from different families and classes.  4.4.2 Applicability of the protease web Understanding a complex biological network, such as the protease web, can only be reached via systematic storing and sharing of biochemical information in order to enable network-based predictions to generate testable hypotheses. Applying this strategy, we gained in silico insights into in vivo processes and validated these biochemically, in culture and in vivo. We forecast that through further identification and biochemical characterization of cleavage and inhibition events, the representation of protease interactions can be improved to strengthen its predictive power. The resulting network could then be used to simulate the effects of protease and inhibitor knockouts and protease drug targeting in disease, which will enhance confidence of targeting the correct protease and thereby increase the success rate of clinical trials by reducing unexpected side effects.  In conclusion, our analysis of the protease web reveals a multidirectional rather than a hierarchical structure, as has been proposed (Mason and Joyce, 2011), with deep connections in regulation of the proteome by specific proteolytic processing in addition to degradation. As the   135 structure of the human protease web is multidirectional rather than cascade-like and hierarchical, it has high connectivity that is robust to change. Biologically this implies that regulation by proteolysis is a consistent and pervasive force in all tissues. In comparison to phosphorylation, which predominantly controls intracellular proteins and pathways (Tagliabracci et al., 2012), proteolysis affects all proteins and pathways inside and outside the cell, and it is irreversible and pervasive and needs to be considered in functional analyses of the proteome.   4.5 Summary Proteolytic processing is an irreversible posttranslational modification affecting a large portion of the proteome. Protease-cleaved mediators frequently exhibit altered activity, and biological pathways are often regulated by proteolytic processing. Many of these mechanisms have not been appreciated as being protease-dependent, and the potential in unraveling a complex new dimension of biological control is increasingly recognized. Proteases are currently believed to act individually or in isolated cascades. However, conclusive but scattered biochemical evidence indicates broader regulation of proteases by protease and inhibitor interactions. Therefore, to systematically study such interactions, we assembled curated protease cleavage and inhibition data into a global, computational representation, termed the protease web. This revealed that proteases pervasively influence the activity of other proteases directly or by cleaving intermediate proteases or protease inhibitors. The protease web spans four classes of proteases and inhibitors and so links both recently and classically described protease groups and cascades, which can no longer be viewed as operating in isolation in vivo. We demonstrated that this observation, termed reachability, is robust to alterations in the data and will only increase in the future as additional data are added. We further show how subnetworks of the web are operational   136 in 23 different tissues reflecting different phenotypes. We applied our network to develop novel insights into biologically relevant protease interactions using cell-specific proteases of the polymorphonuclear leukocyte as a system. Predictions from the protease web on the activity of matrix metalloproteinase-8 (MMP8) and neutrophil elastase being linked by an inactivating cleavage of serpinA1 by MMP8 were validated and explain perplexing Mmp8−/− versus wild-type polymorphonuclear chemokine cleavages in vivo. Our findings supply systematically derived and validated evidence for the existence of the protease web, a network that affects the activity of most proteases and thereby influences the functional state of the proteome and cell activity.     137 Chapter 5: Prediction of inhibitory protein-protein interactions between protease inhibitors and their target proteases in the protease web  5.1 Introduction Identifying physical protein interactions is a fundamental task in molecular biology yet one that is difficult. Biochemical approaches such as yeast-2-hybrid and co-immunoprecipitation and newer experimental methods (Kristensen et al., 2012; Weisbrod et al., 2013) are highly productive and scalable. However, limited accuracy from false positives and coverage that is context dependent remain problematic (Braun et al., 2009; von Mering et al., 2002). Computational methods have been developed to predict protein-protein interactions, commonly linking together proteins on the basis of shared features such as patterns of conservation, expression, or annotations (Bhardwaj and Lu, 2005; Franceschini et al., 2012; Jansen et al., 2003; Rhodes et al., 2005), a version of “guilt by association”. A second class of approaches uses protein structural features to identify potential physical interaction interfaces (Zhang et al., 2012). These approaches can be combined, but there is room for improvement. In the methods cited above, predictions were often hand-picked and few were tested (Jansen et al., 2003; Rhodes et al., 2005). Estimates of the true efficacy of prediction methods in structured evaluations, such as those that exist for function prediction (CAFA (Radivojac et al., 2013)), structure prediction (CASP (Moult et al., 2014)), or for structural docking (CAPRI (Janin, 2002)), are lacking for protein interaction prediction methods. If computational predictions of interactions were sufficiently accurate, biochemical assays could be targeted more efficiently, but to date, computational predictions have not played a major role in interaction discovery or prioritization   138 and reliable prediction remains an aspiration (Pavlidis and Gillis, 2013). We hypothesize that a prediction approach crafted for a specific type of interaction might provide insight that facilitates successful predictions in the general protein interaction task.  Proteases are a critical component of the post-translational regulatory machinery in cells. By irreversibly cleaving proteins, proteases shape the functional state of the proteome and hence mediate cell and tissue responses. Befitting this role, proteases have been identified as promising drug targets in various diseases (Dufour and Overall, 2013; Turk, 2006), but drug development has been hampered by complex protease biology that is often poorly understood. One aspect of this complexity is the protease web, a complex dense interaction network currently known to encompass 255 of 565 human proteases and their inhibitors described in chapter 4 (Fortelny et al., 2014). Proteases regulate the activity of other proteases by direct cleavage or by cleaving their endogenous inhibitors, which in turn influences additional cleavages downstream. Proteases can thus potentially indirectly influence the cleavage of substrates other than their direct substrates. We recently established a graph model of protease web interactions based on existing biochemical data that demonstrated the importance and pervasiveness of these network effects (chapter 4) (Fortelny et al., 2014) that can be used to predict proteolytic pathways as shown in chapter 2 (Fortelny et al., 2015). However, the network is far from its full potential because cleavage and inhibition interaction data underlying the model are incomplete. This is mainly due to the lack of studies of proteases and inhibitors (and also to the lack of data uploading to community databases). Computational prediction could provide a means to accelerate the addition of interactions to this network.    139 Here we focused on the prediction of novel inhibitory interactions between protease inhibitors and proteases to gain insight into factors that are efficacious in predicting protein interactions and to ultimately improve our understanding of the protease web. As the first global prediction effort of this kind, we focused on identifying novel inhibitory interactions between poorly characterized inhibitors and proteases. Identification of protease inhibitors is an important step in understanding a biological system (Scott et al., 1999; Sun et al., 1996) but large scale, computational prediction efforts have been limited to protease-specific features (Song et al., 2011b). According to the MEROPS database (Rawlings et al., 2012), there are 354 (~80%) of 444 human proteases with no identified inhibitor and 13 (~14%) of 94 inhibitors without annotated target proteases (orphan inhibitors). Furthermore, many proteases are inhibited by multiple inhibitors and conversely, many inhibitors inhibit multiple proteases, so that the whole realm of protease inhibition is underexplored (Fortelny et al., 2014). Considering the key roles played in regulating signaling by proteolytic processing of molecules in signaling pathways, identifying additional valid and physiologically relevant protease-inhibitor pairs would greatly benefit our understanding of protease biology.  Important questions in interaction prediction methods are which input data to use for predictions and how to evaluate performance (in contrast, the prediction algorithm used plays relatively little role (Gillis and Pavlidis, 2011b)). To evaluate performance of a predictor, efficacy in separating predefined true positives and true negative examples is measured. For example, if most truly interacting proteins are coexpressed and non-interactors are not coexpressed, then coexpression is a good predictor of interaction. The better the separation of the two groups, the better the predictive performance. True positives are generally readily found in biological databases but it   140 is challenging to define true negatives. Common approaches use either random interactions (based on the fact that true interactions are a small subset of all possible interactions) or use information such as lack of colocalization whereby proteins localized to different compartments are taken to be non-interactors (Braun et al., 2009). One advantage of the protease-inhibitor prediction task is the ability to define a set of true negative inhibitions by pairs of inhibitors and proteases that are enzymatically implausible. Proteases and their inhibitors are organized into families based on their primary sequence and into clans based on the structure of their active site (Rawlings et al., 2012). Families and clans of inhibitors can usually be assigned to one or two target protease classes. It is thus possible to define pairs, where the inhibitor cannot inhibit the protease based on known chemical and structural constraints. For example, for metalloproteinases we did not consider serpins as inhibitors, since serpins only inhibit serine or cysteine proteases (despite the fact that MMPs often cleave and inactivate serpins (Fortelny et al., 2014; Keller et al., 2013)). The second advantage of this prediction method is the ease and accuracy of biochemical testing of the predictions by measuring inhibition of the catalytic activity of the protease.  In this study, we defined true positive inhibitions (TP, n = 294) as those inhibitions annotated in MEROPS and true negative inhibitions (TN, n = 6,990) as enzymatically implausible pairs, obtained as pairs with one serine and/or cysteine protease inhibitor (SERPINs, SPINTs, SPINKs, or BIRCs) and one aspartate, threonine, or metalloprotease or one tissue inhibitor of metalloproteinases (TIMPs) and one aspartate, serine, cysteine, or threonine protease. Using this gold standard, we evaluated the predictive power of protein-protein interaction data, co-annotation, coexpression, phylogenetic similarity, and colocalization as input data to predict   141 protease-inhibitor pairs in the protease web. In particular, we examined 40 coexpression features derived from different input data and correlation metrics and found that all coexpression features have weak predictive power, indicating that coexpression is not true for many functional protease-inhibitor pairs as is commonly assumed. Finally, we predicted 270 protease-inhibitor pairs, examined 9 of these predicted inhibitions biochemically, and validated the novel inhibition of kallikrein 5 (KLK5) by serpin B12 (SERPINB12), previously an orphan inhibitor.      142 5.2 Materials and methods 5.2.1 Data analysis Network figures were created using Cytoscape (Smoot et al., 2011). Data analysis was carried out in R (R Core Team, 2013) using ggplot2, ROCR, and gplots packages.  5.2.2 Proteases and protease inhibitor data Protease and inhibitor class, family, cleavage, and inhibitor information was extracted from the MEROPS database (http://merops.sanger.ac.uk/) (Rawlings et al., 2012) version 9.9 on September 30, 2013. MEROPS IDs were used to classify proteases and inhibitors into classes and families as described on the MEROPS website.  5.2.3 Protein-protein interaction networks Protein-protein interaction (PPI) data from HIPPIE (Schaefer et al., 2012) version 1.5 was downloaded on June 12, 2013. PPI data from high-throughput experiments was downloaded from BioGRID (Chatr-Aryamontri et al., 2015) on October 11, 2013. PPI data from (Bossi and Lehner, 2009) was downloaded on October 11, 2013. Experiments with up to 100 identified PPIs were considered low-throughput, those with 100-1,000 PPIs were labeled medium-throughput, and those with more than 1,000 PPIs were deemed high-throughput.  5.2.4 Protein localization Protein localization information was downloaded from three sources: LocDB (Rastogi and Rost, 2011) (data downloaded November 19, 2013), the Human Protein Atlas (Uhlen et al., 2010) (downloaded November 12, 2013.), and Gene Ontology annotation using the hgu95av2.db   143 package in R (R Core Team, 2013) (downloaded August 8, 2013). For each dataset, annotations were mapped to Gene Ontology terms and annotation trees for each protein were generated using the GOstats package in R (R Core Team, 2013). For LocDB, primary and secondary localization information was combined for each protein. Main and other localization data from the Human Protein Atlas were used if the reliability was annotated as High, Medium, or Supportive. GO annotations were retained if the evidence code was one of EXP, IDA, IPI, IGI, IMP, IEP, or TAS.  5.2.5 Coexpression networks GTEX data (Lonsdale et al., 2013) was downloaded on January 31, 2013. GSE7307 expression data was downloaded from the database GEMMA (Zoubarev et al., 2012) on June 26, 2013. Other microarray-based expression datasets used in meta-coexpression analysis were downloaded from GEMMA on January 18, 2013, and are listed in Table 5.1. Gene correlation was calculated using the cor function in R (R Core Team, 2013). Partial correlation was calculated using the ppcor package in R. Matrices containing maxima of Spearman and Pearson derived correlations were obtained using the pmax function in R. Full datasets or subsets were used as inputs as explained in the Results and listed in Table 5.2.      144 Table 5.1 Microarray datasets used to generate coexpression matrices across datasets. Dataset Tissue  Dataset Tissue GSE11512-Human Brain  GSE7832 Lung GSE11882 Brain  GSE13933 Lung GSE12679 Brain  GSE10718 Lung GSE13564 Brain  GSE14334 Lung GSE17612 Brain  GSE27681 Lung GSE35864 Brain  GSE5883 Lung GSE4036 Brain  GSE16028 Blood GSE4757 Brain  GSE27562 Blood GSE5281 Brain  GSE27263 Blood GSE7621 Brain  GSE26050 Blood GSE28160 Brain  GSE7753 Blood GSE3526 Brain  GSE19314 Blood GSE9770 Brain  GSE25507 Blood GSE20146 Brain  GSE11375 Blood GSE14668 Liver  GSE13849 Blood GSE15235 Liver  GSE14844 Blood GSE11190 Liver  GSE33580 Blood GSE10410 Liver  GSE18123.1 Blood GSE17254.1 Liver  GSE7967 Blood GSE29868 Liver  GSE11499 Blood GSE17183 Liver  GSE31348 Blood GSE1643 Lung  GSE10715 Blood GSE37768 Lung  GSE10041 Blood GSE18995 Lung  GSE29111 Blood GSE24206 Lung    GSE19667 Lung          145 Table 5.2 Matrices of coexpression and phylogenetic similarity. NAME Data source Method Multiple Datasets Tissues GTEX_All_Pcc GTEx (RNA-Seq) Pearson No All Tissues GTEX_All_Scc GTEx (RNA-Seq) Spearman No All Tissues GTEX_All_PccP GTEx (RNA-Seq) Pearson Partial No All Tissues GTEX_All_SccP GTEx (RNA-Seq) Spearman Partial No All Tissues GTEX_All_Max GTEx (RNA-Seq) Max(Pcc, Scc) No All Tissues GTEX_All_MaxP GTEx (RNA-Seq) Max (PccP, SccP) No All Tissues GTEX_AdiposeTissue_Pcc GTEx (RNA-Seq) Pearson No Adipose Tissue GTEX_AdiposeTissue_Scc GTEx (RNA-Seq) Spearman No Adipose Tissue Blood GTEx (RNA-Seq) Pearson No Blood Blood GTEx (RNA-Seq) Spearman No Blood Blood Vessel GTEx (RNA-Seq) Pearson No Blood Vessel Blood Vessel GTEx (RNA-Seq) Spearman No Blood Vessel Brain GTEx (RNA-Seq) Pearson No Brain Brain GTEx (RNA-Seq) Spearman No Brain Heart GTEx (RNA-Seq) Pearson No Heart Heart GTEx (RNA-Seq) Spearman No Heart Lung GTEx (RNA-Seq) Pearson No Lung Lung GTEx (RNA-Seq) Spearman No Lung Muscle GTEx (RNA-Seq) Pearson No Muscle Muscle GTEx (RNA-Seq) Spearman No Muscle Nerve GTEx (RNA-Seq) Pearson No Nerve Nerve GTEx (RNA-Seq) Spearman No Nerve Skin GTEx (RNA-Seq) Pearson No Skin Skin GTEx (RNA-Seq) Spearman No Skin Thyroid GTEx (RNA-Seq) Pearson No Thyroid Thyroid GTEx (RNA-Seq) Spearman No Thyroid GTEX_Averaged_Pcc GTEx (RNA-Seq) Pearson Averaged Averaged tissues GTEX_Averaged_Scc GTEx (RNA-Seq) Spearman Averaged Averaged tissues GSE7307_All_Pcc GSE7307 array Pearson No All Tissues GSE7307_All_Scc GSE7307 array Spearman No All Tissues Array_Merged_Brain Microarray Pearson Merged Brain Array_Merged_Liver Microarray Pearson Merged Liver Array_Merged_Lung Microarray Pearson Merged Lung Array_Merged_SkeletalMuscle Microarray Pearson Merged Skeletal Muscle Array_Merged_AllTissues Microarray Pearson Merged All Tissues Array_Averaged_Brain Microarray Pearson Averaged Brain Array_Averaged_Liver Microarray Pearson Averaged Liver Array_Averaged_Lung Microarray Pearson Averaged Lung Array_Averaged_SkeletalMuscle Microarray Pearson Averaged Skeletal Muscle Array_Averaged_AllTissues Microarray Pearson Averaged All Tissues Inparanoid_Binary InParanoid Agreement  No NA Inparanoid_Binary_Cor InParanoid Pearson No NA Inparanoid_Binary_MI InParanoid Mutual information No NA Inparanoid_Bits_Cor InParanoid Pearson No NA    146 5.2.6 Phylogenetic profiling Phylogenetic profile data were constructed by downloading mappings from human proteins to other species from InParanoid (Östlund et al., 2010). Mappings were binarized into 0 (absent) and 1 (present) for the binary networks before calculating the fraction of agreement (where the genes are absent or present in both organisms), Pearson correlation (cor package in R (R Core Team, 2013)) or mutual information (entropy package in R) for all pairs of genes. The “Bits” network was constructed by multiplying InParanoid scores with the bit score for each cluster and calculating Pearson correlation (cor package in R).  5.2.7 Machine learning Machine learning algorithms were run using the caret package. 60% of pairs were used for training and 40% for testing. Parameters picked by cross-validation were ‘mtry’ of 2 for random forest and ‘C’ of 0.1 and ‘sigma’ of 0.2 for the radial support vector machine.   5.2.8 Biochemical validation experiments Coagulation factor 11 (F11), coagulation factor 12 (F12), plasma kallikrein (KLKB1) and the chromogenic substrates for F12 (Catalog# S820340), F11 (2366 Catalog# S821090) and for KLKB1 (S2302) were from DiaPharma. Chromogenic substrates were measured at an emission wavelength of 405 nm as recommended by the suppliers. KLK5 (Catalog# 1108-SE-010) and its quenched fluorescent substrate (Catalog# ES011) and KLK7 (Catalog# 2624-SE-010) and its quenched fluorescent substrate (Catalog# ES002) were from R&D Systems. Cleavage of quenched fluorescent substrates was measured using excitation/emission wavelengths of 380/460 nm for KLK5 and 320/405 nm for KLK7 as recommended by the suppliers. Serpin B12 was   147 kindly provided by Dr. G.A. Silverman (Children’s Hospital of Pittsburgh); serpin A4 was kindly provided by Dr. J. Chao (Medical University of South Carolina); murine serpin B8 was from Sino Biological Inc. (Catalog# 50215-M08B); and serpin B7 was from Creative BioMart (Catalog# SERPINB7-2596H). Protease activity was measured after incubation for 1 h at 37 °C with and without serpins in a POLARstar OPTIMA plate reader (BMG Labtech). Substrate cleavage and protease inhibition assays were also analyzed by 10% SDS-PAGE and silver staining of proteins after incubation at a 1:1 ratio protease:inhibitor (w/w) for 1 h at 37 °C.      148 5.3 Results 5.3.1 Protein-protein interactions Protein-protein interaction (PPI) data is derived from biochemical methods including high-throughput approaches like yeast-2-hybrid or co-immunoprecipitation coupled to mass spectroscopy, or low-throughput, such as identification of proteins comigrating upon gel electrophoresis. Our goal in considering other protein interaction data was to extract annotated physical interactions between a protease and an inhibitor, where the inhibitor functionally inhibits the protease, but where an annotation in MEROPS is missing. We analyzed PPI networks of proteases and their inhibitors from the databases HIPPIE (Schaefer et al., 2012) and BioGRID (Chatr-Aryamontri et al., 2015), and a literature-curated PPI network (Bossi and Lehner, 2009). We compared 559 annotated cleavage and 325 inhibition interactions between proteases (including inactive proteases) and inhibitors from MEROPS to PPI data from HIPPIE (Schaefer et al., 2012). We found a protein-protein interaction between protease and substrate for 187 cleavages (33%) and between inhibitor and protease for 88 inhibitions (27%). Thus, these databases only captured a subset of protease web data. Most of the overlapping interactions were simply (and correctly) annotated in both data types and could therefore not be used to assess predictive performance when predicting protease web interactions from protein-protein interactions. Figure 5.1A shows interactions annotated in HIPPIE between well-defined groups of proteases and inhibitors such as the proteasome, cathepsins, serum serine proteases, MMPs, and ubiquitin hydrolases (DUBs). As we found before from graph modeling of the protease web in chapter 4 (Fortelny et al., 2014) the connectivity in this network was forged by protease inhibitors. Upon removal of inhibitors from the network (Figure 5.1B), only known complexes such as the proteasome, DUBs and some blood coagulation proteases remained connected. The   149 remaining HIPPIE interactions can represent novel inhibitory interactions or other interactions of proteases such as cleavage or complex formation.  We thus investigated the possibility that novel protease inhibitors of proteases might be hidden in the high-throughput screens of PPI, which identify thousands of interactions that are often not followed up functionally. Protease inhibitors form complexes with the proteases they inhibit, generally exhibiting fast on-rates and slow off-rates, resulting in low KDs and some inhibitors form covalent bonds with their target proteases. We collected 96 protease-inhibitor PPIs not already annotated as inhibitory interactions in MEROPS and examined all the original publications that served as references for these interactions in HIPPIE (Table 5.3). In 28 cases (29%), an inhibition of protease activity was observed in the original paper, in 20 (21%) an inhibition was inferred from complex formation in the source publication, and 14 (15%) were based on a cleavage event. Taken together, 62 (65%) of the 96 interactions were known protease web interactions, which were simply not annotated in protease databases, confirming the incomplete status of protease and protease inhibitor annotations reported in chapter 4 (Fortelny et al., 2014). Focusing on the remaining 34 interactions, we found that 18 (19%) reflected a PPI not related to cleavage or inhibition, 3 (3%) were unclear interactions, and 13 (14%) PPIs were physical interactions between proteases and inhibitors with no mention of inhibitory activity and therefore potentially interesting novel inhibitions.    150      151 Figure 5.1 Protein-protein interaction (PPI) networks constructed for proteases and inhibitors. (A) and (B) PPI network based on the HIPPIE database with a HIPPIE score cutoff of 0.7. Isolated nodes were removed. Nodes are colored according to their MEROPS class (Proteases: green – threonine; blue – metallo; yellow – serine; orange – cysteine; purple – aspartic; Inhibitors: red). Red edges are enzymatic interactions, thickness of edges corresponds to the HIPPIE score of the interaction. (A) Interactions between proteases and inhibitors. (B) The same network as (A) without inhibitors. (C) Network of interactions between proteases and inhibitors generated by high-throughput experiments from BioGRID. Edges are colored according to the study the edges were derived from (black - PMID19615732, green – PMID21145461, red – PMID21832049, blue – PMID22626734, purple – PMID22863883 and PMID22939629, grey – others). (D) Published network which was curated from current literature (Bossi and Lehner, 2009). Nodes are connected by multiple edges if the interaction was curated from multiple publications. Edges are colored to reflect the type of experiment (grey – low throughput, green – medium, blue – high throughput).      152 Table 5.3 Protein protein interaction pairs of protease with inhibitor predicted as “inhibition” from HIPPIE.   InteractorA MeropsA InteractorB MeropsB HIPPIE Score Annotation XIAP I32.004 HTRA2 S01.278 0.9 other interaction APP I02.015 BLMH C01.084 0.89 other interaction CASP9 C14.010 BIRC5 I32.005 0.88 inhibition HTRA2 S01.278 BIRC3 I32.003 0.88 other interaction KLK2 S01.161 SERPINF2 I04.023 0.86 inhibition (inferred) KLK2 S01.161 SERPINB6 I04.011 0.86 inhibition (inferred) KLK3 S01.162 SERPINA5 I04.004 0.86 inhibition (inferred) F2 S01.217 SERPINE1 I04.020 0.86 inhibition BIRC2 I32.002 CASP8 C14.009 0.86 interesting SERPINA4 I04.003 KLK1 S01.UPA 0.86 inhibition PLG S01.233 SERPINE1 I04.020 0.85 indirect SERPINA5 I04.004 KLKB1 S01.212 0.85 inhibition HTRA2 S01.278 BIRC6 I32.006 0.83 cleavage CASP9 C14.010 BIRC6 I32.006 0.82 inhibition CASP3 C14.003 BIRC6 I32.006 0.81 cleavage PRSS3 S01.174 TFPI I02.011 0.78 other interaction APP I02.015 PSEN1 A22.001 0.76 cleavage XIAP I32.004 USP11 C19.014 0.76 interesting CSTA I25.001 USP53 C19.081 0.76 interesting CTSB C01.060 AMBP I02.005 0.75 interesting PLAU S01.231 SERPINE2 I04.021 0.75 inhibition (inferred) F10 S01.216 SERPINB8 I04.013 0.75 inhibition SERPINF2 I04.023 CFD S01.191 0.75 no interaction CPB2 M14.009 A2M I39.001 0.75 other interaction SERPINA5 I04.004 F11 S01.213 0.75 inhibition SERPINB8 I04.013 PRSS1 S01.127 0.75 inhibition APP I02.015 PSEN2 A22.002 0.75 cleavage KLK13 S01.306 SERPINA1 I04.001 0.75 inhibition (inferred) KLK13 S01.306 A2M I39.001 0.75 inhibition (inferred) CASP7 C14.004 BIRC6 I32.006 0.75 inhibition (inferred) ADAMTS1 M12.222 A2M I39.001 0.73 inhibition (inferred) BIRC6 I32.006 USP8 C19.011 0.72 no interaction XIAP I32.004 HTRA1 S01.277 0.68 cleavage CTSB C01.060 SPINT2 I02.009 0.67 interesting   153 InteractorA MeropsA InteractorB MeropsB HIPPIE Score Annotation SERPINA1 I04.001 ADAMTS4 M12.221 0.65 other interaction CASP9 C14.010 NAIP I32.001 0.65 inhibition (inferred) BIRC3 I32.003 USP19 C19.024 0.65 interesting SERPINA3 I04.002 ADAMTS4 M12.221 0.63 other interaction APP I02.015 UCHL1 C12.001 0.63 interesting COPS5 M67.002 SERPINB4 I04.008 0.63 other interaction XIAP I32.004 USP19 C19.024 0.63 interesting BIRC2 I32.002 USP19 C19.024 0.63 interesting BIRC2 I32.002 FAM108A1 S09.052 0.63 interesting BIRC3 I32.003 CASP1 C14.001 0.63 other interaction BIRC5 I32.005 USP9X C19.017 0.63 other interaction TIMP2 I35.002 NPEPPS M01.010 0.63 interesting HTRA1 S01.277 CAST I27.003 0.63 inhibition SERPINA3 I04.002 CELA1 S01.153 0.62 inhibition (inferred) SERPINA3 I04.002 CTRC S01.157 0.62 inhibition KLK2 S01.161 SERPINA4 I04.003 0.62 inhibition ELANE S01.131 SERPINE1 I04.020 0.62 other interaction PROC S01.218 SERPINB6 I04.011 0.62 inhibition PLAU S01.231 SERPINB6 I04.011 0.62 inhibition SERPINA5 I04.004 KLK1 S01.UPA 0.62 inhibition APP I02.015 F12 S01.211 0.62 other interaction SERPINB12 I04.016 UCHL5 C12.005 0.56 interesting BIRC2 I32.002 CASP2 C14.006 0.56 inhibition CSTA I25.001 UCHL5 C12.005 0.56 interesting CD74 I31.002 CTSF C01.018 0.55 cleavage SERPINA3 I04.002 CTRL S01.256 0.52 inhibition CTSG S01.133 SERPINB13 I04.017 0.52 inhibition KLK2 S01.161 SERPINC1 I04.018 0.52 inhibition (inferred) KLK2 S01.161 A2M I39.001 0.52 inhibition (inferred) KLK3 S01.162 PZP I39.003 0.52 inhibition (inferred) CTRC S01.157 SPINK1 I01.011 0.52 inhibition (inferred) CTSB C01.060 SERPINB13 I04.017 0.52 cleavage ELANE S01.131 SERPINB13 I04.017 0.52 cleavage PRSS2 S01.258 APP I02.015 0.52 inhibition (inferred) TMPRSS15 S01.156 SPINK1 I01.011 0.52 inhibition (inferred) SERPINC1 I04.018 PLG S01.233 0.52 other interaction SERPINC1 I04.018 KLK6 S01.236 0.52 inhibition PLG S01.233 SERPINB13 I04.017 0.52 cleavage   154 InteractorA MeropsA InteractorB MeropsB HIPPIE Score Annotation PLG S01.233 SERPINB6 I04.011 0.52 inhibition PLG S01.233 HRG I25.022 0.52 other interaction F2 S01.217 SERPINB8 I04.013 0.52 inhibition (inferred) F2 S01.217 AMBP I02.005 0.52 inhibition (inferred) PLAU S01.231 SERPINF2 I04.023 0.52 inhibition (inferred) F10 S01.216 SERPINA5 I04.004 0.52 inhibition F10 S01.216 SERPINB6 I04.011 0.52 inhibition SERPINF2 I04.023 F11 S01.213 0.52 inhibition SERPINF2 I04.023 KLK6 S01.236 0.52 inhibition CPB2 M14.009 PZP I39.003 0.52 other interaction SERPINB13 I04.017 PRTN3 S01.134 0.52 cleavage SERPINB13 I04.017 PRSS1 S01.127 0.52 cleavage TFPI2 I02.013 F11 S01.213 0.52 inhibition TFPI2 I02.013 KLKB1 S01.212 0.52 inhibition MMP3 M10.005 SPOCK1 I31.006 0.52 other interaction APP I02.015 CASP4 C14.007 0.52 other interaction APP I02.015 CASP8 C14.009 0.52 cleavage APP I02.015 IDE M16.002 0.52 cleavage SPOCK1 I31.006 MMP2 M10.003 0.52 other interaction SERPINA1 I04.001 PRSS1 S01.127 0.52 inhibition SPINK5 I01.028 PRSS1 S01.127 0.52 inhibition CASP4 C14.007 SERPINB9 I04.014 0.52 inhibition CSTB I25.003 CTSD A01.009 0.52 cleavage ADAM19 M12.214 A2M I39.001 0.52 inhibition (inferred)     155 Of the 13 interactions in HIPPIE that suggested possible new protease/inhibitor interactions, 8 occurred between the ubiquitin hydrolase (DUB) USP19 and apoptosis inhibitors XIAP, BIRC2, and BIRC3 (Mei et al., 2011), and also between the DUB USP11 and XIAP (Sowa et al., 2009). These interactions between apoptosis inhibitors (which are often ubiquitin E3 ligases) and DUBs are well documented in literature, but to our knowledge it has not been shown if DUBs are inhibited by the apoptosis inhibitors. Potentially inhibitory PPIs in HIPPIE were also found between cystatin A and USP53 (Sowa et al., 2009) or UCHL5 (Fang et al., 2012). Cystatin A is an inhibitor of cysteine-cathepsins and thus DUBs, which are cysteine proteases, are potential targets of intracellular cystatin A, or one of the other intracellular cystatins F, B, or 11. Similarly, serpin B12, which is only reported to inhibit thrombin and plasmin, interacted with UCHL5 (Fang et al., 2012). Cathepsin B was reported to bind the protein bikunin (AMBP) (Liu et al., 2006). Both genes are highly expressed in tumor tissue (Winnepenninckx et al., 2006), further indicating a functional interaction. The MMP inhibitor TIMP2 interacted with the metalloprotease puromycin-sensitive aminopeptidase NPEPPS (Ewing et al., 2007), but TIMPs have only been shown to inhibit endopeptidases to date. Another PPI in HIPPIE was found between the cysteine protease inhibitor BIRC2 and the serine protease ABHD17A (or Fam108A1) (Yu et al., 2011). A complex of BIRC2 and caspase 8 (Micheau and Tschopp, 2003) likely represents a true but untested inhibition since BIRCs are endogenous inhibitors of caspases. These testable hypothesis show that PPI data not only reflect known protease interactions but are also useful to predict novel inhibitory interactions.       156 In addition to the data from HIPPIE (Schaefer et al., 2012), we compiled a high-throughput network of PPI using the larger BioGRID database (Chatr-Aryamontri et al., 2015). Most of this network (Figure 5.1C) was derived from just a few publications. One source identified binding partners of amyloid protein precursor (APP), shown by the cluster of red edges around APP. Another source identified binding partners (blue edges) of DUBs. Finally, two different publications (purple) were centered on the proteasome. We concluded that this network was strongly biased to the baits used in high-throughput screens, which were not relevant to our goal of identifying protease inhibitors. As a third source of PPI information, we obtained a previously published literature curated PPI network (Bossi and Lehner, 2009) where we separated source papers by the number of interactions identified in each. The resulting network (Figure 5.1D) showed that most edges come from low throughput screens (grey). Medium throughput screens identified exclusively proteasome interactions. Finally, we observed that high throughput interactions between proteases and inhibitors are mostly also identified by low throughput methods and thus have likely been studied in detail and do not represent novel inhibition predictions. These additional networks were thus subject to study biases as described previously (Gillis et al., 2014) and did not yield interesting predictions.       157 5.3.2 Coexpression patterns We explored tissue expression profiles of proteases and inhibitors to seek useful patterns of coexpression, primarily in the Genome Tissue Expression Atlas (GTEX) (Lonsdale et al., 2013) due to its high coverage (RNA-Seq data for 26 different tissues distributed over 1,660 samples). Expression patterns (Figure 5.2) distinguish between tissue-specific genes, which have a particular function in just a few tissues, and “housekeeping” genes with functions in most or all tissues. Different serpins either have a broad expression pattern across tissues (e.g. serpins E1, F1, and G1) or are specific to one or two tissues (e.g. serpins A1, C1, and D1), matching with known targets such as coagulation proteins and kallikreins (Figure 5.2). Further examples of tissue specific expression (granzymes in blood) and broad expression (e.g. a disintegrin and metalloproteinase domain (ADAMs), MMPs, caspases, and cathepsins) are shown in Figure 5.2. In this latter group of broadly expressed genes, genes of one family (e.g. MMPs or caspases) tend to have highly dissimilar expression pattern, indicating specialized individual roles of these proteases in particular tissues. This distinction between tissue-specific and broadly expressed genes is especially helpful in the context of the protease web. The entire protease web is a broadly connected network where protease activity can influence the activity of many other proteases as described in chapter 4 (Fortelny et al., 2014). By tuning the expression of proteases and inhibitors, cells can thereby extend the core protease web consisting of housekeeping genes by proteolytic pathways that are only activated conditionally in some tissues.      158  ADAMSMMPsCoagulationKallikreinsGranzymesCalpainsCaspasesCathepsinsSerpinsTissues TissuesSDlog10(RPKM)-230-2402Fortelny et al., 2015 Figure S1  159 Figure 5.2 Expression patterns of proteases and their inhibitors.  Tissue RNA expression levels of groups of proteases and their inhibitors showing tissue-specific and broad expression patterns. Log10 transformed Reads Per Kilobase of transcript per Million mapped reads (log10(RPKM)) as obtained from GTEX (Lonsdale et al., 2013) shown on the left. Zero values were set to 0.01 before log10 transformation. Normalized RPKMs for each gene are shown on the right and plotted as standard deviation from the mean (SD). Values were averaged across samples of each tissue.      160 It is a common belief and observation in culture that expression of a protease inhibitor positively correlates with its target protease to counterbalance the destructive potential of the protease (Overall et al., 1989), or negatively correlates to facilitate proteolysis (Overall and Sodek, 1990). Gene coexpression is promising as a prediction tool because it is unbiased compared to protein interaction data in that RNA expression is generally measured for all genes simultaneously using microarrays or RNA-Seq (Gillis and Pavlidis, 2011b). To utilize this for prediction, we generated 40 coexpression matrices of protease web proteins using different expression data and correlation methods (all matrices are summarized in Table 5.2).  First, we generated coexpression matrices using Pearson and Spearman methods on the entire GTEX dataset. Pearson and Spearman correlation coefficients are both high when correlation throughout the range of expression levels is evident (Figure 5.3A). Pearson correlation coefficients are high when both genes are expressed at high levels, even if they are not correlated in other tissues (that is, it is sensitive to outliers; Figure 5.3B). On the other hand, Spearman correlation coefficients are highest when the relation is maintained across most tissues and samples (Figure 5.3C). Clearly both measures are low in cases where neither type of correlation exists (Figure 5.3D). To capture both patterns of coexpression that can represent true protease web interactions, we thus generated a matrix using the maximum of Pearson and Spearman correlation coefficients for each pairs of proteins (GTEX_All_Max).    161  Figure 5.3 Scatterplots of expression for selected pairs of proteins.  Normal scatter plot and log/log plot of ST14/SPINT1 (A), F2/SERPINC1 (B), MMP2/A2M (C), and MMP11/TIMP1 (D).   ●●●●●●●●● ●● ●●●●●●●● ●●●●● ●● ●●●●●● ●●●●●●●● ●●●●●●● ●●●●●●● ●●●●●●● ●●●●● ●●●●●● ●●● ●●●●● ●●● ●●●●●●●●●●●●●●●●●●0501001502000 1000 2000 3000TIMP1MMP11Tissue●●●●●●●●●●●●●●●Adipose TissueBloodBlood VesselBrainBreastEsophagusHeartLungMuscleNervePancreaspooledSkinTestisThyroid●● ●●●●●●●● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●● ●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●0501001502000 100 200 300SPINT1ST14Tissue●●●●●●●●●●●●●●●Adipose TissueBloodBlood VesselBrainBreastEsophagusHeartLungMuscleNervePancreaspooledSkinTestisThyroid●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●1e−011e+011e+031e−01 1e+01 1e+03SPINT1ST14Tissue●●●●●●●●●●●●●●●Adipose TissueBloodBlood VesselBrainBreastEsophagusHeartLungMuscleNervePancreaspooledSkinTestisThyroid●●●●●●●●●●02004006000 500 1000 1500SERPINC1F2Tissue●●●●●●●●●●●●●●●Adipose TissueBloodBlood VesselBrainBreastEsophagusHeartLungMuscleNervePancreaspooledSkinTestisThyroid ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●● ●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●1e−011e+011e+031e−01 1e+01 1e+03SERPINC1F2Tissue●●●●●●●●●●●●●●●Adipose TissueBloodBlood VesselBrainBreastEsophagusHeartLungMuscleNervePancreaspooledSkinTestisThyroid●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●● ●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●● ●●●●●●●●●● ●●● ●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●0500100015000 1000 2000 3000 4000A2MMMP2Tissue●●●●●●●●●●●●●●●Adipose TissueBloodBlood VesselBrainBreastEsophagusHeartLungMuscleNervePancreaspooledSkinTestisThyroid●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●1e−011e+011e+031e−01 1e+01 1e+03A2MMMP2Tissue●●●●●●●●●●●●●●●Adipose TissueBloodBlood VesselBrainBreastEsophagusHeartLungMuscleNervePancreaspooledSkinTestisThyroid●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●● ●●●●●●●●●●● ●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●1e−011e+011e+031e−01 1e+01 1e+03TIMP1MMP11Tissue●●●●●●●●●●●●●●●Adipose TissueBloodBlood VesselBrainBreastEsophagusHeartLungMuscleNervePancreaspooledSkinTestisThyroidABCDlog/log plotlog/log plotlog/log plotlog/log plot  162 We then generated coexpression matrices (Pearson, Spearman, and maximum of both) using partial correlation, which might help resolve complex correlation patterns between multiple variables (genes). To capture tissue-specific correlation, we generated coexpression matrices in subsets of GTEX limited to only one tissue (Figure 5.4A), and a network representing the average of the tissue-specific networks (each both for Pearson and Spearman correlation). We also measured Pearson and Spearman correlation coefficients in the GSE7307 microarray dataset (677 samples from over 100 tissues, Figure 5.4B). Finally, with the aim of ending up with more robust coexpression result, we performed meta-analysis (Lee et al., 2004; Ballouz et al., 2015) of gene expression over multiple microarray datasets (listed in Table 5.1 and shown in Figure 5.4C) across all tissues and in a tissue specific manner in two ways: i) datasets were merged into one large dataset (Merged) and ii) Pearson correlation coefficients were obtained from each dataset and then averaged for each gene pair (Averaged (Gillis and Pavlidis, 2011b)).     163  Figure 5.4 Tissue/sample distribution of datasets. (A) Tissue distribution of samples in the GTEX dataset. (B) Tissue distribution of samples in the GSE7307 microarray dataset. (C) Size of datasets in the microarray datasets used for coexpression meta-analysis.  ABCTissuesNumber of samplesNumber of samplesNumber of samplesTissuesTissues  164  Inparanoid_Binary_MIInparanoid_Binary_CorInparanoid_Bits_CorInparanoid_BinaryArrays_Merged_LungArrays_Merged_MuscleGTEX_All_PccPGTEX_All_SccPGTEX_All_MaxPArrays_Merged_LiverArrays_Averaged_LiverArrays_Merged_BrainArrays_Merged_AllArrays_Averaged_MuscleArrays_Averaged_LungArrays_Averaged_BrainArrays_Averaged_AllGTEX_Brain_PccGTEX_Brain_SccGTEX_Blood_PccGTEX_Blood_SccGSE7307_All_SccGSE7307_All_PccGTEX_All_PccGTEX_All_SccGTEX_All_MaxGTEX_Averaged_PccGTEX_Averaged_SccGTEX_Muscle_PccGTEX_Muscle_SccGTEX_Heart_PccGTEX_Heart_SccGTEX_Skin_PccGTEX_Skin_SccGTEX_Lung_PccGTEX_Lung_SccGTEX_Thyroid_PccGTEX_Thyroid_SccGTEX_Blood_Vessel_PccGTEX_Blood_Vessel_SccGTEX_Nerve_PccGTEX_Nerve_SccGTEX_Adipose_Tissue_PccGTEX_Adipose_Tissue_SccInparanoid_Binary_MIInparanoid_Binary_CorInparanoid_Bits_CorInparanoid_BinaryArrays_Merged_LungArrays_Merged_MuscleGTEX_All_PccPGTEX_All_SccPGTEX_All_MaxPArrays_Merged_LiverArrays_Averaged_LiverArrays_Merged_BrainArrays_Merged_AllArrays_Averaged_MuscleArrays_Averaged_LungArrays_Averaged_BrainArrays_Averaged_AllGTEX_Brain_PccGTEX_Brain_SccGTEX_Blood_PccGTEX_Blood_SccGSE7307_All_SccGSE7307_All_PccGTEX_All_PccGTEX_All_SccGTEX_All_MaxGTEX_Averaged_PccGTEX_Averaged_SccGTEX_Muscle_PccGTEX_Muscle_SccGTEX_Heart_PccGTEX_Heart_SccGTEX_Skin_PccGTEX_Skin_SccGTEX_Lung_PccGTEX_Lung_SccGTEX_Thyroid_PccGTEX_Thyroid_SccGTEX_Blood_Vessel_PccGTEX_Blood_Vessel_SccGTEX_Nerve_PccGTEX_Nerve_SccGTEX_Adipose_Tissue_PccGTEX_Adipose_Tissue_Scc0 0.2 0.4 0.6 0.8 1Value0100300500Color Keyand HistogramCountABMatrixMatrix  165 Figure 5.5 Description and comparison of coexpression and phylogenetic similarity matrices obtained.  (A) Heatmap of correlations values (agreement) show similarity and dissimilarity of matrices described in Supplementary results and Table 5.2. For two matrices, the Pearson correlation coefficient across all coexpression values in both matrices is shown. (B) Binned scatterplot of GTEX_All_Pcc (Pearson) and GTEX_All_Scc (Spearman) coexpression values for all pairs of proteins. Counts are number of pairs in each bin. Blue lines indicate a threshold of the mean plus two standard deviations. The overlap in predicted pairs above threshold (top right corner) is small, despite the high correlation of these matrices.     The resulting coexpression matrices differed strongly in content based on the methods and data used (Figure 5.5A). Pearson and Spearman correlation resulted in highly similar coexpression matrices, as seen in all matrices based on GTEX (Lonsdale et al., 2013) data, across (r = 0.67) or within tissues (r > 0.8). Tissue-based coexpression yielded similar results, as seen in the GTEX tissue-based networks. In a few cases similarities were explained by tissue-similarity (r(GTEX_Heart_Pcc, GTEX_Muscle_Pcc) = 0.4), but in many cases similarities were likely due to constitutive coexpression of genes (similarities of adipose tissue, nerve, blood vessel, thyroid, and lung based coexpression matrices). Averaging tissue specific matrices seemed to have extracted generally coexpressed pairs, so that the GTEX_Averaged networks correlated with all tissue-based GTEX matrices (average r = 0.55).      166 Interestingly, these in-tissue averaged coexpression (GTEX_Averaged) also correlated somewhat (r > 0.4) with across-tissue coexpression (GTEX_All), despite the methodological differences. Finally, coexpression matrices derived from using partial correlation showed similarities between partial Pearson and Spearman correlation, but differed from all other matrices. In conclusion, differences were large if data is used tissue-specific or across-tissues and averaging tissue-based networks extracts common features of tissue-specific networks. The effect of data sources was limited, as shown by the similarity of RNA-Seq based GTEX and microarray based GSE7307 across-tissue matrices, which was high (average r = 0.43) considering that these matrices stemmed from different samples, tissues, and instrumentation. Finally, meta-analysis of microarray datasets resulted in matrices different from the above GTEX based matrices, probably due to differences in samples and methodology. However, again, similar methods seemed to result in similar matrices, so that Arrays_Averaged_All correlated with GTEX_Averaged (r = 0.46) and with the individual GTEX tissue-based matrices, and Array_Merged_All correlated somewhat with GTEX_Full_Pcc (r = 0.37), which was also most similar methodologically. Coexpression values between some matrices were highly correlated overall (matrices GTEX_All_Pcc and GTEX_All_Scc with R = 0.67) as shown in Figure 5.5B, indicating similarity between methods. Yet, if predicting interacting pairs by applying a coexpression cutoff (blue lines in Figure 5.5B), these predictions resulted in a small overlap (top right corner). Predictive power thus needs to be assessed separately for each matrix.     167 We compared the predictive power of coexpression matrices for predicting protease inhibition (inhibitory protease-inhibitor pairs). We measured AUC of the receiver operator characteristic (ROC) curve for separating predefined true positive pairs (annotated) from true negative protease-inhibitor pairs (enzymatically implausible) using a given coexpression matrix. Figure 5.6A shows that all matrices correlated with protease web interactions  (better than random picking with AUC > 0.5). True positive pairs thus had higher coexpression than true negative pairs. However, considering the common perception that protease inhibitors are coexpressed with their target protease, this signal was surprisingly low (AUCs < 0.7). One explanation might be that RNA levels do not correspond to protein levels and thus proteins can be coexpressed whereas their mRNAs are not. We tested this by creating a coexpression network based on proteomics quantification data in the Human Proteome Map (HPM) (Kim et al., 2014), which performed even worse than the other networks (AUC of 0.6, data not shown). While this poor performance might be due to noise in the HPM data, it does not support the idea that mRNA data is inherently worse than protein level data. Similar results were found when measuring accuracy of interaction prediction in the top 10% of coexpressed pairs of each matrix (Figure 5.6B). There, Array_Averaged_Liver and GTEX_All_Max methodologies had the highest signal. The strong correlation of Array_Averaged_Liver with protease web data was expected, because many interactions are known between proteases and inhibitors that are expressed in liver as part of the well understood complement and coagulation systems. However, this network was biased to a subset of the protease web and likely not very helpful for the remainder of the network. The high AUC and accuracy of GTEX_All_Max demonstrated the usefulness this network, which combined tissue specific and across tissue coexpression and was not restricted to any specific tissue making it the best candidate for prediction.   168  ●● ●●● ●●●●●●● ●● ● ●● ●●●●●●●● ●● ●●●● ●● ●●●●● ●●●●● ●●● ●● ●● ●●●●●●●● ●●●●Inparanoid_Binary_CorInparanoid_Binary_MIInparanoid_BinaryInparanoid_Bits_CorArrays_Merged_MuscleArrays_Merged_LungArrays_Merged_LiverArrays_Merged_BrainArrays_Merged_AllArrays_Averaged_MuscleArrays_Averaged_LungArrays_Averaged_LiverArrays_Averaged_BrainArrays_Averaged_AllGSE7307_All_SccGSE7307_All_PccGTEX_Thyroid_SccGTEX_Thyroid_PccGTEX_Skin_SccGTEX_Skin_PccGTEX_Nerve_SccGTEX_Nerve_PccGTEX_Muscle_SccGTEX_Muscle_PccGTEX_Lung_SccGTEX_Lung_PccGTEX_Heart_SccGTEX_Heart_PccGTEX_Brain_SccGTEX_Brain_PccGTEX_Blood_Vessel_SccGTEX_Blood_Vessel_PccGTEX_Blood_SccGTEX_Blood_PccGTEX_Adipose_Tissue_SccGTEX_Adipose_Tissue_PccGTEX_Averaged_SccGTEX_Averaged_PccGTEX_All_MaxPGTEX_All_SccPGTEX_All_PccPGTEX_All_MaxGTEX_All_SccGTEX_All_Pcc0.45 0.50 0.55 0.60Accuracy above 10% of nodesMatrixMatrixTypeGTEX across tissuesGTEX by tissueGSE7307 across tissuesArrays averagedArrays mergedInparanoid● ●●● ●●●●●●●●●● ●●●●●● ●●● ●●● ●●●●●● ●● ●●●● ●●●●●●●● ●●●●●●● ●●●●●●●●●●Inparanoid_Binary_CorInparanoid_Binary_MIInparanoid_BinaryInparanoid_Bits_CorArrays_Merged_MuscleArrays_Merged_LungArrays_Merged_LiverArrays_Merged_BrainArrays_Merged_AllArrays_Averaged_MuscleArrays_Averaged_LungArrays_Averaged_LiverArrays_Averaged_BrainArrays_Averaged_AllGSE7307_All_SccGSE7307_All_PccGTEX_Thyroid_SccGTEX_Thyroid_PccGTEX_Skin_SccGTEX_Skin_PccGTEX_Nerve_SccGTEX_Nerve_PccGTEX_Muscle_SccGTEX_Muscle_PccGTEX_Lung_SccGTEX_Lung_PccGTEX_Heart_SccGTEX_Heart_PccGTEX_Brain_SccGTEX_Brain_PccGTEX_Blood_Vessel_SccGTEX_Blood_Vessel_PccGTEX_Blood_SccGTEX_Blood_PccGTEX_Adipose_Tissue_SccGTEX_Adipose_Tissue_PccGTEX_Averaged_SccGTEX_Averaged_PccGTEX_All_MaxPGTEX_All_SccPGTEX_All_PccPGTEX_All_MaxGTEX_All_SccGTEX_All_Pcc0.4 0.5 0.6 0.7Receiver operator AUCMatrixMatrixTypeGTEX across tissuesGTEX by tissueGSE7307 across tissuesArrays averagedArrays mergedInparanoidABGTEX by tissueGTEX across tissuesGSE7307 across tissuesArrays averagedArrays mergedInparanoidGTEX by tissueGTEX across tissuesGSE7307 across tissuesArrays averagedArrays mergedInparanoid  169 Figure 5.6 Performance of coexpression and phylogenetic similarity matrices in predicting protease inhibition.  True positives are inhibitions (n = 218); true negatives are specific inhibitor/protease pairs, where inhibition is enzymatically implausible. True negatives were subsampled to reflect the number of true positives (n = 218, 200 times). AUC values obtained from each sample are represented as boxplots. (A) AUC of the ROC curve. An AUC of 0.5 represents lack of predictive power (random guessing). An AUC of 1 represents perfect prediction. (B) Accuracy of prediction (percent of correct classifications), when predicting interactions as the top 10% of pairs of each matrix.       170 5.3.3 Phylogenetic profiles Phylogenetic similarity of two genes is a measure for their co-occurrence across a range of taxa, where presence and absence (phylogenetic profile) of two genes across taxa is measured and the correlation of the resulting profiles are then measured (Pellegrini et al., 1999). We created similarity matrices of phylogenetic profiles of proteases and inhibitors from the InParanoid (Östlund et al., 2010) database using correlation of sequence similarity scores as well as correlation and mutual independence measures of binary presence/absence values of proteases across 162 species represented in the database (Table 5.2). The resulting matrices were very similar to each other (correlation 0.7-0.98) but dissimilar from the coexpression matrices (Figure 5.5A) indicating that this data source could be used orthogonally to coexpression. Prediction performance of phylogenetic similarity matrices was generally comparable to coexpression matrices but weaker than GTEX_All_Max in the top pairs (Figure 5.6B).   5.3.4 Colocalization Interacting proteins are often thought to be localized to the same subcellular compartment. To measure predictive power of this information, we obtained subcellular localization annotation of proteins from LocDB (Rastogi and Rost, 2011), the Human Protein Atlas (Uhlen et al., 2010), and GO (Ashburner et al., 2000). We defined four groups of localization: (i) “CO” colocalized for proteins sharing localization annotation; (ii) “AT” anti-localized where one protein was extracellular and the other was in the cytosol or if one protein was in an organelle and the other in the cytosol (so that they meet upon cellular stimuli); (iii) “NC” not colocalized where neither of the above was the case, and; (iv) “NA” not annotated, where subcellular localization was not annotated for one or both of the proteins. The use of colocalization enriched true positive   171 interactions considerably (Table 5.4). However, colocalization also reduced the number of remaining pairs (the search space) significantly, mostly because of lack of annotation (NA). Using colocalization for novel predictions might increase specificity but substantially reduce sensitivity.   Table 5.4 True positive (TP) and true negative (TN) inhibitor-protease pairs after applying colocalization filters.  TN unknown TP TP:TN All 6,990 32,368 294 4.21% Colocalized (CO) 430 1,747 65 15.12% Anti-localized (AT) 466 1,757 30 6.44% Information missing (NA) 5,338 25,135 159 2.98% Remaining 756 3,729 40 5.29%	     172 5.3.5 Co-annotation and co-mentioning in the literature Interacting genes often participate in similar cellular functions and so it is possible to predict gene interactions based on their annotation patterns (King et al., 2003). We considered co-annotation and co-mention as predictive features but dismissed both on theoretical and practical grounds. Utilizing this approach, it is hard to distinguish between results that are de novo predictions, where novel interactions or functions are predicted, and mere retrieval of information already present in the literature but not yet annotated to databases such as MEROPS (Pavlidis and Gillis, 2013), as we observed above for PPI data. A related serious difficulty with this approach is that annotation is strongly biased by patterns of publication and gene annotation (Gillis and Pavlidis, 2011a). Furthermore, if there were GO annotations linking a protease-inhibitor pair or in the literature, the interaction would likely already have been characterized biochemically. Estimation of prediction performance based on co-annotation would thus appear over-optimistic. Predicted pairs would represent examples of information retrieval and not de novo predicted pairs. Because of the bias in annotation, poorly characterized proteins would likely never be predicted to be associated. Furthermore, many proteases and inhibitors are “functionally related” in cascades or biological processes, without having physical interactions, while here we were only interested in direct interactions, especially between previously unconnected cascades. Overall co-mention and co-annotation lack the coverage and detail required to make predictions about particular novel candidate pairs and so we disregarded this feature.      173 5.3.6 Predicting inhibitions Having observed moderate correlation between the protease web and individual gene expression and phylogenetic similarity matrices, we considered combining these matrices as features to predict protease inhibition. We trained three common machine-learning algorithms to derive a function that combines the features (the coexpression and phylogenetic similarity matrices from Table 5.2). We used random forest, linear discriminant analysis, and support vector machine to classify true positive and true negative inhibitor-protease pairs. Each pair had a feature vector containing one value for each matrix. We trained the classifiers on a training set of true positive and true negative interactions and compared their performance on a test set (Figure 5.7), where we also evaluated performance of the GTEX_All_Max matrix alone. While the AUCs of these classifiers were slightly above the AUC of GTEX_All_Max, the difference was small and, more importantly, the performance in the critical region of high sensitivity and high specificity (the lower-left hand part of the ROC curve) was indistinguishable. The low performance was probably either due the small number of positive training examples and/or the absence of learnable patterns in the data. Being unable to improve predictions from aggregation, we only used the GTEX_All_Max matrix for predicting true protease-inhibitor pairs. This matrix was the best performing among those tested, and furthermore, represents a measure of similarity that is transparent (being easily interpretable and visualized) and reasonable (fitting to our observation of expression patterns).      174  Figure 5.7 Performance of machine learning algorithms in predicting protease-inhibitor pairs.  Sensitivity plotted versus specificity for the three machine-learned classifiers random forest (red), linear discriminant analysis (green, LDA), and support vector machines (blue, SVM) compared to the coexpression matrix GTEX_All_Max (black). Performance is slightly higher overall (lines closer to the top left corner), but not in the top pairs (bottom left corner) comparing machine-learned classifiers to the original coexpression matrix GTEX_All_Max.    Table 5.5 True positive (TP) inhibitor-protease pairs are enriched compared to true negatives (TN) using coexpression and colocalization filters.   N TP TN TP:TN Total 39,652 (100%) 294 (100%) 6,990 (100%) 1:24 Correlated (R>0.6) 1,488 (3.8%) 46 (15.6%) 205 (3%) 1:4.5 Correlated & colocalized 140 (0.4%) 10 (3.4%) 18 (0.3%) 1:1.8 Correlated & antilocalized 93 (0.2%) 6 (2%) 14 (0.2%) 1:2.3     175 We thus focused on our best performing individual matrix for prediction, GTEX_Max, selecting a coexpression threshold of 0.6. This improved the ratio of annotated true positives to true negatives (TP:TN) from 1:24 to 1:4.5 thus enriching true positives 5.5-fold (Table 5.5). We then attempted to combine this measure with colocalization. Colocalized protease-inhibitor pairs resulted in a further enrichment to a TP:TN ratio of 1:1.8 (3-fold) and anti-localization enriched results to a TP:TN ratio of 1:2.3 (2.4-fold). However, colocalization information was missing for many proteins, so we also picked pairs where no localization information was available for one of the proteins. Finally, we removed all pairs that were considered enzymatically implausible: We only selected inhibitor-protease pairs, where (i) the inhibitor was shown to have inhibited a protease from the same family as the predicted protease or (ii) the protease was shown to be inhibited by an inhibitor from the same family as the predicted inhibitor. These two filters reduced the search space from 1,239 coexpressed pairs to 270 pairs. We anticipated that the incorporation of enzymatic information would greatly increase the precision of predictions. A loss of sensitivity results if all target protease families of an inhibitor or all inhibitor family members of a target protease are not annotated as such, which we considered unlikely since relevant inhibitor families are known for most proteases. Inhibitor-protease pairs meeting the above criteria are shown in Figure 5.8.    176    Figure 5.8 Coexpression of predicted inhibitor-protease pairs shown as a network.  Proteases and inhibitors were connected if their coexpression (Spearman or Pearson correlation of expression values) was higher than 0.6 (GTEX_All_Max), and the inhibitor was annotated to inhibit a protease from the same family as the target or the protease was annotated inhibited by an inhibitor from the same family. Solid black lines show colocalization (proteins in the same compartment), black dashed lines anti-localization (one protein extracellular and the other in the cytosol or one in an organelle and the other in the cytosol) and solid grey lines missing annotation of one of the proteins. The resulting network is clustered (x- and y-axis are arbitrarily arranged to reflect clustering). Clusters are highlighted and labeled A-H. Heatmaps show normalized expression of genes in each cluster. Predictions (proteases, inhibitors, and their connections) selected for biochemical testing are highlighted in red. To test our predictions, we selected 9 pairs for biochemical validation experiments: inhibition of factor XI (F11), factor XII Fortelny et al., 2015 Figure 6  177 (F12) and plasma kallikrein (KLKB1) by kallistatin (SERPINA4) (3 pairs) as well as inhibition of kallikrein 5 and kallikrein 7 by serpins B7, B8 and B12 (6 pairs). All pairs conformed to the criteria established in the previous section, and furthermore, all serpins are in clan ID and all selected proteases are in clan PA. Finally, all serpins have a lysine in their reactive loop and all proteases have lysine/arginine specificity in P1 as observed in MEROPS (Rawlings et al., 2012) and TopFIND (Lange and Overall, 2011; Fortelny et al., 2015). Kallistatin is known to inhibit kallikreins, but is highly expressed in the liver with the coagulation proteins factor XI, factor XII and plasma kallikrein, and was therefore predicted by our algorithm to inhibit these proteases. While this indicated an interesting new role for kallistatin, the prediction was also uncertain, since all of the four proteins are well studied and newly discovered inhibition is therefore unlikely. Thus, we did not cherry pick just the most promising pairs for testing. On the other hand, serpins B7, B8, and B12 are little studied inhibitors and thus interesting targets that were coexpressed with the proteases kallikrein 5 and kallikrein 7. We tested all 9 pairs by fluorescent substrate cleavage assays in vitro, and found strong inhibition of kallikrein 5 by serpin B12 (Figure 5.9A) that we confirmed by analysis of covalent complex formation on SDS polyacrylamide gels (Figure 5.9B). Having predicted a biochemically true inhibition from the coexpression data, we concluded that this interaction is also likely physiologically relevant, since these proteins are expressed together. The interaction could also represent an interesting drug target since kallikrein 5 is a major regulator in a number of diseases (Prassas et al., 2015).     178   Figure 5.9 Inhibition of KLK5 by SERPINB12. (A) Cleavage of the quenched fluorescent substrate ES011 by kallikrein-5 (KLK5) was followed over time after preincubation with different molar ratios of serpin B12 (SERPINB12) (as indicated). A decrease in kallikrein-5 activity (A.U. – arbitrary units) with increasing serpin B12 confirms that serpin B12 inhibits kallikrein-5 as predicted. (B) Silver stained 10% SDS-PAGE gel of kallikrein-5, serpin B12 and the inhibitory kallikrein-5 : serpin B12 complex indicating that the serpin is covalently bound to the protease.     179 5.3.7 Limited coexpression of proteases and their inhibitors Despite the success in validating a predicted inhibitory interaction, we were intrigued by the low coexpression of protease-inhibitor pairs (Figure 5.6) and the limited success in validating predicted coexpression pairs. To analyze differences in predictions of the various methods, we compared the true inhibitions that are found when collecting the top 10% of pairs in each matrix as predictions. Investigating the overlap between the top predictions (10% of pairs) from each matrix (Figure 5.10A), we observed drastically different results between phylogenetic similarity and coexpression derived matrices (Figure 5.10B) as previously observed in Figure 5.5A for the entire matrices. This analysis also confirmed GTEX_All_Max as the most reasonable choice for prediction, since it captures both types of tissue-specific and tissue-spanning coexpression patterns (Figure 5.10C) and most pairs identified in other coexpression matrices (Figure 5.10D).  To further understand the low correlation between our matrices and the protease web, we investigated well-studied examples of annotated inhibitor-protease pairs, in particular serpins that regulate coagulation. In many cases, these serpins (e.g. protein Z-dependent protein inhibitor (serpin A10), antithrombin III (serpin C1), heparin cofactor 2 (serpin D1), and alpha-2-antiplasmin (serpin F2)) were coexpressed with coagulation proteins, destined for export to blood, in liver. However, many more targets are known for these inhibitors, for example alpha-2-antiplasmin inhibits kallikreins 4, 5, 7, 13, and 14. These proteins had uncorrelated expression patterns and none were highly expressed in liver (Figure 5.2) and therefore could not be simultaneously associated with their known inhibitors, again likely because these are special cases of protease-inhibitor pairs where one partner of the pair is expressed in the liver destined for export to peripheral tissues or fluids where the other interactor is expressed. This was the   180 case for all coexpression and phylogenetic similarity measures as shown for serpin F2 (Figure 5.11A). On the other hand, Glia-derived nexin (serpin E2), plasma serine protease inhibitor (serpin A5), and serpin B6 that are known to inhibit thrombin (F2) are not strongly expressed in liver and thus also were not coexpressed with the protease. Another interesting inhibitor is neuroserpin (serpin I1), which was predominantly expressed in brain, but inhibits urokinase plasminogen activator (PLAU) and tissue plasminogen activator (PLAT), as well as plasmin (PLG), which have very different expression patterns.     181  Figure 5.10 Similarity of coexpression and phylogenetic similarity matrices for recovering protease web inhibitions.  Annotated true positive inhibitor-protease pairs found among the top 10% of pairs (predictions) from each matrix. (A) Heatmap of the overlap of recovered inhibitions. Values above 30 were capped. (B) Overlap of recovered inhibitions between GTEX_All_Max and Inparanoid matrices. (C) Overlap between various coexpression matrices. (D) Overlap of recovered pairs in GTEX-derived matrices using Spearman and Pearson correlations.     182   ST14KLK4TMPRSS6KLK5PLGKLK13HABP2KLK14KLK7HPNTMPRSS11ETMPRSS9GSE7307_All_PccGSE7307_All_SccInparanoid_Bits_CorInparanoid_Binary_CorInparanoid_BinaryInparanoid_Binary_MIGTEX_All_SccGTEX_All_PccPGTEX_All_MaxPGTEX_Muscle_PccGTEX_Muscle_SccArrays_Averaged_AllArrays_Averaged_LiverArrays_Averaged_LungGTEX_Brain_SccGTEX_Blood_PccGTEX_Brain_PccGTEX_All_MaxGTEX_All_PccGTEX_Blood_SccGTEX_Lung_PccArrays_Merged_BrainGTEX_Adipose_Tissue_PccGTEX_Lung_SccGTEX_Skin_SccGTEX_Skin_PccGTEX_Nerve_PccGTEX_Nerve_SccGTEX_Thyroid_PccGTEX_Thyroid_SccGTEX_All_SccPArrays_Merged_LungArrays_Merged_LiverArrays_Merged_AllArrays_Merged_MuscleArrays_Averaged_MuscleArrays_Averaged_BrainGTEX_Adipose_Tissue_SccGTEX_Blood_Vessel_PccGTEX_Averaged_SccGTEX_Averaged_PccGTEX_Heart_SccGTEX_Blood_Vessel_SccGTEX_Heart_PccKLK14KLK6KLK5CASP14ACRKLK7KLK4KLK13Inparanoid_Binary_MIInparanoid_Binary_CorGTEX_Nerve_PccArrays_Merged_LungInparanoid_BinaryInparanoid_Bits_CorArrays_Merged_AllArrays_Merged_MuscleArrays_Merged_LiverArrays_Merged_BrainArrays_Averaged_MuscleArrays_Averaged_LiverGTEX_Averaged_SccGTEX_Averaged_PccGTEX_Thyroid_SccGTEX_Skin_SccGTEX_Skin_PccGTEX_Nerve_SccGTEX_Muscle_SccGTEX_Muscle_PccGTEX_Lung_SccGTEX_Lung_PccGTEX_Heart_SccGTEX_Heart_PccGTEX_Brain_SccGTEX_Brain_PccGTEX_Blood_Vessel_SccGTEX_Blood_Vessel_PccGTEX_Blood_PccGTEX_Adipose_Tissue_SccGTEX_Adipose_Tissue_PccGTEX_All_MaxPGTEX_All_MaxGTEX_All_SccPGTEX_All_PccPGTEX_All_PccGTEX_All_SccArrays_Averaged_AllArrays_Averaged_LungArrays_Averaged_BrainGSE7307_All_PccGSE7307_All_SccGTEX_Blood_SccGTEX_Thyroid_PccMMP10MMP13MMP19MMP25MMP3MMP17ADAM12MMP7MMP14MMP15MMP2MMP8MMP16MMP1MMP9Arrays_Merged_LiverArrays_Merged_LungArrays_Averaged_MuscleInparanoid_Binary_MIInparanoid_Binary_CorGTEX_Skin_SccGTEX_Nerve_SccArrays_Averaged_LiverGSE7307_All_PccGSE7307_All_SccGTEX_All_MaxPGTEX_All_PccGTEX_All_PccPArrays_Merged_BrainGTEX_All_MaxGTEX_Skin_PccArrays_Merged_AllGTEX_Nerve_PccGTEX_Lung_PccInparanoid_BinaryArrays_Merged_MuscleArrays_Averaged_AllArrays_Averaged_LungArrays_Averaged_BrainGTEX_Averaged_SccGTEX_Averaged_PccGTEX_Thyroid_SccGTEX_Thyroid_PccGTEX_Muscle_SccGTEX_Lung_SccGTEX_Heart_SccGTEX_Heart_PccGTEX_Brain_SccGTEX_Brain_PccGTEX_Blood_Vessel_SccGTEX_Blood_Vessel_PccGTEX_Blood_SccGTEX_Blood_PccGTEX_Adipose_Tissue_SccGTEX_All_SccGTEX_All_SccPInparanoid_Bits_CorGTEX_Adipose_Tissue_PccGTEX_Muscle_PccELANEACRADAM12CAPN1CTSGMMP1MMP8KLKB1CTSBCELA1CTSL1CTSDCTSHF10BMP1PLGADAMTS5F2KLK3ADAMTS7CMA1ADAMTS12MMP9MMP2ADAMTS4NAGTEX_Muscle_SccArrays_Merged_MuscleGTEX_All_MaxPGTEX_All_PccPGTEX_Brain_SccGTEX_Averaged_PccGTEX_Averaged_SccGTEX_Nerve_SccGTEX_Nerve_PccGTEX_Blood_SccGTEX_Blood_Vessel_PccGTEX_All_SccPArrays_Merged_LungGTEX_Lung_PccGTEX_Lung_SccGTEX_All_SccGTEX_All_MaxArrays_Averaged_BrainArrays_Averaged_AllArrays_Averaged_LiverArrays_Averaged_LungArrays_Merged_AllGTEX_All_PccGTEX_Adipose_Tissue_PccGTEX_Thyroid_SccGTEX_Heart_SccGTEX_Thyroid_PccGTEX_Heart_PccGTEX_Skin_PccArrays_Merged_BrainGSE7307_All_PccGSE7307_All_SccGTEX_Adipose_Tissue_SccInparanoid_Binary_CorInparanoid_Binary_MIInparanoid_BinaryInparanoid_Bits_CorArrays_Merged_LiverArrays_Averaged_MuscleGTEX_Skin_SccGTEX_Muscle_PccGTEX_Brain_PccGTEX_Blood_PccGTEX_Blood_Vessel_SccABCDMatrixMatrixMatrixMatrixTarget proteaseTarget proteaseTarget proteaseTarget proteaseSERPINF2SPINK6TIMP2A2M  183 Figure 5.11 Recovery of inhibitor-protease pairs in matrices.  Recovered pairs amongst the top 10% of ranked pairs in each matrix marked in black. Columns are coexpression and phylogenetic similarity matrices from Table 5.2. Rows are protease genes that are known to be inhibited by (A) SERPINF2, (B) SPINK6, (C) TIMP2, and (D) A2M.  These observations were not limited to serpins but applied to all groups of inhibitors: While Serine Protease Inhibitor Kazal-type 2 (SPINK2) was coexpressed with its only target protease acrosin (ACR), Serine Protease Inhibitor Kazal-type 6 (SPINK6, Figure 5.11B) was not coexpressed with acrosin or with a number of kallikreins, all of which it inhibits. Apoptosis inhibitors BIRC2, 3, 5, and XIAP were weakly coexpressed with some of the caspases that they inhibit (BIRC2 and caspase 7, BIRC3 and caspase 3), but not with other known target caspases, possibly also due to the fact that they are multifunctional proteins. Cystatins and TIMPs (Figure 5.11C) were in most cases, not coexpressed with any of their annotated target proteases with a few exceptions (cystatin-M (CST6) and cathepsin L2 (CTSL2), TIMP2 with MMP2 or MMP14), which were drowned out by the number of biochemically plausible, but not coexpressed target proteases (3 for CST6 and 15 for TIMP2). Finally, alpha-2-macroglobulin (A2M, Figure 5.11D), a highly multifunctional protease inhibitor inhibits many different proteases, but is only coexpressed with a few of them. However, alpha-2-macroglobulin is present in plasma and so can reach most tissues (especially in inflammation were blood vessel permeability is increased) and thus most extracellular proteases or intracellular proteases that are released from damaged cells. Thus many inhibitors were coexpressed with only a few of their known target proteases, and differed in expression patterns from many other targets. We conclude that some of these proteases with different expression patterns might not be the main physiological target or be encountered by the inhibitor under very specific conditions.   184 5.4 Discussion We evaluated multiple data types for ease of predicting inhibitory interactions between protease inhibitors and their target proteases. Protein-protein interaction data, coexpression and phylogenetic profiles seemed useful, alongside structural protein information to limit the pairs to biochemically “rational” pairs. Protein-protein interaction data highlighted potential inhibitions and especially served to identify gaps in annotation of known inhibitions. Coexpression was a promising predictor, because it is free of many biases of other data types and because of the intuition that proteases and their inhibitors should be coexpressed. We exploited the bipartite expression patterns of proteases and inhibitors, where some were expressed in most tissues and others expressed in specific tissues. Limiting the search space using coexpression and structural information was useful to target biochemical assays to likely physiologically relevant inhibitions. Finally, we outlined a computational approach to guide biochemical validation to pairs that are also physiologically plausible. This approach identified a target protease for the poorly studied inhibitor serpin B12, but also led to 8 false positives, a comparable level of validation to other prediction evaluations (Pavlidis and Gillis, 2013).  We have previously shown that cross-validation performance is a very poor predictor of how a “guilt-by-association” method will do in reality (Gillis and Pavlidis, 2012). Biochemical evaluations such as ours are essential to give more realistic estimates of expected performance. Considering other cases where new predictions were tested, the best performing biochemically validated protein interaction prediction method (PrePPI) (Zhang et al., 2012, 2013) was successful in 15 of 19 biochemical validation experiments. However, in that case, predictions were guided by GO annotation, possibly reflecting information retrieval rather than the harder de   185 novo problem, and were hand-selected for validation based on plausibility. This performance estimate is thus likely over-optimistic. Indeed, in a recent mass spectrometry screen of interaction partners of AMPK-α1 and -β1 (Moon et al., 2014), only 63 of the 381 interaction partners identified overlapped with the 1235 partners predicted by PrePPI. Furthermore, no such prediction method has been adopted by biochemists for routine prioritization (Pavlidis and Gillis, 2013) to our knowledge. We posit that this may be due to low validation rates. We confirm that validation rates are low in a realistic setting, when we refrain from using GO annotation and from cherry picking validation targets. Expression patterns and enzymatic knowledge have been used by biochemists in isolated experiments to identify protease inhibitors for many years (Scott et al., 1999; Sun et al., 1996) and novel discovery is therefore difficult. However, our approach has proven successful to fill in gaps in our knowledge.   We hypothesized that prior knowledge of structural constraints in the form of MEROPS family and clan information would greatly simplify the inhibitor prediction task. While these constraints were important, they did not supply sufficient resolution to resolve true and false pairs. Similar limitations exist for protein-protein interaction predictions, such as PrePPI (Zhang et al., 2013), where for example the protease inhibitors TIMP 1-4 are predicted to interact with almost all MMPs. However, this is clearly not the case according to MEROPS, where inhibition is found between specific pairs (Rawlings et al., 2012). Together with phylogenetic information, these PPI approaches suffer from a common weakness: Despite the similarities between members of protease and inhibitor families in MEROPS, individual inhibitors still specifically target individual members of protease families. Phylogenetic information and structural classification cannot easily distinguish between family members. Although this information can be used to   186 decrease the search space, specificity is still not sufficiently high. Our work suggests that even with considerable opportunities to constrain the problem based on prior knowledge, prediction of protease inhibition is nearly as difficult a problem as the general protein-protein interaction prediction task.  Contrary to our expectations and common assumptions, coexpression data was surprisingly limited in utility. We did not observe coexpression of many protease inhibitors with their target proteases (Figure 5.11). Coexpression for these pairs was not detectable, despite our extensive efforts in calculating coexpression across human tissues and within tissues, based on proteomics and using machine learning to combine networks. The reasons for this could be technical, in that coexpression is generally the case for protease-inhibitor pairs but not discernable from current data using current techniques. Alternatively, the explanation may be biological, in that true pairs are indeed not coexpressed. We found that technical limitations include restricted numbers of annotated true positive pairs, impeding learning of larger patterns in the data, in particular when aggregating data. Similarly, additional tissue samples might significantly improve predictions, in particular if these encompass different tissues than the ones available, to predict pairs with tissue-specific coexpression. Another uncertainty is in the relevance of the true pairs. Many biochemically true inhibitions generated in vitro might not be physiologically relevant, which impedes learning of physiologically pertinent patterns. However, well-validated, physiologically relevant inhibitors were equally inconsistently coexpressed with their proteases. From our comprehensive experimentation with coexpression we therefore conclude that coexpression is not the norm for protease-inhibitor pairs.    187 A biological reason for the delinking of protease-inhibitor coexpression may lie in the mobility of proteins in an open biological system with interactions between tissues and cells. Inhibitors in a tissue could inhibit proteases entering or released in that tissue or might themselves be expressed in a particular tissue and then exported to other sites. For example, complement and coagulation proteases are transported from the liver to their place of action via blood plasma; granzymes or neutrophil proteases such as elastase are in stored in cytosolic vesicles and secreted by immune cells upon stimulation in target tissues. Furthermore, in complex networks, where a protease interacts with many inhibitors and an inhibitor inhibits many proteases, the expression pattern of one gene represents a combination of the patterns of its interactors and is thus not clearly correlated with any one of them individually. Partial correlation measures the correlation between two variables (genes), controlling for a set of control variables (other genes) and so could alleviate this problem. We did not observe increased performance using partial correlation, possibly due to small sample size and noise. In conclusion, we have shown that false negatives (undetectable true interactions) are an inherent problem of predicting protease inhibition to date. Knowing that many true inhibitor-protease pairs are not coexpressed, we also found that coexpressed pairs are often not inhibitory. This can partially be explained considering that genes work in pathways, so that a gene is coexpressed with genes in its pathway, which are not direct interactors. One example of this is the coexpression of serpin B12 with kallikrein 5 and kallikrein 7. Serpin B12 inhibits kallikrein 5, but not kallikrein 7. Kallikrein 5 is an activator of kallikrein 7 (Prassas et al., 2015) so that serpin B12 and kallikrein 7 are involved in the same pathway and coexpressed, but do not directly interact. Networks effects might thus explain the low validation success rate here and in other prediction studies.    188 We anticipate developments in the future that will improve the validation rate, such as more comprehensive expression studies and increasingly reliable proteomics quantification. Prediction will also profit from a better definition and larger numbers of true positive and true negative inhibition examples used for (machine) learning to identify pertinent patterns and to meaningfully combine data. Another improvement is expected in the use of enzymatic plausibility. Relying on structural information of the active site derived from MEROPS clans could be improved by highly detailed molecular docking analyses of the whole protein or specific binding domains such as exosites known to influence protease cleavage (Overall, 2002). In a similar fashion, the identification of proteases that can efficiently cleave the reactive site loop of serpins can be prioritized, which will be possible in the future when cleavage specificity of proteases is defined more clearly. With the groundwork laid for this novel prediction task, we anticipate significant improvements over the coming years.  In summary, our study provides a novel in-depth example of the challenges in predicting protein interactions. Our work revealed much about the properties of protease-inhibitor interactions as reflected in protein-protein interaction, coexpression, colocalization, and phylogenetic similarity. We find that the lack of data to discriminate interacting from non-interacting pairs remains a limiting factor. It is likely that many other protein interaction prediction methods will suffer from comparable difficulties. We consciously avoided aggregating data as much as possible in a “black box” algorithm and over-trusting cross validations based on flawed gold standards, which in our opinion can yield over-optimistic estimates of performance (Gillis and Pavlidis, 2012, 2013b). In our framework, the features analyzed are carefully selected and combined in a transparent method, more analogous to a biologist’s reasoning, building confidence among the   189 users of the predictions. On the other hand, our predictions are not biased towards any group of proteins and are thus especially well suited to fill in gaps in our understanding of the protease web and the many understudied proteases and inhibitors.      190 5.5 Summary Pervasive substrate cleavage and regulatory protein interactions in the protease web generate thousands of proteoforms that dynamically shape the topology and functional state of proteomes. To fill in knowledge gaps of the protease web we assess commonly used features of protein interaction prediction tools and identify protein coexpression, phylogenetic similarity, enzymatic compatibility, and colocalization as informative features, as well as examine general protein interaction networks for data that lack functional annotation. We address an important protein interaction prediction task: prediction of inhibitory interactions between proteases and protease inhibitors. Prediction performance and pitfalls of prediction features in relation to protease biology are characterized that ultimately supplied lists of predicted inhibition pairs. Unexpectedly, we did not observe coexpression of proteases and their inhibitors for many functional protease-inhibitor pairs as is commonly assumed. Applying the principles presented, we predict and validate the inhibition of kallikrein 5 by serpin B12, a physiologically interesting yet novel inhibition.      191 Chapter 6: Conclusions Biochemical experiments and computational analysis have demonstrated that post-translation regulation of proteins by protease cleavage is a complex regulatory biological process. High-throughput terminomics technologies to identify protease substrates and cleavage specificity as well as general protein truncations have led to increasing insight into individual biological processes controlled by proteases but are inadequate to tackle the challenges imposed by the complexity of protease biology. In this thesis, I addressed these challenges by developing (i) computational models to provide insight into the complexity of protease biology and (ii) software and databases to enable rapid analysis of terminomics data in the context of existing data and generation of novel hypothesis based on these analysis. Biochemical validation of some of our predictions further demonstrated applicability of our methods.   To increase their utility, our models would strongly profit from increased coverage and greater detail of underlying data. This is especially true for the protease web model. There, improved data coverage will likely not affect the observed connectivity between proteases in the protease web because additional data will only further increase connectivity and non-physiological edges are tolerated in large numbers, as shown by our data removal experiments in section 4.3.4 and Figure 4.10D. On the other hand, the prediction of proteolytic pathways (for example in PathFINDer) is hampered by incompleteness and limited detail of the underlying data (observed in the analysis of protease web data in chapter 4). Non-physiological edges lead to false positive predictions of paths. Such errors are hard to identify computationally but can be identified by experienced biochemists before the validation of predicted pathways. Data incompleteness results in missed predictions (false negatives) and is likely the main current limitation of   192 accuracy in pathway prediction. Biochemical analyses of high and low throughput, in particular substrate discovery with terminomics methods such as TAILS (Kleifeld et al., 2010), gradually improve upon this problem by filling gaps in our knowledge. Identified substrate cleavage data for proteases can directly be incorporated as edges in the model thus increasing coverage. Another contribution towards the goal of complete coverage can be anticipated from the development of bioinformatics tools for the prediction of protease cleavage and inhibition. As shown for protease inhibitor predictions and discussed for protease cleavage prediction, performance of current prediction tools requires biochemical validation and limits their usage as high-throughput predictors. However, improved substrate specificity profiles, for example those generated from PICS (Schilling et al., 2011), could significantly improve biochemical predictions and thus increase coverage of the protease web.  With higher completeness of protease web data, an additional improvement of predictive power can result from refinements of the protease web model. With increased knowledge of cleavage and inhibition edges, the density of the protease web will increase thus predicting multiple paths between any protease and substrate. In consequence, additional information will be required to prioritize paths. A promising avenue for the prioritization of pathways is the incorporation of kinetic information to infer strength of connections. Proteolytic pathways that traverse edges representing energetically favorable interactions are more likely to occur and thus should be prioritized for validation. Kinetic data are currently not consistently annotated in protease databases and can generally not easily be obtained from most protease cleavage experiments (O’Donoghue et al., 2012). An alternative to experimental determination could be the inference of kinetic information by comparing the cleavage site with the cleavage specificity of the   193 protease. It can be hypothesized that energetically favorable cleavage sites should closely fit the substrate specificity of the proteases. However, as much as the prediction of protease substrates based on protease specificity, this hypothetical approach is currently limited by the number of substrates available and requires validation. A second useful refinement of the protease web model is the incorporation of functional consequences of edges. Whereas inhibitory edges unambiguously represent inhibition, cleavages can have various consequences on a target protease or inhibitor including activation, inactivation, altered function, or change of localization (Kassell and Kay, 1973; Desrochers et al., 1991; Rice and Banda, 1995). The effect of cleavage could be manually annotated or possibly predicted from the substrate sequence by comparing the position of the cleavage to the location of protein domains that are removed upon cleavage. Mapping functional consequences of connections to the predicted pathways would then predict either increase or decrease of the final protein cleavage. This predicted effect, in turn, can be compared to the direction of effect observed in the terminomics data (increase or decreased cleavage in the knock-out). Finally, consistency between predicted and observed effect should increase confidence in the predicted path. Many more refinements of the model can be envisaged, for example the incorporation of expression data to remove proteins not expressed in a sample or restriction of the network to specific subcellular compartments. However, such refinements only become relevant once data coverage is improved and prioritization of pathways is required. Ultimately, the protease web model can be combined with other networks of PTMs (e.g. phosphorylation, glycosylation) to broaden our understanding of the impact of proteases regulation. Integration of cellular pathways are available in Pathway Commons (Cerami et al., 2011) and KEGG (Kanehisa et al., 2015), however these database currently do not incorporate protease web data from MEROPS or TopFIND.    194  Taken together, the example of the protease web shows that modeling of complexity in biology, which is generated by the large number of proteins and their interactions, is intricate but rewarding because it allows synthesis of large scale date into biochemically testable predictions. I anticipate that a refined and more complete protease web model will yield significant insight, not only in the distinction between direct and indirect protease substrate and the prediction of proteolytic pathways, but also to predict network effects of mutations in proteases or drug targeting and the interaction of proteases with other regulatory enzymes.    195 References Abreu, R. de S., Penalva, L.O., Marcotte, E.M., and Vogel, C. (2009). Global signatures of protein and mRNA expression levels. Mol. Biosyst. 5, 1512–1526. Adams, R.L.C., and Bird, R.J. (2009). Review article: Coagulation cascade and therapeutics update: Relevance to nephrology. Part 1: Overview of coagulation, thrombophilias and history of anticoagulants. Nephrology 14, 462–470. Addlagatta, A., Hu, X., Liu, J.O., and Matthews, B.W. (2005). Structural Basis for the Functional Differences between Type I and Type II Human Methionine Aminopeptidases. Biochemistry (Mosc.) 44, 14741–14749. Aebersold, R., and Mann, M. (2003). Mass spectrometry-based proteomics. Nature 422, 198–207. Albert, R., Jeong, H., and Barabasi, A.-L. (2000). Error and attack tolerance of complex networks : Article : Nature. Nature 406, 378–382. Arita, M. (2004). The metabolic world of Escherichia coli is not small. Proc. Natl. Acad. Sci. U. S. A. 101, 1543–1547. Arolas, J.L., Botelho, T.O., Vilcinskas, A., and Gomis-Rüth, F.X. (2011). Structural Evidence for Standard-Mechanism Inhibition in Metallopeptidases from a Complex Poised to Resynthesize a Peptide Bond. Angew. Chem. 123, 10541–10544. Arribas, J., and Borroto, A. (2002). Protein Ectodomain Shedding. Chem. Rev. 102, 4627–4638. Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., et al. (2000). Gene Ontology: tool for the unification of biology. Nat. Genet. 25, 25–29. Ashkenazi, A., and Salvesen, G. (2014). Regulated Cell Death: Signaling and Mechanisms. Annu. Rev. Cell Dev. Biol. 30, 337–356. Aznaouridis, K., Vlachopoulos, C., Dima, I., Vasiliadou, C., Ioakeimidis, N., Baou, K., Stefanadi, E., and Stefanadis, C. (2007). Divergent associations of tissue inhibitors of metalloproteinases-1 and -2 with the prothrombotic/fibrinolytic state. Atherosclerosis 195, 212–215. Backes, C., Kuentzer, J., Lenhof, H.-P., Comtesse, N., and Meese, E. (2005). GraBCas: a bioinformatics tool for score-based prediction of Caspase- and Granzyme B-cleavage sites in protein sequences. Nucleic Acids Res. 33, W208–W213. Ballouz, S., Verleyen, W., and Gillis, J. (2015). Guidance for RNA-seq co-expression network construction and analysis: safety in numbers. Bioinformatics 31, 2123–2130.   196 Barabási, A.-L., and Albert, R. (1999). Emergence of Scaling in Random Networks. Science 286, 509–512. Barkan, D.T., Hostetter, D.R., Mahrus, S., Pieper, U., Wells, J.A., Craik, C.S., and Sali, A. (2010). Prediction of protease substrates using sequence and structure features. Bioinformatics 26, 1714–1722. Barrett, A.J., Rawlings, N.D., Salvesen, G., and Fred Woessner, J. (2013). Introduction. In Handbook of Proteolytic Enzymes, N.D. Rawlings, and G. Salvesen, eds. (Academic Press), pp. li–liv. Beaufort, N., Plaza, K., Utzschneider, D., Schwarz, A., Burkhart, J.M., Creutzburg, S., Debela, M., Schmitt, M., Ries, C., and Magdolen, V. (2010). Interdependence of kallikrein-related peptidases in proteolytic networks. Biol. Chem. 391, 581–587. Belaaouaj, A. azzaq, Li, A., Wun, T.-C., Welgus, H.G., and Shapiro, S.D. (2000). Matrix Metalloproteinases Cleave Tissue Factor Pathway Inhibitor EFFECTS ON COAGULATION. J. Biol. Chem. 275, 27123–27128. Bhardwaj, N., and Lu, H. (2005). Correlation between gene expression profiles and protein–protein interactions within and across genomes. Bioinformatics 21, 2730–2738. Bienvenut, W.V., Sumpton, D., Martinez, A., Lilla, S., Espagne, C., Meinnel, T., and Giglione, C. (2012). Comparative Large Scale Characterization of Plant versus Mammal Proteins Reveals Similar and Idiosyncratic N-α-Acetylation Features. Mol. Cell. Proteomics 11, M111.015131. Borth, W. (1992). Alpha 2-macroglobulin, a multifunctional binding protein with targeting characteristics. FASEB J. 6, 3345–3353. Bossi, A., and Lehner, B. (2009). Tissue specificity and the human protein interaction network. Mol. Syst. Biol. 5, 260. Boyd, S.E., Pike, R.N., Rudy, G.B., Whisstock, J.C., and Garcia de la Banda, M. (2005). PoPS: a computational tool for modeling and predicting protease specificity. J. Bioinform. Comput. Biol. 3, 551–585. Braun, P., Tasan, M., Dreze, M., Barrios-Rodiles, M., Lemmens, I., Yu, H., Sahalie, J.M., Murray, R.R., Roncari, L., de Smet, A.-S., et al. (2009). An experimentally derived confidence score for binary protein-protein interactions. Nat. Methods 6, 91–97. Butler, G.S., and Overall, C.M. (2009a). Proteomic identification of multitasking proteins in unexpected locations complicates drug targeting. Nat. Rev. Drug Discov. 8, 935–948. Butler, G.S., and Overall, C.M. (2009b). Updated Biological Roles for Matrix Metalloproteinases and New “Intracellular” Substrates Revealed by Degradomics. Biochemistry (Mosc.) 48, 10830–10845.   197 Carlson, M., and Pages, H. hom.Hs.inp.db: Homology information for Homo Sapiens from Inparanoid. Carlson, B.A., Xu, X.-M., Kryukov, G.V., Rao, M., Berry, M.J., Gladyshev, V.N., and Hatfield, D.L. (2004). Identification and characterization of phosphoseryl-tRNA[Ser]Sec kinase. Proc. Natl. Acad. Sci. U. S. A. 101, 12848–12853. Catherman, A.D., Skinner, O.S., and Kelleher, N.L. (2014). Top Down proteomics: Facts and perspectives. Biochem. Biophys. Res. Commun. 445, 683–693. Cerami, E.G., Gross, B.E., Demir, E., Rodchenkov, I., Babur, O., Anwar, N., Schultz, N., Bader, G.D., and Sander, C. (2011). Pathway Commons, a web resource for biological pathway data. Nucleic Acids Res. 39, D685-690. Cesarman-Maus, G., and Hajjar, K.A. (2005). Molecular mechanisms of fibrinolysis. Br. J. Haematol. 129, 307–321. Chait, B.T. (2006). Mass Spectrometry: Bottom-Up or Top-Down? Science 314, 65–66. Chapple, C.E., Robisson, B., Spinelli, L., Guien, C., Becker, E., and Brun, C. (2015). Extreme multifunctional proteins identified from a human protein interaction network. Nat. Commun. 6. Chatr-Aryamontri, A., Breitkreutz, B.-J., Oughtred, R., Boucher, L., Heinicke, S., Chen, D., Stark, C., Breitkreutz, A., Kolas, N., O’Donnell, L., et al. (2015). The BioGRID interaction database: 2015 update. Nucleic Acids Res. 43, D470-478. Ciechanover, A. (2005). Intracellular protein degradation: from a vague idea through the lysosome and the ubiquitin–proteasome system and onto human diseases and drug targeting*. Cell Death Differ. 12, 1178–1190. Clark-Lewis, I., Vo, L., Owen, P., and Anderson, J. (1997). Chemical synthesis, purification, and folding of C-X-C and C-C chemokines. In Methods in Enzymology, Richard Horuk, ed. (Academic Press), pp. 233–250. Claverie, J.M. (1999). Computational methods for the identification of differential and coordinated gene expression. Hum. Mol. Genet. 8, 1821–1832. Colaert, N., Helsens, K., Martens, L., Vandekerckhove, J., and Gevaert, K. (2009). Improved visualization of protein consensus sequences by iceLogo. Nat. Methods 6, 786–787. Colaert, N., Maddelein, D., Impens, F., Damme, P.V., Plasman, K., Helsens, K., Hulstaert, N., Vandekerckhove, J., Gevaert, K., and Martens, L. (2013). The Online Protein Processing Resource (TOPPR): a database and analysis platform for protein processing events. Nucleic Acids Res. 41, D333–D337.   198 Cox, J.H., Dean, R.A., Roberts, C.R., and Overall, C.M. (2008). Matrix Metalloproteinase Processing of CXCL11/I-TAC Results in Loss of Chemoattractant Activity and Altered Glycosaminoglycan Binding. J. Biol. Chem. 283, 19389–19399. Crawford, E.D., Seaman, J.E., Agard, N., Hsu, G.W., Julien, O., Mahrus, S., Nguyen, H., Shimbo, K., Yoshihara, H.A.I., Zhuang, M., et al. (2013). The DegraBase: a database of proteolysis in healthy and apoptotic human cells. Mol. Cell. Proteomics MCP 12, 813–824. Csardi, G., and Nepusz, T. (2006). The igraph Software Package for Complex Network Research. InterJournal Complex Systems, 1695. Cunningham, M.J., Liang, S., Fuhrman, S., Seilhamer, J.J., and Somogyi, R. (2000). Gene expression microarray data analysis for toxicology profiling. Ann. N. Y. Acad. Sci. 919, 52–67. Damme, P.V., Gawron, D., Criekinge, W.V., and Menschaert, G. (2014). N-terminal Proteomics and Ribosome Profiling Provide a Comprehensive View of the Alternative Translation Initiation Landscape in Mice and Men. Mol. Cell. Proteomics 13, 1245–1261. Darmon, A.J., Nicholson, D.W., and Bleackley, R.C. (1995). Activation of the apoptotic protease CPP32 by cytotoxic T-cell-derived granzyme B. Nature 377, 446–448. Davie, E.W., and Ratnoff, O.D. (1964). Waterfall Sequence for Intrinsic Blood Clotting. Science 145, 1310–1312. Dean, R.A., and Overall, C.M. (2007). Proteomics Discovery of Metalloproteinase Substrates in the Cellular Context by iTRAQTM Labeling Reveals a Diverse MMP-2 Substrate Degradome. Mol. Cell. Proteomics 6, 611–623. Dean, R.A., Butler, G.S., Hamma-Kourbali, Y., Delbe, J., Brigstock, D.R., Courty, J., and Overall, C.M. (2007). Identification of Candidate Angiogenic Inhibitors Processed by Matrix Metalloproteinase 2 (MMP-2) in Cell-Based Proteomic Screens: Disruption of Vascular Endothelial Growth Factor (VEGF)/Heparin Affin Regulatory Peptide (Pleiotrophin) and VEGF/Connective Tissue Growth Factor Angiogenic Inhibitory Complexes by MMP-2 Proteolysis. Mol. Cell. Biol. 27, 8454–8465. Desrochers, P.E., Jeffrey, J.J., and Weiss, S.J. (1991). Interstitial collagenase (matrix metalloproteinase-1) expresses serpinase activity. J. Clin. Invest. 87, 2258–2265. Dezső, Z., Oltvai, Z.N., and Barabási, A.-L. (2003). Bioinformatics Analysis of Experimentally Determined Protein Complexes in the Yeast Saccharomyces cerevisiae. Genome Res. 13, 2450–2454. Dietzel, E., Wessling, J., Floehr, J., Schäfer, C., Ensslen, S., Denecke, B., Rösing, B., Neulen, J., Veitinger, T., Spehr, M., et al. (2013). Fetuin-B, a Liver-Derived Plasma Protein Is Essential for Fertilization. Dev. Cell 25, 106–112.   199 Doucet, A., Butler, G.S., Rodríguez, D., Prudova, A., and Overall, C.M. (2008). Metadegradomics Toward in Vivo Quantitative Degradomics of Proteolytic Post-translational Modifications of the Cancer Proteome. Mol. Cell. Proteomics 7, 1925–1951. Drag, M., and Salvesen, G.S. (2010). Emerging principles in protease-based drug discovery. Nat. Rev. Drug Discov. 9, 690–701. Dufour, A., and Overall, C.M. (2013). Missing the target: matrix metalloproteinase antitargets in inflammation and cancer. Trends Pharmacol. Sci. 34, 233–242. Durinck, S., Spellman, P.T., Birney, E., and Huber, W. (2009). Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt. Nat. Protoc. 4, 1184–1191. duVerle, D.A., and Mamitsuka, H. (2012). A review of statistical methods for prediction of proteolytic cleavage. Brief. Bioinform. 13, 337–349. duVerle, D.A., Ono, Y., Sorimachi, H., and Mamitsuka, H. (2011). Calpain Cleavage Prediction Using Multiple Kernel Learning. PLoS ONE 6. Eckhard, U., Marino, G., Abbey, S.R., Tharmarajah, G., Matthew, I., and Overall, C.M. (2015). The Human Dental Pulp Proteome and N-Terminome: Levering the Unexplored Potential of Semitryptic Peptides Enriched by TAILS to Identify Missing Proteins in the Human Proteome Project in Underexplored Tissues. J. Proteome Res. 14, 3568–3582. Edman, P., and Begg, G. (1967). A Protein Sequenator. Eur. J. Biochem. 1, 80–91. Edwards, A.M., Bountra, C., Kerr, D.J., and Willson, T.M. (2009). Open access chemical and clinical probes to support drug discovery. Nat. Chem. Biol. 5, 436–440. Ehrnthaller, C., Ignatius, A., Gebhard, F., and Huber-Lang, M. (2011). New Insights of an Old Defense System: Structure, Function, and Clinical Relevance of the Complement System. Mol. Med. 17, 317–329. Ewing, R.M., Chu, P., Elisma, F., Li, H., Taylor, P., Climie, S., McBroom-Cerajewski, L., Robinson, M.D., O’Connor, L., Li, M., et al. (2007). Large-scale mapping of human protein-protein interactions by mass spectrometry. Mol. Syst. Biol. 3, 89. Fang, Y., Mu, J., Ma, Y., Ma, D., Fu, D., and Shen, X. (2012). The interaction between ubiquitin C-terminal hydrolase 37 and glucose-regulated protein 78 in hepatocellular carcinoma. Mol. Cell. Biochem. 359, 59–66. Flicek, P., Amode, M.R., Barrell, D., Beal, K., Billis, K., Brent, S., Carvalho-Silva, D., Clapham, P., Coates, G., Fitzgerald, S., et al. (2014). Ensembl 2014. Nucleic Acids Res. 42, D749–D755.   200 Fortelny, N., Cox, J.H., Kappelhoff, R., Starr, A.E., Lange, P.F., Pavlidis, P., and Overall, C.M. (2014). Network Analyses Reveal Pervasive Functional Regulation Between Proteases in the Human Protease Web. PLoS Biol 12, e1001869. Fortelny, N., Yang, S., Pavlidis, P., Lange, P.F., and Overall, C.M. (2015). Proteome TopFIND 3.0 with TopFINDer and PathFINDer: database and analysis tools for the association of protein termini to pre- and post-translational events. Nucleic Acids Res. 43, D290–D297. Franceschini, A., Szklarczyk, D., Frankild, S., Kuhn, M., Simonovic, M., Roth, A., Lin, J., Minguez, P., Bork, P., von Mering, C., et al. (2012). STRING v9.1: protein-protein interaction networks, with increased coverage and integration. Nucleic Acids Res. 41, D808–D815. Freeman, L.C. (1977). A Set of Measures of Centrality Based on Betweenness. Sociometry 40, 35–41. Fuchs, J.E., von Grafenstein, S., Huber, R.G., Kramer, C., and Liedl, K.R. (2013). Substrate-Driven Mapping of the Degradome by Comparison of Sequence Logos. PLoS Comput Biol 9, e1003353. Fujinaga, M., Cherney, M.M., Oyama, H., Oda, K., and James, M.N.G. (2004). The molecular structure and catalytic mechanism of a novel carboxyl peptidase from Scytalidium lignicolum. Proc. Natl. Acad. Sci. U. S. A. 101, 3364–3369. Fukai, F., Ohtaki, M., Fujii, N., Yajima, H., Ishii, T., Miyazaki, K., and Katayama, T. (1995). Release of Biological Activities from Quiescent Fibronectin by a Conformational Change and Limited Proteolysis by Matrix Metalloproteinases. Biochemistry (Mosc.) 34, 11453–11459. Gasteiger, E., Hoogland, C., Gattiker, A., Duvaud, S. ’everine, Wilkins, M., Appel, R., and Bairoch, A. (2005). Protein Identification and Analysis Tools on the ExPASy Server. In The Proteomics Protocols Handbook, J. Walker, ed. (Humana Press), pp. 571–607. Gaudet, P., Michel, P.-A., Zahn-Zabal, M., Cusin, I., Duek, P.D., Evalet, O., Gateau, A., Gleizes, A., Pereira, M., Teixeira, D., et al. (2015). The neXtProt knowledgebase on human proteins: current status. Nucleic Acids Res. 43, D764–D770. Gawron, D., Gevaert, K., and Van Damme, P. (2014). The proteome under translational control. PROTEOMICS 14, 2647–2662. Ge, H., Liu, Z., Church, G.M., and Vidal, M. (2001). Correlation between transcriptome and interactome mapping data from Saccharomyces cerevisiae. Nat. Genet. 29, 482–486. Gentleman, R.C., Carey, V.J., Bates, D.M., Bolstad, B., Dettling, M., Dudoit, S., Ellis, B., Gautier, L., Ge, Y., Gentry, J., et al. (2004). Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 5, R80.   201 Gevaert, K., Goethals, M., Martens, L., Van Damme, J., Staes, A., Thomas, G.R., and Vandekerckhove, J. (2003). Exploring proteomes and analyzing protein processing by mass spectrometric identification of sorted N-terminal peptides. Nat. Biotechnol. 21, 566–569. Giannelli, G., Falk-Marzillier, J., Schiraldi, O., Stetler-Stevenson, W.G., and Quaranta, V. (1997). Induction of Cell Migration by Matrix Metalloprotease-2 Cleavage of Laminin-5. Science 277, 225–228. Gillet, L.C., Navarro, P., Tate, S., Röst, H., Selevsek, N., Reiter, L., Bonner, R., and Aebersold, R. (2012). Targeted Data Extraction of the MS/MS Spectra Generated by Data-independent Acquisition: A New Concept for Consistent and Accurate Proteome Analysis. Mol. Cell. Proteomics 11, O111.016717. Gillis, J., and Pavlidis, P. (2011a). The Impact of Multifunctional Genes on “Guilt by Association” Analysis. PLoS ONE 6, e17258. Gillis, J., and Pavlidis, P. (2011b). The role of indirect connections in gene networks in predicting function. Bioinformatics 27, 1860–1866. Gillis, J., and Pavlidis, P. (2012). “Guilt by Association” Is the Exception Rather Than the Rule in Gene Networks. PLoS Comput Biol 8, e1002444. Gillis, J., and Pavlidis, P. (2013a). Assessing identity, redundancy and confounds in Gene Ontology annotations over time. Bioinformatics 29, 476–482. Gillis, J., and Pavlidis, P. (2013b). Characterizing the state of the art in the computational assignment of gene function: lessons from the first critical assessment of functional annotation (CAFA). BMC Bioinformatics 14, S15. Gillis, J., Ballouz, S., and Pavlidis, P. (2014). Bias tradeoffs in the creation and analysis of protein–protein interaction networks. J. Proteomics 100, 44–54. Goldberg, A.L. (2003). Protein degradation and protection against misfolded or damaged proteins. Nature 426, 895–899. Goldberg, D.S., and Roth, F.P. (2003). Assessing experimentally derived interactions in a small world. Proc. Natl. Acad. Sci. 100, 4372–4376. Golubkov, V.S., Boyd, S., Savinov, A.Y., Chekanov, A.V., Osterman, A.L., Remacle, A., Rozanov, D.V., Doxsey, S.J., and Strongin, A.Y. (2005). Membrane Type-1 Matrix Metalloproteinase (MT1-MMP) Exhibits an Important Intracellular Cleavage Function and Causes Chromosome Instability. J. Biol. Chem. 280, 25079–25086. Goulet, B., Baruch, A., Moon, N.-S., Poirier, M., Sansregret, L.L., Erickson, A., Bogyo, M., and Nepveu, A. (2004). A Cathepsin L Isoform that Is Devoid of a Signal Peptide Localizes to the Nucleus in S Phase and Processes the CDP/Cux Transcription Factor. Mol. Cell 14, 207–219.   202 Grigoriev, A. (2001). A relationship between gene expression and protein interactions on the proteome scale: analysis of the bacteriophage T7 and the yeast Saccharomyces cerevisiae. Nucleic Acids Res. 29, 3513–3519. Hayashida, K., Bartlett, A.H., Chen, Y., and Park, P.W. (2010). Molecular and Cellular Mechanisms of Ectodomain Shedding. Anat. Rec. Adv. Integr. Anat. Evol. Biol. 293, 925–937. Henry, M.T., McMahon, K., Costello, C., Fitzgerald, M.X., and O’Connor, C.M. (2002). Secretory leukocyte proteinase inhibitor and elafin are resistant to degradation by MMP-8. Exp. Lung Res. 28, 85–97. Huntington, J.A., Read, R.J., and Carrell, R.W. (2000). Structure of a serpin–protease complex shows inhibition by deformation. Nature 407, 923–926. Hwang, C.-S., Shemorry, A., and Varshavsky, A. (2010). N-Terminal Acetylation of Cellular Proteins Creates Specific Degradation Signals. Science 327, 973–977. Igarashi, Y., Eroshkin, A., Gramatikova, S., Gramatikoff, K., Zhang, Y., Smith, J.W., Osterman, A.L., and Godzik, A. (2007). CutDB: a proteolytic event database. Nucleic Acids Res. 35, D546–D549. Janin, J. (2002). Welcome to CAPRI: A Critical Assessment of PRedicted Interactions. Proteins Struct. Funct. Bioinforma. 47, 257–257. Jansen, R., Yu, H., Greenbaum, D., Kluger, Y., Krogan, N.J., Chung, S., Emili, A., Snyder, M., Greenblatt, J.F., and Gerstein, M. (2003). A Bayesian Networks Approach for Predicting Protein-Protein Interactions from Genomic Data. Science 302, 449–453. Jeffery, C.J. (2003). Multifunctional proteins: examples of gene sharing. Ann. Med. 35, 28–35. Jensen, P.E.H., and Stigbrand, T. (1992). Differences in the proteinase inhibition mechanism of human α2-macroglobulin and pregnancy zone protein. Eur. J. Biochem. 210, 1071–1077. Jeong, H., Tombor, B., Albert, R., Oltvai, Z.N., and Barabási, A.-L. (2000). The large-scale organization of metabolic networks. Nature 407, 651–654. Jeong, H., Mason, S.P., Barabási, A.-L., and Oltvai, Z.N. (2001). Lethality and centrality in protein networks. Nature 411, 41–42. Kanehisa, M., Goto, S., Furumichi, M., Tanabe, M., and Hirakawa, M. (2009). KEGG for representation and analysis of molecular networks involving diseases and drugs. Nucleic Acids Res. 38, D355–D360. Kanehisa, M., Sato, Y., Kawashima, M., Furumichi, M., and Tanabe, M. (2015). KEGG as a reference resource for gene and protein annotation. Nucleic Acids Res. gkv1070.   203 Kappelhoff, R., Auf dem Keller, U., and Overall, C.M. (2010). Analysis of the degradome with the CLIP-CHIP microarray. Methods Mol. Biol. Clifton NJ 622, 175–193. Kassell, B., and Kay, J. (1973). Zymogens of proteolytic enzymes. Science 180, 1022–1027. Kelleher, N.L., Thomas, P.M., Ntai, I., Compton, P.D., and LeDuc, R.D. (2014). Deep and quantitative top-down proteomics in clinical and translational research. Expert Rev. Proteomics 11, 649–651. Keller, U. auf dem, Prudova, A., Eckhard, U., Fingleton, B., and Overall, C.M. (2013). Systems-Level Analysis of Proteolytic Events in Increased Vascular Permeability and Complement Activation in Skin Inflammation. Sci. Signal. 6, rs2-rs2. Kessenbrock, K., Dau, T., and Jenne, D.E. (2011). Tailor-made inflammation: how neutrophil serine proteases modulate the inflammatory response. J. Mol. Med. 89, 23–28. Khan, A.R., and James, M.N.G. (1998). Molecular mechanisms for the conversion of zymogens to active proteolytic enzymes. Protein Sci. 7, 815–836. Kim, M.-S., Pinto, S.M., Getnet, D., Nirujogi, R.S., Manda, S.S., Chaerkady, R., Madugundu, A.K., Kelkar, D.S., Isserlin, R., Jain, S., et al. (2014). A draft map of the human proteome. Nature 509, 575–581. King, O.D., Foulger, R.E., Dwight, S.S., White, J.V., and Roth, F.P. (2003). Predicting Gene Function From Patterns of Annotation. Genome Res. 13, 896–904. Kleifeld, O., Doucet, A., auf dem Keller, U., Prudova, A., Schilling, O., Kainthan, R.K., Starr, A.E., Foster, L.J., Kizhakkedathu, J.N., and Overall, C.M. (2010). Isotopic labeling of terminal amines in complex samples identifies protein N-termini and protease cleavage products. Nat. Biotechnol. 28, 281–288. Kleifeld, O., Doucet, A., Prudova, A., auf dem Keller, U., Gioia, M., Kizhakkedathu, J.N., and Overall, C.M. (2011). Identifying and quantifying proteolytic events and the natural N terminome by terminal amine isotopic labeling of substrates. Nat. Protoc. 6, 1578–1611. Klein, T., Fung, S.-Y., Renner, F., Blank, M.A., Dufour, A., Kang, S., Bolger-Munro, M., Scurll, J.M., Priatel, J.J., Schweigler, P., et al. (2015). The paracaspase MALT1 cleaves HOIL1 reducing linear ubiquitination by LUBAC to dampen lymphocyte NF-κB signalling. Nat. Commun. 6, 8777. Knäuper, V., Reinke, H., and Tschesche, H. (1990). Inactivation of human plasma α1-proteinase inhibitor by human PMN leucocyte collagenase. FEBS Lett. 263, 355–357. Kötzler, M.P., and Withers, S.G. (2016). Proteolytic Cleavage Driven by Glycosylation. J. Biol. Chem. 291, 429–434.   204 Krisinger, M.J., Goebeler, V., Lu, Z., Meixner, S.C., Myles, T., Pryzdial, E.L.G., and Conway, E.M. (2012). Thrombin generates previously unidentified C5 products that support the terminal complement activation pathway. Blood 120, 1717–1725. Kristensen, A.R., Gsponer, J., and Foster, L.J. (2012). A high-throughput approach for measuring temporal changes in the interactome. Nat. Methods 9, 907–909. Krüger, A. (2009). Functional genetic mouse models: promising tools for investigation of the proteolytic internet. Biol. Chem. 390, 91–97. Kwan, J.A., Schulze, C.J., Wang, W., Leon, H., Sariahmetoglu, M., Sung, M., Sawicka, J., Sims, D.E., Sawicki, G., and Schulz, R. (2004). Matrix metalloproteinase-2 (MMP-2) is present in the nucleus of cardiac myocytes and is capable of cleaving poly (ADP-ribose) polymerase (PARP) in vitro. FASEB J. 18, 690–692. Lai, Z.W., Gomez-Auli, A., Keller, E.J., Mayer, B., Biniossek, M.L., and Schilling, O. (2015). Enrichment of protein N-termini by charge reversal of internal peptides. PROTEOMICS 15, 2470–2478. Lane, L., Argoud-Puy, G., Britan, A., Cusin, I., Duek, P.D., Evalet, O., Gateau, A., Gaudet, P., Gleizes, A., Masselot, A., et al. (2011). neXtProt: a knowledge platform for human proteins. Nucleic Acids Res. 40, D76–D83. Lane, L., Bairoch, A., Beavis, R.C., Deutsch, E.W., Gaudet, P., Lundberg, E., and Omenn, G.S. (2014). Metrics for the Human Proteome Project 2013–2014 and Strategies for Finding Missing Proteins. J. Proteome Res. 13, 15–20. Lange, P.F., and Overall, C.M. (2011). TopFIND, a knowledgebase linking protein termini with function. Nat. Methods 8, 703–704. Lange, P.F., and Overall, C.M. (2013). Protein TAILS: when termini tell tales of proteolysis and function. Curr. Opin. Chem. Biol. 17, 73–82. Lange, P.F., Huesgen, P.F., and Overall, C.M. (2011). TopFIND 2.0--linking protein termini with proteolytic processing and modifications altering protein function. Nucleic Acids Res. 40, D351–D361. Lange, P.F., Huesgen, P.F., Nguyen, K., and Overall, C.M. (2014). Annotating N Termini for the Human Proteome Project: N Termini and Nα-Acetylation Status Differentiate Stable Cleaved Protein Species from Degradation Remnants in the Human Erythrocyte Proteome. J. Proteome Res. 13, 2028–2044. Langfelder, P., and Horvath, S. (2008). WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics 9, 559. Lee, H.K., Hsu, A.K., Sajdak, J., Qin, J., and Pavlidis, P. (2004). Coexpression Analysis of Human Genes Across Many Microarray Data Sets. Genome Res. 14, 1085–1094.   205 Lee, S., Liu, B., Lee, S., Huang, S.-X., Shen, B., and Qian, S.-B. (2012). Global mapping of translation initiation sites in mammalian cells at single-nucleotide resolution. Proc. Natl. Acad. Sci. 109, E2424–E2432. Lees, J.G., Heriche, J.K., Morilla, I., Ranea, J.A., and Orengo, C.A. (2011). Systematic computational prediction of protein interaction networks. Phys. Biol. 8, 35008. Li, J., and Yuan, J. (2008). Caspases in apoptosis and beyond. Oncogene 27, 6194–6206. Li, B.-Q., Cai, Y.-D., Feng, K.-Y., and Zhao, G.-J. (2012). Prediction of Protein Cleavage Site with Feature Selection by Random Forest. PLoS ONE 7, e45854. Li, J., Brick, P., O’Hare, M., Skarzynski, T., Lloyd, L., Curry, V., Clark, I., Bigg, H., Hazleman, B., Cawston, T., et al. (1995). Structure of full-length porcine synovial collagenase reveals a C-terminal domain containing a calcium-linked, four-bladed β-propeller. Structure 3, 541–549. Li, Z., Yasuda, Y., Li, W., Bogyo, M., Katz, N., Gordon, R.E., Fields, G.B., and Brömme, D. (2004). Regulation of Collagenase Activities of Human Cathepsins by Glycosaminoglycans. J. Biol. Chem. 279, 5470–5479. Lima-Mendez, G., and Helden, J. van (2009). The powerful law of the power law and other myths in network biology. Mol. Biosyst. 5, 1482–1493. Lincz, L.F. (1998). Deciphering the apoptotic pathway: All roads lead to death. Immunol. Cell Biol. 76, 1–19. Liu, Z.-P., and Chen, L. (2012). Proteome-wide prediction of protein-protein interactions from high-throughput data. Protein Cell 3, 508–520. Liu, J., Guo, Q., Chen, B., Yu, Y., Lu, H., and Li, Y.-Y. (2006). Cathepsin B and its interacting proteins, bikunin and TSRC1, correlate with TNF-induced apoptosis of ovarian cancer cells OV-90. FEBS Lett. 580, 245–250. Liu, Z.-P., Wang, J., Qiu, Y.-Q., Leung, R.K., Zhang, X.-S., Tsui, S.K., and Chen, L. (2012). Inferring a protein interaction map of Mycobacterium tuberculosis based on sequences and interologs. BMC Bioinformatics 13, S6. Lonsdale, J., Thomas, J., Salvatore, M., Phillips, R., Lo, E., Shad, S., Hasz, R., Walters, G., Garcia, F., Young, N., et al. (2013). The Genotype-Tissue Expression (GTEx) project. Nat. Genet. 45, 580–585. López-Otín, C., and Bond, J.S. (2008). Proteases: Multifunctional Enzymes in Life and Disease. J. Biol. Chem. 283, 30433–30437. López-Otín, C., and Hunter, T. (2010). The regulatory crosstalk between kinases and proteases in cancer. Nat. Rev. Cancer 10, 278–292.   206 López-Otín, C., and Overall, C.M. (2002). Protease degradomics: A new challenge for proteomics. Nat. Rev. Mol. Cell Biol. 3, 509–519. Lu, L.J., Xia, Y., Paccanaro, A., Yu, H., and Gerstein, M. (2005). Assessing the limits of genomic data integration for predicting protein networks. Genome Res. 15, 945–953. Lüthi, A.U., and Martin, S.J. (2007). The CASBAH: a searchable database of caspase substrates. Cell Death Differ. 14, 641–650. Lüttgen, H., Rohdich, F., Herz, S., Wungsintaweekul, J., Hecht, S., Schuhr, C.A., Fellermeier, M., Sagner, S., Zenk, M.H., Bacher, A., et al. (2000). Biosynthesis of terpenoids: YchB protein of Escherichia  coli phosphorylates the 2-hydroxy group  of 4-diphosphocytidyl-2C-methyl-d-erythritol. Proc. Natl. Acad. Sci. U. S. A. 97, 1062–1067. Luzio, J.P., Pryor, P.R., and Bright, N.A. (2007). Lysosomes: fusion and function. Nat. Rev. Mol. Cell Biol. 8, 622–632. Macfarlane, R.G. (1964). An Enzyme Cascade in the Blood Clotting Mechanism, and its Function as a Biochemical Amplifier. Nature 202, 498–499. Maetschke, S.R., Simonsen, M., Davis, M.J., and Ragan, M.A. (2012). Gene Ontology-driven inference of protein–protein interactions using inducers. Bioinformatics 28, 69–75. Mahrus, S., Trinidad, J.C., Barkan, D.T., Sali, A., Burlingame, A.L., and Wells, J.A. (2008). Global Sequencing of Proteolytic Cleavage Sites in Apoptosis by Specific Labeling of Protein N Termini. Cell 134, 866–876. Mann, M., and Jensen, O.N. (2003). Proteomic analysis of post-translational modifications. Nat. Biotechnol. 21, 255–261. Marchant, D.J., Bellac, C.L., Moraes, T.J., Wadsworth, S.J., Dufour, A., Butler, G.S., Bilawchuk, L.M., Hendry, R.G., Robertson, A.G., Cheung, C.T., et al. (2014). A new transcriptional role for matrix metalloproteinase-12 in antiviral immunity. Nat. Med. 20, 493–502. Marrero, A., Duquerroy, S., Trapani, S., Goulas, T., Guevara, T., Andersen, G.R., Navaza, J., Sottrup-Jensen, L., and Gomis-Rüth, F.X. (2012). The Crystal Structure of Human α2-Macroglobulin Reveals a Unique Molecular Cage. Angew. Chem. Int. Ed. 51, 3340–3344. Marshall, A.G., and Hendrickson, C.L. (2008). High-Resolution Mass Spectrometers. Annu. Rev. Anal. Chem. 1, 579–599. Mason, S.D., and Joyce, J.A. (2011). Proteolytic networks in cancer. Trends Cell Biol. 21, 228–237. Mast, A.E., Enghild, J.J., Nagase, H., Suzuki, K., Pizzo, S.V., and Salvesen, G. (1991a). Kinetics and physiologic relevance of the inactivation of alpha 1-proteinase inhibitor, alpha 1-  207 antichymotrypsin, and antithrombin III by matrix metalloproteinases-1 (tissue collagenase), -2 (72-kDa gelatinase/type IV collagenase), and -3 (stromelysin). J. Biol. Chem. 266, 15810–15816. Mast, A.E., Enghild, J.J., Nagase, H., Suzuki, K., Pizzo, S.V., and Salvesen, G. (1991b). Kinetics and physiologic relevance of the inactivation of alpha 1-proteinase inhibitor, alpha 1-antichymotrypsin, and antithrombin III by matrix metalloproteinases-1 (tissue collagenase), -2 (72-kDa gelatinase/type IV collagenase), and -3 (stromelysin). J. Biol. Chem. 266, 15810–15816. Matsushita, M., Thiel, S., Jensenius, J.C., Terai, I., and Fujita, T. (2000). Proteolytic Activities of Two Types of Mannose-Binding Lectin-Associated Serine Protease. J. Immunol. 165, 2637–2642. McQuibban, G.A., Gong, J.-H., Tam, E.M., McCulloch, C.A.G., Clark-Lewis, I., and Overall, C.M. (2000). Inflammation Dampened by Gelatinase A Cleavage of Monocyte Chemoattractant Protein-3. Science 289, 1202–1206. Mei, Y., Hahn, A.A., Hu, S., and Yang, X. (2011). The USP19 Deubiquitinase Regulates the Stability of c-IAP1 and c-IAP2. J. Biol. Chem. 286, 35380–35387. von Mering, C., Krause, R., Snel, B., Cornell, M., Oliver, S.G., Fields, S., and Bork, P. (2002). Comparative assessment of large-scale data sets of protein–protein interactions. Nature 417, 399–403. Micheau, O., and Tschopp, J. (2003). Induction of TNF Receptor I-Mediated Apoptosis via Two Sequential Signaling Complexes. Cell 114, 181–190. Milo, R., Shen-Orr, S., Itzkovitz, S., Kashtan, N., Chklovskii, D., and Alon, U. (2002). Network Motifs: Simple Building Blocks of Complex Networks. Science 298, 824–827. Modrek, B., Resch, A., Grasso, C., and Lee, C. (2001). Genome-wide detection of alternative splicing in expressed sequences of human genes. Nucleic Acids Res. 29, 2850–2859. Mohamed, M.M., and Sloane, B.F. (2006). Cysteine cathepsins: multifunctional enzymes in cancer. Nat. Rev. Cancer 6, 764–775. Moon, S., Han, D., Kim, Y., Jin, J., Ho, W.-K., and Kim, Y. (2014). Interactome analysis of AMP-activated protein kinase (AMPK)-α1 and -β1 in INS-1 pancreatic beta-cells by affinity purification-mass spectrometry. Sci. Rep. 4, 4376. Morel, S., Lévy, F., Burlet-Schiltz, O., Brasseur, F., Probst-Kepper, M., Peitrequin, A.-L., Monsarrat, B., Van Velthoven, R., Cerottini, J.-C., Boon, T., et al. (2000). Processing of Some Antigens by the Standard Proteasome but Not by the Immunoproteasome Results in Poor Presentation by Dendritic Cells. Immunity 12, 107–117. Mostafavi, S., Ray, D., Warde-Farley, D., Grouios, C., and Morris, Q. (2008). GeneMANIA: a real-time multiple association network integration algorithm for predicting gene function. Genome Biol. 9, S4.   208 Mott, J.D., and Werb, Z. (2004). Regulation of matrix biology by matrix metalloproteinases. Curr. Opin. Cell Biol. 16, 558–564. Moult, J., Fidelis, K., Kryshtafovych, A., Schwede, T., and Tramontano, A. (2014). Critical assessment of methods of protein structure prediction (CASP) — round x. Proteins Struct. Funct. Bioinforma. 82, 1–6. Muller-Eberhard, H.J. (1988). Molecular Organization and Function of the Complement System. Annu. Rev. Biochem. 57, 321–347. Nagase, H., and Woessner, J.F. (1999). Matrix Metalloproteinases. J. Biol. Chem. 274, 21491–21494. Nalivaeva, N.N., Fisk, L.R., Belyaev, N.D., and Turner, A.J. (2008). Amyloid-degrading enzymes as therapeutic targets in Alzheimer’s disease. Curr. Alzheimer Res. 5, 212–224. Niehrs, C., and Pollet, N. (1999). Synexpression groups in eukaryotes. Nature 402, 483–487. Nørregaard Jensen, O. (2004). Modification-specific proteomics: characterization of post-translational modifications by mass spectrometry. Curr. Opin. Chem. Biol. 8, 33–41. Ochieng, J., Fridman, R., Nangia-Makker, P., Kleiner, D.E., Liotta, L.A., Stetler-Stevenson, W.G., and Raz, A. (1994). Galectin-3 Is a Novel Substrate for Human Matrix Metalloproteinases-2 and -9. Biochemistry (Mosc.) 33, 14109–14114. O’Donoghue, A.J., Eroy-Reveles, A.A., Knudsen, G.M., Ingram, J., Zhou, M., Statnekov, J.B., Greninger, A.L., Hostetter, D.R., Qu, G., Maltby, D.A., et al. (2012). Global identification of peptidase specificity by multiplex substrate profiling. Nat. Methods 9, 1095–1100. Olsson, M., and Zhivotovsky, B. (2011). Caspases and cancer. Cell Death Differ. 18, 1441–1449. Östlund, G., Schmitt, T., Forslund, K., Köstler, T., Messina, D.N., Roopra, S., Frings, O., and Sonnhammer, E.L.L. (2010). InParanoid 7: new algorithms and tools for eukaryotic orthology analysis. Nucleic Acids Res. 38, D196–D203. Overall, C.M. (2002). Molecular determinants of metalloproteinase substrate specificity. Mol. Biotechnol. 22, 51–86. Overall, C.M. (2014). Can proteomics fill the gap between genomics and phenotypes? J. Proteomics 100, 1–2. Overall, C., and Dean, R. (2006). Degradomics: Systems biology of the protease web. Pleiotropic roles of MMPs in cancer. Cancer Metastasis Rev. 25, 69–75. Overall, C.M., and Blobel, C.P. (2007). In search of partners: linking extracellular proteases to substrates. Nat. Rev. Mol. Cell Biol. 8, 245–257.   209 Overall, C.M., and Kleifeld, O. (2006). Validating matrix metalloproteinases as drug targets and anti-targets for cancer therapy. Nat. Rev. Cancer 6, 227–239. Overall, C.M., and Sodek, J. (1990). Concanavalin A produces a matrix-degradative phenotype in human fibroblasts. Induction and endogenous activation of collagenase, 72-kDa gelatinase, and Pump-1 is accompanied by the suppression of the tissue inhibitor of matrix metalloproteinases. J. Biol. Chem. 265, 21141–21151. Overall, C.M., Wrana, J.L., and Sodek, J. (1989). Independent regulation of collagenase, 72-kDa progelatinase, and metalloendoproteinase inhibitor expression in human fibroblasts by transforming growth factor-beta. J. Biol. Chem. 264, 1860–1869. Paetzel, M., Karla, A., Strynadka, N.C.J., and Dalbey, R.E. (2002). Signal Peptidases. Chem. Rev. 102, 4549–4580. Pages, H., Carlson, M., Falcon, S., and Nianhua AnnotationDbi: Annotation Database Interface. Pampalakis, G., and Sotiropoulou, G. (2007). Tissue kallikrein proteolytic cascade pathways in normal physiology and cancer. Biochim. Biophys. Acta BBA - Rev. Cancer 1776, 22–31. Pavlidis, P., and Gillis, J. (2012). Progress and challenges in the computational prediction of gene function using networks. F1000Research 1, 14. Pavlidis, P., and Gillis, J. (2013). Progress and challenges in the computational prediction of gene function using networks: 2012-2013 update. F1000Research 2, 230. Payne, S.H. (2015). The utility of protein and mRNA correlation. Trends Biochem. Sci. 40, 1–3. Pellegrini, M., Marcotte, E.M., Thompson, M.J., Eisenberg, D., and Yeates, T.O. (1999). Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc. Natl. Acad. Sci. U. S. A. 96, 4285–4288. Pelman, G.R., Morrison, C.J., and Overall, C.M. (2005). Pivotal Molecular Determinants of Peptidic and Collagen Triple Helicase Activities Reside in the S3′ Subsite of Matrix Metalloproteinase 8 (MMP-8): the role of hydrogen bonding potential of ASN188 and TYR189 and the connecting cis bond. J. Biol. Chem. 280, 2370–2377. Perera, N.C., Schilling, O., Kittel, H., Back, W., Kremmer, E., and Jenne, D.E. (2012). NSP4, an elastase-related protease in human neutrophils with arginine specificity. Proc. Natl. Acad. Sci. 109, 6229–6234. Pierre, M., Traverso, J.A., Boisson, B., Domenichini, S., Bouchez, D., Giglione, C., and Meinnel, T. (2007). N-Myristoylation Regulates the SnRK1 Pathway in Arabidopsis. Plant Cell Online 19, 2804–2821. Prassas, I., Eissa, A., Poda, G., and Diamandis, E.P. (2015). Unleashing the therapeutic potential of human kallikrein-related serine proteases. Nat. Rev. Drug Discov. 14, 183–202.   210 Prudova, A., Keller, U. auf dem, Butler, G.S., and Overall, C.M. (2010). Multiplex N-terminome Analysis of MMP-2 and MMP-9 Substrate Degradomes by iTRAQ-TAILS Quantitative Proteomics. Mol. Cell. Proteomics 9, 894–911. Prudova, A., Serrano, K., Eckhard, U., Fortelny, N., Devine, D.V., and Overall, C.M. (2014). TAILS N-terminomics of human platelets reveals pervasive metalloproteinase dependent proteolytic processing in storage. Blood 124, e49-60. Puente, X.S., Sánchez, L.M., Overall, C.M., and López-Otín, C. (2003). Human and mouse proteases: a comparative genomic approach. Nat. Rev. Genet. 4, 544–558. Quesada, V., Ordóñez, G.R., Sánchez, L.M., Puente, X.S., and López-Otín, C. (2009). The Degradome database: mammalian proteases and diseases of proteolysis. Nucleic Acids Res. 37, D239–D243. R Core Team (2013). R: A Language and Environment for Statistical Computing (Vienna, Austria: R Foundation for Statistical Computing). Radivojac, P., Clark, W.T., Oron, T.R., Schnoes, A.M., Wittkop, T., Sokolov, A., Graim, K., Funk, C., Verspoor, K., Ben-Hur, A., et al. (2013). A large-scale evaluation of computational protein function prediction. Nat. Methods 10, 221–227. Ranea, J.A.G., Yeats, C., Grant, A., and Orengo, C.A. (2007). Predicting Protein Function with Hierarchical Phylogenetic Profiles: The Gene3D Phylo-Tuner Method Applied to Eukaryotic Genomes. PLoS Comput Biol 3, e237. Rao, N.V., Marshall, B.C., Gray, B.H., and Hoidal, J.R. (1993). Interaction of Secretory Leukocyte Protease Inhibitor with Proteinase-3. Am. J. Respir. Cell Mol. Biol. 8, 612–616. Rapoport, T.A. (2007). Protein translocation across the eukaryotic endoplasmic reticulum and bacterial plasma membranes. Nature 450, 663–669. Rastogi, S., and Rost, B. (2011). LocDB: experimental annotations of localization for Homo sapiens and Arabidopsis thaliana. Nucleic Acids Res. 39, D230–D234. Rawlings, N.D., Tolle, D.P., and Barrett, A.J. (2004). Evolutionary families of peptidase inhibitors. Biochem. J. 378, 705. Rawlings, N.D., Barrett, A.J., and Bateman, A. (2010). MEROPS: the peptidase database. Nucleic Acids Res. 38, D227–D233. Rawlings, N.D., Barrett, A.J., and Bateman, A. (2011). Asparagine Peptide Lyases A SEVENTH CATALYTIC TYPE OF PROTEOLYTIC ENZYMES. J. Biol. Chem. 286, 38321–38328. Rawlings, N.D., Barrett, A.J., and Bateman, A. (2012). MEROPS: the database of proteolytic enzymes, their substrates and inhibitors. Nucleic Acids Res. 40, D343–D350.   211 Rawlings, N.D., Barrett, A.J., and Finn, R. (2015). Twenty years of the MEROPS database of proteolytic enzymes, their substrates and inhibitors. Nucleic Acids Res. 44, D343–D350. Resh, M.D. (2006). Trafficking and signaling by fatty-acylated and prenylated proteins. Nat. Chem. Biol. 2, 584–590. Rhodes, D.R., Tomlins, S.A., Varambally, S., Mahavisno, V., Barrette, T., Kalyana-Sundaram, S., Ghosh, D., Pandey, A., and Chinnaiyan, A.M. (2005). Probabilistic model of the human protein-protein interaction network. Nat. Biotechnol. 23, 951–959. Rice, A., and Banda, M.J. (1995). Neutrophil Elastase Processing of Gelatinase A Is Mediated by Extracellular Matrix. Biochemistry (Mosc.) 34, 9249–9256. Rogers, L.D., and Overall, C.M. (2013). Proteolytic post-translational modification of proteins: proteomic tools and methodology. Mol. Cell. Proteomics MCP 12, 3532–3542. Rual, J.-F., Venkatesan, K., Hao, T., Hirozane-Kishikawa, T., Dricot, A., Li, N., Berriz, G.F., Gibbons, F.D., Dreze, M., Ayivi-Guedehoussou, N., et al. (2005). Towards a proteome-scale map of the human protein–protein interaction network. Nature 437, 1173–1178. Rubinsztein, D.C. (2006). The roles of intracellular protein-degradation pathways in neurodegeneration. Nature 443, 780–786. Sabino, F., Hermes, O., Egli, F.E., Kockmann, T., Schlage, P., Croizat, P., Kizhakkedathu, J.N., Smola, H., and Keller, U. auf dem (2015). In Vivo Assessment of Protease Dynamics in Cutaneous Wound Healing by Degradomics Analysis of Porcine Wound Exudates. Mol. Cell. Proteomics 14, 354–370. Sarma, J.V., and Ward, P.A. (2011). The Complement System. Cell Tissue Res. 343, 227–235. Schaefer, M.H., Fontaine, J.-F., Vinayagam, A., Porras, P., Wanker, E.E., and Andrade-Navarro, M.A. (2012). HIPPIE: Integrating Protein Interaction Networks with Experiment Based Quality Scores. PLoS ONE 7, e31826. Schatz, G., and Dobberstein, B. (1996). Common Principles of Protein Translocation Across Membranes. Science 271, 1519–1526. Schechter, I., and Berger, A. (1967). On the size of the active site in proteases. I. Papain. Biochem. Biophys. Res. Commun. 27, 157–162. Schilling, O., Barré, O., Huesgen, P.F., and Overall, C.M. (2010). Proteome-wide analysis of protein carboxy termini: C terminomics. Nat. Methods 7, 508–511. Schilling, O., Huesgen, P.F., Barré, O., auf dem Keller, U., and Overall, C.M. (2011). Characterization of the prime and non-prime active site specificities of proteases by proteome-derived peptide libraries and tandem mass spectrometry. Nat. Protoc. 6, 111–120.   212 Schlage, P., Egli, F.E., Nanni, P., Wang, L.W., Kizhakkedathu, J.N., Apte, S.S., and Keller, U. auf dem (2014). Time-resolved Analysis of the Matrix Metalloproteinase 10 Substrate Degradome. Mol. Cell. Proteomics 13, 580–593. Scott, F.L., Hirst, C.E., Sun, J., Bird, C.H., Bottomley, S.P., and Bird, P.I. (1999). The Intracellular Serpin Proteinase Inhibitor 6 Is Expressed in Monocytes and Granulocytes and Is a Potent Inhibitor of the Azurophilic Granule Protease, Cathepsin G. Blood 93, 2089–2097. Shen-Orr, S.S., Milo, R., Mangan, S., and Alon, U. (2002). Network motifs in the transcriptional regulation network of Escherichia coli. Nat. Genet. 31, 64–68. Shin, C.J., Wong, S., Davis, M.J., and Ragan, M.A. (2009). Protein-protein interaction as a predictor of subcellular location. BMC Syst. Biol. 3, 28. Sim, R.B., Reboul, A., Arlaud, G.J., Villiers, C.L., and Colomb, M.G. (1979). Interaction of 125I-labelled complement subcomponents C1r and C1s with protease inhibitors in plasma. FEBS Lett. 97, 111–115. Slee, E.A., Adrain, C., and Martin, S.J. (1999). Serial killers: ordering caspase activation events in apoptosis. Cell Death Differ. 6, 1067–1074. Smith, L.M., Kelleher, N.L., and Proteomics, T.C. for T.D. (2013). Proteoform: a single term describing protein complexity. Nat. Methods 10, 186–187. Smoot, M.E., Ono, K., Ruscheinski, J., Wang, P.-L., and Ideker, T. (2011). Cytoscape 2.8: new features for data integration and network visualization. Bioinformatics 27, 431–432. Song, J., Tan, H., Shen, H., Mahmood, K., Boyd, S.E., Webb, G.I., Akutsu, T., and Whisstock, J.C. (2010). Cascleave: towards more accurate prediction of caspase substrate cleavage sites. Bioinforma. Oxf. Engl. 26, 752–760. Song, J., Tan, H., Boyd, S.E., Shen, H., Mahmood, K., Webb, G.I., Akutsu, T., Whisstock, J.C., and Pike, R.N. (2011a). BIOINFORMATIC APPROACHES FOR PREDICTING SUBSTRATES OF PROTEASES. J. Bioinform. Comput. Biol. 9, 149–178. Song, J., Matthews, A.Y., Reboul, C.F., Kaiserman, D., Pike, R.N., Bird, P.I., and Whisstock, J.C. (2011b). Predicting serpin/protease interactions. Methods Enzymol. 501, 237–273. Song, J., Tan, H., Perry, A.J., Akutsu, T., Webb, G.I., Whisstock, J.C., and Pike, R.N. (2012). PROSPER: An Integrated Feature-Based Tool for Predicting Protease Substrate Cleavage Sites. PLoS ONE 7, e50300. Sowa, M.E., Bennett, E.J., Gygi, S.P., and Harper, J.W. (2009). Defining the Human Deubiquitinating Enzyme Interaction Landscape. Cell 138, 389–403.   213 Staes, A., Van Damme, P., Helsens, K., Demol, H., Vandekerckhove, J., and Gevaert, K. (2008). Improved recovery of proteome-informative, protein N-terminal peptides by combined fractional diagonal chromatography (COFRADIC). PROTEOMICS 8, 1362–1370. Stelzl, U., Worm, U., Lalowski, M., Haenig, C., Brembeck, F.H., Goehler, H., Stroedicke, M., Zenkner, M., Schoenherr, A., Koeppen, S., et al. (2005). A Human Protein-Protein Interaction Network: A Resource for Annotating the Proteome. Cell 122, 957–968. Stennicke, H.R., and Salvesen, G.S. (1998). Properties of the caspases. Biochim. Biophys. Acta BBA - Protein Struct. Mol. Enzymol. 1387, 17–31. Stuart, J.M., Segal, E., Koller, D., and Kim, S.K. (2003). A Gene-Coexpression Network for Global Discovery of Conserved Genetic Modules. Science 302, 249–255. Sun, J., Bird, C.H., Sutton, V., McDonald, L., Coughlin, P.B., Jong, T.A.D., Trapani, J.A., and Bird, P.I. (1996). A Cytosolic Granzyme B Inhibitor Related to the Viral Apoptotic Regulator Cytokine Response Modifier A Is Present in Cytotoxic Lymphocytes. J. Biol. Chem. 271, 27802–27809. Szklarczyk, D., Franceschini, A., Wyder, S., Forslund, K., Heller, D., Huerta-Cepas, J., Simonovic, M., Roth, A., Santos, A., Tsafou, K.P., et al. (2015). STRING v10: protein-protein interaction networks, integrated over the tree of life. Nucleic Acids Res. 43, D447-452. Tabach, Y., Billi, A.C., Hayes, G.D., Newman, M.A., Zuk, O., Gabel, H., Kamath, R., Yacoby, K., Chapman, B., Garcia, S.M., et al. (2013). Identification of small RNA pathway genes using patterns of phylogenetic conservation and divergence. Nature 493, 694–698. Tagliabracci, V.S., Engel, J.L., Wen, J., Wiley, S.E., Worby, C.A., Kinch, L.N., Xiao, J., Grishin, N.V., and Dixon, J.E. (2012). Secreted Kinase Phosphorylates Extracellular Proteins That Regulate Biomineralization. Science 336, 1150–1153. Tam, E.M., Morrison, C.J., Wu, Y.I., Stack, M.S., and Overall, C.M. (2004). Membrane protease proteomics: Isotope-coded affinity tag MS identification of undescribed MT1–matrix metalloproteinase substrates. Proc. Natl. Acad. Sci. U. S. A. 101, 6917–6922. Tester, A.M., Cox, J.H., Connor, A.R., Starr, A.E., Dean, R.A., Puente, X.S., López-Otín, C., and Overall, C.M. (2007). LPS Responsiveness and Neutrophil Chemotaxis In Vivo Require PMN MMP-8 Activity. PLoS ONE 2, e312. The UniProt Consortium (2012). Update on activities at the Universal Protein Resource (UniProt) in 2013. Nucleic Acids Res. 41, D43–D47. The UniProt Consortium (2014). Activities at the Universal Protein Resource (UniProt). Nucleic Acids Res. 42, D191–D198. Thornberry, N.A., and Lazebnik, Y. (1998). Caspases: Enemies Within. Science 281, 1312–1316.   214 Toh, E.C.Y., Huq, N.L., Dashper, S.G., and Reynolds, E.C. (2010). Cysteine protease inhibitors: from evolutionary relationships to modern chemotherapeutic design for the treatment of infectious diseases. Curr. Protein Pept. Sci. 11, 725–743. Tolstykh, T., Lee, J., Vafai, S., and Stock, J.B. (2000). Carboxyl methylation regulates phosphoprotein phosphatase 2A by controlling the association of regulatory B subunits. EMBO J. 19, 5682–5691. Turk, B. (2006). Targeting proteases: successes, failures and future prospects. Nat. Rev. Drug Discov. 5, 785–799. Turk, V., Turk, B., and Turk, D. (2001). Lysosomal cysteine proteases: facts and opportunities. EMBO J. 20, 4629–4633. Uetz, P., Giot, L., Cagney, G., Mansfield, T.A., Judson, R.S., Knight, J.R., Lockshon, D., Narayan, V., Srinivasan, M., Pochart, P., et al. (2000). A comprehensive analysis of protein–protein interactions in Saccharomyces cerevisiae. Nature 403, 623–627. Uhlen, M., Oksvold, P., Fagerberg, L., Lundberg, E., Jonasson, K., Forsberg, M., Zwahlen, M., Kampf, C., Wester, K., Hober, S., et al. (2010). Towards a knowledge-based Human Protein Atlas. Nat. Biotechnol. 28, 1248–1250. Vagner, S., Gensac, M.C., Maret, A., Bayard, F., Amalric, F., Prats, H., and Prats, A.C. (1995). Alternative translation of human fibroblast growth factor 2 mRNA occurs by internal entry of ribosomes. Mol. Cell. Biol. 15, 35–44. Van Damme, P., Martens, L., Van Damme, J., Hugelier, K., Staes, A., Vandekerckhove, J., and Gevaert, K. (2005). Caspase-specific and nonspecific in vivo protein processing during Fas-induced apoptosis. Nat. Methods 2, 771–777. Van Damme, P., Staes, A., Bronsoms, S., Helsens, K., Colaert, N., Timmerman, E., Aviles, F.X., Vandekerckhove, J., and Gevaert, K. (2010). Complementary positional proteomics for screening substrates of endo- and exoproteases. Nat. Methods 7, 512–515. Vasiljeva, O., Reinheckel, T., Peters, C., Turk, D., Turk, V., and Turk, B. (2007). Emerging Roles of Cysteine Cathepsins in Disease and their Potential as Drug Targets. Curr. Pharm. Des. 13, 387–403. Vassar, R. (2002). β-Secretase (BACE) as a drug target for alzheimer’s disease. Adv. Drug Deliv. Rev. 54, 1589–1602. Verspurten, J., Gevaert, K., Declercq, W., and Vandenabeele, P. (2009). SitePredicting the cleavage of proteinase substrates. Trends Biochem. Sci. 34, 319–323. Vinayagam, A., Stelzl, U., Foulle, R., Plassmann, S., Zenkner, M., Timm, J., Assmus, H.E., Andrade-Navarro, M.A., and Wanker, E.E. (2011). A Directed Protein Interaction Network for Investigating Intracellular Signal Transduction. Sci Signal 4, rs8-rs8.   215 Vogel, C., and Marcotte, E.M. (2012). Insights into the regulation of protein abundance from proteomic and transcriptomic analyses. Nat. Rev. Genet. 13, 227–232. Vögtle, F.-N., Wortelkamp, S., Zahedi, R.P., Becker, D., Leidhold, C., Gevaert, K., Kellermann, J., Voos, W., Sickmann, A., Pfanner, N., et al. (2009). Global Analysis of the Mitochondrial N-Proteome Identifies a Processing Peptidase Critical for Protein Stability. Cell 139, 428–439. Wagner, A., and Fell, D.A. (2001). The small world inside large metabolic networks. Proc. R. Soc. Lond. B Biol. Sci. 268, 1803–1810. Wan, J., and Qian, S.-B. (2014). TISdb: a database for alternative translation initiation in mammalian cells. Nucleic Acids Res. 42, D845-850. Wang, Y.X.R., and Huang, H. (2014). Review on statistical methods for gene network reconstruction using expression data. J. Theor. Biol. 362, 53–61. Weisbrod, C.R., Chavez, J.D., Eng, J.K., Yang, L., Zheng, C., and Bruce, J.E. (2013). In Vivo Protein Interaction Network Identified with a Novel Real-Time Cross-Linked Peptide Identification Strategy. J. Proteome Res. 12, 1569–1579. Welches, W.R., Bridget Brosnihan, K., and Ferrario, C.M. (1993). A comparison of the properties and enzymatic activities of three angiotensin processing enzymes: Angiotensin converting enzyme, prolyl endopeptidase and neutral endopeptidase 24.11. Life Sci. 52, 1461–1480. Wildes, D., and Wells, J.A. (2010). Sampling the N-terminal proteome of human blood. Proc. Natl. Acad. Sci. 107, 4561–4566. Wilhelm, M., Schlegl, J., Hahne, H., Gholami, A.M., Lieberenz, M., Savitski, M.M., Ziegler, E., Butzmann, L., Gessulat, S., Marx, H., et al. (2014). Mass-spectrometry-based draft of the human proteome. Nature 509, 582–587. Winnepenninckx, V., Lazar, V., Michiels, S., Dessen, P., Stas, M., Alonso, S.R., Avril, M.-F., Romero, P.L.O., Robert, T., Balacescu, O., et al. (2006). Gene Expression Profiling of Primary Cutaneous Melanoma and Clinical Outcome. J. Natl. Cancer Inst. 98, 472–482. Witze, E.S., Old, W.M., Resing, K.A., and Ahn, N.G. (2007). Mapping protein post-translational modifications with mass spectrometry. Nat. Methods 4, 798–806. Yang, Q.-H., Church-Hajduk, R., Ren, J., Newton, M.L., and Du, C. (2003). Omi/HtrA2 catalytic cleavage of inhibitor of apoptosis (IAP) irreversibly inactivates IAPs and facilitates caspase activity in apoptosis. Genes Dev. 17, 1487–1496. Yasuda, Y., Kaleta, J., and Brömme, D. (2005). The role of cathepsins in osteoporosis and arthritis: Rationale for the design of new therapeutics. Adv. Drug Deliv. Rev. 57, 973–993.   216 Yu, H., Tardivo, L., Tam, S., Weiner, E., Gebreab, F., Fan, C., Svrzikapa, N., Hirozane-Kishikawa, T., Rietman, E., Yang, X., et al. (2011). Next-generation sequencing to generate interactome datasets. Nat. Methods 8, 478–480. Zhang, Q.C., Petrey, D., Deng, L., Qiang, L., Shi, Y., Thu, C.A., Bisikirska, B., Lefebvre, C., Accili, D., Hunter, T., et al. (2012). Structure-based prediction of protein-protein interactions on a genome-wide scale. Nature 490, 556–560. Zhang, Q.C., Petrey, D., Garzón, J.I., Deng, L., and Honig, B. (2013). PrePPI: a structure-informed database of protein-protein interactions. Nucleic Acids Res. 41, D828-833. Zoubarev, A., Hamer, K.M., Keshav, K.D., McCarthy, E.L., Santos, J.R.C., Rossum, T.V., McDonald, C., Hall, A., Wan, X., Lim, R., et al. (2012). Gemma: a resource for the reuse, sharing and meta-analysis of expression profiling data. Bioinformatics 28, 2272–2273. Zucker, S., Cao, J., and Chen, W.T. (2000). Critical appraisal of the use of matrix metalloproteinase inhibitors in cancer treatment. Oncogene 19, 6642–6650.  


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items