@prefix vivo: . @prefix edm: . @prefix ns0: . @prefix dcterms: . @prefix skos: . vivo:departmentOrSchool "Medicine, Faculty of"@en, "Biochemistry and Molecular Biology, Department of"@en ; edm:dataProvider "DSpace"@en ; ns0:degreeCampus "UBCV"@en ; dcterms:creator "Wong, Eric Tsz Chung"@en ; dcterms:issued "2020-12-31T08:00:00Z"@en, "2019"@en ; vivo:relatedDegree "Doctor of Philosophy - PhD"@en ; ns0:degreeGrantor "University of British Columbia"@en ; dcterms:description """Intrinsically disordered protein regions (IDRs) constitute a significant portion of our proteome but have traditionally received less attention than folded domains, making IDRs a focus of ongoing research. These protein regions that are not folded prior to binding have functional importance, contradicting the protein structure–function paradigm. One mechanism through which IDRs function is by forming interactions with protein partners through interaction-mediating elements, including molecular recognition features (MoRFs). Computational biologists have developed many protein-sequence-based methods for predicting IDRs and MoRFs and have applied them in proteome-wide studies, leading to the recognition of their significant roles in regulatory and signaling pathways, housekeeping proteins, and interaction network hubs. IDRs’ involvement in these processes made them attractive targets for research and therapy. However, the folded (globular) proteins interacting with IDRs have received less attention. We developed a structure-based protein interface predictor for binding sites of IDRs named IDRBind, which incorporated features specific to MoRF binding sites with ideas from existing globular protein interface predictors. IDRBind was developed using machine learning and was trained on MoRF–globular complex structures. It consists of two gradient boosted trees models that are combined using a conditional random fields (CRF) model. The structural data used for the development of IDRBind was also useful for characterizing and comparing IDR and globular interactions. In this thesis, I will cover the development and benchmarking of IDRBind and examine the properties of MoRF interactions with comparisons to those of globular proteins and peptides. IDRBind exhibits high performance on predicting both MoRF and peptide binding sites. Our analysis also revealed that MoRF binding sites are positioned between those of peptide and globular proteins on multiple measured properties, in agreement with the performance trends of IDRBind. The differentiating characteristics of IDR-mediated interactions were further investigated by comparing the localization patterns of mutations. Despite the flexibility of IDRs, the interaction surfaces of the IDR complex structures are just as enriched in disease-associated mutations as globular interactions. Their prominent roles in disease, especially in cancer, as well as attributes that favor drug targeting, make IDR interactions a fascinating topic for research."""@en ; edm:aggregatedCHO "https://circle.library.ubc.ca/rest/handle/2429/72797?expand=metadata"@en ; skos:note "PREDICTION AND CHARACTERIZATION OF PROTEIN–PROTEIN INTERFACES THAT BIND INTRINSICALLY DISORDERED PROTEIN REGIONS by Eric Tsz Chung Wong M.Sc., The University of British Columbia, 2012 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY in THE FACULTY OF GRADUATE AND POSTDOCTORAL STUDIES (Biochemistry and Molecular Biology) THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver) December 2019 © Eric Tsz Chung Wong, 2019 ii The following individuals certify that they have read, and recommend to the Faculty of Graduate and Postdoctoral Studies for acceptance, the dissertation entitled: Prediction and Characterization of Protein–Protein Interfaces That Bind Intrinsically Disordered Protein Regions submitted by Eric Tsz Chung Wong in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Biochemistry and Molecular Biology Examining Committee: Jörg Gsponer Supervisor Lawrence McIntosh Supervisory Committee Member Artem Cherkasov University Examiner Calvin Yip University Examiner Additional Supervisory Committee Members: Filip Van Petegem Supervisory Committee Member iii Abstract Intrinsically disordered protein regions (IDRs) constitute a significant portion of our proteome but have traditionally received less attention than folded domains, making IDRs a focus of ongoing research. These protein regions that are not folded prior to binding have functional importance, contradicting the protein structure–function paradigm. One mechanism through which IDRs function is by forming interactions with protein partners through interaction-mediating elements, including molecular recognition features (MoRFs). Computational biologists have developed many protein-sequence-based methods for predicting IDRs and MoRFs and have applied them in proteome-wide studies, leading to the recognition of their significant roles in regulatory and signaling pathways, housekeeping proteins, and interaction network hubs. IDRs’ involvement in these processes made them attractive targets for research and therapy. However, the folded (globular) proteins interacting with IDRs have received less attention. We developed a structure-based protein interface predictor for binding sites of IDRs named IDRBind, which incorporated features specific to MoRF binding sites with ideas from existing globular protein interface predictors. IDRBind was developed using machine learning and was trained on MoRF–globular complex structures. It consists of two gradient boosted trees models that are combined using a conditional random fields (CRF) model. The structural data used for the development of IDRBind was also useful for characterizing and comparing IDR and globular interactions. In this thesis, I will cover the development and benchmarking of IDRBind and examine the properties of MoRF interactions with comparisons to those of globular proteins and peptides. IDRBind exhibits high performance on predicting both MoRF and peptide binding sites. Our analysis also revealed that MoRF binding sites are positioned between those of peptide and iv globular proteins on multiple measured properties, in agreement with the performance trends of IDRBind. The differentiating characteristics of IDR-mediated interactions were further investigated by comparing the localization patterns of mutations. Despite the flexibility of IDRs, the interaction surfaces of the IDR complex structures are just as enriched in disease-associated mutations as globular interactions. Their prominent roles in disease, especially in cancer, as well as attributes that favor drug targeting, make IDR interactions a fascinating topic for research. v Lay Summary Interactions between proteins are essential to life, so there is a significant interest in investigating all aspects of protein–protein interactions. The immense number of protein structures and interactions urged the development of numerous computational predictors for the interaction surfaces of proteins. Many such predictors put all protein interactions under one umbrella, but is this the best approach going forward? Distinct from classical interactions between two folded proteins is a class of interactions mediated by intrinsically disordered regions (IDRs), which lack a folded conformation prior to binding. IDRs that mediate interactions are key players in regulatory pathways, making them promising targets for research and therapy. Thus, we developed IDRBind, which is a method specialized in predicting binding sites of IDRs. Furthermore, this undertaking revealed characteristics that differentiate IDR interactions from classically studied globular interactions. IDR interactions also showed susceptibility to disease mutations, demonstrating their critical contributions to cellular functions. vi Preface Chapter 2, consisting of work related to the predictor IDRBind, is an adjusted version of a published article by Eric Tsz Chung Wong and Jörg Gsponer “Predicting protein-protein interfaces that bind intrinsically disordered protein regions” in Journal of Molecular Biology (2019). Dr. Jörg Gsponer identified the research objective. Dr. Jörg Gsponer and I designed the experiments and wrote the manuscript. I completed all the computational scripting for the experiments and the development of the predictor. Software developer Matthew Jacobson developed the website for IDRBind. The article was also edited by Dr. Monika Fuxreiter and by editors of the journal. Chapter 3 includes the characterization of peptide, MoRF, and globular interfaces. Analyses of the three classes of interfaces include work published in the article mentioned above as well as related unpublished work that I performed. Chapter 3 also presents the investigation of mutation localization in interacting protein regions. Analyses of mutation localization include work performed by undergraduate students Mike Guron and Victor So, using data provided by Dr. Nawar Malhis. Mike Guron performed exploratory work on mapping disease mutation to IDR interaction structures and calculating odds ratios of mutation enrichment. Victor So completed the mutation mapping to the protein sequences that were used for the final analyses. I was responsible for the categorization of protein residues into structural regions using solvent accessible surface area calculations. The computational scripts for combining the protein structure regions with mutation data is the work of Victor So. Dr. Nawar Malhis provided the gnomAD SNV frequency data, which he mapped to uniport sequences, and he also provided the ESpritz protein disorder prediction scores. I calculated the odds ratios and created the figures. Dr. Jörg Gsponer and I designed the project and interpreted the results. vii Table of Contents Abstract ......................................................................................................................................... iii Lay Summary .................................................................................................................................v Preface ........................................................................................................................................... vi Table of Contents ........................................................................................................................ vii List of Tables ............................................................................................................................... xii List of Figures ............................................................................................................................. xiii List of Abbreviations ...................................................................................................................xv Acknowledgements ................................................................................................................... xvii Dedication ................................................................................................................................. xviii Chapter 1: Introduction ................................................................................................................1 1.1 Protein–Protein Interactions (PPIs)................................................................................. 1 1.1.1 Classification and Function......................................................................................... 1 1.1.2 Perturbation of PPIs Is Associated with Disease ........................................................ 3 1.1.3 PPI Identification and Networks ................................................................................. 4 1.1.4 Structure Determination .............................................................................................. 7 1.2 Protein Interface Prediction .......................................................................................... 10 1.2.1 Key Features Used in Interface Prediction ............................................................... 11 1.2.2 First Predictors .......................................................................................................... 16 1.2.3 Classification and Application of Existing Prediction Methods ............................... 18 1.2.4 Machine Learning ..................................................................................................... 23 1.2.5 State of the Art .......................................................................................................... 28 viii 1.3 Properties of IDR-Mediated Interactions ...................................................................... 30 1.3.1 Defining MoRFs and Peptide Motifs ........................................................................ 31 1.3.2 Peptide Complexes.................................................................................................... 32 1.3.3 MoRF Complexes ..................................................................................................... 33 1.3.3.1 Residue Composition and Contributions in MoRF Complexes ........................ 34 1.3.3.2 Interface Geometry ........................................................................................... 36 1.3.3.3 Binding Mechanism .......................................................................................... 38 1.3.3.4 Interaction Dynamics ........................................................................................ 39 1.4 Hypothesis and Experimental Rationale ....................................................................... 40 Chapter 2: Predicting Protein–Protein Interfaces that Bind Intrinsically Disordered Protein Regions ............................................................................................................................42 2.1 Overview ....................................................................................................................... 42 2.2 Introduction ................................................................................................................... 42 2.3 Results ........................................................................................................................... 47 2.3.1 MoRF Complex Datasets for Predictor Training and Evaluation ............................. 47 2.3.2 Several Features Are Associated with Core and Rim Interface Residues ................ 51 2.3.3 Modules That Predict Core and Rim Residues ......................................................... 53 2.3.4 CorePred and RimPred Scores Are Combined Using CRFs .................................... 56 2.3.5 IDRBind Outcompetes Existing Predictors in the Identification of MoRF-binding Sites .................................................................................................................................. 62 2.3.6 IDRBind Accurately Predicts the Interface of Well-Known MoRF Interaction Partners ................................................................................................................................. 63 2.3.7 IDRBind Identifies Peptide but Not Globular Interfaces .......................................... 69 ix 2.4 Discussion ..................................................................................................................... 72 2.5 Methods......................................................................................................................... 78 2.5.1 Construction of Protein Complex Datasets ............................................................... 78 2.5.2 Surface, Core, and Rim Residue Definition .............................................................. 80 2.5.3 Prediction Evaluation Comparisons with Existing Predictors .................................. 80 2.5.4 Calculating Feature Scores ....................................................................................... 83 2.5.4.1 Conservation ..................................................................................................... 83 2.5.4.2 Electrostatics and Groove ................................................................................. 84 2.5.4.3 Residue Composition ........................................................................................ 86 2.5.4.4 B-Factor ............................................................................................................ 87 2.5.4.5 Curvature and Roughness ................................................................................. 87 2.5.4.6 Feature Patterns 1 to 5....................................................................................... 87 2.5.4.7 Aggregated Feature Scores ............................................................................... 91 2.5.5 Constructing CorePred, RimPred, and IDRBind CRF.............................................. 92 2.5.6 Figure Generation ..................................................................................................... 96 2.5.7 Quantification and Statistical Analysis ..................................................................... 96 2.5.8 Structure RMSD Calculations ................................................................................... 97 2.5.9 Prediction Server ....................................................................................................... 98 Chapter 3: Interactions of IDRs — Attributes and Mutations ................................................99 3.1 Overview ....................................................................................................................... 99 3.2 Introduction ................................................................................................................... 99 3.3 Results ......................................................................................................................... 105 3.3.1 Properties That Differentiate IDR-Partner and Globular Interfaces ....................... 105 x 3.3.2 Classification of IDR-Partner and Globular Interfaces. .......................................... 107 3.3.3 Analysis of Mutation Localization on Protein Structures ....................................... 113 3.3.4 SNVs Are Depleted in IDR Interactions. ................................................................ 118 3.3.5 Disease Mutations Are Enriched in IDR Interfaces. ............................................... 123 3.3.6 Predicted Structural Regions Show Enrichment of Disease Mutations .................. 127 3.4 Discussion ................................................................................................................... 129 3.5 Methods....................................................................................................................... 136 3.5.1 Calculating Interface Sequence and Structural Properties ...................................... 136 3.5.2 Training and Testing MoRFint — IDR-Partner Interface and Globular Interface Classifier ............................................................................................................................. 137 3.5.3 Structural Data for Mutation Localization Analysis ............................................... 139 3.5.4 Defining IDR Interaction Datasets.......................................................................... 140 3.5.5 Mapping Mutations to Globular and IDR Interaction Structural Data ................... 140 3.5.6 Predicted Structural Regions for Mutation Mapping .............................................. 141 3.5.7 Odds Ratio Calculations ......................................................................................... 142 Chapter 4: Conclusion ...............................................................................................................143 4.1 IDRBind – Predictor of IDR-Partner Interfaces ......................................................... 144 4.1.1 Significance, Limitations, and Applications ........................................................... 145 4.2 Interface Attributes and Mutation Localization in IDR Interactions .......................... 149 4.2.1 Significance and Limitations .................................................................................. 150 Bibliography ...............................................................................................................................155 Appendices ..................................................................................................................................179 Appendix A MoRF-train and MoRF-test Datasets. ................................................................ 179 xi A.1 MoRF-test Reference and Structure Annotations. .................................................. 179 A.2 MoRF-train Reference and Structure Annotations. ................................................ 180 Appendix B List of Features Used and Analyzed for IDRBind. ............................................ 181 Appendix C The IDRBind Server. .......................................................................................... 184 C.1 The Submission Page of IDRBind Server. ............................................................. 184 C.2 The Public Jobs Queue Page of IDRBind Server. .................................................. 185 xii List of Tables Table 2.1: Performance evaluation of interface predictors on MoRF-test dataset. ...................... 59 Table 2.2: A model that combines CorePred and RimPred scores by taking the larger of the two for each residue evaluated on MoRF-test dataset at varying thresholds. ...................................... 60 Table 2.3: IDRBind prediction performance on individual structures of MoRF-test. .................. 61 Table 2.4: The performance of the scoring component of the CRF on MoRF-test dataset. ......... 62 Table 2.5: Performance measures on PEP. ................................................................................... 69 Table 2.6: Performance measures on DB5. .................................................................................. 72 Table 2.7: Evaluation on a subset of MoRF-test for which ACCLUSTER was given MoRF sequences as input, limiting MoRFs to those under 31 residues long. ......................................... 82 Table 3.1: Datasets of interaction structures with mutation mapping. ....................................... 117 Table 3.2: Odds ratios of low-frequency SNV in IDR interaction proteins over globular proteins...................................................................................................................................................... 121 Table 3.3: Odds ratios of high-frequency SNV in IDR interaction proteins over globular proteins...................................................................................................................................................... 122 Table 3.4: Odds ratios of SwissVar mutations in IDR interaction proteins over globular proteins...................................................................................................................................................... 125 Table 3.5: Odds ratios of COSMIC mutations in IDR interaction proteins over globular proteins...................................................................................................................................................... 127 xiii List of Figures Figure 1.1: Illustration of key interface features of interactions between folded proteins. .......... 12 Figure 1.2: Illustration of the curvature of a binding groove. ....................................................... 16 Figure 1.3: General workflow for supervised learning of a classification model. ........................ 26 Figure 1.4: Three classes of protein–protein interactions. ............................................................ 32 Figure 1.5: Relative size and shape of MoRF interface regions. .................................................. 37 Figure 2.1: Schematic of IDRBind’s architecture......................................................................... 47 Figure 2.2: Conformational diversity of individual MoRFs in our dataset................................... 50 Figure 2.3: ROC curves for interface versus non-interface classification using different individual feature scores. .............................................................................................................. 53 Figure 2.4: ROC curves for core, rim and interface residue predictions. ..................................... 55 Figure 2.5: Schematic of IDRBind CRF (A) and the quadrilateral mesh protein surface representation (B) used in it. ......................................................................................................... 57 Figure 2.6: Prediction results of select proteins mapped on their structures. ............................... 64 Figure 2.7: Curvature scores of rim and core in MoRF-train. ...................................................... 66 Figure 2.8: IDRBind prediction result for the MoRF interface of MED15’s activator-binding domain 1 (ABD1). ........................................................................................................................ 67 Figure 2.9: IDRBind prediction for calmodulin. .......................................................................... 68 Figure 2.10: IDRBind prediction for interleukin-2. ...................................................................... 71 Figure 2.11: A) CorePred and B) RimPred importance plot for selected features calculated by XGBoost. ...................................................................................................................................... 76 Figure 2.12: IDRBind prediction on the unbound structure of Bcl-2. .......................................... 77 xiv Figure 3.1: Feature comparison for globular (blue), MoRF (yellow), and peptide (brown) interfaces. .................................................................................................................................... 107 Figure 3.2: Non-standardized scores of features for globular (blue), MoRF (yellow), and peptide (brown) interfaces. ...................................................................................................................... 107 Figure 3.3: Classification of binding sites of IDRs versus globular domains. ........................... 110 Figure 3.4: Protein regions in mutation localization analysis. .................................................... 115 Figure 3.5: Odds ratios of low-frequency SNVs in protein regions over all residues. ............... 120 Figure 3.6: Odds ratios of high-frequency SNVs in protein regions over all residues. .............. 122 Figure 3.7: Odds ratios of SwissVar mutations in protein regions over all residues. ................. 124 Figure 3.8: Odds ratios of COSMIC mutations in protein regions over all residues. ................. 126 Figure 3.9: Odds ratios of mutations in predicted protein regions over all residues. ................. 129 xv List of Abbreviations 3DC globular dataset for MoRFint training ANN artificial neural networks AUC area under curve CD circular dichroism CRF conditional random fields cryo-EM cryogenic electron microscopy DB5 docking benchmark 5 globular dataset for IDRBind testing DNA deoxyribonucleic acid FN false negative FP false positive FPR false positive rate Gly glycine IDR intrinsically disordered region IDRBind IDR-partner interface predictor MAP maximum a posteriori MCC Matthews correlation coefficient MoRF molecular recognition feature MoRFint IDR-partner and globular interface classifier MoRF-test MoRF dataset for IDRBind testing MoRF-train MoRF dataset for IDRBind training MS mass spectrometry MSA multiple sequence alignment xvi NMR nuclear magnetic resonance spectroscopy OR odds ratio ORR odds ratio of regions across datasets PCA principal component analysis PDB protein data bank PEP peptide dataset for IDRBind testing PPI protein–protein interaction rASA relative accessible surface area. RF random forest RMSD root-mean-square deviation RNA ribonucleic acid ROC receiver operating characteristic SASA solvent accessible surface area SE standard error SLiM short linear motif SNV single nucleotide variant SVM support vector machine TAP tandem affinity purification TN true negative TP true positive TPR true positive rate Y2H yeast 2-hybrid xvii Acknowledgements I give my sincerest gratitude to my supervisor Dr. Jörg Gsponer, for his mentorship and guidance throughout the years that I spent completing two graduate degrees in UBC. The challenging phases of my project would have been hard to overcome without his continued support and insight to drive progress forward. For all their work and for creating an excellent workplace, I give thanks to the past and present members of the Gsponer lab: Alex, Andrew, Ashwani, Dokyun, Erich, Florian, Geoffrey, Guillaume, Jennifer, Massih, Murat, Nawar, Nivretta, Roy, Travis. A special thanks goes to Alex for providing feedback on my writing. I sincerely thank my supervisory committee, Dr. Lawrence McIntosh and Dr. Filip Van Petegem, for all their help and commitment to working through this long journey with me. I also feel grateful for my family and friends for their support. In particular, I thank Leo for his advice on teaching and research. Lastly, I thank UBC and NSERC for the resources and scholarships that supported my research. xviii Dedication To my parents, who raised me and supported everything I do. 1 Chapter 1: Introduction 1.1 Protein–Protein Interactions (PPIs) Nearly all cellular processes involve protein functions, which are typically mediated or modulated by protein–protein interactions (PPIs), so investigations of the interacting proteins and their interaction mechanisms are long-established objectives in biology (1, 2). For instance, protein chains associate to form multimeric cellular machinery or to transduce signals in response to changes in the cellular environments. Therefore, many high-throughput efforts have been aimed at identifying all PPIs in proteomes, which are the full sets of proteins expressed by organisms, creating PPI networks called interactomes (3). However, the three-dimensional structures of protein complexes are critical for interpreting the interaction data. Combining protein interaction and structural data revealed molecular details underlying biological functions and opened many possibilities in research and therapeutic developments (4, 5). The introductory sections present an overview of PPIs and their functions, followed by the methods for generating protein interaction and structural data. 1.1.1 Classification and Function Protein complexes are formed through the interactions of multiple protein chains, and they are often categorized into obligate and transient interactions. The protein’s primary structure is its sequence of amino acid residues. The secondary structure consists of local conformations of the protein chain and is stabilized by the interactions between the amino acid residues’ side chain and backbone atoms, with the most commonly recognized secondary structure elements being alpha-helix and beta-sheet. The tertiary structure of the protein is the three-dimensional conformation of 2 the protein chain. Finally, the quaternary structure is the three-dimensional structure of the protein complex produced by the arrangement of multiple interacting protein chains. There are multiple ways of categorizing protein complexes. The interaction partners in complexes can be identical or different protein chains, forming homomeric and heteromeric complexes, respectively (6). Nooren and Thornton classified PPIs using two overlapping sets of categories: obligate versus non-obligate and permanent versus transient (7). PPIs where the individual partner chains only exist in complexes in vivo, generally due to their structural and functional dependence on their interactions, are defined as obligate interactions. Conversely, non-obligate interaction partners are observable in unbound states, i.e., dissociated protein chains. Permanent and transient are defined based on the complex’s lifetime in vivo, with permanent interactions being very stable and transient interactions exhibiting low affinity, meaning transient interaction partners exist in a balance of bound and unbound states while permanent interaction partners are predominantly in the bound state (7). There are subtle differences between the two sets of terminologies, but these terms are often used interchangeably in the literature. For simplicity, I will use the terms obligate and transient PPIs. Obligate PPIs form between proteins that only function within a larger complex or are unstable on their own. Some proteins fold and associate concurrently upon synthesis, so the individual proteins in obligate complexes may not be stable in isolation (7). Proteins often need to associate to form complexes because of the morphological requirements of their functions. Examples include catalytic complexes that have active sites formed at protein interaction interfaces or catalyze reaction chains that require proximity of multiple enzyme units (8). Besides structural and catalytic roles, PPIs may also serve roles in signal transduction. The ability to respond to changes in the environment is vital for biological systems, and this often 3 involves the propagation of signals through PPIs. The ability for interaction partners to quickly associate and dissociate is an advantage for signal transduction and implies the involvement of transient interactions (7, 9). Transient interactions between proteins can lead to signal propagation via different mechanisms. A common mechanism through which signals are transduced is allostery. Allostery is a mechanism of regulation where a functional element (e.g., enzyme active site or partner binding site) is altered through the association of a protein interaction partner or other types of molecules at a site distant from the functional element (8). Thus, interaction on one protein site can affect the structure and dynamics of another protein site through the allosteric mechanism, propagating a signal through the structure of the protein. Another way one protein induces changes on another is to modify the protein chemically through post-translational modifications (PTMs). PTMs are covalent modifications on proteins that are added after protein synthesis. A common PTM is protein phosphorylation, which is the addition of a phosphoryl group to proteins by protein enzymes called kinases. PTMs like phosphorylation change the chemical characteristics of the protein, which can modulate its structure, activity, or rewire its interactions, making measurements of protein PTM levels a growing topic in PPI research (10). 1.1.2 Perturbation of PPIs Is Associated with Disease The relationship between diseases and mutations, as well as the localization patterns of mutations in different protein regions, provides further evidence of the importance of PPIs (11, 12). Genome sequencing has made available vast amounts of single nucleotide variant (SNV) data (13, 14). An SNV is defined as a single nucleotide difference at a genomic position (i.e., one DNA nucleotide substitution from reference sequence) but is used here to refer specifically to an in-frame non-synonymous mutation, which means a mutation that leads to a single amino acid residue 4 substitution. Some SNVs have been directly linked to diseases, while other SNVs appear benign. Wang et al. mapped disease associated-mutations onto protein complex structures and observed enrichment of mutations on protein interfaces (15). Interfaces are the protein surfaces through which protein partners interact. More precisely, an interface consists of the residues that are in contact with the interaction partner. Furthermore, mutations on different interfaces of the same protein are often linked to different diseases, presumably through disrupting different interactions (15). Experiments performed by Sahni et al. separated disease-associated SNVs into those that disrupt all interactions and those that disrupt only some of the interactions of proteins (11). They observed that disease-associated SNVs that disrupt all interactions predominantly involve residue substitutions that change hydrophobicity. The burial of hydrophobic amino acids in the protein core is a strong driving force of folding (16). Thus, residue substitutions that change the hydrophobicity of a protein often destabilize the protein, disrupting all of the protein’s interactions. Importantly, Sahni et al. observed that SNVs more frequently disrupt a subset of the protein’s interactions, which they called edgetic mutations. These SNVs are found enriched in the protein’s interfaces. Therefore, disease-associated mutations frequently disrupt a subset of a protein's interactions, and disruption of different interactions lead to distinct phenotypes (11, 15). This link between altered PPIs and human disease advocates efforts in mapping all molecular details in interactome networks. 1.1.3 PPI Identification and Networks Most proteins have evolved to make interactions with other proteins. High-throughput screening of yeast protein complexes revealed an estimated 88% of the 1993 sampled proteins 5 interact with one or more protein partners (17). The interactome is the full complement of molecular interactions in an organism. In this writing, the term interactome refers specifically to the protein–protein interactome, which forms the complete PPI network. While the determination of full interactomes remains a distant goal, large PPI networks for multiple species are available thanks to high-throughput screening methods for identifying PPIs. Biological General Repository for Interaction Datasets (BioGRID) contains over 1,670,000 interactions over many species, more than 360,000 of which are non-redundant physical protein interactions in humans (18). Estimates of the size of the human interactome ranged from 130,000 to 650,000 interactions (19, 20). High-throughput experimental approaches that identify physical interaction include yeast 2-hybrid (Y2H) and tandem affinity purification (TAP) in conjunction with mass spectrometry (MS) (21). Importantly, physical interaction could indicate binary interactions where two proteins are in direct contact, or it could indicate co-complex where the two proteins may interact indirectly through intermediary proteins. A prevalent method for identifying physical binary interactions is Y2H, in which yeast is transformed with the constructs of two target proteins, consisting of one bait and one prey, fused with an activation domain and transcription factor pair. Thus, transcription activation would generate products if the two proteins associate, bringing the activation domain and transcription factor into proximity. TAP is a two-step purification method that extracts the TAP-tagged bait protein, which drags out the rest of the protein complex from the cell lysate. The interacting proteins in the purified complex can then be identified through MS, meaning the PPIs identified includes co-complex interactions. These are just two of the most widely-used techniques for identifying PPIs, and many more have been developed with various strengths and weaknesses (21) . Moreover, the PPI datasets identified by many such studies have been uploaded to databases 6 such as BioGRID, allowing aggregation of numerous PPI datasets to create more extensive and more accurate PPI networks. In PPI networks, nodes represent proteins and edges represent PPIs, and the topological properties of the protein nodes are correlated with their functional properties. An important finding from studying PPI networks is their scale-free topology (22). Scale-free networks have nodes whose degrees, i.e., number of edges touching a node, follow a power-law distribution. Therefore, a relatively small number of proteins called hub proteins participate in a large number of PPIs. These hub proteins are functionally critical, with many identified to be essential for the organism’s survival (22, 23). Therefore, essential proteins could be predicted from PPI networks using centrality-lethality rule, where the degree is a measure of centrality. A later study revealed the tendency of hub proteins to be essential might be due to a higher probability of encompassing an essential PPI by random chance and not due to other biological mechanisms inherent to their network topology (24). PPI networks are useful for understanding protein complexes and the pathways that connect them, but they do not provide the full picture. Interactome data provides a high-level overview of the organization of cellular machinery and the flow of signaling information, but structural data is required to reveal the detailed molecular mechanisms explaining how they interact and function. Efforts in integrating structural data into PPI networks resulted in structure-mapped PPI networks like Interactome3D, which provides atomic resolution details in the context of complex cellular pathways (5). Analogous to how SNVs disrupting PPIs can lead to disease, drugs that target PPIs can lead to treatment. For example, because hub proteins can propagate signals to their numerous interaction partners, hub proteins and their partners are promising drug targets (25). Not only can structure-mapped PPI networks be used to study the interaction between 7 the drug and the target protein in the context of biomolecular pathways, but it can also be used to analyze off-target interactions between the drug and the rest of the proteome (26). Therefore, protein structure determination is a critical part of studying PPIs. 1.1.4 Structure Determination The primary methods for protein structural determination at atomic resolutions are X-ray crystallography, nuclear magnetic resonance spectroscopy (NMR), and cryogenic electron microscopy (cryo-EM) (27). X-ray crystallography is the first, dominant, gold-standard method for resolving protein structures. X-ray is directed into a crystallized protein sample, and the diffraction pattern of the X-ray exiting the crystal is used to determine the protein structure. Structures determined through this method at the time of this writing account for approximately 125 thousand protein entries in the Protein Data Bank (PDB), which is ten times more than NMR and electron microscopy combined. With systems that form suitable crystal materials, X-ray crystallography can produce high-resolution structures. The rate of protein structure determination has massively increased in recent years due to high-throughput infrastructures and initiatives (27, 28). Steps ranging from selecting good protein targets to structure determination and deposition into PDB have been automated for X-ray crystallography. Although some methods have been developed to observe conformational changes, proteins in X-ray crystallography experiments are constrained by the crystal lattice, and only a single protein conformation is typically reported (27). The resulting all-atom single-conformation structures are very convenient for detailed structural and functional studies. X-ray crystallography is also the dominant source of the protein complex structures used in this work. 8 In recent years, solution NMR has been used to determine more and more protein structures. NMR exploits atomic isotopes possessing nuclear spins that interact with external magnetic fields, as well as fields generated by neighboring electrons and nuclei, i.e., interactions are dependent on the local chemical environment. The dependence of spin behavior on the chemical environment is exploited to determine protein structures. The advantage that solution NMR provides compared to X-ray crystallography is that proteins can be studied, as the name indicates, in solution. Solution NMR collects data from a large population of protein conformations that are sampled in equilibrium, which provides a window into the dynamics of proteins ranging from nanoseconds up to milliseconds (29). Studies of protein dynamics are essential for a better understanding of protein-mediated catalysis, signal transduction, and cellular communication, which are all processes that are modulated or even depend on protein dynamics in different timescales (30). Protein dynamics are particularly relevant for intrinsically disordered protein regions (IDRs) (29), which are protein parts that lack a unique tertiary structure in solution (see below for more details). NMR experiments on IDRs can measure residual secondary structures and long-range transient contacts that describe the IDR conformation ensemble (31, 32). Protein complexes that involve an intrinsically disordered partner – I call them IDR complexes – are in the focus of this work, and among the IDR complexes that I collected and studied, the number of NMR structures is disproportionately high, meaning NMR is the method of choice for studying IDR-mediated interactions. Furthermore, NMR experiments are more suited than X-ray crystallography for studying weak protein–protein interactions (33), which is a category to which many IDR interactions belong. Lastly, cryo-electron microscopy (cryo-EM) is a method that has grown rapidly in recent years. Cryo-EM involves sending electron beams through vitrified protein samples and into 9 detectors. While existing cryo-EM structures are mostly of low resolution, advancements in technologies such as direct electron detectors and computational correction of the drifting of the protein molecules now allow for resolutions approaching X-ray crystallography (34). Cryo-EM opens the possibility for studying protein systems that are not amenable to crystallization or are difficult to express in sufficiently high concentrations. Unlike X-ray crystallography, the proteins are vitrified in near-native states. Therefore, cryo-EM is very useful for studying protein systems that sample multiple conformational states. Moreover, cryo-EM is suited for large systems with molecular weights greater than 100 kDa, which are generally intractable by NMR (35). The large multi-protein assemblies studied through cryo-EM also enrich our knowledge base with heteromeric interactions, which have lagged behind the number of homomers due to study biases and recent high-throughput X-ray crystallography structural determination efforts (8). However, cryo-EM also presents challenges from the perspective of high-throughput bioinformatics analysis. Most existing computational tools, including the one that we developed, were programmed to use the legacy PDB file format, which does not support large complex structures. Furthermore, limitation in computational resources is also a factor for the processing of large protein structures. However, future computational tools may support larger structures through increasing computational resources and adaptation of the new PDBx/mmCIF file format for structural data. Despite the advancements in methodology and the large-scale structural determination efforts, achieving full structural coverage of the immense PPI networks is unlikely in the foreseeable future. Structure-mapped interactomes would more likely consist of a combination of experimental and predicted data. One computational method for extrapolating more data from experimentally determined structures is protein interface prediction. 10 1.2 Protein Interface Prediction Our protein structural knowledge base is still small compared to the size of our proteome, let alone our interactome, leaving many gaps in the molecular details of PPI networks. Consequently, computational methods have been developed to extrapolate from existing structural data sourced from experiments. For example, the PPI network database Interactome INSIDER uses computational prediction to complement interaction data when experimentally derived structures of the protein complexes are unavailable (4). One class of prediction method is the structure-based protein interface predictor that identify interaction sites without knowledge of binding-partner structure. Structure-based protein interface predictors are bioinformatic tools that predict putative protein interface surfaces on a given protein structure. Put differently, these methods classify protein residues into interface and non-interface classes, where the interface class consists of residues that make physical contact with protein partners. Since buried residues are typically not accessible, the scope of predictions for structure-based methods can generally be limited to the surface residues (i.e., solvent-accessible residues). With few exceptions, machine learning is used to predict the interface residues by processing feature scores extracted from the protein sequence and structure. A feature is a term used in machine learning for the individual properties that are measured. For consistency, this writing will refer to any measured property used for prediction or analysis as features. In the following sections, an overview of existing interface prediction methods and their development is presented. 11 1.2.1 Key Features Used in Interface Prediction In 1975, Chothia and Janin published their analysis on the early structures of insulin dimer, trypsin and pancreatic trypsin inhibitor complex, and αβ oxy-hemoglobin dimer (2). It was apparent that hydrophobicity is a crucial contributor to binding-affinity, while complementarity in polar interactions as well as structure are essential for specificity. However, judging from hydrophobicity alone, one would not be able to distinguish the interfaces between trypsin and pancreatic trypsin inhibitor from the rest of the protein surfaces. The residue composition at interaction interfaces typically differs from non-interface surfaces. More concrete evidence of enrichment in hydrophobic and aromatic residues in interfaces was demonstrated after more protein structures were made available (36). Notably, residue composition can also be used to assess other physicochemical properties. For example, Jones and Thornton calculated the propensity of each of the 20 canonical amino residues for being interface residues based on their frequency in the interface and non-interface regions of protein complex structures, i.e., the interface propensity of each residue type (37, 38). Furthermore, amino acids can be mapped to numerical values representing various properties through more than 400 amino acid indices on the AAindex database (39). Nevertheless, while residue composition at interfaces exhibits clear preferences on average, it does not consistently provide sufficient differentiation between interface and non-interface residues. Figure 1.1 illustrates differences in hydrophobicity, as well as other attributes critical for identifying interface residues discussed below, with relation to different regions of a protein complex structure. 12 Figure 1.1: Illustration of key interface features of interactions between folded proteins. The folded protein shown in molecular surface representation is colored by rASA values (discussed in following paragraphs), with the color scale ranging from blue to white to red representing rASA values from high to low (PDB ID 4M76) (40) Arrows point to examples of residue properties/feature scores on the protein, with blue and red arrows indicating high and low scores, respectively. In particular, residues in the buried region of folded proteins and at the center of protein–protein interfaces typically exhibit high conservation and hydrophobicity, while solvent-exposed non-interface residues are typically polar and less conserved. Protein interfaces also tend to be planar, as highlighted by the grey transparent disk. While planar, some interface residues protrude from the protein surface, creating more interaction surface. A source of predictive information that is orthogonal to residue composition was found in evolutionary patterns. Protein interfaces have functional importance, so interface residues are under stronger evolutionary constraints than non-interface residues (41). Conservation of residues 13 is evaluated through the alignment of homologous protein sequences, i.e., proteins sharing ancestry, which is called multiple sequence alignment (MSA). MSA aligns the protein’s residues with the equivalent residues of homologous proteins such that residue substitution patterns at each sequence position can be assessed. For example, the invariant residue positions are predicted to be functionally important because the residues are under strong evolutionary constraints. Lichtarge et al. are pioneers that integrated protein residue conservation patterns with structural information in their Evolutionary Trace method (42). They used the protein structures to separate the conserved residues into the buried residues that are critical for structural stability from the surface residues that are important for PPIs. The Evolutionary Trace method involves building a phylogenetic tree to represent the evolutionary relationships and distances between protein sequences, and this tree is used to cluster closely related sequences. The Evolutionary Trace method then determines the functionally important residues as those that are invariant either within each cluster or throughout all aligned sequences. The Evolutionary Trace method is an example of a phylogeny-based method for measuring residue conservation. Johansson and Toh classified conservation evaluation methods into substitution matrix-based and frequency-based, and frequency-based methods are further divided into phylogeny- and non-phylogeny-based (43). Phylogeny-based methods like Evolutionary Trace require the construction of a phylogenetic tree. Non-phylogeny-based methods often rely on the calculation of the entropy at each aligned residue position, evaluating variability at the sequence position (44). Unlike the frequency-based approaches that only consider changes in individual residues, substitution-based approaches use residue substitution matrices to account for the similarity between the residues in the aligned sequence position. Substitutions between chemically similar residues indicate stronger conservation than dissimilar residues because substitution with a more 14 similar residue would presumably be less disruptive to protein structure and function. Because evolutionary patterns have a strong connection to function, residue conservation scores are indispensable features for protein interface prediction. Solvent accessible surface area (SASA) is a pivotal innovation for protein interface prediction because not only is it useful for predicting interface residues, but it is also often used to divide residues in a protein complex structure into the surface, buried, and interface regions. SASA is the area measured on the solvent-accessible surface. The solvent-accessible surface of a protein is the surface traced by the center of a solvent probe rolling across the van der Waal surface of the protein, as described by Lee and Richards (45). The relative accessible surface area (rASA) of a residue “X” is a normalized accessibility measure determined by dividing the residue’s SASA measured in the protein structure by its SASA measured in a GLY-X-GLY tripeptide. One of the main performance advantages of a structure-based predictor over a sequence-based predictor of protein interfaces comes from knowing the surface residues (42), which is defined using rASA with perfect accuracy, except for proteins that undergo substantial conformational changes. Numerous studies, including our own, also define interface residues in the protein complex using rASA (46, 47). A residue that has a difference between its rASA in the bound and unbound structures, meaning the residue became more buried upon binding, is defined as an interface residue (46). Furthermore, rASA is a useful feature for predicting protein interface residues because interface residues often protrude further from the protein surface than non-interface residues, which is reflected in higher rASA values (48, 49). Surprisingly, multiple studies corroborated rASA’s top rank in performance for classifying interface and non-interface residues, beating hydrophobicity, and evolutionary conservation (50). 15 Another source of information that has been adopted early on is surface geometry. Early studies found that protein interfaces are typically more planar relative to the average protein surface (36). The planarity of a cluster of surface residues can be measured as the overall curvature of the surface or as the residues’ root mean squared deviation in three-dimensional coordinates from their least-squares plane (38). On the other hand, protein surface roughness is an assessment of the amount of protrusions and depressions. Protein interactions tend to have residues protruding and interlocking across the interface, often involving larger residue types such as aromatic residues (38, 51). Thus, interfaces are relatively rough and bumpy. Pettit and Bowie proposed to measure protein surface roughness based on smoothed atomic fractal dimension, which measures the change in the molecular surface area as the probe used to trace the surface changes in size (52). A smooth surface will have no change in surface area, while a rough surface will have a decrease in surface area as the probe size increases. Notably, high roughness in interfaces is analogous to high residue rASA in certain aspects. Another surface geometry-based feature is pocket or groove detection. Some drugs and other small molecules favor binding to pockets on the protein surface, leading to the use of pocket-detection methods for identifying potential drug-binding sites (53). Pockets can be identified as spaces inside deep clefts or large concaved regions on the protein surface (Figure 1.2). Surface geometry features were critical in the development of our protein interface predictor described in Chapter 2. 16 Figure 1.2: Illustration of the curvature of a binding groove. On the left is the molecular surface of the transcription factor TFIIB colored by curvature values, with the color scale ranging from blue to white to red representing curvature values from high to low (PDB ID 1TFB) (54). Concave surfaces are represented by high curvature scores, while low scores represent convex surfaces. On the right is the same structure with the binding groove highlighted in blue, where the interface residues were mapped from a complex between TFIIB and an IDR of VP16 from herpes simplex virus (PDB ID 2PHG) (54). In this example, the binding groove is clearly identifiable by its large concave surface. 1.2.2 First Predictors While the field continued to innovate on extracting characteristic features of protein interfaces, the ones mentioned above remained essential to prediction methods. It has even been suggested that residue solvent accessibility, hydrophobicity, conservation, and residue interface propensity constitute a sufficient set of features for interface prediction (48). Variations of these basic features have been in use since the very first prediction methods. Two of the first structure-based protein interface prediction methods were developed by Young et al. in 1994, and Jones and Thornton in 1997 (49, 55). Young et al. created a method for predicting protein interfaces that only uses residue hydrophobicity. To overcome the weakness of residue hydrophobicity as the lone feature, Young et al. implemented a method of clustering adjacent surface residues followed by a ranking of the clusters by hydrophobicity. They revealed that the top-ranking hydrophobic clusters of each protein strongly correlated with partner protein binding sites. Jones and Thornton independently developed an approach using surface residue 17 patches and combining multiple features: hydrophobicity, interface residue propensity, SASA, solvation potential, planarity, and protrusion (49). In their approach, hydrophobicity, interface residue propensity, and solvation potential were all residue composition indexed scores. For example, interface residue propensity of a specific residue type is a score derived from the surface area that it contributed to a dataset of PPI interfaces. The solvation potential relates to the residue’s preference for solvent exposure, so it was further scaled based on the residue’s SASA in the unbound structure. The feature scores for each surface patch were individually ranked relative to the patches from the same protein, and the ranked feature scores were subsequently combined with models that weighed each feature equally. The patches with top-ranking combined scores showed good agreement with the true PPI interfaces. Importantly, Jones and Thornton recognized differences between different categories of protein interfaces. Therefore, they used different combinations of features for the following categories: 1) homo-dimers and small proteins in hetero-complexes, 2) large proteins in hetero-complexes, and 3) antigens (49). These studies demonstrated some fundamental concepts and the feasibility of protein interface prediction. In the following years, the growing popularity and accessibility of machine learning methods, along with the growth of the protein structural knowledge base, led to many predictors being developed using machine learning. Machine learning is a field within artificial intelligence that consists of methods and algorithms for computers to learn from data. In general, the machine learning procedure involves a training step where a computer algorithm builds a model based on the features extracted from the training dataset. Once the model is trained, its performance can be evaluated by processing the features extracted from a testing dataset (more on machine learning methods below). The training and testing procedure contrasts with Jones and Thornton’s initial method, where the models were manually designed and tested on all available data (49). One of 18 the first predictors that were developed using machine learning methods is PPISP, which employed artificial neural networks (56). Features used by PPISP accounted for the rASA, residue composition, and conservation of interface residues. An artificial neural network was used to capture both the features of each residue as well as its 19 closest neighbors, analogous to the surface patch approach. Emphasizing the importance of incorporating information of adjacent residues, they used a second artificial neural network that uses the prediction scores of neighboring residues from the first network. Thus, instead of implicitly incorporating feature scores of spatial residue neighbors through the surface patch approach, PPISP’s second neural network improves performance by accounting for the neighboring residues’ predicted classification. Multiple studies, including our own, have subsequently adopted variations of this two-step architecture of predicting interface propensities followed by accounting for neighboring residues (57–59). 1.2.3 Classification and Application of Existing Prediction Methods A broad range of methods has been developed for computational protein interface prediction in the last 15 years, which Xue et al. classified into docking and data-driven approaches (60). Docking generally involves sampling many proposed complex structures and selecting an optimal complex structure based on its stability, which is measured through energy functions. The energy functions are defined using physicochemical principles, statistics, and measures of geometric complementarity. Importantly, classical docking predicts not only the interaction interface but also provides the structure of the interaction complex. Docking generally requires structural data of both binding partners, and docking of large folded protein domains can be challenging due to a large number of orientations and protein conformations (61). However, it has proven very suitable for the prediction of the interaction interfaces of peptides and small molecule 19 drugs. Because of the flexibility of peptides, small probes can serve as proxies for peptides to efficiently search for binding sites that typically coincide with concaved surface pockets (62). Examples of predictors of peptide binding sites on globular proteins (peptide interface predictors) that use the docking approach are PeptiMap and ACCLUSTER (62, 63). Importantly, both PeptiMap and ACCLUSTER are protein interface predictors and thus only locate the interface (i.e., binding pocket) of the peptide, unlike classical docking methods that generate the structure of the complex. Predicting the structure of the peptide complex requires an additional peptide folding step to propose the folded conformation of the peptide upon binding to the predicted pocket, which can be accomplished through programs such as FlexPepDock (64). Data-driven methods rely on experimental data from protein sequence and structure. They can be further categorized into approaches based on homology and machine learning (60). The key difference between the machine learning and homology approaches is that machine learning approaches focus on features derived from protein sequence and structure, such as residue composition and protein surface geometry features, while homology approaches exploit the conservation of interfaces among homologous proteins. Homology approaches, which are also called template approaches, generally have a performance advantage over machine learning approaches when the structures of close homologs are available. Zhang et al. demonstrated that protein interfaces are conserved among both closely homologous proteins of the same family as well as more distant structural neighbors (65). Thus, Zhang et al. developed PredUs, a method that predicts protein interfaces by using the protein complex structures of structural neighbors as templates for interface mapping. In the PredUs procedure, template interfaces from multiple complexes are mapped to the query protein, and residues that get mapped to more frequently are more likely to be part of a conserved interface. 20 This homology approach relies on a database of known protein complex structures from which it extracts interfaces of homologs. A notable advantage homology approaches have over machine learning approaches is that interfaces with any physicochemical or geometric characteristics could potentially be predicted accurately. Other examples of homology approaches are HomPPI and PrISE (66, 67). The term machine learning approach implies the use of machine learning algorithms to build classification models for protein interface and non-interface residues based on features extracted from protein structures (60). Examples of machine learning predictors include SPPIDER, ISMBLab, VORFFIP, CPORT, and ISPRED4 (47, 57, 68–70). Both feature extraction and the choice of machine learning algorithms influence the performance of these predictors. While the features considered in most studies remained relatively unchanged in the last two decades, improvements have been made through increasingly sophisticated or innovative ways to engineer the feature scores. For example, instead of directly using rASA as a feature score, SPPIDER utilized the difference between the measured rASA and the rASA predicted based on protein sequence (47). Authors of SPPIDER reasoned that residues with higher measured rASA (i.e., more solvent-exposed) than predicted rASA might be part of an interface surface that would become buried in its partner-bound state. In another approach, the development of ISMBLab exploited the observation that core regions of protein interfaces are strongly hydrophobic, sharing characteristics with buried protein interior (46, 68). The ISMBLab method uses a probability density map derived from non-covalent interacting atoms between buried residues of protein structures. Using their probability density map of interacting atoms, ISMBLab identifies the putative interacting atoms on the protein surface, and thus predict the protein interface. 21 The development of interface predictors also benefited from advancements in machine learning algorithms. Popular machine learning algorithms include artificial neural networks (ANN), support vector machines (SVMs), random forests (RF), gradient boosted trees, and conditional random field (CRF). PPISP and SPPIDER, which are early adopters of machine learning methods, are both based on artificial neural networks. With the growth in popularity of SVM and random forest machine learning classifiers, so did the number of interface predictors utilizing them, which includes VORFFIP (RF) and ISPRED4 (SVM and CRF). Deep learning, a field in machine learning based on large artificial neural networks that has recently gained broad popularity, has also been applied to protein interface prediction (71). Compared to homology approaches, machine learning approaches are less reliant on homologous complex structures, so the predictions are potentially more generalizable to novel structures. Protein interface prediction has become more useful as advances in structural biology and machine learning resulted in increasingly accurate prediction models. Protein interface predictors can directly assist in analyzing protein structure and function, or they can be used as part of a greater process for extrapolating data. For example, the propensity of surface residues for interaction is information that could be used to assist in generating atomic-resolution structural models of protein complexes. The combination of protein interface prediction with docking is sometimes called data-driven docking and examples of which include pairing the CPORT interface predictor with the HADDOCK docking approach (70), metaPPI interface predictor with the BDOCK docking approach (72), and VORFFIP interface predictor with PatchDock docking approach (i.e., V-D2OCK) (73). Docking produces high-resolution structures of protein complexes but is computationally intensive if all poses are exhaustively sampled and scored at high resolutions. Experimental data, such as restraints derived from NMR chemical shift 22 perturbation, have been used to guide simulations (74, 75). Analogously, protein interface prediction results can be used as restraints for docking to narrow the sampling space, which could reduce the resources required and have been shown to increase performance (70). More generally, computational prediction methods can provide atomic resolution structures in situations where experimental data is missing or of low resolution. An example of successful use of computational prediction is the structure determination of HIV glycoproteins in complex with antibodies and T-cell glycoprotein CD4 through complementing low-resolution cryo-EM density map with homology modeling and docking (76). Interface predictors have also been directly integrated into structure-mapped PPI network databases. PrePPI database aggregates PPI and structural data, both experimentally determined and predicted (77). Regarding protein interface predictions, PrePPI provides results from cons-PPISP, PINUP, and PredUs (56, 65, 78). Interactome INSIDER is another resource integrating PPI data with experimental and predicted structural data (4). The predictor they developed, ECLAIR, provides partner-specific interface predictions by integrating conventional features of protein interface prediction with protein-partner-specific docking and co-evolution feature scores. ECLAIR can also provide prediction in the absence of structural information through sequence-based protein interface prediction, which significantly increases its coverage of the proteome space at lower accuracy. State of the art PPI network resources such as PrePPI and Interactome INSIDER not only aggregate vast amounts of experimental data but also extrapolate additional information through computational prediction, providing users as much coverage of the interactome as currently possible. In all the categories and applications of protein interface predictors introduced above, machine learning is applied in some way. Our prediction method presented in Chapter 2 uses 23 gradient boosted trees and CRF, which I will introduce in the following section, together with a general overview of the machine learning procedures. 1.2.4 Machine Learning Machine learning classifiers are models that take observed features to predict the unobserved class labels. Machine learning algorithms for building classifiers can be broadly divided into supervised and unsupervised learning, where data instances used in training are labeled in supervised learning but not in unsupervised learning, which are methods used to analyze patterns in the data (e.g., clustering algorithms) (79). The focus of this writing is on supervised learning of classification models, where each data instance, also called an example, consists of a class label and a set of features. In our case, each example is a residue that could be labeled as interface or non-interface and have feature scores for properties such as conservation and hydrophobicity. The process of developing a machine learning model involves training and testing steps. The goal is not only to learn from the training dataset but also to generalize the knowledge to new examples (79). Consequently, a way to estimate a model’s performance on new examples is required. One solution is to divide the full dataset into a train set and a test set. The machine learning algorithm uses the train set to train a model to correctly predict the labels of the examples based on their features. Subsequently, a test set consisting of examples that the model has not learned from is used to measure the model’s performance (79). The performance of the model is generally some function of the correctly predicted examples and the incorrectly predicted examples. There are many measures of performance, and a commonly used measure is accuracy, which is the number of correctly predicted examples over the total number of examples (see 24 calculation in Chapter 2 Methods) (50). The performance measured on the test set is an estimate of the model’s performance on the full distribution of data from which the datasets were sampled from, meaning how useful the model is expected to be on all future input examples. The process of training models and estimating their performance is generally repeated many times over the development of the final model. Different models can be trained from the same dataset, and the process of selecting an optimal one is called model selection. All machine learning algorithms generally provide ways to adjust how the models are built, and these options are often called hyperparameters. They are called hyperparameters to distinguish them from the parameters of models, which are the model’s values and properties that are optimized during training. Models produced using different hyperparameters can be compared through testing on validation sets (79). Validation sets can be sampled from the train set, analogous to dividing the full dataset into train and test sets. N-fold cross-validation is one of the common ways of generating validation sets. In N-fold cross-validation, the train set is divided into N partitions, and each partition is used to test a model trained on the remaining partitions, resulting in N models for each set of hyperparameters. After iterating through different sets of hyperparameters, the set of hyperparameters resulting in the highest average performance can be used to train the final optimized model. Another way to enhance model performance is to improve the feature set that is used by the model. Features are extracted from raw data, so feature engineering is the process of developing and improving the procedures for converting the raw data into variables that are useful for machine learning. Our development of a groove score to locate indentations on the protein surface is an example of engineering a feature using an independently developed procedure (see Chapter 2). In the development of protein interface predictors, existing procedures for processing protein data 25 can be utilized. For example, there are many programs for calculating residue conservation scores from the protein sequence. In this case, part of the process of building a good feature set is to select the conservation scoring method that performs best on predicting protein interfaces. The above feature engineering examples involve the manual application of structural biology knowledge. In contrast, features can also be generated through a more automated approach. For example, the conservation score of each residue could be ranked relative to the residues of the same protein to reduce differences stemming from the varying alignment quality between different query proteins. Another example is the averaging of conservation scores of surface residue clusters to smooth the conservation scores over each residue’s spatial neighbors. Similar scaling and averaging procedures can be easily applied across all features. Feature engineering can create large sets of features with significant redundancy, requiring a subsequent feature selection step during which the feature set is reduced. Feature selection can be achieved through training models that use different subsets of the features and selecting the best model, which is analogous to the hyperparameter optimization procedure described above. The result of feature selection are models that use fewer features and are thus lower in complexity. Applying expert knowledge to design features and, subsequently, modifying and filtering the features are important procedures for optimizing the data for machine learning. Figure 1.3 provides an overview of the procedures presented above. 26 Figure 1.3: General workflow for supervised learning of a classification model. (1) The first step is to obtain a dataset with known class labels and divide them into train and test sets. (2) Feature engineering is typically required to process the raw data into variables that are more useful for modeling the classification problem. (3) The model is trained and optimized using cross-validation sets derived from the training data. Optimization procedures include feature selection and hyperparameter optimization, and the best combination of features and hyperparameters are chosen through model selection. (4) The optimized model that is trained on the train set is evaluated on an independent test set. The actual workflow varies, and the entire process may be repeated to test multiple algorithms for model training (e.g., SVM or random forest). One of the machine learning algorithms we choose for our prediction method is gradient boosted trees implemented in the XGBoost library developed by Chen et al. (80). A gradient boosted trees model consists of an ensemble of regression trees. A regression tree can be illustrated as a tree-like diagram with nodes that split into two branches each until terminal nodes (i.e., leaf) are reached. The split at each node represents a threshold in one particular feature score (e.g., dividing residues into high versus low conservation scores). Each leaf in the regression tree is associated with a prediction score. Thus, in a protein interface predictor, each residue starts at the top of the regression tree and ends at a leaf, resulting in a prediction score. Each gradient boosted trees model contains many trees, and the final prediction score for each residue is the sum of the 27 scores from all the trees. In broad terms, the algorithm trains a model by minimizing an objective function that consists of a loss function that reflects the error of the model and a regularization term that reflects the complexity of the model structure (80). The complexity of the model is ideally kept low to reduce overfitting, which is an adverse outcome where models fit the train set too well and thus fitting to all the noise in the training examples, making the model ungeneralizable to new examples (79). Regression trees are built iteratively by optimizing the objective function. Each additional new tree is built to minimize the residual (i.e., error) of the growing tree ensemble (80), and thus each new tree is improving the ensemble’s accuracy. Training ends when a model grows to the number of trees requested, which is one of the hyperparameters that could be optimized. Another method we used was conditional random fields (CRFs). We used a software library called FACTORIE to construct our CRFs, and FACTORIE uses a representation of graph models called factor graphs (81, 82). Factor graphs consist of variables, which include labeled variables (i.e., residues) and feature variables (e.g., feature scores), and factors that connect the variables and contains the weights (i.e., parameters) that describe the compatibility between the variables. For example, since interface residues form clusters, a residue with an interface label is more compatible with a neighboring residue that also has an interface label, as opposed to a non-interface label. CRF models the conditional probability distribution of the labeled variables given the feature variables (81). For training, the algorithm maximizes the likelihood of the model parameters given the training examples. Predictions of CRFs are made through probabilistic inference, predicting a state of the graph (i.e., the predicted labels of all labeled variables) that maximizes the conditional probability distribution given the feature variables and the model parameters. The advantage of using CRF for protein interface prediction is that the predicted class labels of neighboring residues are accounted for during inference, in contrast to algorithms such 28 as gradient boosted trees models that generally model each residue’s local environment through the residue’s feature scores (57). The power of supervised learning algorithms stems from the ability to learn from large and complex datasets. Machine learning classifiers are optimized for prediction performance, which contrasts with statistical models that focus on explaining relationships in the data. Furthermore, Machine learning algorithms can handle large datasets and capture higher-order relationships between features that manual analysis by humans may miss, making machine learning suitable for exploiting the wealth of data from protein structures. 1.2.5 State of the Art As innovative methods are developed and computational resources grow, one path of advancement is to combine multiple approaches or predictors. PredUs web server uses the homology approach, but it also uses machine learning to integrate SASA and information from homologous structures (65, 83). In more detail, the aligned regions’ frequency of being in interfaces in homologous structures and their SASAs are treated as features for an SVM model. Subsequently, they also developed an improved method, PredUs2, which integrates another model that utilizes residue interface propensity (i.e., score based on residue composition), creating a hybrid method utilizing homology and physicochemical properties (84). While PredUs2 selected two orthogonal methods to complement one another, a separate class of predictors called meta-predictors take the more general approach of combining multiple independently developed predictors. Meta-predictors combine multiple predictors, each with their strengths and weaknesses, to produce a single prediction that outperforms all individual components. CPORT and metaPPI are examples of meta-predictors (70, 72). 29 Another path of advancement for protein interface prediction is to target a more specific protein interface category, which is incidentally the path we have taken. The distinct characteristics of different categories of protein interfaces and their effect on prediction have been recognized since the beginning, as illustrated by the three models for three interface categories created by Jones and Thornton (49). In recent years, there has been a resurgence of predictors developed to specifically identify subclasses of protein interfaces, including transient (e.g., RAD-T) (85), transient and obligate (e.g., BindML+) (86), antibodies, and peptides (e.g., ISMBLab, PeptiMap, ACCLUSTER, Multi-VORFFIP) (62, 63, 68, 69). Multi-VORFFIP demonstrates some of the advantages of using specialized interface predictors. For every protein structure submission, their server produces a set of four predictions for interfaces of protein, peptide, DNA, and RNA, respectively. It is not surprising that a model optimized for binding sites of DNA, a separate class of molecules with distinct electrostatic properties, has relatively high specificity for DNA interfaces, but it is notable that the peptide-VORFFIP and protein-VORFFIP predictors are also selective for their respective interface classes. A specialized protein interface classifier will most likely outperform a general one if the protein interface subclasses are substantially different, such that examples of the interface categories are distributed in separate areas of the feature space (87, 88). In such cases, it would be easier to optimize a model for the subclass given their simplified distribution and lower variance in the feature scores. Furthermore, users would be better-informed of the expected performance of the specialized predictor thanks to the benchmarking results on more precisely defined datasets used for training and testing. For instance, a peptide interface predictor would be evaluated on peptide interface datasets. In this manner, the strategy of developing methods for specific interface classes provides more meaningful and accurate 30 predictions than their generalized counterparts. One specific class of interactions that deserves greater attention are those mediated by intrinsically disordered regions. 1.3 Properties of IDR-Mediated Interactions Intrinsically disordered regions (IDRs) are incredibly prolific in our proteome. While they do serve important structural roles, including linkers and spacers between folded protein regions (89), I will be focusing on their role in mediating PPIs. The human proteome has been estimated to contain 132,000 binding motifs in IDRs based on sequence-based computational predictions (90). Their flexibility and accessibility have been theorized to confer multiple advantages, including compact recognition elements, fast kinetics suitable for signaling, promiscuous interactions with multiple partners, exposing sites of post-translational modification, and changing or removing functional elements through alternate splicing (9, 91, 92). IDRs are often involved in the interactions with hub proteins, which strongly influence the connectivity of PPI networks (93, 94). PPIs are postulated to more readily evolve from IDRs because their residues are generally under less evolutionary restraint compared to the residues that are confined in the structures of folded proteins (95). IDRs containing post-translational modifications and interaction-mediating elements are also more often alternatively spliced, providing a mechanism to change binding partners of proteins and rewire the interaction network, which contributes to diversification in tissues and species (92). Other IDRs also perform the role of structural scaffolding for multimeric complexes (96). More recently, IDRs have also gained attention for their role in liquid-liquid phase separation (97, 98). Because of the magnitude and breadth of processes that IDR-mediated interactions are involved in, there is considerable interest in their underlying structures and mechanisms. 31 1.3.1 Defining MoRFs and Peptide Motifs IDRs are protein regions that do not fold into well-defined three-dimensional conformations on their own and thus are natively unfolded (99). Instead, an IDR samples an ensemble of conformations in its native state when not in a complex with other molecules. Conversely, regions that do fold are often called folded or globular domains. The term globular refers to the folded protein’s spherical shape, but I generalize the term globular proteins to refer to any independently folding domains in this writing. Molecular recognition features (MoRFs) is a type of interaction-mediating element in IDRs and is a focal subject of this writing. MoRF sequences range in lengths from 10 to 70 residues but can be longer, and they are embedded inside IDRs (100). Notably, MoRFs characteristically transition from disordered to an ordered (i.e., folded) state upon binding to a protein. MoRFs often fold into secondary structures, such as alpha-helices, in their partner-bound state, so MoRF interactions are diverse in structure and interface size. A MoRF at the long side of the spectrum is the 69-residue region of p27 that binds to the heterodimer of two globular proteins, i.e., Cdk2 and cyclin A (101). Importantly, IDRs also have shorter elements that mediate interactions. IDRs also contain peptide motifs, which are segments generally shorter than ten residues and are also known as short linear motifs (SLiMs) (90, 102). In contrast to MoRFs, peptide motifs often bind to specific binding domains through conserved motifs. One such motif is the proline-rich segments that bind to SH3 domains (103). While MoRFs and peptides (motifs) are both found in IDRs and overlap in characteristics and definitions, a study by London et al. uncovered notable distinctions between peptide- and MoRF-mediated interactions (104). Furthermore, there exist protein interface predictors for peptide binding sites, but their prediction accuracies on MoRFs 32 have yet to be evaluated. Therefore, there is value in characterizing MoRFs and peptides separately since they are differentiated by convention and by physical characteristics, but they likely fall within a continuous spectrum like many other biological concepts. Examples of peptide–, MoRF–, and globular–globular complexes are depicted in Figure 1.4. Figure 1.4: Three classes of protein–protein interactions. Examples of PPI structures representing from left to right: peptide–globular (PDB ID 1AWR) (105), MoRF–globular (PDB ID 5VAY), and globular–globular (PDB ID 4M76) (40). Interaction-mediating elements within IDRs, including both peptides and MoRFs, are in yellow. Globular domains are in blue. In the following sections, I will discuss known properties of peptide- and MoRF-mediated PPI structures, comparing them against complexes formed by globular protein domains (globular complexes). While some MoRFs can interact with other MoRFs, the vast majority of the protein complexes we study are interactions of MoRFs with globular proteins and peptides with globular proteins, which will be called MoRF complexes and peptide complexes, respectively. 1.3.2 Peptide Complexes Distinguishing structural properties of peptide complexes revealed by London et al. provided insights essential to our research (104). One defining characteristic is the pockets to which peptides bind, a trait shared by binding sites of small molecules and drugs (62). The majority 33 of the peptides either bind into large pockets or latch onto smaller pockets by reaching their side chains into their globular partners. Notably, the largest pocket of the protein surface often coincides with the peptide binding site (104), making surface cavities a telltale sign for peptide binding sites. The binding pockets are one of the several characteristics of peptide complexes that contributes favorable binding energy to overcome the conformational entropy cost stemming from the intrinsic flexibility of peptides (104). The pockets favor the packing of the peptide complex interfaces, increasing the enthalpy and offsetting the entropy loss that comes with constraining the peptide conformations in the bound states. Supporting this reasoning, many of the peptides that do not bind to pockets have proline residues that reduce the peptides’ conformational freedom in their unbound states (104). London et al. also determined the key residues that contribute substantially to the peptide affinity, namely hotspot residues. The hotspots on the peptides are strongly enriched in aromatic and hydrophobic residues, often corresponding to conserved motifs, such as those in SLiMs (104). Distinguishing features of peptide complexes also include higher hydrogen bond density and tighter packing relative to globular complexes. Furthermore, peptide binding induces only a small amount of conformational change in the globular partner interface. The binding partner interfaces’ rigidity was proposed to contribute to reducing the entropic cost of peptide-binding, and the lack of conformational change bodes well for the prediction of peptide binding sites. 1.3.3 MoRF Complexes MoRF was a term coined by Mohan et al. to identify a class of unstructured protein segments that undergo disorder-to-order transitions upon binding to their partner proteins (100). 34 Compared to peptide motifs and globular proteins, MoRFs are intermediate in length and may sample partially folded structures prior to binding. The partially folded conformations are termed residual structures, and these protein regions are also referred to as preformed structural elements (106, 107). Thus, MoRFs fall somewhere between peptides and globular proteins from both sequence and structural perspectives. The following sections present a detailed characterization of MoRF complexes, focusing on their differentiating features that form the basis of the work presented in this dissertation. 1.3.3.1 Residue Composition and Contributions in MoRF Complexes Strong evolutionary conservation and the district residue compositions are striking characteristics of MoRF sequences. The interface residues of the MoRF are the most conserved, but even those residues not directly involved in the interaction are more conserved than the protein average, suggesting the functional importance of the whole sequence for folding up upon binding (108). Notably, conservation measured in MoRFs appeared to be even more pronounced than in structured protein interfaces (108). The MoRF complex interface is heavily enriched in hydrophobic residues, more so than globular interfaces (87). In particular, we observed a greater fraction of alanine, leucines, isoleucines, and phenylalanine than globular interfaces (109). Nearly 56% of the MoRF complex interface atoms are non-polar, in contrast with the 45% in both peptide and globular complexes (104). Because MoRFs do not have hydrophobic cores, their hydrophobic residues are exposed to solvent until binding, and the desolvation of these residues is highly favorable for the interaction. The contrasting pattern of enrichment in conserved and hydrophobic residues that are typically depleted within IDRs is very distinctive, as demonstrated by the high accuracy of sequence-based predictors of MoRF segments (110, 111). 35 While not as distinctive as the MoRF residues, some notable characteristics of the MoRF-partner interface residue composition have been previously identified. Our previous study examined the residue composition of MoRF binding sites on their globular partners (MoRF interfaces) divided into core and rim regions since the two regions have shown distinctive properties in globular interfaces (46). Interface residues are categorized by measuring the solvent accessible surface area (SASA) in the partner-bound state, defining buried residues as core and partially solvent-exposed residues as rim (46). Generally, the conserved and hydrophobic core residues are situated in the center of the interface and are surrounded by polar rim residues. The division highlighted the enrichment of leucine, valine, and methionine in the core of the MoRF interface (109). Notably, methionine has been suggested to make nonspecific hydrophobic interactions (87), which is in agreement with the observation that some MoRF-partners can interact with multiple MoRFs through the same binding cleft, forming promiscuous interactions (112). The interface residues in MoRF complexes also differ in terms of contributions to binding affinity. A method for estimating the contribution each residue makes to a protein interaction is alanine scanning. Computational alanine scanning is performed by mutating each residue to alanine and measuring the change in binding free energy between the wild-type and mutant interactions. Similar to peptides, each interface residue of the MoRFs typically contributes a larger interaction surface and higher interaction energy when compared to globular complexes (109). MoRF interactions also exhibit a higher average number of salt bridges than globular interactions. Although the density of the salt bridges is not significantly higher than globular interactions, alanine scanning analysis suggested each individual charged residue, on both MoRF and MoRF-partner, makes a larger contribution to binding free energy (109). We previously also estimated that the overall electrostatic component of the binding free energy is higher for MoRF complexes 36 than globular complexes, lending further support to the significance of the charge pairings in MoRF interactions. In summary, while lower in hydrogen bond density and interface packing compared to peptide interactions, MoRF interactions appear to rely on stronger hydrophobic effects and electrostatic pairings to overcome the entropic cost of association. 1.3.3.2 Interface Geometry One clear difference between MoRF and peptide interactions is the size of the interaction surface. Measuring the total buried surface (SASA) upon binding divided by two resulted in mean areas of 512Å2 and 1361Å2 for peptide and MoRF interactions, respectively (104). Many MoRF interactions bury even larger surface areas than globular–globular interactions, which have a mean estimated at 1151Å2. Although some MoRFs can fold onto themselves upon binding, many tend to form extended structures that wrap around their partners. Their distinctive bound conformations formed the basis of our rationale for using the radius of gyration, a measure of extendedness, to screen for MoRF complexes in the PDB database in our previous study (109). Thus, MoRF interfaces are typically large and elongated. A potent structural feature for finding peptide binding sites on globular proteins (peptide interfaces) is the prominent surface pockets, and classic examples of MoRF complex structures provide anecdotal evidence suggesting this structural cue is transferable to MoRF interfaces. For example, p53 folds into an alpha helix and sits deeply in the elongated pocket of MDM2 (113). However, as a counterexample, the indentation on the KIX domain of CBP, which interacts with a MoRF of the transactivation domain of CREB, is much less distinguishable from the non-interacting surfaces (114). More formal evidence was provided by Vacic et al., who measured the planarity of MoRF interfaces, evaluating the flatness of the interface as the root mean square 37 deviation (RMSD) of the interface atoms to their least-squares plane (87). The RMSDs of MoRF interfaces are somewhat higher, especially in beta-sheet-forming-MoRF complexes. Taken together with the lower mean rASA of the MoRF interface residues, Vacic et al. interpreted that the residues of MoRFs tend to protrude into their binding partners. In other words, MoRFs tend to sit in concaved surfaces of their globular partners. Compared to peptide interfaces, this property is likely weaker in MoRF interfaces. Therefore, I refer to these concaved surfaces as grooves, emphasizing their larger elongated shape and smaller curvature, which I have illustrated in Figure 1.5. Consequently, surface grooves will be an interesting feature to evaluate for MoRF interface prediction. Figure 1.5: Relative size and shape of MoRF interface regions. (Top) MoRF interfaces are typically elongated with a larger rim region (blue) surrounding a narrow core region (red). The relative sizes of the core and rim regions depend on the rASA thresholds used to define them (see Chapter 2.5). (Bottom) The cross section of the MoRF interfaces, with the protein interior in grey. 38 1.3.3.3 Binding Mechanism MoRFs’ flexibility and strong enrichment of hydrophobic residues lead to unique binding mechanisms in MoRF-mediated interactions. MoRFs contain some hydrophobic residues but not enough to form the core of an independently folded domain, leading to the suggestion that MoRFs form a hydrophobic core by melding with its partner’s interface upon binding (108). Thus, MoRF sequences do not have enough self-interaction energy to fold independently but can do so with the assistance of a globular protein. This idea is the basis for the sequence-based MoRF prediction program ANCHOR, which screens for sequences with self-interaction energies strong enough for binding proteins but not enough for independent folding (115). Although MoRFs do not adopt fixed folds independently, they often exhibit residual structures in their native state, sampling conformations with biases to specific secondary structures. A well-studied MoRF exhibiting residual structure is the TAD domain of p53, which has high helical content prior to binding MDM2, as observed through CD spectroscopy (116). Since the conformational space of p53 is already restricted prior to binding, the entropic cost of binding is reduced (116). A MoRF that samples the bound conformation prior to binding suggests, but not prove, that the interaction follows the conformation selection mechanism (116). Consistent with this mechanism, sequence variants of p53 with higher native helicity have shown higher affinity (116). Conversely, a sequence with lower helicity has been shown to bind to MDM2 through an enthalpy-driven mechanism, overcoming entropic cost through stronger interactions with the binding partner. Such interactions where the folding of the MoRF is induced by interacting with the MoRF-partner are more likely to follow the induced folding mechanism. Induced folding and conformation selection are likely competing binding regimes that together contribute to modulate MoRF interactions. 39 1.3.3.4 Interaction Dynamics While MoRFs become more folded upon binding, they often retain varying levels of conformational flexibility in their bound states (107). Conformational heterogeneity in the bound state was termed fuzziness. Some MoRFs exhibit fuzziness by binding through multiple modes, while others may interact through a mix of permanent, folded, and transient interactions (117). In the extreme case, protein regions can exist as an ensemble of bound conformations without a well-defined quaternary structure (118, 119). There is a growing interest in “fuzzy” complexes and interactions, and fuzzy mechanisms bring a new level of insight and complexity into PPI research. However, complexes with transient interactions and multiple binding modes pose challenges for structural determination and are likely underrepresented in our structural data. Structural flexibility is not only observed on the MoRF side of the interaction. Vacic et al. observed varying degrees of changes in the MoRF-binding partner conformation and a correlation between more extensive interfaces and more substantial conformational changes (87). An extreme example of a MoRF-partner exhibiting massive conformational changes is the calmodulin interaction with the MoRF of GAD (120). Unlike the peptide complexes where conformational changes are minimal, identifying pockets or grooves on the globular protein surface may be less fruitful for the prediction of MoRF interfaces. Another implication of the conformational changes observed is the need to test protein interface predictors based on the unbound structure of the MoRF-partner. Testing on the bound conformation where the MoRF occupies the interface would provide an overly optimistic estimate of prediction performance. 40 1.4 Hypothesis and Experimental Rationale Much research has been done on PPIs in general, but the growing recognition of the importance of IDR-mediated interactions creates a need for analyses and computational tools focusing on this subcategory of PPIs. A commonly used dataset for training and testing protein interface predictors is the Docking Benchmark 5 (121). Docking Benchmark 5 consists of protein complexes where the unbound structures of both participants of the interaction are available (121), which implies that they are all independently folding proteins (i.e., globular proteins). Consequently, globular interfaces are the primary source of training data for general interface predictors, and these are the most sophisticated predictors available (e.g., ECLAIR and ISPRED4) (4, 57). Notably, protein interface predictors specializing in peptide interfaces do exist (62, 63), but their performance on MoRF interfaces have not been evaluated. Therefore, one of the objectives of this work is to investigate how well existing protein interface predictors perform on MoRF interfaces. Based on the distinctive properties of MoRF interfaces, we hypothesized that a protein interface predictor specific for MoRF interfaces could be created. In Chapter 2, we describe a novel prediction method named IDRBind. To our knowledge, IDRBind is the only predictor explicitly developed for MoRF interfaces, so we believed that IDRBind would outperform existing methods for this task, making it be a useful addition to existing computational tools. To compare IDRBind to state-of-the-art predictors, we benchmarked both globular and peptide interface predictors on our MoRF interface dataset. To test whether IDRBind is specific for MoRF interfaces, we also benchmarked on peptide and globular interfaces. Chapter 3 presents a more detailed characterization of PPIs mediated by MoRFs, as well as IDRs in general, including their feature scores, interface size, and enrichment for benign and 41 disease-causing mutations. Using the feature scores developed for IDRBind, we characterized and compared the MoRF, peptide, and globular interface classes. MoRF and peptide partner-interfaces are expected to show similarities in sequence and structural properties due to the flexibility of IDRs. The analysis of mutation enrichment will demonstrate the functional importance of the interface residues. We anticipated that the differences in binding mechanism and function of IDR-mediated PPIs would lead to differences in the distribution of disease-causing mutations and benign mutations. The most significant contrast is expected from the interface residues of IDRs given their stronger residue conservation as well as their greater interaction surface per residue. Studies of mutation localization on the broader IDRs, encompassing non-interacting residues, have shown depletion of disease mutations and enrichment of benign mutations compared to globular domains (122). However, higher levels of disease mutations have been reported in conserved short sequence motifs and predicted interaction regions of IDRs, hinting at their functional importance (123, 124). Using IDR interaction structures, we aimed to ascertain the enrichment of disease mutations in the interacting residues of IDRs and IDR-partners with comparisons to globular PPIs. 42 Chapter 2: Predicting Protein–Protein Interfaces that Bind Intrinsically Disordered Protein Regions 2.1 Overview A long-standing goal in biology is the complete annotation of function and structure on all protein–protein interactions, a large fraction of which is mediated by intrinsically disordered protein regions (IDRs). However, knowledge derived from experimental structures of such protein complexes is disproportionately small due, in part, to challenges in studying interactions of IDRs. Here, we introduce IDRBind, a computational method that by combining gradient boosted trees and conditional random field models predicts binding sites of IDRs with performance approaching state-of-the-art globular interface predictions, making it suitable for proteome-wide applications. Although designed and trained with a focus on molecular recognition features, which are long interaction-mediating-elements in IDRs, IDRBind also predicts the binding sites of short peptides more accurately than existing specialized predictors. Therefore, IDRBind accurately predicts binding sites of IDRs with a broad range in lengths, bridging the gap between globular and peptide interface predictors. 2.2 Introduction Protein–protein interactions have a fundamental role in most biological processes and exhibit a remarkable diversity in structure. They enable the assembly of large cellular machines or transient complexes that mediate the transmission of cellular signals. Disruption of protein interactions by mutations can change cellular phenotypes and even lead to disease in multicellular 43 organisms (125). Therefore, significant effort has been made to map and characterize the molecular aspects of all protein interactions present in cellular systems (126). Protein–protein interaction interfaces share common characteristics that make them differentiable from non-interface surfaces (49, 127). The most consistently discernable property is stronger evolutionary conservation of residues at the interface. Residue composition also differs between interface and non-interface surfaces, particularly when dividing the interface into core and rim regions (128). Rim is the outer region that remains partially solvent accessible upon binding and typically consists of polar and charged residues. Core is the central region of the interface that is occluded from solvent upon binding and is generally hydrophobic, which is important since hydrophobic interactions are often the dominant contributors to binding affinity (129). Concordantly, the core is enriched in hotspots, which are interface residues that contribute to a large fraction of the total interaction energy. Strong conservation of residues and enrichment in disease-causing mutations further highlight the importance of the core region in protein–protein interactions (130). Properties that distinguish interface from non-interface residues are exploited in numerous computational methods that predict protein interaction sites on their given structures (60). The property most commonly used by these protein interface predictors is evolutionary conservation, which is generally derived from multiple sequence alignments or, more rarely, the conservation of interface residues among structural neighbors (50). Prediction methods can also exploit the physicochemical properties that distinguish interface from non-interface residues. These properties include hydrophobicity and solvent accessibility (56, 57, 62, 63, 68, 69, 84, 131). Taking advantage of the plethora of available approaches, meta-predictors combine multiple predictors to increase prediction accuracy (70). Protein–protein interface prediction is a challenging task, but 44 the cumulation of advances in this field has led to predictors that are highly effective for protein characterization. The majority of available protein interface predictors have been trained and tested on classical protein complexes in which all interaction partners are globular (56, 68, 84), generalizing the term globular to include all independently folding domains. However, this mode of interaction represents only a part of the currently recognized spectrum of interactions. Research over the last two decades has revealed that a large fraction of eukaryotic proteins contain intrinsically disordered protein regions (IDRs) and that these regions can also be involved in protein interactions. A steadily growing number of identified interactions are mediated by peptide motifs, which are also referred to as short linear motifs (SLiMs) (91). Peptide motifs are typically segments up to 10 residues long that often bind to specific binding domains through conserved consensus sequence motifs (90, 102). A prolific example is certain proline-rich motifs that bind to SH3 domains (103). Some peptide motifs occur in repeats or combinations, resulting in competitive or cooperative interactions (90). In addition to the peptide motifs, much longer interaction-prone regions have been identified in IDRs (99). Molecular recognition feature (MoRF) is a broad term encompassing interaction-mediating segments in IDRs at 10 to 70 residues in length that often fold upon binding (87, 100, 107). They typically lack full-length consensus sequences, although they may contain shorter motifs. MoRFs form a diverse group with heterogeneous structures and interaction mechanisms. Unlike many peptide motifs, MoRFs can gain intricate folds upon binding, including substantial secondary structures and extended conformations that wrap around their binding partners (109). Furthermore, MoRFs often contain regions that transiently sample secondary structures prior to binding, namely, preformed structural elements (106, 132), and their interactions 45 can involve anchoring, flanking, and linker subsegments (133). For example, the 69-residue MoRF of p27, which contains a hydrophobic anchoring region as well as a partially preformed helix, folds up and binds to a heterodimer of globular proteins Cdk2 and cyclin A (101, 134). Interactions mediated by peptide motifs and MoRFs were initially assumed to result in complexes with well-defined quaternary structures, akin to quaternary structures of obligate complexes consisting of folded domains. However, this assumption was recently proven specious by the increasing evidence of IDRs exhibiting conformational heterogeneity in both their unbound and bound states, which was termed fuzziness in protein interactions (118, 135). Although MoRFs are classically characterized as regions that fold upon binding, they often exhibit multiple binding conformations (118). A growing body of research recognizes fuzzy interaction modes and the contribution of both folded elements and transient contacts to IDR-mediated interactions (117). The dynamic nature of MoRFs and broad energy landscapes of their bound and unbound states are proposed to favor their interactions with multiple partners (136, 137); a trait exploited by hub proteins that are central to signaling networks (9). In essence, the emerging picture illustrates a continuum in the degree of conformational heterogeneity not only of individual protein structures, ranging from independently folding domain to IDRs (138), but also of protein interaction complexes, ranging from well-defined quaternary structures to highly heterogeneous complexes with no dominant binding mode (119, 139). While a comprehensive protein interaction interface predictor would be ideal, predictors have always targeted subsections of the interaction spectrum due to the diversity in attributes and the biases in our protein structure knowledgebase. The performance of predictors trained on globular protein complexes is suboptimal for IDR-mediated interactions because of the differences between the two classes of interactions. Chief among the features that distinguish interfaces 46 between globular domains (globular interfaces for short) from interfaces between globular domains and peptides (peptide interfaces) is the deep pockets and grooves that are prominent in the latter (62, 104). Because peptides are intrinsically flexible, the conformational entropy cost in binding is a crucial component determining binding affinity. This energetic cost is counteracted by hotspots that mainly consist of hydrophobic and aromatic residues, often constituting the conserved motifs of the binding peptides. A key component of these hydrophobic interactions is the favorable solvent entropy change resulting from the desolvation of hydrophobic surfaces. Another strategy for mitigating the entropic cost of peptide binding was proposed based on the observation of bridging water molecules in unbound peptide interfaces (104). Increasing awareness of differences between globular and peptide interfaces led to the development of specific predictors such as PeptiMap (62, 104). Similar to peptides, MoRFs tend to bind grooves on partner surfaces. However, protein interactions mediated by MoRFs are depleted in hydrogen bonds but further enriched in hydrophobic and electrostatics pairings when compared to interactions mediated by peptides (108, 109). Given the success of available interface predictors for binding sites of globular proteins and peptides, we aimed to develop a predictor for MoRF-binding sites (MoRF interface), which is a section of the interaction spectrum that has remained underserved. To achieve our objective of developing a MoRF interface predictor, we combined two different machine learning approaches. The gradient boosted trees method was used first to train two prediction modules to identify core and rim residues. Prediction scores from these modules were then integrated through a conditional random field (CRF) to generate the final classification labels (Figure 2.1). On a non-redundant test dataset of complexes between MoRF and globular proteins, the final predictor achieves a Matthews correlation coefficient (MCC) of 0.31 at a 47 sensitivity of 51% for the separation of MoRF interface and non-interface residues. Surprisingly, our predictor also achieves an MCC of 0.36 at a sensitivity of 56% on peptide interfaces. In contrast, the predictor has an MCC of only 0.2 at a sensitivity of 33% on globular interfaces. Thus, our predictor, IDRBind, identifies protein interfaces that preferentially bind both peptide motifs and MoRFs, collectively the binding sites of IDRs. Figure 2.1: Schematic of IDRBind’s architecture. Feature scores extracted from protein structure are used by two gradient boosted trees models to calculate CorePred and RimPred scores separately. Then, the CorePred and RimPred scores are combined by a unique grid-structured conditional random field to predict the final residue labels. 2.3 Results 2.3.1 MoRF Complex Datasets for Predictor Training and Evaluation Creating a computational tool that identifies MoRF interfaces on protein domains requires the assembly of training and testing datasets that consist exclusively of protein complexes that contain MoRFs and have minimal redundancy (see Methods). We first collected complexes for which at least one partner has experimental evidence for being intrinsically disordered prior to binding. Complexes were identified by using the IDEAL database in conjunction with a literature 48 search (140). Removal of redundant sequences resulted in 84 complexes. Although interface predictors can be trained on bound structures (56, 131), meaning protein conformations found in the complexed form are used in training, performance evaluation requires a more stringent and realistic dataset that consists of protein structures in their unbound state. For 57 of the 84 complexes, we found unbound structures of the MoRF binding partners (MoRF partners). Thirty of these unbound structures were randomly allocated to the test set, which we named MoRF-test, and the rest were put in MoRF-train (i.e., 54 mixed bound and unbound structures of the MoRF partners). MoRF‐test contains 4107 surface residues in total, including 186 core and 532 rim residues, as defined in the methods section. MoRF-train consists of 8863 residues, of which 427 and 1071 are core and rim, respectively. Although the resulting datasets are small, the stringent selection process allowed us to provide better estimates of the prediction performance and to form conclusions that are specific to MoRF interfaces. While this work is focused mainly on the MoRF partners, the MoRFs were characterized to clarify the types of interaction elements contained in our datasets. The MoRFs range in lengths from 10 to 72 residues and have a mean of 25 residues. While they all have evidence of intrinsic disorder, some MoRFs in the two datasets transiently sample secondary structures in their unbound states (Appendix A ). Furthermore, MoRFs contain helices in 62% of the cases in their partner-bound form (141). To assess the conformational diversity (i.e., fuzziness) of individual MoRFs, we calculated the mean RMSDs between NMR models of MoRFs after alignment of the MoRF partner structures. MoRFs in our datasets display a wide distribution of mean Cα RMSDs ranging between 0.6 and 17.0 Å (Figure 2.2A). However, the primary interface residues, which we defined as MoRF residues that participate in the interaction in at least 50% of the NMR models, are not as structurally diverse as the remaining “flanking” residues. Notably, the segments containing 49 primary interface residues of MoRFs in 18 out of 33 structures have mean Cα RMSDs below 3 Å. An example with relatively rigid primary interface residues is provided by the complex between XPC and p62, which is a subunit of transcription factor II H complex (TFIIH) that is recruited by XPC in the nucleotide excision repair pathway (PDB ID 2RVB (142); Figure 2.2B). In contrast, the structures of the complex between the activation domain of the herpes simplex virus protein VP16 and the human transcription initiation factor TFIIB (54) provide an example of a MoRF where models differ substantially in structure, thus resulting in high mean RMSD for the primary interface residues (Figure 2.2B). Notably, while the MoRFs themselves display a wide conformational range, the residues on the partner surface to which they associate remain largely preserved across the models, providing a firm basis for the definition of MoRF interfaces that we aim to predict. 50 Figure 2.2: Conformational diversity of individual MoRFs in our dataset. A) Density plot of the mean RMSDs of MoRFs. The mean RMSD of each MoRF was calculated as the quadratic mean of pairwise RMSDs between NMR models after aligning the models on the MoRF partner (See Methods). All structural alignments and RMSD calculations were done using Cα atoms. B) Two examples illustrate the mean RMSDs in different MoRF regions. TFIIH subunit p62 in complex with the MoRF of XPC (left; PDB ID 2RVB) and TFIIB in complex with the MoRF of VP16, the activation domain from herpesvirus 1 strain 17 (right; PDB ID 2PHG). The MoRFs are colored in yellow, and the MoRF partners are in gray. C) Density plot of RMSDs between bound and unbound states of the MoRF partner. For each structure, structure alignment and Cα RMSD calculations were performed over all MoRF partner residues for the black line. 39 and 35 out of 57 of these RMSDs are no higher than 3 Å and 2.5 Å, respectively. Non-interface residues were aligned, and then Cα RMSDs were calculated from the interface residues for the blue line. 38 and 37 out of 57 of these RMSDs are no higher than 3 Å and 2.5 Å, respectively. 51 2.3.2 Several Features Are Associated with Core and Rim Interface Residues Separating MoRF interface residues from non-interface residues is a binary classification task. Therefore, we decided to use receiver operating characteristic curves (ROC curves) to evaluate whether features that are classically used in the identification of interface residues are also useful in the prediction of MoRF interfaces. Moreover, we used ROC curves to gauge whether these features are better at predicting all interface residues or segregated core and rim residues. We utilized structure and sequence-based features that are either incorporated in existing interface prediction methods or deemed relevant to the prediction of MoRF interfaces. These features can be grouped broadly into six categories: residue composition, residue evolutionary conservation, residue relative accessible surface area (rASA), protein surface geometry, B-factor estimates and electrostatics of surface residues or their local environment. The scores of these features were calculated for the surface residues of each protein in the MoRF-train set, some of which using existing methods (see Methods and Appendix B for details). The ROC curves and their corresponding area under curves (AUCs) of a selection of features tested are shown in Figure 2.3A. We found that conservation feature scores decently separate interface and non-interface surface residues (AUC: 0.65). However, other feature scores lack the discriminative power to separate all interface from non-interface residues. Next, we segregated interface residues into core and rim and assessed the predictive power of the same features on discriminating non-interface residues from core and rim, respectively. Consistent with the literature, conservation scores are better at separating core residues and non-interface residues than all interface and non-interface residues (Figure 2.3B). Most importantly, the other features selected are also able to separate core from non-interface residues. Two feature scores worth noting are a surface geometry feature designed to identify grooves and pockets (groove score) and rASA, 52 a solvent-accessible surface area (SASA) score normalized per residue. Core residues tend to have higher groove scores than non-interface residues, suggesting that they are situated at the center of grooves in the protein surface. Correspondingly, core residues are likely to be less accessible than the bulk of surface residues. Some of the features have differently signed correlation with core and rim residues, which is reflected in AUCs above and below 0.5 for core and rim (Figure 2.3B, C). For instance, rASA has a distinctive ability for differentiating core and rim residues because of the tendencies for core residues to be in grooves and rim residues to be more solvent exposed. Similarly, the B-factor estimates show that core residues are more rigid compared to both rim and non-interface residues. Overall, this analysis reveals that the predictive power of many features is amplified by separating the interface into core and rim regions. 53 Figure 2.3: ROC curves for interface versus non-interface classification using different individual feature scores. Non-interface residues are classified against all interface residues (A), core residues (B), and rim residues (C). Selected feature scores are shown from left to right: residue conservation score, groove score, rASA score, and B-factor estimate score. The dotted diagonal lines represent random classification, which has an AUC of 0.5. See Methods and Appendix B for detailed descriptions of the features. 2.3.3 Modules That Predict Core and Rim Residues Based on these findings, we decided to develop two distinct prediction modules that separate core and rim residues, respectively, from non-interface residues. To do so, we used gradient boosted trees models. The XGBoost software library (80) in R was employed to train the two classification models. Thus, one gradient boosted trees model was trained to classify core and non-interface residues, while the other was trained to classify rim and non-interface residues. All features mentioned before as well as others (full list, see Appendix B ) were calculated for residues in the MoRF-train set and used in training. In addition, aggregated feature scores were created 54 using surface residue patches. Surface residue patches are defined by combining each surface residue with its neighbors within a distance threshold (see Methods). For all residues in each surface patch, feature scores are aggregated (e.g., by taking averages or maximum scores) and added to the feature set of the central residue. Such aggregated features have been shown to be able to boost predictor performance (4). From 2182 original features, 30 were chosen via feature selection. Feature selection and hyperparameters used by XGBoost were optimized for the two models separately through 20-fold cross-validation, measuring improvements in terms of AUC (see Methods). Using the optimized features and hyperparameters, the final models were retrained on the entire MoRF-train set. The resulting core and rim predictors, named CorePred and RimPred, achieved AUCs of 0.84 and 0.75 on the independent MoRF test set (Figure 2.4, left). For comparison, we employed the same optimizations used for RimPred to train a gradient boosted trees model that separates all interface from non-interface residues. CorePred and RimPred show an advantage in performance compared to this model trained with all interface residues (Figure 2.4, right). This result suggests the gains observed in the individual features upon separating the interface into two regions translate to the CorePred and RimPred models. 55 Figure 2.4: ROC curves for core, rim and interface residue predictions. Left, ROC curves for the classification of core (A) and rim (B) residues versus non-interface residues made by CorePred and RimPred. Right, ROC curves for the classification of core (A) and rim (B) residues versus non-interface residues made by a single model trained to classify interface versus non-interface residues. The performance of CorePred and RimPred was also evaluated on the MoRF-train set through the leave-one-out cross-validation procedure, using the existing optimized features and hyperparameters. In more detail, variants of CorePred and RimPred models were trained on all MoRF-train complexes except one, which was used for testing. This training and testing procedure was repeated until all complexes were tested. The mean performance of these CorePred and RimPred models are AUC 0.88±0.03 (95% confidence interval) and AUC 0.74±0.04, respectively. These AUC values are close to the performance measured on MoRF-test, suggesting that the performance estimates are robust. 56 2.3.4 CorePred and RimPred Scores Are Combined Using CRFs Next, we combined CorePred and RimPred outputs with a CRF model to create the final output of our new predictor that we call IDRBind. CRFs are discriminative undirected probabilistic graph models that take “neighborhood” context into account, in contrast to more common classifiers like CorePred and RimPred that evaluate each residue individually. Our CRF can be interpreted as having two components: a scoring and an adjacency component (Figure 2.5A). The scoring component is trained to integrate residue-level CorePred and RimPred scores. In addition to these residue-level scores, the scoring component also uses the protein-average of CorePred and RimPred scores. The incorporation of these average scores enabled better handling of small proteins, for which predictions are known to be affected by a size bias (143) (discussed below and in Methods). The adjacency component of our CRF incorporates information from neighboring residues. It enforces penalties for spatially isolated core or rim residue predictions and enhances clusters of the same class, thereby smoothing the prediction results and removing outliers. While CRFs have been applied in this field, other predictors have used the protein sequence or distance thresholds to define adjacent residues (57, 144, 145). Instead, we chose to use a network mesh generated from the molecular surface of proteins. Specifically, a quadrilateral surface mesh was generated (Figure 2.5B and Methods). Each face in the quadrilateral surface was mapped to the closest residue. Because there are many more faces than residues, each residue of the protein can be mapped to multiple faces. Adjacent faces were defined as those sharing the same edge in the mesh, so each face has a maximum of four neighbors. 57 Figure 2.5: Schematic of IDRBind CRF (A) and the quadrilateral mesh protein surface representation (B) used in it. (A) Variables used in the CRF are represented by ovals and factors by rectangles. There are feature variables (i.e., observed variables) in gray and labeled variables (i.e., unobserved variables) in white, which are the residue nodes for which we are predicting class labels. Each labeled variable is a face in the quadrilateral mesh surface. Factors connect the variables, describing their compatibility using weights and feature functions. The scoring component of the graph model is on top, while the adjacency component is highlighted with a yellow background below. (B) Quadrilateral mesh protein surface representation used to construct the graph model. Top, triangulated surface representation of a protein. Bottom, the same surface in quadrilateral representation. Quadrilateral faces identically colored in red, magenta, green and tan identify groups of faces that are mapped to the same protein residue. Faces of the protein mesh surface are represented by labeled variables (nodes) in the CRF that we generated through the software library FACTORIE (146). Core, rim or non-interface labels of these variables are the prediction outputs of the CRF (Figure 2.5A) and thus of IDRBind. Each labeled variable is connected to other variables through factors. Factors contain weights that describe the labels' compatibility with other variables. The scoring component of the CRF has the observed (input) feature variables CorePred, RimPred, CorePred average, and RimPred average. The adjacency component consists of pair-factors that connect adjacent labeled variables (i.e., faces sharing an edge). The prediction output of core, rim, and non-interface labels is attained 58 through approximate inference, for which we use a belief propagation algorithm for maximum a posteriori estimation (see Methods). As IDRBind predictions are categorical, we calculated well-established performance metrics instead of ROC curves to evaluate IDRBind's performance on MoRF‐test (Table 2.1). In separating non-interface from interface residues, the union of rim and core classes, IDRBind achieves a MCC (147) of 0.31 at a sensitivity (true-positive rate, or TPR) of 51%. While MCC is a general indicator of predictive power, sensitivity measures the fraction of true interface the predictor identifies. To assess whether the CRF improved predictions, we compared IDRBind's predictions with the combined predictions of CorePred and RimPred. A naïve but effective way to combine CorePred and RimPred is to take the larger of the two scores for each residue. Table 2.2 shows the performance of this naïve model at varying thresholds. It reveals that for any threshold and associated TPR, the MCC of this naïve model is well below that of IDRBind on MoRF‐test. Next, we analyzed the effect of protein size on predictions made by IDRBind and the naïve model. As mentioned before, size biases can negatively affect predictions. This issue is particularly important for small proteins that get over-scored by many interface predictors, meaning interfaces are predicted to be too large. This size bias can be eliminated by ranking scores for each protein individually (143), but IDRBind's categorical output cannot be ranked. Consequently, we calculated MCCs for each MoRF‐test protein individually and then averaged over the full set. While the mean MCC of IDRBind of 0.31 is identical to the MCC calculated using all interface and non-interface residues in MoRF‐test (Table 2.1 and Table 2.3 for IDRBind performance on individual structures), the mean MCC of the naïve model drops for the majority of the thresholds tested (Table 2.2). In comparison, the scoring component of the CRF in isolation offers little improvement over the naïve model at comparable decision thresholds, that is, thresholds where the 59 naïve model achieves sensitivity around 0.5 to 0.6, but the higher mean MCC provided by the scoring component relative to the naïve model is notable (Table 2.4). These observations suggest that the scoring component contributes to mitigating the size bias, while the adjacency component provides most of the performance gain over the CorePred and RimPred models. Table 2.1: Performance evaluation of interface predictors on MoRF-test dataset. TPR FPR precision specificity accuracy F1 MCC meanMCC SD Error NA count n IDRBind 0.51 0.17 0.39 0.83 0.77 0.44 0.31 0.31 0.04 0 30 ACCLUSTER3a 0.58 0.23 0.34 0.77 0.73 0.42 0.28 0.24 0.04 0 30 PeptiMap 0.32 0.13 0.34 0.87 0.78 0.33 0.20 0.18 0.05 0 30 ISMBLab 0.48 0.26 0.28 0.74 0.70 0.35 0.18 0.22 0.04 1 30 VORFFIP 0.78 0.51 0.24 0.49 0.54 0.36 0.20 0.15 0.03 1 30 CPORT 0.35 0.18 0.28 0.82 0.74 0.31 0.15 0.14 0.03 0 30 cons-PPISP 0.24 0.12 0.29 0.88 0.77 0.26 0.12 0.11 0.04 0 30 PredUs2-SVM 0.41 0.18 0.32 0.82 0.75 0.36 0.21 0.20 0.04 1 28 PrISE 0.45 0.31 0.23 0.69 0.65 0.30 0.11 0.07 0.03 0 30 ISPRED4 0.46 0.28 0.25 0.72 0.67 0.32 0.14 0.05 0.03 0 30 BindML 0.70 0.55 0.21 0.45 0.49 0.32 0.12 0.12 0.04 1 27 GHECOM 0.30 0.17 0.26 0.83 0.74 0.28 0.12 0.13 0.06 2 30 Fpocket 0.37 0.21 0.27 0.79 0.72 0.31 0.14 0.16 0.05 0 30 TPR: true positive rate (i.e., sensitivity); FPR: false positive rate; MCC: Matthews correlation coefficient; meanMCC: mean of MCC calculated per-structure; SD Error: standard error of meanMCC; NA count: structures with absence of either any positive or negative predictions such that MCC cannot be calculated; n: number of structures processed a Number following ACCLUSTER denotes the number of top prediction clusters evaluated. 60 Table 2.2: A model that combines CorePred and RimPred scores by taking the larger of the two for each residue evaluated on MoRF-test dataset at varying thresholds. TPR MCC Threshold Overall Mean Overall Mean 0.1 0.79 0.83 0.25 0.22 0.125 0.69 0.74 0.26 0.24 0.15 0.63 0.67 0.26 0.24 0.175 0.56 0.60 0.26 0.25 0.2 0.51 0.56 0.25 0.24 0.225 0.45 0.50 0.26 0.25 0.25 0.41 0.46 0.26 0.25 0.275 0.37 0.42 0.26 0.26 0.3 0.35 0.40 0.26 0.26 0.35 0.29 0.33 0.26 0.26 Overall: Measured over all residues Mean: Mean of performance measured over individual structures 61 Table 2.3: IDRBind prediction performance on individual structures of MoRF-test. Structure TPR FPR Precision Specificity Accuracy F1 MCC 2GWJ 0.36 0.06 0.38 0.94 0.88 0.37 0.31 1I6C 1.00 0.30 0.47 0.70 0.76 0.64 0.57 2KC9 0.34 0.16 0.63 0.84 0.62 0.44 0.21 1K04 0.73 0.30 0.26 0.70 0.70 0.38 0.30 1PFJ 0.40 0.31 0.32 0.69 0.61 0.36 0.09 2MW5 0.29 0.14 0.40 0.86 0.72 0.33 0.17 1M2O 0.31 0.11 0.22 0.89 0.83 0.25 0.17 1RRF 0.77 0.16 0.38 0.84 0.83 0.51 0.46 2CR7 0.94 0.45 0.42 0.55 0.65 0.58 0.44 1TFB 0.36 0.11 0.38 0.89 0.81 0.37 0.26 2LD1 0.38 0.38 0.26 0.62 0.56 0.31 0.01 3O8Z 0.72 0.26 0.42 0.74 0.74 0.53 0.39 4IPC 0.77 0.32 0.28 0.68 0.69 0.41 0.32 2X7Z 0.77 0.18 0.65 0.82 0.81 0.71 0.57 2F5J 0.73 0.12 0.71 0.88 0.84 0.72 0.60 1FQK 0.65 0.11 0.30 0.89 0.88 0.41 0.38 2KPK 0.79 0.14 0.63 0.86 0.84 0.70 0.60 1G1Q 0.00 0.12 0.00 0.88 0.70 0.00 -0.17 2KLY 0.13 0.19 0.17 0.81 0.65 0.15 -0.07 1Y5O 0.69 0.15 0.46 0.85 0.82 0.55 0.46 2M52 0.73 0.38 0.29 0.63 0.64 0.42 0.27 3B7B 0.68 0.15 0.37 0.85 0.83 0.48 0.41 1MUZ 0.70 0.19 0.61 0.81 0.78 0.65 0.49 1QGV 0.44 0.14 0.39 0.86 0.79 0.41 0.28 1GCJ 0.67 0.20 0.30 0.80 0.78 0.41 0.34 1PCF 0.42 0.24 0.41 0.76 0.66 0.41 0.18 1KHX 0.28 0.16 0.41 0.84 0.69 0.33 0.14 1MM2 0.94 0.39 0.59 0.61 0.73 0.72 0.55 3AVS 0.47 0.09 0.46 0.91 0.85 0.46 0.37 4Z0O 0.39 0.16 0.58 0.84 0.67 0.47 0.26 All residues 0.51 0.17 0.39 0.83 0.77 0.44 0.31 62 Table 2.4: The performance of the scoring component of the CRF on MoRF-test dataset. TPR MCC Threshold Overall Mean Overall Mean Binary 0.58 0.62 0.27 0.27 Overall: Measured over all residues Mean: Mean of performance measured over individual structures 2.3.5 IDRBind Outcompetes Existing Predictors in the Identification of MoRF-binding Sites We compared the performance of IDRBind on MoRF‐test with that of various predictors of protein interfaces. Many more predictors have been developed than we can benchmark, so we selected representatives that allow for informative comparisons. The general interface predictors are trained predominately on complexes between globular proteins (globular interface). Of this group, ISPRED4 (57), cons-PPISP (56), BindML (131), and CPORT (70) represent a state-of-the-art as well as popular predictors that are often used for performance comparisons. PredUs (84) and PrISE (67) were chosen because they are a special class of predictors that uses structural homology. We also tested predictors of peptide-binding sites on globular proteins (peptide interface), and they are ACCLUSTER (63), PeptiMap (62), ISMBLab (peptide version) (68), and VORFFIP (i.e., peptide prediction from Multi-VORFFIP (69)). Lastly, because binding pockets are a distinctive feature of peptide binding sites, pocket-finding methods GHECOM (148) and Fpocket (53) were also included. The performance measures used here are threshold dependent. Therefore, the threshold for each predictor was selected to maximize their MCC on MoRF‐test. IDRBind's MCC of 0.31 is the highest of all predictors tested on the classification of interface versus non-interface residues in 63 MoRF‐test (Table 2.1). Predictors with the closest performance to IDRBind are peptide interface predictors. IDRBind also ranks the highest in terms of precision, followed by ACCLUSTER and PeptiMap. High precision is significant because it indicates the fraction of predicted interface residues that genuinely belongs to an interface, which can be interpreted as the usefulness of the predicted interface. Comparison of the mean MCCs gives IDRBind an even greater lead. Most importantly, the highest precision of the compared tools in conjunction with a sensitivity of 51% indicates that IDRBind produces conservative predictions with adequate coverage of the true interfaces, which makes it practical for MoRF interface predictions. 2.3.6 IDRBind Accurately Predicts the Interface of Well-Known MoRF Interaction Partners Several MoRF‐test proteins are well-known proteins that play key roles in regulatory pathways of eukaryotic cells. We highlight some individual examples here to make the discussion on IDRBind's output more tangible. MRG15 is a transcription factor with various roles, including regulating the cell cycle. It associates with an intrinsically disordered region (IDR) of the protein Pf1, forming part of the Rpd3S/Sin3S corepressor complex (149, 150). The interaction involves two distinct hydrophobic sites on MRG15 that are both predicted accurately by IDRBind (Figure 2.6A). 64 Figure 2.6: Prediction results of select proteins mapped on their structures. Molecular surfaces of MRG15 (A), importin-beta (B), Pin1 (C), and Bcl-2 (D) are colored by IDRBind prediction labels for core, rim and non-interface in red, blue and gray, respectively. In panels E and F, Bcl-2's surface is colored by the core and rim scores provided by CorePred (E) and RimPred (F). The prediction labels (IDRBind), scores (CorePred or RimPred) and the protein surface are from the unbound structure. To visualize consistency between predictions and actual MoRF interfaces, MoRFs were placed on the unbound structures by aligning only the globular domains of the bound and unbound structures. The PDB IDs of the unbound and bound structures, respectively, are 2F5J and 2LKM (A), 1GCJ and 1M5N (B), 1I6C and 1I8H (C), and 1GJH and 5VAY (D–F). Importin-β is another protein found in MoRF‐test. Critical to the transport of macromolecules between nucleus and cytoplasm, it interacts with cargo as well as adaptor proteins, including both MoRFs and globular proteins. In this case, the MoRF complex is between the human importin-β and the parathyroid hormone-related protein (PTHrP) it transports (151, 152). The prediction was made using the unbound structure of mouse importin-β, which shares 98% sequence identity with the aligned region of the human protein. The concaved surface on importin-65 β N-terminal that associates with various binding partners was also correctly identified by IDRBind (Figure 2.6B). An example of a small MoRF partner protein that could be detrimentally affected by over-scoring due to size bias is the WW domain of Pin1, which interacts with tau as well as Cdc25 in phosphorylation-dependent interactions that act as regulatory switches (153). The WW domain consists of only 39 residues and has a flat binding site, increasing the prediction difficulty. However, IDRBind can correctly identify some of the non-interface residues despite their scarcity (Figure 2.6C). Taking an example from outside our MoRF‐test, we tested IDRBind on Bcl-2, a key regulator in apoptotic processes and the dysfunction of which is associated with multiple diseases, including cancer. IDRBind prediction correctly identifies the interface on of Bcl-2 that binds to the MoRF BH3 domain of Beclin-1 (PDB ID: 1GJH (154); Figure 2.6D). Figure 2.6E demonstrates CorePred's high specificity for grooves, practically excluding all protruding surfaces. RimPred prefers convex and solvent-exposed residues but produces more false positives than CorePred (Figure 2.6F; for comparison of core and rim residue curvature scores, see also Figure 2.7). However, the CRF integrates CorePred and RimPred scores and reduces false positives by suppressing residues that have relatively high RimPred scores but are isolated from the main interface patch (Figure 2.6D). 66 Figure 2.7: Curvature scores of rim and core in MoRF-train. Left, ROC curves for the classification of core (top) and rim (bottom) residues versus non-interface residues made by curvature Z-scores. Right, ROC curves for the classification of core (top) and rim (bottom) residues versus non-interface residues made by curvature scores. Since MoRFs can exhibit significant structural diversity in their bound states, we specifically selected a MoRF that has diverse bound structures. The transcription activator–coactivator interaction between MED15–ABD1 and the MoRF of Gcn4–cAD constitutes a fuzzy complex (155). Figure 2.8 shows the successful IDRBind prediction made on the bound structure of MED15–ABD1. While prediction on the bound structure may not reflect true performance, there are reasons to expect positive results. The interaction involves prominent electrostatics, a shallow groove, and hydrophobic patches, all of which are typical of MoRF interactions. 67 Figure 2.8: IDRBind prediction result for the MoRF interface of MED15’s activator-binding domain 1 (ABD1). The prediction labels and figure were generated from the MoRF-bound model PDB ID 2LPB (156). Labels provided by the IDRBind prediction are colored as in Figure 2.6. All NMR models of Gcn4 binding to the MED15 ABD1 are shown as yellow ribbons. Due to the use of structural information, accurate predictions could be hindered by large conformational changes that occur in the partner upon binding, and calmodulin is a particularly challenging example demonstrating this problem. Calmodulin is a calcium-binding protein that is involved in many regulatory processes. The interaction between the calcium-loaded calmodulin and GAD leads to substantial conformational changes in calmodulin. IDRBind identifies the part of the interface on the C-terminal domain of calmodulin but misses the region on the N-terminal 68 domain (PDB ID 1DMO, 1NWD; Figure 2.9) (120, 157). As illustrated by this example, large conformational changes can distort the interface region, leaving some interface residues spatially isolated or buried in the unbound structure, thus making predictions more difficult. Calmodulin is certainly an extreme case compared to the other MoRF‐test and MoRF-train structures. While calmodulin interface residues have a Cα RMSD of 14 Å between the bound and unbound state, 67% of the interfaces on MoRF partners in our sets have an RMSD of less than 3 Å between their bound and unbound states (Figure 2.2C). Certain conformational changes are acceptable for IDRBind prediction as evidenced by the calmodulin example and the evaluation of MoRF‐test partners that show substantial conformational changes (Appendix A and Table 2.3). Figure 2.9: IDRBind prediction for calmodulin. Unbound structures of calmodulin (PDB ID: 1DMO) with molecular surfaces colored by IDRBind prediction labels (A) and true interface labels (B). The true interface labels are defined based on the complex between calmodulin and GAD (PDB ID: 1NWD). C) The complex between calmodulin and GAD with calmodulin colored in gray and GAD colored in yellow. For all three figures, the calmodulin N-terminal domain is on the left, and the C-terminal domain is on the right. 69 2.3.7 IDRBind Identifies Peptide but Not Globular Interfaces Our finding that predictors developed for peptide interfaces come closest in their performance to IDRBind on MoRF‐test motivated us to test IDRBind on peptide interfaces, specifically, the peptide interface dataset (PEP) used in the development of the peptide interface predictor PeptiMap (62). IDRBind achieves an MCC and mean MCC of 0.36 and 0.38, respectively, on this peptide interface dataset (Table 2.5). These MCCs are the highest of the tested predictors, except for the peptide-binding model of VORFFIP, which was trained on some complexes included in the PEP. When VORFFIP was evaluated on the subset of PEP that it was not trained on, its performance was estimated to be substantially lower (VORFFIP-NR in Table 2.5). Table 2.5: Performance measures on PEP. TPR FPR precision specificity accuracy F1 MCC meanMCC SD Error NA count n IDRBind 0.56 0.13 0.38 0.87 0.83 0.45 0.36 0.38 0.05 0 25 ACCLUSTER1 0.42 0.10 0.38 0.90 0.84 0.40 0.31 0.31 0.09 0 25 ACCLUSTER3 0.63 0.26 0.26 0.74 0.73 0.37 0.26 0.26 0.05 0 25 ISMBLab 0.61 0.22 0.29 0.78 0.76 0.39 0.30 0.30 0.05 0 25 VORFFIP 0.57 0.09 0.48 0.91 0.87 0.52 0.44 0.45 0.07 1 25 VORFFIP-NR 0.32 0.06 0.39 0.94 0.87 0.35 0.28 0.23 0.08 0 11 CPORT 0.48 0.20 0.26 0.80 0.76 0.33 0.22 0.19 0.04 0 25 VORFFIP-NR reports the performance of VORFFIP on a subset of PEP that excludes peptide partners with greater than 95% sequence identity and 90% aligned length with their training data (see Methods). To complete the comparison, we also evaluated IDRBind on globular interfaces. Docking Benchmark 5 (121) is a popular benchmark for docking and is dominated by complexes between globular proteins, so we used it as our globular interface dataset, named DB5 (we customized DB5 for our needs; see Methods). IDRBind achieves an MCC of 0.20 on DB5 (Table 2.6). Cons-PPISP and PrISE, globular interface predictors that were chosen for their facilities for processing large 70 datasets, achieve higher MCCs than IDRBind on DB5. Notably, IDRBind's sensitivity of 0.33 is especially low when compared to its performance on MoRF‐test and PEP. However, this sensitivity does mean that a small portion of globular interface residues is positively predicted by IDRBind, which is well illustrated by the interface prediction made for the cytokine interleukin-2. Interleukin-2's interfaces with its globular receptor proteins coincide with IDRBind's two relatively small predicted interface patches, one of which contains a hotspot that is known to be susceptible to small-molecule binding (Figure 2.10; PDB ID 1M4C, 2B5I, 1PY2 (158–160)). Overall, this analysis demonstrates that IDRBind performs well not only on MoRF but also peptide interfaces while having a much lower MCC on globular interfaces, meaning IDRBind exhibits specificity for IDR interfaces. 71 Figure 2.10: IDRBind prediction for interleukin-2. Interleukin-2 is a protein involved in T cell activation that has a globular interface shown to be susceptible to disruption by small molecule inhibitors. A, B) The complex between interleukin-2 and three receptor protein chains: common cytokine receptor gamma, interleukin-2 receptor alpha, interleukin-2 receptor beta (PDB ID: 2B5I). The complex structure is being shown in two different orientations in A and B to highlight two predicted interface patches. Molecular surfaces of interleukin-2 are colored by IDRBind prediction labels. These predictions partially overlap with the globular interfaces. One interpretation of this finding is that this globular interface is too closely resembling MoRF interfaces to be distinguished. Interestingly, one of the predicted interfaces overlaps with a pocket known to bind small molecule inhibitors. The interaction between interleukin-2 and the small-molecule inhibitor (PDB ID: 1PY2) is shown in C. Predictions for interleukin-2 were made on an unbound structure (PDB ID: 1M4C), and the binding partners of interleukin-2 were superimposed by aligning interleukin-2 structures. 72 Table 2.6: Performance measures on DB5. TPR FPR precision specificity accuracy F1 MCC meanMCC SD Error NA count n IDRBind 0.33 0.13 0.31 0.87 0.79 0.32 0.20 0.19 0.013 3 339 cons-PPISP 0.28 0.07 0.39 0.93 0.83 0.33 0.24 0.20 0.016 5 337 PrISE 0.48 0.20 0.28 0.80 0.76 0.36 0.23 0.26 0.012 0 331 See Table 2.1 for more details. 2.4 Discussion The accurate prediction of protein interface residues is a key challenge in computational biology that pushes the boundaries of our understanding of protein–protein interactions and machine learning. Numerous methods have been developed that predict interaction interfaces, and the majority are intended for general interface prediction, which is dominated by globular interfaces (48, 49, 56, 58, 59, 67, 70, 131, 144). More recently, the need for more specialized prediction was recognized, leading to peptide interface predictors (62, 63, 68, 69). Here, we introduced IDRBind, a structure-based prediction method designed for identifying protein surface residues with the potential to interact with IDRs, with an emphasis on MoRFs, filling the gap between peptide and globular interface prediction. A differentiating factor of IDRBind is its architecture. In the first step, the two prediction modules CorePred and RimPred predict core and rim interface residues based on their distinct characteristics. In the second step, the CorePred and RimPred scores are combined by a unique grid-structured CRF that considers the compatibility of neighboring residue predictions and favors the output of interface patches. This architecture solves several challenges we identified for predicting interface residues. One of the challenges is in fully utilizing the distinctiveness of core and rim residues despite the overlap between the two regions. The core region is enriched in conserved residues 73 critical for the interaction and is distinct in multiple features, most notable are surface geometries such as grooves and curvatures that are specific to binding sites of IDRs, including both MoRFs and peptides. Furthermore, core residues are fewer in number and clustered in a narrow region. Accordingly, CorePred predicts fewer residues but identifies core residues with higher accuracy measured at performance-optimizing thresholds. Analogous to a gasket that keeps water out, rim residues surround the core and are more often polar or charged. Rim residues are spatially close to non-interface, and many of its residues are likely not functionally critical, which is reflected in their weak correlation with conservation scores. The weakly discernable characteristics of individual rim residues lead RimPred to rely more on the surface patch feature scores and produce noisier predictions compared to CorePred. Despite the differences, the two regions overlap since they are defined based on artificial rASA thresholds that do not directly correspond to the other physicochemical differences between the residue classes (see Methods). The goal is to use the characteristics of core and rim to predict interface and not to classify core versus rim. To that end, the usage of two models allows explicit optimization on core and rim independently while forgoing penalties for misclassification between them during training. In the subsequent step, the CRF handles the challenge of integrating CorePred and RimPred while adjusting for large scale dependencies between residues. First, because there is a disproportionately large ratio of interface over non-interface residues in small proteins, protein interface predictors tend to over-score smaller proteins, leading to the size bias. This problem could be exacerbated by the graph model during inference (prediction) as the long-distance coupling between labeled variables in these over-scored proteins drives all labels to the interface classes (161). By applying a penalty to proteins with high average CorePred and RimPred scores, we increased the mean MCC value on MoRF‐test, which suggests improved handling of the size bias. 74 Second, gradient boosted trees, support-vector machines and random forest are more commonly used and well-proven methods, but they weakly account for spatial relationships through features that include local neighborhood properties and do not generate the interface clusters that are desired. CRFs can better handle the spatial dependencies of protein residue classification, and there is precedence for this use case for CRFs (57, 144, 145). ISPRED4 uses CRF for this purpose, and the authors noted an increase in precision, which is consistent with our observations for IDRBind. This increase in precision likely comes from removing outlying residues that score relatively high on CorePred or RimPred but do not have adjacent interface residues. The use of CRF leads to a noticeable improvement in performance. Notably, the improvement is achieved despite the fact that some neighborhood properties are already integrated into CorePred and RimPred, demonstrating the high effectiveness of the CRF. Thanks to this architecture, IDRBind achieves a high MCC of 0.31 on the MoRF‐test set, which is comparable to the MCCs of general predictors on globular interfaces. Specifically, compared with two routinely benchmarked predictors, IDRBind's MCC surpasses or is comparable to the MCC cons-PPISP and PredUs achieve on globular datasets (57, 162). For more perspective, PredUs's performance always closely follows the top predictor in benchmarks showcasing state-of-the-art general predictors (67, 68). This implies that the usability of MoRF interface predictions approaches that of predictions made for globular interfaces. This is also true regarding peptide interfaces. However, benchmarking results should always be interpreted with care. Comparing between predictors is inherently hard due to differences in evaluation methodology as well as unknown overlaps between training and testing datasets (50). A striking example of the effect of overlapping datasets is presented in Table 2.5 where the exclusion of PEP peptide interfaces that VORFFIP was exposed to during training reduces its estimated performance by a large margin. 75 Further complicating the matter is the incomplete annotation of interfaces due to either the criteria used in defining interfaces or incomplete structural data. Inadequacies aside, IDRBind outcompetes existing predictors on the MoRF‐test. Specifically, IDRBind achieves the highest MCC on this set. Comparison of MCCs reveals that peptide interface predictors such as ACCLUSTER and PeptiMap are the most comparable to IDRBind. Just like how IDRBind is created for MoRFs yet performs well on peptide interfaces, these peptide interface predictors show the same is true in reverse. Based on the similarity between MoRF and peptide interface predictors, we can deduce that the two subcategories of IDR share similarities in binding-site preferences. Furthermore, peptide interfaces are typically easier to predict, typically being more pronounced binding pockets. MoRFs are longer than peptides, allowing MoRFs to adopt more complex folded conformations and resulting in more variation in interface size and morphology. MoRF interfaces span a broad range in sizes, and the larger interaction surfaces bring the characteristics of MoRF interfaces closer to globular interfaces. An attribute that correlates with interface size is the magnitude of binding-induced conformational change in globular complexes (87, 163). Concordantly, the relatively small peptide interfaces have been demonstrated to be essentially unchanged upon binding (104), in contrast with the range in RMSDs between the bound and unbound structures of MoRF partner that we found. Conformational changes most directly affect geometry-based features such as the surface groove score, which are key indicators of IDR interfaces (Figure 2.11). These factors increase the difficulty of predicting MoRF interfaces compared to peptides and likely prompted IDRBind to have a higher sensitivity to these features. The result is a predictor that can handle a broader range of IDR lengths compared to peptide interface predictors. 76 Figure 2.11: A) CorePred and B) RimPred importance plot for selected features calculated by XGBoost. The contribution (x-axis) of each feature (y-axis) used by the two predictors is reported. The features that make the most significant contributions to CorePred are the groove and conservation scores. The importance of groove and conservation scores for discriminating core residues is in agreement with our ROC curve analysis (Figure 2.3). The core region is also enriched in hydrophobic residues, which explains why a hydrophobicity feature score (PCA1_13) is ranked third. In contrast, none of the features used by RimPred reach an importance score as high as the three most important features used by CorePred, meaning no feature stands out as more essential than the rest. Electrostatic potential’s top rank might reflect the importance of polar and charged pairings in MoRF interactions. rASA is a distinguishing feature of rim residues, as shown by the ROC curves, thus explaining rASA’s second position. Rim residues often protrude more than both core and non-interface residues, forming the ridge around the binding groove, making rASA a key feature for RimPred. Given the prevalence of IDR-mediated interactions in complex biological pathways, IDRBind could be useful for high throughput analysis. With the accumulating protein sequence and structural data, computational analysis is fast becoming a key method for understanding biological pathways. For example, Interactome INSIDER is a tool that integrates protein–protein interactions, genetic variation, and structural information from both experiments and prediction 77 (4). However, given an estimated 122,000 MoRFs in the human proteome (90), a specialized method like IDRBind could be useful for filling in the gaps from experiments and other prediction methods. There is eagerness in the community for mapping protein–protein interaction interfaces as well as modeling their structures, and for IDR-mediated interactions even more so due to their potential as interesting drug targets. Protein–protein interfaces are considered more difficult to target than traditional small-molecule binding sites because protein interfaces are relatively flat and much larger (164). It is likely easier to design drugs to disrupt MoRF and peptide interfaces, which have grooves and pockets. In fact, the MoRF-binding site of Bcl-2 is the target of drugs under development (Figure 2.12) (165). Figure 2.12: IDRBind prediction on the unbound structure of Bcl-2. Unbound structure of Bcl-2 (PDB ID 1GJH) is interposed with Navitoclax analog by aligning with drug-protein complex structure (PDB ID 4MAN) (154, 165). The molecular surface is colored to reflect IDRBind prediction (core = red, rim = blue, gray = non-interface). 78 In summary, IDRBind is a structure-based prediction method designed for identifying protein surface residues with the potential to interact with MoRFs. Because MoRF and peptide interfaces share characteristic features that our predictor exploits, IDRBind is a leading predictor of IDR interfaces spanning both MoRFs and peptides. To facilitate the usage of the new method, IDRBind predictions are available through a web server at https://idrbind.msl.ubc.ca/ (Appendix C ). 2.5 Methods 2.5.1 Construction of Protein Complex Datasets MoRF complex datasets were built using the IDEAL database and a literature search (140). MoRF complexes comprise a MoRF interacting with one or more folded domains (i.e., globular proteins), except for interactions between two MoRFs. A folded protein segment mediating an interaction can be classified as a MoRF if there is experimental evidence that it is intrinsically disordered, meaning it does not have one well-defined tertiary structure in its unbound state. The 2016 July release of IDEAL database was used, which provided 183 MoRFs complex structures. When combined with our literature search, this resulted in 229 MoRF complexes. Complexes with redundant MoRF partners were removed using a length-dependent sequence identity threshold defined by Rost with parameter n of zero (166). We identified unbound structures of MoRF partners through a BLAST search of the PDB database, excluding structures with DNA and RNA molecules. We used NCBI's blast2 program to search for proteins with 95% or higher identity and 90% or higher aligned length (167). The quality of the structures was checked manually, excluding antibodies and some unbound structures with small molecules blocking the MoRF interface, 79 resulting in 84 non-redundant MoRF partners, 57 of which have unbound structures. Thirty unbound structures were selected randomly for MoRF‐test, and the remainder were placed into MoRF-train. Globular (DB5) and peptide (PEP) protein datasets are based on established datasets of bound and unbound structures that have already been used to benchmark the performance of multiple existing predictors (104, 158). Docking Benchmark 5 was modified by removing antibody complexes, complexes where the bound and unbound protein sequences were not well-aligned, and complexes where the interface in the unbound structure was obstructed. The resulting DB5 has 360 interacting domains, with some domains appearing more than once but with unique interactions. Importantly, because our pipeline for benchmarking MoRF and peptide interface focuses on one specific interface for each complex, as opposed to accounting for all general interfaces given a multimeric complex, benchmarks on DB5 will underestimate the performance with regard to general interface predictions. Extensive assessments of general interface predictors on globular interfaces were done in other studies and were not repeated here (67, 162). For instance, ISPRED4's performance evaluation on Docking Benchmark 5 represents a more accurate assessment for predictors on general interfaces (57). The PEP dataset was derived from the 30-peptide-complex test dataset of ACCLUSTER, which was sourced from the database peptiDB (63, 104). Because peptide and MoRF complexes are similar, peptide binding partners (i.e., peptide partners) in PEP that are also in the MoRF datasets had to be removed. CD-HIT-2D from the CD-HIT package was employed, using the parameters “word_length” of five, sequence identity of 0.6, and length difference of 0.5 to remove redundant peptide partners (168). Five out of the original 30 PEP structures were removed. All the peptide interface predictors evaluated on MoRF‐test were also benchmarked with PEP, excluding PeptiMap because it was optimized on PEP. 80 2.5.2 Surface, Core, and Rim Residue Definition Surface, core, and rim residue classes were defined using SASA calculations. Areaimol from the CCP4 suite was used to calculate SASA (169). The segregation of core and rim residues has been shown to be effective in analyzing protein interfaces (46). A modified set of criteria for core, rim, and surface residues was devised to increase the number of surface residues and maximize the effectiveness of core and rim predictions. For each complex structure (i.e., MoRF bound to a MoRF partner), Areaimol was used to calculate the SASA of both the protein complex and the MoRF partner protein in isolation, where the isolated MoRF partner is simply the complex structure with the MoRF atoms removed. The SASA of each residue was then normalized by its SASA in a Gly-X-Gly peptide in an extended conformation to generate rASAs (46). Surface residues were then defined as those with an rASA > 5% in the isolated MoRF partner. Of these surface residues, those with a change in rASA when going from the isolated MoRF partner to the complex structure were identified as interface residues. The core contains interface residues with less than 7% rASA in the complex, while the remaining interface is part of the rim. Lastly, the residue class labels were mapped to an unbound structure if available. The same definitions were used for globular–globular and peptide–globular complexes. 2.5.3 Prediction Evaluation Comparisons with Existing Predictors There are differences between predictors, so our evaluation method aims to convert the different prediction outputs to a consistent format while making the assessments as fair as possible. For instance, IDRBind predicts core and rim class labels, which must be merged into a single interface class for evaluation (see the Quantification and statistical analysis section for details on 81 scoring metrics). Some predictors, including IDRBind and ISPRED4 (57), only make predictions on surface residues, so only surface residues were evaluated across all predictors. Individual residues without prediction scores were ignored and not held against the predictor. This could occur because of differences in definitions of surface residues or issues with interpreting the input structure. Some predictors output multiple interface clusters, as opposed to scores for individual residues. These clusters must be converted to interface and non-interface residue labels to conform to our evaluation scheme. This is more common among predictors of ligand binding pockets, such as GHECOM (148) and Fpocket (53), but ACCLUSTER (63) and PeptiMap (62) also do this. Their prediction outputs are in the form of multiple binding sites. While each site could denote a separate interaction, we chose to combine them because the IDR interfaces are large in comparison, especially those of MoRFs. Each site is composed of either a cluster of partner residues or a cluster of probes mimicking the predicted ligand positions. In the latter case, we used a distance threshold of 4.5 Å to label the probes' adjacent protein residues as interface for evaluation. Furthermore, the individual clusters are often ranked based on prediction confidence. Thus, we tried to select and merge the optimum number of the top-ranking clusters that provided the best performance for the predictor. The number of clusters selected for ACCLUSTER is indicated by the number following its name in Table 2.1 and Table 2.5. Using three top clusters from PeptiMap prediction resulted in its highest performance in MoRF‐test. Whereas the top two pockets from the Fpocket predictions were utilized. Lastly, only the top pocket was evaluated from the GHECOM predictions. Similarly, other predictors were also evaluated such that their reported performance is reasonably optimized for the respective benchmark datasets. Predictors that output individual residue scores were evaluated at multiple thresholds that were selected to give reasonable estimates 82 of maximum MCC on each testing dataset. The performance at the threshold resulting in the highest MCC is reported for each predictor in Table 2.1, Table 2.5, and Table 2.6. In addition, some servers provide multiple models. ISMBLab (68) consists of a family of predictors, including one for peptide interfaces, which we evaluated in Table 2.1 and Table 2.5. Multi-VORFFIP also provides multiple models. We evaluated the peptide interface model (i.e., PDB file output labeled as EBS) from Multi-VORFFIP (69). Analogously, the PredUs2 model combines two separate components. The SVM component of PredUs2 provided better results in MoRF‐test, so we chose to report that instead of the performance of the full predictor (84). There are two exceptional cases where we altered the datasets to evaluate the predictors better. One such case is in the evaluation of Multi-VORFFIP on the PEP dataset where there is a substantial overlap between their training set and PEP. The dataset used by Multi-VORFFIP was sourced from Petsalaki et al. (170). We ran CD-HIT-2D with the word length of 4 and the thresholds 95% identity and 80% alignment length. CD-HIT-2D clusters the sequences and returns a subset of the PEP with the redundant sequences removed. The result is reported in Table 2.5. Another case is in the evaluation of ACCLUSTER, which allows the option to input peptide sequences up to 30 residues in length. Thus, we reported the evaluation on a subset of MoRF‐test that have MoRF sequences satisfying this length restriction in Table 2.7 for completeness. Table 2.7: Evaluation on a subset of MoRF-test for which ACCLUSTER was given MoRF sequences as input, limiting MoRFs to those under 31 residues long. TPR FPR precision specificity accuracy F1 MCC meanMCC SD Error NA count n IDRBind 0.54 0.19 0.39 0.81 0.76 0.45 0.31 0.31 0.05 0 21 ACCLUSTER3 0.58 0.24 0.35 0.76 0.73 0.44 0.29 0.25 0.05 0 21 See Table 2.1 for more details 83 Having outlined the general criteria used in the evaluation process, we will list some of the parameters and options we selected for individual predictors for the interested readers. Most often, we used default parameters because we assumed them to be the suggested settings. However, we did try some parameters that seemed better suited to our use case. ISMBLab outputs numeric scores as well as clustered binary predictions, and we chose the latter due to the higher performance. On the other hand, we chose the numeric predictions over the binary predictions for cons-PPISP. Similarly, the probability output of PrISE at the threshold chosen based on MCC provided higher performance than its binary output for both MoRF‐test and DB5 (67). PrISE server predictions were calculated with the inclusion of highly homologous proteins to increase its performance. CPORT predictions were carried out with the threshold set to sensitive to provide sensitivity comparable to the rest of the predictors (70). BindML predictions were obtained through their web server instead of their downloadable software. The parameters used for the two pocket detection methods were set in an effort to obtain larger interface pockets compared to the default options. GHECOM predictions were calculated using multi-scale pocket detection with parameters “-rs” of 2.5, “-rli” of 4, “-rlx” of 13, and “-br” of 1. Fpocket predictions were calculated using parameters “-M” of 13, “-I” of 45, “-r” of 5, and “-D” of 2. 2.5.4 Calculating Feature Scores 2.5.4.1 Conservation Conservation scores were calculated using methods and software from Capra and Singh (44). The calculation of these conservation scores required multiple sequence alignments (MSA), which we generated using a procedure we optimized for protein interfaces. Specifically, sequence 84 homologs from the NCBI nr table (ftp:/ftp.ncbi.nlm.nih.gov/mmdb/nrtable/) (171) were identified using BLAST (blastp version 2.2.28 +) with the E-value threshold 0.000001. Homologous sequences were only used when the sequence identity was greater than 60% by default, but a lower threshold of 40% was used if fewer than 50 sequences were initially returned. MSAs were built from the homologous sequences using MAFFT (version 7.407) with the fftns option (172, 173). Sequences that were aligned to less than 80% of the reference sequence length were removed. Subsequently, alignment gaps were removed from the reference sequence, and a new MSA was built with MAFFT. Conservation was calculated using this MSA. Two Z-scores were calculated for the conservation scores over individual proteins, with one based on all residues and the other only on surface residues (see Appendix B ). 2.5.4.2 Electrostatics and Groove Electrostatics-derived features and groove scores were both measured from a 3D grid built around the protein. The grid points were spaced 1.2 Å apart and extended five grid points beyond the X, Y, and Z coordinate extremities of the protein. To determine whether each grid point is solvent exposed, we used EDTSurf to calculate the triangulated solvent accessible surface and used Inpolyhedron, a MATLAB module by contributor Sven in File Exchange (https://www.mathworks.com/matlabcentral/fileexchange), to identify the grid points outside the triangulated surface (i.e., solvent-exposed) (174). To calculate groove scores, we then used an approach very similar to the one employed by pocket-finding method LIGSITE (175). In short, binding pockets are large indentations on the protein surface that can be identified by searching for void spaces that are surrounded by the protein on multiple sides. In detail, at each solvent-exposed grid point, we scanned along several evenly spaced lines (e.g., the axes) that intersect the 85 target point, searching for the protein surface. Lines that intersect the protein on both ends were counted for the calculation of groove scores. Thus, if we scan along six lines, the score will range from 0 to 6, where a score of 6 suggests that the grid point is surrounded by protein in 12 directions. We can determine whether each line intersects the protein because the lines were placed along the 3D grid where each point was labeled as inside or outside of the protein. Each line was extended from the target point by 12 grid points in both directions. Two groove score variants were calculated with 6 and 13 lines per grid, respectively, and the feature scores were mapped onto protein residues using a 4.5-Å threshold to define adjacent atoms. The score was then averaged over each residue. The electrostatic potential was calculated using DelPhi, and the related feature scores were derived from the potential and electrostatic field measured at the solvent-exposed grid points (176). To prepare the protein structure for DelPhi, Profix from the JACKAL package was used to fill in missing atoms (i.e., with parameters “-prm 2 -fix 0”) and Reduce to replace or add hydrogens (177–179). The processed PDB file was then formatted with CHARMM param22 atom names using the MMTSB Tool Set (180, 181). The parameters used for DelPhi are similar to those used previously (i.e., probe radius 1.4, ion radius 2, salt 0.1, percentage fill 65, 1 grid/ Å, 2000 iteration linear Poisson-Boltzmann Equation (PBE), and dipolar boundary conditions) (109, 182). Electrostatic potential and electrostatic field gradients were mapped to the residue atoms within a 4.5-Å threshold and then averaged over each residue, resulting in feature scores for potential and field. The electrostatic potential scale has both positive and negative values, so we also calculated the absolute value of the potential, which could better reflect the strength of the potential. Additionally, for each grid point, the field gradient towards the closest protein atom was calculated by taking the dot product of the field gradient and the unit vector towards the closest 86 atom. From these values, FieldToSurface was calculated by mapping and taking average using the same method used for potential and field scores. FieldAtom and FieldFarAtom were instead calculated with respect to the surface atoms. For each protein surface atom, the field gradient in its direction was calculated as a dot product of a grid point’s field gradient with the unit vector in the direction of the atom, and subsequently averaged over all the grid points within set distances from the atom. We calculated this per-atom, averaging over grids within 4.5 Å and grids that fall within the range of 4.5 to 6 Å, which were then averaged over each residue to get FieldAtom and FieldFarAtom, respectively. 2.5.4.3 Residue Composition Residue composition scores were calculated using principal components of amino acid indices generated from the protr package in R (183). This package provided 531 complete indices from the AAindex database (39) for the 20 canonical amino acids, and principal component analysis (PCA) was used to obtain orthogonal indices. The first five principal components were used, which accounts for 95% of the variance of the original indices. Each residue was assigned its indexed scores for principal components 1 to 5 to obtain residue composition scores 1 to 5, and we also calculated surface patch averaged scores for each residue together with its neighboring residues within a 13-Å radius (see list in Appendix B ). Notably, residue composition score 1 has a strong inverse correlation with hydrophobicity, with lower values for hydrophobic residues [Pearson correlation with hydrophobicity index 0.93 (AAindex ID: FASG890101); p = 1.6 * 10−9]. 87 2.5.4.4 B-Factor B-factor estimate scores were based on the thermal fluctuation estimates from Gaussian network modeling (GNM) provided by the ProDy Python package (184). GNM calculations for proteins were carried out with a 7.3-Å cutoff for pairwise interactions between α-carbons. 2.5.4.5 Curvature and Roughness Besides groove scores, we also used the surface geometry features curvature and roughness. The curvature scores were calculated with the help of Surface Racer (185). Surface Racer was set to calculate molecular surface using a probe radius of approximately 1.4 Å, adjusting the radius at 0.01-Å increments if the program fails on a protein. The roughness score of protein surfaces was calculated using rufness from the HotPatch package (52, 186). 2.5.4.6 Feature Patterns 1 to 5 To capture the unique elongated profile of MoRF interfaces, we devised the additional feature scores pattern 1 to 5. MoRF interfaces have core residues in the center with rim residues surrounding them, creating elongated donut shapes. Core residues have higher groove and conservation scores that contrast with those of the rim. Therefore, we combined the groove and conservation scores and searched the protein surface for such patterns. These pattern features do not represent any established physicochemical property measures. Their correlations to MoRF interfaces are not very strong, but the pattern features are not very reliant on the magnitudes of the feature scores from which they are derived. Therefore, they can provide additional orthogonal information. 88 Pattern scores were calculated by combining the Z-scores of the groove score calculated with 13 line tracings (GrooveFine_Z in Appendix B ; see Methods on groove scores) and conservation Z-scores (Conservation_Z in Appendix B ), which are both expected to be highest at the core of the MoRF interface and lower at the rim. Thus, high scoring residues are expected at the center of the interfaces, tracing the axes of elongated MoRF interfaces. Groove and Conservation scores were combined using Bayesian update, for which these scores must first be converted to normally distributed scores. GrooveFine_Z and Conservation_Z scores were converted to normal distributions through a normal quantile transformation through the following steps. The scores from the MoRF-train dataset were ranked and then scaled between 0 and 1. Each set of scores was then treated as the quantile in a cumulative distribution function of a normal distribution. The qnorm function in the Perl CDF module calculates the Z-score when given quantile. We then converted Z-score to a score “x,” given that “x” belongs to a normal distribution with the mean µ= 0.5 and standard deviation σ = 0.1: = − The normal scores were then treated as probabilities and combined using Bayes’ rule. Treating p1 and p2 as probabilities of the residues being in the center of the interface based on evidence from the GrooveFine_Z and Conservation_Z scores, the updated probability is: = + (1 − )(1 − ) The GrooveFine_Z and Conservation_Z scores of new query proteins were ranked and combined based on the MoRF-train score distributions. 89 By defining surface patches and mapping the combined score (i.e., pc) to them, we can calculate vectors that point towards the direction of the highest score. A circular pane was defined around a central residue by the least-square fitting of the geometric centers of all residue side chains within a 13-Å radius. These neighboring residues were then mapped to the plane. Consider a circular plane centered at the origin in cartesian coordinates. We divided the plane into four overlapping regions: front and back, and core and rim. The front and back were divided in the Y direction, with the front of the plane occupying Y > 0. Core and rim were divided in the X direction, with the core occupying -0.25 > X > 0.25. We defined a unit vector pointing to the front region (i.e., Y-axis) of the plane =< 0,1 >. As we allow the vector and plane to rotate about the origin, the combined scores in each of the four regions were averaged independently. The final orientation was set to minimize the squared error between the averaged combined scores and the following step functions: () = 1; ≥ 00; < 0 () = 1; || ≤ 0.250; || > 0.25 As a result, the vector of the reoriented plane will point towards the higher scoring residues. The CR (i.e., core/rim) step function keeps the vector of core residues aligned to the axis of the elongated interface. The FB (i.e., front/back) step function keeps the pointing to the higher scoring residues. 90 With an oriented vector calculated for each residue, we then calculated the pattern feature scores. Ideally, vectors of rim residues will point toward the core residues while vectors of core residues will point parallel to the axis of the MoRF interface. The plane around each central residue C = (0,0) was divided into regions 1 to 3 containing neighboring residues i: 1 = {(, )| ≤ 4Å, y ≥ −13Å 4⁄ } 2 = {(, )| > 4Å} 3 = {(, )| ≤ 4Å, y ≤ −13Å 4⁄ } where R1 corresponds to the front of the plane in the Y-axis direction, R2 corresponds to the two sides, and R3 corresponds to the back. The regions are illustrated below: Pattern 1 expects core residues to have neighboring vectors pointed to them. The dot product of the vector for each Region2 residue (vR2) and its unit vector towards the direction of central residue (vR2->C) was calculated and then averaged. 1 =1 ∙ → Pattern 2 scores high for core residues whose vector is perpendicular to the vectors of Region2 residues. 2 = 1 ∙ → ∗ 1 − 1 ∙ 91 Pattern 3 scores high for core residues with a vector that is parallel to the other vectors of Region1. 3 =1| ∙ | Pattern 4 is high for rim residues whose vectors are perpendicular to the core residues. We first calculated the unit vector of residues in region R3 together with central residue C: = 4 = 1 ∙ ∗ 1 − 1 ∙ Pattern 5 scores high for rim residues that have vectors parallel to their neighbors. = 5 = 1 + ∙ 2.5.4.7 Aggregated Feature Scores Additional feature scores were derived by aggregating neighboring residue scores. Surface patches were defined using 6-, 9-, and 12-Å thresholds between α-carbons. The individual feature scores of the residues in each surface patch were then aggregated. The aggregation functions we 92 used are taking average, average of absolute values, maximum value, minimum value, average of top 50% of values, and average of bottom 50% of values. In addition, Z-scores were calculated based on the original feature scores and the aggregated feature scores. The Z-scores of each residue were calculated with respect to the surface residue scores of the protein to which it belongs. 2.5.5 Constructing CorePred, RimPred, and IDRBind CRF CorePred and RimPred were both trained on the MoRF-train set using the XGBoost library in R (80). They are both gradient boosted trees models built using the binary logistic objective. These two models were optimized through feature selection and hyperparameter optimization, and 20-fold cross-validation was used for evaluation in both processes. Residues from the same protein were grouped together during cross-validation since adjacent residues may have very similar feature scores. For feature selection, features were grouped into conservation, groove, curvature, roughness, field, potential, B-factor estimate, rASA, residue composition 1 (i.e., hydrophobicity), residue composition 2, residue composition 3, residue composition 4, residue composition 5, Pattern1/2/3, and Pattern4/5. Grouping of features into categories was inspired by Meyer et al. (4). This reduces the search space of feature selection in an expert-guided manner. The initial set of 30 features was selected based on the highest Pearson correlation to the core or rim residues within each feature group, selecting two from each group. The R package rBayesianOptimization was used to optimize XGBoost hyperparameters based on AUC performance. Sampling points of the hyperparameter space were determined by rBayesianOptimization's “ucb” function at kappa of 1 and epsilon of 0. Unfortunately, optimization of more than five parameters at once is time intensive with this method. Therefore, we chose to optimize two sets of XGBoost hyperparameters separately. Parameters1 consists of nrounds, max_delta_step, and gamma, while Parameters2 93 consists of max_depth, eta, colsample_bytree, min_child_weight, and subsample. Starting with a rough set of manually selected parameters, we first optimized Paramter1 with 80 random initial points and 20 iterations of Bayesian optimization. This was followed by optimization of Paramters2 with 100 random initial points and 30 iterations of Bayesian optimization. The resulting set of hyperparameters was used during feature selection, which involved iterations through the 30 features and replacing each with features of the same category and selecting the feature resulting in the highest AUC in cross-validation. For CorePred, the aggregated feature scores were removed, except for residue composition features averaged over 13-Å surface patches. Following feature selection, fine adjustments were made to Parameters2 with 100 initial points and 30 iterations of Bayesian optimization. To generate the CRFs, a network representation of the protein surface was created for each protein using the EDTSurf and Instant Meshes programs. EDTSurf was used to calculate a triangulated molecular surface of the protein using a 1.42-Å probe (174). Instant Meshes takes the triangulated surface and converts it to a quadrilateral mesh (Figure 2.5B) (187), which we used to define the graph nodes and edges. The desired scale of the edges was set to 3.5 Å along with the following settings: two smoothing steps, deterministic algorithms, intrinsic mode, and align to boundaries. The scaling of the mesh was selected with the goal of representing each low rASA residue through at least one quadrilateral face while limiting the number of faces associated with each highly solvent-exposed residue. The averaged coordinate of the four vertices of each face was mapped to the closest residue with an upper limit distance of 6 Å. Faces that share an edge were considered neighbors, defining the edges of the adjacency component of our CRF. The CRF model combines the output scores from CorePred and RimPred while accounting for the compatibility between neighboring residue labels (Figure 2.5A). FACTORIE, a software 94 library for creating factor graph models, was used to generate CRFs (146). Factor graphs are bipartite graphs with factors and variables. Factors contain the functions describing the compatibility between variables with weights that are optimized during training. The factors we used are template factors, meaning that factors of the same type share the same weights across the whole graph and all graphs. The variables include the feature variables (observed) and the labeled variables (unobserved), where the labeled variables are defined by the faces on the mesh surface described above and each one has a discrete label core, rim, or non-interface. The feature variables were generated as a vector of four continuous values: CorePred score, RimPred score, and the protein averages of the two scores. The protein-averaged scores were introduced to counteract a protein size bias in prediction scores (143). Thus, our model applies a penalty to core and rim labels for proteins with high averaged CorePred and RimPred scores. Our CRF consists of the scoring component and adjacency component. For each labeled variable, the scoring component is composed of a feature variable vector, a factor describing the feature variables' compatibility with the labeled variable, as well as a factor associated with class bias (Figure 2.5A). The class bias factor provides one weight value per label class, allowing adjustments to the weight of each class during training. The adjacency component consists of the paired factors defined using the surface mesh, connecting neighboring labeled variables. Using a quadrilateral mesh to define connections of the adjacency component of our graph model ensures that each labeled variable is limited to a maximum of four neighbors. Thus, the paired factors making the connections will exert constraints evenly across all labeled variables. In summary, the scoring component combines CorePred and RimPred by applying weights to both and adjusting for size bias, while the adjacency component accounts for the compatibility of neighboring labeled variables. 95 The two components of our graph model were optimized sequentially. First, the scoring component was trained on the CorePred and RimPred scores from MoRF-train that were calculated using the leave-one-out method, calculating prediction scores on one protein by using XGBoost models built on the rest. For training, we adjusted the representation of each class label by oversampling. We did this for each protein individually by starting with the initial set of labeled variables and added to it by sampling with replacement the under-represented label classes until we had 33% core, 33% rim, and 34% non-interface. Thus, the three classes of labeled variables were roughly equally represented during training. In FACTORIE, a likelihood Example is a data structure encapsulating a training example and calculates gradients using maximum likelihood during training. Each labeled variable provided two training examples: one that treats the interface as a single class, and one that also evaluates core versus rim classification. This encouraged the model to learn interface detection instead of putting excessive weights on core and rim segregation. The weight parameters were optimized through batch training 60 iterations on the Examples with the AdaGrad optimizer and parameter averaging (146, 188). Subsequently, the adjacency component was optimized on top of the fully trained scoring component based on the same MoRF-train set. The adjacency component consists of just one type of factor, which is a paired factor touching two labeled variables. The weight parameters for these paired factors were decided upon with the intent to favor clustering of nodes with the same class label, counting core and rim as one class. The idealized model of the interface regions also places the rim around the core, so the case of core being adjacent to non-interface was made to be the most unfavorable. More specifically, the weight tensor of the paired factors between labeled variables was set manually to 1 for matching classes, 1 between core and rim, -1 between core and non-interface, and -0.5 for the rim and non-interface. This weight tensor was modified by 96 multiplying a coupling strength parameter c. For the optimization of this parameter c, the MCC was evaluated for predictions made on MoRF-train while varying c. The resulting MCC was plotted with c, and a polynomial of degree two was fitted in order to select the c that maximizes the MCC. For the final prediction, maximum a posteriori (MAP) estimation on the combined two-component model is accomplished through loopy belief propagation (189). Ten steps of loopy belief propagation are enough for convergence. Finally, the predicted graph labels are mapped back to the protein residues. This is done hierarchically with core taking precedence over rim and rim taking precedence over non-interface. For example, if at least one labeled variable associated with a residue is predicted to have a core label, the whole residue takes on the core prediction. 2.5.6 Figure Generation Figures of protein structures were generated using VMD with the molecular surfaces calculated from MSMS (190, 191). Figure 2.5B is the exception where the molecular surface was calculated with EDTSurf and the figures are screenshots from Instant Meshes (174, 187). The placement of MoRFs onto the unbound structures of MoRF partners was done through aligning the bound and unbound MoRF partner structures only using the MultiSeq alignment tool in VMD (192). Figures of protein structures were edited using GIMP (https://www.gimp.org/). The R package ggplot2 was used for the density plots (193). 2.5.7 Quantification and Statistical Analysis To evaluate the full range of values for the feature scores as well as prediction modules such as CorePred and RimPred, the ROC curves and AUCs were generated using the ROCR 97 package (194) in R (https://www.r-project.org/). For the evaluation of predictions with binary values, or with thresholds applied to continuous values, we use the following measures. True positive (TP) and false positive (FP) are the numbers of residues correctly and incorrectly classified as interface, respectively. True negative (TN) and false negative (FN) are the numbers of residues correctly and incorrectly classified as non-interface, respectively. TPR, also referred to as sensitivity, is defined as TP/(TP + FN). False-positive rate (FPR) is defined as FP/(FP + TN). Precision is TP/(TP + FP). Specificity is 1-FPR. Accuracy is: F1 score is defined as: MCC is: 2.5.8 Structure RMSD Calculations Structure alignment and RMSD calculations presented in Figure 2.2 and Appendix A were performed using the McLachlan algorithm implemented in the program ProFit (http://www.bioinf.org.uk/software/profit/) (195). All alignments and calculations were carried out on Cα atoms. For each NMR MoRF complex structure, the MoRF's primary interface residues were defined as residues classified as part of the interface in at least 50% of the models using the rASA definition of interface residues described above. The flanking residues consist of the = + + + + 1 =22 + + = × − × ( + )( + )( + )( + ) 98 residues that are not part of the primary interface. The RMSDs of the MoRFs were calculated by aligning the structure of the MoRF partner before calculating the RMSD of the specified MoRF residues (i.e., all residues, primary interface, flanking). The mean RMSDs were calculated by taking the quadratic mean of pairwise RMSDs between all NMR models (196). Five out of 38 MoRFs were omitted from the analysis because they were missing flanking regions. The interface RMSD between the bound and unbound states of MoRF partners was calculated by aligning the non-interface residues and subsequently calculating RMSD of the interface residues. The RMSD of the whole MoRF partner structure was also calculated for comparison. Whenever multiple NMR models exist for the MoRF partner structure, the first model was used. 2.5.9 Prediction Server A web server was developed to allow users to utilize our prediction method. Using protein structure and sequence submitted in the PDB and FASTA format, respectively, the IDRBind server returns results in a modified PDB file with the class label in the B-factor column for viewing in programs such as VMD (191). In addition, the user can download a text file containing the IDRBind class labels, CorePred scores, and RimPred scores. By default, prediction jobs are placed in a queue displayed in the queue page where links to the result page of each job are shown (Appendix C ). Users can also opt to have their jobs remain private, which hides them from the queue page. 99 Chapter 3: Interactions of IDRs — Attributes and Mutations 3.1 Overview The characteristics of interaction interfaces between folded protein domains have been the focus of numerous studies. While interactions between folded (globular) proteins dominate the available catalog of protein structures, many interactions are mediated by intrinsically disordered regions (IDRs), which are protein regions that do not adopt a folded structure independently. Because IDRs are flexible, their binding sites on globular proteins should have characteristics that differ from interactions between globular domains. The following study analyzes the differentiating sequence and structural properties of the binding sites of IDRs on globular proteins (IDR-partner). Multiple properties show that the IDR-partner interfaces are generally differentiable from interfaces between globular proteins (globular interfaces), but there is also substantial overlap. Through measuring the enrichment of mutations in protein interface regions, we examine whether the different molecular characteristics of IDR interaction affect their susceptibility and tolerance to mutations. IDR interactions appear to be just as susceptible to disease mutations, but there is a possibility of increased tolerance to neutral mutations in IDR-partner interfaces. 3.2 Introduction Interactions mediated by IDRs are common in the human interactome, but their structural data is scarce relative to globular–globular protein complexes. Therefore, the characteristics of IDR-mediated interactions have not been comprehensively investigated. Unlike globular domains, which are constrained by their three-dimensional folds, IDRs can sample many conformations in 100 their free states, and many IDRs even retain some residual conformational freedom in their bound states (135). Compared to classical globular interfaces, it is unclear what the consequences of these differences are for the binding sites of IDRs on globular proteins, which will be referred to as IDR-partner interfaces to avoid confusion with interfaces in IDRs. The sequence properties of IDRs are well studied. Their interaction-mediating residues are well conserved and hydrophobic, which is in stark contrast to the residues in IDRs that are not involved in interactions, allowing accurate predictions of interacting IDRs based on sequence (111, 197). The interaction-mediating elements within IDRs (interacting IDRs) have traditionally been studied as two separate categories: peptide motifs (peptides) and molecular recognition features (MoRFs). Interacting peptides typically contain a strongly conserved sequence motif consisting of residues critical for binding. These short 5 to 13 residue sequences often bind to deep pockets on the surface of binding partners (104, 107). MoRFs are longer sequences within IDRs that also bind to concaved surfaces, but they can form more elaborate folds and contain preformed structural elements (106). Importantly, IDRs cannot adopt a static fold on their own, meaning they are well-differentiated from globular domains in terms of both sequence and structure. On the other hand, the interaction partners of interacting IDRs (IDR-partners) are most often globular proteins, so IDR-partner interfaces share commonalities with classical globular interfaces. However, the predictor of IDR-partner interfaces we previously developed, named IDRBind, showed higher sensitivity to IDR-partner interfaces than globular interfaces (see Chapter 2). IDRBind’s selectivity demonstrates that binding sites of peptides and MoRFs share similarities while both are differentiated from globular interfaces, but the properties and the amount of differentiation have not been elaborated. Therefore, we are interested in quantifying the differences between IDR-partner and globular interfaces. 101 In the first part of this chapter, we present analyses and comparisons of sequence and structural attributes of IDR-partner and globular interfaces. A noteworthy interface attribute is the surface curvature because it is a good measure for identifying the groove and pockets to which IDRs bind. Since the surface of a globular object is convex on average (185), concave areas above a certain size are strongly contrasted on the protein surface. B-factor estimates of protein interface residues were also compared because peptide interfaces are known to be more rigid (104). Another property that might differentiate IDR binding sites is electrostatic potential, which is especially strong in MoRF interactions (109). Hydrophobicity and rASA are well-established features for protein interface prediction (60), so differences in these features will have implications for IDR-partner interface prediction. In addition to sequence and structural characteristics, we also investigated the functional importance of interface residues. The functional importance of interface residues implies their greater susceptibility to amino acid substitutions that lead to change in or loss of function, as compared to non-interacting surface residues. This susceptibility, or low tolerance to mutations, can be assessed directly via mutagenesis and binding assays (11). However, mutagenesis studies typically target individual proteins and are resource-intensive, limiting their applicability in bioinformatics studies of large groups of proteins, such as IDR interactions. An alternative approach is to measure evolutionary conservation, which is akin to assessing the results of mutagenesis experiments performed by nature (42). Through evolution, random mutations occur naturally and are either rejected or fixed as a result of purifying and positive selection. Sequence alignments of homologous proteins allow evaluation of the conservation of each sequence position, and the strength of conservation correlates with selective pressures. In other words, residue changes in important positions are rarely observed because they would likely disrupt biological 102 function (i.e., have deleterious effects). Protein interface residues are known to be conserved, and multiple protein interface predictors have successfully utilized this feature (60). IDR interactions do not appear to be an exception. Sequence conservation score is a prominent feature for IDR-partner interface prediction by IDRBind and also for predictors of interacting IDRs, such as MoRFpred and MoRFchibi (110, 111). Thanks to new high-throughput sequencing technologies and the steady growing number of human genome sequences, it is now also possible to assess protein sequence conservation within the human population (198, 199). SNVs and their observed frequency data from gnomAD describe the natural sequence variations in the human population. Mutations at key functional regions would more likely cause disease or even lethality, so SNVs across all frequencies are depleted from functionally critical regions such as protein interfaces (130), and thus their enrichment show an inverse correlation with sequence conservation. Studies of SNVs often make a distinction between common and rare variants. Common SNVs, which are often defined as those with greater than 1% frequency in the population, are often neutral (200, 201). These common alleles may even provide a selective advantage and may be beneficial for adaptation of the population to environmental stressors (202). On the other hand, rare variants account for the majority of variants in the population (203). Rare SNVs contains many random and novel mutations and are enriched in disease mutations (204, 205). Furthermore, studies have suggested an inverse correlation between frequency and phenotypic effect (206). Importantly, typical IDRs are under weaker purifying selection, and there is also evidence of stronger positive selection and evolutionary rates in IDRs (207, 208). Additionally, the flexibility of IDRs might allow IDR-partner interfaces to accommodate more SNVs. Therefore, we anticipate 103 different SNV enrichment patterns in IDR interactions. A caveat to correlating depletion of SNVs to functional importance is that SNVs can have neutral or deleterious effects. Conversely, annotated disease mutations are more directly associated with disruption of function. In contrast to the SNVs occurring in the healthy population, disease mutations are more often found in the functionally important regions. Mutation of residues in active-sites will generally be deleterious (12, 204). Similarly, disease-associated mutations are also overrepresented in buried residues in globular domains, where residues are tightly packed and intolerant to substitutions compared to those on the surface (130, 209). Disruption of PPIs is a common mechanism through which mutations lead to disease. An investigation of mutation localization on protein complex structures showed disease mutations are enriched in protein interfaces relative to the full-length proteins (15). Subsequently, a seminal study tested numerous mutations and measured changes in PPIs, and their results indicated that two-thirds of the disease mutations disrupted PPIs (11). Some of these mutations disrupt all interactions made by a protein through destabilizing the protein, which is typical of substations in the buried regions. The other mutations which disrupt a subset of a protein’s interactions were named edgetic mutations, and they are enriched in protein interfaces in PDB structures. Within the interface, disease mutations are more common in the interface core, especially the hotspot residues that contribute most to affinity (130). In summary, PPIs are often disrupted by disease mutations, many of which are localized to the protein interface core. Because IDR interactions and globular interactions exhibit differences in both structure and function, results of disease mutation analysis on globular interactions might not generalize to IDR interactions. IDR-partners are typically globular domains, so IDR-partner interfaces likely share similar trends in disease mutation enrichment, with most mutations localizing in the interface 104 core. However, the mutation enrichment patterns in IDR-partner interfaces have yet to be investigated. In contrast, mutation localization on the IDR side of the interaction has received more attention, partly due to the availability of sequence-based prediction methods of IDRs. With an estimated 22% of disease mutations located in IDRs and a greater concentration of these mutations in interacting IDRs, the importance of understanding the mutations in IDRs cannot be understated (210). The evolutionary rates have been shown to be high in IDR, primarily due to having weaker purifying selection pressure compared to residues in globular domains (211). On the surface, the higher evolutionary rates and the presence of disease mutations appear contradictory. However, studies typically do not separately analyze the interacting IDRs, and studies that analyze interacting IDRs rely on sequence-based predictors to define the interacting regions (124, 210). On the other hand, a study focusing on the peptides with conserved sequence motifs have identified enrichment of disease mutations compared to benign mutations (123). Therefore, identifying the specific IDR residues participating in interactions may be crucial for revealing disease mutation enrichment patterns in interacting IDRs. In the second part of this chapter, we studied the localization of mutations in protein complex structures, focusing on the IDR interactions and increasing contrast by dividing interfaces into core and rim regions. Compared to globular interfaces, IDR-partner interface residues may be under less constraint if the flexible interacting IDRs can adapt to changes in binding sites. To test this hypothesis, disease mutations from SwissVar and COSMIC databases, as well as single nucleotide variants (SNVs) from the gnomAD database, were mapped onto protein complex structures (14, 198, 212). The IDR interactions in the complex structures were identified so that comparisons could be made between IDR and globular interactions. Results reveal a strong 105 presence of disease mutations and depletion of SNVs in the IDR-partner interfaces that is contrary to our expectations. However, these results match certain interpretations of the observed IDR-partner interface physicochemical properties, including high residue conservation and rigidity. Moreover, disease mutations in interface regions of IDRs are at least as likely as globular protein interfaces, especially when comparing the core regions. These observations are concordant with studies that have associated IDRs with numerous diseases, especially cancer (98, 213–215). 3.3 Results 3.3.1 Properties That Differentiate IDR-Partner and Globular Interfaces The IDR-partner interfaces, including MoRF and peptide interfaces, can be differentiated from globular interfaces based on several physicochemical properties. The predictor IDRBind was developed based on the hypothesis that MoRF interfaces are different from globular interfaces, and this hypothesis was proven correct when IDRBind demonstrated selectivity for MoRF interfaces over globular interfaces. Furthermore, IDRBind is even better at identifying binding sites of peptides. Thus, there must be feature scores used by IDRBind that rank IDR-partner and globular interfaces differently. To test this hypothesis, we analyzed the sequence and structural attributes of MoRF, peptide, and globular interfaces. The interface properties were extracted from interaction structure datasets used in the testing of IDRBind, which includes MoRF, peptide, and globular complex datasets that were named MoRF-test, PEP, and DB5, respectively (datasets introduced in Chapter 2.3). An advantage of choosing these datasets is their lack of overlap with the training data of IDRBind (MoRF-train). The interface region of each protein was identified through relative accessible surface area (rASA) calculations. Briefly, interface residues are a subset of solvent-106 exposed surface residues, i.e., high rASA, that changes in solvent accessibility between their bound and unbound states. More specifically, interface residues increase in rASA when the partner protein chain in the complex structure is removed. We analyzed the protein surface residues’ sequence and structural properties, including curvature, electrostatic potential, conservation, rASA, B-factor estimate, and hydrophobicity. These properties are the feature scores used in IDRBind, and their calculations were previously described (Chapter 2.5.4). These feature scores that were calculated per-residue were aggregated over each protein interface for Figure 3.1 (see Chapter 3.5 for calculations). Figure 3.1 shows the averaged Z-scores for several distinguishing features of residues found in the interfaces of the three types of complexes. As a reference, we also provide the interface size for each (Figure 3.1G). Z-scores are chosen because they are more informative for features such as conservation and B-factor estimates that vary between individual proteins and thus needed to be normalized. Both MoRF and peptide interfaces have significantly higher averaged feature scores than globular interfaces for the curvature, electrostatic surface potential, and conservation. The feature that best distinguishes IDR-partner interfaces from globular interfaces is curvature, suggesting that the former have more concaved surfaces than the latter. Consistent with this interpretation, IDR-partner interfaces have significantly lower rASA and B-factor estimate values compared to globular interfaces, indicating that residues in IDR-partner interfaces tend to be more buried in those concave surface regions and are generally less dynamic. While these feature score distributions appear to place MoRF and peptide interfaces closer together, justifying the term IDR-partner interfaces, the overlap in all three classes is substantial. Furthermore, the un-normalized values of the same features (Figure 3.2) place MoRF in the middle of the three interface types or even closer to globular interfaces. 107 Figure 3.1: Feature comparison for globular (blue), MoRF (yellow), and peptide (brown) interfaces. Compared are interface curvature (A), potential (B), conservation (C), rASA (D), B-factor estimate (E), negative hydrophobicity score (F), and size (G). Wilcoxon rank sum tests are used to calculate p-values, and significance is denoted by asterisks: *p < 0.05, **p < 0.01, ***p < 0.001. Figure 3.2: Non-standardized scores of features for globular (blue), MoRF (yellow), and peptide (brown) interfaces. See Figure 3.1. 3.3.2 Classification of IDR-Partner and Globular Interfaces. Despite the significant overlap of the features scores, we wondered whether it could be possible to develop a predictor that can distinguish IDR-partner from globular interfaces. Our tests 108 on IDRBind had shown that it preferentially identifies MoRF and peptide interfaces, but it does also identify globular interfaces, albeit at a much lower sensitivity. Importantly, globular interfaces were not part of the negative training set of IDRBind, and thus IDRBind was not explicitly trained not to recognize globular interfaces. Assuming there is sufficient distinction between MoRF and globular interfaces, we aimed for the development of an additional predictor that distinguished globular and IDR-partner interfaces by using interface feature scores. To train a predictor that classifies IDR-partner versus globular interfaces, we used MoRF-train, which is the MoRF dataset that we used to train IDRBind, in addition to a new non-redundant globular interface dataset for training. From MoRF-train, we selected only interface residues for the positive training set of this new predictor. The negative training set consists of globular interfaces from non-redundant complex structures. This negative training set of 58 structures (3DC) was assembled by selecting 100 structures from the 3DComplex database and removing proteins with sequence similarity with MoRF-train, MoRF-test, and DB5 (see Methods for details). Although our scope of interest spans the range of IDR-partner interfaces, training was done only on MoRF and globular interfaces to be consistent with previous analyses and to reserve our single peptide dataset for testing. The model, named MoRFint, is a gradient boosted trees model built using the XGBoost library in R. The per-residue feature scores were aggregated over each protein interface to get interface feature scores, by taking the average, for example (see Methods). Five of the resulting interface feature scores were selected based on minimum redundancy and maximum relevance (216). The MoRFint model with five selected features was trained and optimized through the same procedure used for IDRBind (see Chapter 2.5.5). The interfaces of the MoRF-test and DB5 datasets, i.e., the testing datasets of IDRBind, are the positive and negative classes for testing MoRFint. The receiver operating characteristic 109 curve (ROC) was plotted to measure the performance of MoRFint on separating MoRF and globular interfaces, resulting in an area under curve (AUC) of 0.767, which is middling performance compared to AUCs of 0.5 and 1.0 expected from random and perfect classifiers, respectively (Figure 3.3A). The importance plot created by XGBoost illustrates the relative contributions of individual features to the MoRFint model (Figure 3.3B). According to the importance plot, the interface feature score derived from surface curvature leads in relative contribution by a large margin. Among the selected interface features are also modified scores of residue composition (principal component 5 of amino acid indices of physicochemical properties, i.e., PCA5), variance in B-factor estimates, and surface patterns 2 and 5. Surface patterns 2 and 5 were features engineered for searching for protein surface patterns in conservation score and groove score, which is a score for identifying concaved surface pockets (see Chapter 2.5.4). The surface pattern scores are thus based on spatial data that are relatively orthogonal to other features, which may explain why they were selected here. More importantly, because curvature was by far the most important interface feature, we used a ROC curve to measure the ability of the curvature score in separating MoRF and globular interfaces (Figure 3.3C left). The resulting AUC of 0.7 confirms that the curvature of the interface accounts for most but not all of the performance of MoRFint (AUC 0.767). The ROC curve of curvature score was also plotted for the classification of peptide versus globular interfaces, i.e., using PEP dataset interfaces as the positive class (Figure 3.3C right). Curvature alone was enough to classify peptide versus globular interface with an AUC of 0.78. Interface curvature is the main distinguishing interface feature between IDR-partner and globular interfaces, and peptide interfaces are at the high end of the curvature spectrum, making peptide interfaces most discernable (Figure 3.3 B and C). 110 Figure 3.3: Classification of binding sites of IDRs versus globular domains. A) ROC curve of MoRFint using MoRF-test and DB5 as positive and negative data sets, respectively. B) Importance plot of MoRFint. The contribution (x-axis) of each feature (y-axis) used by MoRFint is reported. C) ROC curves generated by using the averaged curvature score only. The averaged curvature score is the top-ranking score in the importance plot (B). On the left is a ROC curve calculated by using MoRF-test and DB5 as positive and negative data sets, respectively. On the right is a ROC curve calculated by using PEP and DB5 as positive and negative data sets, respectively. D) Density plot of the MoRFint score distributions of MoRF-test, PEP, and DB5 datasets in yellow, orange, and blue, respectively. To get a better understanding of MoRFint’s modest performance, we created a density plot of the distribution of MoRFint prediction scores for the three types of interfaces: MoRF (MoRF-test), peptide (PEP), and globular (DB5) interfaces (Figure 3.3D). If MoRFint were a more ideal/accurate classifier of IDR-partner and globular interface, IDR-partner interfaces would all 111 have high scores, and their score distribution would be heavily skewed to the right of the density plot. Conversely, globular interfaces would ideally all have low scores with a score distribution skewed to the left. Instead, MoRFint’s score distributions for MoRF-test, PEP, and DB5 strongly overlap, illustrating the model’s modest classification power. However, peptide interfaces typically have the highest MoRFint scores, even though MoRFint was not trained on peptide interfaces. MoRFint scores of MoRF interfaces are distributed between peptide and globular interfaces, which is consistent with the analysis of interface feature scores (Figure 3.1 and Figure 3.2). Moreover, the score distributions suggest that MoRF interfaces are more similar to peptide than globular interfaces, justifying the merger of the two categories under IDR-partner interfaces. Nonetheless, the overlap between IDR-partner and globular interfaces underscores the difficulty in distinguishing the two classes with the features we have tested. Having developed a method for classifying IDR-partner versus globular interfaces, we wondered whether it could be used to refine predictions made by IDRBind, our previously developed method for identifying binding sites of IDRs. Globular interfaces identified by IDRBind are unlikely to be true IDR-partner interfaces, which means they are false positives that would ideally be classified as globular interfaces by MoRFint and be removed. If successful, combining the methods would increase the specificity of the IDR-partner interface prediction. To test this idea, we first used IDRBind to predict interfaces on MoRF-test and DB5. The predicted interfaces were then scored using MoRFint, i.e., the predicted interfaces from MoRF-test were the positive class while the predicted interfaces from DB5 were the negative class. Unfortunately, the MoRFint scores were unable to separate the predicted interfaces on the two datasets (AUC ~ 0.5), meaning that MoRFint cannot be used to improve the specificity of IDRBind predictions to IDR-partner 112 interfaces. However, this result could also mean that at least some interfaces predicted by IDRBind in the DB5 dataset may have IDR-binding characteristics. MoRFint provides higher scores for interfaces with stronger IDR-binding characteristics (Figure 3.3D), so MoRFint scores might correlate with the performance of IDR-partner interface predictors, which perform better on IDR-partner interfaces than globular interfaces. In other words, interfaces with high MoRFint scores are expected to be more accurately predicted by IDR-partner interface predictors. To get a set of interfaces with a broad range of MoRFint scores that span the range of IDR-binding characteristics, we first combined MoRF-test and PEP datasets. In addition to IDRBind, a peptide interface predictor ACCLUSTER was also tested. ACCLUSTER was chosen because it is a chemistry-based predictor that does not rely on machine learning and feature scores, which were used to create MoRFint and IDRBind. For each protein complex structure in MoRF-test and PEP, prediction performance was calculated based on the correct classification of the interface and non-interface residues using Matthew’s Correlation Coefficient (MCC), which is a performance measure ranging from zero to one with one being the best possible performance. Despite the large variance of prediction performance across individual protein structures, the MCCs of ACCLUSTER correlate with MoRFint scores of the protein interfaces (Pearson correlation = 0.36; p-value = 0.008). Therefore, ACCLUSTER is more accurate in identifying proteins with higher MoRFint scores. IDRBind performance also appears to correlate with MoRFint scores, but the correlation is weaker and statistical significance is only seen after including DB5 structures which have much lower MoRFint scores (MoRF-test and PEP Pearson correlation = 0.23; p-value = 0.09; MoRF-test, PEP, and DB5 Pearson correlation = 0.22; p-value = 8*10-6). Therefore, interfaces with stronger IDR-binding characteristics are easier for both ACCLUSTER and IDRBind to predict. Put differently, the performance trends of IDR-partner 113 interface predictors provide supporting evidence for the differentiated but overlapping distributions of IDR-partner and globular interfaces in a spectrum of IDR-binding character. 3.3.3 Analysis of Mutation Localization on Protein Structures For the comparison of IDR-partner and globular interfaces, an orthogonal strategy to the analyses of structural properties is to investigate their enrichment for mutations. Similar to conservation scores, the concentration of disease mutations reflects the functional importance of protein regions. However, unlike the conservation scores calculated based on sequence homologs across species, these mutations are sequence variants in matching proteins from many human samples, meaning stronger correlations to protein function and implications to human health. However, our previous datasets included many species and were designed to be non-redundant for machine learning purposes. New datasets of IDR and globular interactions with only human proteins were defined for the work presented below. Protein complexes used here were experimentally determined structures from the protein data bank (PDB), and the IDR interactions were defined by either curated data or sequence prediction. Each interaction consists of a pair of human proteins, and it would be labeled as an IDR interaction if at least one of the protein chains was labeled as an IDR. A dataset of curated IDR interactions was extracted from the MobiDB database (see Methods) because human proteins in MoRF-train and MoRF-test were insufficient in number and limited to MoRFs. To create a second, larger, but lower-confidence IDR interaction dataset, the sequence-based IDR predictor ESpritz was used. When a protein sequence predicted by ESpritz to be disordered was found in a protein complex deposited in the PDB, it was labeled as an interacting IDR (see Methods). Finally, the globular interaction dataset consists of all human protein complexes deposited in the PDB, which 114 is justifiable because the total number of interaction structures is much higher than either of the two IDR interaction datasets. Furthermore, the DB5 and 3DC datasets used in the previous sections for interface attribute characterization also made similar approximations since globular structures dominate the PDB. In summary, the three datasets are the globular, MobiDB, and ESpritz interaction datasets, with the latter two representing high and low confidence IDR interactions. Protein residues were divided into commonly defined structural regions known to have different physicochemical and functional characteristics. Specifically, residues were divided into surface, buried, and interface categories using rASA thresholds defined in an article by Levi (46). Interface residues were further divided into core and rim, where rim residues remain solvent-exposed in the complex structure while core residues become buried (see Methods). Remaining protein residues that are outside the target structural regions were labeled as external. For example, the external region of a protein with an interacting IDR, i.e., an IDR observed in PDB complexes, contains all residues outside the interacting IDR. Figure 3.4 illustrates the protein regions defined for this study. 115 Figure 3.4: Protein regions in mutation localization analysis. The interface region consists of rim and core regions. The structure region consists of surface, buried, and interface regions. However, the buried regions of the IDR datasets are essentially too small to analyze. The region outside the structure region is labeled the external region. Notably, the structure regions of the IDR and IDR-partner datasets contain only protein structures defined as IDR interaction complexes. To identify enrichments for mutations in the different protein regions that we defined, we integrated the structural information with recently collected human SNV as well as disease mutation data: missense single-nucleotide variants from the Genome Aggregation Database (gnomAD) (198), germline disease-annotated mutations from the SwissVar database (14), and somatic cancer mutations from the COSMIC database (cancer.sanger.ac.uk) (212). The mutation data and structural data were merged by mapping them jointly to UniProt protein sequences, which allowed the comparison of mutation densities in the different protein regions. Summaries of the datasets are provided in Table 3.1. For instance, a total of 8726 mutated positions from SwissVar were mapped to the globular dataset, covering proteins with a total length of 626,900 residues, 116 195,784 of which have structural data. The total number of residues in the globular dataset varies between mutation sets because proteins with no mutation data were excluded from analysis (see Methods). Notably, the interacting IDRs defined in the MobiDB dataset only consists of 2541 residues, which is a very small subset of the globular dataset. 117 Table 3.1: Datasets of interaction structures with mutation mapping. The frequency of the mutations in the different protein regions were compared using odds ratios, closely replicating comparisons previously carried out by David and Sternberg for globular AResidues Mutated Positions Residues Mutated Positions Residues Mutated Positions Residues Mutated Positions Residues Mutated PositionsExternal 431116 4617 59959 514 103102 1407 90964 809 204062 2655Buried 75858 1947 8495 272 14 0 20297 490 92 6Surface 77691 1063 9576 152 665 12 22052 251 1657 32Rim 20182 401 1297 23 1291 31 2530 39 2839 57Core 22053 698 1704 64 571 21 3171 100 1313 49Total 626900 8726 81031 1025 105643 1471 139014 1689 209963 2799Structure 195784 4109 21072 511 2541 64 48050 880 5901 144Interface 42235 1099 3001 87 1862 52 5701 139 4152 106BResidues Mutated Positions Residues Mutated Positions Residues Mutated Positions Residues Mutated Positions Residues Mutated PositionsExternal 754085 19640 133776 6203 185189 9210 201314 7227 373370 11618Buried 114670 3890 15336 1022 25 1 38395 1414 150 11Surface 122355 4384 17909 1101 877 55 42461 1503 2389 99Rim 31958 1305 2518 156 2351 172 4973 221 4774 209Core 34001 1299 3185 199 1047 81 6370 275 2289 114Total 1057069 30518 172724 8681 189489 9519 293513 10640 382972 12051Structure 302984 10878 38948 2478 4300 309 92199 3413 9602 433Interface 65959 2604 5703 355 3398 253 11343 496 7063 323CResidues Mutated Positions Residues Mutated Positions Residues Mutated Positions Residues Mutated Positions Residues Mutated PositionsExternal 1039686 5928 176508 838 250811 1112 275715 1357 558479 2646Buried 197765 532 25172 52 40 0 63984 146 348 1Surface 214620 1036 30011 120 1545 6 74028 289 5898 22Rim 64266 242 4377 16 3623 12 9270 35 10069 38Core 67589 242 5474 26 1626 4 11334 47 4892 12Total 1583926 7980 241542 1052 257645 1134 434331 1874 579686 2719Structure 544240 2052 65034 214 6834 22 158616 517 21207 73Interface 131855 484 9851 42 5249 16 20604 82 14961 50DResidues Mutated Positions Residues Mutated Positions Residues Mutated Positions Residues Mutated Positions Residues Mutated PositionsExternal 1039686 377492 176508 60168 250811 82054 275715 94038 558479 186471Buried 197765 55082 25172 5793 40 9 63984 16014 348 80Surface 214620 73631 30011 9238 1545 557 74028 23227 5898 1934Rim 64266 20648 4377 1078 3623 1260 9270 2630 10069 3445Core 67589 18146 5474 1017 1626 464 11334 2492 4892 1388Total 1583926 544999 241542 77294 257645 84344 434331 138401 579686 193318Structure 544240 167507 65034 17126 6834 2290 158616 44363 21207 6847Interface 131855 38794 9851 2095 5249 1724 20604 5122 14961 4833GnomAD low-frequencyGlobular MobiDB IDR-partner MobiDB IDR Espritz IDR-partner Espritz IDRGnomAD high-frequencyGlobular MobiDB IDR-partner MobiDB IDR Espritz IDR-partner Espritz IDRCOSMICGlobular MobiDB IDR-partner MobiDB IDR Espritz IDR-partner Espritz IDRSwissVarGlobular MobiDB IDR-partner MobiDB IDR Espritz IDR-partner Espritz IDR118 proteins (130). More precisely, the probability of a mutated position within a protein region was estimated as the observed number of mutated positions divided by the length of the region. The odds of mutation was then calculated by dividing the probability of mutation by the probability of no mutation (see Methods for equations). The division of the odds of mutation of two regions, i.e., the ratio of odds or odds ratio (OR), is a metric for the enrichment (OR > 1) or depletion (OR < 1) of mutations in the first/numerator region. Two types of ORs were calculated. Firstly, ORs were calculated by comparing specific protein regions with full-length proteins. For example, the OR of interacting IDR core measures the enrichment of mutations in the core region of IDRs involved in interactions relative to the entire sequence of the IDR-containing proteins. Secondly, ORs were also calculated using the odds of mutations in equivalent regions of different datasets, which we will differentiate as odds ratio between regions (ORR). For example, ORR was calculated using the odds of mutations in the interacting IDRs interface core and the globular interfaces core to see which interaction region has a higher likelihood of mutation. 3.3.4 SNVs Are Depleted in IDR Interactions. SNVs from gnomAD are mutations observed in a population of healthy individuals, so SNVs are typically not directly associated with diseases. David and Sternberg studied non-disease-associated SNVs annotated in UniProt and showed that these variants are depleted from functionally critical regions (130). Specifically, they revealed enrichment in the rim and surface regions and depletion in the buried and interface core regions. However, gnomAD SNVs are from large scale genome sequencing projects, which also allows the study of rare SNVs that were previously not detectable. Therefore, we divided the SNV dataset into high-frequency, i.e., 119 frequency between 0.1 and 0.001, and low-frequency, i.e., frequency between 0.001 and 0.000001. High-frequency SNVs are most likely benign. In contrast, studies have suggested that SNVs with very low frequencies can have deleterious effects (205, 217). We used the gnomAD data and calculated first the odds ratios (OR) for finding low-frequency SNVs in different protein regions of the MobiDB, and ESpritz interaction datasets, using the globular set as a point of reference (Figure 3.5). The ORs indicate that these low-frequency SNVs are depleted in structured parts of proteins. This depletion holds for globular structures in general (blue bars) as well as the subset of globular structures that bind to IDRs (IDR-partners; orange bars). Looking at different regions of the structures, we find, consistent with David and Sternberg’s results, that low-frequency SNVs are depleted from the buried and interface core regions of globular proteins relative to the full-length protein. Compared to globular proteins, the IDR-partner ORs show similar trends. Particularly depleted of low-frequency SNVs is the core region of IDR-partners (MobiDB OR = 0.5; ESpritz OR = 0.6). This finding is contrasted by the more moderate depletion of low-frequency SNVs in the core of the interacting IDRs (MobiDB and ESpritz OR = 0.8). However, interacting IDRs as a whole are not depleted in low-frequency SNVs (MobiDB and ESpritz OR = 1.0; Figure 3.5 Structure region). Together, these findings highlight the functional importance of the core residues in both IDR-partners and interacting IDRs themselves. 120 Figure 3.5: Odds ratios of low-frequency SNVs in protein regions over all residues. The bar graphs plot the odds ratios (OR; Y-axis) of each protein region (X-axis). Each OR is the odds of mutation in the specific region divided by the odds of the full-length parent proteins. The Y-axis is centered at one, and ORs > 1 show enrichment while ORs < 1 show depletion. The IDR-partner (orange) and interacting IDR (yellow) are from the A) MobiDB dataset and B) ESpritz dataset. The globular dataset (blue) is the same in both panels. Next, we assessed the enrichment/depletion of low-frequency SNVs when comparing the same protein regions in different datasets. Specifically, an interface region (core or rim) was selected (column 1 in Table 3.2), and the ORR was calculated by dividing the odds of mutation for that region in the IDR-partner or interacting IDR (column 2 in Table 3.2) by the odds of the same region in the globular set. ORRs between core and rim regions of IDR-partners and globular complexes show that the IDR-partner interface is more depleted of low-frequency SNVs (Table 3.2). In other words, the interface cores of IDR-partners are significantly more depleted in low-frequency SNVs than the interface cores of globular proteins in general (MobiDB ORR = 0.6; ESpritz ORR = 0.8). Similarly, a more pronounce depletion of low-frequency SNVs was also found for the rim of IDR-partners. In contrast, the core and rim regions of the interacting IDR are 121 slightly enriched in low-frequency SNVs when compare to the globular dataset, although not significantly so for the interface core. It is worth noting that the trends observed in both MobiDB and ESpritz datasets are generally in agreement, with the MobiDB interaction dataset regions typically exhibiting stronger enrichment and depletion. Table 3.2: Odds ratios of low-frequency SNV in IDR interaction proteins over globular proteins. In the next step, we calculated the odds for common variants. Common SNVs are typically considered benign (218). While common SNVs are often defined as those above frequency 0.01, the shortage of data limited our analysis to a frequency range of 0.1 to 0.001, which we refer to as high-frequency SNVs. Even so, high-frequency SNVs are relatively scarce at the core regions of IDR interactions, limiting the statistical significance of the corresponding ORs. Similar to the low-frequency variants, high-frequency SNVs are also depleted in the structured parts of proteins, particularly among the buried residues (Figure 3.6). Also, the core and rim residues of globular interfaces are depleted in common variants. Interestingly, not only the core but also the rim residues of interacting IDRs are depleted in high-frequency SNVs, in contrast to the core of the IDR-partner that is enriched in common variants. However, calculations of ORR revealed that none of the regional differences in high-frequency SNV occurrences are significantly different when compared across datasets (Table 3.3). Comparisons of high-frequency SNV 122 enrichment between globular and IDR-partner or interacting IDR interactions have to be interpreted with care due to the limited size of the data. Fortunately, more evidence can be gathered from annotated disease mutations, which are expected to be enriched in the functionally important regions, providing contrast with SNV localization. Figure 3.6: Odds ratios of high-frequency SNVs in protein regions over all residues. See Figure 3.5 for details. Table 3.3: Odds ratios of high-frequency SNV in IDR interaction proteins over globular proteins. 123 3.3.5 Disease Mutations Are Enriched in IDR Interfaces. Previous studies have found that annotated disease mutations are enrichment in the buried and interface core regions of globular complexes (11, 130, 219). We also analyzed the enrichment of known disease mutations in IDR and globular interfaces to confirm previous findings and reveal differences for IDR interfaces. We first used mutations deposited and annotated as disease-causing in the SwissVar database (212). In agreement with previous observations, our globular dataset also shows enrichment of SwissVar disease mutations in the globular buried region (OR = 1.9; Figure 3.7). Buried residues of globular domains are typically more critical to the structure and stability of the protein, so the substitution of these residues will more likely disrupt function and lead to disease. In comparison, enrichment is weaker for the broader structure region of the globular dataset (OR = 1.5), which includes the less vital surface residues. The globular interface core has a higher OR than the buried region (OR = 2.3), whereas the rim has a much lower OR of 1.4, which indicates a similar level of mutation propensity to the structure region. Mutations located in PPI interfaces often specifically disrupt interactions; such amino acid substitutions are called edgetic mutations (11). By dividing the interface into core and rim regions, we see edgetic mutations are more likely to hit the interface core. 124 Figure 3.7: Odds ratios of SwissVar mutations in protein regions over all residues. See Figure 3.5 for more detail. Similar to the globular interface core, the high ORs in the interacting IDR and IDR-partner core suggest these residues are not tolerant of mutations (Figure 3.7). The enrichment of disease mutations in IDR-partner interface core regions appears even higher than the globular dataset (MobiDB OR = 3.0; ESpritz = 2.6). The disease mutation ORs of interacting IDR interface core regions are equally high (MobiDB OR = 2.9; ESpritz = 2.7). This analysis of disease mutations from SwissVar appears to suggest that the interface core of interacting IDRs and IDR-partner are very susceptible to disruption by amino acid substitutions, which is contrary to our hypothesis of a malleable IDR interaction interface. To more directly compare between IDR and globular interactions, ORRs were calculated. When comparing interfaces of IDR and globular interactions, a similar level of enrichments for disease mutations is found. ORRs of interface core regions of both IDR-partner and interacting IDR relative to the globular dataset indicate that disease mutations are enriched in IDR interactions 125 at least as strongly as globular interactions (Table 3.4). Furthermore, most ORRs in Table 3.4 are close to one, indicating that the proteins in the IDR interaction datasets are similar to the globular dataset in terms of disease mutation localization. Therefore, these results oppose the hypothesis that interactions between IDRs and IDR-partners are more malleable and thus able to accommodate mutations without adverse effects in function. For additional evidence, we also performed a comparison using cancer mutations from the COSMIC database. Table 3.4: Odds ratios of SwissVar mutations in IDR interaction proteins over globular proteins. Similar to SwissVar mutations, COSMIC mutations preferentially localize to structure regions, although the enrichment is weaker compared to disease-associated mutations (122). Moreover, previous studies have also shown that cancer mutations have a distinct tendency to occur more often in protein interfaces than in the buried regions (219, 220). Our integration of the COSMIC data confirms that cancer mutations are enriched in buried as well as core and rim interface residues of the globular dataset (Figure 3.8). ORs closer to one indicate weaker enrichment patterns compared to SwissVar mutations, especially in the ESpritz IDR-partner dataset, which could be attributed to the presence of passenger cancer mutations in the COSMIC dataset. Notably, the globular buried region is not the most enriched, which is the result of more 126 mutations residing on the protein surface and is in agreement with previous studies (209, 220). Moreover, the globular rim contains relatively high amounts of COSMIC mutations, and the rim regions of both MobiDB and ESpritz IDR-partners also show enrichment levels as high as their core regions. Importantly, the highest ORs are observed in the interacting IDR core (MobiDB and ESpritz OR = 1.6). The enrichment in interacting IDR core is in contrast with the known depletion of COSMIC mutation in general IDRs (122). However, the ORs of the ESpritz IDR-partner regions are very close to one, which is incongruent with previous trends, but direct comparisons with the globular dataset could clarify these results. Figure 3.8: Odds ratios of COSMIC mutations in protein regions over all residues. See Figure 3.5 for details. Interestingly, proteins in our IDR interaction datasets are enriched in COSMIC mutations compared to the globular dataset across all regions. COSMIC mutation ORRs of core and rim regions of interacting IDR and IDR-partner are above one (Table 3.5). IDR-partners appear to have large amounts of COSMIC mutations across the entire protein sequence, with the MobiDB and 127 ESpritz IDR-partner full-length proteins having odds of mutations 1.78 and 1.27 times higher than the globular set, respectively. That is why IDR-partner core regions have low ORs relative to the protein but maintain higher enrichment relative to the globular core. The ORRs of MobiDB IDR interaction regions have significant adjusted p-values, but most of the ORRs from the ESpritz interaction regions are not significant. The difference between the two datasets could be an indication of study bias in the proteins with MobiDB annotations, or it could also stem from biases in IDR prediction. Nonetheless, the overall trend of cancer mutation enrichment in IDR interactions agrees with the known correlation between cancer-associated proteins and the presence of IDRs (124, 221). Table 3.5: Odds ratios of COSMIC mutations in IDR interaction proteins over globular proteins. 3.3.6 Predicted Structural Regions Show Enrichment of Disease Mutations Our integration of structural and mutation data confirmed the functional importance of IDR-partner interface residues, so we wondered if the IDR-partner interfaces predicted by our predictor IDRBind exhibit the same enrichment. To answer this question, IDRBind was used to predict IDR-partner interfaces on a non-redundant dataset of PDB protein structures (see Methods). The structural data, together with the predicted interfaces, were mapped to UniProt sequences and combined with SwissVar and COSMIC mutation data. The SwissVar mutation OR 128 for the IDRBind interface core is higher than the rim (IDRBind core OR = 2.3; Figure 3.9A), in contrast to the very similar core and rim ORs calculated with COSMIC mutations (IDRBind core and rim OR = 1.5; Figure 3.9B). These trends closely match the enrichment patterns in the globular interfaces. In particular, the enrichment of SwissVar mutations in the interface core of predicted IDR-partner and globular datasets are approximately the same. The COSMIC mutation enrichment is stronger in the IDRBind-predicted core and rim regions compared to the globular dataset, and the rim regions of both datasets are characteristically high in COSMIC mutation enrichment. Unfortunately, IDRBind interface regions did not exactly replicate the same levels of enrichment observed for the MobiDB IDR-partner interfaces. Protein interface predictors such as IDRBind identify putative interfaces on a given structure, so we are assuming the presence of IDR-binding sites for all globular protein structures in this crude analysis. Consequently, the differences between IDRBind predicted interfaces and true IDR-partner interfaces is not surprising. Furthermore, the high propensities for disease mutations in IDRBind predicted core and rim regions do demonstrate their stronger functional importance and thus the usefulness of IDRBind for bioinformatics analyses. 129 Figure 3.9: Odds ratios of mutations in predicted protein regions over all residues. A) Disease mutations from SwissVar. B) Cancer mutations from COSMIC. The IDR-partner ORs are based on the MobiDB dataset. See Figure 3.5 for more details. As an additional control, we tested Pfam domain predictions across the human sequences of UniProt, which should be comparable to the structure region of the globular dataset. The OR of SwissVar disease mutations for Pfam domains is very similar to the structure region of the globular dataset (Figure 3.9A). However, the enrichment of COSMIC mutations is lower in the Pfam domains (Figure 3.9B). The Pfam dataset is much bigger than the structure-based datasets, and thus it provides a conservative estimate for the enrichment of disease mutations in globular domains. 3.4 Discussion IDR interactions are recognized not only for their critical role in cellular function and disease but also for their differences in molecular mechanisms compared to the classical globular interactions. The characteristic flexibility and binding mechanism of IDRs are accompanied by differences across multiple sequence and structural properties in IDR-partner interfaces compared 130 to globular interfaces, with more significant differentiation observed in partner interfaces of peptides than MoRFs. It is reasonable to assume that structural properties of protein interfaces will affect the susceptibility of interface residues to disease mutations and tolerance to SNVs. Nonetheless, we report evidence that suggests IDR interactions consist of conserved and structurally optimized residues that are just as enriched in disease mutations as globular interactions. Higher sensitivity to mutations could also be a consequence of the functional roles of the IDR interactions, which are often transient and specific interactions involved in signaling and regulation (9). The abundance of some proteins with long IDRs are under tight regulation (222), which would imply higher sensitivity to changes in binding affinity as well. However, the enrichment of high-frequency SNVs in the IDR-partner interface points to a possible greater tolerance to mutations despite the functional importance of these residues. Our IDR-partner interface predictor, namely IDRBind, has shown specificity for IDR-partner interfaces, especially peptide interfaces (Chapter 2). Therefore, we have investigated the prediction model’s feature scores to identify properties that favor identification IDR-partner interfaces over globular interfaces. The most distinguishing feature scores for IDR-partner interfaces are their surface curvature and rASA, both of which indicate that IDRs bind to concaved surfaces (i.e., binding pockets and grooves). Binding pockets have been suggested to provide surfaces to which peptides can latch onto, enhancing interface packing and increasing binding free energy (104). The distinct feature distribution present in peptide, MoRF, and globular interfaces may be, in part, explained by the distinct thermodynamic and kinetic aspects of binding of these protein regions. Being intrinsically disordered, both MoRFs and peptides incur a higher average cost in entropy for binding than globular domains. Likely compensating for the entropy cost, MoRF and peptide interfaces show tendencies to consist of residues that are more hydrophobic, 131 buried, and rigid, which are properties that thermodynamically favor binding (104). Interacting IDRs are also noted for their central role in many signaling pathways. Notably, strong electrostatic interactions are associated with regulatory proteins (223). The electrostatic potential is marginally stronger at MoRF and peptide interfaces of the unbound partner structures, as measured by the electrostatic potential feature score (Figure 3.1B). Strikingly, the electrostatic contributions to binding depend on the complementarity between the binding proteins, yet the electrostatic potential on the IDR-partner surface alone was strong enough to make significant contributions to the IDRBind model. Regarding IDRs, this long-range force is proposed to promote fast on and off kinetics beneficial to signaling and regulatory switches (224). Peptides and MoRF interfaces are overlapping subcategories of IDR-partner interfaces, which are generally differentiable from globular interfaces. Peptides are short sequences that typically interact through conserved motifs, and they tend to contribute greater interaction surface area per IDR residue compared to MoRFs (104), which they can achieve through embedding deeply into binding pockets in extended conformations. On the other hand, MoRFs may adopt more sophisticated folds and bind via anchoring, linking, and preformed elements (133) to IDR-partner surfaces that are less concaved on average (Figure 3.1). This diminishes the IDR-binding properties of MoRF interfaces relative to peptide interfaces, making MoRF interfaces more similar to globular interfaces in size and surface geometry. As evidence, the MoRFint classifier that was trained on MoRF and globular interfaces only showed modest performance in separating MoRF interfaces, but MoRFint is much better at differentiating peptide and globular interfaces (PEP versus DB5 AUC = 0.84). Importantly, MoRF interfaces do appear to be closer in characteristics to peptide interfaces than globular interfaces, as demonstrated by their MoRFint score distributions. MoRFint is a 132 relatively rudimentary model that relied on only 54 examples of MoRF interfaces for training. Consequently, the differentiation between IDR-partner and globular interfaces shown by MoRFint is likely a conservative estimate. Additionally, examples of long MoRFs are uncommon in the PDB (133), so examining IDR interactions as a broader group is justifiable while keeping in mind that MoRF interfaces bear some resemblance to globular interfaces in terms of structure. Since IDR-partner and globular interfaces are generally discernible by their differences in structural properties, and the interacting IDRs are very distinctive from both, we investigated whether these interface categories also exhibit differences in mutation enrichment patterns. Of the three datasets of mutations analyzed in this work, the SwissVar disease mutation dataset is most natural to interpret due to the more direct connection between these mutations and disease. Consistent with previous findings (130), the SwissVar disease mutations are mainly enriched in the interface core, as opposed to the rim. Compared to the globular dataset, the core regions of the interacting IDRs and IDR-partners exhibit just as much mutation enrichment. The IDR-partner interfaces are under similar structural constraints as globular interfaces, and thus very high SwissVar mutation ORs suggest that the flexibility of the interacting IDRs does not diminish the IDR-partner interface residues’ functional importance. Supporting this view are the structural and sequence attributes of IDR-partner interfaces that show high rigidity, hydrophobicity, and conservation, suggesting the residues are highly optimized and critical to the interactions. High disease mutation ORs in interacting IDRs can be the result of mutations that disrupt their highly conserved and hydrophobic hotspot residues, which are often part of conserved motifs, e.g., an SH3-binding motif (103, 123). Additionally, interacting IDRs are also sensitive to mutations that cause disorder-to-order transitions (210). Mutations that change an interacting IDR’s conformational flexibility or shift the equilibrium between its folded and unfolded states 133 will upset its binding affinity, which would be particularly damaging in fine-tuned regulatory functions (91). The function of IDRs can be modulated by their unbound conformations, which is exemplified by mutations in p53 that alter its residual helical structure and, consequently, change its affinity to MDM2 (9, 116). However, it is unclear whether the core regions of IDR interactions are enriched in SwissVar mutations compared to the globular interface core, unlike the results from COSMIC mutations. The previously described enrichment of COSMIC mutations in structure regions (122), and more specifically, in the protein interface residues (220), is confirmed in our globular interactions dataset. Dividing the interface regions into core and rim revealed relatively strong enrichment in the globular rim regions. Importantly, the high enrichment in the interface rim is also confirmed in the IDR-partners from both MobiDB and ESpritz datasets. Preferential localization to the rim regions, which consists of polar and charged solvent-exposed residues, is only observed in the COSMIC dataset, but it is in agreement with the reported tendency for cancer mutations to disrupt PPIs through changing charged residues and perturbing the electrostatic component of binding affinity (220, 225). Greater representation of mutations affecting electrostatic contributions to binding could also be associated with the higher electrostatic potential found in IDR interactions and, more generally, in proteins with regulatory functions (223). Interestingly, compared to the globular core, the interacting IDR core has even higher ORs, indicating the stronger preference of COSMIC mutations within proteins with interacting IDRs to localize to the IDR core. While cancer-associated mutations have been shown to be more enriched in globular domains than predicted interacting IDRs (124), our results indicate stronger enrichment in structurally-defined interacting IDRs and emphasize the importance of identifying the core residues. 134 The odds of COSMIC mutations in the interface core and rim regions are higher in IDR interactions compared to globular interactions, but the odds of mutation in the full-length proteins is also higher in the IDR interaction datasets than the globular interaction dataset. Many studies have demonstrated positive correlations between IDRs and cancer (123, 124, 213, 226). Nevertheless, the bias in mutation propensities between protein datasets and the relatively low ORs point to limitations in this analysis. The enrichment of COSMIC mutation is likely confounded by the presence of passenger mutations, as evidenced by the weaker enrichment patterns (122). On top of noise from passenger mutations, cancer mutations are heavily concentrated in proteins with specific functions, such as cell proliferation (124). The attention that cancer research receives could also lead to biases in the data, which could contribute to the overrepresentation of cancer mutation in the curated MobiDB IDR interactions. Notably, there are known differences between oncogenes and tumor suppressors in terms of mutation localization. Mutations in tumor suppressors form larger spatial clusters and are more enriched in the buried region (225, 227), supporting the view that tumor suppressors are susceptible to a wider range of disruptive mutations, whereas mutations in oncogenes are more site-specific. These observations lead us to speculate that oncogenes may have contributed more to the COSMIC mutation enrichment in the interface rim. Consequently, further investigation using cancer driver mutations/genes and accounting for the function and pathways of the proteins could yield additional results. In contrast to the strongly deleterious disease mutations, the SNVs from gnomAD are mostly benign, and as such, are typically scarce in regions with greater functional importance. In the globular dataset, low-frequency SNVs are most strongly depleted in the buried and core regions, which closely matches previous observations (130). Like the globular dataset, low-frequency SNVs are more likely found in the IDR-partner surface and rim regions, supporting the 135 stronger conservation of the residues in the buried and core regions. While the core region of interacting IDRs is moderately depleted of low-frequency SNVs, the rim region of interacting IDRs is slightly enriched. The weaker depletion of low-frequency SNVs suggests an overrepresentation of random and novel variants in the interacting IDR, particularly in the rim region. Although individual SNVs generally do not cause disease, one study suggested that 70% of rare SNVs are mildly deleterious (205). Therefore, the high-frequency SNVs are likely more indicative of the tolerance to amino acid variation. High-frequency SNVs are considered to be benign due to their common presence in the population. As evidence, we observed that buried regions of globular proteins, where substitutions would likely have the most damaging effect, are relatively devoid of high-frequency SNVs. The interface core of the globular dataset is less depleted in high-frequency SNVs when compared to the buried region. The depletion of high-frequency SNVs in the interacting IDR core and rim regions is comparable to the globular dataset. More surprising is the lack of depletion in the IDR-partner core, which could be a sign of the increased tolerance to mutations that we postulated due to the flexibility of interacting IDRs. While high-frequency SNVs are benign, many may have an impact on function (228). Increased high-frequency SNVs in the interface regions could contribute to the genetic diversity in the population, which is important for evolutionary adaptation (202). However, only a small number of high-frequency SNVs mapped to our IDR interaction structures, so their localization in IDR interactions deserves further investigation. Our structure-based analyses provided many insights into IDR interactions and their many differences from classical globular interactions. IDR-partner interfaces differ from globular interfaces across multiple structural and sequence attributes, chief among them is the concaved binding surfaces. The IDR-partner interfaces are highly susceptible to disease mutations, 136 potentially more so than the globular interfaces, especially with regards to cancer mutations. At the same time, enrichment high-frequency SNVs suggests IDR-partner interfaces might also be more tolerant to amino acid variance, which could be a consequence of the flexibility of IDRs. The IDR complex structures also allowed us to identify the core residues of the interacting IDRs, which is not possible with current sequence-based prediction methods. Once the core residues were highlighted, we observed strong enrichment of disease mutations in interacting IDRs. Not only are IDR interactions critical for cellular function and hence involved in many diseases, but the concaved surfaces of IDR-partner interfaces are also more accessible to drug targeting (164), making them priority targets for research. 3.5 Methods 3.5.1 Calculating Interface Sequence and Structural Properties Four protein complex structure datasets constructed previously for training and testing of the predictor IDRBind were used for the analysis of interface properties: MoRF-test, MoRF-train, PEP, and DB5 (see Chapter 2.5). The residues of protein structures were divided into the surface, interface, and buried categories using rASA thresholds. Areaimol from the CCP4 suite was used to calculate the solvent accessible surface area of each residue (SASA) (169). To calculate rASAs, the SASA of each residue was then normalized by its SASA in a Gly-X-Gly peptide in an extended conformation (46). Residues with rASA > 0.05 in the isolated protein chain (unbound state) were classified as surface, and the remaining residues were classified as buried, which were not analyzed. Interface residues are a subset of surface residues that decrease in rASA, i.e., become more buried, in the complexed structure (bound state). Notably, this is a different rASA threshold for defining surface residues than the one used for our mutation mapping analysis. The structural 137 properties we analyzed were feature scores used to develop IDRBind, so they were calculated with an rASA threshold that was optimized interface prediction. The methods for calculating feature scores were described previously (see Chapter 2.5). The feature scores plotted in Figure 3.1 and Figure 3.2 were calculated by averaging the scores of all interface residues of each protein. For each feature, the Z-score of each residue was calculated relative to the surface residues of its parent protein. Interface size is the number of interface residues on the globular protein. The R package ggplot2 was used for the violin plots (193). The p-value labels were calculated using the Wilcoxon rank-sum test in R. 3.5.2 Training and Testing MoRFint — IDR-Partner Interface and Globular Interface Classifier MoRFint is an XGBoost model that was trained to classify MoRF versus globular interfaces using aggregated interface feature scores (80). A separate globular interface dataset was required for training MoRFint because DB5 was reserved for testing. The globular interface dataset 3DC consists of 58 complex structures that were selected from the 3D Complex database, which contains predominantly globular complexes (6). Starting from 100 random heterodimer complexes, proteins that share sequence identity with protein sequences in the MoRF-test, MoRF-train, and DB5 were removed using CD-HIT-2D with parameters “word_length” of four, sequence identity of 0.6, and length difference of 0.5 (168). The resulting dataset, 3DC, was treated as the negative class, while the MoRF-train dataset was treated as the positive class, for the training of MoRFint. Each interface example consists of a set of aggregated feature scores from only the interface residues. Individual residue feature scores were aggregated over each protein interface, ignoring 138 all the non-interface residues. In addition to averaging, we calculated the maximum, minimum, and variance of the feature scores for each interface. We also calculated the average of the top-ranking half, quarter, and eighth of the feature scores, and we repeated the same for the bottom-ranking scores. Because each example in the training and testing sets was an interface instead of individual residues, the number of examples is much smaller. Consequently, the mRMRe R module was used to do select five features based on minimum redundancy and maximum relevance to reduce the complexity of the model (216). Using the five selected interface features, MoRFint was trained and optimized using the same steps used for developing the RimPred component of IDRBind (see Chapter 2.5). In short, the XGBoost module in R was used to create the gradient boosted trees model, MoRFint, and to calculate the importance scores in Figure 3.3B (80). The optimal hyperparameters were identified through 20 fold cross-validation with Bayesian optimization using the rBayesianOptimization module in R (https://www.r-project.org/). The MoRF-test, PEP, and DB5 datasets were used for testing MoRFint and for creating the MoRFint score density plot (Figure 3.3). The interfaces of MoRF-test and PEP were treated as the positive class, while interfaces of the DB5 dataset were treated as the negative class for the ROC curves. The ROC curves and AUCs were generated using the ROCR package in R (194). The original PEP dataset contained 25 proteins, but one protein that overlapped with 3DC was removed from Figure 3.3. The Pearson correlations between MoRFint scores and the prediction performance of IDRBind and ACCLUSTER were calculated using R (63). The performance measures of IDRBind and ACCLUSTER were calculated in our previous work (see Chapter 2). 139 3.5.3 Structural Data for Mutation Localization Analysis The structural data of protein interactions consists of human proteins downloaded from the PDB in September 2018 (http://www.rcsb.org/). For each structure, the model of the biological unit was selected whenever available, and the first model was used when the PDB file contains multiple models. Complex structures that only consist of carbon-alpha coordinates or are extremely large were removed. Protein interactions were analyzed pairwise by iterating through all pairs of protein chains in the PDB file, focusing only on human heteromeric interactions and removing pairs with no physical interaction, i.e., solvent accessibility of protein chain is unchanged between the bound and unbound states. For each interaction pair, relative solvent accessible surface area (rASA) of protein residues were calculated to categorize them into protein regions. FreeSASA was used for fast calculation of solvent accessible surface area (SASA) of the residues of each protein chain in their bound and unbound states (229). The rASA of each residue was calculated with the same method as above. The residues were placed into structural categories based on their rASA in their bound and unbound states (46). Surface and buried regions consist of residues above and below 0.25 rASA in the isolated protein chain, respectively. All residues with a change in rASA between the bound and unbound states were defined as interface residues. The interaction interface consists of the rim residues, which have rASA > 0.25 in the bound state, and the core residues, which have rASA < 0.25 in the bound state. Subsequently, the categorized residues were mapped to UniProt sequences (230). 140 3.5.4 Defining IDR Interaction Datasets IDRs were defined using a curated dataset and sequence-based prediction. The curated dataset of IDRs was extracted from the MobiDB database that was downloaded in September 2018 (231). The MobiDB database contains protein regions annotated as curated linear interacting peptides (LIPs), which consists of IDRs aggregated from multiple databases. For this study, entries from the ELM were excluded since they contain short linear motifs (SLiMs) that are found in structured regions as well as disordered regions (232). The IDRs in our ESpritz dataset was defined using ESpritz sequence-based IDR prediction (233). ESpritz prediction was made using the “disprot” option and “Best Sw” threshold. Both sets of IDRs were mapped onto UniProt sequences. For each UniProt sequence with IDRs defined by MobiDB, we iterated through all interaction pair structures to identify all instances of the IDRs. For an interaction pair structure to be labeled as an IDR interaction, one of the protein chains must overlap with an IDR sequence. A protein chain was labeled as an interacting IDR if more than 50% of the interface residues were within an IDR defined in MobiDB. Furthermore, protein chains with more than nine buried residues were excluded from interacting IDRs. Once all the interacting IDR and IDR-partner interaction pair structures were defined, the remaining structures were excluded from the MobiDB dataset. The same procedure was carried out for the ESpritz dataset, with an additional criterion for interacting IDRs to have more than 50% of the structured residues to be predicted as intrinsically disordered. 3.5.5 Mapping Mutations to Globular and IDR Interaction Structural Data The mutations from SwissVar, COSMIC, and gnomAD were mapped to the interaction pair structures through their shared UniProt sequences (14, 198, 212, 230). The SwissVar mutation 141 dataset consists of disease-related germline mutations. The COSMIC mutation dataset consists of curated cancer mutations, excluding mutations annotated with genome-wide screens and single nucleotide polymorphisms. The gnomAD data uses Ensembl transcripts and there are often multiple transcripts per protein, so we only mapped SNVs from the transcript that best matches each UniProt sequence (234). With all mutations mapped to UniProt sequences, we subsequently iterated through all the interaction pair structures and merged all the structural and mutation data for each UniProt sequence. There are three mutation datasets, two IDR interaction datasets, and one globular interaction dataset. Therefore, there are nine combinations of datasets. For each of the nine combined datasets, UniProt sequences lacking either structural data or mutations were removed. In case of overlap between structures, the residue structural label was decided by their priority from highest to lowest: core, rim, buried, and surface. In the case of the two IDR interaction datasets, the UniProt residues were also labeled as interacting IDR or IDR-partner. 3.5.6 Predicted Structural Regions for Mutation Mapping Mutation localization was also analyzed for predicted structural regions from Pfam and IDRBind (Figure 3.9). Pfam version 32 annotations that were mapped to UniProt sequences were used (235). All the regions labeled as “domains” in Pfam were treated as structure regions for mutation mapping (see Figure 3.4). IDRBind was used to predict IDR-partner interfaces on a non-redundant dataset of protein chains in the PDB. The non-redundant dataset was downloaded from RCSB PDB in December 2018 (236). This dataset was generated through the blastclust algorithm at 100% sequence identity to remove redundant sequences (http://www.rcsb.org/pdb/statistics/clusterStatistics.do). The resulting protein PDB chains that 142 also have matching human UniProt sequences were then processed by IDRBind individually. IDRBind outputs core, rim, buried, and surface residue labels, which were merged with mutation data, akin to the mapping of IDR-partner structural data. 3.5.7 Odds Ratio Calculations The calculation of odds ratios used to compare mutation enrichment between protein regions was described by David and Sternberg (130). The probability of mutation (p) in region i was given by the number of mutated positions (m) in region i divided by the number of residues (r) in region i, i.e.: = / The odds ratio of mutations in region i over j is: =( (1 − )⁄ ) (1 − )⁄ The standard error for the natural log of the odds ratio is (15): _ = 1+1 − +1+1 − The standard error for the natural log of the odds ratio was used to estimate the standard error of the odds ratio (15): _ ≅ ∗ _ The standard error of the odds ratio was used to define the error bars in the bar plots of ORs, which were generated using the ggplot2 module in R (193). The p-values for the odds ratios in the tables (ORRs) were calculated using the chi-square test in R. The adjusted p-values were calculated using the Bonferroni method. 143 Chapter 4: Conclusion IDRs constitute a significant fraction of eukaryotic protein sequences, and one of their primary functions is mediating protein–protein interactions (PPIs) (107). The flexibility and accessibility of IDRs make them ideal for mediating signaling and regulatory interactions, and their functional relevance is highlighted by their association with numerous diseases, including cancer and neurodegeneration (214). Because they are flexible and are typically involved in transient interactions, IDR complex structures are challenging to determine experimentally and are underrepresented in structural databases (115, 140). With limited structural data, research on IDRs has been greatly assisted by sequence-based prediction methods for both general IDRs and interacting IDRs, allowing proteome-scale studies of complex biological mechanisms (115, 210). In contrast, research on the IDR-partner has been limited by the availability of structural data. There are few computational methods available for predicting binding sites of IDRs, with none developed for MoRFs. Therefore, this work began with the development of an IDR-partner interface predictor named IDRBind. The hypothesis is that the distinctive flexibility and binding mechanisms of IDRs will translate to differences in sequence and structural properties in the IDR-partner interfaces, which would allow training of a predictor with specificity for IDR-partner interfaces. A broader goal is to use the growing amount of structural data to investigate a wide range of differentiating properties of IDR-partner interfaces and, more generally, IDR interactions. This work revealed characteristics that differentiate IDR interactions from the classical globular interactions. The differences in molecular structure and function highlight the importance of spreading awareness and promoting further investigations of this crucial class of PPIs. 144 4.1 IDRBind – Predictor of IDR-Partner Interfaces Numerous computational methods have been developed for the prediction of protein–protein interfaces (60). These prediction methods contribute to the community in two significant ways. Firstly, the prediction of protein interfaces is useful for characterizing proteins with incomplete structural interaction data. For example, computational interface prediction is a quick way for identifying potential interface residues, which can then be prioritized for experimental study. Furthermore, interface prediction can be the first step towards the prediction of the complex structure (62). Secondly, the development of protein interface predictors expands our understanding of protein interfaces by revealing the protein features and prediction strategies that best encapsulate the identification of interaction surfaces. Chapter 2 reports the development of an IDR-partner interface predictor named IDRBind. Predictors of protein interfaces have traditionally focused on globular–globular interfaces (globular interfaces) (56), and predictors of peptide binding sites on globular proteins (peptide interfaces) have only been made available more recently (62, 63). The incomplete coverage of existing methods left an opportunity to develop a predictor that can identify binding sites of molecular recognition features (MoRFs) (100), which are a class of interaction-mediating IDRs that are longer than peptides. Therefore, we developed IDRBind, which is capable of predicting a broad range of IDRs, including MoRFs. Furthermore, IDRBind demonstrated the usefulness of various protein interface features and prediction strategies. While features traditionally used for interface prediction proved to be essential, prediction of IDR-partner interfaces benefited from pocket-finding approaches inspired by peptide- and ligand-binding site predictors (237). The colocalization pattern of core and rim regions also helped identify the IDR-partner interface patch. 145 4.1.1 Significance, Limitations, and Applications IDRBind is currently the state-of-the-art for the prediction of IDR-partner interfaces, an interface category encompassing binding sites of both MoRFs and peptides. Using datasets of MoRF and peptide interfaces, we have shown that general globular interface predictors are unsuitable for the prediction of IDR-partner interfaces. While existing peptide interface predictors are better suited than globular interface predictors, IDRBind is a significant step forward in prediction performance in both MoRF and peptide interfaces in terms of MCC. Furthermore, the strategies that were proven successful by IDRBind highlight essential attributes of IDR-partner interfaces, which could potentially help the development of future predictors. The accurate predictions of IDRBind are partly attributed to the combination of elements borrowed from both globular and peptide interface predictors. The long history of globular interface predictor development has uncovered multiple staple features that differentiate interface from non-interface residues, including strong hydrophobicity (i.e., physicochemical properties of residues) and evolutionary conservation. Furthermore, research in globular interfaces also revealed the crucial distinctions between the solvent-exposed rim and the buried core of interfaces, where the features associated with protein interfaces are most distinguishable in the core region. These properties appear just as useful for IDR interface prediction. On the other hand, advanced peptide interface predictors take advantage of similarities between peptide and ligand binding sites by using docking approaches to find binding pockets on the protein surface (62, 63). Taking a cue from pocket-finding methods, we used curvature and groove feature scores to identify grooved surfaces. Our approach combined globular interface features with pocket-finding methods to create a hybrid prediction method. 146 IDRBind demonstrates the distinctiveness between core and rim regions of IDR-partner interfaces and reveals a possible advantage in separating them for interface prediction. IDR-partner and globular interface core regions are similar, aside from the tendency for IDR-partner interface core to localize at the center of grooved surfaces. The importance of predicting the interface core is well recognized, and the globular interface predictor ISMBLab demonstrated the effectiveness of this strategy (68). However, IDRBind is unique in explicitly using separate models for the core region (CorePred) and rim region (RimPred). Unlike globular interfaces, the rim region of an IDR-partner interface forms the ridge of the binding groove, and thus the rim residues have much higher rASA than the core residues. IDR interactions involve a strong electrostatic component (109), and the rim region contains most of the polar and charged interface residues, so RimPred exploits the strong electrostatic potential to identify rim residues. As a consequence of the distinct IDR-partner interface core and rim regions, training and optimizing CorePred and RimPred independently created models that were better at classifying core and rim residues from non-interface residues. The two independent models also created an opportunity to account for the spatial relationship of core and rim residue using a conditional random field model. The conditional random field (CRF) used to combine CorePred and RimPred helped tackle the challenges involved in the prediction of MoRF interfaces. One challenge involved in all protein interface prediction is the size bias, i.e., the tendency to over-score smaller globular proteins (143). As an alternative to ranking propensity scores to correct for size bias, we chose to measure the performance of discrete prediction labels on each protein independently. Individually benchmarked proteins could also make comparisons with future prediction methods easier. For IDRBind, it was also critical to reduce the over-scoring of CorePred and RimPred on small proteins because of the adjacency component of IDRBind’s CRF, which encourages agreement between 147 neighboring residue labels, could drive all residues of small proteins to the interface class. We chose to apply a penalty to over-scored proteins in the scoring component of IDRBind’s CRF because the ranking of scores may be less suitable for predicting IDR-partner interfaces, which varies significantly in size. A challenge more specific to predicting MoRF interfaces is the variance in interface size and shape since MoRFs range from 10 to more than 70 residues in length and could wrap around the IDR-partner structure. Instead of predicting interface patches of a specific size or normalizing prediction scores (49, 58, 238), the adjacency component of IDRBind assists in prediction by identifying the spatial pattern of core and rim residues. We speculate that this architecture contributed to IDRBind's low false-positive rate in the peptide dataset, which is notable because IDRBind was trained on the larger MoRF interfaces. Whether or not this strategy could be applied to globular interface prediction remains to be tested, but the rim residues of globular interfaces are likely more difficult to distinguish from non-interface surface residues. While IDRBind has demonstrated better performance than existing methods, there is more room for improvement. A big obstacle in the development of IDRBind was the quality and quantity of structures for training and testing. There are only 84 examples of MoRF complexes used in developing IDRBind. Such a small sample size meant significant variances in performance estimates, which changes between random samples of training and testing data, and a higher risk of overfitting the CorePred and RimPred models. Another obstacle was the noisy data labels. Structural data for PPIs is incomplete, so some interface residues would inevitably be labeled as non-interface, and the reverse is also observed in non-native interfaces such as crystal contacts in X-ray crystallography. The XGBoost algorithm is suggested to be robust to noisy data, and 148 increasing regularization during training can reduce overfitting to noise, but noisy labels are still detrimental to model training (239). A prospective solution to both data size and label limitations is to exhaustively map all known MoRF interface residues from multiple structures for each protein, which could reduce mislabeling to an extent limited by the number of additional MoRF complexes available. Given the similarity between MoRF and peptide interfaces, IDRBind will likely perform better if one trains it on all IDRs. Moreover, data labels would be even more complete if globular interfaces are included, as long as the different interface categories are incorporated into the model. After all, peptide, MoRF, and globular interfaces exhibit more commonalities than differences in multiple properties, especially hydrophobicity and conservation. With increasing quantity and quality of structural data for training, future protein interface predictors will likely predict more specific categories of interfaces. We also anticipate the increasing use of deep learning algorithms, which excels with larger datasets. To the best of our knowledge, IDRBind is the first to tackle the challenging task of MoRF interface prediction, and thus is a leading method for predicting binding sites of the full range of IDRs. There is an abundance of IDR interactions in the human proteome with essential regulatory functions that are often transient. Furthermore, drug development has shown success in targeting IDR-partner interfaces (164, 165), likely due to the better accessibility of surface grooves and pockets to small-molecule binding. Therefore, IDRBind could be useful for prioritizing IDR-partner proteins for further investigation. IDRBind could also be used in docking pipelines. For instance, IDRBind could potentially be used to drive flexible peptide docking using the HADDOCK protocol to predict peptide-protein complex structures (240). Another application of IDRBind was demonstrated in Chapter 3, where it was used as a high-throughput method to predict 149 IDR-partner interfaces in over 5000 proteins for studying disease mutation enrichment in protein interfaces. 4.2 Interface Attributes and Mutation Localization in IDR Interactions Investigation of the structure and molecular mechanisms of PPIs is a classical undertaking in structural biology. Much of what is known comes from studies in globular–globular interactions, but IDR interactions are typically differentiated in both structure and function from globular interactions (87, 104, 107). While the differences between the interacting IDR and globular interfaces are immediately recognizable, the differences between IDR-partner and globular interfaces are more nuanced. Furthermore, it is unclear what the differences between the structures of IDR complexes and globular complexes would mean for the functional importance of the interface residues and their susceptibility to mutations. The overarching goal in Chapter 3 is to characterize and identify differentiating factors of IDR interactions compared to classical globular interactions. We used the feature scores of IDRBind, which has shown specificity for MoRF and peptide interfaces, to examine the sequence and structural attributes of MoRF, peptide, and globular interfaces. To evaluate the functional importance and the tolerance to variation in the interface regions of IDR interactions and globular interactions, we analyzed the enrichment patterns of disease mutations and SNVs. This study reveals several attributes that separate MoRF, peptide, and globular interfaces, and we highlight curvature as the dominant differentiating factor. Furthermore, peptide and MoRF interfaces share more similarity than globular interfaces, supporting their categorization and analysis as subclasses of IDR-partner interfaces. Although preliminary, the comparative analyses of IDR and globular interactions in terms of disease, cancer, and SNV mutation enrichment reveal contrasts that provide 150 valuable insight on molecular mechanisms and functions. We showed that IDR interactions are just as enriched in disease mutations as globular interactions. Based on observations in SNVs, we further speculate that IDR-partner interfaces may have a higher tolerance to neutral SNVs while interacting IDRs may contain more random and novel mutations. 4.2.1 Significance and Limitations This study showed that peptide interfaces are, on average, highest in surface concavity, evolutionary conservation, and hydrophobicity while lowest in rASA and B-factor estimate. Based on these feature scores, MoRF interfaces generally lie between peptide and globular interfaces. Therefore, MoRF interfaces are not merely larger peptide interfaces but are instead surfaces that share similarities with both peptide and globular interfaces. One exceptional attribute is electrostatic potential, which is equally strong or even stronger in MoRF interfaces compared to peptide interfaces. The stronger concavity, electrostatic potential, conservation, rigidity, and hydrophobicity are consistent with previous findings and support the view that IDR-partner binding sites, especially for peptides, are optimized to pre-pay for the entropic cost of IDR binding (87, 104). However, these attributes also demonstrated the similarities between the three interface classes. Consequently, we created a classifier model to evaluate how well the MoRF, peptide, and globular interfaces could be distinguished, namely MoRFint. Three main observations were made based on MoRFint. Firstly, the model’s main component for separating MoRF and globular interfaces is surface geometry, with MoRFs binding to concaved surfaces while globular proteins typically bind to flatter surfaces. Secondly, MoRFint can classify MoRF versus globular interfaces with modest performance (AUC = 0.76), but it performed better in distinguishing peptide and 151 globular interfaces (AUC = 84), which is consistent with trends in feature scores. Lastly, MoRF interfaces are more similar to peptide interfaces than globular interfaces. The last point is notable for analyses of IDR interactions. Interactions mediated by MoRFs exhibit differences from peptide-mediated interactions, but the similarity between them and the overwhelmingly higher number of peptide motifs (133) justifies overlooking the distinctions. An orthogonal way to compare IDR interactions is the analysis of mutation enrichment. Many recent articles have explored the localization of disease mutations and SNVs because the disruption of protein function is a major underlying mechanism of disease (11, 130). In particular, many disease mutations localize to PPI interfaces, likely perturbing PPIs. A significant conclusion from this work is that IDR interactions show high susceptibility to disease mutations. Disease mutations from SwissVar are enriched to a similar level in interface regions of IDR-partner, interacting IDR, and globular dataset. COSMIC cancer mutation shows even higher enrichment in IDR interactions than globular interactions. Notably, previous studies have observed the depletion of disease mutations in IDRs (124, 210). A crucial distinction in this study is the specific identification of the core and rim interface residues of IDRs based on structure. The definition of core and rim regions was established in globular domains and is not widely adopted in analyzing IDRs (163). However, core residues of interacting IDRs are typically more hydrophobic, conserved, and contribute more interaction surface to protein binding compared to rim residues (109). The enrichment of SwissVar mutations underscores the importance of the core residues of IDR interactions. This study also demonstrated the enrichment of somatic cancer mutations from the COSMIC database in the rim regions of globular proteins. While the core region of interfaces is typically more critical to interactions than the rim, COSMIC mutations in the rim region show 152 enrichment at least as strong as the core region in the globular dataset and the IDR-partners. This finding is in agreement with the tendency for cancer mutations to disrupt the electrostatic component of PPIs previously observed by other studies (220, 225). Moreover, this study confirms evidence of cancer mutations in globular proteins preferentially localizing to PPI interfaces instead of buried regions, in contrast with monogenic disease mutations. It was proposed that mutations in the buried region would more likely be lethal to cancer cells, and thus protein function is instead modulated via mutations at the interface (209). Therefore, we further speculate that cancer mutations in the rim region lead to perturbation of function while limiting the impact on the fitness of cancer cells. However, the MobiDB dataset appears to contain a higher concentration of COSMIC mutation. We suspect the discrepancy is due to biases between the curated and predicted datasets. Further investigation is required to determine the identities and functions of these proteins as well as separating the driver from passenger cancer mutations. Unlike the SwissVar disease mutations that are strongly deleterious, SNVs from a healthy population are mostly benign, making them unlikely to be found in functionally critical regions. The depletion of SNVs from important functional regions generally holds for our globular and IDR interactions. However, interacting IDRs showed an unusually high propensity for low-frequency SNVs. The low-frequency SNVs, although not typically associated with diseases, may include mildly deleterious mutations under negative selective pressure as well as random and novel mutations. Therefore, we suspect that interacting IDRs could be enriched with random and novel mutations. While a higher rate of mutation is associated with IDRs in general (208, 241), it is unclear whether it holds for the interacting IDR core and rim regions. If confirmed, the enrichment of low-frequency SNVs in interacting IDRs could have implications for human health, as there is 153 increasing recognition for the complex and cumulative contributions of SNVs to the risk of disease (200). On the other hand, high-frequency SNVs are common in the healthy population. The enrichment of high-frequency SNVs in IDR-partner core supports our hypothesis that IDR-partner interfaces have a higher tolerance for mutations due to the flexibility of IDRs. Furthermore, high-frequency SNVs are also more strongly depleted in the buried region than the core region in the globular dataset. In contrast with mutations in the buried region, which would more likely destabilize the protein fold, mutations in the protein interface would likely have milder effects on molecular functions that could be negative or positive (228). Common SNVs in the protein interface leads us to speculate on their positive contributions to genetic variance and the adaptability of the population (202). However, the preliminary nature of this analysis and the size of the datasets limit what we can infer from it. This work has uncovered several properties in which IDR interactions differ from globular interactions. Furthermore, mutation enrichment trends in interacting IDR residues also demonstrated their susceptibility to disease mutations as well as possible tolerance to variance and thus lent support for our hypothesis on the malleability of IDR interactions. However, more details could potentially be revealed if protein regions are even more precisely categorized. PPIs exhibit a broad spectrum of structural diversity, and we are increasingly able to characterize and categorize them in more detail thanks to the growing availability of structural data. Different types of binding motifs have been recognized in IDRs, including conserved motifs, preformed elements, and anchoring sites (133). Furthermore, there is increasing recognition of residual dynamics in IDRs upon binding, namely fuzziness (135). There are also potential differences between IDR-partner interfaces that bind to different types of IDR elements. However, high-throughput investigation of 154 these interacting IDR elements requires more structural data as well as community consensus on how to define them. 155 Bibliography 1. O. Keskin, A. Gursoy, B. Ma, R. Nussinov, Principles of protein-protein interactions: what are the preferred ways for proteins to interact? Chem. Rev. 108, 1225–44 (2008). 2. C. Chothia, J. Janin, Principles of protein-protein recognition. Nature 256, 705–8 (1975). 3. T. Rolland, et al., A proteome-scale map of the human interactome network. Cell 159, 1212–1226 (2014). 4. M. J. Meyer, et al., Interactome INSIDER: a structural interactome browser for genomic studies. Nat. Methods 15, 107–114 (2018). 5. R. Mosca, A. Céol, P. Aloy, Interactome3D: Adding structural details to protein networks. Nat. Methods 10, 47–53 (2013). 6. E. D. Levy, J. B. Pereira-Leal, C. Chothia, S. A. Teichmann, 3D complex: a structural classification of protein complexes. PLoS Comput. Biol. 2, e155 (2006). 7. I. M. A. Nooren, J. M. Thornton, Diversity of protein-protein interactions. EMBO J. 22, 3486–92 (2003). 8. L. J. Colwell, et al., The emergence of protein complexes: quaternary structure, dynamics and allostery. Biochem. Soc. Trans. 40, 475–491 (2012). 9. P. E. Wright, H. J. Dyson, Intrinsically disordered proteins in cellular signalling and regulation. Nat. Rev. Mol. Cell Biol. 16, 18–29 (2015). 10. A. G. Ngounou Wetie, et al., Investigation of stable and transient protein-protein interactions: Past, present, and future. Proteomics 13, 538–57 (2013). 11. N. Sahni, et al., Widespread macromolecular interaction perturbations in human genetic disorders. Cell 161, 647–660 (2015). 156 12. S. Stefl, H. Nishi, M. Petukh, A. R. Panchenko, Molecular Mechanisms of Disease-Causing Missense Mutations. J. Mol. Biol. 425, 3919–3936 (2013). 13. M. Lek, et al., Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016). 14. A. Mottaz, F. P. A. David, A.-L. Veuthey, Y. L. Yip, Easy retrieval of single amino-acid polymorphisms and phenotype information using SwissVar. Bioinformatics 26, 851–2 (2010). 15. X. Wang, et al., Three-dimensional reconstruction of protein networks provides insight into human genetic disease. Nat.Biotechnol. 30, 159–164 (2012). 16. R. L. Baldwin, Energetics of protein folding. J. Mol. Biol. 371, 283–301 (2007). 17. A.-C. Gavin, et al., Proteome survey reveals modularity of the yeast cell machinery. Nature 440, 631–6 (2006). 18. R. Oughtred, et al., The BioGRID interaction database: 2019 update. Nucleic Acids Res. 47, D529–D541 (2019). 19. M. P. H. Stumpf, et al., Estimating the size of the human interactome. Proc. Natl. Acad. Sci. U. S. A. 105, 6959–64 (2008). 20. K. Venkatesan, et al., An empirical framework for binary interactome mapping. Nat. Methods 6, 83–90 (2009). 21. S. Hayes, B. Malacrida, M. Kiely, P. A. Kiely, Studying protein-protein interactions: progress, pitfalls and solutions. Biochem. Soc. Trans. 44, 994–1004 (2016). 22. H. Jeong, S. P. Mason, A.-L. Barabási, Z. N. Oltvai, Lethality and centrality in protein networks. Nature 411, 41–42 (2001). 23. M. W. Hahn, A. D. Kern, Comparative Genomics of Centrality and Essentiality in Three 157 Eukaryotic Protein-Interaction Networks. Mol. Biol. Evol. 22, 803–806 (2005). 24. X. He, J. Zhang, Why do hubs tend to be essential in protein networks? PLoS Genet. 2, e88 (2006). 25. Y. Feng, Q. Wang, T. Wang, Drug Target Protein-Protein Interaction Networks: A Systematic Perspective. Biomed Res. Int. 2017, 1289259 (2017). 26. D. Petrey, B. Honig, Structural bioinformatics of the interactome. Annu. Rev. Biophys. 43, 193–210 (2014). 27. S. C. Shoemaker, N. Ando, X-rays in the Cryo-Electron Microscopy Era: Structural Biology’s Dynamic Future. Biochemistry 57, 277–285 (2018). 28. B. A. Manjasetty, K. Büssow, S. Panjikar, A. P. Turnbull, Current methods in structural proteomics and its applications in biological sciences. 3 Biotech 2, 89–113 (2011). 29. S. Kosol, S. Contreras-Martos, C. Cedeño, P. Tompa, Structural characterization of intrinsically disordered proteins by NMR spectroscopy. Molecules 18, 10802–28 (2013). 30. K. K. Frederick, M. S. Marlow, K. G. Valentine, a J. Wand, Conformational entropy in molecular recognition by proteins. Nature 448, 325–9 (2007). 31. M. R. Jensen, M. Zweckstetter, J. R. Huang, M. Blackledge, Exploring free-energy landscapes of intrinsically disordered proteins at atomic resolution using NMR spectroscopy. Chem. Rev. 114, 6632–6660 (2014). 32. M. Varadi, et al., pE-DB: a database of structural ensembles of intrinsically disordered and of unfolded proteins. Nucleic Acids Res. 42, D326-35 (2014). 33. O. Vinogradova, J. Qin, NMR as a unique tool in assessment and complex determination of weak protein-protein interactions. Top. Curr. Chem. 326, 35–45 (2012). 34. A. Bartesaghi, et al., Atomic Resolution Cryo-EM Structure of β-Galactosidase. Structure 158 26, 848-856.e3 (2018). 35. D. P. Frueh, A. C. Goodrich, S. H. Mishra, S. R. Nichols, NMR methods for structural studies of large monomeric and multimeric proteins. Curr. Opin. Struct. Biol. 23, 734–9 (2013). 36. P. Argos, An investigation of protein subunit and domain interfaces. Protein Eng. 2, 101–13 (1988). 37. S. Jones, J. M. Thornton, Principles of protein-protein interactions. Proc. Natl. Acad. Sci. U. S. A. 93, 13–20 (1996). 38. S. Jones, J. M. Thornton, Analysis of protein-protein interaction sites using surface patches. J Mol Biol 272, 121–132 (1997). 39. S. Kawashima, H. Ogata, M. Kanehisa, AAindex: Amino Acid Index Database. Nucleic Acids Res. 27, 368–9 (1999). 40. G. Bajic, L. Yatime, R. B. Sim, T. Vorup-Jensen, G. R. Andersen, Structural insight on the recognition of surface-bound opsonins by the integrin I domain of complement receptor 3. Proc. Natl. Acad. Sci. U. S. A. 110, 16426–31 (2013). 41. B. Ma, T. Elkayam, H. Wolfson, R. Nussinov, Protein-protein interactions: structurally conserved residues distinguish between binding sites and exposed protein surfaces. Proc. Natl. Acad. Sci. U. S. A. 100, 5772–5777 (2003). 42. O. Lichtarge, H. R. Bourne, F. E. Cohen, An evolutionary trace method defines binding surfaces common to protein families. J. Mol. Biol. 257, 342–358 (1996). 43. F. Johansson, H. Toh, A comparative study of conservation and variation scores. BMC Bioinformatics 11, 388 (2010). 44. J. a Capra, M. Singh, Predicting functionally important residues from sequence 159 conservation. Bioinformatics 23, 1875–82 (2007). 45. B. Lee, F. M. Richards, The interpretation of protein structures: estimation of static accessibility. J. Mol. Biol. 55, 379–400 (1971). 46. E. D. Levy, A Simple Definition of Structural Regions in Proteins and Its Use in Analyzing Interface Evolution. J. Mol. Biol. 403, 660–670 (2010). 47. A. Porollo, J. Meller, Prediction-based fingerprints of protein-protein interactions. Proteins 66, 630–45 (2007). 48. H. Zellner, et al., PresCont: predicting protein-protein interfaces utilizing four residue properties. Proteins 80, 154–68 (2012). 49. S. Jones, J. M. Thornton, Prediction of protein-protein interaction sites using patch analysis. J. Mol. Biol. 272, 133–43 (1997). 50. R. Esmaielbeiki, K. Krawczyk, B. Knapp, J.-C. Nebel, C. M. Deane, Progress and challenges in predicting protein interfaces. Brief. Bioinform. 17, 1–15 (2016). 51. J.-F. Xia, X.-M. Zhao, J. Song, D.-S. Huang, APIS: accurate prediction of hot spots in protein interfaces by combining protrusion index with solvent accessibility. BMC Bioinformatics 11, 174 (2010). 52. F. K. Pettit, J. U. Bowie, Protein surface roughness and small molecular binding sites. J. Mol. Biol. 285, 1377–82 (1999). 53. V. Le Guilloux, P. Schmidtke, P. Tuffery, Fpocket: an open source platform for ligand pocket detection. BMC Bioinformatics 10, 168 (2009). 54. H. R. A. Jonker, R. W. Wechselberger, R. Boelens, G. E. Folkers, R. Kaptein, Structural properties of the promiscuous VP16 activation domain. Biochemistry 44, 827–39 (2005). 55. L. Young, R. L. Jernigan, D. G. Covell, A role for surface hydrophobicity in protein-160 protein recognition. Protein Sci. 3, 717–29 (1994). 56. H. Chen, H.-X. Zhou, Prediction of interface residues in protein-protein complexes by a consensus neural network method: test against NMR data. Proteins 61, 21–35 (2005). 57. C. Savojardo, P. Fariselli, P. L. Martelli, R. Casadio, ISPRED4: interaction sites PREDiction in protein structures with a refining grammar model. Bioinformatics 33, 1656–1663 (2017). 58. H. Neuvirth, R. Raz, G. Schreiber, ProMate: a structure based prediction program to identify the location of protein-protein binding sites. J. Mol. Biol. 338, 181–99 (2004). 59. S. J. de Vries, A. D. J. van Dijk, A. M. J. J. Bonvin, WHISCY: what information does surface conservation yield? Application to data-driven docking. Proteins 63, 479–89 (2006). 60. L. C. Xue, D. Dobbs, A. M. J. J. Bonvin, V. Honavar, Computational prediction of protein interfaces: A review of data driven methods. FEBS Lett. 589, 3516–3526 (2015). 61. H.-X. Zhou, S. Qin, Interaction-site prediction for protein complexes: a critical assessment. Bioinformatics 23, 2203–9 (2007). 62. A. Lavi, et al., Detection of peptide-binding sites on protein surfaces: The first step towards the modeling and targeting of peptide-mediated interactions. Proteins 81, 2096–2105 (2013). 63. C. Yan, X. Zou, Predicting peptide binding sites on protein surfaces by clustering chemical interactions. J. Comput. Chem. 36, 49–61 (2015). 64. N. London, B. Raveh, O. Schueler-Furman, Peptide docking and structure-based characterization of peptide binding: from knowledge to know-how. Curr. Opin. Struct. Biol. 23, 894–902 (2013). 161 65. Q. C. Zhang, D. Petrey, R. Norel, B. H. Honig, Protein interface conservation across structure space. Proc. Natl. Acad. Sci. U. S. A. 107, 10896–901 (2010). 66. L. C. Xue, D. Dobbs, V. Honavar, HomPPI: a class of sequence homology based protein-protein interface prediction methods. BMC Bioinformatics 12, 244 (2011). 67. R. A. Jordan, Y. EL-Manzalawy, D. Dobbs, V. Honavar, Predicting protein-protein interface residues using local surface structural similarity. BMC Bioinformatics 13, 41 (2012). 68. C.-T. Chen, et al., Protein-protein interaction site predictions with three-dimensional probability distributions of interacting atoms on protein surfaces. PLoS One 7, e37706 (2012). 69. J. Segura, P. F. Jones, N. Fernandez-Fuentes, A holistic in silico approach to predict functional sites in protein structures. Bioinformatics 28, 1845–50 (2012). 70. S. J. de Vries, A. M. J. J. Bonvin, CPORT: A Consensus Interface Predictor and Its Performance in Prediction-Driven Docking with HADDOCK. PLoS One 6, e17695 (2011). 71. Z. Zhao, X. Gong, Protein-protein interaction interface residue pair prediction based on deep learning architecture. IEEE/ACM Trans. Comput. Biol. Bioinforma. 5963, 1–1 (2017). 72. B. Huang, M. Schroeder, Using protein binding site prediction to improve protein docking. Gene 422, 14–21 (2008). 73. J. Segura, M. A. Marín-López, P. F. Jones, B. Oliva, N. Fernandez-Fuentes, VORFFIP-driven dock: V-D2OCK, a fast and accurate protein docking strategy. PLoS One 10, e0118107 (2015). 162 74. P. Robustelli, K. Kohlhoff, A. Cavalli, M. Vendruscolo, Using NMR Chemical Shifts as Structural Restraints in Molecular Dynamics Simulations of Proteins. Structure 18, 923–933 (2010). 75. C. Dominguez, R. Boelens, A. M. J. J. Bonvin, HADDOCK: a protein-protein docking approach based on biochemical or biophysical information. J. Am. Chem. Soc. 125, 1731–7 (2003). 76. M. Rasheed, R. Bettadapura, C. Bajaj, X-ray, Cryo-EM, and computationally predicted protein structures used in integrative modeling of HIV Env glycoprotein gp120 in complex with CD4 and 17b. Data Br. 6, 833–9 (2016). 77. Q. C. Zhang, D. Petrey, J. I. Garzón, L. Deng, B. Honig, PrePPI: a structure-informed database of protein–protein interactions. Nucleic Acids Res. 41, D828–D833 (2012). 78. S. Liang, C. Zhang, S. Liu, Y. Zhou, Protein binding site prediction using an empirical scoring function. Nucleic Acids Res. 34, 3698–707 (2006). 79. K. P. Murphy, Machine learning: a probabilistic perspective (adaptive computation and machine learning series) (2012). 80. T. Chen, C. Guestrin, XGBoost: A Scalable Tree Boosting System in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD ’16, (ACM Press, 2016), pp. 785–794. 81. C. Sutton, An Introduction to Conditional Random Fields. Found. Trends® Mach. Learn. 4, 267–373 (2012). 82. A. McCallum, K. Schultz, S. Singh, “FACTORIE: Probabilistic Programming via Imperatively Defined Factor Graphs” in Advances in Neural Information Processing Systems 22, Y. Bengio, D. Schuurmans, J. D. Lafferty, C. K. I. Williams, A. Culotta, Eds. 163 (Curran Associates, Inc., 2009), pp. 1249–1257. 83. Q. C. Zhang, et al., PredUs: A web server for predicting protein interfaces using structural neighbors. Nucleic Acids Res. 39, 283–287 (2011). 84. H. Hwang, D. Petrey, B. Honig, A hybrid method for protein-protein interface prediction. Protein Sci. 25, 159–165 (2016). 85. C. J. Bendell, et al., Transient protein-protein interface prediction: datasets, features, algorithms, and the RAD-T predictor. BMC Bioinformatics 15, 82 (2014). 86. D. La, M. Kong, W. Hoffman, Y. I. Choi, D. Kihara, Predicting permanent and transient protein-protein interfaces. Proteins 81, 805–18 (2013). 87. V. Vacic, et al., Characterization of molecular recognition features, MoRFs, and their binding partners. J. Proteome Res. 6, 2351–66 (2007). 88. A. Hoffmann, R. Kwok, P. Compton, “Using Subclasses to Improve Classification Learning” in (Springer, Berlin, Heidelberg, 2001), pp. 203–213. 89. P. Tompa, The interplay between structure and function in intrinsically unstructured proteins. FEBS Lett. 579, 3346–54 (2005). 90. P. Tompa, N. E. Davey, T. J. Gibson, M. M. Babu, A Million peptide motifs for the molecular biologist. Mol. Cell 55, 161–169 (2014). 91. M.-H. Seo, P. M. Kim, The present and the future of motif-mediated protein-protein interactions. Curr. Opin. Struct. Biol. 50, 162–170 (2018). 92. M. Buljan, et al., Alternative splicing of intrinsically disordered regions and rewiring of protein interactions. Curr. Opin. Struct. Biol. 23, 443–50 (2013). 93. P. M. Kim, A. Sboner, Y. Xia, M. Gerstein, The role of disorder in interaction networks: a structural analysis. Mol. Syst. Biol. 4, 179 (2008). 164 94. C. Haynes, et al., Intrinsic disorder is a common feature of hub proteins from four eukaryotic interactomes. PLoS Comput. Biol. 2, e100 (2006). 95. X. Zhang, T. Perica, S. a Teichmann, Evolution of protein structures and interactions from the perspective of residue contact networks. Curr. Opin. Struct. Biol. 23, 954–63 (2013). 96. J. Gsponer, M. M. Babu, The rules of disorder or why disorder rules. Prog Biophys Mol Biol 99, 94–103 (2009). 97. F. Heinkel, et al., Phase separation and clustering of an ABC transporter in Mycobacterium tuberculosis. Proc. Natl. Acad. Sci. U. S. A. 116, 16326–16331 (2019). 98. H. Wu, M. Fuxreiter, The Structure and Dynamics of Higher-Order Assemblies: Amyloids, Signalosomes, and Granules. Cell 165, 1055–1066 (2016). 99. P. E. Wright, H. J. Dyson, Intrinsically unstructured proteins: re-assessing the protein structure-function paradigm. J. Mol. Biol. 293, 321–31 (1999). 100. A. Mohan, et al., Analysis of molecular recognition features (MoRFs). J. Mol. Biol. 362, 1043–59 (2006). 101. E. R. Lacy, et al., Molecular basis for the specificity of p27 toward cyclin-dependent kinases that regulate cell division. J. Mol. Biol. 349, 764–73 (2005). 102. C. Mooney, G. Pollastri, D. C. Shields, N. J. Haslam, Prediction of short linear protein binding regions. J. Mol. Biol. 415, 193–204 (2012). 103. N. Kurochkina, U. Guha, SH3 domains: modules of protein-protein interactions. Biophys. Rev. 5, 29–39 (2013). 104. N. London, D. Movshovitz-Attias, O. Schueler-Furman, The Structural Basis of Peptide-Protein Binding Strategies. Structure 18, 188–199 (2010). 105. F. F. Vajdos, S. Yoo, M. Houseweart, W. I. Sundquist, C. P. Hill, Crystal structure of 165 cyclophilin A complexed with a binding site peptide from the HIV-1 capsid protein. Protein Sci. 6, 2297–307 (1997). 106. M. Fuxreiter, I. Simon, P. Friedrich, P. Tompa, Preformed structural elements feature in partner recognition by intrinsically unstructured proteins. J. Mol. Biol. 338, 1015–26 (2004). 107. R. van der Lee, et al., Classification of intrinsically disordered regions and proteins. Chem. Rev. 114, 6589–631 (2014). 108. B. Mészáros, P. Tompa, I. Simon, Z. Dosztányi, Molecular principles of the interactions of disordered proteins. J. Mol. Biol. 372, 549–61 (2007). 109. E. T. C. E. T. C. Wong, D. Na, J. Gsponer, On the Importance of Polar Interactions for Complexes Containing Intrinsically Disordered Proteins. PLoS Comput. Biol. 9, e1003192 (2013). 110. F. M. Disfani, et al., MoRFpred, a computational tool for sequence-based prediction and characterization of short disorder-to-order transitioning binding regions in proteins. Bioinformatics 28, i75–i83 (2012). 111. N. Malhis, M. Jacobson, J. Gsponer, MoRFchibi SYSTEM: software tools for the identification of MoRFs in protein sequences. Nucleic Acids Res. 44, W488-93 (2016). 112. C. J. Oldfield, et al., Flexible nets: disorder and induced fit in the associations of p53 and 14-3-3 with their partners. BMC Genomics 9 Suppl 1, S1 (2008). 113. P. H. Kussie, et al., Structure of the MDM2 oncoprotein bound to the p53 tumor suppressor transactivation domain. Science 274, 948–53 (1996). 114. I. Radhakrishnan, et al., Solution structure of the KIX domain of CBP bound to the transactivation domain of CREB: a model for activator:coactivator interactions. Cell 91, 166 741–52 (1997). 115. B. Mészáros, I. Simon, Z. Dosztányi, Prediction of protein binding regions in disordered proteins. PLoS Comput Biol 5, e1000376 (2009). 116. S. Yadahalli, J. Li, D. P. Lane, S. Gosavi, C. S. Verma, Characterizing the conformational landscape of MDM2-binding p53 peptides using Molecular Dynamics simulations. Sci. Rep. 7, 15600 (2017). 117. M. Fuxreiter, Fold or not to fold upon binding - does it really matter? Curr. Opin. Struct. Biol. 54, 19–25 (2018). 118. P. Tompa, M. Fuxreiter, Fuzzy complexes: polymorphism and structural disorder in protein-protein interactions. Trends Biochem. Sci. 33, 2–8 (2008). 119. A. Borgia, et al., Extreme disorder in an ultrahigh-affinity protein complex. Nature 555, 61–66 (2018). 120. K. L. Yap, T. Yuan, T. K. Mal, H. J. Vogel, M. Ikura, Structural basis for simultaneous binding of two carboxy-terminal peptides of plant glutamate decarboxylase to calmodulin. J. Mol. Biol. 328, 193–204 (2003). 121. T. Vreven, et al., Updates to the Integrated Protein-Protein Interaction Benchmarks: Docking Benchmark Version 5 and Affinity Benchmark Version 2. J. Mol. Biol. 427, 3031–3041 (2015). 122. H.-C. Lu, S. S. Chung, A. Fornili, F. Fraternali, Anatomy of protein disorder, flexibility and disease-related mutations. Front. Mol. Biosci. 2, 47 (2015). 123. B. Uyar, R. J. Weatheritt, H. Dinkel, N. E. Davey, T. J. Gibson, Proteome-wide analysis of human disease mutations in short linear motifs: neglected players in cancer? Mol. BioSyst. 10, 2626–2642 (2014). 167 124. M. Pajkos, B. Mészáros, I. Simon, Z. Dosztányi, Is there a biological cost of protein disorder? Analysis of cancer-associated mutations. Mol. Biosyst. 8, 296–307 (2012). 125. E. Porta-Pardo, L. Garcia-Alonso, T. Hrabe, J. Dopazo, A. Godzik, A Pan-Cancer Catalogue of Cancer Driver Protein Interaction Interfaces. PLoS Comput. Biol. 11 (2015). 126. O. Keskin, N. Tuncbag, A. Gursoy, Predicting Protein-Protein Interactions from the Molecular to the Proteome Level. Chem. Rev. 116, 4884–909 (2016). 127. N. J. Burgoyne, R. M. Jackson, Predicting protein interaction sites: binding hot-spots in protein-protein and protein-ligand interfaces. Bioinformatics 22, 1335–42 (2006). 128. P. Chakrabarti, J. Janin, Dissecting protein-protein recognition sites. Proteins 47, 334–43 (2002). 129. B. Vallone, A. E. Miele, P. Vecchini, E. Chiancone, M. Brunori, Free energy of burying hydrophobic residues in the interface between protein subunits. Proc. Natl. Acad. Sci. U. S. A. 95, 6103–7 (1998). 130. A. David, M. J. E. Sternberg, The Contribution of Missense Mutations in Core and Rim Residues of Protein-Protein Interfaces to Human Disease. J. Mol. Biol. 427, 2886–98 (2015). 131. D. La, D. Kihara, A novel method for protein-protein interaction site prediction using phylogenetic substitution models. Proteins 80, 126–41 (2013). 132. B. Ma, M. Shatsky, H. J. Wolfson, R. Nussinov, Multiple diverse ligands binding at a single protein site: a matter of pre-existing populations. Protein Sci. 11, 184–97 (2002). 133. R. Pancsa, M. Fuxreiter, Interactions via intrinsically disordered regions: what kind of motifs? IUBMB Life 64, 513–20 (2012). 134. A. A. Russo, et al., Crystal structure of the p27Kip1 cyclin-dependent-kinase inhibitor 168 bound to the cyclin A-Cdk2 complex. Nature 382, 325–331 (1996). 135. M. Fuxreiter, Fuzziness in Protein Interactions-A Historical Perspective. J. Mol. Biol. 430, 2278–2287 (2018). 136. V. N. Uversky, Unusual biophysics of intrinsically disordered proteins. Biochim. Biophys. Acta - Rev. Cancer 1834, 932–51 (2013). 137. P. Jemth, X. Mu, Å. Engström, J. Dogan, A frustrated binding interface for intrinsically disordered proteins. J. Biol. Chem. 289, 5528–33 (2014). 138. M. M. Babu, R. W. Kriwacki, R. V Pappu, Versatility from Protein Disorder. Science (80-. ). 337, 1460–1461 (2012). 139. D. A. Merle, et al., Increased Aggregation Tendency of Alpha-Synuclein in a Fully Disordered Protein Complex. J. Mol. Biol. (2019) https:/doi.org/10.1016/j.jmb.2019.04.031 (May 31, 2019). 140. S. Fukuchi, et al., IDEAL: Intrinsically Disordered proteins with Extensive Annotations and Literature. Nucleic Acids Res. 40, D507–D511 (2012). 141. W. Kabsch, C. Sander, Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22, 2577–637 (1983). 142. M. Okuda, M. Kinoshita, E. Kakumu, K. Sugasawa, Y. Nishimura, Structural Insight into the Mechanism of TFIIH Recognition by the Acidic String of the Nucleotide Excision Repair Factor XPC. Structure 23, 1827–1837 (2015). 143. J. Martin, Benchmarking protein-protein interface predictions: why you should care about protein size. Proteins 82, 1444–52 (2014). 144. Z. Dong, et al., CRF-based models of protein surfaces improve protein-protein interaction site predictions. BMC Bioinformatics 15, 277 (2014). 169 145. T. Wierschin, K. Wang, M. Welter, S. Waack, M. Stanke, Combining features in a graphical model to predict protein binding sites. Proteins 83, 844–52 (2015). 146. A. McCallum, K. Schultz, S. Singh, FACTORIE: Probabilistic Programming via Imperatively Defined Factor Graphs. Proc. 22nd Int. Conf. Neural Inf. Process. Syst., 1249–1257 (2009). 147. B. W. Matthews, Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim. Biophys. Acta - Protein Struct. 405, 442–451 (1975). 148. T. Kawabata, Detection of multiscale pockets on protein surfaces using mathematical morphology. Proteins 78, 1195–211 (2010). 149. T. Xie, et al., Structural basis for molecular interactions involving MRG domains: implications in chromatin biology. Structure 20, 151–60 (2012). 150. P. Zhang, et al., The MRG domain of human MRG15 uses a shallow hydrophobic pocket to interact with the N-terminal region of PAM14. Protein Sci. 15, 2423–34 (2006). 151. G. Cingolani, J. Bednenko, M. T. Gillespie, L. Gerace, Molecular basis for the recognition of a nonclassical nuclear localization signal by importin beta. Mol. Cell 10, 1345–53 (2002). 152. S. J. Lee, et al., The adoption of a twisted structure of importin-beta is essential for the protein-protein interaction required for nuclear transport. J. Mol. Biol. 302, 251–64 (2000). 153. R. Wintjens, et al., 1H NMR Study on the Binding of Pin1 Trp-Trp Domain with Phosphothreonine Peptides. J. Biol. Chem. 276, 25150–25156 (2001). 154. A. M. Petros, et al., Solution structure of the antiapoptotic protein bcl-2. Proc. Natl. Acad. Sci. U. S. A. 98, 3012–7 (2001). 170 155. L. M. Tuttle, et al., Gcn4-Mediator Specificity Is Mediated by a Large and Dynamic Fuzzy Protein-Protein Complex. Cell Rep. 22, 3251–3264 (2018). 156. P. S. Brzovic, et al., The acidic transcription activator Gcn4 binds the mediator subunit Gal11/Med15 using a simple protein interface forming a fuzzy complex. Mol. Cell 44, 942–53 (2011). 157. M. Zhang, T. Tanaka, M. Ikura, Calcium-induced conformational transition revealed by the solution structure of apo calmodulin. Nat. Struct. Biol. 2, 758–67 (1995). 158. M. R. Arkin, et al., Binding of small molecules to an adaptive protein-protein interface. Proc. Natl. Acad. Sci. U. S. A. 100, 1603–8 (2003). 159. C. D. Thanos, M. Randal, J. A. Wells, Potent Small-Molecule Binding to a Dynamic Hot Spot on IL-2. J. Am. Chem. Soc. 125, 15280–15281 (2003). 160. X. Wang, M. Rickert, K. C. Garcia, Structure of the quaternary complex of interleukin-2 with its alpha, beta, and gammac receptors. Science 310, 1159–63 (2005). 161. J. Besag, Statistical analysis of dirty pictures. J. Appl. Stat. 20, 63–87 (1993). 162. S. Maheshwari, M. Brylinski, Predicting protein interface residues using easily accessible on-line resources. Brief. Bioinform. 16, 1025–1034 (2015). 163. L. Lo Conte, C. Chothia, J. Janin, The atomic structure of protein-protein recognition sites. J. Mol. Biol. 285, 2177–98 (1999). 164. J. A. Wells, C. L. McClendon, Reaching for high-hanging fruit in drug discovery at protein-protein interfaces. Nature 450, 1001–9 (2007). 165. A. J. Souers, et al., ABT-199, a potent and selective BCL-2 inhibitor, achieves antitumor activity while sparing platelets. Nat. Med. 19, 202–208 (2013). 166. B. Rost, Twilight zone of protein sequence alignments. Protein Eng 12, 85–94 (1999). 171 167. S. F. Altschul, et al., Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–402 (1997). 168. Y. Huang, B. Niu, Y. Gao, L. Fu, W. Li, CD-HIT Suite: A web server for clustering and comparing biological sequences. Bioinformatics 26, 680–682 (2010). 169. F. M. Richards, The interpretation of protein structures: total volume, group volume distributions and packing density. J. Mol. Biol. 82, 1–14 (1974). 170. E. Petsalaki, A. Stark, E. García-Urdiales, R. B. Russell, Accurate Prediction of Peptide Binding Sites on Protein Surfaces. PLoS Comput. Biol. 5, e1000335 (2009). 171. J. F. Gibrat, T. Madej, S. H. Bryant, Surprising similarities in structure comparison. Curr. Opin. Struct. Biol. 6, 377–85 (1996). 172. K. Katoh, D. M. Standley, MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability. Mol. Biol. Evol. 30, 772–780 (2013). 173. K. Katoh, K. Misawa, K. Kuma, T. Miyata, MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res. 30, 3059–66 (2002). 174. D. Xu, Y. Zhang, Generating Triangulated Macromolecular Surfaces by Euclidean Distance Transform. PLoS One 4, e8140 (2009). 175. M. Hendlich, F. Rippmann, G. Barnickel, LIGSITE: automatic and efficient detection of potential small molecule-binding sites in proteins. J. Mol. Graph. Model. 15, 359–363 (1997). 176. L. Li, et al., DelPhi: a comprehensive suite for DelPhi software and associated resources. BMC Biophys. 5, 9 (2012). 177. D. Petrey, et al., Using multiple structure alignments, fast model building, and energetic 172 analysis in fold recognition and homology modeling. Proteins 53 Suppl 6, 430–5 (2003). 178. J. M. Word, S. C. Lovell, J. S. Richardson, D. C. Richardson, Asparagine and glutamine: using hydrogen atom contacts in the choice of side-chain amide orientation. J. Mol. Biol. 285, 1735–47 (1999). 179. S. Sarkar, et al., DelPhi Web Server: A comprehensive online suite for electrostatic calculations of biological macromolecules and their complexes. Commun. Comput. Phys. 13, 269–284 (2013). 180. A. D. MacKerell, et al., All-atom empirical potential for molecular modeling and dynamics studies of proteins. J. Phys. Chem. B 102, 3586–616 (1998). 181. M. Feig, J. Karanicolas, C. L. Brooks 3rd, MMTSB Tool Set: enhanced sampling and multiscale modeling methods for applications in structural biology. J Mol Graph Model 22, 377–395 (2004). 182. F. B. Sheinerman, B. Honig, On the role of electrostatic interactions in the design of protein-protein interfaces. J Mol Biol 318, 161–177 (2002). 183. N. Xiao, D.-S. Cao, M.-F. Zhu, Q.-S. Xu, protr/ProtrWeb: R package and web server for generating various numerical representation schemes of protein sequences. Bioinformatics 31, 1857–9 (2015). 184. A. Bakan, L. M. Meireles, I. Bahar, ProDy: protein dynamics inferred from theory and experiments. Bioinformatics 27, 1575–7 (2011). 185. O. V. Tsodikov, M. Thomas Record, Y. V. Sergeev, Novel computer program for fast exact calculation of accessible and molecular surface areas and average surface curvature. J. Comput. Chem. 23, 600–609 (2002). 186. F. K. Pettit, E. Bare, A. Tsai, J. U. Bowie, HotPatch: a statistical approach to finding 173 biologically relevant features on protein surfaces. J. Mol. Biol. 369, 863–79 (2007). 187. W. Jakob, M. Tarini, D. Panozzo, O. Sorkine-Hornung, Instant field-aligned meshes. ACM Trans. Graph. 34, 1–15 (2015). 188. J. Duchi, E. Hazan, Y. Singer, Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011). 189. J. Pearl, Probabilistic reasoning in intelligent systems: networks of plausible inference (Morgan Kaufmann, 1988) (February 16, 2019). 190. M. F. Sanner, A. J. Olson, J. C. Spehner, Reduced surface: an efficient way to compute molecular surfaces. Biopolymers 38, 305–20 (1996). 191. W. Humphrey, A. Dalke, K. Schulten, VMD: visual molecular dynamics. J. Mol. Graph. 14, 33–8 (1996). 192. E. Roberts, J. Eargle, D. Wright, Z. Luthey-Schulten, MultiSeq: unifying sequence and structure data for evolutionary analysis. BMC Bioinformatics 7, 382 (2006). 193. H. Wickham, ggplot2: Elegant Graphics for Data Analysis (Springer-Verlag New York, 2016). 194. T. Sing, O. Sander, N. Beerenwinkel, T. Lengauer, ROCR: visualizing classifier performance in R. Bioinformatics 21, 3940–3941 (2005). 195. A. D. McLachlan, Rapid comparison of protein structures. Acta Crystallogr. Sect. A 38, 871–873 (1982). 196. A. Kuzmanic, B. Zagrovic, Determination of ensemble-average pairwise root mean-square deviation from experimental B-factors. Biophys. J. 98, 861–71 (2010). 197. B. Mészáros, I. Simon, Z. Dosztányi, Prediction of protein binding regions in disordered proteins. PLoS Comput. Biol. 5, e1000376 (2009). 174 198. K. J. Karczewski, et al., Variation across 141,456 human exomes and genomes reveals the spectrum of loss-of-function intolerance across human protein-coding genes. bioRxiv, 531210 (2019). 199. A. Stein, D. M. Fowler, R. Hartmann-Petersen, K. Lindorff-Larsen, Biophysical and Mechanistic Models for Disease-Causing Protein Variants. Trends Biochem. Sci. 44, 575–588 (2019). 200. A. Saint Pierre, E. Génin, How important are rare variants in common disease? Brief. Funct. Genomics 13, 353–61 (2014). 201. Y. Choi, G. E. Sims, S. Murphy, J. R. Miller, A. P. Chan, Predicting the functional effect of amino acid substitutions and indels. PLoS One 7, e46688 (2012). 202. Y.-T. Lai, et al., Standing genetic variation as the predominant source for adaptation of a songbird. Proc. Natl. Acad. Sci. U. S. A. 116, 2152–2157 (2019). 203. X. Yi, et al., Sequencing of 50 Human Exomes Reveals Adaptation to High Altitude. Science (80-. ). 329, 75–78 (2010). 204. J. A. Tennessen, et al., Evolution and functional impact of rare coding variation from deep sequencing of human exomes. Science 337, 64–9 (2012). 205. G. V Kryukov, L. A. Pennacchio, S. R. Sunyaev, Most rare missense alleles are deleterious in humans: implications for complex disease and association studies. Am. J. Hum. Genet. 80, 727–39 (2007). 206. E. Marouli, et al., Rare and low-frequency coding variants alter human adult height. Nature 542, 186–190 (2017). 207. A. Afanasyeva, M. Bockwoldt, C. R. Cooney, I. Heiland, T. I. Gossmann, Human long intrinsically disordered protein regions are frequent targets of positive selection. Genome 175 Res. 28, 975–982 (2018). 208. S. Forcelloni, A. Giansanti, Evolutionary forces on different flavors of intrinsic disorder in the human proteome. bioRxiv, 653063 (2019). 209. A. David, R. Razali, M. N. Wass, M. J. E. Sternberg, Protein-protein interaction sites are hot spots for disease-associated nonsynonymous SNPs. Hum. Mutat. 33, 359–363 (2012). 210. V. Vacic, et al., Disease-Associated Mutations Disrupt Functionally Important Regions of Intrinsic Protein Disorder. PLoS Comput. Biol. 8 (2012). 211. A. Afanasyeva, M. Bockwoldt, C. R. Cooney, I. Heiland, T. I. Gossmann, Human long intrinsically disordered protein regions are frequent targets of positive selection. Genome Res. 28, 975–982 (2018). 212. J. G. Tate, et al., COSMIC: the Catalogue Of Somatic Mutations In Cancer. Nucleic Acids Res. 47, D941–D947 (2019). 213. A. Deiana, S. Forcelloni, A. Porrello, A. Giansanti, Intrinsically disordered proteins and structured proteins with intrinsically disordered regions have different functional roles in the cell. PLoS One 14, e0217889 (2019). 214. H. Anbo, M. Sato, A. Okoshi, S. Fukuchi, Functional Segments on Intrinsically Disordered Regions in Disease-Related Proteins. Biomolecules 9, 88 (2019). 215. M. M. Babu, R. van der Lee, N. S. de Groot, J. Gsponer, Intrinsically disordered proteins: regulation and disease. Curr. Opin. Struct. Biol. 21, 432–440 (2011). 216. N. De Jay, et al., mRMRe: an R package for parallelized mRMR ensemble feature selection. Bioinformatics 29, 2365–8 (2013). 217. H. Nishi, J. Nakata, K. Kinoshita, Distribution of single-nucleotide variants on protein-protein interaction sites and its relationship with minor allele frequency. Protein Sci. 25, 176 316–21 (2016). 218. N. Malhis, S. J. M. Jones, J. Gsponer, Improved measures for evolutionary conservation that exploit taxonomy distances. Nat. Commun. 10, 1556 (2019). 219. D. Vitkup, C. Sander, G. M. Church, The amino-acid mutational spectrum of human genetic disease. Genome Biol. 4, R72 (2003). 220. H. Nishi, et al., Cancer Missense Mutations Alter Binding Properties of Proteins and Their Interaction Networks. PLoS One 8, e66273 (2013). 221. S. Malhotra, et al., Understanding the impacts of missense mutations on structures and functions of human cancer-related genes: A preliminary computational analysis of the COSMIC Cancer Gene Census. PLoS One 14, e0219935 (2019). 222. J. Gsponer, M. E. Futschik, S. A. Teichmann, M. M. Babu, Tight regulation of unstructured proteins: from transcript synthesis to protein degradation. Science (80-. ). 322, 1365–1368 (2008). 223. C. Nilofer, A. Sukhwal, A. Mohanapriya, P. Kangueane, Protein-protein interfaces are vdW dominant with selective H-bonds and (or) electrostatics towards broad functional specificity. Bioinformation 13, 164–173 (2017). 224. D. Ganguly, et al., Electrostatically accelerated coupled binding and folding of intrinsically disordered proteins. J. Mol. Biol. 422, 674–684 (2012). 225. C. Dincer, T. Kaya, O. Keskin, A. Gursoy, N. Tuncbag, 3D spatial organization and network-guided comparison of mutation profiles in Glioblastoma reveals similarities across patients. PLOS Comput. Biol. 15, e1006789 (2019). 226. K. Meyer, et al., Mutations in Disordered Regions Can Cause Disease by Creating Dileucine Motifs. Cell 175, 239-253.e17 (2018). 177 227. H. B. Engin, J. F. Kreisberg, H. Carter, Structure-Based Analysis Reveals Cancer Missense Mutations Target Protein Interaction Interfaces. PLoS One 11, e0152929 (2016). 228. Y. Mahlich, et al., Common sequence variants affect molecular function more than rare variants? Sci. Rep. 7, 1608 (2017). 229. S. Mitternacht, FreeSASA: An open source C library for solvent accessible surface area calculations. F1000Research 5, 189 (2016). 230. M. Magrane, U. Consortium, UniProt Knowledgebase: a hub of integrated protein data. Database (Oxford) 2011, bar009 (2011). 231. D. Piovesan, et al., MobiDB 3.0: more annotations for intrinsic disorder, conformational diversity and interactions in proteins. Nucleic Acids Res. 46, D471–D476 (2018). 232. M. Gouw, et al., The eukaryotic linear motif resource – 2018 update. Nucleic Acids Res. 46, D428–D434 (2018). 233. I. Walsh, A. J. M. Martin, T. Di Domenico, S. C. E. Tosatto, ESpritz: accurate and fast prediction of protein disorder. Bioinformatics 28, 503–509 (2012). 234. D. R. Zerbino, et al., Ensembl 2018. Nucleic Acids Res. 46, D754–D761 (2018). 235. S. El-Gebali, et al., The Pfam protein families database in 2019. Nucleic Acids Res. 47, D427–D432 (2019). 236. H. M. Berman, et al., The Protein Data Bank. Nucleic Acids Res 28, 235–242 (2000). 237. J. a Capra, R. a Laskowski, J. M. Thornton, M. Singh, T. a Funkhouser, Predicting protein ligand binding sites by combining evolutionary sequence conservation and 3D structure. PLoS Comput. Biol. 5, e1000585 (2009). 238. J. Fernández-Recio, M. Totrov, R. Abagyan, Identification of protein-protein interaction sites from docking energy landscapes. J. Mol. Biol. 335, 843–865 (2004). 178 239. A. Shanthini, G. Vinodhini, R. M. Chandrasekaran, P. Supraja, A taxonomy on impact of label noise and feature noise using machine learning techniques. Soft Comput. 23, 8597–8607 (2019). 240. M. Trellet, A. S. J. Melquiond, A. M. J. J. Bonvin, A Unified Conformational Selection and Induced Fit Approach to Protein-Peptide Docking. PLoS One 8 (2013). 241. T. Khan, G. M. Douglas, P. Patel, A. N. Nguyen Ba, A. M. Moses, Polymorphism Analysis Reveals Reduced Negative Selection and Elevated Rate of Insertions and Deletions in Intrinsically Disordered Protein Regions. Genome Biol. Evol. 7, 1815–1826 (2015). 179 Appendices Appendix A MoRF-train and MoRF-test Datasets. A.1 MoRF-test Reference and Structure Annotations. 180 A.2 MoRF-train Reference and Structure Annotations. 181 Appendix B List of Features Used and Analyzed for IDRBind. The naming of the features can be understood as the name of the original feature followed by an underscore and a descriptor for the aggregation functions used. Feature name Description B-factor-estimate Square fluctuation prediction of ProDy B-factor-estimate_absAvg9 Absolute values of B-factor-estimates in a 9Å surface patch averaged B-factor-estimate_maxhalf12Z Sort B-factor-estimates in a 12Å surface patch, take average of the top half of the scores, and calculate Z-score over protein surface residues B-factor-estimate_Z Calculate Z-scores of B-factor-estimates over protein surface residues Conservation Evolutionary conservation Conservation_min6Z Take the lowest Conservation value in 6Å surface patch, and calculate Z-score over protein surface residues Conservation_min9Z Take the lowest Conservation value in 9Å surface patch, and calculate Z-score over protein surface residues Conservation_Z Z-score of Conservation over protein surface residues Conservation_allZ Z-score of Conservation over surface and buried residues Curvature Curvature score from Surface Racer program Curvature_avg12 Curvature values in a 12Å surface patch averaged Curvature_Z Z-score of Curvature score over protein surface residues Field Electrostatic field from DelPhi mapped to protein residues FieldAtom_max12 Maximum value of FieldAtom (see SI Supplementary Methods) scores in 12Å surface patch FieldAtom_min12 Minimum value of FieldAtom (see SI Supplementary Methods) scores in 12Å surface patch FieldToSurface See SI Supplementary Methods Groove Groove score calculated with 6 line tracings Groove_maxhalf6 Sort Groove score in a 6Å surface patch, take average of the top half of the scores Groove_Z Z-score of Groove score over protein surface residues GrooveFine Groove score calculated with 13 line tracings GrooveFine_Z Z-score of GrooveFine score over protein surface residues Pattern1 See SI Supplementary Methods Pattern1_Z Z-score of Pattern1 score over protein surface residues Pattern2_min9 Minimum value of Pattern2 (see SI Supplementary Methods) scores in 9Å surface patch 182 Pattern3_Z Z-score of Pattern3 (see SI Supplementary Methods) score over protein surface residues Pattern4_max9 Maximum value of Pattern4 (see SI Supplementary Methods) scores in 9Å surface patch Pattern5 See SI Supplementary Methods Pattern5_minhalf9 Sort Pattern5 scores in a 9Å surface patch, take average of the lower half of the scores Pattern5_Z Z-score of Pattern5 score over protein surface residues PCA1 Mapping residue to the principal componenet 1 of amino acid indices, which correlates to hydrophobicity PCA1_13 PCA1 (hydrophobicity) values in a 13Å surface patch averaged PCA1_avg6Z Take average of PCA1 values in a 6 surface patch, and calculate Z-score over protein surface residues PCA1_Z Z-score of PCA1 score over protein surface residues PCA2_13 PCA2 values in a 13Å surface patch averaged PCA2_max6 Maximum value of PCA2 scores in 6Å surface patch PCA2_Z Z-score of PCA2 score over protein surface residues PCA3 See methods PCA3_absAvg12 Absolute values of PCA3 in a 12Å surface patch averaged PCA3_minhalf12Z Sort PCA3 in a 12Å surface patch, take average of the lower half of the scores, and calculate Z-score over protein surface residues PCA3_Z Z-score of PCA3 score over protein surface residues PCA4 See methods PCA4_13 PCA4 values in a 13Å surface patch averaged PCA4_max12 Maximum value of PCA4 scores in 12Å surface patch PCA4_max9Z Take the maximum PCA4 value in 9Å surface patch, and calculate Z-score over protein surface residues PCA5 See methods PCA5_13 PCA5 values in a 13Å surface patch averaged PCA5_absAvg12 Absolute values of PCA5 in a 12Å surface patch averaged PCA5_min12 Minimum value of PCA5 scores in 12Å surface patch Potential Electrostatic potential from DelPhi mapped to protein residues Potential_Z Z-score of Potential score over protein surface residues PotentialAbs Take the absolute values of Potential scores PotentialAbs_Z Take the absolute values of Potential scores, and calculate Z-score over protein surface residues PotentialAbs_avg9Z Take average of the absolute value of Potential scores in a 6 surface patch, and calculate Z-score over protein surface residues rASA Noramlized solvent accessible surface area from Areaimol 183 rASA_absAvg12 Absolute values of rASA in a 12Å surface patch averaged rASA_Z Z-score of rASA score over protein surface residues Roughness roughness score from rufness program in the HotPatch package Roughness_max12 Maximum value of Roughness scores in 12Å surface patch Roughness_Z Z-score of Roughness score over protein surface residues 184 Appendix C The IDRBind Server. PDB files of one or more protein chains can be submitted to the IDRBind server. The results are available through email or the public jobs queue page. C.1 The Submission Page of IDRBind Server. 185 C.2 The Public Jobs Queue Page of IDRBind Server. "@en ; edm:hasType "Thesis/Dissertation"@en ; vivo:dateIssued "2020-05"@en ; edm:isShownAt "10.14288/1.0387125"@en ; dcterms:language "eng"@en ; ns0:degreeDiscipline "Biochemistry and Molecular Biology"@en ; edm:provider "Vancouver : University of British Columbia Library"@en ; dcterms:publisher "University of British Columbia"@en ; dcterms:rights "Attribution-NonCommercial-NoDerivatives 4.0 International"@* ; ns0:rightsURI "http://creativecommons.org/licenses/by-nc-nd/4.0/"@* ; ns0:scholarLevel "Graduate"@en ; dcterms:title "Prediction and characterization of protein–protein interfaces that bind intrinsically disordered protein regions"@en ; dcterms:type "Text"@en ; ns0:identifierURI "http://hdl.handle.net/2429/72797"@en .