UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Feature analysis and in silico prediction of lower solubility proteins in three eukaryotic model systems Chan, Gerard 2015

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2015_may_chan_gerard.pdf [ 36.67MB ]
Metadata
JSON: 24-1.0166155.json
JSON-LD: 24-1.0166155-ld.json
RDF/XML (Pretty): 24-1.0166155-rdf.xml
RDF/JSON: 24-1.0166155-rdf.json
Turtle: 24-1.0166155-turtle.txt
N-Triples: 24-1.0166155-rdf-ntriples.txt
Original Record: 24-1.0166155-source.json
Full Text
24-1.0166155-fulltext.txt
Citation
24-1.0166155.ris

Full Text

Feature analysis and in silico prediction of lower solubilityproteins in three eukaryotic model systemsbyGerard ChanB.Sc. (Hons.) in Life Sciences, National University of Singapore, 2012A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMaster of ScienceinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Genome Science and Technology)The University of British Columbia(Vancouver)April 2015c© Gerard Chan, 2015AbstractRegulation of protein solubility, or the ability of proteins to remain soluble withinthe cell, is an important part of protein homeostasis. This is highlighted with thedisruption of protein homeostasis and dysregulation of solubility being associatedwith various neurodegenerative diseases. Using quantitative mass spectrometryand computational analyses, we identify low solubility proteins under unstressedconditions in three eukaryotic model systems: yeast cells, human neuroblastomacells, and mouse brain tissue. Using an internal reference, we account for proteinabundance, and allow for the analysis of proteins based on their partitioning be-tween the soluble and insoluble fractions, rather than purely on their abundancewithin the insoluble fraction. We identified several intrinsic traits such as length,disorder, abundance, molecular recognition features, and low complexity regionswhich are correlated with protein solubility. These features have been previouslyshown to be associated with protein-protein interactions. This suggests that, underunstressed conditions, lower solubility in proteins may be linked to functional ag-gregation, rather than aberrant aggregation. We then present two predictors whichmay be used to predict the in vivo solubility of proteins, built using the many traitsexamined in this work. The linear regression model is able to give estimates ofprotein solubility, although proteins near the threshold between low and normalsolubility may be misclassified. The Support Vector Machine is able to reliablydistinguish between low and high solubility proteins, but is unable to reliably dis-tinguish low and normal solubility proteins. We have identified several traits thatdistinguish low solubility proteins from other proteins, as well as developed twomodels that are able to estimate the solubility of proteins.iiPrefaceThe majority of the work presented in this thesis has been published in Albu et al.,2014. Mice used in this study were grown by members of the Johnson lab, withbrain tissues harvested by Taghizadeh, Hu, and Mehran. Preparation of the humanand mouse biological samples and GO analysis were done by Dr. Razvan Albu.Preparation of the yeast biological samples, RNase experiments, Western Blots,and generation of the amino acid compass were performed by Mang Zhu. Sourcecode for the amino acid compass was written by Alex Ng and previously publishedin Ng et al., 2013. Processing of data for CAI, secondary structure, ANCHORMoRFs, ELMs, IUPred,Pfam, and disulfide bond prediction were performed byEric Wong. Analysis of all other protein properties, generation of boxplots, andgeneration of models were carried out by Gerard Chan.iiiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixGlossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Protein homeostasis . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Aggregation and diseases . . . . . . . . . . . . . . . . . . . . . . 21.3 Functional aggregation . . . . . . . . . . . . . . . . . . . . . . . 31.4 Predicting aggregation . . . . . . . . . . . . . . . . . . . . . . . 41.5 Aims and scope of project . . . . . . . . . . . . . . . . . . . . . 52 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.1 Quantitative proteomic mass spectrometry . . . . . . . . . . . . . 62.1.1 Biological sample preparation . . . . . . . . . . . . . . . 62.1.2 Sample preparation and offline fractionation . . . . . . . . 82.1.3 Liquid chromatography- tandem mass spectrometry . . . . 9iv2.2 Biochemical assays . . . . . . . . . . . . . . . . . . . . . . . . . 92.3 Computational analysis . . . . . . . . . . . . . . . . . . . . . . . 92.3.1 Identification of lower solubility proteins . . . . . . . . . 102.3.2 Plotting and statistical analyses . . . . . . . . . . . . . . 112.3.3 Protein properties . . . . . . . . . . . . . . . . . . . . . . 112.4 Models for prediction of lower solubility propensity . . . . . . . . 132.4.1 Multiple regression model . . . . . . . . . . . . . . . . . 132.4.2 Support vector machine . . . . . . . . . . . . . . . . . . 133 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.1 Identification and feature analysis of low solubility proteins . . . . 153.1.1 Isolation of lower solubility proteins . . . . . . . . . . . . 153.1.2 Identification of lower solubility proteins . . . . . . . . . 183.1.3 Lower solubility (LS) proteins are longer than higher solu-bility (HS) proteins . . . . . . . . . . . . . . . . . . . . . 223.1.4 LS proteins are predicted to be more aggregation prone inyeast but not in human or mouse . . . . . . . . . . . . . . 243.1.5 LS proteins contain biases for particular amino acids . . . 253.1.6 LS proteins are more highly charged and are less hydrophobic 283.1.7 Choice of detergent does not significantly affect solubilityof LS proteins . . . . . . . . . . . . . . . . . . . . . . . . 313.1.8 LS proteins are more disordered and contain more molecu-lar recognition features (MoRFs) and eukarotic linear mo-tifs (ELMs) than higher solubility (HS) proteins . . . . . . 323.1.9 RNase treatment increases the solubility of RNA associ-ated proteins, but does not affect the overall properties oflow solubility proteins . . . . . . . . . . . . . . . . . . . 353.1.10 Coding sequences for LS proteins contain a lower GC con-tent in yeast . . . . . . . . . . . . . . . . . . . . . . . . . 363.1.11 LS proteins possess numerous traits that distinguish themfrom HS proteins . . . . . . . . . . . . . . . . . . . . . . 373.2 Models to predict protein solubility . . . . . . . . . . . . . . . . . 393.2.1 Multiple regression model . . . . . . . . . . . . . . . . . 40v3.2.2 Support vector machine . . . . . . . . . . . . . . . . . . 424 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454.1 Ratios obtained from quantitative mass spectrometry are not di-rectly indicative of absolute ratios . . . . . . . . . . . . . . . . . 454.2 Feature analysis of LS proteins highlights differences between or-ganisms and points to association between functional aggregationand low solubility . . . . . . . . . . . . . . . . . . . . . . . . . . 464.2.1 Analysis of features of LS proteins highlights inter-organismdifferences . . . . . . . . . . . . . . . . . . . . . . . . . 464.2.2 LS proteins possess distinct features that differentiate themfrom other proteins . . . . . . . . . . . . . . . . . . . . . 474.2.3 LS proteins may be involved in functional aggregation . . 494.3 Generation of models to predict solubility of proteins . . . . . . . 504.4 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57A Supporting Materials . . . . . . . . . . . . . . . . . . . . . . . . . . 77viList of TablesTable 3.1 MSE and correlation coefficients from each of the five iterationsof cross-validation, as well as the average values obtained. . . . 40Table 3.2 Estimates of the coefficients from the regularized multiple re-gression model. . . . . . . . . . . . . . . . . . . . . . . . . . 41Table 3.3 The prediction performance of the SVM on distinguishing LSand HS proteins. . . . . . . . . . . . . . . . . . . . . . . . . 43Table 3.4 The prediction performance of the SVM on distinguishing LSand non-LS proteins. . . . . . . . . . . . . . . . . . . . . . . 44Table A.1 GO analysis (biological processes) for yeast LS proteins . . . . 77Table A.2 GO analysis (molecular function) for yeast LS proteins . . . . 79Table A.3 GO analysis (biological processes) for human LS proteins . . . 80Table A.4 GO analysis (molecular function) for human LS proteins . . . . 81Table A.5 GO analysis (biological processes) for mouse LS proteins . . . 81Table A.6 GO analysis (molecular function) for mouse LS proteins . . . . 83Table A.7 Analysis for enrichment of Pfam domains for yeast LS proteins 84Table A.8 Analysis for enrichment of Pfam domains for human LS proteins 85Table A.9 Analysis for enrichment of Pfam domains for mouse LS proteins 86Table A.10 Table of p-values for feature analysis of yeast proteins . . . . . 88Table A.11 Table of p-values for feature analysis of human proteins . . . . 92Table A.12 Table of p-values for feature analysis of mouse proteins . . . . 95Table A.13 Enrichment analysis for low complexity regions in yeast LSproteins relative to NS proteins . . . . . . . . . . . . . . . . . 97viiTable A.14 Enrichment analysis for low complexity regions in human LSproteins relative to NS proteins . . . . . . . . . . . . . . . . . 98Table A.15 Enrichment analysis for low complexity regions in mouse LSproteins relative to NS proteins . . . . . . . . . . . . . . . . . 99Table A.16 GO analysis (biological processes) for yeast LS proteins with-out RNase treatment . . . . . . . . . . . . . . . . . . . . . . . 101Table A.17 GO analysis (biological processes) for yeast LS proteins withRNase treatment . . . . . . . . . . . . . . . . . . . . . . . . . 117viiiList of FiguresFigure 3.1 Overview of the approach used to identify lower solubility pro-teins. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16Figure 3.2 Analysis of raw LC-MS/MS quantification data and validationof low solubility delimitation. . . . . . . . . . . . . . . . . . 18Figure 3.3 Average solubilities of proteins within a complex . . . . . . . 20Figure 3.4 Abundance of LS proteins in this study and compared to pre-vious studies . . . . . . . . . . . . . . . . . . . . . . . . . . 22Figure 3.5 Comparison of the lengths of proteins by randomly pickingversus averaging over protein groups . . . . . . . . . . . . . . 23Figure 3.6 Aggregation propensity as predicted by TANGO and AGGRES-CAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25Figure 3.7 Number of low complexity regions (LCRs) per unit length ofproteins in each bin, as found on LPS-annotate . . . . . . . . 26Figure 3.8 Amino acid composition of proteins in the LS and HS bins ofyeast, human and mouse samples . . . . . . . . . . . . . . . 26Figure 3.9 Percentage abundance of particular types of amino acids . . . 27Figure 3.10 Number of phosphorylation sites and disulfide bonds in eachof the three model organisms . . . . . . . . . . . . . . . . . . 29Figure 3.11 Analysis of net charge and hydrophobicity of proteins . . . . 30Figure 3.12 Comparison of results obtained using the two detergents NP40and Triton X-100 . . . . . . . . . . . . . . . . . . . . . . . . 31Figure 3.13 Comparison of protein percentage disorder as predicted by DISO-PRED and IUPRED . . . . . . . . . . . . . . . . . . . . . . 33ixFigure 3.14 Number of disordered regions in each protein, and the numberof MoRFs within each such region . . . . . . . . . . . . . . . 34Figure 3.15 Number of molecular recognition features (MoRFs) within eachprotein, as predicted by the ANCHOR database . . . . . . . . 35Figure 3.16 Average number of MoRFs as determined by ANCHOR withineach disordered patch . . . . . . . . . . . . . . . . . . . . . . 36Figure 3.17 Number of eukarotic linear motifs (ELMs) within each protein,as predicted by the ANCHOR database . . . . . . . . . . . . 37Figure 3.18 Analysis of RNase treatment on protein solubility . . . . . . . 38Figure 3.19 Percentage of coding sequences of each protein comprised ofGC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39Figure 3.20 Illustration of SVM hyperplanes . . . . . . . . . . . . . . . . 42Figure 4.1 SVM and distinguishing multiple subtypes of a class . . . . . 51xGlossaryAAindex amino acid indexAD alzheimer’s diseaseAUC area under curveCAI codon adaptation indexELMs eukarotic linear motifsFDR false discovery rateGO gene ontologyGRAVY grand average of hydrophobicityHA hemagglutininHPLC high performance liquid chromatographyHS higher solubilityIPOD insoluble protein depositJUNQ juxtanuclear quality controlLC liquid chromatographyLCRs low complexity regionsLPS lowest-probability subsequencesxiLOWESS locally weighted scatterplot smoothingLS lower solubilityLTQ linear-trapping quadrupoleMCC matthews correlation coefficientMoRFs molecular recognition featuresMS mass spectrometryMSE mean squared errorNS normal solubilityPD parkinson’s diseaseRSPG random sampling of proteins within groupsSGD saccharomyces genome databaseSILAC stable isotope labeling by amino acids in cell cultureSVM support vector machineUPS ubiquitin proteasome systemxiiAcknowledgmentsFirst and foremost, I would like to thank my supervisor Professor Thibault Mayoras well as my committee members Professor Jo¨rg Gsponer and Professor Christo-pher Loewen. You all have given me a great deal of guidance on technical mattersas well as soft skills such as keeping track of the big picture while not losing sightof small details. Without your patience and encouragement, this project would nothave been possible.Professor Jo¨rg Gsponer, your guidance and supervision on the bioinformaticsof this project have been crucial to the development of the models in this project.The feedback and advice have played an important role in guiding me in this en-deavor. It would not have been possible without you.Much thanks go out to Dr Razvan Albu and Mang Zhu. Your preparation of themass spectrometry (MS) samples used in this study, upstream data processing ofMS spectra, as well as carrying out the biochemical assays form an indispensablepart of this project. Numerous discussions with you have helped shape this projectinto the work it currently is.Eric Wong and Alex Cumberworth deserve many thanks for the invaluable ad-vice with regard to bioinformatic tools and best practices for programming. Priorto my rotation in the Gsponer lab, I had no experience with programming. Withyour help and guidance, I was able to understand the fundamentals of programmingand gain the confidence to pursue a project that heavily utilised this skill set.Nawar Malhis has been instrumental in giving a lot of helpful advice in thebuilding of the model. Your guidance has helped develop the model beyond thesimple linear model and helped to refine it.All the members of the Mayor, Gsponer and Loewen labs have also providedxiiinumerous informal discussions. The constant feedback and examination of ideasand approaches has been crucial in bringing the project from its nascent state towhat it is today.The Genome Science and Technology program has provided me with a stableyet flexible platform to explore the many options available to me. The strong cul-ture of collaboration and the numerous opportunities to build a network of contactswith a variety of areas of expertise is excellent. The wide breadth it provides hasset it apart and proven to be valuable in my development.I would like to thank all the administrators and support staff in the GSAT pro-gram, the NCE, as well as the MSL. Your contributions and dedication have pro-vided a stable and coducive environment without which all of the work in this thesiswould not have been possible.My friends, old and new, have been a huge boon to me these past years. Fromold friends I have known for over a decade, to new friends I have gotten to knowin Canada, you have all played a huge part in my life and made it that much richer.I would also like to thank my parents, Richard and Lucy, as well as my brotherAlvin, and my fiance´e Rui Qing, for their support. Moving to a different continentto pursue my graduate studies was a huge step for me, and your encouragementand support were invaluable to me as I progressed through this phase of my life.Without you, I would not be the person I am today.xivChapter 1IntroductionThe introduction of this thesis will cover several areas.• Section 1.1: Protein homeostasis• Section 1.2: Aggregation and disease• Section 1.3: Functional aggregation• Section 1.4: Predicting aggregation• Section 1.5: Aims and scope of project1.1 Protein homeostasisProtein homeostasis, also known as proteostasis, is crucial to the well being ofcells. Given the high concentration of biological molecules such as proteins withincells, misfolded or damaged proteins present considerable risks. The folding ofa protein can be disrupted during synthesis or even after it has attained its nativeconformation. Factors such as mutations, translation errors and stresses, includingbut not limited to extreme temperatures, pressure, and pH, can cause a protein tomisfold and potentially form amyloid or amorphous aggregates [18, 41]. This isdetrimental to the cell due to the loss-of-function [76] as well as potentially toxicnature of amyloid and amorphous aggregates [92], which have the ability to formnon-native interactions with cellular machinery and impair their functions [14].1Cells rely on the protein quality control network to prevent the accumulationof aberrant protein species, either through refolding them via the use of chaperones[66] or disposing of them via proteolysis. The ubiquitin proteasome system (UPS)plays a major role in clearing aberrant proteins in the cell [22], targetting them fordegradation to the proteasome via covalent attachment of ubiquitin [67, 73]. Fail-ure of the UPS to effectively clear these proteins can lead to detrimental outcomesbrought about by the accumulation of aberrant proteins [10, 59, 116].Another means by which cells address the issue of aberrant proteins is by se-questering them within quality control compartments [114] such as aggresomes[62, 69], Q-bodies [39], the juxtanuclear quality control (JUNQ) and insoluble pro-tein deposit (IPOD) compartments [82]. These compartments may then be clearedby macroautophagy [61, 69] or by asymmetrical partitioning of these structuresupon cell division [2, 15, 115, 134].Macroautophagy is one mechanism by which cells can dispose of aberrant pro-teins sequestered in quality control compartments. The body to be disposed of isengulfed in a double membrane to form the autophagosome, which then fuses withthe lysosome, resulting in the degradation of autophagosomal contents by lyso-zomal enzymes [74]. The ability of processes such as macroautophagy to maintainhomeostasis is known to decline with age, contributing to age-related neurodegen-erative diseases [124].1.2 Aggregation and diseasesMisfolded proteins have the potential to assemble into large, insoluble structuresheld together by hydrophobic intermolecular interactions. Such structures can beclassified into amyloid or amorphous aggregates. Amyloids display a characteristicfibrillar structure consisting of β -sheets running perpendicular to the axis of thefibrils [34]. Studies have shown that short protofibrils in the early stages of fibrilformation may in fact be more toxic than mature fibrils [14]. In contrast to theordered structure of amyloid aggregates, amorphous aggregates are assemblies thatdo not contain such ordered intermolecular bonds [133].When the numerous quality control mechanisms designed to dispose of andmitigate the damage caused by aberrant proteins are overcome, various patholo-2gies can arise. Protein aggregation has been associated with more than 40 diseasesin humans [19, 106]. Of these, neurodegenerative diseases display among the mostcrippling symptoms, leading to them being the focus of intense research efforts.α synuclein has been associated with parkinson’s disease (PD) [85] and amyloid-beta fibrils with alzheimer’s disease (AD) [130], with recent studies highlightingother inclusions and their associations with various pathologies [129]. Many pro-tein deposits in various disease contexts contain ubiquitin [79][6], suggesting thatthey were targeted for degradation, but somehow managed to evade the qualitycontrol pathways in the cell. Studies have shown that [80] disease associated pro-tein aggregation may be able to act as a nucleus for the aggregation of endogenousproteins, potentially allowing for the propagation of the disease state to otherwisehealthy cells [7, 8, 46, 52, 56, 104]. Some have proposed a model whereby whileall proteins are theoretically able to form amorphous or amyloid aggregates, certainproteins simply possess a higher propensity to form them under a given set of con-ditions [20, 68]. Certain inherent traits, such as stretches of high hydrophobicity,high beta-sheet propensity, and low charge, are associated with a higher propensityto form amorhous or amyloid aggregates [21]. Transfer of these stretches froman amylogenic protein domain to a non-amylogenic protein has been shown to in-duce aggregation [123]. Improving our understanding of protein aggregation andsolubility will be important for the development of therapies for proteopathies.1.3 Functional aggregationWhile many amorphous and amyloid aggregates have negative consequences forcells, functional aggregates are a class of aggregates that are part of normal cellularprocesses. Amyloid fibrils, characterized by their fibrillar cross β -sheet structure,have commonly been thought to be detrimental. However, it has been shown tobe utilized by bacteria and fungi as a structural component, due to the high yield-strength and protease resistant nature of amyloids [44]. p53 is an example of a wellknown protein that can form functional aggregates as part of its normal function,existing as a homotetramer in its active form [94, 95]. Some peptide and secretoryhormones have utlized the optimized packing of amyloid-like cross β -sheet richconformations for their storage [81]. Other proteins such as TIA-1 in yeast [48],3ataxin-1 in humans [94], and Pumilio in flies [110] have also been shown to be ableto form functional aggregates. Functional aggregates have also been associate withother functions such as epigenetic inheritance [111] and formation of stress gran-ules [48]. This highlights that although aggregation can be a detrimental scenariothat cells need to manage, it can also serve a functional role in cells. Interestingly,recent studies have suggested that functional and dysfunctional aggregation are in-deed promoted via similar forces, and that regulation of these forces is crucial formaintaining the balance between these two competing pathways [94, 96].Several traits have been associated with the ability of proteins to form func-tional aggregates. Low complexity regions in proteins such as TIA1, FUS, CIRBP,RBM3, hnRNPA1, hnRNPA2 and SUP35 have been shown to be necessary andsufficient to cause aggregation of the proteins [64]. The work of Kato et al.showedthat truncations of RNA-binding proteins that removed the RNA binding domains,and only contained their low complexity domains were capable of forming hydro-gels, networks of interacting proteins with an aqueous phase contained within [3].Truncations lacking the low complexity regions, in contrast, did not display theability to form hydrogels. The work by Salazar et al. highlights how Q/N rich re-gions are important for the regulation of Pumilo function. In the absence of theQ/N rich region, the suppression of toxicity caused by Pumilo expression was notobserved. Disordered proteins have been associated with the formation of func-tional assemblies known as Woronin bodies in plants [72]. The family of proteinsknown as septin pore-associated proteins that are part of Woronin bodies are highlycharged and enriched in amino acids typically found in disordered proteins. Lowcomplexity regions as well as disorder have thus been associated with lower solu-bility and functional aggregation.1.4 Predicting aggregationDue to the pathological association of amyloid aggregates with disease, many amy-loid aggregation predictors have been developed [93][119][40][60][135][23][118][83][27][89].TANGO[40] makes predictions on the aggregation propensity of proteins by calcu-lating the partitioning of the segments of the protein between the aggregation stateand the non-aggregation state. AGGRESCAN[23] utilizes experimentally derived4aggregation propensities [30] and considers local stretches in proteins to determineaggregation propensity. Both were shown to accurately identify known aggrega-tion prone proteins reliably. Given that not all aggregates are amyloid in nature, wewanted to explore and characterize a wider range of potentially aggregating pro-teins. While many studies have characterized aggregation under stress conditionsuch as heat shock[87, 88, 131] and proteosomal inhibition[128], we were inter-ested in studying protein aggregation under steady state conditions. Using insightsgleaned from the analysis of protein solubility, we decided to build a model thatwould then be able to predict solubility under unstressed conditions in silico.1.5 Aims and scope of projectWe hypothesized that even under unstressed conditions in cells, some proteins aremore prone to lower solubility than others, and that there are specific traits thatdistinguish lower solubility proteins from other proteins. Using quantitative massspectrometry (MS), we have identified low solubility proteins in three eukaryoticmodel organisms: budding yeast, human neuroblastoma tissue culture cells, andmouse brain tissue. Analysis of these low solubility proteins highlights traits thatdraw a link to functional aggregation and macro-molecular assemblies. Using thesetraits, two models (a linear model and a support vector machine) were built toenable the in silico identification of low solubility proteins.The aims of the project are as follows• To identify proteins more prone to low solubility• To identify traits that distinguish low solubility proteins form other proteins• To use traits identified to build a model capable of predicting low solubilitypropensity5Chapter 2MethodsSamples from three model systems were prepared and analyzed by quantitativeproteomic MS. Computational and bioinformatic analysis of proteins identifiedallowed the identification of certain traits correlating to protein solubility. Fittingthe data obtained to supervised learning models allowed for the prediction of thesolubility of a protein based on its properties.The methods section of this thesis will cover several areas.• Section 2.1: Quantitative proteomic mass spectrometry• Section 2.2: Biochemical assays• Section 2.3: Computational and bioinformatic analysis• Section 2.4: Models for prediction of lower solubility propensity2.1 Quantitative proteomic mass spectrometryWork in the following section was carried out by Dr Razvan F. Albu and MangZhu. Full details of the methods used can be found in [4]2.1.1 Biological sample preparationThe biological samples from each of the three model organisms were prepared us-ing native lysis to allow for the identification of proteins within the lower solubility6fraction. Denaturing lysis would prevent the identification of proteins within thelower solubility fraction.YeastYeast stable isotope labeling by amino acids in cell culture (SILAC) strains werelabeled with light or heavy arginine and lysine residues in order to carry out a quan-titative mass spectrometry analysis. Cells were grown for at least 7 generations at25◦C. Cultures were grown to mid log phase (OD600 0.8-1) before harvesting andlysis in lysis buffer (100 mM HEPES, 250 mM KCl, 1 mM PMSF, 1PIC, 1 mMphenanthroline and 10 mM chloroacetamide, 1% Triton X-100). To verify that de-tergent choice did not significantly influence results, the experiment was repeatedwith Triton X-100 substituted for 1% Igepal CA-630 with 0.5% deoxycholate.Lysate was pre-cleared before detergent-insoluble proteins were pelleted by cen-trifugation at 16000 rcf at 4◦C for 15 min. The pellet was washed twice beforeresuspension. The protein concentration of both the supernatant and pellet fractionwere measured using the DC Protein Assay (BioRad). For the RNase treatment,a final concentration of 20µg/ml RNase (Roche). RNA extraction was carried outusing TRIzol R© Reagent (Life Science Technologies) according to the manufac-turer’s protocol. Effectiveness of Rnase treatment was validated by agarose gelelectrophoresis.Human cellsHuman neuroblastoma tissue culture cells (SH-SY5Y) were labeled with light orheavy arginine or lysine residues for a SILAC analysis. Cells were grown at 37◦Cfor at least 11 divisions (determined by cell counting). Confluent cells were har-vested and lysed in lysis buffer (50 mM TrisHCl, pH 8.5, 150 mM NaCl, 0.5%Na-deoxycholate, 1% Igepal CA-630, 1 PIC, 1 mM PMSF, 1 mM phenanthroline,0.5 mM DTT) containing Igepal CA-630. Lysate was pre-cleared before detergentinsoluble proteins were pelleted by centrifugation at 50,000 rcf for 1h at 4◦C. Thepellet was washed twice before resuspension. The supernatant was subjected tomethanol chloroform precipitation and proteins extracted were resuspended in HUbuffer (8 M urea, 100 mM HEPES, pH 8.0). Protein concentrations were measured7using the DC Protein Assay (BioRad).Mouse brain tissueFemale C57BL/6 mice (non-littermates) were grown to 11 weeks of age, afterwhich brain tissue was harvested. Harvested tissue was flash frozen in liquid N2and then lysed by cryogrinding. Brain samples from three mice were pooled foranalysis. After resuspension in lysis buffer, lysate was pre-cleared by centrifuga-tion and detergent insoluble proteins were pelleted by centrifugation at 50,000 rcffor 1h at 4◦C. The pellet was washed twice before resuspension. The supernatantwas subjected to methanol chloroform precipitation and proteins extracted wereresuspended in HU buffer. Protein concentrations were measured using the DCProtein Assay (BioRad).2.1.2 Sample preparation and offline fractionationYeast samples were subjected to in-gel trypsin digestion, while human and mousesamples were subjected to in-solution trypsin digestion. The heavy-labeled insol-uble pellet fraction and light-labeled soluble supernatant fraction from each of theorganisms was mixed in a 1:1 ratio by mass. For experiments involving RNase,light-labeled soluble supernatant from cells untreated with RNase was mixed withmedium and heavy-labeled insoluble pellets (without and with RNase treatment re-spectively) in a 1:1:1 ratio by mass (as shown in Figure 3.18a). Mouse brain sam-ples were labeled after tryptic digestion using formaldehyde-cyanoborohydride,attaching a 28Da (light) or a 32Da (heavy) moiety to primary amines as describedin [113]. Consistent with the labeling scheme for human and yeast, the insolublepellet fraction was labeled heavy and the soluble supernatant fraction was labeledlight.Approximately 200µg of tryptic peptides were fractionated by offline high pHreverse-phase chromatography. 96 fractions of 40 seconds each were collected andpooled in a non-contiguous manner [132][120], with 9 pooled fractions for yeastand 10 pooled fractions for human and mouse. Yeast label swap experiments andexperiments comparing light- and heavy-labeled insoluble pellets did not undergoan offline fractionation step.82.1.3 Liquid chromatography- tandem mass spectrometryEach of the fractions prepared in 2.1.1 was analyzed using liquid chromatogra-phy (LC)-MS/MS (Tandem Mass Spectrometry) on a linear-trapping quadrupole(LTQ) Orbitrap Velos (Thermo) coupled to an Agilent 1100 Series Nanoflow highperformance liquid chromatography (HPLC) as described in [4].Spectra were searched by the ANDROMEDA algorithm, in the MaxQuant en-vironment (version 1.5.0.0) against the saccharomyces genome database (SGD)[17]for yeast (Feb 3, 2011), and the Uniprot human [117] (Apr 16, 2014) and mouse(Feb 19, 2014) databases. The search was configured largely using the defaultMaxQuant parameters. We allowed for a 1% false discovery rate at both the pep-tide and protein level.2.2 Biochemical assaysWork in the following section was carried out by Mang Zhu. Full details of themethods can be found in [4].Western blottingTwo proteins each from the lower solubility (LS) and higher solubility (HS) binwere selected. The endogenous copies of these proteins were tagged with a triple-hemagglutinin (HA) tag amplified from parent vector pFA6a-3HA-His3MX6. Thepellet and supernatant fraction were separated as detailed in 2.1.1, normalized in a10:1 ratio. Samples were resolved on 4%-20% gradient gels (BioRad), and trans-ferred onto 0.45 µm nitrocellulose membranes (BioRad). Immunodetection wascarried out using anti-HA primary antibodies(1:2000 dilution,12CA5, AbLab) andLI-COR secondary antibodies (1:10000 dilution, LI-COR Biosciences). Imageswere acquired using the CLx Odyssey (LI-COR Biosciences) and quantificationwas carried out using Image Studio v3.1 (LI-COR Biosciences)2.3 Computational analysisProcessing of data for CAI, secondary structure, ANCHOR MoRFs, ELMs, IUPred,Pfam, and disulfide bond prediction were performed by Eric Wong.92.3.1 Identification of lower solubility proteinsFrom the proteins identified via MaxQuant search results in 2.1.3, proteins flaggedas contaminants or reverse hits were removed, as well as proteins identified byfewer than two peptides. Proteins that did not have a reported quantification valuedue to inconsistent labeling or orphaned analyte issues were also removed. Du-plicate database entries with identical amino acid sequences were removed. Thequantification ratios were ranked and a plot of log2 ratios against rank (highestto lowest) was generated (Figure 3.2a-c). To separate proteins into different cate-gories based on their solubility, we first devised a method to determine appropriatecutoffs. A smoothed curve was generated by application of a locally weighted scat-terplot smoothing (LOWESS) function (python module statsmodels, lowss func-tion, 3 iterations, no linear interpolation, 0.667 of data used for each y-value es-timate) and 10% of points with the lowest gradients were chosen. The R2 valueof all points between the first and last unsmoothed equivalent of these points wascalculated. Data points of increasing log2 ratio were added until the correspondingR2 value dropped below 0.99. The process was repeated with points of decreasinglog2 ratio. Points that were thus defined were considered to correspond to proteinsof normal solubility (NS). A trendline was fitted to these points and the interceptsof that trendline at the first and last ranked point on the plot were taken as thecutoff for the lower solubility (LS), i.e. high ratio, and higher solubility (HS), i.e.low ratio, bins respectively. If the LS or HS cutoff was less extreme than the mostextreme point in the NS bin, the most extreme point in the NS bin was used as thecutoff instead. In most cases, there were points that lay between the LS and NSas well as NS and HS bins. These points were excluded from further analysis astheir solubility was deemed ambivalent. Next, protein groups containing proteinsflagged as having one or more transmembrane domains (as determined by uniprotannotations) were removed. The CYC2008 [97] (for yeast) and Quorum [108] (formouse and human) databases were scanned to determine which proteins were partof complexes.102.3.2 Plotting and statistical analysesFigures were generated using R (www.r-project.org) and the python library Mat-plotlib [58]. Statistical tests were carried out using NumPy [122] and SciPy [91]and R. For boxplots, boxes show the middle 50% of data points, the whiskers rep-resent 1.5 the inter-quartile range. n corresponds to the number of proteins in eachbin, o to the number of outliers, which are not shown. p-values calculated by theMannWhitney U test and listed in Table A.10, Table A.11, and Table A.12 are dis-played above the figures. A dotted gray line represents a p-value lower than 0.05but which was not significant after multiple testing correction, while solid linesdenoting significant values after correction. Statistical comparisons were made be-tween the LS and NS bins, as well as the NS and HS bins. The multiple testingcorrection used was the Bonferroni correction. The Amino Acid Compass plottedusing an in-house R-script as described previously [87].2.3.3 Protein propertiesMouse and human proteomes were omitted during the analysis due to the largenumber of isoforms as well as the tissue specific nature of many proteins. Pro-tein length and amino acid composition were calculated based on sequences in thedatabases used in 2.1.3. The random sampling of proteins within groups (RSPG)analysis was carried out to determine the length variation of proteins within pro-tein groups. For each of the 1000 iterations, one protein was picked randomlyfrom each protein group and the median value of each bin (LS, HS, or NS) wascomputed. Yeast was omitted from the RSPG analysis due to the small number ofprotein groups observed.Matching of mouse and human orthologs was carried out using Roundup [31].gene ontology (GO) analysis was performed using DAVID [32]. Pfam data fileswere downloaded from the Pfam database [43] on June 6, 2014. pfam scan.pl ver-sion 1.5 (used with HMMER-3.1b1, downloaded from hmmer.org [42]) from thePfam database was used at default settings to search for matches to each proteinsequence. For protein groups containing multiple entries, the first protein in thegroup was used for Pfam and GO analysis. Protein abundance data from previousstudies [47][29] was utilized. Aggregation propensity of proteins was predicted11using TANGO [40](aggregation prone: at least 1 stretch of 7 residues with ag-gregation tendency above 50%, non-aggregation prone: no residue in entire pro-tein with aggregation tendency above 50%) as well as AGGRESCAN [23] (ag-gregation prone: at least 1 stretch of 8 residues with a4v value greater than 0.5,non-aggregation prone: no stretch longer than 5 residues with a4v value greaterthan 0.5 using previously defined cut-offs [50]. low complexity regions (LCRs)were retrieved from lowest-probability subsequences (LPS)-annotate, which hasdefined LCRs in a number of organisms as previously published [55]. Disul-phide bond prediction was performed using DIpro v2.0 [16] at default settings.Phosphorylation sites were retrieved from PhosphoSitePlus [57] for mouse andhuman, and from PhosphoGRID for yeast [109]. PSIPRED [63] was used to ob-tain secondary structure predictions. DISOPRED2 [126] and IUPred [35] wereused for the prediction of protein disorder. Stretches of disordered resides longerthan 5 as predicted by DISOPRED were taken to be disordered patches. molecu-lar recognition features (MoRFs), intrinsically disordered stretches on proteins thatassume an ordered conformation upon protein-protein interaction, were retrievedfrom ANCHOR [36]. eukarotic linear motifs (ELMs) were retrieved from the ELMdatabase on the ANCHOR server [99]. When normalizing against disorder, pro-teins with less than 10% disorder were excluded from the analysis. The codonadaptation index (CAI) of proteins were calculated using CAIcal [98]. Referencetables were obtained from the CAIcal server on Jun 10, 2014 (mouse and yeast) andJune 16, 2014. Coding sequences were obtained on June 6,2014 from SGD [17](for yeast) and the UniParc database [117] (for mouse and human). Yeast SGDaccession numbers were mapped to Uniprot accession numbers using the Uniprotmapping tool. Coding sequences were also used to calculate GC content, as wellas the number of codons one substitution away from a stop codon (close stop).Localization information for yeast was retrieved from previously published databy [77]. The number of transmembrane helices was predicted using TMHMMv2 [70] (yeast only). Hydrophobicity was calculated using the grand average ofhydrophobicity (GRAVY) index [71]. The number of codons per protein was cal-culated using coding sequences previously mentioned as published in [1]. Aminoacid abundance by percentage, as well as the percentage of positive (H,R,K), neg-ative (D,E), polar (S,T,Y,C,N,Q), hydrophobic (G,A,V,L,I,F,M), aromatic (F,Y,W),12and rare (C,W,H,M) amino acids, as well as net charge per protein were calculatedusing the sequences used for searching in 2.1.3. Patches of aromatic, hydropho-bic, positive, negative, polar (Q and N, as well as S and T) residues was calculatedfrom the database sequences in 2.1.3 using the same methods described in [55]with categories of amino acids rather than individual species.2.4 Models for prediction of lower solubility propensityThe protein traits derived from Section 2.3.3 were used to build machine learningmodels to predict lower solubility propensity.2.4.1 Multiple regression modelMultiple linear regressions were carried out in R with the glmnet library [45]. Themodel was built using stepwise linear regression in the forward direction with elas-tic net regularization. 5-fold cross-validation was carried out for the training set,with the lambda values chosen to optimize the mean squared error (MSE) on thetraining set. 5-fold cross validation was carried out, with the data divided randomlyinto 5 equal parts, and one part per iteration used as the test set, and the remainingones used for training. MSE values and correlation coefficients were obtained fromeach iteration of the test set.2.4.2 Support vector machineThe support vector machine (SVM) model was built in R using the e1071 library[84]. One fifth of the data was randomly chosen as the test set in turn, and theremaining data was used for training. The model was built (cost = 10, method=”C-classification”, kernel = ”linear”) using parameters determined by the built in tun-ing function. 5-fold cross validation was carried out.13Chapter 3ResultsThis section of the thesis will describe the results of the mass spectrometry exper-iments which were used to identify low solubility proteins, as well as the biofor-matic analyses carried out. The results of those analyses were then used to buildtwo machine learning models to predict protein solubility.Most of the results presented here were published in [4]. Sample processingand mass spectrometry were performed by Dr. Razvan Albu (human and mousetissues) and Mang Zhu (yeast cells). RNase experiments, Western Blots, and gen-eration of the amino acid compass were performed by Mang Zhu. Processing ofdata for CAI, secondary structure, ANCHOR MoRFs, ELMs, IUPred,Pfam, anddisulfide bond prediction were performed by Eric Wong. All other computationalanalyses were performed by Gerard Chan (myself).• Section 3.1: Identification and feature analysis of low solubility proteins– Section 3.1.1: Isolation of lower solubility proteins– Section 3.1.2: Identification of lower solubility proteins– Section 3.1.3: Lower solubility (LS) proteins are longer than highersolubility (HS) proteins– Section 3.1.4: LS proteins are more aggregation prone in yeast but notin human or mouse– Section 3.1.5: LS proteins contain biases for particular amino acids14– Section 3.1.6: LS proteins are more highly charged and are less hy-drophobic– Section 3.1.7: Choice of detergent does not significantly affect solubil-ity of LS proteins– Section 3.1.8: LS proteins are more disordered and contain more molec-ular recognition features (MoRFs) and eukarotic linear motifs (ELMs)than higher solubility (HS) proteins– Section 3.1.9: LS RNase treatment increases the solubility of RNAassociated proteins, but does not affect the overall properties of lowsolubility proteins– Section 3.1.10: Coding sequences for LS proteins contain a lower GCcontent– Section 3.1.11: LS proteins possess numerous traits that distinguishthem from HS proteins• Section 3.2: Models to predict protein solubility– Section 3.2.1: Regularized linear model– Section 3.2.2: Support vector machine3.1 Identification and feature analysis of low solubilityproteinsIn this section, we outline how we distinguished lower solubility proteins fromother proteins in our experiments, and then examined certain features that distin-guished these lower solubility proteins from other proteins.3.1.1 Isolation of lower solubility proteinsOur aim was to identify features that distinguished lower solubility proteins in eu-karyotic cells under steady state conditions. While many studies have studied sys-tems under non-steady state conditions, we felt that the solubility landscape understeady state conditions would allow us to examine features that would be intrin-sic to the proteins and proteomes, free from the influence of stresses and chemical15a LowSolubilitySoluble(reference)LightHeavy1:1Lower0solubilityproteins(normalized to abundance)Heavy/Light1:1Exp31:1Exp11:1Exp2bP SP S P SLightHeavyMass0Spec00cExp2log2(P/S)44-4-4R2=0.84Yeastlog2(P/S)Exp2Exp1log2(S/P)44-4-4R2=0.91Human0(SH-SY5Y)dExp1log2(S/P)Figure 3.1: Overview of the approach used to identify lower solubility pro-teins. (a) Proteins in the low solubility fraction (P:pellet) are mixed withsoluble proteins (S:supernatant) in equal ratios by mass and analyzedby quantitative LC-MS/MS. Normalization for abundance is achievedby comparison between the two fractions. (b) Various combinationsof soluble and lower solubility fractions from light- or heavy- labeledsamples. The results of experiment 1 and 2 are shown in panels c and drespectively. (c, d) Results of the label-swap experiments carried out foryeast (c) and human (SH-SY5Y) cells (d). The log2 ratio values of theproteins quantified in both experiments were plotted, and the coefficientof determination (R2) is shown in each figure. Figure reproduced from[4].16inhibitors. The yeast model organism Saccharomyces cerevisiae was chosen forits relative simplicity and well-characterized nature. Mouse brain and human neu-roblastoma SH-SY5Y cells [12] were used as disruption of protein homeostasis inneuronal tissue is associated with various pathologies. To account for the variationbetween individual mice, brain tissue from three young adult (approximately 11weeks old) mice was combined.Cells were grown and harvested as detailed in Section 2.1.1. Harvested cellswere lysed by cryogrinding under native conditions. Low solubility proteins weredefined to be those that formed a pellet after centrifugation of native lysate, whilethose that remained in the supernatant were defined as soluble proteins. The pelletand supernatant fractions of the lysate were mixed as shown in Figure 3.1a thenanalysed by quantitative mass spectrometry (see Section 2.1.2 and Section 2.1.3).The quantification ratio of the lower solubility vs soluble proteins was obtainedby directly comparing the lower solubility and soluble fractions. Without taking thesoluble fraction into account, abundant proteins with a modest proportion presentin the lower solubility fraction would be overrepresented in relation to less abun-dant proteins with a higher proportion present in the low solubility fraction. Directquantification of proteins in the lower solubility fraction would be reflective of theabundance of proteins in the lower solubility fraction, rather than the solubilityof proteins. SILAC was used to label yeast and human cells, and dimethylationvia isotopically-tagged formaldehyde was used to label mouse sample proteins.Equal amounts of light-labeled soluble and heavy-labeled low solubility proteinsfrom each organism were mixed and analyzed as detailed in Section 2.1.2 andSection 2.1.3. Given that the soluble and low solubility fractions in the yeast andhuman samples were derived from distinct populations of cells, a label swap exper-iment(Figure 3.1b) was performed in order to ascertain the degree of variation be-tween populations, as well as the variation due to handling and labeling efficiency.For the label swap experiment, the heavy-labeled soluble fraction was mixed withthe light-labeled low solubility fraction. The yeast and human samples had R2 val-ues of 0.84 and 0.91 respectively(Figure 3.1c and d), compared to the non-labelswapped experiments, indicating a large amount of similarity. This reassured usthat there were minimal artefacts attributable to handling, labeling, and variationbetween populations.17alog RdLow4Solubility3SolubleI=8YeastHuman4dSHMSY5YI8Ey=MRB5xM6B85LSHSNS=88Ey=M,B9xAfB9,LSHSNSbLS4ProteinsGsyR597SSpof,PdifInsBdf6xISolB19ESHS4ProteinsTrxRE94S498SdGsy2Spo14Pdi1Trx2Ion1intensities/1x1E1EMolecules6cell/1x11E5Protein4Abundance4dYeastILSn=96E=14InsolBn=356E=31241281E61E515LSn=85E=12InsolBn=291E=28elog RdLow4Solubility3SolubleIHuman4dSH SY5YI=Mouse4Brain8Ey=MFB6xM6Bf5LSHSNS=88Ey=M,B9xAfB9,LSHSNScIon1intensities/1x1E1EMolecules6cell/1x11E5Protein4Abundance4dYeastILSn=96E=14InsolBn=356E=31241281E61E515LSn=85E=12InsolBn=291E=28fHSn=18EE=21Pn=437EE=666LSn=96E=14NSn=1E94E=15EIon1intensities/1x1E1E213Protein4Abundance4dYeastIlog RdLow4Solubility3SolubleIlog RdLow4Solubility3SolubleI193E=9495E=4Human4dSHMSY5YI=8Mouse4Brain8Ey=MFB6xM6Bf5LSHSNS=88Ey=M,B9xAfB9,LSHSNSbclog RdLow4Solubility3SolubleIlog RdLow4Solubility3SolubleIAlbu4RFc4et4alB4Figure4Falog RdLow4Solubility3SolubleI=8YeastHuman dSH SY5YI8Ey=MRB5xM6B85LSHSNSbLS4ProteinsGsyR597SSpof,PdifInsBdf6xISolB19ESHS4ProteinsTrxRE94S498SdGsy2Spo14Pdi1Trx2eure13ck1here1to1download1Figure:1RAlbuFigure39epsFigure 3.2: Analysis of raw LC-MS/MS quantification data and validation oflow solubility delimitation. (ac) Quantitative mass spectrometry datacomparing the low solubility to the soluble fraction for the yeast (a),human (b) and mouse (c) samples, using a base-2 logarithmic scale. Thecalculated trendline is shown as a dashed dark gray line with indicatedequation. Individual data points are colored according to their assignedbin: red for lower solubility (LS), gray for normal solubility (NS), bluefor higher solubility (HS), and black for data points not included in anycategory. (d) Western blots depicting the relative amounts of proteins inthe supernatant and low solubility fraction, with ratio of low solubilityto soluble expressed as a percentage. Proteins are marked on the plot in(a). Figure reproduced from [4].3.1.2 Identification of lower solubility proteinsUsing quantitative mass spectrometry, 1738, 2584, and 2326 protein groups werequantified in yeast, human and mouse respectively. Based on the peptides identi-fied during the MS experiment, it is sometimes not possible to distinguish whichof several proteins one or more peptides was derived from. In such cases, all pos-18sible proteins are represented as a protein group, comprised of two or more proteincandidates.We next divided the proteins into three bins for further analysis. As shown inFigure 3.2a-c, for each of the three organisms, a trendline was fitted to as manypoints as possible while maintaining a high R2 value of 0.99 or higher, as detailedin Section 2.3.1. The intercepts of the trendline were used as cutoffs to delimitthe lower solubility (LS) and higher solubility (HS) bins. The points used forthe generation of the trendline were categorized as normal solubility (NS). Theseproteins were considered to be of non-extreme solubility. NS proteins had fairlysimilar solubilities, as seen by the gentler slope in that part of the plot. We thusdecided to use the NS bin as a point of reference when making comparisons to theLS and HS bins. As in the human and yeast samples, the majority of the proteinsfell inside the NS bin, with a small portion of proteins in the lower solubility (LS)and higher solubility (HS) bins. In the mouse sample, a larger proportion of theprotein groups fell inside the LS bin than human and yeast. This could be due tothe increased complexity of the brain tissue used in the mouse sample, comparedto the simpler cultured yeast cells as well as undifferentiated human neuronal cells.Unlike the yeast and human tissue culture cells, neuronal cells in the mouse brainare less capable of reducing the amount of aberrantly aggregated protein per cellvia asymmetrical division [2, 15].After binning, proteins annotated as having one or more transmembrane do-mains were removed, as their solubility would be dependent on the choice of de-tergent used for lysis [9]. 257 yeast proteins, 187 human proteins, and 243 mouseproteins were thusly removed. Proteins in complexes were not removed as no trendwas observed between complex size (number of different proteins in the complex)and solubility (average solubility of all partners in the complex) Figure 3.3. Re-moval of transmembrane proteins did not substantially affect the distribution ofsolubilities. From the filtered data, the lower solubility (LS) bin contained 96, 170,530 proteins from yeast, human and mouse respectively. The higher solubility (HS)bin contained 180, 343, 51 proteins from yeast, human, mouse respectively. TheNS bin contained approximately 2/3 of the quantified proteins, with 1095, 1200,and 1254 proteins in yeast, human, mouse respectively. Proteins that were not in-cluded in the NS bin, and did not meet the LS or HS cutoff, were excluded from192nd gaussian7YeastHumanl(SH-SY5Y)MouselBrain5505505500-6 6log2(P/S)AveragelratiosloflproteinslwithinlalcomplexNumberxofxproteinsxidentifiedxwithinxaxknownxcomplexx(log10)dFigure 3.3: Average solubilities of proteins within a complex. The numberof components of known complexes in our dataset was plotted againstthe average solubility (based on components identified) for each proteincomplex in yeast, human, and mouse. Figure reproduced from [4].further analysis due to ambivalence regarding their solubility.To verify the identification and quantification from the mass spectrometry ex-periments, two proteins from the LS bin (Gys2 and Spo14) and two proteins fromthe HS bin (Pdi1 and Trx2) in yeast had their endogenous copy appended witha C-terminal 3x-hemagglutinin (HA) tag, and their solubility validated by West-ern blotting. As seen in Figure 3.2d, LS proteins tend to be about 20 times moreabundant in the supernatant than the pellet, compared with HS proteins that tend20to be about 100-200 times more abundant in the supernatant, making LS proteinsapproximately 5-10 times less soluble than their HS counterparts.gene ontology (GO) analysis revealed that a number of LS proteins from yeast,human, and mouse were associated with RNA processing as a molecular function(Table A.2,Table A.4,Table A.6). Molecular function showed many LS proteins as-sociated with RNA binding. Unique to the mouse sample, the LS bin also showedan enrichment of proteins associated with cytoskeletal organization and structuralmolecular activity. Association with structural molecular activity is consistent withthe high levels of tubulin and neurofilaments in neurons [33]. We checked if therewas an enrichment of Pfam domains (Table A.7,Table A.8,Table A.9) in the LSproteins compared to the NS bin, and found that RRM 1, an RNA binding domain,and filament were enriched in human and mouse. Human also showed an enrich-ment in septin, while mouse showed an enrichment in spectrin. However, the GOannotations found did not represent a large portion of the LS proteins we identified,suggesting that LS proteins are of a diverse nature.Interestingly, chaperone proteins were not strongly enriched among the LS pro-teins. Ssa1p, a major cytosolic Hsp70 chaperone, was found within the NS bin inyeast. It is possible that, under our unstressed experimental conditions, chaperonesassociated to non-misfolded proteins may remain largely in the soluble supernatantfraction, while only a small portion of the chaperone population is bound to mis-folded proteins.Proteins reported in other studies to exhibit low solubility displayed a low de-gree of overlap with proteins in our LS bin [26, 92, 103, 128]. This can be attributedto differences in isolation methods and the solubility of proteins in stressed and un-stressed conditions. Our approach also accounts for abundance of proteins by nor-malizing against the soluble fraction, which may be important in accounting for thebias of mass spectrometry towards more abundant proteins. As seen in Figure 3.4a,proteins identified as insoluble without normalization to the soluble proteins wereof much higher abundance (ion intensity and number of molecules per cell pre-viously published in [29] and [47] respectively). In contrast, after normalizationto the soluble fraction, it can be seen that LS proteins are actually less abundantthan HS proteins (Figure 3.4b). This is consistent with previous work [50] thatshows proteins which are prone to forming aberrant aggregates being subject to210 %Ionaintensities,ax1010Molecules/cell,axa105Protein0Abundance0wYeast/LSn=960=14Insol.n=3560=312412810610515LSn=850=12Insol.n=2910=28e-8Mouse0Brain80y=G3.6xG0.15LSHSNSc fHSn=1800=21Pn=43700=666LSn=960=14NSn=10940=150Ionaintensities,ax1010213Protein0Abundance0wYeast/log 2wLow0Solubility/Soluble/1.3E-94.5E-4aYeast Human Mouse0Brain50010001500LSn=96NSn=1095HSn=180Pn=66320LSn=530NSn=1255HSn=51400800012001600400800012001.9E-56.5E-41.0E-138.6E-23LSn=170NSn=1201HSn=3431.2E-101.9E-34.0E-11Protein0Length (a.a.)16008.6E-3400300500400300200Length(a.a.)HumanLS NS HS LRandom0Selection0of0Proteins0in0GrbcYeast HFigure 3.4: Abundance of LS proteins in this study and compared to previ-ous studies. (a) Abundance values of proteins identified in the LS binscompared to those identified in the low solubility pellet (Insol) froma separate mass spectrometry analysis of low solubility proteins alone[4]. Protein abundances were derived based on ion intensities in massspectrometry published by [29](left) and levels of endogenously taggedproteins as published by [47] (right). (b) Abundance of proteins in theLS, NS and HS bins, as well as in the proteome as a whole (P), deter-mined for yeast. Figure reproduced from [4].strict regulation at the transcriptional, translational and degradation level, keepingtheir concentration below the critical concentration for aggregation, while highlyexpressed proteins are subject to strong evolutionary pressure toward lower aggre-gation propensities [105, 107].3.1.3 Lower solubility (LS) proteins are longer than highersolubility (HS) proteinsWe first sought to determine whether protein length was correlated with solubil-ity given that ubiquitinated proteins which are less soluble after heat shock havepreviously been shown to be longer [87]. In protein groups with more than onemember, the average length was taken as representative of the whole group. In allthree organisms, proteins in the LS bin tended to be longer than proteins in the NSbin, and proteins in the HS bin tended to be shorter than proteins in the NS binFigure 3.5a, consistent with the findings in the previous study.As mentioned earlier, peptides identified by mass spectrometry often corre-22aYeast Human MouseABrain50010001500LSn=960=5NSn=10950=47HSn=1800=6Pn=66320=2990LSn=5300=40NSn=12550=73HSn=510=1400800012001600400800012001.9E-56.5E-41.0E-138.6E-23LSn=1700=15NSn=12010=70HSn=3430=181.2E-101.9E-34.0E-11400300500400300200Length(a.a.)Human MouseBrainLS NS HS LS NS HSRandomASelectionAofAProteinsAinAGroups (1000 iterations)bcProteinALength (a.a.)16008.6E-3FigureA4 AlbuARF,AetaYeast Human MouseABrain50010001500LSn=960=5NSn=10950=47HSn=1800=6Pn=66320=2990LSn=5300=40NSn=12550=73HSn=510=1400800012001600400800012001.9E-56.5E-41.0E-138.6E-23LSn=1700=15NSn=12010=70HSn=3430=181.2E-101.9E-34.0E-11400300500400300200Length(a.a.)Human MouseBrainLS NS HS LS NS HSRandomASelectionAofAProteinsAinAGroups (1000 iterations)bProteinALength (a.a.)16008.6E-3aYeast Human MouseABrain5001 001500LSn=960=5NSn=10950=47HSn=1800=6Pn=66320=2990LSn=5300=40NSn=12550=73HSn=510=1400800012001600400800012001.9E-56.5E-41.0E-138.6 -23LSn=1700=15NSn=12010=70HSn= 430=181.2E-101.9E-34.0E-11400300500400300200Length(a.a.)Human MouseBrainLS NS HS LS NS HSRandomAS lectionAofAProteinsAinAGroups (1000 iterations)bcY H M B iProteinALength ( a.)16008.6E-3Figure 3.5: Comparison of the lengths of proteins by randomly picking ver-sus averaging over protein groups. (a) Boxplots of the distributions ofprotein length (in amino acids) for yeast, human and mouse brain sam-ples. Length of each protein group was obtained by averaging overlengths of all proteins in the group. (b) Distributions of the medianprotein length values after randomly selecting one protein per proteingroup for one thousand iterations during the random sampling of pro-teins within groups (RSPG) experiment. There were 115 and 298 pro-tein groups with at least two proteins in the human and mouse datasets,respectively. Figure reproduced from [4].23spond to multiple proteins, and it is not uncommon for there to be insufficient infor-mation to allow for definitive identification of proteins. For example, in yeast oneparticular protein group contains two proteins: YKL156W and YHR021C. Basedon the peptides identified by mass spectrometry, the peptides could have been de-rived from either or both of those proteins. In our experiments, 55 of 1738, 2062 of2584, and 1296 of 2326, protein groups contained more than one protein in yeast,human, and mouse respectively. The protein groups that result from such ambigu-ous identification may contain proteins that are substantially different from eachother. In order to gauge the amount of variability on the outcome of subsequentanalysis, we carried out a random sampling of proteins within groups (RSPG). Foreach of the thousand iterations, one random protein from each group was selectedand the median length for each of the bins defined in Section 3.1.2 was calculated.This analysis was not performed in yeast due to the small number of protein groupswith two or more members observed. In human and mouse, the median values ob-tained in the RSPG experiment (Figure 3.5b) did not differ much from the medianvalues obtained by averaging over the protein group, as done in Section 3.1.3 andshown in Figure 3.5a, by a large margin. The averaging of values over the proteingroup was thus regarded as an appropriate approximation.3.1.4 LS proteins are predicted to be more aggregation prone inyeast but not in human or mouseSince aggregation prone proteins should in principle display lower solubility, wenext checked whether the LS bin contained more proteins predicted to be amy-loid aggregation prone. In yeast, AGGRESCAN and TANGO both predicted alarger fraction of the LS to be amyloid aggregation prone than in the NS and HSbins (Figure 3.6a-b). The HS bin, was predicted to contain a higher proportion ofnon-amyloid aggregation prone proteins. However, this trend was not observed inhuman and mouse, where either no trend was observed, or amyloid aggregationprone proteins were more frequently observed in the HS bin. Given that TANGOand AGGRESCAN both look for features associated with amyloid aggregation, itis possible that human and mouse low solubility proteins possess other traits notcharacteristic of amyloids that contribute to their lower solubility.24LSn=96NSn=1095HSn=180LSn=167NSn=1144HSn=326LSn=516NSn=1228HSn=47LSn=96NSn=1095HSn=180LSn=509NSn=1216HSn=47LSn=168NSn=1158HSn=334AGGRESCAN204060204060204060Yeast Human MousefBrainPercent of proteinsTANGOYeast Human MousefBrain20406010203010203040Percent of proteinsaYeast Human MousefBrain50010001500LSn=960=5NSn=10950=47HSn=1800=6Pn=66320=2990LSn=5300=40NSn=12550=73HSn=510=1400800012001600400800012001.9E-56.5E-41.0E-138.6E-23LSn=1700=15NSn=12010=70HSn=3430=181.2E-101.9E-34.0E-11400300500400300200Length(a.a.)Human MouseBrainLS NS HS LS NS HSRandomfSelectionfoffProteinsfinfGroups (1000 iterations)bProteinfLength (a.a.)16008.6E-3Yeast Human MousefBrain50010001500LSn=960=5NSn=10950=47HSn=1800=6Pn=66320=2990LSn=5300=40NSn=12550=73HSn=510=1400800012001600400800012001.9E-56.5E-41.0E-138.6E-23LSn=1700=15NSn=12010=70HSn=3430=181.2E-101.9E-34.0E-11400300500400300200Length(a.a.)Human MouseBrainLS NS HS LS NS HSRandomfSelectionfoffProteinsfinfGroups (1000 iterations)bc3Yeast Human MousefBrain16008.6E-3LSn=96NSn=1095HSn=180LSn=167NSn=1144HSn=326LSn=516NSn=1228HSn=47LSn=96NSn=1095HSn=180LSn=509NSn=1216HSn=47LSn=168NSn=1158HSn=334204060204060204060Yeast Human MousefBrainPercent of proteinsTANGOAggregation-prone Nonfaggregation-proneYeast Human MousefBrain20406010203010203040Percent of proteinsFigure 3.6: Aggregation propensity as predicted by TANGO and AGGRES-CAN. Aggregation prediction (shown as percent of all proteins in eachbin) for the proteins in the indicted bins in yeast, human and mouse us-ing the AGGRESCAN (a) and TANGO (b) algorithms, as shown. Figurereproduced from [4].3.1.5 LS proteins contain biases for particular amino acidsSince we were searching for intrinsic properties of proteins that could contributeto their low solubility, we considered the possibility that LS proteins might con-tain biases for certain amino acids, prompting us to examine the local and globalamino acid composition of proteins. Given that certain neurodegenerative diseaseshave been associated with stretches of polyglutamine within proteins [54, 136], weexamined LS proteins for the presence of low complexity regions (LCRs), which25LS0=4NS0=25HS0=3P0=24LS0=6NS0=54HS0=5LS0=6NS0=36HS0=20cLCRs/lengthr(aa)rx10-3210Yeast Human MouserBrain3.9E-283.5E-10210152.8E-74.8E-47.9E-20aYeast Human MouserBrain50010001500LS96NS1095HS180P66320LS530NS1255HS51400800012001600400800012001.9E-56.5E-41.0E-138.6E-23LS170NS1201HS3431.2E-101.9E-34.0E-11ProteinrLength (a.a.)16008.6E-3Figure 3.7: Number of low complexity regions (LCRs) per unit length of pro-teins in each bin, as found on LPS-annotate.AlbuSRFTSetSal.a Yeast Human MouseSBrainVLPFYD EVYHCYHNQLowSSolubility HighSSolubilityFigureS5NormalSSolubilityS)referenceGGAVLIPFWMCSTK D ER0.8 1.2GALIPFWMSTK D ENQR0.8 1.2GAIWHMCSTKNQR0.8 1 1.2************************ **** ****** ** ** * **11Figure 3.8: Amino acid composition of proteins in the LS and HS bins ofyeast, human and mouse samples. Each data point represents the me-dian value within the entire bin, expressed as a fold enrichment over theNS bin. Statistically significant differences are indicated by asterisks.Figure adapted from [4].26LSn=960=10NSn=10950=33HSn=1800=5Pn=66320=3094128bYeast Human Mouse/BrainSerine/content (percent per protein)4128LSn=1700=3NSn=12010=46HSn=3430=10LSn=5300=10NSn=12550=35HSn=510=1Mouse/BrainYeast Human Mouse/Braincd Hydrophobic/contentLS0=32NS0=53HS0=24321LS0=5NS0=9HS0=0P0=98LS0=4NS0=18HS0=6LS0=13NS0=22HS0=42030405030405030404.0E-106.4E-64.3E-94.2E-38.9E-193.0E-205.1E-188.0E-5Cys/contentHumanLS0=9NS0=52HS0=1443211.7E-55.7E-641282.5E-3aYeast Human Mouse/Brain50010001500LSn=960=5NSn=10950=47HSn=1800=6Pn=66320=2990LSn=5300=40NSn=12550=73HSn=510=1400800012001600400800012001.9E-56.5E-41.0E-138.6E-23LSn=1700=15NSn=12010=70HSn=3430=181.2E-101.9E-34.0E-11400300500400300200Length(a.a.)Human MouseBrainRandom/Selection/of/Proteins/in/Groups (1000 iterations)bProtein/Length (a.a.)16008.6E-3400300500Length(a.a.)LS NRandom/SelectiobLS0=4NS0=25H0cLCRs/length/(aa)/x10-3210Yeast2.8E-74.8E-40=5 0=47 0=6 0=299 0=40 0=73 0=10=15 0=70 0=18400300500400300200Length(a.a.)Human MouseBrainLS NS HS LS NS HSRandom/Selection/of/Proteins/in/Groups (1000 iterations)bLS NS HS P LS NS HSLS NS HScLCRs/length/(aa)/x10-3210Yeast Human Mouse/Brain3.9E-283.5E-10210152.8E-74.8E-47.9E-20Figure 3.9: Percentage abundance of particular types of amino acids. Box-plots showing the distribution of values for the indicated amino acids(percent per protein) in the indicated sample. Shown here are the anal-yses for serine (a), cysteine (b), and hydrophobic residues (c). Pro-teins included in the hydrophobic analysis were glycine, alanine, valine,leucine, phenylalanine and methionine. Figure adapted from [4].are local stretches of a protein that contain a bias for one or more amino acids.While proteins in the LS bin contained more low complexity regions (LCRs) perunit length (Figure 3.7), the LCRs did not show any consistent bias for any partic-ular amino acid across all three organisms (Table A.13, Table A.14, Table A.15).With regard to general amino acid composition (Figure 3.8a), several amino acidstended to be either enriched or depleted in the LS bin vs the NS bin, with the op-posite effect observed between the NS and HS bin. Interestingly, there was littleoverlap between yeast and the other two organisms, but the mouse and human sam-ples tended to display more similar biases. This could be reflective of the closerevolutionary relationship between mouse and human, compared to yeast. The LS27fractions tended to be enriched in glutamines, but depleted in asparagines (Fig-ure 3.8a). Both amino acids have been previously linked to amyloid formation,with polyglutamine associated with benign amyloids and polyasparagine linked tomore toxic species [53]. Serine was more enriched in the LS fraction and moredepleted in the HS fraction of all three organisms (Figure 3.9a). In mouse andhuman, fewer cysteines were observed (Figure 3.9b), suggesting lower potentialfor stabilization of structure by disulphide bonds. Yeast and mouse also displayedsignificantly fewer hydrophobic residues in the LS bin (Figure 3.9c). The differ-ence was observed but not significant in human. All in all, there are specific aminoacid biases that set low solubility proteins apart from normal and high solubilityproteins.With certain amino acids such as serine and cysteine being associated with fea-tures such as phosphorylation and disulphide bonds, we also examined whetherthese features were more or less prominent in LS compared to HS proteins. Phos-phorylation sites were only enriched in LS proteins for human. In yeast, it was theHS proteins that had more phosphorylation sites (Figure 3.10a), which is surpris-ing, given the enrichment for serine observed in Figure 3.9a. LS proteins in humanand mouse contained fewer predicted disulfide bonds (Figure 3.10b), consistentwith the previous observation of fewer cysteines (Figure 3.9b).3.1.6 LS proteins are more highly charged and are less hydrophobicCharge on proteins can be used to mediate protein-protein interactions [28], wedecided to examine if LS proteins might contain more net charge than their HScounterparts. In yeast and mouse, LS proteins tended to have a higher magnitude ofnet charge (regardless of sign) than NS and HS proteins (Figure 3.11a). When signis taken into account, LS proteins in all three organisms had a more positive chargethan HS proteins (Figure 3.11b), but not significantly more so than NS proteins.HS proteins are more negatively charged than their LS and NS counterparts. Giventhe ability of charged residues to mediate inter- and intramolecular interactions, itis plausible that because HS proteins tend to be less strongly charged, they mighthave have fewer interaction partners, as large assemblies brought about by largenetworks of interaction would intuitively be of lower solubility.28HumanPhosphorylation sitesdDi sul fi de bonds/ lengt h x10- 22 21 1LSn=1660=8NSn=11920=54HSn=3400=12Human Mouse BrainPredicted Disulfide BondsLSn=5220=27NSn=12520=54HSn=510=31.2E-116.4E-61.2E-4bacYeast Human Mouse Brain7.6E-5LSn=960=12NSn=10950=82HSn=1800=10LSn=1700=10NSn=12010=93HSn=3430=34LSn=5300=32NSn=12550=110HSn=510=3Phosphosi tes/ lengt h x10- 2132413241321.0E-31.1E-32.8E-2Figure 3.10: Number of phosphorylation sites and disulfide bonds in each ofthe three model organisms. (a) The number of phosphorylation sitesnormalized to length for yeast, human and mouse, as indicated. (b)Number of predicted disulfide bonds normalized to length. Figure re-produced from [4].In light of hydrophobic interfaces being able to mediate interactions betweenproteins [125], we also examined whether the hydrophobicity of proteins differedbetween proteins in the LS and HS bins. The grand average of hydrophobic-ity (GRAVY) index [71] was chosen as it scores each residue with a differenthydrophobicity score, allowing a more precise representation than the previousexamination of what proportion of each protein was composed of hydrophobicresidues (Figure 3.9c). Consistent with the observation that LS proteins were com-29LSn= l8S= =SNSn= =S9%S= ==PHSn= =8SS= =lPn= xxFPS= x/SS%SS=SSS=%SSPSSSP%SSFSSSNet9charge9per9protein9squaredLSn= l8S= PNSn= =S9%S= %FHSn= =8SS= 8Pn= xxFPS= /FSxS/SPSSPS/SxS8SNet9charge9per9proteinLSn= //PS= F/NSn= =P/8S= lFHSn= %=S= SPn= %=FS= F%9/SPSSPS/SxSNet9charge9per9proteinLSn= //PS= %/NSn= =P/8S= =FSHSn= %=S= /Pn= %=FS= %lSS%SS=SSS=%SSPSSSP%SSNet9charge9per9protein9squaredLSn= =F9S= =xNSn= =PS=S= =P8HSn= F/FS= /FPn= 88x%FS= =S/S/SPSS/SSxSS8SS=SSS=PSS=/SSNet9charge9per9protein9squaredCharge9squared9per9proteinYeast Human Mouse9BrainLSn= = 8S= lNSn= ==l9S= lHSn F=%S= %Pn= lPx=SlF8SNSSN%=NS=N%PNSPN%NumberdofdMoRFsdperddisordereddpatchLSn= =F8S= lNS=l9S= /lHSn F=%S= %PlPx=SS lF8SNSSN%=NS=N%PNSPN%NumberdofdMoRFsdperddisordereddpatcheMoRFsLydisorderhxh=Sf=LSn==xxS=9NSn==S%FS=%=Hn=SP/HumanNumber9LSS=%NSS=P%HSS==PS=P/PYeast=NPEfl=NlEfPMotifsLhlengthhxh=Sf=PF PN9EfPpfvaluesc Number9of9linear9motifs%NxEf%%NFEf9LSS=%NSS=P%HSS==PS=P/PYeast=NPEfl=NlEfPMotifsLhlengthhxh=Sf=PF PN9EfPpfvaluesc Number9of9linear9motifseMoRFsLydisorderpxp=Sf=P/LSS=%NSS=P%HSS==PS=P/P=NPEfl=NlEfPMotifsLplengthpxp=Sf=PF PN9EfPpfvaluesc Number9of9linear9motifsYeast Human Mouse9BrainLSn= =F9S= =FNSn= =PS=S= x8HSn= F/FS= PxPn= 88x%FS= x/%FFSPS=SS=SPSFS/S%SNet9charge9per9proteineMoRFsLydisorderhxh=Sf=LSn==xxS=9NSn==S%FS=%=P/HumanNumbeLSS=%NSS=P%HSS==PS=P/PYeast=NPEfl=NlEfPMotifsLhlengthhxh=Sf=PF PN9EfPpfvaluesc Number9of9linear9motifs%NxEf%%NFEf9LSn==F9S==FNSn==PS=S=x8HSn=F/FS=PxPn=88x%FS=x/%FFSPS=SS=SPSFS/S%SNet9charge9per9proteinLSS=%NSS=P%HSS==PS=P/PYeast=NPEfl=NlEfPMotifsLhlengthhxh=Sf=PF PN9EfPpfvaluesc Number9of9linear9motifseMoRFsLydisorderpxp=Sf=P/LSS=%NSS=P%HSS==PS=P/Pe=NPEfl=NlEfPMotifsLplengthpxp=Sf=PF PN9EfPpfvaluesc Number9of9linear9motifsPNxEflPNSEf=F=N=EfP/PN=Ef=F9N%EfFPPN/Ef//NSEf/xNPEf=l%N8Ef/F%NF f=8FNEflaYeast Human Mouse0Brain%SS=SSS=%SSLSn=9xS=%NSn== 9%S=/lHSn==8SS=xPn=xxFPS=P99SLSn=%FSS=/SNSn==P%%S=lFHSn=%S==/SS8SSS=PSS=xSS/SS8SSS=PSS=N9Ef%xN%Ef/=NSEf=F8NxEfPFLSn==lSS==%NSn==PS=S=lSHSn=F/FS==8=NPEf=S=N9EfF/NSEf==/SSFSS%SS/SSFSSPSSLength(aNaNG(=SSS9iterationsGbLSS=/NSS=P%HSS=FPS=P/LSS=xNSS=%/HSS=%LSS=xNSS=FxHSS=PScLCRs/length0(aa)0x10-3P=SYeast Human Mouse0BrainProtein0Length (aNaNGFN9EfP8FN%Ef=SP=S=%PN8Efl/N8Ef/lN9EfPS=xSS8NxEfFYeast H man Mouse0Brain%S=%SSLSn=9xS=%NSn==S9%S=/lHSn==8SS=xPn=xxFPS=P99LSn=%FSS=/SNSn==P%%S=lFHSn=%=S==/SS8SSS=PSS=xSS/SS8SSS=PSS=N9Ef%xN%Ef/=NSEf=F8NxEfPFLSn==lSS==%NSn==PS=S=lSHSn=F/FS==8/SSFSS%SS/SSFSSPSSLength(aNaNGHuman MouseBrainLS NS HS LS NS HSRandom0Selection0of0Proteins0in0Groups (=SSS9iterationsGbLSS=/NSS=P%HSS=FPS=P/LSS=xNSS=%/HSS=%LSS=xNSS=FxHSS=PScLCRs/length0(aa)0x10-3P=SYeast Human Mouse0BrainFN9EfP8FN%Ef=SP=S=%PN8Efl/N8Ef/lN9EfPS=xSS8NxEfF=NxEf=PPNSEf9%NFEf%Charge9per9proteinLSn=//PS=F/NSn==P/8S=lFHSn=%=S=SPn=%=F%9S=F%9F/SPSSPS/SxSNet9charge9per9proteinxNFEf8lNSEfFLSn= l8S= FNSn= =S9%S= P9HSn= =8SS= SPn= xxFPS= PPP=N%=NSSN%SNSSN%=NSAverage9hydrophobicity9of9proteins9(GRAVYGLSn= //PS= 8NSn= =P/8S= PxHSn= %=S= /Pn= %=F%S= P%9=N%=NSSN%SNSSN%=NSAverage9hydrophobicity9of9proteins9(GRAVYGAverage9hydrophobicity9(GRAVYG9of9proteinsYeast Human Mouse9BrainLSn= =F8S= lNS==l9/lHSn F=%S= %Pn= lPx=SlF8SNSSN%=NS=N%PNSPN%NumberdofdMoRFsdperddisordereddpatchLSn= =F9S= =NSn= =PS=S= P%HSn= F/FS= P=Pn= 88x%FS= FF8/=N%=NSSN%SNSSN%=NSAverage9hydrophobicity9of9proteins9(GRAVYGLSn= =F8S= lNS=l9/lHSn F=%S=Pn= lPx=SlF8SNSSN%=NS=N%PNSPN%NumberdofdMoRFsdperddisordereddpatcheMoRFsLydisorderhxh=Sf=LSn==xxS=9NSn==S%FS=%=nP/HumanNumberLSS=%NSS=P%HSS==PS=P/PYeast=NPEfl=NlEfPMotifsLhlengthhxh=Sf=PF PN9EfPpfvaluesc Number9of9linear9motifs%NxEf%%NFEf9LSS=%NSS=P%HSS==PS=P/PYeast=NPEfl=NlEfPMotifsLhlengthhxh=Sf=PF PN9EfPpfvaluesc Number9of9linear9motifseMoRFsLydisorderpxp=Sf=P/LSS=%NSS=P%HSS==PS=P/P=NPEfl=NlEfPMotifsLplengthpxp=Sf=PF PN9EfPpfvaluesc Number9of9linear9motifsPNSEfP/=N= fPSPNPEf%=NPEf/=8N=EfF%NSEf=PPNSEf8Yeast u an ouse0 rain%SS=SSS=%SSLSn=9xS=%Sn== 9%S=/lSn==8SS=xPn=xxFS=P99SLSn=%FSS=/SSn==P%%S=lFSn=%=S=/SS8SSS=PSS=xSS/SS8SSS=PSS=N9Ef%xN%Ef/=NSEf=F8NxEfPFLSn==lS==%Sn==PS=S=lSSn=F/FS==8=NPEf=S=N9EfF/NSEf==/SSFSS%SS/SSFSSPSSLength(aNaNGHuman MouseBrainLS NS HS LS NS HSRandom0Selection0of0Proteins0in0Groups (=SSS9iterationsGLSS=/SS=P%SS=FPS=P/LSS=xSS=%/SS=%LSS=xSS=FxSS=PSLCRs/length0(aa)0x10-3P=SYeast u an ouse0 rainProtein0Length (aNaNGFN9EfP8FN%Ef=SP=S=%PN8Efl/N8Ef/lN9EfPS=xSS8NxEfFLS NS HS P=N%=NSN%NSSN%=NSAverage9hydrophobicity9of9proteins9(GRAVYGNet9charge9per9proteinNet9charge9per9squared9proteinLn= l8S= =Sn= =S9%S= ==PHn= =8SS= =ln= xxFPS= x/SS%SS=SSS=%SSPSSSP%SSFSSSNet9charge9per9protein9squaredLn= //S= %/Nn= =P/8S= =FSHSn %=S= /Pn= %=FS= %lS%S=SSS=%SSPSSSP%SSNet9charge9per9protein9squaredLn= =F9S= =xNSn= =PSS= =P8HSn F/FS= /FPn= 88x%FS= =S/ /SPSS/SSxSS8SS=SSS=PSS=/SSNet9charge9per9protein9squaredLn= l8S= PNn= =S9%S= %FHn= =8SS= 8n= xxFPS= /FSxS/SPSSPS/SxS8SNet9charge9per9proteinLn= //PS= F/Nn= =P/8S= lFHSn %=S= SPn= %=FS= F%/SPSSPS/SxSNet9charge9per9proteinLn= F9S= =FNn= =PS=S= x8Hn= F/FS= PxPn= 88x%FS= x/%FFSPS=SS=SPSFS/S%SNet9charge9per9proteinLn= l8S= FNn= =S9%S= P9Hn= =8SS= SPn= xxFPS= PPP=N%=NSSN%SNSSN%=NSAverage9hydrophobicity9of9proteins9(GRAVYGLn= //PS= 8Nn= =P/8S= PxHSn %=S= /Pn= %=F%S= P%9=N%=NSSN%SNSN%=NSAverage9hydrophobicity9of9proteins9(GRAVYGLn= F9S= =Nn= =PS=S= P%HSn= F/FS= P=Pn= 88x%FS= FF8/=N%=NSSN%SNSN%=NSAverage9hydrophobicity9of9proteins9(GRAVYGFigure 3.11: Analysis of net charge and hydrophobicity of proteins. (a-b)Net charge of proteins was calculated by assigning His, Arg, and Lysa charge of +1, Glu and Asp a charge of -1, and all other residuesa charge of 0. The charges of all residues in the protein were thenadded together to obtain net charge. This net charge was then eithersquared and plotted (a) or plotted directly (b). (c) Average hydropo-bicity of proteins, as calculated by the grand average of hydrophobic-ity (GRAVY) index. 30a bGAVLIPFYWHMCSTKDENQRLS in T100LS in NP400.6 1 1.2 1.40.8HS in T100HS in NP40log2(P/S) with Triton X100log2(P/S) withNP40 and DOC8-88-8Pearson Correlationcoefficient:0.868Figure 3.12: Comparison of results obtained using the two detergents NP40and Triton X-100. (a) Comparison of ratios for proteins in yeast iden-tified in common using different detergents for lysis. (b) Comparisonof amino acid abundances in yeast for cells lysed with different deter-gents. Figure reproduced from [4].prised of less hydrophobic residues, LS proteins in mouse and human had lowerhydrophobicity scores (Figure 3.11c), as calculated by the GRAVY index.3.1.7 Choice of detergent does not significantly affect solubility of LSproteinsThere was a possibility that the choice of detergent might affect the perceived sol-ubility of proteins, as well as the analyses that were based on said perceived sol-ubility. To rule that out, we repeated the experiment in yeast (originally carriedout using Triton X-100) using Igepal as the detergent, and observed little disparitybetween the results obtained with Triton X-100. The ratios obtained in the twoexperiments displayed a Pearson correlation coefficient of 0.868 (Figure 3.12a).There was also little variation observed in the amino acid enrichment between thesamples treated with different detergents (Figure 3.12b). Observed effects betweenorganisms is thus better explained by interspecies variation than by differences indetergents used.313.1.8 LS proteins are more disordered and contain more molecularrecognition features (MoRFs) and eukarotic linearmotifs (ELMs) than higher solubility (HS) proteinsThe study by Ng et al. also highlighted that longer, less soluble proteins that areubiquitinated also tended to be more disordered, which would be consistent withour observation that LS proteins are more highly charged and less hydrophobic.To ascertain if this was true in our dataset, we utilized two disorder predictionalgorithms, DISOPRED and IUPRED, to analyze the disorder of proteins. Wefound that LS proteins were more disordered than their NS and HS counterparts,as shown in Figure 3.13a and Figure 3.13b. This is consistent with the findings ofthe previous study [87] that proteins in the low solubility fraction after heat shocktended to be more disordered than ones that remained in the soluble fraction. LSproteins were also found to contain more coiled regions (Figure 3.13c), as predictedby PSIPRED. Given that coil regions are essentially regions that were not classifiedto have a fixed helical or sheet structure, this is consistent with LS proteins beingmore disordered.While LS proteins contained more disorder, we also wanted to see whetherthis disorder resided in distinct regions of the protein or was dispersed throughoutthe protein, prompting us to examine how many disordered patches the proteinscontained. In all three organisms, LS proteins contained more disordered patches(Figure 3.14a) of at least five residues in length. This trend was only preserved inmouse after normalization to length (Figure 3.14b), suggesting that the increasednumber of disordered patches in yeast and human was likely to be merely due to acorrelation with length.Disordered regions that assume an ordered conformation upon binding to part-ners, also known as molecular recognition features (MoRFs)[121] were enrichedin the LS fraction (Figure 3.15a). Given that MoRFs must occur in regions of dis-order, and that LS proteins have been shown earlier to be more disordered, it waslogical to examine whether the enrichment for MoRFs was due to the increaseddisorder of LS proteins. However, the enrichment for MoRFs in LS proteins wasstill observed even after normalizing to percentage disorder of the proteins (Fig-ure 3.15b).While LS proteins had more MoRFs and contained more patches of disorder,32AlbunRF/netnal1FigurenS6 kpartl2wLS0=5NS0=44HS0=4LS0=14NS0=24HS0=1MousenBrainHuman5v1Eb133v5Eb11Numbernofnlinearnmotifsf2323dHuman MousenBrainPercentnofnregionnpredictedntonbendisorderednyIUPREDhYeastLSn=960=4NSn=10950=71HSn=1800=0Pn=66320=394LSn=1700=0NSn=12010=12HSn=3430=31LSn=5300=0NSn=12550=54HSn=510=940608020uldisorder40608020406080203v1Eb81v1Eb138v0Eb348v2Eb53v9Eb261v7Eb3eMotifs/llengthlxl10b1MousenBrainHumanNumbernofnlinearnmotifsYeast8v3Eb35v6Eb3PercentnofnIntrinsicallynDisorderednRegionsbLSn=960=2NSn=10950=31HSn=1800=0Pn=66320=122LS0=0NS0=19HS0=19LS0=0NS0=28HS0=340608020uldisorder4060802040608020Yeast MousenBrainHuman4v1Eb7 1v2Eb131v4Eb439v7Eb2 1v8Eb394v6Eb4eMoRFs/udisorderlxl10b1LSn=1660=9NSn=10530=51HSn=2110=5LSn=4790=26NSn=8760=47HSn=310=12424MousenBrainHumanNumb rnofnMo FSnyAnchorhLS0=5NS0=25HS0=1P0=242Yeast1v2Eb71v7Eb2Motifs/llengthlxl10b123 2v9Eb2pbvaluesc Numbernofnlinearnmotifs5v6Eb55v3Eb93v0Eb68v5Eb4aYeast Human MousenBrain50010001500LSn=960=5NSn=10950=47HSn=1800=6Pn=66320=2990LSn=5300=40NSn=12550=73HSn=510=1400800012001600400800012001v9Eb56v5Eb41v0Eb138v6Eb23LSn=1700=15NSn=12010=70HSn=3430=181v2Eb101v9Eb34v0Eb1140030050040030020LengthkavavwHuman MouseBrainLS NS HS LS NS HSRandomnSelectionnofnProteinsninnGroups k1000literationswbLS0=4NS0=25HS0=3P0=24LS0=6NS0=54HS0=5LS0=6NS0=36HS0=20cLCRs0lengthnyaahnx3-x4210Yeast Human MousenBrainProteinnLength kavavw3v9Eb283v5Eb10210152v8Eb74v8Eb47v9Eb2016008v6Eb3aYeast Human MousenBrain50010001500LSn=960=5NSn=10950=47HSn=1800=6Pn=66320=2990LSn=5300=40NSn=12550=73HSn=510=1400800012001600400800012001v9Eb56v5Eb41v0Eb138v6Eb23LSn=1700=15NSn=12010=70HSn=3430=181v2Eb101v9Eb34v0Eb11400300500400300200LengthkavavwHuman MouseBrainLS NS HS LS NS HSRandomnSele tio nofnPr teinsninnGroups k1000lit ationswbLS0=4NS0=25HS0=3P0=24LS0=6NS0=54HS0=5LS0=6NS0=36HS0=20cLCRs0lengthnyaahnx3-x4210Yeast Human MousenBrainProteinnLength kavavw3v9Eb283v5Eb10210152v8Eb74v8Eb47v9Eb2016008v6Eb3Figuren6aMousenBrainPercentnofnCoil4060802040608020HumanLSn=1700=0NSn=12010=27HSn=3430=18LSn=5300=3NSn=12550=68HSn=510=07v3Eb6253Eb183v6Eb11LS0=dMoRFs/llengthlxl10b23219v5Eb3Figurel6ClicklhereltoldownloadlFigure:lRAlbuFigure6vepsAlbunRF/netnal1FigurenS6 kpartl1waYeast Human MousenBrainPercentnofnalphaxhelicesYeast Human MousenBrainbLSn=960=0NSn=10950=57HSn=1800=4Pn=66320=43LSn=1700=4NSn=12010=8HSn=3430=11LSn=5300=10NSn=12550=15HSn=510=0406020ulhelices40608020406080204v2Eb4LS0=9NS0=32HS0=1P0=201LS0=5NS0=49HS0=10LS0=25NS0=34HS0=120301040ulsheet20301040203010401v0Eb81v9Eb8Percentnofnbetaxsheets2v5Eb2YeastLS0=3NS0=54HS0=4P0=10840608020ulcoilc PercentnofnCoil NumbernofnMoRFSnyAnchorhYeastLS0=1NS0=53HS0=0P0=3582v3Eb8gMoRFs/llengthlxl10b22313v0Eb33v3Eb2dxf:seelnextlpage3v6Eb34v1Eb32v1Eb33v7Eb31v3Eb2FigurenS6 kpartl1waYeast HumanPercentnofnalphaxhelicesYeast HumanbLSn=960=0NSn=10950=57HSn=1800=4Pn=66320=43LSn=1700=4NSn=12010=8HSn=3430=11406020ulhelices4060802040608020LS0=9NS0=32HS0=1P0=201LS0=5NS0=49HS0=1020301040ulsheet20301040203010401v0Eb8Percentnofnbetaxsheets2v5Eb2YeastLS0=3NS0=54HS0=4P0=10840608020ulcoilc PercentnofnCoil NumbernofLS0=1gMoRFs/llengthlxl10b22313v3dxf:seelnextlpage3v6Eb32v1Eb31v3Eb2Figuren4 AlbunRF/netnaaYeast Hum n MousenBr in50010001500LSn=960=5NSn=10950=47HSn=1800=6Pn=6 320=2990n=530=4 7311400800012001600400800012001v9Eb56v5Eb41v0Eb138v6Eb230=15 0 70n=3430=181v2Eb101v9Eb34v0Eb1143005400300200LengthkavavwHuman MouseBrainLS NS HS LS NS HSRandomnSelectionnofnProteinsninnGroups k1000literationswb0=4 0=25 0=3P0=24LS0=6NS0=54HS0=5LS0=6NS0=36HS0=20cLCRs0lengthnyaahnx3-x4210Yeast Hum n MousenBrainProteinnLength kavavw3v9Eb283v5Eb10210152v8Eb74v8Eb47v9Eb2016008v6Eb3LS0=5NS0=44HS0=4LS0=14NS0=24HS0=1MousenB ainHuma5v1Eb133v5Eb11Numbernofnlinearnmotifsf2323dHuman MousenBrainPercentnofnregionnpredictedntonbendisorderednyIUPREDhYeastLSn=960=4NSn=10950=71HSn=1800=0Pn=6632=394LSn=1700=0NSn=1201=12HSn=3430=31LSn=5300=0NSn=1255=54HSn=510=940608020uldisorder4060802406080203v1Eb81v1Eb138v0Eb348v Eb53v9Eb261v7Eb3eMotifs/llengthlxl10b1HumanNumbernofnli earnmotifsYeastLSn=770=8NSn=6680=35HSn=1330=9LSn=1660=13Sn=10530=65Sn=2110=6LSn=4790=31Sn=87638Sn=313Motifs/luldisorder21021020108v3Eb35v6Eb32v2Eb36v9Eb31v4Eb204v8Eb5yDISOPREDhFigure 3.13: Comparison of protein percentage disorder as predicted byDISOPRED and IUPRED. (a-b) Distributions of the percent disorderof proteins as predicted by DISOPRED(a) and IUPRED (b) in the in-dicated bins.(c) Percent of each protein predicted by PSIPRED to bean unstructured coil. Figure reproduced from [4].33LSn= 78x= 4NSn= vx95x= 48HSn= v8xx= 6Pn= 6632x= 296x2468vxv2v4v6v8Number2of2disordered2regions2per2proteinLSn= 442x= 3xNSn= v248x= 75HSn= 5vx= vPn= 5v359x= 5334x2468vxv2v4v6v8Number2of2disordered2regions2per2proteinLSn= 78x= 4NSn= vx95x= 26HSn= v8xx= 5Pn= 6632x= 2xvx/xxxx/xx5x/xvxx/xv5x/x2xx/x25x/x3xNumber2of2disordered2regions2per2residueLSn= 442x= 7NSn= v248x= 2xHSn= 5vx= 2Pn= 5v359x= v875x/xxxx/xx5x/xvxx/xv5x/x2xx/x25x/x3xNumber2of2disordered2regions2per2residueLSn= v39x= vxNSn= v2xvx= 83HSn= 343x= 22Pn= 88653x= 74v4x5vxv52xNumber2of2disordered2regions2per2proteinNumber2of2disordered2patches2per2proteinYeast Human Mouse2BrainLSn= v39x= 2NSn= v2xvx= v8HSn= 343x= vxPn= 88653x= vv88x/xxxx/xx5x/xvxx/xv5x/x2xx/x25x/x3xNumber2of2disordered2regions2per2residueLSn= v38x= 7NSn= vv79x= 47HSn= 3 5x= 5P726vxx= 738x/xx/5v/xv/52/x2/5NumberdofdMoRFsdperddisordereddpatchLSn= v38x= 7NSn= vv79x= 47HSn= 3v5x= 5Pn= 726vx738x/xx/5v/xv/52/x2/5NumberdofdMoRFsdperddisordereddpatcheMoRFs%7disorderhxhvx)vLSn=v66x=9NSn=vx53x=5vHSn=2vx=524HumanNumber2ofLSx=5NSx=25HSx=vPx=242Yeastv/2E)7v/7E)2Motifs%hlengthhxhvx)v23 2/9E)2p)valuesc Number2of2linear2motifs5/6E)55/3E)9LSx=5NSx=25HSx=vP=242Yeastv/2E)7v/7E)2Motifs%hlengthhxhvx)v23 2/9E)2p)valuesc Numb r2of2linear2motifseMoRFs%7disorderpxpvx)v24LSx=5NSx=25HSx=vPx=242Yeastv/2E)7v/7E)2Motifs%plengthpxpvx)v23 2/9E)2p)valuesNumber2of2linear2motifsNumber2of2disordered2patches2per2unit2lengthYeast Human Mouse2BrainLSn= v38x= 7NSvv7947HSn= 3v5x= 5Pn= 726vxx= 738x/xx/5v/xv/52/x2/5NumberdofdMoRFsdperddisordereddpatchLSn= v38x= 7NSn= vv79x= 47HSn= 3v5x= 5P726vxx= 738x/xx/5v/xv/52/x2/5NumberdofdMoRFsdperddisordereddpatcheMoRFs%7disorderhxhvx)vLSn=v66x=9NSn=vx53x=5vHSn=2x=524HumanNumber2oLSx=5NSx=25HSx=vPx=242Yeastv/2E)7v/7E)2Motifs%hlengthhxhvx)v23 2/9E)2p)valuesc Number2of2linear2motifs5/6E)55/3E)9LSx=5NSx=25HSx=vPx=242Yeastv/2E)7v/7E)2Motifs%hlengthhxhvx)v23 2/9E)2p)valuesc Number2of2linear2motifseMoRFs%7disorderpxpvx)v24LSx=5NSx=25HSx=vPx=242eastv/2E)7v/7E)2Motifs%plengthpxpvx)v23 2/9E)2p)valuesc Number2of2linear2 otifs5/4E)89/8E)232/2E)2v/5E)4v/vE)3v6/xE)v45/8E)253/9E)62/2E)24/5E)44/9E)v5v/vE)v52/xE)v26/3E)34/5E)7aYeast Human Mouse3Br in5xxvxxxv5xxLSn=96x=5NSn=vx95x=47HSn v8xx=6Pn=6632x=299xLSn=53xx=4xNSn=v255x=73Hn=5vv4xx8xxxv2xxv6xx4xx8xxxv2xxv/9E)56/5E)4v/xE)v38/6E)23LSn=v7xx=v5NSn=v2xvx=7xHSn=343x=v8v/2E)vxv/9E)34/xE)vv4xx3xx5xx4 x3xx2xxLength=a/a/PHuman MouseBrainLS NS HS LS NS HSRandom3Selection3of3Proteins3in3Groups =vxxx2iterationsPbcProtein3Length =a/a/Pv6xx8/6E)3Yeast Human Mouse3Brain5xxvxxxv5xxLSn=96x=5NSn=vx95x=47HSn=v8xx=6Pn=6632x=299xLSn=53xx=4xNSn=v255x=73HSn=5vx=v482v6xx482v/9E)56/5E)4v/xE)v38/6E)23LSn=v7xx=v5NSn=v2xvx=7xHSn=343x=v8v/2E)vxv/9E)34/xE)vv4xx3xx5xx4xx3xx2xxLength=a/a/PHuman MouseBrainLS NS HS LS NS HSRandom3Selection3of3Proteins3in3Groups =vxxx2iterationsPbc0-3Yeast Human Mouse3Braing = Pv6xx8/6E)3Sn= 78x= 4Sn= vx95x= 26Sn= v8xx= 5n= 6632x= 2xvx/xxxx/xx5x/xvxx/xv5x/x2xx/x25x/x3xgpLn= v39x= 2NSn= v2xvx= v8HSn 343x= vxPn 88653x= vv88x/xxxx/xx5x/xvxx/xv5x/x2xx/x25x/x3xNumber2of2disordered2regions2per2residueLn= 442x= 7Nn= v248x= 2xHSn 5vx= 2Pn 5v359x= v875x/xxxx/xx5x/xvxx/xv5x/x2xx/x25x/x3xNumber2of2disordered2regions2per2residueSn= v39x= vxn= v2xvx= 83Hn= 343x= 22Pn= 88653x= 74v4x5vxv52xNumber2of2disordered2regions2per2proteinn= 442x= 3xNn= v248x= 75HSn 5vx= vPn 5v359x= 5334x2468vxv2v4v6v8Number2of2disordered2regions2per2proteinSn= 78x= 4n= vx95x= 48Hn= v8xx= 6Pn 6632x= 296x2468vxv2v4v6v8Number2of2disordered2regions2per2proteinFigure 3.14: Number of disordered regions in each protein, and the numberof MoRFs within each such region.(a) Number of disordered regions(at least 5 residues long) per protein. (b) Number of disordered regions(at least 5 residues long) per unit length.it was possible that this could simply be dependent on length, and that each dis-ordered patch would generally contain the same average number of MoRFs. Inmouse and human, LS proteins contained more MoRFs per disordered patch (Fig-ure 3.16), indicating that not only were there more MoRFs, they were also moredensely clustered in disordered regions.Short stretches of conserved amino acid sequences, known as eukarotic linearmotifs (ELMs), are often involved in mediating protein-protein interactions [25].In contrast with MoRFs, ELMs are not necessarily within disordered regions. Asearch of the ELM database on ANCHOR showed that yeast LS proteins contained34y gP=66320=122LS0=0NS0=19HS0=19LS0=0NS0=28HS0=34060802040608020MouseDBrainHuman1.2E-131.4E-431.8E-39eMoRFs/%disorderhxh10-1LSn=1660=9NSn=10530=51HSn=2110=5LSn=4790=26NSn=8760=47HSn=310=12424MouseDBrainHumanNumberDofDMoRFSD(Anchor)P42fs5.6E-55.3E-93.0E-68.5E-4tDofDIntrinsicallyDDisorderedDRegionsP=66320=122LS0=0NS0=19HS0=19LS0=0NS0=28HS0=34060802040608020MouseDBrainHuman1.2E-131.4E-431.8E-39eMoRFs/%disorderhxh10-1LSn=1660 9NSn=10530 51HSn=2110 5LSn=4790 26NSn=8760 47HSn=310 12424MouseDBrainHumanNumberDofDMoRFSD(Anchor)42s5.6E-55.3E-93.0E-68.5E-4BrainrinsicallyDDisorderedDRegionsHSn=510=0LS0 0NS0 19HS0 1LS0 0NS0 28HS3406080040608020MouseDBrainHuman1.2E-131.4E-431.8E-39MouseDBrainHumanLS0=0NS0=1HS0=28LS0=0NS0=13HS0=4NumberDofDMoRFSD(Anchor)d1.3E-122.2E-301.6E-23MoRFs/hlengthhxh10-2321321E-3,rainnsicallyDDisorderedDRegionsHSn=510=0LS NS HS LS NS HS40608020MouseDBrainHuman1.2E-131.4E-431.8E-39MouseDBrainHumanLS0=0NS0=1HS0=28LS0=0NS0=13HS0=4NumberDofDMoRFSD(Anchor)d1.3E-122.2E-301.6E-23MoRFs/hlengthhxh10-23213213aYeast Human MouseDBrain50010001500LSn=960=5NSn=10950=47HSn=1800=6Pn=66320=2990LSn=5300=40NSn=12550=73HSn=510=140080001200160040080012001.9E-5.5E-41.0E-138.6E-23LSn=1700=15NSn=12010=70HSn=3430=181.2E-101.9E-34.0E-11400300500400300200Length(a.a.)Human MouseBrainLS NS HS LS NS HSRandomDSelectionDofDProteinsDi DGroups (1000hiterations)bProteinDLength (a.a.)16008.6E-3Yeast Human MouseDBrain5 01 01500LSn=960=5NSn=10950=47HSn=1800=6Pn=66320=2990LSn=5300=40NSn=12550=73HSn=510=1400800012001600400800012001.9E-56.5E-41.0E-138.6E-23LSn=1700=15NSn=120170HSn=3430=181.2E-101.9E-34.0E-11400300500400300200Length(a.a.)Human MouseBrainLS NS HS LS NS HSRandomDSelectionDofDProteinsDinDGroups (1000hiterations)bcYeast Human Mouse Brain16008.6E-3Figure 3.15: Number of molecular recognition features (MoRFs) within eachprotein, as predicted by the ANCHOR database. (a and b) The numberof MoRFs per protein, normalized to protein length (a) and percentagedisorder (b). Only proteins predicted to be at least 10% disorderedwere considered. Figure reproduced from [4].more linear motifs per unit length (Figure 3.17a). However, this trend was nolonger significant after normalization to disorder (Figure 3.17b).3.1.9 RNase treatment increases the solubility of RNA associatedproteins, but does not affect the overall properties of lowsolubility proteinsAn enrichment of RNA-associated GO terms was previously observed in Sec-tion 3.1.2. We wanted to investigate the possibility that proteins binding to RNAassociated assemblies might increase their propensity to precipitate during cen-trifugation. In order to do so, we carried out a triple-SILAC experiment with one35LSn= vgv%= x8NSn= xx75%= 69HSn= vx%= gPn= v76%= x6%%P%%P5xP%xP5lP%lP5gP%gP5vP%Number4of4MoRF4residues40ANCHORL4per4disordered4patchLSn= xg8%= 6NSn= xx8%%= 5%HSn= gx7%= l8Pn= 8%%%7%= lg8x%P%%P5xP%xP5lP%lP5gP%Number4of4MoRF4residues40ANCHORL4per4disordered4patchLSn= 75%= lNSn= 989%= 7gHSn= x6g%= xlPn= 5876%= ggv%P%%P5xP%xP5lP%lP5gP%Number4of4MoRF4residues40ANCHORL4per4disordered4patchNumber4of4MoRFs40ANCHORL4per4disordered4patchLSn= xg8%= 7Nn= x 79%= v7Hn gx5% 5Pn= 7l6x%%= 7g%P%%P5xP%xP5lP%lP5NumberdofdMoRFsdperddisordereddpatcheMoRFs/3disorderhxhx%8xLSn=x66%=9NSn=x 5g%=5xnlvHumanNumberLS%=5NS%=l5HS%=xP%=lvlYeastxPlE87xP7E8lMotifs/hlengthhxhx%8xlg lP9E8lp8valuesc Number4of4linear4m tifs5P6E855PgE89LS%=5NS%=l5HS%=xP%=lvlYeastxPlE87xP7E8lMotifs/hlengthhxhx%8xlg lP9E8lp8valuesc Numb r4of4linear4motifsLSn= xg8%= 7Nn= xx79v7Hn= gx5%= 5Pn= 7l6x%%= 7 8%P%%P5xP%xP5lP%lP5NumberdofdMoRFsdperddisordereddpatcheMoRFs/3disorderpxpx%8xlvLS%=5NS%=l5HS%=xP%=lvlYeastxPlE87xP7E8lMotifs/plengthpxpx%8xlg lP9E8lp8valuesc Number4of4linear4motifsYeast Human Mouse4BrainlPvE8v5P6E8ggP6E877PgE8lggP6E859PvE8g%Number4of4MoRFs4per4disordered4patchSn= 75%= lSn= 989%= 7gn= x6g%= xln= 5876%= ggv%P%%P5xP%xP5lP%lP5gP%Number4of4MoRF4residues40ANCHORL4per4disordered4patchLn= xg8%= 6Nn= xx8%%= 5%HSn= gx7%= l8Pn 8%%%7%= lg8x%P%%P5xP%xP5lP%lP5gP%Number4of4MoRF4residues40ANCHORL4per4disordered4patchLn= vgv%= x8Nn= xx75%= 69HSn vx%= gPn= v76%= x6%P%%P5xP%xP5lP%lP5gP%gP5vP%Number4of4MoRF4residues40ANCHORL4per4disordered4patchFigure 3.16: Average number of MoRFs as determined by ANCHOR withineach disordered patch (of 5 residues or longer).sample (heavy) RNase treated during lysis but prior to separation of low solubil-ity proteins, and two more samples not treated with RNase. The pellet from theheavy and medium experiments, as well as the supernatant from the light exper-iment, were mixed in a 1:1:1 ratio by mass and analyzed by quantitative massspectrometry (Figure 3.18a). We identified a group of proteins that do indeed ap-pear more soluble following RNase treatment (Figure 3.18b and c). Many of theseproteins were associated with RNA-associated GO terms (Table A.16Table A.17).However, there was no significant change to the trends observed for yeast, savefor a slight reduction in beta sheet content for the LS bin proteins (Figure 3.18d).Therefore,while the solubility of RNA-binding proteins was affected by the pres-ence or absence of RNase treatment, the solubility of most LS proteins was alsonot affected and most LS proteins remained in the lower solubility bin upon RNasetreatment.3.1.10 Coding sequences for LS proteins contain a lower GC contentin yeastAfter examining numerous traits of polypeptides that could influence their solubil-ity, we next looked at nucleic acid-based traits to see if any of them correlated withsolubility. Coding sequences for LS proteins tended to have a lower GC contentthan genes encoding NS and HS proteins (Figure 3.19). Higher GC content has36eMoRFs/udisorderRxR10-1LSn=1660=9NSn=10530=51HSn=2110=5LSn=4790=26NSn=8760=47HSn=310=12424MousedBrainHumanNumberdofdMoRFSd(Anchor)LS0=5NS0=25HS0=1P0=242Yeast1.2E-71.7E-2Motifs/RlengthRxR10-123 2.9E-2p-valuesc Numberdofdlineardmotifs5.6E-55.3E-93.0E-68.5E-4eMoRFs/udisorderRxR10-1LSn=1660=9NSn=10530=51HSn=2110=52424HumanNumberdofdMoRFLS0=5NS0=25HS0=1P0 242Yeast1.2E-71.7E-2Motifs/RlengthRxR10-123 2.9E-2p-valuesc Numberdofdlineardmotifs5.6E-55.3E-9LS0=5NS0=44HS0=4LS0=14NS0=24HS0=1MousedBrainHuman5.1E-133.5E-11Numberdofdlineardmotifsf2323eMotifs/RlengthRxR10-1MousedBrainHumanNumberdofdlineardmotifsYeastLSn=770=8NSn=6680=35HSn=1330=9LSn=1660=13NSn=10530=65HSn=2110=6LSn=4790=31NSn=8760=38HSn=310=3Motifs/RuRdisorder21021020108.3E-35.6E-32.2E-36.9E-31.4E-204.8E-5LSn=960=2NSn=10950=31HSn=1800=0Pn=66320=12240608020uRdisorder40608020Yeast4.1E-79.7E-24.6E-4LS0=5NS0=25HS0=1P0=242Yeast1.2E-71.7E-2Motifs/RlengthRxR10-1232.9E-2p-valuescNumberdofdlineardmotifsLS0=5NS0=44HS0=4LS0=14NS0=24HS0=1MousedBrainHuma5.1E-133.5E-11Numberdofdlineardmotifsf2323eMotifs/RlengthRxR10-1MHum nNumberdofdlineardmotifsYeastLSn=770=8NSn=6680=35HSn=1330=9LSn=1660=13NSn=1050=65HSn=2110=6Ln=40=Motifs/RuRdisorder21021020108.3E-35.6E-36.9E-31.4E-20g ,aYeast Human MousedBrain50010001500LSn=960=5NSn=10950=47HSn=1800=6Pn=66320=2990LSn=530=40NSn=12550=73HSn=510=1400800012001600400800012001.9E-56.5E-41.0E-138.6E-23LSn=1700=15NSn=12010=70HSn=3430=181.2E-101.9E-34.0E-11400300500400300200Length(a.a.)Human MouseBrainRandomdSelectiondofdProteinsdindGroups (1000Riterations)bProteindLength (a.a.)6008.6E-3Yeast Human MousedBrain50010001500LSn=960=5NSn=10950=47HSn=1800=6Pn=66320=2990LSn=5300=40NSn=12550=73HSn=510=1400800012001600400800012001.9E-56.5E-41.0E-138.6E-23LSn=1700=15NSn=12010=70HSn=3430=181.2E-101.9E-34.0E-11400300500400300200Length(a.a.)Human MouseBrainLS NS HS LS NS HSRandomdSelectiondofdProteinsdindGroups (1000Riterations)bc16008.6E-3Figure 3.17: Number of eukarotic linear motifs (ELMs) within each protein,as predicted by the ANCHOR database. (a and b) The number ofELMs per protein, normalized to protein length (a) and percentage dis-order (b). Only proteins predicted to be at least 10% disordered wereconsidered. Figure reproduced from [4].been associated with lower translation rates [100]. This appears to be consistentwith findings that suggest lower translation rates provide more time for proteins tofold, and help to minimize misfolding [112] and thus aberrant amyloid or amor-phous aggregation, potentially allowing for increased solubility.3.1.11 LS proteins possess numerous traits that distinguish themfrom HS proteinsWe have identified a number of traits that distinguish LS proteins from HS proteins(p-value summary in Table A.10, Table A.11, and Table A.12). In contrast withprevious findings [29, 47], they are actually less abundant than HS proteins. They371H:1H:H1bPS PTLight HeavySPTMedSLysisTOHRNasealog 2+PRNase/SClog2+PNormal/SClog2+PRNase/PNormalC-770-7 00-665101520LengthLSn=1000=4NSn=9980=48NSn=10120=506.2E-4LCR/rlengthrxr10-3426LSn=1130=5RNase RNasenonoc dLCR0=1 0=21 0=250=10=5 0=36 0=360=44060cPercentHDisordered2.1E-5NumberrofrAminorAcidsxr102SolubilityChangelog 2(PNormal/SPr-rlog 2(PRNase/SP012RNAn=3180=37Othern=12650=1053.2E-152.1E-105.1E-106.8E-71.4E-81H:1H:H1bPS PTLight HeavySPTMedSLysisTOHRNasealog 2+PRNase/SClog2+PNormal/SClog2+PRNase/PNormalC-770-7 00-665101520LengthLSn=1000=4NSn=9980=48NSn=10120=506.2E-4LCR/rlengthrxr10-3426LSn=1130=5RNase RNasenonoc dLCR0=1 0=21 0=250=10=5 0=36 0=360=44060cPercentHDisordered2.1E-5NumberrofrAminorAcidsxr102SolubilityChangelog 2(PNormal/SPr-rlog 2(PRNase/SP012RNAn=3180=37Othern=12650=1053.2E-152.1E-105.1E-106.8E-71.4E-81H:1H:H1bS PPlog 2+PRNase/SClog2+PNormal/SClog2+PRNase/PNormalC-770-7 00-6620LSn=1000=4NSn=9980=48NSn=10120=50LCR/rlengthrxr10-3426LSn=1130=5LCR0=1 0 21 0=250=10=5 0=36 0=360=44060cPercentHDisorderedlog 2(PNormRNAn=3180=37Othern=12650=105 2.1E-105.1E-106.8E-71.4E-8Yeastd HydrophoLS0=5NS0=9HS0=0P0=98L02030405030408.0E-5AlbuFigureHS7aLSLSlog 2+Pellet/SolubleCRNaseHtreateduntreatedNSNS-880-88028S18SLightH+noHRNaseCMedium+noH RNaseCHeavyH+RNaseC10LSn=1000=4NSn=9980=29NSn=10120=36LSn=1130=8RNase RNasenonoPercentHofHbetapsheets20303.0E-2cdbFigure 3.18: Analysis of RNase treatment on protein solubility. (a) Experi-mental setup used to determine RNA influence on protein solubility.(b) The log2 ratio distribution of low solubility proteins from yeast celllysate treated with RNase (PRNase) against those without the treatment(PNormal) (top left). The scatter plot shows the comparison of log2 ra-tios between the two experiments; proteins which became more solu-ble after RNase treatment are highlighted in red. (c) Solubility changeof proteins annotated with RNA associated GO terms and proteinswithout these GO terms after RNase treatment. (d) Boxplots show-ing the distribution in samples with and without RNase treatment ofprotein length (top), percent disorder (middle), and percent beta-sheets(bottom). Figure reproduced from [4].38LSn= 78h= mNSn= vh95h= vvHSn= v8hh= bPn= 66bmh= mv7bhb54h455h55GC=content=of=coding=sequenceYeasteMoRFslHdisorderpxpvhpvm4LSh=5NSh=m5HSh=vPh=m4mevxmEp7vx7EpmMotifslplengthpxpvhpvmb mx9Epmppvaluesc Number=of=linear=motifsvmxEp7mxvEp4GC=content=MHR=of=coding=sequence6x5Ep5GC=content=MHRSn= 78h= mSn= vh95h= vvSn= v8hh= bPn= 66bh= mv7bhb54h455h55GC=content=of=coding=sequenceFigure 3.19: Percentage of coding sequences of each protein comprised ofGC.tend to be longer and more disordered than HS proteins. While they tend to bemore aggregation prone (at least in yeast), their higher propensity for features thatmediate protein-protein interactions such as ELMs and MoRFs, as well as featuressuch as LCRs, suggests a relationship between low solubility and functional ag-gregation. While a small number of proteins associated with RNA-associated GOterms displayed a higher solubility upon treatment with RNase, this trend was notobserved for majority of the proteins analyzed. In addition to traits of polypeptidesaffecting solubility, we also observed the LS proteins in yeast having a lower GCcontent than HS proteins, which could potentially allow HS proteins more time tofold during translation, contributing to their increased solubility.3.2 Models to predict protein solubilityAfter analyzing various traits to check for relationships with solubility, we nextaimed to use those traits to generate a model which would then aid in the predictionof solubility. With these models it is hoped that we can predict which proteins areof low solubility via in silico methods. In silico methods would allow for the pre-diction of a protein’s solubility without any necessary a priori knowledge derivedfrom experiments such as mass spectrometry and Western Blotting as previouslydetailed. Yeast was chosen as the organism to model as more comprehensive datais available on it compared to mouse and human, and it provides a simpler system39Table 3.1: mean squared error (MSE) and correlation coefficients from eachof the five iterations of cross-validation, as well as the average valuesobtained.Test set MSE Correlation coefficient1 0.9246397 0.51585732 0.7163011 0.4104553 0.8629595 0.44921674 0.6873905 0.54460775 0.7896966 0.4339884Average 0.79619748 0.47082502to build a model on as it possesses a simpler proteome without tissue specific com-positional differences. The reduced time needed to generate strains for experimentswould also speed up validation of the models.3.2.1 Multiple regression modelOne model we attempted to build in order to model protein solubility was a multiplelinear regression model. A linear regression model has the general formyi = b1xi1 +b2xi2 + ...+bpxip + c+ εiwhere y denotes one instance of the response variable (which is solubility inthis case), x denotes the values of various traits such as length for that instance,b denotes the coefficient associated with each trait, c denotes a constant, and εdenotes the error for this instance.The 85 protein traits analyzed (shown in Table A.10) were fitted in a step-wise fashion to a multiple regression model with elastic net regularization usingthe glmnet library for R ([45]). This was done by first finding the value at whichthe intercept alone would have the least deviance from the actual value, then ateach subsequent iteration adding in the variable that explained the variation themost. The elastic net regularization was used to add a penalty for complexity tothe model, minimizing overfitting, whereby the model would fit to and model noisewithin the data and lowering predictive performance. The available data was ran-domly split into 5 equal sized sets, with each set in turn used as a test set to assess40Table 3.2: Estimates of the coefficients from the regularized multiple regres-sion model.Trait Estimate(Intercept) -8.17E-02Membrane localization 2.18E+00Vacuole localization 1.82E+00Average charge per residue 1.43E+00Number of disulphide bonds 4.00E-02Percent abundance of Leu 2.07E-02Number of close stop codons 8.64E-03Net charge per protein 1.11E-03Alpha helix propensity 2.32E-04Net charge squared per protein 5.32E-05Percent coiled coil -2.53E-03Percent disorder (IUPRED) -2.80E-03Percent negative residues -5.50E-03Percent abundance of Gln -6.58E-03Percent abundance of Glu -1.42E-02Percent abundance if Pro -1.78E-02Percent abundance of Ala -3.29E-02Number of MoRFs per unit length -2.28E-01the accuracy of the model, with the remaining 4 of them were used to as the train-ing set to build the model. The sum of the squared differences between observedand predicted values for each data point was used to calculate the mean squarederror (MSE), a measure of accuracy of the model.The averaged MSE value over the 5 folds of cross-validation was 0.796 (Ta-ble 3.1). Table 3.2 shows the estimates of the coefficients for each trait that wasincluded in the model. Given that the values of solubility ratios obtained from themass spectrometry experiments range from roughly -5 to 5, the MSE obtained onlyaffords a crude approximation. Given the performance of the multiple regressionmodel, other models were explored with the aim of improved performance.41L2L1Figure 3.20: Illustration of SVM hyperplanes. While both the lines (hyper-planes) L1 and L2 can divide the two classes (black and white), L2 hasa larger margin of separation and is chosen for higher performance.Two-dimensional space is used in this example.3.2.2 Support vector machineWe next built a support vector machine to distinguish between LS and HS pro-teins, using the R library e1071, utilizing the the same traits that were consideredfor the regression model (Table A.10), save for the amyloid aggregation propen-sities as predicted by TANGO and AGGRESCAN, as the SVM is not compatiblewith categorical variables. Support vector machines project the data into a higherdimensional space using mapping functions, called kernel functions, and aim tofind a multi-dimensional plane, referred to as a hyperplane, within this high di-mensional space that can most effectively separate the two classes (in this case, LSand HS proteins) of data points. As shown in Figure 3.20, even though multiplehyperplanes might be able to divide the two classes, the one with the maximumseparation distance is chosen. Usage of the two extreme bins from our dataset wasdone as an initial test to see if it was possible to distinguish the most dissimilar pro-42Table 3.3: The prediction performance of the SVM on distinguishing LS andHS proteins.Test set Sensitivity Specificity Precision Accuracy FDR MCC AUC1 0.6667 0.9677 0.6667 0.8696 0.3333 0.6972 0.88312 0.8333 0.8571 0.8333 0.8478 0.1667 0.6844 0.83923 0.6875 0.9000 0.6875 0.8261 0.3125 0.6081 0.81474 0.8000 0.8889 0.8000 0.8696 0.2000 0.6471 0.80395 0.6842 0.8889 0.6842 0.8043 0.3158 0.5925 0.8063Average 0.7343 0.9005 0.7343 0.8435 0.2657 0.6459 0.8294teins based on the traits we already examined. Thus, LS and HS bins were pooledand then randomly split into five equal sample sizes. Each of the partitions in turnwas used as a test set while the rest was used as the training set. The SVM wasbuilt with a linear kernel, and a cost of 10. The cost parameter provides a penaltyto each possible hyperplane based on how many incorrect classifications that hy-perplane makes. Higher cost values penalize incorrect classification more, but areprone to overfitting.The performance of the SVM is shown in Table 3.3. Several measures wereused to assess the performance of the SVM. Sensitivity, or true positive rate, isthe ratio of true positives called by the model to the total number of positive datapoints. Specificity, or true negative rate, is the ratio of true negatives called by themodel to the total number of negative data points. Precision, is the ratio of truepositives called by the model to the total number of positives called by the model.Accuracy is the proportion of correctly called data points by the model. The falsediscovery rate (FDR) is the proportion of positives called by the model that are falsepositives. matthews correlation coefficient (MCC) is a measure of quality of thepredictions by the model, ranging from -1 to 1, with 1 being perfect predictions, -1being total disagreement with the observations, and 0 being no better than randomchance. area under curve (AUC) represents the probability that the model will ranka chosen positive data point higher than a negative one, under the assumption thatpositive data points should rank higher. As shown in Table 3.3, the model is ableto distinguish LS and HS proteins.The SVM model obtained an FDR of 0.2657, which would mean that if using43Table 3.4: The prediction performance of the SVM on distinguishing LS andnon-LS proteins.Test set Sensitivity Specificity Precision Accuracy FDR MCC AUC1 0 0.9512 0 0.9123 1 -0.0458 0.47852 0.0909 0.9313 0.0909 0.8772 0.9091 0.0213 0.51023 0.5 0.9563 0.5 0.9244 0.5 0.4397 0.71194 0.375 0.9509 0.375 0.924 0.625 0.2805 0.62075 0.7143 0.9329 0.7143 0.924 0.2857 0.4403 0.6498Average 0.336 0.9445 0.336 0.9124 0.664 0.2272 0.5942the model as a preliminary in silico screen before biological validation, the errorrate is low enough to still allow for narrowing down of candidates.Next, the SVM was built using the LS, NS, and HS bins, rather than just the LSand HS bins, as that would provide a more useful tool for biologists. Performanceof the model Table 3.4 is lower than that on just the LS and HS proteins, suggestingthat it is indeed more challenging to distinguish between LS and NS proteins thanit is to distinguish between LS and HS proteins. The high FDR of 0.664 will poseissues in that more often than not, an LS protein identified by the model will notactually be an LS protein. More improvements to the model will need to be made inorder for it to be useful in analyzing proteins in silico prior to biological validation.One possible avenue by which both models could be improved is the inclusionof more traits that correlate well with protein solubility. Future efforts to improvethe model will include searching for these additional traits that can help improvethe predictive power of models. It is hoped that with sufficient additions to bothmodels, they can provide a tool to allow users to predict the solubility of proteinsin silico without first having to run more time consuming and costly experimentssuch as mass spectrometry.44Chapter 4DiscussionThe section of the thesis will cover several areas• Section 4.1: Ratios obtained from quantitative mass spectrometry are notdirectly indicative of absolute ratios• Section 4.2: Feature analysis of LS proteins highlights differences betweenorganisms and points to link between functional aggregation and low solu-bility– Section 4.2.1: Analysis of features of LS proteins highlights inter-organism differences– Section 4.2.2: LS proteins possess distinct features that differentiatethem from other proteins– Section 4.2.3: LS proteins may be involved in functional aggregation• Section 4.3: Generation of models to predict solubility of proteins• Section 4.4: Future work4.1 Ratios obtained from quantitative mass spectrometryare not directly indicative of absolute ratiosThe aim of this project was to identify proteins that displayed lower solubility andcharacterize features that may contribute to a protein’s solubility or lack thereof.45In order to gauge the proportion of a protein that is present in the low solubilityfraction, as opposed to the amount of protein within the low solubility fraction,usage of the soluble fraction as a reference is necessary. In the absence of a refer-ence such as the soluble fraction, it is not possible to distinguish between a highlyabundant protein with high solubility and a low solubility protein with low abun-dance, as both proteins could very well have the same absolute abundance withinthe low solubility fraction. While the ratios obtained by the normalization methodused are reflective of the partitioning of a protein between the low solubility andsoluble fractions, they should not be taken directly as an absolute ratio of saidpartitioning. For example, in yeast the amount of protein recovered from the lowsolubility pellet was typically 2% of the amount recovered from the supernatant.Mixing the proteins obtained from each fraction in a 1:1 ratio by mass would over-represent proteins in the pellet by a corresponding amount. The main advantage ofthe method used here is to allow for the accounting of protein abundance, and thusrepresent an improvement over previously used absolute quantification [29, 47].4.2 Feature analysis of LS proteins highlights differencesbetween organisms and points to association betweenfunctional aggregation and low solubilityAfter obtaining ratios representing the solubility of various proteins, we catego-rized proteins into lower solubility (LS), normal solubility (NS), and higher sol-ubility (HS) based on their solubility. We then moved on to examine what traitsdistinguished LS proteins from other proteins. We examined several traits, andfound some to aid in distinguishing LS proteins from other proteins, as well as tounderstand better what role LS proteins might have in relation to the proteome.4.2.1 Analysis of features of LS proteins highlights inter-organismdifferencesSome of the features we examined, such as length, showed a consistent trend acrossall three model organisms. In many cases, however, there was agreement betweenmouse and human, but not with yeast. This could be due to greater evolutionarydistance between yeast and the other two species.46One such case was when amyloid aggregation prone proteins were predictedto be more prevalent in the LS bin in yeast, but more prevalent in the HS bin inmouse and human. One possibility is that S. cerevisiae is more tolerant of amy-loid aggregation-prone proteins as it is able to retain harmful aggregates within themother cell during budding, allowing the daughter cell to be free of harmful aggre-gated species [115]. Another possibility is that S. cerevisiae is simply less able todisaggregate proteins as it lacks Hsp110 disaggregases [101], resulting in amyloidaggregation-prone proteins forming a larger portion of the LS fraction than meta-zoans. This is consistent with the observation that LS proteins are predicted to bemore amyloid aggregation prone in yeast but not in our human and mouse sam-ples Figure 3.6. As a unicellular fungal organism, this ensures the generation ofoffspring with greater fitness. Mouse and human cells do not possess these mecha-nisms for dealing with aggregated species. As multicellular organisms, the fitnessof the whole organism would necessitate some mechanism of disposing of aggre-gated proteins, as both daughter cells after cell division are still part of the wholeorganism and contribute to its fitness. In view of this, it is likely that the signifi-cance of LS proteins in yeast might differ greatly from LS proteins in metazoans.4.2.2 LS proteins possess distinct features that differentiate themfrom other proteinsAs mentioned in Section 4.2.1, LS proteins tended to display more amyloid ag-gregation propensity in yeast, but not in mouse and human. The first possibilitydiscussed was the evolutionary distance between yeast, a fungal organism, and thehuman and mouse samples, which are metazoan. A second explanation for thisobservation might be that since the algorithms used utilize hydrophobicity as wellas beta-sheet propensity to predict amyloid aggregation [78], proteins that containlower hydrophobicity scores would be predicted to be less amyloid aggregationprone. Consistent with this, mouse and human LS proteins did in fact obtain lowerhydrophobicity scores relative to NS and HS proteins. LS proteins in the mouse andhuman datasets had a lower abundance for certain residues possessing hydropho-bic side chains, such as isoleucine, leucine, and valine, which might result in theseproteins being deemed less amyloid aggregation prone by the algorithms. Giventhat many proteins in the LS bin have previously not been characterized as amyloid47aggregation prone, it is possible that they may possess novel features contributingto their low solubility that might not be taken into consideration by existing algo-rithms.Several other features we examined highlighted trends that are consistent withwork by other previously published studies. In human and mouse, proteins in theLS bin were found to be enriched in coils, relative to the NS and HS bins, consis-tent with the higher percentage of predicted disorder. Our finding that low solu-bility proteins tended to be more disordered is consistent with published work byLai et al. that disordered proteins can participate in the assembly of functional ag-gregates. The finding by Ng et al. that less soluble, albeit ubiquitinated, proteins(albeit after heat stress) are more disordered, and that longer proteins were depletedin the soluble fraction supports this notion. Longer proteins potentially have morecapacity to contain regions capable of participating in interactions with other pro-teins. Even after normalizing for length, proteins in the LS bin also contained moreMoRFs, ELMs, and LCRs, features which are known to mediate intermolecular in-teractions between proteins. LCRs and MoRFs have been shown to bind multiplepartners [90], consistent with the idea of disordered proteins forming interactionswith multiple other partner proteins. Previous work by Kato et al. shows that LCRsbeing necessary and sufficient for the formation of hydrogels by proteins, under-scoring the role of LCRs in functional assemblies.Phosphorylation within unstructured regions of disordered proteins has alsobeen shown to regulate formation of aggregates [51]. Mouse and human LS pro-teins were enriched in serine residues which are commonly utilized as phosphory-lation targets. Given the association between aberrant hyperphosphorylation andglycosylation with pathologies such as alzheimer’s disease (AD) [5, 13, 49], anenrichment for serine that would normally result in benign structures may be re-sponsible for the assembly of harmful aggregates in the case of certain proteins.Proteins associated with RNA were enriched in the LS bin, which is consistentwith the functional aggregation hypothesis. RRM-1 domains, which are known tobe involved in RNA binding, were indeed enriched in the LS bin in the human andmouse datasets. GO analysis also highlighted the enrichment of RNA processingand RNA binding of proteins within the LS bin of all three organisms. Consistentwith this, the LS bin in mouse showed an enrichment for arginine, which is com-48monly involved in binding to nucleic acids, as well as a tendency to have a morepositive net charge. LCRs have also been known to play a role in the assembly ofRNA granules, a functional aggregate that stores mRNAs and allows an additionallayer of regulation of gene expression [64, 102]. RNA packaging and transportto cellular extremities is essential to the complex architecture of neuronal cells[37]. While the solubility of RNA-related proteins was affected by the presence ofRNase, these proteins were not exclusive to the low solubility fraction, and trendsobserved in the absence of RNase persisted after RNase addition. Macromolecularassemblies containing RNA are therefore unlikely to be the dominating feature ofthe low solubility fraction.Many of the features that low solubility proteins possess suggest that theirlow solubility status might be due to biologically relevant interactions with othermacromolecules within the cell, rather than merely aberrant interactions that needto be abrogated.4.2.3 LS proteins may be involved in functional aggregationLS proteins were found to possess several traits which are involved in protein-protein interactions, such as linear motifs, MoRFs, LCRs, and RNA binding re-gions. The tendency to contain more of these features raises the possibility thatLS proteins are multivalent, allowing a single protein to form interactions withmultiple partners simultaneously. Multivalent proteins can thus interact and formthe building blocks for functional macromolecular complexes. This suggests thatlow solubility proteins may be involved in the formation of functional aggregates[24], which are distinct from toxic aggregates caused by aberrant folding or otherinsults. Functional aggregates are macromolecular assemblies in biological sys-tems in a dynamic and reversible fashion [127]. Such functional assemblies can beformed via liquid-liquid demixing [75] and phase transitions in vivo, which resultin such assemblies forming a phase distinct from the aqueous solution. This sepa-ration from the aqueous phase would be consistent with their presence in the lowsolubility fraction. LS proteins in yeast were also enriched in glutamine, which isknown to form non-toxic aggregates [53], consistent with the idea that aggregatesformed by LS proteins are functional rather than toxic.49The traits that distinguish LS proteins are associated with protein-protein in-teractions. Coupled with a tendency to be longer, LS proteins are thus able to as-semble into functional macromulecular complexes. This highlights the notion thatLS proteins may be of lower solubility because they are assembled into functional,rather than toxic, aggregates.4.3 Generation of models to predict solubility of proteinsAfter looking at many traits and examining their relationship with solubility, wethen utilized these traits to build models to predict the solubility of proteins insilico. A linear regression model was built in a stepwise fashion, adding in traitsthat could help improve the accuracy of the model, with elastic net regularizationto help prevent overfitting. A support vector machine was also built to distinguishbetween LS and HS proteins. Both models utilized cross validation, splitting thedata into equal sized portions and using each as a test set in turn, with the non-testing set being used for training the model.While the regularized regression model is not able to provide precise estimatesof the ratio of proteins, it does allow for an approximation of said ratios. Interest-ingly, not all of the traits that displayed an ability to distinguish LS proteins fromNS and HS proteins were selected for inclusion into the model. Notably, length,linear motifs, and LCRs were not selected for inclusion into the model. This sug-gests that they could possibly be correlated with one or more of the traits that wereincluded in the model. This would limit their contribution to the predictive powerof the model and possibly prevent their inclusion into the model. The number ofclose stop codons (codons one base pair substitution away from a stop codon) [1]was included in the model, highlighting how features not in the actual protein se-quence may be correlated to solubility and thus useful in predicting it. The abilityto predict a numerical ratio without further categorization, coupled with the currentmargins of error, make it difficult to use the model as a preliminary step prior tobiological validation. Incorporating a cutoff to identify proteins as LS would be anapproach to explore, and could improve usability of the model.The support vector machine is able to effectively distinguish between LS andHSproteins with a low FDR and high accuracy. However, the model is unable to reli-50L2L1LS subtype 1LS subtype 2Non-LSL3Figure 4.1: SVM and distinguishing multiple subtypes of a class. Potentialsubtypes of LS proteins might reduce separation margins if subtypesare grouped together (L1). Separating subtypes for classification mayimprove separation margin (L2 and L3).ably detect LS proteins and distinguish them from NS proteins. This is likely dueto the fact that LS and NS proteins are likely to be much more similar than LS andHS proteins are, and thus more challenging to distinguish. Identifying more traitsthat differentiate LS proteins from other proteins will be crucial in improving thepredictive power of the models. With sufficiently high accuracy and low FDR, themodel will be able to provide a useful preliminary in silico step to identify proteinsof a lower solubility which a user can then attempt to validate using in vivo or invitro methods.Currently, both models work on the assumption that LS proteins are a specificsubset of proteins that can be defined by a common set of traits. Given that proteinscould conceivably be detected as low solubility by being within toxic aggregatesas well as part of large functional assemblies, it is quite possible that LS proteins51might be comprised of two or more distinct classes of proteins. If this is indeedso, the models might have difficulty attempting to distinguish the LS proteins andnon-LS proteins. For instance, if LS proteins were composed of subtype 1, withhigh trait A, B, C and low D, E, F, and subtype 2 with low A, B, C, and high D, E, F,the linear regression model would be hard pressed to find a single set of coefficientvalues whereby both subtypes 1 and 2 would be scored highly. Contributions tothe prediction from A, B, and C would tend to oppose D, E, and F, making itdifficult to distinguish LS proteins from proteins that scored high in all of the traitsas well as proteins that scored low in all of the traits. In order to address this, eachsubtype might be modeled separately, being assigned their own set of coefficients.Likewise, for the SVM if the two subtypes of LS proteins are different enough,it is possible that a separating margin such as L1 (Figure 4.1) might only be ableto distinguish both LS protein subtypes (in blue and black) from non-LS proteins(red) with only a small separation margin, resulting in reduced performance ofthe model. Attempting to distinguish just one subtype from non-LS proteins suchas with separating margin L2 might allow for a wider separation margin. Thiswider separation margin would allow for more confident and accurate separationof proteins. In addition to identification of additional traits, future work will alsoinvolve clustering proteins to determine if there are indeed various subtypes ofLS proteins. Adjusting the models to distinguish one specific subtype at a timeis a potential avenue to improve the performance of the SVM and the regressionmodel.4.4 Future workThe traits examined here have provided a plausible explanation for why low sol-ubility proteins are indeed low solubility. Further work could follow up on modi-fying certain traits correlated with solubility (such as length), while keeping othertraits as close to unchanged as possible, and monitoring any change in solubil-ity. This work, while technically challenging, will aid in establishing whether therelationship seen in this study is causal, or merely correlated.Examining additional traits may also shed more light on factors that contributeto the solubility of a protein. The amino acid index (AAindex) [65] is a database52of numerical indices representing various physicochemical and biochemical prop-erties of amino acids and pairs of amino acids. Given the presence of certain lo-calized features, such as MoRFs, it would be interesting to assess the scores ofsliding windows of various sizes across proteins using the amino acid index. Thisapproach aims to identify certain stretches within proteins that may be correlatedwith solubility. There are also large scale datasets characterizing half-lives of yeastproteins [11] as well as localization upon stress [86] that could be examined fortheir correlation to solubility. Identifying additional factors such as these and in-cluding them in the models will also likely contribute to the quality of the modelsdeveloped.In addition to examining various protein traits, future work will also involveinvestigating whether LS proteins are composed of multiple subtypes of proteins.Clustering algorithms such as Markov clustering [38] could be used to cluster theLS proteins and determine if there are indeed different subtypes of LS proteins.If LS proteins are comprised of different subtypes, modelling individual subtypeswould be useful in improving the performance of the regression and SVM modelsin predicting LS proteins.Currently, the yeast data which was used to build the models only covers ap-proximately one-third of the proteome. A deeper mass spectrometry run whichcovers much more of the proteome would provide more data points with whichto train the model on, increasing predictive power, while ironically reducing thenumber of proteins not identified in the mass spectrometry run that a user mightactually need to predict via the model. Certain proteins might be of too low abun-dance to be detected and quantified by mass spectrometry, or possibly removed inthe pre-clearance step, and be unobtainable by our current methods, which couldpotentially limit how many more data points we can acquire with a deeper massspectrometry experiment.Both of the models in Section 3.2 have various improvements that can be madeto them. The linear model could be converted into a binary classifier to hopefullyovercome its large error margin, as well as increase usability with predictions of”Low solubility” or ”Not low solubility” being more intuitive and easier to workwith than a continuous variable ratio. The SVM, while it has performed reasonablyin distinguishing LS proteins from their HS counterparts, could be potentially more53useful and powerful if it trained on the full dataset. This would allow it to distin-guish between NS and LS proteins, which would require more power and precisionthan the current model that distinguishes LS and HS proteins.By gaining a better understanding of what traits proteins possess that contributeto their solubility, it can allow us to better understand the mechanisms of proteinsolubility. With tools and models that allow for the prediction of protein solubil-ity in silico, it will be possible to design experiments, keeping protein solubilityin mind, without having to actually assess the solubility of the entire proteomeempirically.54Chapter 5ConclusionProtein solubility is an integral component of protein homeostasis. Disruptionof homeostasis can result in toxic aggregates that are detrimental to cell fitness.Neurodegenerative diseases are crippling diseases that have been associated withprotein aggregation in cells. Understanding more about protein solubility and ag-gregation will be crucial to gleaning insight into the pathologies and designingtreatments.In order to examine traits associated with low solubility, we utilized quatitativemass spectrometry and an internal standard to account for protein abundance toallow us to obtain the solubility of proteins. After classifying proteins as low,normal, or high solubility, we examined several features of proteins and analyzedthem for correlation with solubility. We have thus identified a number of featuresthat distinguish low solubility proteins under unstressed conditions.Several of the features we examined exhibited trends that were consistent acrossall three model organisms studied, in spite of the vast evolutionary distances be-tween some organisms. In many cases, human and mouse samples showed a simi-larity that was not shared by mouse proteins, highlighting the evolutionary disparitybetween the fungal and metozoan systems, and suggesting that factors underlyingsolubility might differ greatly in these systems. Proteins found to be of low solubil-ity were found to be longer, less abundant, and more disordered. Said proteins alsocontained more coiled regions, LCRs, ELMs, and MoRFs, suggesting a relation-ship between solubility and number of potential interaction partners. This points55to a possible connection between low solubility proteins and functional aggregates.LS protein encoding genes also had a lower GC content, highlighting a relationshipbetween coding sequence and the solubility of the encoded protein.We also generated two models with which to estimate protein solubility, a lin-ear regression model as well as a support vector machine. Both models provideusable estimates for solubility, but improving their accuracy will require uncover-ing more traits that correlate with protein solubility. Accurate algorithms to predictprotein solubility will aid greatly in the experimental biology that will be crucial inunderstanding this complex aspect of protein homeostasis.The work presented here highlights several traits that characterize low solubil-ity proteins, as well as highlighting the possibility of low solubility proteins beingof low solubility due to a role in functional aggregation. The models generated arestarting steps towards providing a high throughput in silico platform for predictingprotein solubility.56Bibliography[1] F. Agostini, M. Vendruscolo, and G. G. Tartaglia. Sequence-basedprediction of protein solubility. Journal of Molecular Biology, 421(2-3):237–241, Aug. 2012. ISSN 1089-8638. doi:10.1016/j.jmb.2011.12.005. →pages 12, 50[2] H. Aguilaniu. Asymmetric Inheritance of Oxidatively Damaged ProteinsDuring Cytokinesis. Science, 299(5613):1751–1753, Mar. 2003. ISSN00368075, 10959203. doi:10.1126/science.1080418. URLhttp://www.sciencemag.org/cgi/doi/10.1126/science.1080418. → pages 2,19[3] E. M. Ahmed. Hydrogel: Preparation, characterization, and applications.Journal of Advanced Research, July 2013. ISSN 20901232.doi:10.1016/j.jare.2013.07.006. URLhttp://linkinghub.elsevier.com/retrieve/pii/S2090123213000969. → pages 4[4] R. F. Albu, G. T. Chan, M. Zhu, E. T. C. Wong, F. Taghizadeh, X. Hu, A. E.Mehran, J. D. Johnson, J. Gsponer, and T. Mayor. A feature analysis oflower solubility proteins in three eukaryotic systems. Journal ofProteomics, Oct. 2014. ISSN 1876-7737. doi:10.1016/j.jprot.2014.10.011.→ pages iii, 6, 9, 14, 16, 18, 20, 22, 23, 25, 26, 27, 29, 31, 33, 35, 37, 38[5] A. d. C. Alonso, T. Zaidi, M. Novak, I. Grundke-Iqbal, and K. Iqbal.Hyperphosphorylation induces self-assembly of into tangles of pairedhelical filaments/straight filaments. Proceedings of the National Academyof Sciences, 98(12):6923–6928, June 2001. ISSN 0027-8424, 1091-6490.doi:10.1073/pnas.121119298. URLhttp://www.pnas.org/cgi/doi/10.1073/pnas.121119298. → pages 48[6] A. Alves-Rodrigues, L. Gregori, and M. E. Figueiredo-Pereira. Ubiquitin,cellular inclusions and their role in neurodegeneration. Trends inNeurosciences, 21(12):516–520, Dec. 1998. ISSN 01662236.57doi:10.1016/S0166-2236(98)01276-4. URLhttp://linkinghub.elsevier.com/retrieve/pii/S0166223698012764. → pages 3[7] E. Angot, J. A. Steiner, C. M. Lema Tom, P. Ekstrm, B. Mattsson,A. Bjrklund, and P. Brundin. Alpha-Synuclein Cell-to-Cell Transfer andSeeding in Grafted Dopaminergic Neurons In Vivo. PLoS ONE, 7(6):e39465, June 2012. ISSN 1932-6203. doi:10.1371/journal.pone.0039465.URL http://dx.plos.org/10.1371/journal.pone.0039465. → pages 3[8] S. Auli, T. T. Le, F. Moda, S. Abounit, S. Corvaglia, L. Casalis,S. Gustincich, C. Zurzolo, F. Tagliavini, and G. Legname. Defined-synuclein prion-like molecular assemblies spreading in cell culture. BMCNeuroscience, 15(1):69, 2014. ISSN 1471-2202.doi:10.1186/1471-2202-15-69. URLhttp://www.biomedcentral.com/1471-2202/15/69. → pages 3[9] M. Babu, J. Vlasblom, S. Pu, X. Guo, C. Graham, B. D. M. Bean, H. E.Burston, F. J. Vizeacoumar, J. Snider, S. Phanse, V. Fong, Y. Y. C. Tam,M. Davey, O. Hnatshak, N. Bajaj, S. Chandran, T. Punna, C. Christopolous,V. Wong, A. Yu, G. Zhong, J. Li, I. Stagljar, E. Conibear, S. J. Wodak,A. Emili, and J. F. Greenblatt. Interaction landscape of membrane-proteincomplexes in Saccharomyces cerevisiae. Nature, 489(7417):585–589,Sept. 2012. ISSN 0028-0836, 1476-4687. doi:10.1038/nature11354. URLhttp://www.nature.com/doifinder/10.1038/nature11354. → pages 19[10] F. Bardag-Gorce, J. Vu, L. Nan, N. Riley, J. Li, and S. W. French.Proteasome inhibition induces cytokeratin accumulation in vivo.Experimental and Molecular Pathology, 76(2):83–89, Apr. 2004. ISSN00144800. doi:10.1016/j.yexmp.2003.11.004. URLhttp://linkinghub.elsevier.com/retrieve/pii/S0014480003001382. → pages 2[11] A. Belle, A. Tanay, L. Bitincka, R. Shamir, and E. K. O’Shea.Quantification of protein half-lives in the budding yeast proteome.Proceedings of the National Academy of Sciences of the United States ofAmerica, 103(35):13004–13009, Aug. 2006. ISSN 0027-8424.doi:10.1073/pnas.0605420103. → pages 53[12] J. L. Biedler, S. Roffler-Tarlov, M. Schachner, and L. S. Freedman.Multiple neurotransmitter synthesis by human neuroblastoma cell lines andclones. Cancer Research, 38(11 Pt 1):3751–3757, Nov. 1978. ISSN0008-5472. → pages 1758[13] M. Broncel, J. Falenski, S. Wagner, C. Hackenberger, and B. Koksch. HowPost-Translational Modifications Influence Amyloid Formation: ASystematic Study of Phosphorylation and Glycosylation in Model Peptides.Chemistry - A European Journal, 16(26):7881–7888, May 2010. ISSN09476539. doi:10.1002/chem.200902452. URLhttp://doi.wiley.com/10.1002/chem.200902452. → pages 48[14] M. Bucciantini, E. Giannoni, F. Chiti, F. Baroni, L. Formigli, J. Zurdo,N. Taddei, G. Ramponi, C. M. Dobson, and M. Stefani. Inherent toxicity ofaggregates implies a common mechanism for protein misfolding diseases.Nature, 416(6880):507–511, Apr. 2002. ISSN 0028-0836.doi:10.1038/416507a. → pages 1, 2[15] M. R. Bufalino, B. DeVeale, and D. van der Kooy. The asymmetricsegregation of damaged proteins is stem cell-type dependent. The Journalof Cell Biology, 201(4):523–530, May 2013. ISSN 0021-9525, 1540-8140.doi:10.1083/jcb.201207052. URLhttp://www.jcb.org/cgi/doi/10.1083/jcb.201207052. → pages 2, 19[16] J. Cheng, H. Saigo, and P. Baldi. Large-scale prediction of disulphidebridges using kernel methods, two-dimensional recursive neural networks,and weighted graph matching. Proteins: Structure, Function, andBioinformatics, 62(3):617–629, Nov. 2005. ISSN 08873585.doi:10.1002/prot.20787. URL http://doi.wiley.com/10.1002/prot.20787. →pages 12[17] J. M. Cherry, E. L. Hong, C. Amundsen, R. Balakrishnan, G. Binkley, E. T.Chan, K. R. Christie, M. C. Costanzo, S. S. Dwight, S. R. Engel, D. G.Fisk, J. E. Hirschman, B. C. Hitz, K. Karra, C. J. Krieger, S. R. Miyasato,R. S. Nash, J. Park, M. S. Skrzypek, M. Simison, S. Weng, and E. D.Wong. Saccharomyces Genome Database: the genomics resource ofbudding yeast. Nucleic Acids Research, 40(Database issue):D700–705, Jan.2012. ISSN 1362-4962. doi:10.1093/nar/gkr1029. → pages 9, 12[18] F. Chiti. Mutational analysis of the propensity for amyloid formation by aglobular protein. The EMBO Journal, 19(7):1441–1449, Apr. 2000. ISSN14602075. doi:10.1093/emboj/19.7.1441. URLhttp://emboj.embopress.org/cgi/doi/10.1093/emboj/19.7.1441. → pages 1[19] F. Chiti and C. M. Dobson. Protein misfolding, functional amyloid, andhuman disease. Annual Review of Biochemistry, 75:333–366, 2006. ISSN0066-4154. doi:10.1146/annurev.biochem.75.101304.123901. → pages 359[20] F. Chiti and C. M. Dobson. Amyloid formation by globular proteins undernative conditions. Nature Chemical Biology, 5(1):15–22, Jan. 2009. ISSN1552-4450. doi:10.1038/nchembio.131. URLhttp://www.nature.com/doifinder/10.1038/nchembio.131. → pages 3[21] F. Chiti, M. Stefani, N. Taddei, G. Ramponi, and C. M. Dobson.Rationalization of the effects of mutations on peptide andproteinaggregation rates. Nature, 424(6950):805–808, Aug. 2003. ISSN0028-0836, 1476-4679. doi:10.1038/nature01891. URLhttp://www.nature.com/doifinder/10.1038/nature01891. → pages 3[22] S. A. Comyn, G. T. Chan, and T. Mayor. False start: cotranslational proteinubiquitination and cytosolic protein quality control. Journal of Proteomics,100:92–101, Apr. 2014. ISSN 1876-7737. doi:10.1016/j.jprot.2013.08.005.→ pages 2[23] O. Conchillo-Sol, N. S. de Groot, F. X. Avils, J. Vendrell, X. Daura, andS. Ventura. AGGRESCAN: a server for the prediction and evaluation of”hot spots” of aggregation in polypeptides. BMC Bioinformatics, 8(1):65,2007. ISSN 14712105. doi:10.1186/1471-2105-8-65. URLhttp://www.biomedcentral.com/1471-2105/8/65. → pages 4, 12[24] A. Cumberworth, G. Lamour, M. Babu, and J. Gsponer. Promiscuity as afunctional trait: intrinsically disordered regions as central players ofinteractomes. Biochemical Journal, 454(3):361–369, Sept. 2013. ISSN0264-6021, 1470-8728. doi:10.1042/BJ20130545. URLhttp://www.biochemj.org/bj/454/bj4540361.htm. → pages 49[25] N. E. Davey, K. Van Roey, R. J. Weatheritt, G. Toedt, B. Uyar,B. Altenberg, A. Budd, F. Diella, H. Dinkel, and T. J. Gibson. Attributes ofshort linear motifs. Molecular BioSystems, 8(1):268, 2012. ISSN1742-206X, 1742-2051. doi:10.1039/c1mb05231d. URLhttp://xlink.rsc.org/?DOI=c1mb05231d. → pages 34[26] D. C. David, N. Ollikainen, J. C. Trinidad, M. P. Cary, A. L. Burlingame,and C. Kenyon. Widespread Protein Aggregation as an Inherent Part ofAging in C. elegans. PLoS Biology, 8(8):e1000450, Aug. 2010. ISSN1545-7885. doi:10.1371/journal.pbio.1000450. URLhttp://dx.plos.org/10.1371/journal.pbio.1000450. → pages 21[27] M. P. C. David, G. P. Concepcion, and E. A. Padlan. Using simple artificialintelligence methods for predicting amyloidogenesis in antibodies. BMC60bioinformatics, 11:79, 2010. ISSN 1471-2105.doi:10.1186/1471-2105-11-79. → pages 4[28] S. J. Davis, E. A. Davies, M. G. Tucknott, E. Y. Jones, and P. A. van derMerwe. The role of charged residues mediating low affinity protein-proteinrecognition at the cell surface by CD2. Proceedings of the NationalAcademy of Sciences of the United States of America, 95(10):5490–5494,May 1998. ISSN 0027-8424. → pages 28[29] L. M. F. de Godoy, J. V. Olsen, J. Cox, M. L. Nielsen, N. C. Hubner,F. Frhlich, T. C. Walther, and M. Mann. Comprehensivemass-spectrometry-based proteome quantification of haploid versus diploidyeast. Nature, 455(7217):1251–1254, Oct. 2008. ISSN 0028-0836,1476-4687. doi:10.1038/nature07341. URLhttp://www.nature.com/doifinder/10.1038/nature07341. → pages 11, 21,22, 37, 46[30] N. S. de Groot, F. X. Aviles, J. Vendrell, and S. Ventura. Mutagenesis ofthe central hydrophobic cluster in Abeta42 Alzheimer’s peptide. Side-chainproperties correlate with aggregation propensities. The FEBS journal, 273(3):658–668, Feb. 2006. ISSN 1742-464X.doi:10.1111/j.1742-4658.2005.05102.x. → pages 5[31] T. F. DeLuca, J. Cui, J.-Y. Jung, K. C. St. Gabriel, and D. P. Wall. Roundup2.0: enabling comparative genomics for over 1800 genomes.Bioinformatics, 28(5):715–716, Mar. 2012. ISSN 1367-4803, 1460-2059.doi:10.1093/bioinformatics/bts006. URL http://bioinformatics.oxfordjournals.org/cgi/doi/10.1093/bioinformatics/bts006.→ pages 11[32] G. Dennis, B. T. Sherman, D. A. Hosack, J. Yang, W. Gao, H. C. Lane, andR. A. Lempicki. DAVID: Database for Annotation, Visualization, andIntegrated Discovery. Genome Biology, 4(5):P3, 2003. ISSN 1465-6914.→ pages 11[33] E. W. Dent and P. W. Baas. Microtubules in neurons as informationcarriers. Journal of Neurochemistry, 129(2):235–239, Apr. 2014. ISSN00223042. doi:10.1111/jnc.12621. URLhttp://doi.wiley.com/10.1111/jnc.12621. → pages 21[34] C. M. Dobson. Protein folding and misfolding. Nature, 426(6968):884–890, Dec. 2003. ISSN 1476-4687. doi:10.1038/nature02261. →pages 261[35] Z. Dosztanyi, V. Csizmok, P. Tompa, and I. Simon. IUPred: web server forthe prediction of intrinsically unstructured regions of proteins based onestimated energy content. Bioinformatics, 21(16):3433–3434, Aug. 2005.ISSN 1367-4803, 1460-2059. doi:10.1093/bioinformatics/bti541. URL http://bioinformatics.oxfordjournals.org/cgi/doi/10.1093/bioinformatics/bti541.→ pages 12[36] Z. Dosztanyi, B. Meszaros, and I. Simon. ANCHOR: web server forpredicting protein binding regions in disordered proteins. Bioinformatics,25(20):2745–2746, Oct. 2009. ISSN 1367-4803, 1460-2059.doi:10.1093/bioinformatics/btp518. URL http://bioinformatics.oxfordjournals.org/cgi/doi/10.1093/bioinformatics/btp518.→ pages 12[37] E. Doxakis. RNA binding proteins: a common denominator of neuronalfunction and dysfunction. Neuroscience Bulletin, 30(4):610–626, Aug.2014. ISSN 1673-7067, 1995-8218. doi:10.1007/s12264-014-1443-7.URL http://link.springer.com/10.1007/s12264-014-1443-7. → pages 49[38] A. J. Enright, S. Van Dongen, and C. A. Ouzounis. An efficient algorithmfor large-scale detection of protein families. Nucleic Acids Research, 30(7):1575–1584, Apr. 2002. ISSN 1362-4962. → pages 53[39] S. Escusa-Toret, W. I. M. Vonk, and J. Frydman. Spatial sequestration ofmisfolded proteins by a dynamic chaperone pathway enhances cellularfitness during stress. Nature Cell Biology, 15(10):1231–1243, Sept. 2013.ISSN 1465-7392, 1476-4679. doi:10.1038/ncb2838. URLhttp://www.nature.com/doifinder/10.1038/ncb2838. → pages 2[40] A.-M. Fernandez-Escamilla, F. Rousseau, J. Schymkowitz, and L. Serrano.Prediction of sequence-dependent and mutational effects on theaggregation of peptides and proteins. Nature Biotechnology, 22(10):1302–1306, Oct. 2004. ISSN 1087-0156. doi:10.1038/nbt1012. URLhttp://www.nature.com/doifinder/10.1038/nbt1012. → pages 4, 12[41] A. D. Ferrao-Gonzales, S. O. Souto, J. L. Silva, and D. Foguel. Thepreaggregated state of an amyloidogenic protein: Hydrostatic pressureconverts native transthyretin into the amyloidogenic state. Proceedings ofthe National Academy of Sciences, 97(12):6445–6450, June 2000. ISSN0027-8424, 1091-6490. doi:10.1073/pnas.97.12.6445. URLhttp://www.pnas.org/cgi/doi/10.1073/pnas.97.12.6445. → pages 162[42] R. D. Finn, J. Clements, and S. R. Eddy. HMMER web server: interactivesequence similarity searching. Nucleic Acids Research, 39(suppl):W29–W37, July 2011. ISSN 0305-1048, 1362-4962.doi:10.1093/nar/gkr367. URLhttp://nar.oxfordjournals.org/lookup/doi/10.1093/nar/gkr367. → pages 11[43] R. D. Finn, A. Bateman, J. Clements, P. Coggill, R. Y. Eberhardt, S. R.Eddy, A. Heger, K. Hetherington, L. Holm, J. Mistry, E. L. L.Sonnhammer, J. Tate, and M. Punta. Pfam: the protein families database.Nucleic Acids Research, 42(D1):D222–D230, Jan. 2014. ISSN 0305-1048,1362-4962. doi:10.1093/nar/gkt1223. URLhttp://nar.oxfordjournals.org/lookup/doi/10.1093/nar/gkt1223. → pages 11[44] D. M. Fowler, A. V. Koulov, W. E. Balch, and J. W. Kelly. Functionalamyloid–from bacteria to humans. Trends in Biochemical Sciences, 32(5):217–224, May 2007. ISSN 0968-0004. doi:10.1016/j.tibs.2007.03.003. →pages 3[45] J. Friedman, T. Hastie, and R. Tibshirani. Regularization Paths forGeneralized Linear Models via Coordinate Descent. Journal of StatisticalSoftware, 33(1):1–22, 2010. ISSN 1548-7660. → pages 13, 40[46] B. Frost, R. L. Jacks, and M. I. Diamond. Propagation of Tau Misfoldingfrom the Outside to the Inside of a Cell. Journal of Biological Chemistry,284(19):12845–12852, May 2009. ISSN 0021-9258, 1083-351X.doi:10.1074/jbc.M808759200. URLhttp://www.jbc.org/cgi/doi/10.1074/jbc.M808759200. → pages 3[47] S. Ghaemmaghami, W.-K. Huh, K. Bower, R. W. Howson, A. Belle,N. Dephoure, E. K. O’Shea, and J. S. Weissman. Global analysis of proteinexpression in yeast. Nature, 425(6959):737–741, Oct. 2003. ISSN0028-0836, 1476-4679. doi:10.1038/nature02046. URLhttp://www.nature.com/doifinder/10.1038/nature02046. → pages 11, 21,22, 37, 46[48] N. Gilks. Stress Granule Assembly Is Mediated by Prion-like Aggregationof TIA-1. Molecular Biology of the Cell, 15(12):5383–5398, Sept. 2004.ISSN 1059-1524. doi:10.1091/mbc.E04-08-0715. URLhttp://www.molbiolcell.org/cgi/doi/10.1091/mbc.E04-08-0715. → pages 3,4[49] C.-X. Gong, F. Liu, I. Grundke-Iqbal, and K. Iqbal. Post-translationalmodifications of tau protein in Alzheimers disease. Journal of Neural63Transmission, 112(6):813–838, June 2005. ISSN 0300-9564, 1435-1463.doi:10.1007/s00702-004-0221-0. URLhttp://link.springer.com/10.1007/s00702-004-0221-0. → pages 48[50] J. Gsponer and M. Babu. Cellular Strategies for Regulating Functional andNonfunctional Protein Aggregation. Cell Reports, 2(5):1425–1437, Nov.2012. ISSN 22111247. doi:10.1016/j.celrep.2012.09.036. URLhttp://linkinghub.elsevier.com/retrieve/pii/S2211124712003671. → pages12, 21[51] J. Gsponer, M. E. Futschik, S. A. Teichmann, and M. M. Babu. TightRegulation of Unstructured Proteins: From Transcript Synthesis to ProteinDegradation. Science, 322(5906):1365–1368, Nov. 2008. ISSN 0036-8075,1095-9203. doi:10.1126/science.1163581. URLhttp://www.sciencemag.org/cgi/doi/10.1126/science.1163581. → pages 48[52] J. L. Guo and V. M.-Y. Lee. Seeding of Normal Tau by Pathological TauConformers Drives Pathogenesis of Alzheimer-like Tangles. Journal ofBiological Chemistry, 286(17):15317–15331, Apr. 2011. ISSN 0021-9258,1083-351X. doi:10.1074/jbc.M110.209296. URLhttp://www.jbc.org/cgi/doi/10.1074/jbc.M110.209296. → pages 3[53] R. Halfmann, S. Alberti, R. Krishnan, N. Lyle, C. O’Donnell, O. King,B. Berger, R. Pappu, and S. Lindquist. Opposing Effects of Glutamine andAsparagine Govern Prion Formation by Intrinsically Disordered Proteins.Molecular Cell, 43(1):72–84, July 2011. ISSN 10972765.doi:10.1016/j.molcel.2011.05.013. URLhttp://linkinghub.elsevier.com/retrieve/pii/S1097276511003807. → pages28, 49[54] S. L. Hands and A. Wyttenbach. Neurotoxic protein oligomerisationassociated with polyglutamine diseases. Acta Neuropathologica, 120(4):419–437, Oct. 2010. ISSN 1432-0533. doi:10.1007/s00401-010-0703-0.→ pages 25[55] P. M. Harrison and M. Gerstein. A method to assess compositional bias inbiological sequences and its application to prion-likeglutamine/asparagine-rich domains in eukaryotic proteomes. GenomeBiology, 4(6):R40, 2003. ISSN 1465-6914. doi:10.1186/gb-2003-4-6-r40.→ pages 12, 13[56] B. B. Holmes and M. I. Diamond. Prion-like Properties of Tau Protein: TheImportance of Extracellular Tau as a Therapeutic Target. Journal of64Biological Chemistry, 289(29):19855–19861, July 2014. ISSN 0021-9258,1083-351X. doi:10.1074/jbc.R114.549295. URLhttp://www.jbc.org/cgi/doi/10.1074/jbc.R114.549295. → pages 3[57] P. V. Hornbeck, I. Chabra, J. M. Kornhauser, E. Skrzypek, and B. Zhang.PhosphoSite: A bioinformatics resource dedicated to physiological proteinphosphorylation. PROTEOMICS, 4(6):1551–1561, June 2004. ISSN1615-9853, 1615-9861. doi:10.1002/pmic.200300772. URLhttp://doi.wiley.com/10.1002/pmic.200300772. → pages 12[58] J. D. Hunter. Matplotlib: A 2d Graphics Environment. Computing inScience & Engineering, 9(3):90–95, 2007. ISSN 1521-9615.doi:10.1109/MCSE.2007.55. URLhttp://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=4160265.→ pages 11[59] D.-H. Hyun, M. Lee, B. Halliwell, and P. Jenner. Proteasomal inhibitioncauses the formation of protein aggregates containing a wide range ofproteins, including nitrated proteins. Journal of Neurochemistry, 86(2):363–373, July 2003. ISSN 0022-3042. → pages 2[60] S. Idicula-Thomas and P. V. Balaji. Understanding the relationship betweenthe primary structure of proteins and their amyloidogenic propensity: cluesfrom inclusion body formation. Protein engineering, design & selection:PEDS, 18(4):175–180, Apr. 2005. ISSN 1741-0126.doi:10.1093/protein/gzi022. → pages 4[61] A. Iwata. HDAC6 and Microtubules Are Required for AutophagicDegradation of Aggregated Huntingtin. Journal of Biological Chemistry,280(48):40282–40292, Sept. 2005. ISSN 0021-9258, 1083-351X.doi:10.1074/jbc.M508786200. URLhttp://www.jbc.org/cgi/doi/10.1074/jbc.M508786200. → pages 2[62] J. A. Johnston. Aggresomes: A Cellular Response to Misfolded Proteins.The Journal of Cell Biology, 143(7):1883–1898, Dec. 1998. ISSN00219525. doi:10.1083/jcb.143.7.1883. URLhttp://www.jcb.org/cgi/doi/10.1083/jcb.143.7.1883. → pages 2[63] D. T. Jones. Protein secondary structure prediction based onposition-specific scoring matrices. Journal of Molecular Biology, 292(2):195–202, Sept. 1999. ISSN 0022-2836. doi:10.1006/jmbi.1999.3091. →pages 1265[64] M. Kato, T. Han, S. Xie, K. Shi, X. Du, L. Wu, H. Mirzaei, E. Goldsmith,J. Longgood, J. Pei, N. Grishin, D. Frantz, J. Schneider, S. Chen, L. Li,M. Sawaya, D. Eisenberg, R. Tycko, and S. McKnight. Cell-free Formationof RNA Granules: Low Complexity Sequence Domains Form DynamicFibers within Hydrogels. Cell, 149(4):753–767, May 2012. ISSN00928674. doi:10.1016/j.cell.2012.04.017. URLhttp://linkinghub.elsevier.com/retrieve/pii/S0092867412005144. → pages4, 48, 49[65] S. Kawashima, H. Ogata, and M. Kanehisa. AAindex: Amino Acid IndexDatabase. Nucleic Acids Research, 27(1):368–369, Jan. 1999. ISSN0305-1048. → pages 52[66] Y. E. Kim, M. S. Hipp, A. Bracher, M. Hayer-Hartl, and F. U. Hartl.Molecular chaperone functions in protein folding and proteostasis. AnnualReview of Biochemistry, 82:323–355, 2013. ISSN 1545-4509.doi:10.1146/annurev-biochem-060208-092442. → pages 2[67] G. Kleiger and T. Mayor. Perilous journey: a tour of theubiquitinproteasome system. Trends in Cell Biology, 24(6):352–359, June2014. ISSN 09628924. doi:10.1016/j.tcb.2013.12.003. URLhttp://linkinghub.elsevier.com/retrieve/pii/S0962892413002274. → pages 2[68] T. P. J. Knowles, M. Vendruscolo, and C. M. Dobson. The amyloid stateand its association with protein misfolding diseases. Nature ReviewsMolecular Cell Biology, 15(6):384–396, May 2014. ISSN 1471-0072,1471-0080. doi:10.1038/nrm3810. URLhttp://www.nature.com/doifinder/10.1038/nrm3810. → pages 3[69] R. R. Kopito. Aggresomes, inclusion bodies and protein aggregation.Trends in Cell Biology, 10(12):524–530, Dec. 2000. ISSN 09628924.doi:10.1016/S0962-8924(00)01852-3. URLhttp://linkinghub.elsevier.com/retrieve/pii/S0962892400018523. → pages 2[70] A. Krogh, B. Larsson, G. von Heijne, and E. L. Sonnhammer. Predictingtransmembrane protein topology with a hidden Markov model: applicationto complete genomes. Journal of Molecular Biology, 305(3):567–580, Jan.2001. ISSN 0022-2836. doi:10.1006/jmbi.2000.4315. → pages 12[71] J. Kyte and R. F. Doolittle. A simple method for displaying the hydropathiccharacter of a protein. Journal of Molecular Biology, 157(1):105–132, May1982. ISSN 0022-2836. → pages 12, 2966[72] J. Lai, C. H. Koh, M. Tjota, L. Pieuchot, V. Raman, K. B. Chandrababu,D. Yang, L. Wong, and G. Jedd. Intrinsically disordered proteins aggregateat fungal cell-to-cell channels and regulate intercellular connectivity.Proceedings of the National Academy of Sciences, 109(39):15781–15786,Sept. 2012. ISSN 0027-8424, 1091-6490. doi:10.1073/pnas.1207467109.URL http://www.pnas.org/cgi/doi/10.1073/pnas.1207467109. → pages 4,48[73] S. H. Lecker. Protein Degradation by the Ubiquitin-Proteasome Pathway inNormal and Disease States. Journal of the American Society ofNephrology, 17(7):1807–1819, June 2006. ISSN 1046-6673, 1533-3450.doi:10.1681/ASN.2006010083. URLhttp://www.jasn.org/cgi/doi/10.1681/ASN.2006010083. → pages 2[74] B. Levine, N. Mizushima, and H. W. Virgin. Autophagy in immunity andinflammation. Nature, 469(7330):323–335, Jan. 2011. ISSN 1476-4687.doi:10.1038/nature09782. → pages 2[75] P. Li, S. Banjade, H.-C. Cheng, S. Kim, B. Chen, L. Guo, M. Llaguno, J. V.Hollingsworth, D. S. King, S. F. Banani, P. S. Russo, Q.-X. Jiang, B. T.Nixon, and M. K. Rosen. Phase transitions in the assembly of multivalentsignalling proteins. Nature, 483(7389):336–340, Mar. 2012. ISSN1476-4687. doi:10.1038/nature10879. → pages 49[76] S. Li, T. Izumi, J. Hu, H. H. Jin, A.-A. A. Siddiqui, S. G. Jacobson, D. Bok,and M. Jin. Rescue of enzymatic function for disease-associated RPE65proteins containing various missense mutations in non-active sites. TheJournal of Biological Chemistry, 289(27):18943–18956, July 2014. ISSN1083-351X. doi:10.1074/jbc.M114.552117. → pages 1[77] T.-h. Lin, R. F. Murphy, and Z. Bar-Joseph. Discriminative motif findingfor predicting protein subcellular localization. IEEE/ACM transactions oncomputational biology and bioinformatics / IEEE, ACM, 8(2):441–451,Apr. 2011. ISSN 1557-9964. doi:10.1109/TCBB.2009.82. → pages 12[78] R. Linding, J. Schymkowitz, F. Rousseau, F. Diella, and L. Serrano. AComparative Study of the Relationship Between Protein Structure and-Aggregation in Globular and Intrinsically Disordered Proteins. Journal ofMolecular Biology, 342(1):345–353, Sept. 2004. ISSN 00222836.doi:10.1016/j.jmb.2004.06.088. URLhttp://linkinghub.elsevier.com/retrieve/pii/S0022283604007715. → pages4767[79] J. Lowe, A. Blanchard, K. Morrell, G. Lennox, L. Reynolds, M. Billett,M. Landon, and R. J. Mayer. Ubiquitin is a common factor in intermediatefilament inclusion bodies of diverse type in man, including those ofParkinson’s disease, Pick’s disease, and Alzheimer’s disease, as well asRosenthal fibres in cerebellar astrocytomas, cytoplasmic bodies in muscle,and mallory bodies in alcoholic liver disease. The Journal of Pathology,155(1):9–15, May 1988. ISSN 0022-3417, 1096-9896.doi:10.1002/path.1711550105. URLhttp://doi.wiley.com/10.1002/path.1711550105. → pages 3[80] K. Lundmark, G. T. Westermark, S. Nystrom, C. L. Murphy, A. Solomon,and P. Westermark. Transmissibility of systemic amyloidosis by aprion-like mechanism. Proceedings of the National Academy of Sciences,99(10):6979–6984, May 2002. ISSN 0027-8424, 1091-6490.doi:10.1073/pnas.092205999. URLhttp://www.pnas.org/cgi/doi/10.1073/pnas.092205999. → pages 3[81] S. K. Maji, M. H. Perrin, M. R. Sawaya, S. Jessberger, K. Vadodaria, R. A.Rissman, P. S. Singru, K. P. R. Nilsson, R. Simon, D. Schubert,D. Eisenberg, J. Rivier, P. Sawchenko, W. Vale, and R. Riek. FunctionalAmyloids As Natural Storage of Peptide Hormones in Pituitary SecretoryGranules. Science, 325(5938):328–332, July 2009. ISSN 0036-8075,1095-9203. doi:10.1126/science.1173155. URLhttp://www.sciencemag.org/cgi/doi/10.1126/science.1173155. → pages 3[82] L. Malinovska, S. Kroschwald, M. C. Munder, D. Richter, and S. Alberti.Molecular chaperones and stress-inducible protein-sorting factorscoordinate the spatiotemporal distribution of protein aggregates. MolecularBiology of the Cell, 23(16):3041–3056, Aug. 2012. ISSN 1059-1524.doi:10.1091/mbc.E12-03-0194. URLhttp://www.molbiolcell.org/cgi/doi/10.1091/mbc.E12-03-0194. → pages 2[83] S. Maurer-Stroh, M. Debulpaep, N. Kuemmerer, M. Lopez de la Paz, I. C.Martins, J. Reumers, K. L. Morris, A. Copland, L. Serpell, L. Serrano,J. W. H. Schymkowitz, and F. Rousseau. Exploring the sequencedeterminants of amyloid structure using position-specific scoring matrices.Nature Methods, 7(3):237–242, Mar. 2010. ISSN 1548-7105.doi:10.1038/nmeth.1432. → pages 4[84] D. Meyer, E. Dimitriadou, K. Hornik, A. Weingessel, F. Leisch, C.-C.Chang, and C.-C. Lin. Misc Functions of the Department of Statistics(e1071). Sept. 2014. → pages 1368[85] E. Mezey, A. Dehejia, G. Harta, M. Papp, M. Polymeropoulos, andM. Brownstein. Alpha synuclein in neurodegenerative disorders: Murdereror accomplice? Nature Medicine, 4(7):755–757, July 1998. ISSN1078-8956. doi:10.1038/nm0798-755. URLhttp://www.nature.com/doifinder/10.1038/nm0798-755. → pages 3[86] R. Narayanaswamy, M. Levy, M. Tsechansky, G. M. Stovall, J. D.O’Connell, J. Mirrielees, A. D. Ellington, and E. M. Marcotte. Widespreadreorganization of metabolic enzymes into reversible assemblies uponnutrient starvation. Proceedings of the National Academy of Sciences of theUnited States of America, 106(25):10147–10152, June 2009. ISSN1091-6490. doi:10.1073/pnas.0812771106. → pages 53[87] A. H. M. Ng, N. N. Fang, S. A. Comyn, J. Gsponer, and T. Mayor.System-wide Analysis Reveals Intrinsically Disordered Proteins Are Proneto Ubiquitylation after Misfolding Stress. Molecular & CellularProteomics, 12(9):2456–2467, Sept. 2013. ISSN 1535-9476, 1535-9484.doi:10.1074/mcp.M112.023416. URLhttp://www.mcponline.org/cgi/doi/10.1074/mcp.M112.023416. → pages iii,5, 11, 22, 32, 48[88] J. D. O’Connell, M. Tsechansky, A. Royall, D. R. Boutz, A. D. Ellington,and E. M. Marcotte. A proteomic survey of widespread protein aggregationin yeast. Molecular bioSystems, 10(4):851–861, Apr. 2014. ISSN1742-2051. doi:10.1039/c3mb70508k. → pages 5[89] C. W. O’Donnell, J. Waldisphl, M. Lis, R. Halfmann, S. Devadas,S. Lindquist, and B. Berger. A method for probing the mutationallandscape of amyloid structure. Bioinformatics (Oxford, England), 27(13):i34–42, July 2011. ISSN 1367-4811. doi:10.1093/bioinformatics/btr238.→ pages 4[90] C. J. Oldfield, J. Meng, J. Y. Yang, M. Q. Yang, V. N. Uversky, and A. K.Dunker. Flexible nets: disorder and induced fit in the associations of p53and 14-3-3 with their partners. BMC Genomics, 9(Suppl 1):S1, 2008. ISSN1471-2164. doi:10.1186/1471-2164-9-S1-S1. URLhttp://www.biomedcentral.com/1471-2164/9/S1/S1. → pages 48[91] T. E. Oliphant. Python for Scientific Computing. Computing in Science &Engineering, 9(3):10–20, 2007. ISSN 1521-9615.doi:10.1109/MCSE.2007.58. URL69http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=4160250.→ pages 11[92] H. Olzscha, S. M. Schermann, A. C. Woerner, S. Pinkert, M. H. Hecht,G. G. Tartaglia, M. Vendruscolo, M. Hayer-Hartl, F. U. Hartl, and R. M.Vabulas. Amyloid-like Aggregates Sequester Numerous MetastableProteins with Essential Cellular Functions. Cell, 144(1):67–78, Jan. 2011.ISSN 00928674. doi:10.1016/j.cell.2010.11.050. URLhttp://linkinghub.elsevier.com/retrieve/pii/S0092867410013723. → pages1, 21[93] M. T. Pastor, A. Esteras-Chopo, and L. Serrano. Hacking the code ofamyloid formation: the amyloid stretch hypothesis. Prion, 1(1):9–14, Mar.2007. ISSN 1933-690X. → pages 4[94] A. Pastore and P. A. Temussi. The two faces of Janus: functionalinteractions and protein aggregation. Current Opinion in StructuralBiology, 22(1):30–37, Feb. 2012. ISSN 0959440X.doi:10.1016/j.sbi.2011.11.007. URLhttp://linkinghub.elsevier.com/retrieve/pii/S0959440X11002016. → pages3, 4[95] N. P. Pavletich, K. A. Chambers, and C. O. Pabo. The DNA-bindingdomain of p53 contains the four conserved regions and the major mutationhot spots. Genes & Development, 7(12B):2556–2564, Dec. 1993. ISSN0890-9369. → pages 3[96] S. Pechmann, E. D. Levy, G. G. Tartaglia, and M. Vendruscolo.Physicochemical principles that regulate the competition betweenfunctional and dysfunctional association of proteins. Proceedings of theNational Academy of Sciences of the United States of America, 106(25):10159–10164, June 2009. ISSN 1091-6490.doi:10.1073/pnas.0812414106. → pages 4[97] S. Pu, J. Wong, B. Turner, E. Cho, and S. J. Wodak. Up-to-date cataloguesof yeast protein complexes. Nucleic Acids Research, 37(3):825–831, Feb.2009. ISSN 0305-1048, 1362-4962. doi:10.1093/nar/gkn1005. URLhttp://nar.oxfordjournals.org/lookup/doi/10.1093/nar/gkn1005. → pages 10[98] P. Puigb, I. G. Bravo, and S. Garcia-Vallve. CAIcal: a combined set oftools to assess codon usage adaptation. Biology Direct, 3:38, 2008. ISSN1745-6150. doi:10.1186/1745-6150-3-38. → pages 1270[99] P. Puntervoll. ELM server: a new resource for investigating short functionalsites in modular eukaryotic proteins. Nucleic Acids Research, 31(13):3625–3630, July 2003. ISSN 1362-4962. doi:10.1093/nar/gkg545. URLhttp://nar.oxfordjournals.org/lookup/doi/10.1093/nar/gkg545. → pages 12[100] X. Qu, J.-D. Wen, L. Lancaster, H. F. Noller, C. Bustamante, and I. Tinoco.The ribosome uses two active mechanisms to unwind messenger RNAduring translation. Nature, 475(7354):118–121, July 2011. ISSN1476-4687. doi:10.1038/nature10126. → pages 37[101] H. Rampelt, J. Kirstein-Miles, N. B. Nillegoda, K. Chi, S. R. Scholz, R. I.Morimoto, and B. Bukau. Metazoan Hsp70 machines use Hsp110 to powerprotein disaggregation. The EMBO Journal, 31(21):4221–4235, Nov. 2012.doi:10.1038/emboj.2012.264. → pages 47[102] M. A. M. Reijns, R. D. Alexander, M. P. Spiller, and J. D. Beggs. A role forQ/N-rich aggregation-prone regions in P-body localization. Journal of CellScience, 121(15):2463–2472, Aug. 2008. ISSN 0021-9533, 1477-9137.doi:10.1242/jcs.024976. URLhttp://jcs.biologists.org/cgi/doi/10.1242/jcs.024976. → pages 49[103] P. Reis-Rodrigues, G. Czerwieniec, T. W. Peters, U. S. Evani, S. Alavez,E. A. Gaman, M. Vantipalli, S. D. Mooney, B. W. Gibson, G. J. Lithgow,and R. E. Hughes. Proteomic analysis of age-dependent changes in proteinsolubility identifies genes that modulate lifespan: Aging, protein solubilityand lifespan in C. elegans. Aging Cell, 11(1):120–127, Feb. 2012. ISSN14749718. doi:10.1111/j.1474-9726.2011.00765.x. URLhttp://doi.wiley.com/10.1111/j.1474-9726.2011.00765.x. → pages 21[104] P.-H. Ren, J. E. Lauckner, I. Kachirskaia, J. E. Heuser, R. Melki, and R. R.Kopito. Cytoplasmic penetration and persistent infection of mammaliancells by polyglutamine aggregates. Nature Cell Biology, 11(2):219–225,Feb. 2009. ISSN 1465-7392, 1476-4679. doi:10.1038/ncb1830. URLhttp://www.nature.com/doifinder/10.1038/ncb1830. → pages 3[105] J. Reumers, S. Maurer-Stroh, J. Schymkowitz, and F. Rousseau. Proteinsequences encode safeguards against aggregation. Human Mutation, 30(3):431–437, Mar. 2009. ISSN 1098-1004. doi:10.1002/humu.20905. →pages 22[106] C. A. Ross and M. A. Poirier. Protein aggregation and neurodegenerativedisease. Nature Medicine, 10(7):S10–S17, July 2004. ISSN 1078-8956.71doi:10.1038/nm1066. URLhttp://www.nature.com/doifinder/10.1038/nm1066. → pages 3[107] F. Rousseau, L. Serrano, and J. W. H. Schymkowitz. How evolutionarypressure against protein aggregation shaped chaperone specificity. Journalof Molecular Biology, 355(5):1037–1047, Feb. 2006. ISSN 0022-2836.doi:10.1016/j.jmb.2005.11.035. → pages 22[108] A. Ruepp, B. Brauner, I. Dunger-Kaltenbach, G. Frishman, C. Montrone,M. Stransky, B. Waegele, T. Schmidt, O. N. Doudieu, V. Stumpflen, andH. W. Mewes. CORUM: the comprehensive resource of mammalianprotein complexes. Nucleic Acids Research, 36(Database):D646–D650,Dec. 2007. ISSN 0305-1048, 1362-4962. doi:10.1093/nar/gkm936. URLhttp://nar.oxfordjournals.org/lookup/doi/10.1093/nar/gkm936. → pages 10[109] I. Sadowski, B.-J. Breitkreutz, C. Stark, T.-C. Su, M. Dahabieh,S. Raithatha, W. Bernhard, R. Oughtred, K. Dolinski, K. Barreto, andM. Tyers. The PhosphoGRID Saccharomyces cerevisiae proteinphosphorylation site database: version 2.0 update. Database, 2013(0):bat026–bat026, May 2013. ISSN 1758-0463.doi:10.1093/database/bat026. URLhttp://database.oxfordjournals.org/cgi/doi/10.1093/database/bat026. →pages 12[110] A. M. Salazar, E. J. Silverman, K. P. Menon, and K. Zinn. Regulation ofsynaptic Pumilio function by an aggregation-prone domain. The Journal ofNeuroscience: The Official Journal of the Society for Neuroscience, 30(2):515–522, Jan. 2010. ISSN 1529-2401.doi:10.1523/JNEUROSCI.2523-09.2010. → pages 4[111] T. R. Serio and S. L. Lindquist. Protein-only inheritance in yeast:something to get [PSI+]-ched about. Trends in Cell Biology, 10(3):98–105,Mar. 2000. ISSN 0962-8924. → pages 4[112] M. Y. Sherman and S.-B. Qian. Less is more: improving proteostasis bytranslation slow down. Trends in Biochemical Sciences, 38(12):585–591,Dec. 2013. ISSN 0968-0004. doi:10.1016/j.tibs.2013.09.003. → pages 37[113] A. Shevchenko, M. Wilm, O. Vorm, O. N. Jensen, A. V. Podtelejnikov,G. Neubauer, A. Shevchenko, P. Mortensen, and M. Mann. A strategy foridentifying gel-separated proteins in sequence databases by MS alone.Biochemical Society Transactions, 24(3):893–896, Aug. 1996. ISSN0300-5127. → pages 872[114] E. M. Sontag, W. I. Vonk, and J. Frydman. Sorting out the trash: the spatialnature of eukaryotic protein quality control. Current Opinion in CellBiology, 26:139–146, Feb. 2014. ISSN 09550674.doi:10.1016/j.ceb.2013.12.006. URLhttp://linkinghub.elsevier.com/retrieve/pii/S0955067413001932. → pages 2[115] R. Spokoini, O. Moldavski, Y. Nahmias, J. England, M. Schuldiner, andD. Kaganovich. Confinement to Organelle-Associated Inclusion StructuresMediates Asymmetric Inheritance of Aggregated Protein in Budding Yeast.Cell Reports, 2(4):738–747, Oct. 2012. ISSN 22111247.doi:10.1016/j.celrep.2012.08.024. URLhttp://linkinghub.elsevier.com/retrieve/pii/S2211124712002641. → pages2, 47[116] F. Sun, V. Anantharam, D. Zhang, C. Latchoumycandane, A. Kanthasamy,and A. G. Kanthasamy. Proteasome inhibitor MG-132 inducesdopaminergic degeneration in cell culture and animal models.NeuroToxicology, 27(5):807–815, Sept. 2006. ISSN 0161813X.doi:10.1016/j.neuro.2006.06.006. URLhttp://linkinghub.elsevier.com/retrieve/pii/S0161813X06001689. → pages 2[117] The UniProt Consortium. Activities at the Universal Protein Resource(UniProt). Nucleic Acids Research, 42(D1):D191–D198, Jan. 2014. ISSN0305-1048, 1362-4962. doi:10.1093/nar/gkt1140. URLhttp://nar.oxfordjournals.org/lookup/doi/10.1093/nar/gkt1140. → pages 9,12[118] J. Tian, N. Wu, J. Guo, and Y. Fan. Prediction of amyloid fibril-formingsegments based on a support vector machine. BMC bioinformatics, 10Suppl 1:S45, 2009. ISSN 1471-2105. doi:10.1186/1471-2105-10-S1-S45.→ pages 4[119] A. C. Tsolis, N. C. Papandreou, V. A. Iconomidou, and S. J. Hamodrakas.A consensus method for the prediction of ’aggregation-prone’ peptides inglobular proteins. PloS One, 8(1):e54175, 2013. ISSN 1932-6203.doi:10.1371/journal.pone.0054175. → pages 4[120] N. D. Udeshi, P. Mertins, T. Svinkina, and S. A. Carr. Large-scaleidentification of ubiquitination sites by mass spectrometry. NatureProtocols, 8(10):1950–1960, Sept. 2013. ISSN 1754-2189, 1750-2799.doi:10.1038/nprot.2013.120. URLhttp://www.nature.com/doifinder/10.1038/nprot.2013.120. → pages 873[121] R. van der Lee, M. Buljan, B. Lang, R. J. Weatheritt, G. W. Daughdrill,A. K. Dunker, M. Fuxreiter, J. Gough, J. Gsponer, D. T. Jones, P. M. Kim,R. W. Kriwacki, C. J. Oldfield, R. V. Pappu, P. Tompa, V. N. Uversky, P. E.Wright, and M. M. Babu. Classification of Intrinsically DisorderedRegions and Proteins. Chemical Reviews, 114(13):6589–6631, July 2014.ISSN 0009-2665, 1520-6890. doi:10.1021/cr400525m. URLhttp://pubs.acs.org/doi/abs/10.1021/cr400525m. → pages 32[122] S. van der Walt, S. C. Colbert, and G. Varoquaux. The NumPy Array: AStructure for Efficient Numerical Computation. Computing in Science &Engineering, 13(2):22–30, Mar. 2011. ISSN 1521-9615.doi:10.1109/MCSE.2011.37. URLhttp://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=5725236.→ pages 11[123] S. Ventura, J. Zurdo, S. Narayanan, M. Parreno, R. Mangues, B. Reif,F. Chiti, E. Giannoni, C. M. Dobson, F. X. Aviles, and L. Serrano. Shortamino acid stretches can mediate amyloid formation in globular proteins:The Src homology 3 (SH3) case. Proceedings of the National Academy ofSciences, 101(19):7258–7263, May 2004. ISSN 0027-8424, 1091-6490.doi:10.1073/pnas.0308249101. URLhttp://www.pnas.org/cgi/doi/10.1073/pnas.0308249101. → pages 3[124] D. Vilchez, I. Saez, and A. Dillin. The role of protein clearancemechanisms in organismal ageing and age-related diseases. NatureCommunications, 5:5659, 2014. ISSN 2041-1723.doi:10.1038/ncomms6659. → pages 2[125] B. S. Wang, R. A. Grant, and C. O. Pabo. Selected peptide extensioncontacts hydrophobic patch on neighboring zinc finger and mediatesdimerization on DNA. Nature Structural Biology, 8(7):589–593, July 2001.ISSN 1072-8368. doi:10.1038/89617. → pages 29[126] J. Ward, J. Sodhi, L. McGuffin, B. Buxton, and D. Jones. Prediction andFunctional Analysis of Native Disorder in Proteins from the ThreeKingdoms of Life. Journal of Molecular Biology, 337(3):635–645, Mar.2004. ISSN 00222836. doi:10.1016/j.jmb.2004.02.002. URLhttp://linkinghub.elsevier.com/retrieve/pii/S0022283604001482. → pages12[127] S. Weber and C. Brangwynne. Getting RNA and Protein in Phase. Cell,149(6):1188–1191, June 2012. ISSN 00928674.74doi:10.1016/j.cell.2012.05.022. URLhttp://linkinghub.elsevier.com/retrieve/pii/S0092867412006344. → pages49[128] I. B. Wilde, M. Brack, J. M. Winget, and T. Mayor. ProteomicCharacterization of Aggregating Proteins after the Inhibition of theUbiquitin Proteasome System. Journal of Proteome Research, 10(3):1062–1072, Mar. 2011. ISSN 1535-3893, 1535-3907.doi:10.1021/pr1008543. URLhttp://pubs.acs.org/doi/abs/10.1021/pr1008543. → pages 5, 21[129] R. L. Woltjer. Proteomic determination of widespread detergentinsolubility, including A but not tau, early in the pathogenesis ofAlzheimer’s disease. The FASEB Journal, Sept. 2005. ISSN 0892-6638,1530-6860. doi:10.1096/fj.05-4263fje. URLhttp://www.fasebj.org/cgi/doi/10.1096/fj.05-4263fje. → pages 3[130] C. W. Wong, V. Quaranta, and G. G. Glenner. Neuritic plaques andcerebrovascular amyloid in Alzheimer disease are antigenically related.Proceedings of the National Academy of Sciences of the United States ofAmerica, 82(24):8729–8732, Dec. 1985. ISSN 0027-8424. → pages 3[131] G. Xu, S. M. Stevens, F. Kobiessy, H. Brown, S. McClung, M. S. Gold, andD. R. Borchelt. Identification of Proteins Sensitive to Thermal Stress inHuman Neuroblastoma and Glioma Cell Lines. PLoS ONE, 7(11):e49021,Nov. 2012. ISSN 1932-6203. doi:10.1371/journal.pone.0049021. URLhttp://dx.plos.org/10.1371/journal.pone.0049021. → pages 5[132] F. Yang, Y. Shen, D. G. Camp, and R. D. Smith. High-pH reversed-phasechromatography with fraction concatenation for 2d proteomic analysis.Expert Review of Proteomics, 9(2):129–134, Apr. 2012. ISSN 1478-9450,1744-8387. doi:10.1586/epr.12.15. URLhttp://informahealthcare.com/doi/abs/10.1586/epr.12.15. → pages 8[133] Y. Yoshimura, Y. Lin, H. Yagi, Y.-H. Lee, H. Kitayama, K. Sakurai, M. So,H. Ogi, H. Naiki, and Y. Goto. Distinguishing crystal-like amyloid fibrilsand glass-like amorphous aggregates from their kinetics of formation.Proceedings of the National Academy of Sciences, 109(36):14446–14451,Sept. 2012. ISSN 0027-8424, 1091-6490. doi:10.1073/pnas.1208228109.URL http://www.pnas.org/cgi/doi/10.1073/pnas.1208228109. → pages 2[134] C. Zhou, B. Slaughter, J. Unruh, A. Eldakak, B. Rubinstein, and R. Li.Motility and Segregation of Hsp104-Associated Protein Aggregates in75Budding Yeast. Cell, 147(5):1186–1196, Nov. 2011. ISSN 00928674.doi:10.1016/j.cell.2011.11.002. URLhttp://linkinghub.elsevier.com/retrieve/pii/S0092867411012918. → pages 2[135] S. Zibaee, O. S. Makin, M. Goedert, and L. C. Serpell. A simple algorithmlocates beta-strands in the amyloid fibril core of alpha-synuclein, Abeta,and tau using the amino acid sequence alone. Protein Science: APublication of the Protein Society, 16(5):906–918, May 2007. ISSN0961-8368. doi:10.1110/ps.062624507. → pages 4[136] H. Y. Zoghbi and H. T. Orr. Glutamine repeats and neurodegeneration.Annual Review of Neuroscience, 23:217–247, 2000. ISSN 0147-006X.doi:10.1146/annurev.neuro.23.1.217. → pages 2576Appendix ASupporting MaterialsTable A.1: GO analysis (biological processes) for yeast LS proteinsGoannotationGO term No. inLSNo. incategoryp-valueGO:0016072 rRNA metabolic process 35 248 1.40E-19GO:0006364 rRNA processing 34 239 5.70E-19GO:0042254 Ribosome biogenesis 39 351 1.20E-18GO:0022613 Ribonucleoprotein complexbiogenesis39 397 1.00E-16GO:0034470 ncRNA processing 35 335 2.70E-15GO:0034660 ncRNA metabolic process 36 393 6.00E-14GO:0000462 Maturation of SSU-rRNAfrom tricistronic rRNAtranscript (SSU-rRNA;5.8S rRNA; LSU-rRNA)19 83 1.10E-12GO:0030490 Maturation of SSU-rRNA 19 85 1.70E-12GO:0006396 RNA processing 36 515 2.50E-1077Table A.1: GO analysis (biological processes) for yeast LS proteinsGoannotationGO term No. inLSNo. incategoryp-valueGO:0000480 Endonucleolytic cleavagein 5’-ETS of tricistronicrRNA transcript(SSU-rRNA; 5.8S rRNA;LSU-rRNA)9 26 5.30E-06GO:0000447 Endonucleolytic cleavagein ITS1 to separateSSU-rRNA from 5.8SrRNA and LSU-rRNA fromtricistronic rRNA transcript(SSU-rRNA; 5.8S rRNA;LSU-rRNA)10 38 7.80E-06GO:0000472 Endonucleolytic cleavageto generate mature 5’-endof SSU-rRNA from(SSU-rRNA; 5.8S rRNA;LSU-rRNA)9 28 1.00E-05GO:0000479 Endonucleolytic cleavageof tricistronic rRNAtranscript (SSU-rRNA;5.8S rRNA; LSU-rRNA)10 40 1.30E-05GO:0000478 Endonucleolytic cleavagesduring rRNA processing10 40 1.30E-05GO:0000967 rRNA 5’-end processing 9 29 1.40E-05GO:0034471 ncRNA 5’-end processing 9 29 1.40E-05GO:0000460 Maturation of 5.8S rRNA 12 69 1.50E-0578Table A.1: GO analysis (biological processes) for yeast LS proteinsGoannotationGO term No. inLSNo. incategoryp-valueGO:0000466 Maturation of 5.8S rRNAfrom tricistronic rRNAtranscript (SSU-rRNA;5.8S rRNA; LSU-rRNA)12 69 1.50E-05GO:0000966 RNA 5’-end processing 9 30 1.90E-05GO:0000469 Cleavages during rRNAprocessing10 59 4.40E-04GO:0045943 Positive regulation oftranscription from RNApolymerase I promoter6 12 6.70E-04GO:0006356 Regulation of transcriptionfrom RNA polymerase Ipromoter6 18 6.60E-03Table A.2: GO analysis (molecular function) for yeast LS proteinsGoannotationGO term No. inLSNo. incategoryp-valueGO:0030515 snoRNA binding 10 19 1.10E-09GO:0003723 RNA binding 25 504 3.00E-04GO:0004386 Helicase activity 11 106 1.80E-03GO:0003724 RNA helicase activity 7 42 1.10E-02GO:0070035 Purine NTP-dependenthelicase activity9 84 1.30E-02GO:0008026 ATP-dependent helicaseactivity9 84 1.30E-0279Table A.3: GO analysis (biological processes) for human LS proteinsGoannotationGO term No. inLSNo. incategoryp-valueGO:0008380 RNA splicing 20 284 4.40E-09GO:0016071 mRNA metabolic process 22 370 7.00E-09GO:0006397 mRNA processing 20 321 3.70E-08GO:0000398 Nuclear mRNA splicing;via spliceosome15 153 5.10E-08GO:0000377 RNA splicing; viatransesterification reactionswith bulged adenosine asnucleophile15 153 5.10E-08GO:0000375 RNA splicing; viatransesterification reactions15 153 5.10E-08GO:0006333 Chromatin assembly ordisassembly14 127 6.20E-08GO:0006325 Chromatin organization 21 378 8.10E-08GO:0051276 Chromosome organization 22 485 1.00E-06GO:0006396 RNA processing 23 547 1.60E-06GO:0034621 Cellular macromolecularcomplex subunitorganization17 357 7.10E-05GO:0045449 Regulation of transcription 47 2601 5.60E-04GO:0034622 Cellular macromolecularcomplex assembly15 318 5.80E-04GO:0007049 Cell cycle 23 776 7.60E-04GO:0034728 Nucleosome organization 9 93 1.10E-03GO:0006350 Transcription 40 2101 1.80E-03GO:0006974 Response to DNA damagestimulus15 373 3.80E-0380Table A.3: GO analysis (biological processes) for human LS proteinsGoannotationGO term No. inLSNo. incategoryp-valueGO:0043933 Macromolecular complexsubunit organization20 710 1.00E-02GO:0007017 Microtubule-based process 12 253 1.00E-02GO:0065003 Macromolecular complexassembly19 665 1.50E-02GO:0033554 Cellular response to stress 17 566 2.70E-02GO:0006259 DNA metabolic process 16 506 2.80E-02GO:0006281 DNA repair 12 284 3.00E-02Table A.4: GO analysis (molecular function) for human LS proteinsGoannotationGO term No. inLSNo. incategoryp-valueGO:0003677 DNA binding 54 2331 1.20E-08GO:0003723 RNA binding 29 718 1.60E-08GO:0003682 Chromatin binding 11 150 2.50E-04Table A.5: GO analysis (biological processes) for mouse LS proteinsGoannotationGO term No. inLSNo. incategoryp-valueGO:0006412 Translation 64 319 2.40E-36GO:0008380 RNA splicing 51 201 2.40E-33GO:0016071 mRNA metabolic process 58 302 2.00E-31GO:0006397 mRNA processing 53 262 1.60E-2981Table A.5: GO analysis (biological processes) for mouse LS proteinsGoannotationGO term No. inLSNo. incategoryp-valueGO:0006396 RNA processing 58 437 1.00E-22GO:0022900 Electron transport chain 26 112 2.50E-14GO:0006091 Generation of precursormetabolites and energy33 261 6.70E-11GO:0007010 Cytoskeleton organization 31 326 6.30E-07GO:0000377 RNA splicing; viatransesterification reactionswith bulged adenosine asnucleophile12 37 1.00E-06GO:0000375 RNA splicing; viatransesterification reactions12 37 1.00E-06GO:0000398 Nuclear mRNA splicing;via spliceosome12 37 1.00E-06GO:0030029 Actin filament-basedprocess18 176 2.00E-03GO:0006403 RNA localization 11 67 6.50E-03GO:0043244 Regulation of proteincomplex disassembly9 43 1.10E-02GO:0043242 Negative regulation ofprotein complexdisassembly8 35 2.40E-02GO:0051236 Establishment of RNAlocalization10 66 4.00E-02GO:0050657 Nucleic acid transport 10 66 4.00E-02GO:0050658 RNA transport 10 66 4.00E-02GO:0022613 Ribonucleoprotein complexbiogenesis14 137 4.10E-0282Table A.6: GO analysis (molecular function) for mouse LS proteinsGoannotationGO term No. inLSNo. incategoryp-valueGO:0003735 Structural constituent ofribosome59 151 1.30E-50GO:0005198 Structural molecule activity 82 450 2.80E-43GO:0003723 RNA binding 97 672 5.70E-43GO:0003779 Actin binding 37 288 1.40E-12GO:0008092 Cytoskeletal proteinbinding42 414 4.80E-11GO:0008137 NADH dehydrogenase(ubiquinone) activity11 24 5.60E-08GO:0003954 NADH dehydrogenaseactivity11 24 5.60E-08GO:0050136 NADH dehydrogenase(quinone) activity11 24 5.60E-08GO:0016655 Oxidoreductase activity;acting on NADH orNADPH; quinone orsimilar compound asacceptor11 27 2.20E-07GO:0003729 mRNA binding 14 54 2.70E-07GO:0016651 Oxidoreductase activity;acting on NADH orNADPH12 51 1.90E-05GO:0000166 Nucleotide binding 92 2183 5.50E-04GO:0019843 rRNA binding 8 24 5.60E-04GO:0003697 Single-stranded DNAbinding8 33 5.60E-03GO:0005516 Calmodulin binding 13 114 1.30E-0283Table A.6: GO analysis (molecular function) for mouse LS proteinsGoannotationGO term No. inLSNo. incategoryp-valueGO:0015078 Hydrogen iontransmembrane transporteractivity11 82 1.60E-02GO:0015077 Monovalent inorganiccation transmembranetransporter activity11 87 2.60E-02Table A.7: Analysis for enrichment of Pfam domains for yeast LS proteinsDomain In LS TotalLSin NS No.TotalNSp-value significantWD40 11 96 23 1095 3.19E-05 NoHistone 4 96 1 1095 1.87E-04 NoAAA 12 3 96 0 1095 5.09E-04 NoAAA 11 3 96 0 1095 5.09E-04 NoGlyco hydro 72 3 96 0 1095 5.09E-04 NoUtp12 3 96 0 1095 5.09E-04 NoDEAD 6 96 16 1095 6.06E-03 NoNOP5NT 2 96 0 1095 6.43E-03 NoNOSIC 2 96 0 1095 6.43E-03 NoNop 2 96 0 1095 6.43E-03 NoXPG I 2 96 0 1095 6.43E-03 NoXPG N 2 96 0 1095 6.43E-03 NoMyosin TH1 2 96 0 1095 6.43E-03 NoTubulin 2 96 0 1095 6.43E-03 NoTubulin C 2 96 0 1095 6.43E-03 NoPH 2 96 1 1095 1.83E-02 No84Table A.7: Analysis for enrichment of Pfam domains for yeast LS proteinsDomain In LS TotalLSin NS No.TotalNSp-value significantHA2 2 96 1 1095 1.83E-02 NoOB NTP bind 2 96 1 1095 1.83E-02 NoHelicase C 6 96 24 1095 2.84E-02 NoMyosin head 2 96 2 1095 3.47E-02 NoPI3 PI4 kinase 2 96 2 1095 3.47E-02 NoTable A.8: Analysis for enrichment of Pfam domains for human LS proteinsDomain In LS TotalLSin NS No.TotalNSp-value significantRRM 1 18 167 20 1190 6.52E-08 YesSeptin 7 170 0 1201 4.04E-07 YesFilament 6 170 0 1201 3.36E-06 YesBromodomain 7 170 2 1201 1.17E-05 NoHistone 5 169 0 1201 2.71E-05 NoHMG box 5 168 2 1201 4.50E-04 NoPHD 5 168 2 1200 4.51E-04 NoHMG box 2 4 170 1 1201 1.03E-03 NoAAA 33 3 169 0 1199 1.86E-03 NoWHIM3 3 170 0 1201 1.88E-03 NoHomeobox 3 169 1 1201 6.72E-03 NoSAP 3 167 2 1199 1.49E-02 NoLTD 2 169 0 1201 1.51E-02 NoWHIM1 2 169 0 1201 1.51E-02 NoFilament head 2 170 0 1201 1.53E-02 No85Table A.8: Analysis for enrichment of Pfam domains for human LS proteinsDomain In LS TotalLSin NS No.TotalNSp-value significantCUT 2 170 0 1201 1.53E-02 NoBAR 2 170 0 1201 1.53E-02 NoEFhand Ca insen 2 170 0 1201 1.53E-02 NoMacro 2 170 0 1201 1.53E-02 NoEF1 GNE 2 170 0 1201 1.53E-02 NoEF-1 beta acid 2 170 0 1201 1.53E-02 NoGATA 2 170 0 1201 1.53E-02 NoDLIC 2 170 0 1201 1.53E-02 Nozf-PARP 2 170 0 1201 1.53E-02 NoRtt106 2 170 0 1201 1.53E-02 NoNOPS 2 170 0 1201 1.53E-02 NoRRM 6 5 170 9 1201 2.17E-02 NoChromo 2 169 1 1201 4.17E-02 NoSWIRM 2 170 1 1201 4.21E-02 NoANTH 2 170 1 1201 4.21E-02 No2-oxoacid dh 2 170 1 1201 4.21E-02 NoI LWEQ 2 170 1 1201 4.21E-02 Nozf-C2H2 4 2 170 1 1200 4.22E-02 NoTable A.9: Analysis for enrichment of Pfam domains for mouse LS proteinsDomain In LS TotalLSin NS No.TotalNSp-value significantRRM 1 50 525 5 1255 4.00E-22 YesSpectrin 12 530 0 1255 4.30E-07 Yes86Table A.9: Analysis for enrichment of Pfam domains for mouse LS proteinsDomain In LS TotalLSin NS No.TotalNSp-value significantFilament 10 530 0 1255 5.01E-06 YesMyosin head 8 530 0 1255 5.82E-05 NoPDZ 17 521 10 1251 3.51E-04 NoRRM 6 8 527 1 1255 3.75E-04 NoGuanylate kin 9 528 2 1253 5.03E-04 NoEFhand Ca insen 6 530 0 1255 6.72E-04 NoFilament head 6 530 0 1255 6.72E-04 NoRas 1 530 25 1254 1.87E-03 NoRRM 5 5 527 0 1255 2.23E-03 NoSH3 2 10 528 5 1253 3.19E-03 NoEF-hand 6 6 530 1 1254 3.53E-03 NoC2 12 530 7 1254 3.63E-03 NoSAM 1 7 529 2 1255 3.90E-03 NoI-set 7 530 2 1255 3.93E-03 NoBand 7 4 529 0 1255 7.67E-03 NoHistone 4 529 0 1255 7.67E-03 NoMyosin tail 1 4 529 0 1255 7.67E-03 NoSorb 4 530 0 1255 7.71E-03 NoLinker histone 4 530 0 1255 7.71E-03 NoDUF1899 4 530 0 1255 7.71E-03 NoNAC 4 530 0 1255 7.71E-03 NoVGCC beta4Aa N 4 530 0 1255 7.71E-03 NoSAP 5 529 1 1255 1.02E-02 NoTPR 11 0 530 13 1254 1.37E-02 NoProteasome 0 530 14 1255 1.45E-02 NoGKAP 3 529 0 1255 2.60E-02 Nodsrm 3 529 0 1255 2.60E-02 No87Table A.9: Analysis for enrichment of Pfam domains for mouse LS proteinsDomain In LS TotalLSin NS No.TotalNSp-value significantCaMKII AD 3 529 0 1255 2.60E-02 NoRibosomal L7Ae 3 529 0 1255 2.60E-02 NoSec7 3 529 0 1255 2.60E-02 NoCollagen 3 530 0 1255 2.61E-02 NoCNH 3 530 0 1255 2.61E-02 NoC1q 3 530 0 1255 2.61E-02 NoBAG 3 530 0 1255 2.61E-02 NoDUF2051 3 530 0 1255 2.61E-02 NoLTD 3 530 0 1255 2.61E-02 NoMyosin N 3 530 0 1255 2.61E-02 NoFox-1 C 3 530 0 1255 2.61E-02 NoCast 3 530 0 1255 2.61E-02 NoPurA 3 530 0 1255 2.61E-02 NoAgenet 3 530 0 1255 2.61E-02 NoTropomyosin 3 530 0 1255 2.61E-02 NoPH 9 4 528 1 1255 2.92E-02 NoIQ 4 530 1 1255 2.94E-02 NoDUF1900 4 530 1 1255 2.94E-02 NoSH3 1 10 529 8 1251 3.38E-02 NoTPR 1 0 530 10 1252 3.89E-02 NoAldedh 0 530 10 1255 3.90E-02 NoTable A.10: Table of p-values for feature analysis of yeast proteinsAnalysis LS vs NS HS vs NS LS vs HSPercent G 2.44E-04 5.38E-02 1.49E-0188Table A.10: Table of p-values for feature analysis of yeast proteinsAnalysis LS vs NS HS vs NS LS vs HSPercent A 7.27E-04 2.35E-02 3.20E-05Percent V 2.12E-03 2.29E-04 9.70E-01Percent L 4.89E-01 3.26E-06 3.21E-03Percent I 5.10E-01 5.30E-10 3.23E-03Percent P 3.20E-02 1.90E-02 3.09E-03Percent F 3.92E-01 8.62E-03 3.54E-01Percent Y 9.15E-01 8.93E-05 1.82E-02Percent W 8.45E-01 4.78E-03 1.34E-01Percent H 4.09E-01 3.21E-03 2.46E-01Percent M 2.15E-01 7.91E-01 3.08E-01Percent C 1.88E-01 3.57E-06 1.17E-04Percent S 7.42E-10 1.71E-01 3.54E-08Percent T 3.04E-01 1.67E-03 1.03E-02Percent K 6.61E-01 6.82E-05 1.86E-02Percent D 5.97E-01 2.09E-04 1.59E-02Percent E 1.11E-01 7.46E-09 8.35E-06Percent N 1.09E-04 7.17E-01 4.68E-03Percent Q 9.07E-02 4.02E-02 1.21E-02Percent R 4.52E-01 2.38E-03 4.36E-02ER likelihood 2.12E-01 3.03E-08 9.79E-05Golgi likelihood 4.27E-02 6.83E-12 8.52E-08Vacuole likelihood 1.07E-05 6.12E-06 1.82E-11Membrane likelihood 3.41E-03 1.20E-10 1.67E-07Secretory likelihood 8.41E-03 1.14E-01 1.41E-03Cytosol likelihood 6.67E-04 4.00E-03 2.76E-01Peroxisome likelihood 1.94E-02 1.12E-04 7.75E-01Mitochondria likelihood 8.00E-04 6.88E-02 1.08E-01Nucleus likelihood 5.00E-04 1.54E-03 4.05E-01Length 7.37E-06 8.58E-23 5.99E-1789Table A.10: Table of p-values for feature analysis of yeast proteinsAnalysis LS vs NS HS vs NS LS vs HSDisorder prediction (DISOPRED) 2.16E-04 4.11E-07 7.39E-01Disorder prediction (IUPRED) 2.40E-03 3.10E-08 3.11E-01Number of LCRs 4.30E-12 4.40E-01 1.31E-10Number of MoRFs (ANCHOR) 4.24E-04 1.58E-02 9.16E-02Number of ELMs 1.36E-08 3.58E-21 2.94E-18Number of LCRs per unit length 2.66E-07 2.09E-02 4.69E-02Number of MoRFs (ANCHOR) perunit length1.28E-02 2.27E-08 7.56E-02Number of ELMs per unit length 1.03E-07 2.88E-02 9.70E-07Percent helix 7.61E-02 3.91E-01 3.29E-01Percent sheet 9.57E-02 1.29E-02 5.65E-01Percent coil 1.50E-03 1.66E-01 1.37E-01Percent polar 2.03E-06 1.63E-02 1.31E-06Percent hydrophobic 4.08E-06 1.50E-03 1.49E-01Percent positive 4.12E-01 2.72E-01 8.75E-01Percent negative 1.06E-01 2.46E-10 3.78E-06Abundance 3.01E-10 4.49E-04 5.41E-14Number of disulfide bonds 1.40E-04 2.31E-18 3.96E-15Number of phosphorylation sites 1.02E-01 6.64E-01 1.09E-01Number of disordered regions per unitlength2.03E-02 5.43E-08 7.78E-02Number of disordered regions 4.47E-07 2.22E-02 6.90E-09Number of coiled coil regions 6.63E-01 1.26E-01 5.74E-01Percent coiled coiled 7.55E-01 8.03E-02 4.57E-01Number of MoRFs (ANCHOR) perpercent disorder2.86E-02 4.80E-13 6.70E-03Number of ELMs per percent disorder 7.66E-01 5.23E-23 7.35E-11Number of LCRs per Percent disorder 5.33E-09 7.11E-04 3.89E-1490Table A.10: Table of p-values for feature analysis of yeast proteinsAnalysis LS vs NS HS vs NS LS vs HSNumber of MoRFs per Percentdisorder1.68E-02 9.83E-01 2.37E-02Number of MoRFs per disorderedpatch8.37E-01 1.01E-01 4.29E-01Codon Adaptation Index (CAI) 6.59E-01 1.30E-02 2.83E-01Number of transmembrane helices 2.68E-01 6.51E-01 3.54E-01Hydrophobicity (GRAVY index) 7.55E-02 4.96E-12 8.15E-03Number of codons 7.76E-06 6.52E-16 1.89E-14Percent GC content 2.09E-04 6.49E-05 1.15E-07Number of close stop codons 9.82E-01 1.72E-07 1.40E-03Percent rate amino acids (C W H M) 6.94E-01 1.66E-04 5.97E-02Number of MoRF residues(ANCHOR)3.35E-04 2.32E-02 6.14E-02Number of MoRF residues(ANCHOR) per unit length4.54E-03 2.00E-06 4.59E-01Number of MoRFs (ANCHOR) perdisordered patch5.60E-03 2.37E-04 9.20E-01Percent aromatic residues (F Y W) 8.22E-01 8.88E-06 1.49E-02Number of aromatic patches 3.01E-04 2.44E-06 2.77E-09Number of hydrophobic patches 1.21E-03 9.13E-09 1.50E-08Number of negatively charged patches 3.55E-03 2.61E-04 3.35E-06Number of positively charged patches 5.69E-01 2.74E-08 9.87E-05Number of polar patches (Q N) 2.76E-03 2.22E-03 2.35E-05Number of polar patches (S T) 4.88E-02 6.06E-03 7.83E-04Number of aromatic patches per unitlength8.09E-01 1.91E-01 6.44E-01Number of hydrophobic patches perunit length4.97E-01 7.56E-02 2.01E-0191Table A.10: Table of p-values for feature analysis of yeast proteinsAnalysis LS vs NS HS vs NS LS vs HSNumber of negatively charged patchesper unit length7.60E-01 8.62E-05 1.71E-02Number of positively charged patchesper unit length3.38E-02 6.44E-01 1.63E-01Number of polar patches (Q N) perunit length7.80E-01 6.18E-03 1.33E-01Number of polar patches (S T) per unitlength4.25E-02 1.94E-05 1.06E-04Net charge per protein 3.95E-04 6.31E-08 2.62E-07Net charge squared per protein 2.02E-09 5.26E-05 1.63E-12Net charge per residue 2.21E-02 1.10E-05 3.32E-05Net charge per residue squared 9.49E-04 7.26E-02 7.29E-02Table A.11: Table of p-values for feature analysis of human proteinsAnalysis LS vs NS HS vs NS LS vs HSPercent G 1.64E-01 3.83E-04 6.70E-01Percent A 8.17E-01 9.18E-05 5.22E-02Percent V 1.36E-03 7.57E-08 8.90E-09Percent L 1.27E-03 3.67E-03 2.75E-06Percent I 8.96E-09 5.01E-03 3.59E-12Percent P 2.59E-01 6.41E-06 3.10E-04Percent F 1.08E-04 3.35E-12 1.34E-12Percent Y 5.81E-01 3.03E-05 6.69E-03Percent W 1.23E-03 4.33E-08 1.08E-09Percent H 3.56E-02 9.66E-04 8.09E-01Percent M 1.53E-01 1.32E-02 4.78E-03Percent C 4.83E-04 5.74E-06 3.01E-0992Table A.11: Table of p-values for feature analysis of human proteinsAnalysis LS vs NS HS vs NS LS vs HSPercent S 1.19E-07 4.31E-09 6.81E-17Percent T 5.51E-02 8.91E-01 1.16E-01Percent K 6.15E-01 1.77E-06 1.72E-02Percent D 5.80E-01 7.93E-08 3.55E-04Percent E 1.18E-01 1.11E-01 6.36E-01Percent N 9.56E-02 7.67E-01 1.07E-01Percent Q 8.70E-01 1.35E-05 1.21E-02Percent R 1.86E-01 8.43E-20 1.05E-10Length 5.92E-02 1.21E-10 7.14E-09Disorder prediction (DISOPRED) 7.67E-13 1.79E-39 1.91E-37Disorder prediction (IUPRED) 8.01E-11 3.91E-26 7.09E-31Number of LCRs 1.11E-08 1.70E-24 1.06E-28Number of MoRFs (ANCHOR) 6.55E-09 1.46E-29 5.43E-32Number of ELMs 1.85E-02 4.03E-14 4.67E-12Number of LCRs per unit length 1.04E-08 7.87E-20 6.17E-25Number of MoRFs (ANCHOR) perunit length2.24E-09 1.56E-23 1.82E-27Number of ELMs per unit length 1.95E-02 5.08E-13 7.06E-10Percent helix 1.11E-01 3.59E-03 1.21E-03Percent sheet 4.97E-03 1.03E-08 1.91E-09Percent coil 8.68E-05 3.57E-11 2.89E-13Percent polar 9.50E-02 2.57E-04 2.26E-04Percent hydrophobic 5.51E-02 5.13E-18 8.48E-11Percent positive 5.42E-02 1.35E-31 3.58E-18Percent negative 1.09E-01 5.57E-05 4.01E-01Abundance 5.92E-01 4.11E-24 1.87E-09Number of disulfide bonds 1.53E-01 1.02E-03 5.92E-01Number of phosphorylation sites 4.51E-06 3.27E-08 2.08E-1393Table A.11: Table of p-values for feature analysis of human proteinsAnalysis LS vs NS HS vs NS LS vs HSNumber of disordered regions per unitlength4.52E-04 1.14E-15 4.91E-15Number of disordered regions 1.53E-04 1.11E-31 9.83E-23Number of coiled coil regions 5.01E-04 3.66E-06 9.75E-10Percent coiled coiled 6.23E-04 1.25E-05 4.46E-09Number of MoRFs (ANCHOR) perpercent disorder5.87E-08 1.93E-17 8.23E-21Number of ELMs per percent disorder 8.30E-05 1.62E-10 6.66E-13Number of LCRs per Percent disorder 1.43E-03 2.22E-11 2.36E-12Number of MoRFs per Percentdisorder2.80E-03 6.42E-11 2.22E-11Number of MoRFs per disorderedpatch1.48E-01 1.17E-12 5.79E-09Codon Adaptation Index (CAI) 5.42E-01 3.36E-01 9.78E-01Percent rate amino acids (C W H M) 4.20E-05 2.13E-04 3.01E-09Number of MoRF residues(ANCHOR)2.79E-08 2.14E-30 1.29E-31Number of MoRF residues(ANCHOR) per unit length1.55E-08 3.09E-26 3.15E-28Number of MoRFs (ANCHOR) perdisordered patch3.63E-07 7.32E-23 6.13E-28Percent aromatic residues (F Y W) 3.04E-03 7.43E-15 1.00E-10Net charge per protein 9.19E-01 5.79E-43 6.17E-17Net charge squared per protein 2.02E-01 1.06E-24 2.02E-13Net charge per residue 6.94E-01 1.23E-35 2.00E-13Net charge per residue squared 3.69E-01 3.41E-09 6.54E-0394Table A.12: Table of p-values for feature analysis of mouse proteinsAnalysis LS vs NS HS vs NS LS vs HSPercent G 6.47E-01 4.51E-01 5.66E-01Percent A 1.12E-01 9.13E-01 8.15E-01Percent V 1.44E-19 2.36E-03 9.69E-01Percent L 1.51E-12 3.38E-02 3.45E-05Percent I 7.14E-13 2.00E-01 1.78E-01Percent P 8.77E-05 1.67E-01 1.13E-02Percent F 7.95E-14 7.48E-01 9.76E-03Percent Y 4.86E-01 1.12E-01 2.23E-01Percent W 2.54E-04 2.10E-01 9.18E-01Percent H 9.03E-01 9.16E-01 9.69E-01Percent M 5.76E-06 6.70E-01 4.35E-02Percent C 6.09E-18 4.17E-01 3.75E-02Percent S 3.49E-08 8.72E-01 2.40E-02Percent T 1.23E-03 8.43E-02 5.46E-01Percent K 8.22E-01 2.64E-04 2.14E-03Percent D 3.32E-06 7.00E-02 5.84E-04Percent E 6.95E-01 1.17E-01 1.46E-01Percent N 5.26E-01 4.44E-01 5.81E-01Percent Q 1.70E-03 4.68E-01 9.83E-02Percent R 4.43E-37 2.02E-04 7.61E-12Length 5.06E-10 3.90E-11 3.15E-13Disorder prediction (DISOPRED) 5.49E-42 1.81E-01 7.06E-08Disorder prediction (IUPRED) 3.31E-36 1.12E-01 6.73E-07Number of LCRs 4.74E-38 5.82E-02 1.30E-10Number of MoRFs (ANCHOR) 6.46E-38 2.17E-03 8.00E-13Number of ELMs 1.90E-13 3.39E-11 2.73E-14Number of LCRs per unit length 3.18E-30 4.84E-01 2.12E-05Number of MoRFs (ANCHOR) perunit length3.32E-33 2.45E-01 1.50E-0595Table A.12: Table of p-values for feature analysis of mouse proteinsAnalysis LS vs NS HS vs NS LS vs HSNumber of ELMs per unit length 7.19E-12 7.72E-03 4.77E-06Percent helix 1.30E-08 4.21E-03 6.33E-05Percent sheet 2.44E-12 3.95E-03 3.47E-01Percent coil 5.30E-19 9.31E-03 2.01E-06Percent polar 1.74E-03 1.90E-01 2.62E-02Percent hydrophobic 7.73E-21 9.72E-01 6.94E-04Percent positive 4.53E-22 2.55E-01 1.18E-02Percent negative 5.55E-02 3.59E-02 1.14E-02Abundance 5.79E-05 3.57E-04 1.69E-06Number of disulfide bonds 5.89E-01 3.66E-07 1.29E-04Number of phosphorylation sites 1.47E-06 3.62E-01 5.95E-03Number of disordered regions per unitlength1.98E-12 5.20E-01 6.29E-03Number of disordered regions 5.79E-25 3.86E-06 5.99E-14Number of coiled coil regions 6.38E-11 4.76E-02 2.45E-04Percent coiled coiled 2.82E-09 6.14E-02 6.47E-04Number of MoRFs (ANCHOR) perpercent disorder7.24E-24 7.09E-02 4.15E-01Number of ELMs per percent disorder 3.25E-11 5.21E-02 8.91E-01Number of LCRs per Percent disorder 4.44E-19 1.22E-01 2.00E-06Number of MoRFs per Percentdisorder5.33E-18 3.10E-04 6.53E-11Number of MoRFs per disorderedpatch2.93E-06 9.85E-01 1.72E-01Codon Adaptation Index (CAI) 8.61E-03 5.06E-02 2.91E-01Hydrophobicity (GRAVY index) 1.20E-41 6.50E-01 2.16E-05Percent rate amino acids (C W H M) 8.10E-11 7.29E-01 1.88E-02Number of MoRF residues(ANCHOR)2.33E-37 4.46E-03 2.11E-1296Table A.12: Table of p-values for feature analysis of mouse proteinsAnalysis LS vs NS HS vs NS LS vs HSNumber of MoRF residues(ANCHOR) per unit length4.95E-34 1.63E-01 8.41E-07Number of MoRFs (ANCHOR) perdisordered patch9.38E-30 5.25E-01 3.64E-05Percent aromatic residues (F Y W) 4.99E-07 4.46E-01 3.66E-01Net charge per protein 5.31E-18 6.98E-03 3.69E-07Net charge squared per protein 9.52E-32 2.38E-04 2.10E-13Net charge per residue 4.04E-16 8.91E-02 2.84E-05Net charge per residue squared 9.21E-14 1.02E-01 1.47E-01Table A.13: Enrichment analysis for low complexity regions in yeast LS pro-teins relative to NS proteinsResidue LS withLCRLSwithoutLCRNS withLCRNSwithoutLCRp-value TrenddirectionA 0 207 21 1220 6.00E-02 depletedC 4 203 14 1227 3.10E-01 enrichedD 15 192 111 1130 5.10E-01 depletedE 24 183 199 1042 1.20E-01 depletedF 5 202 19 1222 3.70E-01 enrichedG 3 204 33 1208 4.70E-01 depletedH 3 204 15 1226 7.30E-01 enrichedI 5 202 22 1219 5.80E-01 enrichedL 47 160 294 947 7.90E-01 depletedK 10 197 23 1218 1.90E-02 enrichedM 0 207 6 1235 6.00E-01 depletedN 31 176 172 1069 6.70E-01 enriched97Table A.13: Enrichment analysis for low complexity regions in yeast LS pro-teins relative to NS proteinsResidue LS withLCRLSwithoutLCRNS withLCRNSwithoutLCRp-value TrenddirectionP 3 204 39 1202 2.60E-01 depletedQ 6 201 60 1181 2.80E-01 depletedR 5 202 12 1229 8.30E-02 enrichedS 41 166 146 1095 2.40E-03 enrichedT 2 205 23 1218 5.60E-01 depletedV 0 207 9 1232 3.70E-01 depletedW 0 207 7 1234 6.00E-01 depletedY 3 204 15 1226 7.30E-01 enrichedTable A.14: Enrichment analysis for low complexity regions in human LSproteins relative to NS proteinsResidue LS withLCRLSwithoutLCRNS withLCRNSwithoutLCRp-value TrenddirectionA 16 521 91 2143 2.60E-01 depletedC 15 522 84 2149 3.00E-01 depletedD 14 523 59 2175 1.00E+00 depletedE 107 430 375 1859 8.70E-02 enrichedF 1 536 13 2221 4.90E-01 depletedG 35 502 124 2109 4.10E-01 enrichedH 7 530 25 2209 6.60E-01 enrichedI 0 537 3 2231 1.00E+00 depletedL 80 457 349 1885 7.40E-01 depletedK 5 531 47 2187 7.70E-02 depleted98Table A.14: Enrichment analysis for low complexity regions in human LSproteins relative to NS proteinsResidue LS withLCRLSwithoutLCRNS withLCRNSwithoutLCRp-value TrenddirectionM 4 532 13 2221 7.60E-01 enrichedN 3 534 10 2223 7.30E-01 enrichedP 70 466 331 1903 3.40E-01 depletedQ 49 488 203 2031 1.00E+00 enrichedR 40 496 145 2089 4.40E-01 enrichedS 66 471 293 1941 6.70E-01 depletedT 7 529 28 2206 8.30E-01 enrichedV 3 534 9 2225 7.10E-01 enrichedW 0 537 5 2229 5.90E-01 depletedY 9 527 19 2215 9.30E-02 enrichedTable A.15: Enrichment analysis for low complexity regions in mouse LSproteins relative to NS proteinsResidue LS withLCRLSwithoutLCRNS withLCRNSwithoutLCRp-value TrenddirectionA 53 1235 65 1186 2.20E-01 depletedC 37 1251 46 1204 2.70E-01 depletedD 15 1273 39 1212 8.00E-04 depletedE 213 1075 241 1010 7.80E-02 depletedF 5 1283 8 1243 4.20E-01 depletedG 77 1211 59 1191 1.90E-01 enrichedH 22 1266 10 1240 4.90E-02 enrichedI 2 1286 4 1247 4.50E-01 depleted99Table A.15: Enrichment analysis for low complexity regions in mouse LSproteins relative to NS proteinsResidue LS withLCRLSwithoutLCRNS withLCRNSwithoutLCRp-value TrenddirectionL 132 1156 174 1076 5.00E-03 depletedK 3 1285 21 1229 1.20E-04 depletedM 8 1280 11 1239 5.00E-01 depletedN 9 1279 9 1241 1.00E+00 depletedP 227 1061 204 1046 4.00E-01 enrichedQ 120 1167 111 1140 7.30E-01 enrichedR 108 1180 35 1215 5.10E-10 enrichedS 199 1089 160 1090 6.00E-02 enrichedT 15 1273 23 1227 1.90E-01 depletedV 4 1283 14 1236 1.70E-02 depletedW 2 1286 2 1249 1.00E+00 depletedY 30 1258 7 1243 1.80E-04 enriched100Table A.16: GO analysis (biological processes) for yeast LS proteins withoutRNase treatmentGOannotationGO term Genes p-valueGO:0006364 rRNAprocessingYLR197W; YOL144W; YGL078C;YJL069C; YGR090W; YEL026W;YLR129W; YDR449C; YPL126W;YOL077C; YPR137W; YMR229C;YHR196W; YJL109C; YGR128C;YOR078W; YJR041C; YKL014C;YPL043W; YOR004W; YDL208W;YER082C; YOR310C; YKL172W;YCR057C; YPL266W; YLL011W;YGL120C; YDR324C; YLR196W;YOR119C; YBL004W; YCR031C;YHR148W; YGL171W; YLR222C;YJR002W; YMR093W; YDL213C;YCL059C; YPL157W; YDL014W;YDR398W; YLR175W3.84E-28101Table A.16: GO analysis (biological processes) for yeast LS proteins withoutRNase treatmentGOannotationGO term Genes p-valueGO:0016072 rRNAmetabolicprocessYLR197W; YOL144W; YGL078C;YJL069C; YGR090W; YEL026W;YLR129W; YDR449C; YPL126W;YOL077C; YPR137W; YMR229C;YHR196W; YJL109C; YGR128C;YOR078W; YJR041C; YKL014C;YPL043W; YOR004W; YDL208W;YER082C; YOR310C; YKL172W;YCR057C; YPL266W; YLL011W;YGL120C; YDR324C; YLR196W;YOR119C; YBL004W; YCR031C;YHR148W; YGL171W; YLR222C;YJR002W; YMR093W; YDL213C;YCL059C; YPL157W; YDL014W;YDR398W; YLR175W1.96E-27102Table A.16: GO analysis (biological processes) for yeast LS proteins withoutRNase treatmentGOannotationGO term Genes p-valueGO:0042254 ribosomebiogenesisYDR060W; YLR197W; YOL144W;YGL078C; YJL069C; YGR090W;YEL026W; YLR129W; YDR449C;YPL126W; YOL077C; YPR137W;YMR229C; YHR196W; YJL109C;YGR128C; YNR053C; YOR078W;YJR041C; YKL014C; YPL043W;YOR004W; YDL208W; YER082C;YOR310C; YKL172W; YCR057C;YPL266W; YLL011W; YGL120C;YDR324C; YLR196W; YOR119C;YBL004W; YCR031C; YHR148W;YGL171W; YLR222C; YJR002W;YMR093W; YDL213C; YCL059C;YPL157W; YDL014W; YDR398W;YJR066W; YLR175W; YLR003C2.80E-25103Table A.16: GO analysis (biological processes) for yeast LS proteins withoutRNase treatmentGOannotationGO term Genes p-valueGO:0022613 ribonucleoproteincomplexbiogenesisYDR060W; YLR197W; YOL144W;YGL078C; YJL069C; YGR090W;YEL026W; YLR129W; YDR449C;YPL126W; YOL077C; YPR137W;YMR229C; YHR196W; YJL109C;YGR128C; YNR053C; YOR078W;YJR041C; YKL014C; YPL043W;YOR004W; YDL208W; YER082C;YOR310C; YKL172W; YCR057C;YPL266W; YLL011W; YGL120C;YDR324C; YLR196W; YOR119C;YBL004W; YCR031C; YHR148W;YGL171W; YLR222C; YJR002W;YMR093W; YDL213C; YCL059C;YPL157W; YDL014W; YDR398W;YJR066W; YLR175W; YLR003C7.73E-23GO:0000462 maturation ofSSU-rRNAfromtricistronicrRNA transcript(SSU-rRNA;5.8S rRNA;LSU-rRNA)YJL069C; YGR090W; YLR129W;YEL026W; YDR449C; YPL126W;YMR229C; YJL109C; YHR196W;YGR128C; YOR078W; YOR004W;YER082C; YOR310C; YCR057C;YLL011W; YGL120C; YDR324C;YOR119C; YBL004W; YCR031C;YLR222C; YJR002W; YMR093W;YCL059C; YDL014W; YDR398W3.16E-22104Table A.16: GO analysis (biological processes) for yeast LS proteins withoutRNase treatmentGOannotationGO term Genes p-valueGO:0030490 maturation ofSSU-rRNAYJL069C; YGR090W; YLR129W;YEL026W; YDR449C; YPL126W;YMR229C; YJL109C; YHR196W;YGR128C; YOR078W; YOR004W;YER082C; YOR310C; YCR057C;YLL011W; YGL120C; YDR324C;YOR119C; YBL004W; YCR031C;YLR222C; YJR002W; YMR093W;YCL059C; YDL014W; YDR398W6.39E-22GO:0034470 ncRNAprocessingYLR197W; YOL144W; YGL078C;YJL069C; YGR090W; YEL026W;YLR129W; YDR449C; YPL126W;YOL077C; YPR137W; YMR229C;YHR196W; YJL109C; YGR128C;YOR078W; YJR041C; YKL014C;YPL043W; YOR004W; YDL208W;YER082C; YOR310C; YKL172W;YCR057C; YPL266W; YLL011W;YGL120C; YDR324C; YLR196W;YOR119C; YBL004W; YCR031C;YHR148W; YGL171W; YLR222C;YJR002W; YMR093W; YDL213C;YCL059C; YPL157W; YDL014W;YDR398W; YLR175W7.64E-22105Table A.16: GO analysis (biological processes) for yeast LS proteins withoutRNase treatmentGOannotationGO term Genes p-valueGO:0034660 ncRNAmetabolicprocessYLR197W; YOL144W; YGL078C;YJL069C; YGR090W; YEL026W;YLR129W; YDR449C; YPL126W;YOL077C; YPR137W; YMR229C;YHR196W; YJL109C; YGR128C;YOR078W; YJR041C; YKL014C;YPL043W; YOR004W; YDL208W;YER082C; YOR310C; YKL172W;YCR057C; YPL266W; YLL011W;YGL120C; YDR324C; YLR196W;YOR119C; YBL004W; YCR031C;YHR148W; YGL171W; YLR222C;YJR002W; YMR093W; YDL213C;YCL059C; YPL157W; YDL014W;YDR398W; YLR175W5.33E-19106Table A.16: GO analysis (biological processes) for yeast LS proteins withoutRNase treatmentGOannotationGO term Genes p-valueGO:0006396 RNAprocessingYLR197W; YOL144W; YGL078C;YJL069C; YGR090W; YEL026W;YLR129W; YDR449C; YPL126W;YOL077C; YPR137W; YMR229C;YHR196W; YJL109C; YGR128C;YOR078W; YJR041C; YKL014C;YPL043W; YOR004W; YDL208W;YER082C; YOR310C; YKL172W;YCR057C; YPL266W; YLL011W;YGL120C; YDR324C; YLR196W;YOR119C; YBL004W; YCR031C;YHR148W; YGL171W; YLR222C;YJR002W; YMR093W; YDL213C;YCL059C; YPL157W; YDL014W;YDR398W; YLR175W2.13E-14GO:0000479 endonucleolyticcleavage oftricistronicrRNA transcript(SSU-rRNA;5.8S rRNA;LSU-rRNA)YCR057C; YJL069C; YLR129W;YDR449C; YBL004W; YMR229C;YJL109C; YLR222C; YJR002W;YOR078W; YCL059C; YOR004W;YER082C; YDL208W; YOR310C9.60E-12GO:0000478 endonucleolyticcleavagesduring rRNAprocessingYCR057C; YJL069C; YLR129W;YDR449C; YBL004W; YMR229C;YJL109C; YLR222C; YJR002W;YOR078W; YCL059C; YOR004W;YER082C; YDL208W; YOR310C9.60E-12107Table A.16: GO analysis (biological processes) for yeast LS proteins withoutRNase treatmentGOannotationGO term Genes p-valueGO:0000472 endonucleolyticcleavage togenerate mature5’-end ofSSU-rRNAfrom(SSU-rRNA;5.8S rRNA;LSU-rRNA)YCR057C; YJL069C; YLR129W;YDR449C; YBL004W; YMR229C;YJL109C; YLR222C; YJR002W;YOR078W; YOR004W; YER082C;YOR310C4.46E-11GO:0000967 rRNA 5’-endprocessingYCR057C; YJL069C; YLR129W;YDR449C; YBL004W; YMR229C;YJL109C; YLR222C; YJR002W;YOR078W; YOR004W; YER082C;YOR310C7.48E-11GO:0034471 ncRNA 5’-endprocessingYCR057C; YJL069C; YLR129W;YDR449C; YBL004W; YMR229C;YJL109C; YLR222C; YJR002W;YOR078W; YOR004W; YER082C;YOR310C7.48E-11GO:0000966 RNA 5’-endprocessingYCR057C; YJL069C; YLR129W;YDR449C; YBL004W; YMR229C;YJL109C; YLR222C; YJR002W;YOR078W; YOR004W; YER082C;YOR310C1.22E-10108Table A.16: GO analysis (biological processes) for yeast LS proteins withoutRNase treatmentGOannotationGO term Genes p-valueGO:0000447 endonucleolyticcleavage inITS1 toseparateSSU-rRNAfrom 5.8SrRNA andLSU-rRNAfromtricistronicrRNA transcript(SSU-rRNA;5.8S rRNA;LSU-rRNA)YCR057C; YJL069C; YLR129W;YDR449C; YBL004W; YMR229C;YJL109C; YLR222C; YJR002W;YOR078W; YCL059C; YOR004W;YER082C; YOR310C1.25E-10GO:0000480 endonucleolyticcleavage in5’-ETS oftricistronicrRNA transcript(SSU-rRNA;5.8S rRNA;LSU-rRNA)YMR229C; YLR222C; YCR057C;YJL109C; YJR002W; YJL069C;YLR129W; YDR449C; YOR004W;YER082C; YBL004W; YOR310C6.21E-10GO:0000460 maturation of5.8S rRNAYCR057C; YJL069C; YLR129W;YGL120C; YDR449C; YBL004W;YMR229C; YLR222C; YJL109C;YJR002W; YOR078W; YCL059C;YKL014C; YOR004W; YER082C;YOR310C2.62E-09109Table A.16: GO analysis (biological processes) for yeast LS proteins withoutRNase treatmentGOannotationGO term Genes p-valueGO:0000466 maturation of5.8S rRNAfromtricistronicrRNA transcript(SSU-rRNA;5.8S rRNA;LSU-rRNA)YCR057C; YJL069C; YLR129W;YGL120C; YDR449C; YBL004W;YMR229C; YLR222C; YJL109C;YJR002W; YOR078W; YCL059C;YKL014C; YOR004W; YER082C;YOR310C2.62E-09GO:0000469 cleavagesduring rRNAprocessingYCR057C; YJL069C; YLR129W;YDR449C; YBL004W; YMR229C;YJL109C; YLR222C; YJR002W;YOR078W; YCL059C; YOR004W;YER082C; YDL208W; YOR310C3.97E-09GO:0045943 positiveregulation oftranscriptionfrom RNApolymerase IpromoterYJL109C; YHR196W; YGR128C;YMR093W; YDR324C; YPL126W;YDR398W3.39E-05GO:0006356 regulation oftranscriptionfrom RNApolymerase IpromoterYJL109C; YHR196W; YGR128C;YMR093W; YDR324C; YPL126W;YDR398W6.16E-04GO:0042274 ribosomal smallsubunitbiogenesisYJR002W; YPL266W; YDL213C;YLR129W; YCL059C; YER082C;YLR003C; YHR148W; YCR031C6.27E-03110Table A.16: GO analysis (biological processes) for yeast LS proteins withoutRNase treatmentGOannotationGO term Genes p-valueGO:0000154 rRNAmodificationYLR197W; YPL266W; YDL014W;YDL208W; YLR175W; YPR137W7.47E-03GO:0045941 positiveregulation oftranscriptionYDR224C; YDR324C; YGL150C;YPL126W; YJL109C; YHR196W;YGR128C; YGR270W; YMR093W;YGL133W; YDR169C; YDR398W;YBR009C1.10E-01GO:0010628 positiveregulation ofgene expressionYDR224C; YDR324C; YGL150C;YPL126W; YJL109C; YHR196W;YGR128C; YGR270W; YMR093W;YGL133W; YDR169C; YDR398W;YBR009C1.16E-01GO:0051173 positiveregulation ofnitrogencompoundmetabolicprocessYDR224C; YDR324C; YGL150C;YPL126W; YJL109C; YHR196W;YGR128C; YGR270W; YMR093W;YGL133W; YDR169C; YDR398W;YBR009C1.80E-01GO:0045935 positiveregulation ofnucleobase;nucleoside;nucleotide andnucleic acidmetabolicprocessYDR224C; YDR324C; YGL150C;YPL126W; YJL109C; YHR196W;YGR128C; YGR270W; YMR093W;YGL133W; YDR169C; YDR398W;YBR009C1.80E-01111Table A.16: GO analysis (biological processes) for yeast LS proteins withoutRNase treatmentGOannotationGO term Genes p-valueGO:0010557 positiveregulation ofmacromoleculebiosyntheticprocessYDR224C; YDR324C; YGL150C;YPL126W; YJL109C; YHR196W;YGR128C; YGR270W; YMR093W;YGL133W; YDR169C; YDR398W;YBR009C2.99E-01GO:0031328 positiveregulation ofcellularbiosyntheticprocessYDR224C; YDR324C; YGL150C;YPL126W; YJL109C; YHR196W;YGR128C; YGR270W; YMR093W;YGL133W; YDR169C; YDR398W;YBR009C3.47E-01GO:0009891 positiveregulation ofbiosyntheticprocessYDR224C; YDR324C; YGL150C;YPL126W; YJL109C; YHR196W;YGR128C; YGR270W; YMR093W;YGL133W; YDR169C; YDR398W;YBR009C3.47E-01GO:0010604 positiveregulation ofmacromoleculemetabolicprocessYDR224C; YDR324C; YGL150C;YPL126W; YJL109C; YHR196W;YGR128C; YGR270W; YMR093W;YGL133W; YDR169C; YDR398W;YBR009C5.41E-01GO:0045893 positiveregulation oftranscription;DNA-dependentYJL109C; YHR196W; YGR128C;YGR270W; YMR093W; YDR324C;YGL150C; YPL126W; YDR169C;YDR398W6.52E-01112Table A.16: GO analysis (biological processes) for yeast LS proteins withoutRNase treatmentGOannotationGO term Genes p-valueGO:0051254 positiveregulation ofRNA metabolicprocessYJL109C; YHR196W; YGR128C;YGR270W; YMR093W; YDR324C;YGL150C; YPL126W; YDR169C;YDR398W7.43E-01GO:0006355 regulation oftranscription;DNA-dependentYGR122W; YDR224C; YDR324C;YGL150C; YPL126W; YDR310C;YJL109C; YHR196W; YGR270W;YGR128C; YMR093W; YBL052C;YBL054W; YMR247C; YER088C;YMR080C; YGL133W; YDR169C;YMR307W; YDR398W7.77E-01GO:0009451 RNAmodificationYLR197W; YPL266W; YGL078C;YPL157W; YDL014W; YOR004W;YDL208W; YLR175W; YPR137W8.36E-01GO:0051252 regulation ofRNA metabolicprocessYGR122W; YDR224C; YDR324C;YGL150C; YPL126W; YDR310C;YJL109C; YHR196W; YGR270W;YGR128C; YMR093W; YBL052C;YBL054W; YMR247C; YER088C;YMR080C; YGL133W; YDR169C;YMR307W; YDR398W8.46E-01GO:0045814 negativeregulation ofgeneexpression;epigeneticYBL052C; YMR247C; YGL150C;YER088C; YDR310C; YMR080C;YGL133W; YMR307W9.28E-01113Table A.16: GO analysis (biological processes) for yeast LS proteins withoutRNase treatmentGOannotationGO term Genes p-valueGO:0006342 chromatinsilencingYBL052C; YMR247C; YGL150C;YER088C; YDR310C; YMR080C;YGL133W; YMR307W9.28E-01GO:0051172 negativeregulation ofnitrogencompoundmetabolicprocessYOL081W; YGR122W; YDR224C;YBL052C; YMR247C; YGL150C;YER088C; YDR310C; YMR080C;YGL133W; YMR307W; YBR009C9.61E-01GO:0045934 negativeregulation ofnucleobase;nucleoside;nucleotide andnucleic acidmetabolicprocessYOL081W; YGR122W; YDR224C;YBL052C; YMR247C; YGL150C;YER088C; YDR310C; YMR080C;YGL133W; YMR307W; YBR009C9.61E-01GO:0040029 regulation ofgeneexpression;epigeneticYBL052C; YMR247C; YGL150C;YER088C; YDR310C; YMR080C;YGL133W; YMR307W9.78E-01GO:0016458 gene silencing YBL052C; YMR247C; YGL150C;YER088C; YDR310C; YMR080C;YGL133W; YMR307W9.82E-01GO:0016481 negativeregulation oftranscriptionYGR122W; YDR224C; YBL052C;YMR247C; YGL150C; YER088C;YDR310C; YMR080C; YGL133W;YMR307W; YBR009C9.84E-01114Table A.16: GO analysis (biological processes) for yeast LS proteins withoutRNase treatmentGOannotationGO term Genes p-valueGO:0010629 negativeregulation ofgene expressionYGR122W; YDR224C; YBL052C;YMR247C; YGL150C; YER088C;YDR310C; YMR080C; YGL133W;YMR307W; YBR009C9.90E-01GO:0031327 negativeregulation ofcellularbiosyntheticprocessYOL081W; YGR122W; YDR224C;YBL052C; YMR247C; YGL150C;YER088C; YDR310C; YMR080C;YGL133W; YMR307W; YBR009C9.91E-01GO:0006348 chromatinsilencing attelomereYBL052C; YMR247C; YGL150C;YER088C; YDR310C; YGL133W9.93E-01GO:0009890 negativeregulation ofbiosyntheticprocessYOL081W; YGR122W; YDR224C;YBL052C; YMR247C; YGL150C;YER088C; YDR310C; YMR080C;YGL133W; YMR307W; YBR009C9.93E-01GO:0045892 negativeregulation oftranscription;DNA-dependentYGR122W; YDR224C; YBL052C;YMR247C; YGL150C; YER088C;YDR310C; YMR080C; YGL133W;YMR307W9.97E-01GO:0051253 negativeregulation ofRNA metabolicprocessYGR122W; YDR224C; YBL052C;YMR247C; YGL150C; YER088C;YDR310C; YMR080C; YGL133W;YMR307W9.97E-01115Table A.16: GO analysis (biological processes) for yeast LS proteins withoutRNase treatmentGOannotationGO term Genes p-valueGO:0010558 negativeregulation ofmacromoleculebiosyntheticprocessYGR122W; YDR224C; YBL052C;YMR247C; YGL150C; YER088C;YDR310C; YMR080C; YGL133W;YMR307W; YBR009C1.00E+00GO:0045449 regulation oftranscriptionYGR122W; YDR224C; YGL150C;YDR324C; YPL126W; YDR310C;YJL109C; YHR196W; YGR270W;YGR128C; YMR093W; YBL052C;YGR040W; YBL054W; YMR247C;YER088C; YMR080C; YGL133W;YDR169C; YMR307W; YDR398W;YBR009C1.00E+00GO:0000463 maturation ofLSU-rRNAfromtricistronicrRNA transcript(SSU-rRNA;5.8S rRNA;LSU-rRNA)YMR229C; YGL120C; YKL014C 1.00E+00GO:0000470 maturation ofLSU-rRNAYMR229C; YGL120C; YKL014C 1.00E+00GO:0010605 negativeregulation ofmacromoleculemetabolicprocessYGR122W; YDR224C; YBL052C;YMR247C; YGL150C; YER088C;YDR310C; YMR080C; YGL133W;YMR307W; YBR009C1.00E+00116Table A.16: GO analysis (biological processes) for yeast LS proteins withoutRNase treatmentGOannotationGO term Genes p-valueGO:0042255 ribosomeassemblyYDR060W; YGL078C; YKL014C;YOL077C; YCR031C1.00E+00GO:0042273 ribosomal largesubunitbiogenesisYDR060W; YOL144W; YGL078C;YGL120C; YOL077C1.00E+00GO:0031118 rRNApseudouridinesynthesisYDL208W; YLR175W 1.00E+00Table A.17: GO analysis (biological processes) for yeast LS proteins withRNase treatmentGOannotationGO term Genes p-valueGO:0006364 rRNAprocessingYLR197W; YOL144W; YGL078C;YPL266W; YEL026W; YGL120C;YBL004W; YJL109C; YJR041C;YPL157W; YKL014C; YDL014W;YDL208W; YLR175W; YOR310C4.40E-02GO:0016072 rRNAmetabolicprocessYLR197W; YOL144W; YGL078C;YPL266W; YEL026W; YGL120C;YBL004W; YJL109C; YJR041C;YPL157W; YKL014C; YDL014W;YDL208W; YLR175W; YOR310C6.48E-02GO:0000154 rRNAmodificationYLR197W; YPL266W; YDL014W;YDL208W; YLR175W8.29E-02117Table A.17: GO analysis (biological processes) for yeast LS proteins withRNase treatmentGOannotationGO term Genes p-valueGO:0006338 chromatinremodelingYFL013C; YOR141C; YNL088W;YGL150C; YGL133W; YDL002C;YBR245C; YLR095C1.24E-01GO:0042254 ribosomebiogenesisYLR197W; YOL144W; YPL266W;YGL078C; YEL026W; YGL120C;YBL004W; YJL109C; YNR053C;YJR041C; YPL157W; YKL014C;YDL014W; YJR066W; YDL208W;YLR175W; YOR310C2.24E-01GO:0016568 chromatinmodificationYFL013C; YBL052C; YPL116W;YOR141C; YNL088W; YMR247C;YGL150C; YGL133W; YDL002C;YBR245C; YLR095C; YBR009C2.38E-01GO:0051172 negativeregulation ofnitrogencompoundmetabolicprocessYGR122W; YDR224C; YGL150C;YDR310C; YLL004W; YBR245C;YOL081W; YBL052C; YMR247C;YGL133W; YNL167C; YMR307W;YBR009C2.55E-01GO:0045934 negativeregulation ofnucleobase;nucleoside;nucleotide andnucleic acidmetabolicprocessYGR122W; YDR224C; YGL150C;YDR310C; YLL004W; YBR245C;YOL081W; YBL052C; YMR247C;YGL133W; YNL167C; YMR307W;YBR009C2.55E-01118Table A.17: GO analysis (biological processes) for yeast LS proteins withRNase treatmentGOannotationGO term Genes p-valueGO:0006325 chromatinorganizationYFL013C; YPL116W; YDR224C;YOR141C; YGL150C; YDL002C;YBR245C; YBL052C; YNL088W;YMR247C; YGL133W; YBR009C;YLR095C2.65E-01GO:0016481 negativeregulation oftranscriptionYGR122W; YDR224C; YBL052C;YMR247C; YGL150C; YDR310C;YGL133W; YNL167C; YLL004W;YMR307W; YBR245C; YBR009C3.32E-01GO:0031327 negativeregulation ofcellularbiosyntheticprocessYGR122W; YDR224C; YGL150C;YDR310C; YLL004W; YBR245C;YOL081W; YBL052C; YMR247C;YGL133W; YNL167C; YMR307W;YBR009C3.67E-01GO:0010629 negativeregulation ofgene expressionYGR122W; YDR224C; YBL052C;YMR247C; YGL150C; YDR310C;YGL133W; YNL167C; YLL004W;YMR307W; YBR245C; YBR009C3.68E-01GO:0009890 negativeregulation ofbiosyntheticprocessYGR122W; YDR224C; YGL150C;YDR310C; YLL004W; YBR245C;YOL081W; YBL052C; YMR247C;YGL133W; YNL167C; YMR307W;YBR009C3.89E-01119Table A.17: GO analysis (biological processes) for yeast LS proteins withRNase treatmentGOannotationGO term Genes p-valueGO:0022613 ribonucleoproteincomplexbiogenesisYLR197W; YOL144W; YPL266W;YGL078C; YEL026W; YGL120C;YBL004W; YJL109C; YNR053C;YJR041C; YPL157W; YKL014C;YDL014W; YJR066W; YDL208W;YLR175W; YOR310C6.19E-01GO:0010558 negativeregulation ofmacromoleculebiosyntheticprocessYGR122W; YDR224C; YBL052C;YMR247C; YGL150C; YDR310C;YGL133W; YNL167C; YLL004W;YMR307W; YBR245C; YBR009C6.47E-01GO:0034470 ncRNAprocessingYLR197W; YOL144W; YGL078C;YPL266W; YEL026W; YGL120C;YBL004W; YJL109C; YJR041C;YPL157W; YKL014C; YDL014W;YDL208W; YLR175W; YOR310C7.49E-01GO:0051276 chromosomeorganizationYFL013C; YFL037W; YDR224C;YPL116W; YOR141C; YGL150C;YDL002C; YBR245C; YBL052C;YNL088W; YMR247C; YGL133W;YPL157W; YML085C; YLR095C;YBR009C8.64E-01GO:0045892 negativeregulation oftranscription;DNA-dependentYGR122W; YDR224C; YBL052C;YMR247C; YGL150C; YDR310C;YGL133W; YNL167C; YLL004W;YMR307W8.89E-01120Table A.17: GO analysis (biological processes) for yeast LS proteins withRNase treatmentGOannotationGO term Genes p-valueGO:0051253 negativeregulation ofRNA metabolicprocessYGR122W; YDR224C; YBL052C;YMR247C; YGL150C; YDR310C;YGL133W; YNL167C; YLL004W;YMR307W8.98E-01GO:0045449 regulation oftranscriptionYFL013C; YGR122W; YDR224C;YPL116W; YOR141C; YIL038C;YGL150C; YDR310C; YLL004W;YDL002C; YBR245C; YJL109C;YGR270W; YBL052C; YGR040W;YBL054W; YMR247C; YGL133W;YNL167C; YMR307W; YLR095C;YBR009C9.56E-01GO:0000742 karyogamyduringconjugationwith cellularfusionYFL037W; YDR356W; YML085C;YHR073W9.66E-01GO:0010605 negativeregulation ofmacromoleculemetabolicprocessYGR122W; YDR224C; YBL052C;YMR247C; YGL150C; YDR310C;YGL133W; YNL167C; YLL004W;YMR307W; YBR245C; YBR009C9.84E-01GO:0006355 regulation oftranscription;DNA-dependentYGR122W; YDR224C; YPL116W;YIL038C; YGL150C; YDR310C;YLL004W; YDL002C; YBR245C;YJL109C; YGR270W; YBL052C;YBL054W; YMR247C; YGL133W;YNL167C; YMR307W9.86E-01121Table A.17: GO analysis (biological processes) for yeast LS proteins withRNase treatmentGOannotationGO term Genes p-valueGO:0000741 karyogamy YFL037W; YDR356W; YML085C;YHR073W9.93E-01GO:0045814 negativeregulation ofgeneexpression;epigeneticYBL052C; YMR247C; YGL150C;YDR310C; YGL133W; YLL004W;YMR307W9.94E-01GO:0006342 chromatinsilencingYBL052C; YMR247C; YGL150C;YDR310C; YGL133W; YLL004W;YMR307W9.94E-01GO:0051252 regulation ofRNA metabolicprocessYGR122W; YDR224C; YPL116W;YIL038C; YGL150C; YDR310C;YLL004W; YDL002C; YBR245C;YJL109C; YGR270W; YBL052C;YBL054W; YMR247C; YGL133W;YNL167C; YMR307W9.94E-01GO:0006997 nucleusorganizationYFL037W; YDR356W; YPL157W;YFR028C; YML085C; YHR073W9.95E-01GO:0034660 ncRNAmetabolicprocessYLR197W; YOL144W; YGL078C;YPL266W; YEL026W; YGL120C;YBL004W; YJL109C; YJR041C;YPL157W; YKL014C; YDL014W;YDL208W; YLR175W; YOR310C9.97E-01122Table A.17: GO analysis (biological processes) for yeast LS proteins withRNase treatmentGOannotationGO term Genes p-valueGO:0010551 regulation ofspecifictranscriptionfrom RNApolymerase IIpromoterYGR270W; YPL116W; YGL150C;YGL133W9.99E-01GO:0040029 regulation ofgeneexpression;epigeneticYBL052C; YMR247C; YGL150C;YDR310C; YGL133W; YLL004W;YMR307W9.99E-01GO:0016458 gene silencing YBL052C; YMR247C; YGL150C;YDR310C; YGL133W; YLL004W;YMR307W9.99E-01GO:0000462 maturation ofSSU-rRNAfromtricistronicrRNA transcript(SSU-rRNA;5.8S rRNA;LSU-rRNA)YJL109C; YGL120C; YEL026W;YDL014W; YBL004W; YOR310C1.00E+00GO:0030490 maturation ofSSU-rRNAYJL109C; YGL120C; YEL026W;YDL014W; YBL004W; YOR310C1.00E+00GO:0006357 regulation oftranscriptionfrom RNApolymerase IIpromoterYGR122W; YGR270W; YDR224C;YPL116W; YIL038C; YBL054W;YGL150C; YGL133W; YNL167C;YDL002C; YBR245C1.00E+00123Table A.17: GO analysis (biological processes) for yeast LS proteins withRNase treatmentGOannotationGO term Genes p-valueGO:0000747 conjugationwith cellularfusionYFL037W; YGR040W; YDR356W;YPR122W; YML085C; YHR073W;YKR031C1.00E+00GO:0009451 RNAmodificationYLR197W; YPL266W; YGL078C;YPL157W; YDL014W; YDL208W;YLR175W1.00E+00GO:0006396 RNAprocessingYLR197W; YOL144W; YDR194C;YPL266W; YGL078C; YEL026W;YIL038C; YGL120C; YBL004W;YJL109C; YJR041C; YPL157W;YKL014C; YDL014W; YDL208W;YLR175W; YOR310C1.00E+00GO:0032583 regulation ofgene-specifictranscriptionYGR270W; YPL116W; YGL150C;YGL133W1.00E+00GO:0006348 chromatinsilencing attelomereYBL052C; YMR247C; YGL150C;YDR310C; YGL133W1.00E+00GO:0000746 conjugation YFL037W; YGR040W; YDR356W;YPR122W; YML085C; YHR073W;YKR031C1.00E+00GO:0000478 endonucleolyticcleavagesduring rRNAprocessingYJL109C; YDL208W; YBL004W;YOR310C1.00E+00124Table A.17: GO analysis (biological processes) for yeast LS proteins withRNase treatmentGOannotationGO term Genes p-valueGO:0000479 endonucleolyticcleavage oftricistronicrRNA transcript(SSU-rRNA;5.8S rRNA;LSU-rRNA)YJL109C; YDL208W; YBL004W;YOR310C1.00E+00GO:0045941 positiveregulation oftranscriptionYJL109C; YGR270W; YDR224C;YPL116W; YGL150C; YGL133W;YBR245C; YBR009C1.00E+00GO:0000460 maturation of5.8S rRNAYJL109C; YGL120C; YKL014C;YBL004W; YOR310C1.00E+00GO:0000466 maturation of5.8S rRNAfromtricistronicrRNA transcript(SSU-rRNA;5.8S rRNA;LSU-rRNA)YJL109C; YGL120C; YKL014C;YBL004W; YOR310C1.00E+00GO:0010628 positiveregulation ofgene expressionYJL109C; YGR270W; YDR224C;YPL116W; YGL150C; YGL133W;YBR245C; YBR009C1.00E+00125Table A.17: GO analysis (biological processes) for yeast LS proteins withRNase treatmentGOannotationGO term Genes p-valueGO:0010552 positiveregulation ofspecifictranscriptionfrom RNApolymerase IIpromoterYGR270W; YPL116W; YGL150C 1.00E+00GO:0051173 positiveregulation ofnitrogencompoundmetabolicprocessYJL109C; YGR270W; YDR224C;YPL116W; YGL150C; YGL133W;YBR245C; YBR009C1.00E+00GO:0045935 positiveregulation ofnucleobase;nucleoside;nucleotide andnucleic acidmetabolicprocessYJL109C; YGR270W; YDR224C;YPL116W; YGL150C; YGL133W;YBR245C; YBR009C1.00E+00GO:0007126 meiosis YOR373W; YLR219W; YFL037W;YNL088W; YFL009W; YJR066W;YML085C; YKR031C1.00E+00GO:0051327 M phase ofmeiotic cellcycleYOR373W; YLR219W; YFL037W;YNL088W; YFL009W; YJR066W;YML085C; YKR031C1.00E+00126Table A.17: GO analysis (biological processes) for yeast LS proteins withRNase treatmentGOannotationGO term Genes p-valueGO:0006260 DNAreplicationYCR028C-A; YGR109W-B;YNL088W; YDR310C; YPL157W;YLL004W; YHR031C1.00E+00GO:0051321 meiotic cellcycleYOR373W; YLR219W; YFL037W;YNL088W; YFL009W; YJR066W;YML085C; YKR031C1.00E+00GO:0010557 positiveregulation ofmacromoleculebiosyntheticprocessYJL109C; YGR270W; YDR224C;YPL116W; YGL150C; YGL133W;YBR245C; YBR009C1.00E+00GO:0043193 positiveregulation ofgene-specifictranscriptionYGR270W; YPL116W; YGL150C 1.00E+00GO:0031328 positiveregulation ofcellularbiosyntheticprocessYJL109C; YGR270W; YDR224C;YPL116W; YGL150C; YGL133W;YBR245C; YBR009C1.00E+00GO:0009891 positiveregulation ofbiosyntheticprocessYJL109C; YGR270W; YDR224C;YPL116W; YGL150C; YGL133W;YBR245C; YBR009C1.00E+00127Table A.17: GO analysis (biological processes) for yeast LS proteins withRNase treatmentGOannotationGO term Genes p-valueGO:0006259 DNA metabolicprocessYLR219W; YDR224C; YOR141C;YCR028C-A; YGL150C; YDR310C;YLL004W; YJR144W; YHR031C;YIL128W; YNL088W; YGR109W-B;YPL157W; YBR009C1.00E+00GO:0051181 cofactortransportYNL078W; YPR122W; YBL037W;YPL249C1.00E+00GO:0007000 nucleolusorganizationYPL157W; YFR028C 1.00E+00GO:0031118 rRNApseudouridinesynthesisYDL208W; YLR175W 1.00E+00GO:0006350 transcription YFL013C; YPL116W; YOR141C;YIL038C; YGL150C; YDR310C;YDL002C; YGR097W; YBR245C;YJL109C; YBL052C; YIL128W;YGL133W; YNL167C; YJR066W;YLR095C1.00E+00GO:0000279 M phase YOR373W; YLR219W; YFL037W;YNL088W; YDR356W; YFL009W;YJR066W; YML085C; YLR175W;YPL124W; YKR031C1.00E+00128Table A.17: GO analysis (biological processes) for yeast LS proteins withRNase treatmentGOannotationGO term Genes p-valueGO:0000480 endonucleolyticcleavage in5’-ETS oftricistronicrRNA transcript(SSU-rRNA;5.8S rRNA;LSU-rRNA)YJL109C; YBL004W; YOR310C 1.00E+00GO:0045132 meioticchromosomesegregationYOR373W; YFL037W; YML085C 1.00E+00GO:0010604 positiveregulation ofmacromoleculemetabolicprocessYJL109C; YGR270W; YDR224C;YPL116W; YGL150C; YGL133W;YBR245C; YBR009C1.00E+00GO:0000469 cleavagesduring rRNAprocessingYJL109C; YDL208W; YBL004W;YOR310C1.00E+00GO:0000472 endonucleolyticcleavage togenerate mature5’-end ofSSU-rRNAfrom(SSU-rRNA;5.8S rRNA;LSU-rRNA)YJL109C; YBL004W; YOR310C 1.00E+00129Table A.17: GO analysis (biological processes) for yeast LS proteins withRNase treatmentGOannotationGO term Genes p-valueGO:0034728 nucleosomeorganizationYDR224C; YGL150C; YBR245C;YBR009C1.00E+00GO:0000967 rRNA 5’-endprocessingYJL109C; YBL004W; YOR310C 1.00E+00GO:0034471 ncRNA 5’-endprocessingYJL109C; YBL004W; YOR310C 1.00E+00130

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0166155/manifest

Comment

Related Items