Open Collections

UBC Faculty Research and Publications

A rapid and accurate approach for prediction of interactomes from co-elution data (PrInCE) Stacey, R. G; Skinnider, Michael A; Scott, Nichollas E; Foster, Leonard J Oct 23, 2017

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


52383-12859_2017_Article_1865.pdf [ 3.17MB ]
JSON: 52383-1.0357355.json
JSON-LD: 52383-1.0357355-ld.json
RDF/XML (Pretty): 52383-1.0357355-rdf.xml
RDF/JSON: 52383-1.0357355-rdf.json
Turtle: 52383-1.0357355-turtle.txt
N-Triples: 52383-1.0357355-rdf-ntriples.txt
Original Record: 52383-1.0357355-source.json
Full Text

Full Text

RESEARCH ARTICLE Open AccessA rapid and accurate approach forprediction of interactomes from co-elutiondata (PrInCE)R. Greg Stacey1* , Michael A. Skinnider1, Nichollas E. Scott1,2 and Leonard J. Foster1,3*AbstractBackground: An organism’s protein interactome, or complete network of protein-protein interactions, defines theprotein complexes that drive cellular processes. Techniques for studying protein complexes have traditionallyapplied targeted strategies such as yeast two-hybrid or affinity purification-mass spectrometry to assess proteininteractions. However, given the vast number of protein complexes, more scalable methods are necessary toaccelerate interaction discovery and to construct whole interactomes. We recently developed a complementarytechnique based on the use of protein correlation profiling (PCP) and stable isotope labeling in amino acids in cellculture (SILAC) to assess chromatographic co-elution as evidence of interacting proteins. Importantly, PCP-SILAC isalso capable of measuring protein interactions simultaneously under multiple biological conditions, allowing thedetection of treatment-specific changes to an interactome. Given the uniqueness and high dimensionality of co-elution data, new tools are needed to compare protein elution profiles, control false discovery rates, and constructan accurate interactome.Results: Here we describe a freely available bioinformatics pipeline, PrInCE, for the analysis of co-elution data.PrInCE is a modular, open-source library that is computationally inexpensive, able to use label and label-free data,and capable of detecting tens of thousands of protein-protein interactions. Using a machine learning approach,PrInCE offers greatly reduced run time, more predicted interactions at the same stringency, prediction of proteincomplexes, and greater ease of use over previous bioinformatics tools for co-elution data. PrInCE is implemented inMatlab (version R2017a). Source code and standalone executable programs for Windows and Mac OSX are availableat, where usage instructions can be found. An example dataset and output arealso provided for testing purposes.Conclusions: PrInCE is the first fast and easy-to-use data analysis pipeline that predicts interactomes and proteincomplexes from co-elution data. PrInCE allows researchers without bioinformatics expertise to analyze high-throughput co-elution datasets.Keywords: Interactome, Protein-protein interaction, Co-fractionation, Co-elution, Protein correlation profiling,Proteomics, System biology, Data analysis, Software* Correspondence:; foster@msl.ubc.ca1Michael Smith Laboratories, University of British Columbia, Vancouver V6T1Z4, CanadaFull list of author information is available at the end of the article© The Author(s). 2017 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0International License (, which permits unrestricted use, distribution, andreproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link tothe Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver( applies to the data made available in this article, unless otherwise stated.Stacey et al. BMC Bioinformatics  (2017) 18:457 DOI 10.1186/s12859-017-1865-8BackgroundThe association of proteins into complexes is commonacross all domains of life [1, 2]. Indeed, most proteins inwell-studied proteomes are involved in at least one proteincomplex [3, 4]. Therefore, understanding the roles, mech-anisms, and interplay of protein complexes is central tounderstanding life.A proteome of 1500 proteins has over one million pos-sible binary protein-protein interactions (PPIs) and manymore potential higher-order complexes. Because of thiscombinatorial explosion, even relatively simple proteomescan yield rich, complex interactomes. High-throughput orhigh-content methods that identify many PPIs simultan-eously are therefore valuable to efficiently map thesenetworks. There are currently three general methods fordoing this: The first, yeast-2 hybrid (Y2H), operates by in-corporating modified bait and prey proteins in a geneticallymodified yeast cell, such that a PPI between bait and preydrives transcription of a reporter gene. Affinity purificationmass spectrometry (AP-MS), a second technique, involvesimmunoprecipitation of proteins of interest (baits) [5].While powerful, both techniques face limitations. For one,tagging proteins, typically with Gal4 in the case of Y2H oran epitope-antibody combination for AP-MS, creates non-endogenous conditions that can disrupt protein bindingsites and increase the number of false negatives.The third general approach, collectively termed co-fractionation approaches, involves resolving complexes byeither chromatography or electrophoresis and assigninginteracting partners based on the similarity of fractionationprofiles [6–8]. While there are similarities in how the datafrom these methods are treated, there are also uniqueconsiderations for each one. Being more establishedmethods, Y2H and AP-MS have several excellentapproaches for data analysis [5, 9, 10]. However, there doesnot yet exist a gold standard tool for analyzing co-fractiona-tion data. We [11] and others have previously reportedpipelines for analyzing co-fractionation data, although exist-ing approaches use other external sources of data, e.g. co-evolution, in addition to co-fractionation data [6, 12]. Opti-mally though, an interactome should be derived from co-fractionation data alone, using other data only for bench-marking. To this end, here we describe an open-sourcepipeline for analyzing co-fractionation data: PrInCE (Pre-diction of Interactomes from Co-Elution). PrInCErepresents a major conceptual advance over preliminarybioinformatics treatments published by our lab, whichprovided basic data extraction and curve fitting tools forco-elution data [8, 11]. Improvements include ranked inter-actions, improved user interface, and extensive documenta-tion. Importantly, PrInCE uses machine learning methodswhich greatly improve its performance. We benchmarkedthe performance of PrInCE versus a previous version [11]and demonstrate a 1.5-to-2-fold improvement in thenumber of predicted PPIs at a given false disovery rate witha 97% decrease in computational cost. This pipeline is freelyavailable for download [13].MethodsPipeline overviewThe workflow of the pipeline is divided into fivemodules: 1) identification of Gaussian-like peaks in theco-fractionation profiles (GaussBuild.m); 2) correctionfor slight differences in the separation dimensionbetween replicates (Alignment.m); 3) comparison ofdifferences in protein amounts, i.e. fold changes,between conditions (FoldChange.m); 4) prediction ofPPIs within each condition (Interactions.m); and 5) con-struction of protein complexes from the predicted PPIs(Complexes.m). The first two modules, i.e. GaussBuild.mand Alignment.m, are pre-processing steps, while theremaining three modules compute protein abundancechanges and predict protein interactions and complexes(Fig. 1).RequirementsSoftware and hardwarePrInCE is available as a standalone program forWindows or Mac OSX, as well as a Matlab package.Matlab is not required to run standalone versions ofPrInCE but it was selected initially due to superiorcurve fitting tools compared to other environments.After downloading and saving to a dedicated foldercontaining co-elution data, standalone PrInCE is dir-ectly accessed through its own icon. PrInCE can bedownloaded for free [13]. Detailed documentation ofall the code as well as further instructions for run-ning the software are provided.DatasetsThis pipeline requires co-fractionation profiles of singleproteins, where co-elution is evidence of co-complexmembership. Each co-fractionation profile, e.g. achromatogram, is a row in a .csv file. Co-fractionationprofiles are grouped by both experimental condition andreplicate number. Separate .csv files are used fordifferent experimental conditions, and the replicatenumber of each chromatogram is recorded by a columnin each file. We provide a test dataset on Github as anexample of correct formatting.Reference database of known complexesThis pipeline requires a reference database of knownprotein complexes. A portion of the proteins in thesereference complexes must also be quantified in theexperimental data, as the reference complexes providethe template by which novel interactions are predicted.We found that manually curated databases that rely onStacey et al. BMC Bioinformatics  (2017) 18:457 Page 2 of 14experimental evidence, such as CORUM [14], lead to ahigh number of predicted interactions.Pipeline workflowData pre-processing (GaussBuild.M, Alignment.M)Module GaussBuild.m uses Gaussian model fitting toidentify the location, width, and height of peaks in theco-fractionation data. Any co-fractionation profile withdata in at least five fractions is chosen for model fitting.First, single missing values in co-fractionation profilesare imputed as the mean of neighbouring data points.Remaining missing values are imputed as zeros, and co-fractionation profiles are smoothed by a sliding averagewith a width of 5 data points. Five Gaussian mixturemodels are fit to each profile. These models are mixturesof 1, 2, 3, 4 or 5 Guassians, respectively. Fitted parame-ters A, μ, and σ are the Gaussian height, center, andwidth, respectively. In order to reduce the sensitivity tooutliers, robust fitting is performed using the L1 norm.For each profile, model selection is performed by select-ing minimum AIC values.Slight differences between the elution time of repli-cates are corrected by module Alignment.m, using theassumption that proteins with a single, well-definedchromatogram peak should elute in the same fraction inevery replicate [11].Fold changes between conditions (FoldChanges.M)Within a single replicate, the protein abundance ratio,i.e. fold change, is calculated between conditions foreach protein (FoldChanges.m). If there are multiplereplicates, this module also calculates significanceusing a paired t-test. Fold changes are calculatedusing data centered on the Gaussian peaks identifiedby GaussBuild.m [11].Predicting interactions (Interactions.M)Quantifying co-fractionation with distance measuresPPI prediction begins by calculating the effective dis-tance between the co-fractionation profiles of everypair of proteins. We use five distance measures toquantify different aspects of co-fractionation profilesimilarity. For all distance measures, a value close tozero signals high similarity between co-fractionationacbFig. 1 Pipeline overview. a. Co-fractionation profiles from known interactors, ribosomal proteins P61247 (black) and P62899 (grey). b.Co-fractionation profiles from non- interacting protein pair, Q6IN85 (black) and E9PGT1 (grey). c. Pipeline workflow. Raw data consists ofco-fractionation profiles grouped by replicate and condition. In pre-processing, Gaussian mixture models are fit to each co-fractionation profile toobtain peak height, width, and center. If there are multiple replicates, the Alignment module adjusts profiles such that Gaussian peaks for thesame protein occur in the same fraction across replicates. Changes in protein amounts between conditions, i.e. fold changes, are computed inthe FoldChange module. Inter- actions between pairs of proteins are predicted by first calculating distance measures between each pair ofproteins and feeding these into a Naive Bayes supervised learning classifier. Known (non-)interactions from a reference database, e.g. CORUM, areused for training. Finally, the list of predicted pairwise interactions is processed by an optimized ClusterONE algorithm [16] to predictprotein complexesStacey et al. BMC Bioinformatics  (2017) 18:457 Page 3 of 14profiles. These five metrics are not exhaustive, but inpractice we found there was little value in additionalmeasures. For a pair of co-fractionation profiles ci, cj,these distance measures are One minus correlation coefficient, 1 − Rcorr: Oneminus the Pearson correlation coefficient between ciand cj . Correlation p-value, pcorr: Corresponding p-value to1 − Rcorr. Euclidean distance between co-fractionation profilesci and cj, E. Peak location, P: Calculated as the difference, infractions, between the locations of the maximumvalues of ci and cj . Co-apex score, CA: Euclidean distance between theclosest (μ, σ) pairs, where μ and σ are Gaussianparameters fitted to ci and cj. For example, if ci is fitby two Gaussians with (μ, σ) equal to (5, 1) and (45,3), and cj is fit by one Gaussian with parameters (45, 2),CA ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi45−45ð Þ2 þ 3−2ð Þ2q¼ 1. Thuschromatograms with at least one pair of similarGaussian peaks will have a low (similar) Co-apex score.Predicting interactions via similarity to referenceCombined with a reference database such as CORUM,these five distance measures can be used to predict novelPPIs. Our pipeline uses a machine learning classifier todo this [6, 15]. Specifically, we train a Naïve Bayes classi-fier, which evaluates how closely the distance measuresfor a candidate protein-protein pair resemble thedistance measures observed for reference interactions.Distance measures are normalized such that their meansare 0 and standard deviations 1. To reject uninformativedistance measures, feature selection is performed priorto classification using a Fisher ratio > 2. The contribu-tion of each feature to prediction performance dependson the dataset, although in general the most-informative(least-rejected) features are 1-Rcorr, P, and CA. Distancemeasures are combined across replicates (but not condi-tions) for each protein-protein pair. Class labels areassigned based on the reference database. Referenceprotein pairs that occur in the same complex are goldstandard interactions (interacting or “intra-complex”label). Proteins that are found in the reference databaseindividually but do not occur within the same complexare labeled non-interacting (“inter-complex”) and arefalse positive interactions [6]. Novel interactions arethose where one or both members are not in the refer-ence database.The Naïve Bayes classifier returns the probability thatputative protein pairs are interacting. Interaction prob-abilities are calculated separately for each experimentalcondition. We use a k-fold cross-validation scheme toavoid over-fitting. k = 15 is used as a tradeoff betweencomputation time and classification accuracy. The classi-fier calculates an interaction probability for every proteinpair. Self-interactions are not considered.By applying a threshold to interaction probabilityreturned by the classifier, protein pairs are separated intopredicted interactions and predicted non-interactions.The probability threshold is chosen so that the resultinginteraction list has a desired ratio of true positives(intra-complex) and false positives (inter-complex),quantified as precision TP/(TP + FP), where TP and FPare the number of true positives and false positives. Thedesired precision is chosen by the user.Finally, we express the confidence of each predictedinteraction by reformulating interaction probability as aninteraction score. A predicted interaction’s score is equalto the precision of all predicted interactions with aninteraction probability greater than or equal to it.Although interaction probability and score are largelyequivalent, interaction score has two advantages. First,interaction score is more human readable, since thedynamic range of predicted interaction probabilities isoften quite small. Second, the use of interaction scoremakes it trivial to generate interaction lists with adesired precision.Predicting complexes (Complexes.M)Complexes are predicted from the list of pairwise in-teractions using the ClusterONE algorithm [16]. Theprimary benefit of ClusterONE over other algorithmsis that ClusterONE can predict the same protein inmultiple complexes. Two parameters, p and dens areoptimized via grid search to produce the mostreference-like complexes. p represents the number ofunknown pairwise interactions, and dens is a thresh-old for the minimum density of a complex, wherecomplex density is defined as the sum of weightedinternal edges divided by N(N − 1)/2. Parameters areoptimized to maximize either the matching ratio [16]or geometric accuracy [17] between predicted andreference complexes. Since there are possibly multipleinteraction lists – a list of all predicted interactionsas well as lists specific to each experimental condition– complexes can be built for each experimental con-dition separately, as well as an overall complex setfrom the aggregate interactome.Test datasetsFor this study, we tested PrInCE on four co-fractionationdatasets, each composed of thousands of co-fractionationprofiles (Table 1). D1, D2, and D4 were collected for re-cently published PCP-SILAC experiments (D1 [18], D2[11], D4 [8]). D3 is the raw intensity values of the mediumStacey et al. BMC Bioinformatics  (2017) 18:457 Page 4 of 14channel of D1, which we included as a surrogate for non-SILAC data, and label-free data more generally.Gold standard referencesWe tested how the choice of gold standard referenceaffects the interactions predicted by PrInCE. First, wepredicted interactions using subsets of CORUM drawnunder two different schemes. The first scheme wasdesigned to test the effects of the size of the referenceset: a fraction of CORUM complexes were drawnrandomly (10%, 20%, …, 100% of complexes) and inter-actions were predicted from dataset D1. The secondscheme was designed to test whether interactions couldbe predicted consistently for different reference sets. Tocontrol the number of PPIs we performed a paired ana-lysis, where we divided CORUM into two halves withequal numbers of gold standard PPIs in the data. Thesehalves have no PPIs in common, and interactions werepredicted from both halves using a single replicate ofdataset D1. The first scheme was repeated 10 times, andthe second Scheme 50 times. Second, we predicted inter-actions from all four datasets using two additional goldstandard references: IntAct [19] and hu.MAP [20].Validation of PrInCE outputUsing these four datasets, we performed computationalvalidations of PrInCE output. First, we tested whetherour metric for ranking predicted interactions (inter-action score) is consistent with other known evidencefor protein interaction. To do so, we calculated theSpearman correlation coefficient between interactionscore and these four other, independent measures ofprotein interaction: (i) whether protein pairs shared atleast one Gene Ontology term within GO slim, acondensed version of the full GO ontology [21, 22]; (ii)the Pearson correlation coefficient of protein abundanceacross 30 human tissues, as taken from the HumanProteome Map (,[23]); (iii) whether protein pairs shared at least onesubcellular localization annotation within the HumanProtein Atlas Database [24]; and (iv) whether proteinpairs shared a structurally resolved domain-domaininterface, as identified by the database of three-dimensional interacting domains (3did) [25]. Thisvalidation was performed on predicted interaction listswith an interaction score of 0.50 or greater.Second, we investigated whether predicted interactionswere enriched over non-interactions for the same fourmeasures (shared GO terms, tissue-dependent proteomeabundance correlation, shared subcellular localizationterms, and shared structurally resolved interfaces). Forthese interacting versus non-interacting enrichmentanalyses, we imposed a 10% breadth cutoff on all anno-tation terms, such that only annotation terms commonto less than 10% of all proteins in the sample were used.As in [26], we also used the Jaccard index between pro-tein pairs to quantify the extent of shared annotationterms across the entire Gene Ontology. This validationwas performed on more stringent interaction lists (inter-action score 0.75 or greater).Third, we re-estimated the precision of our predictedinteraction lists using an independent, previouslydescribed method [27]. Our definition of false positivesas “inter-complex interactions” likely overestimates thenumber of false positives. To quantify the magnitude ofthis overestimation, we added random interactionsbetween non-interacting proteins within the referenceset to bring the average expression correlation coeffi-cient of all interacting proteins within the referencedataset to the same level as in the predicted interactomeunder investigation. To avoid training and testing on thesame reference interactions, we randomly withheld 1/3of CORUM complexes as a validation set, and used theremaining 2/3 as a training set to train the Naive Bayesclassifier and predict interactions. The average Pearsoncorrelation coefficient in tissue proteome abundance wascalculated for the resulting predicted interactions, and itwas compared to interactions from the 1/3 of CORUMwithheld for testing. We bootstrapped this procedure100 times to re-estimate the precision of the proteininteraction network.Finally, following the network analysis of [26], weexplored the topological properties of the predicted sub-graphs by sequentially removing interactions under oneof three schemes: (i) highest interaction score first, (ii)lowest interaction score first, or (iii) randomly. This ana-lysis tests whether the interaction network consists ofcores of tightly connected proteins linked by weaker orTable 1 Test dataset summaryDataset Conditions Replicates Fractions ProteinIDs Interactions(0.50)Interactions (0.75)D1a 2 3 55 3216 19,740 3416D2b 2 4 45–50 3438 7240 1447D3 1 3 55 3198 5691 1160D4c 2 3 50 3844 16,430 2072a[18], b[11], c[8]Stacey et al. BMC Bioinformatics  (2017) 18:457 Page 5 of 14more spurious connections. If this is the case, removingweakest interactions first will fragment the network, in-creasing the number of unconnected subgraphs and low-ering their average size, whereas removing the highestscoring interactions first will not fragment the network.ResultsPrInCE uses a machine learning approach to predictconditional interactomes from co-fractionation data.Four datasets were used to benchmark PrInCE versusa previous pipeline [11], which showed that PRInCEcan discover twice the number of predicted PPIs(Fig. 2a) in less than one tenth the time (Fig. 2b).This improved runtime also includes the complex-building module, Complexes.m, that was not presentin the previous version.Predicting PPIs (Interactions.M)Predicting protein-protein interactions (PPIs) is oneof the primary functions of this pipeline. Figure 3illustrates this process using a subset of D1 thatcontains ribosomal and proteasomal proteins. Eachpotential interaction, i.e. protein pair, is first identifiedas either a reference interaction (white), referencenon-interaction, i.e. proteins in the reference that donot interact (black), or unknown (grey; Fig. 3a). Toscore each potential interaction, the similarity of eachpair of co-fractionation profiles is then quantifiedusing the five distance measures (Additional file 1:Figure S1; see Methods for definitions). Using theseas input to the machine learning classifier, an interactionprobability for each protein pair is then calculated,expressing how well each protein pair resembles the col-lection of reference PPIs (Fig. 3b).By applying a threshold to interaction probabilitiesoutputted by the classifier, a final interaction list can begenerated at a precision specified by the user. Forexample, a more stringent list containing an estimated75% true positives (white), or a more inclusive list withan estimated 50% true positives (cyan; Fig. 3c). In gen-eral, there is a tradeoff between quantity and qualitywhen predicting PPIs, meaning that more PPIs can bepredicted at the cost of lowering the precision (Fig. 3d).How does the number of quantified proteins affect thenumber of predicted interactions? To investigate, we an-alyzed random subsets of each dataset. Although therewas considerable variability between datasets, in generalthere is an N2 relationship between the number of pro-teins used as input to PrInCE and the number of inter-actions returned as output (Additional file 1: Figure S2).For all datasets, fewer than 500 quantified proteins re-sulted in less than 1000 interaction at 50% precision. Itis important to note that while PrInCE is designed topredict reference-like PPIs, it would be useless if it didn’talso predict novel interactions. That is, PrInCE mustpredict interactions that are not simply contained in thereference database. Indeed, for the subset of proteinsshown in Fig. 3 it can be seen that novel interactions arepredicted (Fig. 3c, protein numbers 113 to 237). Morebroadly, all three datasets we used for benchmarkingabFig. 2 Improvements to predictive power and run time. a. Numberof interactions predicted at 50% (D1, D3, D4) or 41% precision (D2).For previously published datasets (D1, D2, D4), precision values andinteraction numbers reflect published interaction lists (“Old”).Precision values for “New” output, i.e. from the current pipeline,were chosen to match the Old precision values. CORUM version2012 was used as a gold standard reference. b. Run time for allmodules on a non-performance PC using either the previouslypublished version (“Old (2015)”, [11]) or the current version (“New”)Stacey et al. BMC Bioinformatics  (2017) 18:457 Page 6 of 14had thousands of novel PPIs predicted at 50% precisionand hundreds to thousands of PPIs at 75% precision(Fig. 2a, Table 1). In particular, at 50% precision 16,019interactions were predicted from D1 that are not con-tained in the reference.PrInCE uses a supervised learning algorithm topredict protein-protein interactions (PPIs), meaning itrequires examples of both interacting and non-interacting proteins, i.e. a gold standard reference ofprotein complexes. We sought to investigate howcharacteristics of the reference impact the interactionspredicted by PrInCE. Using subsets of CORUM tosimulate the effects of a smaller reference, we see thatthe number of predicted interactions can vary widelywhen using relatively small references (Additional file 1:Figure S3A, B). This is likely due to misestimation ofa bc de fFig. 3 Predicting interactions (Interactions.m). a. Reference database. Subset of the CORUM reference database, including ribosomal andproteasomal proteins, expressed as a square pairwise matrix. Intra-complex interactions (white) are pairs of proteins from the samereference complex, inter-complex interactions (black) are pairs of proteins contained in the reference that are not co-complex members,and unknown/novel pairs (grey) have one or more protein not contained in the reference. Proteins are sorted according to their peaklocation. b. Interaction probability for each pair of proteins using the labels in (a) and distance measures. c. Square pairwise matrix ofpredicted interactions at two precision levels, 50% (0.50) and 75% (0.75). Interactions are predicted by applying a constant threshold tointeraction score. d. Precision versus accumulated number of interactions. e. Overlap between three gold standard references (CORUM,IntAct, and hu.MAP). f. Predicted interactions using gold standard references from (e). 5527 interactions were commonly predicted fromall three gold standards (intersection)Stacey et al. BMC Bioinformatics  (2017) 18:457 Page 7 of 14the precision of predicted interactions owing to increasedeffects of noise for smaller references, with spuriously highprecision values leading to erroneously large numbers ofpredicted interactions. However, the predicted interactionsthat differ between these predicted interactomes tend tobe lower scoring, with the highest scoring interactionspredicted regardless of the reference (Additional file 1:Figure S3c). Further, entirely non-overlapping CORUMreference sets (Additional file 1: Figure S3D) lead to pre-dicted interactions with >94% overlap, on average (averageJaccard index = 0.943 +/− 0.2 st.d. between interactionlists predicted from entirely non-overlapping halves ofCORUM; Additional file 1: Figure S3E). Therefore, for agiven MS/MS dataset, PrInCE tends to predict the same,higher scoring interactions regardless of the reference,although small references can lead to errors in the numberof predicted interactions. For large enough references,PrInCE predicts a stable set of interactions, even whengold standard references are incomplete.Second, we compared the performance of PrInCE trainedon CORUM to PrInCE trained on two other gold standards:IntAct, a manually curated database of 1855 proteincomplexes [19], and hu.MAP, a database synthesized fromthree high throughput datasets totaling over 9000 mass spec-trometry experiments [20]. Although these three gold stan-dards are largely independent, with few common PPIs(average pairwise Jaccard index = 0.03; Fig. 3e), they lead topredicted interactions with a greater degree of overlap (aver-age pairwise Jaccard index = 0.30; Fig. 3f; Additional file 1:Table S1). Across all four datasets, there is a pattern forCORUM and IntAct to predict more interactions thanhu.MAP (Additional file 1: Figure S4A-C), possibly becauseCORUM and IntAct are hand-curated. Indeed, gold standardchromatogram pairs given by CORUM and IntAct are morecorrelated than chromatogram pairs given by hu.MAP,suggesting that hu.MAP contains more false positives(Additional file 1: Figure S4D). However, the larger numberof interactions predicted by IntAct may also be an artifactproduced by IntAct’s relatively small size (130 human com-plexes) (Additional file 1: Figure S3A). Over all datasets, wefind that interactions predicted from multiple gold standardsare higher scoring (average interaction score = 0.72) than in-teractions only predicted using a single gold standard (aver-age score = 0.62). Similarly to our analysis of CORUMsubsets, this suggests a stable set of higher-scoring interac-tions are predicted regardless of the choice of reference (e.g.Fig. 3f).Predicting protein complexes (Complexes.M)Building on predicted PPIs, the second major output ofPrInCE is protein complexes. Because buffer conditionsin PCP-SILAC are relatively gentle on protein com-plexes, this module potentially identifies complexes thatare unlikely to be identified by immunoprecipitationtechniques. To do so, PPIs predicted by Interactions.mare weighted by their interaction score and input intothe ClusterONE algorithm [16] to cluster individual PPIsinto complexes.Sorting co-fractionation profiles by their peak location(Fig. 4a) reveals the tendency for groups of proteins toco-elute (Fig. 4b). After analysis with PrInCE, somegroups are predicted to be co-complex members.Figure 4c shows an example protein complex predictedby Complexes.m. The predicted complex (orange andpurple) largely overlaps with the 20S proteasome con-tained in the CORUM reference database (black andpurple). One member (P28065, orange) was predicted tobe participating in the complex. Notably, while P28065is not in the CORUM database, it is annotated as aproteasomal protein. Thus, using co-elution as the onlysource of evidence, PrInCE predicted this known co-complex member of the 20S proteasome even though itwas missing from the reference.PrInCE is also capable of predicting entirely novel pro-tein complexes. For example, a four member complexwas predicted in dataset D1, of which no proteins werein CORUM (Fig. 4d). Reassuringly, these four proteins(P61923, P53621, P48444, O14579) are all subunits ofthe coatomer protein complex, a known complex that,while not present in the CORUM database, has substan-tial low throughput [28–30] and high throughputevidence [6, 8, 15] supporting its existence. For all com-plexes predicted by the pipeline (e.g. Fig. 4e; D1, 71complexes, median size 14), each complex predicted byClusterONE is matched to a reference complex whenpossible. Of the 71 protein complexes predicted for D1,20 were entirely novel, i.e. had no matching referencecomplex. In general, PrInCE predicts both entirely novelprotein complexes and those that recover existingcomplexes while predicting novel members. The fourdatasets analyzed in this study produced a total of 291protein complexes, of which 169 were at least partiallymatched to a CORUM complex. On average, 31% ofcomplex subunits were recovered from known com-plexes while the remaining were novel subunits (Fig. 4f ).Validation of predicted interactions and complexesNo method for determining protein interactions isperfect, and higher-throughput methods tend to recovernoise along with biologically meaningful signal. Weestimate how much noise is in the final interaction listby comparing it to a reference of known interactions,e.g. CORUM, and quantifying the signal to noise ratio interms of precision, i.e. TP/(TP + FP). In order to validatethat we are separating signal from noise in a biologicallymeaningful way, we sought to establish the biologicalsignificance of interaction lists generated by PRInCEusing independent evidence. First, we wanted to confirmStacey et al. BMC Bioinformatics  (2017) 18:457 Page 8 of 14acdfebFig. 4 (See legend on next page.)Stacey et al. BMC Bioinformatics  (2017) 18:457 Page 9 of 14(See figure on previous page.)Fig. 4 Predicting complexes (Complexes.m). a. 2311 co-fractionation profiles from a single replicate of D1, sorted by peak location. Fourteen 20Sproteasomal proteins group together (protein numbers 851–864). b. Square connection matrix for same proteins as (a). Colour shows interactionscore for all 19,740 interactions with score greater than 0.50. Inset: Close up of the 14 × 14 connection matrix for 20S proteasomal members plusother proteins (protein numbers 851–865). c. Co-fractionation profiles for the 14 proteins from B inset, which also correspond to a predictedcomplex. Profiles of complex members (left) all have a similar shape. When compared to its closest match in CORUM, the 20S proteasome, thispredicted complex had 13 overlapping proteins (purple), as well as one protein in the predicted complex that was not in the 20S proteasome(orange). Additionally, there was a single protein from the 20S proteasome that was not in the predicted complex (black). d. Example predictedcomplex with no match in the CORUM database. e. Force diagrams for all 71 predicted complexes from 19,740 interactions in D1. Same colouringscheme as (d and e). Proteins in known complexes that were not predicted (i.e. Reference-only, black) are omitted for clarity. f. Predictedcomplexes are composed of known (“recovered”) subunits and novel subunits. Data is from all four datasets. The size of each predicted complexis the sum of novel and recovered membersFig. 5 Predicted interactions are enriched for biologically significant attributes, and the degree of enrichment reflects interaction score. a. Fractionof interacting proteins with at least one shared GO-slim term as a function of interaction score and ontological domain. Triangle: biologicalprocess. Square: cellular component. Circle: molecular function. b. Tissue proteome abundance [23] correlation (Pearson correlation coefficient) asa function of interaction score. c. Interacting proteins in the apoptosis dataset are enriched for shared GO-slim terms relative to non-interacting proteinpairs at diverse GO term breadths. d. Distribution of tissue proteome abundance correlations (Pearson correlation coefficients) for interacting andnon-interacting protein pairs in D1Stacey et al. BMC Bioinformatics  (2017) 18:457 Page 10 of 14that the measure we use to rank the confidence of pre-dicted interactions, interaction score, is a useful way toidentify which interactions are more likely to be truepositives. To do so, we tested whether proteins in highscore PPIs are more likely to share annotation termsthan low score interactions. Indeed, for every GO-slimannotation category, as interaction score increased, sodid the proportion of interactions sharing at least oneannotation term (Fig. 5a, Additional file 1: Table S2).Similarly, interacting protein pairs were more likely tobe coexpressed across human tissues (Pearson correl-ation coefficient ≥ 0.75) (Fig. 5b), share at least one sub-cellular localization term (Additional file 1: Figure S5A),and have a structurally resolved domain-domain inter-action (Additional file 1: Figure S5B). Therefore, theranking system used by this pipeline is biologicallymeaningful, as demonstrated by independent sources ofevidence.How do predicted interactions differ from predictednon-interactions? A well-performing pipeline returnspredicted classes that are, at least by some measures,cleanly separated. To assess this, we first comparedJaccard indices [26], which measure the degree to whichprotein pairs share annotation terms, between non-interacting protein pairs (cyan), medium-confidencepredictions (orange), and high-confidence (purple;Additional file 1: Figures. S5C, S6A-C). Compared tonon-interacting proteins, high-confidence interactionsshow a bias towards larger Jaccard indices, as do medium-confidence interactions, although to a lesser degree.We next used enrichment values to quantify the tendencyfor predicted interacting proteins to share annotationterms. In general, interacting proteins were about 10× morelikely to share GO annotation terms than non-interactingproteins (Fig. 5c, Additional file 1: Figure S6D-F). Moreover,enrichment was relatively independent of the breadth ofthe annotation terms, where breadth describes the numberof annotated proteins per annotation term [31]. We foundthat interacting proteins were significantly enriched fornearly all validation measures used here (Table 2). Finally,comparing how well tissue-dependent protein abundancecorrelates between protein pairs [23] shows that proteinabundance is more correlated between predicted interact-ing protein pairs versus predicted non-interactions (Fig. 5d,Additional file 1: Figure S6G-J). Therefore, predictedinteractions returned by PrInCE are more enriched thanpredicted non-interactions for external evidence of interact-ing. Importantly, this external evidence is independent ofthe evidence used within the pipeline. The same analysiswas repeated to compare interactions predicted by PrInCEto previously published interaction lists [8, 11]. To do so,we matched the number of interactions in the publishedlists by taking that number of top-ranked interactions pre-dicted by PrInCE. In 15 out 18 comparisons of enrichmentvalues, interactions predicted by PrInCE were measured tobe more enriched for external evidence of interaction thanpreviously published lists (Additional file 1: Table S3).Calculating the precision of the interactions pre-dicted by PrInCE is crucial for minimizing the num-ber of false positives. To estimate precision, both thenumbers of true and false positives must be calcu-lated. The reference database provides a list of truepositive interactions (intra- complex). However, sinceno comparable database of false positive interactionsexists, we make the assumption that pairs of interact-ing proteins which are both present in the reference,but not reported by the reference to interact, are falsepositives (inter-complex). Several of these false posi-tives are likely to be true interactions that simplyhave not been previously discovered and thus not in-cluded in the reference, meaning that PrInCE likelyunderestimates the true precision of the interactions.Using the method outlined in [27] to re-estimate pre-cision, we found that, indeed, the stated precision is aconservative estimate of the confidence of the pre-dicted interaction list (Fig. 6).Table 2 Interacting versus non-interacting terms for shared annotation terms (GO, Subcellular Localization), tissue-dependent prote-ome abundance, and shared structurally resolved binding domainsDataset GO GO GO Proteome Subcellular StructurallyCC BP MF Abundance Localization ResolvedD1 1.2 19.6 13.6 8.7 2.7 130.13 <1e-300 <1e-300 <1e-300 6e-21 2e-275D2 1.94 12.2 10.2 7.7 3.2 142e-8 <1e-300 2e-266 4e-264 2e-8 4e-267D3 2.15 16.8 13.7 12 2.5 151e-4 <1e-300 1e-288 6e-281 3e-4 1e-135D4 3.13 16.1 13.5 10 2.4 111e-51 <1e-300 <1e-300 <1e-300 2e-6 <1e-300Fold values (top numbers) and hypergeometric test p-values (bottom numbers). Annotation terms were first filtered using a 10% breadth cutoffStacey et al. BMC Bioinformatics  (2017) 18:457 Page 11 of 14Finally, we explored the topological properties of thepredicted network, i.e. how the network is connected.Specifically, as is postulated for other PPI networksreturned by high-throughput techniques [26], we vali-dated the hypothesis that predicted networks shouldconsist of small subsets of highly connected proteins,which are more loosely linked to each other by relativelyweak connections. This connectivity structure denoteswell-defined subgraphs connected by weaker signalingand/or spurious false positive interactions. To analyzethe topology, we used an approach described by [26],wherein interactions are removed sequentially from thenetwork: removing the lowest confidence interactionsfirst should fragment the network by revealing islands ofisolated subgraphs; removing the highest confidenceinteractions should lead to no fragmentation. Indeed,removing low confidence interactions first produced anetwork with a greater number (Additional file 1: FigureS7A, purple) of relatively smaller subgraphs (Add-itional file 1: Figure S7B), i.e. fragmentation. Removing in-teractions in this order rapidly fragmented the largestsubgraph (Additional file 1: Figure S7C). Removing high-confidence interactions first did not have this effect (Add-itional file 1: Figure S7, orange). Similar results were ob-tained for other datasets (Additional file 1: Figure S7E-P).DiscussionA machine learning classifier provides improvements oversimply sorting protein-protein pairs by how similarly theyco-elute, as it provides an automated method forcombining multiple measures of co-elution. We chose theNaive Bayes classifier because it is computationally inex-pensive and surprisingly powerful given its relative simpli-city. Indeed, when comparing the Naive Bayes (“fitcnb”,Matlab) to a Support Vector Machine classifier (“fitcsvm”,Matlab) we found the Naive Bayes predicted similar orgreater numbers of interactions at a given precision level,depending on the dataset (data not shown).One limitation of our technique is that it requires asuitable gold standard reference of known protein com-plexes. For mammalian datasets we recommend usingthe CORUM database, as it is large enough, entirelyhand-curated, and accurately describes co-elution data.For yeast or E. coli datasets we recommend the IntActdatabase. Because false positives are defined as inter-complex pairs that are not also intra-complex pairs, goldstandard databases of binary protein pairs, such asSTRING, are not suitable. If neither CORUM nor IntActare suitable, in general we recommend that the referencebe large enough (>10,000 gold standard protein pairs inthe MS/MS dataset, Additional file 1: Figure S3B) andaccurately describe the dataset, measured, for example, byhigh correlation (Pearson R > 0.4) between gold standardco-elution curves (Additional file 1: Figure S4D). Sinceprotein complexes are variable, not all known interactionswill occur at any one time or under one set of biologicalconditions. Therefore, the suitability of a referencedatabase, determined by the fraction of gold standard in-teractions that were indeed physically interacting in thesample, is crucial. Failure of the gold standard reference toaccurately describe the data will result in poor classifica-tion performance and, ultimately, a short or empty list ofpredicted interactions (e.g. hu.MAP, Additional file 1:Figure S4A-C).Early versions of this pipeline were designed for theanalysis of (PCP-) SILAC datasets. A major strength ofSILAC experiments is that they allow conditional experi-ments to be performed simultaneously, minimizingexperimental variability between conditions. However,the analysis here of dataset D3, a surrogate for a non-SILAC labelled dataset, demonstrates that PrInCE is notlimited to analyzing SILAC data. In fact, PrInCE cananalyze any dataset with co-fractionation profiles forsingle proteins where co-fractionation is meaningfulevidence of co-complex membership, and for whichthere exists a suitable reference.ConclusionsPrInCE provides a powerful tool for predicting interactomesfrom co-fractionation experiments. It greatly simplifies thetask of analyzing co-fractionation datasets, requiring at mostinstallation and simple command line tools. Building on pre-liminary versions of a bioinformatics treatment [8, 11],PrInCE predicts nearly twice as many protein interactions atFig. 6 PrInCE precision of the predicted interaction list is aconservative estimate of the number of false positives. Predictedinteraction lists were generated for dataset D1 at multipleuser-defined precision levels (PrInCE precision), and theirprecision was re-estimated (Mrowka precision [27]). PrInCE listswere generated using a random 2/3 subset of the CORUMreference and precision was re-estimated using the remaining1/3. Median values from 100 iterations are shown, and bars showthe interquartile rangeStacey et al. BMC Bioinformatics  (2017) 18:457 Page 12 of 14the same stringency with a 97% decrease in run time (Fig. 2).PrInCE also offers increased functionality over previous ver-sions, providing a module for automated, optimized predic-tion of protein complexes using the ClusterONE algorithm[16]. Importantly, PrInCE is available as a standalone execut-able program, meaning access to Matlab is not required.Finally, at the same number of interactions, interactions pre-dicted by PrInCE are more supported by external, validatingevidence than previous versions, as quantified by a greaterenrichment of shared annotation terms (Additional file 1:Table S3).Additional fileAdditional file 1: Supplementary Figures and Tables. (DOCX 4945 kb)AbbreviationsAP-MS: Affinity purification mass spectrometry; PCP: Protein correlationprofiling; PPI: Protein-protein interaction; PrInCE: Predicting interactomesfrom co-elution; SILAC: Stable isotope labelling by amino acids in cell culture;Y2H: Yeast two-hybridAcknowledgementsWe thank A. Prudova and A. McAfee for critical suggestions. M.A.S. issupported by a CIHR Frederick Banting and Charles Best Canada GraduateScholarship, a UBC Four Year Fellowship, and a Vancouver Coastal Health-CIHR-UBC MD/PhD Studentship Award.FundingThis work was supported by funding from Genome Canada and GenomeBritish Columbia (project 214PRO) and the Canadian Institutes of HealthResearch (MOP77688) to L.J.F. Funding bodies played no role in the designor conclusions of this work.Availability of data and materialsThe three published co-elution datasets analyzed during the current studyare available from doi: 10.1038/nmeth.2131, doi: 10.1016/j.jprot.2014.10.024,and doi: 10.15252/msb.20167067. The PrInCE analysis pipeline is available at’ contributionsRGS analyzed and interpreted the data, wrote the PrInCE software, anddrafted the manuscript and revised it. MS performed the validation analysisand critically revised the manuscript. NS made significant contributions toconception and design of the study and acquired two of the datasetsanalyzed here. LF made substantial contributions to conception and designof the study and critically revised the manuscript. All authors read andapproved the final manuscript.Ethics approval and consent to participateNot applicable.Consent for publicationNot applicable.Competing interestsThe authors declare that they have no competing interests.Publisher’s NoteSpringer Nature remains neutral with regard to jurisdictional claims in publishedmaps and institutional affiliations.Author details1Michael Smith Laboratories, University of British Columbia, Vancouver V6T1Z4, Canada. 2Doherty Institute, University of Melbourne, Melbourne,Australia. 3Department of Biochemistry, University of British Columbia,Vancouver V6T 1Z3, Canada.Received: 2 June 2017 Accepted: 9 October 2017References1. Arifuzzaman M, Maeda M, Itoh A, Nishikata K, Takita C, Saito R, et al.Large-scale identification of protein–protein interaction of Escherichia ColiK-12. Genome Res. 2006;16:686–91.2. Gavin A-C, Aloy P, Grandi P, Krause R, Boesche M, Marzioch M, et al.Proteome survey reveals modularity of the yeast cell machinery. Nature.2006;440:631–6.3. Krogan NJ, Cagney G, Yu H, Zhong G, Guo X, Ignatchenko A, et al. Globallandscape of protein complexes in the yeast Saccharomyces Cerevisiae.Nature. 2006;440:637–43.4. Kühner S, van Noort V, Betts MJ, Leo-Macias A, Batisse C, Rode M, et al.Proteome organization in a genome-reduced bacterium. Science.2009;326:1235–40.5. Dunham WH, Mullin M, Gingras A-C. Affinity-purification coupled to massspectrometry: basic principles and strategies. Proteomics. 2012;12:1576–90.6. Havugimana PC, Hart GT, Nepusz T, Yang H, Turinsky AL, Li Z, et al. A censusof human soluble protein complexes. Cell. 2012;150:1068–81.7. Heide H, Bleier L, Steger M, Ackermann J, Dröse S, Schwamb B, et al.Complexome profiling identifies TMEM126B as a component of themitochondrial complex I assembly complex. Cell Metab. 2012;16:538–49.8. Kristensen AR, Gsponer J, Foster LJA. High-throughput approach formeasuring temporal changes in the interactome. Nat Methods.2012;9:907–9.9. Brückner A, Polge C, Lentze N, Auerbach D, Schlattner U. Yeast two-hybrid,a powerful tool for systems biology. Int J Mol Sci. 2009;10:2763–88.10. Choi H, Larsen B, Lin Z-Y, Breitkreutz A, Mellacheruvu D, Fermin D, et al.SAINT: probabilistic scoring of affinity purification-mass spectrometry data.Nat Methods. 2011;8:70–3.11. Scott NE, Brown LM, Kristensen AR, Foster LJ. Development of acomputational framework for the analysis of protein correlation profilingand spatial proteomics experiments. J Proteome. 2015;118:112–29.12. Wan C, Liu J, Fong V, Lugowski A, Stoilova S, Bethune-Waddell D, et al.ComplexQuant: high-throughput computational pipeline for the globalquantitative analysis of endogenous soluble protein complexes using highresolution protein HPLC and precision label-free LC/MS/MS. J Proteome.2013;81:102–11.13. PrInCE: Bioinformatics pipeline for predicting protein interactomes viaco-elution. Matlab. Foster lab; 2016. 26 May 2017.14. Ruepp A, Waegele B, Lechner M, Brauner B, Dunger-Kaltenbach I, Fobo G,et al. CORUM: the comprehensive resource of mammalian proteincomplexes—2009. Nucleic Acids Res. 2010;38(suppl 1):D497–501.15. Wan C, Borgeson B, Phanse S, Tu F, Drew K, Clark G, et al. Panorama ofancient metazoan macromolecular complexes. Nature. 2015;525:339–44.16. Nepusz T, Yu H, Paccanaro A. Detecting overlapping protein complexes inprotein-protein interaction networks. Nat Methods. 2012;9:471–2.17. Brohee S, Van Helden J. Evaluation of clustering algorithms for protein-protein interaction networks. BMC Bioinf. 2006;7:488.18. Scott NE, Rogers LD, Prudova A, Brown NF, Fortelny N, Overall CM, et al.Interactome disassembly during apoptosis occurs independent of caspasecleavage. Mol Syst Biol. 2017;13:906.19. Meldal BH, Forner-Martinez O, Costanzo MC, Dana J, Demeter J,Dumousseau M, et al. The complex portal-an encyclopaedia ofmacromolecular complexes. Nucleic Acids Res. 2014;43:D479–84.20. Drew K, Lee C, Huizar RL, Tu F, Borgeson B, McWhite CD, et al. Integrationof over 9,000 mass spectrometry experiments builds a global map ofhuman protein complexes. Mol Syst Biol. 2017;13:932.21. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, et al.Gene ontology: tool for the unification of biology. Nat Genet. 2000;25:25–9.22. Consortium GO. Gene ontology consortium: going forward. Nucleic AcidsRes. 2015;43:D1049–56.23. Kim M-S, Pinto SM, Getnet D, Nirujogi RS, Manda SS, Chaerkady R, et al.A draft map of the human proteome. Nature. 2014;509:575–81.24. Uhlén M, Fagerberg L, Hallström BM, Lindskog C, Oksvold P, Mardinoglu A,et al. Tissue-based map of the human proteome. Science. 2015;347:1260419.Stacey et al. BMC Bioinformatics  (2017) 18:457 Page 13 of 1425. Mosca R, Céol A, Stein A, Olivella R, Aloy P. 3did: a catalog of domain-based interactions of known three-dimensional structure. Nucleic AcidsRes. 2013;42:D374–9.26. Hein MY, Hubner NC, Poser I, Cox J, Nagaraj N, Toyoda Y, et al. A humaninteractome in three quantitative dimensions organized by stoichiometriesand abundances. Cell. 2015;163:712–23.27. Mrowka R, Patzak A, Herzel H. Is there a bias in proteome research?Genome Res. 2001;11:1971–3.28. Eugster A, Frigerio G, Dale M, Duden R. COP I domains required forcoatomer integrity, and novel interactions with ARF and ARF-GAP. EMBO J.2000;19:3905–17.29. Faulstich D, Auerbach S, Orci L, Ravazzola M, Wegehingel S, Lottspeich F,et al. Architecture of coatomer: molecular characterization of d-COP andprotein interactions within the complex. J Cell Biol. 1996;135:53–62.30. Harter C, Wieland FTA. Single binding site for dilysine retrieval motifs andp23 within the γ subunit of coatomer. Proc Natl Acad Sci. 1998;95:11649–54.31. Simonis N, Rual J-F, Carvunis A-R, Tasan M, Lemmens I, Hirozane-Kishikawa T,et al. Empirically controlled mapping of the Caenorhabditis Elegans protein-protein interactome network. Nat Methods. 2009;6:47–54.•  We accept pre-submission inquiries •  Our selector tool helps you to find the most relevant journal•  We provide round the clock customer support •  Convenient online submission•  Thorough peer review•  Inclusion in PubMed and all major indexing services •  Maximum visibility for your researchSubmit your manuscript your next manuscript to BioMed Central and we will help you at every step:Stacey et al. BMC Bioinformatics  (2017) 18:457 Page 14 of 14


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items