UBC Faculty Research and Publications

Characterizing the state of the art in the computational assignment of gene function: lessons from the… Gillis, Jesse; Pavlidis, Paul Apr 22, 2013

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


52383-12859_2013_Article_5810.pdf [ 493.33kB ]
JSON: 52383-1.0223203.json
JSON-LD: 52383-1.0223203-ld.json
RDF/XML (Pretty): 52383-1.0223203-rdf.xml
RDF/JSON: 52383-1.0223203-rdf.json
Turtle: 52383-1.0223203-turtle.txt
N-Triples: 52383-1.0223203-rdf-ntriples.txt
Original Record: 52383-1.0223203-source.json
Full Text

Full Text

PROCEEDINGS Open AccessCharacterizing the state of the art in thecomputational assignment of gene function:lessons from the first critical assessment offunctional annotation (CAFA)Jesse Gillis1, Paul Pavlidis2*From Automated Function Prediction SIG 2011 featuring the CAFA Challenge: Critical Assessment of Func-tion AnnotationsVienna, Austria. 15-16 July 2011AbstractThe assignment of gene function remains a difficult but important task in computational biology. Theestablishment of the first Critical Assessment of Functional Annotation (CAFA) was aimed at increasing progress inthe field. We present an independent analysis of the results of CAFA, aimed at identifying challenges in assessmentand at understanding trends in prediction performance. We found that well-accepted methods based on sequencesimilarity (i.e., BLAST) have a dominant effect. Many of the most informative predictions turned out to be eitherrecovering existing knowledge about sequence similarity or were “post-dictions” already documented in theliterature. These results indicate that deep challenges remain in even defining the task of function assignment, witha particular difficulty posed by the problem of defining function in a way that is not dependent on either flawedgold standards or the input data itself. In particular, we suggest that using the Gene Ontology (or other similarsystematizations of function) as a gold standard is unlikely to be the way forward.IntroductionIn computational biology, critical assessment of algo-rithms plays an important role in keeping the field honestabout utility by ensuring progress is measurable and in adirection that is helpful in solving biological problems.The recognition of the need for assessment dates back tothe first Critical Assessment of Structure Prediction(CASP), which aimed to determine the state of the art inprotein structure prediction [1]. CASP’s ongoing assess-ment has proven highly successful in characterizing pro-gress, and 20 years later CASP largely defines the field ofprotein structure prediction. CASP has a number of fea-tures that are important to its success, some of whichwere built in from the start and others which were theresult of lessons learned along the way. Among those fea-tures are forcing participants to make true predictionsrather than blinded post-dictions (limiting over-training),the use of fully representative evaluation metrics (limitingartifactual performance), and the recognition of sub-pro-blems that are treated as distinct tasks (allowing for dif-ferent strategies, e.g. “template-free” vs. “template-based”prediction). In addition to these inherent lessons, CASPhas taught the field that progress in “template-free” (abinitio) prediction, while substantial, is slower than predic-tion that can directly leverage existing protein structuresthanks to sequence similarity. CASP’s results have alsoshown that the aggregation of algorithms is an effectiveway to reduce the effect of “noisy” low-quality predictions[2].CASP has inspired numerous other critical assess-ments, including the topic of this paper, the first CriticalAssessment of Functional Annotation (CAFA). CAFAwas aimed at assessing the ability of computational* Correspondence: paul@chibi.ubc.ca2Centre for High-Throughput Biology and Department of Psychiatry,University of British Columbia, 177 Michael Smith Laboratories 2185 EastMall, Vancouver, Canada, V6T1Z4Full list of author information is available at the end of the articleGillis and Pavlidis BMC Bioinformatics 2013, 14(Suppl 3):S15http://www.biomedcentral.com/1471-2105/14/S3/S15© 2013 Gillis and Pavlidis; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the CreativeCommons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, andreproduction in any medium, provided the original work is properly cited.methods to predict gene function, starting from proteinsequences. The general approach for predicting genefunction is often referred to as “guilt by association”(GBA) [3]. In a computational GBA framework, theinput data takes the form of similarities among genes(sometimes this is treated as a graph or gene network)and some initial functional labeling (often based on theGene Ontology or a related scheme [4]). Genes whichare in some sense “near” genes with a given label mightbe proposed to share that label (function) with them.While a common measure of “guilt” uses sequence simi-larity, numerous other data types have been used aloneor in combination, such as coexpression, protein inter-actions, patterns of conservation and genetic interac-tions. All of these have been shown to be predictive ofgene function to varying degrees when tested by cross-validation, and numerous algorithms of varying levels ofsophistication have been proposed. However, indepen-dent assessment of computational GBA is not routine.CAFA represented an attempt to fill this gap. This is avery challenging goal due to problems in defining goldstandards and evaluation metrics [5].In the first CAFA, in which we were only observers,approximately 47000 protein sequences from UniProtwere selected as targets. These sequences were chosenbecause they lacked high-quality evidence codes (“EXP”,“TAS” or “IC”) on any their Gene Ontology (GO) anno-tations (if they had any), and thus were considered tohave “unknown function” (http://biofunctionprediction.org/node/262). Participants were asked to assign GOterms to the proteins, with no more than 1000 termsassigned to any given target. Importantly, the GO termsthat would eventually be used for assessment were notknown in advance, and users were left to decide whichdata to use as input to their algorithms, without restric-tions. The large number of targets in CAFA and theunpredictable nature of which sequences would be usedfor assessment in the end ensured that it would be diffi-cult for participants to “game” the system.After participants submitted their predictions, a sixmonth waiting period ensued. At the end of this period,UniProt/GOA was checked for updates to the GO anno-tations of the targets. New GO annotations which had“EXP”, “TAS” or “IC” evidence codes were treated as thegold standard “truth” to which each participant’s predic-tions would be compared. Such new GO annotationswere available for ~750 sequences. As set out by theCAFA rules, performance for a submission was to bemeasured for each target, by comparing the GO annota-tion predicted by the algorithm with the truth. To cap-ture the idea of “near misses”, a novel measure ofprecision and recall were devised by the organizersusing the number of common (up-propagated) GOterms shared by the truth and prediction.CAFA was structured differently from an earlier assess-ment that had similar motivations, Mousefunc [6].Mousefunc provided participants with a blinded set ofprepared data (gene networks and the like), and an honorsystem was used to prevent the nine participating groupsfrom reverse-engineering the coding. In addition to clas-sifying a training set of genes against the target GOterms, participants made predictions for a held-out 10%of the genes, which were unmasked for the assessment.The conclusions of the Mousefunc assessors were that ingeneral methods performed fairly similarly (combinedyielding 41% precision at 20% recall), with one methodstanding out as the “winner” by a modest margin; andthat “molecular function” terms were easier to predictthan those for the “biological process” aspect of GO. Byfar the most informative data sets were those based onprotein sequence (while sequences were not directlyused, two of the data sets were founded on proteinsequence patterns). A set of 36 novel predictions (that is,high-scoring apparent “false positives”) were evaluated byhand and found to have literature support at an encoura-ging rate of around 60% [6]. Recently, we reanalyzed theMousefunc data and many other gene networks andshowed that much of the learnability of gene function incross-validation is explained by node degree effects (to afirst approximation, assigning all functions to “hubs” is asurprisingly effective strategy) [7]. We hypothesized thatthe problem of prediction specificity would play a similarrole in CAFA results.In this paper, we report the results of our independentassessment of a portion of the CAFA results made avail-able to us. Our intention was to assist the CAFA organi-zers in making the most of the assessment, and to gaininsight into how gene function predicts “in the wild”.Our results suggest that most prediction algorithmsstruggle to beat BLAST. An evaluation based on aninformation-based metric suggest that informative pre-dictions are made at a rate of at best 15%, and thatmany of the informative predictions made could beinferred from the pre-existing literature and GO annota-tions. In agreement with our previous results on multi-functionality, informative predictions tended to be madeto GO groups containing highly multifunctional genes.We find a comparison of the CASP and CAFA tasks tobe informative, in terms of the some lessons learnedthrough CASP and how they might be applied to CAFAin the future. However, the evidence suggests that manyof the challenges are fundamentally due to reliance onan imperfect gold standard.MethodsAll data and methods we describe for CAFA were based oninformation that was publicly available to non-participantsat the time of the assessment or shortly after the summerGillis and Pavlidis BMC Bioinformatics 2013, 14(Suppl 3):S15http://www.biomedcentral.com/1471-2105/14/S3/S15Page 2 of 122011 workshop. As noted in the discussion, there are a fewpoints of variance in the assessment that was finally doneby the organizers. Our methods reflect our understandingof the state of affairs as they would have appeared to a par-ticipant in the assessment.DataThe CAFA participants were contacted by the organi-zers at our behest, asking them if they would be willingto provide their predictions to us for analysis. It wasmade clear that we would be focusing on overall pat-terns and not on the performance of individual identi-fied algorithms. Positive responses were received from16 groups (no negative responses were obtained; theremainder were apparently non-responses). This yieldeda set of results for 16 out of 56 algorithms that wereentered in the assessment; for the sake of discussion weassume this subset is representative. In accordance withour agreement with the participants, in this paper we donot identify the specific algorithms or the participants.We were also not provided with any information aboutthe input data used by the algorithms other than thetarget sequences that were provided. We note that itwas straightforward for participants to determine exist-ing annotations for the sequences, if they desired, so weassume this information was available for the purposesof our evaluation.The format of the data we were given was as follows,for the Molecular Function (MF) and Biological Process(BP) categories separately. For each of up to 1000 GOterms per target, a score in the interval (0.00, 1.00] wasprovided, where 1.00 was to indicate the strongest pre-diction (some algorithms only provided binary scores).Non-predictions were indicated by missing values (thatis, the value 0.00 was not allowed). Not all algorithmsmade predictions for all targets, and not all algorithmsmade the maximum of 1000 predictions for all targets.One submission did not provide predictions in the BPcategory. We only received predictions made for the eva-luation targets and were thus unable to examine theresults for the other ~46,000 targets.In addition to the official entries, the organizers pro-vided results of using BLAST [8] as a predictive method(assigning GO terms based on the top BLAST hit in the“nr” database, using default settings) and results from aBLAST-based method, GOtcha [9]. GOtcha takes intoaccount information about the structure of the GeneOntology in combining information from multiple high-scoring BLAST hits. Another data set was provided to usin which sequences were “predicted” to have GO termsaccording to the proportion of sequences which had agiven term in the existing GO annotations. Thus all pro-teins were assigned the terms “Biological Process” and“Molecular Function” with weights 1.0, and terms “lowerdown” in the GO hierarchy with decreasing weights until1000 GO terms were reached. We refer to this as the“Prevalence” data set. For reasons to be described thePrevalence data set is best considered a control, not areal entry; thus our evaluation focuses on 18 entriesincluding BLAST and GOtcha. To create an aggregateclassifier, the results from all 18 algorithms were com-bined using the average normalized rank of the scoresprovided for each algorithm. We chose this method ofaggregation because it is as naïve an aggregation as wecould imagine, involving no prior expectation as to per-formance or training based on data. We stress that itshould not be regarded as a competitor algorithm since itmerely averages the other algorithms’ prediction dataafter the fact.The annotations treated as a gold standard for evalua-tion involve 558 sequences in the biological process(BP) evaluation category, and 454 for molecular function(MF), from 10 difference taxa. These sequences wereselected for evaluation from the starting set of 46997because they had received GO annotations with “strong”evidence codes ("EXP”, “TAS” or “IC”) during the sixmonth waiting period. After propagating these termsupwards in the GO hierarchy, a total of 2457 terms (outof 18982 possible) were annotated at least once in theBP category and 709 (out of 8728) in MP.Assessment metricsThe primary assessment metric proposed by the organi-zers is gene-centric (http://biofunctionprediction.org/node/262). For each gene, the terms assigned are propa-gated up the GO hierarchy to the root, yielding a set ofterms. This is performed for each scored term a gene isgiven, starting with the highest scoring and workingdownward, adding to the number of predicted terms.This propagation could be done by the submitter, orwould be done by the organizers for all non-zero predic-tions. The same propagation was done for the gold stan-dard term. Any terms overlapping between these setswas considered “correct”. Given these sets, precision andrecall can be computed the usual way, (precision = |True terms|/|predicted terms|, recall = |correct pre-dicted terms|/|True terms|). To generate a precision-recall curve, the predictions for each gene were treatedas a ranked list of GO terms (with the non-assigned GOterms being tied at the bottom). When a term isexpanded by up-propagation, the best score given forthe term is the one retained (since submitters may havesubmitted an explicit score for a GO term as well asone or more scores implied by the structure of GO)unless the submitter had provided a specific lower scorefor that term. Precisions at each recall were computed,generating a precision-recall curve. The assessment rulesdid not specify a way in which to summarize theseGillis and Pavlidis BMC Bioinformatics 2013, 14(Suppl 3):S15http://www.biomedcentral.com/1471-2105/14/S3/S15Page 3 of 12curves, so we used the average precision [10]. We referto this as the “CAFA score”.We present results from two additional metrics. Oneis function-centric and is a standard area under thereceiver operating characteristic curve (AUROC). Wefirst propagated GO terms as described above. For eachGO term in the gold standard, we considered it forROC computation if it had 10-100 members. For eachsuch GO term, across all evaluation targets, we used thescores provided by the submitter to rank the predic-tions, with non-predictions treated as tied. The areaunder the ROC curve was computed using standardmethods. Because we did not get information on theunevaluated targets, the AUCs we computed are prob-ably depressed, as we suspect that the prediction scoresfor the other ~46,000 would tend to be lower than thatfor the evaluation targets, due to biases in which ofthem received new GO annotations. However, this doesnot preclude valid internal comparisons based on theevaluation targets.The second metric we used is gene-centric butattempts to overcome problems we encountered withthe CAFA score. The measure is the semantic similaritybetween the actual and predicted function, using themeasures of Resnik [11] or Lin [12]. These measurestake into account the information content of each GOterm, based on the usage of terms in existing annota-tions [13]. Thus getting a rarely-used GO term correctin a prediction is given higher weight than correctlyguessing a generic one. We note that the use of thesesemantic similarity measures has been previously pro-posed for assessment of predictions [5].Gene Ontology annotationsTo perform a retrospective analysis, we obtained anno-tations from the GOA [14] FTP archives, dated January11 2011. For mouse this corresponds to GOA release79; for human it is release 93; for E. coli release 91; ratrelease 79; Arabidopsis release 66. The deadline forCAFA submissions was January 18, 2011 so the filerepresents GO annotations that were available to theparticipants at the time of submission and could have inprinciple been used for inference. Note that our analysisis based on the presence of the annotations in the filesfor the date given, not the “date of annotation” columnin the latest files; the latter appears to be misleading orincomplete.Algorithm aggregationTo combine the predictions into a single set, the generankings for each algorithm were normalized to therange 0-1. We computed the mean of this normalizedrank for each gene, and ranked the resulting means toobtain the final ranking.ResultsGene-centric evaluation using the CAFA scoreThe 18 methods evaluated yielded mean CAFA scores of0.45 for the Molecular Function ontology (MF) and 0.21for Biological Process (BP) (Figure 1). The scores obtainedby aggregating across all algorithms were better than anyof the individual algorithms (0.40 for BP and 0.67 for MF).Whether these values are good in an absolute sensedepends on what one considers the null distribution. Ifone completely ignores any prior knowledge of GO struc-ture, the values appear to be very high. However, a moreappropriate null is provided by the Prevalence data. Anyalgorithm that cannot do better than assignment of genefunctions based on their existing prevalence among anno-tated genes is not useful in practice. The fact that the Pre-valence method is not really a method but a “background”is highlighted by the fact that it gives the same predictionsto all genes, and that it can be computed without anyinformation about the target genes. Considering Preva-lence as the baseline, only three algorithms perform betterthan null on BP, and six perform better than null on MF(Figure 1B and 1D). BLAST performed worse than thenull (0.19 vs. 0.31 for BP, 0.44 vs. 0.52 for MF).Further investigation revealed that this result is not dueto terrible performance of the algorithms, but that theCAFA score is problematic. Consider an algorithm whichsimply predicts “Molecular function” (GO:0003674, theroot of the MF hierarchy) for every target. This is always acorrect (precise) prediction, because all MF evaluation tar-gets are assigned the term “molecular function”, due tothe propagation of annotations to the root. One mighthope this single correct prediction is counterbalanced bypoor recall; however the recall turns out not to be thatbad, because the evaluation targets for CAFA have fewannotations, even after propagation (7.3 terms on averagein MF). Thus simply assigning “Molecular function” toeach target yields a respectable CAFA score of 0.19, whichis clearly erroneous. The performance of the “Prevalence”data represents a generalization of this problem. In theory,the submitters could have compensated for this issue bymaking sure they had at least made the predictions theycould make with little chance of decreasing their precision.If this was done in a consistent manner, the scores wouldat least be internally comparable, but there is no evidenceto suggest the submitters took this into account. Theimplication of this analysis is that the CAFA score is notgiving a useful representation of the performance of pre-diction algorithms in any absolute sense, and it also can-not be used to reliably compare the CAFA submissions toeach other.In prediction one can rank functions with respect to agene (“Which functions are most associated with thisgene?”), or one can rank genes with respect to a function(“Which genes are most associated with this function?”).Gillis and Pavlidis BMC Bioinformatics 2013, 14(Suppl 3):S15http://www.biomedcentral.com/1471-2105/14/S3/S15Page 4 of 12The primary CAFA score is gene-centric. However, for aprediction to be useful it should be valid from either per-spective. Otherwise one might be able to argue that allgenes have all functions, in some weak or indirect sense.To make a useful functional assignment, we must considerwhether a gene is “more involved” in a function than othergenes. We therefore considered a more traditional “func-tion’s point of view” evaluation, and finally an alternativegene-centric evaluation that takes into account specificityof predictions and is thus indirectly comparative.Evaluation by area under ROC curvesA common way to evaluate predictions is to ask, foreach function, which gene is most likely to possess thatfunction. This is challenging to implement in the con-text of CAFA because of the capricious nature of whichGO terms and genes were available for evaluation. Formany GO terms there are simply too few genes whichreceived a prediction. Selecting GO terms that hadbetween 10 and 100 genes assigned to them yielded 245terms for BP and 45 for MF. Because of the small num-ber of targets overall, these tend to be somewhat “highlevel” GO groups.Switching to this function-centric perspective, the 18algorithms now score quite well in BP with an averageAUROC of 0.63 (Figure 2; note that the Prevalence datais guaranteed to score exactly 0.5 AUROC as it ranks allgenes equally for each function). The best single methodFigure 1 Summaries of performance using the “precision-recall"-based CAFA score. A and B show results for the BP ontology; C and D forMF. Submitted results are shown in black, and the null Prevalence data are represented in grey. A and C plot the distribution of scores for allevaluation targets, averaged across algorithms. B and D show the distribution of scores across algorithms, averaged across targets.Gillis and Pavlidis BMC Bioinformatics 2013, 14(Suppl 3):S15http://www.biomedcentral.com/1471-2105/14/S3/S15Page 5 of 12is BLAST, with a mean AUROC of 0.75. Once again, theaggregate outperforms the individual algorithms with amean AUROC of 0.77. AUROCs for MF were also gen-erally high, with the average performance being 0.66,BLAST scoring 0.71, and the aggregate performing bet-ter than any individual algorithm (0.84). We note thatthese values are likely to be somewhat artificiallydepressed, because they do not take account of the~46,000 sequences that were left out of the evaluationdue to still not being annotated with good evidencecodes at the time of assessment; we expect that if thesubmitted scores for these were available, AUROC valueswould be higher (this assumes that the 46,000 would bebiased towards sequences for which predictions wouldtend not to be made, due to lack of sequence similarity toannotated proteins, for example).We also evaluated species-based variation in func-tional assignment, to account for the possibility of issuessuch as GO terms being used in only one species. Foreach GO term, sequences were ranked by the incidenceof the term in the species from which the sequence wasderived (that is, all sequences from a given species weregiven the same ranking). This yields a mean AUROC of0.55 for BP (low, but significantly different from 0.5),while assessment per-species yielded AUROCs compar-able to the overall results reported above. This suggeststhat species biases in term usage are not a major influ-ence. However, the molecular function assignmentswere badly distorted by species-specific variation inassignments. In particular, E. coli assignments in MFwere anomalous in having exceptionally high depth(large numbers of GO terms assigned per gene; Addi-tional file 1. Terms for E. coli also had an unusuallyhigh correlation within the BP; Additional file 1). Thereason for this is not clear, but may partially reflect thesmaller number of annotations assigned to E. coli in theBPO relative to other species (17.6 vs. 30.5). However,this elevated annotation depth outside E. coli in BPOwas extremely variable (standard deviation of 31.3), sug-gesting it cannot fully explain the species-wide correla-tion pattern. Curation practices in Ecocyc ([15], thesource of E. coli GO annotations) may be a more impor-tant casual factor.Evaluation using information contentThe main problem with the CAFA gene-centred metric isits failure to account for the structure of GO in consider-ing how informative a prediction is. We therefore consid-ered a metric based on the semantic similarity betweenthe hypothesized assignment and the true assignment.This allowed us to measure whether prediction algorithmsFigure 2 Summaries of performance using ROC curves. Resultsare only presented for BP because the MF results were too stronglyaffected by biases due to the E. coli annotations. A. Distribution ofAUROCs for the GO terms evaluated. The mean performance acrossalgorithms is shown in black. A simple aggregation algorithm doesmuch better on average, shown in grey. B. Density plot showingthe overlay of the ROC curves that make up the results shown forthe aggregation algorithm in A, with areas of high density shown inlighter shades. Scattered light areas are artifacts due to the effectsof GO groups with smaller numbers of genes. Note that thePrevalence method is guaranteed to generate AUROCs of 0.5 for allfunctions since it ranks all genes equally.Gillis and Pavlidis BMC Bioinformatics 2013, 14(Suppl 3):S15http://www.biomedcentral.com/1471-2105/14/S3/S15Page 6 of 12ever assign a function which is “surprisingly” precise (unli-kely to be made by chance). We quantified “surprise” byasking whether the most specific predictions are higher-scoring than the best scoring prediction made by thePrevalence method. Thus a novel prediction had tobe more informative than the best “prediction” that couldbe made simply based on the null. All of the algorithmspassed this threshold at least once within the BP category,and on average 7.6% of their predictions were more infor-mative than the best prediction ever made by the Preva-lence method (for any of the 558 genes), with a maximumof 14.7% (Figure 3A). BLAST yielded 5.6% while GOtchayielded 9.3%. These findings were not substantially alteredby choice of semantic similarity measure (Resnik or Lin)or different thresholds for which of a method’s top Npicks could be considered proposed functions. Weobtained comparable results for the MF evaluation, withall algorithms generating at least some unusually informa-tive predictions (again, relative to prevalence), with 10.0%percent of predictions being informative, on average.These results show that the algorithms can make correctand specific predictions, though at low proportions.As shown in Figure 3B, algorithms often gave infor-mative predictions for the same sequences. The majorityof sequences had no strongly informative predictionsmade for them by any algorithm. While we do notknow the methods or data sources used by the submit-ters (other than in the case of BLAST and GOtcha) theresults suggest that these targets had some feature thatmade them especially predictable. The good perfor-mance of GOtcha led us to hypothesize that informationon sequence similarity was responsible for these unu-sually good predictions.To test the effect of sequence similarity we tookadvantage of the fact that many of the evaluationsequences already had GO terms prior to the assessment(in the BP, an average of 15 functions per sequence; weassessed this only for mouse and human). These annota-tions were present because it is routine for genes whichhave high sequence similarity to other better-annotatedgenes to be assigned GO terms with “weak” evidencecodes such as “IEA”, indicating a computational predic-tion. Over time, as experimental evidence is obtained,these evidence codes can be upgraded to those whichwere considered part of the CAFA evaluation (e.g.,“TAS”). We tested whether using the pre-existing GOannotations as an entry to CAFA would have been com-petitive. Indeed, this yields performance tied with thebest-performing algorithm in terms of the number of“informative” predictions (32 out of 275 in MF), with 16due to the same sequences as the best-performing algo-rithm. This strongly suggests that in many cases, com-putational methods are simply recomputing the “IEA”annotations, and by happenstance some of those wereupgraded in GO during the waiting period, making thempart of the gold standard for the purposes of CAFA. Wenote that, presumably, the upgrading of the GO annota-tions is partly driven by the presence of the IEA annota-tions in the first place (or, at least, by the underlyingsequence similarity); sequences which are already anno-tated in GO are also more likely to be experimentallytested, a form of confirmation bias. Thus these apparent“informative predictions” could be considered in partFigure 3 Summaries of performance based on informationcontent. Results are only shown for BP because of the distorting effectof E. coli annotations in MF. A. The fraction of predictions consideredinformative per algorithm. B. Overlaps among informative predictions.Most sequences received no informative predictions (peak at 0), whilenumerous predictions are made by multiple algorithms.Gillis and Pavlidis BMC Bioinformatics 2013, 14(Suppl 3):S15http://www.biomedcentral.com/1471-2105/14/S3/S15Page 7 of 12successful guessing of which sequences are likely toattract attention from experimentalists.Effect of gene multifunctionalityWhile we did not have access to the data used by thesubmitters, we wished to see if any underlying biases inthe data could be ascertained by the behaviour of thealgorithms. In particular we hypothesized that assign-ments would tend to be made to GO groups that con-tain multifunctional genes [7]. Indeed, functionspopulated more heavily by multifunctional genes werepreferentially (rank correlation ~0.30) the functionwhose assignment caused a rise in the Lin semanticsimilarity (which, unlike Resnik similarity, is sensitive tofalse positives).Manual examination of informative biologicalprocess predictionsTo gain further insight into how predictions are made,we more closely examined some of the “most informa-tive” predictions (the top ten such predictions for BPfrom the aggregated algorithms are listed in Table 1). Weused GO annotations from before the start of CAFA(early January 2011) and compared them to the annota-tions that appeared during the waiting period. This analy-sis was assisted by the UniProtKB [16] and QuickGO [17]web sites but relied primarily on the annotation files pro-vided by GOA [14].As shown in Table 1, in seven of the top 10 cases, theaggregate algorithm included the “right answer” amongthe predictions meeting the threshold established by thePrevalence baseline (although it was never the top predic-tion, not shown). Table 1 shows that in nearly every case,very closely related GO terms were already presentbefore CAFA, in agreement with the systematic analysisdescribed above. A possible exception is SOX11. Notethat these “similar” terms might not be in the BP ontol-ogy, but because of correlations among the GO hierar-chies, such terms are likely to be informative (forexample, the cellular components “vacuole” for Atp6v0cand “late endosomal membrane” for CHMP4B). In all tencases, the source of the annotation was available beforeCAFA, in some cases decades before (last column ofTable 1).DiscussionCAFA provided a unique opportunity to evaluate compu-tational gene function assignments. Our results providesome new insights into the behaviour of gene functionprediction methods, and into the challenges in providingan adequate and fair evaluation. Some of these challengeshave been noted prospectively [5] so it is interesting toTable 1 The sequences with the top ten “most informatively” predicted correct annotations by the aggregatealgorithm are summarized.Sequence GenesymbolGold standard Closest informative prediction Pre-existing IEA terms(representative)Pub DateBHMT1_MOUSE Bhmt methionine biosynthetic process(GO:0009086)methionine biosynthetic process(GO:0009086)methionine biosyntheticprocess2004IPO13_RAT Ipo13 steroid hormone receptor nucleartranslocation (GO:0002146)protein import into nucleus,translocation (GO:0000060)protein import intonucleus2006ARGB_ECOLI argb arginine biosynthetic process(GO:0006526)arginine biosynthetic process(GO:0006526)arginine biosyntheticprocess2007CAF1K_ARATH CAF1-11 nuclear-transcribed mRNA poly(A) tailshortening (GO:0000289)nuclear-transcribed mRNA poly(A)tail shortening (GO:0000289)poly(A)-specificribonuclease activity2009CFAB_MOUSE Cfb complement activation, alternativepathway (GO:0006957)complement activation, alternativepathway (GO:0006957)complement activation 1983CHM4B_HUMAN CHMP4B endosome transport (GO:0016197) endosome transport (GO:0016197) protein transport; lateendosome membrane2010(Reactome)HA15_MOUSE H2-T23 antigen processing and presentation ofendogenous peptide antigen via MHCclass Ib via ER pathway (GO:0002488)antigen processing andpresentation of endogenouspeptide antigen via MHC class Ib(GO:0002476)antigen processing andpresentation of peptideantigen via MHC class I1992SOX11_MOUSE Sox11 positive regulation of hippo signalingpathway (GO:0035332)embryonic digestive tractmorphogenesis (GO:0048557)cell differentiation;nervous systemdevelopment2010TGT_ECOLI tgt tRNA wobble guanine modification(GO:0002099)queuosine metabolic process(GO:0046116)queuine tRNA-ribosyltransferase activity1982VATL_MOUSE Atp6v0c lysosomal lumen acidification(GO:0007042)lysosomal lumen acidification(GO:0007042)proton-transportingV-type ATPase, V0domain;vacuole2001The date of publication of the gold standard source is given in the last column.Gillis and Pavlidis BMC Bioinformatics 2013, 14(Suppl 3):S15http://www.biomedcentral.com/1471-2105/14/S3/S15Page 8 of 12see how practice meets theory. We focus to some extenton comparing CAFA to CASP, and where helpful lessonscould be learned.Task categorizationOne area where CAFA could follow CASP is in the defi-nition of tasks. Currently CASP differentiates betweenthree categories of tasks, all of which have direct analo-gies with function prediction tasks.The CASP “template-based” prediction task is analo-gous to the case of trying to predict function when thegene has sequence similarity to already functionallyannotated genes. In such cases, methods like BLAST pro-vide a baseline for what can be learned readily. Our ana-lysis shows that many of the CAFA targets already had“IEA” functions assigned, and to an extent CAFA suc-cesses are simply recovering these. Thus perhaps unsur-prisingly, BLAST did well in the part of CAFA we hadaccess to, and we expect that other high-scoring methodsare using sequence similarity information. Tasks whichexploit sequence similarity should be considered a dis-tinct category of function prediction problems. Similarly,the CASP “template-free” prediction task is akin to thetask of predicting gene function when no sequence simi-larity information is available (or at least, not used).The CASP “structure refinement” task [18] might beanalogous to the task of “function refinement” where analready functionally annotated gene is given new or morespecific functions. We believe this could be treated as is adifferent task from assigning functions to a completelyunannotated “orphan” gene (not having even IEA annota-tions). Among methods that fall into this category arethose which use GO itself as a measure of “guilt” [19,20].Thus if two genes share nine out of ten GO terms, thetenth one is a pretty good bet. Even if they don’t expli-citly rely on existing annotations, algorithms that aregood at “refinement” might not be very good at “tem-plate-based” assignment (and vice versa).We propose that some scheme like this be adopted forfuture CAFA assessments, to more clearly differentiatebetween cases where sequence similarity is highly infor-mative and those where it is not, and possibly to extendthe competition to include targets which already havesome functions assigned with “strong” evidence codes.The importance of evaluation metricsOver the years, CASP has modified its assessmentmetrics and now has an agreed-upon set of metrics. Aswe have shown, the primary performance metric initiallyproposed for CAFA is unsatisfactory. This is illustratedby the fact that by this measure, a null “predictionmethod” outperforms most methods. The problem withthe CAFA score is that it is not comparative acrossgenes. When one is predicting a function for a gene, thegoal is to say that “this gene has function × more thanother genes do” in some sense. Otherwise, the definitionof function becomes degenerate, and simply assigningall genes the same functions becomes reasonable.We have applied two alternative measures, one whichis gene-centric and focuses on the information content ofa prediction, and a standard metric (AUROC) which isfunction-centric. The information-based metric is impli-citly comparative, because it uses information on the dis-tribution of GO terms across genes as well as a thresholdset by the null predictor. The AUROC metric also ranksgenes against each other. By these measures, it can beseen that the prediction algorithms (including BLAST)are providing meaningful performance. The problemwith the function-centric measure is that it depends onhaving more than one prediction for the function to bescored, which cannot be guaranteed given the nature ofthe CAFA task. The differences among annotation prac-tices for different organisms (notably for E. coli in thecurrent data) make assessment even harder, as criteriavary for what is considered good annotations.The power of aggregationIn recent years, the top algorithms for CASP have tendedto be meta-algorithms which aggregate the results ofindividual methods. If our experience is representative,the same is likely to be true for CAFA. The aggregatealgorithm outperforms all the individual algorithms. Thereason for this is apparently because aggregation allows afew “confident” predictions to rise to the top, while lessconfident predictions (which turn out to be poor) are“averaged out”. A similar phenomenon was reported inthe DREAM5 assessment of gene network inference [21].The benefit of having a clear goalThe points raised thus far are predicated on the idea thatfunction prediction is like protein structure prediction.However, in a fundamental way this is not the case, atleast not yet. Algorithms that perform well in CASP areconsidered to do well at “structure prediction”. That is,the CASP tasks are well-aligned with what the fieldagrees the “real life” task is. This is basically because pro-tein structure is fairly easy to define (position of atoms inspace). In contrast, “gene function” does not have anagreed-upon definition. Certainly there is no consensusthat the Gene Ontology is even close to biological reality,rather than just being convenient. Since function assign-ment/prediction methods always use experimental dataas inputs, there may be more value in simply trustingthose data than in trying to “align” predictions to a goldstandard that is acknowledged by its creators to be pro-blematic for such uses [22]. Tuning algorithms to begood at predicting GO annotations is probably nevergoing to be satisfying. The task of interest is predictingGillis and Pavlidis BMC Bioinformatics 2013, 14(Suppl 3):S15http://www.biomedcentral.com/1471-2105/14/S3/S15Page 9 of 12gene function, not predicting GO annotations as an endin itself, and the fact that these two tasks are not congru-ent presents a challenge.CASP also differs from CAFA in having targetsdefined by specific experimental results held back fromparticipants, who are given key underlying data to usefor computational analysis. The same approach is takenin the Critical Assessment of Genome Interpretation(CAGI, https://genomeinterpretation.org/). This has thebenefit of anchoring the assessment in a very specificoutcome measure (that is: did the computation agreewith the experiment). However, it limits the numberand scope of tasks that can be assessed. This model isnot likely to be applicable to the general problem ofgene function assignment, but might be useful as a wayto make progress on more specialized problems.It is worth mentioning that there are “function predic-tion” tasks that are not based on GO (or similar schemes)in the same way as CAFA. For example, some groupsattempt to predict mutant phenotypes [19] (some of theCAGI tasks are of this type). The roles of the issues weraise in such situations are not entirely clear, but we notethat the types of data used are the same as those used inthe CAFA-style annotation task, and GO often figuresprominently in such work, especially as a source of vali-dation [23].Predicting evidence codes and “post-diction”With the caveat that CAFA’s evaluation is based on arelatively small number of proteins, and our analysis on asubset of the CAFA entries, there are some importantthemes that emerged in terms of which informative pre-dictions were made. The evidence strongly suggests thata major factor is the availability of sequence similarityinformation. Finding a set of proteins which are notannotated at all was difficult, so many of the evaluationtargets already had “IEA” annotations (presumably oftenbased on BLAST or a similar approach). The successfulpredictions are in part simply guessing that those existingannotations are likely to be supported by experimentalevidence once they are tested, and thus upgraded in GO.The strong influence of sequence similarity was also sug-gested by the Mousefunc study [6].The fact that many of the most predictable annotationswere based on literature reports that predate CAFAfurther suggests that a bottleneck in filling in GO isinformation retrieval from the literature, not predictionper se. Strictly speaking, many of the CAFA evaluationtargets are “post-dictions”. The short time window avail-able to CAFA probably helped ensure this would be afactor; there was little chance that many experimentalreports would be published and also curated in a sixmonth period. The organizers were aware of this, and itis unlikely that CAFA participants would have been ableto efficiently exploit the existence of publications describ-ing the functions of proteins in the target set. On theother hand, for all we know some of the entries may haveused text mining methods as a tool for making predic-tions (see Addendum, below). This might be consideredyet another category of automated annotation task. Butwe stress that all the predictions are based on experimen-tal data of one type or another, so this distinction maynot be helpful.This returns us to the issue of the relationshipbetween function prediction and GO. If computationalpredictions are based on experimental data that couldbe used by curators to populate GO, then the task ofprediction is reduced to simply copying that informationinto GO (with appropriate evidence codes), rather thanconsidering GO to be some independent entity thatalgorithms should attempt to match.ConclusionsOur analysis is based on a subset of a single assessmentof automated annotation of a relatively small number ofproteins, but some general principles and concernsemerged which are very likely to be relevant to anyassessment of function assignment. Sequence similarityappears to be a dominant influence on the ability to pre-diction function. In terms of performing assessment,clearly a major challenge is the relatively slow pace andunpredictable nature of the entry of experimentally-defined functions into the annotation databases. But per-haps the deepest issue is the difficulty of deciding what itmeans to predict function in a useful way, as the currentgold standards are deeply problematic. The first CAFAwas a bold attempt to put gene function prediction on afirmer footing, and we expect that future iterations willcontinue to promote progress in this difficult area ofcomputational biology.AddendumSince our manuscript was submitted and revised, adetailed paper describing the outcome of the CAFA eva-luation appeared [24], which enhances the interpretationof our analysis. First, it is now clear that our assessmentincluded one of the top performing methods, Argot2[25], as finally judged by the organizers, lending weightto our reassessment as a fair representation of the entries.Second, it is reported that the organizers of CAFA didnot evaluate the results using the precise rules whichwere released before the assessment (those which wereavailable to us as non-participants). For example, targetannotations for a specific term (“protein binding”) were“not considered informative” and thus excluded fromthe main evaluation [24]; including this term pushes theGillis and Pavlidis BMC Bioinformatics 2013, 14(Suppl 3):S15http://www.biomedcentral.com/1471-2105/14/S3/S15Page 10 of 12naïve Prevalence score to be among the top performers(Supplementary Figure 3 of [24]). The organizers alsointerpreted “pick the highest scoring term among allnon-zero predictions and propagate it to the root of theontology” as excluding the root itself for evaluation (thishas a minimal effect on the value of prevalence since it ismerely the exemplary case of a general problem). Finally,the number of evaluation targets reported in [24] variesfrom ours apparently because the set was expanded afterthe initial assessment.The most striking distinction between Radivojac et al.and our results, at first glance, is that we found thatsimple methods relying on sequence similarity werehighly competitive, while Radivojac et al. rankedBLAST poorly stating that, “top algorithms are usefuland outperform BLAST considerably,” and, “BLAST islargely ineffective at predicting functional terms relatedto the Biological Process ontology” [24]. This conclu-sion was apparently based on the CAFA score; theauthors did not report per-algorithm (function-oriented) AUROCs, on which basis BLAST ranks high-est in our analysis. The issue is readily resolved bystressing that based on the CAFA score, the Prevalencescore outperforms BLAST (sometimes even afterremoving “protein binding”) and indeed other moresophisticated methods [24].In any case it is clear that sequence similarity was thebedrock of function prediction in CAFA. As noted byRadivojac et al., nearly of the methods submitted usesequence similarity. All of the top performers use suchmethods, and for some (e.g., Argot2) it was the primaryor sole source of data (not counting the use of GO itself).The overall top-scoring group (“Jones-UCL”) reportedthat “the largest contribution to correct predictions camefrom homology-based function prediction” (supplementof [24]). The only useful non-sequence source of infor-mation cited for the Jones-UCL method was text-mining of UniProt-presumably amounting to post-dic-tions of the type we report in Table 1. The Jones-UCLmethod also took into “account the GO ontology struc-ture to produce final GO results”. Argot2 also leveragesthe structure of GO, using information related to thePrevalence score [25]. This reinforces our concern thattuning algorithms to match the assessment, while bene-ficial in a rarified sense, could be misleading about gen-eralization to many real applications (see also [26]).FundingSupported by NIH Grant GM076990, salary awards toPP from the Michael Smith Foundation for HealthResearch and the Canadian Institutes for HealthResearch, and a postdoctoral fellowship to JG from theMichael Smith Foundation for Health Research.Additional materialAdditional file 1: Taxon-specific effects on annotation. The GOannotations used as evaluation targets were used (not predictions). Foreach sequence, a binary vector of GO annotations was created (1=sequence is annotated), and the correlation among these vectors isplotted, with lighter shades indicating high correlations. The sequencesare organized by taxon, with the E. coli sequences indicated. It is evidentthat the E. coli sequences have very high correlations in theirannotations in BP (A), very low correlations in MF (B) and consistentlyhigh depth (number of terms assigned per sequence within the MFO; C).Depth of coverage exhibits no visually clear trend for E. coli within BPO,but is significantly depressed relative to other species (p<10-6, ranksumtest).Authors’ contributionsExperiments were designed by JG and PP. JG performed most of theanalysis and prepared the figures. PP drafted the manuscript with input fromJG.Competing interestsThe authors declare that they have no competing interests.AcknowledgementsWe are extremely grateful to the CAFA organizers (Iddo Friedberg, PredragRadivojac and Sean Mooney) for their efforts and the community spirit thatunderlies their motives in promoting critical assessment of functionassignment, and for their encouragement of our work. We thank Wyatt T.Clark for assistance with the data. We thank Stefano Toppo for commentson an earlier draft of the manuscript. We also thank the CAFA participantswho generously provided their predictions prior to publication.This article has been published as part of BMC Bioinformatics Volume 14Supplement 3, 2013: Proceedings of Automated Function Prediction SIG2011 featuring the CAFA Challenge: Critical Assessment of FunctionAnnotations. The full contents of the supplement are available online at URL.http://www.biomedcentral.com/bmcbioinformatics/supplements/14/S3DeclarationsPublication funding obtained from NIH Grant GM076990.Author details1Stanley Institute for Cognitive Genomic, Cold Spring Harbor Laboratory, 196Genome Research Center, 500 Sunnyside Boulevard Woodbury, NY, 11797,USA. 2Centre for High-Throughput Biology and Department of Psychiatry,University of British Columbia, 177 Michael Smith Laboratories 2185 EastMall, Vancouver, Canada, V6T1Z4.Published: 22 April 2013References1. Moult J: A decade of CASP: progress, bottlenecks and prognosis inprotein structure prediction. Curr Opin Struct Biol 2005, 15(3):285-289.2. Zhang Y: Progress and challenges in protein structure prediction. CurrOpin Struct Biol 2008, 18(3):342-348.3. Oliver S: Guilt-by-association goes global. Nature 2000, 403(6770):601-603.4. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP,Dolinski K, Dwight SS, Eppig JT, et al: Gene ontology: tool for theunification of biology. The Gene Ontology Consortium. Nat Genet 2000,25(1):25-29.5. Godzik A, Jambon M, Friedberg I: Computational protein functionprediction: are we making progress? Cell Mol Life Sci 2007,64(19-20):2505-2511.6. Pena-Castillo L, Tasan M, Myers CL, Lee H, Joshi T, Zhang C, Guan Y,Leone M, Pagnani A, Kim WK, et al: A critical assessment of Mus musculusgene function prediction using integrated genomic evidence. GenomeBiol 2008, 9(Suppl 1):S2.7. Gillis J, Pavlidis P: The role of indirect connections in gene networks inpredicting function. Bioinformatics 2011, 27(13):1860-1866.Gillis and Pavlidis BMC Bioinformatics 2013, 14(Suppl 3):S15http://www.biomedcentral.com/1471-2105/14/S3/S15Page 11 of 128. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignmentsearch tool. J Mol Biol 1990, 215(3):403-410.9. Martin DM, Berriman M, Barton GJ: GOtcha: a new method for predictionof protein function assessed by the annotation of seven genomes. BMCBioinformatics 2004, 5:178.10. Gillis J, Pavlidis P: “Guilt by association” is the exception rather than therule in gene networks. PLoS Comput Biol 2012, 8(3):e1002444.11. Resnik P: Semantic similarity in a taxonomy: An information-basedmeasure and its application to problems of ambiguity in naturallanguage. J Artif Intell Res 1999, 11:95-130.12. Lin D: An information-theoretic definition of similarity. Proc15thInternational Conf on Machine Learning 1998, 296-304.13. Lord PW, Stevens RD, Brass A, Goble CA: Semantic similarity measures astools for exploring the gene ontology. Pac Symp Biocomput 2003,601-612.14. Dimmer EC, Huntley RP, Alam-Faruque Y, Sawford T, O’Donovan C,Martin MJ, Bely B, Browne P, Mun Chan W, Eberhardt R, et al: The UniProt-GO Annotation database in 2011. Nucleic Acids Res 2012, 40(Database):D565-570.15. Keseler IM, Collado-Vides J, Santos-Zavaleta A, Peralta-Gil M, Gama-Castro S,Muniz-Rascado L, Bonavides-Martinez C, Paley S, Krummenacker M,Altman T, et al: EcoCyc: a comprehensive database of Escherichia colibiology. Nucleic Acids Res 2011, 39(Database):D583-590.16. Boutet E, Lieberherr D, Tognolli M, Schneider M, Bairoch A: UniProtKB/Swiss-Prot. Methods Mol Biol 2007, 406:89-112.17. Binns D, Dimmer E, Huntley R, Barrell D, O’Donovan C, Apweiler R:QuickGO: a web-based tool for Gene Ontology searching. Bioinformatics2009, 25(22):3045-3046.18. Valencia A: Protein refinement: a new challenge for CASP in its 10thanniversary. Bioinformatics 2005, 21(3):277.19. McGary KL, Lee I, Marcotte EM: Broad network-based predictability ofSaccharomyces cerevisiae gene loss-of-function phenotypes. Genome Biol2007, 8(12):R258.20. Aerts S, Lambrechts D, Maity S, Van Loo P, Coessens B, De Smet F,Tranchevent LC, De Moor B, Marynen P, Hassan B, et al: Gene prioritizationthrough genomic data fusion. Nat Biotechnol 2006, 24(5):537-544.21. Marbach D, Costello JC, Kuffner R, Vega NM, Prill RJ, Camacho DM,Allison KR, Consortium D, Kellis M, Collins JJ, et al: Wisdom of crowds forrobust gene network inference. Nature methods 2012, 9(8):796-804.22. Thomas PD, Wood V, Mungall CJ, Lewis SE, Blake JA, on behalf of the GeneOntology C: On the Use of Gene Ontology Annotations to AssessFunctional Similarity among Orthologs and Paralogs: A Short Report.PLoS Comput Biol 2012, 8(2):e1002386.23. Lee I, Lehner B, Crombie C, Wong W, Fraser AG, Marcotte EM: A singlegene network accurately predicts phenotypic effects of geneperturbation in Caenorhabditis elegans. Nat Genet 2008, 40(2):181-188.24. Radivojac P, Clark WT, Oron TR, Schnoes AM, Wittkop T, Sokolov A, Graim K,Funk C, Verspoor K, Ben-Hur A, et al: A large-scale evaluation ofcomputational protein function prediction. Nat Methods 2013.25. Falda M, Toppo S, Pescarolo A, Lavezzo E, Di Camillo B, Facchinetti A,Cilia E, Velasco R, Fontana P: Argot2: a large scale function prediction toolrelying on semantic similarity of weighted Gene Ontology terms. BMCBioinformatics 2012, 13(Suppl 4):S14.26. Pavlidis P, Gillis J: Progress and challenges in the computationalprediction of gene function using networks. v1; ref status: indexed [http://f1000research.com/articles/1-14/v1], F1000Research 2012, 1:14.doi:10.1186/1471-2105-14-S3-S15Cite this article as: Gillis and Pavlidis: Characterizing the state of the artin the computational assignment of gene function: lessons from thefirst critical assessment of functional annotation (CAFA). BMCBioinformatics 2013 14(Suppl 3):S15.Submit your next manuscript to BioMed Centraland take full advantage of: • Convenient online submission• Thorough peer review• No space constraints or color figure charges• Immediate publication on acceptance• Inclusion in PubMed, CAS, Scopus and Google Scholar• Research which is freely available for redistributionSubmit your manuscript at www.biomedcentral.com/submitGillis and Pavlidis BMC Bioinformatics 2013, 14(Suppl 3):S15http://www.biomedcentral.com/1471-2105/14/S3/S15Page 12 of 12


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items