UBC Faculty Research and Publications

Gene Ontology term overlap as a measure of gene functional similarity Mistry, Meeta; Pavlidis, Paul Aug 4, 2008

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


52383-12859_2008_Article_2312.pdf [ 4.35MB ]
JSON: 52383-1.0223771.json
JSON-LD: 52383-1.0223771-ld.json
RDF/XML (Pretty): 52383-1.0223771-rdf.xml
RDF/JSON: 52383-1.0223771-rdf.json
Turtle: 52383-1.0223771-turtle.txt
N-Triples: 52383-1.0223771-rdf-ntriples.txt
Original Record: 52383-1.0223771-source.json
Full Text

Full Text

ralssBioMed CentBMC BioinformaticsOpen AcceResearch articleGene Ontology term overlap as a measure of gene functional similarityMeeta Mistry1 and Paul Pavlidis*2Address: 1CIHR/MSFHR Graduate Program in Bioinformatics, University of British Columbia, Canada and 2Department of Psychiatry and Centre for High-throughput Biology, University of British Columbia, British Columbia, CanadaEmail: Meeta Mistry - mmistry@bioinformatics.ubc.ca; Paul Pavlidis* - paul@bioinformatics.ubc.ca* Corresponding author    AbstractBackground: The availability of various high-throughput experimental and computationalmethods allows biologists to rapidly infer functional relationships between genes. It is oftennecessary to evaluate these predictions computationally, a task that requires a reference databasefor functional relatedness. One such reference is the Gene Ontology (GO). A number of groupshave suggested that the semantic similarity of the GO annotations of genes can serve as a proxyfor functional relatedness. Here we evaluate a simple measure of semantic similarity, term overlap(TO).Results: We computed the TO for randomly selected gene pairs from the mouse genome. Forcomparison, we implemented six previously reported semantic similarity measures that share thefeature of using computation of probabilities of terms to infer information content, in addition tothree vector based approaches and a normalized version of the TO measure. We find that theoverlap measure is highly correlated with the others but differs in detail. TO is at least as good apredictor of sequence similarity as the other measures. We further show that term overlap mayavoid some problems that affect the probability-based measures. Term overlap is also much fasterto compute than the information content-based measures.Conclusion: Our experiments suggest that term overlap can serve as a simple and fast alternativeto other approaches which use explicit information content estimation or require complex pre-calculations, while also avoiding problems that some other measures may encounter.BackgroundIn this paper we consider the problem of deciding if twogenes are functionally related using computational meth-ods. In particular, we are interested in how existing infor-mation about gene function can be used to enhance orevaluate computational predictions of functional rela-tionships among genes.Many genes have been functionally characterized byexperimental methods, sequencing efforts, and high-throughput techniques, and as a consequence those genesthen appear in public databases annotated with terms orconcepts representative of their deduced function or bio-logical role in the cell. The Gene Ontology (GO) is a struc-tured, controlled vocabulary of terms providingPublished: 4 August 2008BMC Bioinformatics 2008, 9:327 doi:10.1186/1471-2105-9-327Received: 22 February 2008Accepted: 4 August 2008This article is available from: http://www.biomedcentral.com/1471-2105/9/327© 2008 Mistry and Pavlidis; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.Page 1 of 11(page number not for citation purposes)consistency in annotating how a given gene productbehaves in a cellular context, and many genes are nowBMC Bioinformatics 2008, 9:327 http://www.biomedcentral.com/1471-2105/9/327annotated with terms from GO [1]. It is increasingly com-mon to attempt to define functional relatedness using"semantic similarity" of genes using GO annotations [2-7]. While many measures have been used, their relativebenefits and drawbacks are unclear. The current workinvolves an examination of the behaviour of varioussemantic similarity measures that have been proposed,including one that has not been previously considered incomparisons.Six of the measures we consider in this work make use ofthe hierarchical structure of the GO. Each term in GO isassigned to one of the three root ontologies: molecularfunction, biological process and cellular component. Theterms in each ontology are linked to one another in anacyclic directed graph by two types of relationships: 'is-a',which represents a simple class-subclass relationship and'part-of', which indicates a component relationship [8].Importantly, terms can have more than one direct parentterm, because a single child term can be defined in anumber of different contexts. For example, the biologicalprocess term "hexose biosynthetic process"(GO:0019319) has two direct parent terms, "hexose met-abolic process" and "monosaccharide biosynthetic proc-ess". This is because biosynthesis is a subtype ofmetabolism, and a hexose is also a type of monosaccha-ride (Figure 1). The GO terms used by genome databasecurators in the direct annotation of a gene are usuallymore specific, lower level terms. However, the graphicalstructure of the GO implies that a gene that is associatedwith a low level term is also associated with higher levelterms. Thus a gene involved in "hexose metabolic proc-ess" can also be given the annotation "metabolic process"by inference.Previous work on the use of the GO to measure functionalsimilarity focussed on the use of information content[9,10]. The information content (IC) of a term is relatedto how often the term is applied to genes in the database,such that rarely used terms are ascribed higher IC. The ICfor GO terms is monotonically decreasing as one followsthe graph from a leaf terms towards the root term. Intui-tively, terms low in the hierarchy are "more detailed" andimpart more information about function than high-levelterms such as "metabolism". Semantic similarity meas-ures based on IC make use of the idea that genes sharingterms with high IC are expected to be more functionallysimilar than terms that share terms with low IC. Indeed, itwas shown that some IC-based measures correlate withother measures of functional relatedness such as sequencesimilarity [9,10].A second set of semantic similarity measures are varia-retrieval. Unlike the IC-based measures, these methods donot account for hierarchical relations in the GO, andinstead refer to GO terms in a 'flat' matrix format. Therequirement of individual gene vectors to be generated isan extra complexity cost that is incurred prior to the actualsimilarity computation itself. In this study we implementthree of such methods [12-14] for comparison against ourmeasure.Our proposed method, Term Overlap (TO), was used pre-viously by Lee et al. [4] in a study of gene coexpressionanalysis, where it was shown that TO correlates withincreasing confidence in coexpression. Although Lee et al.[4] first implemented the TO measure, it was not thor-oughly evaluated nor was it compared to other similaritymeasures. In this study we sought to test whether TO is anadequate substitute for other measures that have been putforward. In contrast to the other semantic similarity meas-ures, TO does not use an explicit information contentcomputation, and is less algorithmically complex. Herewe explore the properties of TO in more detail and carryout a more formal evaluation of the approach, and findthat TO has a number of attractive features that may rec-ommend it as an alternative to other semantic similaritymeasures.MethodsData SourcesSets of gene pairs were generated by random pairwiseselection from the mouse genome. Each of the genes wasannotated with its respective GO terms as it appears in theNCBI Gene database [15] (downloaded on January 8,2008). Genes which were not annotated with any GOterms were not considered, leaving a set of 18,161 genes.Several sets containing 10,000 gene pairs each were evalu-ated initially, with a final dataset of 100,000 gene pairsgenerated for which the results are displayed in this paper("100 k"). The 100 k set has pairs covering the entire cor-pus of mouse genes. The 100 k set with associated statis-tics is available as supplementary data from http://bioinformatics.ubc.ca/pavlidis/lab/gometric/.Information Content and Semantic Similarity MeasuresSeveral of the measures we considered require the compu-tation of the information content of each GO term. Thesemeasures were originally described for the analysis of anycorpus of text, and were adapted for use with GO by Lordet al. (2003), where full details are given. The informationcontent of a GO term ti is:IC(ti) = -log(p(ti)) (1)Where p(ti) is the probability of a term occurring in thePage 2 of 11(page number not for citation purposes)tions on the Vector Space Model (VSM) [11], an algebraicmodel originally developed for use in informationcorpus:BMC Bioinformatics 2008, 9:327 http://www.biomedcentral.com/1471-2105/9/327p(ti) = freq(ti)/freq(root) (2)where the corpus is the set of annotations for all genesunder consideration. "Root" represents one of the threeroot ontology terms and freq(root) is the number of timesa gene is annotated with any term within that ontology.freq(ti) is given by:Where children(ti) is the set of all children terms for theterm ti (that is, the set of all terms for which ti is a parentterm, either directly or indirectly).In our analysis we focus on three IC based measuresadapted from the work of Resnik[16], Lin[17], Jiang andConrath[18]. Resnik's measure calculates the similaritybetween two terms by using only the IC of the lowest com-mon ancestor (LCA) shared between two terms t1 and t2 :simRes(t1, t2) = IC(LCA) (4)Lin's measure of similarity takes into consideration the ICvalues for each of terms t1 and t2 in addition to the LCAshared between the two terms and is defined as follows[17]:freq t annot t annot ci ic children t i( ) ( ) ( )( )= +∈∑ (3)p LCAlog( ( ))2Structure of Gene OntologyFigure 1Structure of Gene Ontology. Depicted here is a graphical representation of the GO term "hexose biosynthetic process" and its associated parent terms, adapted from the AmiGO website http://www.geneontology.org/. Only partial paths are shown here. The three paths branch up to higher level parent terms, leading back to the root term "biological process". Arrows between terms represent 'is-a' relationships. The hierarchy of each ontology is structured as a directed acyclic graph (DAG), with the more specific, lower level terms, having one or more direct parent terms associated with it. This is because a single child term can be defined in a number of different contexts.alcohol biosynthetic processGO:0046165monosaccharide biosynthetic processGO:0046364monosaccharide metabolic processGO:0005996carbohydrate biosynthetic processGO:0016051cellular carbohydrate metabolic processGO:0044262carbohydrate metabolic processGO:0005975hexose biosynthetic processGO:0019319hexose metabolic processGO:0019318Page 3 of 11(page number not for citation purposes)sim t tp t p tLin( , )log ( ) log ( )1 2 1 2=+(5)BMC Bioinformatics 2008, 9:327 http://www.biomedcentral.com/1471-2105/9/327Jiang and Conrath proposed an IC based semantic dis-tance, which can be transformed into a similarity measure[6]:For each of the three measures, a higher score indicates ahigher semantic similarity between two terms. The lowestscore for all three measures is 0. The highest score for Linand Jiang is 1, and Resnik's measure has no upper bound.These measures are intended to score the similaritybetween two GO terms, and must be extended to comparegenes, each of which can have multiple GO terms. Follow-ing the approach of [9], let us compare two gene productsg1 and g2. Every term in the direct annotation set for geneg1 is compared against every term in the direct annotationset for gene g2. For each pairwise comparison if two directannotations are identical, that term is then considered theLCA. If two direct annotations are not identical, we thenretrieve the parent term sets induced for the two annota-tion terms, and the shared parent term with the highestinformation content is considered the LCA. The similarityscore is then calculated for that pair of terms. The scoresgenerated for all pairs of GO terms are used to produce afinal score for the gene pair in one of two ways: i) scorescan be averaged across all possible term pairs for the twogenes [9] or ii) only the maximum score resulting from allpossible term pairs for the two genes is used, as proposedby [19]. We refer to these as the "average" and "maxi-mum" methods in the following. Thus we consider six IC-based measures: Li, Jiang and Resnik, each with averageand maximum variants.Vector Space Model MeasuresThese similarity measures first require an m × n gene-termannotation matrix be compiled, where m is the totalnumber of genes in the corpus and n is the total numberof GO terms. Each row in the matrix represents a gene vec-tor of its annotations. Each vector is binary valued, with 1representing the presence of the GO term in the gene'sannotation and 0 representing its absence. The Cosinesimilarity can be calculated using the vector for each genein the pair [14].A variation on the Cosine measure, which has been previ-ously used in ontology-based similarity, first generates aweight, wt, for each GO term based on the frequency of itswt = log(N/nt) (8)Where N is the total number of genes in the corpus and ntis the number of genes in the corpus annotated with thatterm t. These weights replace the non-zero values in thebinary vector and similarly the cosine measure is calcu-lated as in (7). We refer to this method as the WeightedCosine measure in this study.Finally, Huang et al also propose a vector-based similaritymeasure integrated in the DAVID Gene Functional Classi-fication Tool [13]. For a given gene pair, binary gene vec-tors are extracted from the compiled matrix as describedabove. Kappa statistics are then used to measure co-occur-rence of annotation between gene pairs. The algorithmcan be found in detail in [13].Term Overlap MeasureWhen calculating the term overlap between two geneproducts we consider the set of all direct annotations foreach gene and all of their associated parent terms (exclud-ing the root of the hierarchy) as a gene product annota-tion set, annotg1. The term overlap score for two genes isthen calculated as the number of terms that occur in theintersection set of the two gene product annotation sets.simTO(g1, g2) = |annotg1 ∩ annotg2| (9)As with the other measures, the higher the score the higherthe similarity between two genes. The lowest term overlapscore is zero and there is no upper bound. The similarityof genes where one or both lacks GO terms is zero, thoughas mentioned these were not considered here. A variantmethod we also considered is the normalized term over-lap (NTO), in which the term overlap score is divided bythe annotation set size for the gene with the lower numberof GO annotations.Traditional cardinality-based similarity measures such asJaccard and Dice [14] are computed similarly to NTO, butuse the union or sum, respectively, of the two gene anno-tation set sizes as the normalizing factor. TheCzekanowski-Dice distance used in the functional analy-ses module of GOToolBox [20], calculates a distance bynormalizing the number of symmetric differencesbetween the two gene term sets with the sum of the inter-section and union sets. In this way, the scale for the dis-tance is reversed from that of NTO, with genes having noGO terms in common scoring a distance of 1 and highlyfunctionally related scoring closer to 0. Since these meas-sim t tp t p t p LCAJiang( , )log( ( )) log( ( )) log( ( ))1 211 2 2 1=− − + +(6)sim g gv vv vcos( , )1 21 21 2=⋅(7)sim g gsimTO g gmin annot g annot gNTO( , )( , )(| |,| |)1 21 21 2= (10)Page 4 of 11(page number not for citation purposes)occurrence in the corpus [12]. ures are very similar to NTO, we chose not to includethem in our study.BMC Bioinformatics 2008, 9:327 http://www.biomedcentral.com/1471-2105/9/327Sequence analysisThe NCBI Consensus Coding Sequence (CCDS) proteinsequences were obtained from the NCBI FTP site ftp://ftp.ncbi.nih.gov/blast/. We used the NCBI blast suite pro-gram "bl2seq" to analyze the similarity between the pro-tein sequences for each of the 100,000 gene pair [21].Similarity for this analysis was measured using the bitscore values. CCDS did not include sequences for some ofthe genes considered, yielding similarity scores for 67,179pairs. Correlations and their statistical significance weredetermined using the cor.test function in R [22].ResultsFor each data set (consisting of randomly selected pairs ofmouse genes), we computed similarity scores using TO,NTO, each of the IC-based similarity measures (Resnik,Lin and Jiang, using both the "average" and "maximum"variants for each), and each of the three vector-basedmeasures (Cosine, Weighted Cosine and Kappa) for atotal of eleven measures. We then sought to see how wellthe scores generated from the different measures corre-lated with one another. A correlation analysis wasrepeated for several sets of 10,000 randomly chosen genepairs. We found the variation between the results for eachof the sets to be negligible (Additional file 1), thus onlydata from a final 100 K gene pair set was studied in detail.Figure 2A and 2B are heat-map representations of thePearson and rank correlations, respectively, among eachof the eleven measures. Some numerical data are shown inTable 1 (the full data set can be found in Additional file2). Several trends are evident: all of the measures are pos-itively correlated with TO, and furthermore the othermeasures are also correlated with each other. The relation-ship with TO and Resnik-maximum was amongst thestrongest of all TO measure correlations and also rela-tively linear, giving high values of 0.87 and 0.77 for bothrank and Pearson correlations respectively. Both Cosineand Kappa methods also showed strong correlations withTO, with slightly higher rank values of 0.89 and 0.90respectively. TO shows notably lower correlations withthe Lin and Jiang measures. In contrast, NTO showed rea-sonably high correlations with all measures, including Linand Jiang's (Additional file 2). NTO and TO are alsohighly correlated (rank correlation 0.82, Table 1). Figures3 and 4 present scatter plots of TO plotted against the sixIC-based measures (similar figures can be found for thevector based measures and NTO in Additional file 3 andMeasuring correlation between similarity scoresFigure 2Measuring correlation between similarity scores. Pearson (A) and Spearman rank correlation (B) values were calculated to measure the degree of agreement between scores generated using each of the various measures. Scores were generated for A B Page 5 of 11(page number not for citation purposes)the 100 k dataset for each method. Correlation was evaluated for scores between all possible pairs of measures.BMC Bioinformatics 2008, 9:327 http://www.biomedcentral.com/1471-2105/9/3274, respectively). The higher correlations of TO with themeasures computed using the "maximum" method (Fig-ure 3) than "average" (Figure 4) are evident. We alsonoted that the Lin and Jiang measures yield a number ofgene pairs with similarity values of 1.0 over a wide rangeof TO values, especially with the "maximum" method(30% of all gene pairs; Figures 3B and 3C). We alsopresent scatter plots for the "average" variants of the IC-based methods to illustrate how they compare against oneanother (Figure 5).We next considered whether semantic similarity is relatedto another measure of functional similarity, proteinsequence similarity (Figure 6 and Additional file 5). Aspreviously reported, the Resnik-average score was posi-tively rank-correlated with BLAST scores (0.086, p < 10-16). Resnik-maximum showed a higher correlation andTO with the highest correlation with BLAST scores (rankcorrelation 0.125, Table 2). We note that Lord et al.(2003) reported correlations of up to ~0.6. This discrep-ancy appears to be explained by the binning Lord et al.performed prior to computing correlations.To help understand the basis for the good agreementbetween TO/NTO and the information-content measures,we studied the relationship between the location of a termin the GO hierarchy and information content. This is rel-evant because TO and NTO are fundamentally based onlocation in the hierarchy. As shown in Figure 7A, the posi-tion of a term in the hierarchy is correlated with informa-tion content: terms with few parents (near the root of thehierarchy) tend to have low IC (rank correlation 0.26).This is expected because IC increases monotonically asone follows parent-child relationships. The overall goodagreement suggests that depth in the hierarchy is a reason-able surrogate for IC. However, the cluster of data pointsoccupying the upper left diagonal of Figure 7A shows thatthere are many terms near the root with high informationcontent. Figure 7B displays the relationship between ICand the number of child terms (rank correlation 0.68).While the trend is similar to that for the number of par-ents, there are few cases of terms with many children andhigh IC. Exceptions can be found, for example "viralreproduction" (GO:0016032) which has 157 children,but an IC of 9.3.Finally, we examined the running times of the variousalgorithms on the 100 k dataset (Table 3). TO measure isover 10 times as fast as the other measures. These timesexclude the cost of computing the IC values from the cor-pus (>1 hour, primarily due to the cost of databaselookups).DiscussionIn this work we conducted a detailed study of a measureof gene functional similarity based on Gene Ontologyterms, Term Overlap. We found that TO compares verywell to other semantic similarity measures, and is easierand faster to compute. This suggests that TO can be usedas an alternative to the more complex measures that havebeen proposed. In addition we demonstrate that in gen-eral, the various measures are all highly correlated, withsome important exceptions. Here we discuss some of thereasons for differences in performance among the meth-ods.We find that with the IC-based methods, the scores fromthe TO correlate best with Resnik-maximum and Resnik-average scores. Recent studies have shown that the simi-larity measure proposed by Resnik out-performs the LinRelationship between Term Overlap and alternate methods ("maximum" variants)Figure 3Relationship between Term Overlap and alternate methods ("maximum" variants). The data for the 100 k set of randomly selected gene pairs are presented as raw points (light grey) as well as density (hexagons, plotted using the R package 0 10 20 30 40 50 60 700246810Term OverlapResnik Max Score0 20 40 6000. OverlapLin Max Score0 20 40 6000. OverlapJiang Max ScoreA B CPage 6 of 11(page number not for citation purposes)"hexbin"). Darker colors indicate increasing density of points. The plots represent TO vs. Resnik (A), Jiang (B), and Lin (C).BMC Bioinformatics 2008, 9:327 http://www.biomedcentral.com/1471-2105/9/327and Jiang methods in terms of correlation with genesequence similarities [9,10] and gene expression profiles[23], consistent with our findings. Our sequence analysisfindings indicate that TO correlates comparably (evenslightly higher) with sequence similarity than Resnik. Thissuggests that TO is at least as reflective of "true" gene func-tion as the measure used by Lord et al. (2003). However,we point out that overall correlations are low; the differ-ence might be corpus-specific, and there is no unassaila-ble "gold-standard" for evaluating semantic similaritymeasures.We found that the Lin and Jiang measures correlate rela-tively poorly with most of the other methods, while beingmost similar to NTO. It has been previously shown thatthe Lin and Jiang methods suffer from what is referred toas the "shallow annotation problem" [7,23]. This isbecause Lin and Jiang both use the IC of the query genesas well as the LCA. As a result, genes that are annotated atonly very shallow levels of the GO hierarchy (e.g., "metab-olism") can yield very high semantic similarities. Suchpairs are therefore not distinguishable from high-scoringpairs of genes that have "deep" annotations. The effect ofshallow annotations can be seen in Figures 3 and 4, whereRelationship between Term Overlap and alternate methods ("average" variants)Figure 4Relationship between Term Overlap and alternate methods ("average" variants). The plots represent TO vs. Res-nik (A), Jiang (B), and Lin (C). For details see figure 3.0 20 40 6000. OverlapJiang Score0 20 40 6000. OverlapLin Score0 20 40 600246Term OverlapResnik ScoreA B CComparisons among information-content methodsFigure 5Comparisons among information-content methods. Similarity scores were calculated using the averaged variant of Res-nik, Lin and Jiang similarity measures for every gene pair in the 100 k set of randomly selected gene pairs. The scores generated were then plotted against each other to illustrate the correlation amongst them, A) Resnik versus Lin; B) Resnik versus Jiang; 0 2 4 600. ScoreLin Score0 0.2 0.4 0.6 0.8 ScoreJiang Score0 2 4 600. ScoreJiang ScoreA B CPage 7 of 11(page number not for citation purposes)C) Lin versus Jiang.BMC Bioinformatics 2008, 9:327 http://www.biomedcentral.com/1471-2105/9/327both Lin and Jiang measures have large numbers of pointswith scores of 1.0, distributed over a wide range of TOscores, including very low TO values. Thus, although theLin and Jiang methods attempt to capture the nature ofthe hierarchy in their methods, the effect of the shallowannotation problem shows that these methods can pro-duce misleading results. For example, the gene pair con-taining Akap1 (A kinase anchor protein 1), and Bbs9(Bardet-Biedl syndrome 9) using Lin and Jiang methodsscore a similarity of 1.0. Akap1 is a trans-membrane pro-tein that participates in second messenger signalling[24,25], and has 29 GO terms associated with it (includ-ing parents). The function for Bbs9 on the other hand ispoorly understood [26] and has only 3 associated terms,including "extracellular space", which it happens to sharewith Akap1. Despite this weak link, according to both Linand Jiang methods these genes are not only similar butthey generate the maximum attainable score for thosemeasures.Scrutiny of the data leads us to believe that the NTO scoresalso suffer from the shallow annotation problem. This isbecause even if a gene is annotated with only one term,and it shares that term with another gene, the NTO is 1.0.Using the previously mentioned gene pair Akakp1 andBbs9, the NTO measure generates a high score of 0.75,whereas the TO measure generates a more appropriatelow score of 3.0. The Jaccard, Dice and Czekanowski-Dicemethods [14,20], which are computed in a similar fashionto NTO but use larger normalizing factors, will amelioratethe shallow annotation problem. However, shallowannotation artifacts will persist when comparing pairs ofgenes where both have few terms For this reason wefavour the raw TO over the normalized overlap measures.On the other hand, the normalized measures have theTable 1: Correlation of TO scores with various other similarity measuresPearson SpearmanResnik 0.56 0.77ResnikMax 0.77 0.87Lin 0.47 0.74LinMax 0.64 0.83Jiang 0.36 0.65JiangMax 0.58 0.78NTO 0.65 0.82Kappa 0.76 0.89Cosine 0.75 0.82Weighted Cosine 0.51 0.90Pearson and rank correlation values were calculated for the scores generated by TO versus each of the seven other measures for the final dataset R100 K, containing 100,000 random gene pairs.Comparing sequence and semantic similarityFigure 6Comparing sequence and semantic similarity. A BLAST sequence analysis was performed to calculate a sequence simi-larity score for each gene pair in the 100 k set for which sequence data was available. Of those gene pairs we considered only the 53,264 which obtained a score greater than zero. Pairs were binned by bit score and the average in each bin plotted against (A) TO scores and (B) Resnik-max scores. Thick horizontal lines indicate medians, boxes indicate interquartile ranges, and 1 3 5 7 9 11 13 15 17 19 21 23 25 270204060ln [Bit Score]Term Overlap1 3 5 7 9 11 13 15 17 19 21 23 25 270246810ln [Bit Score]Resnik Max ScoreA BPage 8 of 11(page number not for citation purposes)whiskers are drawn at 1.5 times the quartile, or the maximum (whichever is closer to the median).BMC Bioinformatics 2008, 9:327 http://www.biomedcentral.com/1471-2105/9/327useful property of yielding values restricted between zeroand one.Others have suggested that using IC values can inducesubstantial artifacts for terms that are rarely used, but notnecessarily very specific [7]. This problem will be particu-larly acute for organisms with sparse GO annotations. Werefer to this problem as the "corpus bias". For example, ahigh level, general term such as "cell growth"(GO:0016049) should have a significantly high back-ground probability and low IC. However, if the corpusdoes not contain many genes involved in cell growth, theterm will score a low probability and will be incorrectlyidentified as a high IC specific term. As shown in Figure7A, most terms near the top of the hierarchy have low IC(Figure 7A). However, there are exceptions, as notedabove. These terms located near the top of the hierarchybut with low IC have two potential causes. First, it is pos-sible that depth in the hierarchy is not always related tosemantic information content. In other words, there maybe terms near the top of the hierarchy that are as "specific"as terms deep in the hierarchy. Second, there may be truecorpus bias where terms with many children are rarelyused. We can partly distinguish between these possibili-ties by examining the relationship between IC and thenumber of child terms (Figure 7B). This showed that high-IC terms almost always have many child terms, arguingagainst corpus bias. It is still unclear whether the cases ofterms with few children and few parents are truly "spe-cific" terms or just parts of the GO which are not yet fullyfleshed out. For example, the term "chemoattractant activ-ity" (GO:0042056) is a direct child of the root of themolecular function hierarchy, but has no child terms.TO appears to avoid the shallow annotation problem.However, it may be questioned whether depth in the hier-archy is a strong enough correlate with IC. TO is in effectbased entirely on how many parents a term has, with noconsideration of the frequency of use by annotators. Thustwo genes annotated with low level terms falling far fromthe root term, and sharing all parent terms would obtaina high similarity score. On the other hand, if two genes areTable 2: Rank correlation of semantic similarity with sequence similarityTO NTO Resnik ResnikMax Lin LinMax Jiang JiangMax Kappa Cosine Wt.cosine0.125 0.112 0.086 0.110 0.088 0.104 0.078 0.10 0.106 0.113 0.114Rank correlation values were calculated for the semantic similarity scores and the sequence similarity bit scores for each gene pair.The relationship between the depth of the GO hierarchy and ICFigu e 7The relationship between the depth of the GO hierarchy and IC. For each GO term, we retrieved the total number of parent terms leading back to the root term, and the total number of children terms. Each dot on the plot represents a GO term, for a total of 8424 terms that have been used in at least one gene annotation within the mouse corpus. The log10 values 0.0 0.5 1.0 1.5024681012log [Number of parent terms]Information content0 1 2 3 4024681012log [Number of children terms]Information contentA BPage 9 of 11(page number not for citation purposes)are plotted each against the pre-calculated IC for that term; jitter was added to each point to reduce overlapping data.BMC Bioinformatics 2008, 9:327 http://www.biomedcentral.com/1471-2105/9/327annotated with high level terms falling closer to the rootterm, also sharing all parent terms, will obtain a low sim-ilarity score. This will work so long as the depth in thehierarchy is a reasonably uniform measure of semanticspecificity (in effect, information content). The data inFigure 7A suggest this might not be an entirely safeassumption, but we argue that the overall good behaviourof the overlap statistic argues that the assumption is notcompletely without basis.The scores generated using VSM based measures alsoshow a high correlation with TO (>0.8). These methodsrely upon a gene-term annotation matrix that essentiallyflattens the redundant and structured GO terms into a col-lection of 'independent' terms. The Kappa and Cosinemethods weight each of the GO terms equally by using abinary valued matrix and would explain the high correla-tion with each other, and with TO. On the other handusing weighted values in the matrix delivers scores thatstill correlate fairly well with TO (higher than the correla-tion with Resnik) as we found with the Weighted Cosinemeasure. Both Cosine measures also correlate very wellwith the NTO measure. This is not surprising, since thedot product of two gene vectors equates to the term over-lap, and thus the two measures merely differ by the factorthat they normalize the TO value with. This is also in gen-eral agreement with results presented recently byChagoyen et al. (8th Spanish Symposium on Bioinformat-ics and Computational Biology, 2008).TO (and NTO) also differs from the other measures inalgorithmic complexity. First, computing the IC for eachterm is an expensive computation, as is the compilationprocess of the gene-term annotation matrix required forthe vector-based methods. The IC requires obtainingcounts of each term in GO for all genes; as does theWeighted Cosine method. Making matters worse, in bothcases the data generated should in principle be recom-puted for the entire database every time the annotationsare updated. TO completely avoids this step. In addition,the computation of overlap for a pair of genes is O(N)where N is the number of terms. Computation of theother IC-based measures requires pairwise comparison ofall terms for the pair of genes, which is O(N2).plest and fastest method when high throughput is neces-sary. Therefore we expect that TO will be of use for rapidlyevaluating algorithms predicting gene functional relation-ships, and in exploring high-throughput experimentaldata. For example, in Lee et al. (2004), TO was used toevaluate the performance of an algorithm for predictinggene function on the basis of expression profile similarity.TO is fast enough to use in on-line applications, and isused in the Gemma system http://www.bioinformatics.ubc.ca/Gemma to display gene semantic similarities(Hamer et al., in preparation). Semantic similarity com-puted by TO could be used to evaluate and examineresults of high throughput studies such as yeast 2-hybridscreens or proteomic studies.Authors' contributionsPP proposed the overlap measure, and oversaw designand execution of the study. MM implemented algorithms,designed and performed experiments, and performed allof the analysis. PP and MM wrote the manuscript. Bothauthors read and approved the final manuscript.Additional materialAdditional File 1TO correlation scores for multiple sets.Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2105-9-327-S1.doc]Additional File 2Correlation values amongst various similarity measures.Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2105-9-327-S2.doc]Additional File 3TO scores versus scores generated using vector -based measures. For every gene pair in the 100 k set of gene pairs, the term overlap was calcu-lated and plotted against the scores generated by Cosine, Kappa, and Weighted Cosine measures.Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2105-9-327-S3.doc]Table 3: Running times for some of the similarity measuresTerm Overlap Normalized Term Overlap Resnik Lin JiangTime (s) 226.8 260.5 2979 3092 3165Time shown is total time taken for the 100,000 gene pair set R100K and excludes the cost of computing the IC values from the corpus. Averaged and maximum variants of the semantic similarity measures show no difference in time, thus only one value is displayed.Page 10 of 11(page number not for citation purposes)In summary, given the generally high correlation amongthe various measures, it seems reasonable to use the sim-Publish with BioMed Central   and  every scientist can read your work free of charge"BioMed Central will be the most significant development for disseminating the results of biomedical research in our lifetime."Sir Paul Nurse, Cancer Research UKYour research papers will be:available free of charge to the entire biomedical communitypeer reviewed and published immediately upon acceptancecited in PubMed and archived on PubMed Central BMC Bioinformatics 2008, 9:327 http://www.biomedcentral.com/1471-2105/9/327AcknowledgementsSupported by a Canadian Institutes for Health Research/Michael Smith Foundation for Health Research graduate scholarship to MM, a Michael Smith Foundation for Health Research Career Investigator Award to PP and NIH grant GM076990 to PP. We thank Kelsey Hamer for technical sup-port.References1. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM,Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M,Rubin GM, Sherlock G: Gene ontology: tool for the unificationof biology. The Gene Ontology Consortium.  Nat Genet 2000,25(1):25-29.2. Chagoyen M, Carmona-Saez P, Gil C, Carazo JM, Pascual-Montano A:A literature-based similarity metric for biological processes.BMC Bioinformatics 2006, 7:363.3. Del Pozo A, Pazos F, Valencia A: Defining functional distancesover Gene Ontology.  BMC Bioinformatics 2008, 9(1):50.4. Lee HK, Hsu AK, Sajdak J, Qin J, Pavlidis P: Coexpression analysisof human genes across many microarray data sets.  GenomeRes 2004, 14(6):1085-1094.5. Lim WK, Wang K, Lefebvre C, Califano A: Comparative analysisof microarray normalization procedures: effects on reverseengineering gene networks.  Bioinformatics 2007,23(13):i282-288.6. Schlicker A, Domingues FS, Rahnenfuhrer J, Lengauer T: A newmeasure for functional similarity of gene products based onGene Ontology.  BMC Bioinformatics 2006, 7:302.7. Wang JZ, Du Z, Payattakool R, Yu PS, Chen CF: A new method tomeasure the semantic similarity of GO terms.  Bioinformatics2007, 23(10):1274-1281.8. Gene Ontology Consortium: Creating the gene ontologyresource: design and implementation.  Genome Res 2001,11(8):1425-1433.9. Lord PW, Stevens RD, Brass A, Goble CA: Investigating semanticsimilarity measures across the Gene Ontology: the relation-ship between sequence and annotation.  Bioinformatics 2003,19(10):1275-1283.10. Lord PW, Stevens RD, Brass A, Goble CA: Semantic similarity11. Baeza-Yates R, R-N B: Modern Information Retrieval.  NewYork, Harlow, England: Addison-Wesley; 1999. 12. Chabalier J, Mosser J, Burgun A: A transversal approach to pre-dict gene product networks from ontology-based similarity.BMC Bioinformatics 2007, 8:235.13. Huang da W, Sherman BT, Tan Q, Collins JR, Alvord WG, Roayaei J,Stephens R, Baseler MW, Lane HC, Lempicki RA: The DAVIDGene Functional Classification Tool: a novel biological mod-ule-centric algorithm to functionally analyze large gene lists.Genome Biol 2007, 8(9):R183.14. Popescu M, Keller JM, Mitchell JA: Fuzzy measures on the GeneOntology for gene product similarity.  IEEE/ACM Trans ComputBiol Bioinform 2006, 3(3):263-274.15. Pruitt KD, Tatusova T, Maglott DR: NCBI reference sequences(RefSeq): a curated non-redundant sequence database ofgenomes, transcripts and proteins.  Nucleic Acids Res2007:D61-65.16. Resnik P: Using Information Content to Evaluate SemanticSimilarity in a Taxonomy.  Proceedings of the 14th International JointConference on Artificial Intelligence 1995.17. Lin D: An information-theoretic definition of similarity.  In15th International Conf on Machine Learning Morgan Kaufmann, SanFrancisco, CA; 1998:296-304. 18. Jiang JJ, Conrath DW: Semantic Similarity Based on CorpusStatistics and Lexical Taxonomy.  ROCLING X: 1997; Taiwan1997.19. Schlicker A, Albrecht M: FunSimMat: a comprehensive func-tional similarity database.  Nucleic Acids Res 2008:D434-439.20. Martin D, Brun C, Remy E, Mouren P, Thieffry D, Jacq B: GOTool-Box: functional analysis of gene datasets based on GeneOntology.  Genome Biol 2004, 5(12):R101.21. Tatusova TA, Madden TL: BLAST 2 Sequences, a new tool forcomparing protein and nucleotide sequences.  FEMS MicrobiolLett 1999, 174(2):247-250.22. R Development Core Team: R: A Language and Environmentfor Statistical Computing.  Vienna, Austria: R Foundation for Sta-tistical Computing; 2008. 23. Sevilla JL, Segura V, Podhorski A, Guruceaga E, Mato JM, Martinez-Cruz LA, Corrales FJ, Rubio A: Correlation between geneexpression and GO semantic similarity.  IEEE/ACM Trans Com-put Biol Bioinform 2005, 2(4):330-338.24. Dong F, Feldmesser M, Casadevall A, Rubin CS: Molecular charac-terization of a cDNA that encodes six isoforms of a novelmurine A kinase anchor protein.  J Biol Chem 1998,273(11):6533-6541.25. Livigni A, Scorziello A, Agnese S, Adornetto A, Carlucci A, Garbi C,Castaldo I, Annunziato L, Avvedimento EV, Feliciello A: Mitochon-drial AKAP121 links cAMP and src signaling to oxidativemetabolism.  Mol Biol Cell 2006, 17(1):263-271.26. Nishimura DY, Swiderski RE, Searby CC, Berg EM, Ferguson AL, Hen-nekam R, Merin S, Weleber RG, Biesecker LG, Stone EM, SheffieldVC: Comparative genomics and gene expression analysisidentifies BBS9, a new Bardet-Biedl syndrome gene.  Am JHum Genet 2005, 77(6):1021-1033.Additional File 4NTO scores versus TO, Resnik, Lin, and Jiang scores. For every gene pair in the 100 k set of gene pairs, the normalized term overlap was cal-culated and plotted against the term overlap scores (A), the averaged var-iant scores of each of the three semantic similarity measures (B-D), and the maximum variant scores of each of the three semantic similarity meas-ures (E-G).Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2105-9-327-S4.doc]Additional File 5Comparing sequence and semantic similarity ("average" variants). A BLAST sequence analysis was carried out to calculate a sequence similar-ity score for each gene pair in the 100 k set for which sequence data was available. Of those gene pairs we considered only the 53,264 which obtained a score greater than zero. Intervals were taken along the x-axis ln [Bit Score] and (A) Resnik, (B) Lin and (C) Jiang scores for the cor-responding gene pairs were averaged and plotted.Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2105-9-327-S5.doc]yours — you keep the copyrightSubmit your manuscript here:http://www.biomedcentral.com/info/publishing_adv.aspBioMedcentralPage 11 of 11(page number not for citation purposes)measures as tools for exploring the gene ontology.  Pac SympBiocomput 2003:601-612.


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items