UBC Faculty Research and Publications

Categorizer: a tool to categorize genes into user-defined biological groups based on semantic similarity Na, Dokyun; Son, Hyungbin; Gsponer, Jörg Dec 11, 2014

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


52383-12864_2014_Article_6898.pdf [ 1.32MB ]
JSON: 52383-1.0074682.json
JSON-LD: 52383-1.0074682-ld.json
RDF/XML (Pretty): 52383-1.0074682-rdf.xml
RDF/JSON: 52383-1.0074682-rdf.json
Turtle: 52383-1.0074682-turtle.txt
N-Triples: 52383-1.0074682-rdf-ntriples.txt
Original Record: 52383-1.0074682-source.json
Full Text

Full Text

SOFTWARECategorizer: a tool to cateos osteentire genome/proteome of a specific organism. Therefore, identify specific biological processes or functions enrichedNa et al. BMC Genomics 2014, 15:1091http://www.biomedcentral.com/1471-2164/15/1091ested in the enrichment of specific GO terms in a set ofgenes but more in the enrichment of all GO terms thatCanadaFull list of author information is available at the end of the articlegreat efforts have been made to develop computationalmethods to translate the flourishing raw data into mean-ingful biological knowledge.Gene Ontology (GO) is a dictionary of controlled bio-logical vocabularies to annotate genes at different levelsof granularity [1]. The GO dictionary can be envisioned aswithin sets of genes. There are many tools that can carryout this task: David, FuncAssociate, BiNGO, etc. [2-5].These tools output lists of all individual GO terms that aresignificantly enriched in the analyzed data set. However,listed GO terms often refer to the same biological process.In addition, many GO terms are highly specific anddifficult to interpret in the larger biological context thatis investigated. As the research of most scientists isfocused on a specific area, scientists are often less inter-* Correspondence: gsponer@chibi.ubc.ca1Department of Biochemistry and Molecular Biology, Centre for High-throughputBiology, University of British Columbia, 2125 East Mall, Vancouver, BC V6T 1Z4,or depleted when compared to a reference set or thebetween all parent–child terms are identical, which is not true in a biological sense. In addition these tools outputlists of often redundant or too specific GO terms, which are difficult to interpret in the context of the biologicalquestion investigated by the user. Therefore, there is a demand for a robust and reliable method for genecategorization and enrichment analysis.Results: We have developed Categorizer, a tool that classifies genes into user-defined groups (categories) andcalculates p-values for the enrichment of the categories. Categorizer identifies the biologically best-fit category foreach gene by taking advantage of a specialized semantic similarity measure for GO terms. We demonstrate thatCategorizer provides improved categorization and enrichment results of genetic modifiers of Huntington’s diseasecompared to a classical GO Slim-based approach or categorizations using other semantic similarity measures.Conclusion: Categorizer enables more accurate categorizations of genes than currently available methods. Thisnew tool will help experimental and computational biologists analyzing genomic and proteomic data accordingto their specific needs in a more reliable manner.Keywords: Gene ontology, Categorization, Enrichment analysis, Semantic similarity, Neurodegenerative diseasesBackgroundDuring the last decade, high-throughput technologieshave allowed scientists to collect large sets of genomicand proteomic data. These data sets are then oftenscreened for groups of genes that are over-representeda graph that has, in a first approximation, the architectureof an upside down tree in which connected nodes, i.e.,related GO terms, have a parent–child relationship andall nodes can be connected back to the three root nodes(biological process, molecular function and cellular compo-nent). This well-structured knowledge has been utilized touser-defined biological grsimilarityDokyun Na1,2, Hyungbin Son2 and Jörg Gsponer1*AbstractBackground: Communalities between large sets of geneidentified by searching for enrichments of genes with thetools used for these enrichment analyses assume that GO© 2014 Na et al.; licensee BioMed Central. ThisAttribution License (http://creativecommons.oreproduction in any medium, provided the orDedication waiver (http://creativecommons.orunless otherwise stated.Open Accessgorize genes intoups based on semanticbtained from high-throughput experiments are oftename Gene Ontology (GO) annotations. The GO analysisrms are independent and the semantic distancesis an Open Access article distributed under the terms of the Creative Commonsrg/licenses/by/4.0), which permits unrestricted use, distribution, andiginal work is properly credited. The Creative Commons Public Domaing/publicdomain/zero/1.0/) applies to the data made available in this article,are associated with their area of interest. In order toreach this goal, researchers define categories of interestand manually assign genes into one of these categoriesaccording to their GO annotations [6,7]. This is a labori-ous process and categorization results may differ fromperson to person.There are tailored cut-down versions of GO, GO Slims,to give a broad overview of the ontology content withoutthe details [8], and there is also a script that automaticallymaps GO terms to GO Slim terms [9]. However, thisscript simply searches for ancestor GO terms that are alsoincluded in the GO Slim list and assumes that the GOgraph has the architecture of a perfect tree in which allparent–child GO pairs are separated by the same distance.This is not true. Not all parent–child relationships inthe hierarchical structure of GO have the same close-ness in a biological sense. For instance, the relationshipof ‘protein serine/threonine kinase activity’ (GO:0004674)and ‘IkappaB kinase activity’ (GO:0008384) is definitelymuch closer than that of ‘biological process’ (GO:0008150)and ‘cellular process’ (GO:0009987). Moreover, the GOarchitecture is more accurately represented by a directedacyclic graph, not a tree. Therefore, there can be morethan one path from a GO term up to the root node andone GO term can be mapped to multiple GO Slim terms.child term having two parent terms may be much closerto one of them in a biological sense. An example for sucha case is shown in Figure 1A. According to the GO Slimmapping script, the term ‘gamma-amminobutyric acidimport’ (G8) can belong to ‘amino acid imports’ (G4) and‘gamma-amminobutyric acid transport’ (G5), since thegraphical distances to both parent terms are identical(Figure 1B). However, the term ‘gamma-amminobutyricacid import’ is closer to ‘gamma-amminobutyric acidtransport’ than ‘amino acid imports’ in the biologicalsense. Accordingly it is more reasonable to say that‘gamma-amminobutyric acid import’ belongs to ‘gamma-amminobutyric acid transport’, which can’t be deducedfrom the graphical distance alone. Hence, the biologicalcloseness of GO terms, not the graphical distance, shouldbe utilized for reliable GO analyses. Another problem withthe tools mentioned before is that they assume independ-ence of GO terms. Functions and processes encompassedby a specific GO term are a subset of the functions andprocesses encompassed by its parent term. Thus, GOterms cannot be independent but are associated. The useof redundant terms can lead to overestimation or under-estimation of relevant biological processes or functions.Therefore, GO annotations should be analyzed in thecontext of the hierarchical structure of GO and by takingO SNa et al. BMC Genomics 2014, 15:1091 Page 2 of 11http://www.biomedcentral.com/1471-2164/15/1091As a result of the differences in parent–child closeness, aFigure 1 Categorization based on the graphical structure of GO (Gare colored in green and categories to which GO terms are assigned are coparticular category. B. Mapping results by the GO Slim approach, which taksemantic distances between terms into account.lim approach). A. Section of the GO structure. GO terms of interestlored in blue. In this example, each of the blue GO terms refers to aes only the graphical structure of GO into account.Na et al. BMC Genomics 2014, 15:1091 Page 3 of 11http://www.biomedcentral.com/1471-2164/15/1091Several semantic similarity measures have been devel-oped recently in order to approach some of these problems[10,11]. As the name indicates, they provide a measure forhow close two annotation terms are, and their calculationis based on the information content (IC), respectively thespecificity of each annotation term. In the determinationof the specificity of annotations, it is assumed that morefrequently used terms are less specific [12-14]. Differenttypes of semantic similarity measures have been intro-duced [10,11,15-17] and used in very diverse applicationsincluding the clustering of microarray data [18], the com-parison of sets of genes and proteins from different spe-cies (GOTax) [19], the assessment of functional similarityof genes or proteins (G-SESAME) [20,21] and the identifi-cation of new disease genes based on known disease anno-tations (MedSim, ACGR) [22,23]. However, if one usessemantic similarity measures for the categorization ofgenes into groups of interest, one has to consider thatcategorization is not about predicting how close two termsare but assessing how well two terms go together.To meet these demands, we developed Categorizerthat assigns genes to pre-defined biological functions orprocesses based on their GO annotations. As biologicalfunctions or processes of interest are different from fieldto field, this new tool allows users to define their owncategories.ImplementationCategorizer was implemented using a platform-independentlanguage, Python, and thus it can run on any operatingsystems. For the user’s convenience, we also provide apre-compiled version of Categorizer that runs on theWindows operating system.The overall scheme of the approach implemented inCategorizer is shown in Figure 2A. Categorizer employsthree steps to categorize genes. (i) The IC scores of GOterms are calculated from the occurrence of GO annota-tions in UniProtKB-GOA [24]. This score denotes thebiological relevance of a GO term, i.e., the more frequentlya term is used the less relevant the term is. (ii) A semanticsimilarity score is calculated for all GO parent–child pairsbased on their IC and the hierarchical structure in GO.(iii) According to the semantic similarity scores, geneswith annotations are assigned to biologically appropriatecategories. In the following sections, the details of thesethree steps are described.Information content (IC)An IC score has to be assigned first to all GO terms inorder to calculate semantic similarities. The IC representsthe significance of GO terms in a biological sense. Weassume that more frequently used GO terms are lesssignificant [25]. We therefore counted all the occurrencesof GO terms in a reference database. Here, we used all theproteins and their annotations in UniProtKB-GOA [24].The hierarchical structure of GO was taken into accountwhen counting occurrences. For example, when a proteinhas the annotation of ‘G21’ in Figure 2B, we also countedits parent terms, ‘G11’ and ‘G0’. When another protein hasthe annotation of ‘G22’, we also increased the occurrencesof ‘G11’ and ‘G0’. The overall occurrences in the givenexample are then ‘G0’ (+2), ‘G11’ (+2), ‘G21’ (+1) and ‘G22’(+1). The occurrences are then divided by the number ofannotations (which is two in the given example) in orderto get occurrence probabilities of GO terms, p(x):p xð Þ ¼ number of all occurences of xthe number of occurences of the root node of xIn this equation, orphan GO terms that have not beenused in UniProtKB-GOA were assigned the lowest p(x)value of the terms, meaning the maximal IC. The prob-ability is finally converted into an IC score that denotesthe significances of each GO term:I xð Þ ¼ − log p xð Þð ÞIn the given example, I(G0) is zero, which means thatannotations with G0 are biologically meaningless. Asthe calculation of the IC score I(x) for all GO terms iscomputationally expensive, Categorizer comes with afile that contains pre-computed values. In addition, wealso provide a script with which users can pre-computeother I(x) values taking their annotations of interest, e.g.UniProtKB-GOA (no IEA), Human GOA, or customizedannotation files. Synthetic IC scores for each term in ourexample are shown in Figure 2B.Semantic similarityWhen a specific GO term needs to be categorized, Cate-gorizer searches for its parent terms that are assigned toa category and calculates semantic similarity scores withthem. The semantic similarity scores are calculated asfollows. IC scores are used to calculate (α) the semanticdistance of a category-assigned parent GO term fromthe root node, (β) the semantic distance of the GO termto be categorized and its category-assigned parent termfrom their most informative child terms and (γ) thesemantic distance between the category-assigned parentterm and the GO term to be categorized (Figure 2A). Allthree scores are then combined in a final semantic simi-larity score [26]. Given two GO terms, for instance G32(term to be categorized) and G22 (one of G32’s parentterms) (Figure 2B), we calculate the semantic similarityof these two terms as follows [26]:Distance of a category-assigned parent GO term from theroot node (α)First, the semantic distance of the category-assigned par-ent term (G22) to the root term is calculated, which isNa et al. BMC Genomics 2014, 15:1091 Page 4 of 11http://www.biomedcentral.com/1471-2164/15/1091defined as the difference in IC scores between the twoGO terms (x1 and x2):d x1; x2ð Þ ¼ I x2ð Þ−I x1ð Þj ja ¼ d r; pð Þwhere r is the root term and p is the parent.Thus, the distance of G22 from the root term isα =12.20.Figure 2 Approach used in categorizer to assign genes to categories. Acalculation, (ii) semantic similarity score calculations for parent–child pairs andmain text for details. B. Illustrative (synthetic) example for the calculation of seeach GO term. G0 is a root term. In this example, a user defined two categoricategory B (blue). Semantic similarity scores (S) of several terms are also showDistance from the most informative child terms (β)Next, the average distance of the GO term to be catego-rized (G32) and its category-assigned parent term (G22)from their most informative child terms is calculated.The distance is defined as below:β ¼ d x1; c1ð Þ þ d x2; c2ð Þ2. Three steps used in the categorization process: (i) Information content(iii) categorization according to the semantic similarity scores. See themantic similarity scores. Information content scores (I) are shown fores (A and B) and assigned G22 to category A (orange), and G23 ton.Na et al. BMC Genomics 2014, 15:1091 Page 5 of 11http://www.biomedcentral.com/1471-2164/15/1091where c1 and c2 denotes the most informative childnode of x1 and x2, respectively. If c1 and/or c2 do notexist, they are set to x1 and x2, respectively.In our example, G32 has the child term G41 andG22 has child the terms G31, G32, G41, and G43. Themost informative child term of G32 is G41 and that ofG22 is G43. Therefore, β = (d(G22, G43) + (G32, G41))/2 = ((13.90 − 12.20) + (13.17 − 12.31))/2 = 1.28.Distance of a category-assigned parent GO term and a GOterm to be determined (γ)The third step is to calculate the distances between acategory-assigned GO term (G22) and a GO term to becategorized (G32), which is defined as:γ ¼ d p; x1ð Þwhere p is a parent term assigned to a category and x1is a term to determine its category.In this example, γ = d(G22,G32) = (12.31 − 12.20) = 0.11Semantic similarity scoreThe final step is to calculate the semantic similarityscore (S) from the three values, α, β, and γ:S x1; x2ð Þ ¼ 11þ γ ⋅ααþ βwhere by 0 ≤ S(x1, x2) ≤ 1.Consequently, the semantic similarity score betweenG22 and G32 isS G22; G32ð Þ ¼ 11þ 0:11 ⋅12:2012:20þ 1:28 ¼ 0:815CategorizationConventional semantic similarity measures were devel-oped to assess how similar two GO terms are, butcategorization is about assessing how well a specificterm belongs to another term or a group of other terms.Thus, we use semantic similarity in the categorizationprocess but require that a categorized term is a child ofany term in the assigned category. For instance, twosibling terms, ‘DNA-templated transcription initiation(GO:0006352)’ and ‘DNA-templated transcription elong-ation (GO:0006354)’, are semantically very similar. Theycould be categorized to their parent term ‘RNA biosyn-thetic process (GO:0032774)’ because transcription initi-ation and elongation are both important steps in RNAbiosynthesis. However, they cannot be categorized toeach other because transcription initiation and elongationare two different molecular processes. Therefore, Categor-izer first determines whether a term to be categorized is achild of only one or more category-assigned terms. If it isthe child of only one term that has a category assignment,the similarity score of this parent–child pair is set to 1 andthe term is assigned to the corresponding category. For aterm that is a child of two or more category-assignedterms, Categorizer assesses semantic similarity betweenthis term and all category-assigned terms and then assignsit to the category with the highest semantic similarityscore. We demonstrate the procedure in the followingexamples:In the example shown in Figure 2B, the user assignedthe term G22 to category A and the term G23 to cat-egory B. First, Categorizer automatically identifies childterms that belong to a single category only (e.g. G31→A, G33→ B and G42→ B). For GO terms that havemultiple parents, i.e. could belong to two or more cat-egories (G32, G41, and G43), semantic similarity scoresare calculated with the GO terms that are assigned to acategory and their parents. Then the GO terms of inter-est are assigned to a category with the highest semanticsimilarity score.Assignment example G32Categorizer calculates pairwise semantic similarities ofG32 with all the GO terms that belong to category Aand are a parent of G32: S(G22,G32). In the same way,Categorizer also calculates semantic similarities of G32with the terms in category B: S(G23, G32). Since S(G22,G32) = 0.815 and S(G23,G32) = 0.078, a gene with theannotation of G32 is more likely to belong to the cat-egory A.Assignment example G41Categorizer calculates the pairwise semantic similaritiesS(G22, G41) and S(G23, G41). Since S(G22, G41) = 0.475and S(G23, G41) = 0.071, a gene with the annotation ofG41 should belong to the category A.Assignment example G43Categorizer calculates the pairwise semantic similaritiesS(G31, G43), S(G22, G43), S(G23, G43), and S(G33,G43). Since S(G31, G43) = 0.350, S(G22, G43) = 0.346, S(G33, G43) = 0.291 and S(G23, G43) = 0.064, we can inferthat the term G43 is closer to G31 than G33 in a bio-logical sense and accordingly a gene with the annotationof G43 should belong to the category A.One can allow a GO term to go into multiple categor-ies if its semantic similarity score is above a user-definedthreshold. For instance, a gene with the annotation ofG32 can belong to category A and/or B depending onthe semantic similarities and the user-defined threshold.The default threshold is set at 0.3 in Categorizer. Thisthreshold value was determined by calculating an averagesemantic similarity score for two randomly selected GOterms that are linked directly or indirectly in a parent andchild relationship. The average score was 0.10 ± 0.12 andaccordingly Categorizer uses 0.3 as a default cutoff valuefor reliable categorization. After assignment of genes toone or several categories, enrichments of the categoriesare calculated.Enrichment analysisUtilizationFor practical categorization, the following key steps arecarried out. First, a user has to define categories that areof interest and assign key GO terms to each category; acategory is defined as a set of one or more GO terms.The user does not need to assign all GO terms to newlyetaran, MAuNa et al. BMC Genomics 2014, 15:1091 Page 6 of 11http://www.biomedcentral.com/1471-2164/15/1091Most GO enrichment analysis tools use simple statis-tical methods, including hypergeometric distribution,chi-square, Fisher’s exact test, and binomial probability[2]. When these methods are used to assess enrichment ofcategories, it is assumed that categories are independent.However, one gene may belong to two or more categories,and thus some categories may co-occur more frequentlythan others. Recently, a random model-based statisticalenrichment analysis has been proposed [27]. Followingthis suggestion, Categorizer first calculates the probabil-ities of each category in a reference gene set:p cð Þ ¼XNi¼1f i cð ÞXMc¼1XNi¼1f i cð Þwhere p(c) denotes a probability of category c in a referencegene set, N denotes the number of genes in a reference set,M denotes the number of categories, and fi(c) is 1 if thegene i is assigned into the category c, otherwise, 0. Then,the genes in the reference set are randomly assigned tocategories according to the category probability, p(c), whileretaining the number of assigned categories to each genein order to keep the degree of categories. L different genesare randomly chosen from the reference, where L denotesthe number of screened genes or genes of interest. Thefrequency of each category is then counted. These ran-domizations are repeated 1,000 times to obtain an averagefrequency and standard deviation of each category. Withthese averages and standard deviations, z-scores for eachcategory are calculated as below:z cð Þ ¼XLi¼1f i cð Þ−μ cð Þσ cð ÞThe μ(c) and σ(c) denote an average number andstandard deviation of category c obtained from therandomization. The p-values for each category are calcu-lated from the z-scores.Table 1 Categories provided with categorizerGroups CategoriesBiological processes Cell cycle, Cytoskeleton, MRNA processing, Splicing, Ttransport, Vesicles, Golgi/ERPhagocytosis/phagosome,Cellular localization Cytoplasm, Mitochondria, GolEnzyme functions Hydrolase, Isomerase, Ligase,defined categories because Categorizer is capable ofidentifying semantically close GO terms and, by doingso, decides whether a GO term belongs to a category ornot. Alternatively, the user can select among three com-monly used category sets that are shipped with Categori-zer: biological processes, cellular localizations, and enzymefunctions (Table 1). For instance, the “biological processes”set contains 27 sub-categories. To run the software, atleast three files (marked in yellow in Figure 3A) should beprovided: (i) a category file defining categories and theirGO terms, (ii) an annotation file containing gene-to-GOannotations, and (iii) a gene file containing the list ofgenes to be analyzed. A background gene file has to beprovided for the category enrichment analysis.Results and discussionGenetic modifiers of Huntington’s diseaseIn order to demonstrate the functionality of Categorizer,we first analyzed the enrichment of specific categories ina set of genes that have been identified as genetic modi-fiers in Drosophila models of Huntington’s disease (HD).The data was compiled from NeuroGeM, a database ofgenetic modifiers of neurodegenerative diseases includingHD, Alzheimer’s, Parkinson’s, Amyotrophic lateral scler-osis, and several Spinocerebellar ataxia types [28,29].Modifiers are genes that are capable of modulating diseasephenotypes; in this case the neuronal cell death caused byprotein aggregation.We categorized genetic modifiers into 9 groups thatare of interest to researchers studying HD: cell cycle (cellcycle, GO:0007049), cytoskeleton (cytoskeleton organization,GO:0007010), metabolism (metabolic process, GO:0008152), protein synthesis (gene expression, GO:0010467),protein folding (protein folding, GO:0006457), proteolysis(proteolysis, GO:0006508), signaling (signal transduction,GO:0007165), splicing (RNA splicing, GO:0008380), andtransport (transport, GO:0006810). We loaded the Dros-ophila gene-to-GO annotation file (downloaded fromFlyBase in March 2014), and entered the list of high-bolism, Transcription, Translation, Protein folding, Proteolysis, Signaling,smembrane transport, Intracellular localization, Protein transport, Nuclearitochondria, Endo- and exo-cytosis, Lysosome, Peroxisome, Ribosomes,tophagy, Apoptosis, DNA repair, DNA replication, Receptorsgi, Nucleus, Cytoskeleton, Vesicle/Lysosome, ER, ExtracellularLyase, Oxidoreductase, TransferaseNa et al. BMC Genomics 2014, 15:1091 Page 7 of 11http://www.biomedcentral.com/1471-2164/15/1091confidence genetic modifiers of HD (210 genes) ob-tained from NeuroGeM. As a reference, we entered allDrosophila genes that had been tested experimentallyas modifiers (7896 genes). We allowed a gene to be in-cluded into multiple categories with the default cutoffvalue of 0.3.With this information, Categorizer assigned the geneticmodifiers to the defined categories. As shown in Figure 3B,categorization results for each gene are reported in themiddle of the graphical user interface (GUI), i.e., the cat-egories that each gene is assigned to are listed togetherFigure 3 Snapshots of the GUI of Categorizer. A. Initial window for settiannotations, gene test set, background genes, and categorization options. B.results (middle), and enrichment analysis result (right).with the semantic similarity score in parenthesis. On theleft side of the GUI, there is a pie chart that displays thecategory statistics. In this example, the metabolism cat-egory is the largest while the protein folding category isthe smallest. On the right side of the GUI, categoryenrichment analysis results are shown (see Enrichmentanalysis in Implementation for details). Consistent withthe knowledge on the importance of the protein foldingmachinery in the pathogenesis of neurodegenerative dis-eases [30,31], the category of protein folding is highlyenriched among genetic modifiers of HD, though theyng up the categorization parameters: category definitions, geneCategorization results: category statistics (left), detailed categorizationaccount for only a small portion of the genetic modifiers.Additionally, the categories of cell cycle, cytoskeleton, pro-tein synthesis and splicing are also enriched among thegenetic modifiers of HD. This finding is consistent withrecent research data on neurodegeneration and HD inparticular [32-39].In the given example, we categorized genetic modifiersof HD into broad biological processes and calculated theirenrichment. However, if a user is interested in signaltransduction, one could define categories such as NK-kappaB cascade or TOR signaling. It is up to the user todecide how specific or broad the defined categories are.Comparison of analysis results generated withCategorizer and classical approaches using GO Slim termsCategorizer is different from GO Slim-based methods inthat it identifies biologically relevant categories by usingboth the graphical structure of GO and the semanticsimilarities between GO terms. Therefore, we decided tocompare the performance of Categorizer with that of theclassical methods using GO Slim. First, we assessed theaccuracies of category assignment by comparing assign-ment results of Categorizer and the GO Slim approachfor a gold standard set of genes. Second, we evaluatedthe quality of enrichment analyses by comparing thed aedr HNa et al. BMC Genomics 2014, 15:1091 Page 8 of 11http://www.biomedcentral.com/1471-2164/15/1091Figure 4 Comparison of results generated by using Categorizer anSlim and a random predictor. The categories of genetic modifiers obtaingold standard. B. Enrichment of the 9 categories. All the genes tested fothe test and randomized reference sets. Since we allowed multiple categorizacategories (p < 10−2) were marked as *.GO-Slim-based approach. A. Overall accuracies of Categorizer, GOfrom a high-throughput screening study (Zhang et al.) were used as aD were used as a reference. C. Numbers of genes for each category intion, one gene may appear in several categories. Significantly enrichedresults of the two methods for the 210 genetic modifiersof HD used in Figure 3. The statistics of categories andenrichment results generated by using Categorizer and bythe GO Slim-based approach are shown in Figure 4B andC. The GO Slim approach identified the categories ‘cellcycle’ and ‘cytoskeleton’ as significantly enriched among thegenetic modifiers of HD, which is consistent with theresults found by Categorizer (Figure 4B). However, thethree categories of ‘protein folding’, ‘protein synthesis’ and‘splicing’ were not identified as enriched categories by theGO Slim approach (p-value > 10−2). This result of the GOSlim approach is in stark contrast to the literature onmodifiers of neurodegenerative diseases, including HD.Genes whose products are involved in protein folding, pro-tein synthesis and splicing are found in most screens formodifiers of neurodegenerative diseases that have beencarried out to date [31,40-42]. As shown in Figure 4C, bothCategorizer and GO Slim assigned the same number ofgenes to the categories of protein folding and splicing. How-ever, the GO Slim approach assigned more genes to thesecategories in the randomized model of the reference geneNa et al. BMC Genomics 2014, 15:1091 Page 9 of 11http://www.biomedcentral.com/1471-2164/15/1091results of the two approaches for the 210 high-confidencegenetic modifiers of HD (used in Figure 3).Zhang et al. have previously categorized genetic modi-fiers, which they had identified in a high-throughputscreen, manually into few broad biological processes(categories) based on the GO annotations of the modi-fiers [6]. We used these categorized genes as a goldstandard to evaluate the accuracies of Categorizer andGO Slim-based methods. For this comparison, we cus-tomized a GO Slim ontology that is composed of thesame nine GO terms that we used for Categorizer (seeabove). Then, Drosophila GO annotations were mappedto these nine terms with the help of the GO Slim assign-ment script map2slim.pl [9]. Categorization accuracywas calculated as below:Accuracy ¼XNg¼1XNCc¼1F g; cð ÞN  NCwhere N denotes the number of genes in the gold stand-ard set; NC denotes the number of total categories. Ascategorization is a multi-class problem, it is necessary toinclude both correct assignment to true categories andcorrect non-assignment to false categories when calculat-ing accuracy. Briefly, we built two matrices named G andP denoting true answers and predictions, respectively. G(g, c) = 1 if a gene g in the gold standard set belongs to acategory c and G(g, c) = 0 if the gene g does not belong tocategory c. P(g, c) = 1 if a gene g is categorized into acategory c by Categorizer or the GO Slim-based method,respectively, and P(g, c) = 0 if the gene g is not categorizedinto category c. Thus, F(g, c) = 1 if G(g, c) = P(g, c) and 0otherwise. The categorization accuracies of Categorizerand GO Slim are 81% and 70%, respectively (Figure 4A).As a control, a random predictor was built that randomlyassigns genes to three categories. Three categories werechosen because the average number of assigned categoriesper gene in the gold standard by Categorizer and the GOSlim approach was 2.5. The accuracy of this random pre-dictor is 65%. Since many genes are categorized into onlya few categories, the number of correctly non-assignedgenes has a big impact on this accuracy measure (hencethe high accuracy of the random predictor). In order todeal with this issue, we also calculated the classicalMathew’s correlation coefficient (MCC) that is a suit-able measure for evaluating unbalanced datasets. TheMCC values of Categorizer, GO Slim-based method, andrandom predictor are 0.32, 0.17 and 0.0, respectively.Overall, these tests demonstrate that the category assign-ment of Categorizer is more accurate than a classicalcategorization with GO Slim terms and, thereby, under-lines the importance of semantic similarity for this task.Next, we compared the quality of enrichment analyses ofthe two approaches. We did so by analyzing the enrichmentset than did Categorizer. Therefore, the p-values obtainedby the GO Slim method were larger than those obtained byCategorizer. Interestingly, Categorizer identifies proteinsynthesis as enriched in contrast to the GO Slim approach,although Categorizer assigned fewer genes to the proteinsynthesis category than GO Slim. The solution to this con-undrum is that Categorizer assigned much fewer genes in areference set to protein synthesis than GO Slim. Overall,these comparisons reveal that Categorizer provides moreFigure 5 Performance comparison of different semanticsimilarity measures and the one implemented in Categorizer.MCC values were calculated with HD modifiers as done for thecomparison of Categorizer and GO Slim in Figure 4A.Na et al. BMC Genomics 2014, 15:1091 Page 10 of 11http://www.biomedcentral.com/1471-2164/15/1091reliable categorization and enrichment results compared tothe conventional GO analysis method.Comparison with other semantic similarity measuresAs different flavors of semantic similarity measures havebeen introduced [11], we assessed the accuracy of cat-egory assignment as a function of the semantic similaritymeasure. We used again as a gold standard the geneticmodifiers of HD that were already categorized manuallyby experts in that field. Hence, we categorized HD modi-fiers based on different semantic similarity measures andassessed the accuracy of the categorization as we did forthe GO Slim-based categorization in Figure 4A. Wetested the semantic similarity measures developed byLin, Resnik, Wang et al., and Zhang et al., as well as theone of XGraSM, and GO-Universal [11,15,43-45]. Cutoffvalues for each measure were determined, as for Cate-gorizer, from the average similarity scores of randomlyselected GO terms and measures were calculated usingthe annotations in UniProtKB including IEA. A keydifference between these different metrics is the methodused to calculate the IC. The approaches of Lin andResnik use only the IC of the most informative commonancestor for similarity calculations, while XGraSM usesthe averaged IC of all informative common ancestors(for details see [11,15,43-45]). The combined methods,XGraSM-Lin and XGraSM-Resnik, calculate semanticsimilarities based on Lin’s and Resnik’s semantic similar-ity metrics, but use the averaged IC of XGraSM. Asshown in Figure 5, Categorizer outperforms all othermeasures commonly employed for assessing semanticsimilarity. Consistent with previous findings [11,46],XGraSM provides the best categorization results of allthe older methods (Figure 5). It is interesting to note thatthe MCC values calculated for XGraSM-Lin and XGraSM-Resnik are slightly higher than the MCC calculated forGO Slim. This finding provides further support for theimportance of the semantic similarity in categorization.ConclusionHere we developed a flexible and extendable tool thatcan be used to find over-represented categories within setsof genes. Categorizer classifies genes to categories accord-ing to biological meanings and assesses their enrichment.Thus, Categorizer offers a new way of enrichment analysisthat allows focusing on processes that are of specific inter-est to the user.Availability and requirementsProject name: CategorizerProject home page: http://chibi.ubc.ca/gsponer/categorizerhttp://ssbio.cau.ac.kr/software/categorizerOperating system: Platform independentProgramming languages: PythonOther requirements: NoneLicense: Apache License 2.0Any restrictions to use by non-academics: NoneCompeting interestsThe authors declare that they have no competing interests.Authors’ contributionsDN implemented the software, HS tested the software and assisted inmanuscript preparation, and JG supervised this project and assisted with writingthe manuscript. All authors read, reviewed, and approved the full manuscript.AcknowledgementsThis Research was supported by NSERC and the Chung-Ang University ResearchGrants in 2013. This research was supported by Basic Science Research Programthrough the National Research Foundation of Korea (NRF) funded by theMinistry of Science, ICT & Future Planning (NRF-2014R1A1A1003444). Wewould like to thank Alex Cumberworth and Guillaume Lamour for their help indeveloping our web site.Author details1Department of Biochemistry and Molecular Biology, Centre for High-throughputBiology, University of British Columbia, 2125 East Mall, Vancouver, BC V6T 1Z4,Canada. 2School of Integrative Engineering, Chung-Ang University, 84Heukseok-ro, Dongjak-gu, Seoul 156-756, Republic of Korea.Received: 2 June 2014 Accepted: 4 December 2014Published: 11 December 2014References1. Gene Ontology. [http://geneontology.org]2. Huang DW, Sherman BT, Lempicki RA: Bioinformatics enrichment tools:paths toward the comprehensive functional analysis of large gene lists.Nucleic Acids Res 2009, 37:1–13.3. Huang DW, Sherman BT, Lempicki RA: Systematic and integrative analysisof large gene lists using DAVID bioinformatics resources. Nat Protoc 2008,4:44–57.4. Maere S, Heymans K, Kuiper M: BiNGO: a cytoscape plugin to assessoverrepresentation of gene ontology categories in biological networks.Bioinformatics 2005, 21:3448–3449.5. Berriz GF, Beaver JE, Cenik C, Tasan M, Roth FP: Next generation softwarefor functional trend analysis. Bioinformatics 2009, 25:3043–3044.6. Zhang S, Binari R, Zhou R, Perrimon N: A genomewide RNA interferencescreen for modifiers of aggregates formation by mutant Huntingtin inDrosophila. Genetics 2010, 184:1165–1179.7. Doumanis J, Wada K, Kino Y, Moore AW, Nukina N: RNAi screening inDrosophila cells identifies new modifiers of mutant huntingtinaggregation. PLoS ONE 2009, 4:e7275.8. Harris MA, Clark J, Ireland A, Lomax J, Ashburner M, Foulger R, Eilbeck K,Lewis S, Marshall B, Mungall C, Richter J, Rubin GM, Blake JA, Bult C, Dolan M,Drabkin H, Eppig JT, Hill DP, Ni L, Ringwald M, Balakrishnan R, Cherry JM,Christie KR, Costanzo MC, Dwight SS, Engel S, Fisk DG, Hirschman JE, Hong EL,Nash RS, et al: The Gene Ontology (GO) database and informatics resource.Nucleic Acids Res 2004, 32:D258–D261.9. GO Slims. [http://www.geneontology.org/GO.slims.shtml]10. Pesquita C, Faria D, Falcão AO, Lord P, Couto FM: Semantic similarity inbiomedical ontologies. PLoS Comput Biol 2009, 5:e1000443.11. Mazandu GK, Mulder NJ: Information content-based gene ontologysemantic similarity approaches: toward a unified framework theory.Biomed Res Int 2013, 2013:Article ID 292063.12. Resnik P: Using information content to evaluate semantic similarity in ataxonomy. In Proceedings of the 14th International Joint Conference onArtificial Intelligence. Volume 1. ; 1995:448–453.13. Lin D: An information - theoretic definition of similarity. In Proceedings ofthe 15th Conference on Machine Learning. 1998:296–304.14. Jiang JJ, Conrath DW: Semantic similarity based on corpus statistics andlexical taxonomy. In Proceedings of the 15th International Conference onResearch in Computational Linguistics. 1997:19–33.15. Mazandu GK, Mulder NJ: DaGO-Fun: tool for Gene Ontology-basedfunctional analysis using term information content measures. BMCBioinformatics 2013, 14:284.functional similarity search tool (GFSST). BMC Bioinformatics 2006, 7:135.46. Guo X, Liu R, Shriver CD, Hu H, Liebman MN: Assessing semantic similaritymeasures for the characterization of human regulatory pathways.Bioinformatics 2006, 22:967–973.doi:10.1186/1471-2164-15-1091Cite this article as: Na et al.: Categorizer: a tool to categorize genes intouser-defined biological groups based on semantic similarity. BMC Genomics2014 15:1091.Na et al. BMC Genomics 2014, 15:1091 Page 11 of 11http://www.biomedcentral.com/1471-2164/15/109116. Xu Y, Guo M, Shi W, Liu X, Wang C: A novel insight into Gene Ontologysemantic similarity. Genomics 2013, 101:368–375.17. Couto FM, Silva MJ: Disjunctive shared information between ontologyconcepts: application to Gene Ontology. J Biomed Sem 2011, 2:5.18. Speer N, Spieth C, Zell A: A memetic clustering algorithm for thefunctional partition of genes based on the gene ontology. In Proceedingsof the 2004 IEEE Symposium on Computational Intelligence in Bioinformaticsand Computational Biology (CIBCB 2004); 2004:252–259.19. Schlicker A, Rahnenführer J, Albrecht M, Lengauer T, Domingues FS: GOTax:investigating biological processes and biochemical activities along thetaxonomic tree. Genome Biol 2007, 8:R33.20. Du Z, Li L, Chen C-F, Yu PS, Wang JZ: G-SESAME: web tools for GO-term-basedgene similarity analysis and knowledge discovery. Nucleic Acids Res 2009,37:W345–W349.21. del Pozo A, Pazos F, Valencia A: Defining functional distances over geneontology. BMC Bioinformatics 2008, 9:50.22. Schlicker A, Lengauer T, Albrecht M: Improving disease gene prioritizationusing the semantic similarity of Gene Ontology terms. Bioinformatics2010, 26:i561–i567.23. Yilmaz S, Jonveaux P, Bicep C, Pierron L, Smaïl-Tabbone M, Devignes MD:Gene-disease relationship discovery based on model-driven data inte-gration and database view definition. Bioinformatics 2009, 25:230–236.24. UniProtKB. [http://www.uniprot.org]25. Lord PW, Stevens RD, Brass A, Goble CA: Investigating semantic similaritymeasures across the Gene Ontology: the relationship between sequenceand annotation. Bioinformatics 2003, 19:1275–1283.26. Wu X, Pang E, Lin K, Pei Z-M: Improving the measurement of semanticsimilarity between gene ontology terms and gene products: insightsfrom an edge- and IC-based hybrid method. PLoS ONE 2013, 8:e66745.27. Glass K, Glass K, Girvan M, Girvan M: Annotation enrichment analysis: analternative method for evaluating the functional properties of gene sets.Sci Rep 2014, 4:4191.28. Na D, Rouf M, O’ Kane CJ, Rubinsztein DC, Gsponer J: NeuroGeM, aknowledgebase of genetic modifiers in neurodegenerative diseases.BMC Med Genomics 2013, 6:52.29. NeuroGeM. [http://chibi.ubc.ca/neurogem]30. Lu B, Vogel H: Drosophila models of neurodegenerative diseases. AnnuRev Pathol Mech Dis 2009, 4:315–342.31. Shorter J: Hsp104: a weapon to combat diverse neurodegenerativedisorders. Neurosignals 2008, 16:63–74.32. Li S-H, Li X-J: Huntingtin–protein interactions and the pathogenesis ofHuntington’s disease. Trends Genet 2004, 20:146–154.33. Nucifora FC, Sasaki M, Peters MF, Huang H, Cooper JK, Yamada M, TakahashiH, Tsuji S, Troncoso J, Dawson VL, Dawson TM, Ross CA: Interference byhuntingtin and atrophin-1 with cbp-mediated transcription leading tocellular toxicity. Science 2001, 291:2423–2428.34. Huang CC, Faber PW, Persichetti F, Mittal V, Vonsattel J-P, MacDonald ME,Gusella JF: Amyloid formation by mutant Huntingtin: threshold, progres-sivity and recruitment of normal polyglutamine proteins. Somat Cell MolGenet 1988, 24:217–233.35. Li SH, Cheng AL, Zhou H, Lam S, Rao M, Li H, Li XJ: Interaction ofHuntington disease protein with transcriptional activator Sp1. Mol CellBiol 2002, 22:1277–1287.36. Takano H, Gusella JF: The predominantly HEAT-like motif structure ofhuntingtin and its association and coincident nuclear entry with dorsal,an NF-kB/Rel/dorsal family transcription factor. BMC Neurosci 2002, 3:15.37. Steffan JS, Kazantsev A, Spasic-Boskovic O, Greenwald M, Zhu YZ, Gohler H,Wanker EE, Bates GP, Housman DE, Thompson LM: The Huntington’s dis-ease protein interacts with p53 and CREB-binding protein and repressestranscription. Proc Natl Acad Sci 2000, 97:6763–6768.38. Culver BP, Savas JN, Park SK, Choi JH, Zheng S, Zeitlin SO, Yates JR, Tanese N:Proteomic analysis of wild-type and mutant Huntingtin-associated proteinsin mouse brains identifies unique interactions and involvement in proteinsynthesis. J Biol Chem 2012, 287:21599–21614.39. Mills JD, Janitz M: Alternative splicing of mRNA in the molecularpathology of neurodegenerative diseases. Neurobiol Aging 2012, 33:1012.e11–1012.e24.40. Branco J, Al-Ramahi I, Ukani L, Pérez AM, Fernandez-Funez P, Rincón-Limas D,Botas J: Comparative analysis of genetic modifiers in Drosophila points tocommon and distinct mechanisms of pathogenesis among polyglutaminediseases. Hum Mol Genet 2007, 17:376–390.Submit your next manuscript to BioMed Centraland take full advantage of: • Convenient online submission• Thorough peer review• No space constraints or color figure charges• Immediate publication on acceptance• Inclusion in PubMed, CAS, Scopus and Google Scholar• Research which is freely available for redistribution41. Pallos J, Bodai L, Lukacsovich T, Purcell JM, Steffan JS, Thompson LM, Marsh JL:Inhibition of specific HDACs and sirtuins suppresses pathogenesis in aDrosophila model of Huntington’s disease. Hum Mol Genet 2008,17:3767–3775.42. Fujikake N, Nagai Y, Popiel HA, Okamoto Y, Yamaguchi M, Toda T:Heat shock transcription factor 1-activating compounds suppresspolyglutamine-induced neurodegeneration through induction ofmultiple molecular chaperones. J Biol Chem 2008, 283:26188–26197.43. Mazandu GK, Mulder NJ: A topology-based metric for measuring termsimilarity in the gene ontology. Adv Bioinformatics 2012, 2012:975783.44. Wang JZ, Du Z, Payattakool R, Yu PS, Chen C-F: A new method to measurethe semantic similarity of GO terms. Bioinformatics 2007, 23:1274–1281.45. Zhang P, Zhang J, Sheng H, Russo JJ, Osborne B, Buetow K: GeneSubmit your manuscript at www.biomedcentral.com/submit


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items