UBC Faculty Research and Publications

Quantitative biomedical annotation using medical subject heading over-representation profiles (MeSHOPs) Cheung, Warren A; Ouellette, BF F; Wasserman, Wyeth W Sep 27, 2012

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


52383-12859_2012_Article_5643.pdf [ 1.3MB ]
JSON: 52383-1.0228400.json
JSON-LD: 52383-1.0228400-ld.json
RDF/XML (Pretty): 52383-1.0228400-rdf.xml
RDF/JSON: 52383-1.0228400-rdf.json
Turtle: 52383-1.0228400-turtle.txt
N-Triples: 52383-1.0228400-rdf-ntriples.txt
Original Record: 52383-1.0228400-source.json
Full Text

Full Text

Quantitative biomedical annotation using medicalsubject heading over-representation profiles(MeSHOPs)Cheung et al.Cheung et al. BMC Bioinformatics 2012, 13:249http://www.biomedcentral.com/1471-2105/13/249RESEARCH ARTICLE Open AccessQuantitative biomedical annotation using medicalsubject heading over-representation profiles(MeSHOPs)Warren A Cheung1,2, BF Francis Ouellette3,4* and Wyeth W Wasserman1*AbstractBackground: MEDLINEW/PubMedW indexes over 20 million biomedical articles, providing curated annotation of itscontents using a controlled vocabulary known as Medical Subject Headings (MeSH). The MeSH vocabulary,developed over 50+ years, provides a broad coverage of topics across biomedical research. Distilling the essentialbiomedical themes for a topic of interest from the relevant literature is important to both understand theimportance of related concepts and discover new relationships.Results: We introduce a novel method for determining enriched curator-assigned MeSH annotations in a set ofpapers associated to a topic, such as a gene, an author or a disease. We generate MeSH Over-representationProfiles (MeSHOPs) to quantitatively summarize the annotations in a form convenient for further computationalanalysis and visualization. Based on a hypergeometric distribution of assigned terms, MeSHOPs statistically accountfor the prevalence of the associated biomedical annotation while highlighting unusually prevalent terms based ona specified background. MeSHOPs can be visualized using word clouds, providing a succinct quantitative graphicalrepresentation of the relative importance of terms. Using the publication dates of articles, MeSHOPs track changingpatterns of annotation over time. Since MeSHOPs are quantitative vectors, MeSHOPs can be compared usingstandard techniques such as hierarchical clustering. The reliability of MeSHOP annotations is assessed based on thecapacity to re-derive the subset of the Gene Ontology annotations with equivalent MeSH terms.Conclusions: MeSHOPs allows quantitative measurement of the degree of association between any entity and theannotated medical concepts, based directly on relevant primary literature. Comparison of MeSHOPs allows entitiesto be related based on shared medical themes in their literature. A web interface is provided for generating andvisualizing MeSHOPs.BackgroundThe MEDLINEW/PubMedW bibliographic database of theU.S. National Library of Medicine (NLM) is an acti-vely maintained central repository of over 18.5 millionbiomedical literature references [1]. To navigate this gro-wing body of published information, the MEDLINEW/PubMedW references are indexed by subject experts atthe NLM using Medical Subject Headings (MeSH) [2],a structured controlled vocabulary of 26,000 biomedi-cal descriptors. The MeSH annotations are intended tofacilitate the identification of relevant papers for researchscientists. As MEDLINEW/PubMedW grows at a modernrate exceeding 600,000 references per year, researchersface a daunting challenge to assess the body of work aboutentities (genes, drugs, authors, etc.) arising in the courseof their research.Encapsulating the bibliography for a biomedical entityof interest in a form both understandable and informa-tive is an increasingly important challenge in biomedicalinformatics [3,4]. One approach to succinctly summarisea bibliography (e.g. a set of key papers) for a biomedicaltopic is to identify the MeSH terms most strongly asso-ciated to the papers. Previous reports which introducedsummaries of over-represented MeSH terms for a setof papers include a study of enriched annotations for* Correspondence: francis@oicr.on.ca; wyeth@cmmt.ubc.ca3Ontario Institute for Cancer Research, Toronto, ON, Canada1Centre for Molecular Medicine and Therapeutics at the Child and FamilyResearch Institute, Department of Medical Genetics, University of BritishColumbia, Vancouver, BC, CanadaFull list of author information is available at the end of the article© 2012 Cheung et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the CreativeCommons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, andreproduction in any medium, provided the original work is properly cited.Cheung et al. BMC Bioinformatics 2012, 13:249http://www.biomedcentral.com/1471-2105/13/249groups of differentially expressed genes [5] and a methodto identify MeSH terms enriched in articles retrieved ina query of the PubMed database [6]. These initialapproaches to MeSH annotation analysis applied ad hocmeasures of association over small sets of articles todemonstrate the potential value for MeSH annotationsummarization.Key to accelerating the research process is the devel-opment of systematic approaches to quantitatively repre-sent bibliometric information and infer functionallyimportant relationships between entities. Addressing thisgoal, we introduce MeSH Over-representation Profiles(MeSHOPs) to quantitatively describe the properties ofgenes, diseases or any other entity associated with a setof articles represented in MEDLINEW/PubMedW. Theentire MEDLINEW/PubMedW database (hereafter refer-red to as MEDLINE) is analyzed. For each MeSHOP, theover-representation of MeSH annotations across a bib-liography of articles is statistically evaluated for a bio-medical topic. MeSHOPs convey characteristics of thesubject entity, facilitating discovery of novel relationshipsacross classes of entities. We demonstrate the use ofMeSHOPs to facilitate property visualization, subject tothe use of appropriate corrections for background anno-tation properties. To assess the utility of MeSHOPs forhigh-throughput generation of quantitative annotation,the capacity of the process to re-derive a subset ofGene Ontology annotation of genes is measured. Usinga class of biomedical entities – vitamins – as an ex-ample, MeSHOPs comparisons are shown to provide aquantitative measure of similarity between each memberof the class. Profiles can be similarly compared acrossentity classes, as demonstrated in an analysis of the simi-larities between gene MeSHOPs and brain diseaseMeSHOPs. MeSH Over-representations Profiles fill animportant niche in computational biology, allowingquantitative annotation descriptions to be generated forany entity for which a set of research articles indexed inthe MEDLINE database can be defined.MethodsCalculating MeSH over-representation profilesA MeSHOP is a quantitative representation of the anno-tations associated with a set of articles, where the set iscomposed of articles that address a specific entity (suchas a gene or disease). The computation of a MeSHOPinitiates from a set of articles that address a specific en-tity and returns a set of over-represented MeSH terms,each term with a p-value reflecting over-representationbased on its rate of occurrence in the set of articles (seeFigure 1). Comparing the observed frequency of eachMeSH term annotated to the background rate returns ameasure of over-representation. A MeSHOP is a vectorof tuples < (t1, m1), (t2, m2), . . . (tn, mn) >. For each tuple(ti, mi) in a MeSHOP, ti is a distinct MeSH term in theMeSH vocabulary and mi is the numeric measure of theover-representation of MeSH term ti to the set of arti-cles. For this study, several large classes of entities wereanalyzed such as the human genes in Entrez Gene andthe diseases specified formally within MeSH.MeSHOPs are generated for each member of a class byassessing the set of all linked MEDLINE records foreach member. We use Fisher’s Exact Test to determinep-values, computed from a 2x2 contingency table com-prised of: 1) the frequency of occurrence of the term ti inthe set of articles addressing the entity of interest; 2) allarticles addressing the entity of interest without the termti; 3) the frequency of the term ti in the background setnot addressing the entity of interest; and 4) the remainingnumber of articles in the background set that do notrefer to the term ti. and do not address the entity of inter-est. The universal background studied is the set of 17million MEDLINE articles assigned MeSH terms, withclass-specific background comprising a subset of thesearticles (see Additional file 1: Table S2 for more details).Annotation dataOver 18 million biomedical references in MEDLINEhave been evaluated by NLM staff subject experts. Thesecurators assigned appropriate MeSH terms correspond-ing to the topics covered by the paper. The MeSH termschosen are intended to be the most specific terms rele-vant to the topic covered in the paper – for example, ifthe term “Alzheimer Disease” is attached to the paper,the more general (‘parental’) term “Brain Disease” wouldnot be associated. For our analysis, we therefore con-sider a paper annotated by a MeSH term to also beannotated with all ‘parents’ (and ‘grand-parents’, and‘great grand-parents’, and so on to the root of the hier-archy) of that MeSH term. When indexing articles usingMeSH terms, the scope of a complex topic often cannotbe covered using a single MeSH term. In this case, mul-tiple separate terms are “coordinated”. For example, thetopic “medical staff in teaching hospitals” is covered byannotating with the two separate distinct MeSH terms“Medical Staff, Hospitals” and “Hospitals, Teaching”.There is no indication that the two terms are linkedwithin the record.Generating disease MeSHOPsFor each MeSH term from the disease category (CategoryC), the entire bibliography of annotated articles inMEDLINE was considered. Disease-article linkages aredrawn directly from MEDLINE via the curator-assignedMeSH terms. To generate MeSH term literature profilesfor diseases, all MeSH terms from the disease category –Category C – were used; a set composed of 4 494 terms inMeSH 2011 linking to over 10 million articles.Cheung et al. BMC Bioinformatics 2012, 13:249 Page 2 of 11http://www.biomedcentral.com/1471-2105/13/249Generating gene MeSHOPsAll human genes in Entrez Gene were considered(45 333 in Entrez Gene 2011). Two sources for gene-article linkages from Entrez Gene were evaluated: GeneReference Into Function (GeneRIF, http://goo.gl/SzRui)and gene2pubmed (http://goo.gl/bUEDU). GeneRIF is acurated set of links provided by annotators at the NLMand public submissions, where each set of PubMed arti-cles refers to a briefly described function of the gene.gene2pubmed is a set of links to PubMed articlesrelating to the gene, generally broader in scope thanGeneRIFs. GeneRIFs link 15 312 human genes to 213595 articles. gene2pubmed links 30 324 human genesto 302 629 articles.Generating chemical compound MeSHOPsWe examine all chemical compounds annotated toMEDLINE articles. These include chemical compoundsthat are part of the main MeSH hierarchy (Category D),as well as chemical compounds that are part of the Sup-plementary Concept Records.Generating MesHOP word cloudsThe MeSHOP [term, -log(p-value)] pairs are submittedto the online cloud generating software Wordle (http://www.wordle.net) and visualized using the “Horizontal”layout. We cap the minimum for the p-values at 10-30.Each MeSH term for a given MeSHOP is laid out in arandom, non-overlapping manner, with the font size ofthe term scaled proportional to the weight in the vector.Users can generate a word cloud from a MeSHOP via asingle click on the results page, or by copying and past-ing the MeSHOP term-value pairs into the WordleAdvanced submission page.ImplementationThe analysis was performed using Python (http://www.python.org/), XSLT (http://www.w3.org/TR/xslt), andthe MySQL database system (http://www.mysql.com/).Fisher’s Exact Test p-values and hierarchical clusteringusing complete linkage and Euclidean distance werecomputed using the R statistics package (http://www.r-project.org/). Results were generated using 50 CPUs of acompute cluster running under Sun GridEngine (http://gridengine.sunsource.net/). A typical cluster machine isa 64-bit dual processor 3 GHz Intel Xeon with 16 GB ofRAM.Datasets were downloaded from Entrez Gene (ftp://ftp.ncbi.nlm.nih.gov/gene/). We analyze the 2011 MeSH-annotated MEDLINEW/PubMedW baseline (http://www.nlm.nih.gov/databases/leased.html). See Additional file 1:Table S1 for details of the size and contents of thedatasets.Web Interface for generating and obtaining MeSHOPsTo enable reader exploration of the profiles, we providepre-computed MeSHOPs for biomedical entities suchas genes, diseases and chemical compounds (http://meshop.oicr.on.ca). All MeSH-annotated articles availa-ble through the 2011 full year release are incorporatedMeSHOP Fisher’s Exact TestMeSH  Articles Figure 1 Workflow for Generating a MeSHOP. Starting from a set of articles relating to a biological concept or entity (the foreground set), theassociated MeSH terms for each PubMed record of each article are extracted. The prevalence of each MeSH term across the set of articles iscompared to a background. Fisher’s Exact Test is applied to measure the statistical over-representation of each term in the foreground set.Cheung et al. BMC Bioinformatics 2012, 13:249 Page 3 of 11http://www.biomedcentral.com/1471-2105/13/249into the profiles. Diseases include all specified by MeSHterms under the parent term “Diseases”. Chemical com-pounds are all compounds appearing in the MeSH sup-plemental concepts. Genes are not consistently definedas MeSH terms. As MeSHOPs may be generated for anyset of articles, gene MeSHOPs were derived from exis-ting mappings of genes onto PubMed article identifiersFor the pre-computed datasets (genes, diseases andchemical compounds), we also provide the number ofnumber of entities(e.g n= 4411 diseases), allowing usersto use this to calculate Bonferroni-corrected p-values ifthey desire. Users seeking to generate MeSHOPs forother biomedical entities – for example, using entity-article mappings from another resource – can use theresults of a PubMed search query or directly providea list of PubMed Identifiers (PMIDs) to computeMeSHOPs.ResultsMeSHOPs quantitatively represent the association ofmedical terms to a topic of interest, based on the bib-liography for the topic compared to a background set ofarticles. We examine methods for generating MeSHOPs,and show how MeSHOPs can be used to reveal termsassociated with a topic.MeSHOPs for biomedical entitiesTo quantitatively describe the annotation properties of abiomedical entity using MeSH terms attached to a set ofarticles about the entity, we evaluated multiple proce-dures. At the simplest, one could count the number oftimes each MeSH term is attached to the corpus of arti-cles (Figure 2A). Such an approach fails to account forthe number of articles in the corpus, so one couldnormalize the frequency. While such a correction mayfacilitate comparisons between distinct MeSHOPs, itfails to account for the importance of the individualterms and has no impact on the visual representation(data not shown). Some terms, such as ‘human’ areattached frequently, but provide little information to dis-tinguish between distinct biomedical entities. To placethe quantitative emphasis on distinguishing terms, weelect to calculate a p-value reflecting the significance ofobserving the number of annotations with a MeSH termin a set of articles of the given corpus size (Figure 2B).The statistical model balances the number of articlesrelated to the entity being profiled against the prevalenceof the term in the background, providing greater em-phasis on the occurrence of rare terms. The p-valuecomputed therefore controls for the number of articlesassociated to an entity, against the null hypothesis of in-dependent random assignment of MeSH terms to thearticles related to the entity. This MeSHOP generationprocess (Figure 1) underlies all subsequent analysis inthis report.Simplifying large MeSHOPsInspecting the raw MeSHOPs revealed two issues thatbecome increasingly important when analyzing largerbibliographies: (i) highly correlated terms within theMeSH hierarchy result in concept redundancy in theprofiles; and (ii) the universal background rate of termfrequency results in uninformative class-enriched terms.Two corrections were introduced to address these issues.As an example of the first problem, consider the term“Alzheimer Disease”, which implies the more generalterm “Brain Disease”, rendering the observed over-representation of “Brain Disease” uninformative in a pro-file (see Figure 3). The tree-like structure of the MeSHvocabulary provides a direct method to determine termrelationships. A more succinct representation can begenerated by removing more general terms, limitingMeSHOPs to include only the most specific significantlyassociated terms from the MeSH tree (See Figure 2C).As an example of the second problem, the initialMeSHOP for the gene BRCA1 includes the term “poly-morphism, single nucleotide”, however this term isenriched for 29% of human genes using the universalbackground set of articles. To address this issue, we cal-culate the enrichment statistics based on class-specificarticle backgrounds. For human genes, the backgroundis restricted to articles addressing at least one humangene. Similarly, for diseases, the background is all arti-cles annotated with at least one MeSH disease term.Using class-specific backgrounds, the statistical testhighlights terms unusually enriched for the specificmember, de-emphasizing terms common to all membersof the class (see Figure 2D).Visualizing MeSHOPsMeSHOPs can be directly converted into word clouds toprovide a convenient graphical depiction of the annota-tion properties that enables rapid visual comparison ofthe relative importance of terms (See Figure 2). Wordclouds for the MeSHOPs provide a visual representationof a MeSHOP, allowing for immediate evaluation of themost important terms as well as their relative import-ance, in a manner similar to sequence logos [7]. Wehave introduced above two approaches that improveover-representation profiles: (i) filtering to retain onlythe most specific MeSH terms and (ii) selecting an ap-propriate background for the statistical comparisons. Aword cloud for a MeSHOP is generated using the asso-ciated MeSH terms and the negative log of the corre-sponding calculated p-values, directly translating thestatistical significance of each term proportionally intothe size of the font for the associated term.Cheung et al. BMC Bioinformatics 2012, 13:249 Page 4 of 11http://www.biomedcentral.com/1471-2105/13/249Properties of gene and disease annotationExamination of the number of articles linked to humangenes and diseases reveal substantial differences betweenthese data sources. Most genes bibliographies have fewlinked articles, the distribution decreasing with an ex-treme tail of well-studied genes with many links. Forthe GeneRIF article links from Entrez Gene (accessed2007-02-13), genes have a mean of 369 assigned articles,but a median of only 15 articles (See Figure 4A). Simi-larly, for the gene2pubmed article links, the mean is 637articles, yet the median is only 20 articles (See Figure 4B).Diseases have a more balanced distribution, but still acharacteristic extreme tail with of certain well-studiedarticles, with the distinct difference that very few diseaseshave only a couple articles. In the 2007 release of PubMed,a mean of 19 431 articles linked to each disease but aFigure 2 Alternative Approaches for Generating MeSHOPs depicted as Word Clouds. All MeSHOPs depict annotation of the HTT gene thatis causal for Huntington Disease. (A) Raw counts. (B) Statistical enrichment scores. The top 150 terms in the profile are shown with the font sizeof the term is proportional to the negative log p-value for the term. Note the presence of many general terms which are implied by morespecific terms, such as “Vertebrates”, “Primates”, “Chordata” and “Mammals” being present, but covered by the term “Humans”. Also, whenstudying a set of human genes, the terms “Humans” and “Genes” are commonly occurring and should be down-weighted accordingly. (C)Redundancy Filtered HTT Gene Biomedical Term Word Cloud. This is a word cloud where the more general terms have been filtered out from (B),leaving only the most specific terms in the profile. For example, the term “Repetitive Sequences, Nucleic Acid” seen in (B) has been filtered outdue to the presence of the term “Trinucleotide Repeat Expansion”. (D) Redundancy Filtered HTT Gene Biomedical Term Word Cloud using humangene background. This is a word cloud when taking only the subset of PubMed articles related to human genes as the background, while alsoapplying the filtering seen in (C).Cheung et al. BMC Bioinformatics 2012, 13:249 Page 5 of 11http://www.biomedcentral.com/1471-2105/13/249median of only 1 912 articles – still substantially morethan the median for genes (See Figure 4C). Of the 24357 MeSH 2007 terms, 15 674 terms are representedin gene MeSHOPs (via the 2007 gene2pubmed articlelinks), and 23 473 terms are found in disease MeSHOPs(via 2007 PubMed). We expect that as genes become bet-ter annotated with more comprehensive bibliographies,their annotation pattern will come to resemble that of themore comprehensively annotated diseases.Re-deriving gene ontology annotations with MeSHOPsMeSHOPs may be most advantageous as an approach togenerate quantitative annotation profiles in a high-throughput manner for any set of biomedical entitiesthat can be associated with sets of research articles. Tomeasure of the performance of the procedure to regen-erate relevant annotations, we assessed the sensitivity ofMeSHOPs for detecting the directly mappable subset ofGene Ontology terms annotated to genes. Using theUnified Medical Language System (UMLS) mapping ofMeSH terms to Gene Ontology terms, we identified 396GO terms with one-to-one equivalent MeSH terms.Depicted in Additional file 1: figure S1A, we observethat the sensitivity of MeSHOPs for representing theseterms for the corresponding genes ranges from 77% (at ap-value threshold of 0.05) to 95% (at a threshold of0.31). As GO annotations are not comprehensive, thereis no direct means to assess the specificity of themethod. In lieu of specificity we plot the total number ofMeSH terms mapped per gene relative to the thresholdvalues, with 162 terms per gene at a p-value threshold of0.05 (Additional file 1: Figure S1B).Temporal changes of MeSHOPsMeSHOPs can be used to identify changing knowledgeand properties for an entity. For example, by taking asubset of the articles for a biomedical entity at differenttimepoints, we can track the changes in research focusfor the entity over time. Two areas of research, definedby the MeSH terms “Computational Biology” and “StemCells” were analyzed. At each selected time point, thefifty most recent articles for that year were taken to rep-resent the state of the field at that time, and MeSHOPswere computed using the universal MEDLINE back-ground. Analyzing the MeSHOPs for “ComputationalBiology” over the past decade allows us to quantitativelyevaluate the evolution of the field (see Figure 5). For thisanalysis, all years indicate the inclusion of articles to theend of that calendar year. The MeSHOP from 1999reveals significant topics such as “Human Genome Pro-ject”, a major informatics focus at that time point, thatare completely absent when we examine the correspond-ing MeSHOP from 2008. “Genetic Research”, present inboth MeSHOPs, is followed in the recent MeSHOP withother terms for biological disciplines and techniquessuch as “Genomics”, “Genetic Techniques”, “Proteomics”and “Sequence Analysis, Protein”, demonstrating howcomputational biology techniques are being more tightlyintegrated with biomedical research (see Additional file 1:**All CategoriesPsychiatry and Psychology CategoryMental DisordersDelirium, Dementia, Amnestic, Cognitive DisordersDementiaAlzheimer DiseaseDiseases CategoryNervous System DiseasesCentral Nervous System DiseasesBrain DiseasesDementiaAlzheimer Disease Alzheimer DiseaseNeurodegenerative DiseasesTauopathiesFigure 3 Subset of the MeSH Tree for Alzheimer Disease. The entries in the Medical Subject Heading tree leading to Alzheimer disease. Notethat the term Alzheimer Disease occurs in three places in the tree, and under two separate subheadings in the Disease category – once under“Central Nervous System Diseases” due to its location in the human body, and once under “Neurodegenerative Diseases” and “Tauopathies” dueto the type of disease.Cheung et al. BMC Bioinformatics 2012, 13:249 Page 6 of 11http://www.biomedcentral.com/1471-2105/13/249Table S3). As seen in Figure 5C, data from MeSHOPs canbe used to chart the gradual decline in significanceof “Information Services” as the focus of the researchswitches from storage of the data, and the correspondingrise in association to “Biochemistry” demonstrating itmore tightly coupling with scientific study. Similarly, wecan track the changes in “Stem Cells” since the introduc-tion of the term in 1984 (see Additional file 1: Figure S2).By 1985, we see “Hematopoietic Stem Cells” and “BoneMarrow Cells” as a significant focus. This is followed bythe surge in importance of “Stem Cell Transplantation”by 2000, whereas by 2009 we see the focus shifting to“Mesenchymal Stem Cells”, “Cell Differentiation” and“Embryonic Stem Cells”.MeSHOPs provide both a qualitative visual summaryof the shifting focus of research over time for an entityof interest, as well as provide a method to quantitativelytrack the progression of association of biomedical sub-jects as they relate to the entity of interest.Intra-group MeSHOP similarityMeSHOPs can also be used to investigate relationships be-tween a set of related entities. For the set of entities com-prising the 13 human Vitamins, we first use MeSHOPsto examine the co-occurrence of Vitamin MeSH termsin MEDLINE (See Additional file 1: Figure S3A) byconsidering, for each vitamin entity, the subset of theMeSHOP relating to vitamins. In this case, the MeSHOPsprovide a measure of co-occurrence strength between anytwo vitamins, allowing us to visualize and cluster thevitamins via their bibliographic topic co-occurrence. Wesee the vitamins separating with the fat-soluble vita-mins A,D,E and K together, whereas the water solublevitamins (Ascorbic Acid and the B complex vitamins)are grouped separately. This graphic also reveals pub-lication trends – for example, of the fat-soluble vi-tamins, all co-occur except for vitamins A and K, andthe water-soluble vitamins clustering into three distinctgroups, with Niacin separated from Pantothenic Acid,Biotin and Thiamine, which are also separate fromFigure 4 (A) Distribution of Genes by Number of AssociatedGeneRIF References. The distribution shows that the bulk of thegenes have very few references, with an extreme tail of a smallfraction of genes having a very large number of references. (B)Distribution of Genes by Number of Associated gene2pubmedReferences. Although overall average number of references is higherdue to the larger number of gene2pubmed references, thedistribution remains is very similar to (A). (C) Distribution of Diseasesby Number of Associated PubMed References. Unlike thedistributions of gene references, disease MeSH terms havesubstantial literature support, although there remains an extremetail of a small fraction of MeSH terms having an extremely largenumber of articles.Cheung et al. BMC Bioinformatics 2012, 13:249 Page 7 of 11http://www.biomedcentral.com/1471-2105/13/249the rest of the B complex vitamins which group withAscorbic Acid.Using the entirety of the MeSHOPs vitamins, we com-pare vitamins based on the similarity of the strength ofassociation to biomedical subjects, taking the Euclideandistance of the log of the p-values for the shared terms intheir MeSHOPs. Co-occurrence is limited to informingabout entities that are discussed together in literature,and cannot predict entities that have not yet appeared inthe same report. Profile comparison allows any pair ofentities to be compared indirectly through their sharebiomedical terms, with the additional advantage ofinherently compensating for differing amounts of lite-rature for each entity. Comparing the results of co-occurrence to the profile comparisons in Additional file 1:Figure S3B reveals that the results of clustering by profileare both similar to the bibliographical co-occurrence, suchas Vitamin A clustering with Vitamin D, as well as Pan-tothenic Acid clustering with Thiamine. Profile similaritycluster however can emphasize different similarities fromco-occurrence, such as Niacin being more similar to Pan-tothenic Acid and Thiamine rather than Biotin, and asimilarity in annotations between Vitamin E and AscorbicAcid. MeSHOPs allow us to analyze a set of biomedicalentities to highlight known and expected relationshipsthrough strength of co-occurrence in biomedical litera-ture, as well as revealing similarities of annotation profiles.Inter-group MeSHOP similarityTo explore the challenges arising with inter-groupMeSHOP comparisons, we sought to identify links be-tween a subset of genes and brain disorders. We exa-mined the genes of the Notch, Wnt and Hh signalingpathways, with the list of genes for each pathway ex-tracted from KEGG (accessed June 2011) (See Additionalfile 1: Figure S4). These signaling pathway genes weregrouped using the subset of MeSHOPs involving MeSHterms that are the immediate children of the MeSH term“Brain Diseases”. Clustering using their association tothe pathway genes, the “Brain Diseases” are arrangedinto categories, with “Brain Neoplasms” being the moststrongly associated to the genes, with “HypothalamicDiseases” and “Dementia” also broadly associated. “BrainInjuries”, “Intracranial Hypertension” and “Hydrocepha-lus” are weakly associated to these genes by MeSHOPcomparison. We grouped the pathway genes based on“Brain diseases” subset of their MeSHOPs. Rather thangrouping distinctly by pathway, the genes are spreadacross different clusters. A broad spectrum of the pathwaygenes strongly associated to “Brain Neoplasms”, with asubset also strongly associated with “Hypothalamic Dis-eases”. Another distinct set of genes associated to “Cere-bellar Diseases” are not associated with the previous twogroups (See Additional file 1: Figure S4C). MeSHOPs pro-vide a unique quantitative method of visualizing the genelandscape for a particular topic through the associatedMeSH annotations.DiscussionMeSHOPs are quantitative annotation profiles based onover-representation analysis of MeSH terms attached tosets of articles, where each set or bibliography is associatedto a specific biomedical entity such as a gene, disease orchemical. Conveniently visually depicted as word clouds, aMeSHOP includes both common terms frequently arisingin a bibliography and rare concepts that arise more thanFigure 5 MeSHOP for “Computational Biology”. MeSHOPs weregenerated for the 50 most recent articles annotated with the MeSHterm “Computational Biology” from the year 1999 (A) and the year2008 (B). MeSHOPs were computed using the universal backgroundfrom PubMed Baseline 2011 (covering articles through 2010). TheMeSH term for “Computational Biology” and its parent terms wereexcluded from the MeSHOP. (C) Change in Significance ofBiomedical Terms for Computational Biology over Time. Thep-values for the terms “Biochemistry” and “Information Services” andtheir association to “Computational Biology” over time. For eachtime point, a MeSHOP using the most recent 50 articles for thatyear was generated to obtain the p-values for the terms.Cheung et al. BMC Bioinformatics 2012, 13:249 Page 8 of 11http://www.biomedcentral.com/1471-2105/13/249expected by chance. In this report we demonstrate thecapacity of the MeSHOP generation procedure to recoverknown gene annotations (as curated with Gene Ontologyterms), use temporal restrictions to demonstrate howMeSHOPs change over time, and introduce methods forthe comparison of MeSHOPs for both intra- and inter-class similarity analyses. MeSHOPs can be expected to bewidely used by researchers, as they may be generated forany biomedical entity and provide quantitative annotationwithout extensive curation.We anticipate that researchers will be most attractedto the convenient generation of annotation images by con-verting MeSHOPs to word clouds. Convenient visualizationmethods in bioinformatics have made substantial impactson communication, as evident in such methods as sequencelogos for motifs [7], circos plots for genomics [8], pip-plotsand dotter images [9,10] for sequence alignments, and net-work diagrams for protein systems [11]. MeSHOPs arelikely to provide a similar level of convenience for summar-izing complex topics for accelerated interpretation. The useof word clouds, of course, has been extensive, including forthe display of gene annotation [12,13]. The key advantageof MeSHOPs is that they draw upon the expert curationunderlying MEDLINE.Technical challengesMeSHOPs directly measure the significance of the anno-tated biomedical topics for a bibliography. The significantterms in a MeSHOP are therefore implicated by co-occurrence (guilt by association). The reliability of suchover-representation analysis is dependent on the annota-tion used to generate the results. MeSH terms and Sup-plemental MeSH Concepts are annotated to MEDLINEarticles by subject area experts to indicate the major andminor topics addressed by an article. There are twocaveats to the over-representation analysis. Firstly, a co-occurring MeSH term may not apply to the biomedicaltopic despite appearing in the same paper. This form oferroneous linkage is mitigated when significant p-valuesare supported by multiple co-occurrences in the bibliog-raphy addressing the entity. Secondly, co-occurrence canindicate a negative association, as negative associations areannotated in MeSH if they are an important topic of thepaper. However, a negative association is unlikely to pro-voke substantial further literature support, unless it is ofsubstantial research interest or the result inconclusive, atwhich point the MeSH term emerges as important to thebiomedical topic. Thus it is our expectation that furtherdevelopment of MeSHOPs will need to explore measuresof confidence for small bibliographies.Related workThe use of statistical tests to assign significance valuesfor annotation terms appearing in a text or across geneannotations has been frequently observed in bioinfor-matics. We calculate p-values using Fisher’s Exact test,which have a specific, well-defined interpretation well-suited for over-representation analysis – the probabilitythat the term would be found as prevalently in anequivalent-sized set of articles drawn uniformly at ran-dom from the background set of articles – making itpossible to set meaningful confidence thresholds andevaluate the scores. These scores highlight strength ofassociation by correcting for the background frequencyof occurrence. Fisher’s Exact Test is commonly used inclassic Gene Ontology annotation over-representationtools for gene set analysis such as DAVID [14] and as ameasure of over-representation of transcription factorbinding sites across a set of genes or sequences [15].A number of publications have incorporated MeSHterms into the analysis of sets of articles. Many studieshave attempted to find common themes for groups ofgenes arising in experimental studies [5,16-18]. Threepapers are more similar to the work described here, al-though each has distinct characteristics. The LigerCatsystem was developed to provide a more convenientinterface for PubMed searching [6]. The system gener-ates a word cloud for MeSH terms arising in articlesreported by an initial user query (which could be a singleentity such as a gene or drug). The user can then clickon the individual terms within the cloud to restrictresults in the PubMed search. Comparisons of MeSH-based gene profiles were performed by Sarkar and Agarwal[19], using hierarchical clustering, but only using profilescomposed of binary values (whether a term is present orabsent from the profile), where a positive setting was madeif there was at least one abstract in which the gene nameand assigned MeSH term co-occurred.Agarwal and Searls describe the use of Fisher’s p-valuefor evaluating the association for the genes present inthe articles relating to disease [20]. They combine gene2-pubmed, GeneRIF and computationally flag gene namesin titles and abstracts of the PubMed entries. The toolgene2mesh [21] provides gene profiles with a universalbackground. MeSHOPs demonstrate that the same sta-tistical analysis can be applied and visualized for anyentity associated to biomedical articles. The Gendoo sys-tem [22,23] allows users to see MeSH terms associatedwith a gene or drug, and provides an information gainscore to indicate which genes or drugs are most closelylinked to a MeSH term. There is no quantitative profileprovided, nor the capacity to perform comparisons ofdistinct entities.Analysis of biomedical topics over time has been pre-viously performed by Agarwal and Searls [24], wherethey examine the progression of individual MeSH termsin biomedical articles and genes over time. The contrastthe number of articles published for a given topic againstCheung et al. BMC Bioinformatics 2012, 13:249 Page 9 of 11http://www.biomedcentral.com/1471-2105/13/249other factors such as relative disease burden, the topicareas for a set of high-tier journals and patent filings,showing the extent of publication growth can identifypotentially important areas for research. Rajpal et al.[25] examine the significance of topics related to obesityfrom 2005-2009. Their trend analysis compares, usingFisher’s Exact Test, the prevalence of biomedical topicsand genes in 2005 to their prevalence in 2009. Thesestudies demonstrate the importance and relevance ofbibliometric analyses such as MeSHOPs in identifyingthe focus of existing research.Other data sources can be analyzed by MeSHOPs. Clin-ical applications for MeSHOPs are indicated by previouswork [26] using electronic health records as an alternativesource of annotated biomedical literature. Diagnoses andsymptoms from the free-text problem summary lists inthe health records are examined to highlight associationsto patients. Alternatively, the same methodology used forMeSHOPs could leverage other web services such asRANSUM [27] (which has been expanded to STOP [28]),to investigate over-representation of different ontologyterms in datasets available at the National Center forBiomedical Ontology. LePendu et al. [29] demonstrate thatGO annotations are a high-quality source of articles linkedto genes, and demonstrate over-representation analysisusing the Disease Ontology. Good et al. [30] show thatGene Wiki articles are a suitable source of biomedicalknowledge that can be automatically annotated withontology terms.Future directionsMany extensions of MeSHOPs remain to be explored.Incorporation of the finer shades of MeSH annotationmay be feasible. We describe here the use of the MeSHterms in isolation, however, MeSH terms may beassigned ‘subheadings’ by curators. Such subheadingsmore specifically specify the context of a MeSH term(e.g. a disease reference may be coupled to “diagnosis”or “therapy”). As well, some MeSH terms are marked asmajor topics – future analysis could use these morenuanced features to refine the MeSHOP approach.The organizational structure of the MeSH terms couldbe better addressed for MeSHOP generation. GO is struc-tured as a directed acyclic graph, thus a term may havemultiple parent terms. Grossmann applies a variant of theFisher’s Exact Test – rather than comparing a term againstthe background frequency for the class, each term is com-pared against the frequency of its parental terms using“parent–child-union” and “parent–child-intersection” rules[31]. Future work on how to account for parent–child rela-tionships in the MeSH hierarchy in this vein is thuswarranted.As evident with disease MeSHOPs, there is a positivecorrelation between the number of articles in abibliography and the number of over-represented MeSHterms. Improved methods to highlight the most relevantbiomedical topics may be required to account for this bias.It may be necessary to cap the size of MeSHOPs, or de-velop a more Bayesian approach for the statistical meas-urement of term over-representation that accounts for thenumber of papers contributing to the profile.MeSHOPs can be generated using any source for bib-liographies. Automated extraction of gene symbols fromPubMed abstracts, using technology such as iHOP [32],could supply improved gene bibliographies. Subclassesof MeSHOPs, such as species-specific gene profiles couldbe generated and compared. A drug MeSHOP could besupplemented with the MeSHOPs of other chemical com-pounds of the same family.The quantitative comparison of entities through theirMeSHOPs opens the possibility of discovery of novel in-formation. Hierarchical clustering is shown here to groupentities with known relationships together, but also pro-vides the opportunity for discovery of new relationships byindirectly linking together entities based on the similarityof their topics. We apply Euclidean distance and completelinkage to perform our hierarchical analysis, methods thatcould be rapidly computed for our MeSHOP profiles andwhich have previously successfully applied for other bio-informatics clustering applications involving vectors ofcontinuous data such as gene expression profiles. Otherforms of linkage could be applied to emphasize differentgroupings of entities, and there exist a plethora of similar-ity measures that could be adapted for comparison of nu-merical p-value vectors.ConclusionMeSHOPs quantitatively represent the MeSH biomedi-cal terms enriched across a set of papers associated witha specific biomedical entity such as a gene, disease ordrug. Visual display of MeSHOPs using word cloudsprovides a convenient way to convey annotation proper-ties to readers. Comparison between MeSHOPs allowsfor the generation of hypotheses, opening new avenuesfor applied text analysis in bioinformatics.Additional fileAdditional file 1: Are available Online. Additional files Figures S1-S4and Additional files Tables S1 and S2.Competing interestsThe authors declare that they have no competing interests.Authors’ contributionAll authors contributed to the design of the method and the analysis andinterpretation of the data. WAC implemented and carried out the study. Allauthors read and approved the final manuscript.Cheung et al. BMC Bioinformatics 2012, 13:249 Page 10 of 11http://www.biomedcentral.com/1471-2105/13/249AcknowledgementThe authors are grateful to Drs. Leon French, Paul Pavlidis and Raf Podowskifor comments and discussion on the research and Joseph Yamada for helpwith the website.FundingThis work was supported by the Canadian Institutes for Health Research [toW.W.W.]; the Ontario Institute for Cancer Research through funding by thegovernment of Ontario [to B.F.F.O.]; the National Sciences and EngineeringResearch Council of Canada [to W.W.W. and W.A.C.]; the Michael SmithFoundation for Health Research (MSFHR) [to W.W.W. and W.A.C.]; theNational Institute of General Medical Sciences [R01GM084875 to W.W.W.];and the Canadian Institutes of Health Research/MSFHR Strategic TrainingProgram in Bioinformatics [to W.A.C.].Author details1Centre for Molecular Medicine and Therapeutics at the Child and FamilyResearch Institute, Department of Medical Genetics, University of BritishColumbia, Vancouver, BC, Canada. 2Bioinformatics Graduate Program,University of British Columbia, Vancouver, BC, Canada. 3Ontario Institute forCancer Research, Toronto, ON, Canada. 4Department of Cell and SystemsBiology, University of Toronto, Toronto, ON, Canada.Received: 23 February 2012 Accepted: 24 September 2012Published: 27 September 2012References1. Sayers E, Barrett T, Benson D, Bryant S, Canese K, Chetvernin V, Church D,Dicuccio M, Edgar R, Federhen S, Feolo M, Geer L, Helmberg W, Kapustin Y,Landsman D, Lipman D, Madden T, Maglott D, Miller V, Mizrachi I, Ostell J,Pruitt K, Schuler G, Sequeira E, Sherry S, Shumway M, Sirotkin K, Souvorov A,Starchenko G, Tatusova T, et al: Database resources of the national centerfor biotechnology information. Nucleic Acids Res 2009, 37.2. Chapter 11 Relationships in Medical Subject Headings: http://www.nlm.nih.gov/mesh/meshrels.html.3. Jensen LJ, Saric J, Bork P: Literature mining for the biologist: frominformation retrieval to biological discovery. Nat Rev Genet 2006,7:119–129.4. Hirschman L, Hayes W, Valencia A: Knowledge acquisition from thebiomedical literature. In Semantic Web 2007, 53–81.5. Djebbari A, Karamycheva S, Howe E, Quackenbush J: MeSHer: identifyingbiological concepts in microarray assays based on PubMed referencesand MeSH terms. Bioinformatics (Oxford, England) 2005, 21:3324–6.6. Sarkar IN, Schenk R, Miller H, Norton CN: LigerCat: using “MeSH clouds”from journal, article, or gene citations to facilitate the identification ofrelevant biomedical literature. In Information Retrieval. Med Inform Assoc2009, 1:563–567.7. Schneider TD, Stephens RM: Sequence logos: a new way to displayconsensus sequences. Nucleic Acids Res 1990, 18:6097–100.8. Krzywinski M, Schein J, Birol I, Connors J, Gascoyne R, Horsman D, Jones SJ,Marra MA: Circos: an information aesthetic for comparative genomics.Genome Res 2009, 19:1639–45.9. Sonnhammer EL, Durbin R: A dot-matrix program with dynamic thresholdcontrol suited for genomic DNA and protein sequence analysis.Gene 1995, 167:GC1–10.10. Schwartz S: MultiPipMaker and supporting tools: alignments and analysisof multiple genomic DNA sequences. Nucleic Acids Res 2003,31:3518–3524.11. Snel B, Lehmann G, Bork P, Huynen MA: STRING: a web-server to retrieveand display the repeatedly occurring neighbourhood of a gene. NucleicAcids Res 2000, 28:3442–4.12. Baroukh C, Jenkins S, Dannenfelser R, Ma’ayan A: Genes2WordCloud: aquick way to identify biological themes from gene lists and free text.Source code for biology and medicine 2011, 6:15.13. Desai J, Flatow JM, Song J, Zhu LJ, Du P, Huang C-c, Lin SM, Kibbe WA:Advances in computational biology. Cancer 2011, 680:709–715.14. Dennis G, Sherman BT, Hosack DA, Yang J, Gao W, Lane HC, Lempicki RA:DAVID: Database for annotation, visualization, and integrated discovery.Genome Biol 2003, 4:P3.15. Ho Sui SJ, Fulton DL, Arenillas DJ, Kwon AT, Wasserman WW: oPOSSUM:integrated tools for analysis of regulatory motif over-representation.Nucleic Acids Res 2007, 35:W245–52.16. Kumar V: Omics and literature mining. In Bioinformatics for Omics Data:Methods and Protocols. 719th edition. Edited by Mayer B. Totowa, NJ:Humana Press; 2011:457–477.17. Jani SD, Argraves GL, Barth JL, Argraves WS: GeneMesh: a web-basedmicroarray analysis tool for relating differentially expressed genes toMeSH terms. BMC Bioinforma 2010, 11:166.18. Hur J, Schuyler AD, States DJ, Feldman EL: SciMiner: web-based literaturemining tool for target identification and functional enrichment analysis.Bioinformatics (Oxford, England) 2009, 25:838–40.19. Sarkar IN, Agrawal A: Literature based discovery of gene clusters usingphylogenetic methods. AMIA . . . Annual symposium proceedings/AMIAsymposium. AMIA Symposium 2006, 689–93.20. Agarwal P, Searls DB: Literature mining in support of drug discovery.Brief Bioinform 2008, 9:479–492.21. Gene2MeSH: [Internet] [http://gene2mesh.ncibi.org].22. Nakazato T, Takinaka T, Mizuguchi H, Matsuda H, Bono H, Asogawa M:BioCompass: A novel functional inference tool that utilizes MeSHhierarchy to analyze groups of genes. In Silico Biology 2007, 8(1):53–61.23. Nakazato T, Bono H, Matsuda H, Takagi T: Gendoo: functional profiling ofgene and disease features using MeSH vocabulary. Nucleic Acids Res 2009,37(suppl 2):W166–W166.24. Agarwal P, Searls DB: Can literature analysis identify innovation drivers indrug discovery? Nature reviews. Drug discovery 2009, 8:865–78.25. Rajpal DK, Kumar V, Agarwal P: Scientific literature mining for drugdiscovery: a case study on obesity. Drug Dev Res 2011, 72:201–208.26. Hanauer DA, Rhodes DR, Chinnaiyan AM: Exploring clinical associationsusing “-omics” based enrichment analyses. PLoS One 2009, 4:e5203.27. Tirrell R, Evani U, Berman AE, Mooney SD, Musen MA, Shah NH: Anontology-neutral framework for enrichment analysis. AMIA . . . Annualsymposium proceedings/AMIA symposium. AMIA Symposium 2010,2010:797–801.28. Statistical Tracking of Ontological Phrases (STOP): http://www.mooneygroup.org/stop/input.29. LePendu P, Musen MA, Shah NH: Enabling enrichment analysis with thehuman disease ontology. J biomed inform 2011, 44(Suppl 1):S31–8.30. Good BM, Howe DG, Lin SM, Kibbe WA, Su AI: Mining the Gene Wiki forfunctional genomic knowledge. BMC genomics 2011, 12:603.31. Grossmann S, Bauer S, Robinson PN, Vingron M: Improved detection ofoverrepresentation of Gene-Ontology annotations with parent childanalysis. Bioinformatics (Oxford, England) 2007, 23:3024–31.32. Hoffmann R, Valencia A: A gene network for navigating the literature.Nat Genet 2004, 36:664.doi:10.1186/1471-2105-13-249Cite this article as: Cheung et al.: Quantitative biomedical annotationusing medical subject heading over-representation profiles (MeSHOPs).BMC Bioinformatics 2012 13:249.Submit your next manuscript to BioMed Centraland take full advantage of: • Convenient online submission• Thorough peer review• No space constraints or color figure charges• Immediate publication on acceptance• Inclusion in PubMed, CAS, Scopus and Google Scholar• Research which is freely available for redistributionSubmit your manuscript at www.biomedcentral.com/submitCheung et al. BMC Bioinformatics 2012, 13:249 Page 11 of 11http://www.biomedcentral.com/1471-2105/13/249


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items