Open Collections

UBC Faculty Research and Publications

Compensating for literature annotation bias when predicting novel drug-disease relationships through… Cheung, Warren A; Ouellette, BF F; Wasserman, Wyeth W May 7, 2013

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


52383-12920_2013_Article_383.pdf [ 1.62MB ]
JSON: 52383-1.0228401.json
JSON-LD: 52383-1.0228401-ld.json
RDF/XML (Pretty): 52383-1.0228401-rdf.xml
RDF/JSON: 52383-1.0228401-rdf.json
Turtle: 52383-1.0228401-turtle.txt
N-Triples: 52383-1.0228401-rdf-ntriples.txt
Original Record: 52383-1.0228401-source.json
Full Text

Full Text

RESEARCH Open AccessCompensating for literature annotation bias whenpredicting novel drug-disease relationshipsthrough Medical Subject Heading Over-representation Profile (MeSHOP) similarityWarren A Cheung1,2, BF Francis Ouellette3,4*, Wyeth W Wasserman1,5*From Second Annual Translational Bioinformatics Conference (TBC 2012)Jeju Island, Korea. 13-16 October 2012AbstractBackground: Using annotations to the articles in MEDLINE®/PubMed®, over six thousand chemical compoundswith pharmacological actions have been tracked since 1996. Medical Subject Heading Over-representation Profiles(MeSHOPs) quantitatively leverage the literature associated with biological entities such as diseases or drugs,providing the opportunity to reposition known compounds towards novel disease applications.Methods: A MeSHOP is constructed by counting the number of times each medical subject term is assigned to anentity-related research publication in the MEDLINE database and calculating the significance of the count bycomparing against the count of the term in a background set of publications. Based on the expectation that drugssuitable for treatment of a disease (or disease symptom) will have similar annotation properties to the disease, wesuccessfully predict drug-disease associations by comparing MeSHOPs of diseases and drugs.Results: The MeSHOP comparison approach delivers an 11% improvement over bibliometric baselines. However,novel drug-disease associations are observed to be biased towards drugs and diseases with more publications. Toaccount for the annotation biases, a correction procedure is introduced and evaluated.Conclusions: By explicitly accounting for the annotation bias, unexpectedly similar drug-disease pairs arehighlighted as candidates for drug repositioning research. MeSHOPs are shown to provide a literature-supportedperspective for discovery of new links between drugs and diseases based on pre-existing knowledge.IntroductionUsing previously studied and approved pharmaceuticalcompounds and applying them towards novel diseases orphenotypes - so-called ‘drug repositioning’ - has emergedas a key issue in biomedical research [1,2]. The cost ofdeveloping a new chemical or molecular entity with pro-ven therapeutic benefit and established safety was esti-mated at over $1.8 billion in 2010, and continues to riserapidly [3]. Therefore, using compounds with knownbiochemical mechanism of action and an established safetyrecord for new purposes is an alternative to the high costof de novo compound research [4]. Advances in drug repo-sitioning research have identified potential treatments forCrohn’s disease [5,6], and have raised hopes for advancesin the treatment of rare, orphan disorders [7].Informatics-based approaches to drug repositioning areexemplified by the identification of known drug targets ingenes arising in genome-wide association studies [8], theprediction of structural suitability of a known compoundfor a new protein target [9,10], systems biology using geneexpression patterns [6,11], and the study of side effects[12]. Underlying many of these informatics approacheshas been the availability of reference databases containing* Correspondence:; wyeth@cmmt.ubc.ca1Centre for Molecular Medicine and Therapeutics at the Child and FamilyResearch Institute, University of British Columbia, Vancouver, BC, Canada3Ontario Institute for Cancer Research, Toronto, ON, CanadaFull list of author information is available at the end of the articleCheung et al. BMC Medical Genomics 2013, 6(Suppl 2):S3© 2013 Cheung et al.; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative CommonsAttribution License (, which permits unrestricted use, distribution, and reproduction inany medium, provided the original work is properly cited.information about the relationship between genes, drugsand diseases, such as DrugBank [13], PharmacogenomicsKnowledge Base [14,15], and the Comparative Toxicoge-nomics Database [16]. The broader informatics approachesto drug repositioning have been recently reviewed [2].Advances in literature and text analysis methods offer apromising path to drug repositioning based on establishedknowledge. Text analysis methods have addressed thestudy of FDA package inserts in the SIDER database [17]to identify side effects, for the comparison of word utiliza-tion between drug and disease-related abstracts [18,19],and for the analysis of similarity between gene ontologyprocess annotations assigned to a known drug target andgenes in disease-associated pathways. Literature-baseddrug repositioning has been reviewed [20,21].The foundation of any text-based analysis is an orga-nized resource of the primary research literature describ-ing the properties contained in the text. The centralinformation source for biomedical literature is the MED-LINE®/PubMed® database encompassing over 20 millionindexed articles in 2012. PubMed provides a citationresource tailored to biomedical researchers, globally acces-sible at no charge. This comprehensive database of medi-cally relevant citations is curated by expert annotators atthe National Library of Medicine. Each article is indexedwith topics from the controlled vocabulary of MedicalSubject Headings (MeSH) [22] by domain experts at theNational Library of Medicine. MeSH terms include medi-cally relevant categories such as Anatomy, Disease, Chemi-cal Compounds (including pharmacologic compounds)and Psychiatric Disorders. In addition to the topics in themain MeSH hierarchy, additional chemical compoundsare indexed through the Supplementary MeSH vocabulary.Despite the increasing wealth of raw literature knowl-edge, having means to evaluate and navigate the entiretyof this knowledge becomes progressively more challen-ging. We previously introduced Medical Subject HeadingOver-Representation Profiles (MeSHOPs) as a convenientquantitative representation of the properties enriched in abibliography of scientific literature from MEDLINE [23].MeSHOPs succinctly describe the most highly associatedMeSH terms for an entity of interest. The quantitativecomparison of MeSHOPs is shown here to allow the pre-dictive inference of entity-entity relationships in a study ofrelationships between drugs and diseases. However, weobserve that the magnitude of research literature intro-duces a strong bias into the study of entity-entity relation-ships, with the most popular diseases more likely to belinked to drugs in the future, and vice-versa. This bias par-allels the effect seen when predicting gene-disease rela-tionships via MeSHOPs, where the most popular genesare more likely to be linked to diseases, and vice-versa[24]. It is important to be aware of biases and trends inresearch that may influence the results of text analysis,and to correct for these biases to better direct researchefforts [25,26].In this report, we investigate the capacity of MeSHOPcomparisons to detect functional relationships betweenpharmaceutical compounds and diseases with an empha-sis on the ranking of candidates for drug repositioningresearch. We demonstrate that MeSHOPs capture theproperties of drugs, and that such information can becompared to disease MeSHOPs to reveal functionallyrelevant relationships. Entities with limited associated lit-erature, such as some rare diseases, are shown to havedisproportionate scores in initial MeSHOP comparisons.To account for existing annotation levels of drug and dis-ease entities and identify MeSHOP similarity, we measurethe annotation strength for drug and disease entities andincorporate this prior information into the scoring of pre-diction strength. Using this improved comparison metricwe demonstrate that drug and disease MeSHOP compari-sons are improved, as validated by the identification ofnovel associations observed in future publications andagainst a curated reference collection.MethodsPharmacological substancesIn this paper, we examine the set of drugs, defined as allchemical compounds annotated as having a Pharmacolo-gic Action, taken from both the Medical Subject Head-ings (MeSH) and Supplemental MeSH vocabularies.Since 1996, indexers at the National Library of Medicinetrack articles where the action of a drug is discussed(MeSH Basics - In 2003, theMeSH Category “Pharmacologic Action” was created, inorder to delineate chemical compounds which are usedtherapeutically as pharmacologic agents. Such annota-tions are conservatively assigned, requiring a minimumof 20 supporting research articles. We analyze these 6512drugs with respect to the diseases in the MeSH hierarchy.Constructing drug and disease MeSHOPsThe construction of MeSHOPs has been previouslydescribed in detail [23], but we provide a description herefor the reader’s convenience. A MeSHOP is a quantitativerepresentation of the MeSH annotations associated with aset of articles, where the unifying property of the articles isthat each addresses the same, specific entity (for example,all articles discussing the entity “Acetaminophen”). Eacharticle has a curator-assigned set of MeSH terms availablein MEDLINE. Comparing the observed frequency of eachMeSH term annotated to the set of articles relative to thebackground rate for each term returns a measure of over-representation (see below for additional details). AMeSHOP is a vector of tuples < (t1, m1), (t2, m2), ... (tn,mn) >. For each tuple (ti, mi) in a MeSHOP, ti is a distinctCheung et al. BMC Medical Genomics 2013, 6(Suppl 2):S3 2 of 13MeSH term in the MeSH vocabulary and mi is the over-representation measure for the term ti . To account forthe tree structure of MeSH, for each MeSH term asso-ciated with an article, the article is considered associatedto all of the parent terms of that MeSH term.We consider the 6 512 pharmacologic compoundsidentified in MeSH 2007 as the drug entities. The 4 229terms in MeSH 2007 in Category C “Diseases” composedthe set of disease entities. We take as the set of articlesfor a specific entity all the MEDLINE articles annotatedby the associated MeSH term. These MeSH annotationsare manually curated by domain experts at the NationalLibrary of Medicine.Predicting drug-disease associationsA drug and a responsive disease are anticipated to sharecommon literature annotations, such as metabolic path-ways, cellular processes and symptoms, even if no linksbetween the drug and the disease have been previouslyreported in the literature. To infer novel relationshipsbetween a drug and a disease, we perform quantitativepairwise comparisons of MeSHOPs between members ofeach class. We hypothesize that a previously unassociateddisease t is likely to be associated with a drug d if theMeSHOP Pt for the disease t is highly similar to the drug’sMeSHOP Pd . When many biomedical terms are commonbetween two profiles, the likelihood for a future associa-tion between the entities profiled is expected to increase.Sixteen distinct similarity measures were evaluated usingReceiver Operating Characteristic Area Under the Curve(ROC AUC) scores, from counting measures such as termoverlap and term coverage to calculated measures such asEuclidean (L2) and cosine distance of p-value profiles (SeeTable 1). The scores evaluate the shared characteristicsfrom both the drug and the disease MeSHOPs to makepredictions. Two baselines are presented for comparison:the number of terms in the drug MeSHOP, and the num-ber of terms in the disease MeSHOP. These baselines con-sider only the drug MeSHOP alone, or the diseaseMeSHOP alone, respectively, not using any informationfrom the other MeSHOP.After implementing and evaluating the scoring metricsusing AUC scores, a consistently effective metric wasdetermined to be the Euclidean distance of the log of thep-value for the overlapping terms between the drug andthe disease. P-values were reported by Fisher’s Exact Testbased on a hypergeometric distribution of term utiliza-tion across a background set of articles. For this report,two background sets are considered. When workingwithin a specific class of entities (e.g. drugs), the back-ground is most appropriately all articles that are asso-ciated with one or more members of the entity class. Forcomparisons between entity classes, a universal back-ground is used. For this study, the universal set contained17 million MEDLINE articles assigned MeSH terms inMEDLINE 2007.Correcting for pre-existing literature annotationGiven the significant impact of annotation bias on pair-wise MeSHOP comparison, we introduce a correction ofour similarity scores for these pre-existing literatureeffects. This correction aims to normalize the scoreswith respect to existing literature annotation, correctingfor inherent biases in the scoring methods and revealingassociations that are due to the similarity of annotationrather than the amount of annotation (the research“popularity” of the entity).Expressed formally, let us consider drug-disease rela-tionships, with scores Xs, drug annotation levels Xc anddisease annotation levels Xd, where the annotation levelis the number of MeSH terms annotated to articles inMEDLINE for the drug or disease. For a given drug cand disease d with drug annotation level xc and diseaseannotation level xd and a drug-disease score xs, we wantto determine the probability that xs is more extremethan a random drug-disease relationship score with drugannotation level xc and disease annotation level xd :P(Xs > xs|(Xc = xc) ∧ (Xd = xd))However, this probability can only be directly com-puted when the set of drugs and diseases is sufficientlylarge that there are many drugs and many diseases withthe same level of annotation. In order to correct for thepreviously observed bias, we will seek to adjust the sig-nificance based on the local distribution of scoresobserved for similarly annotated entities.P(Xs > xs|(Xc ≈ xc) ∧ (Xd ≈ xd))This can be computed by incorporating the propertiesof conditional probability asP(Xs > xs|(Xc ≈ xc) ∧ (Xd ≈ xd)) = P((Xs > xs) ∧ (Xc ≈ xc) ∧ (Xd ≈ xd))P((Xc ≈ xc) ∧ (Xd ≈ xd))As well since P(Xc ≈ xc) and P(Xd ≈ xd) are indepen-dent, this can be further simplified toP(Xs > xs|(Xc ≈ xc) ∧ (Xd ≈ xd)) = P((Xs > xs) ∧ (Xc ≈ xc) ∧ (Xd ≈ xd))P((Xc ≈ xc)P(Xd ≈ xd))We select P(Xc ≈ xc) = P(Xd ≈ xd) = 0.1, and compareagainst the 10% of the drugs that are most similar, anno-tation level-wise, to the drugs in the relationship of inter-est, and likewise for 10% of the diseases. Specifically, wetake the drugs within ±5 percentile of annotated termcounts, and likewise the diseases within ±5 percentile ofannotated term counts. The similarity scores for eachpossible drug-disease pairing between these selectedgroups are extracted. By comparison against these scores,Cheung et al. BMC Medical Genomics 2013, 6(Suppl 2):S3 3 of 13an empirical significance score of the candidate drug-dis-ease pairing is assigned. Given the 4 229 diseases and 6512 drugs, selecting 10% yields hundreds of drug and dis-ease peers, and several hundred thousand scores withwhich to compare.P((Xs > xs) ∧ (Xc ≈ xc) ∧ (Xd ≈ xd)) is computed bydividing the number of drug-disease relationships withscore greater than xs and with drug and disease annotationsimilar to xc and xd respectively, by the total number ofdrug-disease relationships. The correction described allowsus to separate the effect of the level of annotation for thedrug and disease from the similarity of the concepts andallows the user to distinguish high-scoring drug-diseaserelationships that are primarily due to the annotation levelof the drug or disease concept, from high-scoring relation-ships that arise due to sharing significant profile similarity.Table 1 Explanation of the scoring functions evaluated.Scoring Method DescriptionCosine Distance of Term Frequency-Inverse Document Frequency∑j∈M(ci(j)di(j))/⎛⎝√∑j∈M(ci(j))2√∑j∈M(di(j))2⎞⎠Cosine Distance of p-values∑i∈M(cp(i)dp(i))/⎛⎝√∑i∈M(cp(i))2√∑i∈M(dp(i))2⎞⎠Cosine Distance of term fractions∑i∈M(cf (i)df (i))/⎛⎝√∑i∈M(cf (i))2√∑i∈M(df (i))2⎞⎠Sum of the log of combined p-values∑i∈Mlog(cp(i) + dp(i) − cp(i)dp(i))Sum of the differences of log p values∑i∈M∣∣∣∣log(cp(i)dp(i))∣∣∣∣ =∑i∈M∣∣log (cp(i))− log (dp(i))∣∣L2 of log-p of overlapping terms only√ ∑i∈(C∩D)(log(cp(i))− log (dp(i)))2L2 of term fractions of overlapping terms only√ ∑i∈(C∩D)(cf (i) − df (i))2L2 of log of p-values√√√√∑i∈M(log(cp(i)dp(i)))2=√∑i∈M(log(cp(i))− log (dp(i)))2L2 of p-values√∑i∈M(cp(i) − dp(i))2L2 of term fractions√∑i∈M(cf (i) − df (i))2L2 of term frequency√∑i∈M(c(i) − d(i))2Term Coverage |C ∪ D|Term Overlap |C ∩ D|Number of Drug MeSH Terms |C|Number of Disease MeSH Terms |D|M refers to the set of all MeSH terms, C and D refer to the MeSH terms for the drug and disease profile respectively. c(i), cf(i), cp(i) and ci(i) refer to the frequency,term fraction, hypergeometric p-value and term frequency-inverse document frequency for the MeSH term i of the drug profile. d(i), df(i), dp(i) and di(i) refer tothe frequency, term fraction, hypergeometric p-value and term frequency-inverse document frequency for the MeSH term i of the disease profile.Cheung et al. BMC Medical Genomics 2013, 6(Suppl 2):S3 4 of 13Validating drug-disease associationsTo evaluate drug-disease associations predicted byMeSHOP similarity, we analyzed the 2007 baselinerelease of MEDLINE to generate predictions, and mea-sured our predictive performance against annotationsappearing in future releases of MEDLINE. The annualMEDLINE baseline releases 2007 and 2010 were used asthe source of MeSH annotations for articles and wereobtained directly from the NLM. The drug and diseaseMeSHOPs, computed for the MEDLINE baseline 2007,were compared using a panel of 16 similarity scores.Future disease-drug relationships are predicted ifMeSHOP comparison similarity scores exceed an appliedthreshold. Predictions were validated against drug-diseaseco-occurrences that appeared in the future MEDLINEreleases which had not appeared in articles before 2007. Atrue positive novel association means an article referringto a previously unconnected drug-disease pair was pub-lished in the interim period between the 2007 and 2010MEDLINE baselines.As a second validation set, the Comparative Toxicoge-nomics Database (CTD) was used as a source of curateddrug-disease relationships. We matched drugs from the2011 CTD to the drugs defined in MEDLINE 2007, anddefined a reference collection of 291 novel drug-diseaserelationships for those entries in CTD that were definedby publications appearing in the period of 2007-2011. Thereference collection contains 191 unique drugs and 150unique diseases.Using these validation sets, we evaluate the candidatescoring methods by computing the Receiver OperatingCharacteristic (ROC) curve for predictions from analysisof the baseline 2007 data and reporting the Area Underthe ROC Curve (AUC). Novel drug-disease pairs from thetwo reference sets are defined as “true positives”, and allother drug-disease pairings are defined as “true negatives”(which is recognized to be conservative, as such pairs maybe validated in future studies). All drug-disease pairsreported prior to 2007 are excluded from the AUCanalysis.The gold standard dataset analysed by the PREDICTalgorithm [27] was mapped, with 574 of the 593 drugsmapping to 2007 MeSH pharmacologic compounds andthe 190 of the 313 diseases mapping to MeSH CategoryC disease terms. A small number of drugs were notidentified as pharmacologic compounds in 2007 MeSH.Diseases mapping to a combination of multiple MeSHdisease phenotypes, or mapping to MeSH terms thatwere not in the Disease Category of MeSH were notincluded. Overall, 924 of the 1933 associations from thegold standard were mapped, comprising 406 drugs and160 diseases. For the purposes of calculating the ROCvalidation curves, all other drug-disease associationsbetween the mapped drugs and diseases are consideredto be false positives. All the drug and disease mappedterms, as well as all the mapped gold standard drug-dis-ease relationships are available for download at ( analysis was performed using Python (, XSLT (, andthe MySQL database system (’s Exact Test p-values were computed using the Rstatistics package ( Resultswere generated using 50 CPUs of a compute clusterrunning under Sun GridEngine ( A typicalcluster machine is a 64-bit dual processor 3 GHz IntelXeon with 16 GB of RAM.Data was leased and downloaded from MEDLINE/PubMed ( The Comparative Toxicogenomics Database vali-dation set was taken from the drug-disease relationshipsdataset ( are freely accessible on the web at Source code implemented inPython is available at and disease profile analysis) and (evaluation and validation of results).ResultsGeneration of drug MeSH Over-representation Profiles(MeSHOPs)MeSHOPs provide a quantitative overview of the biome-dical knowledge associated with an entity of interestthrough the indexed biomedical terms. Following thedescribed methods, MeSHOPs for all indexed diseasesand drugs in MEDLINE were generated using archivedMEDLINE data up until 2007. A drug MeSHOP is pre-sented for acetaminophen (Figure 1), and a diseaseMeSHOP is presented for Aniridia (Figure 2). Thescores within MeSHOPs are influenced by the back-ground correction for the expectation of MeSH termfrequency. If one takes the background rate from allarticles in MEDLINE, MeSH terms preferentially asso-ciated with drugs are likely to be emphasized in thedrug MeSHOPs, such as ‘pharmaceutical preparation’.The strong scores for such drug-related terms can becorrected for by using class-specific backgrounds - suchas the subset of articles that address one or more drugs.For comparisons of MeSHOPs across categories, as willfollow, we select the universal background as a commonbackground for all entities being compared.Predicting drug-disease associationsWe examine the utility of drug-disease MeSHOP simi-larity scores for the prediction of drug-disease co-Cheung et al. BMC Medical Genomics 2013, 6(Suppl 2):S3 5 of 13annotation in future publications. Table 2 demonstratesthat comparison of drug and disease MeSHOPs predictsfuture drug-disease co-occurrence in subsequent years(2007-2011). The most effective similarity score is theEuclidean distance of log-p of overlapping terms only,which produces an AUC score of 0.95 for the predictionof future co-occurrence in publications:√ ∑ie(C∩D)(log(cp(i)) − log(dp(i)))2(C and D refer to the MeSH terms of drug and diseaseMeSHOPs respectively, cp(i) and dp(i) refer to the p-value for the MeSH term i of the drug or disease profilerespectively).Enthusiasm for the performance is tempered, however,by the fact that a simple metric of the number of MeSHterms associated with a disease when used as a predic-tion ranking produces an AUC score of 0.84 (andcounts for drug-associated MeSH terms produce a scoreof 0.80). Randomly assigned scores will produce anAUC of 0.5. These results are consistent with a processFigure 1 MeSHOP for Acetaminophen. All terms are presented in this MeSHOP word cloud associated in the Acetaminophen MeSHOP with ap-value of 0. The size of the term in the word cloud presented is proportional to the number of related articles for the term.Figure 2 MeSHOP for Aniridia. The top 150 terms in the profile for the disease Aniridia are shown, where the font size of each MeSH term isproportional to the negative log p-value for the term.Cheung et al. BMC Medical Genomics 2013, 6(Suppl 2):S3 6 of 13in which well-studied diseases (or drugs) are more likelyto be the subject of future research publications andtherefore more likely to co-occur with drugs than dis-eases that have few publications. These scores reflect asystematic limitation in the scoring procedure thatneeded to be resolved to allow for the identification ofdrugs suitable for orphan disorders, as well as to pro-duce a more refined list of candidates to pursue.When we examine the mapped validation evaluated bythe PREDICT algorithm, we see a non-random butweaker predictive ability from the number of terms forthe disease (AUC of 0.60) and the number of terms forthe drug (AUC of 0.58).Comparing drug-disease MeSHOP profiles can yieldAUC of up to 0.87, comparing favorably to the AUC of0.90 reported by the PREDICT algorithm on the unmappedgold standard dataset (See Table 2).Annotation bias observed for curated drug-diseaserelationshipsPredicted novel drug-disease relationships were alterna-tively assessed against a curated reference collection fromCTD that contains bonafide drug uses (i.e. not just co-occurrence in a paper, but manually assessed evidence thatthe drug is used as a treatment for a disorder). As seen inTable 2, similarity of MeSHOPs is able to accurately pre-dict novel associations by comparing MeSHOPs of drugsand diseases, achieving ROC AUC of 0.93 (for the sum ofthe log of combined p-values). The Euclidean distance ofoverlapping terms metric that performed best for previousMeSHOP comparison performance tests, produces asimilar ROC AUC of 0.92. As displayed in Figure 3, asubstantial fraction of the validation set is over-representedfor well-studied drugs and diseases. Over half of the 191drugs are in the top 10% of all drugs in terms of amount ofassociated MeSH annotation (the peak to the left of thehistogram). Only slightly less biased, of the 150 diseases,over half are in the top 15% of diseases, in terms of asso-ciated MeSH annotation. Consistent with these properties,using the baseline MeSH term counts for drug or diseaseannotation levels as scores, a ROC AUC of 0.83 is achieved.As for the co-occurrence measure, it is clear that annota-tion bias is a strong predictor for bona fide interactions.Controlling for annotation biasThe influence of annotation on the MeSHOP compari-son scores can be visualized using heatmaps. As seen inFigure 4, and fully consistent with the AUC scoresabove, there is a high degree of correlation between theamount of annotation for the disease (as measured bythe number of MeSH terms in the disease profile), andthe drug-disease score (Pearson correlation of -0.82). Acorrelation of -0.33 is observed when comparing drug-disease scores against the degree of drug annotation (seeFigure 5). For a candidate list for drug repositioning,this annotation bias must be eliminated to allow formore rarely studied drugs or diseases to emerge fromthe analysis as candidates. We introduce a correctedscoring procedure for MeSHOP comparisons that com-putes the significance of similarity scores based on thedistribution of scores for drug-disease tuples with simi-lar annotation levels. In short, the observed similarityscore should be remarkable given the level of annotationof the drug and disease in the tuple. After applying thisTable 2 Performance of a selection of drug-disease similarity scores.Scoring Method Direct Connection Validation AUC CTD Validation AUC PREDICT Validation AUCCorrected drug-disease p-value 0.65 0.76 0.66Cosine distance tf-idf 0.88 0.91 0.87Cosine distance of p-values 0.64 0.70 0.52Cosine distance of term fractions 0.78 0.83 0.80Sum of the log of combined p-values 0.92 0.93 0.80Sum of the differences of log p values 0.89 0.86 0.58L2 of log-p of intersecting terms 0.95 0.92 0.66L2 of term fractions of intersecting terms only 0.64 0.55 0.57L2 of log of p-values 0.88 0.84 0.57L2 of p-values 0.87 0.82 0.56L2 of term fractions P(s < S) 0.85 0.90 0.78L2 of term frequency 0.87 0.83 0.62Total number of terms 0.90 0.87 0.62Number of Intersecting Terms 0.91 0.91 0.63Number of Drug Terms 0.80 0.83 0.58Number of Disease Terms 0.84 0.83 0.60Performance validated using novel direct drug-disease direct co-occurrences from MEDLINE, and novel drug-disease relationships from the CTD. Top scores foreach validation set are presented in boldface type.Cheung et al. BMC Medical Genomics 2013, 6(Suppl 2):S3 7 of 13Figure 4 The degree of disease annotation plotted against MeSHOP comparison score. The figure displays a heatmap depicting thenumber of drug-disease tuples for a disease annotation level (MeSH terms attached to the disease MeSHOP) on the x-axis and a MeSHOPcomparison score on the y-axis. MeSHOP similarity scores were calculated using Euclidean Distance. The degree of disease annotation, measuredas the total number of distinct MeSH terms associated with a disease, is highly inversely correlated (Pearson correlation score of -0.82) with thesimilarity score.Figure 3 Distribution of drug annotation (A) and disease annotation (B) in the new drug-disease associations of the CTD validationset. The x-axis represents the quantile of the MeSH term counts for the drugs (part A) and diseases (part B) in the CTD reference collection (partA). The histograms indicate that both drugs and diseases within the CTD reference collection are biased toward greater numbers of associatedMeSH terms.Cheung et al. BMC Medical Genomics 2013, 6(Suppl 2):S3 8 of 13correction for drug-disease annotation bias, both diseaseannotation level and drug annotation levels have very lowcorrelation to the drug-disease score (0.08 and 0.05respectively) (see Figure 6 and Figure 7). Table 3 demon-strates how the correction re-ranks the candidate drugs,shifting focus away from general compounds like mono-clonal antibodies, immunoglobulin G, epinephrine andiron to compounds more directly to Arthritis and Gout.This also highlights some similar compounds that havenot previously been linked to gout such as glucametacinand imidazole-2-hydroxybenzoate. We see similar resultsfor the candidate drug lists for Asthma, Cardiac Arrhyth-mias, Jaundice and Lupus and provide the entire list ofdrug-disease relationships with raw and corrected scoresonline (See Additional file 1 and Supplementary Table 2at this report, we introduce a new literature-based proce-dure for the analysis of drug-disease similarity with a focuson the identification of candidates for drug-repositioning.Using MeSH Over-representation Profiles (MeSHOPs) asquantitative representatives for biological entities, we seekto identify drugs and diseases with similar annotationunder the expectation that such similarity may be sugges-tive of potential for repositioning. Drug-disease MeSHOPsimilarity scores, using a panel of metrics, are found to bestrongly influenced by the level of annotation of drugs anddiseases. The most heavily studied diseases and drugs aredisproportionately emphasized by the comparison scores.A new corrected scoring procedure is introduced toaccount for the background expectation of similarityscores for comparably annotated drugs and diseases. Thenew procedure is demonstrated to account for the bias.Application of the MeSHOP similarity scoring procedurereveals a set of candidate drugs for future repositioningresearch.The assessment of drug repositioning candidate predic-tions is necessarily problematic. Given the expense ofvalidating drug efficacy, there is no reference collectionagainst which to measure performance. In this report weelected to take as references two approaches. First, wepredicted future co-occurrence in the research literature.This measure is indirect, as co-occurrence does notFigure 5 The degree of drug annotation vs. MeSHOP comparison score. The figure displays a heatmap depicting the number of drug-disease tuples for a drug annotation level (MeSH terms attached to the disease MeSHOP) on the x-axis and a MeSHOP comparison score on they-axis. MeSHOP similarity scores were calculated using L2 distance. The degree of drug annotation, measured as the total number of distinctMeSH terms associated with a drug, is inversely correlated (Pearson correlation score of -0.33) with the similarity score.Cheung et al. BMC Medical Genomics 2013, 6(Suppl 2):S3 9 of 13necessarily reflect a functional tie between the drug anddisease. Furthermore, this measure is particularly suscep-tible to annotation influence - well studied drugs and dis-eases have a higher rate of future publications and arethus more likely to be linked. The second reference col-lection tested was extracted from the CTD, whichrecords bonafide drug-disease links. The performancemeasurements reflect a similar literature bias on theCTD results, which may reflect a tendency for well-stu-died drugs to be tested for utility in well-studied diseasetherapy.Within this report, we observe that the MeSHOP com-parisons perform better than simple annotation mea-sures, which indicates that the similarity assessment hasvalue. Furthermore, we were able to identify and correctfor the annotation bias influence on the analysis. It is ourhope that future annotation-based similarity measureswill be evaluated for the biases we observe here.The source of the annotation biases identified in thevalidation sets may lie in methodological bias or beintrinsic to the nature of drug-disease relationships. Thecase for methodological bias notes the relationshipbetween the existence of experimental protocols and thepublication of related research. The study of diseaseinvolves the availability of appropriate animal models, afamily with a history of the condition, a large-scale asso-ciation study, and an accurate protocol to diagnose thecondition. As well, the rarity and severity of the diseasewill also change the degree of research interest. Likewise,the study of drugs also benefits from animal models,bioassays to detect the compound, the ability and ease togenerate the compound, and the ability to deliver anappropriate dosage of the compound to the targets ofinterest. Other factors motivating research directions areavailability of funding and the focus of existing lab per-sonnel and their research towards more popular direc-tions of research.However, the bias may also intrinsic to the nature ofthe disease or of the drug. Gillis and Pavlidis [28] havepreviously observed that multifunctional genes are astrong driver in gene function prediction. They identifygene multifunctionality through protein interaction andco-expression datasets, which encompass previous defini-tions of the “hub-ness” of a particular gene. A drug mayFigure 6 Disease annotation vs. corrected MeSHOP comparison score. The figure displays a heatmap depicting the number of drug-diseasetuples for a disease annotation level (MeSH terms attached to the disease MeSHOP) on the x-axis and a corrected MeSHOP comparison scoreon the y-axis. MeSHOP similarity scores were calculated using L2 distance, but were corrected as described in the text to account forbackground annotation levels. The degree of disease annotation, measured as the total number of distinct MeSH terms associated with adisease, is no longer correlated (Pearson correlation score of 0.08) once corrected.Cheung et al. BMC Medical Genomics 2013, 6(Suppl 2):S3 10 of 13have a more global effectiveness, due to targeting thesemultifunction genes or their pathways, and thereby beinvolved in more drug-disease associations. Similarly,there may be diseases that are involved in key processes,and therefore be the target of many potential drugs.Whether the biases are intrinsic to the biology of drugsand diseases, primarily introduced by the human naturein the research, or some combination of these factors willultimately be revealed by the results of future research.As our knowledge of the nature of drugs and diseasesincreases and matures, the human elements and metho-dological biases will increasingly become less significant,Figure 7 Drug annotation vs. corrected MeSHOP comparison score. The figure displays a heatmap depicting the number of drug-diseasetuples for a drug annotation level (MeSH terms attached to the drug MeSHOP) on the x-axis and a corrected MeSHOP comparison score on they-axis. MeSHOP similarity scores were calculated using L2 distance, but were corrected as described in the text to account for backgroundannotation levels. The degree of drug annotation, measured as the total number of distinct MeSH terms associated with a drug, is no longercorrelated (Pearson correlation score of 0.05) once corrected.Table 3 Comparison of top drug candidates for gout.Corrected Predictions Original PredictionsRank Drug Score Articles Drug Score Articles1 Kebuzone 0.14 2 Antibodies, Monoclonal 3.66E+08 42 Alminoprofen 0.17 1 Glucose 3.64E+08 53 Benziodarone 0.24 3 Insulin 3.24E+08 54 Proquazone 0.26 1 Norepinephrine 3.14E+08 15 Isoxicam 0.35 1 Tetradecanoylphorbol Acetate 3.02E+08 16 Glucametacin 0.48 Immunoglobulin G 2.98E+08 117 proglumetacin 0.52 2 Nitric Oxide 2.97E+08 58 imidazole-2-hydroxybenzoate 0.52 Interferon-gamma 2.88E+08 19 Prenazone 0.57 1 Serotonin 2.69E+08 410 diclofenac hydroxyethylpyrrolidine 0.59 Antibodies 2.62E+08 5We compare the top 10 drug candidates after applying our described correction against the top 10 candidates before correction. Articles lists the number ofMEDLINE articles in which the MeSH term “Gout” and the drug co-occur.Cheung et al. BMC Medical Genomics 2013, 6(Suppl 2):S3 11 of 13leaving us to identify the degree this bias is due to thebiological mechanism and nature of the drugs anddiseases.The underlying principle motivating the comparisonapproach to reveal novel drug repositioning candidates isthat there will be shared characteristics of the drugactions and disease properties. While the currentapproach utilizes universal comparisons across all MeSHterms, it may be beneficial to restrict the analysis to asubset of more relevant MeSH terms. Development of aprocedure to restrict the terms (the features) ofMeSHOPs may allow for more specific drug reposition-ing candidates to emerge in the future.Future workMeSH provides a wide spectrum of medically relevanttopics, however, some applications may be better servedby a vocabulary with more specific terms in the field ofinterest. For example, there are only eight terms in MeSH(Akathisia, Drug-Induced; Drug Eruptions; Drug Toxicity;Dyskinesia, Drug-Induced; Epidermal Necrolysis, Toxic;Erythema Nodosum; Serotonin Syndrome; Serum Sick-ness) relating directly to adverse drug events. Instead,there are several subheadings including “adverse effects”,“poisoning”, “toxicity” and “contraindications” which canoccur with drug terms, or “chemically induced” and “com-plications” subheadings occurring with adverse outcomes.Expanding the analysis to look specifically for these sub-heading modifiers could allow us to extract a subset ofarticles directly relevant to adverse drug reactions forMeSHOP analysis. Alternatively, an alternative sourcelinking side effects to articles could be employed to sup-plement our existing analysis with side-effect data.CitationRank [29] was used to highlight genesinvolved in adverse drug reaction by analyzing the co-occurrence of genes in articles relating to an adversedrug reaction. Looking at the comprehensive network ofMeSHOP similarity between genes, drugs and diseaseswould allow a similar network-style analysis, adding theinformation of the gene entities.Rather than predicting drug-disease associationsdirectly, another application of the method could be tohighlight potential links between drugs and mechanismsof action. Drug therapies can be effective even when theunderstanding of the underlying mechanism of action isincomplete. These predicted drug-mechanism linkscould be also related back to relevant diseases, indirectlyhelping hypothesize on the biology of a disease andeffective mechanisms for treatment.ConclusionsComparing MeSHOPs allows quantitative analysis ofMeSH biomedical topics shared between drugs and dis-eases through their MEDLINE-indexed primary literature.Quantitatively measuring MeSHOP similarity is shown toinfer functional relationships between drugs and diseases.Specifically, the similarity between drug MeSHOPs anddisease MeSHOPs is highly predictive of future drug-dis-ease ties. The best similarity metric, using Euclidean dis-tance of the log-p of overlapping terms, achieves a meanAUC of 0.94, an 11% improvement over baseline. How-ever, bibliometric characteristics, such as the number ofterms in the disease MeSHOP, are demonstrated to have astrong bias in drug-disease association. We describe here acorrection that eliminates this bias in the scoring metrics,separating the effects of the similarity scoring from theannotation bias.FundingThis work was supported by the Canadian Institutes forHealth Research [to WWW]; the Ontario Institute forCancer Research through funding by the government ofOntario [to BFFO]; the National Sciences and Engineer-ing Research Council of Canada [to WWW and WAC];the Michael Smith Foundation for Health Research(MSFHR) [to WWW and WAC]; the National Instituteof General Medical Sciences [R01GM084875 to WWW];and the Canadian Institutes of Health Research/MSFHRStrategic Training Program in Bioinformatics [to WAC].Additional materialAdditional file 1: Comparison of drug-disease candidates for fivedisorders. The top 20 drug candidates for gout, cardiac arrhythmia,lupus, jaundice and asthma are provided. We contrast the corrected anduncorrected drug candidate lists for each disorder. The uncorrected list isheavily biased to general compounds such as Monoclonal Antibodies,Norepinephrine and Iron, whereas the corrected drug candidates focuson drugs that are much more specific to the disorder. This file is in Excelformat.Authors’ contributionsAll authors contributed to the design of the method and the analysis andinterpretation of the data. WAC implemented and carried out the study. Allauthors read and approved the final manuscript.Competing interestsThe authors declare that they have no competing interests.AcknowledgementsThe authors are grateful to Leon French, Paul Pavlidis and Raf Podowski forcomments and discussion on the research and Joseph Yamada for help withthe website development.DeclarationsThe publication costs for this article were funded by the Centre forMolecular Medicine and Therapeutics (funds awarded to WWW) and theOntario Institute for Cancer Research (funds awarded to BFFO).This article has been published as part of BMC Medical Genomics Volume 6Supplement 2, 2013: Selected articles from the Second Annual TranslationalBioinformatics Conference (TBC 2012). The full contents of the supplementare available online at et al. BMC Medical Genomics 2013, 6(Suppl 2):S3 12 of 13Author details1Centre for Molecular Medicine and Therapeutics at the Child and FamilyResearch Institute, University of British Columbia, Vancouver, BC, Canada.2Bioinformatics Graduate Program, University of British Columbia, Vancouver,BC, Canada. 3Ontario Institute for Cancer Research, Toronto, ON, Canada.4Department of Cells and Systems Biology, University of Toronto, Toronto,ON, Canada. 5Department of Medical Genetics, University of BritishColumbia, Vancouver, BC, Canada.Published: 7 May 2013References1. Ashburn TT, Thor KB: Drug repositioning: identifying and developing newuses for existing drugs. Nature reviews Drug discovery 2004, 3:673-83.2. Dudley JT, Deshpande T, Butte AJ: Exploiting drug-disease relationshipsfor computational drug repositioning. Brief Bioinform 2011, 12:303-311.3. Paul SM, Mytelka DS, Dunwiddie CT, Persinger CC, Munos BH, Lindborg SR,Schacht AL: How to improve R&D productivity: the pharmaceuticalindustry’s grand challenge. Nature reviews Drug discovery 2010, 9:203-14.4. Deftereos SN, Andronis C, Friedla EJ, Persidis A, Persidis A: Drugrepurposing and adverse event prediction using high-throughputliterature analysis. Wiley interdisciplinary reviews. Systems biology andmedicine 2011, 3:323-34.5. Sirota M, Dudley JT, Kim J, Chiang AP, Morgan AA, Sweet-Cordero A,Sage J, Butte AJ: Discovery and Preclinical Validation of Drug IndicationsUsing Compendia of Public Gene Expression Data. Science TranslationalMedicine 2011, 3:96ra77-96ra77.6. Dudley JT, Sirota M, Shenoy M, Pai RK, Roedder S, Chiang AP, Morgan AA,Sarwal MM, Pasricha PJ, Butte AJ: Computational Repositioning of theAnticonvulsant Topiramate for Inflammatory Bowel Disease. ScienceTranslational Medicine 2011, 3:96ra76-96ra76.7. Sardana D, Zhu C, Zhang M, Gudivada RC, Yang L, Jegga AG: Drugrepositioning for orphan diseases. Brief Bioinform 2011, 12:346-356.8. Sanseau P, Agarwal P, Barnes MR, Pastinen T, Richards JB, Cardon LR,Mooser V: Use of genome-wide association studies for drugrepositioning. Nature Biotechnology 2012, 30:317-320.9. Kinnings SL, Liu N, Buchmeier N, Tonge PJ, Xie L, Bourne PE: Drugdiscovery using chemical systems biology: repositioning the safemedicine Comtan to treat multi-drug and extensively drug resistanttuberculosis. PLoS computational biology 2009, 5:e1000423.10. Li YY, An J, Jones SJM: A large-scale computational approach to drugrepositioning. Genome informatics. International Conference on GenomeInformatics 2006, 17:239-47.11. Sirota M, Dudley JT, Kim J, Chiang AP, Morgan AA, Sweet-Cordero A,Sage J, Butte AJ: Discovery and Preclinical Validation of Drug IndicationsUsing Compendia of Public Gene Expression Data. Science TranslationalMedicine 2011, 3:96ra77-96ra77.12. Yang L, Agarwal P: Systematic drug repositioning based on clinical side-effects. PloS one 2011, 6:e28025.13. Wishart DS, Knox C, Guo AC, Cheng D, Shrivastava S, Tzur D, Gautam B,Hassanali M: DrugBank: a knowledgebase for drugs, drug actions anddrug targets. Nucleic acids research 2008, 36:D901-6.14. Klein TE, Chang JT, Cho MK, Easton KL, Fergerson R, Hewett M, Lin Z, Liu Y,Liu S, Oliver DE, Rubin DL, Shafa F, Stuart JM, Altman RB: Integratinggenotype and phenotype information: an overview of the PharmGKBproject. Pharmacogenetics Research Network and Knowledge Base. Thepharmacogenomics journal 2001, 1:167-70.15. Hewett M, Oliver D, Rubin D, Easton K, Stuart J, Altman R, Klein T:PharmGKB: the Pharmacogenetics Knowledge Base. Nucleic Acids Res2002, 30:163-165.16. Davis AP, King BL, Mockus S, Murphy CG, Saraceni-Richards C, Rosenstein M,Wiegers T, Mattingly CJ: The Comparative Toxicogenomics Database:update 2011. Nucleic Acids Res 2011, 39:D1067-D1072.17. Kuhn M, Campillos M, Letunic I, Jensen LJ, Bork P: A side effect resource tocapture phenotypic effects of drugs. Molecular systems biology 2010, 6:343.18. Swanson DR: Somatomedin C and arginine: implicit connectionsbetween mutually isolated literatures. Perspectives in biology and medicine1990, 33:157-86.19. Frijters R, Van Vugt M, Smeets R, Van Schaik R, De Vlieg J, Alkema W:Literature Mining for the Discovery of Hidden Connections betweenDrugs, Genes and Diseases. PLoS Computational Biology 2010, 6:e1000943.20. Andronis C, Sharma A, Virvilis V, Deftereos S, Persidis A: Literature mining,ontologies and information visualization for drug repurposing. Briefingsin bioinformatics 2011, 12:357-68.21. Plake C, Schroeder M: Computational polypharmacology with text miningand ontologies. Current pharmaceutical biotechnology 2011, 12:449-57.22. Chapter 11 Relationships in Medical Subject Headings. [].23. Cheung WA, Ouellette BF, Wasserman WW: Quantitative biomedicalannotation using medical subject heading over-representation profiles(MeSHOPs). BMC bioinformatics 2012, 13:249.24. Cheung WA, Ouellette BF, Wasserman WW: Inferring novel gene-diseaseassociations using medical subject heading over-representation profiles.Genome medicine 2012, 4:75.25. Fedorov O, Müller S, Knapp S: The (un)targeted cancer kinome. Naturechemical biology 2010, 6:166-169.26. Edwards AM, Isserlin R, Bader GD, Frye SV, Willson TM, Yu FH: Too manyroads not taken. Nature 2011, 470:163-5.27. Gottlieb A, Stein GY, Ruppin E, Sharan R: PREDICT: a method for inferringnovel drug indications with application to personalized medicine.Molecular systems biology 2011, 7:496.28. Gillis J, Pavlidis P: The Impact of Multifunctional Genes on “Guilt byAssociation” Analysis. PLoS ONE 2011, 6:e17258.29. Yang L, Xu L, He L: A CitationRank algorithm inheriting Googletechnology designed to highlight genes responsible for serious adversedrug reaction. Bioinformatics 2009, 25:2244-2250.doi:10.1186/1755-8794-6-S2-S3Cite this article as: Cheung et al.: Compensating for literatureannotation bias when predicting novel drug-disease relationshipsthrough Medical Subject Heading Over-representation Profile (MeSHOP)similarity. BMC Medical Genomics 2013 6(Suppl 2):S3.Submit your next manuscript to BioMed Centraland take full advantage of: • Convenient online submission• Thorough peer review• No space constraints or color figure charges• Immediate publication on acceptance• Inclusion in PubMed, CAS, Scopus and Google Scholar• Research which is freely available for redistributionSubmit your manuscript at et al. BMC Medical Genomics 2013, 6(Suppl 2):S3 13 of 13


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items