Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Inferring novel relationships through over-representation analysis of medical subjects in biomedical… Cheung, Warren A. 2012

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2012_fall_cheung_warren.pdf [ 6.33MB ]
Metadata
JSON: 24-1.0073074.json
JSON-LD: 24-1.0073074-ld.json
RDF/XML (Pretty): 24-1.0073074-rdf.xml
RDF/JSON: 24-1.0073074-rdf.json
Turtle: 24-1.0073074-turtle.txt
N-Triples: 24-1.0073074-rdf-ntriples.txt
Original Record: 24-1.0073074-source.json
Full Text
24-1.0073074-fulltext.txt
Citation
24-1.0073074.ris

Full Text

Inferring Novel Relationships through Over-Representation Analysis of Medical Subjects in Biomedical Bibliographies by Warren A. Cheung B.Sc., The University of British Columbia, 2002 M.Sc., The University of British Columbia, 2005 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF  DOCTOR OF PHILOSOPHY in THE FACULTY OF GRADUATE STUDIES (Bioinformatics) THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver)  August 2012 © Warren A. Cheung, 2012  Abstract MEDLINE®/PubMed® is a richly annotated resource of over 21 million article citations, growing at a modern rate of over 600,000 citations annually. One grand challenge of bioinformatics is analysing the extensive literature for a biomedical entity such as a gene or disease. This thesis explores using overrepresentation to extract pertinent biomedical annotation from the research articles for an entity. The quantitative profiles generated are compared to predict novel associations between entities. Medical Subject Heading Over-representation Profiles (MeSHOPs) are constructed from the primary literature of an entity of interest. Medical subject annotations for each article are extracted. Statistical tests evaluate the significance of each term’s frequency across the set of articles, compared against an appropriate background set. The resulting MeSHOP is composed of each term and corresponding enrichment p-value. MeSHOPs can be computed for any entity with an associated bibliography of PubMed articles. We evaluate the predictive performance of quantitatively comparing MeSHOPs to discover novel associations between gene and disease entities, achieving up to 16% improvement in accuracy compared to gene or disease baseline features (measured as increased Receiver Operating Characteristic Area Under the Curve). Strong literature annotation level bias on the predictive performance for future gene-disease association was seen. We observe similar results in a parallel analysis of associations between drugs and disease. Efficiently identifying authors with similar research interests is a challenge in science. During the peer review process, authors seek scientists with similar expertise. MeSHOPs are generated for individual authors, identifying their research foci. Extending the methods to allow comparison across large sets of entities, overlapping research interests between researchers were identified. The predictive performance was evaluated for capacity to identify authors working in the same research domains. Biomedical annotation analysis of primary literature provides insight into the areas of research focus, and is demonstrated to link entities through similarities in their MeSHOPs. We quantitatively confirm the trend where well-studied genes, diseases and drugs are more likely to be the focus of further research. MeSHOP analysis demonstrates that knowledge in the annotated primary literature can be efficiently mined, and the untapped knowledge therein can be discovered computationally. ii  Preface With my supervisors Francis Ouellette and Wyeth Wasserman, we together identified and designed the research program that is explored in this thesis. I implemented and performed experiments and analysis of the research data, under their co-supervision. We were all involved in preparing the manuscripts and this thesis.  iii  Table of Contents Abstract ......................................................................................................................................................... ii Preface ......................................................................................................................................................... iii Table of Contents ......................................................................................................................................... iv List of Tables ................................................................................................................................................ ix List of Figures ............................................................................................................................................... xi Acknowledgements.....................................................................................................................................xiv Chapter 1: Introduction ................................................................................................................................ 1 Motivation ................................................................................................................................... 1 Overview of the Thesis Dissertation ............................................................................................ 2 Review of Existing Literature ....................................................................................................... 4 Biomedical Citation Database .................................................................................................. 5 Biomedical Subject Indexes ..................................................................................................... 8 Knowledge Discovery ............................................................................................................. 12 Visualisation ........................................................................................................................... 21 Conclusion of Literature Review ............................................................................................ 22 Formal Summary of Chapters .................................................................................................... 23 Chapter 2: Quantitative Biomedical Annotation using Medical Subject Heading Over-representation Profiles (MeSHOPs) ..................................................................................................................................... 27 Synopsis ..................................................................................................................................... 27 Introduction ............................................................................................................................... 28 Methods ..................................................................................................................................... 29 MeSH Over-representation Profiles....................................................................................... 29 Medical Subject Heading Annotation Data ............................................................................ 30 Generating Disease MeSHOPs ............................................................................................... 30 iv  Generating Gene MeSHOPs ................................................................................................... 31 Implementation ..................................................................................................................... 31 Results ........................................................................................................................................ 31 Calculating MeSHOPs for Biomedical Entities ....................................................................... 31 Simplifying Large MeSHOPs ................................................................................................... 32 Visualising MeSHOPs.............................................................................................................. 33 Web Interface for Generating and Obtaining MeSHOPs ....................................................... 33 Properties of Gene and Disease Annotation ......................................................................... 34 Re-deriving Gene Ontology Annotations with MeSHOPs ...................................................... 34 Temporal Changes of MeSHOPs ............................................................................................ 34 Intra-group MeSHOP Similarity.............................................................................................. 35 Inter-group MeSHOP Similarity.............................................................................................. 36 Discussion .................................................................................................................................. 37 Technical Challenges .............................................................................................................. 37 Related Work ......................................................................................................................... 38 Future Directions ................................................................................................................... 39 Conclusion .................................................................................................................................. 40 Chapter 3: Inferring Novel Gene-Disease Associations Using Medical Subject Heading Overrepresentation Profiles ............................................................................................................................... 57 Synopsis ..................................................................................................................................... 57 Background ................................................................................................................................ 57 Results ........................................................................................................................................ 59 Generation of MeSHOPs ........................................................................................................ 59 Quantitative Comparison of Gene and Disease MeSHOPs for Prediction of Future CoOccurrence in Research Publications .................................................................................................. 60 v  Gene and Disease Predictive Bibliometric Baselines ............................................................. 61 MeSHOP Similarity Measures ................................................................................................ 62 Predicting Association to Disease .......................................................................................... 63 Comparative Assessment of Predictions with a Literature-based System: Candidate Genes for Alzheimer Disease ......................................................................................................................... 64 Application to Diabetes Association Study ............................................................................ 65 Application to Pancreatic Cancer Study ................................................................................. 65 Discussion .................................................................................................................................. 66 Comparison to Other MeSH-related Methods ...................................................................... 67 Future Directions ................................................................................................................... 68 Methods ..................................................................................................................................... 69 MeSHOP Generation for Genes and Diseases ....................................................................... 69 Inferring Novel Gene-Disease Association ............................................................................. 69 MEDLINE®/PubMed® Data..................................................................................................... 70 Curated Gene-Disease Relationships ..................................................................................... 70 Implementation ..................................................................................................................... 70 Conclusions ................................................................................................................................ 71 Chapter 4: Separation of Literature Annotation Effects from Topic Similarity in Medical Subject Heading Over-representation Profiles (MeSHOPs) for Drugs and Diseases ............................................................. 96 Synopsis ..................................................................................................................................... 96 Introduction ............................................................................................................................... 96 Results ........................................................................................................................................ 98 Generation of Drug MeSH Over-representation Profiles (MeSHOPs) ................................... 98 Predicting Drug-Disease Associations .................................................................................... 99 Annotation Bias Observed for Curated Drug-Disease Relationships ..................................... 99 vi  Controlling for Annotation Bias ........................................................................................... 100 Discussion and Related Work .................................................................................................. 101 Future Work ............................................................................................................................. 102 Methods ................................................................................................................................... 104 Pharmacological Substances ................................................................................................ 104 Constructing Drug and Disease MeSHOPs ........................................................................... 104 Predicting Drug-Disease Associations .................................................................................. 105 Correcting for Pre-Existing Literature Annotation ............................................................... 105 Validating Drug-Disease Associations .................................................................................. 107 Implementation ................................................................................................................... 108 Conclusions .............................................................................................................................. 108 Chapter 5: Finding Similar Authors through Rapid Comparison of Shared Biomedical Research Themes .................................................................................................................................................................. 122 Synopsis ................................................................................................................................... 122 Introduction ............................................................................................................................. 122 Results and Discussion ............................................................................................................. 124 Global Summary Bit-Vectors ................................................................................................ 125 Specific MeSH Term Summaries .......................................................................................... 126 Comparing Authors from Different Institutions .................................................................. 127 Author Domain Comparison ................................................................................................ 128 Methods ................................................................................................................................... 129 Computing p-values ............................................................................................................. 129 MeSHOP Bit-Vectors ............................................................................................................ 130 Bit-Vector Comparison......................................................................................................... 130 Most Significant Terms ........................................................................................................ 131 vii  Validation Sets ..................................................................................................................... 132 Implementation ................................................................................................................... 132 Future Directions ..................................................................................................................... 133 Conclusion ................................................................................................................................ 134 Chapter 6: Conclusion ............................................................................................................................... 155 Overview of the Thesis Work ................................................................................................... 155 Highlights of MeSHOP Research .............................................................................................. 156 Future Directions ..................................................................................................................... 156 Bibliography .............................................................................................................................................. 160  viii  List of Tables Table 1-1. Increase in size of the PubMed/MEDLINE Baseline over time. ................................... 25 Table 1-2. List of Existing Algorithms for Candidate Disease Gene Selection and Data Sources Analysed by these Methods.......................................................................................................... 26 Table 2-1. Datasets Used in the Analysis with Details on Size and Relevant Contents. ............... 55 Table 2-2. Analysis of Over-representation of the MeSH Term Alzheimer Disease in the 31 Articles Linked via GeneRIF to the Gene A2M (alpha-2-macroglobulin, Entrez Gene ID 2). ........ 56 Table 3-1. Comparison of MeSHOP Results for Pancreatic Cancer Candidate Genes. ................. 78 Table 3-2. Analysis of Over-representation of the MeSH Term Alzheimer Disease in the 31 Articles Linked via GeneRIF to the Gene A2M (Entrez Gene ID 2). .............................................. 78 Table 3-3. Performance of Gene Characteristics at Predicting Association with Disease. ........... 79 Table 3-4. Comparison of the Performance of Gene ID to Gene-related Scores in MEDLINE®/PubMed®. ................................................................................................................... 80 Table 3-5. Performance Using GeneRIF as the Gene-Literature Data Source. ............................. 81 Table 3-6. Performance Using gene2pubmed as the Gene-Literature Data Source. ................... 83 Table 3-7. Top 50 Alzheimer Disease Candidate Genes by MeSHOP Similarity. .......................... 85 Table 3-8. Explanation of the Scoring Functions Evaluated. ........................................................ 88 Table 3-9. Summary of MeSHOP Performance............................................................................. 90 Table 3-10. Summary of Diabetes Loci Ranked by MeSHOP Similarity. ....................................... 91 Table 3-11. Comparison of MeSHOP Results for Pancreatic Cancer Candidate Genes. ............... 92 Table 3-12. Datasets Used in the Analysis with Details on Size and Relevant Contents. ............. 95 ix  Table 4-1. Performance of a Selection of Drug-Disease Similarity Scores. ................................ 115 Table 4-2. Table of Top-scoring Drug-Disease Relationships After Literature Correction. ........ 116 Table 4-3. Most Prevalent Highly Associated Drugs. .................................................................. 121 Table 5-1. Performance of Author Similarity for Identifying Author Research Domain. ........... 146 Table 5-2. Author Co-publication Counts. .................................................................................. 147 Table 5-3. Top 20 MeSH Term Overlap Counts. ......................................................................... 149 Table 5-4. Five Leading Journals from Genetics, Oncology and Psychiatry Selected for Validation. ..................................................................................................................................................... 152 Table 5-5. Precision and Recall of the Domain-specific Relationships from MeSH Profile Overlap. ..................................................................................................................................................... 153 Table 5-6. Performance of Author Similarity for Separating CSHL Speakers. ............................ 154  x  List of Figures Figure 1-1. Schematic of the Research Project. ............................................................................ 24 Figure 2-1. Workflow for Generating a MeSHOP. ........................................................................ 41 Figure 2-2. Alternative Approaches for Generating MeSHOPs Depicted as Word Clouds. .......... 42 Figure 2-3. Subset of the MeSH Tree for Alzheimer Disease. ....................................................... 43 Figure 2-4. Example of MeSHOPs for four different entities. ....................................................... 44 Figure 2-5. Distributions of Genes by Associated Literature References. .................................... 45 Figure 2-6. Distribution of Diseases by Number of Associated PubMed References. ................. 46 Figure 2-7. p-values of MeSH Term Mapped Gene Ontology (GO) Human Gene Annotation..... 47 Figure 2-8. MeSHOP for “Computational Biology”. ...................................................................... 48 Figure 2-9. Change in Significance of Biomedical Terms for “Computational Biology” over Time. ....................................................................................................................................................... 49 Figure 2-10. MeSHOP for “Stem Cells”. ........................................................................................ 50 Figure 2-11. Clustering Vitamin MeSHOPs. .................................................................................. 52 Figure 2-12. Signaling Pathway Gene Co-occurrence with Brain Disease Annotation. ................ 54 Figure 3-1. Comparing Gene and Disease MeSHOPs. ................................................................... 72 Figure 3-2. Comparison of Performance of Gene Characteristics. ............................................... 73 Figure 3-3. Comparison of Mean MeSHOP Performance Scores. ................................................ 74 Figure 3-4. Comparing the Performance of Similarity Scores to Gene Characteristics. ............... 75 Figure 3-5. Comparing the Performance of Similarity Scores. ..................................................... 76 xi  Figure 3-6. Comparison of the Top 500 Gene Predictions for Alzheimer Disease from Génie and MeSHOP Similarity. ....................................................................................................................... 77 Figure 4-1 MeSHOP for Acetaminophen. ................................................................................... 109 Figure 4-2. MeSHOP for Aniridia. ................................................................................................ 109 Figure 4-3. Distribution of Drug Annotation and Disease Annotation in the New Drug-Disease Associations of the CTD Validation Set. ...................................................................................... 110 Figure 4-4. The Degree of Disease Annotation Plotted against MeSHOP Comparison Score. ... 111 Figure 4-5. The Degree of Drug Annotation vs. MeSHOP Comparison Score. ........................... 112 Figure 4-6. Disease Annotation vs. Corrected MeSHOP Comparison Score............................... 113 Figure 4-7. Drug Annotation vs. Corrected MeSHOP Comparison Score. .................................. 114 Figure 5-1. Histogram of the Authors by the Count of their Publications in PubMed. .............. 135 Figure 5-2. MeSH Terms vs. Average p-value. ............................................................................ 136 Figure 5-3. Frequency Polygon of the p-values for Several MeSH Terms. ................................. 137 Figure 5-4. Heatmap of Co-publication Rates of Principal Investigators from Four Institutions. ..................................................................................................................................................... 138 Figure 5-5. Similarity Matrix for Principal Investigators from Four Institutions. ........................ 139 Figure 5-6. Comparing Author MeSHOPs Using their Top 20 MeSH. ......................................... 140 Figure 5-7. Top 20 MeSHOP Term Overlaps Between Three Authors. ....................................... 141 Figure 5-8. Network of CSHL Meeting Speakers Linked by JACCARD Score. .............................. 142 Figure 5-9. Network of CSHL Meeting Speakers Linked by Top 20 MeSH Term Overlap Score. 143  xii  Figure 5-10. JACCARD Similarity Using the 1000-bit Profile Bit-vector Against the L2 distance Score using the Full MeSHOPs. ................................................................................................... 144 Figure 5-11 Distribution of PubMed Articles Associated to MeSH terms. ................................. 145  xiii  Acknowledgements I wish to acknowledge the invaluable guidance of my supervisors, Francis Ouellette and Wyeth Wasserman – thank you for always making time for my research, coordinating this project across half a country or more, and your genuine kindness and sincere care that have made this such a fulfilling endeavour. I also wish to acknowledge the members of my thesis committee: Angela Brooks-Wilson, Jennifer Bryan, Raphael Gottardo and Kendall Ho. Thank you for the generous gift of your time and expertise to bring me your unique perspectives at each stage of my research. I would also like to thank all the members at the Wasserman and Ouellette laboratories, as well as the students and professors of the Bioinformatics Training Program. Thank you for always sharing your wonder and your trials, for all your different forms of assistance, for reminding me of all the different facets of science and perspectives on the wide world that surrounds us. I thank my parents, Jean and Philip Cheung, for their constant support through this long trek over the last three decades. Thank you for teaching me joy in my achievements, for being forthright with the truth, for always encouraging me to be a better citizen of the world. I want to thank the rest of my family and my friends for all they have given me – knowledge and wisdom, smiles and laughter, sincerity and happiness – and for tripping the light fantastic with me throughout this entire process. This work was supported by the Canadian Institutes for Health Research [to W.W.W.]; the Ontario Institute for Cancer Research through funding by the government of Ontario [to B.F.F.O.]; the National Sciences and Engineering Research Council of Canada [to W.W.W. and W.A.C.]; the Michael Smith Foundation for Health Research (MSFHR) [to W.W.W. and W.A.C.]; the National Institute of General Medical Sciences [R01GM084875 to W.W.W.]; and the Canadian Institutes of Health Research/MSFHR Strategic Training Program in Bioinformatics [to W.A.C.].  xiv  Chapter 1: Introduction The entire process of developing this doctoral thesis was a journey. Initially, while exploring the literature for the genetics of Purkinje cell-specific expression, I recognised the need for some way to keep abreast of the latest and most relevant scientific literature. I wanted to filter out the unnecessary, while retaining the elements that I wished to discover. In the pursuit of an improved way of navigating scientific knowledge, I realised benefits far beyond my original aims. This thesis describes the motivation to pursue this area of research, the steps taken to design and validate the methodologies, the use of these methods in a scientific context and how these approaches enable new paths for scientific discovery.  Motivation A fundamental challenge of research has always been the extraction of pertinent, relevant knowledge from a vast pool of relevant resources, and the ability to successfully combine and utilise this knowledge to spur new directions of research. As we enter the era where nearly all knowledge is digitised and electronically accessible, and physical geographical separation is superseded by instantaneous electronic access, the complexity of this task balloons exponentially with the availability of information. We stand at the limits of human capability in rationalizing this deluge of information. On the other hand, technology has made equally impressive strides, providing the ability to coordinate entire assemblies of increasingly powerful computers, capable of storing and manipulating immense datasets. However, is this new technology serving our needs in a meaningful way? Are we getting all the information we need while not missing any information that we need? This thesis focuses on demonstrating that the MEDLINE®/PubMed® biomedical citation database can be exploited to satisfy two principal goals. The first goal is the computational understanding of the indexed knowledge of PubMed by extracting, in a quantitative manner, the most important and relevant topics relating to a biomedical entity of interest directly from the primary literature. The second goal builds on the results of the first goal, and asks whether we can predict previously unknown relationships, by comparing our primary literature-derived profiles. While evaluating predictions made from the primary literature, we also use the research to measure and confirm pre-existing biases and trends in the direction of biomedical research. This introduction focuses on expanding on the  background for these goals, providing examples of the current state of the art and placing this work in context of the existing field. The common theme running through this thesis is using computational resources to analyse the extensive, annotated biomedical knowledge available in PubMed. Computational resources allow access to this knowledge from anywhere reachable by cellular phone, and networked series of computers allow the entire immense body of knowledge to be readily analysed. Yet, for any research topic, the amount of knowledge only grows over time, as researchers only add to the body of knowledge (except in rare cases where results are retracted) – PubMed averages over a thousand new citations every day. As well, all biomedical knowledge is intertwined – experiments rely on assays and tests, hypotheses are tested under different experimental conditions, and the observations allow us to understand biochemical processes and biological phenotypes. For example, the development of new imaging modalities directly allow researchers to non-invasively examine not only the structure but the functioning of the brain, allowing scientists to to detect abnormalities in advance of the onset of neurological disorder. Scientific endeavour seeks to look for new links and relationships, rather than exact replication of existing results. However, as this web of interconnected knowledge thickens and entangles, it becomes progressively more difficult to simply observe the most significant and unusual themes. On the other hand, modern computational techniques make it possible to analyse large data sets with the power of tens or hundreds of computers at once, a necessity as we move beyond small subsets of citations to examining over twenty million citations in PubMed. We therefore leverage these computational innovations through the application of bioinformatics methods, allowing an unbiased perspective on the totality of the literature for a topic, allowing not only identification of the most important related biomedical topics but also identifying potential novel avenues to guide further experimentation. We present this introduction to outline the current methods, resources and challenges most pertinent to this perspective of informatics analysis of biomedical bibliographies.  Overview of the Thesis Dissertation This thesis describes the development of novel methods for: (1) extracting the most relevant biomedical topics from a set of related research articles to profile an entity of interest; and (2) comparing profiles of biomedical entities to predict new associations. Chapter 1 provides an introduction and foundation to scientific databases and searches related to them. Chapter 2 describes 2  the development and application of Medical Subject Over-representation Profiles (MeSHOPs). Overrepresentation analysis quantifies the unexpectedness of medical terms, based on their annotation to the PubMed articles of an entity of interest. We generate MeSHOPs for several different classes of entities – human genes, diseases, pharmacological compounds and biomedical authors. Subsequent chapters involve comparing entities using their MeSHOPs to discover novel relationships through their MeSHOP similarity. More specifically, Chapter 2 focuses on the problem of extracting the most significant topics from a biomedical bibliography in the context of biomedical entities. We address the following problems:   Can the key topics related to a biomedical entity be automatically extracted using statistical methods over the entirety of the entity’s bibliography?    Can we filter the topics and highlight the most unexpected and interesting topics related to the entity, and present this information in a visual way?    Can we use the topics extracted to compare related entities and determine in what ways they are similar? Our approach applies bioinformatics methods over the entirety of the PubMed database to  comprehensively analyse any bibliography involving articles in PubMed. We adapt statistical overrepresentation to extract a quantitative measure of association to all the topics in a given bibliography. We demonstrate that this can be adapted to show an overall degree of association to the indexed biomedical terms, but also show techniques to highlight the most specific and unusually relevant terms for a particular application. We demonstrate that this can applied to different entities, as well as other, more specific bibliographies such as the state of an entity at a particular point in time. We show that these profiles can be depicted as a graphic. To demonstrate the utility of the method developed here, we show that profiles allow us to naturally group related entities by similarity of their associated topics. Chapters 3 and 4 further examine the relationships between entities, and ways to discern this from their profiles. To accomplish this, we set out to address the following issues and questions:   Generating a testbed for validating results using new predictions    Discovering biases and trends in biomedical research using our technique 3    Predicting new relationships between entities by comparing their profiles? This section of the thesis aims to demonstrate that MeSHOPs are not only useful for highlighting  and understanding the knowledge currently known about a particular biomedical entity, but these profiles can also be used to inform novel inferences that are hitherto not self-evident. To validate our predictions, we stored versions of all the data at yearly intervals since 2007, allowing us to generate profiles and predictions from that year and compare against new relationships that appear subsequent to that time point. In this regard, Chapter 3 focuses on gene-disease relationships while Chapter 4 adapts the same methodology to drug-disease relationships. Chapter 5 then applies profiles to look at the network of biomedical authors to identify closely related authors sharing biomedical research subject interests. Our hypothesis in these sections is that comparison of MeSHOPs allows us to identify novel relationships via similarity of the profiles extracted from the primary literature of the entities being compared. Furthermore, we tested our hypothesis in these sections by generating profiles using an archived version of the databases, and evaluating the accuracy of our predictions against associations appearing in literature after the archival date. Our methodology used MeSHOPs to generate profiles for entities, which were compared using a diverse set of distance metrics for the smaller gene, disease and drug datasets. We also demonstrated methods to approximate this comparison for extremely large datasets, and apply these methods to compare the many biomedical authors. Our validation was compared against a set of baseline gene and disease features to both investigate previously observed biases in new biomedical research in addition to placing our research results in context.  Review of Existing Literature As this work is the description of the development of a novel method for the extraction and analysis of specified terms or categories in various scientific databases, it is first necessary to describe these datasets and existing methods of searching and data extraction. It is also necessary to describe why there is need for improvement as, due to well-publicized accounts of ever-increasing scientific knowledge, it can already be intuitively understood that the analysis of biomedical literature presents an entire domain of computational and informatics problems. These problems of literature mining range from information retrieval, to prediction and hypothesis generation (Jensen, Saric, & Bork, 2006). We focus in this review on the state of the art directly pertaining to extraction and inference from 4  biomedical literature. We show in Figure 1-1 a schematic of the relevant related research. As the input to the system, we review the sources for biomedical literature citations and annotations linked to these literature citations. We then look at existing methods of information extraction for summarising and integrating the important concepts from sets of citations. Finally, we examine methods that use literature to search for novel associations.  Biomedical Citation Database Central to the thesis is a need for a source of biomedical literature, to encompass the primary literature for the biomedical entities of interest (See Figure 1-1A). The database should be comprehensive, to maximise coverage of the primary literature articles relevant to biomedical entities, such as genes, diseases, drugs and authors. The MEDLINE®/PubMed® (hereafter referred to as PubMed) database stores over 21 million citations, growing at an annual modern rate exceeding six hundred thousand articles(Sayers et al., 2009)(See Table 1-1). PubMed not only provides a centralised resource for scientists in search of primary literature on biomedical subjects, the long-term indexing efforts have made the entirety of this freely accessible repository uniquely computationally accessible. Subject area experts at the National Library of Medicine(NLM) have linked the database records to the Medical Subject Headings controlled vocabulary, providing a hand-curated and maintained set of annotations (Lipscomb, 2000). The NLM grants licenses for the use of PubMed to both U.S. and non-U.S. individuals and organisations and currently does not impose charges of any kind. (License Agreement for NLM Data http://www.nlm.nih.gov/databases/license/license.pdf). Licensees are allowed to download 98% of the records in the PubMed database in XML format. The NLM also indexes nearly all the articles with Medical Subject Headings (MeSH). Many databases include links from their entities to the PubMed records, such as the Online Mendelian Inheritence in Man and Entrez Gene. The Excerpta Medica Database (Embase) is a biomedical citation database owned by the Elsevier publishing company, encompassing the PubMed records from the NLM as well as an additional 5 million records not covered by PubMed. Embase entries are indexed using Emtree, Elsevier’s Life Sciences Thesaurus. All PubMed entries are also merged to Embase by the Embase curators. Embase can only be accessed by users which have purchased a subscription.  5  Sciverse Scopus and Web of Science are also citation indexes covering the broader domain of science, and both describe over 40 million articles. One of their primary goals is providing links from each citation record to other citations in the bibliography for the record, information unavailable through PubMed. Scopus is also owned by Elsevier and includes all the content of Embase, including its index terms, but does not incorporate Embase’s search capabilities. Web of Science, on the other hand, includes in its article entries only general subject categories, keywords from the title as well as “Keywords Plus”, keywords frequently cited by the article. Both of these are also commercial products requiring a subscription fee with no standard procedure to download the data. There are specialised databases for specific domains. For example, PsycINFO is an abstracting and indexing database providing more than 3 million records covering behavioural and mental health indexed by the Thesaurus of Psychological Index Terms. The Cumulative Index to Nursing and Allied Health Literature (CINAHL) provides over 2.2 million records for nursing and allied health with subject headings based also originally on MeSH. Searching these databases can yield different and complementary results for equivalent queries (Wilkins, Gillies, & Davies, 2005). Therefore, applying the methods developed in this thesis to other sources of biomedical citations for entities can yield an alternative set of primary literature and result in a different perspective on the same biomedical entities. PubMed remains the de facto standard, as it is a comprehensive, globally and freely accessible database. Moreover, it is the only database in its class that also allows the complete download of its database at no charge. PubMed has been embraced by the biomedical community, and many entities in other databases are linked to PubMed articles via PubMed identifiers (PMIDs), from the diseases in OMIM to the genes in Entrez Gene. Details of MEDLINE®/PubMed® PubMed is a searchable citation database at the National Centre for Biotechnology Information (NCBI), and now serves as a global portal to a curated index of biomedical literature. The primary subset of the bibliographical citation information in PubMed is composed of the NLM MEDLINE database. Domain experts at the National Library of Medicine index the incoming citations for MEDLINE by their most relevant biomedical topic terms. The scope of PubMed is biomedicine and health, with increased coverage on related life sciences beginning in 2000. PubMed focuses primarily on scholarly journals, but also includes a small number of relevant articles from newspapers, magazines and newsletters (e.g. 6  Time Magazine). There are also legacy articles from OLDMEDLINE and other initiatives that experimented with indexing other scientific literature. As well, unlike SCOPUS and Web of Science, advance electronic editions of articles are also added to PubMed. All articles are initially added to PubMed, and once the published version is indexed, the citation is moved to MEDLINE. PubMed records citation information including title, authors and source publication of the articles. Articles indexed for MEDLINE will also have index terms and supplementary concepts. Additionally, PubMed Central identifiers are provided for articles in PubMed Central, and the Abstract is stored when provided by the source publication. Articles were indexed manually since 1879 by the National Library of Medicine as the Index Medicus. The Index Medicus was originally a supplement to the Index-Catalogue of the Library of the Surgeon-General's Office (Greenberg & Gallagher, 2009), the bibliography of the world’s largest medical library since 1895. MEDLARS (Medical Literature Analysis and Retrieval System) is a computerized bibliographic retrieval for biomedical literature, originally unveiled in 1964 to both facilitate the publication of the bibliography and enable computer-oriented rapid bibliography retrieval(Dee, 2007). MEDLINE (MEDLARS Online) is a bibliographical database maintained by the National Library of Medicine with free access via PubMed since 1997, eventually supplanting the Index Medicus at the end of 2004. Approximately 5000 journals are now indexed, with over 21 million citations in its database. Entries from the Index Medicus are included from 1964 as MEDLINE. As well, OLDMEDLINE entries, currently covering over 2 million citations from 1946 through 1965, are being converted and added, although they use outdated subject headings and do not include abstracts. Towards the middle of November of each year, the National Library of Medicine performs what is essentially annual maintenance on PubMed, updating records received throughout the year and creating a fresh Baseline. Throughout the year, approximately weekly updates are applied to the baseline. The MEDLINE®/PubMed® Baseline Repository stores the yearly baselines for each year since 2002. The journal citations of MEDLINE®/PubMed® are available to lease from NLM at no charge for download in XML format. The current (2012) baseline is 89 GB of data. The NLM licenses the data to entities all around the world which allows direct access to the data for analysis, however, their software 7  is not made available. The NCBI provides programmatic access to their search software through Eutilities. Alternatively, a copy of MEDLINE®/PubMed® can be downloaded for personal use, and small amounts of data can be retrieved via PubMed without a formal license. Therefore, PubMed permits access to their citations both through their own software, as well as access to the entirety of the citation data for direct analysis. While extremely permissive, especially in comparison to other services which do not allow direct access to the entirety of the citation data, the lease remains subject to terms imposed by the NLM.  Biomedical Subject Indexes In tandem with the database for biomedical citations is the need to identify the pertinent topics for each citation (See Figure 1-1B). There is substantial work on biomedical vocabularies, which range from straightforward lists of terms, to formally correct ontologies with precisely defined semantic relationships. Biomedical vocabularies are used to annotate and index data, standardise format to facilitate data exchange and to enable knowledge discovery(O Bodenreider, 2008), something that is becoming ever more difficult as the pool of primary literature continues to grow(Howe et al., 2010). MEDLINE®/PubMed® and Embase index their citations using their own controlled vocabulary of subject terms. Vocabularies also exist for use in specific subject domains, such as annotation of medical causes of death, or for biomedical entities such as genes. All MEDLINE entries in the PubMed citation database are annotated by NLM curators with Medical Subject Headings. While the MeSH vocabulary has been shown to have significant breadth and depth in coverage of concepts occurring in biomedical research articles (Yao, Divoli, Mayzus, Evans, & Rzhetsky, 2011), we also provide here an overview of the other most important medically relevant vocabularies to place our choice in context. In all vocabularies, each term represents a distinct concept, which may have many different spellings and may be referred to by different phrases in literature. Controlled vocabularies often also provide a hierarchical tree structure, where terms are arranged with the relation of increasing specificity. Contrasting this are ontologies, where relationships between terms are more specifically defined. In ontologies, each relationship is explicitly categorised – the “is a subtype of” relationship is different than the “is a part of” relationship. For example, the mitochondria is a part of the cell, but is a subtype of organelle. By explicitly codifying the relationships, much more complex queries can be computed, however, this also involves a greater overhead for those developing and maintaining the vocabulary as well as those curating the annotations. 8  Many vocabularies are designed for a specific purpose. A number of vocabularies are focused for annotating human disease. For example, the International Classification for Disease (ICD) tools ICD-9 and ICD-10 were developed by the World Health Organization (WHO) and are employed by WHO member states to report mortality and morbidity statistics. SNOMED CT(Cornet & de Keizer, 2008) was developed by the College of American Pathologists with a focus on recording clinical data and was acquired by the International Health Terminology Standards Organisation and used by the Canada Health Infoway in its electronic health record standards. Disease Ontology(Schriml et al., 2012) is being developed to provide an open-source ontology for human disease designed to for semantic computation while mapping to other disease vocabularies such as ICD-9 and SNOMED CT as well as more general medical terminologies such as MeSH. Another important domain-specific vocabulary is Gene Ontology (GO). GO is a formal ontology and the GO Consortium is a set of organism databases, protein databases, and biological research communities, all actively involved in developing and applying GO(Ashburner et al., 2000). The GO Consortium now coordinates an effort to annotate twelve reference genomes, and when applicable PMIDs are one of the primary reference sources to support annotations. GO is also important as the biological vocabulary that has been the focus of annotation over-representation analysis (Khatri & Drăghici, 2005). However, while it covers gene-related concepts such as molecular function, biological processes and cellular compartments, its scope does not encompass other medically relevant topics such as diseases. To complement our use of PubMed as our source for biomedical literature, we choose to focus our results by using MeSH terms as the basis for our profiles. MeSH allows us to use a single biomedical vocabulary for all the entities that will be studied in this thesis – drugs, genes, diseases and biomedical authors. MeSH has been shown to have high inter-annotator consistency (Funk & Reid, 1983), and was identified as one of the best vocabularies for predicting gene-disease associations using the TXTGate biomedical literature mining system(Yu, van Vooren, Tranchevent, de Moor, & Moreau, 2008). However, while we focus our methods here on the biomedical citations provided by PubMed, our methodology could be adapted to other biomedical citation databases to provide a complementary perspective or a more specific use case. In such applications, it would be natural to also use a matching biomedical terminology – for example, when mining the Embase database of citations, the proprietary EMTREE thesaurus would provide a ready source of relevant annotations. 9  Medical Subject Headings Our work focuses on MeSH, a controlled vocabulary thesaurus of over 26,000 descriptors as of 2011 (http://www.nlm.nih.gov/pubs/factsheets/mesh.html). Each of the articles indexed by MEDLINE is annotated by MeSH terms for information retrieval. These terms are arranged in a general hierarchical structure, with broad terms such as “Anatomy” and “Mental Disorders” to more specific terms such as “Alzheimer Disease”. Additionally, more than 199,000 headings – Supplemental Concept Records (formerly Supplemental Chemical Records) – are annotated and collected in a separate thesaurus. (http://www.nlm.nih.gov/mesh/intro_preface.html) MeSH is updated continually by staff subject specialists. Based on the Subject Heading Authority List, the Medical Subject Headings list appeared in 1960 in tandem with the inception of the Index Medicus, envisioned as a single subject authority list for all medical periodicals and books(Lipscomb, 2000). Since the 1960s, MeSH has been used by the National Library of Medicine for bibliographies and cataloguing both books and periodical articles and now all of the over 18 million modern MEDLINE®/PubMed® article references are indexed using MeSH, making this one of the most comprehensive, freely accessible biomedical bibliographical resources available. The MeSH terms are organised into 16 main categories, and then furthered divided into subcategories. Within the subcategories, MeSH terms are arranged hierarchically from most general to most specific. Each MeSH term occurs at least once in these branching MeSH trees, but may also appear in additional places as appropriate (http://www.nlm.nih.gov/mesh/introduction.html). For example, the term “Alzheimer Disease” appears under both the term “Dementia” and the term “Tauopathies” in the “Diseases” Category. “Dementia” and “Alzheimer Disease” also appear in the “Psychiatry and Psychology” Category. MeSH terms are used by both periodical indexers and book cataloguers to represent the key topics in each citation entry in MEDLINE. Several basic principles guide curators when annotating MEDLINE entries. When a MEDLINE entry covers several different topics, each topic is represented by the appropriate MeSH terms. As well, for more complex topics, several MeSH terms can be coordinated to accurately represent the topic. Finally, when several headings could describe a topic for the MEDLINE entry, the most specific heading is chosen.  10  As the index to all the records in MEDLINE, MeSH terms enable services such as citation retrieval by PubMed to perform more accurately by acting as a subject thesaurus. MeSH terms allow matching to a concept independent of the specific terminology used in the original texts. It also allows for inclusive retrieval, by using the tree structure to allow general searches to include articles annotated with more specific terms (http://www.nlm.nih.gov/mesh/catpractices.html). A set of natural categories frequently discussed about the terms can be attached to a MeSH terms. If several subheadings are appropriate, separate entries of the MeSH term associated to each subheading are added to the record. For example, an article that covers liver metabolism and pathology is indexed with both “Liver/metabolism” and “Liver/pathology”. Of the over eighty possible subheadings, only relevant subheadings are allowed for each MeSH term, and certain combinations are invalid because a corresponding MeSH term exists – for example, “Arm/Injuries” is not possible because the concept “Arm Injuries” is already in MeSH. Subheadings are also grouped into subheading trees arranged by specificity. The Supplementary Concept Records is a supplementary vocabulary to complement the “Chemical and Drugs” Category. While the terms in the main MeSH Categories are updated annually, Supplementary Concept Records are updated daily. Additionally, Supplementary Concepts are a controlled vocabulary – while including a link to related MeSH terms, they are not formally included in the MeSH hierarchy and are not associated with MeSH subheadings. Starting in 1996, the pharmacologic action terms were coordinated in conjunction with a MeSH chemical term (either a Main Heading or a Supplementary Concept Record) to index articles describing the action of a drug. However, this indexing strategy does not explicitly link the pharmacologic action and the drug involved. Therefore, in 2003, the “Pharmacological Action” category was created, which includes pharmacologic actions and the substances that are shown in the literature to use this action. To be included in this list, a drug must have a confirmed pharmacologic action, including at least 20 papers supporting this action. Automated Entity Recognition Expert curated subject annotations remain the gold standard to which all other methods are compared. However, this method is the most resource intensive, requiring both expertise and time from  11  human curators. Automatic entity recognition is another technique, where abstracts and/or full-text of articles are analysed to identify the topics mentioned in an article. Automated efforts have mapped other vocabularies to PubMed articles in an automated fashion, such as GoPubMed for Gene Ontology(Doms & Schroeder, 2005). The reverse problem, attaching MeSH terms to entities, has also been explored for Uniprot proteins by mapping MeSH terms to UniProt comment lines via text analysis(Mottaz, Yip, Ruch, & Veuthey, 2008), and for OMIM diseases through the analysis of PubMed bibliographies and PubMed searches(Nakazato, Bono, Matsuda, & Takagi, 2009). As a general resource, the Unified Medical Language System(UMLS) (Olivier Bodenreider, 2004) is a medical metathesaurus that integrates and bridges over 100 source vocabularies developed since 1986. Associated with the UMLS initiative is the MetaMap tool(Aronson & Lang, 2010) which enables UMLS to be mapped to other terminologies. As well, UMLS has been mapped to other terminologies not currently part of the thesaurus, such as the proprietary EMTREE thesaurus that indexes EMBASE(Taboada, Lalín, & Martínez, 2009). In addition to the Metathesaurus of terms from multiple source vocabularies, it also incorporates a Semantic Network. Automated efforts still cannot replace manual curation, as seen in efforts comparing automatically annotated MeSH terms to manually annotated terms(Trieschnigg et al., 2009). The authors find that retrieval systems that use manual annotations are more effective than systems using annotations derived from more automated methods at document retrieval tasks. Automated techniques, however, provide effective mechanisms to assist curators in the task of assigning subject terms, and have been shown to have recall that is potentially superior to non-expert human curators(Ruau, Mbagwu, Dudley, Krishnan, & Butte, 2011). Furthermore, computational methods can provide a complementary source of annotations when a large-scale manual curation effort is not possible.  Knowledge Discovery Linking concepts through Literature Swanson and Smalheiser suggested, originally in 1986, that two sets of literature could, when brought together, reveal previously “undiscovered public knowledge” that is not apparent when the two 12  literature sets are analysed in isolation (Don R Swanson & Smalheiser, 1996). Specifically, they propose the ABC model – discovering the relationship between an entity A and a second entity C through a common intermediate entity B. For example, they propose that an environmental factor A can be discovered to be linked to a disease C, through a certain physiological condition B relevant to both A and C. They consider such relationships to be of particular interest when the sets of literature are “mutually isolated and noninteractive” – articles involving A and C both do not cite each other and are not cocited. This concept is centrally important for the thesis research that will follow. Smalheiser(Smalheiser, 2012) describes this methodology as literature-based discovery that emulates “intuitive common-sense” strategies employed by domain scientists. Profiles can generate hypotheses in a method inspired by Swanson (Srinivasan, 2004), where the profile for a topic A contains the MeSH term B which is discovered to link to the profile of another MeSH term C. Alternatively, Srinivasan compares the profiles associated with the two MeSH terms A and C, investigating whether they have any linking MeSH terms B that are shared between their profiles. The work presented in this thesis takes this methodology as inspiration and aims to extend it through automated computational analysis to evaluate all potential B terms quantitatively and simultaneously to evaluate the strength of the link between A and C. Annotation Over-representation The first part of our problem is to build a profile of the biomedical topics most related to an entity of interest from its primary literature (See Figure 1-1C). Moving beyond the naïve extraction of all terms appearing in the articles, we focus on methods to quantitatively organise these topics based on their importance to the entity they describe. This problem maps to the problem of identifying overrepresented terms in a set. Over-representation has been extensively studied in bioinformatics in the context of GO annotation analysis(Khatri & Drăghici, 2005). In this classic problem, determining whether an individual annotation is present more frequently than expected in a group of entity annotations can be performed using well-studied statistical methods – the binomial test, the hypergeometric distribution and Fisher’s Exact test, the chi-squared test – implemented in tools such as the Database for Annotation, Visualization and Integrated Discovery (DAVID) (Dennis et al., 2003). These methods classically compare the rate of occurrence of the annotation in the set studied against the background occurrence rate of the annotation. Other variations of this problem look at the 13  over-representation of child terms in relation to a parent term. For example, in the case of Gene Ontology, there is assigned to each term a “true” parent in the hierarchy, which has been used for this analysis(Grossmann, Bauer, Robinson, & Vingron, 2007). In addition to Gene Ontology, overrepresentation has also been applied in the genomic context to evaluate the over-representation of transcription factor binding sites in a set of sequences, compared to a background, using Fisher’s Exact Test and the normal approximation of the binomial test(Ho Sui, Fulton, Arenillas, Kwon, & Wasserman, 2007), and the same analysis has been applied to the MeSH terms for a set of genes(Kumar, 2011). There are existing methods of computing MeSH term associations. The most basic method is the naïve binary presence or absence of an annotation in the literature(Djebbari, Karamycheva, Howe, & Quackenbush, 2005; Indra Neil Sarkar & Agrawal, 2006). In addition, methods include normalised term frequency * inverse document frequency(TF*IDF) adapted from text mining(Srinivasan, 2004), information gain from information theory(Nakazato et al., 2007), and other ad-hoc frequency-based measures(Indra Neil Sarkar, Schenk, Miller, & Norton, 2009). These examples demonstrate many alternative methods exist to measure the degree of association. We consider in this work Fisher’s Exact test, which remains a widely used statistical test for annotation over-representation, resulting in directly interpretable p-values in a well-understood probabilistic model. MeSH term association statistics through hypergeometric p-values has previously been studied in the context of genes(Jani, Argraves, Barth, & Argraves, 2010), however, this preliminary analysis limits the analysis to the MeSH categories and first-level MeSH sub-categories. Over-representation analysis generates quantitative profiles, succinctly distilling the annotations of the primary literature for each entity analysed. In this thesis, we investigate profiles generated for the gene, disease and drug entities. By comparing profiles for several different entities, we predict previously unknown connections (See Figure 1-1D). Gene-Disease Prediction One classic problem at intersection of bioinformatics and medical informatics is the prediction of novel gene-disease relationships. These two classes of entities are both well-suited for advanced literature analysis. Diseases form an entire category of MeSH terms, and are therefore well annotated in the PubMed literature. The NCBI Entrez Gene annotation resource provides curated literature citations associated to genes. 14  Most of the existing methods for the computational prediction of linkages between genes and disease take as input a preliminary list of candidate genes (e.g. genes in a genomic region linked in a genetic study to a disease), and return as output either a reduced or a ranked list. The underlying approaches differ substantively between methods, but all examine characteristics of the genes to identify which genes have the greatest similarity to a disease. Examples of characteristics used in the methods include numerical features derived from the raw sequence of genes and/or encoded proteins, existing annotations of proteins and genes, and abstracts or articles directly referring to the gene. The current methods focus on using properties from a representative set of genes to identify similar genes from the candidate set. Here I will present an overview of nine recent methods used to link genes to disease (See Table 1-2). One method for identifying disease-related genes involved clustering the diseases in OMIM(Freudenberg & Propping, 2002), rather than the genes themselves, using features such as tissue, age of onset, primary etiology, episodic occurrence and mode of inheritance. A measure of similarity between any two diseases is calculated based on weighted contributions of each of these indices. Once the clusters are determined (using a strategy that involves manual thresholding by a human expert), the candidate genes are compared to the disease genes underlying the diseases in each cluster using the GO annotations. For a candidate gene in a disease cluster, each GO term is considered. If the GO term does not match the candidate gene, the ratio is 0. Otherwise, the ratio of the occurrences of the GO term in the cluster and the occurrences of the GO term in all disease genes is computed. The score for a candidate gene in a disease cluster is then the average of all the GO term ratios for that gene. This score is then downscaled by the number of genes in the cluster. Validation was assessed using leave-one-out cross-validation. They observed that predictions were most powerful in cases where disease phenotype and gene functions are clearly similar among members of a disease family. Syndromal disease phenotypes proved more difficult, potentially due to only partly understood causal mechanisms or weakness in the ability to accurately index complex phenotypes. Rather than ranking candidates, an alternative approach is to restrict a candidate gene set to those genes which meet some of a set of specific properties. GeneSeeker(Driel et al., 2005) can find genes within a chromosomal location that are localized in particular tissues, by looking at human and mouse expression data. Another method of associating disease genes to anatomical locations(Nicki  15  Tiffin et al., 2005) performed text mining of PubMed abstracts to associate eVOC anatomical ontology terms to gene names. Machine learning approaches can be used when a representative set of disease genes are available to use as training data. In DGP(López-Bigas & Ouzounis, 2004), a decision tree classification approach is used to find features common to disease genes based on a training set composed of sample disease and control proteins. Features were protein length, BLASTP ratios (conservation score) between a protein and its highest scoring homologue within taxonomic groups (representing phylogenetic conservation and extent) and the conservation score with the closest paralogue. The study indicates that, on average, hereditary disease genes (genes taken from OMIM) in comparison to randomly selected genes are longer, more conserved, phylogenetically extended and without close paralogues. PROSPECTR(Adie, Adams, Evans, Porteous, & Pickard, 2005) uses a wide variety of features, including the length of the gene, the length of its coding sequence, the length of its cDNA, length of the protein, GC content and percentage protein identity with its nearest homologue in various species (mouse, worm, fly). The investigators used an alternating decision tree, taking genes from OMIM and comparing against genes not found in OMIM. They also generated two independent test sets – one using genes from the Human Gene Mutation Database with randomly selected control genes, and another set of 54 genes not in OMIM, again with a set of randomly selected control genes. They show the ability to enrich candidate disease gene lists by training sequence-based features. POCUS(Turner, Clutterbuck, & Semple, 2003) takes another machine learning approach, using a selected training set of genes linked to the target disease. POCUS identifies common features between all the training genes – InterPro domains, GO annotations, similar expression profile – and assesses the chance that such common features would be shared by chance. This method depends on a carefully selected training set of genes, and focuses the likelihood of these genes all sharing common, diseaserelated properties, in contrast to methods that focus on over-representation of properties among the training genes. G2D(Perez-Iratxeta, Bork, & Andrade, 2002; Perez-Iratxeta, Wjst, Bork, & Andrade, 2005) links genes from a specified genomic locus to diseases based on PubMed MeSH disease and chemical term annotation and RefSeq GO annotations. MeSH disease terms are initially linked to MeSH chemical terms via co-occurring annotation of PubMed articles. Similarly, RefSeq GO annotations were linked to the 16  MeSH chemical terms via the PubMed references in the GO annotations. Scores were generated for these pair-wise associations as the ratio of the cardinality of the intersection against the union. The score for the combined disease-chemical-gene relation is defined as the product of the two pair-wise relations, and the score for a disease-gene relation is simply the maximum of all possible scores. The most recent update(Perez-Iratxeta, Bork, & Andrade-Navarro, 2007) includes additional methods of inferring disease-gene associations, involving the user providing genes from other genomic regions related to the disease. The first new method uses a set of example genes involved or suspected to be involved in the disease, it looks for GO annotation similarity based on Resnik scores(Resnik, 1977). The second method added takes a second chromosomal region of interest, and considers protein-protein interactions (provided by the STRING database) linking the genes in the input region of interest and the second region. This method uses the MeSH annotations of PubMed articles, and integrates this with GO annotations, in addition to providing as alternatives methods to link candidate genes to an existing set of genes and to another implicated region, however, does treats these methods individually. Endeavor(Aerts et al., 2006) aims to create an extendible system for prioritizing disease genes using heterogeneous data sources. The input to the system is a training set of genes, and comparisons are made against all genes in the genome. For attribute-based data (GO annotations, EST expression, InterPro domains and KEGG pathways), over-represented annotations were each given p-values. Genes are ranked using Fisher's omnibus meta-analysis to generate a new p-value from a chi-square distribution. Vector-based data (literature frequency profile of GO terms, microarray expression, human-mouse conserved promoter regions scored by TRANSFAC profiles), and the feature vectors for each gene were compared against a trained average vector using Pearson correlation. BLAST similarity, BIND interaction partner overlap and a genetic algorithm-derived transcription factor binding site model are used to score and rank the genes. The ranks from the methods are combined using order statistics. The performance of the system was evaluated against monogenic diseases (extracted from OMIM), polygenic diseases (six genes recently determined to be involved in polygenic disease) and also for functional role in regulatory pathways (by looking for differential mRNA expression via real-time quantitative PCR). They performed functional validation in zebrafish. DiGeorge syndrome (DGS) candidate genes were identified from a training set of genes causing DGS and DGS-like symptoms. This resulted in the prioritisation of TBX1, a known DGS-related gene, and YPEL1, which yielded DGS-like defects when expression was knocked down in vivo. Endeavor combines the results of multiple analysis methods, and represents one of the most popular tools for gene-disease association analysis. 17  More recently, CAESAR(Gaulton, Mohlke, & Vision, 2007) takes as input a text about a disease and analyses for the presence of ontology terms and gene names. Ontology terms are used to rank candidate genes based on matching identified phenotypes when mutated (e.g. mammalian phenotypes in the mouse genome database), or showing expression in the tissues mentioned (e.g. eVOC anatomy terms in the UniProt database). Identified genes are used to rank candidate genes based on the number of protein-protein interactions, shared common pathways in the Kyoto encyclopedia of genes and genomes, or conserved functional domains in the Interpro database. The gene ranks are then integrated using four scoring functions: sum, mean, maximum and a transformed score that considers both the rank of a gene for each data source and the number of genes returned by that data source. CAESAR mimics the training of an expert through its direct use of the primary literature. To compare and contrast the results from multiple techniques, a large integrative analysis with most of the methods was performed to predict genes potentially linked to diabetes and/or obesity(N Tiffin et al., 2006). In addition to directly comparing and contrasting the results of diverse methods, the study demonstrates the utility of meta-analysis of the combined results. Diversity in methods to analyse and predict candidates allows minimisation of bias from any single data source, and so the MeSHOP analysis discussed here brings an alternative viewpoint as well as methods to identify and compensate for literature bias. The field continues to progress, with the update to G2D and several of the above mentioned methods, including CAESAR and Endeavor,emerging since the review study. This sampling of the various prediction algorithms currently available provides a taste of the diversity of methods available to be applied to predicting novel associations, however, in practice researchers have limited time and resources to dedicate towards generating and analysing predictions. There remains substantial work to be done before in silico predictions can supplant experimental analysis, therefore any predictions need to be supplemented and validated. From a practical standpoint, a tool such as Endeavor, integrating many diverse data sources with state-of the-art algorithms for its predictions, provides a convenient single source for an ensemble view of the prediction landscape. Tools like Endeavor provide a unified framework for predicting novel gene-disease relationships by integrating several data sources in a quantitative manner, but these methods rely on qualitatively determined scoring mechanisms. Our focus in this thesis, rather than integrating multiple methods, is to identify and specifically focus on the literature aspect of the prediction of interactions. From a research standpoint, what has been lacking in previous analyses is a focus on the effectiveness of biomedical 18  literature knowledge alone, to quantify its predictive ability as well as identify and understand biases in the biomedical research. Drug-Disease Prediction Pharmaceutical compounds are amongst the most heavily studied chemicals. As part of the supplemental MeSH vocabulary, PubMed entries are annotated for the chemical compounds that are central to the studies. We focus here on investigating the value of biomedical annotation of compounds with confirmed pharmacologic action, building on previous work exploring general trends in the medical topics in drug research through MeSH and MEDLINE®/PubMed® bibliometric data(Agarwal & Searls, 2009). The identification of new indications for existing pharmacologic compounds is increasingly important. As the development of novel drugs becomes more costly and time-consuming, the repurposing of previously approved drugs becomes more attractive. Existing drugs have the advantage of already-known toxicity and contraindications, resulting in a shorter approval process. Similar to turning off-label prescriptions of doctors for existing compounds into new applications of drugs, we examine here exploiting known properties of drugs to infer potential applications towards other diseases. While existing databases for drugs such as DrugBank(Wishart et al., 2008) store pharmacogenomics information such as drug targets, there also exist several databases focused on the interconnections of genes, diseases and chemical compounds. PharmGKB (Hewett et al., 2002; Klein et al., 2001) and the Comparative Toxicogenomics Database (CTD) (Davis et al., 2010) focus on curating knowledge on human genetic variants and links from these to pharmaceutical compounds and disease. These databases curate relationships from both PubMed articles as well as other sources of information such as the US Food and Drug Administration list of genetic biomarkers in drug labels. Although manually curated drug-gene interaction data is the gold standard for many researchers, computational approaches can provide broad coverage. The automatically populated database SuperTarget (Gunther et al., 2008) focuses on extracting interactions of drugs and their targets from MEDLINE®/PubMed® and other interaction databases, in the context of disease diagnosis or treatment. SuperTarget is paired with a manually annotated companion resource Matador. Text mining tools have been integrated into the biocuration pipeline of CTD to identify interactions between genes, 19  diseases and drugs (Wiegers, Davis, Cohen, Hirschman, & Mattingly, 2009). Although automated text analysis is error-prone, they find the emerging automated tools can be applied to increase biocurator throughput and efficiency by improving prioritization of articles to curate. Beyond assisting in curation, bibliometric and text analysis tools, from text retrieval to natural language processing, have been coupled with computational techniques for drug discovery (Agarwal & Searls, 2008). They look at applications where genes and drugs, with the relations between these two by downloading and analysing PubMed data. (Sardana et al., 2011) explore the need and opportunity of bioinformatic analyses to help find new applications for existing drugs to treat rare diseases. Automated knowledge-based analysis can be used to systematically assist in such drug repositioning in providing quantitative starting evidence. However, the existing methods focus on common (non-orphan) diseases, and so a general methodology applicable to diseases regardless of their rarity but rather focusing on the pre-existing biomedical research, remains an important area of study. Repositioning research has often focused on approaches that are independent of pre-existing research literature. For instance, groups have analyzed quantitative structural chemical properties (Fjell et al., 2009) and quantitative high-throughput experiments (Aerts et al., 2006). PREDICT (Gottlieb, Stein, Ruppin, & Sharan, 2011) combines existing drug-disease associations extracted using the Unified Medical Language System with drug-drug and disease-disease similarity to rank a query drug-disease association based on the most similar drug-disease evidence. Recent advances in re-positioning have incorporated large-scale docking simulations (Y. Y. Li, An, & Jones, 2011) and gene expression profiling (Sirota et al., 2011a). Author Similarity The concept of similarity between entities applies broadly. As a novel approach for information retrieval in the life sciences, we became interested in the identification of similarity between authors. Every article in PubMed is annotated with a list of authors, representing a vast pool of entities. Unlike genes, diseases, drugs or any other kind of medical subject annotation attached to an article, the author is a definitive assignment by the creator of the work. It is expected that each author has characteristic research interests which should be captured by the topics addressed in their work. Given a profile of  20  each author, it should be possible to identify similarities between authors to facilitate collaborations and interactions within the scientific community. Identifying and comparing authors based on their research interests has been pursued using abstract similarity (Errami, Wren, Hicks, & Garner, 2007). The same class of text comparison methods have been applied to detect similar and potentially plagiarised work (Errami, Sun, Long, George, & Garner, 2009). Our work builds on this prior body of literature. Rather than focus on the text and phrases, we elect to construct quantitative author MeSHOPs and perform large-scale comparisons. Author comparisons are confounded by the challenge of author ambiguity. As names are not exclusive, with some names being especially common and shared among many individuals, one must address the problem of differentiating authors based on surname and initials. Since 2002 the full names of the authors have been deposited in PubMed, however many author names exactly match. Further complicating the disambiguation problem is the converse problem, where one individual may appear in the database under several different names – for instance variations including or omitting some or all initials, misspellings, or legal name changes. Biomedical subject annotation similarity has been used to distinguish articles from the same author from those by distinct authors with the same name(Torvik & Smalheiser, 2009). In addition to the computational methods, several initiatives, including OpenID, ResearcherID(Bourne & Fink, 2008) and PubMed Author ID(NLM, 2011) are in progress to explicitly identify the authors of articles with an identification number. Once mature, these efforts will eliminate ambiguity in the identification of authors and allow the generation of more accurate and complete profiles.  Visualisation While the previously discussed applications of the primary literature data allow us to quantitatively associate a biomedical entity directly with its related topics or with predicted related topics, the presentation of profiles and results can assist in the interpretation of results. For quantitative annotation profiles, the ultimate result is a list of terms with a numerical score indicating the strength of association. Beyond presenting the results as an ordered list of these pairs and highlighting the strongest terms, we look at methods that provide additional information such as allowing direct comparison of the relative importance of the terms.  21  In bioinformatics, relative font size has been previously used to convey quantitative scores through sequence logos(Schneider & Stephens, 1990). Word clouds have previously been used in the analysis of free text, presenting terms from the text in size relative to their occurrence from a variety of biomedical sources(Baroukh, Jenkins, Dannenfelser, & Ma’ayan, 2011). These word clouds can be generated by removing common words and measuring the enrichment of terms remaining relative to the background rate. LigerCat (I.N. Sarkar, Schenk, Miller, & Norton, 2009) extracts MeSH terms from the articles returned from PubMed queries or bibliographies from GenBank records, and presents their scored results as a word cloud to highlight the most unusual terms. Additionally, the word cloud is transformed into an interactive entity, allowing the user to form a new query by combining terms of interest from the cloud. We build on these visual representations of medical terms, introducing filtration of redundant terms, and demonstrate how changing the comparative background can highlight classspecific terms. These methods demonstrate that MeSHOPs readily allow visual representation of the information encoded in the primary literature, but also reveals this to be a nascent area rich for future development.  Conclusion of Literature Review As evidenced by all the prior research efforts, dealing with Big Data is an ongoing problem that only increases in difficulty and complexity as our pool of knowledge expands. MEDLINE®/PubMed®, with its comprehensive coverage of biomedical articles indexed by expert curators, provides a rich resource of human-verified knowledge in computable form. Existing methods hint at the importance of literature as both a source of knowledge currently used to guide research and as a means to inform future predictions. This thesis will demonstrate that this pool of knowledge can be used to inform us about preexisting knowledge, by directly and quantitatively extracting the unusually associated topics for an entity and uncovering biases influencing relationships between entities. Moreover, we quantitatively evaluate the extent that the knowledge extracted can be used to uncover previously unidentified links, opening the way to more sophisticated analyses while cautioning about the significant effects of literature annotation bias. These applications are only the tip of the iceberg – PubMed provides the opportunity for bioinformaticians to use their computational tools to not only extract knowledge from a single paper, but to agglomerate the knowledge of all experts in the field to assist in forging a more complete understanding of the domain. Ultimately, we wish to enable a future where biomedical researchers are 22  given clear access the totality of the existing knowledge, and can use this knowledge to its fullest extent to guide their future endeavors.  Formal Summary of Chapters This thesis focuses on a novel method for extracting the most relevant biomedical topics from a set of related research articles to profile an entity of interest (Chapter 2), and uses comparison of these entity profiles to predict new associations (Chapters 3-5). Chapter 2 describes the extraction of MeSHOPs from annotated biomedical citations. This statistical method describes the extraction of highly relevant topic terms over-represented in a set of citations compared to a background set of articles. We evaluate the effect of several methods to refine and focus the over-representation results, and demonstrate how MeSHOPs allow entities to be grouped based on common subject headings in their profiles. Chapters 3 shows how MeSHOPs for genes and diseases can be computed and compared, and that similarity of MeSHOPs can predict novel gene-disease relationships. We evaluate a variety of measures for MeSHOP similarity. We validate the predictions from the analysis of data from 2007, and show high accuracy at predicting new relationships appearing in subsequent years. Importantly, we investigate several bibliographical baselines, demonstrating that the degree of annotation for genes and diseases is predictive of future association. Chapter 4 further examines the relationships between entities, focusing on MeSHOPs for pharmacologic compounds and compares these to the disease MeSHOPs. To overcome the annotation bias introduced by entity annotation, we devise a method to correct for the influence of literature annotation in the similarity scores. Chapter 5 then applies profiles to look at the network of biomedical authors. With over three million authors, several orders of magnitude greater than the four thousand MeSH diseases (4 thousand) or the thirty thousand human genes, this introduces an additional computational obstacle as explicitly comparing every author to every other author is no longer a feasible option. We investigate and present computational methods for rapid high throughput entity comparison.  23  Entity  Extract Significant C Terms Annotation Annotation  Entity  Entity  Gene Disease Drug Author  Prediction D Annotation  A  Biomedical Literature Citations  Annotation  Entity  Annotation Annotation B  Figure 1-1. Schematic of the Research Project. The project focuses on the analysis of biomedical  research literature citations (A) and the annotated major biomedical terms annotated to such citations (B). We look at sets of citations and associated annotation related to an entity of interest, to analyse for the presence of over-represented terms (C). We then explore the prediction of novel associations between entities based on the biomedical annotation profiles (D).  24  Baseline 2002  Created Approximately November 21, 2001  Number of Citations 11,299,108  Increase in Citations  2003 2004 2005 2006 2007 2008  Between November 1-4, 2002 Between November 14-18, 2003 November 20, 2004 November 18 & 19, 2005 November 17 & 18, 2006 November 16 & 17, 2007  11,847,524 12,421,396 14,792,864 15,433,668 16,120,074 16,880,015  548,416 573,872 2,371,468 640,804 686,406 759,941  2009 2010 2011 2012  November 21 & 22, 2008 November 20, 2009 November 19, 2010 November 18, 2011  17,764,826 18,502,916 19,569,568 20,494,848  884,811 738,090 1,066,652 925,280  Table 1-1. Increase in size of the PubMed/MEDLINE Baseline over time. Numbers taken from  http://mbr.nlm.nih.gov/  25  Disease Clustering  Input Type  eVOC GeneSeeker  system  DGP  PROSPECTR  G2D  POCUS  CAESAR  Endeavor  Disease,  Disease,  All genes  Anatomical  list of  list of  List of  location,  genes  genes  genes,  Disease-  genome,  chromosomal  (from  (from  Training  related  Training  List of  regions of  disease  List of  List of  disease  set of  text  set of  genes  interest  locus)  genes  genes  locus)  genes  corpus  genes  in the  Data Sources Analysed PubMed  X  Abstracts eVOC  Sequence  X X  X  data  annotation  X  X  annotation  GO  X  X X  Protein data  X  X  X  X  X  X  X  X  X  X  X  Phenotype  X  /Expression  X  Libraries Orthologous  X  genes  X  (mouse) Orthologous genes (other) Protein Domain Interaction  X  Partners  X  X  X  X  Cis-  X  Regulatory Modules OMIM  X  X  X  Table 1-2. List of Existing Algorithms for Candidate Disease Gene Selection and Data Sources Analysed by these Methods. 26  Chapter 2: Quantitative Biomedical Annotation using Medical Subject Heading Over-representation Profiles (MeSHOPs) Synopsis Background. MEDLINE®/PubMed® indexes over 21 million biomedical articles, providing curated annotation of contents using a controlled vocabulary known as Medical Subject Headings (MeSH). The MeSH vocabulary, developed over 50+ years, provides broad coverage of topics across biomedical research. Distilling the essential biomedical themes for a topic of interest from the relevant literature is important to both understand the importance of related concepts and discover new relationships. Results. We introduce a novel method for determining enriched curator-assigned MeSH annotations in a set of papers associated to a topic, such as a gene, an author or a disease. We generate MeSH Over-representation Profiles (MeSHOPs) to quantitatively summarize the annotations in a form convenient for further computational analysis and visualization. Based on a hypergeometric distribution of assigned terms, MeSHOPs statistically account for the prevalence of the associated biomedical annotation while highlighting unusually prevalent terms based on a specified background. MeSHOPs can be visualized using word clouds, providing a succinct quantitative graphical representation of the relative importance of terms. Using the publication dates of articles, MeSHOPs track changing patterns of annotation over time. Since MeSHOPs are quantitative vectors, MeSHOPs can be compared using standard techniques such as hierarchical clustering. The reliability of MeSHOP annotations is assessed based on the capacity to re-derive the subset of the Gene Ontology annotations with equivalent MeSH terms. Conclusions. MeSHOPs allow quantitative measurement of the degree of association between any entity and the annotated medical concepts, based directly on relevant primary literature. Comparison of MeSHOPs allows entities to be related based on shared medical themes in their literature. MeSHOPs can be generated and visualised via a web interface.  27  Introduction The MEDLINE®/PubMed® (hereafter referred to as PubMed) bibliographic database of the U.S. National Library of Medicine (NLM) is an actively maintained central repository of over 18.5 million biomedical literature references (Sayers et al., 2009). To navigate this growing body of published information, the PubMed references are indexed by subject experts at the NLM using Medical Subject Headings (MeSH) (Nelson, Johnston, Humphreys, Bean, & Green, 2001), a structured controlled vocabulary of 26,000 biomedical descriptors. The MeSH annotations are intended to facilitate the identification of relevant papers for research scientists. As PubMed grows at a modern rate exceeding 600,000 references per year, researchers face a daunting challenge to assess the body of work about entities (genes, drugs, authors, etc.) arising in the course of their research. Encapsulating the bibliography for a biomedical entity of interest in a form both understandable and informative is an increasingly important challenge in biomedical informatics (Hirschman, Hayes, & Valencia, 2007b; Jensen et al., 2006). One approach to succinctly summarise a bibliography (i.e. a set of key papers) for a biomedical topic is to identify the MeSH terms most strongly associated to the papers. Previous reports which introduced summaries of over-represented MeSH terms for a set of papers include a study of enriched annotations for groups of differentially expressed genes (Djebbari et al., 2005) and a method to identify MeSH terms enriched in articles retrieved in a query of the PubMed database (I.N. Sarkar et al., 2009). These initial approaches to MeSH annotation analysis applied ad hoc measures of association over small sets of articles to demonstrate the potential value for MeSH annotation summarization. Key to accelerating the research process is the development of systematic approaches to quantitatively represent bibliometric information and infer functionally important relationships between entities. Addressing this goal, we introduce MeSH Over-representation Profiles (MeSHOPs) to quantitatively describe the properties of genes, diseases or any other entity associated with a set of articles represented in PubMed. The entire PubMed database is analyzed. For each MeSHOP, the overrepresentation of MeSH annotations across a bibliography of articles is statistically evaluated for a biomedical topic. MeSHOPs convey characteristics of the subject entity, facilitating discovery of novel relationships across classes of entities. We demonstrate the use of MeSHOPs to facilitate visualization of associated properties, subject to the use of appropriate corrections for background annotation properties. To assess the utility of MeSHOPs for high-throughput generation of quantitative annotation, 28  the capacity of the process to re-derive a subset of Gene Ontology annotation of genes is measured. Using a set of biomedical entities - vitamins - as an example, MeSHOP comparisons are shown to provide a quantitative measure of similarity between each member of the class. Profiles can be similarly compared across entity classes, as demonstrated in an analysis of the similarities between gene MeSHOPs and brain disease MeSHOPs. MeSH Over-representations Profiles fill an important niche in computational biology, allowing quantitative annotation descriptions to be generated for any entity for which a set of research articles indexed in the PubMed database can be defined.  Methods MeSH Over-representation Profiles A MeSHOP is a quantitative representation of the annotations associated with a set of articles, where the set is composed of articles that address a specific entity (such as a gene or disease). The computation of a MeSHOP initiates from a set of articles that address a specific entity and returns a set of over-represented MeSH terms, each term with a p-value reflecting over-representation based on its rate of occurrence in the set of articles (see Figure 2-1). Comparing the observed frequency of each MeSH term annotated to the background rate returns a measure of over-representation. A MeSHOP is a vector of tuples < (t1, m1), (t2, m2), … (tn, mn) >. For each tuple (ti, mi) in a MeSHOP, ti is a distinct MeSH term in the MeSH vocabulary and mi is the numeric measure of the over-representation of MeSH term ti in the set of articles (the computed over-representation p-value, which can also be negative logtransformed). For this study, several large classes of entities were analyzed such as the human genes in Entrez Gene and the diseases specified formally within MeSH. MeSHOPs are generated for each entity in a class by assessing the set of all linked PubMed records for each member. We use Fisher’s Exact Test to determine p-values, computed from a 2x2 contingency table comprised of: 1) the frequency of occurrence of the term ti in the set of articles addressing the entity of interest; 2) the remainder of articles addressing the entity not having the term ti; 3) the frequency of the term ti in a specified background set of articles, not including articles addressing the entity; and 4) the remaining number of articles in the background set that do not refer to the term ti.or the entity. To illustrate this, we compute the association of the term “Alzheimer Disease” to the gene A2M in Table 2-2.  29  Medical Subject Heading Annotation Data Over 18 million biomedical references in PubMed have been evaluated by NLM staff subject experts. These curators assigned appropriate MeSH terms corresponding to the topics covered by the paper. The MeSH terms chosen are intended to be the most specific terms relevant to the topic covered in the paper – for example, if the term “Alzheimer Disease” is attached to the paper, the more general (‘parental’) term “Brain Disease” would not be associated. For our analysis, we therefore consider a paper annotated by a MeSH term to also be annotated with all ‘parents’ (and ‘grand-parents’, etc.) of that MeSH term. When indexing articles using MeSH terms, complex topics often cannot be covered by a single MeSH term – in this case multiple terms are “coordinated”. For example, the topic “medical staff in teaching hospitals” is covered by using the two MeSH terms “Medical Staff, Hospitals” and “Hospitals, Teaching”. All major topics in a report are indexed, even if the findings are negative. Previous studies have shown that MeSH provides a high specificity but lower sensitivity retrieval of relevant MEDLINE articles(Chang, Heskett, & Davidson, 2006; Jenuwine & Floyd, 2004) than direct text searching. This is attributed to the influence of manual curation being able to accurately when the topic term is the subject of an article. As MeSH is a standardised vocabulary, it also allows for disambiguation of concepts that use the same words, as well as resolving different terminology referring to the same concept. The MeSH vocabulary is updated annually during the Annual MEDLINE®/PubMed® year-end processing (http://www.nlm.nih.gov/bsd/policy/yep_background.html) where terms can may be deleted or changed. In most cases, experts at the NLM will specify will specify a replacement MeSH term that can be applied automatically to all cases. Additionally, the experts may also manually some citations to preserve the intent of the existing indexing effort. New MeSH terms can also be added. In general, pre-existing citations are not re-indexed with this new term, except in cases where the new term redefines a previously ambiguous heading.  Generating Disease MeSHOPs For each MeSH term from the disease category (Category C), the entire bibliography of annotated articles in PubMed was considered. Disease-article linkages are drawn directly from PubMed via the curator-assigned MeSH terms. To generate MeSH term literature profiles for diseases, all MeSH  30  terms from the disease category – Category C – were used; a set composed of 4 229 terms in MeSH 2007 linking to over 8 million articles.  Generating Gene MeSHOPs All human genes in Entrez Gene were considered (38 604 in Entrez Gene 2007). Two sources for gene-article linkages from Entrez Gene were evaluated: Gene Reference Into Function (GeneRIF, http://goo.gl/SzRui) and gene2pubmed (http://goo.gl/bUEDU). GeneRIF is a curated set of links provided by annotators at the NLM and public submissions, where each set of PubMed articles refers to a briefly described function of the gene. gene2pubmed is a set of links to PubMed articles relating to the gene, generally broader in scope than GeneRIFs. GeneRIFs link 11 750 human genes to 142 396 articles. gene2pubmed links 26 510 human genes to 226 615 articles.  Implementation The analysis was performed using Python (http://www.python.org/), XSLT (http://www.w3.org/TR/xslt), and the MySQL database system (http://www.mysql.com/). Fisher’s Exact Test p-values were computed using the R statistics package (http://www.r-project.org/). Results were generated using 50 CPUs of a compute cluster running under Sun GridEngine (http://gridengine.sunsource.net/). A typical cluster machine is a 64-bit dual processor 3 GHz Intel Xeon with 16 GB of RAM. Datasets were downloaded from Entrez Gene (ftp://ftp.ncbi.nlm.nih.gov/gene/) and PubMed (http://www.nlm.nih.gov/databases/leased.html). See Table 2-1 for details of the size and contents of the datasets.  Results MeSHOPs quantitatively represent the association of medical terms to a topic of interest, based on the bibliography for the topic compared to a background set of articles. We examine methods for generating MeSHOPs, and show how MeSHOPs can be used to reveal terms associated with a topic.  Calculating MeSHOPs for Biomedical Entities We evaluate multiple procedures to quantitatively describe the annotation properties of a biomedical entity using MeSH terms attached to a set of articles about the entity. At the simplest, one could count the number of times each MeSH term is attached to the corpus of articles (Figure 2-2A). 31  Such an approach fails to account for the number of articles in the corpus, so one could normalize the frequency. While such a correction may facilitate comparisons between distinct MeSHOPs, it fails to account for the importance of the individual terms and has no impact on the visual representation (data not shown). Some terms, such as ‘human’ are attached frequently, but provide little information to distinguish between distinct biomedical entities. We elect to calculate a p-value reflecting the significance of observing the number of annotations with a MeSH term in a set of articles of the given corpus size as detailed in the Methods. The p-values allow us to place the quantitative emphasis on distinguishing terms while correcting for the number of articles involved (Figure 2-2B). The p-values are computed under the model of a hypergeometric distribution via Fisher’s Exact Test. The universal background applied in this case is the set of 17 million PubMed articles assigned MeSH terms (see Table 2-2 for more details). The p-values measure the co-annotation or co-occurrence of MeSH terms with the entity. This MeSHOP generation process (Figure 2-1) underlies all subsequent analysis in this report.  Simplifying Large MeSHOPs Inspecting the raw MeSHOPs revealed two issues that become increasingly important when analyzing larger bibliographies: (i) highly correlated terms within the MeSH hierarchy result in concept redundancy in the profiles; and (ii) terms enriched among entities in a class result are uninformative of the entity being profiled. Two corrections were introduced to address these issues. As an example of the first problem, consider the term “Alzheimer Disease”, which implies the more general term “Brain Disease”, rendering the observed over-representation of “Brain Disease” redundant in a profile (see Figure 2-3). The tree-like structure of the MeSH vocabulary provides a direct method to determine term relationships. A more succinct representation can be generated by removing more general terms, limiting MeSHOPs to include only the most specific significantly associated terms from the MeSH tree (See Figure 2-2C). As an example of the second problem, the initial MeSHOP for the gene BRCA1 includes the term “polymorphism, single nucleotide”, however this term is enriched for 29% of human genes using the universal background set of articles. To address this issue, we calculate the enrichment statistics based on class-specific article backgrounds. For human genes, the background is restricted to articles addressing at least one human gene. Similarly, for diseases, the background is all articles annotated with at least one MeSH disease term. Using class-specific backgrounds, the statistical test  32  highlights terms unusually enriched for the specific member, de-emphasising terms common to all members of the class (see Figure 2-2D).  Visualising MeSHOPs MeSHOPs can be directly converted into word clouds to provide a convenient graphical depiction of the annotation properties that enables rapid visual comparison of the relative importance of terms (See Figure 2-2). Word clouds for the MeSHOPs provide a visual representation of a MeSHOP, allowing for immediate evaluation of the most important terms as well as their relative importance, in a manner similar to sequence logos (Schneider & Stephens, 1990). We introduced in the previous subsection two approaches that improve over-representation profiles: (i) filtering to retain only the most specific MeSH terms and (ii) selecting an appropriate background for the statistical assessment. A word cloud for a MeSHOP is generated using the associated MeSH terms and the negative log of the corresponding calculated p-values, directly translating the statistical significance of each term proportionally into the size of the font for the associated term. The MeSHOP [term, -log(p-value)] pairs are submitted to the online cloud generating software Wordle (http://www.wordle.net) and visualized using the “Horizontal” layout. Each MeSH term for a given MeSHOP is laid out by the Wordle software in a random, non-overlapping manner, with the font size of the term scaled proportional to the weight in the vector.  Web Interface for Generating and Obtaining MeSHOPs To enable reader exploration of the profiles, we provide pre-computed MeSHOPs for biomedical entities such as genes, diseases and pharmaceutical compounds (http://meshop.oicr.on.ca). All MeSHannotated articles available through the most recent full year release (2010) are incorporated into the profiles. Diseases include all specified by MeSH terms under the parent term “Diseases”. Pharmaceutical compounds were defined as compounds appearing in the MeSH supplemental concepts for which an indication of ‘pharmaceutical action’ was attached. Genes are not consistently defined as MeSH terms. As MeSHOPs may be generated for any set of articles, gene MeSHOPs were derived from existing mappings of genes onto PubMed article identifiers. Users seeking to generate MeSHOPs for other biomedical entities, may provide a list of PubMed Identifiers (PMIDs). We provide, as an example, MeSHOPs for four different entities in Figure Figure 2-4.  33  Properties of Gene and Disease Annotation Examination of the number of articles linked to human genes and diseases reveal substantial differences between these data sources. Most genes have few linked articles, the distribution decreasing with an extreme tail of well-studied genes with many links. For the GeneRIF article links from Entrez Gene (accessed 2007-02-13), genes have a mean of 369 assigned articles, but a median of only 15 articles (See Figure 2-5A). Similarly, for the gene2pubmed article links, the mean is 637 articles, yet the median is only 20 articles (See Figure 2-5B). Diseases have a more balanced distribution, but still a characteristic extreme tail of certain well-studied diseases, with the key distinct difference that very few diseases have only a couple articles. In the 2007 release of PubMed, a mean of 19 431 articles linked to each disease but a median of only 1 912 articles – still substantially more than the median for genes (See Figure 2-6). Of the 24 357 MeSH 2007 terms, 15 674 (64%) terms are represented in gene MeSHOPs (via the 2007 gene2pubmed article links), and 23 473 (96%) terms are found in disease MeSHOPs (via 2007 PubMed). We expect that as genes become better annotated with more comprehensive bibliographies, their annotation pattern will come to resemble that of the more comprehensively annotated diseases.  Re-deriving Gene Ontology Annotations with MeSHOPs MeSHOPs may be most advantageous as an approach to generate quantitative annotation profiles in a high-throughput manner for any set of biomedical entities that can be associated with sets of research articles. To provide a measure of the performance of the procedure to regenerate curated annotations, we assessed the sensitivity of MeSHOPs for detecting Gene Ontology terms annotated to genes. Using the Unified Medical Language System (UMLS) mapping of MeSH terms to Gene Ontology terms, we identified 396 GO terms with one-to-one equivalent MeSH terms. Depicted in Figure 2-7A, we observe that the sensitivity of MeSHOPs for representing these terms for the corresponding genes ranges from 77% (at a p-value threshold of 0.05) to 95% (at a threshold of 0.31). As GO annotations are not comprehensive, there is no direct means to assess the specificity of the method. In lieu of specificity we plot the average total number of MeSH terms included per gene relative to the threshold values, with 162 terms per gene at a p-value threshold of 0.05 (Figure 2-7B).  Temporal Changes of MeSHOPs MeSHOPs can be used to identify changing knowledge and properties for an entity. For example, by taking a subset of the articles for a biomedical entity at different timepoints, we can track the changes in research focus for the entity over time. Two areas of research, defined by the MeSH terms 34  “Computational Biology” and “Stem Cells” were analysed. At each selected time point, the fifty most recent articles for that year were taken to represent the state of the field at that time, and MeSHOPs were computed using the universal PubMed background. Analysing the MeSHOPs for “Computational Biology” over the past decade allows us to quantitatively evaluate the evolution of the field (see Figure 2-8). For this analysis, all years indicate the inclusion of articles to the end of that calendar year. The MeSHOP from 1999 reveals significant topics such as “Human Genome Project”, a major informatics focus at that time point, that are completely absent when we examine the corresponding MeSHOP from 2009. “Genetic Research”, present in both MeSHOPs, is followed in the recent MeSHOP with other terms for biological disciplines and techniques such as “Genomics”, “Genetic Techniques”, “Proteomics” and “Sequence Analysis, Protein”, demonstrating how computational biology techniques are being more tightly integrated with biomedical research. As seen in Figure 2-9, data from MeSHOPs can be used to chart the gradual decline in significance of “Information Services” as the focus of the research switches from storage of the data, and the corresponding rise in association to “Biochemistry” demonstrating a tighter coupling with scientific study. Similarly, we can track the changes in “Stem Cells” since the introduction of the term in 1984 (see Figure 2-10). By 1985, we see “Hematopoietic Stem Cells” and “Bone Marrow Cells” as a significant focus. This is followed by the surge in importance of “Stem Cell Transplantation” by 2000, whereas by 2009 we see the focus shifting to “Mesenchymal Stem Cells”, “Cell Differentiation” and “Embryonic Stem Cells”. MeSHOPs provide both a qualitative visual summary of the shifting focus of research over time for an entity of interest, as well as a method to quantitatively track the progression of association of biomedical subjects as they relate to the entity of interest.  Intra-group MeSHOP Similarity MeSHOPs can also be used to investigate relationships between a set of related entities. For the set of entities comprising the 13 human Vitamins, we first use MeSHOPs to examine the co-occurrence of Vitamin MeSH terms in PubMed (See Figure 2-11A) by considering, for each vitamin entity, the subset of the MeSHOP relating to vitamins. In this case, the MeSHOPs measure the co-occurrence strength between any two vitamins, allowing us to visualise and cluster the vitamins via their bibliographic cooccurrence using Fisher’s Exact Test. The vitamins separate via the clustering with the fat-soluble vitamins A,D,E and K together, whereas the water soluble vitamins (Ascorbic Acid and the B complex vitamins) are grouped separately. This graphic also reveals publication trends – for example, of the fat35  soluble vitamins, all co-occur except for vitamins A and K, and the water-soluble vitamins clustering into three distinct groups, with Niacin separated from Pantothenic Acid, Biotin and Thiamine, which are also separate from the rest of the B complex vitamins which group with Ascorbic Acid. Using the entirety of the vitamin MeSHOPs, we can compute the similarity of the strength of association to biomedical subjects, taking the Euclidean distance of the log of the p-values for the shared terms in their MeSHOPs. Comparing the results of co-occurrence to the profile comparisons in Figure 2-11B reveals that the results of clustering by profile is similar to the bibliographical co-occurrence, such as Vitamin A clustering with Vitamin D, as well as Pantothenic Acid clustering with Thiamine. Profile similarity clustering however can emphasise different similarities from co-occurrence, such as Niacin being more similar to Pantothenic Acid and Thiamine rather than Biotin, and a similarity in annotations between Vitamin E and Ascorbic Acid. MeSHOPs allow us to analyse a set of biomedical entities to highlight known and expected relationships through strength of co-occurrence in biomedical literature, as well as revealing similarities of annotation profiles.  Inter-group MeSHOP Similarity To explore the challenges arising with inter-group MeSHOP comparisons, we sought to identify links between a subset of genes and brain disorders. We examined the genes of the Notch (48 genes), Wnt (152 genes) and Hh (57 genes) signalling pathways, with the list of genes for each pathway extracted from KEGG (accessed June 2011) (See Figure 2-12). These signalling pathway genes were profiled against the subset of MeSH terms that are the immediate children of the MeSH term “Brain Diseases”. Clustering using their association to the pathway genes, the “Brain Diseases” are arranged into categories, with “Brain Neoplasms” being the most strongly associated to the genes, with “Hypothalamic Diseases” and “Dementia” also broadly associated. “Brain Injuries”, “Intracranial Hypertension” and “Hydrocephalus” are weakly associated to these genes by MeSHOP comparison. We grouped the pathway genes based on “Brain diseases” subset of their MeSHOPs. Rather than grouping distinctly by pathway, the genes are spread across different clusters. A broad spectrum of the pathway genes strongly associated to “Brain Neoplasms”, while a a subset also strongly associated with “Hypothalamic Diseases”. Another distinct set of genes associated to “Cerebellar Diseases” are not associated with the previous two groups (See Figure 2-12C). MeSHOPs provide a unique quantitative method of visualising the gene landscape for a particular topic through the associated MeSH annotations. 36  Discussion MeSHOPs are quantitative annotation profiles based on over-representation analysis of MeSH terms attached to sets of articles, where each set or bibliography is associated to a specific biomedical entity such as a gene, disease or chemical. Conveniently visually depicted as word clouds, a MeSHOP includes both common terms frequently arising in a bibliography and rare concepts that arise more than expected by chance. In this report we demonstrate the capacity of the MeSHOP generation procedure to recover known gene annotations (as curated with Gene Ontology terms), use temporal restrictions to demonstrate how MeSHOPs change over time, and introduce methods for the comparison of MeSHOPs for both intra- and inter-group similarity analyses. MeSHOPs can be expected to be widely used by researchers, as they may be generated for any biomedical entity and provide quantitative annotation without extensive curation. We anticipate that researchers will be most attracted to the convenient generation of annotation images by converting MeSHOPs to word clouds. Convenient visualization methods in bioinformatics have made substantial impacts on communication, as evident in such methods as sequence logos for motifs(Schneider & Stephens, 1990), circos plots for genomics (Krzywinski et al., 2009), pip-plots and dotter images (Schwartz, 2003; Sonnhammer & Durbin, 1995) for sequence alignments, and network diagrams for protein systems (Snel, Lehmann, Bork, & Huynen, 2000). MeSHOPs are likely to provide a similar level of convenience for summarizing complex topics for accelerated interpretation. The use of word clouds, of course, has been extensive, including for the display of gene annotation (Baroukh et al., 2011; Desai et al., 2011). The key advantage of MeSHOPs is that they draw upon the expert curation underlying PubMed.  Technical Challenges MeSHOPs directly measure the significance of the annotated biomedical topics for a bibliography. The significant terms in a MeSHOP are therefore implicated by co-occurrence (guilt by association). The reliability of such over-representation analysis is dependent on the annotation used to generate the results. MeSH terms and Supplemental MeSH Concepts are annotated to PubMed articles by subject area experts to indicate the major and minor topics addressed by an article. There are two caveats to the over-representation analysis. Firstly, a co-occurring MeSH term may not apply to the biomedical topic despite appearing in the same paper. This form of erroneous linkage is mitigated when significant p-values are supported by multiple co-occurrences in the bibliography addressing the entity. 37  Secondly, co-occurrence can indicate a negative association, as negative associations are annotated in MeSH if they are an important topic of the paper. However, a negative association is unlikely to provoke substantial further literature support, unless it is of substantial research interest or the result inconclusive, at which point the MeSH term emerges as important to the biomedical topic. Thus it is our expectation that further development of MeSHOPs will need to explore measures of confidence for small bibliographies. Manual annotation by domain experts, such as the annotation of MeSH terms to PubMed, is the gold standard that other large-scale, more automated techniques are compared against. However, manual annotation introduces the potential for human bias, although MeSH annotation has been shown to have high inter-annotator consistency (Funk & Reid, 1983). MeSH is also continually updated incorporating new terms and updating old terms. Deprecated old terms are migrated to new terms in this process, however old articles are not revisited to be associated with new terms. This can be further complicated as when knowledge of a disease grows, classifications may blur or change. One example of this is the change in classification for autism and schizophrenia. In the first two editions of the Diagnostic and Statistical Manual of Mental Disorders (DSM), autistic disorder and childhood-onset schizophrenia were not differentiated, whereas in further editions, childhood-onset schizophrenia is classified with adult schizophrenia. Autism, on the other hand, has been proposed to be revised in the fifth edition of DSM to be reclassified into a category of autism spectrum disorders, and include other disorders such as Asperger’s syndrome. MeSH annotation therefore will be most appropriate and accurate for recent articles, and these issues are mitigated once our knowledge and understanding of a topic has matured.  Related Work The use of statistical tests to assign significance values for annotation terms appearing in a text or across gene annotations has been frequently observed in bioinformatics. We calculate p-values using Fisher’s Exact test, which has a specific, well-defined interpretation well-suited for over-representation analysis – the probability that the term would be found as prevalently in an equivalent-sized set of articles drawn uniformly at random from the background set of articles – making it possible to set meaningful confidence thresholds and evaluate the scores. These scores highlight strength of association by correcting for the background frequency of occurrence. Fisher’s Exact Test is commonly used in classic Gene Ontology annotation over-representation tools for gene set analysis such as DAVID(Dennis et al., 2003) and as a measure of over-representation of transcription factor binding sites 38  across a set of genes or sequences (Ho Sui et al., 2007). Grossmann extended the statistical approach to account for the influence of hierarchical annotation, allowing for the frequency of parental terms to contribute to the analysis of child terms (Grossmann et al., 2007). A number of publications have incorporated MeSH terms into the analysis of sets of articles. Many studies have attempted to find common themes for groups of genes arising in experimental studies (Djebbari et al., 2005; Jani et al., 2010; Kumar, 2011). Three papers are more similar to the work described here, although each has distinct characteristics. The LigerCat system was developed to provide a more convenient interface for PubMed searching (I.N. Sarkar et al., 2009). The system generates a word cloud for MeSH terms arising in articles reported by an initial user query (which could be a single entity such as a gene or drug). The user then can click on the individual terms within the cloud to restrict results in the PubMed search. Comparisons of MeSH-based gene profiles were performed by Sarkar and Agarwal (Indra Neil Sarkar & Agrawal, 2006), using hierarchical clustering, but only using profiles composed of binary values (whether a term is present or absent from the profile), where a positive setting was made if there was at least one abstract in which the gene name and assigned MeSH term co-occurred. The most similar work was described in two publications about the Gendoo system (Nakazato et al., 2007, 2009). The Gendoo system allows users to see MeSH terms associated with a gene or drug, and provides an information gain score to indicate which genes or drugs are most closely linked to a MeSH term. There is no quantitative profile provided, nor the capacity to perform comparisons of distinct entities.  Future Directions Many extensions of MeSHOPs remain to be explored. We describe here the use of the MeSH terms alone for over-representation, however, MeSH terms may be assigned ‘subheadings’ by curators. Such subheadings more specifically specify the context of a MeSH term (e.g. a disease reference may be coupled to “diagnosis” or “therapy”). As well, some MeSH terms are marked as major topics – future analysis could use this to place more emphasis on these MeSH terms. Incorporation of the finer shades of MeSH annotation may be feasible. As evident with disease MeSHOPs, there is a positive correlation between the number of articles in a bibliography and the number of over-represented MeSH terms. Improved methods to highlight the most relevant biomedical topics may be worthy of future investigation. Ambiguities such as topics occurring by chance or negatively associated topics may be addressed through natural language 39  processing text mining techniques to semantically identify exactly how MeSH terms are linked to the biomedical topic of interest. Each article could be weighted to emphasise articles with more importance. At a rough level of scale, the impact factor of the publication could be used to estimate the relevance of the research, but more specific measures may be available, such as the number of citations. Another potential application of this weighting could be to emphasise more recent findings, weighting recent work more strongly as recent work may supersede previous knowledge. The Web 2.0 era has introduced an additional resource that could be mined for data on annotations and article importance – the scientists themselves. Information can be gleaned passively from the behaviour of users. For example, the impact of an article can be directly measured by the number of times it has been accessed. Information can also be actively obtained from the community, by leveraging community curation projects such as Gene Wiki for functional annotation(Good, Howe, Lin, Kibbe, & Su, 2011). More directly, services such as Amazon’s Mechanical Turk provide a way to quickly recruit a large pool of annotators via crowdsourcing. The collective intelligence of these annotators can be used to generate more specific annotations for a set of articles, or could be used to identify articles relevant to an entity, allowing MeSHOPs to be generated for entities without a welldefined source of related articles. MeSHOPs can be generated using any source for bibliographies. Automated extraction of gene symbols from PubMed abstracts, using technology such as iHOP(Hoffmann & Valencia, 2004), could create larger gene bibliographies. Subclasses of MeSHOPs, such as species-specific gene profiles could be generated and compared. A drug MeSHOP could be supplemented with the MeSHOPs of other chemical compounds of the same family.  Conclusion MeSHOPs quantitatively represent the MeSH biomedical terms enriched across a set of papers associated with a specific biomedical entity such as a gene, disease or drug. Visual display of MeSHOPs using word clouds provides a convenient way to convey annotation properties to readers. Comparison between MeSHOPs allows for the generation of hypotheses, opening new avenues for applied text analysis in bioinformatics.  40  Figure 2-1. Workflow for Generating a MeSHOP. Starting from a set of articles relating to a biological concept or entity (the foreground set), the associated MeSH terms for each PubMed record of each article are extracted. The prevalence of each MeSH term across the set of articles is compared to a background. Fisher’s Exact Test is applied to measure the statistical over-representation of each term in the foreground set.  41  Figure 2-2. Alternative Approaches for Generating MeSHOPs Depicted as Word Clouds. All MeSHOPs depict annotation of the HTT gene that is causal for Huntington Disease. (A) Raw counts. (B) Statistical enrichment scores. The top 150 terms in the profile are shown with the font size of the term is proportional to the negative log p-value for the term. Note the presence of many general terms which are implied by more specific terms, such as “Vertebrates”, “Primates”, “Chordata” and “Mammals” being present, but covered by the term “Humans”. Also, when studying a set of human genes, the terms “Humans” and “Genes” are commonly occurring and should be down-weighted accordingly. (C) Redundancy Filtered HTT Gene Biomedical Term Word Cloud. This is a word cloud where the more general terms have been filtered out from (B), leaving only the most specific terms in the profile. For example, the term “Repetitive Sequences, Nucleic Acid” seen in (B) has been filtered out due to the presence of the term “Trinucleotide Repeat Expansion”. (D) Redundancy Filtered HTT Gene Biomedical Term Word Cloud using human gene background. This is a word cloud when taking only the subset of PubMed articles related to human genes as the background, while also applying the filtering seen in (C).  42  Figure 2-3. Subset of the MeSH Tree for Alzheimer Disease. The entries in the Medical Subject Heading tree leading to Alzheimer disease. Note that the term Alzheimer Disease occurs in three places in the tree, and under two separate subheadings in the Disease category – once under “Central Nervous System Diseases” due to its location in the human body, and once under “Neurodegenerative Diseases” and “Tauopathies” due to the type of disease.  43  A  B  C  D  Figure 2-4. Example of MeSHOPs for four different entities. (A) MeSHOP for the human gene PAX6, generated from all the gene’s gene2pubmed references. (B) MeSHOP for the disease Aniridia, generated from all PubMed articles with the MeSH term “Aniridia”. (C) MeSHOP for the drug Acetaminophen, from all PubMed articles with the chemical compound “Acetaminophen. These are all terms in this MeSHOP with p-value of zero; the size of the font is proportional to the number of articles for these terms. (D) MeSHOP for the author Craig Venter, from all articles with “Craig Venter” listed as an author. 44  Figure 2-5. Distributions of Genes by Associated Literature References. (A) Distribution of Genes by Number of Associated GeneRIF References. The distribution shows that the bulk of the genes have very few references, with an extreme tail of a small fraction of genes having a very large number of references. (B) Distribution of Genes by Number of Associated gene2pubmed References. Although overall average number of references is higher due to the larger number of gene2pubmed references, the distribution remains is very similar to (A). 45  Figure 2-6. Distribution of Diseases by Number of Associated PubMed References. Unlike the distributions of gene references, the 4112 disease MeSH terms have substantial literature support, although there remains an extreme tail of a small fraction of MeSH terms having an extremely large number of articles.  46  A  B  Figure 2-7. p-values of MeSH Term Mapped Gene Ontology (GO) Human Gene Annotation. (A) Fraction of human genes to mapped GO term recovered (sensitivity) for p-value thresholds. 77% of gene-GO annotations are recovered in MeSHOPs with p-value scores of 0.05 or less, indicating that most gene-GO annotations are very strongly associated in the MeSHOPs for the genes. (B) The average size (in number of MeSH terms) of the gene MeSHOP when filtered against p-value thresholds. 396 GO terms were mapped to MeSH using UMLS.  47  Figure 2-8. MeSHOP for “Computational Biology”. MeSHOPs were generated for the 50 most recent articles annotated with the MeSH term “Computational Biology” from the year 1999 (A) and the year 2009 (B). MeSHOPs were computed using the universal background from PubMed Baseline 2010 (covering articles through 2009). The MeSH term for “Computational Biology” was excluded from the MeSHOP.  48  Figure 2-9. Change in Significance of Biomedical Terms for “Computational Biology” over Time. The pvalues for the terms “Biochemistry” and “Information Services” and their association to “Computational Biology” over time. For each time point, a MeSHOP using the most recent 50 articles for that year was generated to obtain the p-values for the terms.  49  Figure 2-10. MeSHOP for “Stem Cells”. MeSHOP generated taking the 50 most recent articles annotated with the MeSH term “Stem Cells” from the year 1985 (A), the year 2000 (B) and the year 2009 (C). MeSHOPs were computed using the universal background from PubMed Baseline 2010. The MeSH term for “Stem Cells” was removed to highlight the other terms present.  50  51  Figure 2-11. Clustering Vitamin MeSHOPs. (A) Co-occurrence of Vitamins through MeSHOPs. Each row represents the MeSHOP for a particular Vitamin. Each column in a row plots the p-value for the MeSH term of the column for the MeSHOP of the row. P-values were computed using the universal Baseline 2010 background. Pvalues were plotted as a heatmap where red indicates low p-values and green indicates high p-values. The dendrogram was constructed using hierarchical clustering in R. (B) Vitamins clustered through similarity of MeSHOPs. The MeSHOPs for the Vitamins were compared using Euclidean distance of the log of the p-values for overlapping terms, and the similarity measures were plotted in a heatmap. The vitamins were clustered by their similarity scores, and the dendrogram for the hierarchical clustering plotted on the y-axis. Red indicates low pvalues and green indicates high p-values in the heatmap. For comparison, the dendrogram from (A) was plotted on the x-axis.  52  A  B  53  C  Figure 2-12. Signaling Pathway Gene Co-occurrence with Brain Disease Annotation. (A) The MeSHOPs for signaling pathway genes from the Notch, Wnt and Hh pathways (columns) were plotted, showing the p-values for their associated Brain Disease MeSH terms (rows). The MeSHOPs were computed using the universal Baseline 2010 background. Hierarchical clustering was performed on each axis, and in the heatmap, red indicates low p-values and green indicates high p-values. (B) The same data as (A) but with the axes swapped to show more detail of the genes. The MeSHOPs for signaling pathway genes from the Notch, Wnt and Hh pathways (rows) were plotted, showing the p-values for their associated Brain Disease MeSH terms (columns). The MeSHOPs were computed using the universal Baseline 2010 background. Hierarchical clustering was performed on each axis, and in the heatmap, red indicates low p-values and green indicates high p-values. (C) A Subset of the genes from (B) are shown here. Each gene is labeled by their Entrez Gene ID, followed by Wnt, Hh and/or Notch indicating their presence in the respective KEGG pathway .  54  Dataset  February  January 2009  April 2010  2007 Entrez Gene  Total Genes Human Genes  PubMed  Total Articles  gene2pubmed (Linking  Total Links  2 460 748  4 710 910  5 999 558  38 604  40 183  45 423  Baseline 2007  Baseline 2009  Baseline 2010  (Nov 2006)  (Nov 2008)  (Nov 2009)  16 120 073  17 764 232  18 502 915  3 081 413  12 960 489  5 979 167  272 123  445 650  527 821  Entrez Gene and PubMed) Total Human Gene Links Table 2-1. Datasets Used in the Analysis with Details on Size and Relevant Contents. Although the number of human genes has not increased much over the years, the number of non-human links has increased substantially since 2007, while the human gene links have increased at a more moderate rate. Previously, PubMed links from genomic sequence were propagated to all related genes. This practice was discontinued in March 2009, resulting (at the time) in a 60% decrease in links and the disparity in the number of overall links from 2009 to 2010.  55  A2M articles Articles referring to  Remainder of PubMed articles  Total  8  39 265  39 273  73  16 080 727  16 080 800  81  16 119 992  16 120 073  Alzheimer Disease Articles without Alzheimer Disease reference Total  Table 2-2. Analysis of Over-representation of the MeSH Term Alzheimer Disease in the 31 Articles Linked via GeneRIF to the Gene A2M (alpha-2-macroglobulin, Entrez Gene ID 2). The raw p-value computed from this table using Fisher’s exact test is 1.45E-11, and after Bonferroni multiple testing correction for 25 183 genes, the p-value remains significant at 3.65E-07, indicating a strong research focus of A2M in the field of Alzheimer Disease in existing biomedical literature.  56  Chapter 3: Inferring Novel Gene-Disease Associations Using Medical Subject Heading Over-representation Profiles Synopsis Background: MEDLINE®/PubMed® currently indexes over 21 million biomedical articles, providing unprecedented opportunity and challenges for text analysis. Using Medical Subject Heading Over-representation Profiles (MeSHOPs), an entity of interest can be robustly summarized, quantitatively identifying associated biomedical terms and predict novel indirect associations. Methods: A procedure is introduced for quantitative comparison of MeSHOPs derived from a group of PubMed articles for a biomedical topic (e.g articles for a specific gene or disease). Similarity scores are used to compare MeSHOPs of genes and diseases. Results: Similarity scores successfully infer novel associations between diseases and genes. The number of papers addressing a gene or disease has a strong influence on predicted associations, revealing an important bias for gene-disease relationship prediction. Predictions derived from comparisons of MeSHOPs achieves up to 16% improvement in the identification of gene-disease relationships compared to gene or disease baseline rates. Conclusion: MeSHOP comparisons are demonstrated to provide predictive capacity for novel relationships between genes and human diseases. We demonstrate the impact of literature bias on the performance of gene-disease prediction methods. MeSHOPs provide a usable form of annotation to facilitate relationship discovery in biomedical informatics.  Background A key focus of genomic medicine is the identification of relationships between phenotype and genotype. Genome-wide association studies and exome/genome sequencing can reveal hundreds of candidate genes that may contribute to human disease. Given such a set of candidate genes, the prioritization of these genes for functional validation emerges as a key challenge in biomedical informatics (Makrythanasis & Antonarakis, 2011). Much focus has been placed upon the development of methods for the quantitative association of genes with disease (Nicki Tiffin, Andrade-Navarro, & PerezIratxeta, 2009). 57  Across biomedical research fields, scientific publications are the currency of knowledge. One near-universal tool of life scientists to access this “bibliome” is the MEDLINE®/PubMed® bibliographic database of the U.S. National Library of Medicine, an actively maintained central repository for biomedical literature references (Sayers et al., 2009). Over 21 million citations have been indexed by MEDLINE®/PubMed®, at a modern rate exceeding 600,000 articles per year. Researchers face increasing difficulty navigating the growing body of published information in search of novel hypotheses. Encapsulating the bibliome for a disease or gene of interest in a form both understandable and informative is an increasingly important challenge in biomedical informatics (Hirschman, Hayes, & Valencia, 2007a; Jensen et al., 2006). MEDLINE®/PubMed® provides data structures and curated annotations to assist scientists with the challenge of extracting pertinent articles from the bibliome of a biomedical entity. In an ongoing process, curators at the National Library of Medicine identify key topics addressed in each publication and attach corresponding Medical Subject Headings (MeSH) (Nelson et al., 2001) terms as annotations to each publication’s record in MEDLINE®/PubMed®, covering over 97% of all PubMed-indexed citations. The National Center for Biotechnology Information (NCBI) PubMed portal utilizes the annotated MeSH terms to empower search of the citation database, extending the reach of users beyond naïve word matching to topic matching. As one of the pantheon of NCBI resources, PubMed citations are further linked to gene entries in Entrez Gene where appropriate, with over 450 000 PubMed citations linked to an Entrez Gene entry for a human gene. The analysis of gene annotation properties and gene-related literature is a core challenge within computational biology. Biomedical keywords for properties of genes, drawn from structured vocabularies, have been identified from unstructured gene annotations (Grossmann et al., 2007; Prüfer et al., 2007), as well as directly from the primary literature (Bundschus, Dejori, Stetter, Tresp, & Kriegel, 2008; Nakazato et al., 2007, 2009). Sets of descriptive terms can be visualized as “tag clouds” (Good, Kawas, Kuo, & Wilkinson, 2006; Indra Neil Sarkar et al., 2009) . Comparison of gene annotation profiles can group genes, expanding protein-protein interaction and phenotype networks, deriving regulatory networks and predicting other gene-gene relationships (Kim, Park, & Drake, 2007; Lage et al., 2007; S. Li, Wu, & Zhang, 2006; Loscalzo, Kohane, & Barabasi, 2007; Perez-Iratxeta et al., 2007; Rodríguez-Penagos, Salgado, Martínez-Flores, & Collado-Vides, 2007). Annotation analysis enables prioritization of candidate genes in genetics studies (Bundschus et al., 2008; Gaulton et al., 2007; Yu et al., 2008) and, when 58  integrated with other information sources, predicts novel properties of genes (Aerts et al., 2006; Chen, Xu, Aronow, & Jegga, 2007). Existing tools and techniques demonstrate the value, and suggest the high potential impact, of annotation analysis. Significant research opportunities remain to improve annotation and annotation-based analysis methods. The development of computational disease information resources has run parallel to the aforementioned gene-based efforts. Controlled vocabularies for medical descriptions (Olivier Bodenreider, 2004; Cornet & de Keizer, 2008) and disease-specific annotations (Hamosh, Scott, Amberger, Bocchini, & McKusick, 2005; Osborne et al., 2009) are emerging to facilitate medical information systems. Within MEDLINE®/PubMed®, a disease division of the Medical Subject Headings has been developed over 50 years, providing an extensive inventory of medical disorders. By 2011, over 4494 MeSH disease terms have been established. Key to accelerating the identification of gene-disease relationships is the development of systematic approaches to quantitatively represent bibliometric information and infer functionally important relationships between entities. We have previously introduced MeSH Over-representation Profiles (MeSHOPs) as a convenient tool for constructing quantitative annotations for sets of papers in MEDLINE®/PubMed® where each paper refers to the same entity (such as a gene or a disease)(Cheung, Ouellette, & Wasserman, 2012). To demonstrate the fidelity of the MeSHOP knowledge representation at measuring features important for prediction, we generate the MeSHOPs for human genes and diseases and compare these MeSHOPs to predict novel associations. Predictive performance for genedisease relationships is validated against co-occurrence in future publications and curated databases. Comparing MeSHOPs is demonstrated to be an effective way to identify novel relationships between genes and diseases.  Results Generation of MeSHOPs Disease and Gene MeSHOPs provide a concise quantitative representation of the biomedical knowledge associated with an entity (Figure 3-1). For this study, two large classes of entities were analyzed – the human genes in Entrez Gene and the diseases specified formally within MeSH. MeSHOPs were generated for the classes disease and human gene by assessing the set of all linked MEDLINE®/PubMed® records for each entity. 59  All human genes present in Entrez Gene were considered (38 604 in Entrez Gene 2007). Two sources for gene-article linkages from Entrez Gene were evaluated: Gene Reference Into Function (GeneRIF) and gene2pubmed. GeneRIF is a curated set of links provided by annotators at the NLM and public submissions, where each set of PubMed articles refers to a described function of the gene. gene2pubmed is a set of links to PubMed articles relating to the gene, generally broader in scope than GeneRIFs. GeneRIFs link 11 750 human genes to 142 396 articles. gene2pubmed links 26 510 human genes to 226 615 articles. The two MeSHOP gene collections are analyzed separately in the subsequent sections. Disease MeSHOPs were generated directly from MEDLINE®/PubMed® via the curator-assigned MeSH disease terms. To generate MeSHOPs for diseases, all terms from the disease category – MeSH Category C – were used; a set composed of 4 229 terms in MeSH 2007 linking to over 8 million articles.  Quantitative Comparison of Gene and Disease MeSHOPs for Prediction of Future CoOccurrence in Research Publications We hypothesize that a disease is likely to be associated with a gene if the disease MeSHOP is highly similar to the gene MeSHOP. For example, a disease with a functional relationship to a gene may share MeSH terms between profiles, such as localization, metabolic pathways, cellular processes and symptoms, even if no links between the gene and the disease have been previously reported in the literature. When many biomedical terms are common between two profiles, the likelihood for a future association between the entities profiled is expected to increase. In the subsequent sections gene-disease relationship predictions using MeSHOPs are validated against gene-disease co-occurrences that appear in subsequent MEDLINE®/PubMed® releases (i.e. using data not represented in the MeSHOPs). A validated prediction means the first article referring to both the gene and the disease was published during a subsequent time period (as reported in the future 2009 or 2010 MEDLINE®/PubMed® release). Two overlapping validation sets (2007-2009 and 2007-2010) were extracted: (i) 95 845 novel gene-disease co-occurrences for gene-article mappings from gene2pubmed for 2007-2009; (ii) 183 407 novel gene-disease co-occurrences for mappings from gene2pubmed for 2007-2010; (iii) 95 085 novel gene-disease co-occurrences for gene-article mappings from GeneRIF for 2007-2009; and (iv) 169 723 novel gene-disease co-occurrences for mappings from  60  GeneRIF for 2007-2010. This approach is similar to the validation scheme presented in (Yetisgen-Yildiz & Pratt, 2009). Using these validation sets, we evaluate scoring methods by computing the Receiver Operating Characteristic (ROC) curve for predictions from analysis of the baseline 2007 data and reporting the Area Under the ROC Curve (AUC). MeSHOP comparisons are defined as predictions of future disease-gene cooccurrence if a similarity score exceeds an applied threshold. To calculate the ROC curve, we classify the novel gene-disease co-occurrences appearing in the future gene MeSHOPs as “true positives”, and all other gene-disease pairings as “true negatives”. An ideal prediction method will produce a AUC score of 1, while random predictions are expected to generate a AUC score of 0.5.  Gene and Disease Predictive Bibliometric Baselines There is little quantitative information about baseline performance against which to compare gene-disease association prediction methods. Intrinsic characteristics of genes were investigated for predictive ability of future gene-disease term co-occurrence (see Table 3-3). For these baseline controls, scores were obtained from the quantitative characteristics of each gene. These scores represent genespecific properties and do not account for disease properties. All disease-gene pairs were ranked based solely on the indicated characteristic of the gene in the pair, and the AUC scores calculated (see Figure 3-2). Gene-specific characteristics evaluated were: percentage of G/C mononucleotide content of the primary RefSeq transcript, total number of associated cDNA sequences reported in Entrez Gene, RefSeq transcript length, genomic length (from the annotated Ensembl gene/transcript start to end) and the Entrez Gene identification (ID) numbers. All data were extracted using BioMart (April 2009). The following features produced random AUC scores (~ 0.5): GC content, number of transcripts, transcript length and genomic length (Table Table 3-3). Strikingly, Entrez Gene ID is predictive of a gene’s likelihood to be linked to disease, with genes having lower Entrez Gene IDs more likely to co-occur with a disease in future publications (AUC ranging from 0.64 to 0.78). Entrez Gene IDs reflect no direct biological feature of the gene itself, but are sequentially assigned as genes are added to the database, indirectly measuring the length of time the gene has been studied. Therefore, the publication date of the oldest publication, estimating the length of publication history, and the number of publications, estimating the breadth of publication history, were examined for each gene using the Entrez Gene Feb 2007 dataset (See Table 3-4). The AUC for the 61  oldest publication for each gene exhibits higher predictive performance than the Entrez Gene ID number (AUC of 0.66 to 0.80), and the AUC for the number of publications is the highest of all gene-related characteristics observed (AUC of 0.73 to 0.85). Correlation of Entrez Gene ID to a richer and older publication history was reported by Leong and Kipling (Leong & Kipling, 2009). As the number of publications for a gene is correlated to the number of MeSH terms in the corresponding gene MeSHOP, it is not surprising that high AUC scores were obtained for MeSH term counts (See Table 3-4 as well as Figure 3-2). As observed for gene-only score ranking, disease-only score rankings can be non-random. The MeSH term counts for the disease MeSHOPs were predictive for future gene-disease co-occurrence in the literature (AUC from 0.76 to 0.90) (See Table 3-5 and Table 3-6). Across both gene and disease entities and across all validation sets, an entity that is highly annotated is substantially more likely to cooccur with another entity in future publications.  MeSHOP Similarity Measures Quantitative comparison of gene and disease MeSHOPs improves prediction of future genedisease co-occurrence over the baseline features (based strictly on only genes or disease) established above. Sixteen distinct similarity measures were evaluated using AUC scores, from counting measures such as term overlap and term coverage to calculated measures such as Euclidean (L2) and cosine distance of p-value profiles (See Table 3-8). The scores evaluate the shared characteristics from both the gene and the disease MeSHOPs to make predictions. Three previously assessed baselines are presented for comparison: Entrez Gene ID, the number of terms in the gene MeSHOP, and the number of terms in the disease MeSHOP. The MeSHOP prediction scores produced AUCs ranging from random at 0.51 to a nearly optimal AUC of 0.99, depending on the measure and the validation set (see Table 3-5 and Table 3-6 for the AUC results of each score under each validation set). Each individual score was consistent across multiple validation sets and the GeneRIF or gene2pubmed article links, with the relative rank of the scores remaining nearly identical. Although scores such as Term Overlap and Term Coverage (mean AUC of 0.87) have high scores compared to random, these are only on par with the best baseline scores (see Figure 3-3 and Table 3-9). The most effective similarity score is the L2 of log-p of overlapping terms only:  62  (G and D refer to the MeSH terms of gene and disease MeSHOPs respectively, gp(i) and dp(i) refer to the p-value for the MeSH term i of the gene or disease profile respectively), which generates a mean AUC of 0.94 (See Table 3-9 and Figure 3-4). Although bibliometric baseline scores – number of article links for a gene, number of MeSH terms in the gene MeSHOP and number of terms in the disease MeSHOP – are predictive of a future paper that refers to the gene and a disease, a distinct improvement in prediction is achieved by comparing gene and disease MeSHOPs using this L2 score, which will be used for MeSHOP comparisons going forward. Mean Test Rank As an alternative assessment to AUC scores, one can test assess a score’s ability to correctly rank a list of candidate genes. For a particular disease and validation set, a list of n genes (e.g. n=200 genes) is constructed – one random disease-associated gene and n-1 random non-associated genes. The list of genes is ranked by the comparison score, and the test repeated. In the case of a perfect metric, the mean test rank for the disease-associated gene would be 1, and in the case of completely random predictions, the mean rank would be n/2. For test lists of 200 candidate genes, the top four MeSHOP comparison scores have Mean Test Ranks from 12 to 20, nearly all ranking on average within the top 10% of the list. To compare, the Mean Test Rank for scoring by the number of gene MeSH terms is 39 and scoring using Gene ID is 59 (See Table 3-9).  Predicting Association to Disease Co-occurrence of gene and disease references in the same article does not confirm a functional relationship between the gene and the disease; such co-occurrence could be observed for studies in which a gene-disease relationship is found to be false or not significant. To address this issue, the predictive capacity of MeSHOP comparison is evaluated against curated gene-disease relationships from the Comparative Toxicogenomics Database (CTD) (Davis et al., 2011; Wiegers et al., 2009). CTD curators extract relationships for genes identified as biomarkers, therapeutic targets in treatment or playing a role in the etiology of the disease from published literature and the OMIM database. These known genedisease relationships are taken as the positive associations for ROC curve analysis to assess the MeSHOP predictions. Performance of the MeSHOP scores on the CTD validation sets is consistent with the performance seen when inferring novel disease terms for gene profiles – bibliometric baselines 63  exhibiting up to AUC 0.85 while the best MeSHOP similarity scores achieve AUCs over 0.9 (see Table 3-4 and Table 3-9, and Figure 3-5). Results confirm the effectiveness of MeSHOP comparison to recover the CTD bona fide gene-disease relationships. AUCs shift by less than 0.08 when compared to the updated CTD April 2010 gene-disease relationship data (Table 3-5 and Table 3-6).  Comparative Assessment of Predictions with a Literature-based System: Candidate Genes for Alzheimer Disease To place MeSHOP comparisons in relationship to a top literature-based candidate gene prediction tool, we evaluated predictions for Alzheimer disease-gene relationships by MeSHOP comparison and a leading tool. We identified the top 500 gene candidates (top 3% of genes) for Alzheimer disease (AD) identified by MeSHOP comparison and by the Génie system (Fontaine et al. 2011), plotting the relationships between the ranks in Figure 3-6. The top 50 candidate genes are most strongly correlated, overlapping for 32 of the genes (see Table 3-7 for the top 50 candidate genes). Within Table 3-7, the gene candidates previously investigated in the context of AD are indicated (46/50 genes) with the number of articles in gene2pubmed for which both the gene and the AD MeSH term are attached. For Génie, 48 of the top 50 candidates co-occur with the AD MeSH term (not shown); a 49th gene – Notch3 – co-occurs with AD in two abstracts (and thus was detected as direct associations by Genie) but these papers were not curated in the gene2pubmed or GeneRIF sets as Notch3-focused articles. MeSHOP comparison ranked Notch3 in the top 100 candidates for AD, despite the lack of curated co-occurrence. Both systems provide highly relevant lists of genes, with MeSHOP analysis reporting more novel candidates in this particular case study. Focusing on these novel genes with no pre-existing links in the literature, the two methods both implicate the HTT gene, which is known to be the causative gene for the neurodegenerative disorder Huntington Disease, MeSHOP comparison ranks the XRCC3 gene highly, a DNA repair gene which could be involved in apoptosis and neuronal cell death (both of which are mechanisms associated to AD in the literature). The most striking candidates identified may be the F2 and the F5 genes, which are involved in the blood coagulation pathway. The widely studied AD-related beta-amyloid protein has been shown to interact with fibrinogen, linking abnormalities in coagulation to the pathology of AD in recent papers (Cortes-Canteli et al. 2010; Ahn et al. 2010).  64  Application to Diabetes Association Study To further illustrate the utility of the MeSHOP comparison method, we apply MeSHOP comparisons to predict gene-disease pairs arising in a genome-wide association study (GWAS) of diabetes. In 2007, Sladek et al. (Sladek et al., 2007) reported a GWAS identifying novel risk loci for type 2 diabetes in a French cohort. Comparing the reported genes to the MeSHOP profiles (see Table 3-10), TCF7L2 (Entrez Gene ID 6934) already had eight articles linking it to Type 2 Diabetes and hence a significant association was detected (Bonferroni corrected p=0.018 / raw p=1.3e-7). As well, IDE (Entrez Gene ID 3416) had a weaker established link in four articles (Bonferroni corrected p=0.50 / raw p=3.15e6). No other genes emerging from the report had an established relationship to Type 2 Diabetes. The MeSHOP predictions using publications preceding the GWAS report ranked HHEX (Entrez Gene 3087) in the top 19% of genes linked to Diabetes Mellitus, Type 2 (MeSH D003924). The link to HHEX was supported by a subsequent study (Diabetes Genetics Initiative of Broad Institute of Harvard and MIT, 2007). By 2009, almost all of the genes from the study had been investigated for potential links in the literature (Diabetes Mellitus, MIM 125853), except ALX4. This demonstrates MeSHOPs successfully identifying both known gene-disease linkages as well as corroborating the evidence of a GWAS.  Application to Pancreatic Cancer Study To show that MeSHOP comparison ranks not only support but also supplement existing analysis, we examine a study by Jones et al. combining sequenced RNA transcripts from protein-coding genes with microarray-based detection of homozygous deletions and amplifications in pancreatic cancer (Jones et al., 2008). Using the list of candidate genes from the Jones study with at least two genetic alterations identified (n=83), we rank the genes most highly associated to “Pancreatic Neoplasms” via MeSHOP comparison. This ranked list is compared to the statistical model used by Jones et al. used to differentiate causal genes from passenger genes. In Table 3-1, we list the genes from the Jones 83 gene set which are also among the top 5% of all human genes as ranked by MeSHOPs for association to pancreatic cancer with (full list presented in Table 3-11). Five of the predicted causal genes are in the top six genes in the list, supporting the statistical analysis from a purely bibliometric perspective. The remaining gene in the top six, EP300, has recently been shown to be downregulated by miRNA in highly metastatic pancreatic ductal adenocarcinomas (Mees et al., 2009), demonstrating the ability of MeSHOP comparison to find candidate genes missed by the statistical analysis. Therefore, the MeSHOP  65  comparison provides a bibliometric view that both reinforces and also complements existing analytical methods.  Discussion Quantitative annotation profiles based on MeSH annotations, MeSHOPs, are shown to facilitate the identification of gene-disease relationships. In assessing the baseline properties of gene-disease relationship predictions, we observe a striking bias introduced by the level of annotation of the entities (gene and/or disease), such that simply predicting future gene-disease relationships based on the most studied genes (or diseases) is better than random guessing. Accounting for this bias, we demonstrate that comparison metrics using MeSHOPs have high capacity to predict future gene-disease cooccurrence in future research publications. Functional relationships between genes and diseases were predicted using reference collections, and shown to perform better than baselines based on genes or diseases alone. Overall, MeSHOP comparison is shown to be a useful tool for applied bioinformatics. Strong performance of bibliometric baselines quantitatively indicates researchers may tend to explore additional relationships for existing well-characterised genes and diseases, echoing the imbalanced research activity seen by Agarwal and Searls (Agarwal & Searls, 2009). On the other hand, this may rather reflect methodological biases emphasizing easier to characterise genes and diseases. Well-studied genes have pre-existing protocols and materials such as animal models and PCR primers. Well-studied diseases may be more commonly and reliably identified through better-established diagnostic methods and physician familiarity. As well, the direction of gene and disease research is driven by funding practices, which are also slow to change. Rather than bias, the trends uncovered may reflect properties of a subset of genes and diseases. Certain types of genes and diseases are involved in key processes, similar to multifunctional proteins in interaction networks (Gillis & Pavlidis, 2011). A “hub” gene may be involved in many pathways, and could cause many phenotypes when disrupted. Similarly, some phenotypes may actually be the result of many different molecular processes each of which when misregulated due to a gene can cause different variations of the disease phenotype. As well, there are not just the causative genes for a disease, but many other genes may regulate the severity or provide protective immunity against the disease phenotype. Regardless of origin of these observed predictive biases, we strongly recommend that all future gene-disease prediction methods be contrasted to gene and disease bibliometric baseline 66  characteristics – ideally against the strongest metrics which evaluate the degree of annotation (the number of MeSH terms in the MeSHOPs for the gene and disease). Bibliometric baseline comparison allows direct comparative assessment of the predictive ability of methods compared to these universal trends. While bibliometric biases are more immediately apparent in text mining and text analysis-based methods, it remains important to compare all methods to these baselines to ensure that the results are not simply reproducing this effect, just as the “hub gene” effect can be seen to influence interaction predictions. Also, the bibliometric bias effect may also prove to be important in many other prediction applications – as we see in Chapter 4, there is a similar effect when we look at the literature associated with a drug when performing drug-disease predictions. This effect may be important to consider and control for in any prediction application when the annotation level of an entity can be measured. Previous work demonstrated gene length, cDNA length and protein length significantly differ between control and disease genes (Adie et al., 2005; López-Bigas & Ouzounis, 2004). Our literaturebased analysis shows neither genomic length nor transcript length have significant predictive ability in our current validation sets, suggesting these previous biases are no longer predictive of future genedisease association. Advances in methodology such as high-resolution microarrays and sequencing may have removed the influence of the bias, suggesting that literature bias favouring well-studied genes may correct itself as more genes become better characterised.  Comparison to Other MeSH-related Methods The related but different method of CoPub Discovery (Frijters et al., 2010) seeks to identify hidden links between genes and diseases through shared keywords in MEDLINE abstracts. They assess predictions using historical entries from before 2000, identifying genes and keywords from entries using text mining. For their comparison scores, they employ a straightforward sum of the minimum score of the shared MeSH terms. In contrast, our predictions use the larger corpus of PubMed up to 2007, and our MeSHOPs-based method builds on curated MeSH terms and Entrez Gene article links, enabling a broad range of applications. We also evaluate measures of MeSH term association strength to generate MeSHOPs and many different comparison scores for comparing gene and disease MeSHOPs. Srinivasan (Srinivasan, 2004) extract MeSH terms of importance to summarise a set of articles related to an entity. They consider common MeSH terms between profiles as potential paths to connect two entities. MeSHOPs use a statistical scoring method to compute p-values for the profiles, and further 67  evaluates a large number of different methods for generating and comparing the MeSHOPs, analyzing all terms between profiles computationally. Sarkar et al. (Indra Neil Sarkar et al., 2009) use weighted profiles of MeSH terms and visualize the terms as a MeSH cloud to summarise a collection of documents retrieved from MEDLINE and to facilitate further investigation of related articles in MEDLINE. MeSHOPs share conceptual similarities with the method of CAESAR (Gaulton et al., 2007). CAESAR scores the occurrences of extracted keyword terms in an authoritative text that summarises the topic of interest. MeSHOPs use all relevant articles, each with individual associated MeSH biomedical terms, reflecting both the main directions of research and associated topics.  Future Directions The use of MeSHOPs to novel associations need not necessarily be limited to the attachment of disease terms to genes and vice-versa. This methodology could be expanded to the attachment of any subset of MeSH terms to biomedical topics of interest. Furthermore, MeSH is just one source for biomedical term annotations to PubMed articles. Biases in the human annotation of MeSH terms, as well as more comprehensive coverage, could be achieved by merging these results with other sources of biomedical annotation. MeSHOPs could be explored for gene-disease associations in other species than human – preliminary analysis predicting mouse genes associated with MeSH disease terms have achieved similar performance results. Human disease gene prioritization has been shown to be improved through incorporation of mouse phenotype data (Chen et al., 2007), suggesting incorporation of orthology data could be used improve predictions. New candidates for complex diseases could also be evaluated through their similarity to known gene related to the disease of interest, as seen in an analysis by Taniya et al. for rheumatoid arthritis and prostate cancer using several other sources of gene annotation (Taniya et al., 2011). MeSH does not classify certain Mental Disorders under Category C, “Diseases”, therefore some mental disorders such as Autism or Traumatic Stress Disorders are only found under Category F, “Psychiatry and Psychology”. This particular case study looked at predicting association of genes to one category of MeSH, however, there is no reason, given a greater amount of computation time, to expand the analysis and predict association to any MeSH term. 68  Methods MeSHOP Generation for Genes and Diseases The construction of MeSHOPs was previously described (Cheung et al., 2012), but is summarized here for the convenience of readers. A MeSHOP is a quantitative representation of the MeSH annotations associated with a set of articles where the set is composed of articles that address a specific entity (such as a gene or disease). The computation of a MeSHOP initiates from a set of articles that address a specific entity. Each article has a curator-assigned set of MeSH terms available in MEDLINE®/PubMed®. Comparing the observed frequency of each MeSH term annotated to set of articles relative to the background rate for each term returns a measure of over-representation. A MeSHOP is a vector of tuples < (t1, m1), (t2, m2), … (tn, mn) >. For each tuple (ti, mi) in a MeSHOP, ti is a distinct MeSH term in the MeSH vocabulary and mi is the numeric measure of the strength of association of the MeSH term ti to the set of articles (e.g. the over-representation measures). To account for the tree structure of MeSH, for each MeSH term associated with an article, the article is considered associated to all of the parent terms of that MeSH term. Several scoring metrics have been implemented to report the strength of association between an entity and a MeSH term in a quantitative fashion. A basic measure is the raw count of articles annotated with each term. Such counts can be normalized by dividing the raw count by the total number term annotations for the particular gene or disease to address the degree of annotation. Such counting methods fail to account for statistical significance; the frequency in which terms appears in MEDLINE®/PubMed® should be accounted for. To address this deficiency, p-values can be computed based on a hypergeometric distribution via Fisher’s Exact Test. A universal background of MeSH term frequency is applied in this case derived from a set of 17 million MEDLINE®/PubMed® articles assigned MeSH terms (see Table 3-2 for more details).  Inferring Novel Gene-Disease Association To infer entity-MeSH annotation relationships, we hypothesize that a previously unassociated MeSH term t is likely to be associated with an entity e if the MeSHOP Pt for the MeSH term t is highly similar to the entity’s MeSHOP Pe . The scoring of similarity was performed with a panel of formulae presented in Table 3-8. These similarity scores can be seen to measure the degree of co-citation – where the gene and disease may not be directly linked, but are both linked to by the same MeSH terms. 69  Receiver operating characteristic curves were computed for each of the of the similarity scores evaluated. Area under the curve was measured to assess the accuracy of the scoring metrics. In the case where there are no ties, the ROC curve is composed of horizontal and vertical sections; in the case of ties, diagonal sections also occur. AUC values can be converted to mean rankings by noting that the AUC reports the mean probability that, for a random disease, given a random positive gene and a random negative gene, the positive gene is scored higher than the negative gene. The ranking of the positive is the result of n-1 Bernoulli trials, where the positive is compared to each of the negatives. Each “failure” in this case causes the rank to drop by 1. The average rank is given by 1+(1-AUC )(n-1).  MEDLINE®/PubMed® Data The annual MEDLINE®/PubMed® Baseline releases 2007, 2009 and 2010 were used as the source of MeSH annotations for articles. All gene-disease co-occurrences (i.e. the gene and the disease directly linked to the same article) were extracted for each release.  Curated Gene-Disease Relationships Two sets of curated gene-disease relationships were extracted from the Comparative Toxicogenomics Database (CTD). The first set was all gene-disease tuples involving MeSH disease terms downloaded Nov. 2008: 3685 tuples for the gene2pubmed-based MeSHOPs, and 3227 tuples for the GeneRIF-based MeSHOPs. A second dataset was all gene-disease tuples, involving MeSH disease terms, added between Nov. 2008 and Apr. 2010: 1836 tuples for the gene2pubmed-based MeSHOPs and 1672 tuples for the GeneRIF-based MeSHOPs.  Implementation The analysis was performed using Python (http://www.python.org/), XSLT (http://www.w3.org/TR/xslt), and the MySQL database system (http://www.mysql.com/). Fisher’s Exact Test p-values were computed using the R statistics package (http://www.r-project.org/). Results were generated using 50 CPUs of a compute cluster running under Sun GridEngine (http://gridengine.sunsource.net/). A typical cluster machine is a 64-bit dual processor 3 GHz Intel Xeon with 16 GB of RAM. Datasets were downloaded from Entrez Gene (ftp://ftp.ncbi.nlm.nih.gov/gene/) and MEDLINE®/PubMed® (http://www.nlm.nih.gov/databases/leased.html). The Comparative 70  Toxicogenomics Database validation set was taken from the gene-disease relationships dataset (http://ctd.mdibl.org/downloads/). See Table 3-12 for details of the size and contents of the datasets. Availability and Implementation: Results are freely accessible to the public on the web at http://meshop.oicr.on.ca/meshop/. Source code implemented in Python is available at http://github.com/wac/meshop/ (gene and disease profile analysis) and http://github.com/wac/cmpmeshop/ (evaluation and validation of results).  Conclusions MeSHOPs quantitatively represent the MeSH biomedical terms associated with any defined entity with an identified set of MEDLINE®/PubMed®-indexed papers. Results demonstrate MeSHOP similarity can infer functional annotation of genes and diseases. Specifically, the similarity between gene MeSHOPs and disease MeSHOPs is highly predictive of future gene-disease ties. Although bibliometric characteristics, such as the number of terms in the disease MeSHOP, are predictive of gene-disease association, our best predictions, using Euclidean distance of log-p of overlapping terms, achieve a mean AUC of 0.94, a 7% accuracy improvement over the strongest baseline. The consistency of the results over five validation sets duplicated over two sources of gene-article links demonstrates that the predictive performance of our methods is stable and replicable. Beyond the prediction of annotation, MeSHOP comparison predicts genes with functional roles in disease process, validated using curated gene-disease relationships in CTD and in case studies.  71  Figure 3-1. Comparing Gene and Disease MeSHOPs. A graphical representation of the comparison of the MeSH over-representation profiles for the human gene Pax6 and the disease Aniridia. We show the most strongly associated terms for each profile as a word cloud, scaling the size of each term with the degree of association. Blue lines link shared terms between the profiles – the similarity scores quantitatively evaluate the difference between the profiles by comparing all shared terms between profiles.  72  Figure 3-2. Comparison of Performance of Gene Characteristics. Receiver Operating Characteristic curves are shown comparing predictive gene characteristics. Characteristics are computed from a 2007 Entrez Gene dataset and the MEDLINE®/PubMed® Baseline 2007, predicting all new disease terms associated to gene MeSHOPs between February 2007 and April 2010. The # of articles for the gene and the # of MeSH terms for the gene are very similar, with Pearson correlation of 0.79 and a Spearman correlation of 0.97.  73  0 Gene ID  Number of Gene MeSH Terms  Term Overlap  Term Coverage  L2 of term frequency  L2 of term fractions  L2 of p-values  L2 of log of p-values  L2 of term fractions of overlapping terms only  L2 of log-p of overlapping terms only  Sum of the differences of log p values  Sum of the log of combined p-values  Cosine Distance of term fractions  Cosine Distance of p-values  Cosine Distance of Term Frequency-Inverse Document Frequency  Mean ROC AUC  Summary of MeSHOP Performance  1  0.9  0.8  0.7  0.6  0.5  0.4  0.3  0.2  0.1  Scoring Method  Figure 3-3. Comparison of Mean MeSHOP Performance Scores. Histogram compares the mean ROC AUC of the scoring methods (See Table 3-9).  74  Figure 3-4. Comparing the Performance of Similarity Scores to Gene Characteristics. Receiver Operating Characteristic curves for the L2 of log-p of overlapping terms Gene-Disease Profile Comparison Score, compared against curves for Gene ID, the number of terms in the gene MeSHOP and the number terms in the disease MeSHOP.  75  Figure 3-5. Comparing the Performance of Similarity Scores. ROC curves are shown with AUC, computed for the top five similarity metrics and the disease number of MeSH terms baseline. These scores demonstrate predictions of gene-disease relationships using February 2007 data validated against the Comparative Toxicogenomics Database (11/2008) dataset.  76  Figure 3-6. Comparison of the Top 500 Gene Predictions for Alzheimer Disease from Génie and MeSHOP Similarity. The 215 genes ranked in the top 500 gene predictions for both Génie and MeSHOP Similarity are compared, showing a correlation of 0.38. 79 of the genes ranked in the top 500 by Génie did not have MeSHOPs and therefore did not have a computed MeSHOP similarity score to rank.  77  Table 3-1. Comparison of MeSHOP Results for Pancreatic Cancer Candidate Genes. This table contrasts our MeSHOP similarity scores with data from (Jones et al., 2008), Supplementary Table S7, where they estimate the probability that the nucleotide changes observed are caused by pancreatic cancer (passenger probability rate). This table lists all genes identified with similarity (via the L2 of log-p of overlapping terms only metric) in the top 5%. Passenger probability rates < 0.05 are italicized.  Gene  TP53 CDKN2A KRAS TGFBR2 EP300 SMAD4 ELN F8 SCN5A PRKCG TPO PPP1R3A SMARCA4  Entrez MeSHOP Gene Similarity ID Score 7157 1.24E+08 1029 8.29E+07 3845 6.95E+07 7048 6.76E+07 2033 6.37E+07 4089 6.14E+07 2006 5.57E+07 2157 5.51E+07 6331 5.18E+07 5582 4.77E+07 7173 4.71E+07 5506 4.50E+07 6597 4.04E+07  Rank  Percentile  11 135 266 288 351 386 509 530 629 798 831 946 1243  100 99 99 99 99 98 98 98 98 97 97 96 95  Mutations  18 2 24 3 2 8 2 2 2 2 2 2 2  Deletions  Passenger Probability Low Rates  2 16 0 1 0 6 0 0 0 0 0 0 0  <0.001 <0.001 <0.001 <0.001 0.176 <0.001 0.115 0.165 0.176 0.115 0.115 0.115 0.062  Table 3-2. Analysis of Over-representation of the MeSH Term Alzheimer Disease in the 31 Articles Linked via GeneRIF to the Gene A2M (Entrez Gene ID 2). The raw p-value computed using Fisher’s exact test is 1.45E-11, and after Bonferroni multiple testing correction for 25 183 genes, the p-value remains significant at 3.65E-07, indicating a strong research focus of A2M in the field of Alzheimer Disease in existing biomedical literature.  A2M articles  Remainder of  Total  MEDLINE®/PubMed® articles Articles referring to  8  39 265  39 273  73  16 080 727  16 080 800  81  16 119 992  16 120 073  Alzheimer Disease Articles without Alzheimer Disease reference Total  78  Table 3-3. Performance of Gene Characteristics at Predicting Association with Disease. Characteristics were compared against the 02/2007-11/2008 validation sets using gene2pubmed and GeneRIF gene references, as well as the 11/2008 CTD validation set. gene2pubmed GeneRIF Scoring Method  Validation  Validation  CTD  Validation  (02/2007-  (02/2007-  Validation  (02/2007-  01/2009)  04/2010)  (11/2008)  01/2009)  Validation  CTD Validation  (02/2007-  (11/2008)  04/2010) % GC content  0.50  0.50  0.51  0.50  0.50  0.51  Number of Transcripts  0.53  0.53  0.55  0.51  0.51  0.53  Transcript Length  0.51  0.52  0.50  0.52  0.52  0.53  Genomic Length  0.52  0.52  0.50  0.51  0.51  0.52  Gene ID  0.73  0.71  0.78  0.64  0.63  0.69  79  Table 3-4. Comparison of the Performance of Gene ID to Gene-related Scores in MEDLINE®/PubMed®. The oldest publication for a gene has comparable performance to Gene ID, as measured by the AUC, however, the number of publications for a gene proves to be even more predictive than the Gene ID.  gene2pubmed Feature  GeneRIF  Validation  Validation  CTD  Validation  Validation  CTD  AUC  AUC  Validation  AUC  AUC  Validation  (02/2007-  (02/2007-  (11/2008)  (02/2007-  (02/2007-  (11/2008)  01/2009)  04/2010)  01/2008)  04/2010)  Number of MeSH Terms  0.74  0.73  0.81  0.80  0.85  0.82  0.75  0.73  0.80  0.80  0.85  0.82  Publication (Year)  0.67  0.66  0.73  0.73  0.76  0.73  Gene ID  0.64  0.64  0.66  0.69  0.75  0.73  Number of Publications Oldest  80  Table 3-5. Performance Using GeneRIF as the Gene-Literature Data Source. Area under ROC of the described scoring methods were compared and tested on the validation sets. Scoring Validation Validation CTD AUC CTD AUC Training Mean Rank Method  AUC (02/2007-  AUC  (11/2008)  (02/2007-  (11/200804/2010)  04/2010)  AUC (02/2007)  01/2009) Cosine Distance of Term FrequencyInverse Document  0.90  0.89  0.93  0.91  0.98  0.92  2  0.56  0.57  0.60  0.56  0.53  0.56  15  0.86  0.84  0.91  0.88  0.96  0.89  4  0.86  0.85  0.92  0.90  0.94  0.90  3  0.91  0.91  0.77  0.83  0.93  0.87  6  0.94  0.93  0.91  0.92  0.98  0.94  1  0.56  0.55  0.55  0.56  0.51  0.55  16  values  0.90  0.90  0.76  0.83  0.93  0.86  9  L2 of p-values  0.90  0.90  0.76  0.81  0.92  0.86  11  0.86  0.85  0.89  0.88  0.94  0.88  5  Frequency Cosine Distance of p-values Cosine Distance of term fractions Sum of the log of combined pvalues Sum of the differences of log p values L2 of log-p of overlapping terms only L2 of term fractions of overlapping terms only L2 of log of p-  L2 of term fractions  81  Scoring  Validation  Validation  CTD AUC  CTD AUC  Training  Method  AUC  AUC  (11/2008)  (11/2008-  AUC  (02/2007(02/200701/2009)  Mean  Rank  04/2010) (02/2007)  04/2010)  L2 of term frequency  0.90  0.90  0.76  0.83  0.93  0.86  10  Term Coverage  0.91  0.90  0.77  0.83  0.93  0.87  7  Term Overlap  0.82  0.82  0.86  0.86  0.87  0.85  12  0.74  0.73  0.80  0.80  0.81  0.78  13  Terms  0.90  0.90  0.77  0.83  0.93  0.87  8  Gene ID  0.64  0.64  0.69  0.69  0.66  0.66  14  Number of Gene MeSH Terms Number of Disease MeSH  82  Table 3-6. Performance Using gene2pubmed as the Gene-Literature Data Source. Area under ROC of the described scoring methods were compared and tested on the validation sets. Scoring Validation Validation CTD AUC CTD AUC Training Mean Rank Method  AUC (02/2007-  AUC  (11/2008)  (02/2007-  (11/200804/2010)  04/2010)  AUC (02/2007)  01/2009) Cosine Distance of Term FrequencyInverse Document  0.92  0.91  0.95  0.93  0.98  0.94  2  0.53  0.51  0.65  0.63  0.53  0.57  16  0.90  0.89  0.93  0.91  0.96  0.92  5  0.91  0.89  0.94  0.94  0.94  0.92  3  0.91  0.91  0.77  0.83  0.93  0.87  7  0.96  0.95  0.92  0.94  0.99  0.95  1  0.64  0.62  0.57  0.60  0.53  0.59  15  values  0.90  0.90  0.76  0.83  0.93  0.86  10  L2 of p-values  0.89  0.89  0.75  0.81  0.92  0.86  12  0.92  0.90  0.91  0.92  0.95  0.92  4  Frequency Cosine Distance of p-values Cosine Distance of term fractions Sum of the log of combined pvalues Sum of the differences of log p values L2 of log-p of overlapping terms only L2 of term fractions of overlapping terms only L2 of log of p-  L2 of term fractions  83  Scoring  Validation  Validation  CTD AUC  CTD AUC  Training  Method  AUC  AUC  (11/2008)  (11/2008-  AUC  (02/2007(02/200701/2009)  Mean  Rank  04/2010) (02/2007)  04/2010)  L2 of term frequency  0.90  0.90  0.76  0.82  0.93  0.86  11  Term Coverage  0.90  0.91  0.77  0.83  0.93  0.87  8  Term Overlap  0.91  0.89  0.90  0.92  0.90  0.90  6  0.85  0.82  0.85  0.88  0.83  0.85  13  Terms  0.90  0.90  0.76  0.83  0.93  0.86  9  Gene ID  0.75  0.73  0.78  0.79  0.74  0.76  14  Number of Gene MeSH Terms Number of Disease MeSH  84  Table 3-7. Top 50 Alzheimer Disease Candidate Genes by MeSHOP Similarity. Genes are ranked by MeSHOP similarity score, and compared against the ranked list of Génie candidate genes for Alzheimer Disease (a full analysis considering all possible orthologs). Also provided is a list of the number of articles related to Alzheimer Disease in the gene2pubmed references for the gene, when present. Highlighted rows indicate high-ranking predictions that have no prior association to Alzheimer Disease in the literature.  Gene ID  Rank  Gene Name  Score  Alzheimer Disease gene2pubmed references  Génie Rank  1  348 APOE  1.18E+04  4  812  2  351 APP  1.22E+04  2  576  3  4137 MAPT  1.23E+04  1  211  4  5663 PSEN1  1.27E+04  3  249  5  6622 SNCA  1.27E+04  6  30  6  627 BDNF  1.28E+04  9  47  7  1312 COMT  1.29E+04  87  10  8  1401 CRP  1.29E+04  210  5  9  6532 SLC6A4  1.30E+04  43  23  10  3064 HTT  1.30E+04  44  11  5444 PON1  1.30E+04  204  16  12  1813 DRD2  1.30E+04  114  1  13  4846 NOS3  1.30E+04  118  18  14  23621 BACE1  1.30E+04  5  86  15  2950 GSTP1  1.30E+04  470  4  16  5621 PRNP  1.31E+04  12  28  17  5054 SERPINE1 1.31E+04  18  1636 ACE  1.31E+04  19  2952 GSTT1  1.31E+04  20  5071 PARK2  1.31E+04  13  6  21 120892 LRRK2  1.31E+04  18  6  #N/A  #N/A  3 32  #N/A  45 3  85  22  3553 IL1B  1.31E+04  39  32  23  4023 LPL  1.31E+04  172  7  24  6647 SOD1  1.31E+04  36  6  25  3356 HTR2A  1.31E+04  121  16  26  10 NAT2  1.31E+04  333  4  27  7515 XRCC1  1.31E+04  #N/A  2  28  2944 GSTM1  1.31E+04  #N/A  3  29  3552 IL1A  1.31E+04  30  36  30  3569 IL6  1.32E+04  60  28  31  5664 PSEN2  1.32E+04  7  78  32  6648 SOD2  1.32E+04  131  4  33  2153 F5  1.32E+04  #N/A  1.32E+04  #N/A  1 2  34  338 APOB  #N/A  35  7421 VDR  1.32E+04  #N/A  36  2147 F2  1.32E+04  #N/A  1.32E+04  #N/A  2  37  183 AGT  #N/A  38  1543 CYP1A1  1.32E+04  #N/A  1  39  154 ADRB2  1.32E+04  #N/A  1  40  4524 MTHFR  1.32E+04  57  30  41  1071 CETP  1.32E+04  197  8  42  3557 IL1RN  1.32E+04  278  7  43  4318 MMP9  1.32E+04  219  5  44  1565 CYP2D6  1.32E+04  238  9  45  335 APOA1  1.32E+04  135  7  46  7517 XRCC3  1.32E+04  #N/A  47  3990 LIPC  1.32E+04  #N/A  #N/A 2 86  48  4153 MBL2  1.32E+04  49  23435 TARDBP  1.32E+04  50  345 APOC3  1.32E+04  #N/A  1 10  #N/A  10 2  87  Table 3-8. Explanation of the Scoring Functions Evaluated. M refers to the set of all MeSH terms, G and D refer to the MeSH terms for the gene and disease profile respectively. g(i), gf(i), gp(i) and gi(i) refer to the frequency, term fraction, hypergeometric p-value and term frequency-inverse document frequency for the MeSH term i of the gene profile. d(i), df(i) , dp(i) and di(i) refer to the frequency, term fraction, hypergeometric p-value and term frequency-inverse document frequency for the MeSH term i of the disease profile. Scoring Method Description Cosine Distance of Term FrequencyInverse Document Frequency  Cosine Distance of p-values  Cosine Distance of term fractions  Sum of the log of combined p-values  Sum of the differences of log p values  L2 of log-p of overlapping terms only  L2 of term fractions of overlapping terms only  L2 of log of p-values  L2 of p-values  L2 of term fractions  88  Scoring Method  Description  L2 of term frequency  Term Coverage  GD  Term Overlap  GD  Number of Gene MeSH Terms  G  Number of Disease MeSH Terms  D  Gene ID  Entrez Gene ID of the gene.  89  Table 3-9. Summary of MeSHOP Performance. The AUC mean, standard deviation and ranking for the MeSHOP scores and the gene and disease baselines are described, over all validation sets and both GeneRIF and gene2pubmed reference sets. Scoring Method Mean AUC AUC Standard Error Mean Test Rank Overall Rank (n=200) Cosine Distance of Term Frequency-Inverse Document Frequency  0.93  0.03  15.03  2  Cosine Distance of p-values  0.57  0.05  87.25  16  Cosine Distance of term fractions  0.90  0.04  20.21  4  Sum of the log of combined p-values  0.91  0.03  18.88  3  Sum of the differences of log p values  0.87  0.06  26.97  7  L2 of log-p of overlapping terms only  0.94  0.03  12.06  1  0.57  0.04  86.70  15  L2 of log of p-values  0.86  0.07  28.05  10  L2 of p-values  0.86  0.07  29.62  12  L2 of term fractions  0.90  0.03  20.39  5  0.86  0.06  28.31  11  Term Coverage  0.87  0.06  27.14  8  Term Overlap  0.87  0.03  26.17  6  Number of Gene MeSH Terms  0.81  0.05  38.69  13  Gene ID  0.71  0.06  58.78  14  L2 of term fractions of overlapping terms only  L2 of term frequency  90  Table 3-10. Summary of Diabetes Loci Ranked by MeSHOP Similarity. Loci identified by (Sladek, et al., 2007) were ranked by MeSHOP similarity (L2 of log-p of overlapping terms only). Direct Association scores are the Bonferroni corrected p-values generated using the Feb-2007 datasets.  Locus Entrez Gene ID  Predicted  Rank  Percentile  Direct Association  Similarity Score IDE  3416  7.59E+07  186  0.01 7.93E-02  TCF7L2  6934  5.91E+07  421  0.02 3.30E-03  EXT2  2132  2.96E+07  2616  0.10  N/A  HHEX  3087  2.18E+07  4631  0.18  N/A  KIF11  3832  1.87E+07  5985  0.24  N/A  ALX4  60529  1.55E+07  8313  0.33  N/A  SLC30A8  169026  1.55E+07  8352  0.33  N/A  LOC387761  387761  N/A  N/A  N/A  N/A  91  Table 3-11. Comparison of MeSHOP Results for Pancreatic Cancer Candidate Genes. This table shows all genes from (Jones, et al., 2008), Supplementary Table S7, listing by strength of MeSHOP similarity score (via the L 2 of log-p of overlapping terms only metric). Gene  Entrez Gene  Predicted Similarity  Rank  Percentile  Mutations  Deletions  Passenger Probability Low Rates  Passenger Probability Mid Rates  Passenger Probability High Rates  TP53  7157  1.24E+08  11  100  18  2  <0.001  <0.001  <0.001  CDKN2A  1029  8.29E+07  135  99  2  16  <0.001  <0.001  <0.001  KRAS  3845  6.95E+07  266  99  24  0  <0.001  <0.001  <0.001  TGFBR2  7048  6.76E+07  288  99  3  1  <0.001  0.001  0.003  EP300  2033  6.37E+07  351  99  2  0  0.176  0.482  0.984  SMAD4  4089  6.14E+07  386  98  8  6  <0.001  <0.001  <0.001  ELN  2006  5.57E+07  509  98  2  0  0.115  0.372  0.413  F8  2157  5.51E+07  530  98  2  0  0.165  0.482  0.853  SCN5A  6331  5.18E+07  629  98  2  0  0.176  0.482  1.000  PRKCG  5582  4.77E+07  798  97  2  0  0.115  0.372  0.413  TPO  7173  4.71E+07  831  97  2  0  0.115  0.375  0.694  PPP1R3A  5506  4.50E+07  946  96  2  0  0.115  0.477  0.694  SMARCA4  6597  4.04E+07  1243  95  2  0  0.062  0.183  0.413  COL5A1  1289  3.72E+07  1518  94  2  0  0.176  0.482  0.984  MEP1A  4224  3.38E+07  1895  92  2  0  0.062  0.183  0.413  IL2RG  3561  2.95E+07  2652  89  1  0  0.004  0.016  0.997  ATP10A  57194  2.77E+07  2974  88  2  0  0.176  0.482  1.000  MYH2  4620  2.71E+07  3063  88  2  0  0.165  0.477  0.853  GRIA3  2892  2.62E+07  3281  87  1  1  0.017  0.069  0.999  ABCA7  10347  2.56E+07  3426  86  2  0  0.033  0.139  0.201  DLG3  1741  2.51E+07  3540  86  1  0  0.003  0.015  0.997  DLC1  10395  2.47E+07  3645  86  2  0  0.176  0.482  1.000  GLTSCR1  29998  2.06E+07  5082  80  2  0  0.062  0.183  0.405  PCSK6  5046  2.02E+07  5240  79  2  0  0.176  0.482  0.911  EVPL  2125  2.00E+07  5329  79  2  0  0.176  0.482  0.942  NRG2  9542  1.95E+07  5537  78  2  0  0.165  0.477  0.853  SLITRK5  26050  1.93E+07  5655  78  2  0  0.165  0.477  0.853  SEMA5B  54437  1.92E+07  5713  77  2  0  0.062  0.183  0.413  DPP6  1804  1.86E+07  6025  76  3  0  0.009  0.079  0.201  PCDH15  65217  1.84E+07  6162  76  4  0  <0.001  0.017  0.048  FMN2  56776  1.82E+07  6266  75  2  0  0.176  0.482  0.911  781  1.77E+07  6597  74  1  0  0.001  0.004  0.989  DLEC1  9940  1.70E+07  7039  72  2  0  0.176  0.482  0.911  MLL3  58508  1.69E+07  7090  72  6  0  <0.001  <0.001  <0.001  PB1  55193  1.63E+07  7597  70  2  0  0.165  0.477  0.853  CACNA2D1  92  Gene  Entrez Gene  Predicted Similarity  Rank  Percentile  Mutations  Deletions  Passenger Probability Low Rates  Passenger Probability Mid Rates  Passenger Probability High Rates  LRRN3  54674  1.60E+07  7856  69  2  0  0.062  0.183  0.405  CYFIP1  23191  1.56E+07  8225  67  3  0  0.009  0.079  0.201  SF3B1  23451  1.55E+07  8290  67  3  0  0.009  0.079  0.201  PXDN  7837  1.55E+07  8302  67  2  0  0.176  0.482  1.000  TNR  7143  1.54E+07  8453  66  2  0  0.176  0.482  0.911  SN  6614  1.53E+07  8484  66  2  0  0.176  0.482  1.000  SLC6A15  55117  1.53E+07  8488  66  2  0  0.062  0.183  0.405  ARID1A  8289  1.51E+07  8688  66  2  0  0.176  0.482  0.984  SLC1A6  6511  1.48E+07  8908  65  2  0  0.115  0.477  0.694  LRRTM4  80059  1.46E+07  9064  64  2  0  0.062  0.183  0.413  GALNT13  114805  1.42E+07  9651  62  2  0  0.062  0.183  0.405  GUCY1A2  2977  1.39E+07  9964  60  2  0  0.062  0.183  0.405  ZNF638  27332  1.37E+07  10174  60  2  0  0.115  0.375  0.694  PDZRN3  23024  1.33E+07  10522  58  2  0  0.033  0.082  0.201  DOCK2  1794  1.33E+07  10612  58  2  0  0.062  0.183  0.405  MIZF  25988  1.32E+07  10714  58  2  0  0.062  0.183  0.405  DACH2  117154  1.30E+07  10883  57  1  1  0.022  0.088  1.000  ST6GAL2  84620  1.26E+07  11302  55  2  0  0.115  0.375  0.694  KBTBD11  9920  1.19E+07  12083  52  1  1  0.006  0.025  0.998  CNTN5  53942  1.18E+07  12231  51  2  0  0.115  0.375  0.694  ABLIM2  84448  1.17E+07  12471  51  2  0  0.062  0.183  0.405  PCDH18  54510  1.14E+07  12864  49  2  0  0.115  0.375  0.694  ADAMTS20  80070  1.09E+07  13668  46  2  0  0.176  0.482  0.911  CDH10  1008  1.09E+07  13703  46  3  0  <0.001  0.017  0.048  KIAA1024  23251  1.09E+07  13715  46  2  0  0.115  0.375  0.694  TBX18  9096  1.08E+07  13821  45  2  0  0.062  0.183  0.413  LRFN5  145581  1.07E+07  13894  45  2  0  0.062  0.183  0.405  DEPDC2  80243  1.07E+07  13953  45  3  0  0.055  0.183  0.405  FMNL3  91010  1.05E+07  14376  43  2  0  0.055  0.179  0.405  TM7SF4  81501  1.03E+07  14681  42  2  0  0.055  0.179  0.405  OR10R2  343406  1.02E+07  15126  40  2  0  0.033  0.139  0.317  GPR133  283383  1.02E+07  15188  40  2  0  0.062  0.183  0.405  PCDH17  27253  1.01E+07  15355  39  2  0  0.062  0.183  0.405  577  9.58E+06  16457  35  3  0  0.033  0.082  0.201  KIAA0774  23281  9.49E+06  16660  34  2  0  0.176  0.482  0.984  CTNNA2  1496  9.42E+06  16781  33  3  0  0.033  0.179  0.405  KLHDC4  54758  8.66E+06  18571  26  2  0  0.033  0.082  0.201  BAI3  93  Gene  Entrez Gene  Predicted Similarity  Rank  Percentile  Mutations  Deletions  Passenger Probability Low Rates  Passenger Probability Mid Rates  Passenger Probability High Rates  ZAN  7455  8.45E+06  19030  25  2  0  0.176  0.482  0.984  DKFZP586P0123  26005  7.38E+06  20579  18  2  0  0.165  0.477  0.853  UNC13C  440279  7.38E+06  20835  17  2  0  0.115  0.372  0.694  FLJ39155  133584  7.38E+06  21333  15  2  0  0.176  0.482  0.942  RASSF6  166824  7.38E+06  21543  15  2  0  0.062  0.183  0.405  OVCH1  341350  5.79E+06  24923  1  2  0  0.165  0.477  0.853  Q9H5F0_HUMAN  No MeSHOP available  3  0  <0.001  0.004  0.009  Q9H8A7_HUMAN  No MeSHOP available  2  0  0.165  0.477  0.853  No MeSHOP available  2  0  0.062  0.183  0.405  No MeSHOP available  2  0  0.062  0.183  0.405  No MeSHOP available  2  0  0.009  0.079  0.201  FLJ46481  389197  XR_017918.1 LOC441136  441136  94  Table 3-12. Datasets Used in the Analysis with Details on Size and Relevant Contents. Although the number of human genes has not increased much over the years, the number of non-human links has increased substantially since 2007, while the human gene links have increased at a more moderate rate.  Dataset Entrez Gene  February 2007 Total Genes Human Genes  MEDLINE®/PubMed®  Total Articles  gene2pubmed (Linking  Total Links  January 2009  April 2010  2 460 748  4 710 910  5 999 558  38 604  40 183  45 423  Baseline 2007  Baseline 2009  Baseline 2010  (Nov 2006)  (Nov 2008)  (Nov 2009)  16 120 073  17 764 232  18 502 915  3 081 413  12 960 489  5 979 167  272 123  445 650  527 821  Entrez Gene and MEDLINE®/PubMed®) Total Human Gene Links  95  Chapter 4: Separation of Literature Annotation Effects from Topic Similarity in Medical Subject Heading Over-representation Profiles (MeSHOPs) for Drugs and Diseases Synopsis Medical Subject Heading Over-representation Profiles (MeSHOPs) quantitatively summarise the literature associated with biological entities such as diseases or drugs. A profile is constructed by counting the number of times each MeSH term is assigned to an entity-related research publication in the MEDLINE/PUBMED database and calculating the significance of the count relative to a background expectation. Based on the expectation that drugs suitable for treatment of a disease (or disease symptom) will have similar annotation properties to the disease, we successfully predict drug-disease associations by comparing MeSHOPs of diseases and drugs. The MeSHOP comparison approach delivers an 11% improvement over bibliometric baselines. However, a significant bias in novel drug-disease associations was observed towards drugs and diseases with more publications. To account for the annotation biases, a correction procedure is introduced and evaluated. By explicitly accounting for the annotation bias, unexpectedly similar drug-disease pairs are highlighted as candidates for drug repositioning research.  Introduction Using previously studied pharmaceutical compounds and applying them towards novel diseases or phenotypes, so-called ‘drug repositioning’, has emerged as a key issue in biomedical research(Ashburn & Thor, 2004; Joel T Dudley, Deshpande, & Butte, 2011). The cost of developing a new chemical or molecular entity with proven therapeutic benefit and established safety was estimated at over $1.8 billion, in 2010 and continues to rise rapidly(Paul et al., 2010). Therefore, using compounds with known biochemical mechanism of action and an established safety record for new purposes is an alternative to the high cost of de novo compound research(Deftereos, Andronis, Friedla, Persidis, & Persidis, 2011). Advances in drug repositioning research have identified potential treatments for Crohn’s disease (J. T. Dudley et al., 2011; Sirota et al., 2011a), and has raised hopes for advances in the treatment of rare, orphan disorders(Sardana et al., 2011).  96  Informatics-based approaches to drug repositioning are exemplified by the identification of known drug targets in genes arising in genome wide association studies(Sanseau et al., 2012), the prediction of structural suitability of a known compound for a new protein target(Kinnings et al., 2009; Y. Y. Li, An, & Jones, 2006), systems biology using gene expression patterns(J. T. Dudley et al., 2011; Sirota et al., 2011b), and the study of side effects(Yang & Agarwal, 2011). Underlying many of these informatics approaches has been the availability of reference databases containing information about the relationship between genes, drugs and diseases, such as DrugBank (Wishart et al., 2008), Pharmacogenomics Knowledge Base, and the Comparative Toxicogenomics Database (Davis et al., 2010; Hewett et al., 2002; Klein et al., 2001) . The broader informatics approaches to drug repositioning have been recently reviewed(Joel T Dudley et al., 2011). Advances in literature and text analysis methods offer a promising path to drug repositioning based on established knowledge. Text analysis methods have addressed the study of FDA package inserts in the SIDER database (Kuhn, Campillos, Letunic, Jensen, & Bork, 2010) to identify side effects, for the comparison of word utilization between drug and disease-related abstracts (Frijters et al., 2010; D R Swanson, 1990), and for the analysis of similarity between gene ontology process annotations assigned to a known drug target and genes in disease-associated pathways. Literature-based drug repositioning has been reviewed(Andronis, Sharma, Virvilis, Deftereos, & Persidis, 2011; Plake & Schroeder, 2011). Underlying any text-based analysis is the organization and properties of the text within an accessible database. The central information source for biomedical literature is the MEDLINE®/PubMed® database encompassing over 20 million articles in 2012. MEDLINE/PubMed provides a citation resource tailored to biomedical researchers, globally accessible at no charge. This comprehensive database of medically relevant citations is curated by expert annotators at the National Library of Medicine, where each article is indexed with topics from the controlled vocabulary of Medical Subject Headings(Nelson et al., 2001) by domain experts at the National Library of Medicine. MeSH terms include medically relevant categories such as Anatomy, Disease, Chemical Compounds and Psychiatric Disorders. In addition to the topics in the main MeSH hierarchy, chemical compounds – including pharmacologic compounds – are included in a Supplementary MeSH vocabulary. Despite the increasing wealth of raw literature knowledge, having means to evaluate and navigate the entirety of this knowledge becomes progressively more challenging. We introduced Medical Subject Heading Over-representation Profiles (MeSHOPs) as a convenient quantitative 97  representation of the properties enriched in a bibliography of scientific literature from MEDLINE®/PubMed®. MeSHOPs succinctly describe the most highly associated MeSH terms for an entity of interest. The quantitative comparison of MeSHOPs has been demonstrated to allow the predictive inference of entity-entity relationships in a study of relationships between genes and diseases (See Chapter 3). However, we observe that the magnitude of research literature introduces a strong bias into the study of entity-entity relationships, with the most popular genes more likely to be linked to diseases in the future, and vice-versa. In this report we investigate the capacity of MeSHOP comparisons to detect functional relationships between pharmaceutical compounds and diseases with an emphasis on the ranking of candidates for drug repositioning research. We demonstrate that MeSHOPs capture the properties of drugs, and that such information can be compared to disease MeSHOPs to reveal functionally relevant relationships. It is important to be aware of biases and trends in research that may influence the results of text analysis, and to correct for these biases to better direct research efforts (Edwards et al., 2011; Fedorov, Müller, & Knapp, 2010). Entities with limited associated literature, such as some rare diseases, are shown to have disproportionate scores in initial MeSHOP comparisons. To account for existing annotation levels of drug and disease entities and identify MeSHOP similarity, we measure the annotation strength for drug and disease entities and incorporate this prior information into the scoring of prediction strength. Using this improved comparison metric we demonstrate that drug and disease MeSHOP comparisons are improved, as validated by the identification of novel associations observed in future publications and against a curated reference collection.  Results Generation of Drug MeSH Over-representation Profiles (MeSHOPs) MeSHOPs provide a quantitative overview of the biomedical knowledge associated with an entity of interest through the indexed biomedical terms. Following the described methods, MeSHOPs for all indexed diseases and drugs in PubMed were generated using archived PubMed data up until 2007. A drug MeSHOP is presented for acetaminophen (Figure 4-1), and a disease MeSHOP is presented for Aniridia (Figure 4-2). The scores within MeSHOPs are influenced by the background correction for the expectation of MeSH term frequency. If one takes the background rate from all articles in MEDLINE/PUBMED, MeSH terms preferentially associated with drugs are likely to be emphasized in the 98  drug MeSHOPs, such as ‘pharmaceutical preparation’. The strong scores for such drug-related terms can be corrected for by using class-specific backgrounds – such as the subset of articles that address one or more drugs. For comparisons of MeSHOPs across categories, as will follow, we select the universal background as a common background for all entities being compared.  Predicting Drug-Disease Associations We examine the utility of drug-disease MeSHOP similarity scores for the prediction of drugdisease co-annotation in future publications. To compare to past performance observed in MeSHOP comparisons, 16 similarity scoring metrics were assessed (See the similar analysis in Chapter 3). Two bibliometric baselines – the amount of drug annotation (i.e. the number of MeSH terms linked to the drug) and the amount of disease annotation (i.e. the number of MeSH terms linked to the disease)– were included to assess the effect of annotation bias on predictions. Table 4-1 demonstrates that comparison of drug and disease MeSHOPs predicts future drug-disease co-occurrence in subsequent years (2007-2011). The most effective similarity score is the Euclidean distance of log-p of overlapping terms only (see Methods), which produces an AUC score of 0.95 for the prediction of future cooccurrence in publications. Enthusiasm for the performance is tempered, however, by the fact that a simple metric of the number of MeSH terms associated with a disease when used as a prediction ranking produces an AUC score of 0.84 (and counts for drug-associated MeSH terms produce a score of 0.80). Randomly assigned scores will produce an AUC of 0.5. These results are consistent with a process in which well-studied diseases (or drugs) are more likely to be the subject of future research publications and therefore more likely to co-occur with drugs than diseases that have few publications. These scores reflect a systematic limitation in the scoring procedure that needs to be resolved to allow for the identification of drugs suitable for orphan disorders, as well as to produce a more refined list of candidates to pursue.  Annotation Bias Observed for Curated Drug-Disease Relationships Predicted novel drug-disease relationships may be alternatively assessed against curated reference collections that contain bonafide drug uses (i.e. not just co-occurrence in a paper, but evidence that the drug is used as a treatment for a disorder). We downloaded curated drug-disease relationships reported in the Comparative Toxicogenomics Database (CTD). We matched drugs from the 2011 CTD to the drugs defined in PubMed 2007, and defined a reference collection of 291 novel drug99  disease pairs for those entries in CTD that were defined by publications appearing in the period of 20072011. The reference collection contains 191 unique drugs and 150 unique diseases. As seen in Table 4-1, similarity of MeSHOPs is able to accurately predict novel associations by comparing MeSHOPs of drugs and diseases, achieving ROC AUC of 0.93 (for the sum of the log of combined p-values). The Euclidean distance of intersecting terms metric that performed best for previous MeSHOP comparison performance tests, produces a similar score of 0.92. As displayed in Figure 4-3, a substantial fraction of the validation set is over-represented for well-studied drugs and diseases. Over half of the 191 drugs are in the top 10% of all drugs in terms of amount of associated MeSH annotation (the peak to the left of the histogram). Only slightly less biased, of the 150 diseases, over half are in the top 15% of diseases, in terms of associated MeSH annotation. Consistent with these properties, using the baseline MeSH term counts for drug or disease annotation levels as scores, a ROC AUC of 0.83 is achieved. As for the co-occurrence measure, it is clear that annotation bias is a strong predictor for bona fide interactions.  Controlling for Annotation Bias The influence of annotation on the MeSHOP comparison scores can be visualized using heatmaps. As seen in Figure 4-4, and fully consistent with the AUC scores above, there is a high degree of correlation between the amount of annotation for the disease (as measured by the number of MeSH terms in the disease profile), and the drug-disease score (Pearson correlation of -0.82). A correlation of 0.33 is observed when comparing drug-disease scores against the degree of drug annotation (see Figure 4-5). For a candidate list for drug repositioning, this annotation bias must be eliminated to allow for more rarely studied drugs or diseases to emerge from the analysis as candidates. As described in the methods, we introduce a corrected scoring procedure for MeSHOP comparisons that assigns a significance to similarity scores based on the distribution of scores for drug-disease tuples with similar annotation levels. In short, the observed similarity score should be remarkable given the level of annotation of the drug and disease in the tuple. After applying this correction for drug-disease annotation bias, both disease annotation level and drug annotation levels have very low correlation to the drug-disease score (0.08 and 0.05 respectively) (see Figure 4-6 and Figure 4-7). The top scoring candidate drug-disease predictions are reported in Table 4-2 and can be browsed online (http://meshop.oicr.on.ca/meshop/browse_dpc_results.html). 100  Discussion and Related Work In this report we introduce a new literature-based procedure for the analysis of drug-disease similarity with a focus on the identification of candidates for drug-repositioning. Using MeSH Overrepresentation Profiles (MeSHOPs) as quantitative representatives for biological entities, we seek to identify drugs and diseases with similar annotation under the expectation that such similarity may be suggestive of potential for repositioning. Drug-disease MeSHOP similarity scores using a panel of metrics are found to be strongly influenced by the level of annotation of drugs and diseases. In short, the most heavily studied diseases and drugs are inappropriately linked by the comparison. A new corrected scoring procedure is introduced to account for the background expectation of similarity scores for comparably annotated drugs and diseases. The new procedure is demonstrated to account for the bias. Application of the MeSHOP similarity scoring procedure reveals a set of candidate drugs for future repositioning research. The assessment of drug repositioning candidate predictions is necessarily problematic. Given the expense of validating drug efficacy, there is no reference collection against which to measure performance. In this report we elected to take as references two approaches. First, we predicted future co-occurrence in the research literature. This measure is indirect, as co-occurrence does not necessarily reflect a functional tie between the drug and disease. Furthermore, this measure is particularly susceptible to annotation influence – well studied drugs and diseases have a higher rate of future publications and are thus more likely to be linked. Within this report, we observe that the MeSHOP comparisons perform better than simple annotation measures, which indicates that the similarity assessment has value. Furthermore, we were able to identify and correct for the annotation bias influence on the analysis. It is our hope that future annotation-based similarity measures will be evaluated for the biases we observe here. The second reference collection tested was extracted from the CTD, which records bonafide drug-disease links. The performance measurements reflect a similar literature bias on the CTD results, which may reflect a tendency for well-studied drugs to be tested for utility in well-studied disease therapy. The source of the annotation biases identified in the validation sets may lie in methodological bias or be intrinsic to the nature of drug-disease relationships. The case for methodological bias notes the relationship between the existence of experimental protocols and the publication of related research. The study of disease requires the existence of appropriate animal models, a family with a 101  history of the condition, a large-scale study, an accurate protocol to diagnose the condition. As well, the rarity and severity of the disease will also change the degree of research interest. Likewise, the study of drugs also benefits from animal models, bioassays to detect the compound, the ability and ease to generate the compound, the ability to deliver an appropriate dosage of the compound to the targets of interest. Other factors motivating research directions are availability of funding and the focus of existing lab personnel and their research. However, the bias may also intrinsic to the nature of the disease or of the drug. (Gillis & Pavlidis, 2011) have previously observed that multifunctional genes are a strong driver in gene function prediction. They identify gene multifunctionality through protein interaction and coexpression datasets, which encompass previous definitions of the “hub-ness” of a particular gene. A drug may have a more global effectiveness, due to targeting these multifunction genes or their pathways, and thereby be involved in more drug-disease associations. Similarly, there may be diseases that are involved in key processes and thus present many potential drug targets. Whether the biases are intrinsic to the biology of drugs and diseases, primarily introduced by the human nature in the research, or some combination of these factors will hopefully be revealed by the direction and results of future research. As our knowledge of the nature of drugs and diseases increases and matures, the human elements and methodological biases will increasingly become less significant, leaving us to identify the degree this bias is due to the biological mechanism and nature of the drugs and diseases. The application of informatics analysis for drug repositioning may reveal specific candidates for study. The use of MeSHOP comparisons to reveal relationships between entities can be extended to enhance the chances for novel insights. The underlying principle motivating the comparison approach is that there will be shared characteristics of the drug actions and disease properties. While the current approach utilizes universal comparisons across all MeSH terms, it may be beneficial to restrict the analysis to a subset of MeSH terms more likely to reflect these shared properties. Development of a procedure to restrict the terms (the features) of MeSHOPs may allow for more specific drug repositioning candidates to emerge in the future.  Future Work MeSH provides a wide spectrum of medically relevant topics, however, some applications may be better served by a vocabulary with more specific terms in the field of interest. For example, there are 102  only eight terms in MeSH (Akathisia, Drug-Induced; Drug Eruptions; Drug Toxicity; Dyskinesia, DrugInduced; Epidermal Necrolysis, Toxic; Erythema Nodosum; Serotonin Syndrome; Serum Sickness) relating directly to adverse drug events. Instead, there are several subheadings including “adverse effects”, “poisoning”, “toxicity” and “contraindications” which can occur with drug terms, or “chemically induced” and “complications” subheadings occurring with adverse outcomes. Expanding the analysis to look specifically for these subheading modifiers could allow us to extract a subset of articles directly relevant to adverse drug reactions for MeSHOP analysis. Alternatively, an alternative source linking side effects to articles could be employed to supplement our existing analysis with side-effect data. Other sources of pharmaceutical literature could be incorporated to improve the analysis. In addition to repositioning existing, on-the-shelf drugs, there are a similar number of drugs that fail Phase II trials due to lack of efficacy (DiMasi, 2001). These drugs, however, are not likely to be broadly described in the literature. Incorporating internal research reports and other literature within a pharmaceutical company would provide literature on a much larger suite of compounds, but would require access to this confidential information. The methods may also need to be adapted if a proprietary internal vocabulary was used, or text mining approaches may be required to annotate this literature. CitationRank(Yang, Xu, & He, 2009) was used to highlight genes involved in adverse drug reaction by analyzing the co-occurrence of genes in articles relating to an adverse drug reaction. Looking at the comprehensive network of MeSHOP similarity between genes, drugs and diseases would allow a similar network-style analysis, adding the information of the gene entities. Rather than predicting drug-disease associations directly, another application of the method could be to highlight potential links between drugs and mechanisms of action. Drug therapies can be effective even when the understanding of the underlying mechanism of action is incomplete. These predicted drug-mechanism links could be also related back to relevant diseases, helping hypothesize on the biology of a disease and effective mechanisms for treatment. While the correction presented here uses local empirical distributions, further insight into the nature of these distributions could be obtained by attempting to fit them to known distributions. This would be particularly useful in cases involving more extreme levels of annotation – currently we match a ten percentile range in terms of annotation, which covers a varying range of annotation. As well, this 103  could ultimately lead to improved ways of modeling the assignment of annotation in the background, and therefore lead to improved statistics for generating MeSHOPs.  Methods Pharmacological Substances In this paper, we examine the set of drugs, defined as all chemical compounds, in both the Medical Subject Headings (MeSH) and Supplemental MeSH vocabularies, which are also annotated as having a Pharmacologic Action. Since 1996, indexers at the National Library of Medicine track articles where the action of a drug is discussed (MeSH Basics – http://www.nlm.nih.gov/bsd/disted/mesh/paterms.html). As of 2003, a MeSH Category “Pharmacologic Action” was created, in order to delineate chemical compounds which are used therapeutically as pharmacologic agents. Such annotations are conservatively assigned, requiring a minimum of 20 supporting research articles. We analyze these 6884 drugs with respect to the diseases in the MeSH hierarchy.  Constructing Drug and Disease MeSHOPs The construction of MeSHOPs, previously described in Chapter 2, is summarized here for the convenience of readers. A MeSHOP is a quantitative representation of the MeSH annotations associated with a set of articles, where the unifying property of the articles is that each addresses the same, specific entity (such as “Acetaminophen”). The computation of a MeSHOP initiates from a set of articles that address a specific entity. Each article has a curator-assigned set of MeSH terms available in MEDLINE/PubMed. Comparing the observed frequency of each MeSH term annotated to the set of articles relative to the background rate for each term returns a measure of over-representation (see below for additional details). A MeSHOP is a vector of tuples < (t1, m1), (t2, m2), … (tn, mn) >. For each tuple (ti, mi) in a MeSHOP, ti is a distinct MeSH term in the MeSH vocabulary and mi is the numeric measure of the strength of association of the MeSH term ti to the set of articles (e.g. the overrepresentation measures). To account for the tree structure of MeSH, for each MeSH term associated with an article, the article is considered associated to all of the parent terms of that MeSH term. After evaluating multiple scoring metrics were implemented (See Chapter 3) for the strength of association between an entity (i.e. drug or disease) and a MeSH term, an effective was determined to be the log of the p-value reported by Fisher’s Exact Test based on a hypergeometric distribution of term 104  utilization across a background set of articles. For this report, two background sets are considered. When working within a specific class of entities (e.g. drugs), the background is most appropriately all articles that are associated with one or more members of the entity class. For comparisons between entity classes, a universal background is used. For this study, the universal set contained 17 million MEDLINE/PubMed articles assigned MeSH terms (see Table 3-2 for more details). We consider the 6 512 pharmacologic compounds identified in MeSH 2007 as the drug entities. The 4 229 terms in MeSH 2007 in Category C “Diseases” composed the set of disease entities.  Predicting Drug-Disease Associations A drug and a responsive disease are anticipated to share common literature annotations, such as metabolic pathways, cellular processes and symptoms, even if no links between the drug and the disease have been previously reported in the literature. To infer novel relationships between a drug and a disease, we perform quantitative pairwise comparisons of MeSHOP s between members of each class. We hypothesize that a previously unassociated disease t is likely to be associated with a drug d if the MeSHOP Pt for the disease t is highly similar to the drug’s MeSHOP Pd . When many biomedical terms are common between two profiles, the likelihood for a future association between the entities profiled is expected to increase. We have previously evaluated pairwise comparison procedures (see Chapter 3) and determined that Euclidean distance effectively detects entity relationships between genes and diseases. However, the study revealed a substantial bias in scores introduced by the number of articles associated with each entity – entities with greater annotation tended to be more similar to other entities.  Correcting for Pre-Existing Literature Annotation Given the significant impact of annotation bias on pairwise MeSHOP comparison, we introduce a correction of our similarity scores for these pre-existing literature effects. This correction aims to normalize the scores with respect to existing literature annotation, correcting for inherent biases in the scoring methods and revealing associations that are due to the similarity of annotation rather than the amount of annotation (the “popularity” of the entity). The correction essentially compares each similarity score against a local empirical distribution of similarity scores, taken from all similarity scores involving drugs and diseases having a similar amount of annotation.  105  Expressed formally, let us consider drug-disease relationships, with scores Xs, drug annotation levels Xc and disease annotation levels Xd, where the annotation level is the number of MeSH terms annotated to articles in PubMed/MEDLINE for the drug or disease. For a given drug c and disease d with drug annotation level xc and disease annotation level xd and a drug-disease score xs, we want to determine the probability that xs is more extreme than a random drug-disease relationship score with drug annotation level xc and disease annotation level xd :  P( xs  X s | ( xc  X c )  ( xd  X d )) However, this probability can only be directly computed when the set of drugs and diseases is sufficiently large that there are many drugs and many diseases with the same level of annotation. In order to correct for the previously observed bias, we will seek to adjust the significance based on the local distribution of scores observed for similarly annotated entities.  P( x s  X s | ( xc  X c )  ( xd  X d )) This can be computed by incorporating the properties of conditional probability as  P( x s  X s | ( x c  X c )  ( x d  X d ))   As well since  P( xc  X c )  and  P( xd  X d )  P( x s  X s | ( x c  X c )  ( x d  X d ))   We select  P(( x s  X s )  ( x c  X c )  ( x d  X d )) P(( x c  X c )  ( x d  X d )) are independent, this can be further simplified to  P(( x s  X s )  ( x c  X c )  ( x d  X d )) P( x c  X c ) P( x d  X d )  P( xc  X c )  P( xd  X d )  0.1 , and specifically compare against the 10% of the  drugs that are most similar, annotation level-wise, to the drugs in the relationship of interest, and likewise for 10% of the diseases. The correction described allows us to separate the effect of the level of annotation for the drug and disease from the similarity of the concepts and allows the user to distinguish high-scoring drug-disease relationships that are primarily due to the annotation level of the drug or disease concept, from high-scoring relationships that arise due to sharing significant profile similarity. 106  Validating Drug-Disease Associations To evaluate drug-disease associations predicted by MeSHOP similarity, we analyzed the 2007 baseline release of MEDLINE®/PubMed® and measured our predictive performance against annotations appearing in future releases. The annual MEDLINE®/PubMed® Baseline releases 2007 and 2010 were used as the source of MeSH annotations for articles and were obtained directly from the NLM. The drug and disease MeSHOPs, computed for the MEDLINE®/PubMed® Baseline 2007, were compared using the same panel of similarity scores as we applied to predict gene-disease relationships (See Chapter 3 for details). We highlight here the most effective similarity score of this panel is the Euclidean distance of log-p of overlapping terms only:  (C and D refer to the MeSH terms of drug and disease MeSHOPs respectively, cp(i) and dp(i) refer to the p-value for the MeSH term i of the drug or disease profile respectively). MeSHOP comparisons are defined as predictions of future disease-gene co-occurrence if a similarity score exceeds an applied threshold. Predictions were validated against drug-disease cooccurrences that appeared in the future MEDLINE®/PubMed® releases which had not appeared in articles before 2007. A true positive novel association means an article referring to a previously unconnected drug-disease pair was published in the interim period between the 2007 and 2010 MEDLINE/PubMed Baselines. The Comparative Toxicogenomics Database was used as a source of curated drug-disease relationships. All drug-disease relationships from the 2010 release were extracted, and those relationships reported by articles appearing after the 2007 MEDLINE/PubMed Baseline were defined as the curated validation set. Using these validation sets, we evaluate the candidate scoring methods by computing the Receiver Operating Characteristic (ROC) curve for predictions from analysis of the baseline 2007 data and reporting the Area Under the ROC Curve (AUC). Novel drug-disease pairs from the two reference sets are defined as “true positives”, and all other drug-disease pairings are defined as “true negatives” (which is recognized to be conservative, as such pairs may be validated in future studies). All drugdisease pairs reported prior to 2007 are excluded from the AUC analysis. 107  Implementation The analysis was performed using Python (http://www.python.org/), XSLT (http://www.w3.org/TR/xslt), and the MySQL database system (http://www.mysql.com/). Fisher’s Exact Test p-values were computed using the R statistics package (http://www.r-project.org/). Results were generated using 50 CPUs of a compute cluster running under Sun GridEngine (http://gridengine.sunsource.net/). A typical cluster machine is a 64-bit dual processor 3 GHz Intel Xeon with 16 GB of RAM. Data was downloaded from MEDLINE®/PubMed® (http://www.nlm.nih.gov/databases/leased.html). The Comparative Toxicogenomics Database validation set was taken from the drug-disease relationships dataset (http://ctd.mdibl.org/downloads/). Results are freely accessible on the web at http://meshop.oicr.on.ca/meshop/. Source code implemented in Python is available at http://github.com/wac/meshop/ (gene and disease profile analysis) and http://github.com/wac/cmp-meshop/ (evaluation and validation of results).  Conclusions Comparing MeSHOPs allows quantitative analysis of MeSH biomedical topics shared between drugs and diseases through their MEDLINE®/PubMed®-indexed primary literature. Quantitatively measuring MeSHOP similarity is shown to infer functional relationships between drugs and diseases. Specifically, the similarity between drug MeSHOPs and disease MeSHOPs is highly predictive of future drug-disease ties. The best similarity metric, using Euclidean distance of the log-p of overlapping terms, achieves a mean AUC of 0.94, a 11% improvement over baseline. However, bibliometric characteristics, such as the number of terms in the disease MeSHOP, are demonstrated to have a strong bias in drugdisease association. We describe here a correction that eliminates this bias in the scoring metrics, separating the effects of the similarity scoring from the annotation bias.  108  Figure 4-1 MeSHOP for Acetaminophen. All terms are presented in this MeSHOP word cloud  associated in the Acetaminophen MeSHOP with a p-value of 0. The size of the term in the word cloud presented is proportional to the number of related articles for the term.  Figure 4-2. MeSHOP for Aniridia. The top 150 terms in the profile for the disease Aniridia are  shown, where the font size of each MeSH term is proportional to the negative log p-value for the term.  109  Figure 4-3. Distribution of Drug Annotation and Disease Annotation in the New Drug-Disease Associations of the CTD Validation Set. The x-axis represents the quantile of the MeSH term counts for the drugs (part A) and diseases (part B) in the CTD reference collection (part A). The histograms indicate that both drugs and diseases within the CTD reference collection are biased toward greater numbers of associated MeSH terms. 110  Figure 4-4. The Degree of Disease Annotation Plotted against MeSHOP Comparison Score. The figure displays a heatmap depicting the number of drug-disease tuples for a disease annotation level (MeSH terms attached to the disease MeSHOP) on the x-axis and a MeSHOP comparison score on the y-axis. MeSHOP similarity scores were calculated using Euclidean Distance. The degree of disease annotation, measured as the total number of distinct MeSH terms associated with a disease, is highly inversely correlated (Pearson correlation score of -0.82) with the similarity score.  111  Figure 4-5. The Degree of Drug Annotation vs. MeSHOP Comparison Score. The figure displays a heatmap depicting the number of drug-disease tuples for a drug annotation level (MeSH terms attached to the disease MeSHOP) on the x-axis and a MeSHOP comparison score on the y-axis. MeSHOP similarity scores were calculated using L2 distance. The degree of drug annotation, measured as the total number of distinct MeSH terms associated with a drug, is inversely correlated (Pearson correlation score of -0.33) with the similarity score.  112  Figure 4-6. Disease Annotation vs. Corrected MeSHOP Comparison Score. The figure displays a heatmap depicting the number of drug-disease tuples for a disease annotation level (MeSH terms attached to the disease MeSHOP) on the x-axis and a corrected MeSHOP comparison score on the y-axis. MeSHOP similarity scores were calculated using L2 distance, but were corrected as described in the text to account for background annotation levels. The degree of disease annotation, measured as the total number of distinct MeSH terms associated with a disease, is no longer correlated (Pearson correlation score of 0.08) with the corrected similarity score.  113  Figure 4-7. Drug Annotation vs. Corrected MeSHOP Comparison Score. The figure displays a heatmap depicting the number of drug-disease tuples for a drug annotation level (MeSH terms attached to the drug MeSHOP) on the x-axis and a corrected MeSHOP comparison score on the y-axis. MeSHOP similarity scores were calculated using L2 distance, but were corrected as described in the text to account for background annotation levels. The degree of drug annotation, measured as the total number of distinct MeSH terms associated with a drug, is no longer correlated (Pearson correlation score of 0.05) with the corrected similarity score.  114  Table 4-1. Performance of a Selection of Drug-Disease Similarity Scores. Performance validated using novel direct drug-disease direct co-occurrences from PubMed, and novel drug-disease relationships from the CTD. Top scores for each validation set are presented in boldface type.  Scoring Method Corrected drug-disease p-value Cosine distance tf-idf Cosine distance of p-values Cosine distance of term fractions Sum of the log of combined p-values Sum of the differences of log p values L2 of log-p of intersecting terms L2 of term fractions of intersecting terms only L2 of log of p-values L2 of p-values L2 of term fractions P(s < S) L2 of term frequency Total number of terms Number of Intersecting Terms Number of Drug Terms Number of Disease Terms  Direct Connection Validation AUC  CTD Validation AUC  0.65 0.88 0.64 0.78  0.76 0.91 0.70 0.83  0.92 0.89 0.95 0.64 0.88 0.87 0.85 0.87 0.90 0.91 0.80 0.84  0.93 0.86 0.92 0.55 0.84 0.82 0.90 0.83 0.87 0.91 0.83 0.83  115  Table 4-2. Table of Top-scoring Drug-Disease Relationships After Literature Correction. We present filtered list of the drug-disease relationships with corrected p-value of 0, ordered by the strength of the literature correction applied. We remove the most prevalent drugs (presented in Table 4-3) and exclude terms involving selected disorders judged to be poor targets for drug intervention (Malocclusion, Dental, Vegetative, Bacterial, Infection, Dislocation, Smear, Poisoning, Fractures, Edentulous, Decapitation, Injuries, Reperfusion)  Disease Anomia Anomia Fused Teeth Anomia Anomia Joint Instability Dens in Dente Fused Teeth Joint Instability Fused Teeth Joint Instability Joint Instability Dens in Dente Fused Teeth Joint Instability Joint Instability Joint Instability Alveolar Bone Loss Alveolar Bone Loss Joint Instability Joint Instability Alveolar Bone Loss Joint Instability Alveolar Bone Loss Alveolar Bone Loss Joint Instability Neoplasms, Multiple Primary Neoplasms, Multiple Primary Alveolar Bone Loss Neoplasms, Multiple Primary Neoplasms, Multiple Primary Neoplasms, Multiple Primary Neoplasms, Multiple Primary Lymphatic Metastasis Dens in Dente  Drug Fuji Ortho LC Prime & Bond poly(maleic acid-styrene)neocarzinostatin Gluma single bond Fuji Ortho LC bismuth tripotassium dicitrate bismuth tripotassium dicitrate Prime & Bond tirofiban corticosteroid methanetriol mixture Gluma peginterferon alfa-2a peginterferon alfa-2a Herculite XR isometamidium chloride poly(maleic acid-styrene)neocarzinostatin isometamidium chloride poly(maleic acid-styrene)neocarzinostatin single bond enfuvirtide enfuvirtide Zanamivir Zanamivir bismuth tripotassium dicitrate Abscisic Acid Scotchbond Multi-Purpose single bond Abscisic Acid Zanamivir Oseltamivir tert-butylbicyclophosphorothionate abacavir Abscisic Acid Autoantibodies  Literature Correction 0.7644 0.6636 0.5922 0.5796 0.462 0.3913 0.3515 0.3478 0.3397 0.3384 0.3182 0.2967 0.2945 0.2914 0.2752 0.2752 0.2709 0.2496 0.2457 0.2365 0.2236 0.2028 0.1849 0.1677 0.1443 0.0946 0.0928 0.088 0.0858 0.0688 0.0656 0.064 0.0624 0.0242 0.0095 116  Dens in Dente Cor Triatriatum Fused Teeth Fused Teeth Positive-Pressure Respiration, Intrinsic Cementoma Monieziasis Optic Nerve Glioma Bone Malalignment Mansonelliasis Mansonelliasis Joint Instability Alveolar Bone Loss Aneurysm, Dissecting Gingival Recession Gingival Recession Gingival Recession Gingival Recession Gingival Recession Leg Length Inequality Tooth Demineralization Neoplasms, Multiple Primary Joint Instability Joint Instability Joint Instability Joint Instability Joint Instability Joint Instability Joint Instability Joint Instability Joint Instability Joint Instability Periodontal Pocket Periodontal Pocket Periodontal Pocket Periodontal Pocket Alveolar Bone Loss Alveolar Bone Loss Alveolar Bone Loss Alveolar Bone Loss Alveolar Bone Loss Alveolar Bone Loss  Interleukin-2 Autoantibodies Autoantibodies Interleukin-2 Autoantibodies Autoantibodies Autoantibodies Autoantibodies Interleukin-2 Autoantibodies Interleukin-2 Carbachol Carbachol Carbachol Angiotensin II Autoantibodies Hydrogen Peroxide Indomethacin Interleukin-2 Hydrogen Peroxide Tetrodotoxin Tetrodotoxin Angiotensin II Autoantibodies Cycloheximide Dactinomycin Dinoprostone Histamine Hydrogen Peroxide Indomethacin Interleukin-2 Propranolol Angiotensin II Hydrogen Peroxide Indomethacin Interleukin-2 Angiotensin II Autoantibodies Cycloheximide Dactinomycin Dinoprostone Ethanol  0.0095 0.0094 0.0094 0.0094 0.0094 0.0092 0.0092 0.0091 0.009 0.009 0.009 0.0086 0.0078 0.007 0.0064 0.0064 0.0064 0.0064 0.0064 0.0064 0.0051 0.0048 0.0043 0.0043 0.0043 0.0043 0.0043 0.0043 0.0043 0.0043 0.0043 0.0043 0.004 0.004 0.004 0.004 0.0039 0.0039 0.0039 0.0039 0.0039 0.0039 117  Alveolar Bone Loss Alveolar Bone Loss Alveolar Bone Loss Alveolar Bone Loss Alveolar Bone Loss Alveolar Bone Loss Graft Occlusion, Vascular Graft Occlusion, Vascular Neuroma, Acoustic Neuroma, Acoustic Neuroma, Acoustic Neuroma, Acoustic Neuroma, Acoustic Neuroma, Acoustic Virilism Virilism Orbital Neoplasms Orbital Neoplasms Polycystic Ovary Syndrome Polycystic Ovary Syndrome Paranasal Sinus Neoplasms Aneurysm, Dissecting Aneurysm, Dissecting Aneurysm, Dissecting Aneurysm, Dissecting Aneurysm, Dissecting Aneurysm, Dissecting Glaucoma, Open-Angle Osteoporosis, Postmenopausal Osteoporosis, Postmenopausal Pregnancy, Ectopic Pregnancy, Ectopic Renal Artery Obstruction Tooth Demineralization Tooth Demineralization Mandibular Diseases Neoplasms, Multiple Primary Neoplasms, Multiple Primary Prosthesis Failure Ventricular Dysfunction, Left Ventricular Dysfunction Ventricular Dysfunction  Histamine Hydrocortisone Hydrogen Peroxide Indomethacin Interleukin-2 Propranolol Hydrogen Peroxide Interleukin-2 Angiotensin II Autoantibodies Dinoprostone Hydrogen Peroxide Indomethacin Interleukin-2 Hydrogen Peroxide Interleukin-2 Hydrogen Peroxide Indomethacin Hydrogen Peroxide Interleukin-2 Hydrogen Peroxide Autoantibodies Cycloheximide Dinoprostone Hydrogen Peroxide Indomethacin Interleukin-2 Interleukin-2 Hydrogen Peroxide Interleukin-2 Hydrogen Peroxide Interleukin-2 Hydrogen Peroxide Calcimycin Carbachol Carbachol Calcimycin Carbachol Interleukin-2 Interleukin-2 Hydrogen Peroxide Interleukin-2  0.0039 0.0039 0.0039 0.0039 0.0039 0.0039 0.0039 0.0039 0.0039 0.0039 0.0039 0.0039 0.0039 0.0039 0.0038 0.0038 0.0037 0.0037 0.0037 0.0037 0.0036 0.0035 0.0035 0.0035 0.0035 0.0035 0.0035 0.0035 0.0035 0.0035 0.0035 0.0035 0.0035 0.0034 0.0034 0.0032 0.0032 0.0032 0.0026 0.0025 0.0023 0.0023 118  Atrial Fibrillation Atrial Fibrillation Lymphatic Metastasis Neurilemmoma Digestive System Fistula Pain, Postoperative Pain, Postoperative Pain, Postoperative Cell Transformation, Viral Periodontitis Aortic Aneurysm Neuroma Poultry Diseases Poultry Diseases Poultry Diseases Speech Disorders Speech Disorders Speech Disorders Tooth Demineralization Tooth Demineralization Tooth Demineralization Tooth Demineralization Tooth Demineralization Tooth Demineralization Tooth Demineralization Tooth Demineralization Tooth Demineralization Tooth Demineralization Tooth Demineralization Tooth Demineralization Tooth Demineralization Tooth Demineralization Tooth Demineralization Tooth Demineralization Tooth Demineralization Mandibular Diseases Mandibular Diseases Mandibular Diseases Mandibular Diseases Mandibular Diseases Mandibular Diseases Mandibular Diseases  Hydrogen Peroxide Interleukin-2 Carbachol Hydrogen Peroxide Hydrogen Peroxide Autoantibodies Hydrogen Peroxide Interleukin-2 Propranolol Angiotensin II Hydrogen Peroxide Hydrogen Peroxide Angiotensin II Cisplatin Indomethacin Hydrogen Peroxide Indomethacin Interleukin-2 Angiotensin II Autoantibodies Cisplatin Cycloheximide Dactinomycin Dinoprostone Follicle Stimulating Hormone Histamine Hydrocortisone Hydrogen Peroxide Indomethacin Interleukin-2 Interleukin-4 Luciferases Morphine Prednisone Propranolol Angiotensin II Autoantibodies Cycloheximide Dinoprostone Ethanol Histamine Hydrocortisone  0.0022 0.0022 0.0022 0.0021 0.002 0.002 0.002 0.002 0.0019 0.0019 0.0018 0.0018 0.0017 0.0017 0.0017 0.0017 0.0017 0.0017 0.0017 0.0017 0.0017 0.0017 0.0017 0.0017 0.0017 0.0017 0.0017 0.0017 0.0017 0.0017 0.0017 0.0017 0.0017 0.0017 0.0017 0.0016 0.0016 0.0016 0.0016 0.0016 0.0016 0.0016 119  Mandibular Diseases Mandibular Diseases Mandibular Diseases Mandibular Diseases Neoplasms, Ductal, Lobular, and Medullary Neoplasms, Multiple Primary Neoplasms, Multiple Primary Neoplasms, Multiple Primary Neoplasms, Multiple Primary Neoplasms, Multiple Primary Neoplasms, Multiple Primary Neoplasms, Multiple Primary Neoplasms, Multiple Primary Neoplasms, Multiple Primary Neoplasms, Multiple Primary Neoplasms, Multiple Primary Neoplasms, Multiple Primary Neoplasms, Multiple Primary Neoplasms, Multiple Primary Neoplasms, Multiple Primary Neoplasms, Multiple Primary Neoplasms, Multiple Primary Neoplasms, Multiple Primary Neoplasms, Multiple Primary Peripheral Nervous System Neoplasms Peripheral Nervous System Neoplasms Peripheral Nervous System Neoplasms Peripheral Nervous System Neoplasms Peripheral Nervous System Neoplasms Peripheral Nervous System Neoplasms Peripheral Nervous System Neoplasms Peripheral Nervous System Neoplasms Lymphatic Metastasis Lymphatic Metastasis Lymphatic Metastasis Lymphatic Metastasis Lymphatic Metastasis Lymphatic Metastasis  Hydrogen Peroxide Indomethacin Interleukin-2 Propranolol Indomethacin Adenosine Angiotensin II Autoantibodies Cycloheximide Dinoprostone Edetic Acid Ethanol Glycine Heparin Histamine Hydrocortisone Hydrogen Peroxide Indomethacin Interleukin-2 Interleukin-4 Morphine Propranolol Superoxides Thrombin Angiotensin II Autoantibodies Cycloheximide Dinoprostone Hydrogen Peroxide Indomethacin Interleukin-2 Propranolol Angiotensin II Dinoprostone Histamine Hydrogen Peroxide Indomethacin Propranolol  0.0016 0.0016 0.0016 0.0016 0.0016 0.0016 0.0016 0.0016 0.0016 0.0016 0.0016 0.0016 0.0016 0.0016 0.0016 0.0016 0.0016 0.0016 0.0016 0.0016 0.0016 0.0016 0.0016 0.0016 0.0016 0.0016 0.0016 0.0016 0.0016 0.0016 0.0016 0.0016 0.0011 0.0011 0.0011 0.0011 0.0011 0.0011  120  Table 4-3. Most Prevalent Highly Associated Drugs. List of drugs and their prevalence among the drugdisease relationships with corrected p-value of 0.  Drug Antibodies, Monoclonal Glucose Norepinephrine Immunoglobulin G Insulin Tetradecanoylphorbol Acetate Serotonin Antibodies, Viral Dopamine Acetylcholine Interferon Type II Nitric Oxide Antibodies, Bacterial Cyclophosphamide Immunoglobulin M Antibodies Epinephrine Green Fluorescent Proteins Iron Progesterone Isoproterenol Atropine  Number of Drug-Disease Relationships 3700 3696 3686 3634 3611 3577 3525 3494 3466 3417 3314 3052 1818 1616 1335 1311 1050 607 598 595 556 539  121  Chapter 5: Finding Similar Authors through Rapid Comparison of Shared Biomedical Research Themes Synopsis MEDLINE®/PubMed® provides a repository of over 21 million articles from biomedical researchers, manually indexed against over 25 thousand medical subjects. Medical Subject Heading Over-representation allows us to analyse the research output for a biomedical researcher and quantitatively associate researchers with their most strongly related terms. This can then be used to associate researchers that share these terms together, which in turn can be used in a number of different ways in the biomedical community. In this chapter we explore how these relationships can be used to find related authors which can then be used for peer review exercises or scientific collaborations. We develop efficient methods for evaluating and comparing the medical topic profiles of the large pool of authors, enabling the extraction of an author’s network of most relevant related authors based on common research interests.  Introduction With over half a million new authors added to PubMed each year, biomedical expertise is increasing at an unprecedented rate. Electronic publishing and the Internet now make articles instantly available anywhere in the world, making it possible to read the findings of any researcher in a related field without the hurdle of physical access. Traditional boundaries, such as the geographical location of your current institution, are become less important as technology makes it possible to discuss and collaborate with almost anyone, anywhere. PubMed facilitates finding information by presenting a portal for searching research articles by biomedical topic. At the heart of the search engine is the indexing effort by the National Library of Medicine, which annotates all articles in MEDLINE®/PubMed® with biomedical topic terms from the Medical Subject Headings (MeSH) vocabulary. However, as PubMed continues to expand in size, the number of articles and authors of these articles continues to grow, and sorting through this sea of authors in search of relevant and related experts sharing common biomedical research interest is an increasingly difficult problem.  122  Finding authors with similar interests is an important task in several practical contexts. One common use case is finding potential reviewers for papers – in the world of peer-reviewed scientific publications, it is always necessary to find arms-length researchers with relevant expertise. Researchers with similar interests are also potential collaborators for new and existing projects, while simultaneously approaching similar problems from their own unique directions. Keeping abreast of the new research and finding papers an author is likely to care about by others in a particular field is an important part to remaining relevant through by providing unique perspectives on common interests. Previously, eTBLAST(Errami et al., 2007) and JANE(Schuemie & Kors, 2008) were developed to search for PubMed articles similar to a natural language free text query such as an abstract or a preliminary document. After extracting articles similar to the query, both systems process the matching article results to extract top-ranked authors and journals. These methods comparing the available free text have been also applied to the related topic of detecting highly similar and potentially plagiarised citations – citations that extremely similar to the query are likely to be (un)intentional duplicates(Errami et al., 2009). Our work builds on this prior body of literature, moving beyond the focus on the text and phrases. We shift the focus to the extraction and comparison of distinguishing over-represented topics, and using the statistical methods developed for MeSHOPs again here to control for the prevalence of terms and emphasise the most unusual and unexpected subjects, while providing a quantitative measure of the association. To identify the most relevant and distinctive topics studied by biomedical authors, we adapt our previous work analysing the Medical Subject Heading annotations of PubMed articles(Cheung et al., 2012). We profile an author by extracting all the articles in PubMed where he or she is listed as an author, and compute the statistical over-representation of the MeSH terms found in these articles. These profiles allow quantitative measurement of the significance of the main topics studied by the author as evaluated through prevalence as supported by their research output. By comparing these profiles, we provide a quantitative measure of the similarity of the authors through their common interests. However, as the number of entities to compare increases, an exhaustive full comparison of all the profiles rapidly becomes computationally very expensive as the algorithmic complexity for comparing n entities grows on the order of O(n2). This is especially important once we consider classes with extremely large numbers of entities, such as the analysis of all authors in PubMed, which can 123  involve several hundreds of thousands to millions of entities, compared to the tens of thousands of human genes. An increase of a hundred times in the number of entities results in a ten thousand-fold increase in computational effort. With over 9 million author profiles in PubMed, even comparing a single profile against this entire database is prohibitively expensive, and the full set of comparisons results in over 4x1013 pairs of authors. The dramatic increase in difficulty necessitates novel methods to rapidly compare entities. To accommodate this information space, we introduce and evaluate two methods of comparing large numbers of MeSH profiles for authors to quickly identify highly similar authors. We achieve this through an approximation of the profiles as a vector of bits and comparing these bitvectors. This allows comparison of any existing profile against an entire database of profiles without needing to explicitly compare against each MeSH term in every profile. We apply this method to the comparison of prolific authors in PubMed to illuminate relationships between authors beyond those visible from copublication, enabling authors to find potential related peer reviewers and collaborators, based on the similarity of their published research.  Results and Discussion MeSHOPs are a succinct, quantitative measure of association between an entity and key biomedical topics. By computing the MeSHOPs for all the authors in PubMed, we provide an insight into the research of a biomedical author. We examine the MeSHOPs of authors from 2010 and investigate the network of links between authors based on the similarity of their MeSHOPs. While the number of authors is extremely large – PubMed up to 2010 encompasses over 9 million unique author names – we successfully computed the complete author profiles, resulting in over 100Gb of profiles. However, it is even more computationally prohibitive to pre-compute and store the similarity of all these profiles. Even when we reduce the set to only profiles from authors that have over 15 articles in PubMed and eliminate authors names that are likely to be ambiguous (having over 1000 articles), this still leaves over 800,000 authors. We developed two complementary methods to rapidly compare MeSHOPs, approximating a full comparison of all the MeSH terms in the MeSHOPs. To evaluate these methods, we apply them towards analysis of similarity between authors of articles from journals in three different domains, and for classifying members of four separate research institutions. The approximations balance sufficient 124  breadth to cover the diversity in different combinations of topics covered by the authors, while retaining sufficient specificity to remain informative about the profiles. One approximation focuses on comparing a broad selection of MeSH terms at relaxed p-value thresholds, while the other matches the most strongly associated terms to the profile. For both approximations, when we need to only consider a subset of the MeSH terms rather than the entire profile, there are tradeoffs made between the specificity of the terms and the number of authors strongly associated to the term. The most informative and interesting MeSH terms for any author are the most strongly associated (i.e. with low p-value) terms, yet to be useful for comparison, they need to be associated to large proportion of authors. As seen in Figure 5-2, terms with strong average association are only present in MeSHOPs of few authors – for example, “RNA, Ribosomal, SelfSplicing” has an average p-value of 3.6x10-5, but is only present in the MeSHOPs of eight authors. On the other hand, terms annotated to many authors have very high average p-values – for example, “Humans” is present in the MeSHOPs of over half a million authors, but has an average p-value of 1.0. When comparing the number of authors for a term to the average p-value for the term, we see a Pearson correlation of 0.94 and a Spearman correlation of 0.98. To balance strength of association against the prevalence of author annotation for a term, we introduce a weighted overall average p-value, where authors not having the MeSH term are also included in the average with an effective p-value for the term of 1.0.  Global Summary Bit-Vectors Our first approximation method involves summarising the profiles by selecting a subset of MeSH terms and thresholds to distil each profile into a succinct numeric representation. This method is inspired by locality sensitive hashing techniques, transforming the profile of MeSH terms into a single value which can then be efficiently compared. By treating the profile as a multidimensional vector of values, we test a subset of the dimensions (i.e. MeSH terms) against a set of thresholds, resulting in a distinctive bitvector for each author entity. The more bits two authors have in common, the more similar they are. We extract the top 1000 MeSH terms with the lowest mean p-value from the fourth level of the MeSH hierarchy. We select tier four of the MeSH hierarchy as this depth covers the greatest breadth of distinct topics of the twelve hierarchical levels. We use 1000 MeSH terms to allow the bitvector comparisons to be performed in time on the order of several minutes on our web servers. We select the 125  overall average as the threshold for each bit, and select the terms with the lowest overall average pvalue to identify the MeSH terms that are associated to the most authors with low p-value. Using the mean for the p-value as a threshold, all highly published authors were evaluated. In this case, we examine authors associated with a minimum of 15 articles in PubMed. While MeSHOPs can be computed for authors with any number of associated articles, profiles generated over a small set of articles are less likely to accurately portray the author. The JACCARD score based on the bit-vectors is highly correlated to the Euclidean distance score computed using the entire MeSHOP (See Figure 5-10). Even though comparisons of bit-vectors are no longer direct quantitative measures the differences of the actual p-values of the author MeSHOPs, the bit-vector approximations provide sufficient granularity to allow comparison of disparate authors across fields in different institutions, resulting in a high correlation of -0.97. While global summary bit-vectors allow all authors to be compared directly, due to the diversity in MeSH terms and the specificity of authors, the MeSH terms selected in this manner are by necessity relatively broad, with such terms as “Injections”, “Cell Count”, “Regulatory Sequences, Nucleic Acid” and “Leukemia, Myeloid”. As well, the overall average results in thresholds which are relatively high and therefore are not specific to highly associated MeSH terms in MeSHOPs. This allows authors that are only moderately associated with terms to be included. The most stringent p-value cut-off starts at 0.81 (for the MeSH term “Analysis of Variance”), and the least stringent cutoff is considered is at 0.96 (for “Molecular Biology”).  Specific MeSH Term Summaries We therefore implemented a second method of comparing author MeSHOPs, extracting the top 20 most strongly associated MeSH terms from each author’s profile. This allows MeSH terms from the entire MeSH hierarchy to be available for the summary profile for each author, and in particular, the most specific and relevant terms from each MeSHOP can be included. To compare these partial profiles of authors, we again look at the number of terms shared between the authors. Even when limiting the analysis to twenty MeSH terms, they cover 25,231 of the 26,140 (96%) MeSH terms . The result is a much larger database of 16Gb in size, and attempting to increase the profiles to include the top fifty MeSH terms resulted in a database of 36Gb in size. In comparison, the global summary bit-vectors are 1Gb in size. 126  Comparing Authors from Different Institutions We examine the relationships of authors from differing institutions through their MeSHOPs. The hypothesis we explore is that MeSHOP comparison can group related authors together based on their common research interests. We examine a subset of the principal investigators from four disparate institutions – the Centre for Molecular Medicine and Therapeutics, the Genome Sciences Centre, the Ontario Institute for Cancer Research and the University of British Columbia Department of Psychiatry. One existing method to identify relationships between authors is the analysis of co-authorship on the biomedical research papers. This direct relationship highlights existing collaborations between authors. We see, in Figure 5-4 and Table 5-2, the co-publications between the authors of our studied institutions. Immediately apparent is that co-publication data is often extremely sparse – even among members of the same institution, there are many researchers which have never been co-authors on a publication. Among the few co-publishers, we observe Yatham and Lam from UBC Psychiatry copublishing, as well as Jones and Holt from the GSC. Co-publication can identify closely related researchers, although this can be researchers primarily from the same institution, or collaborators with complementary expertise. However, another metric is needed to compare less tightly associated authors, and to discover authors with similar interests that have not yet interacted. We examine in Figure 5-5 the comparison of the authors from these institutions using the JACCARD bitvector profile similarity metric. We are able to extract several tightly grouped, institutionspecific subsets. Jones, Marra and Holt are very tightly clustered together as investigators at the GSC. Earle, Grunfeld, Dancey, Fenster, Haider and Yaffe from the OICR are grouped together, as are Raymond, Craig, Murphy, MacVicar from UBC Psychiatry. Other groups have a strong subject area focus - for example, we have Wouters, Bell, Dick and Sidhu from the OICR also clustering with Sader and Karsan from the GSC studying cancer. Simpson and Goldowitz from the CMMT cluster with Honer and Snutch from the GSC studying the brain. Clustering in this manner allows us to not only visualise the relationships between two authors, but also provides a broader context. When we examine the overlap of the top 20 most specific MeSH terms from the author MeSHOPs (See Figure 5-6 and Table 5-3), we see several highly related authors clustering together. We again see co-publishers Marra, Jones and Holt from the GSC clustering together, as well Yatham and Lam from UBC Psychiatry. However, this analysis also brings together MacVicar and Murphy from UBC Psychiatry with Snutch at the GSC through their work on T and N-Type calcium channels (See Figure 5-7). 127  As we are only comparing twenty terms from each profile, there are many authors that do not overlap at all and are therefore uncomparable at this level. However, the relationships that are discovered are the result of specific, over-represented topic terms.  Author Domain Comparison Web of Knowledge Domains We also examine the ability of our comparison approximations to differentiate authors from one domain from authors in other domains. We select three for comparison three subject categories: Genetics, Oncology and Psychiatry. Our hypothesis is that authors within a domain will be more similar than authors in different domain, and that the similarity measures would find, in general, that two authors from the same domain would be closer than two authors from different domains. As seen in Table 5-1, the naïve ALLBITS comparison is the least effective, achieving at most 65% accuracy under all the validation sets. By focusing only on the MeSH terms below the significance thresholds using ONEBITS and combining it with the ALLBITS metric, we achieve over 60% accuracy in all domains. However, the JACCARD score is consistently the most accurate of our measures – we achieve over 70% accuracy (as measured by the ROC AUC) for all three domains. This measure combines the strength of looking at the more significant MeSH terms like ONEBITS, however, also includes a correction for the number of significant MeSH terms in each profile. Comparison of the top 20 MeSH terms from each author yielded relatively poor performance over all the tested subject domains (56% accuracy). As we noticed previously when comparing authors from different institutions, this measure was only able to identify highly similar authors – by the very nature of only focusing on the most specific biomedical topics, it is only able to identify the closest authors in topic similarity. We hypothesize that while this measure is able to detect extremely closely related authors, it is unable to distinguish authors in the same domain but working on different topics from authors in completely different domains. We confirm this by examining the fraction of relationships with a non-zero overlap, and note that we have a precision reaching up to 79%, but with a recall rate that never exceeds 15%, when we consider all authors with at least one of their top 20 MeSH terms overlapping (See Table 5-5). The precision is lower for Genetics, indicating that the specific terms for authors in genetics are more often shared with authors in Oncology and Psychiatry. 128  Overall, we note that the JACCARD metric is able to distinguish authors from a domain from authors from other domains at a rate over 70%, however, the poor performance of the highly specific top 20 MeSH term profiles shows that these more specialised profiles are not as useful for such global comparisons. However, for the authors that are comparable using their top 20 MeSH term profiles, we achieve precision at over 75% for Oncology and Psychiatry. Cold Spring Harbour Laboratory Meetings We investigate our metrics for by taking twenty invited speakers from each of four different Cold Spring Harbour Laboratory (CSHL) Meetings: PTEN Pathways & Targets, Gene Expression and Signaling in the Immune System, Molecular Chaperones and Stress Responses, and Neuronal Circuits. We expect the author similarity metrics to highlight speakers from the same meeting as being more similar to one another than to speakers in another meeting. As before, we see that the JACCARD score is the most effective overall, again successfully identifying speakers from the same meeting with an accuracy of 0.76 (See Figure 5-8). The performance of the other metrics, including the top 20 MeSH term profiles, also achieve accuracy from 0.60 to 0.67. Again, using the top 20 MeSH terms means that links found between authors involve very strongly associated MeSH terms, however the number of terms that can be compared between authors is very limited, resulting in the links between authors involve only up to three shared MeSH terms (See Figure 5-9).  Methods Computing p-values To compute the association of MeSH terms to an author, we extract from PubMed all articles for the author. P-values are then computed based on the frequency of occurrence of the MeSH term amongst the articles for the author as compared with the occurrences of the MeSH term in all of PubMed, using the Fisher Exact Test. We use this methodology to associate authors with biomedical themes in their published research, just as we associate genes, diseases and chemical compounds to their relevant biomedical topics in our previous work (Cheung et al., 2012). As we consider all authors in PubMed, the background selected in this case is the set of all PubMed articles.  129  MeSHOP Bit-Vectors We apply two simplifications to drastically reduce the computation time of finding similar authors, especially important when comparing against a set of over half a million authors. We consider a filtered subset of n=1000 MeSH terms (of the 26 142 terms of MeSH 2011). MeSH terms vary widely in their annotation to PubMed articles (see Figure 5-11), therefore we focus on the most strongly associated MeSH terms based on average associated p-value. For each MeSH term i, we further simplify the profile – rather than the storing the p-value for the term, we compare the p-value to a predefined significance threshold ti and only store a single bit: the bit is 1 if i is present in the profile and has a pvalue less than ti, otherwise the bit is 0. This reduces each profile to a vector of 1000 bits, which we can compare much more quickly than the original profiles, where any of MeSH terms could be present as a numeric 32-bit value. The MeSH terms are assigned to one or more categories, each of which arranges the terms in a hierarchy, linking a general term to a more specific term in a parent-child relationship. We minimize overlaps in topic areas by selecting all MeSH terms from the same depth in the tree (being the same number of degrees removed from the topmost terms in the category). We select terms from tier 4, comprising 9608 MeSH terms, to provide a balance of specificity MeSH terms and coverage of biomedical authors. As seen in the example MeSH term distributions in Figure 5-3, for each MeSH term, a substantial fraction of authors are weakly associated to each MeSH term with a p-value near 1.0. To remove these weakly associated authors, we employ p-value thresholds. However, customary p-value cut-offs such as 0.05 prove too aggressive, as they exclude a large number of authors with an intermediate level of association, rendering the bitvectors uninformative for a large fraction of authors. Therefore, to focus on the most representative MeSH terms for each author while also covering the largest number of authors, we use the overall average as a threshold, which is a threshold sufficiently high allows the inclusion of a large fraction of authors.  Bit-Vector Comparison To compare these approximate profiles, we contrast the performance of two metrics. The first counts all the bits that are the same between two bitvector profiles p and q:  ALLBITS( p, q)  BITCOUNT( NOT ( XOR( p, q)) 130  This metric emphasizes both the strongly associated terms in in both profiles, as well as the terms that are not strongly associated to both profiles. However, on consequence of the low prevalence of strongly associated MeSH terms, as seen in Figure 5-2, is that a large fraction of the bits in all profiles are zero, and therefore most of the bits matched are the zeroes. The second metric counts only the bits set to one that are shared by both profiles: O N E B(Ip, q T)  SB I T C O (A UN N(D T p, q) )  This metric only counts the bits that are set to one in both p and q. In contrast to ALLBITS(p,q), ONEBITS(p,q) focuses only the MeSH terms strongly associated to both authors. However, the resulting score is biased as authors with more bits set to one will tend to have higher scores. To normalise for the differing number of bits set to one in the profiles, we consider the Jaccard metric:  JACCARD ( p, q)   BITCOUNT ( AND ( p, q)) BITCOUNT (OR( p, q))  This metric compares all the bits set to one in either p or q, to the number of bits set to one in both p and q. Therefore, large profiles that are very different will be penalised, and the best scoring profiles are the ones which have many overlapping terms and very other terms that do not overlap. To demonstrate these metrics are effective approximations of full profile similarity comparisons, we correlate these scores with actual author similarity scores calculated for the two different validation subsets in the following sections. The domain of authors for this analysis consisted of authors with a minimum of 15 articles and a maximum of 1000 articles, limiting the pool to authors with a substantial body of associated literature, while eliminating very general names which are likely the combination of several author entities.  Most Significant Terms As a second method to approximately compare author MeSHOPs, we examine the most specific and significant terms associated with each author’s profile. For each author, we extracted the 20 most significant MeSH terms, regardless of position in the hierarchy, as measured by the p-value associated 131  with the term. To compare authors, we measure the number of overlapping terms among the top 20 MeSH terms for the pair of authors compared.  Validation Sets We expect authors from the same research domain to be more similar to one another than authors from different domains. We select three distinct domains – Genetics, Oncology and Psychiatry – and compare authors from these domains. For each of the domains, we select five journals specific to the field from the top-ranking journals listed in the ISI Web of Knowledge (see Table 5-4). From each journal, we select the 1000 authors associated to the most articles published in the journal after 2005. We then pool all the authors from the five journals for a domain as the authors for the domain. Authors which occur in the lists for multiple domains are excluded, to prevent ambiguity arising from interdisciplinary authors. We then evaluate the ROC of the similarity metrics at measuring similarity of pairs of authors for a particular domain. Relationships where both authors from the particular domain are treated as “true positive” relationships for the validation, and all relationships of authors from the particular domain to authors in other domains are considered “true negatives”. The ROC AUC therefore measures the accuracy of recovering authors from a particular domain. As a case study, we compare principal investigators from four institutions – the Centre for Molecular Medicine and Therapeutics, the Genome Sciences Centre, the Ontario Institute for Cancer Research and the UBC Department of Psychiatry. We extracted the subset of all authors with at least twenty articles, matching a full first name and last name from PubMed. We also look at speakers at four Cold Spring Harbour Laboratory Meetings. For each of the meetings, we extract twenty authors, each with at least 15 articles and a maximum of 1000 articles, matching a full first name and last name in PubMed.  Implementation MEDLINE®/PubMed® data was downloaded from the National Library of Medicine as the 2011 Baseline. The bitvectors of MeSH terms for authors were extracted and compared using Python scripts. The top MeSH terms for all authors were stored and compared using a MySQL database.  132  Future Directions Author disambiguation is an important area of research highly relevant to the generation of author MeSHOPs. In the past, authors in PubMed were only identified using their first initials and their last name. This results ambiguous cases when multiple authors having the same last name and sharing the same initials being indistinguishable in the system. Even with the introduction of storing full first names when available in 2002[NLM Technical Bulletin, Nov.-Dec. 2001], there are still cases where multiple people have the same first and last name. As well, as PubMed stores specifically the author names as provided by the publication, errors in spelling and the presence or absence of middle initials further confound this issue. Conversely, certain profiles, such as those generated from author names with extremely large numbers of publications, likely to actually be a list merged from several ambiguous authors. Author similarity has been used to disambiguate different people with the same author name in PubMed. MeSH terms have also been used for author disambiguation work such as seen in (Torvik & Smalheiser, 2009), who used MeSH term overlap while filtering using a list of MeSH terms, to determine whether two articles bearing the same or highly similar author names appear to be the same person. While author disambiguation has not yet been solved as of the writing of the thesis, preliminary solutions to this problem are being actively developed through solutions such as OpenID and ResearcherID (Bourne & Fink, 2008), and PubMed has been developing their own solution in the form of a PubMed Author ID(NLM, 2011). Resolving the author disambiguation issue will result in more relevant MeSHOPs encompassing the complete body of literature for authors. Another potential effect on an author’s MeSHOP is the shifts in focus of research direction for an author over their career. Examining cross-sections of the longitudinal evolution could be used to differentiate different stages of an author’s career. Alternatively, articles could be weighted by date, allowing the MeSHOP to focus on an author’s more current work and assist in identifying associations that are current. Our method examines the large, publically available data of MEDLINE®/PubMed® and the curated MeSH terms. Another source of information about the research articles and authors are citation indexes such as Sciverse Scopus and Web of Science. While not providing indexes of the topics related to the articles indexes, they instead provide bibliographical citation information. Author similarity has also 133  been previously studied in medical informatics using author co-citation analysis (Andrews, 2003), grouping authors in a domain by their co-occurrence in bibliographies. Impact Factor and Eigenfactors are metrics commonly applied for bibliometric analysis applied to to identify preeminent journals. Impact Factor is derived from the citation frequency (Dong, Loh, & Mondry, 2005), while Eigenfactor looks at a weighted network of citations(Rizkallah & Sin, 2010). For authors, H-Index combines citation rate and number of authored articles to evaluate the scientific impact of an author’s research(Hirsch, 2005). Measures of author influence could be incorporated in the author analysis to highlight highly influential related authors in the list of similar authors. Another closely related domain to the MeSHOPs for authors studied are MeSHOPs for biomedical journals. Journal MeSHOPs can also be compared, providing an automated, unbiased view of the relationships between the journals based on their published articles. This allows identifying journals by their most outstanding topics, and provides a mechanism for matching journals to authors through their biomedical subject areas. While the techniques described here demonstrate that large-scale profile comparison is now feasible, further performance improvements are still possible. The similarity comparisons are implemented in system-independent Python, however, further performance improvements could be gained by rewriting the bit-comparison routines to take advantage of hardware-specific bitmanipulation features. As well, each author-author comparison is an independent task, and therefore could be parallelised if a cluster was available to the webserver.  Conclusion MeSHOPs provide a direct analysis of the primary literature for biomedical authors, extracting unusually prevalent annotation directly from the author’s work. To allow comparison of authors from the extremely large pool of biomedical authors in PubMed, we implement two approximations of the full profile comparison. We examine bitvectors of the on thousand lowest mean p-values, and compare all authors directly on these common terms. Alternatively, we look at the twenty most significant pvalues for each author and compare the overlap of these terms. This allows all the most specific MeSH terms related to each author to be used. These methods highlight that even extremely large sets of related entities can be compared efficiently, and demonstrate the effectiveness of computational methods to discover new relationships between these entities. 134  Author PubMed Publication Distribution 10000000 1000000  Author Count  100000 10000 1000 100 10  1 2 3 4 5 6 7 8 9 10 20 30 40 50 60 70 80 90 100 200 300 400 500 600 700 800 900 1000 >1000  1  Author PubMed Citations  Figure 5-1. Histogram of the Authors by the Count of their Publications in PubMed. The number of authors exponentially decreases, with over 3.4 million authors having a single citation rapidly declining to 266,000 authors having five citations.  135  A  Term Average  B  Overall Average  Figure 5-2. MeSH Terms vs. Average p-value. Each point represents a MeSH term, with the number of authors having the MeSH term on the x-axis, and the average p-value for the MeSH term on the y-axis. Colour indicates the number of terms in the hexagonal cell. A) The average p-value among all the authors with the term is plotted against the number of authors with the term. B) MeSH terms are plotted again where with the number of authors having the MeSH term on the x-axis, but the average p-value on the y-axis in this case also considers all authors not being annotated with the MeSH term as having an effective p-value of 1.  136  Figure 5-3. Frequency Polygon of the p-values for Several MeSH Terms. Frequency of authors associated to MeSH terms taken from various tiers of the MeSH tree are shown at each p-value level. For example, for “immunoassay”, we see that there are over 20,000 authors with the term “immunoassay” at p-value of about 0.35. Far fewer authors have “immunoassay” at p-values 0.6 through to 0.9, but there is another of concentration of authors that have the term with a p-value near 1.0  137  Figure 5-4. Heatmap of Co-publication Rates of Principal Investigators from Four Institutions. Institutions covered were the Centre for Molecular Medicine and Therapeutics, the Genome Sciences Centre, the Ontario Institute for Cancer Research and UBC Department of Psychiatry. Colour palette ranges from purple (no co-publications) through dark blue and green to yellow and pale orange indicating the greatest number of copublications. Nearly all authors have no common publications.  138  Figure 5-5. Similarity Matrix for Principal Investigators from Four Institutions. Colour indicates the degree of similarity as measured by the JACCARD metric. Full names are listed on the right hand y-axis and last names (in the same order) are listed on the x-axis. The left hand colours on the y-axis indicate the institution – CMMT (turquoise), GSC (pale blue), OICR (orange) and UBC Dept. of Psychiatry (purple).  139  Figure 5-6. Comparing Author MeSHOPs Using their Top 20 MeSH. Colour indicates the degree of similarity as measured by the number of overlapping MeSH terms among the top 20 MeSH terms for the profiles compared. Full names are listed on the right hand y-axis and last names (in the same order) are listed on the x-axis The left hand colours on the y-axis indicate the institution – CMMT (turquoise), GSC (pale blue), OICR (orange) and UBC Dept. of Psychiatry (purple).  140  • • • • • • • • •  • • • • • • •  Allethrin Penfluridol Intralaminar Thalamic Nuclei Aldicarb Rats, Transgenic Rotarod Performance Test Heteroduplex Analysis Urocortins Migraine with Aura  omega-Conotoxin GVIA Flunarizine GTP-Binding Protein alpha Subunits, Gq-G11 Night Blindness Pimozide omega-Conotoxins Calcium Channel Agonists  SNUTCH, TP •  •  Calcium Channels, R-Type  • •  Dendritic Spines  Calcium Channels, T-Type Calcium Channels, N-Type  MURPHY, TH  MACVICAR, BA  • • • • • • •  Adenosine A1 Receptor Agonists Adenosine A1 Receptor Antagonists Voltage-Dependent Anion Channels Carbenoxolone Receptor, Adenosine A1 Mossy Fibers, Hippocampal Receptors, Purinergic P2X7  • • • • • • • • •  • • • • •  Receptor, Muscarinic M1 Muscarine Uricosuric Agents Purinergic P2 Receptor Agonists Pericytes Receptor, Muscarinic M3 Apyrase Wallerian Degeneration Purinergic P2 Receptor Antagonists  • • •  •  Voltage-Sensitive Dye Imaging Lipoylation Calcium Channels, Q-Type Calcium Channels, P-Type Amino Acid Transport System y+ Inhibitory Postsynaptic Potentials Rhodotorula D-Amino-Acid Oxidase  • • • • • • • •  Microscopy, Fluorescence, Multiphoton Touch Perception Amino Acid Transport Systems, Basic 1-Methyl-4-phenylpyridinium NF-E2-Related Factor 2 ADP-Ribosylation Factors Cadmium Chloride GAP-43 Protein  Long-Term Synaptic Depression  Figure 5-7. Top 20 MeSHOP Term Overlaps Between Three Authors. Shown here are the top 20 MeSHOP terms, and the overlaps between three biomedical authors. MacVicar and Murphy are affiliated with UBC Psychiatry, whereas as Snutch is a researcher at the Genome Sciences Centre.  141  Figure 5-8. Network of CSHL Meeting Speakers Linked by JACCARD Score. Nodes are coloured by the CSHL meeting the speaker was identified from: PTEN Pathways & Targets (green), Gene Expression and Signaling in the Immune System (red), Molecular Chaperones and Stress Responses (blue), and Neuronal Circuits (purple). The colour of the band is a mix of the colours of the speakers involved, and the thickness indicates the magnitude of the JACCARD Score for the association. Only associations with JACCARD score greater than 0.20 are shown here.  142  Figure 5-9. Network of CSHL Meeting Speakers Linked by Top 20 MeSH Term Overlap Score. Nodes are coloured by the CSHL meeting the speaker was identified from: PTEN Pathways & Targets (red), Gene Expression and Signaling in the Immune System (purple), Molecular Chaperones and Stress Responses (green), and Neuronal Circuits (blue). The colour of the band is a mix of the colours of the speakers involved, and the thickness indicates the number of overlapping MeSH terms.  143  Figure 5-10. JACCARD Similarity Using the 1000-bit Profile Bit-vector Against the L2 distance Score using the Full MeSHOPs. All pairwise author-author comparison scores for the authors of the CMMT/OICR/GSC/UBC Dept of Psych validation set are shown here.  144  Figure 5-11 Distribution of PubMed Articles Associated to MeSH terms. The distribution of the  number of articles for the 25 562 MeSH terms from 2010 MeSH are plotted in this frequency polygon.  145  Table 5-1. Performance of Author Similarity for Identifying Author Research Domain. Receiver Operating Characteristic Area Under the Curve (ROC AUC) Scores demonstrate the accuracy of profile similarity comparison at differentiating authors within a research area from authors in other research areas. This compares 3749 authors in Genetics, 4807 authors in Oncology and 4244 authors in Psychiatry.  Genetics  Oncology  Psychiatry  JACCARD  0.75  0.78  0.74  ALLBITS x ONEBITS  0.68  0.78  0.68  ONEBITS  0.65  0.77  0.65  ALLBITS  0.65  0.53  0.65  Top 20 MeSH Overlap  0.56  0.56  0.56  146  Table 5-2. Author Co-publication Counts. Number of co-publications for principal investigators from the CMMT, GSC, OICR and UBC Dept. of Psychiatry. Author pairs with at least one co-publication are listed, sorted by number of articles. Each author pair is ordered alphabetically by last name – for example, JONES, STEVEN J M (SJ) | MARRA, MARCO A (MA) is listed rather than MARRA, MARCO A (MA) | JONES, STEVEN J M (SJ).  Principal Investigators HAYDEN, MICHAEL R (MR) LAM, RAYMOND W (RW) JONES, STEVEN J M (SJ) HOLT, ROBERT A (RA) HOLT, ROBERT A (RA) HAYDEN, MICHAEL R (MR) JONES, STEVEN J M (SJ) LEAVITT, BLAIR R (BR) MURPHY, TIMOTHY H (TH) JONES, STEVEN J M (SJ) EARLE, CRAIG C (CC) SIMPSON, ELIZABETH M (EM) HOLT, ROBERT A (RA) MARRA, MARCO A (MA) MARRA, MARCO A (MA) JONES, STEVEN J M (SJ) JONES, STEVEN J M (SJ) HUDSON, THOMAS J (TJ) HUDSON, THOMAS J (TJ) HOLT, ROBERT A (RA) MARRA, MARCO A (MA) LEAVITT, BLAIR R (BR) JONES, STEVEN J M (SJ) HUDSON, THOMAS J (TJ) HOLT, ROBERT A (RA) HOLT, ROBERT A (RA) HOLT, ROBERT A (RA) HAYDEN, MICHAEL R (MR) HAYDEN, MICHAEL R (MR) MCPHERSON, JOHN D (JD) MARRA, MARCO A (MA) MARRA, MARCO A (MA) KARSAN, ALY (A) JONES, STEVEN J M (SJ) JANG, KERRY L (KL)  LEAVITT, BLAIR R (BR) YATHAM, LAKSHMI N (LN) MARRA, MARCO A (MA) MARRA, MARCO A (MA) JONES, STEVEN J M (SJ) RAYMOND, LYNN A (LA) MARRA, MARCO A (MA) RAYMOND, LYNN A (LA) RAYMOND, LYNN A (LA) MARRA, MARCO A (MA) GRUNFELD, EVA (E) WASSERMAN, WYETH W (WW) JONES, STEVEN J M (SJ) SADAR, MARIANNE D (MD) MCPHERSON, JOHN D (JD) SIMPSON, ELIZABETH M (EM) SADAR, MARIANNE D (MD) STEIN, LINCOLN D (LD) PALMER, LYLE J (LJ) SIMPSON, ELIZABETH M (EM) SIMPSON, ELIZABETH M (EM) SIMPSON, ELIZABETH M (EM) WASSERMAN, WYETH W (WW) MCPHERSON, JOHN D (JD) WASSERMAN, WYETH W (WW) LEAVITT, BLAIR R (BR) HONER, WILLIAM G (WG) SIMPSON, ELIZABETH M (EM) MURPHY, TIMOTHY H (TH) STEIN, LINCOLN D (LD) WASSERMAN, WYETH W (WW) STEIN, LINCOLN D (LD) MARRA, MARCO A (MA) MCPHERSON, JOHN D (JD) YATHAM, LAKSHMI N (LN)  Articles 42 32 31 25 16 16 12 7 6 6 5 4 4 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 147  Principal Investigators JANG, KERRY L (KL) HONER, WILLIAM G (WG) HONER, WILLIAM G (WG) HONER, WILLIAM G (WG) HONER, WILLIAM G (WG) HOLT, ROBERT A (RA) HAYDEN, MICHAEL R (MR) HAYDEN, MICHAEL R (MR) HAYDEN, MICHAEL R (MR) HAYDEN, MICHAEL R (MR) HAIDER, MASOOM A (MA) GOLDOWITZ, DAN (D) GOLDOWITZ, DAN (D) DANCEY, JANET (J) BOUTROS, PAUL C (PC) BOUTROS, PAUL C (PC) BOUTROS, PAUL C (PC) BELL, JOHN C (JC)  LAM, RAYMOND W (RW) YATHAM, LAKSHMI N (LN) LEAVITT, BLAIR R (BR) LAM, RAYMOND W (RW) JANG, KERRY L (KL) SADAR, MARIANNE D (MD) WASSERMAN, WYETH W (WW) HUDSON, THOMAS J (TJ) HONER, WILLIAM G (WG) HOLT, ROBERT A (RA) YAFFE, MARTIN J (MJ) LEAVITT, BLAIR R (BR) HAYDEN, MICHAEL R (MR) EARLE, CRAIG C (CC) STEIN, LINCOLN D (LD) MCPHERSON, JOHN D (JD) HUDSON, THOMAS J (TJ) WOUTERS, BRADLY G (BG)  Articles 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1  148  Table 5-3. Top 20 MeSH Term Overlap Counts. Number of overlapping terms, of the Top 20 MeSH terms from their MeSHOPs, for principal investigators analysed from the CMMT, GSC, OICR and UBC Dept. of Psychiatry. Each author pair is arranged alphabetically by last name. For example, JONES, STEVEN J M (SJ) | MARRA, MARCO A (MA) is listed rather than MARRA, MARCO A (MA) | JONES, STEVEN J M (SJ).  Principal Investigators JONES, STEVEN J M (SJ) HOLT, ROBERT A (RA) HAYDEN, MICHAEL R (MR) SIMPSON, ELIZABETH M (EM) LAM, RAYMOND W (RW) EARLE, CRAIG C (CC) MURPHY, TIMOTHY H (TH) MARRA, MARCO A (MA) MACVICAR, BRIAN A (BA) MACVICAR, BRIAN A (BA) LEAVITT, BLAIR R (BR) HOLT, ROBERT A (RA) HOLT, ROBERT A (RA) STEIN, LINCOLN D (LD) MCPHERSON, JOHN D (JD) MCPHERSON, JOHN D (JD) LEAVITT, BLAIR R (BR) LEAVITT, BLAIR R (BR) LEAVITT, BLAIR R (BR) JONES, STEVEN J M (SJ) HUDSON, THOMAS J (TJ) HOLT, ROBERT A (RA) HOLT, ROBERT A (RA) HAYDEN, MICHAEL R (MR) FENSTER, AARON (A) DICK, JOHN E (JE) CRAIG, ANN MARIE (AM) BOUTROS, PAUL C (PC) BELL, JOHN C (JC) RAYMOND, LYNN A (LA) PALMER, LYLE J (LJ) OUELLETTE, B F (BF) MURPHY, TIMOTHY H (TH)  Terms MARRA, MARCO A (MA) JONES, STEVEN J M (SJ) LEAVITT, BLAIR R (BR) WASSERMAN, WYETH W (WW) YATHAM, LAKSHMI N (LN) GRUNFELD, EVA (E) SNUTCH, TERRANCE P (TP) STEIN, LINCOLN D (LD) SNUTCH, TERRANCE P (TP) MURPHY, TIMOTHY H (TH) RAYMOND, LYNN A (LA) MARRA, MARCO A (MA) HONER, WILLIAM G (WG) WASSERMAN, WYETH W (WW) SIMPSON, ELIZABETH M (EM) OUELLETTE, B F (BF) WASSERMAN, WYETH W (WW) SNUTCH, TERRANCE P (TP) SIMPSON, ELIZABETH M (EM) WASSERMAN, WYETH W (WW) STEIN, LINCOLN D (LD) SIMPSON, ELIZABETH M (EM) MCPHERSON, JOHN D (JD) RAYMOND, LYNN A (LA) YAFFE, MARTIN J (MJ) KARSAN, ALY (A) HONER, WILLIAM G (WG) WASSERMAN, WYETH W (WW) KARSAN, ALY (A) SIMPSON, ELIZABETH M (EM) YAFFE, MARTIN J (MJ) WASSERMAN, WYETH W (WW) WASSERMAN, WYETH W  7 7 7 6 6 6 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 149  Principal Investigators MURPHY, TIMOTHY H (TH) MCPHERSON, JOHN D (JD) MCPHERSON, JOHN D (JD) MARRA, MARCO A (MA) MARRA, MARCO A (MA) MACVICAR, BRIAN A (BA) LEAVITT, BLAIR R (BR) KARSAN, ALY (A) KARSAN, ALY (A) JONES, STEVEN J M (SJ) JONES, STEVEN J M (SJ) JONES, STEVEN J M (SJ) JANG, KERRY L (KL) JANG, KERRY L (KL) JANG, KERRY L (KL) JANG, KERRY L (KL) JANG, KERRY L (KL) IVERSON, GRANT L (GL) IVERSON, GRANT L (GL) HUDSON, THOMAS J (TJ) HUDSON, THOMAS J (TJ) HONER, WILLIAM G (WG) HONER, WILLIAM G (WG) HONER, WILLIAM G (WG) HONER, WILLIAM G (WG) HOLT, ROBERT A (RA) HOLT, ROBERT A (RA) HAYDEN, MICHAEL R (MR) HAYDEN, MICHAEL R (MR) HAYDEN, MICHAEL R (MR) HAYDEN, MICHAEL R (MR) HAYDEN, MICHAEL R (MR) HAIDER, MASOOM A (MA) HAIDER, MASOOM A (MA) GRUNFELD, EVA (E) GOLDOWITZ, DAN (D) GOLDOWITZ, DAN (D) GOLDOWITZ, DAN (D)  Terms (WW) RAYMOND, LYNN A (LA) WASSERMAN, WYETH W (WW) STEIN, LINCOLN D (LD) SADAR, MARIANNE D (MD) MCPHERSON, JOHN D (JD) RAYMOND, LYNN A (LA) MURPHY, TIMOTHY H (TH) SIDHU, SACHDEV S (SS) MARRA, MARCO A (MA) SIMPSON, ELIZABETH M (EM) SADAR, MARIANNE D (MD) MCPHERSON, JOHN D (JD) YATHAM, LAKSHMI N (LN) YAFFE, MARTIN J (MJ) SIMPSON, ELIZABETH M (EM) PALMER, LYLE J (LJ) LAM, RAYMOND W (RW) YATHAM, LAKSHMI N (LN) LAM, RAYMOND W (RW) SIDHU, SACHDEV S (SS) PALMER, LYLE J (LJ) YATHAM, LAKSHMI N (LN) RAYMOND, LYNN A (LA) MACVICAR, BRIAN A (BA) LEAVITT, BLAIR R (BR) WASSERMAN, WYETH W (WW) STEIN, LINCOLN D (LD) WASSERMAN, WYETH W (WW) SNUTCH, TERRANCE P (TP) SIMPSON, ELIZABETH M (EM) OUELLETTE, B F (BF) MURPHY, TIMOTHY H (TH) YAFFE, MARTIN J (MJ) JONES, STEVEN J M (SJ) SADAR, MARIANNE D (MD) WASSERMAN, WYETH W (WW) SIMPSON, ELIZABETH M (EM) RAYMOND, LYNN A (LA)  1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 150  Principal Investigators GOLDOWITZ, DAN (D) FENSTER, AARON (A) FENSTER, AARON (A) FENSTER, AARON (A) EARLE, CRAIG C (CC) DICK, JOHN E (JE) DANCEY, JANET (J) DANCEY, JANET (J) DANCEY, JANET (J) CRAIG, ANN MARIE (AM) CRAIG, ANN MARIE (AM) CRAIG, ANN MARIE (AM) CRAIG, ANN MARIE (AM) CRAIG, ANN MARIE (AM) CRAIG, ANN MARIE (AM) CRAIG, ANN MARIE (AM) CRAIG, ANN MARIE (AM) BOUTROS, PAUL C (PC) BOUTROS, PAUL C (PC) BOUTROS, PAUL C (PC) BOUTROS, PAUL C (PC) BELL, JOHN C (JC) BELL, JOHN C (JC) BELL, JOHN C (JC)  Terms OUELLETTE, B F (BF) SADAR, MARIANNE D (MD) HAIDER, MASOOM A (MA) GRUNFELD, EVA (E) HAIDER, MASOOM A (MA) STEIN, LINCOLN D (LD) YAFFE, MARTIN J (MJ) HAIDER, MASOOM A (MA) EARLE, CRAIG C (CC) WASSERMAN, WYETH W (WW) SNUTCH, TERRANCE P (TP) SIMPSON, ELIZABETH M (EM) SIDHU, SACHDEV S (SS) MURPHY, TIMOTHY H (TH) MCPHERSON, JOHN D (JD) LEAVITT, BLAIR R (BR) FENSTER, AARON (A) WOUTERS, BRADLY G (BG) STEIN, LINCOLN D (LD) HONER, WILLIAM G (WG) HAYDEN, MICHAEL R (MR) WOUTERS, BRADLY G (BG) LEAVITT, BLAIR R (BR) HAYDEN, MICHAEL R (MR)  1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1  151  Table 5-4. Five Leading Journals from Genetics, Oncology and Psychiatry Selected for Validation. Five journals were selected from each of three different research categories in the ISI Web of Knowledge, by number of publications as well as for specificity to the subject category. The top 1000 authors from each of these journals, ranked by total publications in the journal after 2005, were selected to form our validation set. Genetics  Oncology  BMC genomics  Cancer research  Psychiatry  The American journal of psychiatry  Journal of clinical oncology : PLoS genetics  official journal of the American  Archives of general psychiatry  Society of Clinical Oncology Human molecular genetics American journal of medical genetics  Anticancer research BMC cancer International journal of  Genetics  radiation oncology, biology, physics  Biological psychiatry Journal of neurology, neurosurgery, and psychiatry The British journal of psychiatry : the journal of mental science  152  Table 5-5. Precision and Recall of the Domain-specific Relationships from MeSH Profile Overlap. Performance measures when examining relationships between authors where at least one of their top 20 MeSH terms of their MeSHOPs are overlapping .  Genetics  Oncology  Psychiatry  Precision  0.62  0.79  0.76  Recall  0.14  0.15  0.15  153  Table 5-6. Performance of Author Similarity for Separating CSHL Speakers. Receiver Operating Characteristic Area Under the Curve (ROC AUC) Scores demonstrate the accuracy of profile similarity comparison at differentiating speakers from the same CSHL meeting from speakers from different meetings. This compares 20 speakers from three different CSHL meetings.  CSHL Meeting Speakers JACCARD  0.76  ALLBITS x ONEBITS  0.67  ONEBITS  0.65  ALLBITS  0.60  Top 20 MeSH Overlap  0.64  154  Chapter 6: Conclusion Overview of the Thesis Work It is generally understood that, in its simplest terms, a thesis work is a description of the generation of a hypothesis and its subsequent validation. As is the case in so many theses, the truth of the matter is that this thesis work can also be considered a scientific journey and its narration. The journey began in the Introduction where the idea that the ever-increasing body of scientific literature is becoming so vast (over 20 million abstracts in PubMed), that a better method needed to be constructed in order to more efficiently mine the scientific literature for relevant information. Taking cues from Bioinformatics, and leveraging the excellent heading assignments made by the scientific database curators, this thesis describes how I devised a method that employs computational biology and statistics to meaningfully determine and visualize associations between two entities of interest. Chapter 2 describes how MeSHOPs were generated for genes, diseases and vitamins in order to gather any novel relationships within each group. The implementation and visualization of the computed relationships within each group were described, with it being clear that this technique was indeed useful for determining new relationships that would otherwise have been too overwhelming for conventional methods. With the MeSHOP technique having demonstrated both its utility and its validity, Chapter 3 represents the logical step of using MeSHOPs to explore Gene-Disease relationships. We apply MeSHOPs to analyse the relationships between genes and diseases in the domains of diabetes and pancreatic cancer. In both cases the MeSHOP methodology supported existing research and predicted novel relationships, providing a valuable counterpoint to conventional methodology. In Chapter 4, I further refine the MeSHOP technique by using Drugs and Diseases as principal discovery entities. In that respect, I devised ways to deal with problems of annotation bias in order to provide a more relevant description of present and predictive drug-disease interactions, with the implication for drug discovery by the latter being clearly understood. Chapter 5 represents a more personal application of MeSHOPs, finding associations between authors and their respective scientific pursuits. This section demonstrated that some scientifically or medically relevant findings may be borne out by looking at the works of authors in related fields. In 155  addition, this technique also indicates how some researchers may be considered of value in a new group or collaboration as their MeSHOPs profile indicates their contribution to a particular field or discipline that is not necessarily obvious. These five chapters combined represent the synthesis, validation and demonstration of a novel technique for data mining various databases to derive not only useful information, but to also derive new discoveries in work that, ironically, has already been done. The utility, the time and effort saved, and the highlighting of new directions to explore, all based on the MeSHOP approach, is described in this thesis.  Highlights of MeSHOP Research The thesis focuses on statistical over-representation of literature biomedical annotations for a given entity in a data vector termed a MeSH Over-representation Profile (MeSHOP). The information in these profiles can be compared to accurately predict novel associations between entities. The substantial knowledge sequestered in research articles is shown here to be computationally accessible through the rich manual curation of biomedical annotation provided in the MEDLINE database. Within the thesis we demonstrate that it is feasible to utilize the rich MeSH annotations across the entirety of the biomedical knowledge library, and to construct MeSHOPs and compare them for entire classes of entities such as drug compounds, human genes, diseases and authors. We quantitatively compare MeSHOPs for entities as numerical vectors, and evaluate 16 measures of MeSHOP similarity, including three annotation-based baselines, over five validation sets. Euclidean distance of shared terms is found to be the best performing method across multiple data collections. The most effective MeSHOP similarity comparison metrics achieve over 16% accuracy over the best of these baselines. We confirm for drug-disease relationships the presence of a strong annotation bias in both the similarity scores and the validation sets. We formulate a method to both measure the degree of annotation of the entities, and compensate for the bias in the similarity scores. We develop and evaluate efficient vector comparison methods to handle the challenge of comparing on the order of 106 authors.  Future Directions MeSHOPs demonstrate that the knowledge encoded in biomedical citations can be efficiently extracted to elucidate information for end-users as well as infer novel relationships. As an example of 156  high-throughput, comprehensive coverage of entities, we envision that up-to-date biomedical annotation profiles to eventually be readily available or computable for any combination of entity bibliography and biomedical vocabulary. These profiles would give researchers a rapid, easily visualized model of entity properties. Prevalence and widespread availability of annotation profiles could be used to inform biomedical search. MeSHOPs brings knowledge apparent to expert researchers to those without extensive knowledge in a particular field. For example, a researcher confronted with a list of genes and an interest in a specific topic represented in the MeSH hierarchy can immediately retrieve a ranked list of genes based on the enrichment of the given MeSH annotation in their profiles. In doing so, MeSHOPs allow clinicians to obtain an overview of the rapidly-changing landscape of biomedical research literature. It can provide clinicians with a view of the most pertinent topics covered in the recent research for an entity of interest – a disease diagnosis, a drug being prescribed, the results of a genetic test. The keywords highlighted in a MeSHOP can be combined with search technologies and be used to enable identification of specific articles related to the entity of interest. By supplying a MeSHOP for a topic of interest, researchers could retrieve newly published articles which have MeSH annotations overlaping with the query, ranked by the importance of the individual query terms. MeSHOPs demonstrate a direct, single term-by-term quantitative analysis of the association of a term with the entity of interest. Machine learning techniques could be used to learn the relative importance of all the terms in the hierarchy, by measuring their ability to predict future annotation. Dimensionality reduction techniques such as principal component analysis could be used to combine terms and reduce the dimensionality of the profiles. Expanding on our methods of filtering general terms in the presence of more specific terms, the significance analysis could also incorporate the relationships in the hierarchy of the vocabulary directly in the computation. MeSHOP analysis at this point considers the importance of each article to be the same. It has been argued that articles that deal with a wide variety of topics or entities are of lesser importance due to their lack of specificity. A large number of citations of an article may be an indicator of the importance of the article with respect to the topics it covers. As more recent research supersedes past results, emphasis could also be made for terms arising in more recent articles. MeSHOPs could therefore 157  be weighted to emphasise the differing levels of importance of the articles. The publication date and the degree of annotation for an article could be used to assign a recency and generality score for articles, and the citation network can be used to identify the most highly cited and connected articles. This could be incorporated directly using data from existing databases and metrics that estimate author, article or journal importance. Alternatively, just as MeSHOPs can be equally applied to biomedical topics or biomedical authors, existing methods of determining author importance in bibliometric knowledge networks could be adapted to examine the association of biomedical topics. Moving beyond a static reporting system, the process could be interactive, allowing the researcher to provide their own expertise and knowledge. Based on their experience, other prior knowledge – confidence in certain replicated experiments or removing retracted, dubious or inapplicable results – could be incorporated in the analysis. MeSHOP comparison demonstrates that similarities in the profiles can be used to link previously unassociated entities. One natural extension would be to apply existing machine learning techniques, adapted for the extremely large scale of available biomedical knowledge. This remains however a significant step away from being able to computationally reason and hypothesize about the connections. Emerging in informatics research are methods for semantic knowledge representation. One vision for the next generation internet has entities associated with properties that can be computationally manipulated and logically reasoned with in an algorithmic fashion. A grand challenge in the transition to a semantic internet is converting knowledge into a computable form. The MeSHOP procedures developed in this thesis may provide a suitable means to develop approximations to the semantic methods using the available annotations within the domain of biomedical informatics. Ultimately, with the availability of semantic knowledge linked to concepts, MeSHOP comparisons could one day be support the quantitative predictions and hypotheses with plausible reasoning – to not only simply connect disparate topics but open the “black box” and be able to support the predictions with the reasoning.  MeSHOPs enable an objective, comprehensive view of the literature for any entity of interest, applying the indexed subject terms already present to inform about past and present research. As our knowledge grows more comprehensive and diverse, techniques that allow us to understand the important themes become increasingly important. As well, we demonstrate that biomedical annotation 158  knowledge can computationally inform hypotheses, while also providing a bibliometric view of the biases and directions of research. Using modern computational resources and techniques, MeSHOPs unlock the entirety of biomedical annotation knowledge as an accessible, information-rich resource.  159  Bibliography Adie, E. A., Adams, R. R., Evans, K. L., Porteous, D. J., & Pickard, B. S. (2005). Speeding disease gene discovery by sequence based candidate prioritization. BMC bioinformatics, 6(1), 55. Aerts, S., Lambrechts, D., Maity, S., Van Loo, P., Coessens, B., De Smet, F., Tranchevent, L. C., et al. (2006). Gene prioritization through genomic data fusion. Nature biotechnology, 24(5), 537–544. Agarwal, P., & Searls, D. B. (2008). Literature mining in support of drug discovery. Briefings in Bioinformatics, 9(6), 479–492. Agarwal, P., & Searls, D. B. (2009). Can literature analysis identify innovation drivers in drug discovery? Nature reviews. Drug discovery, 8(11), 865–78. doi:10.1038/nrd2973 Andrews, J. E. (2003). An author co-citation analysis of medical informatics. Journal of the Medical Library Association : JMLA, 91(1), 47–56. Andronis, C., Sharma, A., Virvilis, V., Deftereos, S., & Persidis, A. (2011). Literature mining, ontologies and information visualization for drug repurposing. Briefings in bioinformatics, 12(4), 357–68. doi:10.1093/bib/bbr005 Aronson, A. R., & Lang, F.-M. (2010). An overview of MetaMap: historical perspective and recent advances. Journal of the American Medical Informatics Association : JAMIA, 17(3), 229–36. doi:10.1136/jamia.2009.002733 Ashburn, T. T., & Thor, K. B. (2004). Drug repositioning: identifying and developing new uses for existing drugs. Nature reviews. Drug discovery, 3(8), 673–83. doi:10.1038/nrd1468 Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, H., Cherry, J. M., Davis, A. P., et al. (2000). Gene Ontology: tool for the unification of biology. Nature genetics, 25(1), 25–29. Baroukh, C., Jenkins, S., Dannenfelser, R., & Ma’ayan, A. (2011). Genes2WordCloud: a quick way to identify biological themes from gene lists and free text. Source code for biology and medicine, 6, 15. doi:10.1186/1751-0473-6-15 Bodenreider, O. (2008). Biomedical ontologies in action: role in knowledge management, data integration and decision support. Yearbook of medical informatics, 67–79. Bodenreider, Olivier. (2004). The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic acids research, 32(Database issue), D267–70. doi:10.1093/nar/gkh061 Bourne, P. E., & Fink, J. L. (2008). I am not a scientist, I am a number. PLoS computational biology, 4(12), e1000247. doi:10.1371/journal.pcbi.1000247  160  Bundschus, M., Dejori, M., Stetter, M., Tresp, V., & Kriegel, H.-P. (2008). Extraction of semantic biomedical relations from text using conditional random fields. BMC bioinformatics, 9, 207. doi:10.1186/1471-2105-9-207 Chang, A. a, Heskett, K. M., & Davidson, T. M. (2006). Searching the literature using medical subject headings versus text word with PubMed. The Laryngoscope, 116(2), 336–40. doi:10.1097/01.mlg.0000195371.72887.a2 Chen, J., Xu, H., Aronow, B. J., & Jegga, A. G. (2007). Improved human disease candidate gene prioritization using mouse phenotype. BMC bioinformatics, 8, 392. doi:10.1186/1471-2105-8-392 Cheung, W. A., Ouellette, B. F., & Wasserman, W. W. (2012). Quantitative biomedical annotation using Medical Subject Heading Over-representation Profiles (MeSHOPs). Manuscript Submitted for Publication (BMC Bioinformatics). Cornet, R., & de Keizer, N. (2008). Forty years of SNOMED: a literature review. BMC medical informatics and decision making, 8 Suppl 1, S2. doi:10.1186/1472-6947-8-S1-S2 Davis, A. P., King, B. L., Mockus, S., Murphy, C. G., Saraceni-Richards, C., Rosenstein, M., Wiegers, T., et al. (2010). The Comparative Toxicogenomics Database: update 2011. Nucleic acids research, 1–6. doi:10.1093/nar/gkq813 Davis, A. P., King, B. L., Mockus, S., Murphy, C. G., Saraceni-Richards, C., Rosenstein, M., Wiegers, T., et al. (2011). The Comparative Toxicogenomics Database: update 2011. Nucleic acids research, 39(Database issue), D1067–72. doi:10.1093/nar/gkq813 Dee, C. R. (2007). The development of the Medical Literature Analysis and Retrieval System (MEDLARS). Journal of the Medical Library Association : JMLA, 95(4), 416–25. doi:10.3163/1536-5050.95.4.416 Deftereos, S. N., Andronis, C., Friedla, E. J., Persidis, A., & Persidis, A. (2011). Drug repurposing and adverse event prediction using high-throughput literature analysis. Wiley interdisciplinary reviews. Systems biology and medicine, 3(3), 323–34. doi:10.1002/wsbm.147 Dennis, G., Sherman, B. T., Hosack, D. a, Yang, J., Gao, W., Lane, H. C., & Lempicki, R. a. (2003). DAVID: Database for Annotation, Visualization, and Integrated Discovery. Genome biology, 4(5), P3. Desai, J., Flatow, J. M., Song, J., Zhu, L. J., Du, P., Huang, C., Lin, S. M., et al. (2011). Advances in Computational Biology. (H. R. Arabnia, Ed.)Cancer, 680, 709–715. doi:10.1007/978-1-4419-5913-3 DiMasi, J. (2001). Risks in new drug development: Approval success rates for investigational drugs. Clinical Pharmacology & Therapeutics, 69(5), 297–307. doi:10.1067/mcp.2001.115446 Djebbari, A., Karamycheva, S., Howe, E., & Quackenbush, J. (2005). MeSHer: identifying biological concepts in microarray assays based on PubMed references and MeSH terms. Bioinformatics (Oxford, England), 21(15), 3324–6. doi:10.1093/bioinformatics/bti503 161  Doms, A., & Schroeder, M. (2005). GoPubMed: exploring PubMed with the Gene Ontology. Nucleic acids research, 33(Web Server issue), W783–6. doi:10.1093/nar/gki470 Dong, P., Loh, M., & Mondry, A. (2005). The “impact factor” revisited. Biomedical digital libraries, 2, 7. doi:10.1186/1742-5581-2-7 Driel, V., M.a, Cuelenaere, K., Kemmeren, P. P., Leunissen, J. A., Brunner, H. G., & Vriend, G. (2005). GeneSeeker: extraction and integration of human disease-related information from web-based genetic databases. Nucleic Acids Res., 33, W758-W761.  Dudley, J. T., Sirota, M., Shenoy, M., Pai, R. K., Roedder, S., Chiang, a. P., Morgan, a. a., et al. (2011). Computational Repositioning of the Anticonvulsant Topiramate for Inflammatory Bowel Disease. Science Translational Medicine, 3(96), 96ra76–96ra76. doi:10.1126/scitranslmed.3002648 Dudley, Joel T, Deshpande, T., & Butte, A. J. (2011). Exploiting drug-disease relationships for computational drug repositioning. Briefings in bioinformatics, 12(4). doi:10.1093/bib/bbr013 Edwards, A. M., Isserlin, R., Bader, G. D., Frye, S. V., Willson, T. M., & Yu, F. H. (2011). Too many roads not taken. Nature, 470(7333), 163–5. doi:10.1038/470163a Errami, M., Sun, Z., Long, T. C., George, A. C., & Garner, H. R. (2009). Deja vu: a database of highly similar citations in the scientific literature. Nucleic acids research, 37(Database issue), D921–4. doi:10.1093/nar/gkn546 Errami, M., Wren, J. D., Hicks, J. M., & Garner, H. R. (2007). eTBLAST: a web server to identify expert reviewers, appropriate journals and similar publications. Nucleic acids research, 35(Web Server issue), W12–5. doi:10.1093/nar/gkm221 Fedorov, O., Müller, S., & Knapp, S. (2010). The (un)targeted cancer kinome. Nature chemical biology, 6(3), 166–169. doi:10.1038/nchembio.297 Fjell, C. D., Jenssen, H., Hilpert, K., Cheung, W. A., Panté, N., Hancock, R. E. W., & Cherkasov, A. (2009). Identification of novel antibacterial peptides by chemoinformatics and machine learning. Journal of medicinal chemistry, 52(7), 2006–15. doi:10.1021/jm8015365 Freudenberg, J., & Propping, P. (2002). A similarity-based method for genome-wide prediction of disease-relevant human genes. Bioinformatics, 18(Suppl 2), S110–S115. Frijters, R., van Vugt, M., Smeets, R., van Schaik, R., de Vlieg, J., & Alkema, W. (2010). Literature Mining for the Discovery of Hidden Connections between Drugs, Genes and Diseases. (A. Rzhetsky, Ed.)PLoS Computational Biology, 6(9), e1000943. doi:10.1371/journal.pcbi.1000943 Funk, M., & Reid, C. (1983). Indexing consistency in MEDLINE. Bull Med Libr Assoc, 71(2), 176–183. 162  Gaulton, K. J., Mohlke, K. L., & Vision, T. J. (2007). A computational system to select candidate genes for complex human traits. Bioinformatics, 23(9), 1132–1140. doi:10.1093/bioinformatics/btm001 Gillis, J., & Pavlidis, P. (2011). The Impact of Multifunctional Genes on “Guilt by Association” Analysis. (J. Bader, Ed.)PLoS ONE, 6(2), e17258. doi:10.1371/journal.pone.0017258 Good, B. M., Howe, D. G., Lin, S. M., Kibbe, W. a, & Su, A. I. (2011). Mining the Gene Wiki for functional genomic knowledge. BMC genomics, 12(1), 603. doi:10.1186/1471-2164-12-603 Good, B. M., Kawas, E. a, Kuo, B. Y.-L., & Wilkinson, M. D. (2006). iHOPerator: user-scripting a personalized bioinformatics Web, starting with the iHOP website. BMC bioinformatics, 7, 534. doi:10.1186/1471-2105-7-534 Gottlieb, A., Stein, G. Y., Ruppin, E., & Sharan, R. (2011). PREDICT: a method for inferring novel drug indications with application to personalized medicine. Molecular systems biology, 7(496), 496. doi:10.1038/msb.2011.26 Greenberg, S. J., & Gallagher, P. E. (2009). The great contribution: Index Medicus, Index-Catalogue, and IndexCat. Journal of the Medical Library Association : JMLA, 97(2), 108–13. doi:10.3163/15365050.97.2.007 Grossmann, S., Bauer, S., Robinson, P. N., & Vingron, M. (2007). Improved detection of overrepresentation of Gene-Ontology annotations with parent child analysis. Bioinformatics (Oxford, England), 23(22), 3024–31. doi:10.1093/bioinformatics/btm440 Gunther, S., Kuhn, M., Dunkel, M., Campillos, M., Senger, C., Petsalaki, E., Ahmed, J., et al. (2008). SuperTarget and Matador: resources for exploring drug-target relationships. Nucleic Acids Research, 36(Database issue), D919. doi:10.1093/nar/gkm862 Hamosh, A., Scott, A. F., Amberger, J. S., Bocchini, C. a, & McKusick, V. a. (2005). Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic acids research, 33(Database issue), D514–7. doi:10.1093/nar/gki033 Hewett, M., Oliver, D., Rubin, D., Easton, K., Stuart, J., Altman, R., & Klein, T. (2002). PharmGKB: the Pharmacogenetics Knowledge Base. Nucleic Acids Research, 30(1). Hirsch, J. E. (2005). An index to quantify an individual ’ s scientific research output. Sciences-New York, 102(46), 16569–16572. Hirschman, L., Hayes, W. S., & Valencia, A. (2007a). Chapter 3 Knowledge Acquisition from the Biomedical Literature. Knowledge Acquisition (pp. 58–81). Hirschman, L., Hayes, W., & Valencia, A. (2007b). Knowledge Acquisition from the Biomedical Literature. Semantic Web (pp. 53–81). doi:10.1007/978-0-387-48438-9_4  163  Ho Sui, S. J., Fulton, D. L., Arenillas, D. J., Kwon, A. T., & Wasserman, W. W. (2007). oPOSSUM: integrated tools for analysis of regulatory motif over-representation. Nucleic acids research, 35(Web Server issue), W245–52. doi:10.1093/nar/gkm427 Hoffmann, R., & Valencia, A. (2004). A gene network for navigating the literature. Nature genetics, 36(7), 664. Howe, A. D., Costanzo, M., Fey, P., Gojobori, T., Hide, W., Hill, D. P., Kania, R., et al. (2010). Big data : The future of biocuration, 455(7209), 47–50. doi:10.1038/455047a.Big Jani, S. D., Argraves, G. L., Barth, J. L., & Argraves, W. S. (2010). GeneMesh: a web-based microarray analysis tool for relating differentially expressed genes to MeSH terms. BMC bioinformatics, 11, 166. doi:10.1186/1471-2105-11-166 Jensen, L. J., Saric, J., & Bork, P. (2006). Literature mining for the biologist : from information retrieval to biological discovery. Nature Reviews Genetics, 7(February 2006), 119–129. doi:10.1038/nrg1768 Jenuwine, E. S., & Floyd, J. a. (2004). Comparison of Medical Subject Headings and text-word searches in MEDLINE to retrieve studies on sleep in healthy individuals. Journal of the Medical Library Association : JMLA, 92(3), 349–53. Jones, S., Zhang, X., Parsons, D. W., Lin, J. C.-H., Leary, R. J., Angenendt, P., Mankoo, P., et al. (2008). Core signaling pathways in human pancreatic cancers revealed by global genomic analyses. Science (New York, N.Y.), 321(5897), 1801–6. doi:10.1126/science.1164368 Khatri, P., & Drăghici, S. (2005). Ontological analysis of gene expression data: current tools, limitations, and open problems. Bioinformatics (Oxford, England), 21(18), 3587–95. doi:10.1093/bioinformatics/bti565 Kim, H., Park, H., & Drake, B. (2007). Extracting unrecognized gene relationships from the biomedical literature via matrix factorizations. BMC Bioinformatics, 8(suppl 9), S6. Kinnings, S. L., Liu, N., Buchmeier, N., Tonge, P. J., Xie, L., & Bourne, P. E. (2009). Drug discovery using chemical systems biology: repositioning the safe medicine Comtan to treat multi-drug and extensively drug resistant tuberculosis. PLoS computational biology, 5(7), e1000423. doi:10.1371/journal.pcbi.1000423 Klein, T. E., Chang, J. T., Cho, M. K., Easton, K. L., Fergerson, R., Hewett, M., Lin, Z., et al. (2001). Integrating genotype and phenotype information: an overview of the PharmGKB project. Pharmacogenetics Research Network and Knowledge Base. The pharmacogenomics journal, 1(3), 167–70. Krzywinski, M., Schein, J., Birol, I., Connors, J., Gascoyne, R., Horsman, D., Jones, S. J., et al. (2009). Circos: an information aesthetic for comparative genomics. Genome research, 19(9), 1639–45. doi:10.1101/gr.092759.109 164  Kuhn, M., Campillos, M., Letunic, I., Jensen, L. J., & Bork, P. (2010). A side effect resource to capture phenotypic effects of drugs. Molecular systems biology, 6(343), 343. doi:10.1038/msb.2009.98 Kumar, V. (2011). Omics and Literature Mining. In B. Mayer (Ed.), Bioinformatics for Omics Data: Methods and Protocols (Vol. 719, pp. 457–477). Totowa, NJ: Humana Press. doi:10.1007/978-161779-027-0 Lage, K., Karlberg, E. O., Størling, Z. M., Ólason, P. Í., Pedersen, A. G., Rigina, O., Hinsby, A. M., et al. (2007). A human phenome-interactome network of protein complexes implicated in genetic disorders. Nature Biotechnology, 25(3), 309–316. doi:10.1038/nbt1295 Leong, H., & Kipling, D. (2009). Text-based over-representation analysis of microarray gene lists with annotation bias. Nucleic Acids Research, 37(11). Li, S., Wu, L., & Zhang, Z. (2006). Constructing biological networks through combined literature mining and microarray analysis: a LMMA approach. Bioinformatics (Oxford, England), 22(17), 2143–50. doi:10.1093/bioinformatics/btl363 Li, Y. Y., An, J., & Jones, S. J. M. (2006). A large-scale computational approach to drug repositioning. Genome informatics. International Conference on Genome Informatics, 17(2), 239–47. Li, Y. Y., An, J., & Jones, S. J. M. (2011). A Computational Approach to Finding Novel Targets for Existing Drugs. (P. E. Bourne, Ed.)PLoS Computational Biology, 7(9), e1002139. doi:10.1371/journal.pcbi.1002139 Lipscomb, C. E. (2000). Medical Subject Headings (MeSH). Bulletin of the Medical Library Association, 88(3), 265–6. Loscalzo, J., Kohane, I., & Barabasi, A.-L. (2007). Human disease classification in the postgenomic era: A complex systems approach to human pathobiology. Mol Syst Biol, 3. López-Bigas, N., & Ouzounis, C. a. (2004). Genome-wide identification of genes likely to be involved in human genetic disease. Nucleic acids research, 32(10), 3108–14. doi:10.1093/nar/gkh605 Makrythanasis, P., & Antonarakis, S. E. (2011). From sequence to functional understanding: the difficult road ahead. Genome medicine, 3(4), 21. doi:10.1186/gm235 Mees, S. T., Mardin, W. A., Wendel, C., Baeumer, N., Willscher, E., Senninger, N., Schleicher, C., et al. (2009). EP300--a miRNA-regulated metastasis suppressor gene in ductal adenocarcinomas of the pancreas. International journal of cancer. Journal international du cancer, 126(1), 114–24. doi:10.1002/ijc.24695 Mottaz, A., Yip, Y. L., Ruch, P., & Veuthey, A.-L. (2008). Mapping proteins to disease terminologies: from UniProt to MeSH. BMC Bioinformatics, 9(suppl 5), S3. NLM. (2011). PubMed Author ID Project. NLM Technical Bulletin, (377), e2. 165  Nakazato, T., Bono, H., Matsuda, H., & Takagi, T. (2009). Gendoo: Functional profiling of gene and disease features using MeSH vocabulary. Nucleic Acids Research, 37(Web Server issue). Nakazato, T., Takinaka, T., Mizuguchi, H., Matsuda, H., Bono, H., & Asogawa, M. (2007). BioCompass: A novel functional inference tool that utilizes MeSH hierarchy to analyze groups of genes. In Silico Biology, 8(0006). Nelson, S., Johnston, D., Humphreys, B. L., Bean, C. A., & Green, R. (2001). Chapter 11 Relationships in Medical Subject Headings. Retrieved May 16, 2012, from http://www.nlm.nih.gov/mesh/meshrels.html Osborne, J. D., Flatow, J., Holko, M., Lin, S. M., Kibbe, W. a, Zhu, L. J., Danila, M. I., et al. (2009). Annotating the human genome with Disease Ontology. BMC genomics, 10 Suppl 1, S6. doi:10.1186/1471-2164-10-S1-S6 Paul, S. M., Mytelka, D. S., Dunwiddie, C. T., Persinger, C. C., Munos, B. H., Lindborg, S. R., & Schacht, A. L. (2010). How to improve R&D productivity: the pharmaceutical industry’s grand challenge. Nature reviews. Drug discovery, 9(3), 203–14. doi:10.1038/nrd3078 Perez-Iratxeta, C., Bork, P., & Andrade, M. A. (2002). Association of genes to genetically inherited diseases using data mining function. Molecular Biology, 31(May). doi:10.1038/ng895 Perez-Iratxeta, C., Bork, P., & Andrade-Navarro, M. A. (2007). Update of the G2D tool for prioritization of gene candidates to inherited diseases. Nucleic Acids Research, 1–5. doi:10.1093/nar/gkm223 Perez-Iratxeta, C., Wjst, M., Bork, P., & Andrade, M. A. (2005). G2D: a tool for mining genes associated with disease. BMC Genetics, 6, 45. Plake, C., & Schroeder, M. (2011). Computational polypharmacology with text mining and ontologies. Current pharmaceutical biotechnology, 12(3), 449–57. Prüfer, K., Muetzel, B., Do, H.-H., Weiss, G., Khaitovich, P., Rahm, E., Pääbo, S., et al. (2007). FUNC: a package for detecting significant associations between gene sets and ontological annotations. BMC bioinformatics, 8, 41. doi:10.1186/1471-2105-8-41 Resnik, P. (1977). Using Information Content to Evaluate Semantic Similarity in a Taxonomy. Evaluation, 1. Rizkallah, J., & Sin, D. D. (2010). Integrative approach to quality assessment of medical journals using impact factor, eigenfactor, and article influence scores. PloS one, 5(4), e10204. doi:10.1371/journal.pone.0010204 Rodríguez-Penagos, C., Salgado, H., Martínez-Flores, I., & Collado-Vides, J. (2007). Automatic reconstruction of a bacterial regulatory network using Natural Language Processing. BMC bioinformatics, 8, 293. doi:10.1186/1471-2105-8-293 166  Ruau, D., Mbagwu, M., Dudley, J. T., Krishnan, V., & Butte, A. J. (2011). Comparison of automated and human assignment of MeSH terms on publicly-available molecular datasets. Journal of biomedical informatics, 44 Suppl 1, S39–43. doi:10.1016/j.jbi.2011.03.007 Sanseau, P., Agarwal, P., Barnes, M. R., Pastinen, T., Richards, J. B., Cardon, L. R., & Mooser, V. (2012). Use of genome-wide association studies for drug repositioning. Nature Biotechnology, 30(4), 317– 320. doi:10.1038/nbt.2151 Sardana, D., Zhu, C., Zhang, M., Gudivada, R. C., Yang, L., & Jegga, A. G. (2011). Drug repositioning for orphan diseases. Briefings in bioinformatics. doi:10.1093/bib/bbr021 Sarkar, I.N., Schenk, R., Miller, H., & Norton, C. N. (2009). LigerCat: using “MeSH clouds” from journal, article, or gene citations to facilitate the identification of relevant biomedical literature. Information Retrieval (Vol. 1, pp. 563–567). American Medical Informatics Association. Sarkar, Indra Neil, & Agrawal, A. (2006). Literature based discovery of gene clusters using phylogenetic methods. AMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA Symposium, 689–93. Sarkar, Indra Neil, Schenk, R., Miller, H., & Norton, C. N. (2009). LigerCat : Using “ MeSH Clouds ” from Journal , Article , or Gene Citations to Facilitate the Identification of Relevant Biomedical Literature Center for Clinical and Translational Science , University of Vermont , Burlington , VT MBLWHOI Library , Marine. Information Retrieval, 1, 563–567. Sayers, E., Barrett, T., Benson, D., Bryant, S., Canese, K., Chetvernin, V., Church, D., et al. (2009). Database resources of the National Center for Biotechnology Information. Nucleic Acids Research, 37(Database issue). Schneider, T. D., & Stephens, R. M. (1990). Sequence logos: a new way to display consensus sequences. Nucleic acids research, 18(20), 6097–100. Schriml, L. M., Arze, C., Nadendla, S., Chang, Y.-W. W., Mazaitis, M., Felix, V., Feng, G., et al. (2012). Disease Ontology: a backbone for disease semantic integration. Nucleic acids research, 40(Database issue), D940–6. doi:10.1093/nar/gkr972 Schuemie, M. J., & Kors, J. a. (2008). Jane: suggesting journals, finding experts. Bioinformatics (Oxford, England), 24(5), 727–8. doi:10.1093/bioinformatics/btn006 Schwartz, S. (2003). MultiPipMaker and supporting tools: alignments and analysis of multiple genomic DNA sequences. Nucleic Acids Research, 31(13), 3518–3524. doi:10.1093/nar/gkg579 Sirota, M., Dudley, J. T., Kim, J., Chiang, a. P., Morgan, a. a., Sweet-Cordero, a., Sage, J., et al. (2011a). Discovery and Preclinical Validation of Drug Indications Using Compendia of Public Gene Expression Data. Science Translational Medicine, 3(96), 96ra77–96ra77. doi:10.1126/scitranslmed.3001318  167  Sirota, M., Dudley, J. T., Kim, J., Chiang, a. P., Morgan, a. a., Sweet-Cordero, a., Sage, J., et al. (2011b). Discovery and Preclinical Validation of Drug Indications Using Compendia of Public Gene Expression Data. Science Translational Medicine, 3(96), 96ra77–96ra77. doi:10.1126/scitranslmed.3001318 Sladek, R., Rocheleau, G., Rung, J., Dina, C., Shen, L., Serre, D., Boutin, P., et al. (2007). A genome-wide association study identifies novel risk loci for type 2 diabetes. Nature, 445(7130), 881–5. doi:10.1038/nature05616 Smalheiser, N. R. (2012). Advances in Information Science Literature-Based Discovery : Beyond the ABCs. Journal of the American Society for Information Science, 63(2), 218–224. doi:10.1002/asi Snel, B., Lehmann, G., Bork, P., & Huynen, M. A. (2000). STRING: a web-server to retrieve and display the repeatedly occurring neighbourhood of a gene. Nucleic acids research, 28(18), 3442–4. Sonnhammer, E. L., & Durbin, R. (1995). A dot-matrix program with dynamic threshold control suited for genomic DNA and protein sequence analysis. Gene, 167(1-2), GC1–10. Srinivasan, P. (2004). Text mining: Generating hypotheses from MEDLINE. Journal of the American Society for Information Science and Technology, 55(5), 396–413. Swanson, D R. (1990). Somatomedin C and arginine: implicit connections between mutually isolated literatures. Perspectives in biology and medicine, 33(2), 157–86. Swanson, Don R, & Smalheiser, N. R. (1996). Undiscovered Public Knowledge : a Ten-Year Update. KDD (pp. 295–298). Taboada, M., Lalín, R., & Martínez, D. (2009). An automated approach to mapping external terminologies to the UMLS. IEEE transactions on bio-medical engineering, 56(6), 1598–605. doi:10.1109/TBME.2009.2015651 Taniya, T., Tanaka, S., Yamaguchi-Kabata, Y., Hanaoka, H., Yamasaki, C., Maekawa, H., Barrero, R. a., et al. (2011). A prioritization analysis of disease association by data-mining of functional annotation of human genes. Genomics, 99(1), 1–9. doi:10.1016/j.ygeno.2011.10.002 Tiffin, N, Adie, E., Turner, F., Brunner, H. G., Driel, V., M.a, Oti, M., et al. (2006). Computational disease gene identification: a concert of methods prioritizes type 2 diabetes and obesity candidate genes. Nucleic Acids Res, 34, 3067–3081. Tiffin, Nicki, Andrade-Navarro, M. a, & Perez-Iratxeta, C. (2009). Linking genes to diseases: it’s all in the data. Genome medicine, 1(8), 77. doi:10.1186/gm77 Tiffin, Nicki, Kelso, J. F., Powell, A. R., Pan, H., Bajic, V. B., & Hide, W. A. (2005). Integration of text- and data-mining using ontologies successfully selects disease gene candidates. Nucleic acids research, 33(5), 1544–52. doi:10.1093/nar/gki296  168  Torvik, V. I., & Smalheiser, N. R. (2009). Author name disambiguation in MEDLINE. ACM Transactions on Knowledge Discovery from Data, 3(3), 1–29. doi:10.1145/1552303.1552304 Trieschnigg, D., Pezik, P., Lee, V., de Jong, F., Kraaij, W., & Rebholz-Schuhmann, D. (2009). MeSH Up: effective MeSH text classification for improved document retrieval. Bioinformatics (Oxford, England), 25(11), 1412–8. doi:10.1093/bioinformatics/btp249 Turner, F., Clutterbuck, D., & Semple, C. (2003). POCUS: mining genomic sequence annotation to predict disease genes. Genome biology, 4(11), R75. Wiegers, T. C., Davis, A. P., Cohen, K. B., Hirschman, L., & Mattingly, C. J. (2009). Text mining and manual curation of chemical-gene-disease networks for the comparative toxicogenomics database (CTD). BMC bioinformatics, 10, 326. doi:10.1186/1471-2105-10-326 Wilkins, T., Gillies, R. A., & Davies, K. (2005). EMBASE versus MEDLINE for family medicine searches: can MEDLINE searches find the forest or a tree? Canadian family physician Médecin de famille canadien, 51, 848–9. Wishart, D. S., Knox, C., Guo, A. C., Cheng, D., Shrivastava, S., Tzur, D., Gautam, B., et al. (2008). DrugBank: a knowledgebase for drugs, drug actions and drug targets. Nucleic acids research, 36(Database issue), D901–6. doi:10.1093/nar/gkm958 Yang, L., & Agarwal, P. (2011). Systematic drug repositioning based on clinical side-effects. PloS one, 6(12), e28025. doi:10.1371/journal.pone.0028025 Yang, L., Xu, L., & He, L. (2009). A CitationRank algorithm inheriting Google technology designed to highlight genes responsible for serious adverse drug reaction. Bioinformatics (Oxford, England), 25(17), 2244–50. doi:10.1093/bioinformatics/btp369 Yao, L., Divoli, A., Mayzus, I., Evans, J. a., & Rzhetsky, A. (2011). Benchmarking Ontologies: Bigger or Better? (K. B. Cohen, Ed.)PLoS Computational Biology, 7(1), e1001055. doi:10.1371/journal.pcbi.1001055 Yetisgen-Yildiz, M., & Pratt, W. (2009). A new evaluation methodology for literature-based discovery systems. Journal of biomedical informatics, 42(4), 633–43. doi:10.1016/j.jbi.2008.12.001 Yu, S., van Vooren, S., Tranchevent, L.-C., de Moor, B., & Moreau, Y. (2008). Comparison of vocabularies, representations and ranking algorithms for gene prioritization by text mining. Bioinformatics, 24(16), i119.  169  

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0073074/manifest

Comment

Related Items