Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Semantic models in biomedicine : building interoperating ontologies for biomedical data representation… Courtot, Melanie 2014

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


24-ubc_2014_september_courtot_melanie.pdf [ 11.38MB ]
JSON: 24-1.0167454.json
JSON-LD: 24-1.0167454-ld.json
RDF/XML (Pretty): 24-1.0167454-rdf.xml
RDF/JSON: 24-1.0167454-rdf.json
Turtle: 24-1.0167454-turtle.txt
N-Triples: 24-1.0167454-rdf-ntriples.txt
Original Record: 24-1.0167454-source.json
Full Text

Full Text

Semantic models in biomedicine:Building interoperating ontologies forbiomedical data representation andprocessing in pharmacovigilancebyMELANIE COURTOTA THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHYinThe Faculty of Graduate and Postdoctoral Studies(Bioinformatics)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)May 2014c⃝ MELANIE COURTOT 2014AbstractIt is increasingly challenging to analyze the data produced in biomedicine, even more so when relyingon manual analysis methods. My hypothesis is that using a common representation of knowledge,implemented via standard tools, and logically formalized can make those datasets computationallyamenable, help with data integration from multiple sources and allow to answer complex queries.The first part of this dissertation demonstrates that ontologies can be used as common knowledgemodels, and details several use cases where they have been applied to existing information in thedomain of biomedical investigations, clinical data and vaccine representation. In a second part, Iaddress current issues in developing and implementing ontologies, and proposes solutions to makeontologies and the datasets they are applied to available on the Semantic Web, increasing theirvisibility and reuse. The last part of my thesis then builds upon the first two, and applies theirresults to pharmacovigilance, and specifically to analysis of reports of adverse events followingimmunization. I encoded existing standard clinical guidelines from the Brighton Collaboration inWeb Ontology Language (OWL) in the Adverse Events Reporting Ontology (AERO) I developedwithin the framework of the Open Biological and Biomedical Ontologies Foundry. I show that it ispossible to automate the classification of adverse events using the AERO with very high specificity(97%). I also demonstrate that AERO can be used with other types of guidelines. Finally, mypipeline relies on open and widely used data standards (Resource Description Framework (RDF),OWL, SPARQL) for implementation, making the system easily transposable to other domains. Thisthesis validates the usefulness of ontologies as semantic models in biomedicine enabling automated,computational processing of large datasets. It also fulfills the goal of raising awareness of semantictechnologies in the clinical community of users. Following my results the Brighton Collaborationis moving towards providing a logical representation of their guidelines.iiPreface• In Chapter 3, a version of section 3.2 was published as “Ryan R Brinkman, Me´lanie Courtot,Dirk Derom, Jennifer M Fostel, Yongqun He, Phillip Lord, James Malone, Helen Parkinson,Bjoern Peters, Philippe Rocca-Serra, et al. Modeling biomedical experimental processes withOBI. J Biomed Semantics, 1(Suppl 1):S7, 2010”. I was the core developer in charge of mostdevelopment in the OBI consortium at this time, and participated in development of thegeneral framework as well as implementation of the models. I produced the release OWL fileon which the manuscript is based. I participated in the implementation of all the use cases andaddressing their representational needs within OBI and IAO, as core developers of both thoseresources. My work focused on the neuroscience investigation (Use case 1) in collaborationwith Dirk Derom and Alan Ruttenberg, as well as the vaccine protection investigation (Usecase 2) in collaboration with Yongqun He. I reviewed and edited the manuscript. A versionof section 3.3 was published as “Yongqun He, Zuoshuang Xiang, Thomas Todd, Me´lanieCourtot, RR Brinkman, Jie Zheng, Christian J Stoeckert Jr, James Malone, Philippe Rocca-Serra, Susanna-Assunta Sansone, et al. Ontology representation and anova analysis of vaccineprotection investigation. In Bio-Ontologies 2010: Semantic Applications in Life Sciences, 18thAnnual International Conference on Intelligent Systems for Molecular Biology (ISMB): 2010;Boston, MA, USA. August 11, volume 13, page 4, 2010.” I participated in the implementationof the use cases in VO, OBI and IAO. Yongqun He, Zuoshuang Xiang and Thomas Toddapplied it to the Brucella case. I reviewed and edited the manuscript.• Portions of Chapter 4 were prepared for submission as “Yongqun He, Zuoshuang Xiang, Lind-say Cowell, Alexander D. Diehl, Harry Mobley, Bjoern Peters, Alan Ruttenberg, Richard H.Scheuermann, Ryan R. Brinkman, Me´lanie Courtot, Chris Mungall, Fang Chen, ThomasTodd, Lesley Colby, Howard Rush, Trish Whetzel, Mark A. Musen, Brian D. Athey, GilbertS. Omenn, Barry Smith VO: Vaccine Ontology”. I participated in the development of theresources, developers discussions, calls and meetings, as well as manuscript preparation andediting. I was a core developer of the Vaccine Ontology (VO), and participated actively inestablishing the original framework in terms of classes and relations. I directly contributedto all terms described in this chapter, amongst others. I formalized knowledge for Canadianvaccines, while my collaborators added US ones. Edits to the OWL file were done by YongqunOliver He following our discussions. VIOLIN and literature-based mining were done at Uni-versity of Michigan. Permission to reproduce parts of this paper for the purpose of this thesiswas obtained from all co-authors.• A version of Chapter 5 was published as “Philippe Rocca-Serra, Alan Ruttenberg, Martin JO’Connor, Patricia L Whetzel, Daniel Schober, Jay Greenbaum, Me´lanie Courtot, Ryan RBrinkman, Susanna Assunta Sansone, Richard Scheuermann, et al. Overcoming the ontologyenrichment bottleneck with quick term templates. Applied Ontology, 6(1):13-22, 2011”, andis reprinted with permission from IOS Press. I was core developer of the OBI consortiumand extensively contributed to all aspects of development, including Quick Term Template.Philippe Rocca-Serra led this work, to which I contributed with 6 other authors. All authorsiiiPreface(11 + the consortium) participated to the manuscript preparation.• In Chapter 6, a version of section 6.2 was published as “Me´lanie Courtot, Chris Mungall,Ryan R. Brinkman, and Alan Ruttenberg. Building the OBO Foundry - one policy at a time.In Proceedings of the International Conference on Biomedical Ontology (ICBO2011), 2011”. Iworked in collaboration with Alan Ruttenberg and Chris Mungall on devising and implement-ing the policies described. I wrote the original draft of the ID specification, the documenta-tion for the common metadata scheme and was the lead developer of the MIREOT. I draftedthe original manuscript. A version of section 6.3 was published as “M.Courtot, F.Gibson,A.L.Lister, J.Malone, D.Schober, R.R.Brinkman and A.Ruttenberg. MIREOT: The min-imum information to reference an external ontology term. Applied Ontology, 6(1):23-33,2011”, and is reprinted with permission from IOS Press. In collaboration with Alan Rut-tenberg, I articulated the problems, devised the guidelines supporting the methodology andprovided an implementation of the specification. I drafted the original manuscript. A ver-sion of section 6.4 was published as “Zuoshuang Xiang, Me´lanie Courtot, Ryan R Brinkman,Alan Ruttenberg and Yongqun He”. Ontofox: web-based support for ontology reuse. BMCresearch notes, 3(1):175, 2010. Ontofox implements the MIREOT mechanism I developed. Iparticipated in the system development via initial prototype development, discussion, testing,feedback and suggestions. I contributed extensively to editing of the original draft manuscript.Zuoshuang Xiang was in charge of the server implementation and maintenance.• A version of Chapter 7 was prepared for submission for peer-review publication as “ZuoshuangXiang, Me´lanie Courtot, Chris Mungall, Alan Ruttenberg, and Yongqun He. Ontobee: ALinked Data Server for OWL Ontology Terms”. Ontobee implements the dereferencing pro-totype mechanism I developed with Alan Ruttenberg. I identified issues, reviewed existingwork and developed the original prototype for publication of OBO ontologies on the SemanticWeb (Linked Ontology Data) on which Ontobee is based. I participated in Ontobee’s devel-opment via discussion, testing, feedback and suggestions. I contributed extensively to editingof the original draft manuscript. Zuoshuang Xiang was in charge of the server implementationand maintenance. Figures were produced by Yongqun He, with the exception of Figure 7.6which I made with Alan Ruttenberg.• Parts of Chapter 8 were published as “Me´lanie Courtot, Jie Zheng, Chris Stoeckert, RyanBrinkman and Alan Ruttenberg Diagnostic criteria and clinical guidelines standardizationto automate case classification Proceedings of the International Conference on BiomedicalOntologies (ICBO) 2013.” and “M. Courtot, R. R. Brinkman, and A. Ruttenberg. The logicof surveillance guidelines: an analysis of vaccine adverse event reports from an ontologicalperspective.” In both cases I performed the ontology development and drafted the manuscript.Jie Zheng implemented the Malaria use case described in Section 8.5.• A version of Chapter 9 was accepted by PLoS ONE on February 25th as “Me´lanie Courtot,Ryan R. Brinkman, and Alan Ruttenberg. The logic of surveillance guidelines: An analysisof vaccine adverse event reports from an ontological perspective”. I collected the datasets,performed the experiments, analyzed the data and wrote the manuscript draft.• No ethics approval was required for this research, as confirmed by the UBC BCCA ResearchEthics Board and supported by article 2.4 of the Tri-Council Policy Statement: EthicalConduct for Research Involving Humans document [1] which states that “REB review is notrequired for research that relies exclusively on secondary use of anonymous information, orivanonymous human biological materials, so long as the process of data linkage or recording ordissemination of results does not generate identifiable information.”.vTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiAbbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiiAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Research questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Contributions and impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.1 Adverse Events Following Immunization (AEFIs) . . . . . . . . . . . . . . . . . . . 82.1.1 What is an adverse event? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.1.2 What may cause an adverse event? . . . . . . . . . . . . . . . . . . . . . . . 92.1.3 When are adverse events reported? . . . . . . . . . . . . . . . . . . . . . . . 102.2 Brighton Collaboration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.2.1 Brighton publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.2.2 Automatic Brighton Classification (ABC) tool . . . . . . . . . . . . . . . . . 162.2.3 Anaphylaxis according to Brighton . . . . . . . . . . . . . . . . . . . . . . . 162.3 Ontologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.4 The OBO Foundry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242.5 OWL and the Semantic Web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262.5.1 The Semantic Web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262.5.2 Components of an ontology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293 Representing biomedical investigations . . . . . . . . . . . . . . . . . . . . . . . . . 333.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.2 The Ontology for Biomedical Investigations (OBI) . . . . . . . . . . . . . . . . . . . 333.2.1 Use case 1: Neuroscience investigation . . . . . . . . . . . . . . . . . . . . . . 353.2.2 Use case 2: Vaccine protection investigation . . . . . . . . . . . . . . . . . . 353.2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38vi3.3 Ontology representation and ANOVA analysis of Brucella vaccine protection inves-tigation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423.3.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474 Representing vaccine data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484.2 Vaccine ontology overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494.3 Specific terms defined in the vaccine ontology . . . . . . . . . . . . . . . . . . . . . . 504.3.1 VO definition of the term ‘vaccine’ . . . . . . . . . . . . . . . . . . . . . . . 504.3.2 VO definition of the term ‘vaccination’ . . . . . . . . . . . . . . . . . . . . . 524.3.3 VO representation of immune response to a vaccine . . . . . . . . . . . . . . 534.4 Vaccine ontology applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544.4.1 Naming vaccine-specific terms . . . . . . . . . . . . . . . . . . . . . . . . . . 544.4.2 Vaccine data exchange and integration . . . . . . . . . . . . . . . . . . . . . 554.4.3 Development of vaccine knowledgebase and semantic web . . . . . . . . . . . 554.4.4 VO-based literature mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 585 Semi-automated ontology building using design patterns . . . . . . . . . . . . . . 605.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605.2 Methodology and results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 615.2.1 Step 1: Develop the representation of the parent class . . . . . . . . . . . . . 615.2.2 Step 2: Derive tabular Quick Term Template . . . . . . . . . . . . . . . . . . 625.2.3 Step 3: Domain experts populate the template . . . . . . . . . . . . . . . . . 635.2.4 Step 4: Submission processing . . . . . . . . . . . . . . . . . . . . . . . . . . 635.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 655.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 676 Working with large biomedical resources . . . . . . . . . . . . . . . . . . . . . . . . 706.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 706.2 OBO Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 716.2.1 Common unique identifier policy . . . . . . . . . . . . . . . . . . . . . . . . . 716.2.2 Improving documentation by sharing metadata through the Information Ar-tifact Ontology (IAO) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 726.2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 736.3 The Minimum Information to Reference an External Ontology Term (MIREOT)guideline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 746.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 746.3.2 Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 766.3.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 776.3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 826.4 OntoFox . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 846.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 846.4.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 866.4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91vii6.4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 986.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 997 Publishing biomedical resources on the Semantic Web . . . . . . . . . . . . . . . 1037.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1037.1.1 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1047.1.2 Previous work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1057.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1087.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1087.2.2 Access to descriptions of entities referred to by term IRIs . . . . . . . . . . . 1097.2.3 Use of PURLs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1107.2.4 Ontology retrieval and preprocessing . . . . . . . . . . . . . . . . . . . . . . 1127.2.5 Retrieval of information about a term . . . . . . . . . . . . . . . . . . . . . . 1127.2.6 Generation of RDF and HTML outputs . . . . . . . . . . . . . . . . . . . . . 1137.2.7 Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1147.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1147.3.1 RDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1147.3.2 Web interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1157.3.3 Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1167.3.4 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1167.3.5 Community adoption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1177.3.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1217.3.7 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1228 Representing pharmacovigilance data . . . . . . . . . . . . . . . . . . . . . . . . . . 1238.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1238.2 Rationale for Adverse Events Reporting Ontology (AERO) and development practice 1238.3 Guideline representation and evaluation in AERO . . . . . . . . . . . . . . . . . . . 1248.3.1 Adverse event class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1248.3.2 Application of the guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . 1268.3.3 Guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1278.3.4 Anaphylaxis representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1298.4 The has component relation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1318.5 The World Health Organization (WHO) severe malaria guideline representation . . 1318.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1328.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1338.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1349 Automated adverse events classification . . . . . . . . . . . . . . . . . . . . . . . . 1359.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1359.2 AERO ontology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1369.2.1 Assessment pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1369.3 VAERS dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1379.4 Data loading and processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1389.5 Brighton classification results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1399.6 Automated case screening . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1439.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1479.7.1 Using an OWL-based approach . . . . . . . . . . . . . . . . . . . . . . . . . 147viii9.7.2 Limitations of the results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1489.7.3 Formalization of the case definition . . . . . . . . . . . . . . . . . . . . . . . 1489.7.4 Time gain in signal detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 1499.7.5 Use of the ontology for reporting . . . . . . . . . . . . . . . . . . . . . . . . . 1509.7.6 Going forward: proposed implementation . . . . . . . . . . . . . . . . . . . . 1509.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15110 Conclusion and future directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15210.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15210.2 Perspectives and future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15210.2.1 Coordinated maintenance of resources . . . . . . . . . . . . . . . . . . . . . . 15210.2.2 Evolution of the AERO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15310.2.3 Implementation in reporting systems . . . . . . . . . . . . . . . . . . . . . . 15310.2.4 Application to other guidelines and other domains . . . . . . . . . . . . . . . 15510.2.5 Data integration and text-mining . . . . . . . . . . . . . . . . . . . . . . . . 15610.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158AppendicesA Canadian Adverse Events Following Immunization Surveillance System (CAE-FISS) sample data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182B Vaccine Adverse Event Reporting System (VAERS) sample data . . . . . . . . 187C OBO Foundry principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189D SPARQL query for FluMist vaccine . . . . . . . . . . . . . . . . . . . . . . . . . . . 192E List of IAO annotation properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199F The anaphylactic reaction Standardised MedDRA Query . . . . . . . . . . . . . 203G The list of significant MedDRA terms . . . . . . . . . . . . . . . . . . . . . . . . . . 206H Seeker collaboration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210ixList of Tables2.1 Potential immune-mediated reactions to vaccines. . . . . . . . . . . . . . . . . . . . . 102.2 Case definition of anaphylaxis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.3 Major and minor criteria used in the case definition of anaphylaxis . . . . . . . . . . 193.1 Ontology terms used in the use cases . . . . . . . . . . . . . . . . . . . . . . . . . . 393.2 Ontology terms for 17 variables in the Brucella vaccine protection assay . . . . . . . 454.1 VO enhanced literature search. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575.1 A basic QTT for submitting an analyte assay term request. . . . . . . . . . . . . . 696.1 The 15 source ontologies currently available in OntoFox . . . . . . . . . . . . . . . . 1016.2 OntoFoxed ontologies in VO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1027.1 Summary of selected ontologies available in Ontobee . . . . . . . . . . . . . . . . . . 1199.1 Classification results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1409.2 Comparison of different classification methods . . . . . . . . . . . . . . . . . . . . . . 1449.3 Contingency table per MedDRA term . . . . . . . . . . . . . . . . . . . . . . . . . . 145xList of Figures2.1 Reporting pipeline in Canada . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2 The vaccine approval process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.3 Details of some guidelines for anaphylaxis assessment . . . . . . . . . . . . . . . . . . 152.4 OBO Foundry coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272.5 Meaning of URIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.6 The Semantic Web architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292.7 RDF triple . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.1 OBI modeling of a single trial in the neuroscience study (a fragment). . . . . . . . . 363.2 OBI modeling of vaccine protection investigation (a fragment). . . . . . . . . . . . . 373.3 Representation of ANOVA analysis process. . . . . . . . . . . . . . . . . . . . . . . . 433.4 Representation of a protection assay with Brucella vaccine RB51 . . . . . . . . . . . 464.1 Representation of vaccination (VO 0000002) using VO and OBI. . . . . . . . . . . . 534.2 Hierarchy of vaccine-induced immune response in VO. . . . . . . . . . . . . . . . . . 544.3 Comparison of Afluria and FluMist influenza vaccines using VO. . . . . . . . . . . . 595.1 Overview of the process using the Ontology for Biomedical Investigations (OBI)modeling of the class ‘analyte assay’ . . . . . . . . . . . . . . . . . . . . . . . . . . . 625.2 OWL restrictions that logically define the analyte assay class in OBI . . . . . . . . . 635.3 Analyte assay class in OBI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645.4 Template expressions in MappingMaster’s DSL . . . . . . . . . . . . . . . . . . . . . 676.1 Diagram of the MIREOT mechanism as implemented by OBI . . . . . . . . . . . . . 786.2 Template SPARQL query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 796.4 Screenshot of the Protege editor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 826.3 Template SPARQL query for import from the NCBI taxonomy database. . . . . . . 886.5 OntoFox retrieval of the term ‘homo sapiens’ . . . . . . . . . . . . . . . . . . . . . . 896.6 OntoFox retrieval of PATO term ‘volume’ and its annotations . . . . . . . . . . . . . 936.7 OntoFox algorithm for extracting computed intermediate classes . . . . . . . . . . . 946.8 OntoFox demonstration of the includeComputedIntermediates setting. . . . . . . . . 956.9 OntoFox SPARQL-based algorithm for retrieval of related terms . . . . . . . . . . . 967.1 Ontobee system architecture design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1117.2 An Ontobee RDF output file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1157.3 Ontobee HTML rendering of the VO term ‘vaccine’ as seen in Firefox . . . . . . . . 1177.4 Demonstration of the search capability and the HTML rendering . . . . . . . . . . . 1187.5 Records of Ontobee daily web page visitors according to Google Analytics. . . . . . . 1197.6 Comparison of Ontobee features vs. other tools . . . . . . . . . . . . . . . . . . . . . 1208.1 The disorder hierarchy as built in AERO . . . . . . . . . . . . . . . . . . . . . . . . . 126xi8.2 Entities represented in patient examination and recording of findings . . . . . . . . . 1278.3 Details of the implementation of the level 1 of anaphylaxis according to Brighton . . 1308.4 Implementation of the WHO severe malaria guideline . . . . . . . . . . . . . . . . . 1319.1 Automatic case classification according to the Brighton criteria . . . . . . . . . . . . 1369.2 The elements of an assessment of anaphylaxis according to Brighton as implementedin AERO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1379.3 Class hierarchy excerpt in the AERO . . . . . . . . . . . . . . . . . . . . . . . . . . . 1389.4 Cosine similarity Receiver Operating Characteristic (ROC) curve . . . . . . . . . . . 1469.5 Time gain using the ontology-based method . . . . . . . . . . . . . . . . . . . . . . . 14910.1 Diagnosis confirmation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154xiiAbbreviationsABC Automatic Brighton ClassificationAEFI Adverse Event Following ImmunizationAEO Adverse Events OntologyAERO Adverse Events Reporting OntologyAERS Adverse Event Reporting SystemAUC Area Under the CurveBCCD Brighton Collaboration Case DefinitionBCPNN Bayesian Confidence Propagation Neural NetworkBFO Basic Formal OntologyCAEFI Canadian Adverse Event Following ImmunizationCAEFISS Canadian Adverse Events Following Immunization Surveillance SystemCARO Common Anatomy Reference OntologyCDC Center for Disease Control and PreventionCFEP Canadian Field Epidemiology ProgramChEBI Chemical Entities of Biological InterestCL Cell TypeDC Dublin CoreEBS Empiric Bayesian ScreeningEDC Electronic Data CaptureEHR Electronic Health RecordFDA US Food and Drug AdministrationFMA Foundational Model of AnatomyFOIA Freedom of Information ActGO Gene OntologyGPS Gamma Poisson shrinkageIAO Information Artifact OntologyIC Information componentIDO Infectious Disease OntologyILI Influenza-like IllnessIMPACT Immunization Monitoring Program ACTiveInfluenzO Influenza OntologyIR Information RetrievalIRI Internationalized Resource IdentifierLOD Linked Open DataMedDRA Medical Dictionary of Regulatory ActivitiesMGPS Multi-gamma Poisson shrinkerMIREOT Minimum Information to Reference an External Ontology TermMMR Measles, Mumps, and RubellaMP Mammalian Phenotype OntologyMSSO Maintenance and Support Services OrganizationxiiiNCBO National Centre for Biomedical OntologyNEMO Neural ElectroMagnetic OntologiesNIAID/FAAN National Institute of Allergy and Infectious Diseases/Food Allergy and Anaphy-laxis NetworkNIF Neuroscience Information FrameworkNLP Natural Language ProcessingOAE Ontology of Adverse EventsOBI Ontology for Biomedical InvestigationsOBO Open Biomedical OntologiesOGMS Ontology for General Medical ScienceOMRE Ontology of Medically Relevant EntitiesOWL Web Ontology LanguagePATO Phenotypic Quality OntologyPCIRN PHAC/CIHR Influenza Research NetworkPHAC Public Health Agency of CanadaPRO Protein OntologyPRR Proportional Reporting RatiosPHSA Provincial Health Services AuthorityQTT Quick Term TemplateRDF Resource Description FrameworkRDFS Resource Description Framework (RDF) SchemaRIF Rule Interchange FormatRO Relations OntologyROC Receiver Operating CharacteristicROR Reporting Odds RatiosRR Relative RiskRRR Relative Report RateSKOS Simple Knowledge Organization SystemSMQ Standardised MedDRA QuerySNOMED-CT Systematized NOmenclature of MEDicine Clinical TermsSPARQL SPARQL Protocol and RDF Query LanguageSO Sequence OntologyUO Unit OntologyURI Uniform Resource IdentifierVAERS Vaccine Adverse Event Reporting SystemVENICE Vaccine European New Integrated Collaboration EffortVO Vaccine OntologyWHO World Health OrganizationxivAcknowledgementsI would like to thank my supervisor, Dr. Ryan Brinkman, who encouraged me to start graduatestudies after having worked as an engineer in his lab. I am extremely grateful for his continuoushelp and support, both personally and professionally.This thesis would not had been possible without contributions from my thesis committee, DrsPaul Pavlidis, Margaret-Anne Storey and Raymond Ng, and members of my defence examiningcommittee: Dr Haydn Pritchard (chair), Drs Julie Bettinger and Rachel Pottinger (universityexaminers) and Dr. Pascal Hitzler (external reviewer). Thanks to Dr. Mark Wilkinson for hissupport in multiple occasions, and his participation in the committee at initial stages.I have been immensely blessed to have the opportunity to work with amazingly talented in-dividuals, who all spent time discussing and explaining their area of expertise - thanks to all mycollaborators. I am especially thankful to Alan Ruttenberg for his ongoing advice and collabora-tion, helping me improve my projects on multiple occasions, and introducing me to other areas andtypes of problems in the general field of data sharing and query answering. I also wish to acknowl-edge my colleagues Drs Jie Zheng, James Malone, Bjoern Peters, Christian Stoeckert, WilliamBug, Chris Mungall, Barry Smith, Albert Goldfain, Richard Scheuermann and Lindsey Cowell fortheir collaboration on various ontology development projects. Drs Robert Pless, Jan Bonhoeffer,Barbara Law, Jean-Paul Collet and Ms Julie Lafle`che helped me understand vaccine safety aspectsand supported my work in multiple occasions.Thanks to Dr Nicolas Le Nove`re for giving me the first opportunity to work on knowledgerepresentation and develop the Systems Biology Ontology, as well as my other past supervisors,Drs Franc Pattus, Renaud Wagner and Christos Ouzounis. You all showed me what research couldbe like and inspired me to take the leap and go back to school.This thesis was partly supported by funding from the Public Health Agency of Canada/CanadianInstitutes of Health Research Influenza Research Network (PCIRN), and the Michael Smith Foun-dation for Health Research.Finally, thanks to my family for their unrelenting support: my parents, Jean-Claude and Clau-dine, my sister Julie, my partner Brian and my daughter Hannah.xvTo HannahxviChapter 1OverviewAssessment of pharmacovigilance data is a largely manual, time-consuming process [2]. Addition-ally, analysis of large datasets, such as those in current reporting systems, can be challenging [3].As a result, rapid detection of safety issues can be hampered by the methods used for surveillance,even more so when a large volume of data such as in the 2009-2010 H1N1 pandemic is being col-lected. In that context, I hypothesized that ontologies and Semantic Web technologies can be usedto make biomedical research in general, and pharmacovigilance in particular, more accurate andreproducible.1.1 Research questionsThis dissertation is divided into three major sections. The first section aims at providing a meansof representing knowledge, specifically in the biomedical domain, using ontologies and the WebOntology Language (OWL). The second section introduces some of the issues raised by developingmultiple resources aimed at working together in supporting multiple applications, across a largeconsortium of ontology developers, the OBO Foundry. Finally, the third section relies on the othertwo and applies their findings to the domain of pharmacovigilance, leading to the development ofthe AERO and its application to automated classification of adverse events.Specifically, I investigate and answer the following research questions:1. Can ontologies be used to encode biomedical knowledge, and specifically biomedical investi-gations and pharmacovigilance, in a standard, unambiguous way, allowing semantic querying(i.e., be complex enough to encode all logical aspects while maintaining reasoning capabili-ties)?• Can biomedical investigations and pharmacovigilance be accurately represented usingontologies? I hypothesized that a standard way of modeling information would improvedescription of experimental processes, hence data comparison and integration.11.2. Contributions and impact2. What are the elements required for supporting large consortiums of ontology developers build-ing compatible resources for publication on the Semantic Web?• How can a suitable framework for development of collaborative, interoperating resourcesbe provided?• What are some of the issues encountered when working with those large interoperatingbiomedical resources, and how they be overcome?3. Can adverse event classification in pharmacovigilance be improved through the use of ontolo-gies to automate the process?• Can the logic of a pharmacovigilance clinical guideline be encoded as an ontology? Doesa standard and logically formalized representation of the Brighton Collaboration casedefinitions enhance data quality and allow for automatic processing of adverse eventsreports? I hypothesized that a standard and logically formalized representation of theBrighton Collaboration case definitions would enhance data quality and allow for auto-matic processing of adverse events reports.• Will establishing a mapping between this ontology and another resource (terminology,other ontology) used to annotate existing AE reports datasets allow us to infer that thedata is of the type of a specific ontology class (i.e., derive a diagnosis according to theselected guideline)? I hypothesized that using the AERO and a custom mapping, adverseevent reports could be automatically classified according to a Brighton case definition.• What is the efficiency of this classification? I hypothesized that the classification wouldbe more efficient in terms of time and cost than performed by human review.1.2 Contributions and impact1. The first part of my thesis details how I solved representation issues in the biomedical domainin areas that formed the basis of my later work. I actively contributed to the development ofseven ontologies addressing different kind of problems and domains:• The Ontology for Biomedical Investigations (OBI), which models investigations, includ-ing their plans and objectives, their realization by experimental processes, as well asparticipants involved,21.2. Contributions and impact• The Information Artifact Ontology (IAO), which addresses the need for representation ofdata and information entities, such as data item, directive information entities (includingguidelines), e-records etc.,• The Basic Formal Ontology (BFO), an upper-level ontology supporting analysis andintegration,• The Vaccine Ontology (VO), which focus is on representation of vaccination and associ-ated immunologic responses, as well as vaccines and vaccine components,• The Infectious Disease Ontology (IDO) and the Ontology for General Medical Science(OGMS), which aim at representing infectious diseases and clinical data,• The Adverse Events Reporting Ontology (AERO), an ontology representing guidelinesused in pharmacovigilance.I was in each case part of the core developers group and contributed significantly to buildingthe resources, either general framework such as critical terms and relations between them,or specific such as representation of clinical guidelines in the context of the Ontology forGeneral Medical Science (OGMS). I brought a pharmacovigilance perspective to this work,and generally worked on representation that I expected would contribute to my research goal.Representation work culminated with me creating the AERO.Chapter 3 describes how biomedical investigations can be modeled in a standard, unam-biguous way which allows semantic querying. It details how some representation issues inthe biomedical domain in areas, that formed the basis of following chapters, were solved.Specifically, it presents three use cases that were modeled within the Ontology of BiomedicalInvestigations (OBI). I participated in the implementation of all the use cases and addressingtheir representational needs within OBI and IAO, as core developers of both those resources.My work focused on the neuroscience investigation (Use case 1) in collaboration with DirkDerom and Alan Ruttenberg, as well as the vaccine protection investigation (Use case 2) incollaboration with Yongqun He. Larissa Soldatova was the main developer of Use case 3.Chapter 4 introduces the Vaccine Ontology (VO), another resource related to my worktowards pharmacovigilance data representation. Amongst others, details of the vaccinationprocess and vaccine composition can be captured via the VO. I participated in the imple-mentation of the use case in VO, OBI and IAO, while my collaborators at the University ofMichigan applied it to the Brucella case.31.2. Contributions and impactHaving a common, standardized representation of biomedical knowledge will improve theability of exchanging and integrating data, with the goal of answering complex queries acrossmultiple data sources.2. The contents of an ontology is only part of what is necessary for adoption. The second part ofmy thesis concerns how to support development and dissemination of resources, such as on-tologies, that are collaboratively constructed within a consortium of biomedical specialists. Iused an emerging technology, the Semantic Web, as a publication medium developing methodsand practices enabling dissemination of these resources using Semantic Web technologies.• I developed the MIREOT to make it feasible to work with parts of other ontologies,particularly when tools such as editors and reasoners could not effectively work with fullversions of those ontologies.• I co-designed the OntoFox, a web-server implementing the MIREOT mechanism throughan accessible web interface.• With one collaborator I developed the original prototype for publication of OBO ontolo-gies on the Semantic Web (Linked Ontology Data). OBO format ontologies contain manyessential terms which were previously not as easily usable, and which were unavailablefor use in the Semantic Web.• I was one of the designers of Ontobee, a server implementing this prototype, and whichis now the default server for terms from all OBO ontologies.• I was one of the designers of Quick Term Template (QTT), which makes it easier forscientists to define many ontology classes whose definitions follow a common pattern byusing common spreadsheet applications.Chapter 5 details the rationale, design and implementation for the Quick Term Templates(QTT) tool which allows semi-automated addition of multiple OWL classes in an ontologywhen those classes all adhere to the same design pattern.Chapter 6 concerns how to support development and dissemination of resources, such asontologies, that are collaboratively constructed within a consortium of biomedical special-ists. It explores how large biomedical resources can be made practically (re) usable to thecommunity of ontologies developers. The MIREOT mechanism allowing reuse of terms from41.2. Contributions and impactexternal resources, as well as its implementation within the OntoFox server are described, inSections 6.3 and 6.4 respectively.An emerging technology, the Semantic Web, is used as a publication medium developingmethods and practices that enable dissemination of these resources using Semantic Webtechnologies. Chapter 7 details how, after resources are built using the mechanisms detailedin earlier chapters, they can be published on the Semantic Web. The Ontobee server wasdeveloped to provide a human-friendly HTML interface as well as RDF for consumption bymachines.By enabling ontology building using Semantic Web technologies, my work helps fulfill goalsof both communities, towards improving understanding of data semantics by machines.3. In the third part of my thesis I describe my implementation of a system for adverse eventclassification in pharmacovigilance based on the approaches I developed in my earlier work.Specifically,• I created the Adverse Events Reporting Ontology (AERO) with the goal of creatinga more rigorous encoding of guidelines about Adverse Events Following Immunization(AEFIs), with the Brighton guidelines for Anaphylaxis forming the nucleus of this effort.• I collected several adverse events datasets, and translated them into a Brighton-annotatedformat• I exported them as OWL documents represented using AERO and classified the ad-verse events using recently developed reasoners - a central component of Semantic Webtechnology• I validated my classification results against existing tools/gold standards• I developed a screening algorithm that is more efficient than those previously publishedChapter 8 details the development of the Adverse Event Reporting Ontology (AERO) andhow it allows logical translation of clinical guidelines used in pharmacovigilance, such as theBrighton Anaphylaxis guideline.Chapter 9 describes the implementation of a system for adverse event classification in phar-macovigilance based on the approaches developed in earlier chapters. Specifically, severaladverse events datasets were collected and translated into a Brighton-annotated format, then51.2. Contributions and impactexported as OWL (Web Ontology Language) documents represented using AERO. Adverseevents were then classified using recently developed reasoners - a central component of Se-mantic Web technology. Classification results were evaluated against existing tools/gold stan-dards, and a screening algorithm that is more efficient than those previously published wasdeveloped.Based on experience developing the system, and in collaboration with the Brighton Collabo-ration and the Public Health Agency of Canada (PHAC), ways to improve AEFI reportingstandards and systems are proposed. This last part of the thesis exemplifies how practically,in clinical settings and with a real-world example of importance to public health, semanticresources can help improve processing of the ever-increasing collected data.6Chapter 2BackgroundPharmacovigilance focuses on safety of medicinal products, with the specific tasks of collecting,detecting, assessing, monitoring and preventing adverse events they may cause [4].The thalidomide disaster of the 1960s [5] had profound impact on drug safety assessment andregulatory aspects [6], and the WHO international monitoring of drug safety was established shortlythereafter. As of the end 2010, 134 countries were part of the WHO pharmacovigilance program1.While all practitioners agree on the importance of reporting adverse events in increasing publichealth safety, current methods used for spontaneous adverse events reporting are not sufficient,mitigating their usefulness.MedDRA codingDiagnosisAdverse event rateVaccine manufacturerdataCLINICPROVINCIALLEVELNATIONALLEVELCAEFISSAdverse event reportsFigure 2.1: The adverse event reporting pipeline in Canada. Reports are entered at the clinic level,and forwarded to the provincial health agency. Reports are aggregated and sent to the CAEFISSdatabase at the national level, where MedDRA codes are added and medical officers try and assessor confirm the diagnosis. Finally, based on information such as number of doses manufactured, anadverse event rate is estimated.In Europe, the Vaccine European New Integrated Collaboration Effort (VENICE) [7] group1 Adverse Events Following Immunization (AEFIs)reports [8] that only 71% (17/24) of the countries states have adopted a classification of AEFIs,and that those chosen classifications are heterogeneous: 38% WHO2 and 62% other or not specified.In the US, monitoring is done via the Adverse Event Reporting System (AERS) [9] and VaccineAdverse Event Reporting System (VAERS) [10] systems for drugs and vaccines respectively. InCanada, the Public Health Agency of Canada (PHAC) administers the Canadian Adverse EventsFollowing Immunization Surveillance System (CAEFISS). In both countries, systems aggregatedata at a national level and rely on the Medical Dictionary of Regulatory Activities (MedDRA)to encode adverse events. Several studies highlight the potential issues in using MedDRA foradverse event reporting, ranging from inaccurate reporting as several terms are non-exact synonyms,to lack of semantic grouping features impairing processing in pharmacovigilance [11, 12, 13, 14].Additionally, in many systems only the adverse event code as determined by the system (e.g.,resulting from parsing the textual input) is saved, and information about signs and symptoms usedin the determination of that code are lost. This limits the ability of analysts to review the set ofsymptoms observed in order to establish a consistent diagnosis. Finally, this code is not linked toany definition. This in turn may lead to heterogeneity in the diagnoses recorded [15] - physiciansmay have slightly different interpretations of what constitutes an anaphylactic reaction for example,as shown on Figure 2.3.The resultant lack of consistency limits the ability to query and assess important safety issuesthe resulting datasets might otherwise support.2.1 Adverse Events Following Immunization (AEFIs)2.1.1 What is an adverse event?The Guidance for Clinical Safety Data Management: Definitions and Standards for ExpeditedReporting [16], defines an adverse event as “Any untoward medical occurrence in a patient or clinicalinvestigation subject administered a pharmaceutical product and which does not necessarily have tohave a causal relationship with this treatment.” The guide then adds “An adverse event (AE) cantherefore be any unfavourable and unintended sign (including an abnormal laboratory finding, for2The reports can be difficult to even interpret - the WHO Adverse Reaction Terminology (WHO-ART) is a non-open terminology and only a 1997 version appears to be publicly visible, hosted at It lacks many terms that are essential for AE reporting, such as those related to seizure.More importantly, WHO-ART follows a 4-level structure similar to MedDRA, and therefore suffers some of the samedefects.82.1. Adverse Events Following Immunization (AEFIs)example), symptom, or disease temporally associated with the use of a medicinal product, whetheror not considered related to the medicinal product.” The Report of Adverse Event FollowingImmunization (AEFI) user guide [17] from the Public Health Agency of Canada (PHAC) adheresto this definition and adapts it for AEFI reporting: “An AEFI is any untoward medical occurrencein a vaccine which follows immunization and which does not necessarily have a causal relationshipwith the administration of the vaccine”. Not detailed in this statement is the additional factthat reporting guidelines often provide protocols for determining and reporting the likelihood thatspecific pathological processes have occurred, and that such protocols and reporting conventionsdiffer from jurisdiction to jurisdiction, from investigation to investigation, and by symptom andseverity. Therefore adverse events as recorded in reports, contrary to what might otherwise bepresupposed, are not necessarily processes, are not necessarily of the type reports say they are, arenot necessarily causally related to the intervention which led to them being reported, and the termsused to describe them are not necessarily univocal. In particular, adverse event is distinguished fromadverse side effect, where the latter is of a type determined to be causally related to the intervention.This matches the usage made for example within the VAERS [10] that mentions “VAERS collectsdata on any adverse event following vaccination, be it coincidental or truly caused by a vaccine”.2.1.2 What may cause an adverse event?Different etiologies are at play in terms in possibly causing adverse events. The most obviouscause is the vaccination itself: either in terms of poor injection technique or stress generatedby the process, which may in turn result in vaso-vagal type of events, including fainting or hy-potonic/hyporesponsive episodes. Components of the vaccine themselves may also cause diversereactions. For example, the Bacille Calmette-Gue´rin (BCG) vaccine has been shown to cause localswelling of the lymph nodes (suppurative lymphadenitis [18, 19]), even more so when more virulentstrains were used for vaccination (such as with the vaccine BCG-Pasteur Intradermal P, ChargeR 5520 [20]). The host immune response plays a major role in the occurrence of adverse event,as described in Table 2.1 The most common one is probably local inflammation due to the innateimmune response, which results in redness and swelling at injection site. The Arthus reaction [21],an hypersensitivity type III reaction, similarly causes redness and swelling, but with severe asso-ciated pain - it is linked to antigen deposit meeting high quantities of antibody in presensitizedpatients which already had circulating antibody. Systemic inflammatory responses such as fever,irritability, nausea, vomiting of general muscle aches can also occur; their etiology is less clear,92.1. Adverse Events Following Immunization (AEFIs)Table 2.1: Potential immune-mediated reactions to vaccines. Adapted from [24]Immune mediated reac-tionFrequent clinical manifestationIgE mediated Urticaria, angioedema, rhinoconjunctivitis, bronchospasm,anaphylaxis, gastrointestinal disorders (diarrhea, abdominalcramping, vomiting)Immune complex (IgG) Vasculitis, myocarditisT-cell mediated Maculopapular exanthema, eczema, acute generalised exan-thematous pustulosis (AGEP), erythema multiformeNon-IgE mediated(pseudo-allergic)Urticaria, angioedema, anaphylactoid reactions, gastrointesti-nal disordersAutoimmune, inflamma-toryThrombocytopenia, vasculitis, polyradiculoneuritis,macrophagic myofasciitis, rheumatoid arthritis, Reiter’s syn-drome, sarcoidosis (juvenile), bullous pemphigoid, lichenplanus, Guillain-Barre´ syndrome, polymalgiathough host factors (e.g., age, gender, genetics) seem to play a role in susceptibility. Type I hy-persensitivity reactions include urticaria, angio-edema and anaphylaxis - the latter being used ascase study throughout this thesis. As is shown in Table 2.1, it can be hard to distinguish betweenanaphylactoid reactions (non-IgE mediated, row 4) and true anaphylaxis reactions (IgE mediated,row 1). For example, a new type of adverse event, the oculo-respiratory syndrome (ORS) wasidentified in Canada in 2000 [22], and only skin testing [23] showed that it was not a type I (i.e.,IgE mediated) hypersensitivity reaction.2.1.3 When are adverse events reported?Current guidelines [17] specify that events should be reported on the basis of their temporal asso-ciation with the medical intervention. For example, in the case of AEFIs depending on (i) the typeof immunizing agent (30 days after live vaccine or 7 days after killed or subunit vaccine) or (ii)biological mechanism (up to 8 weeks for immune-mediated events). Even though in some cases, andbased on their personal experience, clinicians may think that some adverse events are most prob-102.1. Adverse Events Following Immunization (AEFIs)ably caused by the intervention, and even take action to guard the patient’s health based on thisassessment, they nonetheless must report any event occurring in the respective corresponding timeframe. In that way, records accumulated from many clinicians may be reviewed by safety commit-tees, where evidence towards causality establishment will be reviewed and policy recommendations,based on unbiased evidence, can be made.Reports of AEFIs are important elements in the assessment of safety of vaccines and play a majorrole in public health policy. For vaccination campaigns to be effective the general population needsto be adequately informed so that they maintain confidence in and trust individuals responsible formanaging vaccination efficiency and safety [25]. As shown on Figure 2.2, prior to market approval,vaccines are rigorously tested for efficacy and safety, through randomized clinical trials. However,the focus of those trials is efficacy, particularly in the case of widespread, easily transmissible,infections such as influenza where it is hard to fully assess safety due to the limited number ofsubjects. Additionally, these trials introduce multiple biases:• They concentrate on a specific subset of the population, and often do not account for vari-ability in gender, age, race, etc. as per their inclusion/exclusion criteria.• They cannot detect rare adverse events, the cohort of subject enrolled being restricted in size.• They are limited in time, and will not be able to detect those events for which there is alonger onset period.Effects in the larger population and in specific subpopulations such as children, pregnant womenand the elderly can only be studied post-licensing. Chronic effects, or effects of concomitant admin-istration of other drugs, become evident only after several years of surveillance. As a consequence,there is a need to encourage long-term, widespread post-licensing surveillance. Generally, spon-taneous reporting systems are used to monitor for adverse effects in the general population [26].Each report includes information about adverse events that are at least temporally linked to thevaccination process. Some of those events are causally associated with the vaccine (e.g., it is knownthat reactions such as rash at the injection site are caused by injectable vaccines), some may ornot be related (e.g., patient experiencing a loss of consciousness 3h post-vaccination) and some areprobably coincidental (e.g., worker being injured by metallic shard). Analysis of events in largecollections of AEFI reports aims to identify signals highlighting differences in frequency of eventsafter administration of a certain vaccine (e.g., a seasonal influenza vaccine), or in certain popula-tions (e.g., children under the age of 2). When such signals are detected health authorities use that112.1. Adverse Events Following Immunization (AEFIs)information to prompt investigation of a risk of potential safety issues. Depending on their findings,health officials can make choices such as withdrawing the vaccine from general use or mandatingfurther clinical studies.Figure 2.2: The vaccine approval process. During clinical trials, only limited numbers of subjectsare studied, warranting need for observation of the vaccine effectiveness and safety in the generalpopulation post market approval (numbers are as an example only).Opposition to vaccination is not new and has existed since the first vaccination against small-pox. Poland and Jacobson [27] detail how it is today more than ever an issue to be contended with,and how efficient reporting and public information, will contribute to defeat anti-vaccination cam-paigns. One of the most famous roots of the vaccine controversy can be traced to the 1998 Lancetarticle, “Ileal-lymphoid-nodular hyperplasia, non-specific colitis, and pervasive developmental dis-order in children” [28], retracted 12 years later due to fraudulent research. This article concludedby demonstrating a causality relation between the Measles, Mumps, and Rubella (MMR) vaccineand autism in children. A survey of 5000 internet users in February 2010 (immediately after theretraction of the Lancet paper) shows that sixty-five percent of Canadian women and seventy twopercent of Canadian men surveyed believed a) that the vaccine was unsafe or else b) they wereunsure whether or not the MMR vaccine could cause autism. This fear of vaccine side effects hasdramatic consequences, causing a drop in vaccination rates in the population. Herd immunity oc-122.1. Adverse Events Following Immunization (AEFIs)curs when enough people are vaccinated and provide protection for individuals, such as newborns,who have not yet developed immunity. Partly due to parental refusal of vaccination, herd immu-nity for vaccine-preventable diseases, such as pertussis, is now compromised [29]. In 2010, 9,143cases of pertussis (including ten infant deaths) were reported throughout California 3 - the state’sworst whooping cough epidemic in 50 years. The importance of pharmacovigilance as a tool forglobal health policies has been well described [30], even more so in vaccine risk communication,which has been shown [25] to have a direct impact in decisions to immunize in the general public.The resulting drop in vaccination rates is a probable underlying cause in the recent resurgence ofvaccine-preventable diseases such as pertussis [29] or even the recent (September 2013) measlesoutbreaks [31, 32].Reporting issues - the case for standard guideline representationReports are collected from a multitude of sources - physicians and nurses from many differentpractices, coming from many training backgrounds. Variability in report quality [33] and size ofthe dataset [3] are significant challenges to derive adverse event signals - situations which, withsome regularity, predictably lead to some kind of adverse effect - from them. In North America,the assessment of reports is performed by medical officers at the national level, who have no accessto the patient and need to rely on the information reported from the primary point of care, whichis error prone and may lead to information being missed or erroneously interpreted.This assessment is often done following a clinical guideline, a protocol4 that has the objectiveof guiding medical decision making, assessing patient state, and determining diagnosis or givingtreatment. Clinical guidelines might be deployed in at least two places: when first assessing thepatient, guiding what should be reported about the patient and how; or when reviewing the reports,guiding how to interpret them together, adjusting for differences in the reports. In current practice,while some efforts are being made to standardize reporting forms, they focus on the kinds ofinformation to include in a report, but not on the terminology to be used when doing so. Forexample the VAERS report form [34] specifies that a report should include medications the patientis taking, but not detail how a medication should be encoded. Or it might specify that respiratorycondition or severity of condition be recorded, but not indicate that a controlled vocabulary be3 OBI defines protocol as “a protocol is a plan specification which has sufficient level of detail and quantitativeinformation to communicate it between domain experts, so that different domain experts will reliably be able toindependently reproduce the process.”132.2. Brighton Collaborationused.However in order to look for patterns of symptoms and medication, one needs to be able tocount, for example, how often a given symptom occurs - despite it being described in different ways.Such normalization issues may occur even when a specified terminology is used, in cases where theterms are not well documented, or when different terms can be used to describe an equivalentsituation. Finally, whereas a clinician might report a cluster of undesirable conditions a patient hasexperienced, it can be the case that different sets of conditions come from one underlying diseaseor condition, the primary reason for reporting. In order to have the most power to detect a safetysignal, such diseases or conditions need to be recognized and recorded. For example, a goal ofpharmacovigilance is to identify all cases of anaphylaxis (an extreme allergic reaction) in a givenpopulation, which can manifest via rashes, swelling of the tongue, difficulty breathing etc. It isimportant that not only those individual manifestations be recorded, but the primary cause ofreporting should be identified also when possible.Assessment choices differ on which cluster of symptoms signifies an underlying condition, or howreliably the information in a report supports an assessment. Figure 2.3 shows elements from fourguidelines that assess whether anaphylaxis has taken place, the application of which will result indifferent clinical assessments, even for the same condition. While the language of reports currentlyrange from controlled vocabulary such as MedDRA to free text, current controlled vocabulariesdo not sufficiently constrain the meaning of report [15, 35]. To take a step in remedying theseproblems, a good standard for describing patient conditions is still needed, as well as a consistentand computable manner of describing criteria expressed using that standard.2.2 Brighton CollaborationTo allow for comparability of data, it is desirable that a global standard for case definitions andguidelines be used for AEFI reporting [35]. The Brighton Collaboration [36] is a global network ofexperts that aims to provide high quality vaccine safety information. It has done extensive worktowards standardizing the assessment and reporting of adverse events following vaccination [40].2.2.1 Brighton publicationsThe case definitions provided by the Brighton Collaboration relate symptoms and signs to assess-ments of whether a particular type of pathological process has occurred, assigning qualitative levels142.2. Brighton CollaborationANDAustralasianSociety of Clinical Immunology and AllergyNational Institute for Health and Clinical ExcellenceBrightonCollaboration(Level 1)OR ONE OFWorld Allergy OrganizationAND/ORAND/ORVARIOUS COMBINATIONS OFgeneralized urticaria or generalized erythema findingangioedema findinggeneralized pruritus with skin rash findingclinical diagnosis of uncompensated shockrespiratory distress diagnosisbilateral wheeze findingstridor findingupper airwayswelling findingskin and mucosalchangesinvolvement of the skin and/or mucosal tissue (e.g. generalized hives, itching or flushing, swollen lips-tongue-uvula)persistent dizzinesscollapsedifficulty talkinghoarse voicewheeze orpersistent coughdifficult/noisy breathingswelling of the tongueswelling/tightness in throatpale and floppy(young children)circulation problem (hypotension and/or tachycardia)breathing problem(bronchospasm with tachypnoea)problems involving the airway(pharyngeal or laryngeal)Reduced BP or symptoms of end-organ dysfunction such as hypotonia, incontinenceRespiratory symptoms such as shortness of breath,wheeze,cough, stridor, hypoxemiasudden gastrointestinal syndromes such as crampy abdominal pain, vomitingDERMOTOLOGICALMUCOSALCARDIOVASCULARRESPIRATORYOTHERSORORORORORANDORORORORmeasured hypotensionORFigure 2.3: Details of some guidelines for anaphylaxis assessment. Horizontal grouping is done byanatomical system in which the manifestation takes place: dermatological/mucosal, cardiovascular,respiratory and others. Colored boxes indicate each of the guidelines considered. Different logicaloperators (AND, OR, AND/OR) are being used to assemble individual manifestations depending onthe guideline considered: Brighton Collaboration [36], Australasian Society of Clinical Immunologyand Allergy (ASCIA, [37]), National Institute for Health and Clinical Excellence (NICE, [38]),World Allergy Organization (WAO, [39])of certainty. They provide guidelines for three activities - data collection, analysis, and presentationof results, aiming to make collected data comparable, informed by the case definitions. By develop-ing and publishing these guidelines, the collaboration creates methodological standards that enableaccurate risk assessment. The case definitions neither require, nor assess a causal relation betweena given adverse event and the vaccination process. Rather, the case definitions are designed todefine levels of diagnosis certainty based on known information about AEFIs.152.2. Brighton Collaboration2.2.2 Automatic Brighton Classification (ABC) toolThe Automatic Brighton Classification (ABC) tool [41] is the only automated classification systemthat allows users to work with the Brighton case definitions. Given a set of symptoms and atentative diagnosis, one can confirm the level of diagnostic certainty of an AEFI. Or, given a set ofsymptoms, the tool can compare them to all Brighton case definitions and report putative diagnosesand their probabilities. Four limitations warrant development of an ontology that would replacethe ABC tool:1. The different signs and symptoms are not defined within the Brighton tool, making it hardat the time of diagnosis confirmation to know if individual findings are those mentioned inthe case definitions [15]. While the Brighton guidelines do not provide those definitions, theontology uses the PHAC glossary ones [42].2. The tool is embedded within the Brighton portal, and access requires individual login. Thereare no public API or webservices available, making it not amenable to processing of largeamount of data. While there is a mechanism to upload multiple Excel files, this still requireshuman intervention and raises the issue of sharing medical data with servers located outsideof the originating institution.3. The ABC tool can’t be integrated into other systems, and only remote access is available.4. The rules of classification are hard coded into the ABC tool, which is hard to maintain andextend as new case definitions are being developed [43]. In contrast, ontologies allow forthe guidelines to be encoded independently of the application code itself - an update to theontology does not require updating the business logic of the tools relying on it.Additionally, use of an ontology allows for text mining of large corpus of data [44], and mapping to-wards external resources such as MedDRA, which is required when attempting to reconcile existingMedDRA annotations with different guidelines used for their assessment as shown in Chapter Anaphylaxis according to BrightonOf special interest for this thesis, the Brighton Collaboration published an anaphylaxis guidelinein 2007 [45]. It describes anaphylaxis as “an acute hypersensitivity reaction with multi-organ-system involvement that can present as, or rapidly progress to, a severe life-threatening reaction.”The Brighton case definitions have been adopted by the Vaccine Working Group at PHAC and162.2. Brighton Collaborationare captured to some extent in the national Canadian Adverse Event Following ImmunizationReporting form [46].172.2. Brighton CollaborationTable 2.2: Case definition of anaphylaxis.For all levels of diagnostic certaintyAnaphylaxis is a clinical syndrome characterized by• Sudden onset AND• Rapid progression of signs and symptoms AND• Involving multiple (≥2) organ systems, as followsLevel 1 of diagnostic certainty• ≥1 major dermatological AND• ≥1 major cardiovascular AND/OR 1 major respiratory criterionLevel 2 of diagnostic certainty• ≥1 major cardiovascular AND 1 major respiratory criterion OR• ≥1 major cardiovascular OR respiratory criterion AND• ≥1 minor criterion involving 1 different system (other than cardiovascular or respiratorysystems) OR• ≥ (1 major dermatologic) AND (1 minor cardiovascular AND/OR minor respiratorycriterion)Level 3 of diagnostic certainty• ≥1 minor cardiovascular OR respiratory criterion AND• ≥1 minor criterion from each of ≥2 different systems criterion182.2. Brighton CollaborationTable 2.3: Major and minor criteria used in the case definition of anaphylaxis.Major criteria Minor criteriaDermatologic or mucosal system• Generalized urticaria (hives) or generalizederythema• Angioedemaa, localized or generalized• Generalized pruritus with skin rash• Generalized pruritus without skin rash• Generalized prickle sensation• Localized injection site urticariaCardiovascular system• Measured hypotension• Clinical diagnosis of uncompensated shock,indicated by the combination of at least 3of the following:– Tachycardia– Capillary refill time >3 s– Reduced central pulse volume– Decreased level of consciousness or lossof consciousness• Red and itchy eyes• Reduced peripheral circulation as indicatedby the combination of at least 2 of– Tachycardia and– A capillary refill time of >3 s withouthypotension– A decreased level of consciousnessRespiratory system192.2. Brighton CollaborationMajor criteria Minor criteria• Bilateral wheeze (bronchospasm)• Stridor• Upper airway swelling (lip, tongue, throat,urula, or larynx)• Respiratory distress - 2 or more of the fol-lowing:– Tachypnoea– Increased use of accessory respira-tory muscles (sternocleidomastoid, in-tercostals etc)– Recession– Cyanosis– Grunting• Persistent dry cough• Hoarse voice• Difficulty breathing without wheeze or stri-dor• Sensation of throat closure• Sneezing, rhinorrheaGastrointestinal system• Diarrhoea• Abdominal pain• Nausea• VomitingLaboratory• Mast cell tryptase elevation > upper normallimit202.3. OntologiesHowever, and despite their completeness, the textual, article-like, format of the Brighton casedefinitions makes it both problematic for clinicians to confirm that they see the relevant symptomswhen making the adverse event diagnosis and difficult to automate [15]. As shown in Table 2.2and 2.3, the anaphylaxis guideline is complex, and this limits its use in pharmacovigilance [15]. Twomain barriers for adoption of the anaphylaxis Brighton Collaboration Case Definition (BCCD) havebeen identified [15]:1. Health practitioners may not report enough signs and symptoms to allow application of theBCCD,2. Signs and symptoms terms are not consistently used.In this thesis, I propose and demonstrate that using an ontology addresses both those issues,by (1) providing logical encoding of the BCCD, which could be used for consistency checkingat reporting time, and “prompting” users for missing information (2) providing human readabledefinitions for terms used in reporting, based on the PHAC glossary [42].2.3 OntologiesThe project of enabling effective communication and discovery in the biological domain, and phar-macovigilance in particular, is complex. While free-text descriptions can capture relevant experi-mental details, as exemplified by the methods section of research papers, making the data availablefor reanalysis and comparison with other related datasets requires a much more systematic andcomputable approach to capturing information about experiments. In a re-evaluation of 18 peer-reviewed Nature Genetics microarray articles, it was reported that the inability of researchers toreproduce analyses was directly linked to data unavailability, incomplete data annotation, or spec-ification of data processing and analysis [47]. It is a significant challenge to unify diverse data setsin a consistent way when the biological relevance of the same entity is labeled differently in differentresources, and using a common knowledge representation, such as ontologies, will help provide astable and consistent context for the information within them [48, 49].Requirements for a controlled medical vocabulary are described by Cimino in [50], including:1. Vocabulary content: the controlled vocabulary should cover the use cases and domain ofknowledge,212.3. Ontologies2. Concept orientation: terms must correspond to at least one meaning (“non vagueness”)and no more than one meaning (“non ambiguity”), and that meanings correspond to no morethan one term (“non redundancy”),3. Concept permanence: a term may be flagged obsolete or deprecated but once created itis never deleted,4. Nonsemantic Concept Identifier: terms should use numerical, non-semantic identifiers,5. Polyhierarchy: allowing “tree walking” (i.e., browsingthe items of a tree via the connectionsbetween parents and children) along different paths depending on the information availableand the context of access,6. Formal definitions: include textual definitions as well as the logical ones created by theposition in the hierarchy,7. Terminologies should support inferencing: humans and computers should be able todraw conclusions from the information captured [51].Ontologies are formal representations of knowledge with definitions of concepts, their attributesand relations between them expressed in terms of axioms in some well-defined logic [52]. Theyspecifically address requirements detailed by Cimino in his “desiderata”. They model a domain ofinterest, and provide unique unambiguous definition, both human readable and computer amenable,for each of their term. Biomedical ontologies are sets of terms and relations that represent entitiesin the scientific world and how they relate to each other. Terms are associated with documenta-tion and definitions, which are, ideally, expressed in formal logic in order to support automatedreasoning [53, 54, 55]. Ontologies have dramatically changed how biomedical research is con-ducted. For example, since the Gene Ontology (GO) was first published in 2000 [53], it has beenused and cited in more than 2000 peer-reviewed journal articles [56]. Ontologies have been usedin various applications, such as gene expression data analysis [53], literature mining [44], andas the underpinning of a semantic web [57]. There are currently more than 150 biomedical on-tologies and 700,000 entities in the National Centre for Biomedical Ontology (NCBO) BioPortal With new resources continuously being developed, maxi-mizing ontology sharing and interoperability has become a growing concern [58, 59].In addition to the development of a biomedical ontology covering the domain of adverse eventreporting (AERO) (described in Chapter 8), Chapter 6 specifically addresses items from the desider-222.3. Ontologiesata [50]. The ID policy provides a standard scheme for numerical identifiers and formalizes a ver-sioning system. It also explores a deprecation policy that ensures terms are never deleted, theiridentifier therefore being unique and maintained. The MIREOT mechanism (see 6.3) I developedensures only one term is created for each entity to represent by providing a URI sharing strategy.As terms are related to each other by additional relations to subsumptions in the AERO, one canfor example browse adverse event information from the set of all adverse events, or selecting onlythose involving motor manifestations (i.e., has part some motor manifestation). Finally, reasonerssuch as Pellet [60] can be used to check consistency of the ontology and infer new facts based onthe knowledge captured in the hierarchy.Use of ontologies falls within three main categories [61]:1. Knowledge management: Annotation of resources (e.g., the Gene Ontology [62] for geneproducts, or the Mammalian Phenotype Ontology [63] for phenotypic information) that in-crease recall and precision when retrieving biomedical information. In [64], subclasses ofthe originally searched taxon were automatically included in results, such that a search for“mammalian models” would return those pertaining to human, mouse etc.2. Data integration: Ontologies such as TAMBIS [65] facilitate information exchange and se-mantic interoperability, data integration. While TAMBIS accesses only five resources (Swiss-Prot, Enzyme, Cath, blast and Prosite), modern SPARQL endpoints allow for querying acrossmultiple datasets [66, 67, 68]. In [57], Ruttenberg et al. describe a query that retrieves generecords and the name of signal transduction related processes that the gene products par-ticipate in that are related to pyramidal neurons, by querying across multiple data sourceshosted on Neurocommons.3. Decision support and reasoning: For example in [69], an ontology based on the NCI Thesaurusis used to grade glioma tumors automatically and compare the classification with 11 pathologyreports.In this thesis, building an ontology means addressing three essential roles [70]:1. An ontology in a given domain is a collection of representations of the important types ofthings in that domain, with an understanding that instance representations any of thesetypes should be considered proxy for things in the world. The truth of assertions madeon the representations is judged by the facts about that which the representations serve asproxies.232.4. The OBO Foundry2. An ontology is an active computational artifact. The assertions that are made are in a subsetof first order logic which can be checked for consistency and from which logical consequencescan be computed. I aim to take advantage of this by asserting as many axioms as feasible. Asa way of improving quality, these axioms maximize the opportunity for consistency checks toturn up errors. Each case report is classified when the specifics of the case satisfy a guidelinefor the purpose of diagnosis and screening. For example a query can be executed for all thecases of adverse effects that affect a specified anatomical system such as ‘skin rash located atsome dermatological-mucosal system’.3. An ontology facilitates scholarly and technical communication.• Working in a large community of ontology developers who split the labor and use eachother’s work, the OBO Foundry [55], and who together work out principles that encour-age quality through careful analysis,• Mediating communication between clinicians and technical specialists when the practiceof having literate documentation about the types in the ontology is followed, so thatclinicians and co-workers can have reasonably expectations of what the data means,• Being part of the package distributed so that other researchers can reproduce results.2.4 The OBO FoundryBiomedical investigations use empirical approaches to investigate causal relationships among alarge range of variables. The wide range of possible investigations presents a number of challengeswhen building tools to describe experimental processes. There are varying levels of complexity andgranularity and a wide range of material and equipment is used. Furthermore, the use of varyingterminology by different communities makes data integration problematic when representing andintegrating biomedical investigations across different fields of study.The use of ontologies has been successful in biological data integration and representation [71, 72]and there have been multiple efforts to develop ontologies aimed at providing clearer semantics fordata (GO [53], FuGO [73], MGED [74], EXPO [75], LABORS [76], MSI ontology [77]). Work in thetranscriptomics, proteomics and metabolomics communities has proceeded in parallel, producingontologies with overlapping scopes. Though each focuses on particular types of experimental pro-cesses, many terms, such as investigation and assay, are common to all. Merging common aspects242.4. The OBO Foundryof these formalisms is useful as it provides a mechanism by which terms can be used and understoodby all, reducing ambiguity and difficulties associated with post-hoc attempts to integrate data.The practice of consolidating representations is endorsed by organizations such as the OBOFoundry [55] which requires all member ontologies to define a term only once among them (or-thogonality). OBO Foundry members use a common set of relations from the Relations Ontol-ogy (RO) [78] and the upper level Basic Formal Ontology (BFO) [79] in order to facilitate crossontology consistency and to support automated reasoning [55]. OBO ontologies also adhere to com-mon naming conventions in order to make it easier to learn and understand them: this commonmetadata set is described in section 6.2.2.The development of a new biomedical ontology covering a specific domain is often an ambitious,time-consuming project, usually requiring extensive cross-community collaboration [80, 81]. TheOpen Biomedical Ontologies (OBO) Foundry is an open community that has established a set ofprinciples for ontology development with the goal of creating a suite of interoperable referenceontologies in the biomedical domain [55]. These principles require that member ontologies beopen, orthogonal, expressed in a common shared syntax, and designed to possess a common spaceof identifiers. One way of meeting the goal of interoperability is to reuse existing resources byimporting them into the to-be-created ontology. For example, the VO [82], described in Chapter 4,relies on many terms (e.g., administering substance in vivo) already described by other biomedicalontologies, such as the OBI [83]. Authors of resources submitted to the OBO Foundry library5commit to working together to increase quality of resources.As a result of this collaborative work, resources that are part of the OBO Foundry are orthogonalin scope (i.e., each resource describes a specific, non-overlapping domain) - and common policies aredevised and followed [84]. To increase interoperability, ontologies use a common upper-ontology (Basic Formal Ontology (BFO))[85] and a common set of relations (RO)[78]. Policy adoption at thelevel of the OBO Foundry is done by decision of the OBO coordinators, a set of individuals helpingbuild a community adhering to the OBO principles shown in Appendix C. Sharing developmentprinciples and domains aims at decreasing workload for ontologies developers, and ensure eachdomain is covered adequately by experts in the area. The idea of working collaboratively in a“Foundry” type of framework has also been adopted by the MInimum reporting guidelines forBiological and Biomedical Investigations (MIBBI) project [86], with goals similar to the OBOFoundry.5 OWL and the Semantic WebIn addition to creating a suite of reference ontologies, the OBO Foundry also promote their usein the annotation of multiple datasets in the interest of enabling effective integration of data inthis field. Biomedical ontologies are, typically, consensus-based controlled biomedical vocabulariesof terms for classes and relations associated with natural language definitions and logical axiomsformulated to promote automated reasoning. A key challenge in establishing such consensus andreaping the consequent benefits of widespread use is the wide dissemination of the terms fromthese ontologies making them discoverable, understandable, and (re)usable. Chapter 7 describesthe Ontobee server that was implemented by my collaborators at the University of Michigan in thecontext of the OBO Foundry for this purpose.In the context of the OBO Foundry, resources are developed in a modular way, as shown inFigure 2.4. A top-level ontology, BFO, provides the basic scaffolding on which others can build.BFO describes high-level entities, such as occurrents (those things that occur at some time, suchas processes), continuants (those things that perdure through time, such as material entities ortheir attributes, such as qualities). Other resources built under BFO with the goal of providingrepresentations for specific domains. For example, OBI extends bfo:processes encompass assays,and material or data transformation in the domain of biomedical investigations. Organisms aresubclasses of material entities, and bear different roles, such as principal investigator or specimen, orqualities, such as radioactive. Similarly, OGMS and VO, described in Chapter 4, aim at representingclinical and vaccine data. Finally, AERO is discussed in Chapter8 for application to adverse events.2.5 OWL and the Semantic Web2.5.1 The Semantic WebIntegrating heterogeneous data from multiple sources or databases is a well-known problem [87, 88].With the advent of the Web, there is an additional need to unify multiple data sources and serve theend user with a unified view they can browse and query [89, 90]. The semantic web aims at extendingthe existing web of documents into a web of data designed to also be processed automatically. Itrelies on providing unambiguous names for things, such as classes and relationships between them,that are well organized and documented in ontologies. Data is expressed using standard knowledgerepresentation languages, such as Resource Description Framework (RDF) [91], OWL [92], andRule Interchange Format (RIF) [93] and can be queried using for example the SPARQL Protocoland RDF Query Language (SPARQL) [94]. This enables computationally assisted exploitation262.5. OWL and the Semantic WebFigure 2.4: Coverage and distribution of OBO Foundry ontologies. Figure reproduced from by Barry Smith, li-censed under CC-by-nc-sa.of information: machines can work with data, allowing for consistency checking, querying andinferences over the datasets. Finally, data is integrated from different sources. As described in [95],“The Semantic Web isn’t just about putting data on the web. It is about making links, so that aperson or machine can explore the web of data. With linked data, when you have some of it, youcan find other, related, data.” A variety of systems are imagined to benefit from a web of data,ranging from agents that understand enough of such data to reliably act on behalf of their users, aswell as systems that are able to discover previously unknown relations among data in such a webof data [95].In his linked data note [96], Sir Tim Berners-Lee highlighted four underlying principles (withminor paraphrasing):1. Use URIs as names for things.2. Use HTTP URIs so that people can look up (i.e., dereference) those names.272.5. OWL and the Semantic Web3. When URIs are dereferenced, provide useful information, using the standards (RDF, SPARQL).4. Include links to other URIs in that useful information, so that more things can be discovered. picture CC0 from wikipediaWeb documentReal-world entityresolves todescribesdenotesFigure 2.5: Meaning of URIs. The URI denotescells in the real world. This URI can be dereferenced (e.g., via a web browser) and upon doingso provides useful information using RDF. This information may contain links to other relevantresources.This “meaning of URIs” is depicted in Figure 2.5. The Linked Open Data (LOD) [68] data cloudincludes Bio2RDF [66], Uniprot [97], DBPedia [98], and Neurocommons [57], and aims at integratingeven more of those datasets. Bio2RDF encompasses information from many key bioinformaticsresources (e.g., KEGG [99], PubMed [100], Uniprot [97]). This allows for example the querying ofthe Uniprot dataset using a PubMed node identifier, because the identifier of the PubMed resourceis shared between the two datasets. Figure 2.6 illustrates the architecture of the semantic web.Reusing the technologies it relies upon provides some of the building blocks for data sharing in thebiological domain. Logical languages such as RDF and OWL have effective tool implementations,such as HermiT [101], Pellet [60] or Fact++ [102]. Additionally, there is no need to design individual“data models” - a common source on non-integrable data - as a standard one is provided byRDF, as well as databases supporting it, such as Virtuoso [103], Stardog [104], OWLIM [105], or282.5. OWL and the Semantic WebSesame [106]. Finally, SPARQL [94] is a standard query language for RDF.Section 6.2.1 describes how I implemented the linked data principles by formalizing a commonID policy normalizing use and format of HTTP URIs across the OBO Foundry, while Chapter 7details the dereferencing mechanism implemented as default for resources relying on this ID scheme.Throughout this thesis, resources have been built in OWL, to which the next section provides anintroduction.Figure 2.6: The Semantic Web architecture, as described by Tim Berners-Lee, Image in the public domain, from wikipedia.2.5.2 Components of an ontologyOntology languagesAs shown on Figure 2.6, RDF, RDF Schema (RDFS) and OWL are core components of the semanticweb. In RDF, data is represented by triples - a set of subject, predicate, object, in which eachentity can be denoted by its corresponding Uniform Resource Identifier (URI). Figure 2.7 detailshow the triple eukaryotic cell has part nucleus can be encoded in RDF.292.5. OWL and the Semantic WebSUBJECT OBJECTpredicateEUKARYOTICCELLNUCLEUShas_part<><><>Figure 2.7: Encoding of triples in RDFRDFS allows the addition of simple relations to RDF, such as rdfs:subclassOf, which supportsbasic inference. For example, by asserting the following triples subject, predicate, object: animalcell rdfs:subclassOf eukaryotic cell and eukaryotic cell rdfs:subclassOf cell, a reasoner can inferthat animal cell rdfs:subclassOf cell. OWL provides the higher level of expressivity. For example,axioms such as eukaryotic cell disjointWith prokaryotic cell cannot be expressed using RDFS [107].However, the more expressive a language is, the more difficult it is to reason over representationsbuilt using that language [108]. OWL was chosen as a means to provide a balance between ex-pressivity required, computation capability and tool support. The OWL DL subset was indeedsuitable to represent the axioms required (described in Chapter 8), while being computationallydecidable [109]: reasoners are guaranteed to finish and produce a result, though in practice theperformance is not guaranteed, i.e., the memory space and time required for computation can beinfinite. In Chapter 9 I discuss some reasoning issues encountered during this thesis, and solutionsI implemented to address them.While many formats can be used to encode OWL files, the Manchester OWL Syntax [110] waschosen as a more user friendly option for examples in this dissertation.Ontology structureAn ontology can be split into a TBox and ABox [111]. The TBox, or Terminological box, containsintensional knowledge: classes and relationships amongst them, which represents the general knowl-edge about a domain. For example, a woman can be defined as a female person. On the other hand,the ABox, or Assertions box, encodes the extensional knowledge, which is specific to a domain, viainstances. For example, the individual ANNA is a female person [112]. While the TBox is oftencompared to relational databases schemas, the ABox can only approximately be seen as the set302.5. OWL and the Semantic Webof instances in the database. Relational databases store a finite amount of data, and data that isnot present is simply inferred to negative. However in OWL, the Open World Assumption (OWA)prevails, and data that is not explicitly stated as negative is rather considered unknown [113]. Forexample, if the statement ANNA is a female person is made, and the question“is Mary a femaleperson?” is asked, a closed world (such as SQL) system will answer “No”, while an open-worldsystem will answer “unknown”. As a consequence, it may be required in some cases to add negativeaxioms to represent knowledge extracted from a database, as shown in Chapter 9. Additionally,ontologies contain documentation, in the form of metadata either on the classes, properties andinstances in the model, or on the model itself. Details of the shared metadata set I developed forthe OBO Foundry is in Section 6.2.2 for annotation on classes, properties and relations, while Sec-tion 6.2.1 describes some ontology metadata such as versioning information. Finally, ontologies canreuse each other via the owl:imports statement, which allows to reference another OWL ontologycontaining definitions, whose meaning is considered to be part of the meaning of the importingontology [114]. A discussion of the import mechanism and limitations is available in Section 6.3.ClassesIn OWL, classes are set of individuals called the class extension [114]. Classes can be described byeither a class name (or URI) or a description of its extension (e.g., all cells that have a nucleus areeukaryotic cells, or cells are prokaryotic or eukaryotic). In OWL, 6 descriptions are allowed [114]:1. A class identifier2. An enumeration of individuals (using owl:oneOf)3. A property restriction (value or cardinality constraint)4. An intersection of classes descriptions (AND)5. An union of classes descriptions (OR)6. A complement of a class description (NOT)Classes descriptions can be turned into classes axioms using one of the three constructors:1. rdfs:subclassOf: the subject class extension is a subset of the parent class extension. Forexample, animal cell subclassOf cell.312.5. OWL and the Semantic Web2. owl:equivalentClass: the subject and object class extensions are identical. For example ‘eu-karyotic cell’ and ‘cell that has a nucleus’.3. owl:disjointWith: the subject and object class extensions share no common member. Forexample ‘eukaryotic cell’ and ‘prokaryotic cell’.PropertiesOWL defines different types of properties [114]. Object properties link instances to instances, whiledata properties link instances to data values. OWL DL further specifies annotation properties onclass, individuals, properties provided some conditions are respected (for example, annotation, ob-ject and data properties must be disjoint). In the context of the OBO Foundry, common propertiesused to be included within the RO [78]. Common relations have been now included in the upcomingversion of BFO, BFO2.0.IndividualsIndividuals are the members of a class extension. A reasoner can infer their class membership viaasserted facts on those individuals. For example, if one asserts that in their current experimentthe cells C1 have a nucleus, then a reasoner could infer that C1 is a member of the class extensionof “eukaryotic cell”. Such facts, allowing classification of individuals into their respective types,are necessary and sufficient conditions (it is necessary AND sufficient to have a nucleus to be aneukaryotic cell6 ). Other types of facts, necessary conditions, are required but don’t allow for typeinference. For example, all bone cells are part of the bone element, but not everything that is partof the bone element is a bone cell. Finally, when individuals are asserted or can be inferred asmembers of two disjoint classes, an inconsistency occurs [115].6As an interesting case, red blood cells are eukaryotic cells despite losing their nucleus during maturation. Atemporal qualification of existing relations, such as has part is being worked upon in the context of the developmentof BFO2.32Chapter 3Representing biomedicalinvestigations3.1 IntroductionBefore being able to describe pharmacovigilance processes accurately and in a standard way, ageneral framework for biomedical investigations is required. In this chapter, I describe parts of theOBI development in Section 3.2, and how it can be extended to represent vaccine protection assayand statistical analysis thereof. I was a coordinator for the OBI consortium and the core developerin charge of most development at this time, and participated in implementation of the models: allmade use of the infrastructure I developed. Coordinators make decision on general guidance of theOBI, while core developers are directly concerned with the editing and implementation required. Iproduced the release OWL file on which this chapter is based. My work focused on the neuroscienceinvestigation (Use case 1) in collaboration with Dirk Derom and Alan Ruttenberg, as well as thevaccine protection investigation (Use case 2) in collaboration with Yongqun He. OBI is applied tomodel, among others, an investigation of vaccine protection against influenza viral infection. Thevaccine protection investigation measures how efficient a vaccine or vaccine candidate is at inducingprotection against a virulent pathogen infection in vivo. Section 3.3 then describes the applicationof the vaccine protection pattern to the ANOVA analysis of variables involved in a Brucella vaccineprotection efficacy.3.2 The Ontology for Biomedical Investigations (OBI)The OBI Consortium7 is developing an integrated ontology for the description of biological andclinical investigations. This includes a set of ‘universal’ terms that are applicable across variousbiological and technological domains, as well as domain-specific terms. OBI supports the con-7 The Ontology for Biomedical Investigations (OBI)sistent annotation of biomedical investigations, regardless of the particular field of study. Theontology represents the design of an investigation; the protocols, instrumentation and materialused; the data generated; and the type of analysis performed. OBI also represents roles andfunctions used in biomedical investigations. OBI has been used in experimental investigations indifferent communities, for example, Bioinvindex (, isa-tools(, and IEDB ( defines an investigation as a process with several parts, including planning an overallstudy design, executing the designed study, and documenting the results. An investigation typicallyincludes interpreting data to draw conclusions. Biomedical experimental processes involve numeroussub-processes, involving experimental materials such as whole organisms, organ sections and cellcultures. These experimental materials are represented as subclasses of the BFO class materialentity. OBI uses BFO’s material entity as the basis for defining physical things:• A material entity is an independent continuant, a continuant that is a bearer of quality andrealizable entity(s), in which other entities inhere and which itself cannot inhere in any-thing [79].• Material entities are entities that are spatially extended, whose identity is independent ofthat of other entities, and which persist through time, for example organism, test tube, andcentrifuge.• Material entities can bear roles, typically socially defined, which are realized in the contextof a process, e.g., study subject role, host role, specimen role, patient role; and functions,results of design or evolution that depend on their physical structure e.g., measure function,separation function and environment control function. The function is considered to inherein the material entity and be realized by the role that material entity plays in a process.To assess the completeness of the OBI release and demonstrate the use of OBI for annotation,two representative use cases are presented. These demonstrate how to model entities and rela-tions between entities involved in experimental processes using OBI. The first use case models aneuroscience experiment described in a journal article [116] and shows how logical definitions areconstructed using parts of external ontologies imported into OBI. The second use case details howOBI is used to model vaccine studies. Having the ability of integrating across multiple domainsis of particular interest for this thesis: as an example, consider that a vaccine candidate against343.2. The Ontology for Biomedical Investigations (OBI)Alzheimer disease may induce specific changes on the brains of transgenic mice or human pa-tients [117]. Therefore enabling queries across the domains of vaccinology and neuroscience wouldbe of utility in conducting such research.3.2.1 Use case 1: Neuroscience investigationThis investigation studied the role of the primate caudate nucleus in the expectation of rewardfollowing action [116]. While the caudate nucleus responds preferentially to eye movements indifferent directions, the response begins prior to eye movement and is dramatically increased whenthere is expectation of reward for the preferred direction. Here a single trial is represented, in whichthe visual target, a light, is presented to the animal and the neural response is recorded as data.This single trial model contains two processes (Figure 3.1):1. Stimulating monkey with a light source, which is an example of presentation of stimulus.The Japanese macaque monkey participates as the subject and light source as the stimulus,during the process of a measuring neural activity in the caudate nucleus assay.2. Measuring neural activity in the caudate nucleus: this process is a subclass of the processextracellular electrophysiology recording, which unfolds in the caudate nucleus that is part ofthe Macaca fuscata, of which the Japanese macaque monkey is an example. The anatomicalterm caudate nucleus is imported from the Neuroscience Information Framework standardized(NIFSTD) ontology [118] and used in the logical definition of the assay.The light on the tangent screen here is a light source used to present the stimulus to the study sub-ject. The function of the microelectrode, part of the single unit recorder (an example of processedmaterial), is realized in the measuring neural activity in the caudate nucleus process. The processmeasuring neural activity in the caudate nucleus has the specified input a neuron and the specifiedoutput a neuronal spike train datum.3.2.2 Use case 2: Vaccine protection investigationA vaccine protection investigation (also known as a vaccine challenge experiment) measures howefficiently a vaccine or vaccine candidate induces protection against a virulent pathogen infection invivo. Figure 3.2 demonstrates how to use OBI to represent a typical vaccine protection investigationvia the following three sub-processes:353.2. The Ontology for Biomedical Investigations (OBI)Figure 3.1: OBI modeling of a single trial in the neuroscience study (a fragment). In this andsubsequent figures, boxes represent instances, labeled by the class they are instance of and relation-ships as links labeled in italics. In several cases the parent class is also noted with the class label.Note that in typical use only some instances would be explicitly created - others would be inferredas a consequent of OBI’s definitions. Some processes in this experimental trial are presentationof stimulus, measuring neural activity in the caudate nucleus, and stimulating monkey with lightsource. Some continuants are Macaca fuscata, study subject role, spike train1. A vaccination is a kind of administering substance in vivo process that realizes some materialto be added role, borne by a vaccine (e.g., VacX) as well as a target of material role borneby an organism that also bears a host role (e.g., mouse). The term vaccination is a term363.2. The Ontology for Biomedical Investigations (OBI)Figure 3.2: OBI modeling of vaccine protection investigation (a fragment). Major processes arevaccination and pathogen challenge, both of which are subtypes of administering substance in vivo.The roles target of material addition and material to be added role are defined with respect to thisparent class. Some objects are syringe, mouse, host role, target of material addition role, VacXand a portion of Influenza Virus. Note that while the figure shows a single input for the survivalassessment, in fact there would be many replicates of the experiment shown, with observations ofmouse survival from all of them input to the survival assessment.imported from the Vaccine Ontology [119]. The vaccination process realizes the injectionfunction inhering in a syringe (itself a processed material).2. A pathogen challenge is also a kind of administering substance in vivo process. It realizesa number of roles - a pathogen role and material to be added role borne by the challengeorganism (e.g., Influenza Virus), and a target of material role and host role borne by anotherorganism (e.g., mouse). An injection function that inheres in a syringe is realized by thepathogen challenge process.3. A survival assessment is an assay that measures the survival rate (occurrence of death events)in one or more organisms that are monitored over time. The survival assessment is a protectionefficiency assay that has specified input a number of organisms (e.g., mouse) and has specifiedoutput a survival rate, in this case a measurement datum that records that 75% of mice373.2. The Ontology for Biomedical Investigations (OBI)survived the pathogen challenge.3.2.3 DiscussionOBI was built to provide a comprehensive and versatile representation of biomedical investigations.In the three biological use cases above, individual experimental steps - the two processes in theneuroscience use case, the three processes in the vaccine protection case, and the three processesin the functional genomics case - all fall under planned process in OBI.In the example of the neuroscience investigation use case, the construction of logical definitionsof the experimental process prompted questions to domain experts, because details to capture werenot explicit in the publication. For example, was the location of the micro-electrode extra- orintra- cellular? Was all spike train data recorded from the caudate nucleus? How does a spiketrain relate to the GO biological process regulation of action potential [GO:0001508]? Based onthe answers, OBI’s existing assays were augmented, and several terms from external ontologies, forexample NIFSTD, were imported. When relations not yet present in OBI were needed, rather thandefine them de novo relations from ro proposed ([ were used. For example, unfolds in specifies that an occurrent (process) happensin a certain location (i.e., the assay of spike trains in the caudate nucleus). Finally, the NCBItaxonomy [120] was used to describe the species involved in this experiment. As described furtherin Section 6.3, re-use of external resources fulfills two purposes. First, as domain experts havealready devoted time to defining terms in these external ontologies substantial efforts are preventedby not replicating that work. Second, by re-using existing resources that others already use, thepotential for future data integration is improved, by making it unnecessary to map between differentidentifiers denoting the same entity.In developing the neuroscience use case some decisions about choosing an appropriate level ofdetail were challenging: in this use case instances of the classes were not included. Instead focuswas on adding classes that can be re-used for other use cases and communities. The analysis andthe classes defined can then serve as design patterns for other neuroscience assays. Dependingon the use case, OBI intends to be able to model the desired level of details (granularity), frommolecular level experiments to higher level of biomedical investigations. OBI can be used at a moreor less granular level depending on the user community needs.In the second use case, the vaccine protection investigation includes three processes. Theprocesses vaccination and pathogen challenge are disjoint subclasses of administering substance in383.2. The Ontology for Biomedical Investigations (OBI)vivo. The process survival assessment is a type of assay (Table 3.1). All these required processes, aswell as all other entities described in the use case could be represented using OBI idioms. Syringeis a processed material that participates in different processes. Entities such as vaccine are typesof material entity. Host role, pathogen role, and material to be added role are types of roles.Table 3.1: Ontology terms used in the use cases (note: instances are not included).Ontology terms Sources and termIDsParent class UsecasesClassesadministering substance invivoOBI: OBI 0600007 material combination 2assay OBI: OBI 0000070 planned process 1caudate nucleus NeuroLex: birn-lex 1373anatomical entity 1extracellular electrophysiol-ogy recordingOBI: OBI 0000454 assay 1function snap#Function realizable entity 1,2host role OBI: OBI 0000725 role 2IndependentContinuant continuantinjection function OBI:OBI 0005246 function 2light source OBI: OBI 0400065 processed material 1Macaca fuscata NCBI Taxon:NCBITaxon 9542organism 1material combination OBI: OBI 0000652 planned process 2material to be added role OBI: OBI 0000319 role 2material entity snap#MaterialEntity Independent continuant 1,2measure function OBI: OBI 0000453 function 1measurement device OBI: OBI 0000832 processed material 1measurement datum IAO: IAO 0000109 data item 1,2393.2. The Ontology for Biomedical Investigations (OBI)Ontology terms Sources and termIDsParent class Usecasesmeasuring neural activity inthe caudate nucleusOBI: OBI 0000812 extracellular electrophysiol-ogy recording1micro electrode OBI: OBI 0000816 processed material 1neuron FMA: FMA:54527 anatomical entity 1organism OBI: OBI 0100026 material entity 2pathogen challenge OBI: OBI 0000712 administering substance invivo2pathogen role OBI: OBI 0000718 role 2presentation of stimulus OBI: OBI 0000807 process 1process span#Process processual entity 1,2processed material OBI: OBI 0000047 material entity 1,2role snap#Role realizable entity 1,2spike train datum OBI: OBI 0000801 measurement datum 1study subject role OBI: OBI 0000097 role 1survival assessment OBI: OBI 0000699 assay 2survival rate OBI: OBI 0000789 measurement datum 2syringe OBI: OBI 0000422 processed material 2target of material additionroleOBI: OBI 0000444 role 2vaccination VO: VO 0000002 administering substance invivo2vaccine VO: VO 0000001 material entity 2Property termsbearer of RO:OBO REL#bearer of1has participant ro.owl#has participant 1403.2. The Ontology for Biomedical Investigations (OBI)Ontology terms Sources and termIDsParent class Usecaseshas specified input OBI: OBI 0000293 1,2has specified output OBI: OBI 0000299 1,2inheres in RO:OBO REL#inheres in1,2is a RO: OBO REL:is a 1,2is realized by IAO: IAO 0000122 1, 2location of ro.owl#location of 1part of ro.owl#part of 1unfolds in RO:OBO REL#unfolds in1That OBI can be used to represent experimental processes for different applications and domainsis appealing because it suggests that biomedical investigation work can be better leveraged. For thedomain of vaccine investigation, approximately 400 vaccines have been manually curated and storedin the Vaccine Investigation and Online Information Network (VIOLIN; vaccine database system [121], described in Chapter 4. Currently, the vaccine protectionexperimental data in VIOLIN is stored in plain text and can be difficult to interpret. The lack of acommon ontology to aid in representing this data has prevented optimal use of the VIOLIN vaccinedata. Applying the representation described above to that data would enable advanced queryingboth within the data as well as across data from other biomedical communities that represent theirdata using OBI.413.3. Ontology representation and ANOVA analysis of Brucella vaccine protection investigation3.3 Ontology representation and ANOVA analysis of Brucellavaccine protection investigationBrucella is an intracellular bacterium that causes brucellosis, the most common zoonotic diseaseworldwide. Vaccine challenge studies are only performed in animal models, and typically occurat the preclinical stage. They are critical in determining whether a vaccine can yield the desiredimmune response. In this section, it was hypothesized that some experimental variables signifi-cantly contribute to Brucella vaccine protection efficacy while others do not. To investigate thishypothesis, the vaccine protection investigation was represented using VO and OBI. This modelwas then evaluated by my collaborators at the University of Michigan using literature-curated data.3.3.1 MethodsThe following methods were applied in this study:1. Ontology representation of ANOVA Statistical analysis: The analysis of variance(ANOVA) was modeled primarily in OBI. A design pattern was generated. The use case inthis study is ANOVA in terms of a linear model.2. Ontology-based representation of vaccine protection investigation: All variables inthis use case are represented using different ontologies as needed. The main ontologies usedinclude VO, OBI, and IAO.3. Literature curation of individual Brucella vaccine protection data: Peer-reviewedBrucella vaccine protection research papers were obtained from PubMed search. These pa-pers were manually curated to identify variables and extract values taken by these variablespotentially important for vaccine protection efficacy investigation. The data were stored inan OWL file.4. Ontology-based ANOVA analysis of Brucella vaccine protection results: ANOVAwas applied to study the Brucella vaccine protection investigation instance data. The resultswere also represented in the ontology.I performed the ontology implementation (items 1 and 2), while my collaborators at University ofMichigan executed items 3 and 4.423.3. Ontology representation and ANOVA analysis of Brucella vaccine protection investigation3.3.2 ResultsOntology design pattern of ANOVA data analysisThe analysis of variance (ANOVA) provides a statistical test of whether or not the means ofseveral groups are all equal. In statistics, ANOVA includes a collection of statistical models (e.g.,linear models), and their associated procedures, in which the observed variance is partitioned intocomponents due to different explanatory variables. The ontology-based ANOVA data analysisdesign pattern is illustrated in Figure 3.3. ANOVA is a subclass of data transformation process inOBI. F-test is part of ANOVA process. ANOVA has specified input some data item, which comefrom two sources. They can be the output of individual processes (e.g., CFU reduction assay) orof a discretization process that discretizes non-measurable data (e.g., mouse age) into categorizedmeasurement data (e.g., 1 for young mouse, 2 for middle-aged mouse, and 3 for old mouse). Oneapproach to obtain the data items necessary for ANOVA analysis is through data item extractionfrom a journal article (IAO 0000443). In this case, the input is some journal article, and the outputis data. The ANOVA output is a p-value data set, which includes a set of p-value results for anindependent variable data set that is predefined. ANOVA is a concretization of some ANOVAprotocol. The ANOVA protocol includes a predictive model that specifies a testable hypothesismodel (Figure 3.3).Figure 3.3: Representation of ANOVA analysis process.433.3. Ontology representation and ANOVA analysis of Brucella vaccine protection investigationOntology representation of Brucella vaccine protection investigationA vaccine protection investigation includes three processes (or steps): vaccination, pathogen chal-lenge, and vaccine protection efficacy assessment. For those pathogens that kill a model animal(e.g., mouse), survival assessment is used for assessing vaccine protection efficacy [122]. Since viru-lent Brucella does not kill mice, the survival of pathogen challenged mice is not applicable to assessBrucella vaccine efficacy. Instead, a colony forming unit (CFU) reduction assay is used to deter-mine the difference of live bacteria recovery from vaccinated mice and non-vaccinated mice [123].This use case was used to derive an instance level representation based on the formal semanticrepresentation of ANOVA analysis (Figures 3.3 and 3.4).To determine which variables play significant roles in changing the Brucella vaccine protectionefficacy, collaborators at the University of Michigan manually curated more than 40 papers toget instance data that correspond to these variables. In total, 151 instance data were collectedfrom the literature and represented in OWL format. When variables did not already exist in theontology they were added. An ANOVA analysis was performed and indicated that six variablesdo not statistically significantly contribute to the protection (p-value >0.05). These six variablesinclude IL-12 vaccine adjuvant, mouse sex, vaccination route, mouse age at vaccination, vaccination-challenge interval, and challenge dose. The other 10 parameters statistically significantly contributeto the vaccine protection (p-value < 0.05) (Table 3.2).443.3. Ontology representation and ANOVA analysis of Brucella vaccine protection investigationTable 3.2: Ontology terms for 17 variables in the Brucella vaccine protection assay. The firstvariable is dependent variable, and the others are independent variables. The last six variables didnot contribute to the vaccine protection (p-value < 0.05).Classes / ANOVA variables Sources and term IDs1 vaccine protection efficacy VO: VO 00004562 vaccine strain VO: VO 00011803 vaccine viability VO: VO 00011394 vaccine protective antigen VO: VO 00004575 mutated gene in vaccine strain VO: VO 00011956 vaccination mouse strain VO: VO 00011897 vaccination dose specification VO: VO 00011608 pathogen strain for challenge VO: VO 00011949 pathogen challenge (subclass) OBI: OBI 000071210 CFU per volume UO: UO 000021211 CFU reduction VO: VO 000116412 IL-12 vaccine adjuvant VO: VO 000114713 biological sex PATO: PATO 000004714 vaccination (subclass) VO: VO 000000215 animal age at vaccination VO: VO 000089716 vaccination-challenge interval VO: VO 000119117 challenge dose specification VO: VO 0001161453.3.OntologyrepresentationandANOVAanalysisofBrucellavaccineprotectioninvestigationFigure 3.4: Representation of a protection assay with Brucella vaccine RB51 [123]. Boxes represent OWL individuals. Terms fromdifferent ontologies (e.g., OBI, VO, IAO) are used. Italicized text in the middle of arrows represents relations. The bold terms representthree major processes in the vaccine protection investigation463.4. Conclusion3.4 ConclusionIn this chapter, examples of how to represent experimental processes with OBI were describedthrough three real world use cases. Experience such as this helps validate OBI’s design choices,and shows how to extend it in domain specific ways. It also generates competency questions thatallow us to identify parts of OBI that are insufficiently expressive and to identify external resourcesthat can be used to extend OBI’s coverage.A major challenge when developing models is the requirement to import terms from otherontologies to construct logical definitions: due to its broad scope OBI spans multiple existingontological resources. There is a significant cost preventing those large imports, as reasoningbecomes slower and the ontology is harder to navigate. To solve this problem I developed theMIREOT mechanism [124], described in Section 6.3, which preserves namespaces of imported termsand allows their direct use into OBI and other resources.While OBI provides a general framework for biomedical investigations, and contributed largelyto other efforts such as the IAO [125] or the BFO [85], it doesn’t describe clinical informationsuch as encounters, disease and disorder processes, signs or symptoms, which are critical elementwhen considering pharmacovigilance and adverse event reports. Those fall into the scope of theOntology for General Medical Science (OGMS); Chapter 8 describes how AERO extends OGMS.Additionally, my work focuses on adverse events following immunization, and in that context it ishighly relevant to have an accurate representation of vaccine and their components, as well as thevaccination process. Chapter 4 describes the Vaccine Ontology (VO), which targets this domainspecifically.47Chapter 4Representing vaccine data4.1 IntroductionVaccine research, development, testing, and clinical use involve complex processes whose compu-tational representation requires a large number of data types and significant data volume. Severalvaccine types are available; for example, live attenuated vaccines, subunit vaccines, and DNA vac-cines. Vaccines are developed using multiple approaches including studies of gene and proteinexpression, molecular and cellular interactions, and tissue and whole body responses, as well as inextensive epidemiological modeling. Currently there are more than 200,000 vaccine-related arti-cles in PubMed [56]. In addition to the wealth of peer-reviewed literature on vaccines, there aremany public vaccine databases including the USA CDC Vaccine Information Statements system8,the licensed vaccine information by the U.S. FDA9), and the Vaccine Resource Library10. Thesedatabases emphasize the clinical uses and regulatory oversight of existing vaccines. With the largenumber of vaccine data types and publications available, it is a challenge to develop an efficientstrategy for vaccine data standardization, retrieval, and integration. High-throughput computa-tional processes are needed for efficient integration of complex and large volumes of data. It is alsoincreasingly challenging to identify and annotate vaccine data from this large and diverse literaturewhich no one scientist or team can fully master. However, computational analysis is not possiblewithout individual representations of various data types understandable by computers. As a resultof the limited capability for data integration, efficient computational reasoning is hindered. There-fore, it was necessary to develop a common, community-supported ontology for vaccine researchwith both natural language and logical definitions of the terms involved.To promote vaccine data standardization, integration, and computer-assisted reasoning, theVaccine Ontology (VO; was developed by the VO devel-8 Vaccine ontology overviewopers group, of which at the time of this work Oliver He at University of Michigan, Bjoern Peters atLa Jolla Institute for Allergy & Immunology, Alan Ruttenberg at University of Buffalo and myselfwere active. This chapter introduces the overall VO design, some core VO terms, and examples ofhow the VO can be used to answer specific questions in the vaccine domain.4.2 Vaccine ontology overviewThe VO was developed using OWL [114] and the Prote´ge´ editor [126]. In compliance with theOBO Foundry ID policy described in section 6.2.1, the latest version of VO is always available at In addition, VO has been deposited in the NCBOBioPortal [127], and is listed on the OBO website [55].Most of the VO terms are for specific vaccines, indicating that the ontology is focused on thecategorization and relationships of vaccines and vaccine components, vaccination investigation, andthe vaccine-host interactions. Vaccine-induced immune responses and vaccine protection againsttargeted diseases or pathogens are derived from the fundamental vaccine-host interaction and em-phasized in VO.Some terms assigned with VO identifiers may not be vaccine specific but cannot be found inexternal ontologies. For example, the term ‘edible’ indicates the ability of a material entity (e.g.,vaccine) that is orally ingestible. This term may be better located in other ontologies such asthe Phenotypic Quality Ontology (PATO) [128], which focus is on those terms describing qualitiesof entities. In this case, a unique VO identifier has been assigned to this term for now, andsubmitted a term request to PATO, a typical approach to collaborative work following the OBOFoundry principles. VO is interdisciplinary and interoperable with other ontologies, especially thoseOBO Foundry candidate ontologies. Terms from other ontologies are imported in order to avoidduplication and support interoperability of scientific data annotated with them, data that typicallyspans disciplinary boundaries. VO utilizes the Basic Formal Ontology (BFO) [79] as an upperlevel ontology. The relation terms defined in the RO [54] have been used in VO for representingcommonly used relations. VO also utilizes the IAO [125], an ontology of information entities basedon the BFO. VO imports BFO, RO and IAO entirely as these ontologies are the most frequentlyused and their relatively small sizes don’t hinder efficient editing.However, as current editing tools fail to handle larger size ontologies, all terms from ontologiessuch as FMA [129] or the NCBI taxonomy [130] cannot be imported into VO. In addition, these494.3. Specific terms defined in the vaccine ontologyresources cover a broader scope that the VO, and many of their terms are not required in most cases.To import ontology terms from such large external ontologies, and prevent the need for duplicationof terms already defined in other ontologies, VO relies on the MIREOT standard described inSection 6.3. OntoFox, described in Section 6.4, was used to import external ontology terms intoVO. Currently, VO has imported terms from 12 external ontologies, such as the OBI [122], theInfectious Disease Ontology (IDO) [131], and the PATO [128]. For example, VO imports the term‘pathogen’ ( from IDO using OntoFox.4.3 Specific terms defined in the vaccine ontology4.3.1 VO definition of the term ‘vaccine’VO defines a vaccine as a processed material with the function that when administered, it preventsor ameliorates a disorder in a target organism by inducing or modifying adaptive immune responsesspecific to the antigens in the vaccine. In Manchester syntax [110], ‘vaccine’ is a defined class, i.e.,translating the above constraints into logical restrictions:Class: vaccineEquivalentTo:’processed material’and (’has function at some time’ some(’vaccine function’and (’realized in’ some ’vaccine immunization’)))and (is_specified_output_of some ’vaccine preparation’)SubClassOf:’processed material’To translate this to prose, a vaccine is designed to perform a specific vaccine function. The‘vaccine function’ can be a ’preventive vaccine function’ or ’therapeutic vaccine function’. Thepreventive vaccine function is a vaccine function realized by the process of vaccination and leadingto induction of an adaptive immune response to the antigens in a vaccine, which protects againsta specific disorder, or in Manchester Syntax:504.3. Specific terms defined in the vaccine ontologyClass: ’preventive vaccine function’SubClassOf:’realized in’ some ’disorder prevention’,’realized in’ some ’induction of adaptive immune response to antigen’,’vaccine function’,’realized in’ some vaccinationThe ‘therapeutic vaccine function’ is defined similarly. Correspondingly, there are two types ofvaccines: preventive vaccine and therapeutic vaccine. According to the Ontology for GeneralMedical Science (OGMS [132]), a disorder is the physical basis of a disease such as infectiousdisease, cancer, allergy, or autoimmune disease. VO uses these disorders to build its assertedstructure and define vaccines. For example, the class ‘human immunodeficiency virus vaccine’ isdefined as a viral vaccine that is administered to prevent an infection of human immunodeficiencyvirus, or as represented using the Manchester syntax:’viral vaccine’and administered_to_prevent some (infection_of some ’Human immunodeficiency virus’)At the University of Michigan, VO was used to develop VIOLIN (,a web-based vaccine database and analysis system to store and analyze research data concerningcommercial vaccines and vaccines under clinical trials or in early stages of development [121]. Basedon VIOLIN and users requirements, 301 vaccines or vaccine candidates for 20 different genera orspecies of animals have been included in VO, including all 146 vaccines licensed for human use inthe USA and Canada. Many of these vaccines are also used in other countries. More efforts areunder way to include additional licensed vaccines in VO.Vaccines can also be classified depending on the vaccine preparation method, such as ‘inactivatedvaccine’ and ‘subunit vaccine’. To facilitate research and development of these different vaccines,these terms have been included. However, multiple inheritance (i.e., a child term linked withmultiple parent terms) classes will occur if a vaccine is classified under a vaccine that inducesimmunity in vivo against infection of a pathogen (e.g., Influenza virus) and a vaccine that isprepared by inactivation of the whole pathogen and using the inactivated pathogen as vaccineantigen. For example, a vaccine (e.g., Afluria) may be categorized as both an ‘Influenza virus514.3. Specific terms defined in the vaccine ontologyvaccine’ and an ‘inactivated vaccine’. To increase explicitness, modularity, and maintainability,asserted multiple inheritance should be avoided during ontology development [133]. To addressthis, in VO, the asserted hierarchy is based on the OGMS disorder hierarchy and OWL reasonersare used to infer additional information. For example, Afluria is asserted under Influenza virusvaccine. The Afluria vaccine antigen is the whole viral organism that has quality “inactivated”.The Afluria vaccine is therefore declared as bearing the quality “vaccine organism inactivated”. As“inactivated vaccine” ( is defined as:EquivalentTo:vaccineand (’has quality at all times’ some ’vaccine organism inactivated’)a reasoner will classify Afluria correctly as an “inactivated vaccine”.4.3.2 VO definition of the term ‘vaccination’The term ‘vaccination’ is another core term in VO. Vaccination is the process of administering avaccine into an organism (e.g., human). The definition of this VO term relies on three OBI terms.Specifically, vaccination (VO 0000002) is modeled as a process of ‘administering substance in vivo’(OBI 0600007), in which some ‘vaccine’ realizes the ‘material to be added role’ (OBI 0000319) toan organism (OBI 0100026):’administering substance in vivo’and realizes some (’material to be added role’ and role_of some vaccine)and realizes some (’target of material addition role’ and role_of some organism)Figure 4.1 is an example of ‘vaccination’ with Afluria influenza vaccine. Specifically, the Afluriavaccine which bears the ‘material to be added role’ is administered in vivo into a mouse (‘targetof material addition role’). The vaccine is contained in a vial (‘containing function’) and drawninto a syringe (‘injection function’) for vaccine injection. This vaccination is implemented withan administration dose of 0.2 ml (‘administration dose role’) and through the intramuscular route(‘administration route role’). As a result of this work, the whole process of administering Afluriacan be described in an OWL file allowing computers to understand and parse a vaccination process,and thus support automated reasoning.524.3. Specific terms defined in the vaccine ontologyFigure 4.1: Representation of vaccination (VO 0000002) using VO and OBI. All relation terms areitalicized.4.3.3 VO representation of immune response to a vaccineThe study of immune responses in an organism administered with a vaccine is critical to vac-cine research and development. The classes under the VO term ‘vaccine-induced host immuneresponse’ are shown in Figure 4.2. Those immune responses important for protection against var-ious diseases are emphasized. Specifically, vaccine induces adaptive immune responses includingimmunities mediated by B cells or T cells, and T helper type 1 or 2 immune responses. Antigenprocessing and presentation is undertaken in B cells and other professional antigen presenting cells(e.g. macrophages and dendritic cells). B cells give rise to antibody-mediated immune responses,while T cells give rise to cytotoxic T lymphocyte activities. A T helper 1 (Th1) type immuneresponse is normally required to protect against infections caused by viruses (e.g., Poliovirus) andintracellular bacteria (e.g., Brucella), while a T helper 2 (Th2) type immune response is usuallyrequired to protect against extracellular bacteria (e.g., E. coli). Meanwhile, a vaccine also inducesactivation of various cells including dendritic cells and lymphocytes. The above information hasbeen included in VO (Figure 4.2). In addition, VO includes information about vaccine-inducedinnate immune responses, which are often stimulated by vaccine components (e.g., adjuvant [134]).Vaccine-induced activation of various cell types is also included in VO (Figure 4.2).Although not explicitly stated, the current VO terms of vaccine-induced immune responses534.4. Vaccine ontology applicationsFigure 4.2: Hierarchy of vaccine-induced immune response in VO.reference corresponding immune responses introduced in the Gene Ontology (GO) [53]. A vaccine-induced immune response (e.g., vaccine-induced T-helper 1 type immune response) can be con-sidered as a cross product between a corresponding GO term (e.g., T-helper 1 type immune re-sponse) and a VO-specific term (e.g., vaccine-induced adaptive immune response). The GO term‘T-helper 1 type immune response’ (GO:0042088) is associated with 219 gene products (, [135]). Furtherinvestigations are required to determine whether all or a portion of these 219 genes products areindeed associated with a vaccine-induced T-helper 1 type immune response.4.4 Vaccine ontology applications4.4.1 Naming vaccine-specific termsVO contains different aspects of vaccine composition and biology and can, therefore, be used tomodel individual vaccines. An example of modeling two influenza vaccines, Afluria (http://www.544.4. Vaccine ontology and FluMist (, is illustrated in Figure 4.3. Afluria isan inactivated influenza vaccine manufactured by CSL Limited and administered intramuscularly.FluMist is a live attenuated influenza vaccine manufactured by MedImmune and is administeredintranasally. Both Afluria and FluMist share many similar allergens (e.g., chicken egg protein).Due to their different vaccination routes, different types of adverse events may be induced. Forexample, Afluria induces injection-site pain and muscle ache, while FluMist induces cough andsore throat. The similarities and differences shown in Figure 4.3A can also be transferred into thecomputer-readable ontological representation (Figure 4.3B).This model contains many terms unique to VO, such as vaccine, influenza vaccine, and thenames of these two vaccines, all of which have been assigned VO specific identifiers in the VOnamespace. This model also contains many terms that originate from other ontologies. For example,Influenza virus A and B are imported from the NCBI taxonomy [130] using the MIREOT [124]system and retain their original identifiers. This approach allows VO to maintain an optimizedstructure for modeling vaccine-specific features while integrating with existing ontological resources,thus ensuring orthogonality and synergy of different ontologies. The represented ontology termsand computer-interpretable format can further be used for development of different computationaltools for automated reasoning. Conversely, other biomedical ontologies such as OBI [122] and theInfluenza Ontology (InfluenzO [136]) import specific VO terms as part of their development process.4.4.2 Vaccine data exchange and integrationThe VO is loaded on the Ontobee server (described in Chapter 7), and can be queried via thecorresponding SPARQL endpoint, at For exam-ple, the SPARQL queries shown in Appendix D retrieves all information about the FluMist vaccine(VO 0000044). SPARQL queries can also span multiple biomedical ontologies, allowing for inte-gration with other resources.4.4.3 Development of vaccine knowledgebase and semantic webA VO-based vaccine knowledgebase can be generated by representing the data curated in theVIOLIN vaccine database [121] as instances of VO in a standardized approach to comply with theVO requirements. VIOLIN contains a large amount of data about vaccine protection experiments.All VIOLIN vaccine protection data can be represented as VO instances using the OWL format byusing and expanding the VO modeling of vaccine protection assays. Instances of vaccine protection554.5. Discussionassays using the data from the VIOLIN database have been generated. Such an integration approachallows users to query vaccine protection experimental data using complex SPARQL queries andapplying the results for advanced vaccine analysis.4.4.4 VO-based literature miningVO can be used to facilitate vaccine literature mining. Progress in vaccine research has led to adramatic increase in the number of vaccine-related papers. As a result, it has become increasinglychallenging to retrieve relevant vaccine data for research purposes. There are currently more than200,000 vaccine-related journal publications based on a search of “vaccine OR vaccination” in thePubMed literature database [56]. PubMed articles are annotated with the Medical Subject Head-ings (MeSH, [137]). However, MeSH contains limited vaccine-specific information. For example,Brucella is an intracellular bacterium that causes brucellosis, the most common zoonotic diseaseworldwide [138]. MeSH contains the term Brucella vaccine but does not include any subclassesunder this term, limiting the search for Brucella vaccine in PubMed. However, 40 specific Brucellavaccines are currently represented in VO as subclasses (children) of the VO term Brucella vaccine.Each subclass in VO has an is a relationship with its parent class. This ensures that all subclasses(e.g., Brucella RB51) can be included when a parent class (e.g., “Brucella vaccine”) is searched.Inclusion of these 40 specific Brucella vaccines and their synonyms as keywords in PubMed search-ing Brucella vaccine increased the search results by 25% from 1296 to 1619 (as of September 19,2009). Specific annotations of different vaccines in VO can also be used for literature searching.A user case study is to search for “live attenuated Brucella vaccine” in PubMed. As of June16, 2009, a direct PubMed search of this string of keywords returned 58 papers (or PubMed hits).A search using VO information, performed at University of Michigan, dramatically increasedthe recall of searching “live attenuated Brucella vaccine” by 13 fold (693/55) compared to thesearching without using VO (Table 4.1).Those results also showed that the precision of the searching remains high (96%), demonstratingthat VO can be used to significantly improve PubMed searching efficacy in the vaccine domain.4.5 DiscussionAs a collaborative community-based effort, VO is closely related to many other biomedical ontolo-gies. As vaccines are integral in the prevention of many infectious diseases, VO has strong ties564.5. DiscussionTable 4.1: VO enhanced literature search.PubMed Search Keywords Hits True Precisionlive attenuated Brucella vaccine 58 55 95%Consider live attenuated Brucella vaccine in VO:Brucella (RB51 OR SRB51) 182 182 100%Brucella (strain 19 OR S19) 537 510 95%Brucella Rev. 145 144 99%B. suis (strain 2 OR S2) 11 10 91%Brucella bacA mutant vaccine 1 1 100%Other 12 live attenuated Brucella vac-cines in VO62 59 95%Total (unique ones) 720 693 96%with IDO. VO encompasses vaccines against various vaccine-preventable diseases with a particularemphasis on infectious diseases. Since a vaccine can be developed against different stages of the lifecycle of an infectious pathogen, the combined application of VO and IDO will provide a superiormeans for analyzing differing vaccine strategies. VO development has been and continues to beclosely related to the development of the OBI (described in Chapter 3. Many VO terms (e.g.,vaccination) pertain to various vaccine experiments that are in the purview of OBI. Continuedclose collaborations between these projects will ensure coordinated evolution of the many differentresources.Some key challenges remain for future development in VO and sister ontologies such as IDO,OGMS and OBI. For example, the relations among disease/disorder, organism, and infectiousdisposition are currently under debate. In the current version of VO many new relation terms havebeen defined, such as “administered to prevent” and “has route specification”. However, using theRO defined relations would allow easier querying across different ontologies, and provide a clearerunderstanding of relations between entities [54]. A common representation between those efforts istherefore preferable. The new relation terms are currently defined in VO as an intermediate step inVO development, and will be reexamined in future work, looking for opportunities to reuse existing574.6. Conclusionrelations defined in RO, in collaboration with the RO developers.Another challenge is that those ontologies whose terms are often needed for VO developmentare still under development and extension. For example, extensions are needed for the NCBItaxonomy. An infection of human by an influenza virus is a disorder that forms the physicalbasis of a human influenza disease. In the NCBI Taxonomy, Influenzavirus A-C are three generaunder the Orthomyxoviridae family. ’Unidentified influenza virus’ is one species under unclassifiedOrthomyxoviridae. There is no single term called Influenzavirus that covers all these differentInfluenza viruses, which all may cause an influenza disease. Currently there are 20 licensed Influenzavaccines stored in VO. In most cases, each Influenza vaccine covers more than one Influenza generaor species. To simplify the description, a new term Influenzavirus may need to be generated inNCBI Taxonomy in order to cover these four types of influenza viruses. In this case, a suggestionto the NCBI Taxonomy team for inclusion of the new term Influenzavirus should be submitted. Insummary, a collaborative effort is required to progress VO and sister ontologies to address differentneeds and challenges.4.6 ConclusionVO is targeted to include all licensed vaccines in different countries and regions as well as vaccinesin clinical trials and undergoing development in research laboratories. This inclusion will allowadvanced integration and intelligent analysis of the large amount of vaccine data produced aroundthe world. Continuing development of VO will include additional information in such aspects ofvaccines as vaccine clinical trials and vaccine surveillance. By structuring complex vaccine datatypes and data volumes, this approach will promote a shared understanding of vaccines.584.6. ConclusionFigure 4.3: Comparison of Afluria and FluMist influenza vaccines using VO. The labels of ontologyterms are shown in (A). The same content can also be represented by ontology identifiers under-standable by computer programs (B). Each arrow represents a direction of a relation between twoclasses shown in boxes. All relations are italicized.59Chapter 5Semi-automated ontology buildingusing design patterns5.1 IntroductionA complex, expressive and logically rigorous, domain representation that can be practically main-tained and validated by reasoners, such as Pellet [60] and FaCT++ [102] can be constructed throughthe creation of OWL [92] classes with logically necessary and sufficient definitions. However, man-ually adding classes and logical axioms is a time-consuming [139], possibly error prone process, inspite of using advanced ontology editors such as Prote´ge´ [126]. Also, using such editors requiresrather extensive knowledge of OWL. This requirement significantly limits the number of peoplewho can contribute productively to enriching the ontology. Considering that each year many hun-dreds of candidate terms are being submitted to large resources such as OBI, the process of definingthem must not become a bottleneck.The approach described here is motivated by the observation that definitions of a significantproportion of term requests can be accommodated by a limited number of pre-defined designpatterns. It falls into the realm of ontology design patterns (ODP), which cover those techniquesused to solve common and recurring representation problems [140, 141, 142, 143]. Relying on suchODP, a practical solution, geared toward bioontology developers and editors, has been developed.In order to engage domain experts without extensive practice in ontology development, the requiredinput for each such design pattern as a QTT, which can be edited as an Excel spreadsheet, wasformulated. Excel spreadsheet format was chosen as being the most ubiquitous, familiar, andeasy to use to scientists. The work on QTT was led by Philippe Rocca-Serra, and I participatedin the implementation and testing. In the following, an example of a common term request isillustrated, namely assays that measure the concentration of a specified molecular compound ina given material, which is typical for clinical chemistry assays. Requests for terms to identify605.2. Methodology and resultssuch assays come from diverse communities, including EBI’s BioInvestigation Index [144, 145],the Immune Epitope Database [146] and the Influenza Virus BioHealthbase [147]. This exampleillustrates the QTT process as a proof of principle.5.2 Methodology and resultsThe Quick Term Template submission process has four main steps as shown on Figure 5.1:1. Agreement by the OBI consortium on the logical definition of the parent class for submis-sions matching a certain pattern;2. Identification of entities that can be varied with respect to the parent class (the differentia),for which a QTT spreadsheet containing one column for each such entity is generated;3. Population of a QTT spreadsheet by domain experts;4. Processing the QTT submission to generate new classes and definitions, and returning thelabel and identifier of an OBI class for each valid row of the submission.5.2.1 Step 1: Develop the representation of the parent classThe example used throughout this section is a QTT for subclasses of ‘analyte assay’ in OBI(OBI:0000443). Such assays measure the concentration of a specified molecular entity relativeto a given material entity, such as measuring glucose concentration in blood in units of µg per liter.Each logical definition relates the material in which the concentration is measured (the evaluant;e.g., blood), the molecular entity that is detected (the analyte; e.g., glucose), and the units of themeasurement being made (e.g., microgram per liter). The full logical definition of this class is shownin Figure 5.2: an analyte assay achieves planned objective some analyte measurement objective.During the analyte assay, the evaluate role is realized (e.g. by the blood) and the analyte role isborne by some scattered aggregate constituted how homogeneous grains (e.g., glucose in the blood).The output of the assay is information about concentration - a relational quality of the analytetowards the evaluant. Figure 5.3 shows how the analyte measuring assay is modeled in OBI. Thecorresponding textual definition is: “An analyte assay is an assay with the objective to determinethe concentration of one substance (bearer of the analyte role) that is present in (part of) another(bearer of the evaluant role)”.615.2. Methodology and results 7       Analyte Evaluant Measurement Unit Label glucose  blood mmol per liter sodium chloride blood plasma mmol per liter chromium-51 cell culture supernatant  ppm glucose  material entity mmol per liter Interferon gamma cell culture supernatant  ug per liter Ontology builders and knowledge representation experts deliver an axiomatized class model in OWL.  Domain experts leverage modeling pattern by filling in the Quick Term Template spreadsheet Varying parts of Class restrictions are specified in the header and filled in columns of a Quick Term Spreadsheet Template.  A QTT processing tool reports back labels and identifiers of requested terms, and generates an OWL file with any new definitions that need to be incorporated into OBI.  Step 1 Step 2 Step 3 Step 4 Figure 5.1: Overview of the process using the OBI modeling of the class ‘analyte assay’ as a startingpoint or seed class (Step 1). The variable parts (differentiae) of the representation are used toderive a Quick Term Template made up of 3 fields (Steps 2 & 3). The OWL file is generated usinga dedicated tool (Step 4). For simplicity, identifiers were omitted in this figure.5.2.2 Step 2: Derive tabular Quick Term TemplateA large number of current requests for terms are subclasses of analyte assay. Their differentiae arethe analyte (i.e., what the concentration is being detected of), the evaluant (i.e., the material inwhich the analyte concentration is detected) and the unit, which is used to qualify the measurementdatum. Consequently, a Quick Term Template for an analyte assay needs columns for only thosethree entities. Table 3.1 depicts a QTT with several example entities, as they would be seen ina spreadsheet by a submitter. Each column is to be filled with elements that are of a specifiedgeneral type. The analyte column is expected to be a subclass of molecular entity, and the evaluantcolumn any material entity.625.2. Methodology and resultsAnalyte assay:achieves planned objective some analyte measurement objectiveand realizes some (’evaluant role’ and (role of some material entity))and realizes some (’analyte role’ and( role of some (’scattered molecular aggregate’ andhas grain only molecular entity)))and has_specified_outputsome (’scalar measurement datum’and (’is quality measurement of’ some ’molecular concentration’)and (’has measurement unit label’ some concentration unit label’))Figure 5.2: OWL restrictions that logically define the analyte assay class in OBI.5.2.3 Step 3: Domain experts populate the templateThe template hides the complexity of modeling by only identifying the differentiating entities neededfor the definition of the class while hiding the actual relations binding those entities together. Theburden of adding logical definitions is displaced advantageously from the users to the machine, whichreliably and automatically populates class specifications from the template and the differentiatingentities supplied by the user. A template such as this one would be accompanied by guidelinesfor users explaining what values are allowed in the columns, and how they will be interpreted inbuilding the assay.5.2.4 Step 4: Submission processingFollowing submission of a completed QTT, a QTT processing goes through the steps outlinedbelow.1. Identify referenced classes from external ontologies, and import them as necessary via theMIREOT mechanism [124]. OBI relies on this mechanism to reference classes in externalontologies. In case entities are absent from resources, a submission is necessary. Requestprocessing was quick enough not to be perceived as a hindrance to the process.2. Create an OWL class description by substituting values from the spreadsheet.3. Use the constructed class description to do a query for equivalent classes already in OBI. If635.2. Methodology and resultsGlucose Measurementvalue: 100 units: mg/dlhas_specifiedinputhas_participanthas_specifiedoutputevaluant rolemeasurement functioncontain functionanalyte roleinheres_inis_realized_byinheres_inis_realized_byis_realized_byinheres_ininheres_in *is_realized_byProcessparticipant relationrealization relationinheres relation*analyte role inheres in scattered aggregate of glucose molecules in bloodis_ainstancemeasurement datumAnalyte Assayachieves_plannedobjectiveanalytemeasurement objectiveFigure 5.3: Analyte assay class in OBI. Adapted with permission from Alan equivalent class exists, store its URI and label and continue to the next row.4. If an equivalent class does not already exist in OBI, create a unique OBI URI and associatewith it a new class defined by the constructed class description. Add metadata such as labeland definition. As a QTT submission creates fully logically defined classes, the creation oflabels and textual definitions can be automated. For the examples in Table 3.1, the classdefined by Row 1 is assigned the label ‘glucose concentration measurement in blood in units645.3. Implementationof mmol per liter’, the class corresponding to Row 5 the label ‘interferon gamma concentrationmeasurement in cell culture supernatant in units of µg per liter’.5. Use a reasoner to perform a consistency check and classification of the combination of theexisting OBI file and one with the newly created classes.6. Report on processing of the template, and return a list of URIs and labels corresponding tothe rows of the submitted QTT spreadsheet.5.3 ImplementationThe implementation of steps 1-3 in the QTT approach requires no automation as their focus is onproviding the template to be used throughout. The ‘analyte assay’ example is shown in Figure 5.2and Table 5.1. Step 4 requires implementation of an automated QTT handler. In order to validatethe approach, a prototype of a QTT handler was created using Perl. Plain OWL templates werederived from the previously described class representation as found in the ontology and populatedwith token values parsed out from the incoming QTT spreadsheet template (Step 4.5). Thisprototype implementation delivered the expected results and helped in refining requirements of theworkflow for Step 4. However, running this implementation required extensive manual interventionmaking it not end-user friendly. Therefore other options were considered.As OBI developers are heavy users of the Prote´ge´ editor, its plugin library was explored for alter-native implementation options. Three add-ons (the Matrix (Drummond, 2008), Excel Import [148]and OPPL [143] plugins for Prote´ge´ 4) were evaluated. The Matrix plugin enables tabular visual-ization of the axioms of an existing class viewed in Prote´ge´. Potentially, this allows rapid craftingof a Quick Term Template from a class; however, the lack of a persistence mechanism for savingsuch a template significantly limits direct applicability of the Matrix plugin for the QTT approach.Excel Import plugin has a fairly explicit aim: taking in an Excel spreadsheet and creating OWLclasses by relying on a set of rules to declare the relations between columns. This is technicallyvery close to what is required for implementing the QTT specification. However, while assessingthe relevance of the Excel Import plugin, two major stumbling blocks appeared. First, the incom-ing spreadsheet had to be explicit, meaning that all restrictions and fillers placeholders must bepresent as column headers for the OWL generation to occur properly. This requirement defeats thepurpose of the QTT, which aims to conceal some of the modeling the complexity from end users.Second, it is not possible to create nested axioms on a class X such as ‘X (realizes some (’evaluant655.3. Implementationrole’ and (role of some Y))) from the tool’s Restriction Generator pane which doesn’t provide suchoption. This second limitation means that Excel Import could be used for fairly simple and directclass restrictions but is incompatible with some of the more advanced patterns being tested by OBIdevelopers.More flexibility in specifying axioms and manipulating OWL ontologies is provided by theOPPL plugin, which relies on the Manchester syntax [110]. However, the OPPL plugin requiresProte´ge´ 4 while some of OBI development still relies on Prote´ge´ 3.4 features. All three of thesetools are highly useful and fully functional for their intended applications, but each was missingfunctionality necessary to develop an end-to-end prototype for QTT template processing, as detailedabove. Specifically, none provided the ability to build class expression templates of the complexityshown in Figure 5.2, persist such templates, and populate them by parsing information from aspreadsheet.Therefore the process was designed to use a prerelease version of MappingMaster, a plugin forProte´ge´ 3.4 [149] for mapping spreadsheet content into OWL that is under active development.The following section describes experience with this tool. It provides a Domain Specific Language(DSL) that is based on the Manchester Syntax to define these mappings. In this DSL, any referenceto an OWL named class, OWL property, OWL individual, or a data value can be substituted witha reference to one or more cells within a spreadsheet. Any expressions containing such referencesare preprocessed and the relevant spreadsheet content specified by these references is imported.This content can then be used in four main ways:(1) It can be used to directly name OWL entities that are created on demand.(2) It can be used to annotate OWL entities that are created on demand.(3) The content may reference existing OWL entities, either directly as a URI or through anannotation.(4) The content may be used as a literal data value.Using one of these approaches, each reference within an expression is thus resolved duringpreprocessing to a named OWL entity or a data value and the resolved value is substituted forits associated reference. A standard Manchester syntax processor (e.g., the OWLAPI [150]) canthen interpret the resulting expression and generate the OWL equivalent statement. Declarativelyspecifying mappings in this way has several advantages. No programming or scripting expertise is665.4. Conclusionrequired to write those mappings, and they can be easily shared using the MappingMaster pluginwhere they can be persistently stored as OWL files. The mappings can then easily be executedrepeatedly on different spreadsheets with the same structure. Since MappingMaster is availableas a Prote´ge´ plugin, the results of mapping processing can be examined immediately within theontology editor, and the mappings modified as needed and immediately re-executed, speeding thedevelopment process. MappingMaster also includes an interactive editor for the mapping DSL thatsupports on-the-fly entity name checking and dynamic expansion of entity references.The DSL expressions used to convert the QTT template into OWL classes as shown in Figure 5.4are passed to MappingMaster. Running the tool generates an additional OWL file that containsall newly created classes. All the material is available from the OBI wiki [151].Class: @A*(rdfs:label ’analyte assay’)EquivalentTo:(achieves_planned_objective some ’analyte measurement objective’) and(realizes some (’evaluant role’ and (role_of some @D*(material_entity)))) and(realizes some (’analyte role’ and(role_of some (’scattered molecular aggregate’ and(’has grain’ only @B*(’molecular entity’))))))SubClassOf:has_specified_output some(’scalar measurement datum’ and(’is quality measurement of’ some ’molecular concentration’) and(’has measurement unit label’ some @F*(’measurement unit label’)))Figure 5.4: Template expressions in MappingMaster’s DSL based on the Manchester Syntax. Ref-erences to spreadsheet cells are prefixed with “@”. Cell values are substituted into the template byMappingMaster to generate class descriptions associated with the QTT.5.4 ConclusionThe QTT process outlined here provides two benefits. First it provides a method to incorporatea large number of classes considered of high value by domain experts in the communities OBI isdesigned to serve. For example, the IUPAC clinical chemistry resources [152] contain hundreds675.4. Conclusionof assays describing analyte measurements. Second, the approach allows domain experts to di-rectly populate templates without having to learn OWL syntax. The evaluation of MappingMasterProte´ge´ plug-in as a QTT handler produced encouraging results. This early version already exhibitskey features such as flexible creation of axioms thanks to a domain specific language based on theManchester Syntax, as well as capabilities to automatically generate names for the newly createddefined classes by passing user defined expression to the rdfs:label field. Finally, it tries to avoidclass duplication by inspecting the target ontology for existing entries matching the input receivedfrom the template. In mid 2010, Carlo Torniai at Oregon Health & Science University successfullyused QTT to add 100+ instruments from the Eagle-i [153] project to OBI.As always with ‘off the shelf’ solutions such as MappingMaster, there are some caveats. Portionsof the QTT specifications are not entirely supported and so reaching production grade reliabilitywill require further work. Several rounds of evaluation have led to a number of feature requeststhat would facilitate performing the entire procedure. In particular, three areas would benefit fromsuch efforts:• First, it is desired to have the capability to perform automatic class resolution when dealingwith external ontologies. This would require implementing the MIREOT mechanism. Atpresent, external reference resolution in a QTT template is done manually prior to runningMappingMaster, with processing aborted if any term is not found. It should be noted thatISACreator [145, 154], a spreadsheet editor geared towards managing experimental metadataships with embedded ontology lookup service. It could be harnessed to create and populateQuick Term Template in order to address this limitation.• Second, development could be made more efficient by checking class membership and equiv-alence as run-time queries rather than relying on full reclassification of the ontology, whichcan be very time consuming. In the analyte assays example presented, this would allow quickdetection of classes declared as analyte (first column in Table 3.1) but which are not sub-types of ‘molecular entity’, as expected. Reporting such errors immediately would speed updebugging of submitted QTT spreadsheets.• Third, adding class metadata should be better supported. It is currently not possible to createclass annotations such as cross-references, editor notes or alternative names, which would beeasily supplied when creating the QTT spreadsheet. Future releases of MappingMaster pluginwill cater for this need.685.4. ConclusionTable 5.1: A basic QTT for submitting an analyte assay term request. This specification includesclasses defined in several ontologies: Chemical Entities of Biological Interest (ChEBI) [155], TheFoundational Model of Anatomy (FMA) [129], the Unit Ontology (UO) [156], the Protein Ontology(PRO) [157] and BFO.AnalytelabelAnalyte ID EvaluantlabelEvaluant ID Measurementunit labelMeasurementunit IDglucose CHEBI:17234 blood FMA:670 mmol perliterUO:0000300sodiumchlorideCHEBI:26710 bloodplasmaOBI:0100016 mmol perliterUO:0000300chromium-51CHEBI:50076 cell culturesuper-natantOBI:1000023 ppm UO:0000169glucose CHEBI:17234 material entity BFO:MaterialEntitymmol perliterUO:0000300interferongammaPRO:000000017 cell culturesuper-natantOBI:1000023 µg perliterUO:0000301Since this initial implementation of the QTT, a new version of the Prote´ge´ editor, Prote´ge´ 4, wasreleased. Unfortunately, the MappingMaster plugin was not ported to this new version. However,a web-based server, the Ontorat [158] was created to allow easy incorporation of multiple terms inresources following the QTT guideline. It takes one Excel spreadsheet as input, and returns thespreadsheet containing the IDs of the terms, as well as an OWL file that can either be copied intothe source file or simply imported. Some features are still missing from the Ontorat. For example,where MappingMaster could create new terms that did not exist in the ontology, Ontorat requiresthat all terms used already have an associated URI. Also, while Ontorat allows adding annotationson terms (or editing existing ones), it does not currently handle instances, which means some ofthose, such as the curation status annotation described in Section 6.2.2, are not supported. Ontoratdevelopers are actively working towards addressing those issues.69Chapter 6Working with large biomedicalresources6.1 IntroductionIn this chapter, I describe my investigation of what elements are required to support a largeconsortium of ontology developers developing compatible resources for publication on the SemanticWeb. After building the biomedical resources described in Chapters 3 and 4, there remains a needto need to address how they can be used together, and in conjunction with other relevant resources,specifically in the framework of the OBO Foundry described in Section 2.4. The ability to workwith multiple resources is critical to allow developers to concentrate on new requirements ratherthan duplicate existing efforts.In order to harmoniously build on several distinct bodies of work originating from differentcommunities, guidelines should be established and followed. While several OBO Foundry principlesalready are adopted or are under discussion (see Section 2.4), critical gaps remained to be filled. Toaddress some of those, several OBO policies were developed and are presented in Section 6.2. Forexample, the adoption of a common ID policy is crucial to fulfill the Semantic Web requirementof using URIs as identifiers, which will be critical for publishing resources as shown in Chapter 7.Additionally, sharing a common metadata set, as described in Section 6.2.2, not only provides areliable, consistent behavior to the end user, but also allows building tools which support multipleresources consistently, such as is the case with MIREOT (see Section 6.3), OntoFox (see Section 6.4),or Ontobee (see Chapter 7).One of the core principles of the OBO Foundry is to maintain orthogonality between resources:no resource should duplicate work done by another, to prevent heterogeneity in representation ofentities and duplication of effort. In this context, it was critical that a mechanism be devised toallow select usage of specific classes or portions from external resources - the MIREOT. MIREOT,706.2. OBO Policiesdescribed in Section 6.3, allows integration of multiple ontologies and taxonomies without beinghindered by the increasing size of the result. However to be useful to the community, MIREOTneeds to be easily available and implemented such that non-computer specialist can use the system.To that effect, a web-based tool that implements the MIREOT guideline in a user-friendly way wascreated: Ontofox, described in Section OBO Policies6.2.1 Common unique identifier policyThe OBO foundry currently hosts resources under the OBO [159] and OWL [160] formats, and aimsat providing tools such as the OWLAPI mapping for OBO format 11 to allow their interconversion.In order to do so, one key requirement is to rely on a common system to handle unique identifiersfor entities. A policy, normative for Foundry resources, includes a Foundry-compliant URI scheme,and rules to map from current OBO IDs and OBO legacy URIs towards them. In collaborationwith Alan Ruttenberg and Chris Mungall, I devised an ID policy for OBO resources.Following a common ID policy allows URIs to be more reliable, and ensures they are uniquewithin the Foundry consortium. It also helps building tools relying on this ID scheme. For example,the OBI [161] developers do not deal with ID management when creating entities; rather a script isrun pre-release to check and homogenize URIs for format and stability (e.g., was any URI deletedsince the last release?). Another feature is to allow dereferencing and provide useful informationto a user trying to resolve terms’ URIs. The Ontobee browser 12, described in Chapter 7, displaysan HTML page that provides human readable information on each term, such as label and textualdefinition, while the page source is RDF that can be machine-processed. Finally, the ID policyspecifies versioning rules for ontology releases, effectively creating a version history for resources.By doing so, users are always able to access the latest published version and get the most up todate developments, or instead use a specifically dated release, and maintain stability of their ownresource. They can also test different versions and ensure no conflicts are created between versionsbefore deciding to update. The ID policy ( has beenadopted throughout the OBO Foundry11 OBO Policies6.2.2 Improving documentation by sharing metadata through the IAOThe IAO is an ontology of information entities, which aims at providing high-level blocks upon whichspecific resources can build. It describes classes such as directive information entity, which can forexample be extended in a clinical-focused ontology by the clinical guideline subclass (see 8.3.3for an example in AERO). As part of the IAO project, a distinct file defining common metadataproperties13 has been created. This file can be imported independently of the “core” IAO, and usedby any developer. The IAO common metadata set contributes to the realization of the principle ofdocumenting ontologies within the OBO Foundry.Other efforts already exist to formalize metadata, such as the Simple Knowledge OrganizationSystem (SKOS) [162] and the Dublin Core (DC) Metadata element set [163]. However, consider-ing the case of dc:creator, its definition reads “An entity primarily responsible for making theresource”, where the resource is described by the class bearing this property. For example, in abook description, the dc:creator property value is set to the name of the author of the book, anddoes not capture the name of the author of the book description, which is what is intended withiao:definition_editor14. Similarly, the definition of skos:definition defines concepts, whichis not suitable in this case [164].In the IAO metadata set, common and expected annotation properties, such as definition andeditor preferred term are documented, and allow tool developers to rely on them to build theiruser interface. Other properties such as definition source or definition editor were created to storeany references used in developing the definition and who did create the term. This allows resourceconsumers to go back and check on the origin of the term and what its intended meaning is, and/orcontact the relevant individual should they need more clarification about its usage. The importanceof having human readable definitions was described in section 2.2.3: for example, the AERO relieson the PHAC glossary, and includes references to the appropriate source via the definition sourceannotation property.Curators of the ontology can add example of usage and editor note to further clarify what theterm denotes and what its intended usage is. Other slightly more complex properties have beendesigned to enable quality assessment of the terms. Namely, the curation status specification classprovides a list of predefined instances (i.e., ‘example to be eventually removed’, ‘metadata complete’,‘organizational term’, ‘ready for release’, ‘metadata incomplete’, ‘uncurated’, ‘pending final vetting’,13, section 2.4726.2. OBO Policies‘to be replaced with external ontology term’, ‘requires discussion’ 15) that can be used on each classto mark its degree of “readiness” and stability. Similarly, the class obsolescence reason specificationoffers a list of predefined values that can be used on obsoleted terms to give more informationas to why that term was deprecated and indicate (in conjunction with for example an editornote) what the term replacement is. Finally, an OBO Foundry unique label annotation property(, was recently added in the ontology-metadatafile to allow disambiguation between terms local to a resource when they are taken in the wholeset of OBO Foundry ontologies. OBO foundry unique labels are automatically generated basedon regular expressions provided by each ontology, when processed by the OBO package managercurrently being written by the OBO Foundry custodians. Appendix E provides further descriptionof the IAO annotation properties. The IAO metadata set is distributed as a file independentlyof the main IAO (which deals with representation of information entities), allowing resources toselectively import this ontology-metadata file.6.2.3 DiscussionDespite the progress made on homogenizing some aspects of the OBO Foundry consortium guide-lines, work remains to be done in several aspects. For example, sometimes terms need to be retiredas ontologies evolve. The OBO Foundry doesn’t currently formalize a standard deprecation policy,which leads to the problem of different policies within resources. As a general guideline, deprecatedterms are not deleted from the ontology: (1) deleting terms contravenes the Cimino desideratapresented in section 2.3 and (2) removing a term that has been used in the past can be confusingfor users. Some discrepancies exist between the practice of the GO [165] and other resources, suchas OBI: in the GO [166], when terms are merged one term effectively disappears from the ontol-ogy file and its identifier is maintained as an alt id annotation property on the term it is mergedwith. By contrast in the OBI, one term is deprecated, and its obsolescence reason specification isset to “term merged”, with the addition of an editor note indicating the replacement term. As aconsequence, tools such as MIREOT expect to find the URI of classes in their declaration (andnot as a secondary ID). MIREOT scripts are therefore unable to retrieve the external informa-tion in the GO merging case, resulting in a loss of terms on the importing ontology side, such as15 The MIREOT guidelinerecently happened with some PATO terms [128]16. A common deprecation policy, following theexample of what has been done regarding the ID policy, would help formalize expected behavior,and guide tools developers. A review of the current reasons for obsolescence in the GO would beuseful to perform to ensure adequacy between the instances defined by the IAO and the needsof the curators. Most proposed policies have been adopted fairly recently, and evaluation is verypreliminary. Although the relative costs and benefits could be difficult to quantify, a number ofuse cases illustrate the advantage of relying on numerical identifiers. When choosing to use nu-merical IDs for terms, it is anticipated that tooling issues will hinder adoption of those standards- nobody wants to type in OBI 0001234 when doing a SPARQL query. However, it is believedthat in the long term (i) tooling issues will be resolved (ii) using numerical IDs will be beneficialfor maintenance of the resources and their necessary evolution. As illustration of these respec-tive points, see for example the recent threads mentioning how (i) the Protege [126] team addeda new menu “render by rdfs:label” to their interface 17 and (ii) issues faced by the developer ofGoodRelations [167] to rename some classes.18 Those policies also need to evolve with time andaccommodate for example legacy resources. An update has been recently proposed 19 to enablethe Protein Ontology (PRO) to reuse identifiers from existing databases, such as UniProt [168]in the interest of (1) making the connection to the original resource more explicit (2) not havingto mint new identifiers for each existing identifier in the UniProt database. However, corollary tothat update, is managing the dereferencing of those additional terms, which implies correspond-ingly updating the redirection rules to accommodate the new identifier format 20, without breakingexisting support. A description of the current (November 2013) redirection rule is available at The MIREOT guideline6.3.1 IntroductionThe ability to share and reuse existing ontological resources is an important consideration whendeveloping a new ontology. For example, when developing an ontology related to the biomedical16 The MIREOT guidelinedomain, it may be useful to include terms from the GO [165] to represent biological processesor from the PATO [128] to represent properties of entities. Ontologies such as GO and PATOare built collaboratively by communities of experts and are the products of substantial effort.Recapitulating this work instead of reusing it represents a duplication of development effort andresults in multiple ontologies covering the same domain. It could also result in projects havingdifferent unique identifiers to denote the same entity, which would require post-hoc, potentiallyerror-prone, identifier mapping systems to enable data integration.While it appears that building upon existing ontologies is the best way to proceed, developersare faced with a number of practical challenges when trying to do so. The easiest way to integratean existing ontology is to rely on the owl:imports [160] mechanism, which imports the externalresource as a whole. However, current limitations in tools and reasoners can sometimes make thisimpractical. Popular OWL tools (e.g., Prote´ge´ [126] and Pellet [60]) can neither load nor reasonover very large ontologies such as the NCBI Taxonomy Database [169] or the Foundational Model ofAnatomy [170]. Furthermore, external ontologies may have been constructed using design principleswhich do not align with the principles of the ontology requiring their import. In this instance, whollyimporting such ontologies could lead to inconsistencies or unintended inferences [171]. Other importoptions are possible, for instance using software that extracts a module [172] of the external ontology.A module can be seen as a subset of an external ontology that, when imported by anotherontology, allows the same inferences to be drawn with respect to the classes of interest as if thewhole ontology had been imported, and answer queries without losing any reasoning power. How-ever, if an extracted module is to be useful, the external ontology needs to be structured in a waythat is compatible with the importing ontology (e.g., using the same upper ontology and relation-ship types), and the logical axioms need to accurately represent existing knowledge, which is notalways the case at the current stage of development of some resources. For example, during thedevelopment of the OBI [83], importing the root class of the Common Anatomy Reference Ontol-ogy (CARO) [173] was not desired as its definition intersected multiple classes in OBI, making itdifficult to determine how the two ontologies aligned. Specifically, the root of CARO, anatomicalentity, encompasses material and immaterial material entity, which belong to different hierarchiesin OBI. In addition, although software that extracts modules are available, most are in early stagesof development.Several modularization tools [60, 174, 175, 176] were tried. All of them discarded annotations,resulting in modules containing only the class declarations and no annotation properties, such756.3. The MIREOT guidelineas labels or definitions. There were also software crashes on large ontologies (the size of theontologies capable of being loaded varying with the tool; for example the Chemical Entities ofBiological Interest (ChEBI) Ontology [155] can be loaded with SWOOP but not with Prote´ge´ 3.4).One tool [176] had undocumented assumptions about the form of URIs used as class names andtherefore extracted empty modules. The other tools described were able to extract modules byautomatically determining their size. This resulted in either a single term or a large number ofterms being extracted, depending on the provided arguments, as the tools attempt to approximatea module without discarding potentially useful information. These large modules undermine thegoal of having imports of a manageable size. In conclusion, the current ontology modularizationtool set is in the early stages of development and, though promising, does not address current needs.To address these issues a set of guidelines for importing terms from multiple resources, avoidingthe overhead of importing the complete ontology from which the terms derive was developed. Incollaboration with Alan Ruttenberg, I created the MIREOT guidelines to aid the development ofOBI. OBI uses the BFO [79] as an upper-level ontology and has been submitted for inclusion inthe OBO Foundry [55]. MIREOT enables reuse, where appropriate, of existing ontology resources,therefore avoiding duplication of effort and ensuring orthogonality (i.e., non overlapping scope),and contributing to the realization of a fundamental principle of the OBO Foundry. MIREOTis a guideline independent of any design principle, and provides a mechanism by which externalontology terms can be selectively imported, even if they do not use a particular upper ontology orOWL DL [109].6.3.2 PolicyIn deciding upon a minimum unit of import, the first step was to consider the practice of otherontology efforts. For example, in the GO, the intended denotation of classes remains stable suchthat even when the ontology is repaired or reorganized, the effects of such changes do not affect theintended meaning of individual terms. Rather, the changes are towards more carefully expressingthe logical relations between them. When a term’s meaning changes, the term is deprecated [166].Therefore a term can be considered as stable, in isolation from the rest of the ontology, and terms(i.e., individual classes in isolation from the ontology) can be used as basic unit of import. Thecurrent implementation of MIREOT has been limited to the import of terms from other ontologiesthat aspire to be a part of the OBO Foundry, and so adhere to a similar deprecation policy. Theminimum amount of information needed to reference an external term is its URI (i.e., the identifier766.3. The MIREOT guidelinefor this term) and its source ontology URI (i.e., where the term comes from). Generally, theseitems remain stable and can be used to unambiguously reference the external term. The minimumamount of information needed to then integrate this class in the importing ontology is its desiredposition in the hierarchy, specifically the URI of its direct superclass (i.e., under which class theterm is to be asserted).Taken together, the following minimal set is enough to consistently reference an external term:1. Source ontology URI The logical URI of the ontology containing the external term to beimported.2. Source term URI The logical URI of the specific term to import.3. Target direct superclass URI The logical URI of the direct asserted superclass in theimporting ontology.While physical URIs may evolve over time, logical URIs are stable and can be used to unam-biguously refer to the same term. To ease development of the importing ontology, it is also rec-ommended, although not required, that additional information about the external class be added,such as its label and textual definition, or any other kind of information that may be deemed usefulby the ontology developers. This additional information, when appropriate, is mapped into theimporting ontology’s annotation properties. As it is prone to modification by the source ontologydevelopers (e.g., when updating a definition), it is stored in a separate file that can be removedand rebuilt on a regular basis, allowing for regular updates within the importing target ontology.6.3.3 ImplementationI performed an implementation of the MIREOT guidelines in the context of the OBI project (Figure6.1), and can be decomposed into a two-step process:1. Gather the minimum information for the external class.2. Use this minimum information to fetch additional elements, like labels and definitions.Once the external term is identified for import, the first step is to gather the correspondingminimum information set. This set is stored in a file called external.owl (all scripts and files areavailable under the OBI Subversion Repository [177]). In the current implementation, a Perl script,, can be used to append the minimum information set for a given external term776.3. The MIREOT guidelineMinimal information  URI of the term  URI of the source ontology  Superclass in the target ontologyexternal.owlexternalDerived.owlTarget Ontologyexternal-templates.txtSPARQL EndPoint12345Ontology curatorFigure 6.1: Diagram of the MIREOT mechanism as implemented by OBI. 1. The ontology editorgathers the minimal information for the class to import and adds it into the external.owl file 2. Ascript parses the external.owl file, and for each class selects the appropriate SPARQL CONSTRUCTtemplate. 3. The SPARQL query is executed against a SPARQL endpoint (e.g., Neurocommons)4. The results of the SPARQL queries are combined into the externalDerived.owl file 5. The targetontology imports the external.owl and externalDerived.owl the external.owl file. The script takes as arguments the identifier of the external class to beimported and its parent class in the target hierarchy. In addition, a mapping between the prefixused in the identifier and the external source ontology URI is built into the script. For example,when requesting the term CL:0000767 (see below), the script maps the CL: prefix to its sourceontology URI Curators therefore need only specify the ID of theexternal class to import (rather than the full URI) and the ID of the class it should be importedunder. Upon addition of an external class, a visual check can be performed as the Perl scriptreturns to standard output the OWL excerpt added to the file.In the current implementation, the additional information can be obtained programmaticallyvia SPARQL [94] CONSTRUCT queries (Figure 6.2). While access to a SPARQL endpoint is not786.3. The MIREOT guidelinecompulsory to use the MIREOT mechanism, it provides easy access, using standard protocols, tothe information needed. These queries [178] specify, for each source ontology, which extra elementsabout the term is to be extracted, such as the definition and preferred label, and how to map theseinto the corresponding OBI annotation properties.prefix rdf: <>prefix rdfs: <>prefix owl: <>prefix obi: <>prefix obo: <>prefix iao: <>construct{_ID_GOES_HERE_ rdf:type owl:Class._ID_GOES_HERE_ iao:IAO_0000111 ?label._ID_GOES_HERE_ rdfs:label ?label._ID_GOES_HERE_ iao:IAO_0000115 ?definition.}where{{ _ID_GOES_HERE_ rdfs:label ?label. }UNION{ _ID_GOES_HERE_ obo:hasDefinition ?blank.?blank rdfs:label ?definition}}Figure 6.2: Template SPARQL query. For convenience, alias:preferredTerm and alias:definition areused to reference annotations properties IAO 0000111 and IAO 0000115 [125] respectively. TheID GOES HERE pattern will be replaced by the script when building the CONSTRUCT query.796.3. The MIREOT guidelineFor example, in the current OWL rendering of OBO files, definitions are individuals and therdfs:label of those individuals records the text of the definitions. Within the OBI implementationof the MIREOT guidelines, the value of the rdfs:label of the oboInOwl:Definition will be set tothe value of (i.e., iao:definition). Only annotationproperties which map directly to the target ontology’s own metadata are copied; new properties, ifnot specified in the source ontology, are not created.Finally, a script, create-external-derived.lisp, iterates through the minimum information storedin external.owl. Depending on the source ontology URI of each of the imported terms, it then se-lects the correct SPARQL template and substitutes the relevant ID. The queries are then executedagainst the Neurocommons OBO SPARQL endpoint [57, 179]. This supplementary informationis stored in a second file, externalDerived.owl. This file can be removed on an ad-hoc basis (e.g.,before releasing new versions of the importing ontology) so that it can then be rebuilt via scriptbased on external.owl in order to refresh the additional information (e.g., label). The two files,external.owl and externalDerived.owl, are then imported by the target ontology, providing the nec-essary information to the editors while at the same time keeping it independent from the importingontology’s proprietary classes. This introduces an additional level of modularity, separating thedomain ontology of interest from the external ontologies.In the following sections I present three different cases of application of the MIREOT guidelines,implemented during the OBI development.Use Case One - Basophil and Cell classesThe OBI cell class was replaced with that from the Cell Type (CL) ontology [180]. CL is part ofthe OBO Foundry effort, and the cell class as defined by this resource should be reused, instead ofcreating another class denoting the same entity. This class can subsequently be chosen as the parentof another imported term as needed. For example the following invocation of the add-to-external.plscript:perl CL:0000767 CL:0000000will add the class basophil (CL:0000767) as subclass of the class cell (CL:0000000), and setthe source ontology URI as Once imported, the basophil and cellclasses can be used like any other OBI class. For example, the material entity CD3+ T cell cultureis defined as:806.3. The MIREOT guidelineClass: CD3+ T cell cultureSubClassOf:’ cell culture’and ’has grain’ some (celland (has_part some ’CD3 subunit with immunoglobulin domain’))Use Case Two - taxonomic informationThe cell use-case highlights what is likely to be the most common import scenario (i.e., a simpleimport of one external term, making it available for direct use in the target ontology). However, insome cases, more than that single external term may be required, and to account for this MIREOThas been devised to be flexible.Consider the scenario in which there are two experiments, one in human and one in mouse.The files are annotated with the classes human and mouse from the ontology, which are in turnmapped from the NCBI taxonomy database. Somebody could want to query for all experiments inmammals, without specifying the exact species. In this case, one needs to know that human andmouse are subclasses (even indirect) of mammals in the NCBI taxonomy. The root term of theNCBI taxonomy database is an example of a term OBI didn’t want to include, as it encompassesviroids, unclassified sequences and others sequences, which were not considered useful when definingorganisms. Therefore, when mapping towards an NCBI term, it was decided to also retrieve allits superclasses up to the Archaea, Bacteria, Eukaryota and Viruses levels of the NCBI taxonomydatabase (Figure 6.3). When the create-external-derived.lisp script parses the external.owl file andencounters an NCBI taxonomy ID, it will invoke a specific SPARQL query (Figure 6.3). As per themechanism described above, the minimum information about the imported external class (e.g., Musmusculus) is defined in external.owl, whereas the additional rank information (e.g., genus, kingdom,phylum i.e., its superclasses) is stored in externalDerived.owl. On the same model, any informationthat the importing ontology editors would require could be added in the externalDerived.owl file:the only requirement is to write the corresponding SPARQL query.Use Case Three - Unit instancesFinally, the most recent use case addresses the needs for OBI to represent units of measurement.The Unit Ontology (UO) [156] tackles this effort, and currently encompasses more than 2000 classes.816.3. The MIREOT guidelineHowever, the representation of units as classes doesn’t comply with the design pattern chosen byOBI and the IAO, which take the stance that in the absence of a satisfactory unit representationtheory, things that are understood, i.e., unit labels, should be represented. Therefore the UO classescorresponding to specific units (such as gram or meter) were imported as instances of the IAO classmeasurement unit label. Figure 6.4 shows the result of this addition into the OBI hierarchy.Figure 6.4: Screenshot of the Protege editor showing the class temperature unit and its instancedegree celsius, as imported using the MIREOT mechanism from the UO ontology.Work is in progress with the developers of the UO to reach agreement on the best way torepresent measurement units in a consistent manner, and it is expected the different resources willalign as part of the OBO Foundry collaborative effort.6.3.4 DiscussionThe MIREOT mechanism offers a lightweight mechanism for importing specific classes from ex-ternal ontologies. The approach is decoupled from the importing ontology, allowing a computa-tional update mechanism which does not interfere with the primary ontology under development.MIREOT is currently implemented and used by several ontology efforts, including OBI, the IAO,the VO [119], the IDO [131] and the Influenza Ontology (InfluenzO) [136]. In the context of OBI,472 terms are currently explicitly imported, which in turn leads to actual integration of 1447 classes(due to the automatic retrieval of parents when using the NCBI taxonomy).With broader use of the MIREOT mechanism by OBI and other resources, several minor issuesarose. The first issue is a case of cyclic imports between resources: for example, IAO developers826.3. The MIREOT guidelinerequired import of the term investigation, which class already exists in OBI. However, OBI importsIAO, and therefore re-imports, via IAO, its own investigation class. This is not problematic ingeneral, as duplication of triples in OWL files is of no consequence. However when OBI curatorsdecided to update the definition of the investigation class, the information natively in OBI and thatimported from IAO became out-of-sync: two different definitions were displayed to the curators.Moreover one of them could not be edited as it is outside the remit of OBI to edit IAO definitions.One solution to this problem would be to update the IAO import - but this requires a release ofOBI with the updated investigation definition, its upload on Neurocommons, and for the IAOdevelopers to update their information and produce a new release of IAO. At best, this implies adelay of a few days, more realistically of a few weeks until the information in both files is againsynchronized. Such a solution also has consequences; when updating the information from theSPARQL endpoint, a specification of which RDF graph [181] the term originally belonged to isrequired. Taking again the example of the investigation class, when querying based on its URIwithout specifying the RDF graph, the OBI class, but also the one distributed by IAO, will bereturned. This is not the desired behavior; in this example, the IAO annotation property valuesare now out of date compared to the original and authoritative OBI file. A better solution wouldbe for tools to recognize and prioritize the origin of a class based on its URI. Ontology editingtools would then display only the information originating from the target ontology when editingthe target ontology file. This issue remains to be addressed.Additionally, when updating imported information, the SPARQL endpoint where the informa-tion resides must be up-to-date. The implementation currently relies on the OBO Foundry resourcesat the Neurocommons OBO distribution. This is updated nightly with the latest information fromthe OBO server, and can therefore be reasonably relied upon for accessing current resources. Thetimeliness of the information may not always be known if extending the mechanism to anotherSPARQL endpoint, or other sets of ontological resources. The MIREOT standard presents anapproach to importing classes from external ontologies that removes the overhead of full ontologyimports whilst maintaining a decoupled but usable reference to the external classes. There is aclear trade-off that MIREOT offers between practicality and full, axiomatic completeness. Beinga lightweight import mechanism, only the desired parts of an external ontology are imported, atthe risk that inferences drawn may be incomplete or incorrect; correct inference using the externalclasses is only guaranteed if the full ontology, or a module, is imported. It does however present theimportant advantage of overcoming the obstacle presented by ontologies which are not fully interop-836.4. OntoFoxerable at present. Since only partial, reasoner-supported consistency checking is undertaken, extracare is taken when assertions about an imported term are made. In adding axioms, such as thesubclass axiom when importing the external term, the aim for the ontology editor is to only asserttrue statements, which do not contradict or alter the meaning of the term in its source ontology.With the more stable OBO ontologies, the denotation of the term, as explained in the definition ordocumentation, is clearer and more correct than the axiomatization, the former being easier andquicker to formulate. It is anticipated that some of the statements added by the importing ontologymay migrate to the source ontologies at some point in the future; a fruit of the collaborative natureof OBO Foundry ontology development.When deciding to import an external term the textual definition is reviewed and, if required(e.g., if the definition is ambiguous), discussion with the original editor is undertaken. An importantaspect of the MIREOT mechanism is maintaining the term’s meaning, and ensuring that if itchanges the term gets deprecated, and therefore it is recommended to use resources adhering toa deprecation policy. As imports are done from OBO Foundry candidate ontologies there is acommunity process for monitoring change, a shared understanding of the basics of the domain,and the intention to eventually share the same upper-level ontology. Therefore, it is expected thatterms will be deprecated if there is a significant change in meaning, and the MIREOT mechanismis flexible enough to adjust and update import of terms as the other ontologies start enhancingtheir logical definitions.Finally, the original implementation of the MIREOT guidelines relied on command-line scriptsand specific libraries, making it difficult for curators with no programmatic skills to use. Subse-quently, a web service, OntoFox [182], described in the following section, has been developed tofacilitate the process.6.4 OntoFox6.4.1 IntroductionMIREOT is being used in an increasing number of ontology projects, for example, OBI, VO, theInfluenzO -, Neural ElectroMagnetic Ontologies(NEMO) -, ontologies developed in the NeuroscienceInformation Framework (NIF) -, and as partof the eagle-i project ( While editing tools commonly provide846.4. OntoFoxmeans to reference an external term by directly setting its URI, one must also manually enterauxiliary information necessary for practical editing, such as the label and definition, and updatesuch information if the source ontology changes. Manual entry is time consuming and duplicatesalready existing information. Also, such terms would be hard to distinguish from those in thecurrent resources, making their update a tedious process. In addition, it is often desirable toimport additional related terms. For example, when the VO imports a species term, the inclusionof some of its superclasses allows for queries at different taxonomic ranks (e.g., kingdom, phylum,and species). To address these issues, an initial implementation based on MIREOT was created tofacilitate managing the tedious aspects of this process automatically.Alternatives such as computing modules [183] were investigated. Structural approaches use thesyntax of the axioms of ontologies and mostly only consider the induced is-a hierarchy [176, 184].Logic-based approaches take into account the consequences of ontologies and require that thisextracted module captures the meaning of the imported terms used, i.e., includes all axioms relevantto the meaning of these terms. However, Grau et al. [172] proved that it is undecidable, even fordescription logics simpler than OWL-DL, to determine whether a subset of an ontology is a minimallogic-based module. These approaches are relatively new, experience using them is limited, andexperience with current Web-based implementations has found them to be unreliable. Moreoverthe methods do not provide ways to avoid import of certain terms or axioms that might not beconsidered desirable, or have other issues that prevent their easy use. Nonetheless the syntacticlocality approach these methods use is applicable to single-term import and so is compatible withthe MIREOT approach.In section 6.3, an implementation of the MIREOT mechanism that demonstrates the feasibilityof the approach is described. It is, however, command line-based and requires the specificationof terms either by command-line scripts or construction of an ontology document. Specificationof which ancillary information should be incorporated is by writing SPARQL queries, restrictingits adoption by less technically able users. To facilitate application of the MIREOT guideline bythe wider ontology community a more user-friendly system facilitating the import and update ofexternal terms into a target ontology is desired. In addition, while MIREOT provides a practical yetsimple approach to specifying external ontology terms, the OBI implementation does not provide theability to consider restrictions on imported terms that a user may desire to import. To preserve themeaning of the imported terms, ontology developers might like to use ontology module extraction,e.g., extraction of the target class and its transitively related (via restriction) closure [176]. Ontology856.4. OntoFoxdevelopers may also want the flexibility of including no superclasses, only one direct superclass, allsuperclasses to the top class, or a subset of all superclasses for a term, in order to provide additionalrelevant domain terms for their users.To address these needs for ontology reuse, OntoFox (, a web-based application implementing the MIREOT and related ontology term extraction strategies wasdeveloped. OntoFox facilitates ontology development by automatically fetching properties, annota-tions, and related terms from external ontology terms and saving the results as OWL serialized asRDF/XML [181] suitable for use with the OWL import directive. OntoFox provides a web-basedpackage of solutions for ontology developers to extract, for subsequent import, different sets ofontology terms by following and expanding the initial MIREOT implementation and by developingrelated ontology term extraction methods based on SPARQL [94]. The following sections describethe general OntoFox web system, how users can choose which properties and related terms shouldbe imported, and demonstrate how OntoFox is used in the VO development.6.4.2 MethodsOntoFox system architectureOntoFox uses a simple text format and web forms for data input in a user-friendly implementation,and is designed to not require any programming skills. OntoFox is implemented using a three-tier system architecture. At the front-end, data can be submitted using either web forms orby uploading a plain text input file. The input data are then processed using PHP and Java,and SPARQL (middle-tier, application server) queries are then executed against an RDF triplestore (back-end, database server), currently the Neurocommons SPARQL endpoint [67]. The webserver then processes the result of each SPARQL query sent by the back-end server; as a result anRDF/XML file is created and offered for download to the user.As OntoFox is a web-based system, it is accessible everywhere through the Internet withoutneed for additional software installation. The techniques used in the OntoFox web application werechosen for maximum compatibility by using established W3C standards, specifically, OWL as aweb ontology language, RDF/XML as its serialization, and SPARQL for queries.866.4. OntoFoxOntoFox three-tier structure implementation1. OntoFox web interface The OntoFox web interface is designed based on iterative testing,thus far informal usability testing and feedback from initial users, following a spiral softwaredevelopment model [27]. It accepts the input from the user, via either web forms or uploadingof a local text file, and presents the output data after query processing. Finding and enteringthe URIs for desired terms can be tedious. To speed up the term specification process, anontology term suggestion feature based on auto-completion of the string of text entered byusers after selecting the desired source ontology was implemented. The OntoFox server offersa list of potential matches, and upon selection, the associated term ID will show up in an inputbox next to the label. Additionally, the “Detail” hyperlink next to the term ID provides easyaccess to an interactive ontology browser allowing visual confirmation of the term definitionand its position in the hierarchical ontology tree structure. Lastly, as shown in Figure 6.5,the user can click “Add” next to “Detail” to insert the full URL of the selected term into theinput text box on the web interface.876.4. OntoFox# give names to the top taxaalias:bacteria=tax:_2alias:eukaryota=tax:_2759alias:archaea=tax:_2157alias:viruses=tax:_10239alias:cellularOrganism=tax:_131567prefix rdf: <>prefix rdfs: <>prefix owl: <>prefix obi: <>prefix tax: <>prefix iao: <>construct{ ?super rdf:type owl:Class.?super rdfs:subClassOf ?parent.?super iao:IAO_0000111 ?label.?super rdfs:label ?label.?super alias:importedFrom <>}where{{ # We harvest the transitive superclass annotations_ID_GOES_HERE_ rdfs:subClassOf ?super.graph <>{ ?super rdfs:subClassOf ?parent.?super rdfs:label ?label.}}UNION{ graph <>{ ?super rdfs:subClassOf ?parent.?super rdfs:label ?label.FILTER (?super=_ID_GOES_HERE_)}}FILTER (!((?super=alias:bacteria) || (?super=alias:eukaryota) || (?super=alias:viruses)|| (?super=alias:archaea)|| (?super = alias:cellularOrganism) || (?parent = alias:cellularOrganism)))}Figure 6.3: Template SPARQL query for import from the NCBI taxonomy database. TheID GOES HERE pattern will be replaced by the relevant NCBITax ID dynamically viascript when building the CONSTRUCT query. This query allows retrieval of the classof interest and its parents up to a set of defined root classes. Note that the graph<> contains the source ontology, but the Neuro-commons triple store includes inferred subClassOf triples.886.4.OntoFoxFigure 6.5: OntoFox retrieval of the term ‘homo sapiens’ from the NCBI Taxonomy Ontology (NCBITaxon). Input data can be enteredvia web-based forms (A) or text file upload (B). The output OWL file [Additional file 1] can be visualized using Prote´ge´ (C). All termsfrom ‘homo sapiens’ up to Eukaryota are retrieved. Synonyms used to annotate each term are also included.896.4. OntoFox2. Data processing by the web application The OntoFox application server runs on a Dell Pow-erEdge 2580 server running the Red Hat Linux operating system (Red Hat Enterprise Linux5 server). PHP and Java are used as programming languages in the web application server.General web-based programming and query submission are written using PHP. The OWLAPI [150], a Java API for manipulating OWL files, is used in OntoFox to read, process, andrewrite OWL files and save the final results as one OWL file after merging individual queryresults.3. Data storage and access The OntoFox internal RDF database server runs on a separateDell PowerEdge 2580 server. The database server is powered by the OpenLink Virtuosodatabase engine [103]. While VO is loaded within this Virtuoso server, OntoFox also usesRDF data stored in other web accessible servers, for example, the Neurocommons knowledgemanagement platform [57]. Fifteen biomedical ontologies generally used within the OBOcommunity are available for users to select as source ontologies within OntoFox (Table 6.1).These ontologies, initially chosen to support VO development, were selected based on theirspecificity, community support, and maturity. They all adhere to a strict deprecation policy,ensuring that the meaning of each term remains stable until the term is deprecated. Thoughnothing bars serving more resources, these 15 ontologies were all that were required to cover allinformation needed for import via MIREOT during the VO development. Users can choose toprovide another source ontology URI and corresponding SPARQL endpoint, allowing retrievalof terms outside of the OntoFox source ontology repository resources; however this is done attheir own risk as term stability is not guaranteed.Evaluation of OntoFox SPARQL retrieval of related termsTo compare the performance of the OntoFox SPARQL related term retrieval approach with theOWLAPI modularization, three sets of signature data were used. I performed the OWLAPI mod-ularization, while Zuoshuang Xiang ran the queries on OntoFox. The first two sets of signaturedata include either one term (e.g., the OBI term ‘antigen’) or a list of OBI terms that were im-ported to VO. The third set of signature terms for modularization includes all terms in the NIFLexicon ontology (nif.owl; The nif.owl file usesapproximately 30 external files. The OntoFox method and the OWLAPI modularization methodwere separately performed and compared. For the OWLAPI modularization, the OWLAPI Syn-906.4. OntoFoxtacticLocalityModuleExtractor with STAR module type was used.6.4.3 ResultsMIREOT implementationAs described in Section 6.3, the MIREOT guideline suggests the following minimal set: (1) sourceterm URI, (2) target direct superclass URI, and (3) source ontology URI. These are the firstparameters taken as input by OntoFox:1. Source ontology URI. Box 1 of the OntoFox web input system includes a list of the 15 on-tologies a user can select as source ontology (Figure 6.5). Alternatively, a user can requestan unlisted source ontology in Box 2, in which case the URL of a SPARQL endpoint wherethis new source ontology can be accessed must be provided. For each external ontology term,OntoFox adds an importedFrom annotation property (, which indicates the URI of the source ontology.2. Low level source term URI. This parameter is equivalent to the source term URI in theMIREOT guideline. Box 3 allows users to input one or multiple source term URIs, enteringone URI per line. For example: sapiens3. Target direct superclass URI. This is the URI of the direct superclass of the top-level sourceterm chosen above (i.e., where to position the newly imported term(s) in the target ontology).This parameter is entered alongside the top-level source term URI in Box 4 using the directive“subClassOf” (see more detail below).These three data items together unambiguously define a single term from the source ontology andwhere to position (i.e., what class is it a subclass of) it in the target ontology.Annotation properties managementOntoFox provides several settings/directives allowing users to select which annotation propertiesto retrieve, and more importantly, under which format those should be returned.1. Source term annotation URIs: By default (i.e., if no annotation URI is specified), OntoFoxwill not fetch any of the annotation properties of the selected term. A user can choose to916.4. OntoFoxretrieve specific annotation properties by specifying their URIs, or use the OntoFox command‘includeAllAxioms’ to fetch all annotations properties associated with source ontology terms.This parameter is entered in Box 6 in the web input format (Figure 6.5).2. “copyTo”: This directive is used to map an ontology term annotation to a new annotationproperty created in the target ontology, resulting in a duplication of the annotation propertyvalue in the output file. It is used at the beginning of a line, followed by an annotation URIused in target ontology. For example, the “copyTo” command is used in Figure 6.6: copyTo #preferred termThis duplicates the value of the rdfs:label property into the “preferred term” annotation(IAO 0000111 from the IAO [125]), and both annotations are included in the output file.This directive can be used in the web form (Box 6) or in the OntoFox input text file.3. “mapTo”: This directive allows mapping of an ontology term annotation: it will replace anexisting annotation property in the target ontology with the value of another annotationproperty from the source ontology. It is used at the beginning of a line, followed by anannotation URI from the target ontology. For example, Figure 6.6 contains an example ofusing the “mapTo” directive: mapTo #definition annotation propertyAs ontologies don’t always use a common set of annotation properties, this feature providesan easy way to integrate information from a source ontology into a target ontology while retain-ing a consistent, metadata style. For example, the OBO2OWL script (, used to automatically generate OWL version of OBO ontologies within theOBO Foundry, uses the property “hasDefinition” to relate a term to an instance whose rdfs:label isthat term’s definition. However VO uses the IAO metadata scheme (, and directly relates the term to itsdefinition via the, definition annotation prop-erty. The mapTo directive instructs OntoFox to map the definition used in the source to the valueof the VO annotation property for definition. This mapping directive is used in Box 6 of the webform input method or in OntoFox input text format.926.4. OntoFoxFigure 6.6: OntoFox retrieval of PATO term ‘volume’ and its annotations. (A) OntoFox input data;(B) Prote´ge´ display of OntoFox output data. All terms from ‘volume’ up to ‘quality of continuant’in PATO have been imported and positioned under the BFO term Quality. The desired annotationproperties (IAO 0000111 ‘preferred term’ and IAO 0000115 ‘definition’) have been specified usingOntoFox directives ‘copyTo’ and ‘mapTo’.Managing incorporation of related termsOntoFox provides a number of mechanisms for selecting related terms for import, all based onstructural approaches and that have been used within VO development. Methods are provided forselective retrieval of parent terms, transitive retrieval of restrictions inspired by structural-basedmodularization techniques, and the extraction of a subtree rooted at a given term. In this sectionthese mechanisms are detailed. The setting “Top level source term URI”, is designed to work inconjunction with another term specification when retrieving parent terms between lower and upperlevel source terms. A typical use is when importing some or all of the superclasses of a species term936.4. OntoFoxto allow for queries at different taxonomic ranks (e.g., kingdom, phylum, and species). For example,in between ‘homo sapiens’ and Eukaryota (the chosen top-level term) in the NCBI Taxonomy, thereare 27 intermediate terms (cf Figure 6.5). It would be very tedious to find, copy and then pasteall those 29 terms into the new ontology. By specifying ‘homo sapiens’ as the low level sourceterm, Eukaryota as the upper level source term, and the setting “includeAllIntermediates”, the27 intermediate terms are automatically retrieved by OntoFox (Figure 6.5). In addition to thisretrieval of all parent terms, OntoFox uses an algorithm to compute and retrieve intermediatesource terms that are the closest ancestors of more than one low-level source terms, and to removeintermediate terms that have only one parent term and one child term (Figure 6.7), leaving onlyterms that present alternatives for query. This setting, “includeComputedIntermediates”, providesan option to reduce the number of extracted ontology terms by getting less intermediate ontologyterms than that with the setting “includeAllIntermediates” (Figure 6.5), while still fulfilling manyusers’ requirement.Figure 6.7: OntoFox algorithm for extracting computed intermediate classes. It removes any inter-mediate classes that have only one parent class and only child class. Only intermediate terms withat least two children classes are kept.Figure 6.8 demonstrates the usage of this setting. 11 commonly used animal species are includedas the low-level source terms. Using the setting “includeAllIntermediates”, 70 intermediate termswill be included. However, only six intermediate terms are included after the “includeComputed-946.4. OntoFoxIntermediates” setting is applied (Figure 6.8).Figure 6.8: OntoFox demonstration of the includeComputedIntermediates setting. Terms that arecommon ancestors (e.g., Bovidae) to at least 2 external terms are kept in the resulting hierarchy,in addition to the terms (e.g., Primates) explicitly requested.Each of these six intermediate terms (e.g., Euarchontoglires) is the immediate parent class for atleast two child terms (e.g., Primates and Homo sapiens). Primates and mammals are not leaf nodesin the taxonomy hierarchy when the sole parent term is Homo sapiens. Since these terms shouldbe included as well in the final result of the OntoFox output file, they were intentionally includedthem as low-level source terms. A third choice for including selected terms is inspired by structuralmodularization techniques. Given a set of signature terms, OntoFox retrieves restrictions that areparent classes of a term. This choice is implemented using OntoFox’s SPARQL-based related termretrieval algorithm (Figure 6.9).Where a restriction mentions another class, restrictions on that class are queried, and so on,until a fixed point is reached. The method gives useful results with the ontologies at the typicallevel of complexity encountered. It also has the benefit of being straightforwardly implemented inSPARQL and is highly scalable - current modularization algorithms use in-memory representationsthat require excessive memory for ontologies such as NCBI Taxonomy. Within the OntoFox userinterface, users select this choice by choosing “includeAllAxioms”. To test OntoFox‘s SPARQLmethod to retrieve related terms, three sets of signature terms (individual term, small subset ofterms, larger ontology file) were given as input to the OntoFox method and OWLAPI modular-956.4. OntoFoxFigure 6.9: OntoFox SPARQL-based algorithm for retrieval of related terms. Its goal is to extractrelated terms and annotations associated with a set of signature terms (stored in Su) from anexternal ontology. This method was performed in OntoFox when the setting “includeAllAxioms”is selected.ization method. In all three cases, both methods generated identical results. One comparativetest was performed using the set of terms in the Neurodegenerative Disease Phenotype Ontology(NDPO; that imports the NIF Lexicon ontology..The imports closure of this OWL file contains some 50,000 classes in 87 MB of OWL files. Ap-plying the OWLAPI to the classes and object properties in NDPO.owl yielded a module with1351 classes and 7 object properties - roughly 2.5M OWL file including annotations. The OntoFoxgenerated the same results as measured by the ontology metrics provided by Prote´ge´ 4.1. Theseresults support the claim that the OntoFox approach is an effective method for extracting relatedontology terms. Finally OntoFox can extract the whole branch ontology terms below a specific on-tology term.Choices such as which terms in a parent hierarchy should be included are preliminaryto module extraction techniques, which take as input a set of terms (signature) that the ontologydeveloper has identified as being of interest. OntoFox supports experimentation by offering morethan one choice for making such a term selection.966.4. OntoFoxOntoFox data input and result outputBesides the web form-based data input, data can be uploaded as a text file to the OntoFox webserver. This input file contains the same information as the web form input method, but makesit easier to submit batch jobs. The file upload method also makes it possible to keep track ofsubmissions and easily update the input. An OntoFox sample input file (available at has been developed for users to quickly understand and usethe required format. Also, the OntoFox input file can also be automatically generated using thebutton “Generate OntoFox Input File” from data in the web forms (Figure 6.5). Finally, jobs canbe programmatically submitted to the OntoFox server via a script at As an example, the following command line can be used to provide an inputfile (input.txt) and retrieve the corresponding output file (output.owl):curl -s -F file=@/tmp/input.txt -o /tmp/output.owl OntoFox query can result in either a processing error, in which case an explicit message isprovided to the user, or in the production of an OWL file serialized in the RDF/XML format.This OWL file constitutes an ontology on its own and can be visualized using the Prote´ge´ ontologyeditor [126] and directly imported into the target ontology using the OWL import directive. TheOntoFox process can be executed at different times to import updated information of external on-tology terms. By keeping and updating the original OntoFox input text file, users can subsequentlyquery the OntoFox server on a regular basis and get up to date information with little effort.OntoFox application in Vaccine Ontology (VO) developmentUsing OntoFox, VO currently imports approximately 1000 terms from 12 external ontologiessuch as GO [53], NCBI Taxonomy, OBI, PATO [128], and Mammalian Phenotype Ontology(MP) [63](Table 6.2). When using OntoFox to develop VO, it was desirable to apply differentsettings depending on the source ontology considered, and therefore generated one OWL file to beimported per external resource. Once imported into VO, external terms can be used exactly inthe same way as other vaccine-specific VO terms. Different OntoFox settings have been applied forgenerating these 12 ontology subsets for VO imports (Table 6.2). In terms of superclass extraction,six were generated with the OntoFox setting “includeNoIntermediates”, which is particularly usefulwhen the intermediate superclasses do not generate much more information needed for the targetontology. The setting “includeComputedIntermediates” was used for extracting ontology terms976.4. OntoFoxfrom three external resources, including NCBITaxon, PATO, and the PRO [31]. In the case of theNCBI taxonomy it reduces the number of imported classes without losing the information of themost recent ancestor superclasses (Figure 6.8). Finally, the setting “includeAllIntermediates” hasbeen used for extraction from OBI, ro proposed (, andthe Sequence Ontology (SO) - (Table 6.2). These threeexternal ontologies are closely related to VO, and their original hierarchies should be maintainedfor those terms imported to VO. Similarly, different annotation property settings have been applied(Table 6.2). Typically, VO follows the IAO’s ontology metadata scheme and uses the properties“rdfs:label” or “iao:definition”. To make the annotation styles consistent among all ontology termsin VO, the OntoFox directives “copyTo” and “mapTo” were used (Table 6.2).6.4.4 DiscussionWhile an implementation of the MIREOT strategy has been performed in the context of OBI, itrelies on command line scripts, making its use impractical for the average ontology curator andlimiting its adoption by interested users. Comparatively, OntoFox provides a convenient web-basedapproach to use MIREOT that does not require programmatic skills and allows users to specify theirrequirements via simple text formats. In addition, the OntoFox server provides additional optionsfor users to add and rewrite annotations, to include superclasses or subclasses, or select termsvia related restrictions (transitively). This last option performs comparable to existing structuralmodularization methods. OntoFox uses a RDF triple store and SPARQL for information storageand retrieval, resulting in a system that scales better than in-memory modularization techniques.For the Neurodegenerative Disease Phenotype Ontology, OntoFox extracted the same module thata more sophisticated modularization technique did. While these more sophisticated techniques maybe desirable, there are issues with their use. While OntoFox uses simpler methods to retrieve termsand axioms related to MIREOT specified terms, it provides a simpler and more understandableapproach to reuse. This is particularly useful in conjunction with the fact that OntoFox providesan easy approach to incorporate frequent updates from source ontologies that are under activedevelopment. The provision of a simple mechanism for importing selected terms from externalontologies does not shield the user from general issues associated with using external terms. Whenusing terms from other ontologies, care must be taken to avoid a situation in which the meaningof an ontology term in the source ontology is different from the meaning of the term used in thetarget ontology. To avoid this problem, users are advised to exercise due diligence when selecting986.5. Conclusionterms to import. OntoFox helps prevent this confusion, by first offering a limited set of 15 selectedontologies with good documentation and second by importing annotation properties, providingimmediate access to the textual definitions. Where an ontology developer has questions as to themeaning of a term it is recommended that they contact the developers of the source ontology andask for clarification and enhanced documentation. The 15 initially selected ontologies generally havetrackers and mailing lists where questions can be posted. Another issue is the evolution problemassociated with using ontologies that are under active development, as is the case with most currentbiomedical ontologies. Although at a certain time point a certain term is used in the source andtarget ontologies equivalently, over time the usage of the term (and the associated classes) in bothontologies may change. It is considered good practice to not use terms from external ontologiesin ways not consistent with their definition, and for ontologies to deprecate old and define newterms rather than changing the meaning of terms. While OntoFox provides a way for users toautomatically update the annotations of imported terms, it cannot monitor changes in meaning.Therefore it is up to developers to choose ontologies that have practice that will let them monitorfor such changes and make adjustments as appropriate. OntoFox’s 15 initial ontologies were chosenbecause they tend to have predictable practices related to ontology evolution.6.5 ConclusionThe common ID policy has been adopted has a normative principle for Foundry resources withinthe OBO Consortium, and it is expected that OBO library resources will abide by it. One strongincentive for developers to do so was coupling the Ontobee dereferencing service with the obtentionof the common prefix and using OBO types of PURLs. The implementation of the Ontobee isdescribed in the following chapter, Chapter 7. The IAO common metadata set is being used bymultiple ontologies: it provides the common annotations supported by OntoFox and Ontobee. Workis in progress to augment it with the annotation properties required to enable automated OBO toOWL conversion21.While the current implementation of the MIREOT mechanism is tailored towards OWL ontolo-gies, a similar mechanism could be applied to OBO format resources. It is also expected that anoption in the released version of ontologies, such as OBI, will in the future enable the replacementof external.owl with imports.owl, a file of imports statements generated by extracting the ontology21 ConclusionURIs mentioned in external.owl. Users would then be able to import all of the external resources,therefore replacing the MIREOT selected terms.In the case of OntoFox, more ontologies will be included in the list of source ontology repos-itories. These ontologies may come from the OBO foundry or other reliable sources. Developingan OntoFox plugin for ontology editors (e.g., Prote´ge´) is also under consideration. Editors of OBOformat ontologies desire a similar facility, and while OntoFox currently supports resources in OWL,integrating an automatic conversion for OBO files could directly support the OBO format. Asusability testing of the web interface has thus far been informal, more careful usability studies willbe designed, such as a survey to solicit feedback from the community. A drawback to the currentOntoFox approach is that it requires maintaining independent text files with the import directives(compared to the original MIREOT mechanism which reads in an existing OWL file). However,given that there are no editing tools available in either case, the human readable format seemsadapted.Finally, as module extraction technology matures, the ability to use such mechanisms for doingtargeted imports, on a source-by-source basis, will be included.1006.5. ConclusionTable 6.1: The 15 source ontologies currently available in OntoFoxOntology Source ontology URI Example of source Ontology Term URICARO ConclusionTable 6.2: OntoFoxed ontologies in VOOntologyName# ofsignatureterms# ofimportedtermsIntermediates Annotations1 CARO 2 2 Nordfs:label copyTo iao:preferredTermoboInOwl:hasDefinition copyToiao:definitionoboInOwl:hasSynonym mapToiao:alternativeTerm2 CHEBI 13 13 No3 DOID 10 57 All4 FMA 2 2 No5 GO 2 2 No6 IDO 1 2 No7 NCBITaxon143 198 Computed8 OBI 41 48 All rdfs:label, iao:definition9 PATO 15 17 Computed rdfs:label copyTo iao:preferredTermoboInOwl:hasDefinition copyToiao:definitionoboInOwl:hasSynonym mapToiao:alternativeTerm10 PRO 2 2 Computed11 ro proposed 7 9 All12 SO 1 1 No102Chapter 7Publishing biomedical resources onthe Semantic Web7.1 IntroductionThe goals of the OBO Foundry, and of biomedical ontology in general, are very much in line withthose of the semantic web, and using semantic web technologies for using and sharing biomedicaldata annotated with ontology terms is desired. One such technology, now existing in a numberof implementations due to the popularity of LOD, is the practice around serving linked data sothat it can be browsed and accessed at the granularity of instances [185]. While such services alsosomewhat address serving ontology terms and relations as well, existing implementations did notsatisfy current needs. For example, I am not aware of a linked data service that accurately renderslogical axioms, such as property restrictions, expressed in OWL. To make ontology terms accessiblethere is a need to, as with linked data, (1) provide human users with adequate information tounderstand what the term means, while also (2) make available the documentation of these termsin a form that automated tools can use. Each of these goals is here modified from the case wheretypically instances are browsed. The human browsable presentation of ontology terms is intendedfor a few key audiences. The biomedical community is a diverse community with different practicesand perspectives, and frequently uses different words to describe the same types of entity. Ontologydevelopers are responsible for building ontologies. They need to be able to easily navigate theirown work as well as the content of other ontologies in order to be able to find existing terms (andconfirm they are indeed relevant) that should be reused rather than created de novo. Curators andannotators work with existing datasets and must be able to discover and understand the termsapplicable to their data.1037.1. Introduction7.1.1 RequirementsExperience and iterations of discussions have led me (in collaboration with Alan Ruttenberg) tothe summarization of two sets of requirements, one aimed at providing a useful user experience(U1-U9) and a second set of goals related to engineering (E1-E6).Based on my experience within the community of practice, and after discussion with prospectiveusers, the requirements, aimed at providing a useful user experience, are to:U1) Provide a service with predictable behaviour across the whole body of OBO ontologies.U2) Ensure that useful information is displayed. This includes at a minimum documentation,attribution, and provenance. Terms should be displayed with labels rather than identifiers,but identifiers should be accessible.U3) Be clear as to what IRIs identify.U4) Term IRIs should be used in scholarly citations. Common ways of bookmarking shouldyield the term IRI.U5) Deliver RDF that is accurate when compared to the source ontology. Ontology writersuse specific relations and axioms to communicate and set expectations that users of theirwork will be able to retrieve them as they were written.U6) Present both ontology-centered and term-centered views. Access will be via two commonroutes: either start at a given ontology and explore or search within it, or directly requestthe page for a term via its IRI or a search result.U7) Display OWL expressions in a readable syntax. The RDF-centric rendering of OWL isdifficult to understand beyond simple statements.U8) Be able to customize views as desired by the ontology developers. Often ontologies, suchas UniProt [97], have web sites presenting their work, and some have term browsers. Theirdevelopers don’t want to lose their “branding” by having a different site be the destination forviewing terms but at the same time wish to take advantage of the services Ontobee provides.U9) Provide tools to aid navigation to ontology terms of interest. For example, any ontologyterm appearing anywhere on the page should be clickable to view information about it, andthe ability to search for terms should always be easily available. While ontology navigation1047.1. Introductionand visualization is an ongoing area of research, efforts have been made to incorporate whatis known, as well as users’ feedback, into this work.On the engineering side, there is a different set of requirements, motivated by the desire to takeadvantage of already widely available semantic web technologies as well as promote their uptake inthe community.E1) Adherence to documented web and semantic web practices. The relevant specificationsare for RDF [91], RDF/XML [181], OWL 2 [186], SPARQL [187], and XSLT [188]. Using thesestandards means there are a number of implementations to chose from, providing advantageof advances in performance and functionality without changing the underlying code base.E2) Predictable access to RDF/XML assertions relevant to computing with the term. Inorder to ensure predictability advertised policies that let developers rely on what they will beable to get to are needed.E3) Ability to have generated HTML reused by other applications.E4) Visibility in search results of popular search enginesE5) Scalability as the number of terms served increases, and as the number of clients increase.E6) Transparency. For example, an interested user should be able to see the queries that areused to collect the information that is assembled into the web presentation.Finally, a general requirement of this work is that it be built on an open source platform. Bydoing so collaborative development, extensions, and experimental forks by others can happen.7.1.2 Previous workThere are a variety of LOD and ontology browsers that have been developed. The NCBO Biopor-tal [189], Ontology Lookup Service (OLS) [49], Manchester Ontology Browser [190], and AmiGO [135]primarily offer views and navigation of biomedical ontologies without particular attention to oper-ating in the framework of LOD. Ontotext’s Linked Life Data [191] and the Bio2RDF effort [66] areprojects that are closest to Ontobee. DBpedia [192] is an exemplar of LOD most associated withthe movement, and are both primarily instance oriented. DBpedia serves linked data derived fromWikipedia and Virtuoso’s data spaces aggregate many different sources of data and present themas linked data.1057.1. IntroductionThe Ontology Lookup Service (OLS) provides both a web-based interface and a programmer’sAPI based on SOAP. The web-based search interface provides autocompletion for terms in OBOformat ontologies and presents three different views, one of the term alone, one of the term inhierarchical context, and one graphical visualization of the term view with either ancestors ordescendants [193]. OLS differs in that it doesn’t support parsing of resources in the Web Ontologylanguage (OWL) and does not provide RDF/XML format data (requirement E1). Instead there isthe SOAP based API that provides access to search facilities, terms, and relations between termsand other terms. Still, the OLS is appreciated within a large community of biomedical annotators.Of note is the inclusion of clickable graphical display of either the path from the term to theontology root, or of children of the term.The BioPortal provides a variety of services including textual and graphical term browsing andsearch [189], as well as REST-based APIs for accessing and using ontology terms, including severalthat deliver RDF. Textual presentation of ontology terms is only partially covered - classes arebrowsable, but not relations or instances. Of note, textual views do not satisfy U7 in that propertynames and axioms that are displayed are not hyperlinked to pages that describe the terms. It onlytrivially satisfies U5, in that it omits logical axioms. This can be demonstrated by comparing thedisplay of the OBI term OBI 0001705 in BioPortal and Ontobee [194]. Although there is an RDFservice it is not coupled with the user-oriented interface as it would for a typical LOD browser, andit fails U3. RDF retrieved by the API call does not contain all the axioms from the original ontology,and rewriting the OWL changes some assertions on some properties to assertions on different ones,typically from the SKOS vocabulary [195]. In addition the BioPortal has recently provided a RDFtriple store and SPARQL query browser [196]. However the RDF generated, similar to the service,does not satisfy U3 in that is also rewrites certain properties and does not provide a way to getall the assertions for a term. It also skolemizes blank nodes, yielding, in the case of SPARQLCONSTRUCT queries, RDF/XML that will be rejected as invalid by tools that parse OWL [197].The Manchester Ontology-browser [190] is intended as an on-line ontology browser but notspecifically as linked data browser. It focuses on the accurate and accessible display of fullyreasoned-over OWL 2 ontology content including logical axioms, but does not serve RDF for indi-vidual terms. It is notable for clear display of owl axioms, consistent use of labels, and for presentingthe reasoned over ontology so that consequences beyond the asserted axioms can be understood bythe user. The Manchester Ontology Browser is an ontology-centric interface. Access is typically viaontology IRI followed by search or navigation to a term and IRIs displayed in the address bar are1067.1. Introductionnot the term’s IRI nor persistence, though some element of persistence is offered via permalinks.AmiGO [135], another example of a web-based ontology browser, in this case developed by theGene Ontology Consortium [198] in order to browse the Gene Ontology. AmiGO has been developedover time to meet their community (typically biologists) needs. It provides search by label, varioushierarchical displays, including some generated by simple inference, and display of gene productsthat are annotated by the class in focus. However with respect to existing requirements AmiGOhas several shortages. It does not serve RDF, and so does not operate as a linked data server. Itis oriented towards the display of OBO format assertions, which do not always easily map to theOWL equivalents, which are not displayed. It is notable, as it is the best example known of anontology browser that serves the needs of a specific community.Ontotext’s linked life browser [199] provides a web page interface as well as the ability to re-trieve RDF for a term as RDF/XML and several other RDF syntaxes. While formats are offered aslink with distinct IRIs, content negotiation is also active. However HttpRange-14 is not followed -leading to a failure of U3. As with the Bioportal and Bio2RDF, the standard IRIs for OBO terms,such as for terms from the Gene Ontology [191] are not used, and the assertions are adaptationsof the original assertions expressed using the SKOS vocabulary. Somewhat confusing is the han-dling of subclass assertions. In the default web view, and in the RDF, they are simply omitted.However an option on the web page under the title “inference”, when chosen, shows the existenceof skos:broaderTransitive relations in place of subClassOf. Even with this setting, RDF retrievedusing the links or with content negotiation does not contain these relations, which, even in trans-lated form, are an essential feature of the GO. Thus U2 and U5 suffer. Bare URIs are displayedwhen objects of predicates. Search is via keyword, and is uncomfortable in that a limited numberof results are shown and there appears to not be any sorting for relevance. This is exemplified by asearch for “biological process”, an upper level term from the GO that nonetheless does not appearon the first few pages of search results.Bio2RDF [66], is another effort to create a linked data view of both biomedical ontologies anddatabases. In what seems to be the common pattern, it issues new IRIs for existing resourcesand rewrites those resources according to another schema. Display of terms varies. Version 1 re-sources use the Pubby [200] software for term display, generally favoring IRIs over rdfs:labels fordisplay, and with no provision for displaying OWL axioms other than as raw triples. For example, shows the target of some of the subClassOf relations as“(Unnamed RDF node)” rather than a readable OWL restriction [201]. Provenance for terms is1077.2. Implementationoften absent. Whereas for a GO term such as go:0032283 there is a dc:license link that can beinterpreted as provenance, the the term ‘scalar measurement datum’ [202] from the InformationArtifact Ontology, authored natively in OWL 2, is rendered as [203]. In that rendering none of the axioms are displayed, labels are not used, nor even vis-ible as a property value and there is no indication from where the single triple displayed originates.RDF/XML is available via hyperlink, and by content negotiation, and matches the html display.The BIO2RDF Release 2 resources are presented as web pages using Openlink’s Virtuoso FacetedBrowser. An example of term display for a term in NCBI taxonomy, ‘bos taurus’ [204] is [205].Here the view is notable for the combination of bare IRIs used as property labels with a fixed widthdisplay. This yields IRIs in which the middle part has been replaced with an ellipsis, for example The link from the title of the page actually resolvesto the Pubby display, which is not completely concordant with the original page, contributing tofurther confusion.DBpedia extracts structured information from Wikipedia and to make this information availableon the Web [192]. DBpedia uses content negotiation to return RDF descriptions when accessed bySemantic Web agents and a HTML view of the same information to Web browsers. The HTMLweb browser does not provide hierarchical tree structure. DBpedia is focused on linking instancedata (mostly outside life science domain) instead of ontology terms. The DBpedia Ontology hasbeen developed to support data linkage and mapping within the DBpedia datasets [192].In this work those issues are addressed, implementing a service that provides a balance betweenfollowing Web and Semantic Web specifications while being careful about ontological issues asdetailed below.7.2 Implementation7.2.1 OverviewThe Ontobee server is currently a single HP server running Red Hat Linux operating system (RedHat Enterprise Linux 5 server). The open source Apache HTTP Server is used. Programming isdone with PHP, Java, SPARQL 1.0, and JavaScript. OpenLink software’s Virtuoso Open-SourceEdition is used as a RDF triple store. The same machine both runs the triple store and generatesthe documents needed to implement the web interface.As shown in Figure 7.1, Ontobee provides access to RDF and HTML documents with informa-1087.2. Implementationtion for ontology terms. RDF documents can be accessed at the published identifier for the termimplementing the “httpRange-14” recipe of issuing an HTTP redirect to a document after firstresponding with a 303 status code [206]. To provide a user-friendly web interface for users to iden-tify related links and detailed information, the RDF document includes a stylesheet directive [188],which the browser uses to generate HTML. In addition direct access to HTML is provided at adifferent IRI as shown below.It is not uncommon for there to be different versions of an ontology available. This leadsto a design choice regarding what happens when different ontologies loaded into Ontobee importdifferent versions of the same ontology - should the most recent version be loaded and all importsmodified to use that version? Or should each ontology and its imports be kept segregated so thatdifferent versions for different imports are supported? For now, Ontobee chooses the latter strategy,using a separate named graph for each top-level ontology it loads. While it prevents issues thatcould be created by “forcing” a resource to use a newer version of an imported ontology, it meansthat on the same server multiple versions of the same resource are available, which may be confusingfor users.7.2.2 Access to descriptions of entities referred to by term IRIsA common method by which RDF and web pages are associated with a term is to have the termIRI accessed via a server configured to do content negotiation. In that scenario, the user agentissues a GET with the term IRI, and sets the HTTP Accept header with the mime-type desired,typically application/rdf+xml or, for the web browser, text/html. Content negotiation was forgonein favor of using the mechanism described in the resolution of httpRange-14, under the rationalethata) Documents are in the domain of discourse of our field, and returning different documentsin response to access requests for a single IRI is confusing because there is confusion aboutwhat the IRI denotes. For example, in working with a commonly available database such asone at NCBI, one might want to identify the class of protein properties of its instances, orthe evolution of the web page that is displayed when the IRI is dereferenced in the browser,or run a processing pipeline on the information about the protein class formatted in XML.If all three are given the same IRI it is difficult to record assertions specifically about one orthe other. However if each has a distinct IRI one or the other can be referred to by using the1097.2. Implementationappropriate IRI.b) Doing so promotes predictability. Extant servers vary on their implementation of Acceptheader processing, commonly returning content types other than what is requested.c) It is not always possible to easily set Accept headers in requests. For example few webbrowsers offer the ability to change the Accept header. Programming APIs may or may notexpose this functionality. With APIs that do allow this, documentation can be buried indetails that are easy to miss.d) httpRange-14 offers a solution that is simple and uniformThe W3C Technical Architecture Group (TAG) resolved issue httpRange-14 by saying thefollowing:If an “http” resource responds to a GET request with a 2xx response, then the resourceidentified by that URI is an information resource;If an “http” resource responds to a GET request with a 303 (See Other) response, then theresource identified by that URI could be any resource;If an “http” resource responds to a GET request with a 4xx (error) response, then the natureof the resource is unknown.The server architecture implements this resolution by having all HTTP access requests for entitiesnamed in an ontology to result in a response status code of 303. By doing so, any potentiallyconfusing implications that the entity is an “information resource” are avoided. The 303 responsealso includes a redirect to Ontobee, which provides an RDF/XML document describing the entity.This is the middle case above in the httpRange-14 resolution.Following the redirect, the client requests the RDF/XML document, which has its own IRI, andthe server responds with a 200 status, indicating an information resource as per the httpRange-14resolution.7.2.3 Use of PURLsWithin the OBO community, the common (and encouraged) practice is to create PURLs for ontol-ogy ids using the domain. This facilitates changing the servers used to respond1107.2. ImplementationFigure 7.1: Ontobee system architecture design. Queries for an ontology IRI term from a webbrowser or a http user for a Semantic Web and LOD application will send a GET (ontology termIRI) request to the Ontobee server. Once a request is received, the Ontobee server will issue aSPARQL query against a RDF triple store and return a RDF document that dereferences the IRI.The RDF document will be returned to the Semantic Web and LOD application (no web browserinvolvement). For a web browser, the browser notices the XSL stylesheet and asks for that. TheXSL stylesheet returns the HTML with an XSLT wrapper. The browser applies that to the RDF,gets the HTML and renders access requests, without having to change the IRI of the term. Where to redirect PURLs is achoice of ontology developers. While the use of Ontobee is recommended, there is also the option forcustom redirection. In order to reduce the cost of administering a PURL server, the Online Com-puter Library Center (OCLC) granted permission to set up a Canonical Name Record (CNAME)for, making it an alias for This allows us to leverage existing in-frastructure - OCLC’s PURL resolver - while providing a backup mechanism should OCLC’s serverstop being available. In that case the DNS entry for could still be redirected toa different PURL server or implementation.1117.2. Implementation7.2.4 Ontology retrieval and preprocessingOntobee retrieves, pre-processes, and loads ontologies into its triple store. The server currentlyloads OBO Foundry ontologies as well as a selection from the OBO Library [207]. These ontologiesare distributed in either OBO format or in OWL. The obo2owl pipeline [208] is used to generatean OWL version of OBO ontologies according to the translation for OBO Format 1.4 [209].A set of PHP/Java scripts retrieves the OBO library ontologies on a regular basis (currentlydaily). Ontobee does no OWL reasoning on the source ontologies, though some developers releasea pre-reasoned version including inferred axioms.7.2.5 Retrieval of information about a termSPARQL was used to retrieve information from the triple store. Many of the queries are SPARQL‘describe’ and ‘select’ functions against the RDF store. The information retrieved includes:• Annotations on the term• Restriction superclasses of the term, when a class is specified equivalentClass assertions wherethe term is one of the equivalents• The ancestors of the term• The direct instances of the term• Other terms in the ontology that reference the term in their axioms• Other ontologies, of the ones in Ontobee, that also reference the term• Ontology header and ontology annotations• The SPARQL queries used for retrievalThe motivation behind using the above information should for the most part be obvious. How-ever some items deserve mentioning. In particular, the ontology annotations are queried so thatattribution information, such as Dublin Core (dc) metadata values of dc:contributor and dc:creator,can be known to a casual user who accesses only one or a handful of terms. The SPARQL queriesare collected so that they can be shown to users of the web interface who wish to learn more abouthow to use semantic web technologies.1127.2. ImplementationA variety of choices have been made to decrease the number of queries executed, reduce thesize of and increase the readability of the resulting RDF. An example is the use of ‘transitive’and the ‘CBD’ (i.e., Concise Bounded Description [210, 211] options in the Virtuoso SPARQL forsome queries. For example, in order to retrieve the ancestors of a term, one might have to issue anumber of separate queries as the class tree is traversed. Use of the ‘transitive’ option in Virtuoso’simplementation of SPARQL obviates that need. Retrieving the axioms for a term can also take anumber of SPARQL select queries.The current approach to retrieval effectively offers a compromise between expressivity andquery execution time. Therefore the set of assertions about a term is not necessarily complete, butgenerally shows enough to be useful. For example if the loaded ontology does not have inferredsuperclasses, such as those in an equivalentClass axiom, these would not be retrieved. It is expectedthat SPARQL implementations over OWL will improve in time and the technology used will bereexamined periodically.To minimize the size and increase legibility of the RDF/XML document, the raw output fromSPARQL queries is reformatted. For example, automatically generated namespace prefixes (e.g.,xmlns:n0pred) are rewritten using more widely used ones (e.g., xmlns:obo), and the RDF is re-rendered using the OWLAPI [150].7.2.6 Generation of RDF and HTML outputsUpon request for an ontology term IRI, Ontobee generates a representation of that term in RDF.The RDF is returned, in RDF/XML format, including a stylesheet directive. When the request isfrom a web browser agent, the stylesheet is retrieved. While this stylesheet could programmaticallytransform the generated RDF, it is easier to generate the HTML in a more fully featured pro-gramming language. Therefore the HTML is generated using PHP and then encapsulated withina trivial XSLT transformation that translates the root RDF node into the resultant HTML. ThisHTML is subsequently used by the browser to render the page. In addition to the term IRI, thesegenerated documents are given their own IRIs and are accessible from the server independently.Although they may not be of primary interest to a user browsing the site, this access was providedshould it be useful to retrieve them directly in developing another application or in order to makeassertions about them, for example in a study of how the term’s logical definition has changed overtime. As a final touch, the RDF is made valid OWL by adding an ontology header and importsstatements from the original ontology. Lightweight linked data clients will ignore these statements,1137.3. Resultsbut doing this make it possible to open the RDF document in an OWL-aware tool and give thecorrect inferences should the reasoner be employed.7.2.7 SearchA keyword-based search facility with autocompletion is available. On the client side, autocomple-tion is implemented using the jQuery JavaScript Library [212] and AJAX [213]. Users can makecalls to the server which implements the search. Search over the entire set of ontologies is availableat Each ontology term page also has a search box that is scoped over termsfrom the same ontology.7.3 ResultsOntobee was initially deployed in late 2009. Near the beginning for 2012 it became the defaultlocation for the bulk of OBO ontologies listed at the OBO Foundry web site. Ontobee undergoescontinuing development as suggestions are received and usage evolve. Source code for the servercan be found in the subversion repository hosted by Sourceforge at, and is distributed under the Apache License, Version 2.0. Below the current webinterface, scale, and adoption are discussed.7.3.1 RDFFigure 7.2 shows a portion of the RDF/XML generated for the term ‘vaccine’ from the VaccineOntology,, as seen when one chooses view sourcein a web browser. This file is a valid OWL-DL document - a small ontology centered around asingle term - made so by the addition of the ontology header and import statement. As a resultusers can open an individual term IRI directly in an OWL editor such as Prote´ge´ [126], and runa reasoner. Reformatting with OWL API provides indentation to make logical definitions easierto read. Not shown, the RDF includes further information that was judged to be of utility in alinked-data context, such as attributions information about the ontology and labels for any termused in the RDF. The precise contents of the RDF are evolving as experience grows. For example,in the future, a modularization algorithm may be used to select relevant assertions. The RDF andHTML documents are accessible separately. For the class ‘vaccine’, the IRIs would be: denoting the class,1147.3. Results denoting the RDF/XML document, denoting the html document.Figure 7.2: An Ontobee RDF output file, which is the source page of the ontology term (label: ‘vaccine’ from VO). The RDF document includesan XSL stylesheet directive (1). (2) Shows part of the logical definition, in this case, an equivalentclass axiom. (3) text definition.7.3.2 Web interfaceFigure 7.3 shows the same term, ‘vaccine’, in the web interface. The items and order of items havebeen chosen to give ample information to help the user understand the term, and to emphasizeelements that contribute to reuse. rdfs:labels, when present, are used for display. To encouragediscovery, any visual element that is an ontology term is a link to the page for that term. Below isa description of the elements of the page and motivation for their placement and inclusion.At the top there are the type of term, its definition, and the term IRI, bolded to emphasize thatit is what the user should copy should they want to cite the term. All annotations on the term,1157.3. Resultssuch as editor notes and synonyms, and term editor are just underneath. After the definition, theseare the most understandable documentation. Next are equivalents, the strongest form of logicaldefinition as they are both necessary and sufficient conditions. Here and elsewhere the display isformatted using a variant of Manchester Syntax in which term labels are displayed instead of IRIs.The hierarchical context gives the user more information about the term and draws their attentionto more general and more specific terms that may be appropriate for their work. Direct superclassesand class axioms follow. These refine the logical definition, provide links to related terms, and servea pedagogical role by presenting patterns that might be used for other definitions. Other terms inthe ontology whose axioms make some reference to this page’s term are next. These axioms givemore information about how the term can be used in practice as well as help understand the termby seeing it in use. Other ontologies within Ontobee that use the ontology term are then displayed.Reuse is encouraged, and showing that there is already use in other ontologies lets the user knowwhere this is happening as well as navigate to the other ontology for further examples of using theterm. Finally, there is an offer to show the SPARQL queries used to generate the page. Userswith technical experience are encouraged to adopt SPARQL and other semantic web ontologies.By exposing the queries they can learn more about the technology and try it out by themselves.7.3.3 SearchOntobee provides a simple textual search (Figure 7.4). On the Ontobee front page, one can queryontology terms across all ontologies. When viewing an ontology page, or term from an ontology, thesearch is across terms in that ontology and its imports. As the user types, commencing on the thirdcharacter, a drop-down menu with terms whose label contains the string typed so far (Fig. 7.4A).A selection of a specific term in the drop-down menu will lead to the web page dereferencing theontology term (Fig. 7.4B). Selecting one of the menu items navigates to the page for that term.Alternatively one can choose “Search terms” to get a page that lists all matches, sorted in order tofirst show terms that start with the search string, shortest to longest, then terms that include thestring, shortest to longest (Fig. 7.4C).7.3.4 ScalabilityOntobee scales well in practice. Currently it provides access to over 1,300,000 ontology termsfrom more than 100 ontologies without appreciable delay, including very large resources such asthe NCBI Taxonomy. Table 7.1 shows a selection of these ontologies. Scalability is achieved in1167.3. ResultsFigure 7.3: Ontobee HTML rendering of the VO term ‘vaccine’ as seen in Firefox.large part by using a triple store and SPARQL. Performance of triple stores and SPARQL is anactive area of research, has been improving over the last few years and it is expected to continueto improve. Since the common SPARQL technology is used for querying and web page displaying,different triple store implementations can be tested as the technology or needs evolve.7.3.5 Community adoptionOntobee was initially prototyped with the VO and OBI ontology groups. Over the last year it hasbegun to serve as the default destination for IRIs of terms in ontologies listed on the OBO Foundrysite.Google Analytics data shows that the number of Ontobee daily users has steadily increasedsince early 2012 (Figure 7.5). During the year of 2012, there were over 10,000 unique visitors. On1177.3. ResultsFigure 7.4: Demonstration of the search capability and the HTML rendering of the VO term‘vaccine’ in the Firefox Ontobee web browser. (A) On the Ontobee home page, a search of ‘vaccine’returns a number of results from different ontologies. (B) Selecting the VO term ‘vaccine’ from thesearch result list navigates to the page for ‘vaccine’ in VO term. (C) A click on the “Search terms”button results in the display of all links for all terms with labels that match ‘vaccine’ in Ontobee.average, each visitor spent nearly 4 minutes on the site and browsed about 4 pages.A variety of projects use Ontobee as their preferred term visualization tool. Eagle-i [153] is onesuch project, the result of an NIH-funded effort to help scientists discover research resources. Eagle-i provides access to information about more than 50,000 resources and is deployed at a growingnetwork of universities.1187.3. ResultsTable 7.1: Summary of selected ontologies available in Ontobee. The number of terms includesterms defined inside the ontology as well as terms imported from other ontologies.Ontology Name Number of termsGO (Gene Ontology) 38,562PR (Protein Ontology) 35,342PATO (Phenotype Ontology) 2,331IAO (Information Artifact Ontology) 244IDO (Infectious Disease Ontology) 549OBI (Ontology for Biomedical Investigation) 3,804CL (Cell Type Ontology) 4,401VO (Vaccine Ontology) 5,348OAE (Ontology of Adverse Events) 2,558NCBITaxon (NCBI organismal classification) 847,760ERO (Eagle-i Reagant Ontology) 3,541Cell Line Ontology (CLO) 38,689Figure 7.5: Records of Ontobee daily web page visitors according to Google Analytics.1197.3.ResultsBioPortal! OLS! Manchester22Browser! AmiGO! LLD! Bio2RDF! DBPedia! Ontobee!U1.!All!OBO,predictable! n/a! n/a2U2.!Useful!informa;on! par?al2 par?al22 par?al2U3.!Denota;on!clear?! n/a!U4.!IRIs!citable! n/a! par?al22U5.!Accurate!RDF! n/a! n/a! n/a2U6.!Both!ontology!&!term!views! n/a! n/a2U7.!Readable!OWL! n/a!U8.!Customized!views!U9.!Naviga;on!tools! par?al2 par?al2 par?al2E1.!Use!of!standards! par?al2 par?al2 par?al2 par?al2 par?al2E2.!Predictable!RDF!asser;ons! n/a2 n/a2E3.!Re!usable!HTML! par?al2E4.!Search!visibility!E5.!Scalability!E6.!Transparency!Open!source!Figure 7.6: Comparison of Ontobee features vs. other tools1207.3. Results7.3.6 EvaluationFigure 7.6 compares the features of Ontobee and other tools. Ontobee has been designed andimplemented with the requirements listed in Section 7.1.1 in mind. Ontobee provides a servicewith predictable behavior across the whole body of OBO library ontologies (U1). Although theinstruction does not strictly require redirection to RDF/XML about the entity, this has beencommon practice. While less ambiguity in the specifications would be desirable, narrowing therange-14 approach by always returning RDF/XML about the entity is preferable. In Ontobee,useful information has been displayed (U2), and the ontology terms IRIs are clearly identified (U3).The Term IRI was bolded in Ontobee to suggest the usefulness of this term IRI for copy/pasteand citation (U4). Care is taken to ensure that the RDF output is accurate and inclusive in termsof relation and axiom specifications compared to the source ontology (U5). In Ontobee, the termIRI is always delivered to the latest version. While there has been a tension between version usedand unified view, it is possible to use the SPARQL endpoint and SPARQL queries to solve thisissue. Ontobee presents well both ontology-centered and term-centered views through hierarchicaltree visualization (U6). Readable OWL expressions are displayed in Ontobee RDF output (U7).Ontobee has not allowed customized views yet (U8), which will be considered for future Ontobeedevelopment (note: see more detail in the later Future direction part). Ontobee also provides waysto aid navigation to ontology terms of interest (U9). More methods can be developed to improvethe implementation of this requirement, for example, providing links to graphic in OLS and linksto where the term is used in GO.Ontobee also follows the requirements on the engineering side. Ontobee adheres to the speci-fications of RDF, OWL, SPARQL, and XSLT (E1). It provides predictable access to RDF/XMLassertions by accurately following the original ontology definitions (E2). Ontobee is able to generateHTML visualized and reused by other applications (E3). Allowing the customization of the HTMLcode will make Ontobee more powerful in this regard. Ontobee has not demonstrated good visibil-ity in search results of popular search engines (e.g., Google) (E4). This is due to the lack of abilityin indexing RDF search results in these search engines. Different approaches are being evaluatedto solve this issue. Ontobee shows a comfortable scalability so far as explained in Section 7.3.4(E5). Ontobee’s performance will be monitored given with a possible significant increase of usersin the future. Ontobee provides all SPARQL code for each ontology term IRI display supportingtransparency and education (E6).1217.3. Results7.3.7 Future workCurrently Ontobee uses SPARQL 1.0 with Virtuoso extensions. Further development includes usingSPARQL 1.1 and no extensions. More Ontobee features are expected to be developed. For example,the HTML rendering in Ontobee is currently not fully styled. Use of Cascading Style Sheets (CSS)would make it fully customizable so that individual ontologies could supply customized CSS. Itwill also be allowed to have a community-specific view choice so that projects like Neurolex [214]doesn’t have to make new ids to have new pages. The RDF output content, i.e., which triplesare included or excluded, can be improved with suggestions from Ontobee users and developers.Besides the current RDF contents for each ontology term, there are other alternatives to explore,for example, using modularity algorithms [174, 175] to construct a self-contained ontology thatincludes the term. Addition of other content types is planned, such as using foaf:depiction fromthe Friend of a Friend (FOAF) project [215] to show images. A Wikipedia setting may also beattached for users to comment, provide feedback, point to trackers and other resources.122Chapter 8Representing pharmacovigilance data8.1 IntroductionIn this chapter, I describe how I developed the AERO, with the hypothesis that a standard andlogically formalized representation of the Brighton Collaboration case definitions would enhancedata quality and allow for automatic processing of adverse events reports. The AERO was usedto encode the anaphylaxis case definition, the most complex Brighton guideline, which had previ-ously been used in feasibility studies [15]. Several challenges that arose during implementation aredetailed, such as definition of core terms (e.g., “adverse event”), relation between adverse eventsand the underlying biological entities (i.e., how does a finding of erythema relate to the physicalmanifestation erythema) and how to represent the assessment of an adverse event according to theBrighton guideline in a rigorous ontological framework.8.2 Rationale for AERO and development practiceA formal and logical description of vaccine adverse events would allow their automated processing.For example, currently, software tools must deal with variation in the names of symptoms. A toolbased on an ontology could present only relevant items and their definitions in a checklist, makingit both easier to enter and validate data at reporting time. As detailed in 2.2.3, the availabilityof the AERO, an ontology representing the Brighton guidelines, in addition to the existing humanreadable format, would increase accuracy and quality of reporting. This, in turn, would facilitatefurther automated analyses of clinical data, potentially allowing detection of adverse events in alarge population at a fraction of the time and cost currently incurred.When developing AERO, care was taken to reuse, when possible, work done in the contextof other efforts. Reusing terms from other resources allowed us to rely on knowledge of domainexperts who curated them and to dedicate more work time for terms that need to be created de novo.When only few relevant terms were identified in an external ontology, these were imported using1238.3. Guideline representation and evaluation in AEROthe Minimum Information to Reference an External Ontology Term (MIREOT) guideline [30]. Forexample, in order to define vaccine adverse events, the VO [119] term vaccination [216], defined as“administering substance in vivo that involves in adding vaccine into a host (e.g., human, mouse) invivo with the intent to invoke a protective immune response” is imported. That definition, in turn,uses the term administering substance in vivo [217] that OBI [83] defines. Similarly, the OGMS [132]has terms for pathological entities, diseases and diagnosis which AERO also uses as building blocks.In other cases, external ontologies have been imported as a whole: (i) the RO [78] contains a setof common relations, (ii) the IAO [125] deals with information entities and metadata, and (iii) theBFO is used as upper-level ontology. Finally, AERO is a driving effort for the Ontology of MedicallyRelevant Entities (OMRE) [218], to which it submits all signs and symptoms definitions, as thoseare not specific to AERO but rather intended to be used by other efforts. These resources arecommonly used by the OBO Foundry [55] ontologies, of which AERO aims to be a part. Reusingterms from OBO Foundry Ontologies, where applicable, also improves the ability to interoperatewith other resources that also use ontologies developed within the Foundry framework.8.3 Guideline representation and evaluation in AERO8.3.1 Adverse event classConsider the following cases in which the clinician wishes to report adverse events:• sensorineural deafness reported after measles, mumps, and rubella vaccination. This distur-bance of the cochlea or auditory nerve results in hearing impairment, often loss of ability tohear high frequencies [219],• infection such as in the case of leflunomide in treatment of arthritis [220],• any of the dermatological adverse events observed in patients treated with etanercept [221],• headaches reported following use of proton pump inhibitors such as lansoprazole [222],• rashes, extremely common for example at the injection site.These cases indicate that the type of an adverse event can be either of BFO’s upper level classes- occurrent or continuant. OGMS currently defines sign as “A quality of a patient, a material entitythat is part of a patient, or a processual entity that a patient participates in, any one of which is1248.3. Guideline representation and evaluation in AEROobserved in a physical examination and is deemed by the clinician to be of clinical significance.”and symptom as “A quality of a patient that is observed by the patient or a processual entityexperienced by the patient, either of which is hypothesized by the patient to be a realization of adisease.”. Those classes are sibling of the bfo:continuant and bfo:occurrent classes, directly assertedunder bfo:entity. Adverse events clearly match those definitions: they can be quality of the patient(for example, pallor or cyanosis), a material entity part of the patient (e.g., rash), or a processualentity that parts of a patient participate in (e.g., seizure).Following this, aero:adverse event is logically defined as the union of aero:adverse event processand aero:disorder resulting from an adverse event process (i.e., the adverse event continuant de-scribed above). An aero:adverse event process is “a processual entity occurring in a pre determinedtime frame following administration of a coumpound or usage of a device”; this can be logicallytranslated as (using the Manchester OWL syntax [110]):Class: ’adverse event process’EquivalentTo:processual_entityand (preceded_by some(’adding a material entity into a target’or ’administering substance in vivo’))where the classes adding a material entity into a target and administering substance in vivoare imported from the OBI [161]. The AERO definition of adverse event process is meant to beinclusive, and cover cases such as those described by the Manufacturer and User Facility DeviceExperience (MAUDE); for example the case of a patient fitted with bioprosthetic heart valveswho dies within the following 4 months22. It is also worth noting that this definition of adverseevent does not imply causation between the sign observed and the compound administration/deviceutilization, but is rather based on temporal association.The adverse event continuant hierarchy was built under the ogms:disorder class (Figure 8.1),which is defined as “A material entity which is clinically abnormal and part of an extended organism.Disorders are the physical basis of disease.” To avoid any language ambiguity by associating theterms event and continuant in the label of the class adverse event continuant, it was renameddisorder resulting from an adverse event process. As a general way of overcoming the potential issuebetween terms in use by clinicians and ontological usage in the context of the OBO Foundry, in22 Guideline representation and evaluation in AEROFigure 8.1: The disorder hierarchy as built in AERO, under the ogms:disorder class. The classadverse event rash is logically defined as the intersection of disorder resulting from an adverseevent process and rash.which it may be confusing to associate the word “event” to a hierarchical position under continuant,the OBO Foundry unique label IAO annotation property ( is used. Classes such as adverse event rash (EquivalentTo: disorder resultingfrom an adverse event process and rash) will therefore have an OBO Foundry unique labelannotation with value “rash resulting from an adverse event process”.8.3.2 Application of the guidelinesAs discussed above, AERO is developed with extensive use of existing OBO resources. In order topresent how I represent and compute with guidelines some orientation is first needed. Figure 8.2depicts the representation of a patient examination, a typical way in which a set of findings iscollected in post-licensing signal detection work. The process representation is from OBI. Apatient examination is a planned process with (at least) three participants - the patient beingexamined, the clinician doing the examination, and the collection of findings created as a result.The class clinical finding is of information entities that are about medically relevant entities -material entities, qualities, processes, dispositions that are typically localized in an anatomicalsystem or region. Medically relevant entities are to be considered a generalization of symptoms orconditions and are directly related to the patient or part of the patient. For example the entityomre:low blood pressure [223] is localized to the cardiovascular system. Clinical findings relate tothe medically relevant entities and to the body systems using subproperties of iao:is about [224], ageneral relation between information and things in the world.The collection of findings produced in the examination is an exam report. However generally1268.3. Guideline representation and evaluation in AEROhas   participantexam report of June 7finding of urticariaPatientrashMedicallyrelevant entitydermatologicalsystemAnatomical systemClinical Findingabout mreClinical   ReportClinicianhas specified outputhas specified inputfound toexhibitinvolvesfinding of hypotensionlow blood pressureMedicallyrelevant entitycardiovascularsystemAnatomical systemClinical Findinginvolvesabout mrefound toexhibitPatient examinationpart oflocated inhas participantis aboutinstance ofFigure 8.2: Entities represented in patient examination and recording of findings. During anobi:planned process(red surrounding box) a clinician examines a patient - the specified input- andproduces a report which is a set of ogms:clinical findings - the specified output. Each finding iao:isabout a medically relevant entity, mre (here a rash or low blood pressure) as well as the anatomicalsystem or part proximate (here the skin or cardiovascular system). The report is a set of findings,each related to the report by the aero:has component relation.speaking a distinction between reports, findings, diagnoses is not made. Each can have composi-tional structure, with parts related using aero:has component [225] and information about a patientthat involves observation and judgment. A convenience relation aero:found to exhibit [226] is de-fined that relates a patient to findings about them. It is common to say that the patient has somefinding but there is no essential relation from patient to finding. On the other hand the finding isdependent on the patient. In signal detection it is the exam report or a derivative of it that is theprimary input to analysis.8.3.3 GuidelinesAlthough there are a variety of kinds of clinical guidelines, the focus of the AERO is guidelines thatare diagnostic in the sense that they provide, essentially, a recipe for taking a set of findings in theadverse event report and determining whether some specific medically relevant entity is implied1278.3. Guideline representation and evaluation in AEROto exist. In the case of the Brighton guidelines the assessment also quantifies how certain oneshould be about whether the entity exists, by defining for example Level 1, 2 and 3 of certainty forthe adverse events. A recipe is represented as an information entity, an iao:directive informationentity [227]. The recipe and the Brighton case definition are related to the process of assessmentby a composition of relations defined in IAO and BFO. As an information entity it is the sort ofthing that can have many “copies”. Each copy is represented as connected to the case definitionusing the iao:is concretization of relation. Such directive information entity is meant to be actedon, to be a representation of a plan - like a recipe. Plans (however they happen to be embodied)are represented using bfo:realizable entity [228], which connects the plan to a process in which theplan is carried out. The relation between the bfo:realizable entity and the process, should it occur,is called is realized by.The current implementation accomplishes this classification by defining classes that correspondto the criteria by which each of the possibilities is determined. For example Brighton gives aset of conditions which, if obtained, provide the strongest evidence that a case of anaphylaxishas occurred aero:level 1 of certainty of anaphylaxis according to Brighton [229]. aero:level 1 ofcertainty of anaphylaxis according to Brighton is given a complete logical definition which is theexpression encoding the criteria depicted in the lower middle of the figure. If the report has a setof finding components which together satisfy this class, then the report is classified as aero:level 1of certainty of anaphylaxis according to Brighton.The main classes used in the representation of guidelines are:1. The ogms:clinical finding class [230]. A clinical finding is defined as “A representation thatis either the output of a clinical history taking or a physical examination or an image find-ing, or some combination thereof.” It does take into account historical information such asgathered from the patient’s medical records, as well as results of assays such as blood testsor observations made about the patient by the physician. Clinical findings can themselves bediagnoses, allowing nesting of criteria as shown below in the case of uncompensated shock.A new relation, found to exhibit, has been created to link the patient to the clinical findings,such as “patient found to exhibit some nausea finding”. It can also be used to link anatomicalentities, such as a heart, to associated findings, such as “malfunctioning heart valve”, allowingfor diagnosis at multiple levels of granularity. In AERO, diagnosis are types of findings: it isoften the case that the output of a diagnostic process is used to support further reasoning,such as when a physician establishes a diagnosis or respiratory distress based on a difficulty1288.3. Guideline representation and evaluation in AERObreathing finding in a first step, and then relies on that respiratory distress diagnosis to infer,in conjunction with other findings, a diagnosis of anaphylaxis.2. The classes of medically relevant entities from the OMRE. Those classes are of type patho-logical entities or formation as defined by the OGMS, as well as some processes. Signs andsymptoms are separated from the assessment made of them to be able to consider them sig-nificant or not according to the specific guideline being used. For example, depending onthe guideline considered, an increase in temperature will be considered a fever only if thetemperature is above 37.8 ◦C (for example in older adult residents [231], or 38.3 ◦C [232] forneutropenic patients.3. The anatomical entities which exhibits those findings. For example, chest tightness find-ing [233] involves the respiratory system, while a measured hypotension finding [234] involvesthe cardiovascular system. AERO doesn’t define anatomical entities; rather they are importedfrom Uberon [235].8.3.4 Anaphylaxis representationIn the AERO, the Brighton case definition for the anaphylaxis level 1 of certainty is modeled asan equivalent class as shown in Figure 8.3. It has component the different findings, grouped insets according to their importance in the establishment of the diagnosis. For example, the majorcardiovascular criteria set for anaphylaxis according to Brighton is the disjoint union of a clinicaldiagnosis of uncompensated shock and a measured hypotension finding. A clinical diagnosis ofuncompensated shock is a clinical finding, but also a diagnosis established based on the presence of3 or more uncompensated shock signs, but at most one of each type, as shown in the Manchestersyntax [110]:1298.3. Guideline representation and evaluation in AEROLevel 1 of certainty of anaphylaxis according to Brightonhas component some major dermatological criterion for anaphylaxis according to BrightonAND has component some major cardiovascular criterion for anaphylaxis according to Brighton OR major respiratory criterion for anaphylaxis according to Brightongeneralized urticaria or generalized erythema findingangioedema findinggeneralized pruritus with skin rash findingmajor dermatological criteria set for anaphylaxis according to Brightonmajor cardiovascular criteria set for anaphylaxis according to Brighton major respiratory criteria set for anaphylaxis according to BrightonDisjointUnionOfclinical diagnosis of uncompensated shockDisjointUnionOfrespiratory distress diagnosisbilateral wheeze findingstridor findingDisjointUnionOfupper airway swelling findingclinical finding and is_about some entity (which entity is the corresponding disorder/process)is_ameasured hypotension findingFigure 8.3: Details of the implementation of the level 1 of anaphylaxis according to Brighton. Setsof criteria are modeled as disjoint union classes, representing each of the findings that should beassessed by the physician.Class: ’clinical diagnosis of uncompensated shock’EquivalentTo:(’has component’ min 3 ’uncompensated shock sign finding’)and (’has component’ max 1 ’tachycardia finding’)and (’has component’ max 1 ’capillary refill time > 3s finding’)and (’has component’ max 1 ’reduced central pulse volume finding’)and (’has component’ max 1 ’decreased level of consciousnessor loss of consciousness finding’)SubClassOf:’clinical finding’1308.4. The has component relation8.4 The has component relationA relation, has component [225], was defined to relate clinical findings with the signs and symptomsthat compose them. has component is a sub property of has part [236], which could not be useddue to limitation on the use of non simple properties and cardinality restrictions23. Additionally,using has part didn’t seem accurate; the clinical diagnosis of uncompensated shock doesn’t havepart a tachycardia finding, rather the finding is a component of the diagnosis.8.5 The WHO severe malaria guideline representationFigure 8.4: Implementation of the WHO severe malaria guideline. (A) Details representation ofWHO severe malaria criteria as union of various criteria specified in the WHO guideline. (B)Example of classification: a diagnosis of severe malaria is inferred for patient1 based on laboratorydata according to the WHO guideline.23 ResultsThe WHO divides malaria into two categories, severe malaria and mild (or uncomplicated)malaria. Severe malaria is a life-threatening form of the disease requiring immediate hospital careand therefore correct classification of malaria is critical for appropriate patient treatment. TheWHO specifies a list of criteria for severe malaria diagnosis [237]. The criteria include severeanemia, hyperparasitemia, hyperlactatemia, hypoglycemia, and over ten other different signs orsymptoms. Severe malaria is diagnosed when any of the criteria are present. Otherwise, thediagnosis is considered to be mild malaria. Most of the symptom/signs are determined throughlaboratory tests and specified in the WHO guideline. For example, severe anemia is determinedaccording to laboratory (or assay) results, hematocrit < 15% or hemoglobin < 5 g/dL and plasmalactate level greater than 5 mmol/l means hyperlactatemia [237].The WHO severe malaria guideline is not as complex as the Brighton guideline as it does notneed to relate symptoms and signs to specific anatomical systems. The severe malaria guidelinedoes define symptoms and signs assessment based on laboratory measurement data in keeping withthe approach described in the “Guideline representation in AERO” section but without a detailedimplementation component. The iao:scalar measurement datum class [238] is used to representmeasurement data to facilitate the diagnosis process. A scalar measurement datum is defined as“a measurement datum that is composed of two parts, numerals and a unit label”. For example,hematocrit 17% can be logically represented as:‘has measurement unit label’ ‘volume percentage’‘has measurement value’ ‘‘17’’^^decimalinstance of ‘hematocrit measurement datum’ subClassOf ‘scalar measurement datum’The diagnosis pipeline for severe malaria is similar to the assessment of anaphylaxis level 1according to the Brighton guideline. Applying the AERO developed pattern, severe malaria ismodeled as the union of different criteria specified by the WHO. Formal and logical representationof severe malaria diagnosis and some related criteria using Manchester syntax is shown in the toppart of Figure 8.4. It was tested by laboratory results and clinician’s diagnosis published by Krupka,et al. [239].8.6 ResultsThe pattern developed in AERO allows for automated classification of the patients based on a set ofsigns and symptoms they present, and the associated clinical findings assessed by their physician in1328.7. Discussioncompliance with a selected guideline, as shown in Figure 8.2. Signs and symptoms are assessed bythe physician during a patient examination, and the corresponding findings are of type generalizedurticaria finding and measured hypotension finding respectively. These two clinical findings canthen be inferred to be of type major cardiovascular criterion for anaphylaxis according to Brightonand major dermatological criterion for anaphylaxis according to Brighton. A diagnosis of level 1 ofanaphylaxis is reached as they match the Brighton case definition for the components required.Krupka, et al. [239] provided selected clinical findings and laboratory data of five patientsassociated with different malaria status. Using the WHO guideline, those patients were manuallydiagnosed as severe malaria and four of them with severe anemia before treatment. The automaticdiagnostic classification results obtained from the implementation shown in section 8.5 are consistentwith the manual assessment. The bottom part of Figure 8.4 shows detailed implementation ofselected laboratory results associated with patient1 at the first visit. Based on clinical findings, thepatient is classified as severe malaria according to the WHO criteria.8.7 DiscussionIt is critical in health care in general, and in analysis of adverse event in particular to be ableto store medical data as well as the guideline that was used to assess it. Gagnon et al. [240]demonstrate that depending on the guideline considered, the number of anaphylaxis cases afterinjection of the adjuvanted H1N1 pandemic vaccine varies. The National Institute of Allergy andInfectious Diseases/Food Allergy and Anaphylaxis Network (NIAID/FAAN) considers that reducedblood pressure is enough to diagnose anaphylaxis after exposure to allergens [241], while two ormore organ systems need to be involved as per Brighton. During vaccination, decrease of bloodpressure is frequently caused by fear of the syringe or the vaccine, and may lead to false positiveswhen diagnosed with the NIAID/FAAN guideline.Knowing which guideline was used for diagnosis establishment is therefore important to beable to weigh cases as more or less important depending on their evidence and supporting ornot detection of a safety signal and further actions by health authorities. An additional possiblecontribution is to allow for various versions of the same guidelines to be encoded. Different changes,such as scientific research progress, may warrant guidelines update [242], and it needs to be ableto at a minimum accommodate their co-existence. Ideally, they could be partly reconciled, andfacilitate migration from data encoded in the previous version to the newer one.1338.8. Conclusion8.8 ConclusionThese results demonstrate that the pattern defined in AERO is applicable to the automated classi-fication of AEFI according to the Brighton guidelines. It can be implemented in other applications,such as automatic malaria classification based on the WHO severe malaria guideline. The latterillustrates the potential to generalize the AERO diagnosis guideline pattern to formal and logicaldescription of various diagnosis guidelines and facilitate automated disease diagnosis and validation.A standard representation of diagnosis criteria and clinical guidelines allows one to unambiguouslyrefer to a set of carefully defined signs and symptoms at the time of data entry, as well as tochoose an overall diagnosis that retains provenance links to its source, definition, and associatedsigns and symptoms. Such diagnosis is formally expressed, making it amenable to further queryingfor statistical analysis and other applications and supports query at different levels of specificity.Finally, cases encoded according to different guidelines may be reconciled; for example, based ontheir respective definitions, all cases of anaphylaxis according to the Brighton guidelines are alsocases of anaphylaxis as per the NIAID/FAAN guideline (while the reverse is not true).134Chapter 9Automated adverse eventsclassification9.1 IntroductionIn this chapter, I apply the pattern developed in AERO, and described in Chapter 8, to large reportcollections from current reporting systems to allow those reports to be classified according to theBrighton criteria. This, in turn, will help identify potential cases on which human review shouldbe focused and decrease cost and time by reducing manual evaluation.Currently, efficient analysis of adverse event reports is a time-consuming task, requiring qualifiedmedical personnel. For example, a team of 12 medical officers worked for over three-months toreview 6,000 post-H1N1 vaccination reports for positive cases, only a fraction of the total numberof reports received [243]. Ideally, enabling automatic case classification from specialized reportingsystems such as the VAERS [244] used in the United States and the CAEFISS implemented inCanada would allow analysts to confirm or discard diagnoses made by physicians and identifyadditional probable cases for further investigation. However both those datasets are imperfect.While the Brighton guidelines have been adopted as standard by PHAC, their usage in practice isscarce. They are not implemented in the reporting pipeline, but for a partial implementation in theform of check boxes in a PDF form [245]. In practice, the free text part of the reports is manuallyannotated, and part of the reports is then reviewed by medical experts. Additionally, both VAERSand CAEFISS currently rely on MedDRA to encode adverse events data. This section describeshow, using a mapping to convert MedDRA codes to AERO annotations, I was able to process theexisting MedDRA annotations on the data and infer if a Brighton criteria has been met or not, asshown on Figure 9.1.1359.2. AERO ontologyPolicy makersSIGNALDETECTIONINFORMATIONRECALLSOPsData repositoriesAUTOMATIC CASE CLASSIFICATIONMedDRA encodedstructured dataFree text partof the reportADVERSE EVENT REPORTING ONTOLOGY(AERO)VAERSREPORTGeneral populationBRIGHTON ANNOTATIONSFigure 9.1: Automatic case classification according to the Brighton criteria. Classified case reportsallow for signal detection and policy makers information, impacting public health.9.2 AERO ontologyIn Chapter 8 the development of the AERO was described. Here I present how it is being used inpractice to enable automation of adverse events classification, by assessing whether they correspondto the Brighton case definition criteria.9.2.1 Assessment pipelineFigure 9.2 shows how the various entities are related in AERO to form a diagnosis pipeline for theassessment of anaphylaxis according to the Brighton guideline. The patient examination by thephysician results in a set of clinical findings that are part of a report, upper left. The report findingsare input to a process of diagnosis which uses the case definition. The case definition is concretizedas the plan to use the guidelines in a process of diagnosis, and that this process realizes the plan(in figure as manifests as). The case definition includes different criteria concerning the findings,each of which, when satisfied, yields some assessment of the certainty of anaphylaxis being present.For example, the lower middle stack represents the criteria for diagnosing a level one of certaintyaccording to the Brighton anaphylaxis guideline. When findings in the report together satisfy thesecriteria, the output of the diagnostic process is determination of the level 1 of diagnostic certaintyof anaphylaxis according to Brighton.Figure 9.3 gives two extracts from the class hierarchy related to terms in the figure. It reads:Every ‘level 1 of diagnostic certainty of anaphylaxis according to Brighton’ is a ‘Brighton diagnosisof anaphylaxis as an AEFI’, which is in turn is a ‘Brighton diagnosis’, itself a ‘clinical finding’.Every ‘Brighton case definition of anaphylaxis as an AEFI’ is a ‘Brighton case definition’, which inturn is a ‘diagnosis guideline’.1369.3. VAERS datasetFigure 9.2: The elements of an assessment of anaphylaxis according to Brighton as implementedin AERO. Performing a diagnosis involves assessing a number of criteria each (e.g., lower middlebox) implemented as a class expression that classifies a set of findings. The diagnosis of Level1 of certainty of anaphylaxis is made by the clinician if the written criteria apply, and by theOWL implementation if the class expression subsumes the set of findings shown in illustration asa Clinical Report.9.3 VAERS datasetThe VAERS [26] is a post-market passive surveillance system, under joint authority from theCenter for Disease Control and Prevention (CDC) and US Food and Drug Administration (FDA).It provides self-reporting tools for individuals and health practitioners, and its datasets are publiclyavailable. VAERS reports are semi-structured. A free text field contains the report notes, and1379.4. Data loading and processingFigure 9.3: Class hierarchy excerpt in the AERO. Every ‘level 1 of diagnostic certainty of anaphy-laxis according to Brighton’ is a ‘Brighton diagnosis of anaphylaxis as an AEFI’, which is in turnis a ‘Brighton diagnosis’, itself a ‘clinical finding’. Every ‘Brighton case definition of anaphylaxisas an AEFI’ is a ‘Brighton case definition’, which in turn is a ‘diagnosis guideline’.another field contains a list of MedDRA terms that correspond to the report.The dataset described in [243], was obtained through a series of Freedom of Information Act(FOIA) requests. It consists of 6034 reports received between the end of 2009 through early 2010,all following H1N1 vaccination after the FDA was alerted of a possible anaphylaxis safety signal bythe PHAC. However data surrounding the 100 confirmed anaphylaxis cases in the original reportwere unobtainable as they were deemed lost. All reports in this set were evaluated by specialistsand so provide a gold standard for comparison. A series of FOIA requests were also used to obtainthe dataset describing classification results on the same dataset using the ABC tool, the MedDRAStandardized MedDRA Queries (SMQs) as well as a custom information retrieval method [246].However details of the original analysis approach necessary for reproducing the original resultswere not made available and I could only hypothesize the cause of results obtained that were notin concordance with the original publication.To demonstrate that the AERO can be used to effectively encode a logical formalization ofthe Brighton guidelines, the output of classification using the ABC tool with the results of theclassification using the ontology was compared.9.4 Data loading and processingTo streamline the analysis process, Python was used to perform the following steps, semi-automatically:1389.5. Brighton classification results1. Load the VAERS reports into MySQL. The VAERS data was provided as a set of Excelspreadsheets, and MedDRA is distributed as ASCII files and corresponding database schema.Both were loaded into a relational database for easier processing.2. Apply the mapping ([246], Electronic Supplementary Material, Appendix 3) from the exist-ing MedDRA annotations to the Brighton terms. Each MedDRA ID was mapped to thecorresponding AERO ID, and a mapping table was created in the database.3. Export the dataset into a series of RDF files and perform pre-processing. As working withthe complete dataset in OWL was neither efficient nor necessary the data into smaller fileswas partitioned as follows.4. Export the dataset into a series of RDF files and perform pre-processing. For each report(i.e., each VAERS ID), all information in that report was collected for RDF serializationNext MedDRA terms were mapped to assertions using AERO. Because the OWL represen-tation required more information than was available in the reports choices had to be madebefore classification could proceed, specifically (1) setting some Brighton required values totrue as they cannot be encoded in the current version of MedDRA (2) add negation to reportsto simulate the closed world assumption made in the reports. These steps are both furtherexplained below.Serialization was done using the FuXI framework [247], which provides a syntax for OWL [160]entities in Python that is more amenable to coding than RDF/XML.5. Apply an OWL reasoner to classify reports. The reasoning step was performed with theHermiT reasoner [101], via the OWLAPI [150]. In series, each RDF file was loaded, thereasoner computed inferred axioms, including individual types assertions, and those axiomswere recorded into another RDF file.6. Load each of the original RDF and associated inferred axioms as well as AERO into a Sesametriplestore [106]. I found it was more user friendly to use Sesame’s interface for querying.9.5 Brighton classification resultsI was able to successfully classify a subset of just over 6000 VAERS records in just over 2h on a MacOS X laptop with a 2.4Ghz Intel Core i5 and 8GB of memory. The triplestore was then queried to1399.5. Brighton classification resultsTable 9.1: Classification results. The first row are the results of running the ABC tool online,as described in [246]. The second row is the initial ontology-based classification, using the samerules and with the addition of the negation for information not present in the reports. The lastrow is the ontology-based classification without the addition of the negation. Level 1, 2 and 3columns represent the existing Brighton classification categories. Level 2 updated and Level 3updated represent the category as they should have been encoded based on communication withthe Brighton collaboration.Positive cases Negative casesLevel 1 Level 2Level 2updatedLevel 3Level 3updatedInsufficientevidenceNot a case No evidenceABC tool 101 221 N/A 7 N/A 488 2844 2373Ontologywithnegation98 223 223 8 8 3 3078 2622Ontologywithoutnegation98 178 223 4 8 3 3078 2622retrieve reports in each of the Brighton case definition categories; results are shown in Table 9.1.However, three issues were identified, either with the annotation standard being used (such asMedDRA), the quality/availability of the information in the reporting systems (such as VAERS)and interpreting the guideline (such as Brighton case definitions).First, there are critical limits to the temporality representation in MedDRA. Temporality in-formation is needed for causality assessment. It is a necessary (though not sufficient) conditionthat the temporal association be consistent with the vaccination. Temporality data is also neededfor diagnosis determination (which is of interest for the classification) to represent dynamic diseaseconditions, such as onset, progression (rapid, chronic?) and relapsing. In the specific case of ana-phylaxis, there are no MedDRA terms allowing encoding of ‘sudden onset’ and ‘rapid progression’which are necessary conditions to reach any positive level in the Brighton classification of Anaphy-laxis. The strict application of the Brighton guidelines to the VAERS dataset as-is would result ina value ‘don’t know’ for those criteria, and consequently classify all reports as negative (insufficientevidence/not a case).1409.5. Brighton classification resultsSecond, there are no distinctions between unknown/missing/non applicable information in thereporting systems. In the case of ‘generalized pruritus without skin rash’, when the report doesnot provide any information about ‘skin rash’, it is impossible to know whether that informationis unknown (the physician did not check for presence/absence of skin rash), missing (the physiciandid check but the information was not recorded) or was negative and therefore not included in thereport (the physician checked and did not see a skin rash, but the negative finding was not includedin the report).To remedy those two major issues, and for the purpose of research, the condition that ‘Rapidprogression’ and ‘sudden onset’ criteria are not required for diagnosis was added to the VAERSdataset. Also the negation of those signs or symptoms that were not positively stated on eachreport was added.For example, clinical findings of the report 369695 are defined as (shown using the Manchestersyntax [110]):Individual: 369695Types:‘clinical finding’,‘has component’ some ‘generalized erythema finding’,‘has component’ some ‘generalized urticaria finding’,‘has component’ some ‘difficulty breathing finding’does not reach a Brighton level of diagnosis certainty. However, with the addition of therestrictionsnot (‘has component’ some ‘bilateral wheeze finding’),not (‘has component’ some ‘stridor finding’)the condition for minor respiratory criteria is fulfilled (‘difficulty breathing without wheeze orstridor’) and the report is classified as Level 2 of certainty.Third, when translating the Brighton guidelines into their logical form, different interpretationsof the same human readable content were observed, and I conducted extensive discussion with theBrighton collaboration to clarify the formalization due to this ambiguity.Upon realizing that the addition of negation to the dataset would be required (that I establishedwas also the case in [246], though unpublished), further enquiries were made with the Brightoncollaboration as to whether those negations were logically and clinically required or if the wereadded to allow human readers to distinguish between minor and major criteria. For example1419.5. Brighton classification results‘pruritus with or without skin rash’, which is major or minor criterion respectively: ‘pruritus’ oughtto be enough as minor criterion, there should be no need to require the presence of the ‘no skinrash’ (which is currently required in the ABC tool). Practically, this means that when consideringa report annotated with ‘Rapid progression of signs and symptoms’, ‘Sudden onset of signs andsymptoms’, ‘Hypotension, measured’, ‘Pruritus, generalized’: with the addition of ‘Skin rash: Yes’it classifies as expected as Level 1. With the addition of ‘Skin rash: No’ it does classify as expectedas Level 2. However, with the addition of ‘Skin rash: Don’t know’ it classifies as ‘insufficient levelof evidence’ - which is incorrect: even if it is unknown, there was either presence of skin rash ornot, so this report should at a minimum classify as level 2 of diagnostic certainty. Another outcomeof this work is that compound terms should be represented as association of individual terms. Forexample, ‘capillary refill time of >3s without hypotension’ should be encoded as ‘Capillary refilltime > 3 sec’ and not ‘Hypotension, measured’. There are currently 2 entries in the ABC tool: onecan either select ‘Capillary refill time > 3 sec’ Yes and ‘Hypotension, measured No’ OR one canselect ‘Capillary refill time > 3 sec, no hypotension’. While the former behaves as expected whenapplied to an anaphylaxis report for which ‘capillary refill time of >3s without hypotension’ is acardiovascular criterion, the latter doesn’t allow for correct classification. Similarly, in the prurituscase above, ‘generalized pruritus with skin rash’ should be ‘generalized pruritus’ and ‘skin rash’.This allows differentiating between a major dermatological criterion (‘generalized pruritus withskin rash’) and the corresponding minor dermatological criterion (‘generalized pruritus’ and not‘generalized pruritus without skin rash’). By systematically reviewing and applying this to othercriteria, I was able to overcome the need for addition of negation in the dataset. This can be moreor less complex depending on the number of such negated criteria in the original case definition.Also, there exist different human interpretations of the same guideline, often linked to ambiguity inthe textual representation of the criteria. For example, the case definition of anaphylaxis (describedin Table 2.2) states that a level 3 of diagnostic certainty is reached when the following are observed:• ≥1 minor cardiovascular OR respiratory criterion AND• ≥1 minor criteria from each of ≥ 2 different systems/categoriesThis was interpreted as (1 minor cardiovascular OR respiratory criterion) AND 2 minors fromsystems that are neither respiratory nor cardiovascular (dermatologic, gastrointestinal, laboratorysystems) and so translated in the ABC tool. However it should have been read as “if there is1429.6. Automated case screeninga minor cardiovascular criterion, then 2 other systems need to be involved, including respiratory,dermatologic, gastrointestinal and laboratory” (and vice versa for a respiratory criterion).Following discussion of those results with the Brighton Collaboration, an updated version ofBrighton guidelines was encoded and added in the AERO, in addition to the existing ones, to reflectthose changes. Based on these changes I was able to reason over the dataset, without the additionof negation, and simultaneously compare the different cases, shown in table 9.1 under columns‘Level 2 updated’ and ‘Level 3 updated’ (there were no modifications to the Level 1 or associatedcriteria). Using the updated logical translation of the Brighton guidelines, the intended resultswere achieved. In the row ‘Ontology without negation’, there are 223 cases for ‘Level 2 updated’and 8 results for ‘Level 3 updated’.By comparison the original algorithm misses cases and detects only 178 cases for ‘Level 2 up-dated’ and 4 results for ‘Level 3 updated’ (20% and 50% missed respectively). Finally, a ratherlarge difference was observed for the category ‘Insufficient evidence’. Running the ABC tool asshown in [246], Botsis et al. found that 488 cases were classified as ‘Insufficient evidence’. However,according to the Brighton guideline, the full label for this category is ‘reported anaphylaxis withinsufficient evidence’, and is meant to identify cases for which there may have been misdiagnosisfrom the reporting physician, or not enough evidence according to the Brighton criteria to establishthe anaphylaxis diagnosis. In the original dataset, only 12 reports were annotated with an ‘anaphy-laxis’ MedDRA term (including anaphylaxis-like terms, e.g., anaphylactic reaction, anaphylacticshock,...). Out of those 12 reports, only 3 were lacking supporting evidence as shown in Table 9.1,column ‘Insufficient evidence’, rows 2 and 3. This results from the fact that the online ABC toolthat was used for classification, provides a ‘diagnosis confirmation’ tool, which implies that the userwants to confirm an anaphylaxis diagnosis that they established. Consequently, those 488 caseswere incorrectly categorized as ‘reported anaphylaxis with insufficient evidence’.9.6 Automated case screeningIn the previous section the Brighton guidelines were translated into their logical representation, andapplied the AERO to automate classification of vaccine adverse event reports from VAERS. Asshown in Table 9.2, while the resulting specificity is very high (97%), the corresponding sensitivityis fairly low (57%).This can however be easily understood remembering that the Brighton guidelines were never1439.6. Automated case screeningTable 9.2: Comparison of different classification methods. * indicates that the result was takenfrom [246] (values for the testing set). In the Brighton Collaboration section, the ABC tool andontology-based classification have similar outputs (the small difference in terms of sensitivity canbe explained as Botsis et al. split their dataset into training and testing). In the SMQ section, theexpanded SMQ yields better results in terms of sensitivity and specificity compared to the existingSMQ categories and the IR approach proposed in [246]. CI: confidence interval.Sensitivity (95% CI) Specificity (95% CI) AUC (95% CI)Brighton CollaborationABC tool* 0.64 (0.52-0.75) 0.97 (0.96-0.98) NAOntology Classification 0.57 (0.51-0.64) 0.97 (0.96-0.97) 0.77 (0.74-0.80)IR approach* 0.86 (0.75-0.93) 0.7861 (0.76-0.80) NASMQSMQ categories (combined)* 0.54 (0.42-0.66) 0.97 (0.96-0.98) NAIR approach* 0.85 (0.73-0.92) 0.86 (0.84-0.87) NAExpanded SMQ 0.92 (0.89-0.95) 0.88 (0.87-0.89) 0.96 (0.95-0.97)meant for screening, but instead are reporting and diagnosis confirmation guidelines. The guidelinesthemselves were designed to identify only portion of the cases (low sensitivity) but do so extremelyaccurately (high specificity). Sensitivity needs to be significantly increased for the purpose ofautomated identification of rare adverse events. To address the issue of detecting similarity betweenthe diagnosis text and the adverse event reports, the well-established information retrieval techniqueof cosine similarity [248] was used. Each document (gold standard query or report) was decomposedinto its corresponding vector of terms (e.g., ’skin rash’, ’generalized pruritus’). The angle thosevectors form can be used to measure the similarity between them: the cosine of the angle is 1.0 foridentical vectors and 0.0 for orthogonal ones. Terms in the vectors were weighted using the term1449.6. Automated case screeningfrequency-inverse document frequency (tf-idf) scheme, which numerically translates the importanceof each term in function of its frequency (tf) in a given document and its frequency in the globaldataset (idf). This method can be used to compare the vector terms extracted from each adverseevent report against the chosen gold standard, such as Brighton or MedDRA terms. In [246], theauthors divide the whole dataset into training and testing subsets (details of which are unpublished),and use the training subset to identify which terms are correlated with the outcome, which theythen use to classify reports in the testing set. This method leads to a 85% sensitivity and 86%specificity (Table 9.2, section SMQ, row IR approach).Upon inspection of the MedDRA SMQ and MedDRA terms used to annotate the reports, Irealized that some of them which should presumably be highly correlated with an anaphylaxisdiagnosis (e.g., “hypersensitivity” ) were not included in the existing MedDRA SMQ and thereforenot considered for diagnosis assessment. Therefore, rather than creating a bag of words de novobased on keyword extraction from a training set of reports, I chose to expand a known already widelyimplemented screening method, i.e., the MedDRA SMQs. To identify which terms statisticallycorrelate significantly with the outcome, the 2273 different MedDRA terms were extracted, and,using the classified dataset, for each a contingency table was built and the associated χ2 and p-valuecomputed, as shown in Table 9.3.An α level of significance at 0.05 (arbitrarily chosen) and at one degree of freedom correspondsto a χ2 value of 3.841.Table 9.3: Contingency table per MedDRA termMedDRA term x Not MedDRA term xAnaphylaxis a bNot anaphylaxis c dThe 120 MedDRA terms above this threshold were selected (see Appendix G), to which the 77terms from the existing MedDRA SMQ were added, and then duplicates removed. The remaining168 MedDRA terms were used to perform the cosine similarity based classification: they form thegold standard vector against which each of the report vector will be compared against. I firstperformed the analysis using a 50/50 training/testing data split: half the dataset (training) wasused to build the MedDRA contingency tables, and classification was performed on the second halfof the data (testing). The cosine similarity values obtained for each report were used to build a1459.6. Automated case screeningROC, and the best threshold value was obtained using the shortest Euclidean distance between thecurve and the top left corner as well as the Youden index. At the best cut-off point (r = 0.051) Iobtained 92% sensitivity (86-96% at 95% CI) and 81% specificity (80-82% at 95% CI) in the testingset, AUC 0.93 (0.9-0.95 at 95% CI). I then classified the whole dataset, and as shown on Figure 9.4,this expanded MedDRA SMQ significantly improves sensitivity (92% against 85% in [246]) withslight increase in terms of specificity (88% against 86%). The Area Under the Curve (AUC) wasalso high (0.96) compared to 0.80 in Botsis et al.’s training set: using my approach the classifiercorrectly discriminates between a positive and negative outcome in almost 96% of the cases.Full classification results are shown in Table 9.2.Figure 9.4: Cosine similarity ROC curve. ROC curve showing the sensitivity (True Positive Rate,TPR) vs. 1- Specificity (False Positive Rate, FPR) when measuring cosine similarity of the ex-panded MedDRA SMQ built from the existing SMQ and augmented with the terms identifiedas being significantly correlated with the outcome based on contingency tables. Statistics werecomputed using the R pROC package [249].1469.7. Discussion9.7 DiscussionThese results indicate that using a logical formalization of existing guidelines helps identify miss-ing elements in the reporting pipeline, as well as errors in the interpretation and application ofthe guidelines. Also the Brighton guidelines are not optimally suited for case identification in thecurrently existing reporting systems. Despite having an efficient, standardized and accurate on-tological representation of the information, the guidelines were not designed for this purpose. Byproviding a suitable formalism and method, and encoding multiple versions of the Brighton guide-lines, I demonstrated that the AERO can represent multiple guidelines, and allows for immediatecomparison of classification across them. Additionally, this work suggests that relying only onthe MedDRA encoded anaphylaxis (and associated synonyms such as ‘anaphylactic reaction’) inVAERS [250] may cause severe underestimation of the number of actual cases, as it was found thatonly 12 reports were reported as anaphylaxis in a dataset in which careful manual review identified236 potentially positive cases. Finally, I demonstrated that automated adverse event screening canreach a very high sensitivity and specificity by building a specific bag of words (SMQ or guidelinebased) for each AEFI, on the best query terms I identified.9.7.1 Using an OWL-based approachCurrent state of the art for automated use of the Brighton case definitions is the ABC tool; howeveras shown above it is not suitable for automated classification. My approach not only addresses thelimitations of the ABC tool, but also provides an open and extensible foundation which can beincorporated into future classification tools. Despite the Brighton guidelines not being optimallysuited for the screening problem in the current context, there are multiple benefits in choosing toadopt a logical formalization of the surveillance guidelines considered, detailed below. Regardingthe choice of the formalism, OWL is an accepted standard for knowledge representation, and comeswith a large suite of tools allowing editing, storage and more importantly reasoning is supportedby various softwares [60, 101, 103, 126, 150, 160]. This work demonstrates that even complexguidelines, such as the Brighton Anaphylaxis one, can be encoded using OWL2, and successfullylead to the desired inferences.1479.7. Discussion9.7.2 Limitations of the resultsThe main limitation of the results is that only the reports’ annotations are analyzed. The ability touse Natural Language Processing (NLP) methods on the textual part would potentially allow furtherdiscrimination, and provide supporting evidence in decision making. Additionally, a mappingbetween MedDRA and Brighton was used for part of the classification pipeline. This mappingis subjective and may not be identical to the one another group would produce. Finally, while Icould have worked towards increasing the sensitivity/specificity of the classification results usingthe AERO, I decided that this would change the purpose of the Brighton guidelines and was notdesired. However, one could imagine that a ‘Brighton screening guideline’ could be created for thatpurpose.9.7.3 Formalization of the case definitionHaving a formal representation of the guideline, which could be distributed alongside a manuscript,would help both prevent misinterpretation (such as those observed as a result of not taking intoconsideration the underlying assumption that it performs diagnosis confirmation), and enable ho-mogenized implementation in electronic systems of the chosen standard. Several studies [251, 252]rely on the number of adverse events detected in VAERS to hypothesize whether their rate is higherthan expected with a certain vaccine. It is not currently possible to compare those studies, noteven in cases in which they concern the same adverse event. For example, in [253], the authorsdefine anaphylaxis in a less restrictive way than the Brighton criteria. In [254], yet another setof criteria is used, even though the two papers share authors. In [255], the authors acknowledgethat different criteria were used for anaphylaxis identification, including the Brighton criteria, butconclude that they could not use the latter as this was not compatible with existing publishedreports. It is critical to ensure that not only reporters use standard for reporting, but also thatmedical officer know which standards were used, and be able to compare different ones. This isnot only crucial for VAERS, but also, and more importantly, critical to reach the goal of havingan international assessment of vaccine safety [256]. Finally, several projects have been recentlyconcerned with addressing the need for reporting guidelines, such as the CARE guidelines [257],the PROSPER Consortium guidance document [258] or the integration of guidelines into asthmaelectronic record [259], the latter two specifically advocating for the use of taxonomies.1489.7. Discussion9.7.4 Time gain in signal detectionThe approach I developed allows for earlier identification of a safety signal indicating a high levelof adverse events related to vaccination, potentially preventing further adverse events. Figure 9.5illustrates this time gain using the ontology-based method over the manual analysis. The VAERSdataset comprises just over 6000 reports which were collected over 2 months, and required 3 monthsfor manual analysis by 12 medical officers [246]. By contrast, those 6000 reports can be analyzedalmost instantaneously using the ontology based, automated approach - the only delay is due tothe time needed to accumulate enough reports for analysis. As a result, in this case, the timegain would be at least a month during the flu season, which could translate in earlier detectionof a safety signal, and subsequent forwarding of the information to relevant health authorities.Whether to automate the process of adverse event reports analysis is a health policy decision. Itcan be hypothesized that increase in cost and/or number of reports (for example as more provincesadopt an electronic reporting system) are two critical factors.November 2009 December 2009 January 2010Time gainAbility to detect signalTime6000reportsManual analysisOntology-based analysisLegendFigure 9.5: Time gain using the ontology-based method. As soon as the 6000 reports in theVAERS dataset are accumulated (2 months) they can be automatically analyzed, by contrast withthe manual analysis which requires 3 months for 12 medical officers.1499.7. Discussion9.7.5 Use of the ontology for reportingAnother way to improve detection of adverse events is to standardize the reporting step. Currently,reports are centralized and then annotated with MedDRA terms by specialized coders. Theseindividuals do not see the patient, and if deemed it necessary, they need to request more detailedmedical reports after the fact. A tool that allows unambiguous and consistent reporting of thesigns and symptoms they observe was provided to the person reporting the event, at data entrytime, this information could be captured within the submitted report, and subsequently complexdata-mining of the reports to classify them would not be needed. Using the ontology at data entrytime would provide two distinct advantages: (1) the ontology provides textual definition for all thecriteria terms and (2) the ontology can be used to enforce consistency checking at data entry time.Regarding (1), one of the requisite of my collaboration with PHAC was that the resource developedwould be usable by human as well as machines. Not only were the logical axioms derived fromthe Brighton case definitions encoded, but also the human readable labels and textual definitionswere added, most of those provided from [15]. Regarding (2), upon development of a data captureform capturing the Brighton criteria, the ontology can be used locally to check whether conditionsfor the diagnosis establishment are met. For example, when a physician reports ‘anaphylaxis’, thesystem could automatically ask for relevant signs and symptoms and store whether they have beenobserved or not. This would also help with respect to capturing whether the information that isnot present in the report is missing or unknown.9.7.6 Going forward: proposed implementationAs rare adverse events are considered, there is a need to ensure all possibly potential cases areretrieved, and to the best of my knowledge these results are the best obtained to date. I recommenda hybrid approach where both the SMQ information retrieval method and the AERO classificationapproach be used in parallel. The output of the high sensitivity classifier allows for extraction ofa subset of the original dataset, even though there will be false positives (12.3%). Here, 5082 truenegatives were rightfully discarded. If intersecting, the Brighton confirmed cases can be subtractedfrom this, allowing curators to focus on the remaining reports. Also, a fast screening method whendata is being sent in would allow to automatically identify potentially positive cases, at which pointa more detailed form (such as the Brighton-based reporting form from PHAC) can be immediatelyprovided to the reporter.1509.8. Conclusion9.8 ConclusionBy standardizing and improving the reporting process, the diagnosis confirmation was automated.By allowing medical experts to prioritize reports such a system can accelerate the identification ofadverse reactions to vaccines and the response of regulatory agencies. Future reporting systemsshould provide a web-based interface (or a form in their electronic data capture systems) that reflectsthe criteria being used for case classification. This would help ensure that the information beingcaptured is standardized and that potentially missing information can be immediately added byadding consistency checking tests. While this chapter provides way of improving standardization inpassive, spontaneous reporting systems such as VAERS, other avenues can be explored to improvesurveillance, such as promoting active systems [260]. At a minimum, providers of guidelines shouldrecognize issues such as those described here, and commit to provide logical representations oftheir work. Based on our partnership and results, the Brighton Collaboration is moving towardsproviding such a representation for their case definitions.151Chapter 10Conclusion and future directions10.1 SummaryThe first part of my thesis shows how I co-developed multiple ontological resources, focusing onthe OBI in Chapter 3 and the VO in Chapter 4. Various use cases were presented, each exempli-fying application in a different domain, demonstrating that ontologies allow for unambiguous andstandardized representation of biomedical knowledge.The second part of my thesis describes collaborative development in the context of the SemanticWeb. Building large, interoperable ontological resources necessitated addressing some issues suchas enabling rapid addition of similar terms following a pre-established pattern (QTT, Chapter 5),devising common policies and guidelines (OBO ID policy and IAO metadata in Section 6.2) andgenerally supporting reuse of existing ontologies to avoid duplication of efforts and multiplicity ofURIs (MIREOT and OntoFox, described in Sections 6.3 and 6.4). Publication of those resources toimprove their visibility and make them available via the Semantic Web was realized via the Ontobee,described in Chapter 7. Finally, the last part of my thesis shows that, relying on these efforts,adverse event classification in pharmacovigilance can be improved and automated through the useof ontologies. I built the AERO to encode pharmacovigilance guidelines and data (Chapter 8) andvalidated it against a manually curated dataset, demonstrating high specificity in Chapter 9.10.2 Perspectives and future work10.2.1 Coordinated maintenance of resourcesDisappearance of online resources in the biomedical domain is a known issue [261, 262, 263].Throughout this thesis, data standards have been used to alleviate some of the concerns - for exam-ple, there is no need for integration of multiple database schemas or languages. Another deliberatechoice was to publish all codes and dataset on publicly available content management system, suchas Sourceforge [264] and Google Code [265], and rely on the OCLC PURL infrastructure to rem-15210.2. Perspectives and future workedy disappearing URLs. It is also anticipated that consortium development in the context of theOBO Foundry will provide community support of resources, therefore decreasing their chance ofvanishing. To help address some of those issues, as well as maintain the infrastructure described inSection 6.2, a new group has been established in June 2012, with mission to streamline the OBOFoundry operations and supports its coordinated maintenance. As part of this OBO OperationsCommittee (OBOFOC,, adedicated technical group aims at supporting the OBO global infrastructure as well as desideratafrom the ontologies developers. As part of this group, I authored four documents describing the de-tails of the systems currently deployed [266, 267, 268, 269]. More efforts need to be done to addresslegacy documentation, as well as consolidate existing infrastructure, for example by implementingbackup/mirror systems in case of failure.10.2.2 Evolution of the AEROThe AERO is open-access and available publicly at DrJan Bonhoeffer from the Brighton Collaboration has expressed interest in translating the BCCDswithin the ADVANCE (Accelerated development of vaccine benefit-risk collaboration in Europe)network which recently launched. A Brighton working group has also been created, and it isexpected it will at a minimum keep the different interested parties in contact with respect toapplication of the AERO to remaining BCCDs. Following a meeting in Buffalo in June 2012,several ontology developers, including representatives from the OGMS, IDO, Ontology of AdverseEvents (OAE), expressed interest in building a global infrastructure for all surveillance projectswithin the OBO Foundry. Other parties, such as members of the Network of Relevant Ontologiesfor Epidemiology consortium in charge of the Epidemiology Ontology [270], and representatives ofthe FDA Medical Device Safety division were also looking forward to a common representation ofthe medical interventions and following events. To that aim, the Medical Surveillance Ontology(MSrv, [271]) has been created. I had extensive discussions with the developers of the OAE, andwhile there is agreement with respect to integration of some very high-level terms, several issueshaven’t been addressed, and subsequently the MSrv is still in very early stages of development.10.2.3 Implementation in reporting systemsIn [15], completeness of the information recorded is identified as the limiting factor. The authorssuggest that “One possible solution, which may allow any of the BCCDs to be applied, would be15310.2. Perspectives and future workValidated and detailed report Data repositoryClinicianUnresponsivenessHypotoniaPallor/cyanosisHyporesponsive Episode (HHE)Check skin colorSeizure level 2or Hypotonic Hyporesponsive Episode (HHE)PatientFigure 10.1: Diagnosis confirmation. An automated system can help confirm diagnosis at the timeof data entry, by suggesting additional criteria to disambiguate diagnoses. In this case, observationof the patient skin color is enough information to determine if the event reported is a seizure or ahypotonic hypo-responsive educate health care providers on what specific symptoms, signs and investigations should becaptured.” One application of the work done within this thesis is to enable the use of an ontology-based system at the time of data entry, which will increase data accuracy and completeness. Forexample, when the clinicians select “seizure” as adverse event, they will be offered a list of symptomsthat may have manifested. By selecting the ones they did observe, the system will be able toconfirm their diagnosis, potentially specifying it, such as assigning a level of certainty based on theBrighton case definition. The system will also be able to call the diagnosis into question, by warningthat the set of events selected does not allow for unambiguous interpretation, such as shown onFigure 10.1. In the latter case, the system will also provide a list of such events that would allowdetermination. This will enable, at the time of data entry, clinicians to unambiguously refer to aspecific set of symptoms, each carefully defined, and establish a diagnosis that remains linked toits associated symptoms. The adverse event will also be formally expressed, making it amenableto further querying for example for statistical analysis “what percentage of patients presented withmotor manifestations?”) at different levels of granularity (e.g., facilitating queries such as “whatpercentage of patients presented with tonic-clonic motor manifestations?”)This system not only addresses the concern that not all required signs and symptoms arebeing reported, but it could additionally check on the consistency of the reported information. Forexample, if the health care provider assesses an anaphylaxis diagnosis, the system can prompt them15410.2. Perspectives and future workto record required observations in the multiple systems required to be involved in such diagnosis,but also check that taken together they are consistent with the BCCD anaphylaxis. Enabling suchinteraction at the point of care would be beneficial for the reporting systems, as it may limit thenumber of back and forth required between the local organization and the national surveillancegroup, who often needs to requests more information (e.g., detailed medical records) to confirmdiagnoses.Additionally, two main barriers for adoption of the anaphylaxis BCCD were identified [15]:(1) not all signs and symptoms required for application of the BCCD are reported; and (2) thosesigns and symptoms are not consistently described and reported. While the use of a glossarypromises to address point (2), the availability of a unique system that would solve both issueswould be preferable. The ontology can both support reporting of required signs and symptomsand check their consistency, and can also offer help to the user via either textual definitions (whichhave currently been integrated in the AERO from the PHAC glossary) or even via their logicalrepresentation. Indeed, nothing precludes nesting of ontology terms. For example, fever can bea diagnosis when the fever BCCD is applied, but it can be a sign or symptom when the SeizureBCCD is applied.10.2.4 Application to other guidelines and other domainsFormally expressing the signs and symptoms via the AERO allows for integration of multipleperspectives in health care. In Chapter 8, I showed that the AERO can be applied to the WHOMalaria guidelines. Additionally, considering that it is unlikely all systems will adopt the sameguidelines worldwide, and that drugs are being shared across countries - aggregating adverse eventinformation internationally is critical, as was first shown in the Thalidomide case [6], leading tothe establishment of the WHO. With the AERO, users have the ability to encode their specificguideline, relying on common building blocks from the OGMS or the OMRE. A reasoner can thenbe used to classify automatically documents into one or several categories. For example, in the US,the CDC defines Influenza-like Illness (ILI) as “fever over 100 ◦F AND cough and/or sore throat”.In Canada, PHAC uses the definition “Acute onset of respiratory illness with fever and coughand with one or more of the following - sore throat, arthralgia, myalgia, or prostration which islikely due to influenza.” While both case definitions require fever and cough, the PHAC one goesfurther and requires an extra sign/symptom. Using the AERO, both guidelines can be encoded,and individual patients instances can be classified in one or more categories: all patients that are15510.3. ConclusionCDC ILI are PHAC ILI (but the reverse is not true).10.2.5 Data integration and text-miningVery early work has been done in linking the VAERS dataset with other resources, thus fulfillingthe last of the linked data principles, “Include links to other URIs in that useful information, sothat more things can be discovered.” In doing so, some issues arose, such as missing terms in theVO. Work is ongoing with VO developers to add the remaining information in their ontology. Itis expected that this will enable more complex querying, such as “are there differences in the typeof adverse events observed with different types of vaccines?”. A limitation of my work is that onlystructured annotations (the MedDRA terms) were considered. However, each VAERS report alsoincludes a textual part, which can be more or less detailed. Some work has been done to applyNatural Language Processing methods to analysis of adverse event reports [243]. Together with aprivate company, Seeker solutions (, a preliminary analysis ofthe textual content of VAERS reports was performed - results of which are attached in Appendix H.While theoretically promising, many hurdles stand in the way of properly exploiting the textualpart of the reports. First, the content itself varies greatly in terms of length and quality. A numberof medical abbreviations are used, and some reports are filled in foreign language (e.g., Spanish inthe case of the VAERS). Also, it is unclear whether the text is comprehensive or not - are all theobserved signs and symptoms reported? This is an issue I mentioned in Chapter 9, and that couldbe overcome with better reporting methods at the time of data entry.Finally, it would be very interesting to pursue a combination of text-mining and data integration,which would leverage the content of the reports and the power of the Semantic Web. For example, ifit was possible to extract names of the drugs that were used for treatment of patient, and which areoften mentioned in the text, they could be linked with information from DrugBank [272]. Knowinga patient was treated with Benadryl, and via DrugBank that Benadryl is an anti-allergic agent,that could be inferred as supportive evidence (though weaker) for potential anaphylaxis.10.3 ConclusionThis thesis forms a coherent body of work showing how existing biomedical knowledge can beencoded using formal representations. It details several resources I contributed to, and my in-volvement within the OBO Foundry to support interoperability of resources, and publication on15610.3. Conclusionthe Semantic Web. Using the pharmacovigilance domain, it demonstrates how ontologies can beused to improve standardization of knowledge, as well as automate some manual processes, suchas classification of adverse event reports. Additionally, it proposes some ways ontologies could bepractically implemented to improve the reporting process. Finally, this thesis achieves the goalof raising awareness in the clinical community: following my results, the Brighton Collaborationis moving towards providing an ontological representation of their existing and future guidelines.This will hopefully pave the way for other organizations to understand and rely on ontology-basedapplications.157Bibliography[1] Government of Canada Panel on Reasearch Ethics. TCPS 22nd edition of Tri-CouncilPolicy Statement: Ethical Conduct for Research Involving Humans -, Ac-cessed Feb 2014. (Cited on page iv.)[2] James A Singleton, Jenifer C Lloyd, Gina T Mootrey, Marcel E Salive, and Robert T Chen. Anoverview of the vaccine adverse event reporting system (VAERS) as a surveillance system. Vaccine,17(22):2908–2917, 1999. (Cited on page 1.)[3] A. Sinha, G. Hripcsak, and M. Markatou. Large datasets in biomedicine: a discussion of salientanalytic issues. Journal of the American Medical Informatics Association, 16(6):759–767, 2009. (Citedon pages 1 and 13.)[4] World Health Organization (WHO). The importance of pharmacovigilance: safety monitoring ofmedicinal products. Geneva: World Health Organization, pages 1–48, 2002. (Cited on page 7.)[5] Bara Fintel, Athena T. Samaras, and Carias Edson. The thalidomide tragedy:lessons for drug safety and regulation -, 2009. (Cited on page 7.)[6] Frances O Kelsey. Thalidomide update: regulatory aspects. Teratology, 38(3):221–226, 1988. (Citedon pages 7 and 155.)[7] Vaccine European New Integrated Collaboration Effort -, June 2011.(Cited on page 7.)[8] VENICE project. Final Report on the Survey on AEFI Monitoring Systems in Member States -, June 2011. (Cited on page 8.)[9] US Food and Drug Administration. Adverse Event Reporting System Data -, June 2011. (Cited on page 8.)[10] US Food and Drug Administration. Vaccine Adverse Event Reporting System Data -, June 2011. (Cited on pages 8 and 9.)158Bibliography[11] Gary H. Merrill. The MedDRA paradox. AMIA Annual Symposium Proceedings, 2008:470–474, 2008.(Cited on page 8.)[12] Krischer J Richesson R, Fung K. Heterogeneous but standard coding systems for adverse events: Issuesin achieving interoperability between apples and oranges. Contemp Clin Trials, 29, 2008. (Cited onpage 8.)[13] P. Mozzicato. Standardised MedDRA queries: their role in signal detection. Drug Saf, 30(7):617–619,2007. (Cited on page 8.)[14] June Almenoff, Joseph M Tonning, A Lawrence Gould, Ana Szarfman, Manfred Hauben, Rita Ouellet-Hellstrom, Robert Ball, Ken Hornbuckle, Louisa Walsh, Chuen Yee, et al. Perspectives on the use ofdata mining in pharmacovigilance. Drug safety, 28(11):981–1007, 2005. (Cited on page 8.)[15] Michael S. Gold, Jane Gidudu, Mich Erlewyn-Lajeunesse, and Barbara Law. Can the Brighton Col-laboration case definitions be used to improve the quality of Adverse Event Following Immunization(AEFI) reporting?: Anaphylaxis as a case study. Vaccine, 28(28):4487 – 4498, 2010. (Cited on pages 8,14, 16, 21, 123, 150, 153, and 155.)[16] Canada Minister of Health. Clinical Safety Data Management Definitions and Standards for Ex-pedited Reporting -, June 2011. (Cited on page 8.)[17] Public Health Agency of Canada. User Guide: Report of Adverse Events Following Immuniza-tion (AEFI) -, June 2011. (Cited onpages 9 and 10.)[18] JS Goraya and VS Virdi. Bacille Calmette-Gue´rin lymphadenitis. Postgraduate medical journal,78(920):327–329, 2002. (Cited on page 9.)[19] Lis Halkieer-Lassen. Suppurative lymphadenitis following intradermal BCG vaccination of pre-schoolchildren. Bull. Org. mond. Sante, 12:143–167, 1955. (Cited on page 9.)[20] Paul Hengster, J Schnapka, M Fille, and G Menardi. Occurrence of suppurative lymphadenitis after achange of BCG vaccine. Archives of disease in childhood, 67(7):952–955, 1992. (Cited on page 9.)[21] Maurice Arthus. Injections repetees de serum de cheval chez le lapin. Comptes Rendus des Seances dela Societe de Biologie et de ses Filiales, pages 817–820, 1903. (Cited on page 9.)[22] Danuta M Skowronski, Barbara Strauss, Gaston De Serres, Diane MacDonald, Stephen A Marion,Monika Naus, David M Patrick, and Perry Kendall. Oculo-respiratory syndrome: a new influenzavaccine-associated adverse event? Clinical infectious diseases, 36(6):705–713, 2003. (Cited on page 10.)159Bibliography[23] Danuta M Skowronski, Gaston De Serres, Jacques Hebert, Donald Stark, Richard Warrington, JaneMacnabb, Ramak Shadmani, Louis Rochette, Diane MacDonald, David M Patrick, and Bernard Duval.Skin testing to evaluate oculo-respiratory syndrome (ORS) associated with influenza vaccination duringthe 2000–2001 season. Vaccine, 20(21):2713–2719, 2002. (Cited on page 10.)[24] Philipp J Fritsche, Arthur Helbling, and Barbara K Ballmer-Weber. Vaccine hypersensitivity–updateand overview. Swiss Med Wkly, 140(17-18):238–246, 2010. (Cited on page 10.)[25] Leslie K. Ball, Geoffrey Evans, and Ann Bostrom. Risky business: Challenges in vaccine risk commu-nication. Pediatrics, 101(3):453–458, 1998. (Cited on pages 11 and 13.)[26] R.T. Chen, S.C. Rastogi, J.R. Mullen, S.W. Hayes, S.L. Cochi, J.A. Donlon, and S.G. Wassilak. Thevaccine adverse event reporting system (VAERS). Vaccine, 12(6):542–550, 1994. (Cited on pages 11and 137.)[27] Gregory A. Poland and Robert M. Jacobson. The age-old struggle against the antivaccinationists. NewEngland Journal of Medicine, 364(2):97–99, 2011. (Cited on page 12.)[28] Aj Wakefield, Sh Murch, A Anthony, J Linnell, Dm Casson, M Malik, M Berelowitz, Ap Dhillon,Ma Thomson, P Harvey, A Valentine, Se Davies, and Ja Walker-Smith. RETRACTED: ileal-lymphoid-nodular hyperplasia, non-specific colitis, and pervasive developmental disorder in children. The Lancet,351:637–641, February 1998. (Cited on page 12.)[29] Jason M. Glanz, David L. McClure, David J. Magid, Matthew F. Daley, Eric K. France, Daniel A.Salmon, and Simon J. Hambidge. Parental refusal of pertussis vaccination is associated with anincreased risk of pertussis infection in children. Pediatrics, 123(6):1446 –1451, June 2009. (Cited onpage 13.)[30] Jerry Avorn and Daniel H. Solomon. Cultural and economic factors that (mis)shape antibiotic use: Thenonpharmacologic basis of therapeutics. Annals of Internal Medicine, 133(2):128–135, 2000. (Citedon page 13.)[31] Fraser Health. Measles cluster in Fraser East -, September Ac-cessed Sep 2013. (Cited on page 13.)[32] CNN. U.S. measles cases in 2013 may be most in 17 years -, September Accessed Sep 2013. (Cited on page 13.)[33] F. Varricchio, J. Iskander, F. Destefano, R. Ball, R. Pless, M.M. Braun, and R.T. Chen. Understandingvaccine safety information from the vaccine adverse event reporting system. The Pediatric infectiousdisease journal, 23(4):287–294, 2004. (Cited on page 13.)160Bibliography[34] US Center for Disease Control and Food Drug Administration. VAERS Reporting form -, Accessed Feb 2014. (Cited on page 13.)[35] Katrin S Kohl, Jan Bonhoeffer, M Miles Braun, Robert T Chen, Philippe Duclos, Harald Heijbel,Ulrich Heininger, and Elisabeth Loupi. The Brighton Collaboration: Creating a global standardfor case definitions (and guidelines) for adverse events following immunization. AHRQ: Advances inPatient Safety. Concepts and Methodology. Rockville, AHRQ, 2:87–102, 2005. (Cited on page 14.)[36] The Brighton Collaboration., June Accessed Jun 2011.(Cited on pages 14 and 15.)[37] Australasian Society of Clinical Immunology and Allergy -, December2012. (Cited on page 15.)[38] National Institute for Health and Clinical Excellence -, December 2012.(Cited on page 15.)[39] World Allergy Organization -, December 2012. (Cited on page 15.)[40] J. Bonhoeffer, K. Kohl, R. Chen, P. Duclos, H. Heijbel, U. Heininger, T. Jefferson, and E. Loupi.The Brighton Collaboration: addressing the need for standardized case definitions of adverse eventsfollowing immunization (AEFI). Vaccine, 21(3):298–302, 2002. (Cited on page 14.)[41] The Brighton Collaboration. ABC tool - available at registration required., June 2010. (Cited on page 16.)[42] Public Health Agency of Canada User Guide: Report of Adverse Events Following Immunization (AEFI), Accessed Dec2013. (Cited on pages 16 and 21.)[43] Rachel Regier, Rupali Gurjar, and Roberto A Rocha. A clinical rule editor in an electronic medicalrecord setting: development, design, and implementation. In AMIA Annual Symposium Proceedings,volume 2009, page 537. American Medical Informatics Association, 2009. (Cited on page 16.)[44] Irena Spasic, Sophia Ananiadou, John McNaught, and Anand Kumar. Text mining and ontologiesin biomedicine: making sense of raw text. Briefings in bioinformatics, 6(3):239–251, 2005. (Cited onpages 16 and 22.)[45] Jens U. Ruggeberg, Michael S. Gold, Jose-Maria Bayas, Michael D. Blum, Jan Bonhoeffer, Sheila Fried-lander, Glacus de Souza Brito, Ulrich Heininger, Babatunde Imoukhuede, Ali Khamesipour, MichelErlewyn-Lajeunesse, Susana Martin, Mika Makela, Patricia Nell, Vitali Pool, and Nick Simpson. Ana-phylaxis: Case definition and guidelines for data collection, analysis, and presentation of immunizationsafety data. Vaccine, 25(31):5675 – 5684, 2007. (Cited on page 16.)161Bibliography[46] Public Health Agency of Canada Canadian immunization guide, Accessed Dec 2013. (Cited on page 17.)[47] J. P. Ioannidis, D. B. Allison, C. A. Ball, I. Coulibaly, X. Cui, A. C. Culhane, M. Falchi, C. Furlanello,L. Game, G. Jurman, J. Mangion, T. Mehta, M. Nitzberg, G. P. Page, E. Petretto, and V. van Noort.Repeatability of published microarray gene expression analyses. Nature genetics, 41(2):149–155, Feb2009. (Cited on page 21.)[48] C. J. Penkett and J. Bahler. Navigating public microarray databases. Comparative and FunctionalGenomics, 5(6-7):471–479, 2004. (Cited on page 21.)[49] R. G. Cote, P. Jones, L. Martens, R. Apweiler, and H. Hermjakob. The Ontology Lookup Service:more data and better tools for controlled vocabulary queries. Nucleic acids research, 36(Web Serverissue):W372–6, Jul 1 2008. (Cited on pages 21 and 105.)[50] James J Cimino. Desiderata for controlled medical vocabularies in the twenty-first century. Methodsof information in medicine, 37(4-5):394, 1998. (Cited on pages 21 and 23.)[51] James J Cimino. In defense of the Desiderata. Journal of biomedical informatics, 39(3):299–306, 2006.(Cited on page 22.)[52] D. L. Rubin, N. H. Shah, and N. F. Noy. Biomedical ontologies: a functional perspective. Briefings inbioinformatics, 9(1):75–90, Jan 2008. (Cited on page 22.)[53] Michael Ashburner, Catherine A Ball, Judith A Blake, David Botstein, Heather Butler, J MichaelCherry, Allan P Davis, Kara Dolinski, Selina S Dwight, Janan T Eppig, Midori A. harris, David P.Hill, Laurie Issel-Tarver, Andrew Kasarskis, Suzanna Lewis, John C. Matese, Joel E. Richardson,Martin Ringwald, Gerald M. Rubin, and Gavin Sherlock. Gene ontology: tool for the unification ofbiology. Nature genetics, 25(1):25–29, 2000. (Cited on pages 22, 24, 54, and 97.)[54] Barry Smith, Werner Ceusters, Bert Klagges, Jacob Ko¨hler, Anand Kumar, Jane Lomax, ChrisMungall, Fabian Neuhaus, Alan L Rector, and Cornelius Rosse. Relations in biomedical ontologies.Genome biology, 6(5):R46, 2005. (Cited on pages 22, 49, and 57.)[55] Barry Smith, Michael Ashburner, Cornelius Rosse, Jonathan Bard, William Bug, Werner Ceusters,Louis J Goldberg, Karen Eilbeck, Amelia Ireland, Christopher J Mungall, The OBI Consortium, Neo-cles Leontis, Philippe Rocca-Serra, Alan Ruttenberg, Susanna-Assunta Sansone, Richard H Scheuer-mann, Nigam Shah, Patricia L Whetzel, and Suzanna Lewis. The OBO Foundry: coordinated evolutionof ontologies to support biomedical data integration. Nature biotechnology, 25(11):1251–1255, 2007.(Cited on pages 22, 24, 25, 49, 76, and 124.)[56] PubMed Home - (Cited on pages 22, 48, and 56.)162Bibliography[57] Alan Ruttenberg, Jonathan A Rees, Matthias Samwald, and M Scott Marshall. Life sciences on theSemantic Web: the Neurocommons and beyond. Briefings in bioinformatics, 10(2):193–204, 2009.(Cited on pages 22, 23, 28, 80, and 90.)[58] Jyotishman Pathak, Thomas M Johnson, and Christopher G Chute. Survey of modular ontologytechniques and their applications in the biomedical domain. Integrated Computer-Aided Engineering,16(3):225–242, 2009. (Cited on page 22.)[59] Bernardo Cuenca Grau, Ian Horrocks, Yevgeny Kazakov, and Ulrike Sattler. Modular reuse of ontolo-gies: Theory and practice. J. Artif. Intell. Res.(JAIR), 31:273–318, 2008. (Cited on page 22.)[60] Evren Sirin, Bijan Parsia, Bernardo Cuenca Grau, Aditya Kalyanpur, and Yarden Katz. Pellet: Apractical OWL-DL reasoner. Web Semantics: Science, Services and Agents on the World Wide Web,5(2):51, 2007. (Cited on pages 23, 28, 60, 75, and 147.)[61] Olivier Bodenreider et al. Biomedical ontologies in action: role in knowledge management, dataintegration and decision support. Yearb Med Inform, 47:67–79, 2008. (Cited on page 23.)[62] Judith A Blake and Carol J Bult. Beyond the data deluge: data integration and bio-ontologies. Journalof biomedical informatics, 39(3):314–320, 2006. (Cited on page 23.)[63] Cynthia L Smith, Carroll-Ann W Goldsmith, and Janan T Eppig. The mammalian phenotype ontologyas a tool for annotating, analyzing and comparing phenotypic information. Genome biology, 6(1):R7,2004. (Cited on pages 23 and 97.)[64] Nicolas Le Novere, Benjamin Bornstein, Alexander Broicher, Me´lanie Courtot, Marco Donizelli, HarishDharuri, Lu Li, Herbert Sauro, Maria Schilstra, Bruce Shapiro, et al. Biomodels database: a free,centralized database of curated, published, quantitative kinetic models of biochemical and cellularsystems. Nucleic acids research, 34(suppl 1):D689–D691, 2006. (Cited on page 23.)[65] Robert Stevens, Patricia Baker, Sean Bechhofer, Gary Ng, Alex Jacoby, Norman W Paton, Carole AGoble, and Andy Brass. Tambis: transparent access to multiple bioinformatics information sources.Bioinformatics, 16(2):184–186, 2000. (Cited on page 23.)[66] Franc¸ois Belleau, Marc-Alexandre Nolin, Nicole Tourigny, Philippe Rigault, and Jean Morissette.Bio2rdf: towards a mashup to build bioinformatics knowledge systems. Journal of biomedical infor-matics, 41(5):706–716, 2008. (Cited on pages 23, 28, 105, and 107.)[67] Neurocommons sparql endpoint - (Cited on pages 23 and 86.)[68] LOD SPARQL Endpoint - (Cited on pages 23 and 28.)163Bibliography[69] Gwenae¨lle Marquet, Olivier Dameron, Stephan Saikali, Jean Mosser, and Anita Burgun. Grading genetumors using OWL-DL and NCI Thesaurus. In AMIA Annual Symposium Proceedings, volume 2007,page 508. American Medical Informatics Association, 2007. (Cited on page 23.)[70] Barry Smith. Ontology (science). In FOIS, pages 21–35, 2008. (Cited on page 23.)[71] Midori A Harris, Jennifer I Deegan, Jane Lomax, Michael Ashburner, Susan Tweedie, Seth Carbon,Suzanna Lewis, Chris Mungall, John Day-Richter, Karen Eilbeck, et al. The gene ontology project in2008. Nucleic Acids Res, 36:D440–D444, 2008. (Cited on page 24.)[72] P De Matos, M Ennis, M Darsow, M Guedj, K Degtyarenko, and R Apweiler. ChEBI-chemical entitiesof biological interest. Nucleic Acids Research, 2006. (Cited on page 24.)[73] Patricia L Whetzel, Ryan R Brinkman, Helen C Causton, Liju Fan, Dawn Field, Jennifer Fostel,Gilberto Fragoso, Tanya Gray, Mervi Heiskanen, Tina Hernandez-Boussard, et al. Development ofFuGO: an ontology for functional genomics investigations. OMICS: A journal of integrative biology,10(2):199–204, 2006. (Cited on page 24.)[74] Patricia L Whetzel, Helen Parkinson, Helen C Causton, Liju Fan, Jennifer Fostel, Gilberto Fragoso,Laurence Game, Mervi Heiskanen, Norman Morrison, Philippe Rocca-Serra, et al. The mged ontology:a resource for semantics-based description of microarray experiments. Bioinformatics, 22(7):866–873,2006. (Cited on page 24.)[75] Larisa N Soldatova and Ross D King. An ontology of scientific experiments. Journal of the RoyalSociety Interface, 3(11):795–803, 2006. (Cited on page 24.)[76] Ross D King, Jem Rowland, Stephen G Oliver, Michael Young, Wayne Aubrey, Emma Byrne, MariaLiakata, Magdalena Markham, Pinar Pir, Larisa N Soldatova, et al. The automation of science. Science,324(5923):85–89, 2009. (Cited on page 24.)[77] Susanna-Assunta Sansone, Daniel Schober, Helen J Atherton, Oliver Fiehn, Helen Jenkins, PhilippeRocca-Serra, Denis V Rubtsov, Irena Spasic, Larisa Soldatova, Chris Taylor, et al. Metabolomicsstandards initiative: ontology working group work in progress. Metabolomics, 3(3):249–256, 2007.(Cited on page 24.)[78] B. Smith, W. Ceusters, B. Klagges, J. Kohler, A. Kumar, J. Lomax, C. Mungall, F. Neuhaus, A. L.Rector, and C. Rosse. Relations in biomedical ontologies. Genome biology, 6(5):R46, 2005. (Cited onpages 25, 32, and 124.)[79] P. Grenon, B. Smith, and L. Goldberg. Biodynamic ontology: applying BFO in the biomedical domain.Studies in health technology and informatics, 102:20–38, 2004. (Cited on pages 25, 34, 49, and 76.)164Bibliography[80] Alexander C Yu. Methods in biomedical ontology. Journal of biomedical informatics, 39(3):252–266,2006. (Cited on page 25.)[81] Alan L Rector et al. Clinical terminology: why is it so hard? Methods of information in medicine,38(4/5):239–252, 1999. (Cited on page 25.)[82] Yongqun He, Lindsay Cowell, Alexander D Diehl, HL Mobley, Bjoern Peters, Alan Ruttenberg,Richard H Scheuermann, Ryan R Brinkman, Me´lanie Courtot, Chris Mungall, Zuoshuang Xiang,Fang Chen Chen, Thomas Todd, Lesley Colby, Howard Rush Rush, Trish Whetzel, Mark A. Musen,Brian D. Athey, Gilbert S. Omenn Omenn, and Barry Smith. VO: vaccine ontology. In The 1st In-ternational Conference on Biomedical Ontology (ICBO 2009) Nature Precedings, pages 24–26, 2009.(Cited on page 25.)[83] OBI consortium. Ontology for Biomedical Investigations (OBI) -, June 2011. (Cited on pages 25, 75, and 124.)[84] Me´lanie Courtot, Chris Mungall, Ryan R. Brinkman, and Alan Ruttenberg. Building the OBOFoundry - one policy at a time. In Proceedings of the International Conference on Biomedical Ontology(ICBO2011), 2011. (Cited on page 25.)[85] The Basic Formal Ontology (BFO) -, Accessed Dec 2013. (Cited onpages 25 and 47.)[86] Chris F Taylor, Dawn Field, Susanna-Assunta Sansone, Jan Aerts, Rolf Apweiler, Michael Ashburner,Catherine A Ball, Pierre-Alain Binz, Molly Bogue, Tim Booth, et al. Promoting coherent minimumreporting guidelines for biological and biomedical investigations: the MIBBI project. Nature biotech-nology, 26(8):889–896, 2008. (Cited on page 25.)[87] Amit P Sheth and James A Larson. Federated database systems for managing distributed, heteroge-neous, and autonomous databases. ACM Computing Surveys (CSUR), 22(3):183–236, 1990. (Cited onpage 26.)[88] Gomer Thomas, Glenn R Thompson, Chin-Wan Chung, Edward Barkmeyer, Fred Carter, MarjorieTempleton, Stephen Fox, and Berl Hartman. Heterogeneous distributed database systems for produc-tion use. ACM Computing Surveys (CSUR), 22(3):237–266, 1990. (Cited on page 26.)[89] Richard Hull. Managing semantic heterogeneity in databases: a theoretical prospective. In Proceedingsof the sixteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems, pages51–61. ACM, 1997. (Cited on page 26.)[90] Alon Y Halevy. Answering queries using views: A survey. The VLDB Journal, 10(4):270–294, 2001.(Cited on page 26.)165Bibliography[91] Dan Brickley and Ramanathan V Guha. Resource Description Framework (RDF) Schema Spec-ification 1.0: W3C Candidate Recommendation 27 March 2000 -, Accessed Nov 2013. (Cited on pages 26 and 105.)[92] World Wide Web Consortium (W3C). OWL Web Ontology Language Guide, 02/10/ 2004. (Cited onpages 26 and 60.)[93] Leora Morgenstern, Chris Welty, and Harold Boley. RIF Primer. World-Wide Web Consortium, 2010.(Cited on page 26.)[94] SPARQL Query Language for RDF - (Cited onpages 26, 29, 78, and 86.)[95] Tim Berners-Lee, James Hendler, Ora Lassila, et al. The semantic web. Scientific american, 284(5):28–37, 2001. (Cited on page 27.)[96] Tim Berners-Lee Linked Data, Accessed Dec2013. (Cited on page 27.)[97] UniProt Consortium. The universal protein resource (UniProt). Nucleic acids research, 36(Databaseissue):D190–5, Jan 2008. (Cited on pages 28 and 104.)[98] So¨ren Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary Ives.Dbpedia: A nucleus for a web of open data. In The semantic web, pages 722–735. Springer, 2007.(Cited on page 28.)[99] M. Kanehisa and S. Goto. KEGG: kyoto encyclopedia of genes and genomes. Nucleic acids research,28(1):27–30, Jan 1 2000. (Cited on page 28.)[100] D. L. Wheeler, T. Barrett, D. A. Benson, S. H. Bryant, K. Canese, D. M. Church, M. DiCuccio,R. Edgar, S. Federhen, W. Helmberg, D. L. Kenton, O. Khovayko, D. J. Lipman, T. L. Madden, D. R.Maglott, J. Ostell, J. U. Pontius, K. D. Pruitt, G. D. Schuler, L. M. Schriml, E. Sequeira, S. T. Sherry,K. Sirotkin, G. Starchenko, T. O. Suzek, R. Tatusov, T. A. Tatusova, L. Wagner, and E. Yaschenko.Database resources of the National Center for Biotechnology Information. Nucleic acids research,33(Database issue):D39–45, Jan 1 2005. (Cited on page 28.)[101] Rob Shearer, Boris Motik, and Ian Horrocks. HermiT: A Highly-Efficient OWL Reasoner. In OWLED,volume 432, 2008. (Cited on pages 28, 139, and 147.)[102] Dmitry Tsarkov and Ian Horrocks. Fact++ description logic reasoner: System description. In Auto-mated reasoning, pages 292–297. Springer, 2006. (Cited on pages 28 and 60.)166Bibliography[103] OpenLink Virtuoso: Open-Source Edition, on pages 28, 90, and 147.)[104] Clark & Parsia. Stardog - the RDF database -, Accessed Dec 2013. (Cited onpage 28.)[105] Atanas Kiryakov, Damyan Ognyanov, and Dimitar Manov. OWLIM–a pragmatic semantic repositoryfor OWL. In Web Information Systems Engineering–WISE 2005 Workshops, pages 182–192. Springer,2005. (Cited on page 28.)[106] Aduna. The Sesame triplestore -, August Accessed Aug 2013.(Cited on pages 29 and 139.)[107] Jeen Broekstra, Michel Klein, Stefan Decker, Dieter Fensel, Frank Van Harmelen, and Ian Hor-rocks. Enabling knowledge representation on the web by extending rdf schema. Computer networks,39(5):609–634, 2002. (Cited on page 30.)[108] Ronald J Brachman and Hector J Levesque. The tractability of subsumption in frame-based descriptionlanguages. In AAAI, volume 84, pages 34–37, 1984. (Cited on page 30.)[109] Ian Horrocks, Oliver Kutz, and Ulrike Sattler. The Even More Irresistible SROIQ. KR, 6:57–67, 2006.(Cited on pages 30 and 76.)[110] M. Horridge, N. Drummond, J. Goodwin, A. Rector, R. Stevens, and H.H. Wang. The ManchesterOWL Syntax. In Bernardo Cuenca Grau, Pascal Hitzler, Conor Shankey, and Evan Wallace, editors,Proceedings of OWL Experiences and Directions Workshop (OWLED2006), 2006. (Cited on pages 30,50, 66, 125, 129, and 141.)[111] Mike Bergman. Thinking ‘Inside the Box’ with Description Logics -, Accessed Dec 2013. (Cited on page 30.)[112] Franz Baader. The description logic handbook: theory, implementation, and applications. Cambridgeuniversity press, 2003. (Cited on page 30.)[113] Boris Motik, Ian Horrocks, and Ulrike Sattler. Bridging the gap between OWL and relational databases.Web Semantics: Science, Services and Agents on the World Wide Web, 7(2):74–89, 2009. (Cited onpage 31.)[114] Sean Bechhofer, Frank Van Harmelen, Jim Hendler, Ian Horrocks, Deborah L McGuinness, Peter FPatel-Schneider, Lynn Andrea Stein, et al. OWL web ontology language reference. W3C recommen-dation, 10:2006–01, 2004. (Cited on pages 31, 32, and 49.)167Bibliography[115] Samantha Bail. Common reasons for ontology inconsistency -, Accessed Dec 2013. (Cited on page 32.)[116] Johan Lauwereyns, Katsumi Watanabe, Brian Coe, and Okihide Hikosaka. A neural correlate ofresponse bias in monkey caudate nucleus. Nature, 418(6896):413–417, 2002. (Cited on pages 34and 35.)[117] Christoph Hock, Uwe Konietzko, Andreas Papassotiropoulos, Axel Wollmer, Johannes Streffer, Ruth Cvon Rotz, Gabriela Davey, Eva Moritz, and Roger M Nitsch. Generation of antibodies specific for β-amyloid by vaccination of patients with alzheimer disease. Nature medicine, 8(11):1270–1275, 2002.(Cited on page 35.)[118] William J Bug, Giorgio A Ascoli, Jeffrey S Grethe, Amarnath Gupta, Christine Fennema-Notestine,Angela R Laird, Stephen D Larson, Daniel Rubin, Gordon M Shepherd, Jessica A Turner, et al. TheNIFSTD and BIRNLex vocabularies: building comprehensive ontologies for neuroscience. Neuroinfor-matics, 6(3):175–194, 2008. (Cited on page 35.)[119] The Vaccine Ontology -, June 2011. (Cited onpages 37, 82, and 124.)[120] Eric W. Sayers, Tanya Barrett, Dennis A. Benson, Stephen H. Bryant, Kathi Canese, VyacheslavChetvernin, Deanna M. Church, Michael DiCuccio, Ron Edgar, Scott Federhen, Michael Feolo,Lewis Y. Geer, Wolfgang Helmberg, Yuri Kapustin, David Landsman, David J. Lipman, Thomas L.Madden, Donna R. Maglott, Vadim Miller, Ilene Mizrachi, James Ostell, Kim D. Pruitt, Gregory D.Schuler, Edwin Sequeira, Stephen T. Sherry, Martin Shumway, Karl Sirotkin, Alexandre Souvorov,Grigory Starchenko, Tatiana A. Tatusova, Lukas Wagner, Eugene Yaschenko, and Jian Ye. Databaseresources of the National Center for Biotechnology Information. Nucleic Acids Research, 37(suppl1):D5–D15, 2009. (Cited on page 38.)[121] Zuoshuang Xiang, Thomas Todd, Kim P Ku, Bethany L Kovacic, Charles B Larson, Fang Chen,Andrew P Hodges, Yuying Tian, Elizabeth A Olenzek, Boyang Zhao, et al. VIOLIN: vaccine investi-gation and online information network. Nucleic acids research, 36(suppl 1):D923–D928, 2008. (Citedon pages 41, 51, and 55.)[122] Ryan R Brinkman, Me´lanie Courtot, Dirk Derom, Jennifer M Fostel, Yongqun He, Phillip Lord,James Malone, Helen Parkinson, Bjoern Peters, Philippe Rocca-Serra, et al. Modeling biomedicalexperimental processes with OBI. J Biomed Semantics, 1(Suppl 1):S7, 2010. (Cited on pages 44, 50,and 55.)[123] Gerhardt G Schurig, R Martin Roop, T Bagchi, S Boyle, D Buhrman, and N Sriranganathan. Biological168Bibliographyproperties of RB51; a stable rough strain of Brucella abortus. Veterinary microbiology, 28(2):171–188,1991. (Cited on pages 44 and 46.)[124] M. Courtot, F. Gibson, A. L. Lister, J. Malone, D. Schober, R. R. Brinkman, and A. Ruttenberg.MIREOT: The minimum information to reference an external ontology term. Applied Ontology,6(1):23–33, 2011. (Cited on pages 47, 55, and 63.)[125] The Information Artifact Ontology (IAO) -, June 2011.(Cited on pages 47, 49, 79, 92, and 124.)[126] The Prote´ge´ Ontology Editor and Knowledge Acquisition System, on pages 49, 60, 74, 75, 97, 114, and 147.)[127] M. Musen, N. Shah, N. Noy, B. Dai, M. Dorf, N. Griffith, J. D. Buntrock, C. Jonquet, M. J. Montegut,and D. L. Rubin. Bioportal: Ontologies and data resources with the click of a mouse. AMIA ...AnnualSymposium proceedings / AMIA Symposium.AMIA Symposium, pages 1223–1224, 2008. (Cited onpage 49.)[128] G. V. Gkoutos, E. C. Green, A. M. Mallon, J. M. Hancock, and D. Davidson. Using ontologiesto describe mouse phenotypes. Genome Biol, 6(1):R8, 2005. 1465-6914 Journal Article. (Cited onpages 49, 50, 74, 75, and 97.)[129] Cornelius Rosse and Jose´ LV Mejino Jr. A reference ontology for biomedical informatics: the Founda-tional Model of Anatomy. Journal of biomedical informatics, 36(6):478–500, 2003. (Cited on pages 49and 69.)[130] National Institutes of Health National Center for Biotechnology Information (NCBI), National Libraryof Medicine. The NCBI Entrez Taxonomy Homepage. (Cited on pages 49 and 55.)[131] IDO consortium. The Infectious Disease Ontology -,December Accessed Dec 2009. (Cited on pages 50 and 82.)[132] The OGMS developers group. Ontology for General Medical Science (OGMS) -, Accessed Jun 2011. (Cited on pages 51 and 124.)[133] Alan L Rector. Modularisation of domain ontologies implemented in description logics and relatedformalisms including owl. In Proceedings of the 2nd international conference on Knowledge capture,pages 121–128. ACM, 2003. (Cited on page 52.)[134] Ali M Harandi, Gwyn Davies, and Ole F Olesen. Vaccine adjuvants: scientific challenges and strategicinitiatives. Expert Review of Vaccines, 2009. (Cited on page 53.)169Bibliography[135] Seth Carbon, Amelia Ireland, Christopher J Mungall, ShengQiang Shu, Brad Marshall, Suzanna Lewis,et al. AmiGO: online access to ontology and annotation data. Bioinformatics, 25(2):288–289, 2009.(Cited on pages 54, 105, and 107.)[136] FLU consortium. The Influenza Ontology -, De-cember Accessed Dec 2009. (Cited on pages 55 and 82.)[137] Henry J Lowe and G Octo Barnett. Understanding and using the medical subject headings (MeSH)vocabulary to perform literature searches. JAMA: the journal of the American Medical Association,271(14):1103–1108, 1994. (Cited on page 56.)[138] Zuoshuang Xiang, Wenjie Zheng, and Yongqun He. BBP: Brucella genome annotation with literaturemining and curation. BMC bioinformatics, 7(1):347, 2006. (Cited on page 56.)[139] Olivier Bodenreider and Robert Stevens. Bio-ontologies: current trends and future directions. Briefingsin bioinformatics, 7(3):256–274, 2006. (Cited on page 60.)[140] Mikel Egan˜a Aranguren, Erick Antezana, Martin Kuiper, and Robert Stevens. Ontology design pat-terns for bio-ontologies: a case study on the cell cycle ontology. BMC bioinformatics, 9(Suppl 5):S1,2008. (Cited on page 60.)[141] Lora Aroyo, Grigoris Antoniou, Eero Hyvo¨nen, Annette Ten Teije, Heiner Stuckenschmidt, LilianaCabral, and Tania Tudorache. The Semantic Web: Research and Applications: 7th European SemanticWeb Conference, ESW 2010, Heraklion, Crete, Greece, May 30-June 3, 2010, Proceedings, volume 2.Springer, 2010. (Cited on page 60.)[142] Oscar Corcho, Catherine Roussey, LM Vilches-Bla´zquez, and Iva´n Perez Dominguez. Pattern-basedOWL ontology debugging guidelines. In OWLED, 2009. (Cited on page 60.)[143] Luigi Iannone, Mikel Egan˜a Aranguren, Alan L Rector, and Robert Stevens. Augmenting the expres-sivity of the ontology pre-processor language. In OWLED, volume 432, 2008. (Cited on pages 60and 65.)[144] The Bio Investigation Index. BII: The Bio Investigation Index. -, Accessed Nov 2013. (Cited on page 61.)[145] Dawn Field, Susanna-Assunta Sansone, Amanda Collis, Tim Booth, Peter Dukes, Susan K. Gregurick,Karen Kennedy, Patrik Kolar, Eugene Kolker, Mary Maxon, Sian Millard, Alexis-Michel Mugabushaka,Nicola Perrin, Jacques E. Remacle, Karin Remington, Philippe Rocca-Serra, Chris F. Taylor, MarkThorley, Bela Tiwari, and John Wilbanks. ’Omics Data Sharing. Science, 326(5950):234–236, 2009.(Cited on pages 61 and 68.)170Bibliography[146] Bjoern Peters and Alessandro Sette. Integrating epitope data into the emerging web of biomedicalknowledge resources. Nature Reviews Immunology, 7(6):485–490, 2007. (Cited on page 61.)[147] Burke Squires, Catherine Macken, Adolfo Garcia-Sastre, Shubhada Godbole, Jyothi Noronha, VictoriaHunt, Roger Chang, Christopher N Larsen, Ed Klem, Kevin Biersack, et al. BioHealthBase: informaticssupport in the elucidation of influenza virus host–pathogen interactions and virulence. Nucleic acidsresearch, 36(suppl 1):D497–D503, 2008. (Cited on page 61.)[148] Jay Kola. ExcelImport - co-ode-owl-plugins - Get data from a spreadsheet into your ontology -, Accessed Nov 2013. (Citedon page 65.)[149] Martin J O’Connor, Christian Halaschek-Wiener, and Mark A Musen. M2: A Language for MappingSpreadsheets to OWL. In OWLED, 2010. (Cited on page 66.)[150] Matthew Horridge and Sean Bechhofer. The OWLAPI: A Java API for OWL ontologies. SemanticWeb, 2(1):11–21, 2011. (Cited on pages 66, 90, 113, 139, and 147.)[151] Philippe Rocca-Serra. Quick Term Templates - OBI Ontology -, Accessed Nov 2013. (Cited on page 67.)[152] International Union of Pure and Applied Chemistry. Subcommittee on Nomenclature, Properties,and Units in Laboratory Medicine -,Accessed Nov 2013. (Cited on page 67.)[153] Eagle i consortium. eagle-i -, Accessed Nov 2013. (Cited on pages 68and 118.)[154] E. Maguire, P. Rocca-Serra, and S. Sansone. ISA Infrastructure - isacreator. -, Accessed Nov 2013. (Cited on page 68.)[155] Kirill Degtyarenko, Paula de Matos, Marcus Ennis, Janna Hastings, Martin Zbinden, Alan McNaught,Rafael Alca´ntara, Michael Darsow, Mickae¨l Guedj, and Michael Ashburner. ChEBI: a database andontology for chemical entities of biological interest. Nucleic acids research, 36(suppl 1):D344–D350,2008. (Cited on pages 69 and 76.)[156] Georgios V Gkoutos, Paul N Schofield, and Robert Hoehndorf. The Units Ontology: a tool forintegrating units of measurement in science. Database: The Journal of Biological Databases andCuration, 2012, 2012. (Cited on pages 69 and 81.)171Bibliography[157] Darren A Natale, Cecilia N Arighi, Winona C Barker, Judith Blake, Ti-Cheng Chang, Zhangzhi Hu,Hongfang Liu, Barry Smith, and Cathy H Wu. Framework for a protein ontology. BMC bioinformatics,8(Suppl 9):S1, 2007. (Cited on page 69.)[158] Zuoshuang Xiang, Yu Lin, and Yongqun He. Ontorat web server for automatic generation and an-notations of new ontology terms. In International conference on biomedical Ontology (ICBO), 2012.(Cited on page 69.)[159] John Day-Richter. The OBO Flat File Format Specification, version 1.2 -, Accessed Nov 2013. (Cited on page 71.)[160] Web Ontology Language (OWL), (Cited on pages 71, 75, 139,and 147.)[161] OBI Ontology,, June 2011. (Cited on pages 71 and 125.)[162] W3C. Simple Knowledge Organization System (SKOS) -, Accessed Nov 2013. (Cited on page 72.)[163] Dublin Core Metadata Initiative. Dublin Core Metadata Element Set -, Accessed Nov 2013. (Cited on page 72.)[164] Barry Smith. Beyond concepts: ontology as reality representation. In Proceedings of the third interna-tional conference on formal ontology in information systems (FOIS 2004), pages 73–84, 2004. (Citedon page 72.)[165] Gene Ontology Consortium. The Gene Ontology (GO) database and informatics resource. Nucleicacids research, 32(90001):D258–D261, 01/01/ 2004. (Cited on pages 73 and 75.)[166] GO consortium. GO editorial style guide -,October Accessed Oct 2013. (Cited on pages 73 and 76.)[167] Martin Hepp. Goodrelations: An ontology for describing products and services offers on the web. InAldo Gangemi and Jerome Euzenat, editors, Knowledge Engineering: Practice and Patterns, volume5268 of Lecture Notes in Computer Science, pages 329–346. Springer Berlin / Heidelberg, 2008. (Citedon page 74.)[168] UniProt Consortium et al. Update on activities at the Universal Protein Resource (UniProt) in 2013.Nucleic acids research, 41(D1):D43–D47, 2013. (Cited on page 74.)[169] Eric W Sayers, Tanya Barrett, Dennis A Benson, Evan Bolton, Stephen H Bryant, Kathi Canese, Vy-acheslav Chetvernin, Deanna M Church, Michael DiCuccio, Scott Federhen, et al. Database resources172Bibliographyof the national center for biotechnology information. Nucleic acids research, 39(suppl 1):D38–D51,2011. (Cited on page 75.)[170] C. Golbreich, S. Zhang, and O. Bodenreider. The foundational model of anatomy in OWL: Experienceand perspectives. Web semantics (Online), 4(3):181–195, 2006. (Cited on page 75.)[171] Bernardo Cuenca Grau, Ian Horrocks, Yevgeny Kazakov, and Ulrike Sattler. Ontology reuse: Bettersafe than sorry. Description Logics, 250, 2007. (Cited on page 75.)[172] Bernardo Cuenca Grau, Ian Horrocks, Yevgeny Kazakov, and Ulrike Sattler. Extracting modulesfrom ontologies: A logic-based approach. In Heiner Stuckenschmidt and Stefano Spaccapietra, editors,Ontology Modularization. Springer, 2008. (Cited on pages 75 and 85.)[173] Melissa A Haendel, Fabian Neuhaus, David Osumi-Sutherland, Paula M Mabee, Jos LV Mejino Jr,Chris J Mungall, and Barry Smith. CARO–the common anatomy reference ontology. In AnatomyOntologies for Bioinformatics, pages 327–349. Springer, 2008. (Cited on page 75.)[174] Bernardo Cuenca Grau, Ian Horrocks, Yevgeny Kazakov, and Ulrike Sattler. Just the right amount:extracting modules from ontologies. In Proceedings of the 16th international conference on World WideWeb, pages 717–726. ACM, 2007. (Cited on pages 75 and 122.)[175] Ernesto Jime´nez-Ruiz, Bernardo Cuenca Grau, Ulrike Sattler, Thomas Schneider, and Rafael Berlanga.Safe and economic re-use of ontologies: A logic-based methodology and tool support. In The SemanticWeb: Research and Applications, pages 185–199. Springer, 2008. (Cited on pages 75 and 122.)[176] Julian Seidenberg and Alan Rector. Web ontology segmentation: analysis, classification and use. InProceedings of the 15th international conference on World Wide Web, pages 13–22. ACM, 2006. (Citedon pages 75, 76, and 85.)[177] OBI scripts - (Cited onpage 77.)[178] OBI consortium. SPARQL queries template file -, December Accessed Dec 2009. (Cited onpage 79.)[179] Science Commons. Neurocommons OBO SPARQL endpoint -, December Accessed Dec 2009. (Cited on page 80.)[180] Jonathan Bard, Seung Y Rhee, and Michael Ashburner. An ontology for cell types. Genome biology,6(2):R21, 2005. (Cited on page 80.)173Bibliography[181] Dave Beckett and Brian McBride. RDF/XML syntax specification (revised) -, 2004. (Cited on pages 83, 86, and 105.)[182] Zuoshuang Xiang, Me´lanie Courtot, Ryan R Brinkman, Alan Ruttenberg, and Yongqun He. Ontofox:web-based support for ontology reuse. BMC research notes, 3(1):175, 2010. (Cited on page 84.)[183] Roman Kontchakov, Luca Pulina, Ulrike Sattler, Thomas Schneider, Petra Selmer, Frank Wolter, andMichael Zakharyaschev. Minimal Module Extraction from DL-Lite Ontologies Using QBF Solvers. InIJCAI, volume 9, pages 836–841, 2009. (Cited on page 85.)[184] Natalya F Noy and Mark A Musen. Specifying ontology views by traversal. In The Semantic Web–ISWC 2004, pages 713–725. Springer, 2004. (Cited on page 85.)[185] Christian Bizer, Tom Heath, and Tim Berners-Lee. Linked data-the story so far. International Journalon Semantic Web and Information Systems (IJSWIS), 5(3):1–22, 2009. (Cited on page 103.)[186] W3C OWL Working Group and others. OWL 2 Web Ontology Language document overview -, 2009. (Cited on page 105.)[187] Prud’hommeaux E and Seaborne A. Resource Description Framework (RDF) / W3C Semantic WebActivity - (Cited on page 105.)[188] James Clark. XSL transformations (XSLT) -, Accessed Nov 2013. (Citedon pages 105 and 109.)[189] Patricia L Whetzel, Natalya F Noy, Nigam H Shah, Paul R Alexander, Csongor Nyulas, Tania Tudo-rache, and Mark A Musen. BioPortal: enhanced functionality via new Web services from the NationalCenter for Biomedical Ontology to access and use ontologies in software applications. Nucleic acidsresearch, 39(suppl 2):W541–W545, 2011. (Cited on pages 105 and 106.)[190] CO-ODE project. Ontology-browser: An OWL Ontology and RDF (Linked Open Data) Browser -, Accessed Nov 2013. (Cited on pages 105 and 106.)[191] Ontotext. Linked Life Data -, Accessed Nov 2013. (Cited on pages 105 and 107.)[192] Christian Bizer, Jens Lehmann, Georgi Kobilarov, So¨ren Auer, Christian Becker, Richard Cyganiak,and Sebastian Hellmann. DBpedia-A crystallization point for the Web of Data. Web Semantics:Science, Services and Agents on the World Wide Web, 7(3):154–165, 2009. (Cited on pages 105and 108.)[193] Steven Vercruysse, Aravind Venkatesan, and Martin Kuiper. OLSVis: an animated, interactive visualbrowser for bio-ontologies. BMC bioinformatics, 13(1):116, 2012. (Cited on page 106.)174Bibliography[194] Compare to, Accessed Dec 2012. (Cited onpage 106.)[195] BioPortal RDF retrieval of a SKOS term:, Accessed Jan 2013. (Cited on page 106.)[196] NCBO BioPortal. BioPortal SPARQL -, Accessed Nov 2013. (Citedon page 106.)[197] NCBO BioPortal. BioPortal result -, Accessed Nov 2013. (Citedon page 106.)[198] JA Blake, J Corradi, JT Eppig, DP Hill, JE Richardson, M Ringwald, et al. Creating the gene ontologyresource: design and implementation., 2001. (Cited on page 107.)[199] Ontotext. Linked Life Data -, Accessed Nov 2013. (Cited on page 107.)[200] Richard Cyganiak and Chistian Bizer. Pubby-a linked data frontend for sparql endpoints -, Accessed Nov 2013. (Cited on page 107.)[201] Bio2RDF Term search -, Accessed Nov 2013. (Cited onpage 107.)[202] IAO Term search -, Accessed Nov 2013. (Cited onpage 108.)[203] Bio2RDF Term search -, Accessed Nov 2013. (Cited onpage 108.)[204] Bio2RDF Term search -, Accessed Nov 2013. (Cited on page 108.)[205] OpenLink. Openlink’s Virtuoso Faceted Browser term display -, Accessed Nov 2013. (Cited onpage 108.)[206] Roy Fielding, Jim Gettys, Jeffrey Mogul, Henrik Frystyk, Larry Masinter, Paul Leach, and TimBerners-Lee. Hypertext transfer protocol–HTTP/1.1, 1999. (Cited on page 109.)175Bibliography[207] OBO Foundry. OBO library -, Accessed Nov 2013. (Cited on page 112.)[208] Christopher J Mungall. OBO2OWL pipeline -, AccessedNov 2013. (Cited on page 112.)[209] Christopher J Mungall. OBO flat file format 1.4 syntax and semantics [draft]. Technical re-port, Lawrence Berkeley National Laboratory. Available at Accessed 5 Mar, 2012. (Cited on page 112.)[210] Patrick Stickler. CBD-concise bounded description - W3CMember Submission, Accessed Nov 2013. (Cited on page 113.)[211] OpenLink Software. SPARQL describe -, June 2011. (Cited on page 113.)[212] jQuery API documentation -, Accessed Nov 2013. (Cited on page 114.)[213] Jesse James Garrett et al. Ajax: A new approach to web applications -, 2005. (Cited on page 114.)[214] Fahim T Imam, Stephen D Larson, Anita Bandrowski, Jeffery S Grethe, Amarnath Gupta, Maryann EMartone, et al. Development and use of ontologies inside the neuroscience information framework: apractical approach. Frontiers in genetics, 3, 2012. (Cited on page 122.)[215] Dan Brickley and Libby Miller. FOAF vocabulary specification 0.98 - document, Accessed Nov 2013. (Cited on page 122.)[216] Vaccine Ontology, vaccination - (Cited onpage 124.)[217] The Ontology for Biomedical Investigations, administering substance in vivo - (Cited on page 124.)[218] OMRE developers group. Ontology of Medically Relevant Entities (OMRE) -, December 2012. (Cited on page 124.)[219] B J Stewart and P U Prabhu. Reports of sensorineural deafness after measles, mumps, and rubellaimmunisation. Archives of Disease in Childhood, 69(1):153–154, 1993. (Cited on page 124.)[220] Madhok R Alcorn N, Saunders S. Benefit-risk assessment of leflunomide: an appraisal of leflunomide inrheumatoid arthritis 10 years after licensing. Drug Safety, 32(12):1123–34, 2009. (Cited on page 124.)176Bibliography[221] Lidian L. A. Lecluse, Emmilia A. Dowlatshahi, C. E. Jacqueline M. Limpens, Menno A. de Rie, Jan D.Bos, and Phyllis I. Spuls. Etanercept: An overview of dermatologic adverse events. Arch Dermatol,147(1):79–94, 2011. (Cited on page 124.)[222] Angela A. M. C. Claessens, Eibert R. Heerdink, Jacques T. H. M. van Eijk, Cornelis B. H. W. Lamers,and Hubert G. M. Leufkens. Determinants of Headache in Lansoprazole Users in The Netherlands:Results from a Nested Case-Control Study. Drug Safety, 25(4), 2002. (Cited on page 124.)[223] The Ontology of Medically Relevant Entities, low blood pressure - (Cited on page 126.)[224] Information Artifact Ontology, is about - (Citedon page 126.)[225] The Adverse Event Reporting Ontology, has component - (Cited on pages 127 and 131.)[226] The Adverse Event Reporting Ontology, found to exhibit - (Cited on page 127.)[227] Information Artifact Ontology, directive information entity - (Cited on page 128.)[228] Basic Formal Ontology, realizable entity - (Cited on page 128.)[229] The Adverse Event Reporting Ontology, level 1 of certainty of anaphylaxis according to Brighton - (Cited on page 128.)[230] The Ontology for General Medical Science, clinical finding - (Cited on page 128.)[231] Kevin P. High, Suzanne F. Bradley, Stefan Gravenstein, David R. Mehr, Vincent J. Quagliarello,Chesley Richards, and Thomas T. Yoshikawa. Clinical practice guideline for the evaluation of feverand infection in older adult residents of long-term care facilities: 2008 update by the infectious diseasessociety of america. Clinical Infectious Diseases, 48(2):149–171, 2009. (Cited on page 129.)[232] Walter T. Hughes, Donald Armstrong, Gerald P. Bodey, Eric J. Bow, Arthur E. Brown, Thierry Calan-dra, Ronald Feld, Philip A. Pizzo, Kenneth V. I. Rolston, Jerry L. Shenep, and Lowell S. Young. 2002guidelines for the use of antimicrobial agents in neutropenic patients with cancer. Clinical InfectiousDiseases, 34(6):730–751, 2002. (Cited on page 129.)177Bibliography[233] The Adverse Event Reporting Ontology, chest tightness finding - (Cited on page 129.)[234] The Adverse Event Reporting Ontology, measured hypotension finding - (Cited on page 129.)[235] C.J. Mungall, C. Torniai, G.V. Gkoutos, S.E. Lewis, and M.A. Haendel. Uberon, an integrative multi-species anatomy ontology. Genome Biology, 13(1):R5, 2012. (Cited on page 129.)[236] The Relation Ontology, has part - (Cited onpage 131.)[237] World Health Organization. Severe falciparum malaria. Transactions of the Royal Society of TropicalMedicine and Hygiene, 94:1–90, 2000. (Cited on page 132.)[238] Information Artifact Ontology, scalar measurement datum - (Cited on page 132.)[239] M. Krupka, K. Seydel, C.M. Feintuch, K. Yee, R. Kim, C.Y. Lin, R.B. Calder, C. Petersen, T. Taylor,and J. Daily. Mild Plasmodium falciparum malaria following an episode of severe malaria is associatedwith induction of the interferon pathway in Malawian children. Infection and immunity, 80(3):1150–1155, 2012. (Cited on pages 132 and 133.)[240] Remi Gagnon, Marie Noel Primeau, Anne Des Roches, Chantal Lemire, Rhoda Kagan, Stuart Carr,Manale Ouakki, Me´lanie Benoˆıt, and Gaston De Serres. Safe vaccination of patients with egg allergywith an adjuvanted pandemic H1N1 vaccine. Journal of Allergy and Clinical Immunology, 126(2):317–323, 2010. (Cited on page 133.)[241] National Institute of Allergy and Infectious Diseases/Food Allergy and Anaphylaxis Network anaphy-laxis guideline -, December 2012. (Cited on page 133.)[242] Paul Shekelle, Martin P Eccles, Jeremy M Grimshaw, and Steven H Woolf. When should clinicalguidelines be updated? BMJ, 323(7305):155–157, 7 2001. (Cited on page 133.)[243] Taxiarchis Botsis, Michael D Nguyen, Emily Jane Woo, Marianthi Markatou, and Robert Ball. Textmining for the vaccine adverse event reporting system: medical text classification using informativefeature selection. Journal of the American Medical Informatics Association: JAMIA, 18(5):631–638,October 2011. (Cited on pages 135, 138, and 156.)[244] Centers for Disease Control and Prevention (CDC) and the Food and Drug Administration (FDA),agencies of the U.S. Department of Health and Human Services. Vaccine Adverse Event ReportingSystem (VAERS) -, June Retrieved 2012. (Cited on page 135.)178Bibliography[245] Report of Adverse Events Following Immunization form -, December 2012. (Cited on page 135.)[246] Taxiarchis Botsis, EmilyJane Woo, and Robert Ball. Application of information retrieval approachesto case classification in the vaccine adverse event reporting system. Drug Safety, 36(7):573–582, 2013.(Cited on pages 138, 139, 140, 141, 143, 144, 145, 146, and 149.)[247] FuXi 1.0: A Python-based, bi-directional logical reasoningsystem -, August 2013. (Cited on page 139.)[248] Amit Singhal. Modern information retrieval: A brief overview. IEEE Data Eng. Bull., 24(4):35–43,2001. (Cited on page 144.)[249] Xavier Robin, Natacha Turck, Alexandre Hainard, Natalia Tiberti, Fre´de´rique Lisacek, Jean-CharlesSanchez, and Markus Muller. pROC: an open-source package for R and S+ to analyze and compareROC curves. BMC Bioinformatics, 12:77, 2011. (Cited on page 146.)[250] Barbara A Slade, Laura Leidel, Claudia Vellozzi, Emily Jane Woo, Wei Hua, Andrea Sutherland,Hector S Izurieta, Robert Ball, Nancy Miller, M Miles Braun, et al. Postlicensure safety surveillancefor quadrivalent human papillomavirus recombinant vaccine. JAMA: the journal of the AmericanMedical Association, 302(7):750–757, 2009. (Cited on page 147.)[251] WHO Global Advisory Committee on Vaccine Safety. Report of meeting held 12-13 June 2013 -, October AccessedOct 2013. (Cited on page 148.)[252] Pedro L Moro, Theresa Harrington, Tom Shimabukuro, Maria Cano, Oidda I Museru, David Menschik,and Karen Broder. Adverse events after Fluzone Intradermal vaccine reported to the Vaccine AdverseEvent Reporting System (VAERS), 2011–2013. Vaccine, 2013. (Cited on page 148.)[253] John M Kelso, Gina T Mootrey, and Theodore F Tsai. Anaphylaxis from yellow fever vaccine. Journalof allergy and clinical immunology, 103(4):698–701, 1999. (Cited on page 148.)[254] Lauren DiMiceli, Vitali Pool, John M Kelso, Sean V Shadomy, John Iskander, and VAERS Team.Vaccination of yeast sensitive individuals: review of safety data in the US vaccine adverse eventreporting system (VAERS). Vaccine, 24(6):703–707, 2006. (Cited on page 148.)[255] Nicole P Lindsey, Betsy A Schroeder, Elaine R Miller, M Miles Braun, Alison F Hinckley, NinaMarano, Barbara A Slade, Elizabeth D Barnett, Gary W Brunette, Katherine Horan, et al. Adverseevent reports following yellow fever vaccination. Vaccine, 26(48):6077–6082, 2008. (Cited on page 148.)179Bibliography[256] John K Iskander, Elaine R Miller, and Robert T Chen. Vaccine adverse event reporting system(VAERS). Pediatr Ann, 33:599, 2004. (Cited on page 148.)[257] Joel J Gagnier, Gunver Kienle, Douglas G Altman, David Moher, Harold Sox, and David Riley. TheCARE guidelines: consensus-based clinical case report guideline development. Journal of clinicalepidemiology, 2013. (Cited on page 148.)[258] Anjan K Banerjee, Sally Okun, I Ralph Edwards, Paul Wicks, Meredith Y Smith, Stephen J Mayall,Bruno Flamion, Charles Cleeland, and Ethan Basch. Patient-reported outcome measures in safetyevent reporting: Prosper consortium guidance. Drug Safety, pages 1–21, 2013. (Cited on page 148.)[259] Janice Minard, Suzanne M Dostaler, Jennifer G Olajos-Clow, Todd W Sands, Chris J Licskai, andM Diane Lougheed. Development and implementation of an electronic asthma record for primary care:Integrating guidelines into practice. Journal of Asthma, pages 1–29, 2013. (Cited on page 148.)[260] David W Scheifele and Scott A Halperin. Immunization monitoring program, active: a model of activesurveillance of vaccine safety. In Seminars in pediatric infectious diseases, volume 14, pages 213–219.Elsevier, 2003. (Cited on page 151.)[261] Stella Veretnik, J Lynn Fink, and Philip E Bourne. Computational biology resources lack persistenceand usability. PLoS computational biology, 4(7):e1000136, 2008. (Cited on page 152.)[262] Jonathan D Wren and Alex Bateman. Databases, data tombs and dust in the wind. Bioinformatics,24(19):2127–2128, 2008. (Cited on page 152.)[263] Jonathan D Wren. URL decay in MEDLINEa 4-year follow-up study. Bioinformatics, 24(11):1381–1385, 2008. (Cited on page 152.)[264] Inc. Dice Holdings. Sourceforge -, Accessed Jan 2014. (Cited on page 152.)[265] Google. Google code -, Accessed Jan 2014. (Cited on page 152.)[266] Me´lanie Courtot and the OBO TWG. PURL Guide -, Accessed Jan 2014. (Cited on page 153.)[267] Me´lanie Courtot and the OBO TWG. OBO PURL domain -, Accessed Jan 2014. (Cited onpage 153.)[268] Me´lanie Courtot and the OBO TWG. Setting up Protege to work with OBO ontologies-, Accessed Jan2014. (Cited on page 153.)180[269] Me´lanie Courtot and the OBO TWG. Policy for OBO namespace and associatedPUR requests-, Accessed Jan 2014. (Cited on page 153.)[270] Catia Pesquita, Joa˜o D Ferreira, Francisco M Couto, and Ma´rio J Silva. The epidemiology ontology:an ontology for the semantic annotation of epidemiological resources. Journal of Biomedical Semantics,5(1):4, 2014. (Cited on page 153.)[271] MSrv developers. MSrv - An ontology of medical surveillance-,Accessed Jan 2014. (Cited on page 153.)[272] David S Wishart, Craig Knox, An Chi Guo, Dean Cheng, Savita Shrivastava, Dan Tzur, BijayaGautam, and Murtaza Hassanali. Drugbank: a knowledgebase for drugs, drug actions and drugtargets. Nucleic acids research, 36(suppl 1):D901–D906, 2008. (Cited on page 156.)181Appendix ACanadian Adverse Events FollowingImmunization Surveillance System(CAEFISS) sample data1822011-09-02_exportPDF.pdfCanada VigilanceSummary of Reported Adverse ReactionsReport Runtime:Initial Received Date:Latest Received Date:Total Number of Reports:2011-09-02 - 05:27:16 PM1965-01-01 to 2011-03-31N/A10  Report(s)Brand Name/Active Ingredient: 'GARDASIL'Search Date Criteria: 1965-01-01 to 2011-03-31Reaction Term(s): All/TousSerious report?: BothFeature of Report: AllType of Report: AllSource of Report: AllGender: AllReport Outcome: AllAge: AllCAVEAT:  This summary is based on information from adverse reaction reports submitted by health professionals and laypersons either directlyto Health Canada or via market authorization holders.  Each report represents the suspicion, opinion or observation of the individual reporter.The Canada Vigilance Program is a spontaneous reporting system that is suitable to detect signals of potential health product safety issuesduring the post-market period.  The data has been collected primarily by a spontaneous surveillance system in which adverse reactions tohealth products are reported on a voluntary basis.  Under reporting of adverse reactions is seen with both voluntary and mandatory spontaneoussurveillance systems.  Accumulated case reports should not be used as a basis for determining the incidence of a reaction or estimating risk fora particular product as neither the total number of reactions occurring, nor the number of patients exposed to the health product is known.Because of the multiple factors that influence reporting, quantitative comparisons of health product safety cannot be made from the data.  Someof these factors include the length of time a drug is marketed, the market share, size and sophistication of the sales force, publicity about anadverse reaction and regulatory actions.  In some cases, the reported clinical data is incomplete and there is not certainty that these healthproducts caused the reported reactions.  A given reaction may be due to an underlying disease process or to another coincidental factor.  Thisinformation is provided with the understanding that the data will be appropriately referenced and used in conjunction with this caveat statement.183Page 2       2011-09-02_exportPDF.pdfCanada VigilanceSummary of Reported Adverse ReactionsReport Runtime:Initial Received Date:Latest Received Date:Total Number of Reports:2011-09-02 - 05:27:16 PM1965-01-01 to 2011-03-31N/A10  Report(s)Report Information   **AER = Adverse Reaction ReportAdverseReactionReport NumberLatest AER**Version NumberInitial ReceivedDateLatest ReceivedDateSource ofReportMarketAuthorizationHolder AERNumberFeature of Report Type of Report Reporter Type000358593 0 2010-12-24 2010-12-24 MAH 2010004848 Adverse Reaction Spontaneous HealthProfessionalSerious report? Death: Yes Disability: Congenital Anomaly:Yes Life Threatening: Hospitalization: Yes Other Medically Important Conditions: YesPatient InformationAge Gender Height Weight Report Outcome14 Years Female DeathLink / Duplicate Report InformationRecord Type Link AER** NumberNo duplicate or linked report.Product InformationProduct Description Health Product Role Dosage Form Route ofAdministration Dose Frequency Therapy DurationGARDASIL Suspect Unknown 1.0 Day(s)GARDASIL Suspect SUSPENSIONINTRAMUSCULAR Subcutaneous 1.0 Day(s)INFLUENZA VACCINE Concomitant NOT SPECIFIED UnknownYASMIN 21 Suspect TABLET Unknown 61.0 Day(s)Adverse Reaction TermInformationAdverse Reaction Term(s) MedDRA Version Reaction DurationAbasia v.14.0Asthenia v.14.0Basilar migraine v.14.0Blood glucose increased v.14.0Cardiac arrest v.14.0184Page 3Adverse Reaction Term(s) MedDRA Version Reaction DurationConfusional state v.14.0Dizziness postural v.14.0Drowning v.14.0Loss of consciousness v.14.0Nausea v.14.0Syncope v.14.0Vomiting v.14.0185Page 4       2011-09-02_exportPDF.pdfCanada VigilanceSummary of Reported Adverse ReactionsReport Runtime:Initial Received Date:Latest Received Date:Total Number of Reports:2011-09-02 - 05:27:16 PM1965-01-01 to 2011-03-31N/A10  Report(s)Report Information   **AER = Adverse Reaction ReportAdverseReactionReport NumberLatest AER**Version NumberInitial ReceivedDateLatest ReceivedDateSource ofReportMarketAuthorizationHolder AERNumberFeature of Report Type of Report Reporter Type000359862 0 2011-01-17 2011-01-17 Community Adverse Reaction Spontaneous PhysicianSerious report? Death: Disability: Congenital Anomaly:Yes Life Threatening: Yes Hospitalization: Yes Other Medically Important Conditions:Patient InformationAge Gender Height Weight Report Outcome14 Years Female 158 Centimetres 80 Kilograms UnknownLink / Duplicate Report InformationRecord Type Link AER** NumberDuplicate 000360728Product InformationProduct Description Health Product Role Dosage Form Route ofAdministration Dose Frequency Therapy DurationGARDASIL Suspect SUSPENSIONINTRAMUSCULAR Unknown1.0 Dosageforms OnceAdverse Reaction TermInformationAdverse Reaction Term(s) MedDRA Version Reaction DurationNervous system disorder v.14.0Ventricular fibrillation v.14.0186Appendix BVaccine Adverse Event ReportingSystem (VAERS) sample data187188Appendix CList of OBO Foundry principles (as ofNovember 2013)Principle ID DescriptionAcceptedFP 001 open The ontology must be open and available to be used by all without any con-straint other than (a) its origin must be acknowledged and (b) it is not to bealtered and subsequently redistributed under the original name or with the sameidentifiers. The OBO ontologies are for sharing and are resources for the entirecommunity. For this reason, they must be available to all without any con-straint or license on their use or redistribution. However, it is proper that theiroriginal source is always credited and that after any external alterations, theymust never be redistributed under the same name or with the same identifiers.FP 002 format The ontology is in, or can be expressed in, a common shared syntax. This maybe either the OBO syntax, extensions of this syntax, or OWL. The reason forthis is that the same tools can then be usefully applied. This facilitates sharedsoftware implementations. This criterion is not met in all of the ontologiescurrently listed, but we are working with the ontology developers to have themavailable in a common OBO syntax.FP 003 URIs The ontologies possess a unique identifier space within the OBO Foundry. Thesource of a term (i.e. class) from any ontology can be immediately identifiedby the prefix of the identifier of each term. It is, therefore, important that thisprefix be unique.189Appendix C. OBO Foundry principlesPrinciple ID DescriptionFP 004 versioning The ontology provider has procedures for identifying distinct successiveversions.FP 005 delineatedcontentThe ontology has a clearly specified and clearly delineated content. The ontol-ogy must be orthogonal to other ontologies already lodged within OBO. Themajor reason for this principle is to allow two different ontologies, for exampleanatomy and process, to be combined through additional relationships. Theserelationships could then be used to constrain when terms could be jointly ap-plied to describe complementary (but distinguishable) perspectives on the samebiological or medical entity. As a corollary to this, we would strive for commu-nity acceptance of a single ontology for one domain, rather than encouragingrivalry between ontologies.FP 006 textual def-initionThe ontologies include textual definitions for all terms. Many biological andmedical terms may be ambiguous, so terms should be defined so that theirprecise meaning within the context of a particular ontology is clear to a humanreader.FP 007 relations The ontology uses relations which are unambiguously defined following the pat-tern of definitions laid down in the OBO Relation Ontology.FP 008 documented The ontology is well documented.FP 009 users The ontology has a plurality of independent users.FP 010 collabora-tionThe ontology will be developed collaboratively with other OBO Foundrymembers.FP 011 locus of au-thorityThere should be a single person who is responsible for the ontology, for ensuringcontinued maintenance in light of scientific advance and prompt response touser feedback, Contact information for this person should be provided on theontology website, and listed in the OBO Library Metadata File.FP 012 naming con-ventionsThe ontology follows the OBO set of naming conventions.190Appendix C. OBO Foundry principlesPrinciple ID DescriptionFP 016 mainte-nanceOBO is an open community and, by joining the initiative, the authors of anontology commit to its maintenance in light of scientific advance and to workingwith other members to ensure the improvement of these principles over time.Under discussionFP 013 genus differ-entiaAll definitions of the genus-differentia form, utilizing (some) cross-products.FP 014 BFO Ontologies should be conceivable as the result of populating downwards fromsome fragment of BFO2.0.FP 015 Single in-heritanceSingle asserted is a inheritance (= each ontology should be conceived as con-sisting of a core of asserted single inheritance links, with further is a relationsinferred).FP 017 instantiabil-ityAll the types represented by the terms in the ontology should be instantiable.FP 018 orthogonal-ityFor each domain there should be convergence upon a single ontology that isrecommended for use by those who wish to become involved with the Foundryinitiative.FP 019 content The ontology must be a faithful representation of the domain and fit for thestated purpose.191Appendix DSPARQL query for FluMist vaccine192DEFINE sql:describe-mode "CBD" describe <>FROM <>================================================================prefix rdfs: <>prefix rdf: <>prefix owl: <>select * from <>where {?nodeID owl:annotatedSource <>.#?nodeID rdf:type owl:Annotation.?nodeID owl:annotatedProperty ?annotatedProperty.?nodeID owl:annotatedTarget ?annotatedTarget.?nodeID ?aaProperty ?aaPropertyTarget.OPTIONAL {?annotatedProperty rdfs:label ?annotatedPropertyLabel}.OPTIONAL {?aaProperty rdfs:label ?aaPropertyLabel}.FILTER (isLiteral(?annotatedTarget)).FILTER (not (?aaProperty in(owl:annotatedSource, rdf:type, owl:annotatedProperty, owl:annotatedTarget)))}================================================================prefix rdfs: <>prefix rdf: <>prefix owl: <>SELECT DISTINCT ?ref ?refp ?label  ?oFROM <>WHERE {?ref ?refp ?o.FILTER (?refp IN (owl:equivalentClass, rdfs:subClassOf)).OPTIONAL {?ref rdfs:label ?label}.{{SELECT ?s ?o FROM <>WHERE {?o ?p ?s .FILTER (?p IN (rdf:first, rdf:rest, owl:intersectionOf, owl:unionOf, owl:someValuesFrom, owl:hasValue, owl:allValuesFrom, owl:complementOf, owl:inverseOf, owl:onClass, 193owl:onProperty)) }}OPTION (TRANSITIVE, t_in(?s), t_out(?o), t_step(?s) as ?link).FILTER (?s= <>)}}ORDER BY ?label================================================================PREFIX rdf: <>PREFIX rdfs: <>PREFIX owl: <>SELECT DISTINCT ?s ?o ?scFROM <>WHERE { {?s rdfs:subClassOf <> .FILTER (isIRI(?s)).OPTIONAL {?s rdfs:label ?o} .OPTIONAL {?sc rdfs:subClassOf ?s}}UNION{?s owl:equivalentClass ?s1 .FILTER (isIRI(?s)).?s1 owl:intersectionOf ?s2 .?s2 rdf:first <> .OPTIONAL {?s rdfs:label ?o} .OPTIONAL {?sc rdfs:subClassOf ?s}}UNION{?s rdfs:subClassOf <> .FILTER (isIRI(?s)).OPTIONAL {?s rdfs:label ?o} .OPTIONAL {?sc owl:equivalentClass ?s1 .?s1 owl:intersectionOf ?s2 .?s2 rdf:first ?s}}UNION{?s owl:equivalentClass ?s1 .FILTER (isIRI(?s)).194?s1 owl:intersectionOf ?s2 .?s2 rdf:first <> .OPTIONAL {?s rdfs:label ?o} .OPTIONAL {?sc owl:equivalentClass ?s3 .?s3 owl:intersectionOf ?s4 .?s4 rdf:first ?s}}}================================================================prefix rdfs: <>prefix rdf: <>prefix owl: <>SELECT ?path ?link ?labelFROM <>WHERE{{SELECT ?s ?o ?labelWHERE{{?s rdfs:subClassOf ?o .FILTER (isURI(?o)).OPTIONAL {?o rdfs:label ?label}}UNION{?s owl:equivalentClass ?s1 .?s1 owl:intersectionOf ?s2 .?s2 rdf:first ?o  .FILTER (isURI(?o))OPTIONAL {?o rdfs:label ?label}}}} OPTION (TRANSITIVE, t_in(?s), t_out(?o), t_step (?s) as ?link, t_step ('path_id') as ?path).FILTER (isIRI(?o)).FILTER (?s= <>)}================================================================prefix rdfs: <>prefix rdf: <>prefix owl: <>SELECT ?s ?label195FROM <>WHERE{?s rdf:type <> .?s rdfs:label ?label}================================================================SELECT distinct ?gWHERE{graph ?g {<> ?p ?o}}================================================================SELECT *FROM <>WHERE { ?s <> ?o.FILTER (?s in(<http://null>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>))}================================================================196SELECT *FROM <>WHERE { ?s <> ?o.FILTER (?s in(<http://null>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>))}================================================================PREFIX rdf: <>PREFIX rdfs: <>PREFIX owl: <>SELECT DISTINCT ?s ?o ?scFROM <>WHERE { {?s rdfs:subClassOf <> .FILTER (isIRI(?s)).OPTIONAL {?s rdfs:label ?o} .OPTIONAL {?sc rdfs:subClassOf ?s}}UNION{?s owl:equivalentClass ?s1 .197FILTER (isIRI(?s)).?s1 owl:intersectionOf ?s2 .?s2 rdf:first <> .OPTIONAL {?s rdfs:label ?o} .OPTIONAL {?sc rdfs:subClassOf ?s}}UNION{?s rdfs:subClassOf <> .FILTER (isIRI(?s)).OPTIONAL {?s rdfs:label ?o} .OPTIONAL {?sc owl:equivalentClass ?s1 .?s1 owl:intersectionOf ?s2 .?s2 rdf:first ?s}}UNION{?s owl:equivalentClass ?s1 .FILTER (isIRI(?s)).?s1 owl:intersectionOf ?s2 .?s2 rdf:first <> .OPTIONAL {?s rdfs:label ?o} .OPTIONAL {?sc owl:equivalentClass ?s3 .?s3 owl:intersectionOf ?s4 .?s4 rdf:first ?s}}}================================================================    198Appendix EList of IAO annotation propertiesused as common metadata setLabel Definition Cardinalityeditorpreferred termThe concise, meaningful, and human-friendly name for a class or prop-erty preferred by the ontology developers. (US-English)1:1definition The official definition, explaining the meaning of a class or property.Shall be Aristotelian, formalized and normalized. Can be augmentedwith colloquial definitions.1:1definitioneditorName of editor entering the definition in the file. The definition editoris a point of contact for information regarding the term. The defini-tion editor may be, but is not always, the author of the definition,which may have been worked upon by several people.1:ndefinitionsourceformal citation, e.g., identifier in external database to indicate / at-tribute source(s) for the definition. Free text indicates attributessource(s) for the definition. EXAMPLE: Author Name, URI, MeSHTerm C04, PUBMED ID, Wiki URI on 31.01.20071:n199Appendix E. List of IAO annotation propertiesLabel Definition CardinalitycurationstatusspecificationThe curation status of a class or property. The allowed values mustcome from this enumerated list of predefined terms:• placeholder: This isn’t a class that the ontology will keep - it’s aplaceholder for edits that are underway. The class name shouldstart with an underscore• uncurated: Nothing done yet beyond assigning a unique classID and proposing a preferred term• metadata incomplete: Class is being worked on; however, themetadata (including definition) are not complete or sufficientlyclear to the editors.• metadata complete: Class has all its metadata, but is eithernot guaranteed to be in its final location in the asserted IS Ahierarchy or refers to another class that is not complete.• pending final vetting: All definitions, placement in the assertedIS A hierarchy and required minimal metadata are complete.The class is awaiting a final review by someone other than thedefinition editor.• ready for release: Class has undergone final review, is readyfor use, and will be included in the next release. Any classlacking “ready for release” should be considered likely to changeplace in hierarchy, have its definition refined, or be obsoletedin the next release. Those classes deemed “ready for release”will also derived from a chain of ancestor classes that are also“ready for release.”1:1200Appendix E. List of IAO annotation propertiesLabel Definition Cardinalityexample ofusageA phrase describing how a class name should be used. May also in-clude other kinds of examples that facilitate immediate understandingof a class semantics, such as widely known prototypical subclasses orinstances of the class. Although essential for high level terms, exam-ples for low level terms (e.g., Affymetrix HU133 array) are not0:nalternativetermAn alternative name for a class or property which means the samething as the preferred name (semantically equivalent)0:neditor note A note containing points under consideration for further term devel-opment that may be included in released versions of the ontology. Itshould contain nothing embarrassing and something potentially use-ful for end users to understand the ontology. Editor notes shouldinclude the date of edit (YYYYMMDD) and the author.0:ncurator note An administrative note intended for the curator of the ontology.It will not be included in the released versions of the ontology,so it should contain nothing necessary for end users to under-stand the ontology. Curator notes should include the date of edit(YYYY/MM/DD) and the author.0:n201Appendix E. List of IAO annotation propertiesLabel Definition CardinalityobsolescencereasonspecificationThe obsolescence reason of a class or property. The allowed valuesmust come from this enumerated list of predefined terms:• failed exploratory term: The term was used in an attempt tostructure part of the ontology but in retrospect failed to do agood job• terms merged: An editor note should explain what were themerged terms and the reason for the merge.• term split: This is to be used when a term has been split in twoor more new terms. An editor note should indicate the reasonfor the split and indicate the URIs of the new terms created.• placeholder removed: This is to be used when the original termhas been replaced by a term imported from an other ontology.An editor note should indicate what is the URI of the new termto use.• term imported: This is to be used when the original term hasbeen replaced by a term imported from an other ontology. Aneditor note should indicate what is the URI of the new term touse.1:1OBO foundryunique labelAn alternative name for a class or property which is unique acrossthe OBO Foundry.1:1202Appendix FThe anaphylactic reactionStandardised MedDRA Query203Individual SMQs  SMQ Introductory Guide V13.1 September 2010 MSSO-DI-6226-13.1.0 232.7 Anaphylactic reaction (SMQ)         (Production Release November 2005) 2.7.1 Definition x An acute systemic reaction characterized by pruritus, generalized flush, urticaria, respiratory distress and vascular collapse x Occurs in a previously sensitized person upon re-exposure to the sensitizing antigen x Other signs and symptoms: agitation, palpitation, parasthesias, wheezing, angioedema, coughing, sneezing and difficulty breathing due to laryngeal spasm or bronchospasm  Less frequent clinical presentations: seizures, vomiting, abdominal cramps and incontinence 2.7.2 Inclusion/Exclusion Criteria x Included:  Any terms, at the PT level, representing events which may be noted during anaphylaxis  In a spreadsheet format, the testing pharmaceutical company’s list and the testing regulator’s list were positioned alongside the MedDRA SSC list for anaphylaxis, and this three-column table was then systematically reviewed top-down.  Unanimous agreement for/against inclusion of each term was achieved by the group   x Excluded:  Terms for signs and symptoms that do not fall within the three defined categories (Upper Airway/Respiratory, Angioedema/Urticaria/Pruritus/Flush, and Cardiovascular/Hypotension) in the broad search are excluded. NOTE: There are two SMQs related to anaphylaxis: Anaphylactic reaction (SMQ) and Anaphylactic/anaphylactoid shock conditions (SMQ).  The two SMQs have different focuses.  Anaphylactic/anaphylactoid shock (SMQ) is specific for more severe anaphylactic manifestations, i.e. those that result in shock, and not less severe ones such as rash.  Anaphylactic reaction (SMQ) widens the search beyond shock conditions by including such terms as PT Type I hypersensitivity. 2.7.3 Algorithm The SMQ Anaphylactic reaction consists of three parts: x A narrow search containing PTs that represent core anaphylactic reaction terms;  204Individual SMQs  SMQ Introductory Guide V13.1 September 2010 MSSO-DI-6226-13.1.0 24x A broad search that contains additional terms that are added to those included in the narrow search. These additional terms are signs and symptoms possibly indicative of anaphylactic reaction;  x An algorithmic approach which combines a number of anaphylactic reaction symptoms in order to increase specificity.   A case must include either:  A narrow term or a term from Category A;   A term from Category B - (Upper Airway/Respiratory) AND a term from Category C - (Angioedema/Urticaria/Pruritus/Flush);  A term from Category D - (Cardiovascular/Hypotension) AND [a term from Category B - (Upper Airway/Respiratory) OR a term from Category C - (Angioedema/Urticaria/ Pruritus/Flush)] 2.7.4 Notes on Implementation and/or Expectation of Query Results In addition to narrow and broad searches, Anaphylactic reaction (SMQ) is an algorithmic SMQ.  The algorithm is a combination of broad search terms among various categories to further refine the identification of cases of interest.  The algorithm can be implemented in a post-retrieval process as noted below: x First, retrieve relevant cases by applying the SMQ query as a narrow/broad SMQ (see section x Post-retrieval process, software applies the algorithmic combination to screen the cases retrieved above.  For small data sets of retrieved cases, the algorithm may be applied on manual review of cases.  The algorithm for Anaphylactic reaction (SMQ) is A or (B and C) or (D and (B or C)).  Cases filtered by the algorithm can be listed for output. 2.7.5 List of References for Anaphylactic reaction (SMQ) x The Merck Manual. 15th edition. Merck, Sharp & Dohme Research Laboratories. (1987): 306-7 205Appendix GThe list of significant MedDRA termsbased on contingency tables test206Supplementary	  material	  	  Appendix	  1:	  MedDRA	  terms	  with	  a	  chi-­‐square	  value	  over	  3.841	  	   MedDRA	  term	   	  	  	  	  	  Chi-­‐square	   	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  P-­‐value	  Hypersensitivity	   1578.605353	   0	  Dyspnoea	   553.3557	   2.34E-­‐122	  Throat	  tightness	   551.5865009	   5.69E-­‐122	  Pruritus	   297.906177	   9.42E-­‐67	  Chest	  discomfort	   296.2635345	   2.15E-­‐66	  Pharyngeal	  oedema	   251.7630256	   1.07E-­‐56	  Urticaria	   231.0682725	   3.49E-­‐52	  Wheezing	   205.1667372	   1.56E-­‐46	  Swelling	  face	   203.0038003	   4.62E-­‐46	  Anaphylactic	  reaction	   198.3924991	   4.68E-­‐45	  Oedema	   181.4914781	   2.29E-­‐41	  Swelling	   179.028501	   7.90E-­‐41	  Lip	  swelling	   177.3909311	   1.80E-­‐40	  Discomfort	   160.1597406	   1.04E-­‐36	  Swollen	  tongue	   157.5517954	   3.88E-­‐36	  Throat	  irritation	   154.4938506	   1.81E-­‐35	  Eye	  swelling	   141.3551256	   1.35E-­‐32	  Tic	   122.0267653	   2.28E-­‐28	  Dysphagia	   83.93452989	   5.11E-­‐20	  Vaccination	  complication	   81.70570956	   1.58E-­‐19	  Rash	   68.93363732	   1.02E-­‐16	  Anxiety	   56.33309817	   6.12E-­‐14	  Paraesthesia	  oral	   51.40599746	   7.51E-­‐13	  Dermatitis	  allergic	   50.13558624	   1.43E-­‐12	  Oxygen	  saturation	   49.73241883	   1.76E-­‐12	  Flushing	   49.3121747	   2.18E-­‐12	  Allergy	  to	  vaccine	   44.76216274	   2.22E-­‐11	  Heart	  rate	  increased	   41.07021225	   1.47E-­‐10	  Electrocardiogram	  normal	   40.11780423	   2.39E-­‐10	  Palpitations	   37.25210863	   1.04E-­‐09	  Dysphonia	   36.7245365	   1.36E-­‐09	  Erythema	   34.31261596	   4.69E-­‐09	  Oxygen	  saturation	  normal	   33.65197646	   6.59E-­‐09	  Cough	   33.12717418	   8.63E-­‐09	  Electrocardiogram	   32.54342042	   1.17E-­‐08	  Chest	  pain	   31.8973366	   1.63E-­‐08	  Eye	  pruritus	   31.06355091	   2.50E-­‐08	  Oedema	  peripheral	   28.64424038	   8.70E-­‐08	  207MedDRA	  term	   	  	  	  	  	  Chi-­‐square	   	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  P-­‐value	  Heart	  rate	   28.6141026	   8.83E-­‐08	  Oral	  pruritus	   28.13879224	   1.13E-­‐07	  Idiopathic	  urticaria	   26.77190018	   2.29E-­‐07	  Angioedema	   24.88169145	   6.10E-­‐07	  Tachycardia	   24.1470991	   8.93E-­‐07	  Ocular	  hyperaemia	   23.56285888	   1.21E-­‐06	  Dizziness	   21.90501031	   2.86E-­‐06	  Pruritus	  generalised	   20.41537534	   6.23E-­‐06	  Hyperventilation	   20.28914823	   6.66E-­‐06	  X-­‐ray	  normal	   18.62066883	   1.59E-­‐05	  Rash	  erythematous	   17.79906026	   2.46E-­‐05	  Chest	  X-­‐ray	  normal	   17.02743339	   3.68E-­‐05	  Non-­‐cardiac	  chest	  pain	   16.87767466	   3.99E-­‐05	  Oxygen	  saturation	  decreased	   16.50541696	   4.85E-­‐05	  Adverse	  drug	  reaction	   15.84086837	   6.89E-­‐05	  Asthma	   14.94011526	   0.000110978	  Hypertension	   13.76604066	   0.000207045	  Rhinitis	   13.68760651	   0.000215874	  Food	  allergy	   13.58133526	   0.000228446	  Rash	  macular	   12.90979478	   0.000326867	  Blood	  glucose	  increased	   12.39650931	   0.000430137	  Bronchial	  hyperreactivity	   11.95078269	   0.000546244	  Oedema	  mouth	   11.95078269	   0.000546244	  Dry	  throat	   11.78175253	   0.000598141	  Respiratory	  rate	   11.513761	   0.000690829	  Chest	  X-­‐ray	   10.74988156	   0.00104286	  Paraesthesia	   10.46235549	   0.001218318	  Tension	   9.777425891	   0.001766675	  Pyrexia	   9.460175584	   0.00209981	  Feeling	  abnormal	   9.424379867	   0.002141195	  Presyncope	   9.414183846	   0.002153134	  Altered	  state	  of	  consciousness	   9.010832195	   0.002683842	  Respiratory	  rate	  decreased	   9.010832195	   0.002683842	  Rhinitis	  allergic	   9.010832195	   0.002683842	  Red	  blood	  cell	  count	  normal	   9.010832195	   0.002683842	  Respiration	  abnormal	   9.010832195	   0.002683842	  Skin	  test	   9.010832195	   0.002683842	  X-­‐ray	   8.958551384	   0.002761738	  Eyelid	  oedema	   8.395515718	   0.003761478	  Hypoaesthesia	  oral	   8.260899726	   0.004050805	  Feeling	  hot	   8.222546332	   0.004137311	  Face	  oedema	   8.081313552	   0.004472402	  208MedDRA	  term	   	  	  	  	  	  Chi-­‐square	   	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  P-­‐value	  Immediate	  post-­‐injection	  reaction	   7.72106081	   0.005458032	  Blood	  glucose	   7.72106081	   0.005458032	  Stridor	   7.064359153	   0.007863244	  No	  reaction	  on	  previous	  exposure	  to	  drug	   6.745371787	   0.009399119	  Blood	  pressure	   5.971226848	   0.014541161	  Dermatitis	   5.813875639	   0.015900215	  Feeling	  jittery	   5.685593271	   0.017104755	  Lymph	  node	  palpable	   5.624025895	   0.017715909	  Activated	  partial	  thromboplastin	  time	  shortened	   5.624025895	   0.017715909	  Panic	  disorder	   5.624025895	   0.017715909	  Skin	  test	  negative	   5.624025895	   0.017715909	  Arrhythmia	  supraventricular	   5.624025895	   0.017715909	  Steroid	  therapy	   5.624025895	   0.017715909	  Oropharyngeal	  spasm	   5.624025895	   0.017715909	  Soft	  tissue	  inflammation	   5.624025895	   0.017715909	  Laryngospasm	   5.624025895	   0.017715909	  Vaccination	  site	  erythema	   5.624025895	   0.017715909	  Barium	  swallow	  normal	   5.624025895	   0.017715909	  Lip	  discolouration	   5.624025895	   0.017715909	  Plantar	  fasciitis	   5.624025895	   0.017715909	  Food	  aversion	   5.624025895	   0.017715909	  Computerised	  tomogram	  thorax	  normal	   5.624025895	   0.017715909	  Oropharyngeal	  swelling	   5.624025895	   0.017715909	  Vaccination	  site	  pruritus	   5.624025895	   0.017715909	  Scan	  myocardial	  perfusion	  normal	   5.624025895	   0.017715909	  Vasoconstriction	   5.624025895	   0.017715909	  Blood	  electrolytes	  decreased	   5.624025895	   0.017715909	  Venous	  thrombosis	   5.624025895	   0.017715909	  Troponin	   5.474670434	   0.019293998	  Pain	  in	  extremity	   5.123595971	   0.023602658	  Bronchitis	   4.782141775	   0.028756334	  Myalgia	   4.763685188	   0.029066252	  Blood	  pressure	  decreased	   4.744564425	   0.029390993	  Metabolic	  function	  test	   4.672009897	   0.030658028	  Oxygen	  supplementation	   4.300705957	   0.038096556	  Productive	  cough	   4.18625334	   0.040753069	  Serum	  sickness	   3.874258039	   0.049031977	  Hypokalaemia	   3.874258039	   0.049031977	  Bronchospasm	   3.874258039	   0.049031977	  Hypoventilation	   3.874258039	   0.049031977	  209Appendix HSummary of the Seeker collaborationwork and results210brinkman rEPOrTBrinkman-Seeker Collaboration 2012Automated Classification of Adverse Eventswww seekersolutions com211B R I N K M A N - S E E K E R  C O L L A B O R AT I O N  2 0 1 2  PA P E R  |  2212B R I N K M A N - S E E K E R  C O L L A B O R AT I O N  2 0 1 2  PA P E R  |  3abstract                                            4background                                       5The Challenge                                     6methodology                                       7Data availability                                    7Adverse Events Selected for Identification         7The Technology                                    8The Experiments                                  9results                                           10Going Forward                                   11references                                        12Table ofCOnTEnTs213B R I N K M A N - S E E K E R  C O L L A B O R AT I O N  2 0 1 2  PA P E R  |  4absTraCTThe field of medical informatics has become an important area of research in the healthcare industry. This unique field unites researchers with backgrounds in computer science, engineering, the life sciences and healthcare. Due to this diverse set of skill requirements, now more than ever, strong partnerships between academia and industry are needed to develop efficient and intelligent solutions for a wide variety of healthcare issues. In this pilot study, Seeker developers partnered with researchers from the BC Cancer Agency to investigate the task of identifying adverse events following immunizations using machine learning classification with simple language features. While previous work demonstrated that more advanced feature engineering is required for the identification of structurally complex adverse events, the results of this pilot study find that simple features perform well in some circumstances, and warrant further investigation and collaborative research. More importantly, the partnership between Seeker and the BC Cancer Agency demonstrates a successful dialogue between industry partners and academic researchers, and shows how fruitful collaborative work can be in the medical informatics domain.214B R I N K M A N - S E E K E R  C O L L A B O R AT I O N  2 0 1 2  PA P E R  |  5baCkGrOUnDPublic health authorities across North America are searching for ways to improve the safety and cost efficiency of many healthcare system components. One interesting and important area of public health focuses on the incidence of adverse events following an immunization (AEFI). As defined by the International Conference on Harmonisation of Technical Requirements for Registration of Pharmaceuticals for Human Use, an adverse event (AE) is:Any untoward medical occurrence in a patient or clinical investigation subject administered a pharmaceutical product and which does not necessarily have to have a causal relationship with this treatment. An adverse event (AE) can therefore be any unfavourable and unintended sign (including an abnormal laboratory finding, for example), symptom, or disease temporally associated with the use of a medicinal product, whether or not considered related to the medicinal product.[1]In Canada, the Canadian Adverse Event Following Immunization Surveillance System (CAEFISS) exists to monitor the frequency and severity of AEFIs, and provides valuable data to help public health authorities make decisions related to immunization programs[2]. The process to submit an AEFI report to CAEFISS involves several steps, as indicated in Figure 1. When an AEFI occurs, a health care provider such as a nurse or physician compiles a report, and submits it to their local Provincial or Territorial Health Unit. The exact format and content of the AEFI report varies based on the standards and processes established by the Province or Territory, and does not necessarily mirror the fields and format of the nationally available AEFI report form[3] provided by the Public Health Agency of Canada (PHAC). It is important to note that the reporting clinician provides immediate treatment of the adverse event prior to submitting the AEFI report to their local Health Unit, ensuring that the patient receives timely resolution of their symptoms. Once collected, Provincial and Territorial Health Units remove personally identifiable information from the report, and forward it to PHAC to be included in CAEFISS for aggregation and study. 215B R I N K M A N - S E E K E R  C O L L A B O R AT I O N  2 0 1 2  PA P E R  |  6THE CHaLLEnGEWhile AEFI report forms contain highly structured fields, there are sections that allow for the input of free text as supplementary information. This type of supplementary information is extremely valuable since its proper analysis could be used to improve the consistency and accuracy of the structured fields of the AEFI report. In turn, these improvements could directly impact the quality of the decisions public health authorities make related to immunization programs and protocols. Free text analysis of the supplementary information fields is where Seeker Solutions Inc. (Seeker) decided to explore the application of Natural Language Processing (NLP) and Machine Learning (ML) technologies. Figure 1. Information flow from Provincial / Territorial AEFI reporting systems to CAEFISS.In late 2012, a team of Seeker data scientists partnered with researchers from the BC Cancer Agency: Dr. Ryan Brinkman (Associate Professor, Medical Genetics, UBC; Senior Scientist, BC Cancer Agency) and Mélanie Courtot (PhD Candidate, UBC). Their goal was to determine if simple NLP and ML techniques and tools could be used to identify AEFIs within the free text fields of an AEFI report, and to tentatively identify a scale of difficulty for computationally identifying different types of AEFIs.216B R I N K M A N - S E E K E R  C O L L A B O R AT I O N  2 0 1 2  PA P E R  |  7mETHODOLOGYData availabilityAdverse Events Selected for IdentificationA key issue for the project was to identify and obtain data that could be used for testing and proof of concept. Given the tight timeline to produce a proof of concept, Mélanie Courtot suggested that the team analyze data sets derived from the United States Vaccine Adverse Event Reporting System (VAERS). Similar to CAEFISS, VAERS is a national program designed to collect AEFI reports for the purposes of post-market vaccine safety monitoring [4]. Two different adverse events were selected for the study:1  Paresthesia: a burning or prickling sensation usually felt in the feet and hands that is idiomatically described as “tingling” or “pins and needles”[6]. 2  anaphylaxis: a severe, life-threatening, multi-system allergic reaction that occurs after contact with an allergen that may include some compounds found in a vaccine[7]. To diagnose anaphylaxis with various levels of certainty, the Brighton Collaboration Allergic Reactions Working Group has produced a case definition that describes symptoms that must be present in the dermatologic, cardiovascular, gastrointestinal, and respiratory systems[8]. For example, at the first level of diagnostic certainty, Brighton defined criteria must be present from a dermatological system, combined with symptoms present in a cardiovascular and/or respiratory system. Therefore, a physician may use terms such as “throat” and “swell” to describe the respiratory distress experienced by a patient, and “rash” and “hives” to describe their dermatological symptoms.According to the Brighton case definitions, both groups of symptoms must be present and have appeared with a sudden onset and rapid progression before a diagnosis of anaphylaxis can be certain. To the best of our knowledge, no similar case definition exists for paresthesia.However, a key difference between CAEFISS and VAERS is that VAERS data is made available to the public after reports have been suitably anonymized. In addition, while the VAERS AEFI reporting forms differ from those used in Canada, the bulk of the form is composed of a free text field used to capture details about the AEFI. This wealth of free text provided a good starting point for Seeker to apply NLP and ML technology, given its similarity to the supplementary information fields found on Canadian AEFI reports. The dataset already contains annotations for many different adverse events, as defined by the Medical Dictionary for Regulatory Activities (MedDRA)[5]. Finally, medical officers from the U.S. Food and Drug Administration (FDA) manually reviewed and positively coded 237 reports for anaphylaxis.217B R I N K M A N - S E E K E R  C O L L A B O R AT I O N  2 0 1 2  PA P E R  |  8mETHODOLOGYThe TechnologyThe act of classifying a report as positive or negative for a condition is a well-known task in the ML community. For example, a classifier can learn to identify suspected cases of anaphylaxis or paresthesia through empirical evidence. To do so, the classifier is provided a training dataset containing a large number of reports that have already been positively or negatively labeled. The classifier then constructs a model by associating features that appear in the dataset with the positive or negative label. In a typical classification task, there may be hundreds or thousands of features that the classifier observes. Each feature reflects an interesting aspect of the data, such as the length of the document or the frequency of a word. Many features such as these can be discovered within a free text report using various NLP techniques. Once the learning process is complete, a trained classifier can use features within a test dataset (or in novel data) to make positive or negative predictions against its constructed model. Past work has revealed that classifying anaphylaxis based on free text is challenging[9]. This is due to the variety of language that can be used to describe various systemic reactions, combined with a strict set of requirements for their valid configurations. Botsis et al. demonstrated that the production of a highly accurate classifier involves advanced feature engineering to incorporate medical domain knowledge into the classifier. However, in many similar ML classification tasks, simple approaches have historically yielded decent results. Thus Seeker’s approach to the classification problem was to use individual words as features (known as a bag-of-words) instead of investing in advanced feature engineering. From an NLP standpoint, constructing a bag-of-words involves little domain knowledge, is computationally inexpensive, and requires little time to construct a classifier. 218B R I N K M A N - S E E K E R  C O L L A B O R AT I O N  2 0 1 2  PA P E R  |  9mETHODOLOGYThe ExperimentsTo classify paresthesia, 32,885 reports were selected from the VAERS database between January 1, 2009 and December 31, 2009. Of these reports, 1,167 contained a MedDRA annotation for paresthesia. For each report, a bag-of-words was created. Each word was stemmed such that morphological inflections were removed (for example, the word infected would be stemmed to infect).  A 10-fold cross validation was then performed on the dataset. The cross validation technique works by randomly generating 10 individual views or folds of the data such that each fold reserves 90% of the data for training and 10% of the data for testing. For each fold, a classifier was trained using the training data for the fold, while performance metrics were collected using the testing data for the fold. Aggregate performance metrics were compiled from the results of each fold, and included the positive predictive value (known in the ML community as precision), sensitivity (known in the ML community as recall), and F-measure. In terms of classifiers, a variety of well-known types were used in the experiment, including Support Vector Machines, Random Forests, and Logistic Regression. A similar process was used to classify anaphylaxis, making use of 6,034 reports from the VAERS database. Of these reports, 237 were positively coded for anaphylaxis by the FDA. Some pre-processing on the FDA data was necessary to match the records back to their original VAERS report IDs.219B R I N K M A N - S E E K E R  C O L L A B O R AT I O N  2 0 1 2  PA P E R  |  1 0rEsULTsFor the paresthesia classification task, the calculated precision following a 10-fold cross validation ranged from 51 - 89% across the bundle of classifiers used. Recall ranged from 73 - 79%, and F-measure ranged from 62 - 80%. Overall, the most performant classifier model had a precision of 88%, a recall of 73%, and an F-measure of 80%. While preliminary in nature, these results demonstrate that simple NLP and ML techniques can be used to obtain relatively good results for AEFIs such as paresthesia, and that not all AEFIs require deep domain knowledge for identification. In terms of a scale of difficulty, we consider these types of AEFIs to be relatively uncomplicated, due mainly to the fact that inexpensive feature engineering (such as the bag-of-words approach) results in features that are sufficiently informative to produce a good classifier. In practical terms, this preliminary finding is fairly significant, since it suggests that classifiers for uncomplicated AEFIs will remain relatively cheap to build and deploy. However, further work is needed for an in-depth error analysis, and to close the performance gap in precision and recall.The anaphylaxis classification task tells a different story. As expected, the simple bag-of-words approach performed poorly when compared to the results obtained by Botsis et al. This is likely due to its inability to model medically significant domain knowledge. In terms of performance, precision ranged from 29 - 61%, recall ranged from 18 - 61%, and F-measure ranged from 28 - 46%. Overall, the most performant classifier had a precision of 42%, a recall of 50%, and an F-measure of 46%. On a scale of difficulty, we consider AEs similar to anaphylaxis to be structurally complex due to the fact that many simple features need to be combined or parsed according to a domain-specific recipe to generate more informative ones. More work similar to Botsis et al. is needed to understand what medical domain knowledge is needed to build highly accurate classifiers for AEFIs that are structurally complex, as well as work to discover other informative features that may be useful for their identification and classification.220B R I N K M A N - S E E K E R  C O L L A B O R AT I O N  2 0 1 2  PA P E R  |  11The results of the experiments demonstrate that the difficulty of identifying various AEFIs can vary widely, and advanced feature engineering is not always required. However, much more work is needed to understand the structure of other AEFIs, and to find informative features that exist in the text to help with their classification and discovery. Further work is also needed to close the performance gap in precision and recall for both uncomplicated and structurally complex AEFIs. Overall, given the wide range of AEFIs, collaboration between the ML and NLP communities with domain experts will continue to be a necessity. This work demonstrates how fruitful these types of collaborations can be, from the discovery of new data sources to the transfer of domain knowledge between academia and industry. While the duration of the partnership was short, both parties agree that the simple experiments and their preliminary results represent promising potentials for complex algorithmic processing in the medical domain.GOinG FOrWarD221rEFErEnCEs1  Clinical Safety Data Management: Definitions and Standards for Expedited Reporting E2A. ICH Harmonised Tripartite Guideline. (Retrieved November 8, 2013)2  Canadian Adverse Events Following Immunization Surveillance System (CAEFISS). Public Health Agency of Canada. (Retrieved October 17, 2013)3  Report of Adverse Events Following Immunization (AEFI). Public Health Agency of Canada. (Retrieved October 10, 2013) 4  VAERS: Vaccine Adverse Event Reporting System. (Retrieved October 17, 2013) 5  Understanding MedDRA - The Medical Dictionary for Regulatory Activities. MedDRA. (Retrieved November 8, 2013)6  NINDS Paresthesia Information Page. National Institute of Neuro-logical Disorders and Stroke. (Retrieved October 17, 2013)7  M. Erlewyn-Lajeunesse, et al. (2007) Anaphylaxis as an ad-verse event following immunisation. Journal of Clinical Pathology, 60(7), pages 737-739. (Retrieved October 30, 2013)8  J.U. Ruggeberg et al. (2007) Anaphylaxis: Case Definition and Guidelines for Data Collection, Analysis, and Presentation of Immu-nization Safety Data. Vaccine, 25(31), pages 5675-5684.   T. Botsis et al. (2011) Text mining for the Vaccine Adverse Event Re-porting System: medical text classification using informative feature selection. Journal of the American Medical Association, 18, pages 631-638. R I N K M A N - S E E K E R  C O L L A B O R AT I O N  2 0 1 2  PA P E R  |  1 2222B R I N K M A N - S E E K E R  C O L L A B O R AT I O N  2 0 1 2  PA P E R  |  1 3223www seekersolutions comseeker solutionsCorporate Head Office400-1112 Fort st Victoria, bC V8V 3k8250 483 4129224


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items