CONSTRUCTING AND APPLYING SEMANTIC MODELS OF CLINICAL PHENOTYPES TO SUPPORT WEB-EMBEDDED CLINICAL RESEARCH by SOROUSH SAMADIAN B.Sc., Sharif University of Technology, 2001 M.Sc., University of Tarbiat Modares, 2003 M.Sc., University of Sheffield, 2005 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY in THE FACULTY OF GRADUATE STUDIES (Bioinformatics) THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver) July 2013 © Soroush Samadian, 2013 ii Abstract The adoption of the Semantic Web by life sciences provides exciting opportunities for exploring novel ways to conduct biomedical research. Particularly, the approval of the Web Ontology Language (OWL) by the World Wide Web Consortium (W3C) has provided a global standard for shared representation of biomedical knowledge in the form of ontologies. However, though there are numerous examples of bio-ontologies being used to describe “what is” (i.e., a universal view of reality), there is a dearth of examples where ontologies are used to describe “what might be” (i.e. , a hypothetical view of reality). This thesis proposes that, to achieve scientific rigor, it is important to consider approaches for explicitly representing subjective knowledge in a particular domain - in particular, we examine phenotypic classifications in the clinical domain. We provide supporting evidence, that the OWL is suitable for formally representing these subjective perspectives. Envisioning ontologies as hypothetical, contextual and subjective is a notable departure from the commonly-held view in the life-sciences, where ontologies (ostensibly) represent some "universal truth". We support these arguments with both empirical and quantitative studies. We demonstrate that, when expressed in OWL, many phenotypic classification systems can be accurately modeled in silico. We then demonstrate that these knowledge-models can be "personalized", and show that such models can enable the automated analysis of data in a transparent manner. This results in more rigorous clinical research, while simultaneously allowing the clinicians to maintain their role as the final arbiters of decisions. Finally, we investigate methodologies that might facilitate the encoding and sharing of personalized expert-knowledge by non-knowledge-engineers - a necessary step in making these ideas useful to the clinical community. The knowledge-acquisition bottleneck is the primary barrier to the widespread use of ontologies in life sciences. Thus, we investigate data-driven methodologies to automatically extract knowledge from existing data-systems, and show that it is possible to "boot-strap" the construction of knowledge models through various data-mining algorithms. Taken together, these studies begin to reveal a path toward Web-embedded, in silico clinical research, where knowledge is explicit, transparent, personalized, modular, globally-shared, re-used, and dynamically applied to the interpretation and analysis of clinical datasets. iii Preface Chapter 1 describes the overarching hypothesis of the research conducted in this thesis. Portions of the section 1.2 has appeared in the publication: “SADI, SHARE, and the in silico scientific method” in BMC Bioinformatics(2010), Wilkinson MD, McCarthy L, Vandervalk B, Withers B, Kawas E and Samadian S, I wrote the cardiovascular and clinical ontology, and implemented several of the SADI services. The rest of the chapter was created from scratch. Chapter 3 describes a generic and Semantic Web based approach to automatically resolve measurement unit conflicts in clinical settings. The idea was conceived jointly by Mark Wilkinson and me as a by-product of research conducted in chapter 3. Majority of the design and implementation of programming work for the system, including the core query engine and the live web demonstration, was done by me. Portions of this work has appeared in “Extending and encoding existing biological terminologies and datasets for use in the reasoned Semantic Web” Journal of Biomedical Semantics (2012). The original manuscript was split into two separate manuscripts as suggested by the reviewers. The work in chapter 3, is the expanded and updated (and submitted for a separate publication) version of this manuscript. Chapter 4 describes a generic framework for extending existing clinical terminologies such that they can be used to automatically classify raw clinical observation data. Portions of this work have appeared in “Extending and encoding existing biological terminologies and datasets for use in the reasoned Semantic Web.,” Journal of Biomedical Semantics (2012). Mark Wilkinson and I planned this work, and jointly wrote this manuscript. I executed the data migration, ontology design and extension, Web Service deployment, and overall analysis. Bruce McManus generated and initially analyzed the source clinical data-set, and discussed and validated our approach and choice of clinical standards. Chapter 6 describes data-driven approaches to bootstrap the creation of an ontological framework de novo from unstructured clinical data. Portions of this chapter have appeared in Soroush Samadian, Benjamin M. Good, Bruce McManus, Mark D. Wilkinson (2012) , “A data-driven approach to automatic discovery of prescription drugs in cardiovascular risk management”, Bio-Ontologies 2012. Mark Wilkinson, Benjamin Good and I planned this work, and jointly wrote this manuscript. I designed and implemented the framework and executed the overall analyses. Bruce McManus iv generated and initially analyzed the source clinical data-set, and discussed and validated our approach and choice of clinical standards. v Table of contents Abstract..................................................................................................................................... ii Preface...................................................................................................................................... iii Table of contents ........................................................................................................................v List of tables ..............................................................................................................................ix List of figures..............................................................................................................................x Glossary....................................................................................................................................xii Acknowledgements ...................................................................................................................xv Dedication ................................................................................................................................xv 1 Introduction ..........................................................................................................................1 1.1 Dissertation overview ...................................................................................................1 1.2 Dissertation objectives and chapter summaries ................................................................4 1.3 Dissertation objectives and chapter summaries ................................................................5 2 Background and state of the art ............................................................................................10 2.1 Bioinformatics and clinical informatics .........................................................................10 2.2 Brief history of the Web ...............................................................................................11 2.3 Semantic Web .............................................................................................................11 2.4 Formal data, information and knowledge representation .................................................13 2.4.1 Resource Description Framework (RDF) ...............................................................14 2.4.2 SPARQL query language ......................................................................................16 2.4.3 RDF adaptation in Bioinformatics .........................................................................17 2.4.4 Ontology..............................................................................................................18 2.4.5 Knowledgebase ....................................................................................................20 2.4.6 Ontology components ...........................................................................................20 2.4.7 The Web Ontology Language (OWL) ....................................................................22 2.4.8 Closed world assumption vs. open world assumption ..............................................25 2.5 Bio-ontologies .............................................................................................................25 2.5.1 Knowledge management: data integration, indexing, annotation, retrieval ................27 2.5.2 Clinical decision support.......................................................................................28 2.5.3 Biomedical data classification ...............................................................................29 2.6 Ontological knowledge acquisition ...............................................................................32 2.6.1 Expert-driven ontology construction ......................................................................32 2.6.2 Data-driven ontology construction .........................................................................34 2.6.3 Data-driven formal concept learning ......................................................................36 2.7 Data integration and interoperability tools in bioinformatics ...........................................37 2.7.1 Semantic Automated Data Integration (SADI)........................................................40 vi 2.7.2 Semantic Health and Research Environment (SHARE) ...........................................41 2.7.3 SADI/SHARE Demonstration ...............................................................................41 3 Automatic detection and resolution of measurement-unit conflicts in aggregated data..............45 3.1 Synopsis .....................................................................................................................45 3.2 Introduction ................................................................................................................46 3.3 Background and related work .......................................................................................47 3.4 Materials and methods .................................................................................................52 3.4.1 Data set and data collection ...................................................................................52 3.4.2 Data transformation ..............................................................................................54 3.4.3 Clinical data model...............................................................................................56 3.4.4 Semantic services .................................................................................................57 3.5 Results and discussion .................................................................................................58 3.6 Conclusion and future work..........................................................................................62 4 Extending and encoding existing biological terminologies and datasets for use in the reasoned Semantic Web ............................................................................................................................63 4.1 Synopsis .....................................................................................................................63 4.2 Background.................................................................................................................64 4.3 Methods ......................................................................................................................67 4.3.1 Datasets and data collection ..................................................................................67 4.3.2 Overview of the approach .....................................................................................68 4.3.3 Ontologies used....................................................................................................70 4.3.4 Ontological mapping, extensions, and algorithmic services .....................................70 4.3.5 Refactoring the legacy dataset ...............................................................................73 4.3.6 Approach to binary patient classification (“at risk” versus “not at risk”) ...................75 4.3.7 Approach to ternary risk assessments .....................................................................79 4.4 Results ........................................................................................................................83 4.4.1 Evaluation of automated binary risk classification ..................................................83 4.4.2 Discrepancies between automated and expert binary classifications .........................85 4.4.3 Discrepancies between automated and expert ternary classifications ........................89 4.5 Discussion...................................................................................................................90 4.5.1 Interpreting discrepancies between automated and manual risk classification ...........90 4.5.2 Broader implications of "personalizing" OWL ontologies .......................................92 4.6 Conclusion ..................................................................................................................92 5 Using public drug data to assist in automatic phenotypic classification of patient records in Semantic Web ............................................................................................................................94 5.1 Synopsis .....................................................................................................................94 5.2 Introduction ................................................................................................................95 vii 5.3 Background.................................................................................................................97 5.4 Methods .................................................................................................................... 102 5.4.1 Overview of the drug description and prescribed medicines architecture ................ 102 5.4.2 Dataset .............................................................................................................. 103 5.4.3 Patient data transformation.................................................................................. 103 5.5 Evaluation: clinical case studies.................................................................................. 108 5.5.1 Evaluation of canonicalization service ................................................................. 108 5.5.2 Case study 1 - Automatic determination of the treatment status implicated in cardiovascular risk assessment ........................................................................................... 109 5.5.3 Case Study 2 - Automatic detection of potentially harmful drug-drug interaction .... 116 5.6 Discussion................................................................................................................. 120 5.6.1 Discussion of blood pressure, diabetes and cholesterol treatment statuses............... 120 5.6.2 Discussion of Framingham risk scores experimental results................................... 122 5.6.3 Multiple drug interaction .................................................................................... 123 5.6.4 NDF-RT evaluation ............................................................................................ 123 5.7 Related work ............................................................................................................. 125 5.7.1 Ontological-based decision support systems ......................................................... 125 5.7.2 Drug identification in structured and unstructured text .......................................... 127 5.8 Conclusion ................................................................................................................ 127 5.9 Limitations and future work ....................................................................................... 128 6 A data-driven approach to learning OWL expressions in clinical health care ......................... 129 6.1 Synopsis ................................................................................................................... 129 6.2 Introduction .............................................................................................................. 129 6.3 Related work ............................................................................................................. 131 6.3.1 Ontology evaluation ........................................................................................... 131 6.3.2 Ontology enrichment .......................................................................................... 132 6.4 Experiments .............................................................................................................. 133 6.5 Clinical case study 1: automatic discovery of prescription drugs in cardiovascular risk management ......................................................................................................................... 134 6.5.1 Summary of the experiment ................................................................................ 134 6.5.2 Overview of the method...................................................................................... 134 6.5.3 Dataset and data collection .................................................................................. 135 6.5.4 NDF-RT Ontology ............................................................................................. 136 6.5.5 Preprocessing and data transformation ................................................................. 136 6.6 Results for case study 1 .............................................................................................. 137 6.7 Broader implications of the case study 1...................................................................... 141 6.7.1 Data-driven knowledge discovery........................................................................ 141 viii 6.7.2 Potential for improvement of NDF-RT................................................................. 142 6.7.3 Medication ranking proposal for clinical use ........................................................ 142 6.8 Clinical case study 2: DL Synthesis of clinical phenotypes for septic shock patients: An experiment with DL-learner .................................................................................................. 143 6.8.1 Summary of the experiment ................................................................................ 143 6.8.2 Control experiment............................................................................................. 143 6.8.3 Experiment: OWL representation of clinical phenotypes in VASST dataset............ 147 6.9 Limitations................................................................................................................ 158 6.10 Future work............................................................................................................... 158 6.10.1 Subjective evaluation.......................................................................................... 158 6.10.2 Combining DL-learner and reasoners................................................................... 159 6.11 Conclusion ................................................................................................................ 159 7 Conclusion and future work ............................................................................................... 161 7.1 Summary .................................................................................................................. 161 7.1.1 Chapter 3: Measurement unit conflicts in clinical data .......................................... 161 7.1.2 Chapter 4: Formal classifications of patient phenotype .......................................... 161 7.1.3 Chapter 5: Linking public drug ontologies to legacy patient data ........................... 163 7.1.4 Chapter 6: Data-driven approaches for ontology learning ...................................... 163 7.2 Limitations and future work ....................................................................................... 164 7.3 Theoretical perspectives and methodologies ................................................................ 166 7.4 Overall analysis and an outlook on future .................................................................... 167 Bibliography ............................................................................................................................ 170 Appendix A: Supporting material for chapter 2........................................................................... 188 Appendix B: Supporting material for chapter 3........................................................................... 193 Appendix C: Supporting material for chapter 4........................................................................... 204 Appendix D: Supporting material for chapter 5........................................................................... 219 Appendix E: Supporting material for chapter 6 ........................................................................... 228 ix List of tables Table ‎2.1 Basic syntax rules for DL .............................................................................................23 Table ‎3.1 The first two rows of the dataset used in the original format ...........................................52 Table ‎3.2 American Heart Association classification for systolic blood pressure .............................59 Table ‎3.3 Units and values before and after conversion .................................................................60 Table ‎4.1 American Heart Association classification for systolic and diastolic blood pressure .........66 Table ‎4.2 Part of the first row of dataset used in Microsoft excel sheet ...........................................68 Table ‎4.3 American Heart Association classification for systolic and diastolic blood pressure .........76 Table ‎4.4 American Heart Association classification for cholesterol, HDL, and triglycerides...........77 Table ‎4.5 American Heart Association classification for BMI........................................................78 Table ‎4.6 LDL guidelines ...........................................................................................................79 Table ‎4.7 Estimated risk of general cardiovascular disease in men .................................................81 Table ‎4.8 10-year risk for general CVD by total Framingham risk score.........................................82 Table ‎4.9 Comparison between manual and automatic binary risk classifications ............................85 Table ‎5.1 Prescription information for the first two patients ........................................................ 101 Table ‎5.2 A number of medications for which no spelling suggestions ........................................ 109 Table ‎5.3 Estimated risk of general cardiovascular disease .......................................................... 109 Table ‎5.4 Complete results of analysis for HBP b) diabetes and C) cholesterol treatment statuses .. 112 Table ‎5.5 Comparison between classification metrics between before and after incorporating drug information for Framingham risk classifications ......................................................................... 116 Table ‎5.6 Potentially harmful drug interactions........................................................................... 120 Table ‎5.7 An extreme example of an aberrant clinical classification ............................................. 121 Table ‎5.8 List of medications for which no therapeutic intent was found in NDF-RT .................... 125 Table ‎6.1 The first two rows of the dataset used in the original format ......................................... 136 Table ‎6.2 The list showing the relationships between medications and diseases-of-interest ............ 140 Table ‎6.3 A list of drugs associated with QTc prolongation ......................................................... 152 x List of figures Figure ‎2.1 Semantic Web layer cake ............................................................................................13 Figure ‎2.2 The RDF graph ..........................................................................................................15 Figure ‎2.3 Merging two RDF datasets.. ........................................................................................16 Figure ‎2.4 Ontology spectrum: there are a variety of artifacts all collectively called ontologies........26 Figure ‎2.5 Ontology spectrum division. .......................................................................................30 Figure ‎2.6 Layers of ontology development process with examples of each layer............................35 Figure ‎2.7 Architecture of distributed mediator systems ................................................................39 Figure ‎2.8 Inferred classification based on BMI............................................................................43 Figure ‎2.9 SADI service for BMI calculation ...............................................................................43 Figure ‎3.1 High level representation of the unit “inch” in QUDT ...................................................50 Figure ‎3.2 OM representation of cubic centimeter.........................................................................51 Figure ‎3.3 Extending clinical concepts to incorporate numerical quantities .....................................57 Figure ‎4.1 CardioSHARE architecture: increasingly complex ontological layers organize data into more abstract concept .................................................................................................................69 Figure ‎4.2 Data schema using concepts in legacy Ontologies.........................................................74 Figure ‎4.3 The schematic diagram of the SADI Web Service interface to BMI calculation service ...79 Figure ‎4.4 SPARQL queries (Prefixes not shown) followed by a small snapshot of the results for automatic classification of patients into “high risk” (A) and “low risk” (B) for blood pressure. ........84 Figure ‎4.5 SPARQL queries and a small snapshot of the results for automatic classification of patients into “High Risk”, “Medium Risk” and “Low Risk”, respectively. ......................................89 Figure ‎5.1 The representation of drug ASPIRIN 120MG ...............................................................99 Figure ‎5.2 High level representation of the proposed schema using concepts in legacy ontologies.. 102 Figure ‎5.3 High level representation of the RDF data schema ...................................................... 105 Figure ‎5.4 Flowchart algorithm of the canonicalizer service to migrate erroneous excel sheet data into a standardized format................................................................................................................ 107 Figure ‎5.5 Automatic vs. manual determination of high blood pressure treatment status ................ 112 Figure ‎5.6 High level representation of drug data model merged with clinical observation model .. 114 Figure ‎5.7 Automatic determination of “high risk” patients. ........................................................ 122 Figure ‎6.1 Layers of ontology development process.................................................................... 132 xi Figure ‎6.2 Migration from legacy data into the format suitable for machine learning algorithm ..... 136 Figure ‎6.3 Inferring components of formal class definitions from instance data ............................ 137 Figure ‎6.4 Comparison of the two ontologies in Protégé.............................................................. 146 Figure ‎6.5 Rules suggested by DL-learner for class Phosphate Protoprotein ................................. 147 Figure ‎6.6 Data schema using concepts in legacy ontologies ....................................................... 149 Figure ‎6.7 Representation of normal (top) and abnormal (bottom) QTc waveform ........................ 150 xii Glossary The majority of the terminologies defined here have a variety of different meanings. The provided definitions are specific to the use of the defined word in the context of this dissertation. Terms that are explicitly defined in the text of the introduction, may not have been included here. Artificial intelligence (AI): The attempt to endow computers with human-like cognitive abilities. Atomic class (Atomic Concept): A named class or concept which cannot be expressed using combination of other concepts. Catalog (as a type of ontology): A list of terms providing an unambiguous interpretation of terms (e.g. assign a unique identifier to medication Acetaminophen.) Defined class (Defined concept): the set of concepts whose description is both necessary and sufficient for class membership. Description Logics (DLs): A group of knowledge representation formalisms, which can be utilized to formally represent class descriptions. Efficient algorithms exist for automatically reasoning over DL formalisms. Expert System: A computer program that simulates the judgement and behavior of a human or an organization that has expert knowledge and experience in a particular field Frames: Ontologies that are equipped property information for classes. Frames with value restrictions : Ontologies that include value restrictions for the properties. In such ontologies one might place restrictions on what can fill a property. Glossary (as a type of ontology): A list of terms and meanings specified (usually) as natural language statements. They provide sufficient semantics for human users to interpret them; however would not meet the criteria of being machine processable. Hypothesis : Any provisional idea, explanation or statements for an observable and testable phenomenon. By the virtue of this definition a hypothesis is only subject to falsification rather than proof; since there is always the possibility that a future experiment will show that it is false. In silico: an expression used to mean "performed on computer or via computer simulation. Knowledge acquisition: The translation of knowledge from unstructured sources such as human minds or texts into formalisms useful for automatic processing. Knowledge discovery: The creation of non-trivial and previously unknown knowledge from structured (relational databases, XML) and unstructured (text, documents, images) sources Machine learning (ML): A collection of algorithms used in programs that improve their performance based on the results of their operations. There are many kinds of machine learning. The xiii main type used within this dissertation is known as ‘supervised learning’. In supervised learning, algorithms learn predictive functions from labelled training data. Metadata are data providing information about one or more aspects of the data. Natural language processing (NLP): A field of artificial intelligence and linguistics concerned with the interactions between computers and human (natural) languages Named class (Named Concept): A group of individuals sharing a specific set of characteristics to which a name attributed such as class “Person”. By convention only singular names are used to represent classes. We use the terms, “concept”, “class” and “category” interchangeably throughout this thesis; although more formal and precise definitions will be given where necessary in the thesis. Ontologies with informal is-a relationship: Ontologies that do provide an explicit hierarchy; however, the hierarchy is not a strict subclass or “is-a” hierarchy- a pattern that occurs frequently on the Web. Without true subclass ( “is-a”) relationships, reasoning with ontologies become problematic. Ontologies with formal is-a relationship include strict subclass hierarchies. In such ontologies if A is a superclass of B, then if c is an instance of B it necessarily follows that c is an instance of A. Strict subclass hierarchies are necessary for exploitation of inheritance. Ontologies with formal instance are ontologies that provide include instances of each class in addition to formal hierarchical (is-a) relationships. Pattern discovery: The act of discovering and reporting the interesting relationships in data Primitive class (Primitive Concept): Those concepts which only have necessary conditions (in terms of their properties) for membership of the class. Query: A request for data that satisfy a set of specific criteria. Semantic Web service : A Web service that is built around standards for the interchange of semantic data, with the aim of combining semantic data from different resources and services Thesauri: glossaries that provide additional semantics in their relations between terms. They provide information such as synonym or acronym relationships. Thesauri do not usually provide explicit hierarchical relationships between concepts; though with narrower and broader term specifications, simple hierarchies might be deduced. Unnamed class (Category): A group of individuals sharing a specific set of characteristics to which no specific name is attributed such as “A person who is a male and is blind”. An unnamed class is created by conjunction of a number on Named classes and properties (Person, Male and Blind in the case of our example) Web Ontology Language (OWL): A family of knowledge representation languages for authoring ontologies. xiv Web Ontology Language-Description Logics (OWL-DL): A sublanguage of OWL corresponding to description logics (DLs). OWL-DL is designed to support the existing Description Logic business segment and has desirable computational properties for reasoning systems. Web service: A software system designed to support interoperable machine-to-machine interaction over a network. Workflow: A set of connected and possibly atomic steps required to accomplish a specific task. Each atomic step of a workflow comprises the following three parameters: input, transformation algorithms, and output. xv Acknowledgements I wish to immensely thank my supervisor, Mark Wilkinson, for his unrelenting and uncompromising supports throughout the entire process. I wish to thank him for his crucial contributions, which have made him a backbone of the research conducted in this thesis. His originality, diligence and patience have contributed substantially to my gradual scientific maturity. I thank him for standing by me in the face of hardship and for patiently teaching me the core scientific writing skills. I am grateful in every possible way and hope to have the privilege to proactively collaborate with him in the future. I would wish to thank my co-supervisor, Bruce McManus, for his guidance, support and his encouragement throughout this period. Also, I wish to thank him for patiently teaching me the required cardiovascular knowledge and for his essential contributions to the experimental designs. Additionally I would like to thank him for helping me gradually improve my communication skills. I wish to thank my committee, Wyeth Wasserman and Raymond Ng. for their unconditional support, their diligence in giving me valuable suggestions and advice to better conduct my research. I wish to thank Benjamin Good, former PhD student in our laboratory for his guidance and laying the philosophical ground for part of my thesis. I thank Benjamin Vandervalk, Ed Kawas and Luke McCarthy former members of Wilkinson laboratory for their constant supports with technical problems and for helping me through trials of graduate school. In addition thanks to Chris Fjell for providing useful discussions and his constructive feedback on my thesis. Finally, I would like to thank entire team involved in the Vasopressin and Septic Shock Trial (VASST) study, especially John Boyd, Keith Walley and Nattachai Anantasit for training me with the required biology and for generously providing me with the valuable data required to conduct my research This work is part of the CardioSHARE initiative, founded through a special initiatives award from the Heart and Stroke Foundation of British Columbia and Yukon, with subsequent funding from Microsoft Research and an operating grant from the Canadian Institutes for Health Research (CIHR). Core laboratory funding is derived through an award from the Natural Sciences and Engineering Research Council of Canada. The authors recognize the fiscal, operational and scientific support of the NCE CECR PROOF Centre of Excellence. xvi Dedication I would like to express my gratitude to my parents for their support and motivation. My parents are the pillar of my life and I wish to dedicate this thesis to them for their unconditional love and support. They supported me throughout the entire process and I feel immensely privileged to have them in my life. Introduction 1 1 Introduction “A paradigm shift is where the state before does not have the words or concepts to describe the state after” Tim Berners-Lee, 2006 1.1 Dissertation overview 1 Reproducibility is a cornerstone of science. To be truly reproducible, an experiment should be explicit and thorough in describing every stage of the analysis, starting with the initial question or hypothesis, continuing on through the methodology by which candidate data were selected and analyzed, and finishing with a fully-documented result, including all provenance information (which resource, which version, when, and why). As modern biology becomes increasingly in silico-based many of these best-practices in reproducibility are being managed with high efficiency. The emergence of analytical workflows (e.g. those created using the Taverna[1] workflow authoring tool) as first-class reference-able and sharable objects in bioinformatics has led to a high level of precision in describing in silico “materials and methods”, as well as the ability to automate collection of highly detailed provenance information. However, the earlier stages in the scientific process – the posing of the hypothesis and the selection of candidate data – are still largely limited to human cognition; we pose our hypotheses in the form of sentences, and we select/screen candidate data often based on expert knowledge or intuition. This is particularly acute in the biomedical sciences, where the experts are the ultimate arbiters of patient phenotypic classification often based entirely on their personal expert-opinion. The emergence and uptake of the Semantic Web technologies by Life Science practitioners provides exciting opportunities for exploring novel ways to conduct biomedical research. These standards have allowed scientists to explicitly express “knowledge”. In particular, the adoption of the Web Ontology 1 Portions of the section 1.2 has appeared in the publication: SADI, SHARE, and the in-silico scientific method, Wilkinson MD, McCarthy L, Vandervalk B, W ithers B, Kawas E and Samadian S, I wrote the cardiovascular and clinical Ontology, and implemented several of the SADI services . Introduction 2 Language (OWL) 2 [1] and its variants such as OWL-DL[2], by the World Wide Web Consortium (W3C) [3]has provided a global standard for knowledge-representation which is showing particularly rapid adoption within the life sciences and health sciences communities [4]. Though there are numerous examples of ontologies being used to describe “what is” (i.e. to describe a particular aspect of biological reality)[5], there are far fewer examples of ontologies being used to describe “what might be” (i.e. a hypothetical view of biologica l reality). Given the constantly changing nature of biomedical “reality”, the distinction seems, in our opinion, to be somewhat artificial. In this thesis, we demonstrate that, to achieve scientific rigor and reproducibility, it is important to consider, create, and evaluate approaches for explicitly representing hypotheses and phenotypic classification systems, in particular, in the clinical domain. This thesis can be broadly divided into two sections. In the first section, we will investigate the suitability of OWL for representing these hypothetical and/or subjective perspectives in a concrete way. We will support these arguments with empirical and quantitative studies. In the second section of the thesis, we will explore and evaluate the feasibility of designing frameworks to support clinical researchers as they model their hypotheses and classification systems in an unfamiliar and complex language such as OWL. We will empirically demonstrate the utility of the approach by showing that, when expressed in OWL; many real-life biomedical classifications can be automatically evaluated in-silico using novel Semantic Web technologies. Specifically, we pursue the following inter-related research questions:  Can we design a generic, Semantic Web-based approach to 1) express measurement-unit information for quantitative and qualitative data measures, 2) automatically detect when an integrated dataset contains conflicting units, and 3) automatically resolve these conflicts? This is undertaken specifically in the context of integrating clinical data, imagining that future clinical studies will automatically discover and integrate data from disparate, non- coordinating sites on the Web. Unit homogenization, therefore, will be required to make any meaningful interpretation of this clinical data. Subsequent chapters build on this idea by adding more layers of information representation and abstraction. 2 The detailed background informat ion on the concepts discussed in this chapter (e.g. Semantic Web and its components) will be provided in chapter 2. Introduction 3  How do we alter and/or extend existing clinical terminologies such that they can be used to automatically classify raw clinical observation data? What modifications to traditional data capture and representation must be made in order to make these data amenable to logical inferences? Can we replace (or at a minimum, guide) expert clinical annotators in their interpretation of clinical data, thus enhancing reproducibility and rigor of clinical research, and if so, with what level of accuracy can this be achieved?  Can we extend the analysis carried out for quantitative clinical observations into the more complex pharmaceutical portion of legacy clinical datasets – effectively asking the machine to make phenotypic interpretations based on the drugs a patient is taking? What kind of clinical and phenotypic inferences become possible (and/or more accurate) when structured clinical data are connected, using semantic technologies, to the extensive structured knowledge present in pharmaceutical databases? Could such inferences enhance the quality of clinical research and/or patient care?  Can we utilize data-driven approaches to bootstrap the creation of an ontological framework de novo from unstructured clinical data? Can we detect additional subtleties known to be present in such datasets, such as multiple valid classification-solutions which differ in popularity? How can such data-driven approaches improve the knowledge-acquisition bottleneck which is ubiquitous in ontology engineering? How can we utilize data-driven approaches, in practice, to help clinicians automatically build and evaluate hypothetical clinical classifications in the form of formal ontologies? What do we learn from their (subjective) experience with the framework? Each of the projects mentioned above contributes to a component of the ultimate goal of the project which is to support end-users in in silico generation and evaluation of hypotheses in Clinical Informatics [6].The importance of this work is both scientific and pragmatic. From an Information Science perspective, we will evaluate the boundaries of the knowledge-representation space where ontologies and clinical categorizations overlap. More importantly this thesis helps shed light on the process of building bio-ontologies using lay users' (experts who are not ontology engineers) knowledge, and hence converting informal lay users' knowledge into formal expert systems 3 . Additionally, from the perspective of clinical practice, this project proposes a framework which 3 “An expert system is a computer program that simulates the judgement and behavior of a human or an organization that has expert knowledge and experience in a particular field”[258]. Introduction 4 allows for different, disparate, and cross-domain world views to coexist, enabling the process of evoking and evaluating conjecture about the data based on distributed expert clinical knowledge. 1.2 Dissertation objectives and chapter summaries We carried out the research described in this dissertation to evaluate methodologies that would assist clinicians and biologists to formally represent and evaluate their personal expert opinion as ontologies. The first part of the project was the feasibility study where we undertook to empirically evaluate the ability to model clinical observations and hypotheses generated from these observations using ontologies. Following successful demonstration that many kinds of clinical observations can be formally modeled by ontologists to achieve similar observations to those of a clinical expert, we undertook the inverse approach, evaluating whether data-driven approaches might guide the construction of formal, axiom-based classification models by clinical researchers themselves or other non-ontologists. This step necessitates the development and evaluation of knowledge acquisition frameworks that enable the efficient acquisition of knowledge from domain experts. Overall, the research conducted in this thesis is related to three overlapping but distinct sub-disciplines of knowledge engineering as follows:  Ontology-based clinical data integration and classification (Chapters 3, 4)  Ontology-based clinical decision support (Chapter 5)  Data-driven ontological knowledge acquisition (Chapters 6) We have chosen to focus the research conducted in all the above sub-disciplines in the context of Cardiovascular Diseases. The choice of cardiovascular diseases for the case study is supported by the fact that they are the leading cause of the death worldwide which provides a socially and economically important use-case to conduct the analysis[7]. The choice is also supported by the availability of ethics-approved and well-curated datasets relating to heart disease; however the methodologies are applicable to other domains[8]. We briefly present the chapter summaries in the context of the mentioned sub-disciplines in the following. Introduction 5 1.3 Dissertation objectives and chapter summaries As mentioned, this thesis covers two core themes. The first theme encompasses chapters 3 through 5, and focuses on the empirical study on modeling clinical observation, and hypotheses using the Semantic Web. The second theme spans chapters 6, and focuses on data-driven approaches to bootstrapping the creation of an ontological concept. The remainder of this thesis is structured as follows: Chapter 2: In chapter 2, we provide general background information on tools and technologies together with the underlying considerations and approaches used by our research group. Together, these provide the context necessary to follow the various threads of investigation in this thesis. Specifically, in section 2.1 we discuss the key concepts of the World Wide Web and its (proposed) successor, known as the “Semantic Web”. In section 2.2 we discuss fundamental concepts in formal knowledge representation pertinent to this work such as background information about ontologies and their components, Resource Description Framework (RDF)[9] and Web Ontology Language (OWL). We present a review of existing biomedical ontologies and their role in knowledge management, data integration and classification and decision support in section 2.3. In section 2.4 we present the existing methodologies proposed to overcome the well-known “knowledge acquisition bottleneck” in the ontology construction and evaluation[10] ; we specifically focus on formal concept learning, relevant to the second theme of this thesis. Finally, in 2.5, we discuss two recently published pieces of technology that are frequently utilized in this thesis - semantic Automated Discovery and Integration (SADI)[11] and the semantic Health and Research Environment (SHARE)[12] . Chapter 3: In chapter 3, we propose an extensible framework to represent clinical measurement units that uses Web Ontology Language-Description Logics (OWL-DL) semantics. The motivation behind this investigation was that, in order to make any meaningful interpretation of numerical data, particularly those derived from automated data integration, the first step in the analytical pipeline is harmonization of the measurement units. Existing Semantic Web technologies do not provide a built- in method for representing of measurement units [13] nor an automated way to convert between them. As such in chapter 3, we define and demonstrate a generalized approach to expressing measurement-unit information, automatically detecting when an integrated dataset contains unit conflicts, and automatically resolving those conflicts. We discuss the current challenges to, and various resources that can be applied to semantic representation of measurement units. Subsequently, we present the design architecture of our proposed solution to automatically detecting and resolving conflicts in measurement units. We then evaluate our framework using a legacy clinical dataset. Introduction 6 Chapter 4: In chapter 4, we address the problem of patient phenotypic classification and risk stratification in healthcare. We detail a disruptive problem within clinical research, namely, that the basis for a patient's phenotypic classification is usually not transparent within the dataset, making the analysis difficult (and at times, impossible) to reproduce. We present a migration path - both for the terminologies/classification systems and the data - that enables rich automated clinical classification using formal logical representation of well-established standards [14]. This is achieved by establishing a simple and flexible core data model, which is combined with a layered ontological framework utilizing both logical reasoning and analytical algorithms to iteratively "lift" clinical data through increasingly complex layers of interpretation and classification. This migration path is applied to data that have been homogenized and modeled using the approach described in Chapter 3, with the goal of comparing automated analyses to those of clinical experts. The source data was a case study of a cardiovascular cohort collected and clinically classified over two decades ago. Specifically, by extending OWL-DL classes in existing clinical ontologies (the GALEN Ontology[15] was used for this experiment) and combining those OWL-DL classes with analytical Semantic Web services, we undertake to model two risk-assessment schemes that were used to annotate this legacy patient dataset: a binary risk score ("at risk", "not at risk") assigned to individual clinical observations such as blood pressure, and an overall cumulative risk score using the Framingham risk measurement[7]. The comparison between manual and automatic classifications, outline a number of discrepancies; such discrepancies are further used to refine the ontological models, finally arriving at ontologies that mirror the expert opinion of the individual clinical researcher [14]. We demonstrate that the combination of semantically-explicit data, logically rigorous OWL-DL models of clinical guidelines, and publicly-accessible Semantic Web Services, can be used to execute automated, rigorous and reproducible clinical classifications with an accuracy approaching that of an expert. As a result in chapter 4, we show that "personalized" ontologies may represent a re- usable and transparent approach to modeling individual clinical expertise, which can lead to more reproducible science. Finally, we outline discrepancies of more complex nature where the experts’ classifications could not be emulated; we identify the possible factors causing such discordances and propose a solution to ameliorate the problem which leads to the research conducted in chapter 5. Chapter 5: In chapter 5, we focus on the role of evaluating prescribed medications with respect to improving automated clinical decision support. As revealed by the previous analysis, there were cases where it was not possible to reliably mimic expert classifications using the proposed Ontological “layers” described in chapter 4. This suggested that other clinical observations – not accounted-for by the technological implementation described in chapter 4– had affected the expert’s clinical risk Introduction 7 classification. In consultation with the clinical researchers who had generated the dataset, we determined that the patients’ prescribed medic ines were the most likely and important factor not taken into-account by the previous models. As a result in chapter 5, we propose and demonstrate a simple framework for formalizing patient drug data and connecting them to public drug knowledgebases such as National Drug File - Reference Terminology (NDF-RT) [16]. The proposed framework uses the extensive knowledge available in public knowledgebase to make automated and clinically relevant inferences about patients, based on their prescribed medication. As mentioned, the research in this chapter was motivated by the previous case study and hence, the primary goal was to attempt to successfully automate the classification of raw patient clinical data into Framingham risk categories, using semantic models together with OWL-DL logical reasoning. This involve determining, based on observations of their prescriptions, patients’ treatment statuses (e.g. “Treated for diabetes”), and using that information to more accurately model the risk schema. We evaluate how much improvement is gained in mimicking experts’ classification, following the incorporation of drug information. As a secondary objective, we then use the proposed framework to address a perennial problem in healthcare systems which is the accidental prescription of contraindicative medications to patients. We were able to demonstrate that, using the proposed framework, we can efficiently identify patients to whom possibly dangerous combination of drugs have been prescribed. Chapter 6: In chapter 6, we discuss data-driven methodologies in for automatic and semi-automatic ontology building and evaluation. These methodologies are developed to leverage machine-learning (ML) approaches for the creation, extension and evaluation of ontologies. We first provide the theoretical foundations on formal concept learning in OWL in the context of inductive rule programming (ILP) [17]. Subsequently, we present existing frameworks, namely Ontoloki [18], and DL-Learner [19] together with a description of their algorithms and their evaluation schema. Next we discuss the conceptual framework of our design, focusing on its implication in clinical practice. We evaluate the framework for various different clinical scenarios including automatically identifying medications used in the treatment of cardiovascular disease. We demonstrate how legacy patient data can be mined to discover interesting rules that mirror the expert knowledge encoded in modern clinical ontologies (NDF-RT) [20]. As a secondary objective, we propose a generic data-driven methodology that dynamically discovers both generalized and personalized ranking for prescription medications. Finally, we address the primary end-point of this thesis, where we utilize the available methodologies for learning OWL-DL class expressions for clinical phenotypes observing the effectiveness of these frameworks in aiding clinicians to automatically(or semi-automatically) build Introduction 8 and evaluate their personal clinical classification mental-models, now formalized as “personalized” OWL-DL ontologies. Chapter 7: In chapter 7 we summarize the findings and contributions, present conclusions and provide an outlook. The research contributions of this thesis are twofold: The first contribution is a framework to support personalization of clinical knowledge using formal ontologies. We note that to date, the existing clinical research-oriented ontologies are built and refined by collaborative groups of experts, which requires a considerable investment of resources. Such ontologies are created based on the assumption that there is a universal and static “reality” about clinical knowledge, and this reality is derived by the consensus of this small group of experts. However, given the complexity and diversity of clinical ontologies and constant changing of biomedical knowledge, it is unrealistic to expect that a single, universal ontological model can be identified. Thus, existing ontologies cannot accommodate the flexibility required by clinical science where knowledge is often incomplete, contextual and subjective. In this thesis we design and implement an ontological framework to enable the formal representation of the clinical knowledge in a transparent manner. Using ontologies for personalized representation of knowledge is a clear shift from the more widespread use of ontologies as the universal truth; it provides a platform in which heterogeneous clinical views can coexist and new interpretation about the data can be made. The ability to represent personalized ontological models can be utilized to support the unambiguous construction, sharing, comparison and modification of clinical knowledge. Encoding clinica l knowledge as personalized ontologies offer a number of significant advantages: the models are unambiguous, extensible (by third parties) and computationally testable , making clinical research tasks more accurate, transparent and reproducible. The second contribution of is a framework to facilitate the process of discovering expressions for definition of concepts in clinical ontologies can be modeled as solving supervised classification problems in machine learning. This was motivated by the observation that generally the existing clinical ontologies lack formal definitions of the clinical concepts significantly hindering their use for automated classification of clinical data. Thus, provided the abundance of legacy databases, methods for automated acquisition of class definition from data are required. In this thesis, we demonstrate that the process of generating formal expressions for concept definition in ontologies can be modeled as solving supervised classification problems in machine learning. Consequently we design and implement a system in which subgroups of existing machine learning techniques can be applied to Introduction 9 learning concept definition in Web Ontology Language (OWL) which is the de facto standard for representing ontologies. Finally, we evaluate our design on benchmark problems and in real-life and challenging clinical use-cases; demonstrating that concrete learning problems in clinical science, can be solved using existing OWL as knowledge representation formalism combined with the current techniques in machine learning. Background and state of the art 10 2 Background and state of the art 2.1 Bioinformatics and clinical informatics The past few decades have witnessed an unprecedented explosion in the amount of biomedical and biological data generated thanks to development of high throughput data collection methodologies. As a result, to process and interpret the massive amount of biomedical data, the need for efficient data coding, sharing and integration has increased in parallel. This period of massive explosion of biological data, has coincided with numerous paradigm shifts in computational technologies and information sciences which enabled efficient data processing. Bioinformatics emerged as a result of the application of powerful computational technologies to biology and medicine. Thus bioinformatics is defined as a subfield of information science that focuses on the development of approaches for investigating biological information with the ultimate goal of answering biological and/or biomedical problems. Genomics and molecular biology were the biomedical sub-disciplines that first embraced Bioinformatics and devised infrastructures to incorporate techniques developed in information sciences for some of their most ambitious undertakings, of which Human Genome Project (HGP)[13] is arguably the most well-known. Bioinformatics can itself be divided into different subcategories. One of such subcategories is Clinical Informatics which mainly focuses on clinical aspects of Bioinformatics’ application. In [6] a comprehensive definition of Clinical Informatics is provided as follows: “Clinical Informatics is the scientific discipline that seeks to enhance human health by implementing novel information technology, computer science and knowledge management methodologies to prevent disease, deliver more efficient and safer patient care, increase the effectiveness of translational research, and improve biomedical knowledge access.” Many general approaches developed by information scientists, such as databases, knowledge representation, data mining algorithms and communication protocols are applied in Clinical Informatics. The frameworks presented in this thesis were motivated by clinically-relevant problems in the context of Clinical Informatics and were evaluated using clinical and biological datasets. The majority of these frameworks are conceptually applicable to other domains. As such, the research presented in this thesis can be considered both as Clinical Informatics based on the motivations and the nature of Background and state of the art 11 the experiments, and as Information Science, based on its relevance to other domains. The ultimate aim of the project is to design and develop methodologies that will enable more effective use of the increasingly massive and diverse amount of available biological and medical information. 2.2 Brief history of the Web The World Wide Web (The Web or WWW) is a system of interlinked hypertext documents that can be accessed through the Internet. At core, the WWW is composed of three main technologies: a universally unique naming system where all documents can be uniquely identified; a syntax for presenting the content of documents in the form of hypertext; and a protocol by which these unique names can be resolved to the document they represent. Hypertext is defined as “a body of written or pictorial material interconnected in such a complex way that it could not conveniently be presented or represented on paper” [14]. Such interconnections are defined by hyperlinks within the documents that make it possible, via the Web, to navigate to and from each document. The most crucial feature of the Web is its universality [15] which makes it possible for all the users to act both as generators and consumers of the information. That anyone is able to interact with the Web, has made it the most ubiquitous medium for information transfer in the world; and as we shall see, the strategies used in this thesis rely on this “democratic” aspect of the Web. The Web owes its existence to the advent and uptake of global identifier system. In order to be able to retrieve documents on the Web, every document should be associated with a unique address called Uniform Resource Identifiers (URI) [16] much the same as the street address of each individual that is required in order to communicate with her. 2.3 Semantic Web The World-Wide Web can be viewed as a massive information network, where webpages are the nodes of the network, and links connecting those pages form an intertwined, gigantic network[17]. The documents in World Wide Web are supposed to be read, understood and retrieved exclusively by human users; the body of such documents are written in natural language. The main vision of the Semantic Web is to create machine-understandable (and processable) data analogous to the human- readable network of World Wide Web. Background and state of the art 12 The “vision” of the Semantic Web is almost as old as the Web itself; however it took a several more years to develop the infrastructures (described below) necessary to realize this vision. Tim Berners- Lee talked about semantics on the Web during the first World Wide Web conference[18] where he mentioned the following: "To a computer, then, the Web is a flat, boring world devoid of meaning. This is a pity, as in fact documents on the web describe real objects and imaginary concepts, and give particular relationships between them. …. Adding semantics to the Web involves two things: allowing documents which have information in machine-readable forms, and allowing links to be created with relationship values. Only when we have this extra level of semantics will we be able to use computer power to help us exploit the information to a greater extent than our own reading." The Semantic Web is built upon a conceptual model called Semantic Network Models proposed by Allan M. Collins which is a form to structurally represent knowledge [19]. Using Semantic Network Models, the Semantic Web extends the graph of human readable Web pages and hyperlinks by adding machine-readable information (and metadata 4 ) about the information-content of Web pages, including and how these information-blocks are related to one another. Thus, machines become enabled to automatically understand, access and process the information within the Web much more efficiently[20]. In order for machines to fulfill these tasks, it is required that the relevant information sources be semantically structured. To achieve this goal, some standards and technologies shown in Figure 2.1, were adopted by World Wide Web Consortium (W3C)[3]. Together, these standards and technologies shown in Figure 2.1 enable the effective and formal representation of knowledge [3]. 4 Metadata is defined as data providing information about one or more aspects of the data. Background and state of the art 13 Figure 2.1 Semantic Web Layer Cake[18] In the next section we will discuss formal knowledge representation and its key components. For a more in depth description of all the components shown in Figure 2.1, the interested reader is referred to [3], [18], [21]. 2.4 Formal data, information and knowledge representation In the context of this thesis, formal knowledge representation is defined as the process of encoding existing knowledge into statements that are comprehensible by machines [22]. Recently, standards have emerged to carry out the process of formal data, information and knowledge representation. Unfortunately these terms are used in an inconsistent and often ambiguous manner in different research areas[23]. The unclear distinction between data, information, and knowledge has historically caused numerous problems for integrative research [23].Thus, first we define the terms data, information, and knowledge in the context of the Semantic Web. The definitions are adapted from [23].  Data are defined as patterns with no specific meaning; they are the first entry to an interpretation process.  Information is defined as data with meaning understandable by human (or machine); it can be the output from data interpretation as well as the input to, and output from, the interpretation process. Background and state of the art 14  Knowledge is information incorporated in an agent's (human or machine) reasoning resources, and makes intelligent decision/action possible; it may be the output of a learning process. In the context of the Semantic Web, the smallest unit and the most commonly accepted schema for data representation is called a triple (or statement) 5 [12]. A triple can be defined, as a set containing a subject, a predicate and an object. Axioms are statements that are asserted to be true in the domain being described[24]. Axioms serve as a starting point for deducing and inferring other statements 6 [25]. The essential task of knowledge representation is to specifically and unambiguously express knowledge as axioms (relationships) between the triples. 2.4.1 Resource Description Framework (RDF) Resource Description Framework(RDF) [9] is the data model for the Semantic Web. Under the RDF model, a dataset is encoded as a set of triples, where each triple represents an atomic statement. Each triple consists of a subject, a predicate, and an object, and each of these three parts is identified by a URI[26]. The key advantage of RDF triples over HTML hyperlinks is that the links are explicitly labeled—the intent (semantics) of the relationship between the two entities is thus computationally accessible through URI resolution[27]. In the context of the Semantic Web all the subjects, objects and predicates are individual URIs [15]. For example, the statement “Soroush hasSystolicBloodPressure High” can be decomposed to different URIs for “Soroush” (subject), “hasBloodPressure” (predicate) and “High” (object) and placed in a triple store. Figure 2.2 shows the RDF description of a resource “Dr Eric Miller” who is described as a person with the specified name, email address, and title. As mentioned previously, the resources are represented by URIs, and thus can be identified globally 7 . Below is the equivalent of the graph represented in Figure 2.2. 5 Though not universally true, in the context of the Semantic Web, statement and triple can be considered to represent the same entities. 6 Oxford d ictionary defines an axiom as “A proposition that commends itself to general acceptance; a well- established or universally conceded principle; a maxim, rule, law”. Therefore, an axiom’s “truth” is taken for granted without mathemat ical proof within the particu lar domain of analysis. 7 Resources, which need no global identifier can also be assigned as blank nodes. An RDF blank node is an RDF node that itself does not contain any data, but serves as a parent node to a grouping of data[259] Background and state of the art 15 in RDF/XML syntax which together with N3(and Turtle)[28] are the formats most commonly used for textual representation of RDF graphs. Eric Miller Dr. Figure 2.2 The RDF graph representing "Dr. Eric Miller" (taken from [29]) Background and state of the art 16 A set of such triples can be merged to create an RDF dataset. Figure 2.3 shows how two RDF data stores, one describing metabolic pathways (green box), and the other mapping two of the proteins to their corresponding terms in Gene Ontology, can be merged to form a triple store. As shown in this example, the use of globally unique identifiers in RDF provides a straightforward mechanism for matching entities across datasets, which can facilitate data integration and merging. In addition to simplifying the merging process, RDF triple stores can provide machines with some degree of automatic reasoning since the predicates are explicitly specified and labeled in the graph(s). Figure 2.3 Merg ing two RDF datasets. The green box with dashed box on the left shows four different proteins (A,B,C, and D) part icipating in a common pathway(A) while the blue box maps two of those four proteins to their corresponding GO term (adapted from[18] ). 2.4.2 SPARQL query language SPARQL has become the standard query language to retrieve data from RDF triple stores. It is similar to the SQL language for querying relational databases. SPARQL contains capabilities for querying RDF datasets [30]. The results of SPARQL queries can be results sets or RDF graphs. A triple pattern is a triple which can have variables as each part of the triple (denoted by “?” in the SPARQL query). A set of triple patterns is a basic graph pattern[31]. A basic graph pattern matches a subgraph of the RDF knowledgebase when RDF terms in the subgraph can be replaced by variables in the query[31]. Background and state of the art 17 The following SPARQL query selects all persons with full name “Eric Miller". PREFIX rdf: PREFIX contact: SELECT ?person WHERE { ?person rdf:type contact:Person . ?person contact:fullName "Eric Miller" . } The select clause identifies the variables to occur in the results. In this case, ?person can by replaced by the green oval shown in the center of Figure 2.2 (http://www.w3.org/People/EM/contact#me) to obtain a match[31]. SPARQL alone does not provide reasoning power beyond the simple unique identification of resources; however recent tools have enhanced SPARQL with reasoning capabilities of Description Logics (DL), a field of research that studies logics (discussed later). Appendix A provides some additional information about SPARQL and related technologies (e.g. SPARQL endpoints) that may be helpful in better understanding the framework. 2.4.3 RDF adaptation in Bioinformatics At the time of writing this thesis, RDF (and consequently SPARQL) is not widely adopted for publishing data in biomedical informatics; however some projects have undertaken initiatives to provide publically accessible, third-party translations of existing bioinformatics resources in RDF form. Three major efforts in this area are UniProt[32] ,Bio2RDF[33] and Linked Open Drug Data(LODD)[34] . UniProt unites protein databases such as Swiss-Prot[35] and TrEMBL[35], and is represented entirely in RDF.Bio2RDF aims to provide the RDF version of popular protein and gene databases such as KEGG[36] and PDB[37] . LODD provides RDF versions of drug-related resources such as DrugBank[38], DailyMed[39] and SIDER[40]. There are few possible reasons for the slow adoption of RDF-SPARQL framework in life sciences, such as: minimal tool support for early-adopters, ad hoc and partial adoption of standards within the community[27],the intrinsic complexity of RDF data models, RDF/XML serialization[41]. As mentioned by [41]: Background and state of the art 18 “…Clearly, there is a great deal of work to be done in establishing RDF as a core technology that adds value to the widely adopted XML syntax “ The interested reader is referred to [28], [40], and [51] for a comprehensive study about the state-of- the art and existing challenges for adoption of RDF (and other Semantic Web standards) in the health care and life sciences. RDF ( and SPARQL) is the necessary-and-sufficient technology underpinning the Semantic Web and has paved the way for Semantic Web applications; however the Semantic Web, owes much of its extended capabilities to the adaptation of the concept of “ontologies” from philosophy and advent of Web Ontology Language (OWL)[2] for representing these to machines. We will first present an overview of ontologies and their core components and subsequently introduce OWL in section 2.4.6. 2.4.4 Ontology The term “ontology” in computer science is borrowed from philosophy [43]. In the context of the Semantic Web and computer science the term is used in a rather narrower sense compared to philosophy. For the purpose of this thesis, “ontology” may be defined as follows: “An ontology is a formal explicit description of concepts in a domain of interest, properties of each concept describing various features and attributes of the concept, and restrictions on those features and attributes “[44]. Ontologies are often envisioned as an “an explicit specification of a Conceptualization” [43]. A conceptualization is defined as an abstract representation of the entities and their relationships to one another in a particular domain. Most computing technologies aim to make such “explicit” aspect of the ontologies both human and machine readable. Making such conceptualizations machine readable, offers a number of advantages some of which are: 2.4.4.1 Encode knowledge and enable reasoning Since early days of computers, professionals anticipated (rather optimistically) the time when computers would facilitate the decision making process [45]; health domains were among the early adopters of reasoning using encoded knowledge. For instance rule-based systems are an example of Background and state of the art 19 heuristic-based approaches 8 , in which individual logical rules (for example in the form of “if condition X then diagnosis Y”) are established [46]. The most prominent of such rule-based expert systems was called MYCIN, intended to select appropriate antibacterial therapy for a patient[47], [48]. The work on MYCIN lead to extensions, including EMYCIN [49], GUIDON[50] and TEIRESIAS[51] . Refer to [45–47] for more detailed historical perspective about the evolution of formal knowledge encoding in medicine. 2.4.4.2 Integrate and share data (and knowledge) among people or software agents As mentioned, the idea of automating reasoning with formalized knowledge is not novel; however the vision of using the encoded knowledge (ontologies) to share and exchange knowledge between different people (and possibly machines) is relatively new and has sparked considerable interest in ontology development. On the other hand, the emergence of the Web has resulted in the increased need to systematically share escalating amounts of data, a major driver for the incorporation of ontologies in computer health science[15]. Even though World Wide Web infrastructure allows humans to share their knowledge (in a human-understandable format), without ontologies, there is no effective mechanism to share knowledge across software agents and applications. Data integration may be defined as aggregating disparate data that describe the same entity (e.g., gene, disease, and so on) from independent resources. Data integration is crucial to knowledge discovery because it allows different information about the same entity to be related (possibly) in new ways[52]. There are several advantages in the using ontologies for data integration: they provide a rich, predefined terminology that serves as a conceptual interface to the databases and is independent of the database schemas ; they also support consistent management and the recognition of inconsistent data[52]. For example, suppose different websites contain medical information about cardiovascular diseases. If these sources share and publish the same underlying ontology of terms they collectively use, the computer agent will be able to extract, aggregate and integrate information from these websites[53]. Subsequently, the machines can use the integrated knowledge to answer questions (queries) that they would not otherwise be able to answer. 8 heuristics are strategies using readily accessible, though loosely applicable, informat ion to control problem solving in human beings and machines [260] Background and state of the art 20 In section 2.5, we will discuss in more detail about the roles ontologies play and advantages they offer focusing on the existing ontologies in biomedical sciences. 2.4.4.3 Reuse existing domain knowledge In addition to providing the platform to share common understanding of the structure of information, ontologies facilitate the reuse of domain knowledge. For instance, many domains of science use temporal information. This representation may constitute of time intervals, units of time and so on. In such a case, if one sufficiently detailed “Time” ontology exists , other groups can simply use that ontology (or portions of it) and/or tailor it to their need[54]. By re-using ontologies, it becomes easier to integrate data, since you gain the assurance that the “words” being used in one dataset are synonymous with those used by others. Sharing and re-use of the domain knowledge behind these datasets creates an extremely powerful platform within which to do integrative research 9 . This feature of domain ontologies is extensively used in chapters 3 and 4 where we extend existing ontologies to automatically classify clinical data. 2.4.5 Knowledgebase An ontology together with a set of individual instances of classes constitutes a knowledgebase. For example consider an Ontological class called HypertensivePatient. This class will consist of the original ontology (also termed T-Box). When we begin filling the class with instances fulfilling the membership criteria, we effectively create the knowledgebase (also termed A-Box). In practice, it is often difficult to dissociate an ontology from its knowledgebase, and the distinction is often somewhat artificial [53] . 2.4.6 Ontology components The main components of an ontology are concepts (classes), relations, instances and axioms. A concept represents a set or class of entities or “things” within a domain. For example, a “Blood Vessel” is a concept within the domain of Cardiology. Concepts are the main focus of most ontologies. Concepts can broadly be divided in two categories: • Primitive concepts are those which only have necessary conditions (in terms of their class defining 9 Additional harmonization, can be achieved by using shared “upper ontologies” – ontologies that describe core aspects of knowledge shared between different domains. An example of this is the GFO[261]. These can be used to “ground” new ontologies such that machines can understand the core nature of the data. Background and state of the art 21 properties) for membership of the class [55]. For example, if we state that it is necessary that a Patient be a Person. Obviously the inverse relationship does not hold. • Defined concepts are those whose description is both necessary and sufficient for a thing to be a member of the class. For example, Generalized Architecture for Language Encyclopedia and Nomenclatures( GALEN)[56] defines Mild Hypertension as follows: MildHypertension = Hypertension AND hasSeverity MildSeverity Based on the above definition any entity that fulfills the right hand side of the above equation is inferred to be a member of class MildHypertension and vice versa. Broadly, there are two main types of relations: • Taxonomies that organize concepts into sub- super-concept hierarchical structures. Common examples of such “relationships” include: - Specialization relationships commonly known as the “is a” or “is a kind of” relationship. For example, MildHypertension is a kind of Hypertension. - Partitive relationships describe concepts that are part of other concepts For instance, Heart hasComponent Heart Valve. • Associative relationships that relate concepts across tree structures. Examples include: -Nominative relationships describe the names of concepts e.g. Disease hasName CongestiveHeartFailure -Locative relationships describe the location of one concept with respect to another [55] e.g. Heart Valve hasPhysicalLocation Heart. -Associative relationships that represent properties that can be attributed to a concept[55] e.g. Congestive Heart Failure playsPathologicalRole Disease. Relationships can often also be hierarchical. For example, the relation hasName may be subdivided into hasDiseaseName, hasPatientName. Relations also have features (such as transitivity and cardinality) that capture further knowledge about the relationships between concepts (see Appendix A for some common examples). Background and state of the art 22 • Instances are the “things” represented by a concept; Soroush’s BloodPressureRecord is an instance of the concept BloodPressureRecord. However, the decision about whether to model something as an instance or a concept is highly contextual, and depends on the purpose of the knowledge-base[55] . For example, in one ontology, UnitOfPressure might be a concept and “millimeterOfMercuryColumn” might be an instance of that concept. In another ontology, however, millimeterOfMercuryColumn might be a sub-concept of UnitOfPressure, and individual blood- pressure measurements from patients would be the instances. This is a well-known and open question in knowledge management research and further discussion of this issue will only happen tangentially in this thesis – in the experiments here, the decision of the granularity at which concept and instance are separated was made purely pragmatically. 2.4.7 The Web Ontology Language (OWL) There exist different ontology languages globally. The ontology language approved by the World Wide Web Consortium (W3C) for use on the Web is called Web Ontology Language (OWL). It is designed for use by applications that need to automatically process the information content of Web- based data beyond just presenting the information to humans. Web Ontology Language (OWL) facilitates greater machine interoperability (of Web content) than supported by previous standards (e.g. XML). It provides the additional vocabulary and structural rules for encoding RDF datasets in a way analogous to schemas for relational databases and secondly, it is extensible, allowing the formal creation of new rules and schema structures relevant to a particular discipline, which then can be used to encode the knowledge about a specific domain in a machine-readable form such that it can be used to make inferences about RDF data. OWL has three sub-languages which differ in their logical expressivity: OWL Lite, OWL DL, and OWL Full, in order of increasing complexity and expressivity [1]. OWL-DL is so called because of its connection with Description Logics(DLs) that outline the formal basis of OWL[1]. OWL-DL provides a rich formalism for capturing knowledge about a domain together with a classifier allowing reasoning capabilities[1].An OWL ontology represents a set of axioms in a description logics(DL)[57] and thus in order to understand how OWL ontologies achieve the tasks mentioned above, it is necessary to understand basic principles of DL.  Description Logic DL is the name of a family of knowledge representation (KR) formalisms. They emerged from earlier KR formalisms like semantic networks and frames[31]. One of the characteristic features of DLs Background and state of the art 23 different from predecessor logics is that DLs are equipped with a formal, logic-based semantics. Another distinguished attribute is that DLs support “reasoning” (extraction of implicit knowledge from the explicit knowledge represented in the knowledgebase) tasks such as: classification of concepts and individuals that occur in many applications of information processing systems[58]. Table 2.1 shows the basic syntax rules for DL for hypothetical concepts C and D and a role R. Table 2.1 Basic syntax ru les for DL Symbol Description Example Read all concept names top empty concept bottom intersection or conjunction of concepts C and D union or disjunction of concepts C or D negation or complement of concepts not C universal restriction all R-successors are in C existential restriction an R-successor exists in C Concept inclusion all instances of C are D Concept equivalence C is equivalent to D For example suppose we define a concept “patient” as a “person who has a disease”. According to Table 2.1, the DL syntax for this statement is as follows: Patient ≡ Person Π hasDisease.T With respect to individual members of a class, the members may be specified explicitly (e.g. Soroush is-a Patient); however, classes may also be defined more generally in terms of axioms. Such axioms describe either necessary or necessary and sufficient conditions regarding individuals that are members of a class. For example a “Heart Patient” can be defined as follows. Heart Patient (p) ≡ Patient Π hasDisease(p,hd) HD (hd) HD ⊏ D Background and state of the art 24 where d is asserted to be a heart disease (HD) which is itself a subclass of disease (D). The hasDisease(p,d) above is also called a property restriction (in this example, an “existential” property restriction in Table 2.1). There exist other types of property restrictions such as universal restrictions, asserting that all values of a property must belong to certain class and cardinality restrictions, asserting that an instance must have a certain number of distinct values for a property[59]. The use of axioms to define classes and properties makes it possible to automatically determine contradictions within an ontology which is called consistency checking [59]. For instance, in the previous if we have an instance p who is declared as an instance of “Heart Patient “ but is not linked to an instance of Heart Disease by existential property hasDisease, an automatic inference system (known as OWL reasoner; see below) can detect this inconsistency. In addition, it is possible to test whether a given RDF dataset complies with the axioms in an ontology and/or deduce facts not explicitly asserted in the RDF data by inferring new facts based on axiomatic analysis of facts expressed in the dataset. Besides consistency checking, OWL-DL enables other logical operations such as classification of instances into various logically-valid classes, subsumption (automated creation of a hierarchy of classes based on their axioms), and so on[55]. Some examples of the logical operations are[59]:  Are all instances of class X also instances of class Y? (Subsumption)  Does individual x belong to class X? (Realization)  Is it possible for any individual to be a member of class X? (Satisfiability)  Are there any logical contradictions in the ontology? (Consistency) Finally, the combinatorial nature of OWL-DL allows for new concepts to be created from the combination of existing ones and for these to automatically be placed in their proper location within an ontological network or hierarchy. The reasoning services provided by OWL reasoners (software applications that can make logical inferences based on OWL axioms) have many applications which are specifically well suited for classification systems in biomedical domain (section 2.5) The motivations discussed, together with OWL-DL’s computational features(Appendix A) such as completeness (it is possible, and guaranteed, through logical reasoning, to find all solutions to a question, if such solutions exist), have made it the acceptable standard representation language by Background and state of the art 25 W3C[3]. Thus, to be standards-compliant, we will exclusively use OWL-DL for the frameworks developed in this thesis. 2.4.8 Closed world assumption vs. open world assumption In formal logic, the Closed World Assumption (CWA) is the assumption that what is not known to be true must be false. The Open World Assumption (OWA) is the opposite. In other words, it is the assumption that what is not known to be true is simply unknown (which, in practice, takes on the meaning “might be true” in the context of logical reasoning) (reviewed by [57]);. OWA applies when we represent knowledge within a system as we discover it, but we cannot guarantee that we have discovered or will discover complete information. This is normally the case with information on the Web. Thus it would be logical that Semantic Web languages such as OWL make the open world assumption. On the other hand, many database languages make the closed world assumption. For instance, if a database of criminals does not contain a person’s name, it means that the person’s criminal record is clear. The down-side of databases, particularly within clinical research, then, is that large numbers of records in this domain contain unknowns or are incomplete. It has been shown that, despite the use of OWA by Semantic Web technologies, in some cases (e.g. using machine learning techniques to learn definition of concepts[60]), the CWA is preferable[61] and offers additional computational and reasoning benefits. Thus, some Semantic Web tools (such as the one used in chapter 6 of this thesis) implement CWA in their frameworks. 2.5 Bio-ontologies The last two decades have witnessed an upsurge in the number of computational and human-readable artifacts aimed at representing biomedical knowledge. A few common examples of such artifacts are biomedical thesauri, controlled vocabularies, taxonomies and ontologies. Though attempts have been made to precisely delineate the distinction between such artifacts(for example see [62]), in practice, the term “ontology” is used inconsistency to refer any of these artifacts. As a result in reality, the term “ontology” is used to describe a variety of different entities that are collectively called ontologies . Figure 2.4 shows the “Ontology Spectrum”, and includes some common examples of each point (in red) in the spectrum. Appendix A provides a thorough definition of the artifacts shown in the figure. Background and state of the art 26 Figure 2.4 Ontology spectrum: there are a variety of artifacts all co llect ively called ontologies(adapted from [63]). The artifacts (defined in the text) are shown in black and increase in complexity from the left to the right. Examples of existing bio-ontologies compatib le with these artifacts are shown in red. Existing literature have addressed the commonly used biomedical ontologies focusing on different aspects including their design and structure [64–66], usability in specific applications such as drug discovery(e.g. [67], [68]), and functional perspective – addressing how ontologies can assist biomedical researchers find and interpret pertinent information[69]. Here we present a quick review of highly influential biomedical ontologies from a functional perspective; we create three axis of interest relevant to this thesis – knowledge management, clinical decision support and biomedical data classification – and then describe the functional role of these ontologies on each axis. The ontologies we include span much of the ontology spectrum, to the right of simple terminologies, and includes Gene Ontology(GO) [70], the National Cancer Institute ontology(NCI)[71] , the Foundational Model of Anatomy(FMA) [72], SNOMED-CT[73], Generalized Architecture for Language Encyclopedias and Nomenclatures(GALEN) [56], RxNorm[74] (including National Drug File - Reference Terminology ,NDF-RT[75]), Medical Subject Heading(MeSH)[76], International Classification of Diseases(ICD-10)[77], Logical Observation Identifiers Names and Codes (LOINC)[78] the Unified Medical Language Semantic Network (UMLS) [79] in which the above ontologies are integrated (with the exception of GALEN 10 and FMA). 10 We will extensively exp loit GALEN as the base ontology for the experiment in chapters 3 and 4 where detailed exp lanation of GALEN and its characteristics will be presented. Background and state of the art 27 2.5.1 Knowledge management: data integration, indexing, annotation, retrieval Generally, existing high-impact bio-ontologies facilitate the integration of datasets, by providing a common reference for biomedical entities in several datasets[80]. For instance, the ontologies mentioned above share certain features of data and knowledge management in that they: use unique, permanent accessioned identifiers for major biomedical entities to ensure cross-reference validity; and they use controlled vocabularies to enable knowledge management and information retrieval by facilitating tasks such as: indexing, coding, annotation, text mining and mapping across resources. Indexing is used to assign entities from controlled vocabularies to documents. Indexing large volume of documents significantly facilitate effective access and accurate retrieval of biomedical information 11 . Hierarchical controlled vocabularies such as MeSH, UMLS and NCI have been extensively used for indexing the biomedical literature[20]. The indexing of clinical documents is usually referred to as coding[81]. One prominent example of such coding systems is ICD-10 which has been used for a long time for coding different diseases. SNOMED-CT is another well-known example of coding systems. SNOMED-CT provides a comprehensive ontology for health care systems and has been recently adopted by some countries as a standard terminology for electronic medical records(EMR)[82]. To date the majority of indexing and coding processes are done manually; however recently automatic machine learning (ML) techniques (with varying degrees of success) have been developed to automate the indexing task(e.g. [83] ). Besides enabling document indexing (and coding), bio-ontologies assist with the task of annotating more fine-grained biomedical data; Annotation can be defined as “the act of or the product of associating metadata with a particular resource”[15]. For instance gene annotation can be used to identify the locations of genes and all of the coding regions in a genome and determining the functions of those genes. Many of the mentioned bio-ontologies have been used to annotate biological data. For example MeSH has been used to annotate publically available gene expression data in order to generate a gene-disease network[84] and NCI and SNOMED have been used to annotate and query public microarray repository[85] . 11 Accurate in this context refer to high precision and recall Background and state of the art 28 In addition, ontologies provide support for text mining for biomedicine through a variety of Natural Language Processing (NLP) techniques. The importance of text mining approaches for biology and medicine stems from the fact that Published manuscripts are the main source of biologica l knowledge and manual processing of this volume of literature is extremely difficult, if not impossible, (e.g. PubMed alone contains between 17-19 million abstracts[86]). Finally, bio-ontologies have been extensively utilized for mapping biomedical text to the ontological entities and to discover ontological concepts referred to in text. Some examples of features that is provided by existing text mining systems include the detection of author-defined acronyms/abbreviations, the ability to browse the ontologies for concepts (even loosely) related to input text, the detection of negation in a phrase, word sense disambiguation (WSD), and so on. Some examples include [87] which identifies concepts from GO in a text corpus , MetaMap[88] which provide access to the concepts in the UMLS Metathesaurus from biomedical text and Whatizit[89] which is a service oriented text mining framework that incorporates several resources including DrugBank[90], UniProtKb[91], CheBI[92] , GO and UMLS. 2.5.2 Clinical decision support In the context of information technology, computer-based clinical decision support (CDS), which is the major focus of chapter 5 of this thesis, can be defined as follows: “use of the information and communication technology to bring relevant knowledge to bear on the health care and wellbeing of patients”[46] CDS assist and influence biomedical experts with decision-making tasks such as determining diagnosis of patient. Some potential benefits of integrating computer based decision support systems (CDSS) in biomedical practice include the ability to [93]:  combine the related pieces of information  facilitate access to pertinent information and better accessibility of data  generate patient specific and context specific alerts and reminder  identify patterns in patient data to focus experts’ attention Background and state of the art 29 The application of computers as decision support tools is relatively old; however, the application of complex ontologies with large networks of associative relations among the entities, to support the interpretation of data using a myriad of different techniques, is an active area of research. CDSS benefit from bio-ontologies in several ways. First ontologies can provide a standard terminology for biomedical concepts which can facilitate data integration[94]. In addition to standardization and coding, ontologies are equipped with reasoning capabilities that can be of further benefit to clinical decision support systems( see Appendix A). With these studies as a foundation, chapter 5 describes how we utilize both coding and reasoning capabilities, together with automated data integration and tool-interoperability frameworks, to design and implement a novel ontology-based CDSS that uses patient drug data to address two clinically important use cases: 1. automatic determination of adverse drug reactions and 2.cardiovascular risk stratification and patient phenotype classification.. 2.5.3 Biomedical data classification Ontological classification or categorization is organizing a set of entities into groups, based on their essences and possible relations[95]. When talking about classification systems, we should bear in mind that there is usually not a single “perfect” system as summarized by Clay Shirky: “In working classification systems, success is not ‘Did we get the ideal arrangement?’ but rather ‘How close did we come, and on what measures?[95]’” Classification is a field of study with a long history rooted in ancient philosophy, with Aristotle being (arguably) the originator of formal classification systems (see [96] and [97]for review and discussion of the historical roots of classification). In the modern era, classification of entities is now almost ubiquitous in biomedicine, and plays a pivotal role in biology and clinical sciences. Bio-ontologies have been used for many diverse classification purposes such as categorizing of biomedical documents, functional categorization of genes, and detection of related topics and so on. In most cases, these classification activities are accomplished manually. In the context of this thesis, we use the term classification to refer to phenotypic classification, in the clinical sense, for example, categories of patients being treated for a given disorder based on the types of drugs they have been prescribed, or phenotypic classifications (e.g. risk stratification) based on clinical observations, and moreover, we are speaking only of automated classification – that is, where machines compare a Background and state of the art 30 dataset with a classification system, and automatically determine which data fit into which category. Accomplishing this requires that we not only declare that a category exists, but that we also define (in a machine-readable way) what it means to be a member of that category. Figure 2.5 shows the ontology spectrum again(Figure 2.4) overlaid by a red line distinguishing between the existing ontologies in terms of their capability in data classification[63]. To date the majority of ontologies in life sciences (such as MeSH, UMLS, and GO) are located on the left side of the red line drawn in Figure 2.5 and are built as class hierarchies lacking formal definition for classes. Such ontologies assist(as explained)experts in organizing of data by providing a set of standardized terminologies and in the analysis of data by providing a shared terminology that can be utilized by both computers and humans ; however though they can assist in interpretation of data once the classification is done by experts, they offer little help in automating the classification process itself i.e. the lack of formal (machine-understandable) axiom-based definitions of the classes in the ontology hinders the automatic classification of data. Figure 2.5 Ontology spectrum d ivision: Among the artifacts all called ontologies, in this thesis we focus on the ones shown on the right side of the red line since they provide unambiguous and axiomatic definit ion of classes and can be used more efficiently as a basis for inference[63]. On the other hand, ontologies located on the right side of the Figure 2.5, require the delineation of precise and machine-readable definitions for each class in an ontology. If ontologies contain information about properties and value restrictions on the properties, then the following features can be realized: 1) automatic classification consistency checking and 2) automatic classification of instances. Two examples of such ontologies (the right side of the spectrum) as follows: Background and state of the art 31 Phosphobase : Phosphabase[98] is an OWL ontology for describing protein phosphatases based on their domain architecture. For example, the class “Receptor Tyrosine Phosphatase” is defined to be equivalent to the set of Tyrosine Phosphatases that contain at least one “Transmembrane Domain”. In OWL-DL, the class definition says that having such a domain is both, necessary and sufficient, to be a member of that class. It is therefore possible to use reasoners, such as Pellet [99] or Fact++[100], to automatically classify proteins using that ontology, based exclusively on their protein-domain composition, without any human intervention. Since classes in Phosphabase are equipped with highly consistent OWL-DL axiom-based definition, it was used in chapter 6 for the control experiment. TAMBIS Ontology(TaO): TaO[101] is the ontology used by the TAMBIS[102] mediator framework (mediator frameworks are explained in 2.8.2). TAMBIS was designed to provide a common interface to biologists in choosing combining and interacting with multiple external data resources, using TaO to answer biological questions by providing a unifying lingua franca between identical concepts between the resources[102]. A major difference between the TaO and most other ontologies, is that it is a “dynamic” ontology, since it can grow without the need for either conceptualising or encoding new knowledge[55]. TaO uses rules within the ontology to join existing concepts to form new concepts via novel inter-concept relationships. As a result concepts are formally defined as they are created and since concepts represent instances, a concept can effectively represent a question (query). For instance imagine we create a concept “Receptor Protein”, representing instances of proteins which function as biological receptors, Gathering the instances of this class, is effectively answering a biological question [55]. Building-on this conceptual approach of TAMBIS, in part 2 of this thesis we investigate machine learning methodologies that utilize features in the datasets to automatically enhance ontologies by adding necessary and/or sufficient restrictions to classes within the ontologies. Effectively, these data-driven approaches facilitate the transition of ontologies from left to the right side of the spectrum shown in Figure 2.5. As such in the next section, we present a background on knowledge acquisition algorithms and approaches that address data-driven, automatic knowledge discovery applied to ontology engineering (more specific details will be presented within the chapters). Background and state of the art 32 2.6 Ontological knowledge acquisition Knowledge Acquisition (KA) is the process of extracting, structuring and organizing knowledge from one or multiple source(s), usually human experts [103] . Traditionally, the knowledge engineer acquires knowledge from one or more domain experts who explain the domain with sufficient level of granularity that the knowledge engineer can then build a model of that domain that captures that expert knowledge in a formal manner. During this process, detailed information on domain knowledge, and the entities and relations in that domain are obtained[104]. The obtained knowledge is subsequently represented with a language that is, ideally, context-independent[104]. Being context- independent, is one of the basic principles upon which ontologies are built. It is not surprising then that ontologies have provided an ideal medium for representing the output of knowledge acquisition processes. Thus, although KA covers many more areas than merely ontology construction, building ontologies to encapsulate experts’ knowledge is an important component of KA systems. Problematically, however, building and maintaining ontologies is labor intensive, requiring a broad range of specialist-expertise, and is therefore both time-consuming and expensive[105]. This is one reason that, “the knowledge acquisition bottleneck” was identified as the most significant barrier to overcome in the creation of knowledgebased systems[106]. Thus it would be desirable to automate (or at least semi-automate) the process of ontology-building. A wide range of algorithms and approaches have been developed to address (semi-)automatic KA applied to ontology engineering. Recent work on ontological KA can be divided in two broad categories of Expert-driven and data- driven ontology construction which is discussed in the following. 2.6.1 Expert-driven ontology construction To date, the majority of popular approaches of building ontologies focus on centralized development of ontologies in that a relatively small number of knowledge engineers and domain experts 12 contribute to (and exert control over) process of ontology construction in a domain[107]. This process may involve scheduling series of face-to-face meetings, teleconferences, and interviews between knowledge engineers and domain experts, teleconferences and/or using collaborative desktop ontology editing environments such as Protégé 13 to build ontologies. 12 Generally domain experts and knowledge engineer are not the same individual(s); however if that is the case, the knowledge acquisition phase becomes more straightforward. 13 http://protege.stanford.edu Background and state of the art 33 That a small number of individuals contribute to building and maintaining ontologies has several disadvantages. Most importantly it exacerbates the knowledge acquisition bottleneck problem explained earlier, since it does not allow for a larger community to participate in the ontology construction process. A broad range of methodologies have been proposed in recent years focusing on different aspects of the ontological knowledge acquisition problem. Several projects focus on facilitating collaborative ontology building, editing, merging and evaluation. For instance Ontolingua[108] provides a distributed collaborative environment to browse, create, edit, modify, and use ontologies. OntoEdit[109] is an ontology editor that focuses on guided and collaborative development of ontologies. OilEd[96] focuses on using inference to support ontology engineering. OntoRama[110] is built as a robust integrated environment or suite that provides technological support to most of the ontology lifecycles activities . For a comprehensive comparative study of existing tools, and existing challenges refer to[111]. Though the mentioned decentralized methodologies have achieved more success compared to localized (e.g. desktop) applications, they still impose high entrance barrier for potential users[107], a fact which is evidenced by their relatively small number of active contributors. To address the entry barrier problem, another approach to ontology engineering has focused on harnessing the “wisdom of the crowd” by using the infrastructures and open culture of the “Wiki philosophy”14 as the knowledge-capture framework .Using these frameworks anybody can add new elements to the ontology and refine or modify existing ones (e.g. OntoWiki[107], Semantic Media Wiki[112] , and Freebase[113]). Two examples in healthcare are WikiNeuron[114], which specifically focus on neuroscience and neurological disorders and Gene Wiki[115], which aims to create a continuously- updated, collaboratively-written, and community-reviewed review article for every gene in the human genome Overall, Semantic Wikis enhance the ability of effective collaboration and cooperation with a simple documentation process[116]; however they face challenges such as inconsistency of content, limited reasoning capabilities and integration problems[117]. To summarize, the methodologies discussed (both distributed ontology editing environments and wiki-based approaches) have significantly facilitated the process of collaborative, ontology building; 14 http://en.wikipedia.org/wiki/Wiki Background and state of the art 34 however, to date they still heavily rely on human experts (hence the name expert-driven) as their main source for building ontologies. These approaches do not provide practical mechanisms for efficient integration of data to exploit machine learning algorithms in the process of ontology building. 2.6.2 Data-driven ontology construction In data-driven ontology construction approaches, exemplar domain data are used as a means to establish the conceptual knowledge required by the ontology that will later be used to categorize that data. Data-driven ontology construction approaches can be categorized according to the types of input data from which they are learned[118]. These type of data are unstructured (e.g. natural language text documents), and (semi-structured (e.g. database schemes)[118]. Unstructured data is the most difficult starting data-type. Existing methodologies for extracting knowledge from raw text are generally based on a combination of Natural Language Processing (NLP) techniques with statistical methods(e.g.[119]). Generally, although they almost universally provide some degree of support for manual editing of the resulting ontology, the existing methods for learning from unstructured data have not shown good performance for de novo ontology construction[120]; their performance is further reduced in the biomedical domain because of the high level of expert knowledge required. It has been shown[121] that, the quality of the results of ontology learning methods using semi-structured data is much better than the ones using completely unstructured data making them more suitable for biomedical domains[120].Therefore, in this thesis we focus on data -driven ontology learning approaches starting with semi-structured input data. A more detailed explanation of data-driven approaches – both structured and unstructured, since they both follow a similar pattern – requires a certain amount of background and terminological definition. Figure 2.6 shows different “layers” involved in ontology learning together with examples for each layer. Background and state of the art 35 Figure 2.6 Layers of ontology development process with examples of each layer (adapted from [120] orig inally presented in[119]); SBP: Systolic Blood Pressure DBP: Diastolic Blood Pressure Terms are linguistic realization of domain-specific concepts. Term extraction is necessary for all aspects of ontology learning[119]. Terms can be used to refer to or associate with specific concept (e.g. Hypertension in the figure) The literature provides many examples of term extraction methods that could be used as a first step in ontology learning[119]. Synonyms are used to enable the acquisition of semantic term variants and help to avoid redundancy in the concepts[120]. Much of the work in this area has focused on the using word-sense disambiguation for the integration of dictionaries such as WordNet 15 for the acquisition of synonyms. Concepts are then extracted from this non-redundant set of terms; As can be seen from the example shown in Figure 2.6, not all the extracted candidate terms qualify as concepts and the extraction of concepts from terms (especially from unstructured data) is generally application-dependent and not a straightforward process[119]. Once concepts are extracted, methodologies can be used to elicit instances of the identified concepts from within the dataset (for example, if the system determined that “President” was an important concept in the dataset, then the subsequent step in the system would be to discover “George W Bush” and “Barak Obama” as examples/instances of that concept). Once instances have been identified, inductive rule learning methodologies can be used to elicit axiomatic definition (top box in Figure 2.6) of those concepts(discussed next) – effectively, what is it that George W Bush and Barak Obama have in-common that makes them both instances of the President concept. In addition, methodologies exist that apply different algorithms (e.g. hierarchical clustering algorithms) to identify the hierarchical relationships (generalization and specialization) between concepts based on these 15 http://wordnet.princeton.edu Hypertension ≡SBP > 140 and DBP > 90 Hypertension is-a Disease Hypertension, Disease {hypertension, high blood pressure} disease, hypertension, high blood pressure, hospital Background and state of the art 36 properties; for example, the system might identify the concept “President” as well as “U.S. President” and “French President”. Barak Obama would be an example of a U.S President, and François Mitterrand an example of a French President, where both French President and U.S. President are defined as specializations of the “President” concept based on the country of origin of their individuals. Term, synonym and taxonomy extraction methodologies are mostly used for learning ontologies from text(which is not the focus of this thesis), and thus are not discussed further (Please refer to [118] and [119] for additional information). The part of the ontology learning that relates to the extraction of formal concept definitions based on examples (dubbed “Formal Concept Learning”; green boxes in Figure 2.6) is of most relevance to this thesis. 2.6.3 Data-driven formal concept learning Formal concept learning can be defined as the process of eliciting a logical description of a concept from given instance members and non-members of that concept. In the context of this thesis, the notion of “logical description” can be taken to mean formal axioms describing the rules of concept/category membership, represented in OWL-DL. This logical description of a concept, which is the output of learning process, is called a hypothetical class or a hypothesis, since it is a suggestive description of why the instances are members (or non-members) of a specific concept[61]. Members of a concept are referred to as positive examples and non-members are called negative examples. A common feature of the concept learning algorithms is usage of background knowledge (instances). Concept learning methodologies apply machine learning algorithms over the background knowledge in order to generate a set of tentative expressions that might act as the concept definition. This approach allows more complex learning scenarios since rich logical description of the given (positive and negative) examples can be used by the learning algorithm.Concept learning methods have attracted considerable attention in biology and biomedicine mostly due to 1) abundance of legacy (semi-)structured data in biomedicine and 2) structural complexity of existing biomedical data. Overall, class expression learning methods offer several advantages for ontology engineering including enriching ontologies with formal (and human readable) definitions for classes, finding similar instances using the generated rules alongside DL reasoning, and classifying new instances based on their properties[61]. Disadvantages of these methodologies are the dependency on the existence of instance data (background knowledge), the need for a priori ontological “scaffold” and Background and state of the art 37 the requirements on the quality of the ontology, i.e. modelling errors in the ontology can reduce the quality of results[124]. Concept learning in description logics falls under the much broader umbrella of inductive logic programming (ILP)[125]. ILP uses inductive reasoning (as opposed to deductive reasoning) to find high level representation of concepts. The main difference between inductive and deductive reasoning is that in deductive reasoning it is formally shown whether a statement follows from a given knowledgebase, whereas in inductive learning new statements are generated de novo[61]. ILP has become one of the most prominent machine-learning research and application areas during the past two decades. Numerous books and publications address the theoretical foundations and different applications in this area such as probabilistic ILP (e.g. [126]), statistical relational learning (e.g. [127]) and so on. Nevertheless, we focus on the methods from ILP systems that are applicable to knowledgebases represented in OWL-DL. The major problem for adaptation of ILP systems for learning problems in the Semantic Web is that traditional ILP systems traditionally utilize logic programs for knowledge representation, whereas learning problems in the Semantic Web focus on knowledgebases that are represented in OWL-DL formalism [128] . Few studies have focused on combining ILP with OWL-DL formalisms to jointly target the concept learning problem in ontology engineering. The most prominent example of such frameworks is the recently published DL-learner framework called CELOE (Class Expression Learning for Ontology Engineering )[124] which is integrated into the ontology engineering tool Protégé. CELOE aims to use information about instances in an ontology to infer new axioms that might more precisely define an existing class. One interesting feature of CELOE is that the plugin implemented for Protégé is interactive, allowing the knowledge engineer the ability to accept, or reject, a new axiom. As we will see, the DL-learner framework is utilized in chapter 6 as the primary tool for conducting the experiments. 2.7 Data integration and interoperability tools in bioinformatics The experiments described in chapters 3 to 5 in this thesis utilize two recently-published data integration and interoperability technologies: Semantic Automated Discovery and Integration (SADI) and the Semantic Health and Research Environment (SHARE). Thus, we will provide a quick overview of prior-art in data integration and interoperability in bioinformatics, and subsequently introduce SADI and SHARE. Background and state of the art 38 Data integration from the Web is one of the well-known challenges faced by bioinformatics[129]. Many questions in bioinformatics can be answered only by combining data from multiple resources. Usually, biomedical scientists are required to collect information from a wide variety of resources and analytical tools and tailor the collected information to the specific problem at hand[42]. There are some problems associated with this traditional approach[129]: 1) A large number of existing data resources have to be incorporated 2) The semantic and syntactic heterogeneity in existing biomedical data, different access methods make the integration process labor intensive and time consuming Frameworks have been designed to mitigate the barriers to data integration and analysis across multiple resources. Broadly, these approaches can be categorized as either centralized or distributed approaches[42]. The centralized approach, also dubbed data warehousing, is a conceptually straightforward (though quite complex in practice) approach to data integration. Data warehouses are constructed by combining a set of related sources into a single database with a unified schema[130],[59]. Several warehouses have been developed by integrating common databases (e.g. Atlas[131] ,Sequence Retrieval System (SRS) [132] and BioMart[133] ). One drawback of the warehousing approach is that queries are limited to a fixed set of databases collected with a specific use-case in-mind, and that use-case may not necessarily match the problem of any given user [42]. Another problem is that warehouses are vulnerable to changes made in the original data sources, resulting in a very high level of maintenance for both the (large) integrated schema, as well as for the data importing system that keeps the warehouse “fresh”[134]. On the other hand, distributed systems (also called mediator systems), do not aggregate data into one massive warehouse. Instead, they translate a user query into smaller subqueries and issue them remotely against a data source or computational service at its native network location[42]. Contrary to warehouses, distributed systems are not sensitive to changes in the original data; however they face the data transfer problem across the network while handling large datasets. Additionally, mediator systems rely on the availability of remote data provided by third party and thus cannot be guaranteed[26]. They are sensitive to changes in semantics at all sources, but are somewhat less sensitive to source-syntactic changes since the mediation layer with each data source will generally be designed to harmonize the syntax prior to merging the data. Background and state of the art 39 Effectively mediator systems aim to create an illusion (from user’s point of view) of a single query language, a single data model and a single location [102]. The process is depicted schematically in Figure 2.7 for an original query that is divided into three hypothetical subqueries each of which is directed to a different resource for resolution; the black arrows depict query submission and the red arrows show the results. The resources in this approach can be accessed via discovery and invocation of Web Services 16 (or orchestration of multiple Web Services) as shown for subquery 2. Finally, the result from each subquery is integrated into one answer and is returned to the user. The majority of existing data in bioinformatics is in the “Deep Web” – that is, the part of the Web not accessible to search engines like Google (e.g. subqueries 2 and 3 in Figure 2.7). “Deep Web” data is “hidden” either behind remote query interfaces or exists only transiently as output from an analytical algorithm exposed as a Web Service[135]. Figure 2.7 Arch itecture of distributed mediator systems (adapted from[59] and [136] ) Traditional XML-based Web Services offer limited interoperability unless their semantics are properly described[137]. Semantic Web Service (SWS) frameworks are built around universal 16 Web Services are designed to support computer-to-computer interaction over a network as opposed to Web applications that are targeted for human users Background and state of the art 40 standards for the interchange of semantic data, to facilitate combination of data from different sources. These frameworks utilize Semantic Web technology in an attempt to enhance Web Service interoperability[135]. Prominent examples of Semantic Web based systems in healthcare include SSWAP[138], BioMoby [139], myGrid[140] , BioFlow[141], TAMBIS 17 (focusing on molecular biology [102],[142]) and caGrid [136]( focusing on cancer research). Generally, these frameworks have significantly improved the interoperability problem mentioned and some have exhibited highly integrative behaviors ( e.g. BioMoby and SSWAP)[135]; however each have their own limitations specially with respect to automatic service discovery process ([135] ). Some recent projects have attempted to address these shortcomings by taking novel approaches to the semantic representation of bioinformatics resources on the Web. The most relevant to this thesis is Semantic Automated Discovery and Integration (SADI). 2.7.1 Semantic Automated Data Integration (SADI) SADI provides a novel interoperability platform for the Semantic Web. The motivations that led to the development of SADI include 1) the majority of Web Services in bioinformatics exhibit a distinct subset of behaviors compared to the wide range of behaviours by Web Services in other domains[135] and 2) biomedical scientists are interested in the relationships between pieces of biological information, as opposed to a specific business model of how biological information will be analysed, when searching for Web Services to accommodate their needs [135]. The main components of the SADI framework are (a) a set of best-practices for providing Web Services on the Semantic Web, and (b) a public registry of machine-readable Web Service descriptions [11], [143]. The SADI registry contains information about SADI-compliant Web Services such as the full logical definition of their input and output datatypes. In addition, the registry for SADI indexes a unique piece of service information – of the set of predicates that describe the biological relationship between the service’s input and what will be returned after the service is executed [135]. This additional piece of captured metadata – the semantic relationship between the input and output of a service – enables construction of a novel and fully-automated Web Service discovery, pipelining, and execution system, the behavior of which we have utilized throughout this thesis. The ability of SADI services to be automatically pipelined is a very unique 17 The TAMBIS project is no longer active Background and state of the art 41 feature of this Semantic framework, and the software that achieves this will now be described in some detail. 2.7.2 Semantic Health and Research Environment (SHARE) SHARE is a SADI client application that allows SADI services to be discovered during the process of SPARQL-DL query evaluation [12]. Effectively, SHARE augments OWL reasoners and SPARQL query engines with the ability to retrieve data that is dynamically generated from remote resources at the time of query execution and reasoning. When an ontological concept is present in a query, the SHARE Web Service registry is used to discover relevant services by first decomposing the concept into smaller sub-concepts, and then mapping each sub-concept to one or more Web Services that generates data related to that sub-concept. Finally, the results from each of the Web Service invocations are integrated into the final result over which SPARQL then resolves the original query. To achieve the goal of automating the process of integration of the queries, the system must be capable of pipelining the services in a workflow, in a way that the output of one service is fed in as the input of the next and so on. The BioMoby[139] project, the progenitor of SADI, achieved this goal by establishing a shared ontology of datatypes . The major difference (and advantage) of SADI however, is that SADI datatypes are defined in OWL, their definitions are not centralized, and users of SADI may extend or customize the set of datatypes as they see fit. This can be done by publishing a new OWL class online. Eliminating the need to have a centralized ontology of “valid” datatypes, better facilitates the utilization of distributed knowledge and significantly reduces the (centralized) cost of building and maintaining interoperability systems. 2.7.3 SADI/SHARE Demonstration To demonstrate the semantic behaviours of SHARE we present an example use case for the SHARE query engine, based in the biomedical domain 18 . Suppose that a researcher wants to identify patients who are overweight, possibly because such patients are at higher risk of developing cardiovascular diseases. Figure 2.8 shows an example SPARQL query the clinician poses to retrieve the overweight patients. The prefix “cardio” in the query is the local ontology where the researcher has defined her personal 18 This example is a simplified version of the part of the experiment conducted in chapter 4 published by the author and colleagues [179] Background and state of the art 42 definition of the concept overweight and “patients.rdf” file is the RDF representation of patient data shown in left side of Figure 2.9. Suppose the clinician chose to use Body Mass Index (BMI) as the metric for obesity 19 and uses World Health Organization (WHO) definition of overweight as patients whose BMI value is greater than 25. OverweightPatient ≡ Patient hasBMI BodyMassIndex and BodyMassIndex hasValue Literal float[> "25.0"^^float] As shown in left side of Figure 2.9 (the input RDF file) does not contain the data required to answer the query (i.e., it lacks a BMI value). SHARE detects that this information is missing, examines the data that is available, matches this with a Web Service that can utilize height and weight to calculate BMI, and sends the data to this service. The service returns the data with the cardio:hasBMI property attached, and this is then added to the set of measurements for each patient in order to allow the query to proceed. The input and outputs of the “BMI calculator” service is as follows in OWL, where all the concepts and relations are defined from the local ontology (cardio in this example) and are shown in Figure 2.9 for two hypothetical patients. As a result the inferred classification of the input data is retrieved which allows for multiple and on-the-fly interpretations of input data from different perspectives. Input Class: hasMass someValueFrom Mass and hasHeight someValueFrom Height Output Class: hasBMI someValueFrom BMI Generated Property: hasBMI 19 BMI is a statistical measure which compares a person's weight and height. BMI is used to estimate a healthy body weight based on a person's height and defined as Background and state of the art 43 Figure 2.8 Inferred classification based on BMI: a query where the class, "Overweight " has been defined, representing a personalized world-view of how to interpret the property-constraints to find members of the class. Figure 2.9 SADI service for BMI calculation: SADI d iscovers, invokes, generates and attaches the BMI value (green circle) to input RDF file on the left. The OW L-DL reasoner can subsequently carry the reasoning over the attached property at run time and classify patients based on personal world-views. The bottom left part of the Figure shows the original data in tabular format.; as shown the input file does not have a BMI property. As shown in this demonstration, the concept OverweightPatient represents a concept from the Clinician’s personal world view, and may be reused, extended, or replaced as new hypotheses are developed, without any effect on the underlying data. As we will see, the flexibility offered by SADI/SHARE framework is utilized in conducting the experiments presented in chapters 3 to 5. To summarize, in this chapter we provided a broad overview on the Semantic Web tools and technologies together with the underlying considerations and approaches used by our research group. PREFIX rdf: PREFIX cardio: SELECT ?patient ?bmi FROM < /patients.rdf> WHERE { ?patient rdf:type cardio:Overweight . ?patient cardio:bmi ?bmi } Background and state of the art 44 We focused on providing prominent examples of existing biomedical ontologies, focusing on the role they play in biomedical data integration, data classification and decision support. Together, these provide the context necessary to understands findings presented in chapters 3 through 5. We then we present the existing methodologies proposed to overcome the well-known “knowledge acquisition bottleneck” in the ontology construction and evaluation. We specifically focused on formal concept learning methodologies which are relevant to the second theme of this thesis. Automatic detection and resolution of measurement-unit conflicts in aggregated data 45 3.1 Synopsis “Units are basic scientific tools that render meaning to numerical data”[144] Measurement-unit conflicts are a perennial problem in integrative research, and as the number of multi-national collaborations increases, new quantitative measurement instruments proliferate, and as Linked Data and Semantic Web data integration infrastructures become increasingly pervasive, the number of such conflicts will similarly increase. This problem tends to be more severe in biomedical data due to its complex, multi-dimensional and heterogeneous nature where researchers often attempt to integrate data from multiple and disparate datasets[145]. In this chapter we will explain the motivation and challenges associated with formal measurement- unit representation and integration in Semantic Web. Then, we introduce the related ontologies and research projects. Different design choices for modeling units and their consequences, specifically in the context of biomedical informatics, are discussed. Subsequently, we will discuss our proposed approach to 1) automatically detecting when an integrated dataset contains unit conflicts, and 2) automatically resolving these conflicts. We will compare our approach with the existing projects and justify the important design and representation choices. Finally, we will present a case study that was identified by our clinical partners in the context of patient phenotype classification. The dataset for this case study is used to evaluate our approach using a cardiovascular dataset. 20 A version of this chapter has been submitted for the publication: Soroush Samadian, Bruce McManus and Mark W ilkinson, Automatic detection and resolution of measurement-unit conflicts in aggregated data 3 Automatic detection and resolution of measurement-unit conflicts in aggregated data20 Automatic detection and resolution of measurement-unit conflicts in aggregated data 46 3.2 Introduction Precise measurements of physical quantities are essential for the development of engineering and sciences. Massive volumes of scientific data are generated in scientific experiments and biomedical research on a daily basis; and with the advent of high throughput technologies, the amount of generated biomedical data is on the continual rise. The amount of data has increased the need to integrate data from multiple sources; and to make any meaningful integration, comparison and interpretation of quantitative data, the first step is to ensure quantitative data are represented in appropriate measurement-units. Failure to do so has historically contributed to fatal outcomes. For instance, even NASA 21 has fallen prey to dangerous and expensive errors in failing to detect and account for measurement-unit conflicts[146]. This work was motivated by a clinically-oriented study in which we envisioned being able to gather clinical phenotype data from various participating groups, and then automatically categorize individual patients over a number of health-risk criteria. Prior to undertaking the study, we became aware of the potential for measurement unit conflicts in these integrated datasets. Rather than creating an ad hoc solution, we attempted to define a lightweight, standards-compliant, and generalized solution to this widespread and pervasive problem that could be re-used by both our own group and other integrative teams especially in the context of biology and biomedicine. The rest of this chapter is organized as follows: In section 3.3, background and related work is reviewed 22 . We will explain the challenges associated with formal measurement-unit representation and integration in the Semantic Web, and discuss related tools and resources. We describe possible design-choices for modeling units, and the problems or benefits of these alternatives. In section 3.4, we present the high level system architecture and the methodology used to ground existing unit ontologies (e.g. OM) in a clinically-oriented ontologies (e.g. GALEN) with the goal of attaching quantitative values to existing Cardiovascular concepts . In section 3.4, we discuss our proposed framework, and justify the core design and representation choices through presenting a case study of patient phenotype classification within the cardiovascular domain. 21 http://www.nasa.gov/ 22 The detailed theoretical foundation of measurement theory is beyond the scope of this work. The interested reader is referred to “Foundations of Measurements” [262] and “Basic Measurements Theory”[263] Automatic detection and resolution of measurement-unit conflicts in aggregated data 47 3.3 Background and related work That the standardization and formalization of measurement-units is crucial for scientific research was identified many decades ago. Several efforts were initiated to achieve the standardization of units of which the most notable is the International System of Standards (SI) [147], which was adopted in 1960 as a universal system for measurements in all areas of science[148]. However, the adaptation of standard systems for units such as SI has practically proved insufficient to ensure the reliable integration of quantitative measurements[149] and as a result, a consistent methodology to interpret and integrate the units in datasets is warranted[148–150]. The advent of Semantic Web technologies has provided opportunities for some recent frameworks to address the problem of representation and integration of quantitative research data. At their core, these frameworks utilize ontologies for representation of units. The existing biomedical(and non- biomedical) ontologies such as GALEN[56] and SNOMED-CT[73] each provide limited terminologies for unit representation. The majority of the existing biomedical ontologies use the Semantic Web de facto standards Resource Description Framework (RDF)[9] and Web Ontology Language (OWL)[2]. They use RDF for data encoding layer and OWL for inferences that can be made about the data i.e. knowledge layer. However, RDF does not inherently define a method for representing the measurement unit of any numerical data[151] which causes a syntactic problem .For example, RDF literal nodes can represent discrete numeric values, such as “273” or "37", but a literal node doesn't capture the information on which unit is used to express the value. Additionally, these ontologies have generally limited coverage of measurement units, and lack systematic relationships amongst measurement units causing a semantic problem. For instance, the GALEN concept MilligramPerDeciLitre, is defined as a subclass of the concepts ConcentrationUnit, but lacks any indication that this unit is composed of combination of two base units (gram and liter) and two prefixes (milli and deci). Over the past few years ontologies have been developed to specifically address the problem of formal theory of measurement representation and integration. Prominent example of such ontologies are: Unit Ontology(UO)[144], EngMath[150], Measurement Unit Ontology (MUO)[151], Ontology of Units of Measure(OM)[152], and QUDT[153] . The following concepts are shared between these ontologies. Automatic detection and resolution of measurement-unit conflicts in aggregated data 48 Base Units are the units that are not derived from any other unit. Base units can be used to derive other units (such as SI base units). Derived Units are the units obtained from combination of the base units to represent “derived physical qualities”. Below, discuss these ontologies and their design choices, similarities and differences and discuss the advantages and disadvantages of each (more detailed analysis of existing frameworks is presented in Appendix B). Unit Ontology (UO): UO is a unit ontology specifically focusing on the units of measurement within the biomedical domain[144]. Regarding semantics, UO provides the relationship between the units which are based on the same units. For instance “gram” and “kilogram” are based on the same unit and thus they are both considered as subclasses (or instances, depending on the version) of “gram based units”. Such units are defined using a “prefix” (in this case kilo) using “has-prefix” relationship in the ontology. However, UO does not provide the mathematical meaning of such prefixes. For instance, an OWL reasoner would not be able to determine the conversion factor between kilogram and gram only by using the ontology. Furthermore the conversion factor between units used for the same quantity that are not connected by any prefix is missing. (e.g. relationship between “inch” and “meter”). Finally the semantic relationship between derived units and their components is lacking (“square meter” has no relationship to “meter”). MUO: MUO is a modular ontology specifically designed to represent units in a combinatorial manner. In MUO, complex measurement units can be derived from the base ones in a modular fashion. Derived units in MUO are further divided into simple (e.g. millimeter) and complex derived units (e.g kg/m2). MUO defines property muo:derivesFrom to express the relationship between the derived unit and the units it is derived from. Derived Units are further divided into simple and complex derived units. Simple derived units are the units that are derived from exactly one base unit. For instance, the millimeter (mm) can be derived from meter (m). These are units that can be defined by attaching a Prefix to base Units. MUO also recognizes a different type of base unit that although derived from exactly one base unit, has a different dimension. For instance SquareMeter(m2). For such cases another property called muo:dimensionalSize is added to account for the dimensionality differences. Automatic detection and resolution of measurement-unit conflicts in aggregated data 49 Complex derived units are the units that are derived from more than one base unit. For instance, “kilogram per square meter” is defined as follows. kilogram-per-meter-square rdf:type muo:ComplexDerivedUnit ; muo:derivesFrom ucum:kilogram ; muo:derivesFrom :meter-squared. meter-squared rdf:type muo:SimpleDerivedUnit ; muo:derivesFrom ucum:meter ; muo:dimensionalSize "2"^^xsd:float. The main advantage of MUO over UO is that it proposes a convenient framework for defining new units of measurements in terms of existing ones. However, while MUO defines metric prefixes (e.g. centi- and kilo-), which could be used to automate automated conversion for SI-based measurements, it lacks quantitative or formula-based definitions for converting between SI units and similar qualities in other unit-systems (e.g. inch and cm). Ontology for Engineering Mathematics(EngMath): EngMath[150] is an ontology for mathematical modelling in engineering, written in Ontolingua[154]. It provides conceptual foundations for representing mathematical and physical entities such as scalars 23 , vectors 24 , tensors 25 , physical quantities, physical dimensions and units explicitly designed for knowledge sharing applications in engineering[150]. Regarding unit representation problem, the main feature in EngMath (absent in UO and MUO) is the component “physical dimensions”. The physical dimension of a quantity is an abstraction of a quantity ignoring magnitude, sign and direction aspects[152]. The dimension of a quantity can be thought of as independent set of base dimensions[152]. For instance the quantity Body Mass Index (BMI) has the dimension that can be decomposed into base dimensions mass and length: . The base dimensions in SI systems are length (L), mass (M), time (T), electric current (I), temperature(Ө), amount of substance (N) and luminous intensity (J). EngMath(as opposed to MUO and UO) provides enough semantic information to convert among any pair of units of the same dimension that are either defined as basic units or composed from the basic 23 http://mathworld.wolfram.com/Scalar.html 24 http://mathworld.wolfram.com/Vector.html 25 http://mathworld.wolfram.com/Tensor.html Automatic detection and resolution of measurement-unit conflicts in aggregated data 50 units[154]. The main problem with EngMath is that since it was developed prior to uptake and adoption of Semantic Web standards, it is not available in OWL. This problem is addressed in two recent ontologies QUDT and OM. QUDT: Quantities, Units, Dimensions and Types (QUDT)[155] Ontologies are a group of ontologies that are currently being developed by NASA. QUDT defines “quantity dimensions” which allows for automatic consistency checking of different quantities. QUDT also includes several major unit- systems such as the CSG system of Units, SI and others, but relates all other unit systems back to SI using two data properties - “conversion offset”, “conversion multiplier” - that could enable automated conversion between any non-SI-based unit and its SI-based equivalent. All compound units (e.g. kilogram per cubic meter) are of the type "Derived Units", with no further logical specification of their precise constituents. Figure 3.1 High level representation of the unit “inch” in QUDT. The properties are shown in d ifferent colors on the right side of the figure Figure 3.1 shows the high level representation of the unit “inch” in QUDT with a number of units used for length representation(not all shown in the figure). In terms of coverage for base units QUDT is fairly comprehensive; however it lacks a number of derived units (e.g. the centimeter of mercury column commonly used for clinical measurements of blood pressure). QUDT uses two data properties “conversion offset”, “conversion multiplier” to provide the conversion between any non-SI-based unit and its SI-based equivalent. For instance for the unit “inch”, the conversion offset is set to zero and the conversion multiplier is set to 0.0254 and for the “Degree Fahrenheit” the conversion offset and multiplier are 255.370 and 0.556 respectively to convert to Kelvin. Generally, the following formula is used to carry this conversion: Automatic detection and resolution of measurement-unit conflicts in aggregated data 51 Quantity value in SI = (conversion multiplier)*(Quantity value in non-SI unit) + conversion offset This quantitative relationship between different units feature does not exist in MUO (and UO) and is a major advantage provided by QUDT. The conversion is possible only if a unit or scale have the same dimension represented by property “quantityKind”. OM: Ontology of Units of measure and related concepts(OM) models concepts and relations important to scientific research[156] focusing on units and quantities . OM and QUDT are similar in terms of high level design features and hence we only discuss the differences (The interested reader is referred to [152] for specific details about OM). One difference is that OM defines “Quantity kinds” as OWL classes and thus the reasoning with the hierarchies are more straightforward[152]. Furthermore, though QUDT provides explicit conversion formula between non-SI and SI-based units, it does not represent the sub (multiple) units in terms of their components. For example the unit Centimeter has the offset and conversion factors of 0 and 0.01 respectively though it does not have a “Prefix” property. Figure 3.2 OM representation of cubic centimeter Additionally, OM provides relationship between compound units (e.g. kilogram per cubic meter) to their individual constituents (kilogram and cubic meter) whereas QUDT treats units derived from one base unit (e.g. cubic meter) and units derived from multiple base units(kilogram per cubic meter) all as individuals of the class “Derived Units”. In OM the top level “compound units” is further divided into three top level classes “unit division”(e.g. meter per second), “unit exponentiation”(meter squared) and “unit multiplication” (e.g. meter kilogram). For instance, for unit “millimole per cubic centimeter” (mmol/cm3), the nominator and denominator are defined as “millimole” and “cubic centimeter” respectively where millimole is related to mole by Prefix milli (om:factor = 1e-3) where cubic centimeter is an instance of “unit exponentiation” and is shown in Figure 3.2.This additional piece of metadata provided by OM allows us to automatically Automatic detection and resolution of measurement-unit conflicts in aggregated data 52 check dimension compatibility and generate on-the-fly relationship between compatible compound units. As an example consider the units used for concentration and density in for clinical data (Appendix B). As we see the labels of units are highly structured i.e. compound units can be constructed from the labels of its constituent. Thus instead of manual curtain of individual units, using relationships in OM it would be possible to automatically generate the mathematical expression between quantities. Finally, OM provides Web Services based on the OM ontology which facilitate more complex tasks such as unit conversion, and checking the consistency of dimensions (see next section)[157]. 3.4 Materials and methods 3.4.1 Data set and data collection As mentioned in the introductory section, this work was originally an integral part of an experiment (presented in the next chapter) in which we undertook to model cardiovascular phenotype data and then automatically classify individual patients over several health-risk criteria. The dataset used for our experiments involved the clinical measurement records of a cardiovascular patient cohort collected from a referral hospital in Nebraska, USA, between 1986 and 1989, including 536 unique patients (see next chapter). Table 3.1 shows some columns from two rows of the dataset used in this study. Table 3.1 The first two rows of the dataset used in the original format . In the last four columns “1” represents “high risk” and “0” represents a “low risk” for the condition listed in the header. . The intended meaning of acronyms for each column header (e.g., SBP for Systolic Blood Pressure) was confirmed with the clinician who had originally collected the dataset. In the last four columns “1” represents “high risk” and “0” represents a “low risk” for the condition listed in the header. Regarding this dataset few observations can be as follows. ID HEIGHT WEIGHT SBP CHOL HDL BMI GR SBP GR CHOL GR HDL GR pt1 1.82 177 128 227 55 0 0 1 0 pt2 179 196 13.4 5.9 1.7 1 0 1 0 Automatic detection and resolution of measurement-unit conflicts in aggregated data 53 1. The measurement-units are not represented explicitly together with quantities they represent. For instance the measurement of “Height” for “pt1” is 1.82 and 179 for “pt2” without any explicit reference to the unit and it can only be guessed (with high confidence) that it is “meter” for the first and “centimeter” for the second row based on its value. 2. Different rows of the same data set are represented in different measurement units ( possibly due to the fact that measurements in these rows were recorded by different health practitioners). Though, it cannot be stated with 100% certainty, the range of the values suggest that SBP is represented in “millimeter of mercury column” and “centimeter of mercury column” and CHOL (cholesterol), HDL (HDL cholesterol), are represented in “milli mole per litre” and “milligram per litre” in the first and second row respectively. 3. There is no consistency of “system of units” used even between different columns of the same row. For instance HEIGHT in the first row is represented in SI system whereas WEIGHT is represented in Imperial (and United States Customary) measurement system. 4. Colloquial and non-standard terms are used in the dataset. For example, WEIGHT is defined as the force of object as a result of gravity ( where m represents mass and g is the gravitational acceleration. Though WEIGHT and MASS are frequently confused with each other, they are different quantities with different dimensions. The observations made highlight the need for a practical solution to ameliorate the problem of measurement-unit conflict resolution in health care. In this study we focus on the observations 2 and 3 made above. Assumptions: Automatic labeling and automatic concept mappings are not a major focus of this study. Thus, we make the assumptions that the input numerical data for each quantity is represented with a unit that is an “allowable” unit for that quantity according to QUDT (e.g. centimeter, inch, meter, and so on for Length). Additionally, we assume that the cardiovascular concepts stated in the dataset can be grounded in cardiovascular ontologies. For instance we map SBP in our data to SystolicBloodPressure in GALEN ontology. Automatic detection and resolution of measurement-unit conflicts in aggregated data 54 3.4.2 Data transformation A few ontologies and standards are used in multiple chapters of this dissertation, we introduce them here and only refer to them in the next chapters to avoid repetition. GALEN The GALEN Common Reference Model (CRM) is a rich compositional ontology of the medical domain, covering anatomy, function, pathology, diseases, symptoms, drugs, and procedures[56]. It was developed by the Department of Computer Science at the University of Manchester 26 . It is available in both GRAIL and OWL formalisms([56], [158]). The version used in this study is the OWL version, dated August 2011 consisting of 2749 classes and 500 object properties. Several groups have investigated various aspects of the GALEN Ontology including expressivity, representation, and suitability for specific applications (e.g.,[159]). Based on these studies, and our own investigation of the suitability of its terminological domain, we selected GALEN as our core Ontology describing cardiovascular concepts. In this thesis we primarily focus on concepts in GALEN that are relevant to cardiovascular risk monitoring, and describe an approach for re-factoring and extending the cardiovascular-relevant classes of GALEN such that they can be used to automatically classify clinical data. Semanticscience integrated ontology (SIO) The SemanticScience Integrated Ontology (SIO) is an effort to create a coherent formal ontology with rigorous attention to concrete and clearly-stated design patterns [160]. SIO takes the "realist" position in which things exist independently of conceptual or linguistic schemes, and firmly acknowledges that terms used in a discourse denotes one or more individuals or classes, for which the latter may have zero or more instances[161]. The choice of properties in development of any ontology is crucial and non-trivial [162]. The use of a minimal set of re-usable relations is essential in building consistent, interoperable and well-formed knowledgebases[162]. For instance, the following two OWL property constraints might be considered to describe the same data feature: 26 http://www.cs.manchester.ac.uk/ Automatic detection and resolution of measurement-unit conflicts in aggregated data 55 1. Patient hasAttribute someValueFrom SystolicBloodPressure 2. Patient hasSystolicBloodPressure someValueFrom Attribute With respect to re-usability these two representations are considerably different. When designing ontologies to support logical reasoning, it is considered good-practice to encode the complexity of data in class definitions (statement #1) rather than through proliferation of properties (statement #2)[162].The relationships defined by SIO are highly generic (e.g., "has Attribute", defined by SIO's property SIO_000008), and this forces us, as the data modelers, to follow these good design patterns and formalize data-types through elaboration of ontological classes which are, whenever possible, distinct in their properties from all other ontological classes. We adhered to this design principle as closely as possible in this study. Finally, SIO is extensively used by analytical tools exposed using SADI Semantic Web Services, and thus our adoption of SIO also allows us to more easily take advantage of existing analytical tools published through the SADI framework, as well as rapidly publish and integrate new tools as-needed for our study. SADI and SHARE: As discussed in details chapter 2, SADI[11] is a set of standards-compliant design principles for exposing stateless Web Services on the Semantic Web. SADI Services consume and produce RDF data, where the input and output data properties are described by OWL classes. These classes are, similarly, utilized to discover Services of interest through their registration in the SADI Service registry. SHARE[59] is an enhanced SPARQL query engine which is capable of (a) decomposing OWL classes into their constituent property restrictions, and (b) discovering and invoking SADI Services based on the properties those Services consume/produce. When a query is entered into SHARE, a workflow is dynamically generated where each query clause, and each OWL class, is matched against a SADI service capable of producing data of the desired type OM: Overall, the choice of using which ontology as the basic choice for unit representation depends heavily on the specific application and practical concerns. For instance, the questions of type: “what are the practical implications (advantages/disadvantages) of representing quantity kinds as OWL classes versus instances?” has no generic answer, since the best (versus “right”) answer will depend on context. Automatic detection and resolution of measurement-unit conflicts in aggregated data 56 There are several applications that are outlined by previous research[152] that can benefit from the formalization of quantities and units such as: 1. Checking the consistency of units with the corresponding quantity kind 2. Semi-automatic annotation of quantities in legacy data using quantity range and specific domain of knowledge 3. Checking the dimension consistency of mathematical formula 4. Automatic unit conversion and conflict resolution The focus of specific experiment conducted in this clinical case study is the “automatic unit conversion and conflict resolution for clinical units” (item 4 above) and both QUDT and OM are reasonably well-suited for this task. QUDT is more comprehensive in terms of coverage of the units however OM (as explained earlier) provides better conceptualization of compound units. Since compound units are extremely common in clinical measurements, we therefore chose the OM as the core unit ontology for the remainder of our experiments. 3.4.3 Clinical data model We converted the clinical dataset into RDF by extending concepts in GALEN using SIO and OM. For example, we extend the concept SystolicBloodPressure in GALEN (measure:SystolicBloodPressure) as in OWL as follows: measure:SystolicBloodPressure = galen:SystolicBloodPressure and ("sio:has measurement value" some "sio:measurement" and ("sio:has unit" some ”om: unit of measure”) and (“om:dimension” value “om:pressure or stress dimension”) and "sio:has value" some rdfs:Literal)) In the above measure:SystolicBloodPressure is explicitly declared a sio:measurement 27 whose units should be linked to “pressure or stress dimension” in OM ;this guarantees the dimension compatibility i.e. all the instances are associated to “pressure or stress dimension” by om:dimension relationship. Figure 3.3 shows the schematic view of the data model for systolic blood pressure. 27 It should be noted that though it may not always be the case that blood pressure is measured (fo r example it may be inferred, pred icted or calculated), the blood pressure is generally measured in clinical settings. Thus the concepts “blood pressure” and “measured blood pressure” usually represent the same entities in clinical context. Automatic detection and resolution of measurement-unit conflicts in aggregated data 57 Figure 3.3 Extending clinical concepts to incorporate numerical quantities 3.4.4 Semantic services A series of generic SADI compliant Web Service was constructed to carry the major types of measurement unit conversion for quantities that were most frequently used in clinical settings: Pressure, Length, Mass, and Temperature, and Concentration. We will first focus on “Pressure or Stress” and “Concentration” in OM; however other quantities follow the same principle (with less complexity). The input and output classes for the service are defined by the following OWL classes. Input: "sio:has measurement value" some (("sio:has unit" some (”om: unit of measure” and “om:dimension” value “om:pressure or stress dimension” and "sio:has value" some rdfs:Literal)) Output : "sio:has measurement value" some (("sio:has unit" value “om:kilopascal”)) and (("sio:has unit" value “om:torr28”)) and (("sio:has unit" value “measure:inch of mercury column”)) and (("sio:has unit" value “measure: centimeter of mercury column”)) 28 Torr is an alternative name for millimeter of mercury column Automatic detection and resolution of measurement-unit conflicts in aggregated data 58 We defined the missing units commonly used in clinical science for pressure (the units starting with prefix “measure” i.e., inch of mercury and centimeter of mercury) using OM’s framework for defining new units using existing ones. The output class above was constructed to only include the units that are most commonly used in clinical settings due to performance reasons; however, we can extend it to all units that are “allowed” for representing Pressure (e.g., Pound Force Per Square Inch). 3.5 Results and discussion Patient data was converted into RDF format using the schema shown in Figure 3.3. A SPARQL query was then entered into SHARE to extract high blood pressure measurements from the dataset. where HighRiskSystolicBloodPressure is defined in OWL according to criteria used by clinician (Table 3.2) as follows measure:SystolicBloodPressure and sio:hasMeasurement some (sio:Measurement and (“sio:has unit” value om:kilopascal) and (sio:hasValue some double[>= "18.7"^^double]))) The prefix “measure” in the above is used to represent our local ontology throughout this chapter. Also, note that in the above definition, we deliberately used guideline in a different unit (kilopascal) than the ones used (mmHg and cmHg) in our database. SELECT ?record ?convertedvalue ?convertedunit FROM <./patient.rdf> WHERE { ?record rdf:type measure:HighSystolicBloodPressureMeasurement . ?record sio:hasMeasurement ?measurement. ?measurement sio:hasValue ?convertedvalue. ?record cardio:ExpertClassification ?riskgrade . } Automatic detection and resolution of measurement-unit conflicts in aggregated data 59 Table 3.2 American Heart Association classification for systolic blood pressure(adapted from[163]) Classification Systolic Pressure mmHg kPa Normal 90-119 12-15.9 Pre-hypertension 120-139 16.0-18.5 Stage 1 140-159 18.7-21.2 Stage 2 ≥160 ≥21.3 In the SPARQL query above SHARE decomposes the “HighSystolicBloodPressureMeasurement” Class into its constituents defined above. It then examines the sio:hasUnit clause and (a) detects that some (in this case none) of the measurements in its database are not represented in the target unit (om: kilopascal), (b) queries the SADI registry to find a SADI Service that provides pressure unit conversion. It finds the unit conversion service, determines that it fulfills the input criterion for that service (through having some "om:pressure or stress dimension” for each record), and then (c) executes that service on the data prior to completing the query. The offset and coefficient parameters for conversion between source (input data) and target (required by the user in the OWL class) are automatically extracted by integrating two functions from OM into the SADI Web Service: GetUnitConversionFactor (source unit, target unit) and GetUnitConversionOffset(source unit, target unit). Table 3.3 shows a small snapshot taken from the results of the SPARQL query above (the original “Start Unit” and “Start Value” are not retrieved from the SPARQL query and are only presented for the sake of clarity). Using a combination of Semantic Web Services and Pellet reasoner [99], the system was able to identify all 134 high systolic blood pressure measurements with no false positives. The majority of these records (123) were represented in mmHg and the remaining units were represented in cmHg. Automatic detection and resolution of measurement-unit conflicts in aggregated data 60 Table 3.3 Units and values before and after conversion. (3 d igit precision). The RecordID’s starting with “cm” and “mm” shows that pressure data was originally represented in mmHg and cmHg respectively. RecordID Start Val Start Unit End Val End Unit cm_hg1 15 cmHg 19.998 KiloPascal cm_hg2 14.6 cmHg 19.465 KiloPascal mm_hg1 148 mmHg 19.731 KiloPascal mm_hg2 146 mmHg 19.465 KiloPascal Although we did not find any misclassification, the possibility of rounding errors introduced as a result of conversion cannot be ruled out; thus it would be preferable to adjust the conversion engine so that it yields an acceptable value for a specific unit. For example if we’re using mmHg as a reference unit in clinical settings, the value 129.999 may be converted to 130 with high confidence.  Concentration(Density) The term concentration most frequently refers to amount of substance in a solution. There are two major types of units that are used to represent concentration in clinical settings usually used to denote the concentration of different chemicals in plasma (e.g. Hemoglobin). The first type represents the amount of substance per unit volume (e.g. gram per liter) and the second type represents the number of moles per unit volume (e.g. mole per litre). Based on our experience with several clinical datasets the molar-based units most frequently used (see Appendix B) for plasma concentration of chemicals are millimol per litre (mmol/L) micromole per litre (µmol/L) and picomole per litre (pmole/L), and the gram-based units most frequently used are gram per litre (g/L) ,milligram per decilitre (mg/dL) and nanogram per litre(ng/L). Some of the concentration units were not present in QUDT ontology possibly due to the fact that it is not a clinically oriented ontology 29 . We adopted (as described) OM’s framework, including prefixes, nominator, and denominator, for defining these units in terms of existing ones. Unlike pressure for which there is only one SI-based unit (Pascal), for concentration, both kg/m3 and mole/m3 are considered SI-based derived units. The main distinction between the two units is that kg/m3 is has a dimension relation to “density dimension” whereas mole/m3 has a dimension relation to “amount of substance dimension”. Thus the dimensions of mole/m3 ( and kg/m3 ( ) are 29 Regarding mmol/L, the SI-based unit “mole per cubic meter”(mol/m3) exists in the ontology and is numerically equal to mmol/ litre ( both nominator and denominator are mult iplied by 1000; though the use of mmol/ litre is more appropriate for chemicals in the plasma). Automatic detection and resolution of measurement-unit conflicts in aggregated data 61 not compatible. As such, there have been criticisms of the concept of the mole as a unit[164] 30 . The use of molar based units is generally discouraged; however, these units are routinely used in medicine and physiology, and hence we developed a generic Web Service to handle the conversion. For example, the conversion factor between mmol/L and mg/dL depends on the molar mass of the specific molecule for which the measurement has been made. For example to convert mmol/L to mg/dL for triglyceride and HDL we need to multiply by 88.57 and 38.67 respective ly. The input and output classes for the service are defined by the following OWL classes. Input: "sio:has measurement " some (("sio:has unit" some (”om:unit of measure” and ( “om:dimension” value “om:amount of substance concentration” or “om:dimension” value “om:density”) and "sio:has value" some rdfs:Literal)) Output : "sio:has measurement " some (("sio:has unit" value measure:milligramPerDeciliter”)) and (("sio:has unit" value measure:millimolePerLiter”)) and ... Here we only implemented the units used most frequently in clinical sciences for performance reasons; however the framework is easily extensible to include all the units. Distinct from the Pressure case, the SADI Web Service needs to specifically know the molecule (e.g. Glucose, HDL cholesterol, etc.) for which the unit conversion is calculated. We implemented the conversion for molecules most frequently used in clinical data shown in Appendix B (not for all concepts in Appendix). Subsequently, analyses similar to Systolic Blood Pressure were carried for cholesterol and HDL classification (Table 3.1) and SPARQL queries were constructed to outline measurements considered to be “High” (e.g. “High Cholesterol Measurement”). Both for cholesterol and HDL the majority of records classified as “high” by the expert (253 out of 255 for cholesterol and 271 out of 273 for HDL) were originally presented in mg/dL. Thus, to further test the accuracy and reliability of our system, 30 Formally speaking, “ mole” is not a true metric (i.e. measuring) unit, rather it is a parametr ic unit and amount of substance is a parametric base quantity[164] Automatic detection and resolution of measurement-unit conflicts in aggregated data 62 we deliberately encoded the guideline used by the expert in mmol/L in OWL. Similar to SBP the system was able to correctly determine all the records. Once again rounding errors did not cause any classification errors. 3.6 Conclusion and future work Busy health researchers should not need to concern themselves with data integration barriers at this trivial, but frustratingly obstructive level. Here we have utilized a combination of semantic standards and frameworks to demonstrate, with a unit-conversion exemplar case, that these types of data integration problems can, and should, be dealt with by the machines themselves. By encoding data with semantic transparency, it becomes possible for machines to detect such conflicts and use semantic systems such as SADI + SHARE to automatically resolve them. The work presented here can be extended in several directions. Firstly, we note that different domains use different measurement-units in practice. For instance, there are numerous units that are theoretically possible but are not used in practice (e.g. tone per cubic centimeter!). We plan to conduct an empirical study to precisely outline the recommended and common units specific to clinical and biomedical domain. We predict that this will significantly reduce human errors and increase computational performance. Secondly, we note that a large number of measurement units in clinical practice include more complex unit patterns than the ones we modeled. For instance the majority of units used for drug dosage and clearance include temporal elements (e.g mg/kg/hour for the drug dosage) that are not modeled in this study. To the best of our knowledge such patterns have not been modeled in any existing ontologies. As such, we plan to extend our framework to include more complex patterns together with temporal units and their conversion. Finally, we plan to extend our study to include different datasets from multiple centers and evaluate the usability of our approach in more complex biomedical scenarios. Extending and encoding existing biological terminologies and datasets for use in the reasoned Semantic Web 63 31 A version of this chapter has been appeared in the publication: S. Samadian, B. McManus, and M. D. Wilkinson, “Extending and encoding existing biolog ical termino logies and datasets for use in the reasoned Semantic Web.,” Journal of biomedical semantics, vol. 3, no. 1, p. 6, Jul. 2012. 4 Extending and encoding existing biological terminologies and datasets for use in the reasoned Semantic Web31 “The word “definition” has come to have a dangerously reassuring sound, owing no doubt to its frequent occurrence in logical and mathematical writings.” Willard van Orman Quine 4.1 Synopsis Clinical phenotypes and disease-risk stratification are most often determined through the direct observations of clinicians in conjunction with published standards and guidelines, where the clinical expert is the final arbiter of the patient’s classification. While this "human" approach is highly desirable in the context of personalized and optimal patient care, it is problematic in a healthcare research setting because the basis for the patient's classification is not transparent, and likely not reproducible from one clinical expert to another. This sits in opposition to the rigor required to execute, for example, genome-wide association analyses and other high-throughput studies where a large number of variables are being compared to a complex disease phenotype. Most clinical classification systems are not structured for automated classification, and similarly, clinical data is generally not represented in a form that lends itself to automated integration and interpretation. Here we apply Semantic Web technologies to the problem of automated, transparent interpretation of clinical data for use in high-throughput research environments, and explore migration-paths for existing data and legacy semantic standards. Using a dataset from a cardiovascular cohort collected two decades ago, we present a migration path - both for the terminologies/classification systems and the data - that enables rich automated clinical Extending and encoding existing biological terminologies and datasets for use in the reasoned Semantic Web 64 Terminologies and Nosologies 32 have long been used by clinicians and clinical researchers as a means of more consistently annotating observations. It is not surprising, then, that the emergence of the Semantic Web found fertile ground in the clinical and life science communities, and formal Semantic Web standards have been rapidly adopted by these communities to migrate existing annotation systems into these modern frameworks and syntaxes. While this largely syntactic migration is a useful exercise, in that it becomes possible to do simple reasoning over manual annotations, this simple migration does not enable the full power of modern semantic technologies to be applied to these important biomedical datasets. This is, in part, because these semantic resources continue to be used largely as controlled vocabularies rather than as rich descriptors for logical classification. The Semantic Web languages Resource Description Framework (RDF) [9] and Web Ontology Language (OWL) [1]are the World Wide Web Consortium's recommended standards for 32 Nosology is a branch of medicine that deals with classification of d iseases classification using well-established standards. This is achieved by establishing a simple and flexible core data model, which is combined with a layered ontological framework utilizing both logical reasoning and analytical algorithms to iteratively "lift" clinical data through increasingly complex layers of interpretation and classification. We compare our automated analysis to that of the clinical expert, and discrepancies are used to refine the ontological models, finally arriving at ontologies that mirror the expert opinion of the individual clinical researcher. Other discrepancies, however, could not be as easily modeled, and we evaluate what information we are lacking that would allow these discrepancies to be resolved in an automated manner. We demonstrate that the combination of semantically-explicit data, logically rigorous models of clinical guidelines, and publicly-accessible Semantic Web Services, can be used to execute automated, rigorous and reproducible clinical classifications with an accuracy approaching that of an expert. Discrepancies between the manual and automatic approaches reveal, as expected, that clinicians do not always rigorously follow established guidelines for classification; however, we demonstrate that "personalized" ontologies may represent a re-usable and transparent approach to modeling individual clinical expertise, leading to more reproducible science. 4.2 Background Extending and encoding existing biological terminologies and datasets for use in the reasoned Semantic Web 65 semantically-explicit encoding of data and knowledge representation on the Semantic Web (respectively), and as such, these were the languages chosen for this study. Given the ability for RDF and OWL to be used to interpret, rather than simply annotate data, it would be useful to examine the migration path - both for the terminologies and the data - that enables such rich interpretive reasoning to be applied. How do we alter and/or extend existing terminologies such that they can be used to classify clinical data? What modifications to traditional data capture and representation must be made in order to make these data amenable to such logical inferences? Can we replace (or at a minimum, guide) expert clinical annotators in their interpretation of clinical data, and with what level of accuracy can this be achieved? In this report, we explore one such migration path, and discuss our observations and results, as well as the barriers and resulting manual-interventions that were employed to accomplish the goal of creating a reasoned environment for clinical data evaluation and interpretation. We base our exploration in a real-world use case, using clinical data collected and annotated 20 years ago in the context of a study of patient outcomes after various cardiovascular interventions. Heart and blood vessel diseases have a high rate of mortality and morbidity, and pose a significant disease-burden on healthcare systems worldwide. In such diseases, asymptomatic biological “diseases”, typically precede the clinical manifestation of symptomatic diseases. Most of the time, the development of biological disease into a symptomatic event can be significantly mitigated or prevented through a combination of medication and lifestyle changes. It is widely accepted that several risk factors including age, sex, high blood pressure, smoking, dyslipidemia, diabetes, obesity and inactivity are major factors for developing a variety of heart and blood vessel diseases[165]. To assist with comparison of, and interpretation of, patient data, clinical researchers have developed guidelines for classifying patients phenotypically into various categories based on a wide variety of raw clinical measurements. For instance, Table 4.1 shows the American Heart Association (AHA) [163] guidelines for phenotypic classification of hypertension based on systolic and diastolic blood pressure observations. Although this classification system appears relatively straightforward, it is important to note that this represents only one of several different classification systems for the same phenotypic phenomenon (systemic hypertension), some of which include the informal expert-opinion of the clinicians themselves. As such, the same patient clinical observations might be categorized as “hypertensive” using one standard but categorized as “normal” using a different standard. This leads to problems when attempting to compare and integrate patient data between studies or even between Extending and encoding existing biological terminologies and datasets for use in the reasoned Semantic Web 66 different clinicians/centers in the same study, particularly when the annotation (“normal” versus “hypertensive”) is published in the dataset in lieu of the primary clinical measurements. To complicate matters further, health sciences communities continuously modify and update their guidelines in the light of new biomedical knowledge. For example, Global Initiative for Chronic Obstructive Lung Disease (GOLD) was comprehensively updated in 2006 which lead to different criteria for phenotypic classification with respect to previous years [166]. As such, even data from the same institution may be subject to slightly different interpretations over time. These interpretations become encoded in published datasets and, unfortunately, it is rare for the standards under which an interpretation was made to be rigorously recorded together with that interpretation. This issue leads to potentially erroneous re-interpretation of data, particularly when integrating data over long periods of time, or between disparate institutions. The emergence and uptake of Semantic Web technologies such as OWL and RDF by the Life Sciences, and the ability to use these technologies to enable dynamic classification of data, provides exciting opportunities for exploring novel ways to evaluate the feasibility of doing such clinical annotation dynamically. Table 4.1 American Heart Association classification for systolic/diastolic blood pressure (adapted from [163]) Classification Systolic Pressure Diastolic Pressure mmHg kPa mmHg kPa Normal 90-119 12-15.9 60-79 8.0-10.5 Pre-hypertension 120-139 16.0-18.5 80-89 10.7-11.9 Stage 1 140-159 18.7-21.2 90-99 12.0-13.2 Stage 2 ≥160 ≥21.3 ≥100 ≥13.3 Isolated systolic hypertension ≥140 ≥18.7 <90 <12.0 In this largely methodological study we undertook to create an environment in which “legacy” clinical data and annotation terminologies are modified such that they can be used together to automate the dynamic "on-demand" analysis and logical classification of patients into various cardiovascular disease risk groups under a variety of clinical classification guidelines. Specifically, we undertook a data remodeling process, migrating data from traditional databases and spreadsheets into a graph-based data framework (RDF); we utilize OWL to extend the cardiovascular-specific portion of an existing clinical annotation system namely GALEN[56] such that it can be utilized as an interpretation layer over this patient data; we then created a series of analytical Web Services which Extending and encoding existing biological terminologies and datasets for use in the reasoned Semantic Web 67 will be used to execute the statistical analyses of patient data in cases where pure logical reasoning is insufficient for classification; and finally, we executed our automated analyses/classifications, and compared them to the manual annotations done by an expert cardiovascular clinician two decades prior. Any differences were then examined in detail to determine the source of the discrepancy, and we evaluate and discuss our ability to modify the interpretive layers to account for differences between the clinician's manually annotated data and the automated annotations. 4.3 Methods 4.3.1 Datasets and data collection The dataset used for this experiment consists of clinical observations of a cardiovascular patient cohort collected from several hospitals in Nebraska, USA from the period from August 1986 to July 1989. A total number of 636 unique patients with a total of 1723 encounters were recorded. The database was originally collected as a part of a study comparing the cardiovascular disease risk- profile changes over a period of one year post procedure/surgery for patients undergoing Coronary Allograft Bypass Graft (CABG) versus those undergoing percutaneous coronary intervention (PCI). An individual's risk can be assessed using available risk-prediction tools such as Framingham( [7],[167]), and Reynolds risk scores[168], which incorporate information on established risk factors such as blood lipids, blood pressure, Body Mass Index(BMI), age, gender, and smoking status. In this dataset two risk-assessment schemes were used to annotate patient data: a binary risk score ("at risk", "not at risk") assigned to individual clinical observations such as blood pressure, and an overall cumulative risk score using the Framingham risk measurement (see results section). The clinical observations used in this analysis were as follows: Age, Gender, Height, Weight, Body Mass Index(BMI), Systolic Blood Pressure(SBP), Diastolic Blood Pressure(DBP) Glucose, Cholesterol, Low Density Lipoprotein (LDL), High Density Lipoprotein (HDL), Triglyceride(TG) As an exemplar, the first row of the data set is shown in Table 4.2. The intended meaning of acronyms for each column header (e.g., SBP for Systolic Blood Pressure) was confirmed with the clinician who owned the dataset. The table contains two types of data: clinical observations (un- shaded cells), and the clinician-assessed binary risk - 1 or 0 for "at risk" or "not at risk", respectively (shaded cells; e.g., HDL GR for High Density Lipoprotein Risk Grade). The final column (RISK GR) Extending and encoding existing biological terminologies and datasets for use in the reasoned Semantic Web 68 indicates the ternary overall risk assessment - 1 for low, 2 for moderate, and 3 for high risk - which the clinician indicated to us was based on the Framingham Risk Score algorithms. Table 4.2 Part of the first row of dataset used in Microsoft excel sheet SBP DBP TOTALCHOL HDL TG AGE GENDER HEIGHT WEIGHT TG GR HDL GR LDL GR CHOL GR BMI GR DBP GR SBP GR RISK GR 128 80.1 227 55 84 77 M 1.8288 78.1818 0 0 0 1 0 0 0 1 4.3.2 Overview of the approach In 2005 our group (Wilkinson et al.) proposed a semantic data classification architecture in which raw clinical measurements would be "lifted" through increasingly conceptual/interpretive layers of ontologies in order to complete an analysis, evaluation, or query[169] . This would be achieved through a combination of logical reasoning over the data and ontologies, in parallel with the discovery of Web Services that aggregated and analyzed the data, thereby dynamically identifying individuals logically compliant with the ontological classes at each layer. This hybrid approach is necessary because (useful) OWL reasoning is limited to a decidable fragment of first-order logic - effectively, it is possible to define the conditions under which an individual would be a member of a particular set/category, and it is possible to infer through a series of logical statements about the data, whether those conditions exist for a particular data record. However, while it is possible to infer that particular data properties must exist as a logical consequence of the existence of other data properties, it is not possible to derive data through algorithmic calculations using OWL reasoning alone. For these cases, we have written and published a series of Semantic Web Services that consume clinical data, execute various algorithmic analyses on them, and then return the dataset with new, derived data properties attached. These derived properties can then be used by the OWL reasoner to further classify the clinical data and "lift" it into increasingly complex clinical phenotypic categories. While this approach is not reliant on any additional technologies for its success, one of our secondary goals in undertaking this project was to demonstrate that certain frameworks and practices established by our group could be used, with very little effort, to automate this interaction between OWL models and analytical Web Services. This automation reduces the complexity of analysis and evaluation of clinical data for the end-user. While the iterative process of reasoning, identification of appropriate analytical algorithms, execution of those algorithms, re-integration of output data, and re-reasoning Extending and encoding existing biological terminologies and datasets for use in the reasoned Semantic Web 69 could be done entirely manually (as would be the current practice), automating the "semantic lifting" process is enabled by two recently published pieces of technology - Semantic Automated Discovery and Integration (SADI) [11]and the Semantic Health And Research Environment (SHARE)[59][170] explained previously in chapter 2. When an ontological concept is present in a SHARE query, it will exhaustively "decompose" that concept into its complete set of property restrictions, importing any additional ontological classes as necessary. Once "decomposed", it then utilizes SADI to discover and execute services capable of creating those properties based on any data SHARE already has in its database. Figure 4.1 provides a diagrammatic representation of the "semantic lifting" process. By referring to an ontological concept in the SPARQL query (Layer 4 in the diagram), raw data is "lifted" through the ontological layers via an iterative process of reasoning, service discovery, and execution. This is our first attempt to deploy such an architecture over bona fide clinical data. Figure 4.1 Card ioSHARE architecture: Increasingly complex ontological layers organize data into more abstract concept Extending and encoding existing biological terminologies and datasets for use in the reasoned Semantic Web 70 In order for this approach to be successful, we must first migrate the legacy data, and any legacy terminologies, into a more rigorous logical framework that is capable of being interpreted by OWL reasoners. We will now describe that process in detail. 4.3.3 Ontologies used GALEN We selected GALEN[158] as our core Ontology describing cardiovascular concepts (see chapter 3) SIO For reasons mentioned in chapter 3[161] ,SIO[160] is our ontology of choice for representing relationships (properties) in clinical data. OM When extracting datasets from disparate sites, particularly over international boundaries, it is not uncommon for the de facto unit of measurement to be different for any given clinical observation. Therefore, we must define a practical approach that allows clinical measurements to carry different units while not sacrificing interoperability. The lack of a standard approach to represent measurable quantities in RDF has led to different configurations being used in different ontologies and RDF data repositories. Since this was a potential problem in our analysis, and is a significant problem in science generally, we created a series of publicly-accessible Semantic Web Services capable of automatically detecting when unit-conflicts exist in an aggregated dataset, and automatically resolving those conflicts to whichever canonical measurement unit is desired. As mentioned in chapter 3, OM[152]propose a clear and convenient framework for defining new units of measurements in terms of existing ones, and this was used to derive any units required by our investigation not explicitly defined by those ontologies (Please refer to chapter 3 for details). 4.3.4 Ontological mapping, extensions, and algorithmic services The set of OWL classes that are required to describe our dataset are as follows: Age, Sex, Mass, Height, BodyMassIndex, SystolicBloodPressure, DiastolicBloodPressure, BloodSugarConcentration, SerumCholesterolConcentration,SerumLDLCholesterolConcentration, SerumHDLCholesterolConcentration, SerumTriglycerideConcentration. Extending and encoding existing biological terminologies and datasets for use in the reasoned Semantic Web 71 We explored GALEN to search for the cardiovascular concepts listed above, and found it to be sufficiently comprehensive in terms of coverage of these concepts; however there were some minor differences in terminology between the labels in our dataset and GALEN terms. For example, the term SerumHDLCholesterol appears in GALEN, while the acronym HDL was used in our clinical dataset. Similarly the concept Glucose exists in GALEN while BloodSugarConcentration was the label applied to the (semantically) equivalent measurement in our clinical dataset. Such discrepancies were manually mapped based on consultation with expert clinicians, using their preferred labels. Our intent was to select the label/class-name that best semantically described the intended meaning of the concept; while we admit that this approach is somewhat arbitrary, we could think of no way to reliably automate these mappings. The concept Height did not exist in GALEN, though the class Length did; to avoid over-loading the semantics of the existing GALEN class, we defined a new class Height, and made this a subclass (owl:subClassOf) of GALEN's Length. Our proposed "layered" semantic framework requires us to identify concepts which are "core" (based on direct observations - Layer 1 of Figure 4.1), and concepts which are "derived" (based on calculations over core observations - Layer 2 and higher). For instance, the current lipid measurement protocols do not generally measure LDL particles directly but instead estimate them using the Friedewald equation[171]: (2) where H is HDL cholesterol, L is LDL cholesterol, C is total cholesterol, T is triglycerides, and k is 0.20 if the quantities are measured in mg/dl and 0.45 if in mmol/l. Thus, triglycerides and cholesterol are "core" measurements, while LDL is a "derived" measurement. Similarly, BMI is calculated from a relationship between height and mass, and would be considered "derived". We manually examined the protocols for obtaining the measurements in our dataset and consider the following GALEN classes to represent "core" measurements: Age, Sex, Mass, Height, SystolicBloodPressure, DiastolicBloodPressure, BloodSugarConcentration, SerumCholesterolConcentration, SerumHDLCholesterolConcentration, SerumTriglycerideConcentration. Extending and encoding existing biological terminologies and datasets for use in the reasoned Semantic Web 72 We henceforth will refer to these Classes as the "Grounding Classes" - classes whose members will be directly tied to the dataset through explicit declaration of a piece of data as being a member of that Class. The OWL definition of each Grounding Class was created by extending the corresponding GALEN Class definition to include the defining features of the "Attribute" OWL Class from SIO; thus all members of these classes will (logically) be both GALEN individuals, and SIO Attributes. This involved adding axioms for the SIO properties hasMeasurement, hasUnit and hasValue to the GALEN class definitions. The example below shows how GALEN class for Systolic Blood Pressure is extended using external classes and properties (the prefix before “:” shows the ontological namespace of each entity; the prefix "cardio" is the namespace used to indicate the ontological classes we have defined) cardio:SystolicBloodPressure: Galen:SystolicBloodPressure and (sio:hasMeasurement some cardio:pressuremeasurement) cardio:pressuremeasurement: sio:measurement and (sio:has unit some om:unit of measure) and (om:dimension value om:pressure or stress dimension) and has value some Literal) The remainder of the measurements in our clinical dataset are "derived", based on calculations performed over the core measurements, and their corresponding "Derivative Classes" in GALEN are: SerumLDLCholesterolConcentration, BodyMassIndex The class definition for these was generated using the same approach as for the Grounding Classes; however, since members of these Derivative Classes can only be determined through algorithmic analysis of 'core' measurements, we also created a set of SADI Semantic Web Services that expose the necessary algorithms, consuming members of the relevant Grounding Classes, and generating Extending and encoding existing biological terminologies and datasets for use in the reasoned Semantic Web 73 members of the Derivative Classes in response. Thus, data from Layer 1 can be "raised" into Layer 2 Classes (and above) through invocation of these algorithmic services. 4.3.5 Refactoring the legacy dataset Assumptions about data collection and measurements Since the exact protocols describing how the clinical observations were made were not available, we made the assumption that they were derived from the most common measurement protocols. For example, for blood pressure measurements, we assumed that the measurements were made in a clinical setting (as opposed to casual home monitoring), using conventional mercury manometers applied on the left arm. The units used for each measurement were not explicitly stated in the datasets (table 4.2) itself, so we made a best-guess based on the range of the measurement values and confirmed those with clinical experts. The units used to represent measurements are shown below. Height: meter Weight: kilogram BodyMassIndex: kilogram per square meter SystolicBloodPressure, DiastolicBloodPressure: millimeter of mercury column SerumHDLCholesterolConcentration, SerumLDLCholesterolConcentration, SerumTriglycerideConcentration, SerumCholesterolConcentration, BloodSugarConcentration : milligram per deciliter Data schema Our primary objective in designing an ontological model to represent the clinical data was to support dynamic re-interpretation of that data under a variety of different hypothetical scenarios (e.g., re- interpretation as analytical or classification standards change over time). Importantly, it was not our intention to design a data model with sufficient complexity to represent every aspect of a clinical record; rather we were focused on modeling individual clinical measurements in a way that would allow them to be automatically analyzed and interpreted. Constraining ourselves to modeling only this small aspect of the clinical record should, we believe, allow existing comprehensive clinical record models to be easily adapted to the framework we propose here. Figure 4.2 shows the schematic view of the data model, described as follows. Extending and encoding existing biological terminologies and datasets for use in the reasoned Semantic Web 74 Figure 4.2 Data schema using concepts in legacy Ontologies. The additional features shown on the Mass class are present on all classes in that row, but are hidden to improve readability 1. We defined a class "PatientRecord", as a subclass of the SIO "record" class. PatientRecord will include all of the observations about a patient, keeping in-mind that we considered each patient- encounter to be a different patient record for the purposes of this study (i.e., the longitudinality of the data was not considered). 2. Patient clinical observations were divided into Grounding Classes and Derived Classes as described above, and were modeled as owl:Individuals of these classes, with the corresponding unit and value attached by the SIO hasUnit and hasValue properties. 3. Each resulting Grounding Class member was attached as an attribute of the PatientRecord using the hasAttribute property from SIO. Extending and encoding existing biological terminologies and datasets for use in the reasoned Semantic Web 75 Finally, methodologies described in chapter 3, we defined the units kilogram, kilogram-per-meter- squared, millimeter-of mercury-column, milli-gram-per-deci-liter, and milli-mole-per-liter, which were used for various individual studies as described in the Results section. 4.3.6 Approach to binary patient classification (“at risk” versus “not at risk”) In our dataset the clinical researchers used a binary system to classify patients as being "at risk" or "not at risk" based on each of the following measures: blood pressure, cholesterol, HDL, LDL, triglycerides, and BMI. Thus, we created OWL classes representing each of these categories - for example, "HighRiskSBPRecord" to represent patient records reflecting a high-risk score with respect to systolic blood pressure, and "LowRiskSBPRecord" to represent patient records reflecting a low- risk score with respect to Systolic Blood Pressure. These categories would then be used in SHARE SPARQL queries to trigger data "lifting", and to compare the result of the resulting automated categorization of patient records with the expert annotation of the clinical researchers two decades ago. Working through one example in detail - Table 4.3 shows the American Heart Association's classification of systolic and diastolic blood pressure values. Although they indicate five different ranges (Normal, Prehypertension, Stage 1 hypertension, etc.) the clinical researchers who generated our dataset had only two categories - "at risk", and "not at risk". Through discussions with the researchers, they indicated that they considered Normal and Prehypertension to be "not at risk" (in Table 4.3) and all other categories to be "at risk". In Tables 4.3 through 4.6 the light shaded area represents “low risk” whereas the dark shaded area represents the “high risk” groups as defined by the guidelines. As such, we modeled an ontological class "HighRiskSBPRecord" in OWL as follows: HighRiskSBPRecord = cardio:PatientRecord and (sio:has Attribute some (cardio:SystolicBloodPressure and sio:hasMeasurement some (sio:Measurement and (sio:has unit value cardio:milli-meter-of-mercury-column) and (sio:has value some double[>= "140.0"^^double])))) Extending and encoding existing biological terminologies and datasets for use in the reasoned Semantic Web 76 Table 4.3 American Heart Association classification for systolic and diastolic blood pressure [163] Classification Systolic Pressure Diastolic Pressure mmHg kPa mmHg kPa Normal 90-119 12-15.9 60-79 8.0-10.5 Pre-hypertension 120-139 16.0-18.5 80-89 10.7-11.9 Stage 1 140-159 18.7-21.2 90-99 12.0-13.2 Stage 2 ≥160 ≥21.3 ≥100 ≥13.3 Isolated systolic hypertension ≥140 ≥18.7 <90 <12.0 In tables 4.3 through 4.6 the light shaded area represents “low risk” whereas the dark shaded area represents the “high risk” groups as defined by the guidelines. In a somewhat different scenario, Table 4.4 shows the American Association risk stratification for cholesterol, HDL and triglycerides, each of which has three categories - high, medium, and low - compared to our clinician's binary categorization of high and low. As above, we attempted to create OWL classes to model these risks; however, in this case we had no guidance from the clinician as to what to do with intermediate measurements, as their original policy had not been recorded. As such, in our initial analysis, we defined "high risk" and "low risk" records as being congruent with the high and low risk categories of the official guidelines, and ignored all data in the intermediate category. We describe how we modified these models, and our ability to determine the actual clinician's risk threshold, in the results section. Extending and encoding existing biological terminologies and datasets for use in the reasoned Semantic Web 77 Table 4.4 American Heart Association classification for cholesterol, HDL, and trig lycerides([172],[173]) Level (mg/dl) Level (mmol/L) Interpretation Cholesterol <200 <5 Desirable level corresponding to lower risk 200-240 5.2-6.2 Borderline high risk >240 >6.2 High risk Level (mg/dl) Level (mmol/L) Interpretation HDL <40 for men, <50 for women <1.03 Low HDL cholesterol, heightened risk 40-59 1.03-1.55 Medium HDL level >60 >1.55 High HDL level, optimal condition Level (mg/dl) Level (mmol/L) Interpretation Triglyceride <150 <1.69 Normal Range: low risk 150-199 1.70-2.25 Borderline high 200-499 2.26-5.65 High >500 >5.65 Very high: high risk Modeling BMI and LDL risks were slightly more complex, since these two measurements are derived by algorithmic analysis of one or more 'core' measurements. BMI is calculated using a person’s weight and height (Equation 1) [174], and the guidelines were modeled in OWL following the American Heart Association guidelines in Table 4.5. The resulting OWL classes representing BMI and HighRiskBMI measurements respectively were as follows: cardio:BodyMassIndex = galen:BodyMassIndex and (sio:hasMeasurement some cardio:measurement) Extending and encoding existing biological terminologies and datasets for use in the reasoned Semantic Web 78 cardio:measurement = sio:measurement and (sio:hasUnit some om:unit of measure and (om:dimension value om:unit of area density) sio:hasValue some Literal) HighRiskBMI= PatientRecord and (sio:hasAttribute some (cardio:BodyMassIndex and sio:hasMeasurement some (sio:Measurement and (sio:hasUnit value cardio:kilogram-per-meter-squared) and (sio:hasValue some double[>= 25.0])))) Table 4.5 American Heart Association classification for BMI [30] BMI (kg/m 2 ) Category Below 18.5 Underweight 18.5 to 24.9 Healthy weight 25.0 to 29.9 Overweight 30 to 39.9 Obese 40 and above Morbidly obese The schematic diagram of the SADI Web Service interface for BMI calculation is shown in Figure 4.3. The input and output of the Service are as follows (sample data, and instructions on how to send this data to the SADI service, are provided in Appendix C and supplementary materials[175]). Input: (sio:hasAttribute some cardio:Height) and (sio:hasAttribute some cardio:Mass) Output: sio:hasAttribute some (cardio:BodyMassIndex and (sio:hasMeasurement some ( sio:hasMeasurement and (sio:hasUnit value cardio:kilogram-per-meter-squared and (sio:hasValue some Literal))) Extending and encoding existing biological terminologies and datasets for use in the reasoned Semantic Web 79 Figure 4.3 The schematic diagram of the SADI Web Serv ice interface to the BMI calcu lation Service. The property-restriction imposed on the output, when detected by SHARE, triggers the discovery and invocation of the Service that attaches the BMI class with appropriate units and value properties attached to it. Subsequently, we calculated LDL in a similar fashion using SADI-compliant Semantic Web Services. LDL is calculated based on HDL measurements via the Friedewald equation (Equation 2[171]) and the guidelines were modeled in OWL following the guidelines in Table 4.6. Note that the Friedewald equation includes a constant that is sensitive to the units HDL is measured in; however since we are explicitly declaring and automatically converting units, this service is able to automatically determine which is the correct constant to use in every case. Table 4.6 LDL guidelines[172] Level (mg/dl) Level (mmol/L) Interpretation <129 <3.3 Desirable level 130-159 3.3-4.1 Borderline high risk >160 >4.1 High risk 4.3.7 Approach to ternary risk assessments In addition to binary risk assessments described above, we wish to determine whether more complex clinical phenotype and risk classifications can be automated using the same infrastructure. For example, some clinicians seek to estimate the probability of a patient developing a certain type of cardiovascular disease within a specific period of time. Researchers have developed a variety of Extending and encoding existing biological terminologies and datasets for use in the reasoned Semantic Web 80 algorithms for estimating a patient’s statistical probability of death (from cardiovascular disease) or of developing a variety of cardiovascular diseases, with one of the most widely adopted being the Framingham Risk Scores ([165], [7], [167]). There are a number of different Framingham risk scores centered around different cardiovascular diseases (e.g., Congestive Heart Failure versus Atrial Fibrillation), the period of time under which the risk assessment is calculated (e.g., 5 year versus 10 year risk), and the precise Framingham standard used. For instance, the same patient clinical observations might be categorized as “high risk” using Canadian Standards, but categorized as “medium –high risk” using American or European Standards. To test our ability to automatically classify patients into complex risk-stratification models such as Framingham, we created OWL models of the Framingham Risk Scores for General Cardiovascular Disease in Men [176]. Table 4.7 shows the scoring framework proposed by the Framingham study to calculate the estimated risk score for “General Cardiovascular Disease” in men based on the mean values for clinical observations. Similar tables exist for women and other cardiovascular diseases including Arterial Fibrillation, Congestive Heart Failure, Coronary Heart Disease, General Cardiovascular Disease, Hard Coronary Heart Disease, Intermittent Claudication, Recurring Coronary Heart Disease, and Stroke after Atrial Fibrillation. Extending and encoding existing biological terminologies and datasets for use in the reasoned Semantic Web 81 Table 4.7 Estimated risk of general cardiovascular disease in men[176] Points Age, y HDL Total Cholesterol SBP Not Treated SBP Treated Smoker Diabetic −2 60+ <120 −1 50-59 0 30-34 45-49 <160 120-129 <120 No No 1 35-44 160-199 130-139 2 35-39 <35 200-239 140-159 120-129 3 240-279 160+ 130-139 No 4 280+ 140-159 Yes 5 40-44 160+ 6 45-49 7 8 50-54 9 10 55-59 11 60-64 12 65-69 13 14 70-74 15 75+ In our dataset, the clinician had annotated the records with three scores: “high risk”, “low risk” and “moderate risk”. For this study, we only considered the records of male patients and records with no missing values for the various observations required to make a risk evaluation. The conventional classification used in the Canadian health care system is based on three levels of quantization (0–9: low Risk, 10–19: Medium risk, > = 20: High risk) over the accumulated individual risk score (Table 4.8). Extending and encoding existing biological terminologies and datasets for use in the reasoned Semantic Web 82 Table 4.8 10-year risk for general CVD by total Framingham Risk Score[176] Total Points 10-Year Risk <9 <1 % 9 1 % 10 1 % 11 1 % 12 1 % 13 2 % 14 2 % 15 3 % 16 4 % 17 5 % 18 6 % 19 8 % 20 11 % 21 14 % 22 17 % 23 22 % 24 27 % 25 or more ≥30 % The input and output classes for SADI Web Service to calculate the Framingham Risk Score are defined as follows: Input: PatientRecord and (sio:hasAttribute some cardio:Age) and (sio:hasAttribute some cardio:SerumCholesterolConcentration)and (sio:hasAttribute some cardio:SerumHDLCholesterolConcentration) and (sio:hasAttribute some cardio:SystolicBloodPressure) Extending and encoding existing biological terminologies and datasets for use in the reasoned Semantic Web 83 Output: sio:hasAttribute some (GeneralCVDFraminghamRiskScore and (sio:hasValue some Literal)) Since an OWL class representing Risk Score did not exist in any of the Ontologies we were using, we defined a class named RiskScore and a second, GeneralCVDFraminghamRiskScore, which is a subclass of the former. 4.4 Results 4.4.1 Evaluation of automated binary risk classification Evaluation of our ability to dynamically reproduce the original clinical classifications, using the approaches described above, was undertaken as follows: In the dataset, when the clinician had indicated the patient was "at risk" for a given type of observation, this was represented as a numeric "1", while if they indicated the patient was not at risk, we represented this as a numeric "0". We then used our "HighRisk" and "LowRisk" OWL Classes in SPARQL queries, calling-up the clinician- annotated numerical score in the same query. For each HighRisk query, we would expect the clinicians score to be "1" in all cases if our automated analysis is functioning correctly, and should be "0" in all cases for the LowRisk queries. Figure 4.4 shows two queries for SBP measurements and their clinician-assigned risk grade, together with a screen-shot of the abbreviated output for each query. If the system is calculating risk correctly, then all results of the query for high risk (Figure 4.4A) should be assigned a score of "1" by the clinician, and similarly the results of the query for low risk (Figure 4.4B) should be assigned a score of "0". Similar queries were issued for DBP, Chol, HDL, TG, and BMI attributes. Table 4.9 shows the comparison between manual and automatic risk classification for all attributes in the dataset. In most cases, our automated analysis of the data was entirely concordant with the expert annotations of the clinician; however, there were several cases of discrepancy as discussed in the next section. More detailed query/result pairs, plus before/after categorization data for all clinical observations can be found in supplementary materials[175]. Extending and encoding existing biological terminologies and datasets for use in the reasoned Semantic Web 84 Figure 4.4 SPARQL queries (Prefixes not shown) followed by a small snapshot of the results for automatic classification of patients into “high risk” (A) and “low risk” (B) for Systolic Blood Pressure. Note that, because of unit conversion layer, the units used to model the guideline may or may not be the same as the unit used to model clinical data. Extending and encoding existing biological terminologies and datasets for use in the reasoned Semantic Web 85 Table 4.9 Comparison between manual and automatic b inary risk classifications according to the degree of fidelity of automatic classification to that of the expert. True positive rate “at risk” % False positive rate “at risk”% SBP 100 0 DBP 100 0 CHOL 92.6 0 HDL 100 56.5 TG 100 8.5 BMI 100 18.8 LDL 100 0 4.4.2 Discrepancies between automated and expert binary classifications Systolic and diastolic blood pressure risks Classifying patients as being “high risk” or “low risk” based on blood pressure was consistent with manual curation of experts in every case. LDL Similar to SBP and DBP, for LDL manual and automatic classifications were consistent. Total cholesterol risk Some patient risk classifications differed between our automated analysis and the expert annotations. In each case, the risk score fell between 5 and 5.2. Interestingly, in the American Heart Association guidelines (Table 4.4) there is a gap in their measurement-continuum, resulting in a lack of any interpretation-guidance for measurements between 5 and 5.2. Our automated analysis therefore revealed that the clinical expert had compensated for this gap by assigning these measurements to the "low risk" category. By modifying our OWL model to change the low risk cut-off level from 5 to 5.2, we were then able to achieve perfect correspondence with the clinical expert; moreover, this correspondence shows that the clinician had used this 5.2 boundary as their upper limit for low-risk when undertaking their binary classification. The original and modified guidelines in OWL are shown below. Extending and encoding existing biological terminologies and datasets for use in the reasoned Semantic Web 86 Original AHA guideline in OWL: HighRiskCholesterolRecord= PatientRecord and (sio:hasAttribute some (cardio:SerumCholesterolConcentration and sio:hasMeasurement some ( sio:Measurement and (sio:hasUnit value cardio:mili-mole-per-liter) and (sio:hasValue some double[>= 5.0])))) LowRiskCholesterolRecord= PatientRecord and (sio:hasAttribute some (cardio: SerumCholesterolConcentration and sio:hasMeasurement some ( sio:Measurement and (sio:hasUnit value cardio:mili-mole-per-liter) and (sio:hasValue some double[< 5.0])))) Modified model: HighRiskCholesterolRecord= PatientRecord and (sio:hasAttribute some (cardio:SerumCholesterolConcentration and sio:hasMeasurement some ( sio:Measurement and (sio:hasUnit value cardio:mili-mole-per-liter) and (sio:hasValue some double[>= 5.2])))) LowRiskCholesterolRecord= PatientRecord and (sio:hasAttribute some (cardio: SerumCholesterolConcentration and sio:hasMeasurement some ( sio:Measurement and (sio:hasUnit value cardio:mili-mole-per-liter) and (sio:hasValue some double[< 5.2])))) Extending and encoding existing biological terminologies and datasets for use in the reasoned Semantic Web 87 HDL and triglyceride risk Having no guidance on how to build the model in these cases where the clinical (binary) classification had no correspondence to the three or four level categorization system of the official guidelines, we first modeled the extreme cases (high/low, ignoring borderline/medium categories), expecting to find complete congruence with the expert annotation at least for these patients. Surprisingly, in neither case did our automated categorization match the expert clinical categorization. We determined (by manual inspection) that in these cases the clinician did not follow the guideline boundaries for their binary classification; rather, they "invented" boundaries reflecting their personal opinion of risk. In the case of HDL, the boundary was well under the official lower limit (0.89 mmol/L compared to the official boundary of 1.03 mmol/L), whereas for triglyceride measurements the clinician chose a cutoff between the guidelines range for "High" risk (2.26-5.65 mmol/L). The original OWL models and the adjusted OWL models are represented in supplementary section provided in Appendix C. The adjusted models provided perfect correspondence with the expert clinical classification when used in our automated framework. Body Mass Index risk Similarly, we determined from our results that the guideline used by the expert in his classification was more relaxed than the AHA guidelines. By changing the threshold in our OWL class definition from 25 to 26, we were able to achieve perfect correspondence with the expert’s annotations. It is important to point-out, with respect to this measurement, that we did not modify the analytical Web Service in order to achieve this correspondence - only the OWL model needed to be adapted to match the interpretation of the clinical expert. The significance of this point will be discussed later. Below is the BMI guideline in OWL before and after the modification. Original AHA guideline in OWL: HighRiskBMIRecord= PatientRecord and (sio:hasAttribute some (cardio:BodyMassIndex and sio:hasMeasurement some ( sio:Measurement and (sio:hasUnit value cardio:kilogram-per-meter-squared) and (sio:hasValue some double[>= 25.0])))) Extending and encoding existing biological terminologies and datasets for use in the reasoned Semantic Web 88 LowRiskBMIRecord= PatientRecord and (sio:hasAttribute some (cardio:BodyMassIndex and sio:hasMeasurement some ( sio:Measurement and (sio:hasUnit value cardio:kilogram-per-meter-squared) and (sio:hasValue some double[< 25.0])))) Modified model: HighRiskBMIRecord= PatientRecord and (sio:hasAttribute some (cardio:BodyMassIndex and sio:hasMeasurement some ( sio:Measurement and (sio:hasUnit value cardio:kilogram-per-meter-squared) and (sio:hasValue some double[>= 26.0])))) LowRiskBMIRecord= PatientRecord and (sio:hasAttribute some (cardio:BodyMassIndex and sio:hasMeasurement some ( sio:Measurement and (sio:hasUnit value cardio:kilogram-per-meter-squared) and (sio:hasValue some double[< 26.0])))) Evaluation of automated ternary risk classification Figure 4.5 shows the SPARQL query which automatically classifies patient records into the "moderate risk" Framingham guidelines OWL model, compared with the annotations done manually by experts (see supplementary materials[175] for other Framingham guideline query/result pairs). Below this is an abbreviated table of exemplar query output specifically showing rows of discrepancy which are of particular interest for discussion. Extending and encoding existing biological terminologies and datasets for use in the reasoned Semantic Web 89 Figure 4.5 SPARQL queries and a small snapshot of the results for automatic classification of patients into “High Risk”, “Medium Risk” and “Low Risk”, respectively . 4.4.3 Discrepancies between automated and expert ternary classifications Framingham risk No expert-annotated “high risk record” was classified as “low risk record” by automatic classification or vice versa; only the “moderate risk records” were differentially-classified by our automated approach compared to the clinical expert classification. Interestingly, however, the automated interpretations included both higher- and lower-risk classifications compared to the expert annotations. As can be seen in the first two rows of the Figure 4.5 results table, the same calculated Framingham risk-score of 15 was classified as being “low risk” and “medium risk” respectively by the expert clinician, while the "medium risk" score of 19 was classified as "high risk" by the expert in some cases. After trouble-shooting the code and the ontological definitions, we examined the scores to determine if, as with the binary classifications above, it would be possible to improve our performance by relaxing or tightening certain constraints in the OWL class definitions, however we determined that this was not possible. This suggests that other factors, not captured by the guidelines, have led the clinical expert to select one or the other risk category for any given patient, even given the same scores. In discussions with the clinician, we learned that the patients were under varying regimes of pharmaceutical blood pressure treatment, and that this would have affected their risk- Extending and encoding existing biological terminologies and datasets for use in the reasoned Semantic Web 90 assessment. The subsequent chapter of this thesis details how we attempted to semantically encode these pharmaceutical interventions in order to more accurately model clinical risk. In the next chapter, we semantically model and add drug treatment regimens to the patient's profiles and our risk models to determine if this is sufficient to resolve all cases of mis-classification by our automated system or if there remain yet additional factors that are being used by clinicians to make their risk assessment. Regardless, we may not ever be able to determine, with any certainty, the bases for the original risk classifications, and this is an important point for discussion. 4.5 Discussion 4.5.1 Interpreting discrepancies between automated and manual risk classification The data in our study - in particular, the risk classifications of the patients - were not used for the purpose of selecting an intervention in the course of the patient's clinical care. We presumed that clinical researchers would use existing published guidelines for categorization in the course of their clinical research, but this was not necessarily a valid premise. In our results, we note a variety of discrepancies between our initial OWL models' rigorous adherence to published clinical standards, and the evaluation and phenotypic classification by the expert clinical researcher. Some of these were due to missing data in the guidelines themselves, where we were able to, with reasonable confidence, guess what the clinician's interpretation of the guideline was and model that interpretation. Others were due to the researcher "bending" the guidelines either to match their personal beliefs, or because it was more appropriate for the research question they were asking. We were similarly able to accurately modify the OWL models to match the clinician's expert opinion in many of these cases. Some cases, however, have so far eluded our ability to capture, in OWL, what the intent or rationale of the researcher was. Nevertheless, assuming that the decisions are not "arbitrary", we are confident that with further study we will be able to construct OWL models that correspond to these complex clinical interpretations. Moreover, while in this pilot study, we manually modified the cutoff levels of the OWL models after visual inspection of the data, our subsequent studies (described in later chapters of this thesis) undertake to determine these boundaries using data- mining and machine-learning approaches, thus this should not be considered an insurmountable weakness of the current work. Extending and encoding existing biological terminologies and datasets for use in the reasoned Semantic Web 91 That experts (at least in our case, but we believe it is likely to be true for many cases) do not strictly follow published guidelines when classifying patients in their clinical research is, in itself, not surprising; however, it does have implications for both reproducibility of clinical studies, as well as the accuracy and interpretation of statistically sensitive high-throughput studies such as GWAS. Potential factors that influence experts to deviate from guidelines may include clinical observations outside of those that make up the guideline, or other non-clinical yet measurable/detectable features. Regardless, it is important for reproducibility and rigor that experimental methods be fully explained and detailed, yet at the same time it is undesirable (in fact, likely impossible) to force clinical researchers to follow guidelines which go against their expert beliefs. As such, a middle-ground is needed where experts retain their "personalized" classification system, and yet have this system formally encoded in a transparent, publishable, and re-usable manner. In this study, we demonstrate that the semantic modeling approach we advocate here provides re- usable, rigorous models which are nevertheless flexible, allowing individual, personalized expert- knowledge to be encoded, published, and shared. Moreover, these rigorous yet personalized ontological models can be used to drive the automated analysis of data, removing the individual from the analytical process. This is important because "analysis tweaking", based in human intervention in the analytical or interpretative process, historically has gone unrecorded and thus led to non- reproducible science[177]. Our approach, while not preventing the expert from imposing their own interpretation on the data, ensures that in order to "tweak" the analysis, such an intervention must be made explicit in their ontological model; moreover, the resulting ontology can be published together with the study results to ensure transparency. Not only does this facilitate reproducibility of the study by making the personal expert opinion/interpretation accessible to other researchers, but it also allows explicit and accurate comparison between the formally-encoded expert opinions of a diverse community of clinical researchers, and the ability to use a third-party interpretation to investigate your own data - i.e. the ability to "see your data through the eyes of another". We think this is a powerful new approach to transparent and reproducible clinical research, where ideas and interpretation-regimes are explicitly recorded, shared, and compared. Extending and encoding existing biological terminologies and datasets for use in the reasoned Semantic Web 92 4.5.2 Broader implications of "personalizing" OWL ontologies Our use of OWL in this study differs markedly from the norm in the biomedical community, where ontologies are used primarily to compel harmonization around a particular world view, thus facilitating cross-study comparisons by dis-allowing individual opinion or interpretation. In contrast, we began from the perspective that individual clinical researchers would insist upon their authority, as experts, to classify patients in whatever way they thought was correct for a particular study, and would resist forced adherence to guidelines (in fact, in some cases it is the guidelines themselves that are the topics of investigation and evaluation). Indeed, we demonstrate in this study that clinicians frequently deviate from established clinical guidelines, yet we also demonstrate that OWL classes can be constructed to model an individual clinician's expert perspective, thereby making their interpretation transparent, and re-usable in a rigorous manner. Most importantly, however, our inability in some cases to accurately reproduce the interpretation of the expert post facto, even after manually re-modeling the guidelines, shows the danger of not capturing these personal, expert perspectives in some formal framework such as OWL at the time the experiment is being run. These ontologies, representing individual perspectives on how data should be interpreted, resemble in silico hypotheses - the belief system of the individual undertaking the study, which may or may not be correct and/or shared by any other researcher. In this study, we demonstrate that these clinical hypotheses can be automatically evaluated over real patient data using existing Semantic Web tools and frameworks. 4.6 Conclusion This study had several, largely methodological, objectives. First, there are a large number of legacy datasets that would be of benefit to researchers if they were published on the Semantic Web. We demonstrated a workable path for conversion and publication of these datasets that provided advantages beyond simply making the data available as "triples", but also in making it semantically transparent such that it could be easily re-analyzed by third-party researchers using their own classification frameworks. Second, the majority of ontologies available in the life sciences to date are class hierarchies, where the labels of each class are largely used to standardize annotations. The ability to logically reason over these labels is quite limited, thus inhibiting their use for automated annotation and classification of data. Nevertheless, these ontologies are increasingly comprehensive and reflect expert consensus of what concepts are relevant in a given domain. Here, we proposed and demonstrated a path for extending an existing ontology such that it could be utilized by DL reasoners Extending and encoding existing biological terminologies and datasets for use in the reasoned Semantic Web 93 to dynamically classify and interpret datasets - a process that is currently done largely by experts. Third, we demonstrated that clinical phenotype classification systems could be modeled in the OWL language by taking advantage of the rich, axiomatic structure of OWL-DL ontologies, and a variety of analytical Web Services. We showed how this combination of ontologies and Services can be used to make clinical data analyses both more transparent and more automated. Finally, we showed that individual clinicians deviate from established clinical guidelines at every layer of an analysis, and this demonstrates the need for a formal, yet personalized clinical interpretation framework to ensure transparency and reproducibility. We demonstrate that this can be achieved by creating and publishing "personalized" OWL ontologies. Future experiments should attempt to move beyond modeling well established standards and begin and attempt ontologically model complex research classifications and hypothesis. Using a similar approach with published research as our gold standard, experiments should evaluate the ability to model and automatically evaluate research hypotheses in silico and compare the conclusions with those drawn by the experts. Using public drug data to assist in automatic phenotypic classification of patient records in Semantic Web 94 5 Using public drug data to assist in automatic phenotypic classification of patient records in Semantic Web33 “Human error is inevitable. It happens in health care systems as it does in all other complex systems, and no measure of attention, training, dedication, or punishment is going to stop it”[178] 5.1 Synopsis Pharmaceutical databases have a wealth of available information that can be used in interpretation and analysis of clinical data. This can be achieved by providing an infrastructure that facilitates linking legacy patient data to multiple drug resources available on the Web. This is currently problematic as data is distributed over heterogeneous, mostly file-based, erroneous, and incomplete resources causing those resources to be inconsistent both in representation and content. The massive amount of available pharmaceutical data has increased the need for a practical solution to integrate pharmaceutical data to clinical data. The work in this chapter was motivated by a clinical study in which we undertook to computationally model various Framingham risk classification guidelines (see chapter 4). There we found discrepancies between manual and automated risk classifications suggesting that factors not captured by the guidelines led the clinical expert to select one or the other risk category for any given patient. In discussions with the clinician, we learned that the patients were under varying regimes of medications that would have affected their expert risk-assessment. In this study we propose, and demonstrate, a generic framework for formalizing patient drug data and connecting it to a knowledgebase of drug/phenotype interactions. Using the same cardiovascular dataset presented in chapter 4,we conducted a case study to evaluate the practicality of using a semantic representation of, and analysis of, prescription medications to reduce the discrepancies between the automatic and manual Framingham risk groups, and showed that we could markedly improve our automated Framingham risk assessments compared to the expert clinical assessment. To 33 Portions of the section 1.2 has been prepared for the publication: Soroush Samadian, Bruce McManus and Mark W ilkinson, “Using public drug data to assist with automatic phenotypic classification of patient clinical records in semantic web” Using public drug data to assist in automatic phenotypic classification of patient records in Semantic Web 95 demonstrate the flexibility of the semantic approach to representing prescribed drugs, we then used this same framework, to undertake a second case study to automatically determine potentially harmful drug interactions in the patient’s list of prescribed medications. We detected 16 cases of potentially contra-indicated prescription (~4% of the patients). This study is of both methodological and clinical interest. From the methodological perspective, we presented a workable framework to link the prescription medication in legacy datasets to public drug databases. This linkage allows third parties to define complex phenotypic classifications in terms of prescription medications and use them to dynamically classify patient data. From the practical perspective, we demonstrated the central role played by public drug ontologies in providing decision support to clinicians and reduce human errors that are ubiquitous in healthcare. 5.2 Introduction Analysis of clinical data should benefit from formal semantic approaches to dynamically linking clinical observations with the rich knowledge contained in Pharmaceutical databases. The ultimate aim of such linkages would be, for example, to make automated inferences about the treatment-status of patients on various combinations of drugs. Unfortunately, clinical research datasets are heterogeneous in the way prescription information is recorded; in "real world" clinical settings, where much research data is collected, the physicians and primary-care givers have neither time, nor motivation, to carefully record pharmaceutical/prescription information with the rigor required for automated analysis. Moreover, there has been little study of approaches for semantically describing the connections and interactions between a patient’s clinical phenotype, the drugs they are prescribed, and the range of afflictions those drugs are used to treat. Such knowledge-based infrastructures would benefit both front-line clinicians, as well as clinical researchers, by facilitating detailed data analysis by the latter, and acting as a potential real-time warning decision support system (e.g. dangerous drug interactions, and/or unexpected phenotypic responses to drug-combinations) for the former. While this could be implemented via a series of local rules connected to the patient’s data- stream and/or Electronic Health Record (EHR), the benefit of semantic encoding is that it becomes possible to then dynamically import, from numerous sources on the Web, continuously updated knowledge related to drug interactions, side-effects, and warnings. Using public drug data to assist in automatic phenotypic classification of patient records in Semantic Web 96 The recent emergence of data and knowledge representation technologies such as Resource Description Framework (RDF)[9] and Web Ontology Language (OWL)[1], often referred to as "semantic technologies", seems to provide the tools required to achieve these goals. It remains to be defined, however, the migration path from legacy, unstructured clinical data, or even data in non- semantic EHRs, to data that can be automatically processed by - even interpreted by - these semantic technologies. What kind of clinical and phenotypic inferences become possible when structured clinical data is dynamically connected to the extensive structured knowledge present in pharmaceutical databases worldwide? Could such inferences enhance the quality of clinical research and/or patient care? In the previous chapter [179], we demonstrated that, for individual risk factors (e.g. patients with blood pressure that was clinically “at high risk”), we were able to build ontological models that, in almost all cases, could mirror the phenotypic classification of the expert clinician. However, in that same study we found that we were unable to build models that accurately classified patients according to more complex, combinatorial risk factors [167]. This suggested to us that other clinical observations – not captured by our prototype implementation – affected the expert’s clinical risk classification. In consultation with the clinical researchers who had generated the dataset, we determined that the patients’ prescribed medicines were likely to represent the most significant factors we had not accounted for in our original models. In this study we propose, and demonstrate, a simple framework for formalizing patient drug data and connecting it to a public knowledgebase of drug/phenotype interactions. Our primary objective is to successfully automate the classification of raw patient clinical data into Framingham risk categories, using semantic models of Framingham risk scores together with logical reasoning. This involves determining, based on observations of their prescriptions, the clinical phenotypes a patient is being treated for (e.g. “Patient treated for high blood pressure”), and subsequently using that information in logical models to interpret when, for example, the patient's raw blood pressure record might be misleading our risk classifier. We then explore two other facets that became apparent to us once this infrastructure was in place. First, the possibility of, in near-real-time, detecting adverse drug events such as dangerous drug interactions by dynamically pulling-in public drug/drug interaction data; or detecting when the administration of a drug (perhaps in combination with other drugs) is not having the expected phenotypic effect. Second, we explore the knowledgebase itself and describe some idiosyncrasies in that resource that interfered with our early attempts to accomplish these goals. Using public drug data to assist in automatic phenotypic classification of patient records in Semantic Web 97 5.3 Background NDF-RT The National Drug File Reference Terminology (NDF-RT) is a resource developed by the Department of Veterans Affairs (VA) Veterans Health Administration, as an extension of the VA National Drug File [75]. NDF-RT is freely available for download as an OWL file. NDF-RT is a concept-oriented terminology- a collection of concepts- each of which represents a single, unique “meaning”. Every concept has one fully-specified name and an arbitrary number of other names, all of which are intended to mean the same thing and are therefore synonymous terms. NDF-RT assigns an alphanumeric unique identifier (NUI) to every concept, maintained across releases to label and track that meaning [75] . The version used in this study is the latest OWL version available, dated 2011. It consists of 1751 active moieties (ingredients) and 4695 clinical drugs (VA products). NDF- RT also includes links to external drug resources such as RxNorm[74], MESH[76], and UMLS[79]. NDF-RT is based on a model that adheres to Health Level 7 (HL7) guidelines [180]. HL7 has become the de facto standard for drug data representation and has been widely adopted by many public drug Ontologies including the clinical observation Interoperability Ontology (COI) [4][181], RxNorm and others. NDF-RT expresses what a clinician might order (Orderable Drugs) for a patient, and the type of order a pharmacy might receive. In NDF-RT nomenclature, drug names take a semantic normal form (SNF) that reflects their active ingredients, strength, and form. As such, NDF- RT includes a distinct name for every strength and dose of every available combination of active ingredients. NDF-RT Content Model We use the following convention for nomenclature for the rest of this chapter to: The top level classes explained above (e.g. DRUG_KIND) are uppercase, hyphenated and represented in a different font (Serif font).The Properties (role) are represented in lower-case Italic font (e.g. “may_treat”), instances of INGREDIENT_KIND are represented in lower-case Italic bold font (e.g. acetaminophen) and instances of DRUG_KIND are shown in upper-case Italic bold font (e.g. ASPIRIN 120MG TAB). We use upper-case Italic (ACETAMINOPHEN) to indicate the entities recorded within the legacy dataset (raw format prior to analysis). Using public drug data to assist in automatic phenotypic classification of patient records in Semantic Web 98 Every concept in NDF-RT is assigned to one top level and general non-overlapping class such as DRUG_KIND, DISEASE_KIND, INGREDIENT_KIND, and so on. Figure 5.1 shows the high level representation of drug ASPIRIN 120MG TAB where DRUG_KIND, DISEASE_KIND, INGREDIENT_KIND, and VA_CLASSES are shown in colors red, purple, orange and blue respectively. The is_a relationship (blue) links NDF orderable drugs to both their VA classes and their generic ingredients. Additionally, the red arrows relate generic ingredient to INGREDIENT_KIND (has_ingredient role) and DISEASE_KIND (may_treat and may_prevent roles). Two additional roles (not shown in Figure 5.1), has_PE and has_MoA link generic ingredients (INGREDIENT_KIND) to their Physiological Effect (PE) and Mechanism of Action (MoA) respectively. Orderable drugs (instances of DRUG_KIND) inherit the mentioned properties (roles) from their associative active ingredients and are additionally described in terms of their strength, units and dosage. For instance the Orderable Drug “CODEINE 15MG/ACETAMINOPHEN 300MG TAB” is linked to generic ingredients “acetaminophen” and “codeine” via the has_ingredient property. Using public drug data to assist in automatic phenotypic classification of patient records in Semantic Web 99 Figure 5.1 The representation of drug ASPIRIN 120MG. It should be noted that all the properties of ASPRIN as a drug is inherited by its subclasses and thus a logical inference system is capable of inferring the therapeutic intent for ASPIRIN 120MG by linking it to ASPIRIN (red b lock) In the current implementation of NDF-RT, all the properties of generic ingredients are inherited by their subclasses. For instance, a logical inference system is capable of inferring the therapeutic intent for ASPIRIN 120MG TAB by linking it to ASPIRIN shown as a red block in the Figure 5.1. Furthermore, as shown in the Figure 5.1, there are two entities under the name ASPIRIN one of which is a subclass of DRUG_KIND (red box) and the other “aspirin” is a subclass of INGREDIENT_KIND (green box) which itself is a subclass of “salicylic acid”; however, as we will see in the sections that follow, we argue that this may not be the best naming convention. Using public drug data to assist in automatic phenotypic classification of patient records in Semantic Web 100 Programmatic Tools RxNav 34 is a browser for several drug information sources, including RxNorm, RxTerms and NDF- RT. RxNav displays links from drugs, both branded and generic, to their active ingredients, drug components and related brand names (brand names do not exist in NDF-RT)[74] . The RxTerms record for a given drug can be accessed through RxNav, as well as clinical information from NDF- RT, including pharmacologic classes, mechanisms of action, physiologic effects and drug-drug interactions. RxNav provides two major programmatic APIs that are of importance to this study: the RxNav API and the NDF-RT API. In this study RxNav was used to provide a standardized representation of drug names in our legacy data. This makes the data amenable to interpretation explained next. ChemSpider is a free chemical structure database providing access to over 28 million structures, properties and associated information[182]. ChemSpider integrates and links multiple databases providing free access to chemical and pharmaceutical data. ChemSpider offers text and structure searching to find compounds of interest and provides unique services to improve this data by curation and annotation and to integrate it with users’ applications[182]. ChemSpider provides programmatic Web Services (including Java-based Web Services) to access and query different databases. ChemSpider was used as an integrative part of drug data canonicalization service as explained later. Drug Representation problem in Legacy Patient Data Due to the rushed conditions under which clinical data is collected, and the minimal training that front-line medical professionals receive in data collection protocols, legacy clinical datasets often contain combinations of medication brand-names, generic names, active moieties, abbreviations, acronyms, including inconsistencies with e.g. lower/upper cases and the presence or absence of dosage information. Even within the same dataset there may be different representations (e.g. because different data rows may be entered by different individuals). Table 5.1 shows the first few rows of the dataset that was used for this study. In that dataset, we see that Patient 1 has been prescribed the following: ASPRIN 125MG, VASOTEC, PERSANTINE, ASCRIPTIN, and DIPYRIDAMOLE 100 MG. However, consulting public drug databases and ontologies we note that: 34 http://rxnav.nlm.nih.gov/RxNormAPI.html Using public drug data to assist in automatic phenotypic classification of patient records in Semantic Web 101 1. ASPIRIN is spelled incorrectly 2. VASOTEC is a drug brand name for generic drug ENALAPRIL 3. Combination of both lower and upper cases are used to represent medications 4. There exist some duplicate entities 5. ASCRIPTIN and ASPIRIN are comprised of the same ingredients (duplication), and 6. The prescribed dosage/administration can be present (implicitly) within the drug name entry (e.g. DIPYRIDAMOLE 100MG), entered as a separate entry (e.g. a separate row called “dosage”) or not present at all. Table 5.1 Prescription information for the first two patients. DLY: Daily, TID: Three t imes a day PATIENTID patient1 Patient2 DRUG 1 ASPRIN* ASCRIPTIN DRUG 1 DOSAGE 1 DLY 1DLY,10MG AS NEEDED DRUG 2 PROCARDIA PERSANTINE DRUG 2DOSAGE 10MG 1 3X DLY 75MG TID DRUG 3 BUFFERIN Lopid DRUG 3 DOSAGE 1DLY 4X300MG DLY DRUG 4 VASOTEC DICUMAROL DRUG 4 DOSAGE 2 DLY DRUG 5 XSD ASCRIPTIN TRANRENE DRUG 5 DOSAGE DRUG 6 DIPYRIDAMOLE 100MG PERSANTINE DRUG 6 DOSAGE 1 75MG, 3X DLY DRUG 7 VASOTEC DRUG 7 DOSAGE Treated for HBP 1 1 Treated for diabetes 1 1 Treated for high cholesterol 0 1 *misspelled entity As we noted in the Figure 5.1, only ASPRIN as DRUG_KIND is attached to its therapeutic properties in the NDF-RT. As a result, the starting dataset required extensive curation in order to be used to determine the therapeutic intent of the prescriber; moreover, we attempted to define a re-usable workflow to accomplish this in a reliable and automated manner, as will be described. Using public drug data to assist in automatic phenotypic classification of patient records in Semantic Web 102 5.4 Methods 5.4.1 Overview of the drug description and prescribed medicines architecture In previous chapter[179], we proposed a semantic data architecture in which raw clinical measurements would be automatically "lifted" through increasingly conceptual/interpretive layers of ontologies in order to complete an analysis, evaluation, or query. This would be achieved through a combination of logical reasoning over the data and ontologies, together with the discovery of Web Services that aggregated and analyzed the data, dynamically identifying individua ls logically compliant with the ontological classes at each layer. In this study, we create a similar framework to both normalize, and then interpret the drug portion of a patient clinical record. We utilized SADI and SHARE [11], [59], [135] to achieve this goal similar to the model we proposed for clinical observations . Figure 5.2 High level representation of the proposed data s chema using concepts in legacy ontologies Using public drug data to assist in automatic phenotypic classification of patient records in Semantic Web 103 In the current study, we are interested in evaluating our ability to automatically determine 1) the treatment status of a patient, based on the drugs they are taking and 2) patient records exhibiting potentially contraindicative medications. Since all of the tools we will be using are based on Semantic technologies, it is necessary to build models of patient-treatment-status using such tools. Figure 5.2 shows the high level diagram of the migration workflow that enables the definition of patient phenotypes in terms of prescribed medications. By referring to an ontological concept in the SPARQL query (Layer D in the diagram), raw drug data is "lifted" through the ontological layers via an iterative process of reasoning, service discovery, and execution. In the following, we describe how we model both the data, and the patient phenotypes. 5.4.2 Dataset The dataset was the same dataset used to conduct the experiments presented in previous chapter. However we focused on the medication portion of the dataset in this experiment. The entire cohort consisted of 636 unique patients of which 246 were annotated with the treatment statuses (“Treated” vs. “Not under treatment”) for “hypertension”, “hypercholesterolemia”, and ”diabetes”(Table 5.1). Relevant clinical observations, medications, and treatment status were recorded for every patient. 5.4.3 Patient data transformation Converting legacy data into RDF It was not our intention to define a comprehensive model for the drug portion of a patient clinical record. Our primary goal was to define a simplistic record containing only the information we thought would be necessary to achieve our downstream research goals. We explicitly used the concept of arbitrary collections of medications (as opposed to single medications) in our data model since the number and types of drugs prescribed to patients can vary dramatically from patient to patient, making it difficult to design a more formally-enumerated data model. By attributing a collection of medications to patients, their phenotypic classifications may be affected by the sum-total of the collection of medications prescribed to them, as opposed to the result of individual drugs considered alone (for example, in the case of drug-drug interactions). In this study, then, the drugs prescribed to a patient are modeled as an un-ordered, un-ranked collection of individual drugs. Using public drug data to assist in automatic phenotypic classification of patient records in Semantic Web 104 While RDF does provide native approaches for representing “collections” of things (e.g. rdf:Bag), these are strongly discouraged by the Semantic Web development community, and more recently by the W3C itself 35 . The Semantic Science Integrated Ontology (SIO; [160]), is our ontology of choice for representing scientific data (see [179]), and in this situation, provides a formal approach to building collections of entities, that is amenable to logical reasoning[183]. A Collection is defined by SIO as follows: “a set for which i) there exists at least one member at any given time and ii) each item is of the same type.” ; where “type” indicates the semantic type (e.g. DRUG_KIND) .For instance, if a collection contains only one kind of member, then we can write a class axiom as follows. 'Collection of x-type valued items' owl:equivalentClass 'Collection' and 'has member' some ('x' and 'has identifier' some Literal) (1) Figure 5.3 shows the schematic view of the data model for converting layer A to layer B in Figure 5.2. Note that at layer B, we do still not know if the string representation of the medication will resolve to DRUG_KIND, INGREDIENT_KIND or neither. Thus, the collection (the red block in the figure) consists of “un-typed” members (empty ovals in Figure 5.3) that are linked to drug names by sio:hasIdentifier propery as follows: 'has member' some ('has identifier' some Literal) 35 http://www.w3.org/2011/rdf-wg/track/ issues/24 Using public drug data to assist in automatic phenotypic classification of patient records in Semantic Web 105 Figure 5.3 High level representation of the RDF data schema using concepts in legacy ontologies The data was migrated from Excel sheet to the RDF schema diagrammed in Figure 5. using a set of Perl scripts [184].Subsequently, the collection is assigned to the patient through property cardio:isPrescribed as shown in Figure 5.3. Canonicalization of RDF drug data Once the patient data is represented in RDF, we used a combination of RxNav and ChemSpider API’s to analyze that data. However, the API’s are not compatible with SADI services workflow in their native format. Thus to make the API’s compatible with SADI’s Semantic framework, we converted the RxNav and ChemSpider API’s as a set of SADI-compliant Semantic Web Services. The reason for using ChemSpider API as an integrative part of the canonicalizer service, was that for multiple-word medications (e.g. “XSD ASCRIPTIN”) where RxNav spelling suggestion was not able to retrieve any medication, ChemSpider tokenizer service[182] performs more reliably. The output of this step (layer C in Figure 5.2) provides a harmonized drug data providing links from drugs to their active ingredients and therapeutic intents, a step that is necessary for automatic “lifting” of data. Using public drug data to assist in automatic phenotypic classification of patient records in Semantic Web 106 The input and output of the canonicalization service are as follows: Input : (NonCanonicalizedCollection ) sio:has_member some (sio:has_identifier some Literal) Output : (CanonicalizedDrugCollection) sio:has_member some ((NDF:Drug_Kind and sio:has_identifier some (Rx_NUI some Literal) and (Rx_CUI some Literal)) Figure 5.4 shows the flowchart algorithm of the canonicalizer service. As a result of discovery and invocation of canonicalization service, the unstructured drug information from the raw clinical record is transformed into a set of well-structured and shared drug-identifiers which can then be used to make associations with treatment status. Using public drug data to assist in automatic phenotypic classification of patient records in Semantic Web 107 Figure 5.4 Flowchart algorithm of the canonicalizer service to migrate erroneous excel sheet data into a standardized formatFlowchart algorithm of the canonicalizer service to migrate erroneous excel sheet data into a standardized fo rmat. Once standardized, the model can be used for additional inferences and patient phenotypic classification The output from canonicalizer service is a collection whose members are NDF-RT’s DRUG_KIND with unique identifiers Rx_CUI and Rx_NUI attached using sio:has_identifier property. Once the medication is standardized in this way (Layer C shown in Figure 5.2), we are able to retrieve information (see next section) in terms of the properties linking those medications to other concepts in NDF-RT (e.g. may_treat, has_ingredient and so on). Using public drug data to assist in automatic phenotypic classification of patient records in Semantic Web 108 5.5 Evaluation: clinical case studies 5.5.1 Evaluation of canonicalization service We evaluated the canonicalization step (transition from layer B to layer C in Figure 5.2) separately as it is required for both case studies that follow. r. Once data was converted in RDF format (layer A in Figure 5.2), a SPARQL query was entered into SHARE to extract canonicalized collection as follows. The query above triggers the automatic discovery and invocation of canonicalization service .The alternative workflow paths taken by each string representation of a drug from the original file (figure 5.4) were separately recorded to obtain general statistics about the drug data 36 ; this, however, could be done en masse in a non-research environment. A total number of 181 unique identifiers were automatically mapped to NDF-RT identifiers. Using RxNav alone 52 unique medications could not be mapped to NDF-TR identifiers( the majority of which were multiple-word medications); however, once ChemSpider was integrated (figure 5.4) within the canonicalization service, only a small number of medications (7 unique medications ) could not be mapped to an NDF-RT medication, significantly reducing the failure rate from approximately 22% to 4%.Table5.2 shows some medications for which no spelling suggestion was initially found (using RxNav API) but ChemSpider was able to find the correct medication automatically. For a complete list and general statistics please refer to Appendix D. 36 The final results of this query was not of particular interest; however, the alternative routes taken strin g representations of medication in the orig inal RDF file were recorded to yield general statistics. PREFIX cardio: PREFIX rdf: SELECT ?drugcollection FROM { ?drugcollection rdf:type cardio:CanonicalizedDrugCollection } Using public drug data to assist in automatic phenotypic classification of patient records in Semantic Web 109 Table 5.2 A number medicat ions for which no spelling suggestion by RxNav-API, however were resolved using ChemSpider (with the exception of “U-100 REGULAR” The medications here are the ones that affect the treatment status for HBP (red), Cholesterol (yellow), and Diabetes (green) according to NDF ontology Medication CALAN SR LOPID 100MG LOPRESSO R U-100 REGULAR LY DIABENESE NPH INSULIN ChemSpider’s Suggestion Vasolan Gemfibrozil Gemfibrozil Chlopropamide Phosphinamine Frequency 6 4 3 2 2 2 5.5.2 Case study 1 - Automatic determination of the treatment status implicated in cardiovascular risk assessment Cardiovascular risk assessment and monitoring is of critical importance in health care. It is widely accepted that several risk factors including high blood pressure, dyslipidemia, and diabetes are major factors for developing cardiovascular diseases. An individual's risk can be assessed using available risk-prediction tools such as Framingham Risk Score, which incorporate information on established risk factors. Table 5.3 shows the score schema proposed by Framingham study to calculate the estimated risk score for General Cardiovascular Disease within 10 years in men based on the mean values for clinical observations. Table 5.3 Estimated Risk of General Card iovascular Disease (part of the table shown) [167] Points Age, y HDL Total Cholesterol SBP Not Treated SBP Treated Smoker Diabetic -2 60+ <120 -1 50-59 0 30-34 45-49 <160 120-129 <120 No No 1 35-44 160-199 130-139 2 35-39 <35 200-239 140-159 120-129 3 240-279 160+ 130-139 Yes 4 280+ 140-159 Yes Previously we aimed to model Framingham risk score using Semantic Web technologies; however we found some major discrepancies between manually and automatically annotated risk stratifications. One possible explanation for such discrepancies may be the fact that, for instance as Using public drug data to assist in automatic phenotypic classification of patient records in Semantic Web 110 shown in Table 5.3, the same reading for measurement on SBP, can be attributed different risk scores depending whether or not that patient is treated for SBP; a factor that we had not included in the original study. This led us to carry out the experiment to evaluate how the current Semantic Web technologies can be exploited to automatically evaluate the extent to which the medication prescribed to patients reflect their treatment status for various conditions. Subsequently, we re-evaluated our previous study (chapter 4) to assess how closely the automatically calculated Framingham Risk Score matches the risk classifications manually done by experts. As discussed, the treatment statuses were present only for 246 patients in our dataset and those records were used for evaluation. Using Framingham risk schema for different cardiovascular diseases, the treatment status for “Hypertension”, “High Cholesterol” and “Diabetes” affect the total score assigned to a patient. Thus, though our framework can easily be extended to be used for other conditions, only those three conditions were used to evaluate the semantic framework we designed here. The patient treatment-status for these conditions was modeled in OWL-DL. For example, the OWL class for “Patient under treatment for hypertension” is defined as follows: galen:Patient and cardio:isPrescribed some (cardio:CanonicalizedDrugCollection and sio:has_member some (cardio:HypertensionTreatmentMedication)) where HypertensionTreatmentMedication is defined as cardio:CanonicalDrugRecord and (ndf:may_treat some ndf:Hypertension ) CanonicalDrugRecord is the standardized drug class definition that has a unique concept identifiers (RxNorm_CUI, and Rx_NUI ) attached to it as data properties discussed previously. Since the framework is designed to be used as a decision support, we decided that retrieving false positive would be more tolerable than missing true positives; moreover since the dataset annotations did not specify the precise subcategory of a condition (e.g. “Hypertension” in the above definition, includes all subclasses of “Hypertension” such as “Pulmonary Hypertension”), we considered all the concepts listed as subclasses of that specific condition. Finally, and most importantly, a certain drug may be used to treat few different conditions. For example ASPIRIN may be used both as a pain Using public drug data to assist in automatic phenotypic classification of patient records in Semantic Web 111 relieving and an anti-coagulant agent. In our definitions, therefore, we assumed that if a drug is ever used to treat a certain condition, the patient prescribed that drugis pootentiallly suffering from that condition, and should be defined as being treated for that condition (even if that were not the therapeutic intent of the prescribing clinician). Thus, we anticipated a certain degree of inconsistencies (mostly false positives) in our results compared to the clinical assessment of the expert. The input and output of the SADI service that links medications to their therapeutic intent is as follows: Input: (CanonicalizedDrug) sio:has_identifir some (ndf:Rx_NUI some Literal) and (ndf:Rx_CUI some Literal) Output: (CanonicalizedDrug) (ndf:may_treat some ndf:Disease ) Results for case study1- individual treatment status The OWL-DL definitions for specific patient phenotypes (such as _”Patient Treated for hypertension”) were composed. Subsequently, SPARQL queries were used to retrieve patient records automatically detected as being treated for that specific condition and these were compared with the patient list annotated manually by experts for that same condition. The SPARQL query used to retrieve patient classified as being treated for hypertension is shown below. PREFIX cardio: PREFIX rdf: SELECT ?patientrecord ?expertbptreatmentstatus FROM WHERE { ?patientrecord rdf:type cardio:PatientUnderTreatmentHypertension . ?patientrecord cardio:BP_TREATMENT_STATUS ?expertbptreatmentstatus . } Using public drug data to assist in automatic phenotypic classification of patient records in Semantic Web 112 In the above query cardio:BP_TREATMENT_STATUS, is the manual expert annotation (Table 5.1) for a specific record, “1” representing the patient is being treated and a “0” representing that the patient is not being treated. In Figure 5.5, the bold record(s) represent where there were discrepancies between the automatically determined and manually annotated treatment status of records. The precision – recall, specificity- sensitivity metrics were used to evaluate the performance of the automatic system with respect to manual annotation as the reference. We subsequently, carried out the similar analysis for Cholesterol and Diabetes. Tables 5.4a through 5.4c show the comparison between automatic and manual determination of treatment statuses for hypertension, diabetes and hypercholesterolemia respectively. Figure 5.5 A s mall section of results for automat ic vs. manual determination of high blood pressure treatment status.The bold record shows where automatic and manual treatment status differ.”1”: treated, “0”: untreated Table 5.4 Complete results of analysis for HBP b) Diabetes and C) Cholesterol treatment status TP: True Positive, FP: False Positive, TN: True Negative, FN: False Negative TP FP TN FN HBP 166 35 28 17 Precision: 0.8259 Recall: 0.9071 Specificity: 0.4444 TP FP TN FN Cholesterol 58 8 169 11 Precision: 0.8787 Recall: 0.8405 Specificity: 0.9548 TP FP TN FN Diabetes 37 1 194 14 Precision: 0.9736 Recall: 0.7254 Specificity: 0.9948 Patient ID Automatic Risk Grade (based on drugs prescribed) Expert-assigned Grade (BP_TREATMENT_STATUS) uri4627 1 1 uri4275 1 1 uri822 1 0 Uri893 1 1 Using public drug data to assist in automatic phenotypic classification of patient records in Semantic Web 113 Results for case study1-Modified Framingham risk assessment Using the information available from the “treatment” status we re-implemented the Framingham risk groups originally implemented in previous chapter and made comparisons between automatic and manual annotations. For clinical measurements such as SBP and Cholesterol, we utilized our previous model. In our dataset, clinicians had annotated the records with three scores: “high risk”, “low risk” and “moderate risk”. For this study, we considered only those patient records for which there were no missing-values for any of the clinical observations required to make a definitive risk evaluation. The conventional classification used in Canadian health care system is based on three levels of quantization (0-9: low Risk, 10-19: Medium risk, >=19: High risk) over the accumulated individual risk score. SPARQL queries were used to outline patients classified as High, Low and Moderate risk. Case study1-Approach The core data model for this analysis is obtained by merging the patient model developed in our previous study for clinical observations (see previous chapter) and the drug model. Figure 5.6, shows the high level representation of the merged model in RDF (for details please refer to [3]) where the left hand side of the figure shows the drug representation portion and the right hand side shows the clinical observation portion. Using public drug data to assist in automatic phenotypic classification of patient records in Semantic Web 114 Figure 5.6 High level representation of drug data model merged with clinical observation model.The left side of the figure shows the drug representation portion and the right hand side shows the clinical observation portion. As an exemplar case consider the SPARQL query below used for determining the high risk patients. In the above query the automatic risk score is the accumulation of individual risk scores for Age, HDL, Cholesterol, SBP, and DIABETES (Table 5.3). Specifically, regarding “10 Year General Cardiovascular Risk Score” analyzed here, the individual risk score that will affect the risk score using drug information are the scores attributed to SBP and Diabetes. PREFIX cardio: PREFIX rdf: SELECT ?patientrecord ?calculatedrisk ?riskgrade FROM WHERE { ?patientrecord rdf:type cardio:HighRiskFraminghamScoreRecord . ?patientrecord cardio:ExpertFraminghamGrade ?riskgrade . ?patientrecord cardio:hasAttribute ?attr . ?attr rdf:type cardio:GeneralCVD10YearFraminghamRiskScore . ?attr cardio:hasValue ?calculatedrisk } Using public drug data to assist in automatic phenotypic classification of patient records in Semantic Web 115 The input and output of SADI service to calculate the risk score is similar to our previous study (chapter 4), with an additional constraint put on the input (cardio:isPrescribed some cardio:CanonicalizedDrugCollection) Input: (cardio:isPrescribed some cardio:CanonicalizedDrugCollection) and (sio:hasAttribute some cardio:Age) and (sio:hasAttribute some cardio:SerumCholesterolConcentration)and (sio:hasAttribute some cardio:SerumHDLCholesterolConcentration) and (sio:hasAttribute some cardio:SystolicBloodPressure) Output: sio:hasAttribute some (cardio:GeneralCVDFraminghamRiskScore and (sio:hasValue some Literal)) The input and output of the service that annotates the patients with their treatment statuses (cardio:isTreatedFor) are defined follows: Input: cardio:isPrescribed some cardio:CanonicalizedDrugCollection Output: cardio:isTreatedFor some NDF:Disease_Kind Evaluation Metrics Binary class prediction algorithms are often evaluated based on the relations between sets of true and false positive and negative predictions[185]. These relations are quantified with measures of Accuracy, Precision and Recall. These measures are defined as follows: Precision = TP/ (TP+FP) Recall =TP / (TP+FN) Accuracy = (TP + TN)/ (TP+TN+FP+FN) Using public drug data to assist in automatic phenotypic classification of patient records in Semantic Web 116 TP, TN, FP and FN represent number of true positives, true negatives, false positives and false negatives, respectively. Table 5.5 summarizes there results of the classification in terms of Accuracy, Precision and recall before and after the incorporation of drug information. We will elaborate on these findings presented in Table 5.5 and their implications in detail in the discussion section. Table 5.5 Comparison between classification metrics between before (not shaded) and after (shaded) incorporating drug informat ion classification metrics for Framingham risk classifications Accuracy Precision Recall High Risk 0.82 0.89 0.71 0.84 0.30 0.61 Moderate Risk 0.68 0.73 0.68 0.72 0.71 0.74 Low Risk 0.76 0.83 0.55 0.65 0.80 0.81 5.5.3 Case Study 2 - Automatic detection of potentially harmful drug-drug interaction Polypharmacy - concomitant use of multiple medications - is a common and widespread problem in healthcare([186], [187]). A 1992 study published in Medical Care examined prescriptions given to people being discharged from a community hospital[187]. The results of this study are quite surprising, both in what they reveal about the doctors’ prescribing practices and as evidence of the potential damage that these prescribing practices can do to older adults. The study highlighted several problems, among them the fact that 44 percent of the patients in that study were found to have been given a combination of drugs that can result in harmful drug interactions. Although polypharmacy may occur in all age groups, it is a common occurrence in elderly people. Studies conducted during the past 10 years have revealed that patients age 65 and older use an average of two to six prescribed medications and one to 3.4 non-prescribed medications on a regular basis. The elderly are particularly at increased risk for drug side effects and interactions between multiple drugs because of the changing physiology of aging [188]. This age-related effect is even more pronounced for patients affected by specific health problems (e.g. Cardiovascular Diseases). For instance in the dataset used for the current study every patient is prescribed approximately 7 medications on average. Thus, designing optimal drug prescription monitoring strategies for elderly is an area of active research. As such, we carried out the first experiment of its kind to evaluate whether, Using public drug data to assist in automatic phenotypic classification of patient records in Semantic Web 117 using a formal Semantic Web approach to drug representation, we can automatically detect potentially harmful drug-drug interactions by linking patient data to publically available drug data and knowledge resources. For the purposes of this study, a negative drug interaction is a situation in which a substance (e.g. food or another drug) affects the activity of a given drug, either by increasing or decreasing its effects, or by producing a new effect. Case Study2-Approach Using OWL-DL, we defined a subclass of CanonicalizedDrugCollection as follows ContraIndicatedDrugCollection: cardio:CanonicalizedDrugCollection and (cardio:hasContraindicationProfile some cardio:CounterIndication) An object property hasContraindicationProfile was defined that is a property of DrugCollection that may appear one or more times. The range of this property is the class “ContraIndication” , which is is a collection whose members are the drug-pairs that have interactions, as well as the severity and a reference to the resource to the original source contraindication data is retrieved. In the case of our experiment the reference knowledge-set for adverse drug interactions is NDF-RT; however the framework is designed so that it can be extended to include other, potentially multiple, drug- interaction knowledge sources anywhere on the Web. The ContraIndication class is defined using the approach suggested by SIO to define collections of paired-data. Thus, ContraIndication class is defined (in OWL) as follows: sio:has_member some (cadio:DrugPair and (sio:has_component_part exactly 2 cardio:CanonicalizedDrug) and (ndf:severity some Literal) and (sio:is_referenced_by some Literal)) The input and output of the SADI service that provides the drug interaction information for any given collection containing two canonicalized prescribed medications is as follows: Using public drug data to assist in automatic phenotypic classification of patient records in Semantic Web 118 Input: sio:has_component_part exactly 2 cardio:CanonicalizedDrug Output: (sio:has_component_part exactly 2 cardio:CanonicalizedDrug) and (ndf:severity some Literal) and (sio:is_referenced_by some Literal)) Results for Case Study2: for automatic drug-drug interaction detection A DrugCollection is considered “ContraIndicated” if at least two medications present in that collection are found to be contraindicative. Thus ContraIndicatedPatientRecord is defined as follows. cardio:PatientRecord and cardio:isPrescribed some cardio:ContraIndicatedDrugCollection A SPARQL query was used to outline the patients who are considered as contraindicated and for whom an alert should be issued by the system. The SPARQL query is shown below. The query above does not take into account the “severity” of interaction between the drugs and also does not retrieve the reference from which the interaction information is taken. We evaluated our framework on 394 patients selected from our data. The above query revealed a total of 16 patients (~4.1%) to whom contraindicated drugs were prescribed. We then composed a second SPARQL query to retrieve more detailed interaction information as follows: PREFIX cardio: PREFIX rdf: SELECT ?patient FROM WHERE { ?patient rdf:type cardio:ContraIndicatedPatient . } Using public drug data to assist in automatic phenotypic classification of patient records in Semantic Web 119 Table 5.6 shows the results of this analysis. As shown in the table 5.6, we found that of the 16 total contraindications, 5 are listed as critical interactions by NDF-RT. For instance patient uri550 was prescribed ASPIRIN and METHOTREXATE simultaneously (as indicated by time stamps associated with each prescription record), and in another case, the branded medications RHUEMATREX and ASCRIPTIN were both recorded as having been simultaneously prescribed to the patient. We did not encounter any record that had more than one pair of contraindicated drugs. PREFIX cardio: PREFIX rdf: SELECT DISTINCT ?patient ?drug1 ?drug2 ?severitylevel FROM WHERE { ?patient rdf:type cardio:ContraindicatedPatient . ?patient cardio:isPrescribed ?drugcollection . ?drugcollection cardio:counterInteractionProfile ?interactionprofile . ? interactionprofile sio:has_member ?drugPair . ?drugPair sio:has_component ?drug1 . ?drugPair sio: has_component ?drug2 . ?drugPair cardio:has_interaction_severity ?severity . ?severity sio:has_value ?severitylevel . } Using public drug data to assist in automatic phenotypic classification of patient records in Semantic Web 120 Table 5.6 Potentially harmful d rug interactions . N0000145918 (Aspirin), N0000146197(Ibuprofen), N0000147814(Diltiazem), N0000148001(Propanolol), N0000148054(Verapamil), N0000148057 (Warfarin), N0000147602(Lovastatin), N0000147916(Methotrexate) , N0000146243(Indometacin), N0000145859(Ascorbic Acid), N0000146187(Tolazamide), N0000146241 (Tolbutamide), N0000148034(Timolo l), N0000148010(Quinidine), N0000178380(Ketorolac). Potentially critical interactions are shown in bold font. PatientID Drug1( NDF-ID) Drug2(NDF-ID) Severity Level uri122 N0000146197 N0000145918 Significant uri550 N0000145918 N0000147916 Critical uri735 N0000145918 N0000146197 Significant uri2648 N0000148001 N0000148054 Critical uri3045 N0000148057 N0000145918 Significant uri4203 N0000148001 N0000146243 Significant uri3464 N0000148057 N0000145918 Significant uri449 N0000145859 N0000145918 Critical uri1142 N0000146187 N0000146784 Significant uri4189 N0000145918 N0000146197 Significant uri1470 N0000146241 N0000148034 Significant uri3309 N0000148010 N0000147814 Significant uri2779 N0000145918 N0000178380 Critical uri2526 N0000148054 N0000148001 Critical uri678 N0000147602 N0000147814 Significant uri4218 N0000147814 N0000147602 Significant 5.6 Discussion 5.6.1 Discussion of blood pressure, diabetes and cholesterol treatment statuses With respect to hypertension, due to the wide variety of potential candidate drugs and their combinations, and different conditions all classified under “hypertension”, we were expecting to find a relatively large number of false positives (results shown in table 5.4a). Surprisingly, the converse was true; there were records automatically determined to be “not under treatment”, yet annotated by the expert as “under treatment”. We confirmed the results using other external drug databases such as Drugbank[38] (Appendix D). However, we could not rule out the possibility of the medication being primarily prescribed for a different purpose, rather than hypertension treatment. This suggests that prescribing clinicians do not consistently take into-account the multiple pharmaceutical effects of the drugs they are prescribing; effectively, that they may sometimes only consider the treatment outcome Using public drug data to assist in automatic phenotypic classification of patient records in Semantic Web 121 for which they are prescribing the medication, and not consider (or may not have other suitable options) that the medication will have other pharmacological effects on the patient. On the other hand, examination of the dataset revealed records that were annotated as being “treated for HBP” without any specific medications accounting for the condition. For instance uri123 in Table 5.7 is not assigned to any medications however is annotated as being “treated”. All the false negatives (also in Diabetes and Cholesterol) were of this type. There are two possible explanations for this observation. Either the physician was considering factors other than prescribed medications (e.g. Sodium free diet or exercise regime) when classifying the patient’s treatment-status, or they had forgotten to enter the certain medications that had been prescribed to the patient. Regardless, the rationales for this clinical phenotypic classification were not recorded, and could not be reproduced in an automated manner. Table 5.7 An extreme example of an aberrant clinical classification showing a patient to whom no medications were prescribed, yet were annotated as “treated” for HBP. The higher precision and recall for cholesterol and diabetes status might be explained by the relatively lower diversity of medications in our dataset prescribed for those two conditions compared with hypertension. There were 7 major drugs (considering Insulin and its family as one) used in our dataset used for diabetes, 10 for cholesterol and 28 for hypertension (FISH OIL is in common for both hypertension and hypercholesterolemia and is highlighted); see Appendix D for more information. The lower diversity of prescription drugs for Diabetes and Cholesterol may account for low number of false positives, for these two conditions. Additionally, this lower diversity may make the process of manual annotation less prone to human error at the time of manual annotation of records resulting in very low false positive rate. The lower diversity of medications used for these three conditions is also reflected in NDF-RT. There are 617 unique medications in NDF-RT linked to hypertension using may_treat properties; however there are 254 and 118 unique medications diabetes and hypercholesterolemia respectively. The large number of medications implicated in hypertension may partially explain the lower recall rate. Finally, with respect to hypertension, it may be argued that patients under specific strict diets and life style PATIENT_ID DRUG 1 DRUG 2 DRUG 3 DRUG 4 DRUG 5 Treated for BP Treated for Diabetes Treated for High Cholesterol uri123 1 0 0 Using public drug data to assist in automatic phenotypic classification of patient records in Semantic Web 122 changes may be considered “treated”; however based on the existing literature on hypertension [189] the factors explained are not likely to account for the relatively large number of false negatives. Once again this highlights the need for more “transparency” in recording clinical phenotypic classifications. 5.6.2 Discussion of Framingham risk scores experimental results As stated in the introductory section, the results for automatic Framingham calculations in the original study (before adding drug information) were highly inconsistent with the manual risk classification by the expert clinician. Following the use of drug information, the results showed significant improvement according to standard evaluation metrics shown in Table 5.5. However, there were still discrepancies present between the manual and automatic classifications. The majority of inconsistencies are caused by records for which the drug information had failed to determine the individual treatment status for Diabetes and/or SBP. Other inconsistencies are possibly caused by not considering (on part of the expert) a patient to be under-treatment whereas the automatic system suggested the contrary. For instance in figure 5.7, both uri3231 and uri219 have the calculated risk score of 19, nevertheless one is classified as medium risk whereas the other is annotated as high risk by the expert. However, if we remove the medication information the calculated risk for uri3231 will remain 19, while the calculated risk for uri219 will drop to 17 (the patient was under treatment for high blood pressure suggested by the prescribed medications). Thus, the inconsistencies in Framingham risk groups were caused by inconsistencies in determination of treatment statuses for hypertension, and diabetes. PatientID calculatedrisk expertriskgrade uri4218 21 high uri2192 20 high uri3231 19 high uri219 19 medium Figure 5.7 A s mall section of the results for automatic vs. manual determination of “high risk” Framingham patient.The bold record shows a record for which automatic and manual t reatment status differ. The result of these analyses therefore highlight two primary observations directly related to the thesis objectives. First, the need for more rigorous and formal definitions of patient phenotype classifications by clinical researchers, such that the classification can be reproduced and, perhaps, Using public drug data to assist in automatic phenotypic classification of patient records in Semantic Web 123 automated and re-used by others. Second, we show that connecting to third-party drug knowledge- bases can improve the automated clinical classification as well as potentially give real-time advice to clinicians about hazardous drug interactions. This study led to other observations related to the quality of our dataset, compared to other datasets used in similar studies, which allows us to discuss whether our results will be more broadly applicable to the general domain of clinical research. In addition, we were also able to make some observations about the suitability of, and structure/accuracy of, the various external resources we utilized. We will now discuss these auxiliary observations. 5.6.3 Multiple drug interaction The numbers of contraindications found in our dataset were significantly less than the number observed by H. L. Lipton et al[187] (4% vs. 44%[187]) which could be due to the different disease- focus of our study compared to theirs (thus the different drugs being examined in our study) , the diligence on the part of the clinician from whom we obtained our dataset, or due to incomplete coverage of RxNav Services for drug-drug interactions. However, the number of contraindicative records in this study are higher than those found in a recent study reported by Stevens et al. [190] (between 0.56% to 1.25%). This may be due to the existence of built in mechanisms in their EHR framework to impose certain restrictions at the time of entering the drug information such as forcing the user to select medications from a drop-down menu and enter prescription time/duration, may help clinicians to detect possible interactions. However, regardless of the underlying reasons for large variations reported by different groups, our study once again underscored a well-recognized problem with concomitant medications and proposed a low-cost and seemingly effective approach to detecting these dangerous situations. 5.6.4 NDF-RT evaluation Our study allowed us to make some observations about the suitability of NDF-RT for the purposes of clinical treatment-classification, and for detection of drug-drug interactions. The observations can be used to make suggested improvements to NDF-RT were as follows: Using public drug data to assist in automatic phenotypic classification of patient records in Semantic Web 124  Improvements to NDF-RT DL axiomatic class expression In the current version of NDF-RT, no mappings exist between “ingredients” and conditions. For instance, although ASPIRIN, as a DRUG_KIND, is linked to therapeutic conditions (Figure 5.1) aspirin, as an ingredient, is not linked to any conditions or diseases. It seems erroneous that an ingredient named aspirin should exist at all. It would seem to be preferable to link ASPIRIN as a DRUG_KIND directly to the ingredient “salicylic acid” (which is the super class of aspirin as an ingredient in the current version). However no additional property exists to link “salicylic acid” to any medicinal uses; thus reasoners will be unable to infer the pharmacologic intent of that ingredient. In addition, since each medication may have one or more ingredients each of which implicated in different medicinal uses, it would be preferable and more conceptually sound to link individual ingredients to their medicinal use and use the reasoner to infer the therapeutic intent for the combination of the ingredients in each medication adding more granularity and clarity to the reasoning process. The same argument would be true for detecting possibly harmful drug-drug interactions.  Improvement to NDF-RT for Medication Coverage Our study revealed medications which exist as DRUG_KIND, however lack any links (may_treat ) to any conditions or diseases. Table 5.8 shows the medications for which no specific therapeutic intent is listed (after removing Vitamins). Of the medications listed, DYPIRIDAMOLE, RANITIDIN, and METHYSERGIDE are linked only by may_prevent property and the rest of the medications are not linked to any conditions by any property. The literature seems to be consistent about the preventive nature of DYPIRIDAMOLE [191]; however, RANITIDIN, is listed as both treatment and preventive agents [192] and the literature was not unanimous about the implication of METHYSERGIDE in vascular headaches. Using public drug data to assist in automatic phenotypic classification of patient records in Semantic Web 125 Table 5.8 List of medications for which no therapeutic intent (may_treat) was found in NDF-RT along with their frequency of occurrence. Finally, the medications PHENFORMIN (1 occurrence) which is a commonly prescribed medication for diabetes has no link to the disease in NDF-RT. Thus, a patient taking this drug will be misclassified, in our system, as being “not under treatment”, causing some degree of discordance between manual and automatic classification. These results highlight the need for continual revision, curation, and updating of drug knowledge- bases. In [105](see next chapter) we developed a data driven framework that utilizes machine learning to enhance existing Ontologies with specific focus on improving drug-disease relationships in a more systematic manner as a way of at least partially automating this process. 5.7 Related work The related work can broadly be divided into two general categories. The first addresses the role of Ontologies in providing decision support in clinical practice using drug information. The second research area is more focused towards identifying and annotating drug names in structured, semi- structured and unstructured data. We will describe both in turn. 5.7.1 Ontological-based decision support systems Clinical decision support systems generally benefit from ontologies in two principal ways[20]. First, ontologies provide a standard vocabulary for biomedical entities, helping to integrate data Display Name (NDF-ID) frequency of occurrence DIPYRIDAMO LE (N0000146237) 185 RANITIDINE ( N0000148012) 7 CALCIUM ( N0000146031) 6 POTASSIUM ( N0000146069) 3 PHENYLBUTAZONE ( N0000146718) 2 TERFENADINE (N0000147493) 2 CASCARA SAGRADA( N0000147280) 1 COD LIVER OIL ( N0000145787) 1 KETORO LAC( N0000178380) 1 ZINC ( N0000146033) 1 CASTOR OIL ( N0000145784) 1 PEPSIN ( N0000146741) 1 METHYSERGIDE (N0000147921) 1 PHENFORMIN( N0000171785) 1 Using public drug data to assist in automatic phenotypic classification of patient records in Semantic Web 126 sources[20] . Second, ontologies are a source of computable domain knowledge that can be exploited for decision support purposes, often in combination with logical axioms. Studies have addressed the problem of designing ontological-based decision support systems in the context of clinical observations, symptoms, and so on. With respect to prescribed medications, studies have aimed to specifically address the application of ontologies in prescribing and scheduling medications. For instance [193] presents a computerized decision support system for empirical antibiotic prescribing that take into account the antibiotic susceptibility, and patient characteristics. OntoQuest [194] is a decision support system that mines a hospital database for previous decisions made. For example, OntoQuest displays a list of the medications prescribed to patients sharing similar histories (encoded as ICD-9 Ontological diagnoses) from which the physician may compare the choice of medication for the patient. In [190] the authors have developed software to automatically record drug/drug interactions in a very large cohort using British National Formulary (BNF) as their underlying database for interactions. The major conceptual difference between their approach and ours in addressing the drug-drug interaction problem is that first, their system requires expensive software installed locally making their analyses difficult to reproduce whereas our system utilizes public drug database for drug interactions, and second, their software requires the user to enter the drug information in a standardized format at the time the prescription information is recorded, which requires additional time and effort from individual clinicians at the time of recording drug information. One interesting observation however, is that the rate of contra-indicated prescriptions in our dataset was notably higher than they reported, which might be due to the fact that their built-in automated system will allow clinicians to more accurately detect the possible interactions at prescribing-time. The examples above are just a few of many studies that have attempted to incorporate prescription drug information for decision support. To date, most studies have been focused on formalized representation of medications for current clinical studies; to the best of our knowledge, fewer studies investigate post facto integration of unstructured “legacy” patient data into existing semantic frameworks. [195] proposes an interesting approach to defining external pharmacological classes in NDF-RT in terms of ingredients, to provide support for the anti-coagulation medications ; however, their methodology does not provide a formal framework for standardized representation of patient data, and also ontological layers and Semantic Web services that are laid over patient data to automate patient classification process. As a result their scope of analyses becomes relatively limited requiring extensive manual intervention. Using public drug data to assist in automatic phenotypic classification of patient records in Semantic Web 127 5.7.2 Drug identification in structured and unstructured text Projects such as ChemSpider ([182],[196]) and ChemList[197] have addressed the first category and have developed sophisticated Natural Language Processing(NLP) methodologies for the identification of small molecules and drugs in free text corpus. However, since clinical datasets may generally be considered as semi-structured-usually with direct reference to medications (e.g. Medication column), identifying drug names is not as challenging a task as for the case of free text. Though we did not take advantage of such capabilities, we acknowledge that using NLP methods in tandem with our methodologies is potentially useful in extracting information from unstructured data (e.g. physician’s note) providing additional decision support. 5.8 Conclusion This study is of both methodological and practical interest. The practical objective of this study was to help clinicians and specialists more reliably classify the patients for different cardiovascular conditions; it also proposes a generic and easily implementable framework to help physicians avoid prescribing contraindicative medications. From the methodological perspective, we presented a generic and flexible framework to transform the prescription medication in legacy datasets into a format suitable for interpretation and integration. Our proposed framework enabled us to dynamically link the legacy patient data to existing drug Ontologies together with generic Semantic Web services, to perform complex reasoning tasks. We demonstrated that clinically relevant patient phenotypes can automatically be classified by integrating legacy patient data sets to existing public drug knowledgebases. By formalizing the phenotype-definitions, and thereby enabling subsequent automation of data analysis using these explicit and self-resolvable classes, we remove the human from the analytical process, thus reducing the number of manual data-manipulations which, by-and- large, go unrecorded and thus lead to non-reproducible science. The use cases illustrate the central role potentially played by public drug ontologies in providing more accurate decision-support to expert clinicians, but also point-out certain areas where these resources fall-short of the requirements for this purpose. Using public drug data to assist in automatic phenotypic classification of patient records in Semantic Web 128 5.9 Limitations and future work Due to limited information available in the dataset used in this study, we were unable to incorporate complex dosage information and temporal (longitudinal) reasoning that are required to address more complex and dynamic questions related to treatment regimes and patient history. In the next phase of the project we aim to consider temporal and dosage information attached to patient records and use the framework to define ever-more complex patient phenotypes. We are currently extending this current framework to include temporal, dosage and unit information on a much larger cohort. By using accurate temporal information, we predict that we will be able to do more complex analyses. For instance we may be able to extend the analysis to non-chronic conditions that are more sensitive to time and duration of prescription. Another limitation of this study is in the small number of inaccuracies in information-extraction from the patient record. While spelling correction suggestions are embedded in the data-cleansing framework, as with any other spelling suggestion systems, these are imperfect. For instance the string “ASPRON” is corrected to “MEPRON” using RxNav spelling suggestion system; however “ASPIRON” is corrected to “ASPIRIN”. The design of optimal system to cope with noisy drug data, and use of other statistical and logical approaches to increase the accuracy of mapping, would be an interesting extension to the current system. Additionally, with respect to performance, currently the SPARQL queries take about 2 hours to run on a personal computer, and hence improvements on the computational performance would be desirable. Finally, we wish to extend our current framework to enable more complex reasoning analyses with respect to medications for example to automatically determine the precise class of medication(s) prescribed to patients. For instance, there are four major groups of medication treatments for hypertension and several less frequently used classes: Beta-blockers, Diuretics, Calcium channel blockers, and ACE inhibitors. Automatically mapping patient data to the precise class of medications can provide more reasoning power that can potentially improve decision support systems. A data-driven approach to learning OWL expressions in clinical health care 129 6 A data-driven approach to learning OWL expressions in clinical health care 37 6.1 Synopsis The advent of high throughput technologies has given rise to a rapid increase in the amount of biological and biomedical data. This exponential growth of data in the life sciences has created the need for biological and biomedical ontologies. While the number of bio-ontologies increases, the creation, maintenance, sharing and integration of these ontologies still remain a challenge. More specifically, generation of formal class expressions for biomedical concepts is one of the most useful and at the same time challenging aspects of ontology engineering. For reasons discussed earlier, OWL-DL has become the de facto standard for knowledge representation in the Semantic Web. Thus in this chapter, we investigate how we can automatically learn OWL-DL class expressions in support of semi-automated ontology engineering; specifically focusing on clinical phenotypes. In this chapter, we discuss data-driven methodologies in for automatic and semi-automatic ontology building and evaluation. These methodologies are developed to leverage machine-learning (ML) approaches for the creation, extension and evaluation of ontologies. We first present the related work and existing frameworks. Then, we address the key research objectives of this thesis, where we utilize the available methodologies for learning OWL-DL class expressions for clinical phenotypes. We evaluate the effectiveness of our framework in aiding clinicians to automatically (or semi- automatically) build and evaluate their clinical classifications and hypotheses in the form of “personalized” OWL-DL ontologies. 6.2 Introduction Today, ontologies have become an indispensable element for carrying out biological and biomedical research, and numerous bio-medical ontologies have been built to assist in both clinical and research activities. Building such ontologies is labor intensive, requiring a broad range of specialist-expertise, and therefore, is both time-consuming and expensive. In particular, creating logical class expressions 37 Portions of the first section of this chapter has appeared in the publication: Soroush Samadian,Benjamin M. Good,Bruce McManus,Mark D. Wilkinson (2012) A data-driven approach to automatic discovery of prescription drugs in cardiovascular risk management. Bio-Ontologies 2012 A data-driven approach to learning OWL expressions in clinical health care 130 constitutes one of the more demanding aspects of ontology engineering[31], and in most cases requires considerable domain-training. Thus it would be desirable to automate (or at least semi- automate) the process of creating class expressions in support of non-expert ontology engineering. In this chapter, we focus on data-driven approaches to address the learning class expression problems. Data-driven approaches to ontology enrichment are particularly useful in scenarios where extensional information (i.e. facts, instance data) is abundant, whereas the corresponding intentional information (schema) is missing or not expressive enough to allow powerful reasoning over the data in a useful way[31]. Such situations frequently occur in clinical and biomedical databases where there is extensive legacy data available. The issue becomes, therefore, defining approaches where the existing data can be automatically mined to give clues as to the optimal schema within which it could be classified. In the previous chapters we conducted the feasibility study where we undertook to empirically evaluate our ability to model, using manually-constructed ontologies, clinical observations. In this chapter we examine whether using data-driven approaches can guide the specification of classifications (ontologies) automatically. This chapter has two major sections. First, we discuss the conceptual framework of our design focusing on interesting clinical cases. Second, we evaluate the framework for a number of different clinical scenarios including automatically identifying patterns that mirror the expert knowledge encoded in existing clinical ontologies. Once we demonstrate that rule-discovery approaches can ontologically (OWL-DL axioms) model healthcare domain knowledge available in existing ontologies, we then undertake to evaluate whether using data-driven approaches can guide the specification of OWL-DL axiom-based classification rules by clinical researchers or other non- ontologists. A data-driven approach to learning OWL expressions in clinical health care 131 6.3 Related work The related work can be broadly categorized in two areas: Ontology evaluation, which is indirectly related and Ontology enrichment which is more directly related to our study as discussed in the following. 6.3.1 Ontology evaluation Ontology evaluation is the problem of assessing a given ontology from the point of view of a particular criterion of application, typically in order to determine which of several ontologies would best suit a particular purpose[198]. Ontology evaluation is crucial for a wide variety of applications, both in the Semantic Web and in other semantically enabled technologies. A wide array of approaches to ontology evaluation have been considered in the literature, depending on the specific aspect of ontology evaluation, such as coverage of terminology, correctness of hierarchical relationship(is-a), and so on[198]. One specific aspect of ontology evaluation which is most related to this study is the consistency of instance classification defined as the ability of ontology to provide precise delineation for class definitions within an ontology. To the best of our knowledge, the only framework that has specifically addressed this aspect of ontology evaluation is called Ontoloki[15] which originally inspired the experiments presented later in this chapter, and as such, the OntoLoki framework will be described in some detail now.  Ontoloki Ontoloki, is an approach for data-driven ontology evaluation developed by Benjamin Good[15]. Machine learning algorithms are used to empirically determine the rules (patterns of properties) that allow predicting class membership reliably. More specifically, Ontoloki is constructed so that it takes as its input an OWL/RDF knowledgebase containing an Ontology, instances associated with each of the classes in the ontology, and properties of those instances. Ontoloki was originally designed to evaluate the ability of an ontology to classify data based on its properties; however the project had an interesting side-effect: it proved capable of novel discoveries (in terms of properties) about the Ontology classification axioms. Such discoveries can enhance its capabilities in anticipating instance memberships for existing ontologies (e.g. gene ontology (GO)[22]). As a result, Ontoloki can be considered a framework that can be used both for the ontology evaluation and ontology enrichment purposes. A data-driven approach to learning OWL expressions in clinical health care 132 6.3.2 Ontology enrichment The term “enrichment” is generally defined as the extension of a knowledgebase schema[199] (i.e., extension of the A-Box, as defined in chapter 2) by a process of gradually increasing the expressiveness of a knowledgebase [199]. Knowledgebase enrichment can be considered as a sub- discipline of ontology learning. Figure 6.1 (repeated from chapter 2) shows the incremental layers of complexity that are traversed during ontology learning. Ontology enrichment is usually used when data already existing in the knowledgebase is exploited to improve the richness of the ontology. As discussed in chapter 2, a wide range of algorithms and methodologies spanning several research areas such as machine learning (ML), statistical analyses, Wiki-based environments and natural language processing techniques (NLP) have been developed to address ontology enrichment. In the context of this chapter, we consider applying techniques to find axioms and resulting instance-relationships that can be added to an existing ontology (the green boxes shown in Figure 6.1). The task can be defined as the process of eliciting an OWL-DL description of a concept for given instance members and non-members of the concepts. Figure 6.1 Layers of Ontology Development Process (adapted from [120] orig inally presented in [119]) Finding class-defining axioms is one of the most demanding and complex tasks in ontology engineering. It is closely related to inductive logic programming (ILP) [200]and more specifically supervised learning in description logics[199]. The use of ILP in description logic is well-established (e.g.[201]) where the goal is to find the most specific description for a concept. The field has been further advanced by the invention of several pruning(e.g. refinement operators 38 ) and optimization algorithms [61]and several applications were implemented that utilized these approaches, such as YINYANG[202] and DL-FOIL[203]. [204] proposes a framework for reasoner-aided relational 38 Refinement operators are used to narrow down the searching process in the space of possible expressions A data-driven approach to learning OWL expressions in clinical health care 133 exploration of OWL DL ontologies alongside a tool developed called RELExo which is developed to support the acquisition and refinement of complex class descriptions[204]. RELExo is an interactive application focusing on acquisition of role restrictions by asking the knowledge engineer a series of questions. The knowledge engineer has to either positively answer the question or provide a counterexample. Another enrichment task is called knowledgebase completion(not to be confused with knowledgebase enrichment) where the aim is to make the knowledgebase complete in a particular sense[205]. For example, a goal could be to ensure that all subclass relationships between named classes should automatically be inferred[205]. Some examples of the work in this area include [206–210]. It has been demonstrated that the existing algorithms generally tend to produce long and complex expressions that may be hard to interpret by the knowledge engineers([31],[205]). [61] proposes an algorithm to address some of the problems associated with these systems. The algorithm is called Class Expression Learning for Ontology Engineering (CELOE) and tool developed to implement the algorithm is called DL-Learner. Compared to other frameworks, CELOE is biased towards short solutions for expression to increase readability of the suggestions made[31]. Additionally, CELOE uses a reasoning procedure for instance checks which follows the closed world assumption (CWA) to increase the reasoning speed. The second part of the study presented in this chapter is based on the approach proposed by CELOE where we evaluate rule-discovery approaches to guide the specification of OWL-DL-based cardiovascular “classifications” by clinical researchers. 6.4 Experiments We conducted two clinical experiments for automatic learning of OWL-DL class expressions in OWL. Both experiments are motivated by important real-life clinical problems, but the clinical backgrounds of the two studies are different. There were two core differences between the first and second experiment: 1. In the first, we conducted a study to test our ability to identify a set of known (“gold standard”) rules with respect to classification consistency. Thus, in the first experiment the classification rules are known a priori whereas in the second experiment we do not know the classification rules in advance. A data-driven approach to learning OWL expressions in clinical health care 134 2. In the second experiment we asked our clinical partners to identify concepts of interest whereas in the first experiment these concepts were defined by existing ontologies. We then attempted to create the classification rules for each concept de novo. Each study (datasets, methods, results and conclusion) is presented separately in detail. 6.5 Clinical case study 1: automatic discovery of prescription drugs in cardiovascular risk management 6.5.1 Summary of the experiment The objective of this study was to evaluate data-driven approaches for automatically identifying medications used in the treatment of cardiovascular diseases and consider how the learned rules might be applied to the process of ontology curation and evaluation. In this experiment, we mined the clinical records of a large cardiovascular patient cohort, focusing on their clinical phenotype and their prescribed medications. Machine learning algorithms from WEKA[211] detected rules linking medications to patient’s treatment-status. These rules were then compared to axioms encoded in the National Drug File and Reference Terminology (NDF-RT)[75] Ontology. For most medications in the dataset we were able to re-discover, with high precision, the prescriptive rules present in the NDF-RT; however, we discovered partial coverage (4/19 possible rules) for medications linked to Chronic Heart Failure, and no rules for medications linked to Hypertension. Subsequently, we show that, in some cases, these rules contain more detailed information than is present in the NDF-RT itself. This experiment demonstrates how data-driven approaches might be used to ameliorate the knowledge acquisition problem for ontology design. We show that the learned rules could be used to evaluate and improve an existing ontology (NDF-RT). We propose that these rules could be used to automatically construct ontological axioms, thus semi-automating the process of de novo ontology construction for a given domain. 6.5.2 Overview of the method As explained in the introductory section a wide range of algorithms and approaches have been developed to address automatic knowledge discovery applied to ontology engineering; of which most relevant/similar to the current problem are the CELOE and Ontoloki algorithms. We model our current approach after these studies, in that we utilize patterns in the properties of instance data to suggest classification rules; however our approach differs in two meaningful ways. First, the previous A data-driven approach to learning OWL expressions in clinical health care 135 studies started with "structured" data already in an ontological/linked-data framework, whereas we begin with "unstructured", manually-entered legacy data stored in tabular format. Second, we know a priori that our data - clinical records of prescribed drugs - contains information about prescribing "preference"; i.e., a "ranking" of rules. While the CELOE (and Ontoloki) system provides a ranking of discovered rules from which the user may select one to turn into a formal class-defining axiom, the rankings we are trying to detect are conceptually distinct in that all of the ranked rules are "correct" with varying degrees of coverage. Therefore we would like to be able to preserve this ranked-rule information for the case of non-Description Logic-based classification. Our primary questions, therefore, are (a) can we utilize OWL class expression learning methodologies to boot-strap the creation of an ontological framework de novo from unstructured data, and (b) can we detect subtleties such as preferential ranking of otherwise valid, but distinct, classification-solutions. In this experiment, we mine data from the drug component of legacy clinical datasets in an attempt to automatically discover how drugs are being used to treat diseases. We focus on rule discovery between several well-known cardiovascular risk factors and conditions including Diabetes Mellitus (DM), Hypercholesterolemia (HC), Chronic Heart Failure (HF), Stroke (ST) and Hypertension (HT) and the medications prescribed to such patients. Discovered rules are then evaluated against the widely used National Drug File Reference Terminology (NDF-RT), which (arguably) represents a "gold standard" for correct prescriptions. 6.5.3 Dataset and data collection The dataset used for our experiments involved the clinical records of a cardiovascular patient cohort collected from a referral hospital in Nebraska, USA, between 1986 and 1989, including 536 unique patients over 1523 “encounters”. Each encounter is considered the status of the patient as a specific point in time. Clinical observations, medications, and treatment status were recorded for every patient. Table 6.1 shows a small section of the dataset used in this study. For instance the first patient (row 1 in Table 6.1) is prescribed Vasotec, Perstantine and etc. at the specific time the data is recorded. Additionally the patient is considered “under treatment” for hypertension and Diabetes; however, the explicit explanation as to “why” this patient is considered to be under treatment is not recorded. A data-driven approach to learning OWL expressions in clinical health care 136 Table 6.1 The first two rows of the dataset used in the original format . In the last four columns “1” represents “treated” and “0” represents a “not under treatment” for the condition listed in the header. 6.5.4 NDF-RT Ontology The NDF-RT is a public resource developed by the Department of Veterans Affairs (VA) and is available in OWL[212]. NDF-RT is a concept-oriented terminology - a collection of concepts, each of which has a single, unique “meaning”. NDF-RT’s medications are described in terms of their active ingredients (has_ingredient), mechanisms of action (has_MoA), physiologic effects (has_PE), and therapeutics (may_treat, may_prevent). Of particular interest in this study is the "may_treat" property as it links medications with clinical phenotypes they are intended to treat. This knowledgebase is used as the standard against which we evaluate our predicted rules. A detailed description of the ontology was provided in the previous chapter. 6.5.5 Preprocessing and data transformation Medications in our dataset were recorded as a combination of brand-names, brand-names with specified doses, generic names, active moieties, abbreviations, and acronyms, all of which were prone to mis-spellings and other typographical variations. Thus to maximize the efficacy of our data-mining we first needed to standardize the data elements. The block diagram shown in Figure 2.1 shows the four major standardization steps we undertook (green blocks in the Figure 2.1). Figure 6.2 Migration from legacy data into the format suitable for Machine learning algorithm Using a similar approach presented in chapter 5, we used Rx_Norm[74] and Rx_Nav[213] to harmonize the disparate medication names in our dataset. After harmonization, a unique ID was assigned to each medication corresponding to the numeric identifiers in the NDF-RT Ontology. In total, 150 unique medications were detected in our dataset. We then removed the medications which A data-driven approach to learning OWL expressions in clinical health care 137 occurred less than three times, resulting in a final total of 56 unique and standardized medications. A matrix was then constructed associating each patient with each medication, where values of "1" and "0" indicate that the patent is prescribed the medication, or not, respectively. Additional columns are then added to each patient row showing whether the clinician annotated them as being treated or not treated for Diabetes, High Cholesterol, Hypertension, Stroke, and Heart Failure, where any given patient may be simultaneously treated for numerous clinical disorders with a wide array of drugs. Based on prior experience from other similar studies [15]we selected the JRip algorithm from WEKA[214] to discover relationships between medications and treatment status. We predict we should discover disorder-drug associations in this dataset corresponding to the "may_treat" relationships in NDF-RT. The example shown in Figure 6.2, shows how the OWL axiom “may treat some Diabetes” present in the NDF-RT ontology could be learned directly from recorded uses of this drug in patient records. This basic pattern can be used to expand class definitions, to evaluate definitions based on real world data and to provide continuous estimates of confidence for qualitative logical definitions. Figure 6.3 Inferring components of formal class definitions from instance data. The axiom “may treat some Diabetes” (DM) is learned from legacy data automatically and can be added to an existing OW L-DL Class. 6.6 Results for case study 1 The rules discovered by JRip of most interest in this study are of the form “If (Medication), Then (Condition)”. Consider the first rule generated for diabetes (DM). (INSULIN = 1) => DM = Treated (14.0/0.0) A data-driven approach to learning OWL expressions in clinical health care 138 The above rule can be interpreted in natural language as follows. If Insulin is prescribed for a patient, the patient is treated for diabetes, and this rule is true for 14 instances with no exceptions (numbers in the parenthesis at the end of the rule). Below we report the rules learned for each condition using the entire dataset and the results of 6-fold cross-validation experiments in terms of accuracy (acc), precision (P), and recall (R). The cross- validation metrics provide a measure to estimate the likely performance of this system on new data sets from a purely computational perspective while the rules themselves can be assessed based on clinicians' knowledge.  Diabetes (DM) acc: 90.63 % P: 0.89 R: 0.62 (INSULIN = 1) => DM= Treated (14.0/0.0) (CHLORPROPAMIDE = 1) => DM =Treated (4.0/0.0) (GLIPIZIDE = 1) => DM =Treated (6.0/1.0) (GLYBURIDE = 1) => DM = Treated (3.0/0.0) (TOLBUTAMIDE = 1) => DM = Treated (2.0/0.0) => DM= Not_Treated (163.0/11.0)  Hypercholesterolemia(CH) acc:72.4 % P:0.83 R: 0.53 (GEMFIBROZIL = 1) => CHOL= Treated (23.0/4.0) (LOVASTATIN = 1) => CHOL= Treated (17.0/2.0) (CHOLESTYRAMINERESIN = 1) => CHOL= Treated (8.0/0.0) (PROBUCOL = 1) => CHOL= Treated (3.0/0.0) => CHOL=Not_ Treated (141.0/47.0)  Stroke (ST) acc: 91.15 % P: 0.63 R: 0.78 (WARFARIN = 1) => ST = Treated (28.0/8.0) (PENTOXIFYLLINE=1) => ST = Treated (5.0/2.0) => STROKE = Not_Treated (159.0/3.0) A data-driven approach to learning OWL expressions in clinical health care 139  Heart Failure (HF) acc: 89.58% P: 0.30 R: 0.19 (NITROGLYCERIN=1) and (FUROSEMIDE= 1) => HF = Treated (7.0/3.0) (ATENOLOL = 1) and (NIFEDIPINE = 1) => HF = Treated (6.0/2.0) => HF = Not_Treated (179.0/8.0)  Hypertension (HT) ) acc: 71.3% P: 0.71 R: 1.0 => HT=Treated (192.0/55.0) Table 6.2 shows the medications and diseases-of-interest compared by the presence (x) /absence(-) of a may_treat association in the NDF-RT; according to Table 6.2, we make the following observations for different conditions.  Diabetes All six medications discovered by JRip are in a may_treat relationship to diabetes in the NDF-RT, and only these six drugs from our dataset have this relationship in the NDF-RT, indicating that our rule discovery was comprehensive (see Table 6.2). Additionally, JRip provides a ranking of medications by the order in which the rules are generated. With the exception of METFORMIN, which has become the “drug of choice” in recent years[215], but was not present in our data (since it was not available in United States until 1995), the rest of the medication rules were discovered in the same order of ranking vis a vis actual diabetes treatment recommendations 39 indicating that we are able to detect and codify quite detailed usage preferences.  Hypercholesterolemia Similar to the Diabetes case, JRip was able to discover all four medications linked to Hypercholesterolemia in the NDF-RT. However, unlike the diabetes case, the first medication discovered by the rule is primarily used to treat hyperlipoproteinemia and is only a second-line therapy for Hypercholesterolemia according to DrugBank [38]. 39 http://www.drugs.com/condition/diabetes -mellitus-type-ii.html A data-driven approach to learning OWL expressions in clinical health care 140 Table 6.2 The list of medications and diseases-of-interest compared by the presence (x) /absence(-) of a may_treat association in the NDF-RT  Stroke JRip discovered both drugs used to treat stroke, in order of their recommendation in existing guidelines. Warfarin, which is an oral anticoagulant, has been the first medication of choice for stroke treatment (and prevention) for over 50 years [216].  Heart Failure The rules discovered by JRip for HF have a different form from the previously described rules in that their antecedents are each composed of a logical conjunction of two constrains. This is due to JRip’s parsimony in rule selection which finds the most comprehensive conjunctive rule. Though the Medication DI. CH. ST. HT. HF. ACETOHEXAMIDE x - - - - ATENOLOL - - - x x BENDROFLUMETHIAZIDE - - - x x CAPTOPRIL - - - x x CHLOROTHIAZIDE - - - x x CHLORPROPAMIDE x - - - - CHLORTHALIDONE - - - x x DIGOXIN - - - - x DILTIAZEM - - - x - ENALAPRIL - - - x x FUROSEMIDE - - - x x GEMFIBROZIL - x - - - GLIPIZIDE x - - - - GLYBURIDE x - - - - HYDROCHLOROTHIAZIDE - - - x x INSULIN x - - - - ISOSORBIDE - - - - x LOVASTATIN - x - - - CHOLESTYRAMINE - x - - - METHYLDOPA - - - - x METOPROLOL - - - x x NADOLOL - - - x x NIFEDIPINE - - - - x NITROGLYCERIN - - - x x PENTOXIFYLLINE - - x - - PROBUCOL - x - - - PROPRANOLOL - - - x x SPIRONOLACTONE - - - x x TIMOLOL - - - x x TOLBUTAMIDE x - - - - VERAPAMIL - - - - x WARFARIN - - x - - A data-driven approach to learning OWL expressions in clinical health care 141 translation of these rules into axioms is not as straightforward as the previous cases, all four of the medications identified by JRip are in fact used to treat chronic heart failure. However there are 19 medications (Not all of them are shown in Table 6.2) in our dataset that are linked to the condition in the NDF-RT. It is likely that JRip was not able to find rules for those drugs due to the scarcity of records annotated as “treated for HF” (14 records = ~7% of the dataset) compared to the large diversity of medications that are used for HF treatment.  Hypertension (HT) JRip was unable to find any interesting rules in the data for HT. Contrary to the case for missing rules in HF, the missing rules for HT are almost certainly the result of a disproportionately high fraction of records annotated as Hypertensive (71% of the records) and also large number of medications used for the condition (16 medications), which makes any detected rules more likely to occur by chance. The wide variety of drugs prescribed may be due to the maze of contra-indications, co- morbidities, and physician preferences, making it difficult to achieve “signal” over the noise of real clinical data. This problem might be resolved by using a larger dataset or reducing the number of co-dependencies in the data; i.e., building a dataset that focused specifically on the hypertension phenotype. 6.7 Broader implications of the case study 1 6.7.1 Data-driven knowledge discovery This pilot study demonstrates how legacy patient data can be mined to discover interesting rules that mirror the expert knowledge encoded in modern clinical ontologies. We propose that these ruless can, therefore, facilitate semi-automatic knowledge acquisition and ontology building by automatically identifying rules that could be used to "bootstrap" formal ontological axioms where none exist. The overall consistency of our results in matching the NDF-RT may_treat property, suggests that knowledge-bases constructed from these rules would have a relatively high level of accuracy compared to those constructed manually, thus providing an inexpensive and easily-constructed ontological "scaffold" which can then be manually edited by tools such as those provided by the DL- Learner project. While in this case our results represent, in a way, a "self-fulfilling prophecy" (since the clinicians whose data we are examining were following the rules that we are trying to automatically discover) this will not always be the case. There are a wide variety of datasets, both A data-driven approach to learning OWL expressions in clinical health care 142 clinical and otherwise, for which rules are not known a priori (for example, consider the extension of the analysis to non-chronic and non-common conditions). This study demonstrates that we can, with considerable precision, automatically generate at least a set of template classification rules for such datasets. The work presented in the second experiment builds on these ideas and hypotheses. 6.7.2 Potential for improvement of NDF-RT The JRip algorithm generated rules that can be used to rank medications in terms of their relevance to specific conditions. Due to limited data this pilot study was not conclusive; however we provide some preliminary evidence that properties such as may_treat can be refined into more granular formal-logic properties (e.g. "optimal_treatment_for"), or could be used to generate more statistical (non-DL) classification rules. Nevertheless, by combining rankings from a large number of diverse datasets, a better measure of a given drug's efficacy may be obtained. This observation is quite important in the context of ontology maintenance and evaluation, since it provides an approach to measuring the "correctness" or "precision" of an ontology relative to any given data-set by simply comparing the automatically-generated rules to the ontologically defined rules. Using such data-driven approaches, an ontology curator would be able to evaluate the ability to accurately represent, for example, a newly acquired dataset, or evaluate the accuracy and precision of their ontological knowledge-representation as their underlying dataset grows and changes over time - a common scenario in the life sciences. 6.7.3 Medication ranking proposal for clinical use In addition to the importance of our study from the knowledge engineering perspective, our project had some implications that are of relevance to our clinical partners (Dr. Bruce McManus, Dr. John Boyd and Dr. Keith Walley). The task of rating medications is highly subjective and, given clinician- to-clinician variations and "habits", different geographical locations, availability, cost, differences in demographics pre-existing conditions and previous medications prescribed is difficult to determine with acceptable accuracy 40 . However, once we can collect multiple datasets from multiple centers targeted for demographic diversity, we argue that we may better cope with the “noise” and achieve more reliable results. As we saw the generated rules provided an implicit ranking of medications according to a specific dataset. Thus we propose to extend the framework to multiple databases and to use more elaborate ML techniques (see Appendix E for an example case). 40 http://www.drugs.com/members_comments_add.php?ddc_id=3280&brand_name_id=0&condition_id=432 A data-driven approach to learning OWL expressions in clinical health care 143 6.8 Clinical case study 2: DL Synthesis of clinical phenotypes for septic shock patients: An experiment with DL-learner 6.8.1 Summary of the experiment In the previous experiment we showed how legacy patient data can be mined to discover patterns that mirror the expert knowledge encoded in modern clinical ontologies. The previous study generated results that were in a way expected; the clinicians whose data we used to elicit rules were following rules that we attempt to automatically discover. However, this will always be the case as there are a large number of datasets for which rules are not known a priori. Additionally in the previous study, we did not involve clinicians in the process of building OWL expressions for classes. In this study, we extend our approach to determine if data-driven rule-discovery techniques can guide the construction of OWL expressions for clinical phenotypes in settings where the expressions are not known a priori. We used a portion of a cohort of septic shock patients (VASST) [217] to build our knowledge-base and evaluate our approach. The evaluations begin with a control experiment intended to show that the implementation is successful in producing the expected OWL class expressions for known concepts (concepts for which clear and unambiguous classification rules exist). This is followed by the evaluation of our approach on more complex phenotypes for which no clear classification rules exist. Finally we evaluate the correctness of “novel” class expressions in collaboration with experts. 6.8.2 Control experiment To evaluate the ability of our method, we begin by identifying a gold standard with regard to classification consistency. We used two ontologies for this purpose: 1) an existing ontology from biomedical domain and 2) our artificially constructed ontology (VASST Ontology). In the following we discuss the two experiments. 6.8.2.1 Phosphabase ontology To minimize possible modeling biases, in the first phase we use a real-life ontology from biomedical domain that has previously shown to have highly cons istent class expressions. Phosphabase[218] ontology is an OWL ontology and contains information for describing protein phosphatases based on their domain composition. Classes in Phosphabase are equipped with OWL-DL axiom-based definition and hence, Phosphabase can serve as a suitable control experiment. Figure 6.3 shows a snapshot taken from the Phosphabase ontology in Protégé environment. As shown the class “classical_tyrosine_phosphatase” [219] (left-side of the figure), is defined as follows. A data-driven approach to learning OWL expressions in clinical health care 144 containsDomain some IPR000242 Thus if a new protein is discovered that meets the above definition, the existing reasoners 41 such as Pellet[99] and Fact++[100] can be used to automatically place the newly discovered protein as a member of “classical_tyrosine_phosphatase” . Generally, ontologies such as Phosphabase that are constructed using OWL class expressions can be used to bench OWL class-expression learning algorithms. The formal, computable restrictions on class membership that they encode can be used to guarantee that the instances of each class demonstrate a specific pattern of properties[15]. The ability to rediscover the decision boundaries (expressed in the properties of instances assigned to the classes) by the learning algorithms, serve as a positive control on our implementation[15]. If any algorithms fail to rediscover such straightforward and consistent patterns, they will probably fail to identify the more complex patterns in noisy clinical data. Figure 6.3 A Snapshot from Protégé ontology editing environment showing the “classical_tyrosine_phosphate”(left-side) and the class-defining axiom(right-side) 41 Reasoner a piece of software ab le to in fer logical inferences from a set of asserted axioms(see chapter 2) A data-driven approach to learning OWL expressions in clinical health care 145 In addition to being a highly axiomatic ontoloy, Phosphabase has previously been used in Ontoloki[15] experiment and as a control experiment ; as such it can be used to benchmark the performance of CELOE(and its variants) algorithm against best algorithms used by Ontoloki (JRip and Chi_25JRip[211]). 6.8.2.2 Phosphabase knowledgebase generation We used the RDF knowledgebase previously created by Benjamin Good (i.e., by querying Uniprot[91]) ; For the control experiment, the only properties that were considered in the original experiment were the InterPro domains (containsDomain property) and the presence of transmembrane regions (containsTransembraneRegion) since they were the only features used in the Phosphabase class restrictions. Once the instances were retrieved and their annotations mapped to the Phosphabase representation, they were submitted to the Pellet reasoner and thereby classified within the ontology. This process results in a knowledgebase where the properties of the instances are guaranteed to contain the information required to identify the decision boundaries represented in the original class definitions. Using a similar approach as the one used in Ontoloki we created two versions of the ontology as follows: 1) The A-Box of the ontology including all the OWL-DL class expressions for classes (left side of Figure 6.6) with no instances. This ontology includes the axioms that ideally the learner should be producing. 2) The T-Box containing all the instances(individuals) of the classes within the ontology; however this time we removed all the OWL-DL class expressions from the classes, thus creating a simple class hierarchy(right-side of Figure 6.6) A snapshot of these two ontologies in Protégé is shown in Figure 6.4 for comparison. The ontology on the right side includes instances (bottom-right). These instances are used by DL-learner to learn class expressions shown on the left-side. A data-driven approach to learning OWL expressions in clinical health care 146 Figure 6.4 Comparison of the two ontologies in Protégé. The ontology on the right side includes instances (bottom-right). These instances are used by DL-learner to learn class expressions shown on the left-side 6.8.2.3 Phosphabase ontology results We used DL-learner Protégé plugin to learn OWL class expressions for 19 classes that had previously been learned by Ontoloki. These were classes that were defined using the (containsDomain property) and had at least 5 positive and 5 negative examples. One major difference between DL-leaner and Ontoloki is that DL-learner can be adjusted to suggest at most 100 suggestions per class and rank them in order of decreasing accuracy whereas Ontoloki can only make one suggestion per class. Ideally, the rule with the highest accuracy should be the one encoded within the ontology. For the control experiment we set the number of rules per class to 5. JRip algorithm had previously identified 18 out of 19 class definitions correctly[15]. DL-learner was able to identify all of the expressions used to define the classes and, in most cases, the first discovered rule was the rule originally encoded in the ontology (For detailed information and setting the parameters for learning expressions please refer to Appendix E). For instance Figure 6.5 shows the 5 rules learned for class PPP (Phosphoprotein Phosphatase). Out of the 5 rules learned for the definition of PPP, only the first one (outlined in the figure), already exist in the ontology. As we will see, in cases where the rules are not known a priori, it is the knowledge engineer’s (or domain expert’s) job to decide which OWL-DL expressions can/should be added to the ontology. A data-driven approach to learning OWL expressions in clinical health care 147 Figure 6.5 Rules suggested by DL-learner for class PPP. Only the first rule already exists in the ontology. The knowledge engineer will have to decide which rule(s) may be added to the ontology (“Add” button on the right) 6.8.3 Experiment: OWL representation of clinical phenotypes in VASST dataset The purpose in of the control experiments was to show that the DL-learner can successfully “rediscover” the OWL class expressions definition of classes in cases where those expressions are guaranteed to exist. In the next phase of the experiment, we extended the analysis to include clinical legacy datasets. This experiment was motivated by a clinically-oriented study in which we envisioned being able to find OWL expressions for clinical phenotype in VASST study. The main conceptual differences between this step and the Phosphabase experiment are: 1) In this experiment we use a legacy (and noisy) dataset to generate our knowledgebase 2) The ontological scaffold (hierarchical structure of the ontology) to represent clinical data does not exist beforehand and hence we need to generate it de novo. 3) Some of the attributes (e.g. clinical measurements) in the ontology are numerical; since the existing algorithms in DL-learner are unable to identify class expressions for numerical attributes, we need to extend the DL-learner algorithm so that it is capable of finding OWL- DL constraints on numerical attributes. 4) There is no a priori “gold standard” to compare the generated OWL-DL against. The majority of rules must therefore be evaluated using expert knowledge. In the first phase, we evaluate the DL learner approach for its ability to correctly identify patterns for which clear and unambiguous definitions exist in the form of established guidelines. These A data-driven approach to learning OWL expressions in clinical health care 148 phenotypes are defined by the VASST criteria for including patients in the study, based on the values recorded at the time of admittance (for some measurements one maximum and one minimum value were recorded) for each of the physiological variables: Fever, Tachycardia, Tachypnea, Abnormal QTc waveform, Pathologic WBC count, Renal failure, Respiratory failure, and Coagulopathy. Once we conduct the first phase of the experiment, we extend the analysis to the “interesting” phenotypes (as selected by clinicians) for which clear diagnostic guidelines do not exist in advance. This step will lead to generation of class expressions for clinical phenotypes that are currently only loosely defined, and/or part of the "vocabulary" used by individual clinical teams. We then evaluate the correctness of these class expressions in collaboration with the clinicians whose expertise we are attempting to formally encode. 6.8.3.1 Overview of methodology The general high level structure of the learning framework is illustrated in Figure 6.4. The entities in legacy data are divided into two broad categories of measurements and phenotypes followed by the approach proposed in [220]. Three separate ontologies were used to standardize the elements of clinical data: raw clinical measurements, prescription medications and patient phenotypes. A data-driven approach to learning OWL expressions in clinical health care 149 Figure 6.6 Three separate ontologies were adopted to standardize the elements of a clinical data: raw clinical measurements, prescription medications and patient phenotypes. Two knowledgebases were subsequently used; one for control and the other for the non-control experiments Two knowledgebases were subsequently used: 1) phenotypes for which classification guideline exist, and 2) phenotypes for which classification guideline does not exist. The knowledgebases were fed into the DL-learner Protégé plugin for finding possible relationships patterns between Electrocardiogram (ECG) phenotypes and medications prescribed to a patient. The proposed framework is generic and can be extended to include additional clinical legacy databases. 6.8.3.2 VASST Dataset The Vasopressin and Septic Shock Trial (VASST) was a multicenter, randomized, double-blind, controlled trial evaluating the efficacy of vasopressin versus norepinephrine in 779 patients who were A data-driven approach to learning OWL expressions in clinical health care 150 diagnosed with septic shock according to the current consensus definition[221]. All patients were enrolled within 24 hours of meeting the definition of septic shock. The research ethics boards of all participating institutions approved this trial, and written informed consent was obtained from all patients or their authorized representatives[217]. 6.8.3.3 Clinical problem statement: drug-induced abnormalities in ECG phenotypes Electrocardiogram(ECG also EKG) is defined as interpretation of the electrical activity of the heart over a period of time, as detected by electrodes attached to the surface of the skin and recorded by a device external to the body[222]. Normally several parameters are recorded for each ECG measurements which are discussed below. QT (QTc): QT wave form represents the time from the electrical stimulation (depolarization) of the heart’s pumping chambers (ventricles), to their recharging (repolarization)[223]. It is measured in seconds on the ECG and closely approximates the time from the beginning of the ventricles’ contraction until the end of relaxation[224]. QT interval varies with the heart rate. It shortens as the rate increases and lengthens as the rate decreases. Thus in order to determine if a given QT is appropriate for a given heart rate, the QT is corrected for the heart rate using a simple mathematical formula, and this quantity is called the QTc [223].Normally, the QTc interval varies from 0.35 to 0.46 seconds (350-460 milliseconds). Figure 6.7 provides an example of a normal and prolonged QTc interval. Figure 6.7 Representation of normal (top) and abnormal (bottom) QTc waveform[225] A data-driven approach to learning OWL expressions in clinical health care 151 QRS: The QRS interval represents the time it takes for depolarization of the ventricles. Normal depolarization requires normal function of the right and left bundle branches[226]. QRS may vary based on the size of the heart and heart rate and it ranges from 40 to 120 milliseconds. PR: The PR interval of the surface ECG is measured from the onset of atrial depolarization (P wave) to the beginning of ventricular depolarization (QRS complex). Normally, this interval should be between 120 and 200 milliseconds in the adult population[227].  Drug induced ECG changes Medications can affect ECG waveforms[97]. For example “Quinidine” (originally developed as an antimalarial medication) was the first medication to be discovered to change ECG waveforms[223]. Table 6.3 shows some of these drugs with their corresponding therapeutic class. Some of the drug classes shown in Table 6.3 are in turn categorized into smaller sub-classes 42 . 42 For example antiarrhythmic drugs are divided into type 1a antiarrhythmic, type 1c antiarrhythmic and membrane depressant beta-blockers. A data-driven approach to learning OWL expressions in clinical health care 152 Table 6.3 Some drugs associated with QTc prolongation with their therapeutic class(Adapted from [228]). Some of the drug classes shown in Table 6.3 are in turn categorized into smaller sub -classes. For example Antiarrhythmic drugs are divided into Type 1a Antiarrhythmic, Type 1c Antiarrhythmic and membrane depressant beta-blockers. The diagnosis and management of patients with an abnormal ECG readings triggered by specific toxicity can challenge experienced physicians[97]. The problem becomes more evident when we note that on average patients are prescribed a large number of medications at the same time. Patients admitted for to VASST study were taking 15 medications on average. After discussion with our clinical partners([229],[230]), we decided to focus on a portion of the VASST dataset containing ECG wave measurements and medications prescribed to the patients. Our clinical partners anticipated that mining the drug prescription patterns may reveal novel relationships between medications and certain ECG phenotype. This makes the use case interesting both from the perspective of knowledge-engineering and from the perspective of novel clinical diagnostics. A data-driven approach to learning OWL expressions in clinical health care 153 6.8.3.4 Integrating legacy data in existing ontologies The first step in this experiment was to systematically convert the data from a tabular format into a schema represented in RDF/OWL. The network and hierarchical data structures used in ontologies are significantly different from flat tabular relational models used in biomedical databases. Thus the integration of relational databases into ontologies is not generally a straightforward process[231]. Tools and algorithm have been developed to facilitate the process of converting relational data into OWL format ([231–233]) and using different metrics to assess the fidelity of conversion. One important strategic choice for this study was to determine which attributes in the original data should be regarded as complex classes and which should be regarded as atomic classes. We used an approach originally proposed by Nguyen et al[234]. In their model they first take attributes in the tabular data as the basic elements of the ontology. Subsequently, the Boolean (binary) attribute is treated as a complex concept that may be learned and the categorical and numerical variable are treated as atomic concepts. As such, concepts such as Prolonged_QT_Interval are considered as complex (Phenotype) for which the DL-learner is used to learn OWL-DL class expressions and concepts such as QT_Interval are considered atomic concepts (Clinical measurements). We used Clinical Measurement Ontology (CMO)[220] based on its coverage of the terms. The CMO provides the standardized vocabulary necessary to indicate the type of measurement made to assess a phenotype[220]. We explored CMO to search for the concepts used in the experiment, and found it to be sufficiently comprehensive in terms of coverage of these concepts; however there were some minor differences in terminology between the labels in our dataset and CMO terms 43 which were manually mapped. We extended concepts in CMO using properties in SIO as discussed in previous chapters. CMO is originally developed in OBO[5] format and thus we used OBO Converter to convert it to OWL format. For instance the concept PR_Interval is extended in our ontology (prefix cardio) as follows: 43 For example, the term “Timed urine volume” appears in CMO, while the acronym HDL was used in our clin ical dataset “Urine output”. A data-driven approach to learning OWL expressions in clinical health care 154 cardio:PR_Interval = cmo:PR_Interval and “sio:has value” some owl:Literal Where prefixes “med” and “sio” are used to show CMO and SIO ontologies respectively. 6.8.3.5 Representation of clinical phenotypes The human phenotype ontology (HPO) aims to provide a standardized vocabulary of phenotypic abnormalities encountered in human disease[235]. Terms in the HPO describes phenotypic abnormalities which make HPO especially suitable for this experiment (as phenotypes for which we aim to find class expressions are abnormal). We explored the OWL version of HPO for phenotypes and mapped between the concepts used in the guideline and concepts in the HPO. Since the class expression learning will pose additional restriction on these classes, it is reasonable to define them as subclasses of their equivalent classes in HPO. For instance for the phenotype “fever” the following statement is added to the ontology. cardio:Prolonged_QT_Interval_on_EKG rdfs:subClassOf hpo: Prolonged_QT_Interval_on_EKG The terms in the cardio ontology can then be extended with further property-restrictions, without altering the original term-definitions in the HPO (this is a "polite" approach to altering third-party ontological definitions, which minimizes the effects on the open-world use of those ontologies). 6.8.3.6 Representation of medications As discussed NDF-RT was used together with RxNav and RxNorm API’s to standardize the disparate medication names in. Following standardization, a unique ID was assigned to each medication corresponding to the numeric identifiers in the NDF-RT Ontology. A total of 363 unique identifiers were automatically mapped to NDF-RT identifiers. We then removed the medications which occurred less than 5 times, resulting in a final total of 39 unique and standardized medications. 6.8.3.7 Knowledgebase generation Once the A-Box (schema) for the ontology is created using the three ontologies, the instance data are stored in the ontology by creating instances of these classes and storing the data in the appropriate class. We used Perl scripts [184]to convert our data into RDF format. Once an OWL/RDF knowledgebase is assembled, a T-Box consisting of positive and negative examples for the A data-driven approach to learning OWL expressions in clinical health care 155 phenotypes of interest is generated. These consist of abnormal and normal phenotypes for QTc, QRS and PR complex. Additionally the patient ECG dataset has a phenotype classification (Yes/No) for each patient for the following concepts (some of these classes are mutually disjoint OWL classes). o First-degree AV block (1st AVB) o Sinus tachycardia o Atrial Fibrillation o Right bundle branch block o Sinus bradycardia o Ectopic rhythm The learning problem is, therefore, to find a class expression which closely fits each of the positive examples. In addition to these phenotypes we generated positive and negative instances for patient outcomes that were defined as "interesting" by the clinicians (e.g. “28 day survival “and “Unexpected renal failure” following admittance) though no clear definitions exist. 6.8.3.8 Adaptation of DL-Learner plugin for the experiment One interesting feature of Class Expression Learner for Ontology Engineering (CELOE) is that the plugin implemented for Protégé is interactive, allowing the knowledge engineer the ability to accept, or reject, a new axiom. Additionally, DL-Learner is open source and component based and thus it can be easily extended to include new learning algorithms. The current version of DL-Learner only includes Random Cluster, Genetic Programming and CELOE[61]. Currently CELOE is the best algorithm available to DL learner for categorical attributes as demonstrated in[31]. However, the main problem with all of the existing algorithms for ontology learning is that none of these algorithms can handle numerical attributes. We demonstrated in the first part of this chapter that the JRip algorithm outperformed similar compatible algorithms; it was capable of generating highly accurate rules mimicking expert classification. Additionally JRip is capable of handling numerical data and finding decision boundary thresholds. Thus, we extended the DL learner Protégé plugin to include the JRip algorithm to generate rules for numerical attributes. Once the numerical constraint axioms were added to the class definition, we return to the CELOE algorithm to generate the remainder of the proposed axioms. A data-driven approach to learning OWL expressions in clinical health care 156 6.8.3.9 Results for classification of phenotypes based on numerical clinical measurements The classification rules generated for clinical measurements are presented in Appendix E. Overall, the generated rules were highly consistent with the ones (see Appendix E) used to classify patients. This experiment demonstrates how we can start from undefined classes in our ontology, and use the instance data to find numerical constraints that further define these classes. The results of the first phase of the experiment are mostly of knowledge engineering interest and have already been discussed. Please refer to Appendix E for details. 6.8.3.10 Novel hypotheses generated: Prediction Results for complex ECG phenotypes We used DL-Learner to learn general models for learning ECG abnormal phenotypes. We used 10 fold cross validation to assess the results and started with suggested default parameters; we then empirically varied parameters to obtain rules (accuracy > 65%). Ten rules (both for equivalent and super class axioms) were learned for some of the classes mentioned before. Some “interesting” rules (indicated by our clinical partners) and the ones with potential clinical importance are presented in the following (for the complete set of rules, refer to Appendix E). We note that generally the evaluation of these set of rules is challenging since (see discussion), as opposed to the control experiment (where “gold standard” exists a priori), experts are the final arbiters in making a subjective evaluation of the algorithms' success in detecting meaningful (and useful) rules.  Atrial Fibrillation(AF) (PROLONGED_PR and isPrescribed some (DIPHENHYDRAMINE or SODIUM_CHANNEL_BLOCKERS)) accuracy 78.93% (PROLONGED_PR and isPrescribed some (DIPHENHYDRAMINE or Class_III_ANTIDYSRITHMICS)) accuracy 77.01% The rules above should be read as a patient is considered as suffering from “Atrial Fibrillation” if and only if the patient is suffering from “Prolonged PR interval” and has been prescribed either the medication Diphenhydramine or is taking a medication from “Sodium Channel Blockers” (or ”Class III Antidysrithmics” for the second rule). Interestingly, the existing ECG literature supports both expressions generated above (For the suggestive correlation between PR prolongation 44 and Atrial Fibrillation refer to [236] and for the relationship between Sodium Channel Blockers (and Class III 44 Not all the phenotypes in this experiment are not independent as suggested by the learned rules A data-driven approach to learning OWL expressions in clinical health care 157 Antidysrithmics) please refer to[223]. Though the rules above are not suggesting definite causal relationships between phenotypes and medications, they can be used as attention focusing and decision support tools in clinical context.  First-degree AV block (1st AVB) (PROLONGED_PR and isPrescribed some DILTIAZEM or or Beta_blockers ) accuracy 78.87% (ABNORMAL_ECG_PHENOTYPE and isPrescribed some accuracy 78.87% Cardiovascular medication) The second rule above is in fact the generalization of the first rule since the PROLONGED_PR is a subclass of ABNORMAL_ECG_PHENOTYPE and Beta_blockers are subclass of “Cardiovascular medicines”. In such cases usually the more specific definition is more informative (the accuracies are the same). The implication of Beta blockers and Calcium channel blockers (Diltiazem) in 1 st AVB is supported by the literature as they can both alter cardiac conduction times [236].  Sinus Bradycardia (ABNORMAL_ECG_PHENOTYPE and isPrescribed min 2 accuracy 67.68% Antipsychotic medication) (ABNORMAL_ECG_PHENOTYPE and isPrescribed some accuracy 67.68% (Antipsychotic medication or Antidepressant medication) Antipsychotic medications are listed as potentially causing sinus bradycardia[189]; however they are listed alongside other groups of medications. One possible explanation of this observation is the small number of positive examples (only 3) in the data for this class. Additionally antipsychotic and antidepressant medications are frequently prescribed together which can bias the rule learning algorithm. We plan to conduct further investigations in collaboration with our clinical partners to determine any specific relationships.  Prolonged QTc (ABNORMAL_ECG_PHENOTYPE and isPrescribed min 6 (Cardiovascular Medications and isPrescribed min 2 accuracy 80.0% Calcium Channel Blockers) A data-driven approach to learning OWL expressions in clinical health care 158 The above rule hints the possibility that patients with abnormal QTc are on average on a large number of cardiovascular medications mostly Beta-Blockers. From this rule alone, the causality of this relationship cannot be established with certainty; however it may serve as good starting point for further investigations. Overall, the generated rules in the previous experiment are highly consistent in terms of standard machine learning evaluation metrics such Precision/Recall, and F-Measure(Appendix E), which can be considered a success from the perspective of knowledge engineering. 6.9 Limitations We demonstrated that by and large, our framework is applicable in real-life clinical context. However, we can envisage some limitations to our study as discussed below. One major limitation of the system is its dependency on the availability of instance data in the knowledgebase in addition to the original ontological model. The creation of the high quality knowledgebase (especially from legacy patient data) is arguably one of the most challenging tasks. As such, the derivation of efficient methods to generate high quality, relevant and standardized knowledgebases is of high priority for future research. This study demonstrated how legacy patient data can be mined to discover interesting logical expressions useful for explicitly defining patient phenotypes. These phenotypes can broadly be divided into two categories: 1) those for which definitions exist in form of clinical guidelines and 2) those for which formal definitions are lacking or where the cluster of symptoms have not previously been identified as a named clinical phenotype. The evaluation of OWL expressions for second category is particularly challenging since, in the absence of any “gold standard”, it relies entirely on expert knowledge in making a subjective evaluation of the learning algorithms' success in detecting something "meaningful". 6.10 Future work 6.10.1 Subjective evaluation The generated rules in the previous experiment are highly consistent in terms of standard machine learning evaluation metrics such Precision/Recall, and F-Measure, which can be considered a success from the perspective of knowledge engineering; however, as discussed, the focus of this experiment was to generate rules/classifications that are of potential importance to clinicians. As such, it is A data-driven approach to learning OWL expressions in clinical health care 159 important that we were able to support the suggested classification-rules generated by the system using existing literature on ECG. It remains unclear, however, how we might objectively and quantitatively evaluate the usefulness of the resulting system from the perspective of the healthcare professionals. We are conducting follow up experiments with VASST study group to answer this question. More specifically we seek scenarios in which the generated hypotheses are unexpected (or even surprising) to researchers. Subsequent clinical experiments will then be required to validate/refute such tentative hypotheses. If successful, it will demonstrate that this kind of machine- learning is a novel way to generate new and plausible clinical hypotheses. 6.10.2 Combining DL-learner and reasoners One interesting extension to the current framework is the possibility of using an iterative approach of reasoning and learning class definition of logical combinations (usually intersection) of classes. Consider an example where a researcher is interested in learning OWL class expressions for patients who have both prolonged PR wave and prolonged QRS complex. In such cases a reasoner would first be used to find data-instances that are members of both classes, and subsequently the DL-learner would be applied to examine those individuals. If the DL-learner is able to find an expression (with acceptable accuracy) for the combined class, those classes can be merged to a new, more complex phenotypic concept in the ontology. The process more closely simulates expert classification and at the same time enables iterative enriching of ontologies. 6.11 Conclusion Classifications of patient phenotypes are most often determined through direct observations by clinicians, in conjunction with published standards and guidelines, where the clinical expert is the ultimate arbiter of patients’ classifications. This “manual” classification approach is generally problematic in clinical research since the basis for the clinician's decision is not (always) transparent, resulting in non-reproducible science. In previous experiments, we addressed this problem and demonstrated that by using a combination of semantically-explicit data, logical reasoning and analytical services, rigorous and reproducible clinical classifications with accuracy approaching that of an expert can be achieved. However, in those experiments the class expressions for the ontologies were generated manually, which is error prone and time-consuming. Thus, in this chapter we proposed a data-driven framework to mitigate the well-known problem of ontology construction and enrichment. We utilized the available frameworks for learning OWL-DL class expressions to answer specific clinical questions of substantial importance. We evaluated the effectiveness of our framework in its ability to automatically (or semi-automatically) generate OWL class expressions A data-driven approach to learning OWL expressions in clinical health care 160 based on real data in the realm of the effect of medications on changes of Electrocardiogram (ECG) phenotypes. Due to lack of a sufficiently large enough dataset, this pilot study was not conclusive; however we provided evidence that, using our proposed framework, both prior-knowledge as well as potentially novel hypotheses can be detected and provided for expert-evaluation, where experts might then decide to conduct follow-up studies to experimentally validate these hypotheses. In the future works we will collaborate with our clinical partners to refine the promising hypotheses and design the experiments to evaluate those hypotheses. Conclusion and future work 161 7 Conclusion and future work The diverse topics discussed in this dissertation address different aspects of using Semantic Web technologies to formally represent clinical classifications. In this chapter, we first summarize the research work, then highlight the main results and novel observations. Subsequently, we present future research directions drawing on the work in each chapter. Finally, we present an overall commentary and a perspective of where such research could lead, and what impact it could have. 7.1 Summary 7.1.1 Chapter 3: Measurement unit conflicts in clinical data In chapter 3, we proposed a generic approach to the perennial problem of formal representation of measurement units in healthcare. The problem is likely to be exacerbated over time, since the massive amounts of data now being generated (e.g. through high-throughput studies, and mobile and static sensors) has increased the need to integrate data from multiple sources. To make meaningful integration, comparison and interpretation of quantitative data, the first step is to ensure quantitative data are represented in appropriate measurement-units. Thus in chapter 3, we proposed a generic framework to automatically detect when an integrated dataset contains unit conflicts, and automatically resolve these conflicts. We exploited a combination of semantic standards and novel detection frameworks to demonstrate that measurement unit conflicts in clinical data can be automatically resolved by machines. 7.1.2 Chapter 4: Formal classifications of patient phenotype In chapter 4 we addressed the problem of patient phenotype classification which is ubiquitous in biomedical research, healthcare and public health. The problem stems from the fact that the majority of the clinical classification systems are not structured for automated classification, and similarly, clinical data is generally not represented in a form that is suitable to automated interpretation. As such, the classifications in clinical data are by and large not reproducible (neither automatically nor even manually in some cases). In chapter 4 we applied Semantic Web technologies to enable the automated, transparent interpretation of clinical data for use in high-throughput research environments. We demonstrated that the combination of semantically-explicit data, logically rigorous models of clinical guidelines, Conclusion and future work 162 and publicly-accessible Semantic Web Services, can be used to execute automated, rigorous and reproducible clinical classifications with an accuracy approaching that of an expert. The study had several findings. First, we showed how legacy datasets would be of benefit to researchers if they were published on the Semantic Web. We demonstrated a workable path for conversion and publication of these datasets that provided advantages beyond simply making the data available as "triples", but also in making it semantically transparent such that it could be re-analyzed by third-party researchers using their own classification frameworks. Second, the majority of ontologies available in the life sciences to date are structured as class hierarchies, where the labels of each class are largely used to standardize annotations. The ability to logically reason over these labels is quite limited, thus inhibiting their use for automated annotation and classification of data. Nevertheless, these ontologies are increasingly comprehensive and reflect expert consensus of concepts relevant in a given domain. In chapter 4, we proposed and demonstrated a path for extending an existing ontology such that it could be utilized by DL reasoners to dynamically classify and interpret datasets - a process that is currently done largely by experts. Third, we demonstrated that clinical phenotype classification systems could be modeled in the OWL language by taking advantage of the rich, axiomatic structure of OWL-DL ontologies, and a variety of analytical Web Services. We showed how this combination of ontologies and services can be used to make clinical data analyses both more transparent and more automated. Finally, we showed that individual clinicians deviate from established clinical guidelines at every layer of an analysis, and this demonstrates the need for a formal, yet personalized clinical interpretation framework to ensure transparency and reproducibility. We demonstrate that this can be achieved by creating and publishing "personalized" OWL ontologies. However, in some cases we demonstrated that clinical phenotype classification (in our case, for Framingham risk groups) could not be accurately modeled to mimic expert’s annotations using OWL-DL; our inability in some cases to accurately reproduce the interpretation of the expert post facto, even after manually re-modeling the guidelines, shows the danger of not capturing these personal, expert perspectives in some formal framework such as OWL at the time the experiment is being run. In discussions with the clinician, we learned that the patients were under varying regimes of pharmaceutical blood pressure and diabetes treatment affecting their risk assessment; this discussion motivated the research in chapter 5. Conclusion and future work 163 7.1.3 Chapter 5: Linking public drug ontologies to legacy patient data In chapter 5 we proposed, and demonstrated, a simple framework for formalizing patient drug data and connecting it to a public knowledgebase of drug/phenotype interactions. The study was originally motivated by the inability of automatic models to mimic expert’s classifications for complex phenotypes such as Framingham Risk Groups. Hence, our primary goal was to attempt to successfully automate the classification of raw patient clinical data into Framingham risk categories, using semantic models of Framingham risk scores together with semantic models of a patient's prescribed medicines and semantics models of the pharmacological effects of each drug. This involved determining the treatment statuses of patients for several chronic conditions (e.g. “Under treatment for Hypertension”) based on the pharmacological effect of their drug-regime, and using that information in our logical models to interpret when, for example, the patient's raw blood pressure record might be misleading our risk classifier. We then explored two other facets that became apparent to us once this infrastructure was in place. First, the possibility of, in near-real-time, detecting adverse drug events such as dangerous drug interactions by dynamically pulling-in public drug/drug interaction data; or detecting when the administration of a drug (perhaps in combination with other drugs) is not having the expected phenotypic effect. Second, we explored the knowledgebase itself and describe some idiosyncrasies in that resource that interfered with our early attempts to accomplish these goals. This study was of both methodological and practical interest. The practical objective of this study was to help clinicians and specialists more reliably classify the patients for different cardiovascular conditions; it also proposed a generic and easily implementable framework to help physicians avoid prescribing contraindicative medications. From the methodological perspective, we presented a generic and flexible framework to transform the prescription medication in legacy datasets into a format suitable for interpretation and integration. Our proposed framework enabled us to dynamically link the legacy patient data to existing drug ontologies together with a number of generic Semantic Web services, to perform complex reasoning tasks. 7.1.4 Chapter 6: Data-driven approaches for ontology learning In the previous chapters, the task of ontology construction was largely undertaken manually, and we then demonstrated the utility of these logical models in both clinical diagnosis as well as clinical research transparency and reproducibility. If the outcomes of our research are to be generally useful, however, in a domain where classifications and definitions are constantly changing and there are no Conclusion and future work 164 resident knowledge-engineers, the task of knowledge capture and representation must become at least partially automated. As such, in chapter 6 we focused on data-driven approaches to mitigate the well- known problem of ontology construction and enrichment. We discussed data-driven methodologies for automatic and semi-automatic ontology building and evaluation. These methodologies are developed to leverage machine-learning approaches for the creation, extension and evaluation of ontologies. More specifically we focused on creating class definitions for clinical phenotypes which constitutes one of the most demanding aspects of ontology engineering and in most cases requires considerable domain-training. We then utilized the available methodologies for learning OWL-DL class expressions for specific scenarios in clinical practice. We evaluated the effectiveness of our framework in their ability to automatically (or semi-automatically) generate OWL class expressions based on real data for Electrocardiogram (ECG) phenotypes. 7.2 Limitations and future work Chapter 3: Different domains use only a limited variety of different measurement-units in practice; i.e., there are many units that are theoretically possible, but never used. This adds unnecessary "noise" to the task of automatically predicting which units are being used by a given study in which they are not specified. We plan to conduct an empirical study to precisely outline the recommended and common units, specific to clinical and biomedical domain. We predict that this will significantly reduce errors and increase computational performance. Secondly, we note that a large number of measurement units in clinical practice include more complex unit patterns than the ones we modeled. For instance the majority of units used for drug dosage and clearance include temporal elements (e.g mg/kg/hour for the drug dosage) that are not modeled in this study. To the best of our knowledge such patterns have not been modeled in any existing ontologies. As such, the framework can be extended to include more complex patterns together with temporal units and their conversion. Finally, it can be extended to include different real life datasets from multiple centers and evaluate the usability and scalability of our approach in more complex biomedical scenarios. Chapter 4: Future experiments should attempt to move beyond modeling well established standards (classifications for which a clear and unambiguous guideline exists; e.g. Framingham risk groups) and begin to attempt to ontologically model complex research classifications and hypotheses. Using a similar approach with published research as the gold standard, experiments should evaluate the ability to model and automatically evaluate research hypotheses in silico and compare the conclusions with those drawn by the experts[237]. We predict that a large subset of hypotheses can be modeled using Conclusion and future work 165 exclusively OWL-DL axioms or their conjunctions. Hypotheses encoded in OWL would have a number of significant advantages over hypotheses represented in natural language; they would be unambiguous, extensible by third parties, and could be tested computationally. These features would make such OWL constructs an excellent platform for scientific discourse and disagreement. However, given that OWL reasoners are currently only able to compute inferences for a single, locally stored dataset, the testing and comparison of such hypotheses would be constrained by the human labour of gathering and integrating data from many sources. We demonstrated that using SADI/SHARE these hurdles can be significantly mitigated. Thus we anticipate that future experiments should undertake to evaluate the ability to model and automatically evaluate research hypotheses in silico and compare the conclusions with those drawn by the experts. Chapter 5: Due to limited information available in the dataset used in the study conducted in chapter 5, we were unable to incorporate complex dosage information and temporal (longitudinal) reasoning that are required to address more complex and dynamic questions related to treatment regimes and patient history. The project can be extended by incorporating temporal and dosage information attached to patient records and use the framework to define more complex patient phenotypes. Another limitation of this study was in a number of inaccuracies in information-extraction from the patient record. The design of optimal system to cope with noisy drug data, and use of other statistical and logical approaches to increase the accuracy of mapping, would be an interesting extension to the current system. Finally, the framework to enable more complex reasoning analyses with respect to medications for example to automatically determine the class of medication(s) (e.g. Antipsychotic) prescribed to patients. Automatically mapping patient data to the precise class of medications can provide more reasoning power that can potentially improve decision support systems. Chapter 6: Future experiments include the evaluation of learning algorithms on a variety of clinical scenarios and to generate novel hypotheses. From the perspective of clinical sciences the suggested class expressions discovered by Ontoloki and DL-Learner are, effectively, hypotheses automatically generated by the system, and these can and should be validated experimentally. We are currently collaborating with clinical researchers on few interesting projects. One of these involves medication ranking, where we plan to collect a multiple datasets from various geographically centers targeted for demographic diversity. We plan to extend the framework to model data and phenotypes of more Conclusion and future work 166 complex nature. For instance in our most recent study we are evaluating the possible effects of the prescription dosage of Vasopressors 45 on primary patient outcomes. Finally an interesting area for future extensions is the incorporation of uncertainty in the knowledgebases. For example, Pronto[238] is a probabilistic description logic reasoner capable of performing probabilistic reasoning in the Semantic Web. It can handle uncertainty in terminological and DL axioms allowing the expression of uncertainty and incomplete knowledge - a situation which is the norm, rather than exception, in biomedical sciences. 7.3 Theoretical perspectives and methodologies The major theoretical contributions of this thesis are twofold: From the perspective of clinical research, the majority of existing clinical research-oriented ontologies are built and refined by collaborative groups of experts, which requires a considerable investment of resources. Moreover, they are built on the premise that there is an “objective truth” about clinical knowledge, and this truth is derived by the consensus of this small group of experts. In reality, it is apparent that clinical knowledge is too complex to be captured and represented in a single universally-accepted model. As such, existing ontological frameworks cannot provide the flexibility required by clinical science where knowledge is often incomplete, contextual and subjective. In this thesis we demonstrated that ontologies can be used to formally represent these individualized perspectives in a transparent manner. Envisioning ontologies as personalized and extensible is a major shift from the commonly- held view of ontologies and allows for different and disparate world views to coexist, enabling the process of evaluating conjectures about the data based on distributed expert clinical knowledge. This ability to represent individualized knowledge can be exploited to support the explicit construction, sharing, comparison and iterative modification of clinical hypotheses which are the key component of scientific method. From the perspective of knowledge representation theory, we note that, existing ontologies, by and large, lack formal definitions of the classes. The ability to logically reason over these ontologies is rather limited, thus inhibiting their use for automated classification of data. Given that an increasingly large number of legacy databases are becoming available, methods for automated acquisition of class 45 http://en.wikipedia.org/wiki/Antihypotensive_agent Conclusion and future work 167 definition from data are required. In this thesis, we demonstrated that the process of discovering OWL expressions for class definition in ontologies can be modeled as solving supervised classification problems in machine learning. We then demonstrated that subgroups of existing machine learning techniques are applicable to learning class expressions in OWL. Finally, we empirically demonstrated that concrete biomedical learning problems can be solved using existing frameworks and description logics as knowledge representation formalism by using the current techniques in machine learning. 7.4 Overall analysis and an outlook on future The complexity of biomedical research is increasing, not only due to the increased volumes of molecular data coming from high-throughput technologies, but also because of the increased intersection of molecular and clinical research, and the increasing pervasiveness of data-gathering devices and sensors. In such a complex, interdisciplinary environment, researchers are required to combine data from multiple heterogeneous sources and, somehow, pull it all together in the context of prior-knowledge and expertise in order to draw a rapid and accurate conclusion. Unfortunately, making combined use of data sets and software packages often requires insurmountable amount of work to reconcile incompatible formats, schemas, and interfaces. These incompatibilities exist because the various databases and programs are developed independently by different research laboratories, and without reference to a shared set of standards. Semantic Web Technologies have proven their great potential in addressing the problems mentioned above; however, in practice, the adoption of Semantic Web technologies in healthcare have proved to be a major challenge. As Hyman G. Rickover[239] point out: “Good ideas are not adopted automatically. They must be driven into practice with courageous patience. Once implemented they can be easily overturned or subverted through apathy or lack of follow up, so a continuous effort is required.[239]'' Although the Semantic Web has the potential to be truly transformative in healthcare and biological sciences, and despite substantial time and efforts both by computer scientists and healthcare professionals, there is still a dearth of large-scale examples of how such knowledge can be used in practical scenarios. Ironically, even after recent national and international efforts to encourage the use of systems such as Electronic Health Record (EHR) and other grounds on which health-related Conclusion and future work 168 semantic technologies can flourish, the progress continues to be slow. Overall, adoption and usage of Semantic Web and related technologies in healthcare has been mostly limited to academic and research centers. We argue that to fully leverage Semantic Web technologies to improve clinical and biomedical research, a number of hurdles must be overcome. First, we should note that adoption of Semantic Web related technologies in healthcare is apparently as much a social as it is a technological challenge[27]. While the technical problem-solving capacity of Semantic Web-based frameworks have been repeatedly demonstrated, particularly in the healthcare space (e.g.[4]), their adoption, even within academia, has been limited. As such, the challenge would appear to be motivational, rather than technical. There are many motivations that should draw people toward adoption of Semantic technologies: the bursting amounts of biological data and knowledge, harsh time constraints on healthcare professionals, minimization of human error, reproducibility and transparency of clinical research, and the promise of “personalized medicine” enabled by genomics are just a few. These, alone, are compelling reasons to adopt Semantic technologies in biomedical sciences, in order to dramatically improve overall efficacy in healthcare[46]. The limited adoption might, therefore, be the result of a gap (or a chasm!) between what has been demonstrated by the Semantic Web community, and the needs, expectations, and motivations of stakeholders (physicians, biologist, academics and hospitals)[240]. For example, in most cases, the use of Semantic Web technologies by physicians not only requires extra time and effort by the physicians (especially the ones not trained in the Semantic Web technologies) - adding substantial costs to the healthcare-but also the immediate benefits are not seen, neither to the physicians nor the patients. Another major problem is the gap between academic and the industrial environments in the field of bioinformatics. The problem is rooted in the different needs (immediate) and motivations of the different communities; the academic community is research-focused, whereas the industry is mostly product-driven. This is not surprising by itself; however as a side effect it results in development of tools (by both communities) that are not easy to interact with, are insufficiently stable, insufficiently rigorous, insufficiently standards-compliant, and/or require a significant understanding effort on the side of the healthcare professionals. As such, given the generally tenuous acceptance of new technologies by medical practitioners and researchers, both target communities seem to tend towards existing systems that exhibit reasonable performance and with which they are already comfortable. We are poised at a point where the need to accelerate efforts for Semantic Web adoption is great. As we seek to accomplish methodologies to sharing and integrating the biomedical knowledge, we also Conclusion and future work 169 need to adopt more creative approaches to learn how to best incorporate the Semantic technologies to maximize usability and acceptance by end-users. This opens up the process to more industry-driven initiatives to lower the existing barriers. Glimpse of the future can be both exciting and challenging- exciting when we witness how emerging technologies can address challenging biological problems, but challenging when we recognize that these technologies must be applied effectively and efficiently to deliver cost-effective healthcare. The question will be how we can incorporate these technologies so that future systems are designed and implemented to optimize their usability both for individual health practitioners. This thesis was an effort to develop a common ground for addressing some of the challenges mentioned here. We motivated the need for efficient adoption of the Semantic Web techniques with respect to the field of clinical informatics. In designing and conducting the experiments, special attention was paid to provide results that are of interest to both the Semantic Web and clinical research communities, using real data and challenging clinical case studies. We specially investigated approaches that facilitate the constructing semantic models of clinical phenotypes to support web- embedded research. In accordance with the multidisciplinary nature of this thesis, our long-term vision is twofold: From the perspective of clinical sciences we hope to broaden the scope of research by incorporating multiple data sources and use of large amount of available knowledge in the Semantic Web. From the perspective of the Semantic Web, we wish to mitigate the knowledge-acquisition barrier by facilitating the creation and maintenance of expressive knowledgebase. We hope that this thesis provides a worthwhile contribution towards achieving these goals. Bibliography 170 Bibliography [1] “OWL Web Ontology Language Overview.” [Online]. Available: http://www.w3.org/TR/owl- features/. [Accessed: 29-Sep-2012]. [2] “OWL Web Ontology Language Reference.” [Online]. Available: http://www.w3.org/TR/owl- ref/#Sublanguages. [Accessed: 11-Oct-2012]. [3] “World Wide Web Consortium (W3C).” [Online]. Available: http://www.w3.org/. [Accessed: 28-Sep-2012]. [4] “Semantic Web Health Care and Life Sciences (HCLS) Interest Group.” [Online]. Available: http://www.w3.org/blog/hcls/. [Accessed: 11-Oct-2012]. [5] B. Smith, M. Ashburner, C. Rosse, J. Bard, W. Bug, W. Ceusters, L. J. Goldberg, K. Eilbeck, A. Ireland, C. J. Mungall, N. Leontis, P. Rocca-Serra, A. Ruttenberg, S.-A. Sansone, R. H. Scheuermann, N. Shah, P. L. Whetzel, and S. Lewis, “The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration.,” Nature biotechnology, vol. 25, no. 11, pp. 1251–5, Nov. 2007. [6] “What is clinical informatics? - SCCI - Stanford University School of Medicine.” [Online]. Available: https://clinicalinformatics.stanford.edu/background.html. [Accessed: 26-Sep-2012]. [7] “Risk Score Profiles Framingham Heart Study.” [Online]. Available: http://www.framinghamheartstudy.org/risk/index.html. [Accessed: 11-Oct-2012]. [8] R. S. Kutty, R. S. Kutty, and S. K. Nair, Surgery for coronary artery disease, vol. 26, no. 12. 2008, pp. 501 – 509. [9] “RDF - Semantic Web Standards.” [Online]. Available: http://www.w3.org/RDF/. [Accessed: 28-Sep-2012]. [10] A. G. Castro, P. Rocca-Serra, R. Stevens, C. Taylor, K. Nashar, M. A. Ragan, and S.-A. Sansone, “The use of concept maps during knowledge elicitation in ontology development processes--the nutrigenomics use case.,” BMC bioinformatics, vol. 7, p. 267, Jan. 2006. [11] M. D. Wilkinson, B. Vandervalk, and L. McCarthy, “The Semantic Automated Discovery and Integration (SADI) Web service Design-Pattern, API and Reference Implementation.,” Journal of biomedical semantics, vol. 2, no. 1, p. 8, Jan. 2011. [12] B. P. Vandervalk, E. L. Mccarthy, and M. D. Wilkinson, The Semantic Web, vol. 5926. Berlin, Heidelberg: Springer Berlin Heidelberg, 2009, pp. 367–369. [13] “About the Human Genome Project.” [Online]. Available: http://www.ornl.gov/sci/techresources/Human_Genome/project/about.shtml. [Accessed: 26- Sep-2012]. Bibliography 171 [14] T. H. Nelson, “Complex information processing,” in Proceedings of the 1965 20th national conference on -, 1965, pp. 84–100. [15] B. M. Good, “Strategies for amassing, characterizing, and applying third-party metadata in bioinformatics-PhD Thesis” University of British Columbia, 2009. [16] “URIs, URLs, and URNs: Clarifications and Recommendations 1.0.” [Online]. Available: http://www.w3.org/TR/uri-clarification/. [Accessed: 16-Oct-2012]. [17] H. Gao, L. Lim, W. Wang, C. Li, and L. Chen, Eds., Web-age information management, vol. 7418. Berlin, Heidelberg: Springer Berlin Heidelberg, 2012. [18] “Plenary talk by Tim BL at WWWF94: Overview.” [Online]. Available: http://www.w3.org/Talks/WWW94Tim/Overview.html. [Accessed: 06-Nov-2012]. [19] A. M. Collins and E. F. Loftus, “A spreading-activation theory of semantic processing.” [20] O. Bodenreider, “Biomedical ontologies in action: role in knowledge management, data integration and decision support.,” Yearbook of medical informatics, pp. 67–79, Jan. 2008. [21] T. Berners-Lee, J. Hendler, and O. Lassila, “The Semantic Web: Scientific American,” Scientific American, 2001. [22] S. Samadian, “Supporting End-user Synthesis of Description Logic Constructs to Aid Hypothesis Generation and Evaluation in Clinical Informatics-PhD proposal” 2010. [23] A. Aamodt and M. Nygård, “Different roles and mutual dependencies of data, information, and knowledge — An AI perspective on their integration,” Data and knowledge engineering, vol. 16, no. 3, pp. 191–222, Sep. 1995. [24] “OWL 2 Web Ontology Language Structural Specification and Functional-Style Syntax.” [Online]. Available: http://www.w3.org/TR/owl2-syntax/. [Accessed: 16-Oct-2012]. [25] T. Sundstrom, Mathematical Reasoning: Writing and Proof (2nd Edition). Pearson, 2006, p. 544. [26] B. Vandervalk, “CardioSHARE: Automated Integration and Analysis of Data from Multiple Sources.” [27] B. M. Good and M. D. Wilkinson, “The Life Sciences Semantic Web is full of creeps!,” Briefings in bioinformatics, vol. 7, no. 3, pp. 275–86, Sep. 2006. [28] “Notation 3 Logic.” [Online]. Available: http://www.w3.org/DesignIssues/Notation3.html. [Accessed: 20-Nov-2012]. [29] “RDF Primer.” [Online]. Available: http://www.w3.org/TR/rdf-primer/. [Accessed: 20-Nov- 2012]. Bibliography 172 [30] “SPARQL Protocol for RDF.” [Online]. Available: http://www.w3.org/TR/rdf-sparql- protocol/. [Accessed: 29-Sep-2012]. [31] J. Lehmann, “Learning OWL Class Expressions - PhD Thesis,” Learning, 2010. [32] A. Bairoch, R. Apweiler, C. H. Wu, W. C. Barker, B. Boeckmann, S. Ferro, E. Gasteiger, H. Huang, R. Lopez, M. Magrane, M. J. Martin, D. A. Natale, C. O’Donovan, N. Redaschi, and L.-S. L. Yeh, “The Universal Protein Resource (UniProt).,” Nucleic acids research, vol. 33, no. Database issue, pp. D154–9, Jan. 2005. [33] F. Belleau, M.-A. Nolin, N. Tourigny, P. Rigault, and J. Morissette, “Bio2RDF: towards a mashup to build bioinformatics knowledge systems.,” Journal of biomedical informatics, vol. 41, no. 5, pp. 706–16, Oct. 2008. [34] M. Samwald, A. Jentzsch, C. Bouton, C. S. Kallesøe, E. Willighagen, J. Hajagos, M. S. Marshall, E. Prud’hommeaux, O. Hassenzadeh, E. Pichler, and S. Stephens, “Linked open drug data for pharmaceutical research and development.,” Journal of cheminformatics, vol. 3, no. 1, p. 19, Jan. 2011. [35] B. Boeckmann, A. Bairoch, R. Apweiler, M.-C. Blatter, A. Estreicher, E. Gasteiger, M. J. Martin, K. Michoud, C. O’Donovan, I. Phan, S. Pilbout, and M. Schneider, “The SWISS- PROT protein knowledgebase and its supplement TrEMBL in 2003.,” Nucleic acids research, vol. 31, no. 1, pp. 365–70, Jan. 2003. [36] M. Kanehisa, S. Goto, S. Kawashima, Y. Okuno, and M. Hattori, “The KEGG resource for deciphering the genome.,” Nucleic acids research, vol. 32, no. Database issue, pp. D277–80, Jan. 2004. [37] J. L. Sussman, D. Lin, J. Jiang, N. O. Manning, J. Prilusky, O. Ritter, and E. E. Abola, “Protein Data Bank (PDB): Database of Three-Dimensional Structural Information of Biological Macromolecules,” Acta Crystallographica Section D Biological Crystallography , vol. 54, no. 6, pp. 1078–1084, Nov. 1998. [38] D. S. Wishart, C. Knox, A. C. Guo, S. Shrivastava, M. Hassanali, P. Stothard, Z. Chang, and J. Woolsey, “DrugBank: a comprehensive resource for in silico drug discovery and exploration.,” Nucleic acids research, vol. 34, no. Database issue, pp. D668–72, Jan. 2006. [39] “DailyMed.” [Online]. Available: http://dailymed.nlm.nih.gov/dailymed/about.cfm. [Accessed: 20-Nov-2012]. [40] “SIDER Side Effect Resource.” [Online]. Available: http://sideeffects.embl.de/. [Accessed: 20-Nov-2012]. [41] F. Lillehagen and J. Krogstie, Active Knowledge Modeling of Enterprises (Google eBook). Springer, 2008, p. 436. [42] B. P. Vandervalk, E. L. McCarthy, and M. D. Wilkinson, “Moby and Moby 2: creatures of the deep (web).,” Briefings in bioinformatics, vol. 10, no. 2, pp. 114–28, Mar. 2009. Bibliography 173 [43] T. R. Gruber, “Toward principles for the design of ontologies used for knowledge sharing?,” International Journal of Human-Computer Studies, vol. 43, no. 5–6, pp. 907–928, Nov. 1995. [44] N. Noy and D. Mcguinness, “Ontology Development 101: A Guide to Creating Your First Ontology,” no. KSL-01–05, 2001. [45] L. M. Fagan, Medical Informatics: Computer Applications in Health Care and Biomedicine (Health Informatics). Springer, 2003, p. 854. [46] Clinical Decision Support: The Road Ahead. Academic Press, 2006, p. 544. [47] Mycin: a rule-based computer program for advising physicians regarding antimicrobial therapy selection. 1974. [48] “Shortliffe (1976) Computer-based medical consultations, MYCIN.” [Online]. Available: http://www.getcited.org/pub/101600773. [Accessed: 01-Oct-2012]. [49] “BibSonomy :: publication :: EMYCIN: A Knowledge Engineer’s Tool for Constructing Rule- Based Expert Systems.” [Online]. Available: http://www.bibsonomy.org/bibtex/53c846985f956a320faf600afcab689b. [Accessed: 01-Oct- 2012]. [50] W. J. Clancey, Knowledge-Based Tutoring: The GUIDON Program (MIT Press Series in Artificial Intelligence). The MIT Press, 1987, p. 404. [51] “Knowledge-based systems in artificial intelligence.” [Online]. Available: http://books.google.com/books/about/Knowledge_based_systems_in_artificial_in.html?id=Mp VQAAAAMAAJ. [Accessed: 01-Oct-2012]. [52] L. Zhang, Y. Ma, and G. Wang, “An Extended Hybrid Ontology Approach to Data Integration,” in 2009 2nd International Conference on Biomedical Engineering and Informatics, 2009, pp. 1–4. [53] N. Noy and D. {mcguinness}, “Ontology Development 101: A Guide to Creating Your First Ontology,” 2001. [54] E. Simperl, “Reusing ontologies on the Semantic Web: A feasibility study,” Data & Knowledge Engineering, vol. 68, no. 10, pp. 905–925, Oct. 2009. [55] R. Stevens, C. A. Goble, and S. Bechhofer, “Ontology-based knowledge representation for bioinformatics.,” Briefings in bioinformatics, vol. 1, no. 4, pp. 398–414, Nov. 2000. [56] “OpenGALEN Mission Statement.” [Online]. Available: http://www.opengalen.org/index.html. [Accessed: 29-Sep-2012]. [57] F. Baader, D. Calvanese, D. L. McGuinness, D. Nardi, and P. F. Patel-Schneider, “The description logic handbook: theory, implementation, and applications,” Jan. 2003. [58] F. Baader and W. Nutt, “Basic description logics,” pp. 43–95, Jan. 2003. Bibliography 174 [59] B. Vandervalk, “The SHARE System: A Semantic Web Based Approach for Evaluating Queries Across Distributed Bioinformatics Databases and Software-MSc Thesis,” 2011. [60] L. Badea and S.-H. Nienhuys-Cheng, “A Refinement Operator for Description Logics,” pp. 40–59, Jul. 2000. [61] J. Lehmann, S. Auer, L. Bühmann, and S. Tramp, “Class expression learning for ontology engineering,” Web Semantics: Science, Services and Agents on the World Wide Web , vol. 9, no. 1, pp. 71–81, Mar. 2011. [62] M. Gruninger, O. Bodenreider, F. Olken, L. Obrst, and P. Yim, “Ontology Summit 2007 - Ontology, taxonomy, folksonomy: Understanding the distinctions,” Applied Ontology, vol. 3, no. 3, pp. 191–200, Aug. 2008. [63] “Ontologies Come of Age.” [Online]. Available: http://www.ksl.stanford.edu/people/dlm/papers/ontologies-come-of-age-mit-press-(with- citation).htm. [Accessed: 17-Oct-2012]. [64] O. Bodenreider and R. Stevens, “Bio-ontologies: current trends and future directions.,” Briefings in bioinformatics, vol. 7, no. 3, pp. 256–74, Sep. 2006. [65] J. J. Cimino and X. Zhu, “The practical impact of ontologies on biomedical informatics.,” Yearbook of medical informatics, pp. 124–35, Jan. 2006. [66] A. C. Yu, “Methods in biomedical ontology.,” Journal of biomedical informatics, vol. 39, no. 3, pp. 252–66, Jun. 2006. [67] D. Qi, R. D. King, A. L. Hopkins, G. R. J. Bickerton, and L. N. Soldatova, “An ontology for description of drug discovery investigations.,” Journal of integrative bioinformatics, vol. 7, no. 3, Jan. 2010. [68] L. M. Refolo, H. Snyder, C. Liggins, L. Ryan, N. Silverberg, S. Petanceska, and M. C. Carrillo, “Common Alzheimer’s Disease Research Ontology: National Institute on Aging and Alzheimer's Association collaborative project.,” Alzheimer’s & dementia : the journal of the Alzheimer's Association, vol. 8, no. 4, pp. 372–5, Jul. 2012. [69] D. L. Rubin, N. H. Shah, and N. F. Noy, “Biomedical ontologies: a functional perspective.,” Briefings in bioinformatics, vol. 9, no. 1, pp. 75–90, Jan. 2008. [70] M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M. Cherry, A. P. Davis, K. Dolinski, S. S. Dwight, J. T. Eppig, M. A. Harris, D. P. Hill, L. Issel-Tarver, A. Kasarskis, S. Lewis, J. C. Matese, J. E. Richardson, M. Ringwald, G. M. Rubin, and G. Sherlock , “Gene ontology: tool for the unification of biology. The Gene Ontology Consortium.,” Nature genetics, vol. 25, no. 1, pp. 25–9, May 2000. [71] F. W. Hartel, S. de Coronado, R. Dionne, G. Fragoso, and J. Golbeck, “Modeling a description logic vocabulary for cancer research.,” Journal of biomedical informatics, vol. 38, no. 2, pp. 114–29, Apr. 2005. Bibliography 175 [72] C. Rosse and J. L. V Mejino, “A Reference Ontology for Bioinformatics: The Foundational Model of Anatomy.” 01-Jan-2003. [73] “SNOMED CT.” [Online]. Available: http://www.ihtsdo.org/snomed-ct/. [Accessed: 17-Oct- 2012]. [74] L. Peters and O. Bodenreider, “Using the RxNorm web services API for quality assurance prposes.,” AMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA Symposium, pp. 591–5, Jan. 2008. [75] “National Drug File – Reference Terminology (NDF-RTTM) Documentation.” [Online]. Available: http://evs.nci.nih.gov/ftp1/NDF-RT/NDF-RT Documentation.pdf. [76] S. J. Nelson, M. Schopen, A. G. Savage, J.-L. Schulman, and N. Arluk, “The MeSH translation maintenance system: structure, interface design, and implementation.,” Studies in health technology and informatics, vol. 107, no. Pt 1, pp. 67–9, Jan. 2004. [77] “ICD-10 Version:2010.” [Online]. Available: http://apps.who.int/classifications/icd10/browse/2010/en. [Accessed: 05-May-2013]. [78] “Document Ontology — LOINC.” [Online]. Available: http://loinc.org/discussion- documents/document-ontology. [Accessed: 19-Oct-2012]. [79] B. L. Humphreys, D. A. Lindberg, H. M. Schoolman, and G. O. Barnett, “The Unified Medical Language System: an informatics research collaboration.,” Journal of the American Medical Informatics Association : JAMIA, vol. 5, no. 1, pp. 1–11. [80] J. A. Blake and C. J. Bult, “Beyond the data deluge: data integration and bio-ontologies.,” Journal of biomedical informatics, vol. 39, no. 3, pp. 314–20, Jun. 2006. [81] RHIA and C. E. by K. Giannangelo, Healthcare Code Sets, Clinical Terminologies, and Classification Systems. American Health Information Management Association (AHIMA), 2006, p. 321. [82] C. Daniel-Le Bozec, O. Steichen, T. Dart, and M.-C. Jaulent, “The role of local terminologies in electronic health records. The HEGP experience.,” Studies in health technology and informatics, vol. 129, no. Pt 1, pp. 780–4, Jan. 2007. [83] S. V. S. Pakhomov, J. D. Buntrock, and C. G. Chute, “Automating the assignment of diagnosis codes to patient encounters using example-based and machine learning techniques.,” Journal of the American Medical Informatics Association : JAMIA, vol. 13, no. 5, pp. 516–25. [84] A. J. Butte and R. Chen, “Finding disease-related genomic experiments within an international repository: first steps in translational bioinformatics.,” AMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA Symposium, pp. 106–10, Jan. 2006. [85] N. H. Shah, D. L. Rubin, K. S. Supekar, and M. A. Musen, “Ontology-based annotation and query of tissue microarray data.,” AMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA Symposium, pp. 709–13, Jan. 2006. Bibliography 176 [86] T.-K. Kim, J.-S. Oh, W.-S. Cho, G. H. Ko, S. Lee, and B. K. Hou, “PubMine: An Ontology- Based Text Mining System for Deducing Relationships among Biological Entities,” Interdisciplinary Bio Central, vol. 3, no. 2, pp. 1–6, Apr. 2011. [87] C. E. Crangle and A. Zbyslaw, “Identifying gene ontology concepts in natural-language text.,” Conference proceedings : ... Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Conference, vol. 4, pp. 2821–3, Jan. 2004. [88] A. R. Aronson and F.-M. Lang, “An overview of MetaMap: historical perspective and recent advances.,” Journal of the American Medical Informatics Association : JAMIA, vol. 17, no. 3, pp. 229–36. [89] D. Rebholz-Schuhmann, M. Arregui, S. Gaudan, H. Kirsch, and A. Jimeno, “Text processing through Web services: calling Whatizit.,” Bioinformatics (Oxford, England), vol. 24, no. 2, pp. 296–8, Jan. 2008. [90] “DrugBank.” [Online]. Available: http://www.drugbank.ca/. [Accessed: 24-Oct-2012]. [91] “UniProt.” [Online]. Available: http://www.uniprot.org/. [Accessed: 24-Oct-2012]. [92] “Ontology Lookup Service (OLS).” [Online]. Available: http://www.ebi.ac.uk/ontology- lookup/browse.do?ontName=CHEBI. [Accessed: 24-Oct-2012]. [93] M.-M. Bouamrane, A. Rector, and M. Hurrell, “Experience of using OWL ontologies for automated inference of routine pre-operative screening tests,” pp. 50–65, Nov. 2010. [94] M. A. Musen, “Dimensions of knowledge sharing and reuse.,” Computers and biomedical research, an international journal, vol. 25, no. 5, pp. 435–67, Oct. 1992. [95] “Shirky: Ontology is Overrated -- Categories, Links, and Tags.” [Online]. Available: http://www.shirky.com/writings/ontology_overrated.html. [Accessed: 05-Nov-2012]. [96] S. Bechhofer, I. Horrocks, C. Goble, and R. Stevens, “OilEd: A Reason-able Ontology Editor for the Semantic Web,” Lecture Notes in Computer Science, vol. 2174, 2001. [97] R. M. Millis, Advances in Electrocardiograms - Clinical Applications. null. [98] K. Wolstencroft, P. Lord, L. Tabernero, A. Brass, and R. Stevens, “Protein classification using ontology classification.,” Bioinformatics (Oxford, England), vol. 22, no. 14, pp. e530–8, Jul. 2006. [99] “Pellet: OWL 2 Reasoner for Java.” [Online]. Available: http://clarkparsia.com/pellet/. [Accessed: 26-Oct-2012]. [100] “OWL : FaCT++.” [Online]. Available: http://owl.man.ac.uk/factplusplus/. [Accessed: 21- Nov-2012]. Bibliography 177 [101] P. G. Baker, C. A. Goble, S. Bechhofer, N. W. Paton, R. Stevens, and A. Brass, “An ontology for bioinformatics applications,” Bioinformatics, vol. 15, no. 6, pp. 510–520, Jun. 1999. [102] R. Stevens, P. Baker, S. Bechhofer, G. Ng, A. Jacoby, N. W. Paton, C. A. Goble, and A. Brass, “TAMBIS: Transparent Access to Multiple Bioinformatics Information Sources,” Bioinformatics, vol. 16, no. 2, pp. 184–186, Feb. 2000. [103] “Knowledge Acquisition.” [Online]. Available: https://engineering.purdue.edu/~engelb/abe565/knowacq.htm. [Accessed: 06-Nov-2012]. [104] G.-L. Cui, “Acquisition of domain concepts and ontology construction,” Jan. 2010. [105] M. D. W. Soroush Samadian, Benjamin M. Good, Bruce McManus, “A data-driven approach to automatic discovery of prescription drugs in cardiovascular risk management,” Bio- Ontologies 2012. [106] M. Stefik, D. G. Bobrow, S. Mittal, and L. Conway, “Knowledge programming in loops: report on an experimental course,” pp. 493–503, Apr. 1989. [107] D. B. Martin Hepp, “OntoWiki: Community-driven Ontology Engineering and Ontology Usage based on Wikis.” [108] “Ontolingua Home Page.” [Online]. Available: http://www.ksl.stanford.edu/software/ontolingua/. [Accessed: 08-Nov-2012]. [109] Y. Sure, J. Angele, and S. Staab, “OntoEdit: Guiding Ontology Development by Methodology and Inferencing,” pp. 1205–1222, Oct. 2002. [110] P. Eklund, N. Roberts, and S. Green, “OntoRama: Browsing RDF Ontologies Using a Hyperbolic-Style Browser,” p. 0405, Nov. 2002. [111] M. N. Ahmad and R. M. Colomb, “Managing ontologies: a comparative study of ontology servers,” pp. 13–22, Mar. 2007. [112] M. Völkel, M. Krötzsch, D. Vrandecic, H. Haller, and R. Studer, “Semantic Wikipedia,” in Proceedings of the 15th international conference on World Wide Web - WWW ’06, 2006, p. 585. [113] “Main Page - Freebase.” [Online]. Available: http://wiki.freebase.com/wiki/Main_Page. [Accessed: 08-Nov-2012]. [114] “WikiNeuron: Semantic Wiki of Collective Minds in Neuroscience | bioontology.org.” [Online]. Available: http://www.bioontology.org/WikiNeuron. [Accessed: 08-Nov-2012]. [115] B. M. Good, E. L. Clarke, L. de Alfaro, and A. I. Su, “The Gene Wiki in 2011: community intelligence applied to human gene annotation.,” Nucleic acids research, vol. 40, no. Database issue, pp. D1255–61, Jan. 2012. Bibliography 178 [116] F. H. Zaidan and M. P. Bax, “Semantic wikis and the collaborative construction of ontologies: case study,” JISTEM Journal of information systems and technology management, vol. 8, no. 3, pp. 539–554, Dec. 2011. [117] M. Buffa and F. Gandon, “SweetWiki,” in Proceedings of the 2006 international symposium on Wikis - WikiSym ’06, 2006, p. 69. [118] G. Perez and M. Mancho, “{A Survey of Ontology Learning Methods and Techniques},” 2003. [119] W. Wong, W. Liu, and M. Bennamoun, “Ontology learning from text,” ACM computing surveys, vol. 44, no. 4, pp. 1–36, Aug. 2012. [120] R. Drumond, L. and Girardi, “A Survey of Ontology Learning Procedures,” Proceedings of the 3rd workshop on ontologies and their applications, 2008. [121] M. R. Nathalie Pernelle, “Automatic Construction and Refinement of a Class Hierarchy over Semistructured Data.” [122] E. M. Alexander Maedche, “The TEXT-TO-ONTO Ontology Learning Environment.” [123] B. M. Paul Buitelaar, “Ontology Learning from Text: An Overview.” [124] J. Lehmann, S. Auer, L. Bühmann, and S. Tramp, “Class expression learning for ontology engineering,” Web Semantics: Science, Services and Agents on the World Wide Web , vol. 9, no. 1, pp. 71–81, Mar. 2011. [125] S.-H. Nienhuys-Cheng, R. de/Siekmann Wolf, and J. G. Carbonell, “Foundations of Inductive Logic Programming,” Jan. 1997. [126] L. Raedt, P. Frasconi, K. Kersting, and S. Muggleton, Eds., Probabilistic inductive logic programming, vol. 4911. Berlin, Heidelberg: Springer Berlin Heidelberg, 2008. [127] K. Kersting, “An Inductive Logic Programming Approach to Statistical Relational Learning,” pp. 1–228, May 2005. [128] F. A. Lisi, “Reasoning with OWL-DL in Inductive Logic Programming,” Proc. of the Third International workshop, OWL: experiences and directions (OWLED 2007), 2007. [129] X.-A. Zhang, “Supporting on-the-fly data integration for bioinformatics,” Jan. 2007. [130] T. J. Lee, Y. Pouliot, V. Wagner, P. Gupta, D. W. J. Stringer-Calvert, J. D. Tenenbaum, and P. D. Karp, “BioWarehouse: a bioinformatics database warehouse toolkit. ,” BMC bioinformatics, vol. 7, no. 1, p. 170, Jan. 2006. [131] S. P. Shah, Y. Huang, T. Xu, M. M. S. Yuen, J. Ling, and B. F. F. Ouellette, “Atlas - a data warehouse for integrative bioinformatics.,” BMC bioinformatics, vol. 6, no. 1, p. 34, Jan. 2005. Bibliography 179 [132] T. Etzold, A. Ulyanov, and P. Argos, “SRS: information retrieval system for molecular biology data banks.,” Methods in enzymology, vol. 266, pp. 114–28, Jan. 1996. [133] D. Smedley, S. Haider, B. Ballester, R. Holland, D. London, G. Thorisson, and A. Kasprzyk, “BioMart--biological queries made easy.,” BMC genomics, vol. 10, no. 1, p. 22, Jan. 2009. [134] L. Stein, “Creating a bioinformatics nation.,” Nature, vol. 417, no. 6885, pp. 119–20, May 2002. [135] “SADI Semantic Web Services -- ‘cause you can’t always GET what you want!.”[Online]. Available: http://arnetminer.org/publication/sadi-semantic-web-services-cause-you-can-t- always-get-what-you-want- 2835554.html;jsessionid=59A6E0D13AA6BD7806BC64921627081E.tt. [Accessed: 17-Oct- 2012]. [136] J. Saltz, S. Oster, S. Hastings, S. Langella, T. Kurc, W. Sanchez, M. Kher, A. Manisundaram, K. Shanbhag, and P. Covitz, “caGrid: design and implementation of the core architecture of the cancer biomedical informatics grid.,” Bioinformatics (Oxford, England), vol. 22, no. 15, pp. 1910–6, Aug. 2006. [137] A. Dogac, G. B. Laleci, S. Kirbas, Y. Kabak, S. S. Sinir, A. Yildiz, and Y. Gurcan, “Artemis: Deploying semantically enriched Web services in the healthcare domain,” Information systems, vol. 31, no. 4–5, pp. 321–339, Jun. 2006. [138] D. D. G. Gessler, G. S. Schiltz, G. D. May, S. Avraham, C. D. Town, D. Grant, and R. T. Nelson, “SSWAP: A Simple Semantic Web Architecture and Protocol for semantic web services.,” BMC bioinformatics, vol. 10, p. 309, Jan. 2009. [139] M. D. Wilkinson, “BioMOBY: An open source biological web services proposal,” Briefings in bioinformatics, vol. 3, no. 4, pp. 331–341, Jan. 2002. [140] R. D. Stevens, A. J. Robinson, and C. A. Goble, “myGrid: personalised bioinformatics on the information grid,” Bioinformatics, vol. 19, no. Suppl 1, pp. i302–i304, Jul. 2003. [141] H. Jamil and B. El-Hajj-Diab, “BioFlow: A Web-Based Declarative Workflow Language for Life Sciences,” in 2008 IEEE Congress on services - Part I, 2008, pp. 453–460. [142] R. Stevens, P. Baker, S. Bechhofer, G. Ng, A. Jacoby, N. W. Paton, C. A. Goble, and A. Brass, “TAMBIS: Transparent Access to Multiple Bioinformatics Information Sources,” Bioinformatics, vol. 16, no. 2, pp. 184–186, Feb. 2000. [143] M. D. Wilkinson, L. McCarthy, B. Vandervalk, D. Withers, E. Kawas, and S. Samadian, “SADI, SHARE, and the in silico scientific method.,” BMC bioinformatics, vol. 11 Suppl 1, no. Suppl 12, p. S7, Jan. 2010. [144] G. V Gkoutos, P. N. Schofield, and R. Hoehndorf, “The Units Ontology: a tool for integrating units of measurement in science.,” Database : the journal of biological databases and curation, vol. 2012, p. bas033, Jan. 2012. Bibliography 180 [145] J. P. Higgins and S. Green, Eds., Cochrane handbook for systematic reviews of interventions. Chichester, UK: John Wiley & Sons, Ltd, 2008. [146] N. Leveson, “Role of software in spacecraft accidents,” Journal of spacecraft and rockets, vol. 41, no. 4, pp. 564 – 575, 2004. [147] N. US Department of Commerce, “International System of Units (SI).” [148] G. V Gkoutos, P. N. Schofield, and R. Hoehndorf, “The Units Ontology: a tool for integrating units of measurement in science.,” Database : the journal of biological databases and curation, vol. 2012, no. 0, p. bas033, Jan. 2012. [149] G. Schadow, C. J. McDonald, J. G. Suico, U. Föhring, and T. Tolxdorff, “Units of measure in clinical information systems.,” Journal of the American medical informatics association : JAMIA, vol. 6, no. 2, pp. 151–62. [150] G. R. O. Thomas R. Gruber, “An Ontology for Engineering Mathematics Abstract.” [151] D. B. and L. Polo, “MUO: Measurement Units Ontology.” [Online]. Available: http://idi.fundacionctic.org/muo/muo-vocab.html. [Accessed: 03-Dec-2012]. [152] “Ontology of Units of Measure and Related Concepts | www.semantic-web-journal.net.” [Online]. Available: http://www.semantic-web-journal.net/content/ontology-units-measure- and-related-concepts. [Accessed: 03-Dec-2012]. [153] “QUDT - Quantities, Units, Dimensions and Types.” [Online]. Available: http://qudt.org/. [Accessed: 28-Nov-2012]. [154] “Ontolingua Theory STANDARD-UNITS.” [Online]. Available: http://www- ksl.stanford.edu/knowledge-sharing/ontologies/html/standard-units/index.html. [Accessed: 15- Dec-2012]. [155] “QUDT - Quantities, Units, Dimensions and Types.” [Online]. Available: http://www.qudt.org/. [Accessed: 04-Dec-2012]. [156] “wurvoc.org - Ontology of units of Measure (OM).” [Online]. Available: http://www.wurvoc.org/vocabularies/om-1.6/. [Accessed: 04-Feb-2013]. [157] “wurvoc.org - OM web services.” [Online]. Available: http://www.wurvoc.org/services/oum.jsp. [Accessed: 04-Feb-2013]. [158] A. Rector, J. Rogers, and P. Pole, “The GALEN High Level Ontology,” Proceedings MIE 96, pp. 174 – 178, 1996. [159] C. G. Chute, “Clinical classification and terminology: some history and current observations.,” Journal of the American medical informatics association : JAMIA, vol. 7, no. 3, pp. 298–303. Bibliography 181 [160] “SIO - semanticscience - The Semanticscience Integrated Ontology (SIO) - Scientific Knowledge Discovery - Google Project Hosting.” [Online]. Available: http://code.google.com/p/semanticscience/wiki/SIO. [Accessed: 05-Dec-2012]. [161] M. Dumontier, “why and how SIO differs from the OBO Foundry effort.” [162] D. M. Villanueva-Rosales N, “Modeling Life Science Knowledge with OWL 1.1.” [163] A. V Chobanian, G. L. Bakris, H. R. Black, W. C. Cushman, L. A. Green, J. L. Izzo, D. W. Jones, B. J. Materson, S. Oparil, J. T. Wright, and E. J. Roccella, “The Seventh Report of the Joint National Committee on Prevention, Detection, Evaluation, and Treatment of High Blood Pressure: the JNC 7 report.,” JAMA : Journal of the American medical association, vol. 289, no. 19, pp. 2560–72, May 2003. [164] I. Johansson, “Metrological thinking needs the notions of parametric quantities, units and dimensions,” Metrologia, vol. 47, no. 3, pp. 219–230, Jun. 2010. [165] W. B. Kannel and D. L. McGee, “Diabetes and cardiovascular risk factors: the Framingham study,” Circulation, vol. 59, no. 1, pp. 8–13, Jan. 1979. [166] F. P. Gómez and R. Rodriguez-Roisin, “Global Initiative for Chronic Obstructive Lung Disease (GOLD) guidelines for chronic obstructive pulmonary disease.,” Current opinion in pulmonary medicine, vol. 8, no. 2, pp. 81–6, Mar. 2002. [167] T. R. DAWBER, F. E. MOORE, and G. V MANN, “Coronary heart disease in the Framingham study.,” American journal of public health and the nation’s health , vol. 47, no. 4 Pt 2, pp. 4–24, Apr. 1957. [168] P. M. Ridker, J. E. Buring, N. Rifai, and N. R. Cook, “Development and validation of improved algorithms for the assessment of global cardiovascular risk in women: the Reynolds Risk Score.,” JAMA : Journal of the American medical association, vol. 297, no. 6, pp. 611–9, Feb. 2007. [169] M. D. Wilkinson, “BioMOBY Interoperability Today, Integration Tomorrow.” [Online]. Available: http://sadiframework.org/documentation/MOBY_IMB_UQ2005.ppt. [170] B. Vandervalk, L. McCarthy, and M. Wilkinson, The Semantic Web, vol. 5926. Berlin, Heidelberg: Springer Berlin Heidelberg, 2009, pp. 367 – 369. [171] W. T. Friedewald, R. I. Levy, and D. S. Fredrickson, “Estimation of the Concentration of Low-Density Lipoprotein Cholesterol in Plasma, Without Use of the Preparative Ultracentrifuge,” Clinical chemistry, vol. 18, no. 6, pp. 499–502, Jun. 1972. [172] “The Cholesterol Levels.” [173] “What Your Cholesterol Levels Mean.” Bibliography 182 [174] G. Eknoyan, “Adolphe Quetelet (1796-1874)--the average man and indices of obesity.,” Nephrology, dialysis, transplantation : official publication of the European dialysis and transplant association - European Renal Association, vol. 23, no. 1, pp. 47–51, Jan. 2008. [175] “Supplementary Materials.” [176] “Framingham Heart Study.” [177] A. Casadevall and F. C. Fang, “Reproducible science.,” Infection and immunity, vol. 78, no. 12, pp. 4972–5, Dec. 2010. [178] D. L. Welch, “Human error and human factors engineering in health care.,” Biomedical instrumentation & technology / association for the advancement of medical instrumentation, vol. 31, no. 6, pp. 627–31. [179] S. Samadian, B. McManus, and M. D. Wilkinson, “Extending and encoding existing biological terminologies and datasets for use in the reasoned semantic web.,” Journal of biomedical semantics, vol. 3, no. 1, p. 6, Jul. 2012. [180] “Health Level Seven International.” [181] “Mappings From Clinical Trial Eligibility Drug Information to Prescription in Patient Data Using Drug Ontology.” [Online]. Available: http://www.w3.org/wiki/HCLS/ClinicalObservationsInteroperability/DrugMapping.html. [182] H. E. Pence and A. Williams, “ChemSpider: An Online Chemical Information Resource,” Journal of Chemical Education, vol. 87, no. 11, pp. 1123–1124, Nov. 2010. [183] M. Dumontier., “ODPCollection.” [Online]. Available: http://code.google.com/p/semanticscience/wiki/ODPCollection. [184] “The Perl Programming Language - www.perl.org.” [Online]. Available: http://www.perl.org/. [Accessed: 19-Jan-2013]. [185] I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations (The Morgan Kaufmann Series in Data Management Systems) . Morgan Kaufmann, 1999, p. 371. [186] P. D. and D. A. M. Pharm.D. and J. E. Tisdale, Drug-induced diseases: prevention, detection, and management. ASHP, 2005, p. 890. [187] H. L. Lipton, L. A. Bero, J. A. Bird, and S. J. McPhee, “The impact of clinical pharmacists’ consultations on physicians' geriatric drug prescribing. A randomized controlled trial.,” Medical care, vol. 30, no. 7, pp. 646–58, Jul. 1992. [188] E. R. Hajjar, A. C. Cafiero, and J. T. Hanlon, “Polypharmacy in elderly patients.,” The American journal of geriatric pharmacotherapy, vol. 5, no. 4, pp. 345–51, Dec. 2007. [189] “Cardiology Explained.” Remedica, 2004. Bibliography 183 [190] R. G. Stevens and D. Balon, “Detection of hazardous drug/drug interactions in a community pharmacy and subsequent intervention,” International journal of pharmacy practice, vol. 5, no. 3, pp. 142–148, Sep. 1997. [191] “Dipyridamole medical facts from Drugs.com.” [Online]. Available: http://www.drugs.com/mtm/dipyridamole.html. [Accessed: 25-Jan-2013]. [192] “Ranitidine Information from Drugs.com.” [Online]. Available: http://www.drugs.com/ranitidine.html. [Accessed: 25-Jan-2013]. [193] T. J. Bright, E. Yoko Furuya, G. J. Kuperman, J. J. Cimino, and S. Bakken, “Development and evaluation of an ontology for guiding appropriate antibiotic prescribing.,” Journal of biomedical informatics, vol. 45, no. 1, pp. 120–8, Feb. 2012. [194] M. Popescu and G. Arthur, “OntoQuest: a physician decision support system based on ontological queries of the hospital database.,” AMIA Annual symposium proceedings / AMIA symposium. AMIA symposium, pp. 639–43, Jan. 2006. [195] F. M. and A. B. Olivier Bodenreider, “Automatic determination of anticoagulation status with NDF-RT,” in Proceedings of the 13th ISMB’2010 SIG meeting “Bio-ontologies”, 2010, pp. 140–143. [196] A. J. Williams, V. Tkachenko, S. Golotvin, R. Kidd, and G. McCann, “ChemSpider - building a foundation for the semantic web by hosting a crowd sourced databasing platform for chemistry,” Journal of cheminformatics, vol. 2, no. Suppl 1, p. O16, 2010. [197] “Regulated Chemicals - CHEMLIST.” [Online]. Available: http://www.cas.org/content/regulated-chemicals. [Accessed: 18-Jan-2013]. [198] M. G. Janez Brank, “A survey of ontology evaluation techniques.” [199] A. Teije, J. Völker, S. Handschuh, H. Stuckenschmidt, M. d’Acquin, A. Nikolov, N. Aussenac-Gilles, and N. Hernandez, Eds., Knowledge engineering and knowledge management, vol. 7603. Berlin, Heidelberg: Springer Berlin Heidelberg, 2012. [200] S. Muggleton and L. De Raedt, “Inductive Logic Programming: Theory and Methods,” Journal of Logic Programming, vol. 19/20, pp. 629 – 679, 1994. [201] F. Baader, B. Sertkaya, and A.-Y. Turhan, “Computing the least common subsumer w.r.t. a background terminology,” Journal of applied logic, vol. 5, no. 3, pp. 392–420, Sep. 2007. [202] “YINYANG System.” [Online]. Available: http://www.di.uniba.it/~iannone/yinyang/. [Accessed: 08-Feb-2013]. [203] B. Conference Chair-Smith and C. Conference Chair-Welty, “FOIS introduction,” in Proceedings of the international conference on formal ontology in information systems - FOIS ’01, 2001, vol. 2001, p. .3–.9. Bibliography 184 [204] “relexo - Relational Exploration for Learning Expressive Ontologies - Google Project Hosting.” [Online]. Available: http://code.google.com/p/relexo/. [Accessed: 08-Feb-2013]. [205] J. . Bühmann, L., Lehmann, “Universal OWL axiom enrichment for large knowledge bases,” EKAW 2012, 2012. [206] B. G. Franz Baader, “Completing description logic knowledge bases using formal concept analysis.” [207] S. Rudolph, “Exploring relational structures via FLE.” [208] B. Sertkaya, “Explaining User Errors in Knowledge Base Completion.” [209] D. V. Johanna Völker, “Learning disjointness.” [210] F. Baader, B. Ganter, B. Sertkaya, and U. Sattler, “Completing description logic knowledge bases using formal concept analysis,” pp. 230–235, Jan. 2007. [211] “Weka 3 - Data Mining with Open Source Machine Learning Software in Java.” [Online]. Available: http://www.cs.waikato.ac.nz/ml/weka/. [Accessed: 18-Feb-2013]. [212] M. S. T. Michael J. Lincoln, Steven H. Brown, Viet Nguyen, Tim Cromwell, John Carter, Mark Erlbaum, “Department of Veterans Affairs Enterprise Reference Terminology strategic overview,” Studies in health and technology informatics, no. 107, pp. 391–395, 2004. [213] “RxNav: A Semantic Navigation Tool for Clinical Drugs.” [214] I. H. Witten, E. Frank, and M. A. Hall, Data Mining: Practical Machine Learning Tools and Techniques: Practical Machine Learning Tools and Techniques (The Morgan Kaufmann Series in Data Management Systems). Morgan Kaufmann, 2011. [215] E. Standl, “Metformin: drug of choice for the prevention of type 2 diabetes and cardiovascular complications in high-risk subjects.,” Diabetes & metabolism, vol. 29, no. 4 Pt 2, pp. 6S121– 2, Sep. 2003. [216] C. T. Ruff and E. Braunwald, “Will warfarin ever be replaced?,” Journal of cardiovascular pharmacology and therapeutics, vol. 15, no. 3, pp. 210–9, Sep. 2010. [217] S. A. Thair, K. R. Walley, T. Nakada, M. K. McConechy, J. H. Boyd, H. Wellman, and J. A. Russell, “A single nucleotide polymorphism in NF-κB inducing kinase is associated with mortality in septic shock.,” Journal of immunology (Baltimore, Md. : 1950), vol. 186, no. 4, pp. 2321–8, Feb. 2011. [218] “PhosphaBase - A PTPNET Initiative.” [Online]. Available: http://www.bioinf.manchester.ac.uk/phosphabase/. [Accessed: 29-Mar-2013]. [219] “Protein-tyrosine phosphatase, KIM-containing (IPR008356) < InterPro < EMBL-EBI.” [Online]. Available: Bibliography 185 http://www.ebi.ac.uk/interpro/entry/IPR008356;jsessionid=68DEFFE5EABC4E1B52BBE45E 58EFE67A. [Accessed: 29-Mar-2013]. [220] M. Shimoyama, R. Nigam, L. S. McIntosh, R. Nagarajan, T. Rice, D. C. Rao, and M. R. Dwinell, “Three ontologies to define phenotype measurement data.,” Frontiers in genetics, vol. 3, p. 87, Jan. 2012. [221] W. J. Sibbald and J. L. Vincent, “Round table conference on clinical trials for the treatment of sepsis.,” Critical care medicine, vol. 23, no. 2, pp. 394–9, Feb. 1995. [222] Textbook of Cardiovascular Medicine (Book with CD-ROM). Lippincott Williams & Wilkins, 2002, p. 2008. [223] P. Kannankeril, D. M. Roden, and D. Darbar, “Drug-induced long QT syndrome.,” Pharmacological reviews, vol. 62, no. 4, pp. 760–81, Dec. 2010. [224] K. Kunkler, “Acquired long QT syndrome: risk assessment, prudent prescribing and monitoring, and patient education.,” Journal of the American academy of nurse practitioners, vol. 14, no. 9, pp. 382–9, Sep. 2002. [225] “Long QT Syndrome.” [Online]. Available: http://www.mykentuckyheart.com/information/LongQTSyndrome.htm. [Accessed: 02-Apr- 2013]. [226] H. Wang, F. Azuaje, B. Jung, and N. Black, “A markup language for electrocardiogram data acquisition and analysis (ecgML).,” BMC medical informatics and decision making, vol. 3, no. 1, p. 4, May 2003. [227] “First-Degree Atrioventricular Block.” [Online]. Available: http://emedicine.medscape.com/article/161829-overview. [Accessed: 02-Apr-2013]. [228] P. K. Rohan Jayasinghe, “Drugs and the QTc interval.” [229] “Dr. John Boyd - UBC James Hogg Research Centre, Institute for Heart + Lung Health.” [Online]. Available: http://www.hli.ubc.ca/research/PIs/Boyd.html. [Accessed: 17-Mar-2013]. [230] “Dr. Keith Walley - UBC James Hogg Research Centre, Institute for Heart + Lung Health.” [Online]. Available: http://www.hli.ubc.ca/research/PIs/Walley.html. [Accessed: 17-Mar- 2013]. [231] D. L. Rubin, M. Hewett, D. E. Oliver, T. E. Klein, and R. B. Altman, “Automating data acquisition into ontologies from pharmacogenetics relational data sources using declarative object definitions and XML.,” Pacific symposium on biocomputing. pacific symposium on biocomputing, pp. 88–99, Jan. 2002. [232] “ConverterToRdf - W3C Wiki.” [Online]. Available: http://www.w3.org/wiki/ConverterToRdf. [Accessed: 18-Mar-2013]. Bibliography 186 [233] A. Skowron and Z. Suraj, Eds., Rough sets and intelligent systems - Professor Zdzisław Pawlak in Memoriam, vol. 42. Berlin, Heidelberg: Springer Berlin Heidelberg, 2013. [234] T.-L. Tran, Q.-T. Ha, T.-L.-G. Hoang, L. A. Nguyen, H. S. Nguyen, and A. Szalas, “Concept Learning for Description Logic-Based Information Systems,” in 2012 Fourth international conference on knowledge and systems engineering, 2012, pp. 65–73. [235] “Human Phenotype Ontology Website - The Human Phenotype Ontology.” [Online]. Available: http://www.human-phenotype-ontology.org/. [Accessed: 25-Mar-2013]. [236] S. Cheng, M. J. Keyes, M. G. Larson, E. L. McCabe, C. Newton-Cheh, D. Levy, E. J. Benjamin, R. S. Vasan, and T. J. Wang, “Long-term outcomes in individuals with prolonged PR interval or first-degree atrioventricular block.,” JAMA : Journal of the American Medical Association, vol. 301, no. 24, pp. 2571–7, Jun. 2009. [237] I. Wood, B. Vandervalk, L. McCarthy, and M. D. Wilkinson, Leveraging Applications of Formal methods, verification and validation. applications and case studies, vol. 7610. Berlin, Heidelberg: Springer Berlin Heidelberg, 2012, pp. 56–66. [238] P. Klinov, “Pronto: a non-monotonic probabilistic description logic reasoner,” pp. 822–826, Jun. 2008. [239] “Doing a Job - The Management Philosophy of Adm. Hyman G. Rickover.” [Online]. Available: http://govleaders.org/rickover.htm. [Accessed: 03-Apr-2013]. [240] R. S. Rudin, S. R. Simon, L. A. Volk, M. Tripathi, and D. Bates, “Understanding the decisions and values of stakeholders in health information exchanges: experiences from Massachusetts.,” American journal of public health, vol. 99, no. 5, pp. 950–5, May 2009. [241] “SPARQL endpoint - semanticweb.org.” [Online]. Available: http://semanticweb.org/wiki/SPARQL_endpoint. [Accessed: 22-Nov-2012]. [242] O. Erling and I. Mikhailov, “{RDF Support in the Virtuoso DBMS},” vol. 221, 2007. [243] J. J. Carroll, I. Dickinson, C. Dollin, D. Reynolds, A. Seaborne, and K. Wilkinson, “Jena,” in Proceedings of the 13th international World Wide Web conference on alternate track papers & posters - WWW Alt. ’04, 2004, p. 74. [244] A. K. Jeen Broekstra, “Sesame: A Generic Architecture for Storing and Querying RDF and RDF Schema.” [245] “Bio2RDF.” [Online]. Available: http://bio2rdf.org/. [Accessed: 22-Nov-2012]. [246] “The Pharmacogenomics Knowledge Base [PharmGKB].” [Online]. Available: http://www.pharmgkb.org/. [Accessed: 22-Nov-2012]. [247] B. Quilitz and U. Leser, “Querying distributed RDF data sources with SPARQL,” pp. 524– 538, Jun. 2008. Bibliography 187 [248] “Yahoo!” [Online]. Available: http://www.yahoo.com/?s=https. [Accessed: 26-Feb-2013]. [249] G. Marquet, O. Dameron, S. Saikali, J. Mosser, and A. Burgun, “Grading glioma tumors using OWL-DL and NCI Thesaurus.,” AMIA Annual symposium proceedings / AMIA symposium. AMIA symposium, pp. 508–12, Jan. 2007. [250] D. L. Rubin, O. Dameron, Y. Bashir, D. Grossman, P. Dev, and M. A. Musen, “Using ontologies linked with geometric models to reason about penetrating injuries.,” Artificial intelligence in medicine, vol. 37, no. 3, pp. 167–76, Jul. 2006. [251] “PATO:Main Page - OBOFoundry.” [Online]. Available: http://obofoundry.org/wiki/index.php/PATO:Main_Page. [Accessed: 05-Dec-2012]. [252] “OBO Flat File Format 1.4 Syntax and Semantics [DRAFT].” [Online]. Available: http://oboformat.googlecode.com/svn/branches/2011-11-29/doc/obo-syntax.html. [Accessed: 14-Dec-2012]. [253] “MUO: Measurement Units Ontology.” [Online]. Available: http://idi.fundacionctic.org/muo/muo-vocab.html#PhysicalQuality. [Accessed: 16-Dec-2012]. [254] “DOLCE-Lite-Plus.” [Online]. Available: http://www.w3.org/2001/sw/BestPractices/WNET/DLP3941_daml.html#measurement-unit #5. [Accessed: 15-Dec-2012]. [255] S. B. Claudio Masolo, “Qualities in formal ontology.” [256] F. O. Cardarelli, Encyclopaedia of scientific units, weights and measures: their SI equivalences and origins. Springer, 2003, p. 848. [257] “SI Units for Clinical Data.” [Online]. Available: http://www.unc.edu/~rowlett/units/scales/clinical_data.html. [Accessed: 11-Jan-2013]. [258] “What is expert system? - Definition from WhatIs.com.” [Online]. Available: http://searchcio- midmarket.techtarget.com/definition/expert-system. [Accessed: 11-Oct-2012]. [259] “What is an RDF Blank Node?” [Online]. Available: http://msdn.microsoft.com/en- us/library/aa303696.aspx. [Accessed: 20-Nov-2012]. [260] J. Pearl, “Heuristics: intelligent search strategies for computer problem solving,” May 1984. [261] “General Formal Ontology (GFO).” [262] D. H. Krantz, R. D. Luce, P. Suppes, and A. Tversky, Foundations of Measurement Volume I: Additive and Polynomial Representations (Dover Books on Mathematics). Dover Publications, 2006, p. 624. [263] “Basic Measurement Theory.” Appendix A: Supporting material for chapter 2 188 Appendix A: Supporting material for chapter 2 SPARQL endpoints SPARQL endpoints are HTTP forms that accept SPARQL queries as input. A SPARQL endpoint enables users (human or other) to query a knowledgebase via the SPARQL language[241]. An SPARQL endpoint can be created using existing triple stores such as Virtuoso 46 [242] ,Jena[243], and Sesame[244]. There have been an increasing number of SPQRQL endpoint available publically in recent years. For instance Bio2RDF[245] provides a SPARQL endpoints of some major resources such as PharmGKB[246] and UniProt[91]. SPARQL endpoints enables users (human or machine) to query a knowledgebase via the SPARQL language[30]. To date, the majority of existing SPARQL endpoints are not capable of querying across multiple RDF files stored at different locations on the Web[42]; however, recently, a number of frameworks have been developed for querying across multiple SPARQL endpoints (see for example [247] ). These frameworks attempt to break a single query into a number of subqueries, and execute those subqueries against different endpoints using the SPARQL protocol[42]; however it would still be necessary to locate these endpoints a priori. The existing frameworks (with the exception of SHARE) do not support for OWL-DL reasoning into the query resolution process. Frameworks designed to address these issues are discussed under the broader context of data integration and analysis tools in bioinformatics (section 2.8) once the required basic concepts have been introduced. Some features of relations in an ontology Relationships can often also be hierarchical in any given ontology. For example, the relation hasName may be subdivided into hasDiseaseName, hasPatientName. Relations also have features that capture further knowledge about the relationships between concepts. Some common examples include: 46 http://virtuoso.openlinksw.com/ Appendix A: Supporting material for chapter 2 189 - Whether it is universally necessary that a relationship must hold on a concept. For example, we might want to say that Patient hasBloodPressure High, holds universally for all hypertensive patients. - Whether a relationship can optionally hold on a concept, for example, we might want to describe that Drug hasAlternativeName AlternativeName, only describes the possibility that a medication might have an alternative name, as not all instances of Drug necessarily have an alternative name. - Whether the concept a relationship links to is restricted to certain kinds of concepts. For example, Drug hasFunction AntiHypertensiveDrug restricts the hasFunction relation to only link to concepts that are specific kinds of drug (such as Captopril) - The cardinality of the relationship. For example, a particular BloodPressureRecord is accossiated to only one patient; however a single patient may have several instances of BloodPressureRecord. - Whether the relationship is transitive, where relationships are inherited down a chain of relationships. , for example if HeartValve isPartOf Heart and Heart isPartOf Body then HeartValve isPartOf Body. On the other hand, the property isMotherOf is not transitive, since if A isMotherOf B and B isMotherOf C, we cannot conclude that A isMotherOf C. Decidability in DL Contrary to F-Logic, the majority of DLs are decidable i.e. the reasoning process in DLs should always terminate both for positive and for negative answers. However, the guarantee of an answer in finite time does not necessarily mean that the answer is retrieved in “reasonable” time. The time it takes to retrieve an answer depends on the complexity of the problem. Decidability and complexity of the reasoning depend on the expressive power of the specific DL and investigating the trade-off between the expressivity and the complexity of their reasoning problems has been one of the most important research areas in logics[50]. Appendix A: Supporting material for chapter 2 190 Individuals vs. logical axioms in DL In DL classes are represented as unary predicates (e.g. Patient(x); meaning “x is a Patient”) and roles are represented as binary predicates ( hasDisease(x,y); meaning “X has disease Y”)). . Clearly, from the structure of these two example statements (“x is a Patient” and “X has disease Y”), they follow the familiar subject-predicate-object “triple” structure described earlier. The natural syntax for encoding the OWL DL logic is in RDF triples. In the Semantic Web knowledgebases, both the data (individuals) and the logic and knowledge to be used to interpret that data (the ontology) are represented in RDF. Definition of terms used in the Ontology Spectrum (Figure 2.4) Catalog is a list of terms providing an unambiguous interpretation of terms[63] (e.g. assign a unique identifier to medication Acetaminophen.) Glossary is a list of terms and meanings specified (usually) as natural language statements. They provide sufficient semantics for human users to interpret them; however would not meet the criteria of being machine processable[63]. Thesauri are glossaries that provide additional semantics in their relations between terms. They provide information such as synonym or acronym relationships. Thesauri do not usually provide explicit hierarchical relationships between concepts[63]; though with narrower and broader term specifications, simple hierarchies might be deduced[63]. Ontologies with informal is-a relationship are ontologies that do provide an explicit hierarchy; however, the hierarchy is not a strict subclass or “is-a” hierarchy- a pattern that occurs frequently on the Web. For instance, in yahoo website[248]the term “Health” includes a subcategory of “Fitness” that may not hold globally true. Without true subclass ( “is-a”) relationships, reasoning with ontologies become problematic. Ontologies with formal is-a relationship include strict subclass hierarchies. In such ontologies if A is a superclass of B, then if c is an instance of B it necessarily follows that c is an instance of A. Strict subclass hierarchies are necessary for exploitation of inheritance[63]. Appendix A: Supporting material for chapter 2 191 Ontologies with formal instance are ontologies that provide include instances of each class in addition to formal hierarchical (is-a) relationships. Frames are ontologies that are equipped property information for classes[63]. For instance, if the class City may contain isLocatedInCountry , hasLatitude and hasLongtitude properties we may then say that Vancouver isLocatedIncountry Canada and hasLatitude and hasLongtitude of 49.25° N, 123.11° W respectively. Frames with Value Restrictions include value restrictions for the properties. In such ontologies one might place restrictions on what can fill a property[63]. For example we may say that isLocated Incountry can only be filled (range of property) with countries listed in Encyclopedia. Logical Constraints allow additional constraints to be placed on the properties[63]. For instance we may place specific mathematical limitations that can be placed on properties Systolic Blood Pressure (e.g. above 140 mmHg) and Diastolic Blood Pressure (above 90 mmHg) for a person to be considered Hypertensive. Some ontologies allow only a limited set of constraints while others allow arbitrary logical restrictions to be placed on the classes. The use of ontologies in clinical decision support CDSS benefit from bio-ontologies in several major ways. First, as just explained, ontologies can provide a standard terminology for biomedical concepts which can facilitate data integration[94]. For instance, consider the wide array of medication-related decision support applications that have been developed to provide suggestions for drug allergies, drugs contraindications with certain conditions (e.g. pregnancy), drug-drug interaction checking, and weight-corrected dosage suggestions, and so on[55]. In a system developed for such drug-related decision supports, ontologies can assist in resolving the local drug names (such as brand names, generic names, and abbreviations ) into standardized codes; once standardized, the patient data can be mapped to existing knowledgebases(e.g. NDF-RT) that have information regarding the decision task. In addition to standardization and coding, ontologies are equipped with reasoning capabilities that can be of further benefit to clinical decision support systems. For example, [93] developed a system that uses logical inference on the domain knowledge (in OWL) combined with individual patient’s Appendix A: Supporting material for chapter 2 192 medical context to enable the generation personalised patients’ reports, consisting of a risk assessment and clinical recommendations, including relevant pre-operative screening tests. Additionally, FMA, UMLS, and NCI have been used extensively for inference-based decision supports For instance the OWL version of NCI thesaurus has been used for automatic grading of brain tumors[249]. Additional studies ([46] , and [93], [94], [249], [250]) relate to other aspects of semantics in clinical decision support, such as the interaction between the human and the machine; however, since this thesis is focused on exploring systems that do not rely on human intervention at all, these more social and behavioural aspects of the field of CDSS will not be discussed further.. Appendix B: Supporting material for chapter 3 193 Appendix B: Supporting material for chapter 3 Existing unit ontologies Unit Ontology (UO): UO is a unit ontology specifically focusing on the units of measurement within the biomedical domain. The development of UO was initiated in 2005, as a separate part of Phenotype and Trait Ontology (PATO)[251] for describing qualitative and quantitative observations in biology. UO is available both in OBO[252] and OWL[1] format. UO provides different versions of unit ontology that are suitable (as claimed by the authors) for different applications[144]. The main distinctions between these different versions are as follows: 1) whether to treat units as instances or classes in OWL There has been ongoing debate whether or not units such as “kilogram” should be modeled as classes or instances 47 . Generally the choice between is a non-trivial problem([144] [150] [151]) with no globally correct answer where the choice of representation, depends both on philosophical aspects and type of the applications[144]. UO provides different versions to cover units both as classes and instances (e.g. uo-without-units-as-classes.owl and so on). The original version of UO (uo.owl) maps unit instances as “singletons” of an OWL class48 (e.g. the class only has one instance), and get the same reasoning results compared to when units are treated as instances. 2) whether or not the PATO ontology of physical qualities is included49 In PATO, physical qualities are kind of properties that can be quantified i.e. that can be perceived, measured or even calculated[253](e.g. mass, pressure). In practice, some applications only require identifiers for units but do not need the link to the physical qualities the units are used to represent. In these applications OWL is effectively used as a container 47 If a “kilogram” is a class, then the philosophical question arises what do these instances conceptually represent? 48 Similar approach is also adopted by other groups such as SemanticScience Integrated Ontology(SIO)[160] 49 When qualities are present, UO uses “disjunction” (logical OR) in OW L for cases where a unit may be the unit of more than one quality increasing the computational complexity of the reasoning). Appendix B: Supporting material for chapter 3 194 and the semantic relationship between the units and qualities are not used. UO provides different light weight versions for such application[144]. Regarding semantics, UO provides the relationship between the units which are based on the same units. For instance “gram” and “kilogram” are based on the same unit and thus they are both considered as subclasses (or instances, depending on the version) of “gram based units”. Such units are defined as having a “Prefix” (in this case kilo) using “has-prefix” relationship in the ontology. However, UO does not provide the mathematical meaning of such Prefixes. For instance, an OWL reasoner would not be able to determine the conversion factor between kilogram and gram only by using the ontology. Furthermore the conversion factor between units used for the same quantity that are not connected by any prefix is missing. (e.g. relationship between “inch” and “meter”). Finally the semantic relationship between derived units and their components is lacking (“square meter” has no relationship to “meter”). Measurement Unit Ontology (MUO): MUO is a modular ontology specifically designed to represent units in a combinatorial fashion. MUO includes the definitions of the classes and properties conforming to the general design principles of upper-level Ontology DOLCE [254]. The ontology models mainly three disjoint entities: Units of measurement, Physical qualities, and Common prefixes for units of measurement. Similar to UO, MUO defines URIs for the most common units of measurement, “physical qualities” [255], and prefixes, to be shared and reused in different domain ontologies. Contrary to UO, every unit of measurement is attributed to a physical quality. The most important feature in MUO which is not present in UO is that complex units of measure can be derived from the base ones in a modular fashion. Two types of units are generally dealt with in MUO (similar to EngMath, QUDT and OM): Base Units and Derived Units are defined as follows. Base Units are the units that are not derived from any other unit. Base units can be used to derive other units (such as SI base units). It should be noted that even though ‘kilogram” is considered a base unit in SI it is composed of kilo prefix plus base unit “gram”. In this sense, kilogram is an exception of a unit which is considered as base unit in SI system 50 . Derived Units are the units obtained from combination of the base units of based units to 50The unit “gram” should be considered as the base and defined kilogram as an extension of it, defines a number of independent base units such as meter (m). Appendix B: Supporting material for chapter 3 195 represent “derived physical qualities” as defined by DOLCE. In the formal representation of physical qualities and associated units, MUO defines property muo:derivesFrom to express the relationship between the derived unit and the units it is derived from. Derived Units are further divided into simple and complex derived units. Simple Derived Units are the units that are derived from exactly one base unit. For instance, the millimeter (mm) can be derived from meter (m). These are units that can be defined by attaching a Prefix to base Units. MUO also recognizes a different type of base unit that although derived from exactly one base unit, has a different dimension. For instance SquareMeter(m2). For such cases another property called muo:dimensionalSize is added to account for the dimensionality differences. Complex DerivedUnits are the units that are derived from more than one base unit. For instance consider Body Mass Index (BMI) which is a statistical measure which compares a person's weight and height. BMI is used to estimate a healthy body weight based on a person's height. BMI in International System of Units (SI) [15], is defined as follows [16]: BMI defined in this way, has the units kilogram per square meter (kg/m2) and this unit can be defined as follows using MUO: kilogram-per-meter-square rdf:type muo:ComplexDerivedUnit ; muo:derivesFrom ucum:kilogram ; muo:derivesFrom :meter-squared. meter-squared rdf:type muo:SimpleDerivedUnit ; muo:derivesFrom ucum:meter ; muo:dimensionalSize "2"^^xsd:float. The main advantage of MUO over UO is that it proposes a convenient framework for defining new units of measurements in terms of existing ones. However, similar to UO, it lacks the numerical factors required for unit conversion and also is not capable of expressing the relationship between units that are used for the same physical quality but are not linked together using Prefix property (e.g. inch and centimeter). Moreover, the upper class “derived physical quality” dos not exist in MUO, and thus a defined composite unit cannot be linked to a “Derived physical quality” using Appendix B: Supporting material for chapter 3 196 MUO:measuresQuality property. For instance the complex unit (kg/s2) defined above cannot be linked to any physical quality in MUO. In addition, the physical qualities are defined as OWL instances and thus the hierarchy between physical qualities cannot easily be established. For example it is not straightforward to define a new physical quality “angular velocity” as the specialization of “velocity”. Ontology for Engineering Mathematics(EngMath): EngMath[150] is an ontology for mathematical modelling in engineering, written in Ontolingua[108]. The ontology provides conceptual foundations for representing mathematical and physical entities such as scalars 51 , vectors 52 , tensors 53 , physical quantities (quantifiable qualities 54 ), physical dimensions and units explicitly designed for knowledge sharing applications in engineering[150]. Regarding unit representation problem, the main extra feature in EngMath (absent in UO and MUO) is the component “physical dimensions”. The physical dimension of a quantity is an abstraction of a quantity ignoring magnitude, sign and direction aspects[152]. The dimension of a quantity can be thought of as independent set of base dimensions[152]. For instance the quantity Body Mass Index (BMI) has the dimension that can be decomposed into base dimensions mass and length : . The base dimensions in SI systems are length (L), mass (M), time (T), electric current(I), temperature(Ө), amount of substance (N) and luminous intensity (J). Additionally, Each unit of measure in EngMath is defined with its relationship to SI units for the fundamental dimensions[154]. It should be noted that, there are cases where distinct physical qualities have the same dimension symbol. This often occurs in cases where physical laws are discovered and formalized independently of each other, but reduce to the same base quantity kinds 55 [155]. EngMath(as opposed to MUO and UO) provides the enough information to convert among any pair of units of the same dimension that are either defined as basic units or composed from the basic units[154]. The main problem with EngMath is that since it was developed prior to uptake and 51 http://mathworld.wolfram.com/Scalar.html 52 http://mathworld.wolfram.com/Vector.html 53 http://mathworld.wolfram.com/Tensor.html 54 It should be noted that physical quantities in EngMath (also in QUDT and OM) is conceptually equivalent to physical qualit ies in DOLCE, and PATO. This may cause some confusion. 55 A commonly quoted example is the dimensional equivalence of mechanical torque and energy. Both have the same dimensions but are conceptually defined in a different manner[155]. Appendix B: Supporting material for chapter 3 197 adoption of Semantic Web standards, it is not available in OWL. This problem is addressed in two recent ontologies QUDT and OM which are explained next. QUDT: Quantities, Units, Dimensions and Types(QUDT) Ontologies are a group of ontologies that are currently being developed by TopQuadrant and NASA. Originally, they were developed for the NASA Exploration Initiatives Ontology Models (NExIOM) project, a Constellation Program initiative at the AMES Research Center (ARC) [155]. QUDT uses a different terminology from PATO and DOLCE for defining concepts pertaining to measurements. QUDT defines a “Quantity” as an observable property of an object, event or system that can be measured and quantified numerically[155]. Quantities are further differentiated by two attributes “Quantity Kind” and “Quantity Magnitude”. Quantity Kinds are defined as observable property that can be measured and quantified numerically such as length and mass. Thus “Quantity Kind” in QUDT is conceptually similar to “Physical quality” in PATO and DOLCE; however, not equivalent and thus the mapping between them is not straightforward(see [152] for more information). Quantity Magnitude expresses the numerical value of a quantity with respect to the specific measurement-unit[155]. Similar to MUO , QUDT defines Base and Derived units as OWL instances; however in addition to MUO, QUDT defines base and derived quantity kinds (e.g. Area which is derived from Length) Base quantity kinds are chosen so that they are orthogonal (no base quantity kind can be expressed as an algebraic relation of one or more other base quantity kinds). A quantity kind that can be expressed as an algebraic relation of one or more base quantity kind is called a derived quantity kind. Similar to EngMath, QUDT defines “quantity dimensions”. The major advantage of defining quantity dimensions is that generally, with few exceptions 56 (which are not considered in QUDT), the same units of measure can be applied to quantities with equal dimensionality. QUDT also defines dimensionless quantities and units such as counts and ratios. Similar to MUO, “Quantity kinds” are OWL instances; however QUDT provides a property qudt:generalization to model hierarchical relationships between quantity kinds. For example if one is interested in defining a new quantity kind, for example, “Height”, she should link it to Length using the qudt:generalization property. Figure 3.1 shows the high level representation of the unit “inch” in QUDT with a number of units used for length representation(not all shown in the figure). In terms of coverage for base units QUDT is fairly comprehensive; however it lacks a number of derived units (e.g. the centimeter of mercury column commonly used for clinical measurements of blood pressure). 56 Using the mechanical torque and energy example again, the former a pseudo-vector while the latter is a scalar[155] Appendix B: Supporting material for chapter 3 198 Figure 0.1 High level representation of the unit “inch” in QUDT. The properties are shown in d ifferent colors on the right side of the figure In QUDT there are a number of major of units systems such as CSG system of Units, US customary Unit System, System of Imperial Units and so on[256].However, QUDT emphasizes the International System of Units (SI) for unit representations (for example as shown in Figure 1 the unit “inch” is a member of class “Not used with SI”) and expresses all other units in terms of SI units. It uses two data properties “conversion offset”, “conversion multiplier” to provide the conversion between any non-SI-based unit and its SI-based equivalent. For instance for the unit Inch, the conversion offset is set to zero and the conversion multiplier is set to 0.0254 and for the “Degree Fahrenheit” the conversion offset and multiplier are 255.370 and 0.556 respectively to convert to Kelvin. These parameters exist even for SI based units and are set to (not surprisingly) 1 and 0 respectively. Generally, the following formula is used to carry this conversion: Quantity value in SI = (conversion multiplier)*(Quantity value in non-SI unit) + conversion offset This quantitative relationship between different units feature does not exist in MUO (and UO) and is a major advantage provided by QUDT. It should be noted that the conversion is possible only if a unit or scale have the same dimension represented by property “quantityKind”. Appendix B: Supporting material for chapter 3 199 OM: Ontology of Units of measure and related concepts(OM) models concepts and relations important to scientific research[156] focusing on units and quantities . OM and QUDT are similar in terms of high level design features and hence we only discuss the differences (The interested reader is referred to [152] for specific details about OM). One difference is that OM defines “Quantity kinds” as OWL classes and thus the reasoning with the hierarchies are more straightforward[152]. Furthermore, though QUDT provides explicit conversion formula between non-SI and SI-based units, it does not represent the sub (multiple) units in terms of their components. For example the unit Centimeter has the offset and conversion factors of 0 and 0.01 respectively though it does not have a “Prefix” property. Figure 0.2 OM representation of cubic centimeter Additionally, OM (similar to MUO 57 ) provides relationship between compound units (e.g. kilogram per cubic meter) to their individual constituents (kilogram and cubic meter) whereas QUDT treats units derived from one base unit (e.g. cubic meter) and units derived from multiple base units(kilogram per cubic meter) all as individuals of the class “Derived Units”. In OM the top level “compound units” is further divided into three top level classes “unit division”(e.g. meter per second), “unit exponentiation”(meter squared) and “unit multiplication” (e.g. meter kilogram). For instance, for unit “millimole per cubic centimeter” (mmol/cm3), the nominator and denominator are defined as “millimole” and “cubic centimeter” respectively where millimole is related to mole by Prefix milli (om:factor = 1e-3) where cubic centimeter is an instance of “unit exponentiation” and is 57 The major difference between MUO and OM in dealing with compound units is that OM specifically defines two properties (nominator and denominator) whereas MUO only p rovides property (derivesFrom) which effectively t reats nominator and denominator of a compound unit similarly. The granularity provided by OM allows for additional reasoning capabilit ies that are not possible using MUO(see next section) Appendix B: Supporting material for chapter 3 200 shown in Figure 3.2.This additional piece of metadata provided by OM allows us to automatically check dimension compatibility and generate on-the-fly relationship between compatible compound units. As an example consider the units used for concentration and density in for clinical data (below). As we see the labels of units are highly structured i.e. compound units can be constructed from the labels of its constituent. Thus instead of manual curtain of individual units, using relationships in OM it would be possible to automatically generate the mathematical expression between quantities. Finally, OM provides a number of Web Services based on the OM ontology which facilitate more complex tasks such as unit conversion, and checking the consistency of dimensions[157]. Clinical unit conversion Factors for converting conventional units to SI units for selected clinical data[257]. Conversion:  To convert from the conventional unit to the SI unit, multiply by the conversion factor;  To convert from the SI unit to the conventional unit, divide by the conversion factor. Component Conventional Unit Conversion Factor SI Unit Acetaminophen µg/mL 6.62 µmol/L Acetoacetic acid mg/dL 0.098 mmol/L Acetone mg/dL 0.172 mmol/L Alanine mg/dL 112.2 µmol/L Albumin g/dL 10 g/L Aldosterone ng/dL 0.0277 nmol/L Aluminum ng/mL 0.0371 µmol/L Aminobutyric acid mg/dL 97 µmol/L Amitriptyline ng/mL 3.61 nmol/L Ammonia (as NH3) µg/dL 0.587 µmol/L Androstenedione ng/dL 0.0349 nmol/L Angiotensin I pg/mL 0.772 pmol/L Angiotensin II pg/mL 0.957 pmol/L Antidiuretic hormone pg/mL 0.923 pmol/L Antithrombin III mg/dL 10 mg/L Apolipoprotein A mg/dL 0.01 g/L Apolipoprotein B mg/dL 0.01 g/L Arginine mg/dL 57.4 µmol/L Appendix B: Supporting material for chapter 3 201 Asparagine mg/dL 75.7 µmol/L Bilirubin mg/dL 17.1 µmol/L Bromide mg/dL 0.125 mmol/L Calcium mg/dL 0.25 mmol/L Carotene µg/dL 0.0186 µmol/L Chloride mEq/L 1.0 mmol/L Cholesterol mg/dL 0.0259 mmol/L Citrate mg/dL 52.05 µmol/L Copper µg/dL 0.157 µmol/L Cortisol µg/dL 27.59 nmol/L Cotinine ng/mL 5.68 nmol/L Creatine mg/dL 76.26 µmol/L Creatinine mg/dL 88.4 µmol/L Desipramine ng/mL 3.75 nmol/L Diazepam µg/mL 3.512 µmol/L Digoxin ng/mL 1.281 nmol/L Epinephrine pg/mL 5.46 pmol/L Estradiol pg/mL 3.671 pmol/L Ferritin ng/mL 2.247 pmol/L Fibrinogen mg/dL 0.0294 µmol/L Fluoride µg/mL 52.6 µmol/L Folate ng/mL 2.266 nmol/L Fructose mg/dL 55.5 µmol/L Galactose mg/dL 55.506 µmol/L Glucagon pg/mL 1.0 ng/L Glucose mg/dL 0.0555 mmol/L Glutamine mg/dL 68.42 µmol/L Glycerol (free) mg/dL 108.59 µmol/L Glycine mg/dL 133.3 µmol/L Haptoglobin mg/dL 0.10 µmol/L High-density lipoprotein cholesterol (HDL-C) mg/dL 0.0259 mmol/L Histidine mg/dL 64.45 µmol/L Homocysteine (total) mg/L 7.397 µmol/L Hydroxybutyric acid mg/dL 96.05 µmol/L Hydroxyproline mg/dL 76.3 µmol/L Immunoglobulin A (IgA) mg/dL 0.01 g/L Immunoglobulin D (IgD) mg/dL 10 mg/L Appendix B: Supporting material for chapter 3 202 Immunoglobulin E (IgE) mg/dL 10 mg/L Immunoglobulin G (IgG) mg/dL 0.01 g/L Immunoglobulin M (IgM) mg/dL 0.01 g/L Insulin µIU/mL 6.945 pmol/L Iron, total µg/dL 0.179 µmol/L Iron binding capacity, total µg/dL 0.179 µmol/L lsoleucine mg/dL 76.24 µmol/L lsopropanol mg/L 0.0166 mmol/L Lactate (lactic acid) mg/dL 0.111 mmol/L Lead µg/dL 0.0483 µmol/L Leucine mg/dL 76.237 µmol/L Lipids (total) mg/dL 0.01 g/L Lipoprotein (a) mg/dL 0.0357 µmol/L Lithium mEq/L 1.0 mmol/L Low-density lipoprotein cholesterol (LDL-C) mg/dL 0.0259 mmol/L Lysine mg/dL 68.5 µmol/L Magnesium mg/dL 0.411 mmol/L Manganese ng/mL 18.2 nmol/L Myoglobin µg/L 0.0571 nmol/L Nicotine mg/L 6.164 µmol/L Nitrogen, nonprotein mg/dL 0.714 mmol/L Norepinephrine pg/mL 0.00591 nmol/L Osteocalcin µg/L 0.171 nmol/L Parathyroid hormone pg/mL 1.0 ng/L Phenobarbital mg/L 4.31 µmol/L Phenylalanine mg/dL 60.54 µmol/L Phenytoin µg/mL 3.96 µmol/L Phosphorus mg/dL 0.323 mmol/L Plasminogen mg/dL 0.113 µmol/L Potassium mEq/L 1.0 mmol/L Progesterone ng/mL 3.18 nmol/L Proline mg/dL 86.86 µmol/L Prostate-specific antigen ng/mL 1.0 µg/L Protein, total g/dL 10.0 g/L Prothrombin g/L 13.889 µmol/L Protoporphyrin, erythrocyte µg/dL 0.01777 µmol/L Appendix B: Supporting material for chapter 3 203 Pyruvate mg/dL 113.6 µmol/L Quinidine µg/mL 3.08 µmol/L Salicylate mg/L 0.00724 mmol/L Serine mg/dL 95.2 µmol/L Serotonin (5-hydroxytryptamine) ng/mL 0.00568 µmol/L Sodium mEq/L 1.0 mmol/L Testosterone ng/dL 0.0347 nmol/L Theophylline µg/mL 5.55 µmol/L Threonine mg/dL 83.95 µmol/L Thyroglobulin ng/mL 1.0 µg/L Transferrin mg/dL 0.01 g/L Triglycerides mg/dL 0.0113 mmol/L Tryptophan mg/dL 48.97 µmol/L Tyrosine mg/dL 55.19 µmol/L Urea nitrogen mg/dL 0.357 mmol/L Uric acid mg/dL 59.48 µmol/L Vitamin A (retinol) µg/dL 0.0349 µmol/L Vitamin B6 (pyridoxine) ng/mL 4.046 nmol/L Vitamin B12 (cyanocobalamin) pg/mL 0.738 pmol/L Vitamin C (ascorbic acid) mg/dL 56.78 µmol/L Vitamin D( 25-Hydroxyvitamin D) ng/mL 2.496 nmol/L Vitamin E mg/dL 23.22 µmol/L Vitamin K ng/mL 2.22 nmol/L Warfarin µg/mL 3.247 µmol/L Zinc µg/dL 0.153 µmol/L Appendix C: Supporting material for chapter 4 204 Appendix C: Supporting material for chapter 4 A B DBP binary risk classification SELECT ?patientrecord ?val ?riskgrade FROM WHERE { ?patientrecord rdf:type cardio:HighRiskDBPRecord . ?patientrecord cardio:ExpertDiastolicGrade ?riskgrade . ?patientrecord cardio:hasAttribute ?attr . ?attr rdf:type cardio:DiastolicBloodPressure . ?attr cardio:hasMeasurement ?msr . ?msr cardio:hasValue ?val . ?msr cardio:hasUnit cardio:milli-meter-of-mercury-column . } SELECT ?patientrecord ?val ?riskgrade FROM WHERE { ?patientrecord rdf:type cardio:LowRiskDBPRecord . ?patientrecord cardio:ExpertDiastolicGrade ?riskgrade . ?patientrecord cardio:hasAttribute ?attr . ?attr rdf:type cardio:DiastolicBloodPressure . ?attr cardio:hasMeasurement ?msr . ?msr cardio:hasValue ?val . ?msr cardio:hasUnit cardio:milli-meter-of-mercury-column . } Appendix C: Supporting material for chapter 4 205 A B Chol binary risk classification SELECT ?patientrecord ?val ?riskgrade FROM WHERE { ?patientrecord rdf:type cardio:HighRiskCholesterolRecord . ?patientrecord cardio::ExpertCholesterolGrade ?riskgrade . ?patientrecord cardio:hasAttribute ?attr . ?attr rdf:type cardio:SereumCholesterolConcentration . ?attr cardio:hasMeasurement ?msr . ?msr cardio:hasValue ?val . ?msr cardio:hasUnit cardio:milli-mol-per-liter } SELECT ?patientrecord ?val ?riskgrade FROM WHERE { ?patientrecord rdf:type cardio:LowRiskCholesterolRecord . ?patientrecord cardio::ExpertCholesterolGrade ?riskgrade . ?patientrecord cardio:hasAttribute ?attr . ?attr rdf:type cardio:SereumCholesterolConcentration . ?attr cardio:hasMeasurement ?msr . ?msr cardio:hasValue ?val . ?msr cardio:hasUnit cardio:milli-mol-per-liter } 206 A B HDL binary risk classification SELECT ?patientrecord ?val ?riskgrade FROM WHERE { ?patientrecord rdf:type cardio:HighRiskHDLRecord . ?patientrecord cardio:ExpertHDLGrade ?riskgrade . ?patientrecord cardio:hasAttribute ?attr . ?attr rdf:type cardio:SereumHDLConcentration . ?attr cardio:hasMeasurement ?msr . ?msr cardio:hasValue ?val . ?msr cardio:hasUnit cardio:milli-mol-per-liter } SELECT ?patientrecord ?val ?riskgrade FROM WHERE { ?patientrecord rdf:type cardio:LowRiskHDLRecord . ?patientrecord cardio::ExpertHDLGrade ?riskgrade . ?patientrecord cardio:hasAttribute ?attr . ?attr rdf:type cardio:SereumHDLConcentration . ?attr cardio:hasMeasurement ?msr . ?msr cardio:hasValue ?val . ?msr cardio:hasUnit cardio:milli-mol-per-liter } // 207 A B TG binary risk classification SELECT ?patientrecord ?val ?riskgrade FROM WHERE { ?patientrecord rdf:type cardio:HighRiskTriglycerideRecord . ?patientrecord cardio::ExpertTriglycerideGrade ?riskgrade . ?patientrecord cardio:hasAttribute ?attr . ?attr rdf:type cardio:SereumTriglycerideConcentration . ?attr cardio:hasMeasurement ?msr . ?msr cardio:hasValue ?val . ?msr cardio:hasUnit cardio:milli-mol-per-liter } SELECT ?patientrecord ?val ?riskgrade FROM WHERE { ?patientrecord rdf:type cardio:LowRiskTriglycerideRecord . ?patientrecord cardio::ExpertTriglycerideGrade ?riskgrade . ?patientrecord cardio:hasAttribute ?attr . ?attr rdf:type cardio:SereumTriglycerideConcentration . ?attr cardio:hasMeasurement ?msr . ?msr cardio:hasValue ?val . ?msr cardio:hasUnit cardio:milli-mol-per-liter } // 208 A B LDL binary risk classification SELECT ?patientrecord ?val ?riskgrade FROM WHERE { ?patientrecord rdf:type cardio:HighRiskLDLRecord ?patientrecord cardio::ExpertLDLGrade ?riskgrade . ?patientrecord cardio:hasAttribute ?attr . ?attr rdf:type cardio:SereumLDLConcentration . ?attr cardio:hasMeasurement ?msr . ?msr cardio:hasValue ?val . } SELECT ?patientrecord ?val ?riskgrade FROM WHERE { ?patientrecord rdf:type cardio:LowRiskLDLRecord ?patientrecord cardio::ExpertLDLGrade ?riskgrade . ?patientrecord cardio:hasAttribute ?attr . ?attr rdf:type cardio:SereumLDLConcentration . ?attr cardio:hasMeasurement ?msr . ?msr cardio:hasValue ?val . } 209 A B BMI binary risk classification SELECT ?patientrecord ?val ?riskgrade FROM WHERE { ?patientrecord rdf:type cardio:HighRiskBMIRecord . ?patientrecord cardio:ExpertBMIGrade ?riskgrade . ?patientrecord cardio:hasAttribute ?attr . ?attr rdf:type cardio:BodyMassIndex . ?attr cardio:hasMeasurement ?msr . ?msr cardio:hasValue ?val . } SELECT ?patientrecord ?val ?riskgrade FROM WHERE { ?patientrecord rdf:type cardio:LowRiskBMIRecord . ?patientrecord cardio:ExpertBMIGrade ?riskgrade . ?patientrecord cardio:hasAttribute ?attr . ?attr rdf:type cardio:BodyMassIndex . ?attr cardio:hasMeasurement ?msr . ?msr cardio:hasValue ?val . } 210 SELECT ?patientrecord ?val ?riskgrade FROM WHERE { ?patientrecord rdf:type cardio:LowRiskFraminghamScoreRecord . ?patientrecord cardio:ExpertFraminghamGrade ?riskgrade . ?patientrecord cardio:hasAttribute ?attr . ?attr rdf:type cardio:GeneralCVD10YearFraminghamRiskScore. ?attr cardio:hasValue ?calculatedrisk } Framingham risk classification SELECT ?patientrecord ?val ?riskgrade FROM WHERE { ?patientrecord rdf:type cardio:HighRiskFraminghamScoreRecord . ?patientrecord cardio:ExpertFraminghamGrade ?riskgrade . ?patientrecord cardio:hasAttribute ?attr . ?attr rdf:type cardio:GeneralCVD10YearFraminghamRiskScore. ?attr cardio:hasValue ?calculatedrisk } A B 211 Summary of Binary and Framingham Risk Classification for 100 patients randomly selected. Number of False Positives (FP) and False Negatives(FN) are calculated with respect to manual annotations as the reference classes. Note that the total number may slightly vary for individual cases due to missing values. For the case of SBP, DBP, and LDL no adjustment for the guidelines were required. For other cases the original and modified guidelines are presented in OWL following each table. spa SBP Expert Automatic # False Positives # False Negatives At Risk 22 22 0 0 Not At Risk 69 69 0 0 DBP Expert Automatic # False Positives # False Negatives At Risk 18 18 0 0 Not At Risk 73 73 0 0 LDL Expert Automatic # False Positives # False Negatives At Risk 21 21 0 0 Not At Risk 70 70 0 0 212 Original AHA guideline in OWL: HighRiskCholesterolRecord= PatientRecord and (sio:hasAttribute some (cardio:SerumCholesterolConcentration and sio:hasMeasurement some ( sio:Measurement and (sio:hasUnit value cardio:mili-mole-per-liter) and (sio:hasValue some double[>= 5.0])))) LowRiskCholesterolRecord= PatientRecord and (sio:hasAttribute some (cardio: SerumCholesterolConcentration and sio:hasMeasurement some ( sio:Measurement and (sio:hasUnit value cardio:mili-mole-per-liter) and (sio:hasValue some double[< 5.0])))) ______________________________________________ Modified guideline in OWL: HighRiskCholesterolRecord= PatientRecord and (sio:hasAttribute some (cardio:SerumCholesterolConcentration and sio:hasMeasurement some ( sio:Measurement and (sio:hasUnit value cardio:mili-mole-per-liter) and (sio:hasValue some double[>= 5.0])))) LowRiskCholesterolRecord= PatientRecord and (sio:hasAttribute some (cardio: SerumCholesterolConcentration and sio:hasMeasurement some ( sio:Measurement and (sio:hasUnit value cardio:mili-mole-per-liter) and (sio:hasValue some double[< 5.0])))) 213 Original AHA guideline in OWL: HighRiskHDLCholesterolRecord= PatientRecord and (sio:hasAttribute some (cardio:SerumHDLCholesterolConcentration and sio:hasMeasurement some ( sio:Measurement and (sio:hasUnit value cardio:mili-mole-per-liter) and (sio:hasValue some double[<= 1.03])))) LowRiskHDLCholesterolRecord= PatientRecord and (sio:hasAttribute some (cardio: SerumCholesterolConcentration and sio:hasMeasurement some ( sio:Measurement and (sio:hasUnit value cardio:mili-mole-per-liter) and (sio:hasValue some double[> 1.55])))) _______________________________________________ Modified guideline HighRiskHDLCholesterolRecord= PatientRecord and (sio:hasAttribute some (cardio:SerumHDLCholesterolConcentration and sio:hasMeasurement some ( sio:Measurement and (sio:hasUnit value cardio:mili-mole-per-liter) and (sio:hasValue some double[<= 0.89])))) LowRiskHDLCholesterolRecord= PatientRecord and (sio:hasAttribute some (cardio: SerumCholesterolConcentration and sio:hasMeasurement some ( sio:Measurement and (sio:hasUnit value cardio:mili-mole-per-liter) and (sio:hasValue some double[> 0.89])))) 214 Original AHA guideline in OWL: HighRiskTriglycerideRecord= PatientRecord and (sio:hasAttribute some (cardio:SerumTriglycerideCholesterolConcentration and sio:hasMeasurement some ( sio:Measurement and (sio:hasUnit value cardio:mili-mole-per-liter) and (sio:hasValue some double[>=2.26])))) LowRiskTriglycerideRecord= PatientRecord and (sio:hasAttribute some (cardio: SerumTriglycerideConcentration and sio:hasMeasurement some ( sio:Measurement and (sio:hasUnit value cardio:mili-mole-per-liter) and (sio:hasValue some double[<1.69 ])))) ________________________________________________ Modified guideline in OWL: HighRiskTriglycerideRecord= PatientRecord and (sio:hasAttribute some (cardio:SerumTriglycerideCholesterolConcentration and sio:hasMeasurement some ( sio:Measurement and (sio:hasUnit value cardio:mili-mole-per-liter) and (sio:hasValue some double[>=2.63])))) LowRiskTriglycerideRecord= PatientRecord and (sio:hasAttribute some (cardio: SerumTriglycerideConcentration and sio:hasMeasurement some ( sio:Measurement and (sio:hasUnit value cardio:mili-mole-per-liter) and (sio:hasValue some double[<2.63 ])))) / 215 Original AHA guideline in OWL: HighRiskBMIRecord= PatientRecord and (sio:hasAttribute some (cardio:BodyMassIndex and sio:hasMeasurement some ( sio:Measurement and (sio:hasUnit value cardio:kilogram-per-meter-squared) and (sio:hasValue some double[>= 25.0])))) LowRiskBMIRecord= PatientRecord and (sio:hasAttribute some (cardio:BodyMassIndex and sio:hasMeasurement some ( sio:Measurement and (sio:hasUnit value cardio:kilogram-per-meter-squared) and (sio:hasValue some double[< 25.0])))) ___________________________________________________________ Modified guideline: HighRiskBMIRecord= PatientRecord and (sio:hasAttribute some (cardio:BodyMassIndex and sio:hasMeasurement some ( sio:Measurement and (sio:hasUnit value cardio:kilogram-per-meter-squared) and (sio:hasValue some double[>= 26.0])))) LowRiskBMIRecord= PatientRecord and (sio:hasAttribute some (cardio:BodyMassIndex and sio:hasMeasurement some ( sio:Measurement and (sio:hasUnit value cardio:kilogram-per-meter-squared) and (sio:hasValue some double[< 26.0])))) 216 The original representation of two rows of dataset in excel. The green color risk grades are the one where manual and automatic annotations were consistent and the red color risk grades shows where the manual and automatic classifications disagree. As can be seen for the second record, there is a great deal of discrepancies existing between manual and automatic classifications. This is due to the fact that the numerical values for measurement lie between the “gray areas” of classification. For instance the BMI value is greater than 25 but less than 26 and so on. All but the Framingham risk grade(last column) above can be “fixed” by tweaking the thresholds to a slightly different values. The FRS is complicated by several other factors that may not be present in the data (see discussion). TheRDF representation of the file is attached in the link that follows. PatienID SBP DBP TOTALCHOL HDL TG AGE GENDER HEIGHT WEIGHT TGGR HDLGHR LDLGR CHOLGR BMIGR DBPGR SBPGR RISKGR 1 128 80.1 227 55 84 77 M 1.8288 78.1818 0 0 0 1 0 0 0 1 2 130 84.2 202 37 167 68 M 1.8034 82.2727 0 1 0 1 0 0 0 2 217 Sample data and the instruction for BMI SADI service Test Input RDF for BMI Service: http://cardio-soroush.rhcloud.com/inputbmi.rdf Output RDF for BMI Service: http://cardio-soroush.rhcloud.com/outputbmi.rdf SADI Service URI: http://cardio-soroush.rhcloud.com/BMICalculator 218 In order to test the BMI service using Poster Plugin, you need to have Mozilla installed on your machine. Once you installed Mozilla Firefox i go to page (https://addons.mozilla.org/en- US/firefox/addon/poster/ ) to install the plugin. It will require that you restart the Firefox. Subsequently, go to tools->poster. You should see a snapshot similar to the one shown below. In the URL tab, place the url of the service (http://cardio-soroush.rhcloud.com/BMICalculator), and in the file tab browse the content of the inputbmi.rdf (http://cardio- soroush.rhcloud.com/inputbmi.rdf) stored locally on your machine since the poster plugin needs the input file to be stored in a file as opposed to a url. Click on post and a pop-up window appears with the output file (should be equal to outputbmi.rdf). Appendix D: Supporting material for chapter 5 219 Appendix D: Supporting material for chapter 5  Total # of medications: 1663  Average # of medications per patient: 6.9  # of medications for which no spelling suggestion found: 68  # of unique medications after canonicalization: 181  Medications histogram: (Frequency >= 5) , complete histogram shown below Appendix D: Supporting material for chapter 5 220 Medication Frequency ASPIRIN 350 DIPYRIDAMOLE 265 DILTIAZEM 88 NIFEDIPINE 67 NITROGLYCERIN 59 DIGOXIN 41 ISOSORBIDEDINITRATE 39 HYDROCHLOROTHIAZIDE 35 GEMFIBROZIL 26 ATENOLOL 23 LOVASTATIN 22 INSULIN 20 PROPRANOLOL 17 ACETAMINOPHEN 15 WARFARIN 15 ISOSORBIDE 14 ENALAPRIL 13 FUROSEMIDE 13 METOPROLOL 13 VERAPAMIL 13 POTASSIUMCHLORIDE 12 CHOLESTYRAMINERESIN 11 ESTROGENS 11 PROCAINAMIDE 11 QUINIDINE 11 CONJUGATED(USP) 10 THYROXINE 10 ALLOPURINOL 9 CHLORPROPAMIDE 9 CIMETIDINE 9 IBUPROFEN 8 ALPRAZOLAM 7 GLYBURIDE 7 PIROXICAM 7 RANITIDINE 7 CALCIUM 6 DIAZEPAM 6 FISHOILS 6 INDOMETHACIN 6 NAPROXEN 6 PROBUCOL 6 CAPTOPRIL 5 DOCUSATE 5 NADOLOL 5 NIACIN 5 PREDNISONE 5 Appendix D: Supporting material for chapter 5 221  Medications for which no treatment/prevention disease is attributed in NDF-RT (this is strange and potentially a suggestion for improvement for the coverage of NDF-RT) Total # of occurrence: 283 Total # of unique medications for which no disease is attributed: 0 50 100 150 200 250 300 350 400 A SP IR IN H YD R O C H LO R O TH IA ZI D E W A R FA R IN C H O LE ST YR A M IN ER ES IN C H LO R P R O P A M ID E C A LC IU M D O C U SA TE M ET H YL D O P A TI M O LO L D ES IP R A M IN E P SY LL IU M A M IN O P YR IN E C O LC H IC IN E P H EN YL B U TA ZO N E A C ET O H EX A M ID E B EN D R O FL U M ET H IA ZI D E C A R B EN IC IL LI N C LO N ID IN E D IP H EN H YD R A M IN E H YD R A LA ZI N E M ET H O TR EX A TE N P H IN SU LI N P IN D O LO L Q U IN A C R IN E TE TR A C YC LI N E TR IC LO SA N Series1 Appendix D: Supporting material for chapter 5 222 The highlighted rows show the rows where the NDRF API yields different results from the Ontology. However this is of no consequence for the purpose of this study.  IRON (N0000146077) is redirected to FERROUS SULFATE (N0000146041)  ALUMINUM (N0000146330) to ALUMINUM HYDROXIDE (N0000146920)  COLCHCINE (N0000148412) to COLCHICINE (N0000145804)  FLUORIDE (N0000146026) to SODIUM FLUORIDE (N0000146042) Display Name (NDF-ID) frequency of occurrence DIPYRIDAMOLE (N0000146237) 185 ISOSORBIDE DINITRATE ( N0000146183) 36 RANITIDINE ( N0000148012) 7 CALCIUM ( N0000146031) 6 NICOTINIC ACID (N0000029451) 5 IRON ( N0000146077) 4 MULTIVITAMINS ( N0000020121 ) 3 POTASSIUM ( N0000146069) 3 VITAMIN E ( N0000029271) 3 MULTIVITAMINS ( N0000020121) 3 ALUMINUM (N0000146330) 3 COLCHCINE ( N0000148412) 2 PHENYLBUTAZONE ( N0000146718) 2 TERFENADINE (N0000147493) 2 VITAMIN C ( N0000146025) 2 VITAMIN B ( N0000148535) 2 CASCARA SAGRADA( N0000147280) 1 COD LIVER OIL ( N0000145787) 1 POVIDONE IODINE ( N0000147472) 1 KETOROLAC( N0000178380) 1 ESTROGENS ( N0000029181) 1 ZINC ( N0000146033) 1 CASTOR OIL ( N0000145784) 1 PEPSIN ( N0000146741) 1 METHYSERGIDE (N0000147921) 1 FLUORIDE (AS SODIUM FLUORIDE) ( N0000146026) 1 Appendix D: Supporting material for chapter 5 223 Medications for which NDF-API was originally unable to suggest a name AMINOPHYLLIN 100 KOLYLANTAL ASCRIPTIN AD LASIX/THEODIU ASCRIPTIN/PERSANTINE LOPID 100MG ASPIRIN CHILDRENS LOPRESSOR 50MG ASPIRIN-BABY LY DIABENESE BECLOVEN INHALER MAX EPA BLOCADREN5/DIPYRID MG DIPYRIDAMALE BUTONOLOM MG PERSONTINE CALAN SR MODUERTEI CHILDRENS ASPIRIN NITRO PATCH CHLOROTHIA NOVAFED A CAP DARVACET NPH INSULIN DAYALETS + IRON PM-MAXITROL OPTH DBECOVISE INHL PROMEGA FISH OIL DIPIRIDANIDE QUESTRAN POWDER DIPYIDAINDLE QUINDEX EXT DIURIL QUINIDEX EXTENTA DYAZIDE CAP Q-VIL DYSYUDAMOLE SPIRIZIDE EEDPROMETHAZ VC COD TENERELIC ADALAT 10DPS PROCARD THERN COMBEX H-P TRNS DRMP NITRO TID SPIRIZIDE U-100 REGULAR TIDCHOYDILL XDLMAX EPA TRENTAL EST TEMAZEPAM TRENTAL 400 INDERAL L.A TRENTALHOECHST ISOSORBIDE DINIT Medications implicated in diabetes together with their relative frequency NDF ID Freq. Display Name N0000147876 20 INSULIN N0000146198 9 CHLORPROPAMIDE N0000146417 7 GLYBURIDE N0000145940 3 INSULIN,REGULAR,HUMAN/SEMISYNTHETIC N0000146241 3 TOLBUTAMIDE N0000145950 3 INSULIN,REGULAR,HUMAN/SEMISYNTHETIC N0000146830 3 GLIPIZIDE N0000146187 1 TOLAZAMIDE N0000145888 1 ACETOHEXAMIDE N0000145944 1 INSULIN,NPH,HUMAN/rDNA N0000145959 1 INSULIN,NPH,HUMAN/SEMISYNTHETIC Appendix D: Supporting material for chapter 5 224 Medications implicated in Hypercholesterolemia together with their relative frequency N0000146952 GEMFIBROZIL 25 N0000147602 LOVASTATIN 22 N0000147114 CHOLESTYRAMINE 11 N0000148773 FISH OIL 6 N0000146924 PROBUCOL 6 N0000146039 NIACIN 5 N0000147570 NICOTINIC ACID 5 N0000146804 CLOFIBRATE 3 N0000148004 PSYLLIUM 3 N0000148255 PRAVASTATIN 1 Confirming the results for hypertension using other external sources We examined the cases classified as “under treatment” by the machine and “not under treatment” by the expert, and re-evaluated the records using RDF version of DrugBank[20] and Diseasome[21] exposed as SADI lookup services and generating the following SPARQL query. SHARE to list the drugs linked to hypertension. In all cases we were able to find at least one medication that is linked to “hypertension” by the property “Diseasome:possibleDrug”, confirming the results obtained from NDF-RT. PREFIX drugbank: PREFIX diseasome: SELECT ?drugName WHERE { ?hypertension diseasome:name "Hypertension" . ?hypertension diseasome:possibleDrug ?drug . ?drug drugbank:genericName ?drugName . } Appendix D: Supporting material for chapter 5 225 The list of medications and their frequency in the studied dataset; according to NDF-RT, these medications may be used (may_treat property) to treat a)HBP, b)Hypercholesterolemia, and c)Diabetes respectively. Medication FISH OIL(h ighlighted) medicat ion is in a may_treat property with both Hypercholesterolemia and Hypertension a) NDF ID Display Name Frequency N0000147814 DILTIA ZEM 88 N0000146717 NIFEDI PINE 66 N0000145909 NITROGLYCERIN 59 N0000145995 HYDROCHLOROTHIA ZIDE 35 N0000146784 ATENOLOL 23 N0000148001 PROPRANOLOL 17 N0000147835 ENALAPRIL 13 N0000146188 FUROSEMIDE 13 N0000147924 METOPROLOL 13 N0000148054 VERAPA MIL 13 N0000148773 FISH OIL 6 N0000145994 CAPTOPRIL 5 N0000145983 NADOLOL 5 N0000146248 METHYLDOPA 4 N0000148034 TIMOLOL 4 N0000147986 PRA ZOSIN 3 N0000147699 AMILORIDE 2 N0000146099 CHLOROTHIA ZIDE 2 N0000146240 CHLORTHALIDONE 2 N0000147886 LABETALOL 2 N0000146234 SPIRONOLACTONE 2 N0000146253 TRIAMTERENE 2 N0000145992 BENDROFLU METHIA ZIDE 1 N0000146207 CLONIDINE 1 N0000147063 PINDOLOL 1 N0000146925 TRICHLORMETHIAZI DE 1 N0000146756 UREA 1 b) N0000146952 GEMFIBROZIL 25 N0000147602 LOVASTATIN 22 N0000147114 CHOLESTYRA MINE 11 N0000148773 FISH OIL 6 N0000146924 PROBUCOL 6 N0000146039 NIACIN 5 N0000147570 NICOTINIC ACID 5 N0000148773 FISH OIL 6 N0000146804 CLOFIBRATE 3 N0000148004 PSYLLIU M 3 N0000148255 PRAVASTATIN 1 c) N0000147876 INSULIN 20 N0000146198 CHLORPROPAMI DE 9 N0000146417 GLYBURIDE 7 N0000146241 TOLBUTAMIDE 3 N0000145950 INSULIN,REGULAR,HUMAN/SEMISYNTHETIC 3 N0000146830 GLIPIZIDE 3 N0000146187 TOLA ZAMI DE 1 N0000145888 ACETOHEXA MIDE 1 N0000145944 INSULIN,NPH,HUMAN/rDNA 1 Appendix D: Supporting material for chapter 5 226 Medications implicated in Hypercholesterolemia together with their frequency N0000147814 DILTIAZEM 88 N0000146717 NIFEDIPINE 66 N0000145909 NITROGLYCERIN 59 N0000145995 HYDROCHLOROTHIAZIDE 35 N0000146784 ATENOLOL 23 N0000148001 PROPRANOLOL 17 N0000147835 ENALAPRIL 13 N0000146188 FUROSEMIDE 13 N0000147924 METOPROLOL 13 N0000148054 VERAPAMIL 13 N0000148773 FISH OIL 6 N0000145994 CAPTOPRIL 5 N0000145983 NADOLOL 5 N0000146248 METHYLDOPA 4 N0000148034 TIMOLOL 4 N0000147986 PRAZOSIN 3 N0000147699 AMILORIDE 2 N0000146099 CHLOROTHIAZIDE 2 N0000146240 CHLORTHALIDONE 2 N0000147886 LABETALOL 2 N0000146234 SPIRONOLACTONE 2 N0000146253 TRIAMTERENE 2 N0000145992 BENDROFLUMETHIAZIDE 1 N0000146207 CLONIDINE 1 N0000147063 PINDOLOL 1 N0000146925 TRICHLORMETHIAZIDE 1 N0000146756 UREA 1 ** FISH OIL is in common for HBP and Cholesterol Appendix D: Supporting material for chapter 5 227 Framingham risk classification before incorporating drug information Before incorporation of drug information: Manual FP FN TP TN Precision Recall Accuracy High 54 6 38 15 189 0.71 0.28 0.82 Moderate 125 41 37 87 81 0.68 0.71 0.68 Low 67 44 13 53 134 0.55 0.80 0.76 After incorporation of drug information: Manual FP FN TP TN Precision Recall Accuracy High 54 6 21 33 186 0.84 0.61 0.89 Moderate 125 34 32 91 89 0.72 0.74 0.73 Low 67 29 13 54 150 0.65 0.81 0.83 Appendix E: Supporting material for chapter 6 228 Appendix E: Supporting material for chapter 6  Medication Ranking Proposal The patient datasets for this study should meet the following criteria which we believe is met by large number of legacy datasets.  A list of patient annotated for a number of conditions such as "diabetes", "chronic heart failure" or ideally annotated as "being treated" for those conditions as shown in figure 1. Preferably the database will also have annotations for subtypes of the conditions (e.g. “Pulmonary Hypertension”, “Systemic Hypertension” “Type 1 Diabetes” and so on).  A list of medications prescribed to those patients with ideally temporal information (time of prescription, frequency, duration) and form (tablet, injection, etc.) and dosage information. The medications are attributed to the patients as a cocktail.  Multiple resources, multiple physicians (to account for physician's choice) and diverse demographics, ideally diverse geographical locations so that we can make a meaningful analysis for head-to-head comparison between different medications, physicians choice, availability in different geographical locations, cost and so on. Based on our experience we believe we may start from 4-5 resources each having about 200 or more patient data so that we can feed the data into ML algorithms.  Vision If successful, our framework should be able to dynamically answer simple and sophisticated queries such as: Question1: What is the medication(s) of choice for diabetes in order of their popularity? Answer in order: e.g Insulin, Tolbutamide, Metformin (each with a relative score) Appendix E: Supporting material for chapter 6 229 Question2: A 65 year old male Caucasian patient who suffers the chronic condition Hypertension and takes X,Y,Z medications, is recently diagnosed with Diaebets , what is the medication ranking order for this person if he lives in rural area in eastern Canada? Answer in order: e.g. Metforin , Insulin, Glipizide (each with a relative score) Appendix E: Supporting material for chapter 6 230  Guidelines for classification of phenotypes in VASST dataset  Fever 58 Body temperature > 38 degrees Celsius OR Body temperature < 36 degrees Celsius  Tachycardia Heart rate > 90 beats per minute  Tachypnea Respiration Rate < 20 breaths per minute OR Carbon Dioxide Tension (PaCO2) < 32mmHg OR Patient on Mechanical Ventilation  Pathologic White Blood Cell (WBC) Count WBC count > 12000 cells/cubic millimeter OR WBC count < 4000 cells/cubic millimeter  Respiratory failure Patient is ventilated AND PaO2/FiO2 < 300 (PaO2/FiO2: ratio of partial pressure arterial oxygen and fraction of inspired oxygen)  Coagulopathy(Bleeding disorder ) Platelet count < 80000 per cubic millimeter  Abnormal QTc waveform QTc duration < 350 OR QTc duration >460 58 Note that the definition of fever in this context includes hypothermia in addition to hyperthermia Appendix E: Supporting material for chapter 6 231 Rules learned for definition of phenotypes using the entire dataset and the results of 10-fold cross- validation experiments in terms of accuracy (acc), precision (P), and recall (R) summarized in Table below.  Fever (TEMP highest value <= 38) OR (TEMP Lowest value >= 35.9)  Tachycardia (HEART_RATE highest value <= 92)  Tachypnea No relevant rules were found for Tachypnea as approximately only 1% of patients (8 patients from 800) were classified as not having Tachypnea. Additionally, the majority of patients classified as having Tachypnea (770 out of 792) were under mechanical ventilation which makes the process of finding patterns extremely difficult (similar to the Hypertension case discussed earlier).  Pathologic WBC count (note that the units are different from the ones used in the guideline) (WBC_COUNT highest value > 12.3) OR (WBC_COUNT lowest value <= 4)  Coagulopathy (note that the units are different from the ones used in the guideline) (PLATELET <= 87000)  Abnormal QTc (QTc < 397) OR (QTc > 565) Appendix E: Supporting material for chapter 6 232 Precision, recall and accuracy of ru les learned by JRip for d ifferent phenotypes listed above Fever Tachycardia Tachypnea Pathological WBC Coagulopathy Abnormal QTc Precision 0.89 0.972 ------------ 0.705 0.759 0.671 Recall 0.94 0.972 ------------ 0.728 0.854 0.981 Accuracy% 86.85% 94.74% ------------ 89.11% 89.22% 93.33%