UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

The SHARE system : a semantic web based approach for evaluating queries across distributed bioinformatics… Vandervalk, Ben 2011

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2011_spring_vandervalk_ben.pdf [ 4.99MB ]
Metadata
JSON: 24-1.0071783.json
JSON-LD: 24-1.0071783-ld.json
RDF/XML (Pretty): 24-1.0071783-rdf.xml
RDF/JSON: 24-1.0071783-rdf.json
Turtle: 24-1.0071783-turtle.txt
N-Triples: 24-1.0071783-rdf-ntriples.txt
Original Record: 24-1.0071783-source.json
Full Text
24-1.0071783-fulltext.txt
Citation
24-1.0071783.ris

Full Text

The SHARE System A Semantic Web Based Approach for Evaluating Queries Across Distributed Bioinformatics Databases and Software by Ben Vandervalk  B. Math. (Computer Science), The University of Waterloo, 2004  A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE in The Faculty of Graduate Studies (Bioinformatics)  THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver) April, 2011 c Ben Vandervalk 2011  Abstract Many bioinformatics studies require combined use of data sets and software developed by different research labs. At the current time, accomplishing such studies requires the development of custom scripts that act as “glue” for the independent resources, performing transformations on the data sets that will allow them to be loaded into a single database and/or shuttled through different pieces of software. Due to the tedium and inefficiency of manual data/software integration, many institutions and research groups have sought to find a more reliable and automatic approach. The most significant integration project in recent years has been the Semantic Web activity of the World Wide Web Consortium (W3C), which aims to automate data integration not only in bioinformatics, but on the WWW as a whole. The goal of the Semantic Web is to interlink data on the web in a manner that is similar to the way that HTML pages are linked, while at the same time making the data available in a universal form that can be easily processed by software. In this thesis, the author describes a distributed query system called SHARE (Semantic Health and Research Environment) which demonstrates how the available standards and tools of the Semantic Web can be assembled into a framework for automating data and software integration in bioinformatics. We find that while SHARE has a similar architecture to existing query systems, the use of Semantic Web technologies has important advantages for the implementation, maintenance, and addition of new data sources to the system. After reviewing the mechanics of SHARE, we examine the crucial problem of optimizing queries in an environment where statistics about the data sources are typically not available. A query evaluation procedure called GREEDY is presented that addresses this challenge by: i) interleaving the planning and execution phases of a query, and ii) learning statistics from the execution of previous queries. We conclude by highlighting the unique strengths of SHARE and GREEDY in relation to other integration systems, and review important areas for future work.  ii  Preface The design of SHARE was a joint effort of the Wilkinson laboratory. The majority of the programming work for the system, including the core query engine and the live web demonstration, was done by Luke McCarthy, a full-time employee of the Wilkinson laboratory. My own contributions to the programming were the query optimizer and the ability to query SPARQL endpoints as data sources. Chapter 2, which describes the mechanics of the SHARE system, is an expanded and more up-to-date version of material published in: Ben P. Vandervalk, E. Luke McCarthy, and Mark D. Wilkinson, “SHARE: A Semantic Web Query Engine for Bioinformatics,” in The Semantic Web (ASWC 2009), vol. 5926, pp. 367-369, 2009. The text and figures for Chapter 2 were created from scratch. Chapter 3, which describes the optimization algorithm for SHARE, is a further development of work published in: Ben P. Vandervalk, E. Luke McCarthy, and Mark D. Wilkinson, “Optimization of Distributed SPARQL Queries Using Edmonds’ Algorithm and Prim’s Algorithm,” in 12th IEEE International Conference on Computational Science and Engineering (CSE 2009), vol. 4, pp. 29-31, 2009. In particular, the idea for the GREEDY algorithm is based on Section 7 of that paper, titled “Adaptive query execution using Prim’s algorithm”. The text and figures for Chapter 3 were created from scratch.  iii  Table of Contents Abstract  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  ii  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  iii  Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  iv  List of Tables  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  vi  List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  vii  Preface  Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 The Motivation for Data and Software Integration in Bioinformatics 1.2 Fundamental Problems of Data and Software Integration . . . . . . 1.3 Integration Technologies and Related Research Projects . . . . . . . 1.3.1 Data Warehouses . . . . . . . . . . . . . . . . . . . . . . . . 1.3.2 XML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.3 The Semantic Web . . . . . . . . . . . . . . . . . . . . . . . 1.3.4 Web Services . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.5 Mediator Systems . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . .  . . . . . . . . . .  . . . . . . . . . .  . . . . . . . . . .  . . . . . . . . . .  . . . . . . . . . .  . . . . . . . . . .  1 1 2 4 4 4 7 14 17 19  2 SHARE System Architecture . . . . . . . . . . . . . . . . 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 The SPARQL Query Language . . . . . . . . . . . . . . . 2.3 SHARE Data Sources . . . . . . . . . . . . . . . . . . . . 2.3.1 SADI Services . . . . . . . . . . . . . . . . . . . . 2.3.2 SPARQL Endpoints . . . . . . . . . . . . . . . . . 2.4 SHARE Query Resolution . . . . . . . . . . . . . . . . . 2.4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . 2.4.2 Resolution Process of an Example SHARE Query 2.4.3 Pseudocode for SHARE Query Resolution . . . . 2.4.4 Validating Inputs for Services . . . . . . . . . . . 2.5 SHARE Online Demonstration . . . . . . . . . . . . . . . 2.6 Source Code . . . . . . . . . . . . . . . . . . . . . . . . . 2.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . . . . . .  . . . . . . . . . . . . . .  . . . . . . . . . . . . . .  . . . . . . . . . . . . . .  . . . . . . . . . . . . . .  . . . . . . . . . . . . . .  . . . . . . . . . . . . . .  . . . . . . . . . . . . . .  . . . . . . . . . . . . . .  . . . . . . . . . . . . . .  . . . . . . . . . . . . . .  . . . . . . . . . . . . . .  . . . . . . . . . . . . . .  20 20 20 23 23 24 29 29 29 31 32 35 35 35  3 An 3.1 3.2 3.3  . . . . . .  . . . . . .  . . . . . .  . . . . . .  . . . . . .  . . . . . .  . . . . . .  . . . . . .  . . . . . .  . . . . . .  . . . . . .  . . . . . .  . . . . . .  39 39 40 41 41 41  Adaptive Query Evaluation Algorithm for SHARE Introduction . . . . . . . . . . . . . . . . . . . . . . . . . Related Work . . . . . . . . . . . . . . . . . . . . . . . . Key Points of SHARE Query Resolution . . . . . . . . . 3.3.1 Resolving Triple Patterns . . . . . . . . . . . . . . 3.3.2 OWL Inferences About Predicates . . . . . . . . .  iv  Table of Contents 3.4  The Distributed SPARQL Query Optimization Problem . . 3.4.1 An Illustrative Example . . . . . . . . . . . . . . . . 3.4.2 Assumptions . . . . . . . . . . . . . . . . . . . . . . 3.5 Secondary Optimization: Intersections of Variable Bindings 3.6 The GREEDY Optimization Algorithm . . . . . . . . . . . 3.6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . 3.6.2 Cost Estimation for Query Patterns . . . . . . . . . 3.7 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7.1 Evaluation Procedure . . . . . . . . . . . . . . . . . 3.7.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . 3.8 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 3.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.10 Source Code . . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . . . . .  42 42 45 46 51 51 51 54 54 60 73 73 74  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  75  Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  77  4 Conclusion  . . . . . . . . . . . . .  . . . . . . . . . . . . .  . . . . . . . . . . . . .  . . . . . . . . . . . . .  . . . . . . . . . . . . .  . . . . . . . . . . . . .  . . . . . . . . . . . . .  . . . . . . . . . . . . .  . . . . . . . . . . . . .  . . . . . . . . . . . . .  . . . . . . . . . . . . .  Appendices A Supporting Material for Chapter 2 . . . . . . . . . . . . . . . . . . . . . . . . . . A.1 An Example RDF Service Description for a SADI Service . . . . . . . . . . . . .  90 90  B Supporting Material for Chapter 3 B.1 Variants of Training Queries . . . B.1.1 Variants of Query 1 . . . . B.1.2 Variants of Query 3 . . . . B.1.3 Variants of Query 5 . . . . B.2 Randomly Generated Orderings for B.2.1 Orderings for Query 2 . . . B.2.2 Orderings for Query 4 . . . B.2.3 Orderings for Query 6 . . .  91 91 91 92 93 94 94 96 98  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Test Queries . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . .  . . . . . . . . .  . . . . . . . . .  . . . . . . . . .  . . . . . . . . .  . . . . . . . . .  . . . . . . . . .  . . . . . . . . .  . . . . . . . . .  . . . . . . . . .  . . . . . . . . .  . . . . . . . . .  . . . . . . . . .  . . . . . . . . .  . . . . . . . . .  . . . . . . . . .  . . . . . . . . .  . . . . . . . . .  v  List of Tables 2.1 2.2 2.3  Results for example SPARQL SELECT query . . . . . . . . . . . . . . . . . . . . List of SPARQL endpoints currently indexed by SHARE . . . . . . . . . . . . . . Pointers to SHARE source code . . . . . . . . . . . . . . . . . . . . . . . . . . . .  20 27 36  3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10  Description Description Description Description Description Description Description Description Description Description  42 45 50 50 68 68 69 72 72 72  of of of of of of of of of of  execution execution execution execution execution execution execution execution execution execution  steps steps steps steps steps steps steps steps steps steps  for an inefficient query ordering . . . . . . . for an efficient query ordering . . . . . . . . without the variable bindings optimization with the variable bindings optimization . . for Query 2 under BASIC . . . . . . . . . . for Query 2 under untrained GREEDY . . for Query 2 under trained GREEDY . . . . for Query 6 under BASIC . . . . . . . . . . for Query 6 under untrained GREEDY . . for Query 6 under trained GREEDY . . . .  . . . . . . . . . .  . . . . . . . . . .  . . . . . . . . . .  . . . . . . . . . .  vi  List of Figures 1.1 1.2 1.3 1.4 1.5 1.6 1.7  Example illustrating the problem of merging two XML datasets An example RDF dataset . . . . . . . . . . . . . . . . . . . . . Example of a successful RDF-merge . . . . . . . . . . . . . . . Example of an unsuccessful RDF-merge . . . . . . . . . . . . . Using different OWL ontologies with the same RDF dataset . . Example of a Taverna workflow . . . . . . . . . . . . . . . . . . Mediator architecture . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  6 8 10 11 14 16 18  2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9  An example RDF dataset, to illustrate SPARQL usage An example SPARQL SELECT query . . . . . . . . . An example SPARQL CONSTRUCT query . . . . . . An example SADI service . . . . . . . . . . . . . . . . Map of generated properties for SADI services . . . . Querying of remote SPARQL endpoints in SHARE . . A map of generated properties for SPARQL endpoints An overview of SHARE query resolution . . . . . . . . Screenshot of the SHARE demonstration site . . . . .  . . . . . . . . .  . . . . . . . . .  . . . . . . . . .  . . . . . . . . .  . . . . . . . . .  . . . . . . . . .  . . . . . . . . .  . . . . . . . . .  . . . . . . . . .  . . . . . . . . .  21 21 22 23 25 26 28 30 36  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . trained . . . . . trained . . . . .  43 44 47 48 49 53 55 64  3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.9 3.10  . . . . . . . . .  . . . . . . . . .  . . . . . . . . .  . . . . . . . . .  . . . . . . . . .  Example of an inefficient ordering of triple patterns . . . . . . . . . . . . Example of an efficient ordering of triple patterns . . . . . . . . . . . . . Pseudocode for the variable bindings optimization . . . . . . . . . . . . Query to demonstrate the variable bindings optimization . . . . . . . . . Execution plans with and without the variable bindings optimization . . Example regression line to determine baseTime and timePerInput . . . . Six example SPARQL queries for evaluating GREEDY . . . . . . . . . . State of predicate statistics during experiment . . . . . . . . . . . . . . . Execution plans for Query 2 under BASIC, untrained GREEDY, and GREEDY. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.11 Execution plans for Query 6 under BASIC, untrained GREEDY, and GREEDY. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  67 71  vii  Acknowledgements Thanks to my supervisor, Mark Wilkinson, for his unfaltering support, patience, and enthusiasm, and thanks to Luke McCarthy for helping me to build off of his excellent work. Thanks to Paul Pavlidis and Laks Lakshmanan, who have sacrificed their time to be on my advisory committee and to review this thesis. This research was funded by the Heart and Stroke Foundation of BC and Yukon, and additional support for the development of SHARE was also provided by CIHR and NSERC. Ongoing development and promotion of the SADI web service standard is made possible by CANARIE. All research of the Wilkinson Lab is supported by the James Hogg Research Centre of the Providence Heart and Lung Institute.  viii  Chapter 1  Introduction 1.1  The Motivation for Data and Software Integration in Bioinformatics  The development of high-throughput methods in molecular biology, such as next generation sequencing [1], microarrays [2], and chromatin immunoprecipitation [3], has led to an explosion in the quantity and diversity of biological data that is available on the web. In turn, the new data has driven the development of new computational tools for tasks such as gene prediction [4] and protein structure prediction [5, 6]. The rapid growth of bioinformatics data and software is evidenced by the annual “database issue” and “web server issue” of the Nucleic Acids Research (NAR) journal, which have now published papers describing ∼ 1200 online databases and ∼ 1400 online analysis tools, respectively. NAR provides an online list of the databases [7], while the analysis tools have been catalogued by the Bioinformatics Links Directory [8]. One of the most intriguing things about the body of available bioinformatics resources is the large number of logical interconnections. In general, the resources describe entities that are closely related (such as genes [9], transcripts[10], and proteins [11]) or describe different aspects of the same entities (such as the sequence [11], localization [12, 13], interaction partners [14–16], or structure of a protein [17]). As such, there are many motivations for integrating datasets and software that have been produced by different laboratories. Some applications of integration include: assembling comprehensive datasets For most studies or reviews, researchers will want to obtain the most complete dataset possible, which usually requires extracting and merging data from many publications and websites. This need has led to the development of many databases that play the role of shared repositories for a single type of information. Such repositories exist for mutations in cancer [18], antibody sequences [19], protein interactions [15, 16, 20], protein structures [17], drug targets[21], microarray experiments [22–24], and biological pathways [25–27], to name a few. combining evidence Many activities in bioinformatics involve combining evidence in order to increase the confidence of predictions. One such task is the annotation of a newly-sequenced genome, where researchers must integrate direct experimental evidence, computational predictions, and known information about related organisms in order to identify genecoding regions, non-coding RNAs, synteny blocks, and putative protein functions [28]. There are many other areas of study for which the integration of evidence is essential, such as the reconstruction of phylogenetic trees[29], the prediction of subcellular protein localization [30], and the prediction of protein-protein interactions [20, 31]. interpreting experimental results Genome-wide association studies [32] and microarray experiments identify potentially long lists of genes that have a significant role in a particular disease or experimental condition. However, these experiments do not explain why these “hits” are significant or how they are biologically related. The researcher may have a 1  1.2. Fundamental Problems of Data and Software Integration large number of questions about the top-ranked genes which require external datasets or analysis, such as: ⇒ “For what other diseases or conditions was this gene found to be significant?” ⇒ “What papers have been published on this gene?” ⇒ “What metabolic, signaling, and regulatory pathways does this gene participate in?” ⇒ “Where do the protein product(s) of this gene localize within the cell?” ⇒ “What other tissues is the gene transcribed in?” ⇒ “What are known interaction partners for the protein product(s) of this gene?” ⇒ “Is this gene alternatively spliced?” ⇒ ... Such contextual information is crucial for forming hypotheses and for prioritizing genes for further study or validation. In fact, several data integration systems have been designed specifically for prioritizing gene lists [33–35]. comparing experiments Meta-analyses of microarray experiments are an important example of data integration in bioinformatics. The most common goal of such analyses is to validate differentially expressed genes for a certain disease or set of experimental conditions across experiments [36]; this method of validation is promising because it is significantly more cost-effective than experimentally validating results by PCR. Other types of meta-analyses have been conducted to determine the reliability of microarray results across platforms [37] and to validate pairs of coexpressed genes [38]. executing multi-step data analyses Many analyses in bioinformatics are conceptualized as pipelines or workflows, in which each step performs a different transformation on the data. For instance, a “first-pass” phylogenetic tree for a gene or protein is typically constructed by: (i) identifying the homologs of the gene/protein with a BLAST search, (ii) creating a multiple sequence alignment (MSA) of the homologs, and (iii) using a phylogenetic tree program to infer evolutionary distances from the MSA [39]. Other tasks that are frequently implemented as pipelines are genome annotation, drug discovery, and microarray analysis.  1.2  Fundamental Problems of Data and Software Integration  The current body of websites, databases, and software that is available to bioinformaticians has been produced by a large number of research groups acting independently and without a shared set of standards. This has inevitably resulted in a large number of incompatible schemas and software interfaces that make the combined use of resources unnecessarily difficult and time-consuming [40]. As the field has grown, the bioinformatics community has recognized the need for standardization, and many XML formats have been developed for the exchange of data within specific domains such as microarrays, molecular interations, and biological pathways [41]. Unfortunately, there are several integration-related problems which XML alone cannot solve, as will be discussed in Section 1.3. While all of the problems related to data and software integration can be (and typically are) solved manually on a case-by-case basis, the principal goal of integration systems is to resolve these problems automatically. In particular, the main tasks that integration systems seek to automate are:  2  1.2. Fundamental Problems of Data and Software Integration discovery of relevant resources The first of step of integration is to identify the available databases and software that need to be combined for the desired analysis. In current practice, resources are usually discovered manually through internet search engines, literature, and link directories. matching entities across datasets Whenever two related datasets are combined, the common entities (e.g. genes) between those datasets must be identified and merged. The ideal solution to this problem is to establish a system of globally-shared identifiers for biological entities; however, no such system yet exists. The LSID [42] identifier and resolution scheme was recently proposed for this purpose, but unfortunately it has failed to gain widespread acceptance by the community. LSID identifiers were based on a URN scheme, and aimed to support guaranteed persistence of identifiers and immutability of data records, among other features. The most common objection to the LSID system was that the resolution of identifiers required specialized software and could not be performed with a standard web browser (for example, see [42, 43]). In the absence of a global identifiers, more sophisticated methods must be employed for matching entities across datasets, based on the content of the records. A variety of methods for entity matching have been studied within the field of database research, including rule-based systems, decision trees, and other supervised learning methods [44]. schema integration Data models such as the relational model, XML, and RDF impose only general constraints on the structure of a dataset. Under these models, a dataset must be encoded as a set of cross-referenced tables, a tree, or a directed graph, respectively. Any remaining decisions about structure are left to the data owner, and so the same dataset may be organized according to a multitude of different schemas. A fundamental problem arises when two parties wish to share datasets that are logically related, but whose schemas are mutually incompatible. Data cannot be transferred directly from one database to the other, but must first be mapped between the two schemas. Typically this mapping must be worked out by hand, and the complexity of the problem grows with the number of databases to be integrated. Schema integration is the central problem of data integration, and has been widely studied in the context of relational databases [45], XML (e.g. [46]), and OWL [47]. connecting software interfaces In order to accomplish multi-step analyses, an integration system must be able to route data from the output of one program to the input of another in an automated fashion. Pipes[48] are a well-known facility of the Unix command line that was designed precisely for this purpose. A standard Unix program reads its input data from a special file descriptor called STDIN, and writes its output to a special file descriptor called STDOUT. When a series of programs is connected by Unix pipes, the STDOUT of each program is sent to the STDIN of the next, thereby creating a continuous processing pipeline. While the Unix command line is a sturdy and versatile tool, specialized workflow systems have been developed for bioinformatics that go beyond the functionality of Unix pipes in a number of ways. Typically, the functional units of such systems are Web Services rather than locally installed programs, and this obviates the need for downloading, installing, and configuring software on the user’s local machine. More importantly, workflow frameworks automate (or partially automate) a number of fundamental tasks such as service discovery, service invocation, identification of services that may be connected, and execution of an overall pipeline. Beyond this, the most advanced workflow systems are capable of translating an abstract, query-based representation of a workflow into a concrete implementation. SHARE, the subject of this thesis, is one such system.  3  1.3. Integration Technologies and Related Research Projects  1.3  Integration Technologies and Related Research Projects  In this section we introduce the main technologies that have been developed for data and software integration, discuss the advantages and disadvantages of each, and highlight related systems that have been developed for use in bioinformatics.  1.3.1  Data Warehouses  Conceptually, warehousing is the most straightforward approach to data integration. Data warehouses are constructed by combining a set of related sources into a single database with a unified schema [49, 50]. Beyond the task of designing an unified schema, the creators of a warehouse must develop scripts for extracting data from the original sources, transforming the data to fit the schema, and loading the data into the warehouse. In the database literature, this is known as the Extract-Transform-Load (ETL) cycle. Procedures for extraction and transformation must typically be written by hand after detailed study of the original sources. Warehousing is a popular methodology in bioinformatics, and numerous warehouses have been constructed for specific types of data such as structural families of proteins [51], sequences of fungal species [52], ligands [53], and crop-related data [54], to name a few. In addition, several warehouses have been built to integrate popular databases such as GenBank, UniProt, and GO; two warehouses of this type are Atlas [55] and SeqHound [56]. The most widely known warehousing system in biology is the Sequence Retrieval System (SRS) [57]. SRS is a framework for integrating databases that are encoded as collections of plain text files (one file per record), and can be configured to parse a wide variety of syntaxes. The principal functionality offered by SRS is the ability to execute efficient, field-specific keyword searches across a large collection of databases. (EBI maintains a publicly-accessible installation of SRS that currently includes 280 different databases [58].) Some recent data warehousing efforts have utilized RDF as their data model, such as the Neurocommons Knowledge Base [59] and the Pathway Knowledge Base [60]; this approach serves to simplify some aspects of the schema integration problem, as will be discussed in Section 1.3. The warehousing approach to integration is often compared to the opposing methodology of view integration that is used by mediator systems; mediator systems function by translating a user query into a number of subqueries that are issued against the databases in their native locations. The principal advantages of warehouses over mediators are speed and reliability; warehouse users need not rely on the availability or performance of numerous remote data sources, and query processing is not delayed by the transfer of data across the network. A further advantage of warehouses is that the provider has complete control over the organization of the data. This allows him/her/them to include additional processing steps that result in a cleaner and simpler database, such as filtering, entity-matching, and translation to a uniform identifier scheme. However, centralized control is also a disadvantage in the sense that it places the full burden of maintenance on the warehouse provider. Warehouses must be updated regularly from the original sources, and data extraction scripts must be repaired whenever the schemas of the original sources change.  1.3.2  XML  XML (eXtensible Markup Language) [61] is a framework for defining application-specific data formats. All XML files are text-based and share a general syntax consisting of tags (URIs enclosed by angle brackets) and content organized in a hierarchical manner. Listing 1.1 shows an example XML file that encodes pathway information, in accordance to a hypothetical schema called PathwayML. (Hypothetical schemas are used in this section to avoid unnecessarily complicated examples; however, the examples used here are valid XML.) The central standard of 4  1.3. Integration Technologies and Related Research Projects XML is XML Schema[46], which allows the user to rigorously specify the rules of an XML format. An XML schema primarily specifies structural relationships between tags, such as the parent/child relationships, the relative order of sibling tags, and the number of legal occurrences for a tag in a given context. It also describes the type of content (i.e. datatype) for each tag, and specifies an analogous set of rules for XML attributes (such as “reactionID” in listing 1.1). Listing 1.2 shows the XML Schema for the PathwayML file of listing 1.1. <? xml v e r s i o n=” 1 . 0 ” e n c o d i n g=”UTF−8” ?> <pathway pathwayID=”KEGG:hsa00234” xmlns=” h t t p : //www. pathwayml . o r g / schema ”> <r e a c t i o n r e a c t i o n I D=”KEGG:R00335”> <enzyme enzymeID=” U niPr ot:P 04035 ”> <s u b s t r a t e>PubChem:3649</ s u b s t r a t e> <p r o d u c t>PubChem:3708</ p r o d u c t> </ enzyme> <r e a c t i o n> </ pathway>  Listing 1.1: Example XML file describing part of a metabolic pathway <x s d : s c h e m a x m l n s : x s d=” h t t p : //www. w3 . o r g /2001/XMLSchema”> <x s d : e l e m e n t name=” pathway ” t y p e=” pathwayType ” /> <xsd:complexType name=” pathwayType ”> <x s d : s e q u e n c e> <x s d : e l e m e n t name=” r e a c t i o n ” minOccurs=” 0 ” maxOccurs=” unbounded ”> <xsd:complexType> <x s d : s e q u e n c e> <x s d : e l e m e n t name=” enzyme ” minOccurs=” 1 ” maxOccurs=” unbounded ”> <xsd:complexType> <x s d : s e q u e n c e> <x s d : e l e m e n t name=” s u b s t r a t e ” t y p e=” x s d : s t r i n g ” /> <x s d : e l e m e n t name=” p r o d u c t ” t y p e=” x s d : s t r i n g ” /> </ x s d : s e q u e n c e> < x s d : a t t r i b u t e name=” enzymeID ” t y p e=” x s d : s t r i n g ” u s e=” r e q u i r e d ” /> </ xsd:complexType> </ x s d : e l e m e n t> </ x s d : s e q u e n c e> < x s d : a t t r i b u t e name=” r e a c t i o n I D ” t y p e=” x s d : s t r i n g ” u s e=” r e q u i r e d ” /> </ xsd:complexType> </ x s d : e l e m e n t> </ x s d : s e q u e n c e> < x s d : a t t r i b u t e name=” pathwayID ” t y p e=” x s d : s t r i n g ” u s e=” r e q u i r e d ” /> </ xsd:complexType> </ x s d : s c h e m a>  Listing 1.2: Example XML Schema (“PathwayML”) describing the XML file of listing 1.1 One of the principal features of XML Schema is that schema files are “machine readable”. This means that, in contrast to a natural language description of the schema, an XML Schema file has a simple and rigid structure that makes automated processing by software a relatively straightforward task. (XML data files are themselves machine readable, as are many other formats such as ASN.1 [62] and RDF [63].) The machine-readability of XML schema has enabled the development of software libraries (e.g. [64]) that are capable of validating and parsing XML files with respect to any user-defined schema. Thus, XML has eliminated the need to create a custom file parser for every application. Many XML standards have been developed for the exchange of specific types of data within bioinformatics, such as sequence annotations [65, 66], structures [67, 68], alignments [69], protein  5  1.3. Integration Technologies and Related Research Projects interations [70], expression data [71, 72], and pathways [73, 74]. While initially there was a multitude of competing XML formats for each type of data, most areas of bioinformatics appear to be consolidating. For instance, there was once a number of different XML standards for annotating sequences such as SBML, AGAVE, GAME, and BioML; however, EMBL, NCBI, and DDBJ now offer their annotations for download in a common XML format called INSDSeq (International Nucleotide Sequence Database). Another example of consolidation is the fact that several protein interaction databases such as DIP[14], MINT [75], and IntAct [76] now publish their data in a common XML format called PSI-MI [70]. Similarly, the pathway standard SBML [73] is now being utilized by a large number of simulation systems and databases [77]. One important area where standarization has not yet been fully achieved is the sharing of microarray expression data. In spite of ongoing standardization work by the MGED society [78], including the development of the MAGE-ML [71] and MAGE-TAB [79] formats, the three largest microarray repositories [22–24] still publish their datasets and descriptions of experimental conditions in different (although highly similar) text-based formats. Only ArrayExpress is currently using the standards developed by MGED. The establishment of XML standards is an important step forward for bioinformatics because it greatly facilitates the exchange of data between databases. While database owners might still need to develop scripts for importing and exporting XML, in most cases they will only need worry about mapping their database schema to one or two XML schemas, rather than a myriad of possible formats. However, it is important to note that the existence of XML standards does not address the problem of integrating different types of data. For example, suppose that in addition to the PathwayML file of listing 1.1, a researcher also has a TargetML file containing data about putative drug targets, as shown in Fig. 1.1. The integration of XML schemas is no easier to automate than the integration of relational schemas [80], and thus in order to create an XML file that incorporates both datasets, the researcher must design a new XML schema by hand. This barrier to merging the two datasets prevents a researcher from readily answering questions such as “What are the natural substrates of enzymes that are targeted by the drug Pravastatin?” <?xml version="1.0" encoding="UTF-8"?> <drug drugbankID="DB00175" xmlns="http://www.drugml.org/schema"> <drugname>Pravastatin</drugname> <target>UniProt:P04035</target> <target>UniProt:P01234</target> </pathway>  ??? <?xml version="1.0" encoding="UTF-8"?> <pathway pathwayID="KEGG:hsa00234" xmlns="http://www.pathwayml.org/schema"> <reaction reactionID="KEGG:R00335"> <enzyme enzymeID="UniProt:P04035"> <substrate>PubChem:3649</substrate> <product>PubChem:3708</product> </enzyme> <reaction> </pathway>  Figure 1.1: An example illustrating the problem of merging two XML datasets. One dataset identifies the targets of the drug Pravastatin (top), while the other dataset gives pathway information (bottom). Although the datasets are related (by the common protein UniProt:P04035), there is no automated procedure for merging them into a single, integrated dataset.  6  1.3. Integration Technologies and Related Research Projects  1.3.3  The Semantic Web  The vision of the Semantic Web is the establishment of a global, interconnected network of machine-readable data, analogous to the network of human-readable documents that currently populates the World Wide Web (WWW) [81]. The idea was first conceived by the inventor of the WWW (Tim Berners-Lee), and under his direction, a large amount of work has been conducted by the World Wide Web Consortium (W3C) [82] in order to develop the requisite standards [83]. Although the standards work is largely complete, the Semantic Web is still more a dream than reality; the vast majority of data owners on the web have not yet shown an interest in participating [84]. Regardless of its long term success, the Semantic Web effort has provided a number of valuable tools for addressing the data and software integration problems that currently exist in bioinformatics. In this section, we provide an introduction to RDF and OWL, the core standards for the Semantic Web. We highlight the main advantages of the RDF/OWL framework for data integration, in comparison to the prevailing models of relational databases and XML, and describe several bioinformatics projects where the standards are being used. RDF RDF (Resource Description Framework) is the data model for the Semantic Web. Under the RDF model, a dataset is encoded as a set of triples, where each triple represents an atomic statement or “fact”. Each triple consists of a subject, a predicate, and an object, and each of these three parts is identified by a URI (Uniform Resource Identifier). URIs are globally unique identifiers for the WWW; in the context of RDF, URIs may represent anything (people, material objects, categories, types of relationships, etc.)1 . For instance, supposing that http://www.genome.jp/R00335 represents a certain reaction, http://www.genome.jp/hasEnzyme represents the reaction-to-enzyme relationship, and http://www.uniprot.org/P04035 represents a certain protein, the following triple asserts that the protein is an enzyme for the reaction: (kegg:R00335, kegg:hasEnzyme, uniprot:P04035)2 A set of such triples constitutes an RDF dataset, and can be visualized as a graph where the subjects and objects are nodes, and the predicates are edges, as in Fig. 1.2. Thus, RDF is a graph-based data model, whereas the relational model is table-based, and the XML model is tree-based. RDF datasets are typically serialized as TURTLE [85] or RDF/XML [86], and may be stored in specialized databases called triple stores [87]. Triple stores are typically queried via the SPARQL query language [88], which will be introduced in Section 2.2. At the current time, RDF is not a widely used format for publishing data within bioinformatics [89]. In fact, the author is aware of only one data provider that natively publishes RDF: the UniProt protein database [11]. However, several projects have endeavoured to provide publically accessible, third-party translations of existing bioinformatics resources in RDF form. The largest and longest-standing effort in this area is the Bio2RDF [90] project, which provides “RDFized” versions of popular databases such KEGG [25], Entrez Gene [9], and PDB [17]. More recently, the LODD (Linked Open Drug Data) project [91] has produced RDF versions of drug-related resources such as DrugBank [21], DailyMed [92], and SIDER [93]. Bio2RDF and LODD house the RDF versions of each database in a separate triple store; however, there have also been several projects that have built RDF-based data warehouses, such as the Neurocommons Knowledge Base [59] (neuroscience resources), the Pathway Knowledge Base [60] 1 The most widely used form of URIs in RDF are URLs (Uniform Resource Locators, such as http://www.yahoo.com.) 2 These URIs, and the others that are used in the examples of this section are hypothetical. At the current time, the majority of data providers in bioinformatics do not yet publish their data in RDF, and thus they do not have established URI schemes for referencing their database records. UniProt is an exception.  7  1.3. Integration Technologies and Related Research Projects  @ p r e f i x k e g g : <h t t p : //www. genome . j p /> . @ p r e f i x u n i p r o t : <h t t p : // u n i p r o t . o r g /> . @ p r e f i x pubchem: <h t t p : // pubchem . n c b i . nlm . n i h . gov /> . k e g g : h s a 0 0 2 3 4 k e g g : h a s R e a c t i o n kegg:R00335 . kegg:R00335 kegg:hasEnzyme u n i p r o t : P 0 4 0 3 5 . u n i p r o t : P 0 4 0 3 5 k e g g : h a s S u b s t r a t e pubchem:C3649 . u n i p r o t : P 0 4 0 3 5 k e g g : h a s P r o d u c t pubchem:C3708 . (a) RDF serialized as TURTLE <? xml v e r s i o n=” 1 . 0 ” e n c o d i n g=”UTF−8” ?> <rdf:RDF x m l n s : k e g g=” h t t p : //www. genome . j p / ” x m l n s : u n i p r o t=” h t t p : // u n i p r o t . o r g / ” xmlns:pubchem=” h t t p : // pubchem . n c b i . nlm . n i h . gov / ” x m l n s : r d f=” h t t p : //www. w3 . o r g /1999/02/22 − r d f −syntax−ns#”> < r d f : D e s c r i p t i o n r d f : a b o u t=” h t t p : //www. genome . j p / h s a 0 0 2 3 4 ”> <k e g g : h a s R e a c t i o n r d f : r e s o u r c e=” h t t p : //www. genome . j p / R00335 ” /> </ r d f : D e s c r i p t i o n> < r d f : D e s c r i p t i o n r d f : a b o u t=” h t t p : //www. genome . j p / R00335 ”> <kegg:hasEnzyme r d f : r e s o u r c e=” h t t p : // u n i p r o t . o r g / P04035 ” /> </ r d f : D e s c r i p t i o n> < r d f : D e s c r i p t i o n r d f : a b o u t=” h t t p : // u n i p r o t . o r g / P04035 ”> <k e g g : h a s S u b s t r a t e r d f : r e s o u r c e=” h t t p : // pubchem . n c b i . nlm . n i h . gov / C3649 ” /> <k e g g : h a s P r o d u c t r d f : r e s o u r c e=” h t t p : // pubchem . n c b i . nlm . n i h . gov / C3708 ” /> </ r d f : D e s c r i p t i o n> </ rdf:RDF> (b) RDF serialized as RDF/XML pubchem:3649  kegg:hasSubstrate  pubchem:3708  kegg:hasProduct  uniprot:P04035  kegg:hasEnzyme  kegg:R00335  kegg:hasReaction  kegg:hsa00234  (c) RDF visualized as a graph  Figure 1.2: An example RDF dataset describing part of a metabolic pathway 8  1.3. Integration Technologies and Related Research Projects (pathway resources), and BioGateway [94] (general resources such as Uniprot and GO). RDF has proven useful in warehouse construction because it greatly simplifies the merging of datasets from different sources, as will be described below. The use of globally unique identifiers in RDF provides a mechanism for matching entities across datasets, which is essential for data integration (see “entity matching” in 1.2). However, the manner in which these global identifiers should be assigned to entities is not entirely clear. URIs were chosen to play the role of identifiers in RDF because it is a scheme that guarantees uniqueness and is already well established on the WWW. In addition, its connection to the DNS framework gives a clear mechanism for the ownership of URLs, in which a single party has control of each URL and the data that it resolves to. However, the use of URLs in RDF has also been ongoing source of confusion [89, 95]. For instance, RDF providers have tended to invent their own URLs for every entity that they reference in their data, regardless of whether URLs for those entities already exist elsewhere. This behaviour has arisen from the fact that an HTTP URL can only resolve to one network location, and thus the use of a “foreign URL” to identify an entity (e.g. a protein) will prevent browsing or crawling to the providers own unique data about that entity. An additional problem is that, even in cases where an RDF data provider is willing to use external URL for an entity, he/she has no straightforward method for determining if such an URL exists. These problems (and several others) motivated the recent development of the LSID (Life Science IDentifier) [42] identifier and resolution system. LSID utilizes features of the DNS system to implement resolution to multiple sources, and in addition seeks to guarantee the persistence and immutability of identifiers and their associated data records. However, there has been a great deal of argument about whether specialized resolution mechanisms are truly necessary for the Semantic Web (e.g. [42, 43]), and LSID has not seen widespread acceptance. At the current time, the issues surrounding the creation and resolution of identifiers in RDF remain unresolved. The principle advantage of RDF as a data model is that it supports an automated procedure for merging data sets. To perform an RDF-merge, one simply adds the set of triples in one dataset to the set of triples in another dataset (discarding duplicates), as depicted in Fig. 1.3. An RDF-merge will yield a coherent dataset provided that: 1. common entities and common relationships are identified by the same URIs, and 2. whenever the same type of data is encoded in both datasets (e.g. reactions), the graph structures used for representing that data are the same. For instance, Fig. 1.4 shows a scenario where different graph structures are used to encode biochemical reactions. The merged dataset is not coherent and cannot be uniformly queried to extract information. The problem illustrated in Fig. 1.4 is a variant of the schema integration problem, which also exists under the relational model and the XML model. On the Semantic Web, the problem is addressed through the use of OWL ontologies, as will be discussed in the following section. Although RDF does not provide a universal solution to the schema integration problem, it is important to note that RDF does provide a solution when the domains of the two datasets differ and common identifiers are used, as in Fig. 1.3. OWL OWL (Web Ontology Language) is a standard for defining ontologies for use with the Semantic Web. An ontology is often defined as a “formal conceptualization of a domain” [96]. In this context, a “conceptualization” is a set of classes of things (e.g. proteins, genes, chromosomes), together with the relationships that connect them (e.g. a protein is encoded by a gene); a “domain” is any possible area of study (e.g. tertiary structures of proteins). OWL ontologies on the Semantic Web serve two purposes:  9  1.3. Integration Technologies and Related Research Projects  pubchem:3649  kegg:hasSubstrate  pubchem:3708  kegg:hasProduct drugbank:DB00175  drugbank:DB003946  uniprot:P04035  kegg:hasEnzyme drugbank:targetOf  drugbank:targetOf  kegg:R00335 uniprot:P04035 kegg:hasReaction  kegg:hsa00234  (kegg:hsa00234, kegg:hasReaction, kegg:R00335) (kegg:R00335, kegg:hasEnzyme, uniprot:P04035) (uniprot:P04035, kegg:hasSubstrate, pubchem:C3649) (uniprot:P04035, kegg:hasProduct, pubchem:C3708)  drugbank:DB00175  drugbank:targetOf  (uniprot:P04035, drugbank:targetOf, drugbank:DB00175) (uniprot:P04035, drugbank:targetOf, drugbank:DB003946)  drugbank:DB003946  drugbank:targetOf pubchem:3708  pubchem:3649 kegg:hasSubstrate uniprot:P04035  kegg:hasProduct  kegg:hasEnzyme  kegg:R00335  kegg:hasReaction  kegg:hsa00234  (kegg:hsa00234, kegg:hasReaction, kegg:R00335) (kegg:R00335, kegg:hasEnzyme, uniprot:P04035) (uniprot:P04035, kegg:hasSubstrate, pubchem:C3649) (uniprot:P04035, kegg:hasProduct, pubchem:C3708) (uniprot:P04035, drugbank:targetOf, drugbank:DB00175) (uniprot:P04035, drugbank:targetOf, drugbank:DB003946)  Figure 1.3: An example of a successful RDF-merge.  10  1.3. Integration Technologies and Related Research Projects  pubchem:3649  pubchem:3708  pubchem:3649 kegg:hasSubstrate  uniprot:P07096  pubchem:3708  kegg:hasProduct  uniprot:P04035  kegg:hasEnzyme reactiondb:reactant  reactiondb:product  kegg:hasEnzyme kegg:R00335 kegg:R00335 kegg:hasReaction kegg:hasReaction kegg:hsa00234 kegg:hsa00234  (kegg:hsa00234, kegg:hasReaction, kegg:R00335) (kegg:R00335, kegg:hasEnzyme, uniprot:P07096) (kegg:R00335, reactiondb:reactant, pubchem:C3649) (uniprot:R00335, reactiondb:product, pubchem:C3708)  (kegg:hsa00234, kegg:hasReaction, kegg:R00335) (kegg:R00335, kegg:hasEnzyme, uniprot:P04035) (uniprot:P04035, kegg:hasSubstrate, pubchem:C3649) (uniprot:P04035, kegg:hasProduct, pubchem:C3708)  pubchem:3649  pubchem:3708  kegg:hasSubstrate  kegg:hasProduct  uniprot:P04035 reactiondb:product  reactiondb:reactant kegg:hasEnzyme  kegg:hasEnzyme kegg:R00335 uniprot:P07096 kegg:hasReaction  kegg:hsa00234  (kegg:hsa00234, kegg:hasReaction, kegg:R00335) (kegg:R00335, kegg:hasEnzyme, uniprot:P04035) (uniprot:P04035, kegg:hasSubstrate, pubchem:C3649) (uniprot:P04035, kegg:hasProduct, pubchem:C3708) (kegg:R00335, kegg:hasEnzyme, uniprot:P07096) (kegg:R00335, reactiondb:reactant, pubchem:C3649) (uniprot:R00335, reactiondb:product, pubchem:C3708)  Figure 1.4: An example of an unsuccessful RDF-merge. The two original datasets represent data reactions using different graph structures. uniprot:P04035 and uniprot:P07096 represent two alternate enzymes that catalyze the same reaction.  11  1.3. Integration Technologies and Related Research Projects 1. They provide the vocabulary and structural rules for encoding an RDF dataset for a given domain. In this respect, OWL ontologies are the functional analog of schemas for relational databases and XML. 2. They encode knowledge about a domain in machine-readable form, such that it can be used to make automated inferences about RDF data. For example, an OWL reasoner might infer that an instance of the class Ligase has at least one substrate and at least one product, given that the class Ligase is a subclass of Enzyme. An OWL ontology represents a set of axioms in a description logic [97], and so it is necessary to understand the basic principles of description logics in order to understand the manner in which OWL ontologies achieve the two tasks above. Every description logic represents four kinds of elements: Individuals Individuals are the entities that are described and related within the data. In the RDF/OWL framework, individuals are represented by the subject and object URIs of the triples in an RDF dataset. Properties Properties are relationships between individuals, and are also referred to as predicates or roles. An OWL ontology provides a vocabulary of predicate URIs that may be used when encoding RDF triples in a given domain; for instance, a pathway ontology corresponding to the dataset in Fig. 1.2 would contain the predicate URIs kegg:hasReaction, kegg:hasEnzyme, etc. In addition, the pathway ontology could assert axioms about predicates such as: “the subject URI of any triple with the kegg:hasReaction predicate belongs to the class Pathway”: Pathway(x) ⇔ hasReaction(x, y) In a description logic, classes are represented as unary predicates (e.g. Pathway(x)) and properties are represented as binary predicates(e.g. hasReaction(x,y)). (Classes will be discussed below.) Many other types of axioms regarding predicates are also possible, such as the assertion that one property is the inverse of another: isReactionOf(y, x) ⇔ hasReaction(x, y) Assertions Assertions are statements about individuals. In the RDF/OWL framework, assertions are represented by the triples of an RDF dataset. Classes Classes are sets of individuals, and are also referred to as concepts. Classes constitute another part of the vocabulary that is provided by an OWL ontology. The members of a class may be specified explicitly; however, classes may also be defined more generally in terms of axioms. Such axioms describe either necessary or necessary and sufficient conditions regarding individuals that are members of a class. For instance, a class axiom may assert that whenever an individual x belongs to class X, it also belongs to class Y: Y (x) ⊆ X(x) This axiom asserts that Y is a subclass of class X. More complex relationships between classes can also be encoded in description logics, by building expressions involving the intersection, union, and complement of classes. Classes may also be defined in terms of the properties of an individual. For instance, one might define a class called Enzyme as the set of individuals that have at least one hasSubstrate property and at least one hasProduct property: Enzyme(x) ⇔ ∃yhasSubstate(x, y) ˆ ∃yhasProduct(x, y)  12  1.3. Integration Technologies and Related Research Projects In OWL, the atoms ∃yhasSubstate(x, y) and ∃yhasProduct(x, y) are called property restrictions. More specifically, they are existential property restrictions. Other types of property restrictions are universal restrictions, which assert that all values of a property must belong to certain class, and cardinality restrictions, which assert that an individual must have a certain number of distinct values for a particular property. The use of axioms to define classes and properties makes it possible to automatically find contradictions within an ontology; this operation is called consistency checking, and it is one of the principal tasks that can be performed by an OWL reasoner (e.g. [98–100]). It is also possible to use consistency checking to test whether a given RDF dataset conforms to the rules of an ontology; thus, an OWL ontology can be used to the specify structural rules of RDF data in a very sophisticated manner. Beyond consistency checking, an OWL reasoner may be used to deduce facts that are not explicitly stated in the data. Typical questions that an OWL reasoner can answer are: ⇒ Are all instances of class X also instances of class Y? (Subsumption) ⇒ Does individual x belong to class X? (Realization) ⇒ Is it possible for any individual to be a member of class X? (Satisfiability) ⇒ Are there any logical contradictions in the ontology? (Consistency) While this type of “reasoning” may seem somewhat limited, the ability to automatically apply a classification system to biological data has many applications. For instance, an OWL ontology can be used to group proteins into families according to their domains, as described in [101], or to organize a database of small molecules by their functional groups, as in [102]. Description logics are distinguished by the particular constructs (e.g. existential restrictions) that they permit in the construction of their axioms. For example, there are several variants of OWL based on different description logics; the most widely used variant, OWL DL, is a representation of the description logic SHOIN (D). Each letter in indicates one or more constructs: S indicates the constructs of the description logic ALC , which include negation of class expressions, intersections of classes, universal restrictions, and existential quantification. H subproperty axioms O ordinals (classes may be defined as enumerated lists of individuals) I inverse properties N cardinality restrictions (D) datatype properties (properties which take a number or string as their value, rather than an individual) The set of constructs used by a description logic determines its expressivity. One of the main subjects of research in description logics is the tradeoff between expressivity and computational complexity of reasoning. It is generally understood that the more expressive a logic, the more computationally expensive are its inferencing services (e.g. subsumption), with undecidability as the worst case [103]. Several projects have created OWL ontologies for specific domains of bioinformatics, such as BioPAX [104] for modelling pathways and interactions, OBI [105] for modelling experimental procedures and conditions, and the MGED ontology[78] for modelling microarray experiments. Larger OWL ontologies have also been constructed that cover multiple domains. For instance, the NCI (National Cancer Institute) Thesaurus [106] models many areas of biology and clinical science such as diseases, genes, drugs, pathways, taxonomy, and cancer diagnoses. In addition, a 13  1.3. Integration Technologies and Related Research Projects number of description logic ontologies predate OWL, such as the GALEN [107] and SNOMEDCT [108] ontologies for medical terms, and the TAMBIS ontology [109] for general biological concepts. It is worth noting here that the word “ontology” is more commonly used in bioinformatics as a synonym for controlled vocabularies (simple hierarchies of terms) such as the GO (Gene Ontology) [110]. GO has been a very successful means of ensuring that scientists use consistent terms when annotating data records, and has enabled the querying of databases using only one standardized term per concept (e.g. “transcription”). As it relates to data integration, the principal advantage of the RDF/OWL framework over the relational and XML models is that a single dataset can conform to multiple schema (ontologies) at the same time. For example, the dataset of Fig. 1.3 could be queried using two independently developed ontologies for drug targets and enzymes, as shown in Fig. 1.5. This “mixing and matching” of ontologies is possible under OWL because class definitions do not require individuals to have a fixed number of property types; in the example, an instance of the class DrugTarget is permitted to have any number of other properties in addition to “targetOf” (such as “hasSubstrate” and “hasProduct”). Figure 1.5 demonstrates how the RDF/ OWL framework solves the schema integration problem when two ontologies describe different types of data about the same entities. However, if the domains of two ontologies overlap (e.g. two ontologies that describe pathways), the representation of common entities must agree (as demonstrated in Fig. 1.4). drugbank:DB00175  drugbank:DB003946  drugbank:targetOf  drugbank:targetOf  pubchem:3708  pubchem:3649 Enzyme kegg:hasSubstrate kegg:hasProduct uniprot:P04035  kegg:hasEnzyme  kegg:R00335  drugbank:DB00175  drugbank:targetOf  drugbank:DB003946  drugbank:targetOf pubchem:3708  pubchem:3649 DrugTarget kegg:hasSubstrate kegg:hasProduct uniprot:P04035  kegg:hasEnzyme  kegg:R00335  kegg:hasReaction  kegg:hsa00234  (a) uniprot:P04035 as an instance of the class Enzyme  kegg:hasReaction  kegg:hsa00234  (b) uniprot:P04035 as an instance of the class DrugTarget  Figure 1.5: An example illustrating how different OWL ontologies may be used to provide alternate interpretations of the same RDF dataset. In the first case (left), uniprot:P04035 is identified as an instance of the class Enzyme, whereas in the second case (right), uniprot:P04035 is identified as an instance of the class DrugTarget.  1.3.4  Web Services  The fundamental idea underlying web services is to make the functionality of software accessible over a network. This can be accomplished by setting up a server that listens for input messages from remote clients, performs computations on those input messages, and responds with output messages that contain the results. The client-server model is already widely used for a large number of internet applications such as web servers and FTP servers; the idea is merely to 14  1.3. Integration Technologies and Related Research Projects extend the pattern to other types of software. It important to note that the intended “users” of web services are not people but rather machines; web services are a mechanism for accessing external software from within a program that is analogous to a procedure call. Thus, there is a strong emphasis on machine-readability and automation. A number of standards have been developed to promote uniform interaction, description, and discovery of web services, which together form the WS-* stack. First, SOAP (Simple Object Access Protocol) [111] is a simple XML “wrapper” that allows control information to be attached to web service messages, for purposes such as application-specific routing and security. Second, WSDL (Web Service Description Language) [112] is an XML standard for describing the message formats, calling pattern, location, and protocol (e.g. HTTP) for a web service. WSDL files describe message formats using XML schema. Lastly, UDDI (Universal Description, Discovery and Integration) [113] is a distributed, XML-based registry system for web services, which can itself be queried as a web service. Apart from the WS-* stack, an alternative model for web services also exists in the literature called RESTful services. Requests to RESTful services are typically encoded as HTTP GET URLs and are invoked by a simple request-response pattern; in other words, RESTful services are invoked in the same manner that a web browser retrieves a web page. The term REST (REpresentational State Transfer) describes a general set of design principles for distributed systems that were employed in the development of the HTTP protocol; the essential characteristics of a REST system are the separation of interactions into clients and servers, the statelessness of servers, the use of global identifiers for resources, and the support of a generic interface for retrieving representations of resources [114]. Currently, there is an ideological battle being waged between supporters of the RESTful model and the “Big Services” model [115]; however, the two views are in many ways compatible, because the WS-* stack supports a superset of the behaviour described by RESTful services. The majority of web services in bioinformatics communicate via SOAP, are described in WSDL, and behave in a RESTful manner (in the sense that they are stateless, have a simple request-response calling pattern, and operate over HTTP). Examples include EBI’s dbfetch [116] service, the KEGG API [117], BioMoby [118], and BioMart’s “MartServices” [119]. While the WSDL and SOAP standards are widely used in bioinformatics, the UDDI standard is not; in current practice, bioinformaticians usually discover web services through papers, internet search engines, and word of mouth. In an effort to fill this gap, a number of projects have created specialized registries for bioinformatics services with capabilities beyond that of UDDI. For example, the BioMoby [118] registry includes a shared ontology of XML datatypes that represent common entities such as DNA sequences and GO terms. Services that wish to participate in the BioMoby framework must consume and generate one of the BioMoby datatypes (or extend the shared ontology), and thus the framework encourages the creation of interoperable services. The Feta [120] registry takes an alternate, annotation-based approach. Feta service providers annotate services and their parameters with terms from the myGrid ontology [121], enabling users to discover services based on the types of parameters they consume/generate (e.g “Protein Sequence”), and the types of analyses they perform (e.g. “Sequence Alignment”). Most recently, the BioCatalogue [122] provides an open web-based service registry which can be searched by keyword, service categories (e.g. “Phylogeny”), and user annotated tags. The BioCatalogue now contains more than 1000 services. Beyond the basic tasks of locating and invoking individual services, it is also possible to chain services into workflows to perform complex analyses. (Fig. 1.6 depicts an example workflow, constructed in Taverna [123].) Workflows are useful for a number of tasks in bioinformatics. For example, one might build a workflow to automate the mapping of differentially expressed genes to pathways [124], the retrieval of gene information for a QTL [125], or the construction of a phylogenetic tree [126]. A number of software tools have been developed to aid in the (manual) construction of workflows, such as Taverna [123] and Kepler [127]. Unfortunately, building workflows is still not easy. Chaining services from different providers tends to be difficult and time-consuming [128] for the same reason that merging data from different providers is dif15  1.3. Integration Technologies and Related Research Projects  Workflow Inputs Input_Sequence  ListUser  N-J or UPGMA  Condition_DNA_RNA Fail_if_false_DNA  Fail_if_true_DNA Condition_Protein Fail_if_true_Protein Not_Protein_Sequence  Fail_if_false_Protein  Setting_fasta  MergeUserList  toLowerCase  searchSimple blastsimplifier Extract_GI_Evalue Extract_Duplicates Split_by_newline Get_Protein_FASTA MergeString insert_query_seq extract_number_sequences Extract_Seq_Description  ClustalW prettyplot FlattenImage  fprotdist fneighbor  Is_DNA_RNA  fdrawtree  fdrawgram  Workflow Outputs It is a DNA or RNA sequence  Not Protein Sequence  Blast Report  Protein Description  Image Alignment  Distance Outfile  Unrooted_Tree  Rooted_Tree  Output Tree (N or UPGMA)  Figure 1.6: Example Taverna workflow which generates a BLAST report, multiple sequence alignment, and phylogenetic tree for an input protein sequence.  16  1.3. Integration Technologies and Related Research Projects ficult: the schema integration problem. Virtually all web services utilize XML-based message formats, and the XML schema for these formats are developed independently by different service providers. Thus, at the current time, building analysis pipelines with web services has few advantages over building pipelines with locally installed software. One advantage that web services do have over ordinary software is platform independence; a web service may be invoked from virtually any programming language on any operating system, and requires no downloading, installation, or configuration prior to use. (The price of this convenience is dependency on a remote server.) In addition, web services are a useful mechanism for “wrapping” a related set of databases and software packages, so that they are exposed with a uniform set of interfaces. A number of projects have been dedicated to building more sophisticated infrastructure on top of the existing web service standards. For example, grid architectures seek to unify a collection of distributed servers and allow them to be used as a single computational resource. Thus, the functionality that is supplied by a grid system is analogous to the functionality supplied by an operating system: job control (running, pausing, or killing a program), job monitoring, file storage/transfer, and security. In contrast to users of stand-alone web services, grid users must supply both the input data and the software to be run as a job on the grid. The general architecture and functionality of a grid system is described in the OGSA (Open Grid Services Architecture) Architecture document [129], and the de facto standard implementation of OGSA is the Globus Toolkit [130]. Both the BIRN (BIoinformatics Research Network) [131] and caGrid [132] grid projects are based on the Globus Toolkit, as well a number of other grid projects outside of bioinformatics. A related term, cloud computing, is used to describe grid-like services that have recently become available from Amazon [133], Google [134], and Microsoft [135]; these services allow application developers to run multiple instances of an application on virtual machines that are supplied by the vendor. The main feature of cloud computing services is that the number of virtual machines running at any given time may be readily increased or decreased (in some cases automatically). The most common target application of cloud computing is the creation of websites that can readily scale in response to increasing popularity. Another area of research related to web services is Semantic Web Services [136], which seeks to apply Semantic Web standards to the problem of automatically constructing workflows. The general approach taken in this field is borrowed from the planning domain [137] of artificial intelligence, in which the condition of the world before and after any action is modeled by a set of state variables. The user’s goal in executing the workflow is expressed as a desired set of values for the state variables, and the job of a service broker (e.g. [138]) is to find a path from the initial conditions, through a number of web service invocations, to the goal. Two standardsin-progress have emerged for describing web services along these lines, called OWL-S [139] and WSMO[140].  1.3.5  Mediator Systems  A mediator system (also called a multidatabase system or a data integration system) provides a common query interface to a collection of data sources that are distributed across a network. A mediator answers a query by translating it into a set of subqueries that are issued against the individual data sources (Fig. 1.7). Typically, users formulate their queries against a “virtual” schema or ontology that provides a coherent view of all available data; thus, designing the virtual schema, and mapping it to the schemas of the individual sources, is one of the central tasks of implementing a mediator system. This design/mapping work is yet another instance of the schema integration problem that has been encountered under several different scenarios in this chapter. As in the previous instances, the mappings must be made manually after careful study of the sources. One common technique for distributing the work of schema integration is for each participant to implement a wrapper interface (i.e. a web service) over his/her data source, as depicted in Fig. 1.7. The wrapper layer also masks other types of heterogeneity, such as the different interfaces used for invoking an analytical program and querying a database. 17  1.3. Integration Technologies and Related Research Projects  User Query  Mediator  Subquery Wrapper  Database  Subquery Wrapper  Analytical Software  Subquery Wrapper  Database  Figure 1.7: Mediator architecture A number of mediator systems have been developed for bioinformatics. One of the earliest systems was Kleisli [141]. Kleisli uses a query language called CPL (Collections Programming Language) [142] that allows for the querying and transformation of nested data structures consisting of lists and sets 3 . This is particularly suitable for bioinformatics resources, which tend to have complex structures, such as the sequence features (e.g. exon regions within a DNA sequence) that are embedded in a GenBank record. In the late 90’s, a library of Kleisli “drivers” (i.e. wrappers) was built [143] that enabled querying across a number of important bioinformatics resources such as GenBank, GDB, and BLAST. One caveat of Kleisli is that it requires the user to be aware of the complete set of available data sources, and the particular record structures of each; in other words, there is no virtual schema. The TAMBIS [144] project addressed this issue by mapping Kleisli drivers to the concepts and relationships of a large description logic ontology[109] for bioinformatics. In addition, TAMBIS provided an intuitive graphical user interface for composing queries against the ontology. (The TAMBIS project is no longer active.) BioFlow [145] is another bioinformatics mediator system, which implements extensions to SQL syntax that allow for the mapping of remote XML files to tables, and the mapping of web forms to SQL functions. The mediator approach to data integration is often contrasted with the data warehousing approach. On one hand, mediators are not affected by the “staleness” issue of data warehouses, because they obtain their data directly from the original sources. On the other hand, interaction with the original sources often requires large data transfers across the network, and thus mediator queries are generally slower than warehouse queries. In addition, mediators depend on the availability and performance of remote servers, which may be unpredictable. Both the warehouse approach and the mediator approaches typically require the creation of complex schemas and schema mapping rules, which entail a significant investment of human labour. In addition, ongoing maintenance of the schema and associated mappings is required, as new data sources 3 CPL  is similar in syntax to Javascript Object Notation (JSON).  18  1.4. Thesis Outline are added and the schemas of existing sources change. One advantage of the mediator approach with respect to maintenance is that the work can be distributed across data providers, if each provider is responsible for maintaining the wrapper of his/her own resource.  1.4  Thesis Outline  This thesis describes the design and implementation of SHARE (Semantic Health and Research Environment), a general-purpose mediator system for bioinformatics. The objective of SHARE is to provide a framework that allows for flexible and efficient querying across distributed databases and software, while at the same time allowing for the ongoing addition of resources by third parties. The principal challenges of implementing a successful mediator system relate to schema integration and query performance. In the following chapters, we propose new approaches for addressing these problems based on the use of Semantic Web technologies. In Chapter 2, we describe the system architecture for SHARE, with an emphasis on the design decisions that support open addition of resources, while avoiding the requirement for centralized development and maintenance of an overarching schema. In Chapter 3, we investigate the problem of optimizing queries in SHARE. We describe an adaptive query execution algorithm called GREEDY, which has been designed to cope with the problem of planning queries in the absence of statistics about the data sources. Chapter 4 concludes by highlighting the contributions of SHARE to bioinformatics and the Semantic Web, and also the weaknesses of the system and areas for future work.  19  Chapter 2  SHARE System Architecture 2.1  Introduction  SHARE (Semantic Health and Research Environment) is a mediator system that enables simultaneous querying of databases and analytical programs that are distributed across the web. SHARE differs from other mediators, both inside and outside of bioinformatics, because its design is based entirely on Semantic Web standards. All services (i.e. data sources) consume and generate RDF [63], input/output datatypes are matched using OWL [146] reasoning, and user queries are expressed in SPARQL [88], the standard query language for RDF. In this chapter, we will demonstrate how these standards can be employed to simplify the implementation, maintenance, and addition of data sources to a mediator system. In particular, there are two characteristics that are unique to SHARE: 1. There is no master schema. 2. If there is a chain of service calls that will convert datatype A to datatype B, that chain can be discovered and invoked automatically. We will return to these two claims and discuss their implications after describing the mechanics of the system. We begin by introducing SPARQL, the language for SHARE user queries, in Section 2.2. Section 2.3 describes the different types of data sources that are queried by SHARE, and Section 2.4 explains the process of query resolution. Section 2.5 describes the online demonstration of SHARE, and Section 2.6 concludes by discussing the main strengths and weaknesses in the design of the system.  2.2  The SPARQL Query Language ?structure  ?ligand  ?name  pdb:2DN1 pdb:2DN1  pdb:OXY pdb:MBN  “OXYGEN MOLECULE” “TOLUENE”  Table 2.1: The results for the SELECT query of Fig. 2.2, when issued against the dataset in Fig. 2.1. A SPARQL [88] SELECT query searches an RDF dataset for a subgraph with a specified structure, as demonstrated by the example query in Fig. 2.2. The subgraph of interest is defined by a set of triple patterns in the WHERE clause, in the same manner that an RDF graph is described by a set of triples. The difference between an RDF triple and a triple pattern is that a triple pattern may have a variable (denoted by a ‘?’ prefix) in any combination of its three positions. Variables denote placeholders for unknown values, and the result of a SPARQL query is a table of bindings for the variables. A set of triple patterns forms a basic graph pattern, which may be combined with the UNION (logical OR) and OPTIONAL operators to build 20  2.2. The SPARQL Query Language  PREFIX : <h t t p : // s a d i f r a m e w o r k . o r g / o n t o l o g i e s / p r e d i c a t e s . owl#> . PREFIX u n i p r o t : <h t t p : // l s r n . o r g / U n i P r o t :> . PREFIX pdb: <h t t p : // l s r n . o r g /PDB:> . uniprot:P00214 uniprot:P68871 pdb:1FRI pdb:F3S pdb:2DN1 pdb:OXY pdb:2DN1 pdb:MBN  :has3DStructure :has3DStructure :hasLigand :hasChemicalName :hasLigand :hasChemicalName :hasLigand :hsaChemicalName  pdb:1FRI . pdb:2DN1 . pdb:F3S . ”FE3−S4 CLUSTER” . pdb:OXY . ”OXYGEN MOLECULE” . pdb:MBN . ”TOLUENE” .  (a) target dataset (N3 format) pdb:1FRI  pdb:F3S :hasLigand  :hasChemicalName "FE3-S4 CLUSTER"  :has3DStructure pdb:2DN1 uniprot:P00214  pdb:OXY :hasLigand  :hasChemicalName "OXYGEN MOLECULE"  :hasLigand :has3DStructure  pdb:MBN :hasChemicalName  uniprot:P68871  "TOLUENE"  (b) target dataset (graph visualization)  Figure 2.1: An example RDF dataset, shown in N3 and graph form.  PREFIX : <h t t p : // s a d i f r a m e w o r k . o r g / o n t o l o g i e s / p r e d i c a t e s . owl#> PREFIX u n i p r o t : <h t t p : // l s r n . o r g / U n i P r o t :> SELECT ? s t r u c t u r e WHERE { uniprot:P68871 ? structure ? ligand }  ? l i g a n d ?name :has3DStructure ? structure . :hasLigand ? ligand . :hasChemicalName ?name .  Figure 2.2: A SPARQL SELECT query which asks “What are the ligand(s) of Hemoglobin subunit beta (UniProt protein P68871)?”. The results for the query, when issued against the dataset in Fig. 2.1, are shown in Table 2.1.  21  2.2. The SPARQL Query Language  PREFIX : <h t t p : // s a d i f r a m e w o r k . o r g / o n t o l o g i e s / p r e d i c a t e s . owl#> PREFIX u n i p r o t : <h t t p : // l s r n . o r g / U n i P r o t :> CONSTRUCT { uniprot:P68871 ? ligand } WHERE { uniprot:P68871 ? structure ? ligand }  :hasLigand :hasChemicalName  ? ligand . ?name .  :has3DStructure :hasLigand :hasChemicalName  ? structure . ? ligand . ?name .  (a) example CONSTRUCT query pdb:OXY :hasChemicalName :hasLigand  "OXYGEN MOLECULE"  uniprot:P68871 pdb:MBN :hasChemicalName :hasLigand  "TOLUENE"  (b) result graph for CONSTRUCT query  Figure 2.3: A SPARQL CONSTRUCT query which tells the system to “Construct an RDF graph in which Hemoglobin subunit beta is directly annotated by its ligands.” The resulting RDF graph for this query, when issued against the dataset in Fig. 2.1, is depicted in part (b).  22  2.3. SHARE Data Sources up complex query expressions. The PREFIX syntax at the top of Fig. 2.2 is used to declare abbreviations for URIs that are used in the query. A SPARQL CONSTRUCT query generates an RDF graph as output, rather than a table of variable bindings. CONSTRUCT queries are useful for performing structural transformations on RDF, as demonstrated by Fig. 2.3. For further details about SPARQL query syntax, see [88]; however, the explanation provided here is sufficient for the remainder of the discussion in this chapter.  2.3  SHARE Data Sources  2.3.1  SADI Services  Input RDF Document  Output RDF Document  uniprot:P56424  "MEEPQSDPV..."  uniprot:P13481  HTTP POST  Response  owl:hasAminoAcidSequence  owl:hasHomolog owl:hasHomolog  uniprot:P04637  SADI BLAST service  uniprot:P04637  uniprot:Q9TTA1 owl:hasHomolog  Input OWL Class: owl:ProteinWithSequence ≡ (≥1 owl:hasAminoAcidSequence) Output OWL Class: owl:HomologSet ≡ (≥1 owl:hasHomolog) Generated Property: owl:hasHomolog  Figure 2.4: An example SADI service which performs a BLAST search. The structure of the input and output data is simplified; a real BLAST service would likely provide additional data such as expect values, pairwise sequence alignments, etc. Note that the output RDF file does not need to include statements (triples) about the input URIs that were given in the input RDF document, because these facts are already known to the client. For example, the output document need not specify that (uniprot:P04637, owl:hasAminoAcidSequence, ”MEEPQSDPV...”). SADI (Semantic Automated Discovery and Integration) services are stateless web services that are invoked by issuing an HTTP POST request. Both the request and the response consist of a single RDF document. The input RDF document contains one or more input URIs, and may contain any number of statements (i.e. triples) about those URIs; the purpose of these statements is to provide the service with the information that is required to process the inputs. For instance, the example BLAST service in Fig. 2.4 requires one or more amino acid sequences associated with each input protein, in addition to the URIs identifying the proteins themselves 4 . The set of properties (i.e. predicates) that must be supplied for each input URI are encoded by the input OWL class of the SADI service. Each SADI service also has an output OWL class that serves a similar function; it specifies the set of properties that are attached to the input URIs as a result of the service call. In the example service of Fig. 2.4, the input OWL class is the set of nodes having one or more hasAminoAcidSequence properties, and the output class is the set of nodes with one or more hasHomolog properties. While knowledge of the input OWL class is required to successfully invoke a SADI service, knowledge of the output OWL class is required to determine the type of data that the service generates. 4 It would be preferable to specify that each input protein has exactly one amino acid sequence. However, standard OWL reasoners can never deduce instances of a class that is defined by an exact cardinality restriction, due to the open world assumption [147].  23  2.3. SHARE Data Sources The input and output OWL classes for a SADI service are specified by the service provider as part of the service description, and this description is accessed by performing an HTTP GET on the service URL. The service description is encoded in RDF using the myGrid service ontology, and may include other details such as a natural language description of the service. An example service description is provided in Appendix A. Any third party may register their service with the central SADI registry by submitting the service URL to http://sadiframework.org/registry; the SADI registry then obtains all other necessary information about the service by downloading and processing the service description. In most respects, the SADI standard is a straightforward application of RDF and OWL to the domain of web service messaging. The key constraint of SADI is that both the input and output OWL classes are defined with respect to the input URIs. As a result, the difference between the input and output classes can be computed to determine the set of generated properties for a service. For example, the service in Fig. 2.4 has one generated property, which is hasHomolog. The SADI registry stores the generated properties for each service, and allows clients to retrieve services based on a property of interest. The discovery of services by individual properties is unique to the SHARE system, and is responsible for its ability to automatically assemble workflows from SPARQL queries, as will be discussed in Section 2.4. The SADI services that are currently available in the registry generate properties for a number of common identifier types in bioinformatics, such as UniProt [11], GO [110], KEGG [117], and PDB [17]; these identifiers must be expressed as URIs according to the LSRN URI scheme (e.g. http://lsrn.org/UniProt:P68871). The generated properties for each identifier type are depicted in the predicates diagram at http://biordf.net/cardioSHARE/predicates.html (Fig. 2.5), which represents the schema for querying the system. A user may extend the pool of queriable services/properties by registering his or her own SADI service at http: //sadiframework.org/registry. Provided that the service has been implemented correctly according to the SADI standard, the service will be immediately available for use during query resolution. Instructions for building SADI services are available at http://sadiframework.org.  2.3.2  SPARQL Endpoints  SPARQL endpoints are HTTP forms that accept SPARQL queries as input. Virtually all triple stores (e.g. Jena [148], Sesame [149], Virtuoso [150]) can be configured to act as SPARQL endpoints, and there are a growing number of SPARQL endpoints on the web that provide access to RDF versions of biological databases. At the current time, these endpoints are not hosted by the original data providers but are rather supplied by third parties such as Bio2RDF [90] and the Linked Open Drug Data (LODD) [91] project. Bio2RDF provides RDF mirrors of general resources such as the UniProt [11] protein database and the KEGG [117] pathway database, while LODD provides mirrors of drug-related resources such as DrugBank [21] and the SIDER database of side effects [93]. The SHARE query engine is capable of querying across SPARQL endpoints. SHARE implements this functionality by means of a mediator-side service wrapper for each SPARQL endpoint. Viewed from the query engine side, each service wrapper makes a SPARQL endpoint appear as if it is a SADI service; just as for a SADI service, each SPARQL endpoint has a set of generated properties and may be invoked using one or more URIs as input. On the network side, the service wrapper translates the input URIs to CONSTRUCT queries and sends them to the relevant endpoint, as depicted in Fig. 2.6. The set of generated properties for a SPARQL endpoint is the set of predicates that are occur in one or more triples in the endpoint. In principle, this list of predicates can be obtained automatically by issuing the following query to the endpoint: SELECT DISTINCT ?p WHERE { ? s ?p ? o }  24  2.3. SHARE Data Sources  String  String  String  on  t:h as  Sy  pred  :hasN  mb ol  ame serv:correspondsToEntrezGene  Gene (Entrez)  pred:hasDescription  on  t:c o  rre  sp  on  ds  To  En  tre  zG  dbSNP  Pathway Diagram (image URL)  en  e  NM_xxxxx (RefSeq)  NP_xxxxx (RefSeq)  By  ed liz  Pa  thw  ay  D  r iag  am  ua  :vis  rote  in  ed  sR e  sc ran qT  fSe  fSe rip  ser v:h a  Metabolite (KEGG_ COMPOUND)  pred:hasMetabolite  In nt ipa nt e rtic ipa en Pa rtic yG t:is P a hw a n s o a at t:h P n o as :h ed pr  t  Gene (KEGG)  serv:isOrthologOf  ont:hasParticipant  Pathway (KEGG_ PATHWAY)  qP  e sR  ont:hasReference  a v:h ser  pr  pred:isSubstance  Chemical (PubChem_ Substance)  serv:isParalogOf  pr M  as  :h  ed if  ot  String  pre  d:h  asP  ub  lica  tion  Ye  pred:hasTitle  Paper (PMID)  In ed  String  Motif (Prosite)  pred:encodes  :pu  d pre  sh bli  ar  pred:isEncodedBy  String  on R  s t:i  M as  gOf  isTa ont:  enc  e  pred:hasN  ame  Protein (UniProt)  ription  esc pred:hasD  a :h ed  pr  ein  O sG  With  ot  rm Te  ed  ed  :is 3  :h a  DS  s3 D  tru  St ru  ct ur  e  eF  or  pr  :has  3D Structure (PDB)  sS  tru  serv:hasJmol3DStructureVisualization  :ha red  S s3D  ture  truc  pr e  d:h  Lig  as  an  e  p  d:is  Lig  jMol Visualization (image URL)  an  d  dO f  am  lNa ica  me t  em  ul ar  Ch  ec sM  ol  pre  as d:h  :h a  String  d:h pre  String  ula  neri gGe serv :ha  sDru  Integer  W ei gh  cNa m  rge Cha :has pred  orm  e  an dN ug Br se rv:  String String  a :h  alC  Chemical (PDB)  ha s  String String  ed  r ctu  pre  Drug (DrugBank)  Protein (GI)  Dr  pred:hasTermType  pred:hasTermName  pred:hasTermDefinition  pred:hasDescendentTerm  ion at  s las  pred  GO Term (GO)  pred:hasAncestorTerm  Structural Classification (CATH)  ific  ct ur  Mole  pred:hasParentTerm  pr  ed  pr  Pr  ction  sA  d te  pr  a :h  ed  T  ia oc ss  fDrug rgetO  pr  pred:hasChildTerm  ng s  rInte ra  elo d:b  Ta serv:is  pre  String  rg oO  cula  String  Secondary Structure (HSSP)  pred:hasHomologyDerivedStructure  m anis  pred:numAtoms  eq u  ica lF  asS  hem  String  d:h  Keyword (Global_ Keyword)  asC  pre  pr  ce  ed  en  :h  er ef  r Fo  sR  ce  ot if  en  a t:h  er  on  ef  String  Float  String  Integer  Figure 2.5: The generated properties for the SADI services that are currently available through the public registry http://sadiframework.org/registry. Currently, this diagram is maintained by hand, but in future work it will constructed automatically from the contents of the registry.  25  2.3. SHARE Data Sources  Query SHARE Query Engine  SPARQL Service Wrapper  uniprot:P68871  RDF Document  CONSTRUCT { uniprot:P68871 ?p ?o } WHERE { uniprot:P68871 ?p ?o }  SPARQL Endpoint  Triple Store  RDF Document  Figure 2.6: Querying of remote SPARQL endpoints in SHARE. Generated properties for the remote SPARQL endpoint are indicated in green. However, this is a relatively expensive query, because it requires iterating through all of the triples in the database. In practice, it tends to fail for large triple stores because the HTTP POST request that issues the query expires before the database finishes processing the query. A further problem is that SPARQL endpoints may disallow the execution of queries that are deemed to be too expensive. At the current time, such endpoints are instead indexed based on a single URI representing a “typical” database record (e.g. http://bio2rdf.org/uniprot: P05067); in order to accomodate complex record structures, the RDF graph rooted at the URI is crawled using a breadth first search. Clearly, this is not an ideal solution, because it is not fully automatic and is not guaranteed to be complete. In future work, the author plans to develop scripts for different types of SPARQL endpoints (e.g. Virtuoso [150], D2R [151]) that will build standardized indices on the endpoint side. In addition to a list generated properties, SHARE’s indices for SPARQL endpoints also contain regular expressions for the subject/object URIs that occur in the endpoint. As with generated properties, the regular expressions can be built by querying the endpoints, but for large endpoints must be constructed using the record-based approach. The SPARQL endpoint registry for SHARE currently contains indices for the LODD and Bio2RDF endpoints, as listed in Table 2.2. These indices are rebuilt on a weekly basis. A map of the generated properties for the endpoints is shown in Fig. 2.7; as the map is very large, only a small portion is shown in detail here5 . The map was generated by performing a series of breadth first traversals across the data in the endpoints, with a depth limit of 7 edges for each traversal. The root URIs for the traversals were chosen by querying for a list of distinct record types (i.e. rdf:types) from each endpoint, and then randomly sampling 3 URIs of each type. As with the SPARQL indexes, the map is not guaranteed to be complete; improved methods for constructing the map will be a subject of future work.  5 To  view the map in full detail, files may be downloaded in N3, Cytoscape, SVG, and PNG formats from  http://sadiframework.org/sparqlmap. Viewing the map with Cytoscape or Inkscape (SVG viewer) is recommended, as PNG requires too much memory to be loaded in most image viewers.  26  2.3. SHARE Data Sources  SPARQL Endpoint URL  Provider  http://www4.wiwiss.fu-berlin.de/dailymed/sparql http://www4.wiwiss.fu-berlin.de/diseasome/sparql http://www4.wiwiss.fu-berlin.de/sider/sparql http://www4.wiwiss.fu-berlin.de/drugbank/sparql http://www4.wiwiss.fu-berlin.de/medicare/sparql http://www4.wiwiss.fu-berlin.de/stitch/sparql http://uniprot.bio2rdf.org/sparql http://omim.bio2rdf.org/sparql http://reactome.bio2rdf.org/sparql http://mesh.bio2rdf.org/sparql http://ec.bio2rdf.org/sparql http://hgnc.bio2rdf.org/sparql http://inoh.bio2rdf.org/sparql http://mgi.bio2rdf.org/sparql http://protein.bio2rdf.org/sparql http://unists.bio2rdf.org/sparql http://obo.bio2rdf.org/sparql http://uniparc.bio2rdf.org/sparql http://cpd.bio2rdf.org/sparql http://affymetrix.bio2rdf.org/sparql http://irefindex.bio2rdf.org/sparql http://kegg.bio2rdf.org/sparql http://cpath.bio2rdf.org/sparql http://biocyc.bio2rdf.org/sparql http://chebi.bio2rdf.org/sparql http://taxonomy.bio2rdf.org/sparql http://homologene.bio2rdf.org/sparql http://pubchem.bio2rdf.org/sparql http://biocarta.bio2rdf.org/sparql  LODD LODD LODD LODD LODD LODD Bio2RDF Bio2RDF Bio2RDF Bio2RDF Bio2RDF Bio2RDF Bio2RDF Bio2RDF Bio2RDF Bio2RDF Bio2RDF Bio2RDF Bio2RDF Bio2RDF Bio2RDF Bio2RDF Bio2RDF Bio2RDF Bio2RDF Bio2RDF Bio2RDF Bio2RDF Bio2RDF  Table 2.2: List of SPARQL endpoints currently indexed by SHARE  27  2.3. SHARE Data Sources  28  Figure 2.7: A map of the generated properties for SPARQL endpoints available under SHARE. The callouts show sections of the map containing predicates about chemical reactions, proteins, and drugs. Each node represents a distinct record type (i.e. rdf:type) that occurs in one or more of the endpoints.  2.4. SHARE Query Resolution  2.4  SHARE Query Resolution  2.4.1  Overview  Standard SPARQL query engines [148, 150] have limited capabilities for querying remote data [152]. Users may include “FROM” clauses at the beginning of a SPARQL query that indicate the URLs of target RDF files, and these files will be downloaded in bulk as the first step of query processing.The main caveat of this approach is that it does not work well for large datasets. In addition, it precludes the possibility of querying data that is generated by analytical software. SHARE addresses these issues by acting as a transparent data gathering layer on top of a standard SPARQL query engine. An overview of the process is depicted in Fig. 2.8. In the first phase of query resolution, SHARE retrieves the data required to answer the query by issuing a series of requests to available services (SADI services and SPARQL endpoints); the process by which SHARE maps a query to a set of service requests will be explained detail in the remainder of this section. As data from the services is retrieved, it is aggregated in a local triple store. In the second phase of query resolution, the user’s original query is run against the local triple store using a standard SPARQL query engine. Conceptually, SHARE provides access to a large dataset that consists of the output from each SADI service for each possible input URI, together with all triples stored by all known SPARQL endpoints. Hereafter, we will refer to this dataset as the virtual graph. The objective of SHARE is to download a subset of the virtual graph that is sufficient to generate a complete solution for the user’s query. In this context, “complete” means that the solution set has all of the solutions that would be found by running the user’s query on a complete, materialized version of the virtual graph.  2.4.2 1: 2: 3: 4: 5: 6: 7: 8:  Resolution Process of an Example SHARE Query PREFIX s a d i : <h t t p : / / s a d i f r a m e w o r k . o r g / o n t o l o g i e s / p r o p e r t i e s . owl#> PREFIX u n i p r o t : <h t t p : / / l s r n . o r g / UniProt :> SELECT ∗ WHERE { u n i p r o t : P47989 s a d i : isEncodedBy ? gene . ? gene s a d i : i s P a r t i c i p a n t I n ? pathway . }  Listing 2.1: An example SHARE query which asks: “What biological pathways does Xanthine dehydrogenase/oxidase participate in?” To illustrate the process by which SHARE retrieves and assembles a dataset for a SPARQL query, we will use an example. Consider the SHARE query shown in listing 2.1, which asks: “What biological pathways does Xanthine dehydrogenase/oxidase participate in?” The PREFIX lines 1-2 define abbreviations for URIs that make the query easier to read. The “SELECT *” line indicates that all variables mentioned in the query (?gene and ?pathway) should be included in the table of solutions. Lines 5-8 define the WHERE clause, which specifies the search criteria for the query. Here, the WHERE clause consists of two triple patterns (lines 6 and 7), which together constitute a single basic graph pattern. The criteria specified by each of the triple patterns are combined with the logical AND operation. Thus, each solution for the query of listing 2.1 is a combination of values for ?gene and ?pathway such that: 1. Xanthine dehydrogenase/oxidase (uniprot:P47989) is encoded by ?gene, AND 29  2.4. SHARE Query Resolution  SADI Service Registry  SPARQL Endpoint Registry  USER QUERY  SHARE Local Triple Store  SADI Services and SPARQL Endpoints  SPARQL Engine  QUERY SOLUTIONS  Figure 2.8: An overview of SHARE query resolution. SHARE acts as a transparent data gathering layer on top of a standard SPARQL engine. In the first phase of query resolution, SHARE translates the user’s query to a series of service requests, in order to gather the data needed to answer the query. This data is accumulated in a local triple store as it is retrieved. In the second phase of query resolution, the user’s original query is executed against the local triple store using a standard SPARQL query engine.  30  2.4. SHARE Query Resolution 2. ?gene is a participant in ?pathway6 SHARE builds a dataset for a basic graph pattern by retrieving data for each triple pattern in sequence. Each triple pattern is mapped to a set of service requests, and the output data from those requests is used to determine bindings for any unbound variables that occur in the pattern. To illustrate, we will consider the resolution of the example query in listing 2.1. SHARE resolves the first triple pattern, (uniprot:P47989, sadi:isEncodedBy, ?gene), by the following steps: 1. Matching services are identified. Matching services are services that output triples of the form (<input URI>, sadi:isEncodedBy, *), where <input URI> is an input URI in the service request, and * denotes any possible value. Such services are said to “attach”sadi:isEncodedBy to their input URIs, or (equivalently) to have sadi:isEncodedBy as one of their generated properties. 2. Matching services are invoked. In this case, each service request contains a single input URI: uniprot:P47989. It is possible to invoke services with more than one input URI; this would be necessary if the subject of the triple pattern were a variable with multiple bindings, rather than a constant URI. 3. The output RDF from the services is loaded into the local triple store. 4. Bindings are assigned to any unbound variables in the current triple pattern. SHARE queries the local triple store for triples of the form (uniprot:P47989, sadi:isEncodedBy, *) that were loaded in Item 3. The full set of distinct values for “*” become bindings for the variable ?gene. After resolving the first triple pattern, SHARE will proceed to gather data for the second triple pattern, (?gene, sadi:isParticipantIn, ?pathway), in a similar manner. However, for the second pattern, the subject is a bound variable (?gene) rather than a constant URI (uniprot:P47989). The bindings for ?gene, as determined by the resolution of the first triple pattern, become the input URIs for the service requests that gather data for the second triple pattern. After performing the service calls for the two patterns, SHARE has aggregated a dataset that is sufficient to answer the user’s query. The original query is then issued against the local triple store using a standard SPARQL query engine, and the solutions are displayed.  2.4.3  Pseudocode for SHARE Query Resolution  In the example query of the preceding section, all triple patterns were resolved by using the bindings (or constant values) of the subjects as input URIs for the service invocations. We will refer to this as resolving the triple pattern in the forward direction (left-to-right). SHARE is also capable of resolving triple patterns in the reverse direction (right-to-left). Given a triple pattern (s, p, o), the system decides which direction to use according to the following rules: Case 1: s is a bound variable or a URI. The triple pattern is resolved in the forward direction. The value(s) for s are used as input URIs to all services that generate property p; if o is an unbound variable, then the set of retrieved values for property p become the bindings of o. Case 2: s is an unbound variable and o is a bound variable or a URI. The triple pattern is resolved in the reverse direction. The value(s) for o are used as input URIs to all services that generate inverse(p), where inverse(p) is the OWL inverse of property p. The set of retrieved values for inverse(p) become the bindings of s. 6  Readers may observe that the second pattern is not strictly correct; the participants of metabolic pathways are proteins and metabolites rather than genes. This is an artifact of the way data is modelled in the KEGG database. In KEGG, the assertion that a gene participates in a metabolic pathway means that the gene codes for at least one enzyme that participates in the pathway.  31  2.4. SHARE Query Resolution The pseudocode for the data gathering procedure of SHARE is shown in listing 2.2. Note that the above two cases are the only types of triple patterns that SHARE is capable of resolving. In particular, it is not possible for the system to resolve a triple pattern in which the predicate is a variable, or a pattern in which both the subject and the object are unbound variables. The former case is unresolvable because the predicate URI is the basis for discovering relevant services, and the latter case is unresolvable become there is no input to send the services in order to generate data. Solutions for these limitations will be a subject of future work.  2.4.4  Validating Inputs for Services  One aspect of the data gathering process that remains to be explained is the manner in which SHARE confirms that a URI is a valid input for a service. (This check occurs when getValidInputs is called inside of invokeServicesByProperty, in listing 2.2.) In the case of SADI services, a URI is a valid input if and only if it is an instance of the service’s input OWL class, and thus SHARE must use an OWL reasoner to validate an input URI. However, standard OWL reasoners [98–100] are currently limited to reasoning about locally stored data, just as SPARQL engines are limited to querying local data. Thus, SHARE must perform an additional datagathering procedure to collect properties that are relevant to an OWL class definition. For example, suppose that the following query were issued to SHARE: SELECT ? homolog WHERE { u n i p r o t : P68871 : hasHomolog ? homolog . }  Further, suppose that the only hasHomolog-generating service available is the BLAST service shown of Fig. 2.4. This service requires that every input URI belongs to the class ProteinWithSequence, and a URI is an instance of ProteinWithSequence if and only if it has at least one hasAminoAcidSequence property. However, at the time the triple pattern (uniprot:P68871, :hasHomolog, ?homolog) is resolved by SHARE, no attempt has been made to gather hasAminoAcidSequence properties for uniprot:P68871. SHARE addresses this problem with a secondary, OWL-based data-gathering procedure. This procedure automatically decomposes the definition for the input OWL class of ProteinWithSequence, determines that the hasAminoAcidSequence property is required for uniprot:P68871, and invokes all services that generate hasAminoAcidSequence with uniprot:P68871 as input. The hasAminoAcidSequencegenerating services have their own input OWL classes, which may themselves require additional data; in this manner, the data-gathering process may proceed recursively through an arbitrarily long chain of dependent services. A base case is reached when a service is discovered that consumes bare URIs. In theory, such services would have “owl:Thing” as their input OWL class. (owl:Thing is a special, predefined OWL class that contains all URIs.) In practice, SHARE contains rules to assign bare URIs to OWL classes representing different types of database records, based on their syntax. For example, uniprot:P68871 is represented as http://lsrn.org/UniProt: P68871 and is assigned to the OWL class http://purl.oclc.org/SADI/LSRN/UniProt_Record. The URLs for SHARE are based on the LSRN (Life Sciences Resource Name) URL scheme [153]. The pseudocode for the OWL-based data gathering procedure is shown in listing 2.3. The main work is performed by the gatherDataForTypePattern procedure; the name for this procedure comes from the convention that instances of an OWL class are indicated by triples of the form (<subject URI>, rdf:type, <OWL class URI>). gatherDataForTypePattern is recursive in two ways. First, calling services to gather new data requires resolving the input OWL classes of those services, which may in turn require gathering additional data (as described in the previous paragraph). Second, OWL class definitions may themselves be nested. For example, an AnnotatedProtein might be defined as any URI that has at least one hasAnnotation property 32  2.4. SHARE Query Resolution  resolve(query) b i n d i n g s = empty hash t a b l e // i n i t i a l i z e v a r i a b l e b i n d i n g s f o r each t r i p l e p a t t e r n t i n query i f ( t . s u b j e c t i s not a v a r i a b l e ) bindings [ t . subject ] = singletonSet ( t . subject ) i f ( t . predicate is a variable ) e r r o r ” c ann ot r e s o l v e p a t t e r n where p r e d i c a t e i s a v a r i a b l e ” i f ( t . o b j e c t i s not a v a r i a b l e ) bindings [ t . object ] = singletonSet ( t . object ) // g a t h e r data f o r each t r i p l e p a t t e r n ( s , p , o ) i n query processPattern ( s , p , o , bindings ) // hand o v e r t h e g a t h e r e d data t o a s t a n d a r d SPARQL e n g i n e r e t u r n SPARQLEngine . query ( query , l o c a l T r i p l e S t o r e ) processPattern ( s , p , o , b i n d i n g s ) subjects = bindings [ s ] objects = bindings [ o ] i f ( s u b j e c t s == n u l l && o b j e c t s == n u l l ) warn ” ca nn ot r e s o l v e p a t t e r n where both s u b j e c t and o b j e c t a r e unbound variables ” e l s e i f ( o b j e c t s == n u l l ) invokeServicesByProperty ( subjects , p) e l s e i f ( s u b j e c t s == n u l l ) invokeServicesByProperty ( objects , inverse (p) ) else // both s u b j e c t and o b j e c t a r e a l r e a d y bound invokeServicesByProperty ( subjects , p) f o r each t r i p l e t i n l o c a l T r i p l e S t o r e . g e t M a t c h i n g T r i p l e s ( b i n d i n g s [ s ] , bindings [ p ] , bindings [ o ] ) i f ( s u b j e c t s == n u l l ) b i n d i n g s [ s ] . add ( t . s u b j e c t ) i f ( o b j e c t s == n u l l ) b i n d i n g s [ o ] . add ( t . o b j e c t ) invokeServicesByProperty ( inputURIs , p r o p e r t y ) s e r v i c e s = empty s e t s e r v i c e s . add ( SADIRegistry . g e t S e r v i c e s B y P r o p e r t y ( p r o p e r t y ) ) s e r v i c e s . add ( SPARQLRegistry . g e t S e r v i c e s B y P r o p e r t y ( p r o p e r t y ) ) f o r each s e r v i c e s i n s e r v i c e s v a l i d I n p u t U R I s = g e t V a l i d I n p u t s ( inputURIs , s ) r e s u l t T r i p l e s = s . invoke ( validInputURIs ) l o c a l T r i p l e S t o r e . add ( r e s u l t T r i p l e s )  Listing 2.2: Pseudocode for the main data gathering procedure of SHARE. resolve is the top-level method that answers a SHARE query. resolve invokes processPattern for each triple pattern in the query, in order to find solutions for unbound variables. invokeServicesByProperty is a helper procedure for processPattern that invokes services for a particular generated property and set of input URIs.  33  2.4. SHARE Query Resolution  getValidInputs ( inputURIs , s e r v i c e ) v a l i d I n p u t U R I s = empty s e t if  s e r v i c e i s SPARQL e n d p o i n t f o r each URI i n inputURIs i f ( s e r v i c e . matchesRegEx ( URI ) ) v a l i d I n p u t U R I s . add ( URI ) else // s e r v i c e i s a SADI s e r v i c e inputClass = service . getInputClass () // g a t h e r data t h a t i s r e l e v a n t t o t h e // d e f i n i t i o n o f i n p u t C l a s s s = generateNewVariableName ( ) b i n d i n g s [ s ] = inputURIs gatherDataForTypePattern ( s , i n p u t C l a s s , b i n d i n g s ) // c h e c k each URI a g a i n s t t h e i n p u t C l a s s f o r each URI i n inputURIs i f ( OWLReasoner . i s I n s t a n c e ( URI , i n p u t C l a s s ) ) v a l i d I n p u t U R I s . add ( URI ) return validInputURIs gatherDataForTypePattern ( s , i n p u t C l a s s , b i n d i n g s ) // OWL c l a s s e s may have e x p l i c i t l y d e c l a r e d // p a r e n t c l a s s e s and e q u i v a l e n t c l a s s e s f o r each p a r e n t c l a s s P o f i n p u t C l a s s gatherDataForTypePattern ( s , P) f o r each e q u i v a l e n t c l a s s E o f i n p u t C l a s s gatherDataForTypePattern ( s , E) // OWL c l a s s e s may be c o n s t r u c t e d u s i n g s e t o p e r a t o r s i f i n p u t C l a s s i s a union o f c l a s s e s f o r each c l a s s C i n t h e union gatherDataForTypePattern ( s , C) e l s e i f i n p u t C l a s s i s an i n t e r s e c t i o n o f c l a s s e s f o r each c l a s s C i n t h e i n t e r s e c t i o n gatherDataForTypePattern ( s , C) e l s e i f i n p u t C l a s s i s t h e complement o f c l a s s C gatherDataForTypePattern ( s , C) else // Base c a s e : i n p u t C l a s s i s d e f i n e d a s a s e t o f p r o p e r t y r e s t r i c t i o n s f o r each p r o p e r t y r e s t r i c t i o n r o f i n p u t C l a s s p = r . getProperty () o = generateNewVariableName ( ) i f r i s a value r e s t r i c t i o n bindings [ o ] = s e t ( r . getValue ( ) ) processPattern ( s , p , o , bindings ) else if r is a class restriction bindings [ o ] = null processPattern ( s , p , o , bindings ) gatherDataForTypePattern ( o , r . g e t C l a s s ( ) , b i n d i n g s ) else // r i s a c a r d i n a l i t y r e s t r i c t i o n bindings [ o ] = null processPattern ( s , p , o , bindings )  Listing 2.3: Pseudocode for validating inputs to services. gatherDataForTypePattern is a helper method for getValidInputs which gathers all available data that is related to an OWL class definition. generateNewVariableName generates a new variable name that hasn’t been in the user’s query or by previous temporary variables. 34  2.5. SHARE Online Demonstration with a value (i.e. object URI) from the class GOTerm. (We call this a “class restriction” in the pseudocode of listing 2.3; in OWL, it is also called a “someValuesFrom” restriction.) In this case, the OWL data gathering procedure will not only have to retrieve hasAnnotation properties for the subject URI, but also properties for the values of hasAnnotation that are relevant to the definition of GOTerm. In addition to validating inputs for services, gatherDataForTypePattern is also used by SHARE to directly find instances of OWL classes from within a query. For example, the query SELECT ? k i n a s e WHERE { ? k i n a s e r d f : t y p e myOntology : Kinase . }  will identify instances of the class Kinase, provided that the property restrictions used to define myOntology:Kinase map to SADI services.  2.5  SHARE Online Demonstration  The SHARE demonstration site allows users to issue queries against a sample set of SADI services and SPARQL endpoints, and may be accessed at http://biordf.net/cardioSHARE/query. The main page provides a form for entering SPARQL queries, and a list of example queries is provided at http://biordf.net/cardioSHARE/queries.html. Fig. 2.9 shows the result of executing the SPARQL query from listing 2.1. The SPARQL endpoints provided by the Bio2RDF [90] and Linked Open Drug Data (LODD) [91] projects are also queriable by the system. Bio2RDF provides SPARQL mirrors for popular databases such as UniProt [11], KEGG [117], and Pfam [154], while LODD provides endpoints for drug-related resources such as Drugbank [21] and SIDER [93]. At the current time, the generated properties of these endpoints is not incorporated in the predicate diagram of listing 2.2; however, it is feasible to generate a complete diagram automatically from the SPARQL and SADI service registries, with further work.  2.6  Source Code  The source code for the SHARE query engine may be accessed at http://sadi.googlecode. com, under the sadi.share folder (i.e. http://sadi.googlecode.com/svn/trunk/sadi.share). The real methods corresponding to the pseudocode methods of listings 2.2 and 2.3 may be found within the SHAREKnowedgeBase class (ca.wilkinsonlab.sadi.share.SHAREKnowledgeBase), as per Table 2.3. The source code for the demonstration servlet and website is available separately from http://cardioshare.googlecode.com.  2.7  Discussion  At the current time, XML is the de facto standard for encoding data files and for interacting with web services. Thus, the main caveat of SHARE as an RDF/OWL-based system is that it is not compatible with the majority of existing data and services on the web. However, the RDF/ OWL data model has significant advantages over XML, with respect to the implementation and maintenance of a mediator system. In particular, it is possible to automate several tasks that traditionally require hand-coded schema mappings: building/maintaining a master schema The query schema for SHARE (Figs. 2.5 and 2.7) is determined by the generated properties of the available SADI services and SPARQL 35  2.7. Discussion  Figure 2.9: A screenshot of the SHARE demonstration site, after running the query from 2.1, which asks: “What biological pathways does Xanthine dehydrogenase/oxidase participate in?”  Pseudocode Method  Real Method  resolve processPattern invokeServicesByProperty getValidInputs gatherDataForTypePattern  executeQuery processPattern gatherTriples filterByInputClass processTypePattern  Table 2.3: The methods in the SHARE source code which correspond to the pseudocode methods of listings 2.2 and 2.3. All relevant methods members of the SHAREKnowledgeBase class (ca.wilkinsonlab.sadi.share.SHAREKnowledgeBase), which is located with the sadi.share folder at http://sadi.googlecode.com.  36  2.7. Discussion endpoints; there is no centralized control of the schema. This is a significant advantage because the design of a master schema, and the maintenance of mappings to and from that schema, is one of the largest sources of manual labour involved in implementing a mediator system. A further advantage is that new SADI services may be added at any time, and will immediately become queriable by the system. The main drawback of SHARE’s laissez faire approach is that it will likely to result in a more chaotic schema than that of a centrally managed system. In particular, it will be essential for service providers to reuse property URIs when describing the same relationships. For instance, if three services that retrieve DNA sequences generate the properties hasDNASequence, hasSequence, and hasNucleotideSequence, then a query referencing hasDNASequence will only invoke one of those services.Moreover, a schema containing many synonymous predicates will be unnecessarily difficult to navigate and understand.Achieving community agreement on predicates is a important problem not only for SHARE, but also for the Semantic Web in general. While several basic RDF voculabaries such as the Dublin Core Metadata Initiative (document metadata) [155], SKOS (taxonomies) [156], and FOAF (relationships between people) [157] have now gained widespread acceptance, it remains to be seen whether similar standardization can be achieved in more complex domains such as bioinformatics. integrating output data from multiple services During SHARE query execution, the output of each SADI service is automatically integrated into the local triple store by an RDF-merge operation. In contrast, integrating the results of an XML-based service into a relational database requires a hand-coded mapping of the output XML to the database schema. matching service input/output datatypes Under the current paradigm of XML-based web services, connecting the output of a service A to the input of a service B requires that the output XML schema of A is equal to the input XML schema of B. However, there are many scenarios where the schema are not equal, despite the fact that the two schemas describe the same type of entity, and the connection itself would be logical. Consider a scenario in which a bioinformatician wants to connect the output of an XML-based BLAST service to the input of an XML-based multiple sequence alignment (MSA) service. The BLAST output is a collection of sequences and the MSA input is likewise a collection of sequences. However, the precise information provided/required about each sequence may differ between the two services. For example, the MSA might require the source organism of each input sequence in order to improve the quality of the alignment, while this information might not be included in the output of the BLAST service. In this case, the two services are not directly compatible, and the bioinformatician must construct an extra “shim” [128] step to translate between the two schemas. This type of problem arises because XML schema are atomic, and there is no mechanism by which they can be automatically decomposed or merged. In contrast, SHARE’s use of the RDF data model allows it to automatically assemble input datatypes from individual properties, as is done in the gatherDataForTypePattern method of listing 2.3. There are a number of areas where future work will be needed, in order to make SHARE a practical tool for bioinformaticians. First, there will need to be a mechanism to inspect the precise chain of sources and assertions that have led to each query solution (provenance tracking). This functionality is particularly important given the ability of any third party to add services to the system, and thus the potential for sabotage by services that make false or non-sensical assertions. Second, there will need to be controls for data source selection, algorithm selection, and algorithm parameters (e.g. gap penalty). It is difficult to incorporate such controls into the query language itself, as it would destroy the abstraction of property-based service discovery, and greatly would complicate the query interface. However, SHARE could in principle be used 37  2.7. Discussion to generate a template workflow (e.g. a SCUFL [158] file) to be edited in workflow software such as Taverna [123]. Although not fully automated, this approach would eliminate a large part of the manual work involved in identifying and wiring together relevant services for an analysis.  38  Chapter 3  An Adaptive Query Evaluation Algorithm for SHARE 3.1  Introduction  One of the main challenges in implementing a distributed query system is in determining efficient execution plans for queries. The database operations that are used to answer a query (e.g. selections, projections, joins) can often be rearranged into a large number of logically equivalent plans, and these plans can differ greatly in their overall efficiency. This same problem is encountered when evaluating queries against standalone relational databases, and has been studied extensively in this context7 . However, in a distributed scenario, the optimization problem is more difficult for a number of reasons. First, estimating the cost of a query plan is more complex. Whereas traditional query optimization is based on minimizing the number of I/O operations (i.e. hard drive reads), the time required to process requests at remote servers and the time required to transfer data across the network are also significant factors for distributed systems. Another challenge is that the query system may have little or no statistics about the data that is available from remote sources, while in standalone databases, such statistics are essential for constructing efficient plans. In addition, for many mediator systems such as SHARE, the data sources may be software services (e.g. BLAST) for which there is no standard means of computing statistics. In this chapter we describe an algorithm called GREEDY that is used by SHARE to evaluate distributed SPARQL queries in an efficient manner. More specifically, the goal of GREEDY is to optimize the ordering of joins across distributed RDF data services. GREEDY differs from the traditional approach to query optimization because the full execution plan for a query is not determined prior to running the query. Instead, GREEDY performs one join at a time, and uses the cardinalities of previously completed joins to decide which join to perform next. Thus, GREEDY is an adaptive optimization algorithm. A further difference between GREEDY and the standard approach to query optimization is that statistics about the data are learned during the execution of queries, which are then used to help the optimizer in the planning of future queries. We begin by briefly discussing related work in Section 3.2. Next, we review the procedure used by SHARE to resolve queries in Section 3.3, in order to provide context for the optimization problem. In Section 3.4, we present the optimization problem by means of an example. In Section 3.5, we describe a simple optimization relating to the assignment of variable bindings, and in Section 3.6 we present the main optimization algorithm, GREEDY. We evaluate the performance of GREEDY using a small set of bioinformatics queries from the literature in Section 3.7, and we discuss areas for future work in Section 3.8. Finally, we summarize the strengths and weaknesses of GREEDY in Section 3.9 and provide pointers to the relevant source code in Section 3.10. 7 For  an overview of the standard approach to query optimization, see [159] and [160].  39  3.2. Related Work  3.2  Related Work  GREEDY is an extension of an idea published in [161], in “Section 7: Adaptive query execution using Prim’s algorithm”. A key difference between the algorithm described here and the original pseudocode is that multiple inputs to the same service are batched into a single request8 . This simplifies the pseudocode for the algorithm, and is also more efficient. Another improvement over the original paper is that specific methods have been developed for gathering and computing predicate statistics. Much work has been done on distributed query processing in the context of relational databases; for a good overview, see [162]. Some of the main techniques that are used in distributed query processing are dynamic programming (for enumeration of query plans), row blocking, and semi-joins. Applying the dynamic programming approach to a mediator system requires that the wrappers expose methods for estimating the costs of queries to the individual sources. In addition, the enumeration of plans is typically exhaustive, and thus quickly becomes expensive for complex queries. For these reasons, the dynamic programming approach was deemed unsuitable for SHARE, and the GREEDY algorithm was developed instead. The technique of row blocking simply involves sending the rows of table across a network in blocks, rather than individually, when performing a join across sites. In SHARE, this roughly corresponds to the batching of inputs to SADI services. The semi-join is a technique for reducing data transfer when joining two tables at different sites. Rather than transferring an entire table from one site to another, a semi-join first sends a reduced version of Table A to Site B, which contains only the join columns. Site B then performs a join against the reduced table to determine the matching rows in Table B. Finally, the matching rows of Table B are transferred to Site A, and a join is performed between Table A and the matching rows of Table B. SHARE uses a technique similar to a semi-join when performing joins, where the values of the join column correspond to the bindings of a SPARQL variable. Work has also been done for distributed query optimization in the context of RDF and SPARQL specifically. Other distributed SPARQL systems that have been described in the literature are DARQ [152], SemWIQ [163], and the Distributed SAIL Extension [164] for Sesame [149]. The DARQ paper describes a procedure for query planning that is based on the average selectivities of predicates when resolved in either the forward (bound subject) or reverse (bound object) directions. An important caveat of the DARQ approach is that the selectivity values for the various predicates are obtained from “service description” files that must be constructed by the user in advance. Another important difference between GREEDY and the DARQ optimizer is that GREEDY is adaptive, whereas DARQ follows the more traditional approach of static query optimization. DARQ implements an additional type of optimization in the form of query rewriting rules, which perform transformations such as the merging of basic graph patterns and the pushing of FILTER expressions into subqueries; this type of optimization is currently not implemented in SHARE. For Sesame, [165] describes a join procedure that makes use of semi-joins and row blocking. Query optimization is a subject of future work for the SemWIQ system. The GREEDY algorithm differs from most query optimizers because the processes for planning and executing the query are interleaved. In the literature, this type of approach is termed adaptive query execution. Two prior examples of adaptive query systems are Tukwila [166] and Telegraph [167], both of which are relational systems. Tukwila uses traditional System R-style optimization, but additionally allows for the creation of partial plans when adequate statistics are not available, and for replanning when cardinality estimations prove to be inaccurate. Telegraph is a more radical departure from traditional query processing. It replaces the standard tree-of-operators representation for a query plan with a circular structure called an eddy that connects all of the operators. The idea of the eddy is to dynamically change the routing of the 8 Strictly speaking, this is only true for SADI services; there is no mechanism for batching queries to SPARQL endpoints.  40  3.3. Key Points of SHARE Query Resolution tuples between the operators in response to the run-time performance of the operators.  3.3 3.3.1  Key Points of SHARE Query Resolution Resolving Triple Patterns  In Section 2.4, we described the procedure that SHARE uses to answer a SPARQL query over a distributed set of services; we summarize the key points of the procedure here. SHARE answers a query by resolving each triple pattern of the query in sequence. A triple pattern (s, p, o) may be resolved in either the forward direction or the reverse direction: • If a triple pattern is resolved in the forward direction, all services that generate p are invoked using the bindings of s as the input URI(s), and the resulting output data is loaded into the local triple store. A service generates p if it outputs triples of the form (<input URI>, p, *), where * represents any URI or literal value. If o is an unbound variable, the set of distinct values returned for * become the bindings of o. • If a triple pattern is resolved in the reverse direction, all services that generate inverse(p) are invoked using the bindings of o as the input URI(s), and the resulting output data is loaded into the local triple store. inverse(p) represents an inverse property of p, meaning that statements of the form (X, p, Y) and (Y, inverse(p), X) are logically equivalent 9 . A service generates inverse(p) if it outputs triples of the form (<input URI>, inverse(p), *). If s is an unbound variable, the set of distinct values returned for * become the bindings of s. Once all of the triple patterns have been resolved, the local triple store will contain a sufficient set of RDF triples to produce the solutions for the query. As the final step, the user’s original query is issued against the local triple store and the solutions are displayed.  3.3.2  OWL Inferences About Predicates  SHARE implicitly makes use of OWL reasoning during query resolution. Whenever a new predicate or OWL class is encountered within a triple pattern, the query engine downloads the OWL ontology file for that predicate/class and loads it into the local reasoner10 . One advantage of using OWL reasoning during query processing is that the mapping from predicates to services is more intelligent. The system is aware of which predicates are synonyms, inverses, or connected by a parent/child relationships (i.e. subproperties), and can perform the necessary translations between related predicates implicitly. For example, if two predicates that describe the relationship between a drug and a target (e.g. “hasTarget” and “hasDrugTarget”) are equivalent, the user may use either predicate within a query and be certain that all of the relevant services will be discovered and invoked. However, this additional intelligence comes at a cost. First, downloading ontology files during query resolution entails additional data transfer, and thus additional time when resolving triple patterns. Second, the computation of inferences can be expensive, depending on the number of ontology rules (e.g. synonym relationships between predicates) and the amount of data in the local triple store. For the evaluation of the optimizer in Section 3.7, the author has disabled OWL inferencing in SHARE. This allows the data retrieval aspect of the optimization problem to be studied in isolation. 9 By 10 By  convention, the inverse properties of p (if any exist) are defined in the OWL ontology that defines p. convention, the OWL ontology for a given predicate/class may be downloaded by resolving its URI.  41  3.4. The Distributed SPARQL Query Optimization Problem  3.4 3.4.1  The Distributed SPARQL Query Optimization Problem An Illustrative Example  The ordering of triple patterns in a query does not affect the logical meaning of the query. For example, Fig. 3.1a and Fig. 3.2a show two equivalent SPARQL queries with different orderings; both versions ask the system to “List all motifs that occur in Poecilia reticulata (guppy fish) proteins” [168]. While the ordering of triple patterns does not change the meaning, it does change the steps that SHARE uses to gather the required data, and thus it can have a significant effect on query efficiency. To illustrate, Fig. 3.1b and Fig. 3.2b depict the alternate steps used by SHARE to resolve the two alternate orderings of the guppy query. The first ordering (Fig. 3.1b) is a bad strategy for resolving the query because there is a “fan-out” effect in Steps 1 and 2; there are a large number of PROSITE motifs (1,456), and each of those motifs is found in large number of proteins. As a result, the query engine sends 166,986 protein URIs to the services that map proteins to organisms in Step 3, and the query continues running for more than an hour. (After an hour, the query was manually aborted). In contrast, the second ordering of the query (Fig. 3.2b) begins by retrieving the 48 proteins that belong to Poecilia reticulata. Only a small amount of a data is gathered (10,480 triples) to answer the query, and the query completes successfully in about 30 seconds.  Query Description Step  Triple Pattern  Time (ms)  Variable Bindings Assigned  1  Retrieve a list of all PROSITE motifs  (?motif, rdf:type, prosite:Site)  2,749  ?motif ⇒ 1,456 bindings  2  For each motif from Step 1, retrieve all proteins that contain that motif  (?motif, bio2rdf:xUniProt, ?protein)  170,639  ?protein ⇒ 166,986 bindings  3  For each protein from Step 2, retrieve the organism it belongs to  (?protein, core:organism, taxonomy:8081)  > 1 hour (query aborted)  N/A  Table 3.1: Description of execution steps for an inefficient query ordering, as depicted in Fig. 3.1b.  42  3.4. The Distributed SPARQL Query Optimization Problem  PREFIX PREFIX PREFIX PREFIX PREFIX  b i o 2 r d f : <h t t p : // b i o 2 r d f . o r g / ns / b i o 2 r d f#> taxonomy: <h t t p : // b i o 2 r d f . o r g / taxonomy:> c o r e : <h t t p : // p u r l . u n i p r o t . o r g / c o r e /> p r o s i t e : <h t t p : // b i o 2 r d f . o r g / ns / p r o s i t e#> r d f : <h t t p : //www. w3 . o r g /1999/02/22 − r d f −syntax−ns#>  SELECT ? m o t i f WHERE { ? motif r d f : t y p e p r o s i t e : S i t e . ? motif bio2rdf:xUniProt ? protein . ? p r o t e i n c o r e : o r g a n i s m taxonomy:8081 . } (a) SPARQL query  ?motif (1,456 bindings)  1 rdf:type 2,749ms  2  bio2rdf:xUniProt  170,639ms  prosite:Site  ?protein  (166,986 bindings)  3  core:organism > 1 Hour  taxonomy:8081 (b) execution plan as graph traversal  Figure 3.1: Example of an inefficient ordering of triple patterns for a query, along with its associated execution plan. A textual description of each step in the execution plan is provided in Table 3.1. The query asks the system to “List all motifs that occur in Poecilia reticulata (guppy fish) proteins.”  43  3.4. The Distributed SPARQL Query Optimization Problem  PREFIX PREFIX PREFIX PREFIX  b i o 2 r d f : <h t t p : // b i o 2 r d f . o r g / ns / b i o 2 r d f#> taxonomy: <h t t p : // b i o 2 r d f . o r g / taxonomy:> c o r e : <h t t p : // p u r l . u n i p r o t . o r g / c o r e /> p r o s i t e : <h t t p : // b i o 2 r d f . o r g / ns / p r o s i t e#>  PREFIX r d f : <h t t p : //www. w3 . o r g /1999/02/22 − r d f −syntax−ns#> SELECT ? m o t i f WHERE { ? p r o t e i n c o r e : o r g a n i s m taxonomy:8081 . ? motif bio2rdf:xUniProt ? protein . ? motif r d f : t y p e p r o s i t e : S i t e . } (a) SPARQL query  ?motif (8 bindings)  3  rdf:type  19,099ms  prosite:Site  2  bio2rdf:xUniProt 9,112ms  ?protein  (48 bindings)  1 core:organism 615ms taxonomy:8081 (b) execution plan as graph traversal  Figure 3.2: Example of an efficient input ordering for a query, along with its associated execution plan. A textual description of each step in the execution plan is provided in Table 3.2. The query asks the system to “List all motifs that occur in Poecilia reticulata (guppy fish) proteins.”  44  3.4. The Distributed SPARQL Query Optimization Problem  Query Description Step  Triple Pattern  Time (ms)  Variable Bindings Assigned  1  Retrieve all proteins that belong to Poecilia reticulata  (?protein, core:organism, taxonomy:8081)  615  ?protein ⇒ 48 bindings  2  Retrieve the motifs contained by each protein from Step 1  (?motif, bio2rdf:xUniProt, ?protein)  9,112  ?motif ⇒ 8 bindings  3  For each motif from Step 2, retrieve its type to determine if it is a PROSITE motif1  (?motif, rdf:type, prosite:Site)  19,099  N/A  1  This step is necessary because the bio2rdf:xUniProt predicate is also used to connect genes to UniProt proteins.  Table 3.2: Description of query execution steps for an efficient query ordering, as depicted in Fig. 3.2b.  3.4.2  Assumptions  The goal of the optimization algorithm is to determine the best ordering for the triple patterns in a distributed SPARQL query; that is, the ordering which will answer the query in the shortest amount of time. To simplify the problem, we assume that the execution time of a query is determined solely by: 1. The time spent by services processing requests 2. The time spent transferring data (service requests and responses) over the network The use of OWL inferencing in SHARE can add a significant amount of local processing time to query resolution. However, as the optimization of OWL reasoning is a challenging problem on its own, for the purposes of this chapter we disable OWL reasoning in SHARE and assume that the cost of local query processing is zero. In addition, we assume that the following constraints hold for resolving triple patterns: 1. Variables may not appear in the predicate positions of triple patterns. 2. Triple patterns may only be resolved when either the subject or the object is bound. These constraints are not unique to SHARE; they are also applicable to the earlier DARQ [152] system, which uses the same predicate-based method for mapping triple patterns to services. Under constraints 1 and 2 above, the resolution of a distributed SPARQL query may be visualized as a graph traversal, as depicted in Fig. 3.1b. Each constant in the query acts an independent starting point for one branch of the traversal, and the traversal is complete when all edges have been visited. The traversal of an edge is in effect a distributed join operation, where the outer table is a list of URIs (variable bindings), and the inner table is the set of triples that are “stored”11 by the relevant services. Thus, the optimization problem discussed in this paper is a distributed version of the well-known join-order problem [159] [160]. 11 The word “stored” is quoted here because triples may be generated computationally by the relevant services. Conceptually, the set of triples “stored” by a computational service is the full set of triples generated by invoking the service with all possible input URIs.  45  3.5. Secondary Optimization: Intersections of Variable Bindings  3.5  Secondary Optimization: Intersections of Variable Bindings  This section describes an optimization for the assignment of variable bindings that is independent of the query evaluation algorithm (GREEDY) presented in Section 3.6. The optimization is implemented by a small change to the processPattern method from Chapter 2 (listing 2.2), as shown in Fig. 3.3. In the original method, the bindings for a variable are determined when the first pattern that references the variable is resolved; thereafter, the bindings of the variable are fixed. In the optimized method, the bindings that are determined for a variable are intersected with any existing bindings. To demonstrate the difference, consider answering the query shown in Fig. 3.4, which asks “What proteins in the post-synaptic membrane are targeted by the drug Benzthiazide?”, using the original processPattern method. Here the variable of interest is ?protein, which occurs in three different patterns (patterns 1, 3, and 4). When pattern 1 is resolved, ?protein is assigned 965 proteins that are located in the post synaptic membrane. When pattern 2 is resolved, ?target is assigned 5 drug target records for Benzthiazide. The critical step occurs in pattern 3. When pattern 3 is resolved12 , the 5 bindings for ?target (drug target records) are mapped to corresponding UniProt IDs. However, because ?protein has already been assigned a set of bindings in pattern 1, the 965 existing bindings for ?protein are left unmodified. As a result, resolving pattern 4 requires retrieving the names (rdfs:label) of 965 proteins, which requires ∼ 2 minutes. The query time can be significantly reduced if the system computes the intersection of the bindings for ?protein from patterns 1 and 3. Using this approach, ?protein is reduced to a single binding and thus pattern 4 is resolved with 1 input. Fig. 3.5a and Fig. 3.5b depict the execution of the query without and with the variable bindings optimization, respectively. 12 We assume here that pattern 3 is resolved in the forward direction. The reverse direction is also possible, but it does not make any difference for the purposes of this discussion.  46  3.5. Secondary Optimization: Intersections of Variable Bindings  processPattern ( s , p , o , b i n d i n g s ) subjects = bindings [ s ] objects = bindings [ o ] i f ( s u b j e c t s == n u l l && o b j e c t s == n u l l ) warn ” ca nn ot r e s o l v e p a t t e r n where both s u b j e c t and o b j e c t a r e unbound variables ” e l s e i f ( o b j e c t s == n u l l ) invokeServicesByProperty ( subjects , p) e l s e i f ( s u b j e c t s == n u l l ) invokeServicesByProperty ( objects , inverse (p) ) else // both s u b j e c t and o b j e c t a r e a l r e a d y bound invokeServicesByProperty ( subjects , p) f o r each t r i p l e t i n l o c a l T r i p l e S t o r e . g e t M a t c h i n g T r i p l e s ( b i n d i n g s [ s ] , bindings [ p ] , bindings [ o ] ) i f ( s u b j e c t s == n u l l ) b i n d i n g s [ s ] . add ( t . s u b j e c t ) i f ( o b j e c t s == n u l l ) b i n d i n g s [ o ] . add ( t . o b j e c t ) (a) original processPattern method processPatternOptimized ( s , p , o , b i n d i n g s ) subjects = bindings [ s ] objects = bindings [ o ] i f ( s u b j e c t s == n u l l && o b j e c t s == n u l l ) warn ” ca nn ot r e s o l v e p a t t e r n where both s u b j e c t and o b j e c t a r e unbound variables ” e l s e i f ( o b j e c t s == n u l l ) invokeServicesByProperty ( subjects , p) e l s e i f ( s u b j e c t s == n u l l ) invokeServicesByProperty ( objects , inverse (p) ) else // both s u b j e c t and o b j e c t a r e a l r e a d y bound invokeServicesByProperty ( subjects , p) newBindings = empty hash t a b l e f o r each t r i p l e t i n l o c a l T r i p l e S t o r e . g e t M a t c h i n g T r i p l e s ( b i n d i n g s [ s ] , bindings [ p ] , bindings [ o ] ) newBindings [ s ] . add ( t . s u b j e c t ) newBindings [ o ] . add ( t . o b j e c t ) i f ( s u b j e c t s == n u l l ) b i n d i n g s [ s ] = newBindings [ s ] else b i n d i n g s [ s ] = i n t e r s e c t i o n ( b i n d i n g s [ s ] , newBindings [ s ] ) i f ( o b j e c t s == n u l l ) b i n d i n g s [ o ] = newBindings [ o ] else b i n d i n g s [ o ] = i n t e r s e c t i o n ( b i n d i n g s [ o ] , newBindings [ o ] ) (b) optimized processPattern method  Figure 3.3: A modification to the processPattern method from Chapter 2 (listing 2.2) which implements the variable bindings optimization described in Section 3.5. Part a) shows the original pseudocode and part b) shows the modified pseudocode. 47  3.5. Secondary Optimization: Intersections of Variable Bindings  PREFIX PREFIX PREFIX PREFIX PREFIX  d r u g : <h t t p : //www4 . w i w i s s . fu−b e r l i n . de / drugbank / r e s o u r c e / d r u g s /> d r u g b a n k : <h t t p : //www4 . w i w i s s . fu−b e r l i n . de / drugbank / r e s o u r c e / drugbank /> r d f s : <h t t p : //www. w3 . o r g /2000/01/ r d f −schema#> c o r e : <h t t p : // p u r l . u n i p r o t . o r g / c o r e /> g o : <h t t p : // b i o 2 r d f . o r g / g o :>  SELECT ? proteinName WHERE { ? protein c o r e : c l a s s i f i e d W i t h go:0045211 . drug:DB00562 d r u g b a n k : t a r g e t ? t a r g e t . ? target drugbank:swissprotId ? protein . ? p r o t e i n r d f s : l a b e l ? proteinName . } (a) SPARQL query  drug:DB00562 drugbank:target ?target drugbank:swissprotId ?protein core:classifiedWith rdfs:label go:0045211  ?proteinName  (b) SPARQL query, depicted graphically  Figure 3.4: An example query for SHARE, used to illustrate variable bindings optimization described in Section 3.5. The query asks “What proteins in the post-synaptic membrane are targeted by the drug Benzthiazide?”. go:0045211 is the Gene Ontology term for “post synaptic membrane”, and drug:DB00562 is the identifier for Benzthiazide.  48  3.5. Secondary Optimization: Intersections of Variable Bindings  drug:DB00562  2  drugbank:target 3,896ms  ?target (5 bindings)  3 drugbank:swissprotId 2,936ms  ?protein (965 bindings)  1  4 core:classifiedWith  2,949ms  go:0045211  rdfs:label  128,363ms  ?proteinName  (965 bindings)  (a) execution plan without the variable bindings optimization  drug:DB00562  2  drugbank:target 3,673ms  ?target (5 bindings)  3 drugbank:swissprotId 2,936ms  ?protein (Step 1: 965 bindings,  4 1 core:classifiedWith 2,655ms  go:0045211  Step 3: 1 binding)  rdfs:label  12,020ms  ?proteinName  (1 bindings)  (b) execution plan using the variable bindings optimization  Figure 3.5: The execution plans used by SHARE to solve the query from Fig. 3.4, without and with the variable bindings optimization described in Section 3.5. Textual descriptions of the two execution plans are given in Table 3.3 and Table 3.4, respectively.  49  3.5. Secondary Optimization: Intersections of Variable Bindings  Query Description Step  Triple Pattern  Time (ms)  Variable Bindings Assigned  1  Retrieve all proteins that have been annotated with Gene Ontology term go:0045211 (post synaptic membrane)  (?protein, core:classifiedWith, go:0045211)  2,949  ?protein ⇒ 965 bindings  2  Retrieve all targets of the drug DB00562 (Benzthiazide)  (drug:DB00562, drugbank:target, ?target)  3,896  ?target ⇒ 5 bindings  3  Retrieve the equivalent UniProt identifier for each target from Step 2  (?target, drugbank:swissprotId, ?protein)  2,936  No bindings are assigned1.  4  For each protein from Step 1, retrieve its name.  (?protein, rdfs:label, ?proteinName)  128,363  ?name ⇒ 965 bindings  1  No bindings are assigned to ?protein in this step, because ?protein has already been assigned a set of bindings in Step 1.  Table 3.3: Description of execution steps without the variable bindings optimization, as depicted in Fig. 3.5a.  Query Description Step  Triple Pattern  Time (ms)  Assigned Variable Bindings  1  Retrieve all proteins that have been annotated with Gene Ontology term go:0045211 (post synaptic membrane)  (?protein, core:classifiedWith, go:0045211)  2,655  ?protein ⇒ 965 bindings  2  Retrieve all targets of the drug DB00562 (Benzthiazide)  (drug:DB00562, drugbank:target, ?target)  3,673  ?target ⇒ 5 bindings  3  Retrieve the equivalent UniProt identifier for each target from Step 2  (?target, drugbank:swissprotId, ?protein)  2,936  ?protein ⇒ 1 binding1.  4  For each protein from Step 3 retrieve the protein’s name2.  (?protein, rdfs:label, ?proteinName)  12,020  ?name ⇒ 1 binding  1  2  This step is where the optimization takes place. When the pattern is resolved, 5 distinct values are obtained for ?protein. This set of 5 solutions is then intersected with the existing 965 bindings for ?protein from Step 1. The two sets have one value in common, and thus ?protein is assigned 1 binding. For each protein that is both a target of Benzthiazide and located in post synaptic membrane.  Table 3.4: Description of execution steps with the variable bindings optimization, as depicted in Fig. 3.5b. 50  3.6. The GREEDY Optimization Algorithm  GREEDY( q u e r y P a t t e r n s ) v i s i t e d P a t t e r n s = empty s e t while ( v i s i te d P a tt e r n s . s i z e () < queryPatterns . s i z e () ) bestPattern = null ; f o r each p a t t e r n i n q u e r y P a t t e r n s i f ( visitedPatterns . contains ( pattern ) ) continue ; i f ( b e s t P a t t e r n == n u l l | | compareCost ( p a t t e r n , b e s t P a t t e r n ) < 0 ) bestPattern = pattern ; resolvePattern ( bestPattern ) ; v i s i t e d P a t t e r n s . add ( b e s t P a t t e r n ) ;  Listing 3.1: Pseudocode for the GREEDY query evaluation algorithm  3.6 3.6.1  The GREEDY Optimization Algorithm Overview  In this section, we will describe an algorithm for evaluating distributed SPARQL queries called GREEDY. In the default method of query evaluation for SHARE, which we will hereafter call BASIC, triple patterns are resolved in the same order that they occur in the query. In addition, BASIC resolves all triple patterns in the forward (left-to-right) direction by default, unless only the reverse direction is possible (i.e. the subject is an unbound variable). The pseudocode for the default query evaluation procedure is shown in listing 2.2. GREEDY uses a more sophisticated approach for determining the ordering and directions of the triple patterns. At each step, the costs for all unvisited triple patterns are estimated and compared pairwise, and the pattern that is cheapest overall is selected for resolution. If a particular pattern is resolvable in both the forward and reverse directions (i.e. both the subject and object of the pattern are bound), then the cost for both directions is estimated and the overall cost of pattern is considered to be the lesser of the two. The pseudocode for the main procedure of GREEDY is shown in listing 3.1. While the main procedure of GREEDY is straightforward, the operation of compareCost(pattern1, pattern2) remains to be explained. While no statistics about the services are directly available, the algorithm must somehow estimate the relative costs of the triple patterns. In GREEDY, the relative cost of a triple pattern is estimated in one of two ways: 1. Using statistics about predicates that have been learned from previous queries 2. Using the number of bindings for the subject/object of the triple pattern The latter method acts as a fallback in the case where no statistics have yet been learned from previous queries. In the next section, we describe details for estimating the cost of a triple pattern.  3.6.2  Cost Estimation for Query Patterns  Cost Equation GREEDY estimates the cost of retrieving data for a triple pattern by the following equation: estimatedTime = baseTime(p, dir) + numInputs × timePerInput(p, dir)  (3.1)  where • estimatedTime is the expected time to retrieve the data  51  3.6. The GREEDY Optimization Algorithm • baseTime(p, dir) is the sum of the round-trip network latencies13 , for all services that resolve predicate p in the direction dir • numInputs is the number of input URIs that are sent to each service • timePerInput(p, dir) is the time required to receive the response data for a single input URI, from all services that resolve predicate p in the direction dir The resolution of a triple pattern (s, p, o) is in effect a join operation between the local triple store and the triples stored by services that generate p. However, in contrast to traditional cost estimations for joins, which are based on the sizes of the input tables, Eq. (3.1) is based on service response times. There are two reasons for this. First, the time required to transfer data over the network is a significant cost that is not present in a single database scenario, and it is a cost that may vary for each data source. Second, the data sources can be computational services. The amount of processing time required by different algorithms is likely to vary greatly (e.g. a BLAST query vs. a protein structure prediction), and will also vary based on the CPU power and load of the servers that are used to perform the analyses. These factors are represented in Eq. (3.1) by the timePerInput(p, dir) factor. Learning Statistics from Queries After each triple pattern is resolved, GREEDY records a sample containing the following information: • predicate URI • direction of resolution (forward or reverse) • number of inputs (i.e. number of bindings of the subject/object of the triple pattern) • total response time (in milliseconds) The total response time is the time required to obtain all data for a triple pattern from the available services. The information listed above constitutes a single sample in the statistics database. At periodic intervals, the available samples are used to compute the values for baseTime(p,dir) and timePerInput(p,dir) using linear regression, as depicted in Fig. 3.6. Once the baseTime and timePerInput values have been computed, GREEDY is able to use them to estimate the times for resolving triple patterns, as per Eq. (3.1). Special Cases for Cost Estimation In order to compute a regression line for a given predicate and direction, at least two samples with distinct x values (i.e. numInputs values) must be recorded in the statistics database. There are two cases where summary statistics will not be available for a given predicate/direction combination (denoted “(p, dir)” below): 1. No samples have been recorded for (p, dir): If available, values for averageBaseTime and averageTimePerInput will be used instead. averageBaseTime and averageTimePerInput are calculated over the baseTime and timePerInput values for all predicates in the statistics database: averageBaseT ime = 13 baseTime(p,  baseTime(p,dir) N  dir) also includes the cost of establishing a TCP connection.  52  3.6. The GREEDY Optimization Algorithm  20000  Regression Line For bio2rdf:xUniProt Predicate (reverse direction)  estimatedTime = 1166.53 ms + numInputs * 177.98 ms  10000  timePerInput = 177.98 ms  5000  responseTime  15000  baseTime = 1166.53 ms  0  20  40  60  80  100  numInputs  Figure 3.6: An example regression line, computed to determine the cost parameters baseTime(bio2rdf:xUniProt, reverse) and timePerInput(bio2rdf:xUniProt, reverse), as described in Section 3.6.2. The Bio2RDF project uses the bio2rdf:xUniProt predicate to denote a cross reference from an entity (e.g. a gene) to a UniProt protein record. The prefix “bio2rdf:” is an abbreviation for http://bio2rdf.org/ns/bio2rdf#.  53  3.7. Evaluation  averageT imeP erInput =  timePerInput(p,dir) N  where N is the number of predicate/direction combinations that have computed values for baseTime and timePerInput. These values are intended to represent the characteristics of a predicate with average performance. averageBaseTime and averageTimePerInput will only be available if a regression line has been computed for at least one predicate / direction combination in the statistics database. If this is not the case, GREEDY will use numInputs (i.e. the number of subject/object bindings) as the cost of the triple pattern. This is equivalent to the assumption that all predicate / direction combinations perform equivalentally when given the same number of inputs. 2. One or more samples with the same x value (numInputs) have been recorded for (p, dir): Let the common value of numInputs across the samples be X. An average response time will be computed for the samples, and this value will be used to estimate the cost of any future triple patterns where numInputs is exactly equal to X. Otherwise, the cost of the triple pattern will be computed from averageBaseTime/averageTimePerInput or numInputs, as per Case 1.  3.7 3.7.1  Evaluation Evaluation Procedure  At the current time, no benchmarks are available for distributed SPARQL systems. This is not surprising, given that benchmarking of distributed systems is considerably more difficult than for monolithic database systems. Whereas the performance of monolithic systems can be measured independently with standardized data sets, distributed systems must be compared as a group, in order to control for factors such as changing data sources, server load, connection bandwidth, and volume of traffic on the network. Here we evaluate the performance of GREEDY in a simple before-and-after manner; that is, we measure the performance differences between BASIC (optimizer off) and GREEDY (optimizer on) with increasing amounts of training, without comparison to other systems. In order to base the evaluation of GREEDY on realistic use cases, the author translated 6 bioinformatics queries found in the literature into equivalent SHARE queries (Fig. 3.7). The original queries were taken from the publications of other bioinformatics integration projects, namely TAMBIS [168], GALAXY [169], BioMART [119], and the Linked Open Drug Data (LODD) [91] project. These systems were the most suitable source for third-party queries because they target many of the same data sources as SHARE. Nevertheless, some translations could not be made perfectly. For instance, Query 3 (Fig. 3.7) originally used the UCSC genome database to obtain exon boundaries and SNP coordinates, while in the SHARE version, the SNP annotations provided by UniProt are used instead. Notes are provided in Fig. 3.7, indicating where such differences exist. To evaluate the learning aspect of the GREEDY algorithm, the queries of Fig. 3.7 were divided into 3 training queries and 3 test queries. The assignment of queries into the two sets was based on the overlap in subject matter between the queries. Queries 1 & 2 involve SNPs, and Queries 3 & 4 involve drugs and associated targets, Queries 5 & 6 involve biological pathways. For query pairs 1/2 and 5/6, one query was randomly chosen as the training query and the other became the testing query. For query pair 3/4, Query 3 was explicitly chosen as the training query because it contains only one constant; this implies that Query 3 has only one possible execution plan, and thus optimization would have no effect. More will be said about queries with a single constant in the conclusion (Section 3.9). 54  3.7. Evaluation For each training query, 3 additional training variants were generated by substituting the constants in the query with randomly selected replacement values. For example, the training variants for Query 1 were generated by changing the target protein. Similarly variants for queries 3 and 5 were generated by changing the target drug and biological pathway, respectively. Generating variants of the training queries was necessary in order to provide adequate training data for GREEDY. If the optimizer were trained by running the same queries repeatedly, the predicates in the training queries would always be resolved using the same number of subject/ object bindings (i.e. service inputs), and thus no regression lines could be computed. The variants used for each training query are listed in Appendix B.1. Each training query was run with GREEDY rather than BASIC. This most accurately reflects the real usage of SHARE, where the optimizer is enabled for all queries.  PREFIX PREFIX PREFIX PREFIX PREFIX  u n i p r o t : <h t t p : / / b i o 2 r d f . o r g / u n i p r o t :> c o r e : <h t t p : / / p u r l . u n i p r o t . o r g / c o r e /> c o r e 2 : <h t t p : / / b i o 2 r d f . o r g / c o r e :> r d f : <h t t p : / /www. w3 . o r g /1999/02/22 − r d f −syntax−ns#> r d f s : <h t t p : / /www. w3 . o r g /2000/01/ r d f −schema#>  SELECT ? s u b s t i t u t i o n ? s t a r t ? end ? dbCrossRef WHERE { u n i p r o t : P01344 c o r e : a n n o t a t i o n ? a n n o t a t i o n . # u n i p r o t : P01344 = ”IGF−I I gene ” ? annotation r d f : type core2 : Natural Variant Annotation . ? a n n o t a t i o n r d f s : s e e A l s o ? dbCrossRef . ? annotation core : s u b s t i t u t i o n ? s u b s t i t u t i o n . ? annotation core : range ? range . ? range core : begin ? s t a r t . ? r a n g e c o r e : end ? end . } (a) Query 1 (training query): “Find non-synonymous single nucleotide polymorphisms (SNPs) within coding exons of the human insulin-like growth factor II (IGF-II) gene”. The original query from the GALAXY project [169] did not specify that the SNPs should be non-synonymous; this was a limitation of using UniProt to obtain SNP information, rather than the UCSC genome database as per the original query. uniprot:P01344 core:annotation ?annotation rdf:type core2:Natural_Variant_Annotation  rdfs:seeAlso core:substitution ?dbCrossRef  core:range  ?substitution  ?range core:begin core:end ?start  ?end  (b) Query 1 as graph  55  3.7. Evaluation  PREFIX PREFIX PREFIX PREFIX PREFIX PREFIX  p r o b e s e t : <h t t p : / / b i o 2 r d f . o r g / a f f y m e t r i x :> c o r e : <h t t p : / / p u r l . u n i p r o t . o r g / c o r e /> c o r e 2 : <h t t p : / / b i o 2 r d f . o r g / c o r e :> a f f y : <h t t p : / / b i o 2 r d f . o r g / ns / a f f y m e t r i x#> r d f : <h t t p : / /www. w3 . o r g /1999/02/22 − r d f −syntax−ns#> r d f s : <h t t p : / /www. w3 . o r g /2000/01/ r d f −schema#>  SELECT ? geneID ? p r o t e i n I D ? snpID ? g o P r o c e s s ? goComponent ? g o F u n c t i o n WHERE { # gene IDs p r o b e s e t : 5 3 7 0 1 a t a f f y : xEnsembl ? geneID . # p r o t e i n IDs probeset :53701 a t a f f y : xSwissProt ? proteinID . # GO terms probeset :53701 at a f f y : xGene Ontology Biological Process ? goProcess . p r o b e s e t : 5 3 7 0 1 a t a f f y : x G e n e O n t o l o g y C e l l u l a r C o m p o n e n t ? goComponent . probeset :53701 a t a f f y : xGene Ontology Molecular Function ? goFunction . # SNPs ? proteinID core : annotation ? annotation . ? annotation r d f : type core2 : Natural Variant Annotation . ? a n n o t a t i o n r d f s : s e e A l s o ? snpID . } (c) Query 2 (test query): “Map a microarray probeset to associated information: ENSEMBL gene IDs, SwissProt protein IDs, SNPs, and GO terms.”, a use case from the BioMART project [119]. As in Query 1, the results are limited to non-synonymous SNPs. probeset:53701_at affy:xEnsembl affy:xSwissProtaffy:xGene_Ontology_Biological_Process affy:xGene_Ontology_Cellular_Component affy:xGene_Ontology_Molecular_Function ?geneID  ?proteinID  ?goProcess  ?goComponent  ?goFunction  core:annotation ?annotation rdf:type  rdfs:seeAlso  core2:Natural_Variant_Annotation  ?snpID  (d) Query 2 as graph  56  3.7. Evaluation  PREFIX PREFIX PREFIX PREFIX PREFIX  drug : <h t t p : / /www4 . w i w i s s . fu−b e r l i n . de / drugbank / r e s o u r c e / d r u g s/> drugbank : <h t t p : / /www4 . w i w i s s . fu−b e r l i n . de / drugbank / r e s o u r c e / drugbank/> d i s e a s o m e : <h t t p : / /www4 . w i w i s s . fu−b e r l i n . de / d i s e a s o m e / r e s o u r c e / d i s e a s o m e/> b i o 2 r d f : <h t t p : / / b i o 2 r d f . o r g / ns / b i o 2 r d f#> r d f s : <h t t p : / /www. w3 . o r g /2000/01/ r d f −schema#>  SELECT ? t a r g e t P r o t e i n ? gene ? d i s e a s e ? diseaseName WHERE { drug : DB00421 drugbank : t a r g e t ? t a r g e t P r o t e i n . ? t a r g e t P r o t e i n drugbank : hgncId ? gene . ? gene b i o 2 r d f :xOMIM ?omim . ? d i s e a s e d i s e a s o m e : omim ?omim . ? d i s e a s e r d f s : l a b e l ? diseaseName . } (e) Query 3 (training query): “For each target protein of the drug Spironolactone, show the gene that encodes the protein, and any known disease associations for the gene”, from the Linked Open Drug Data (LODD) project [91]. The original “query” that appears in [170] is answered interactively, by browsing across the LODD datasets from within a web browser. The SHARE query answers the same question using a distributed query across the DrugBank, HGNC, and Diseasome SPARQL endpoints. The subject drug of the original query has been changed from Varenicline to Spironolactone to avoid overlap of training and testing queries during the optimizer evaluation. drug:DB00421 drugbank:target ?targetProtein drugbank:hgncId ?gene  ?disease  bio2rdf:xOMIM  diseasome:omim  rdfs:label  ?omim  ?diseaseName  (f) Query 3 as graph  57  3.7. Evaluation  PREFIX PREFIX PREFIX PREFIX PREFIX PREFIX PREFIX  drug : <h t t p : / /www4 . w i w i s s . fu−b e r l i n . de / drugbank / r e s o u r c e / d r u g s/> drugbank : <h t t p : / /www4 . w i w i s s . fu−b e r l i n . de / drugbank / r e s o u r c e / drugbank/> r d f s : <h t t p : / /www. w3 . o r g /2000/01/ r d f −schema#> c o r e : <h t t p : / / p u r l . u n i p r o t . o r g / c o r e /> omim : <h t t p : / / b i o 2 r d f . o r g /mim:> owl : <h t t p : / /www. w3 . o r g /2002/07/ owl#> dc : <h t t p : / / p u r l . o r g / dc / e l e m e n t s /1.1/ >  SELECT ? t a r g e t P r o t e i n ? a l z P r o t e i n ? a l z P r o t e i n N a m e WHERE { drug : DB01273 drugbank : t a r g e t ? t a r g e t . # drug : DB01273 = ” V a r e n i c l i n e ” ? t a r g e t drugbank : s w i s s p r o t I d ? t a r g e t P r o t e i n . ? a l z P r o t e i n r d f s : s e e A l s o omim : 1 0 4 3 0 0 . ? a l z P r o t e i n dc : t i t l e ? a l z P r o t e i n N a m e .  # omim : 1 0 4 3 0 0 = ” Alzheimer ’ s D i s e a s e ”  ? targetProtein core : interaction ? interaction . ? interaction core : participant ? participant . ? p a r t i c i p a n t owl : sameAs ? a l z P r o t e i n . } (g) Query 4 (test query): “Are there any protein-protein interactions between Alzheimer-associated proteins and the targets of Varenicline?”, from the Linked Open Drug Data project. The original “query” posed in [170] was answered by browsing hypotheses in the AlzSWAN KnowledgeBase [171]. In the SHARE version, the question is answered by a distributed query across the UniProt and DrugBank SPARQL endpoints. drug:DB01273 drugbank:target ?target drugbank:swissprotId ?targetProtein core:interaction ?interaction core:participant ?participant owl:sameAs ?alzProtein rdfs:seeAlso dc:title omim:104300  ?alzProteinName  (h) Query 4 as graph  58  3.7. Evaluation  PREFIX s a d i : <h t t p : / / s a d i f r a m e w o r k . o r g / o n t o l o g i e s / p r o p e r t i e s . owl#> PREFIX GO: <h t t p : / / l s r n . o r g /GO:> SELECT ? p r o t e i n ?omim FROM <h t t p : / / dev . b i o r d f . n e t /˜ benv / i n f e r e n c e s . owl> WHERE { ? p r o t e i n s a d i : h a s F u n c t i o n GO: 0 0 0 4 8 7 2 . # GO: 0 0 0 4 8 7 2 = ” r e c e p t o r a c t i v i t y ” ? p r o t e i n s a d i : i s P a r t i c i p a n t I n GO: 0 0 0 7 5 9 5 . # GO: 0 0 0 7 5 9 5 = ” l a c t a t i o n ” ? p r o t e i n s a d i : i s C a u s a l l y R e l a t e d W i t h ?omim . } (i) Query 5 (test query): “Retrieve receptor proteins involved in lactation and one or more disease processes”, from the TAMBIS project [172]. ?protein  sadi:hasFunction sadi:isParticipantIn sadi:isCausallyRelatedWith GO:0004872  GO:0007595  ?omim  (j) Query 5 as graph PREFIX s a d i : <h t t p : / / s a d i f r a m e w o r k . o r g / o n t o l o g i e s / p r o p e r t i e s . owl#> PREFIX GO: <h t t p : / / l s r n . o r g /GO:> PREFIX UniProt : <h t t p : / / l s r n . o r g / UniProt :> SELECT DISTINCT ? m o t i f FROM <h t t p : / / dev . b i o r d f . n e t /˜ benv / i n f e r e n c e s . owl> WHERE { ? p r o t e i n s a d i : isHomologousTo UniProt : Q93038 . # UniProt : Q93038 = LARD p r o t e i n ? p r o t e i n s a d i : i s P a r t i c i p a n t I n GO: 0 0 0 6 9 1 5 . # GO: 0 0 0 6 9 1 5 = a p o p t o s i s ? protein sadi : hasMotif ? motif . } (k) Query 6 (test query): “Select motifs for proteins that participate in apoptosis and are homologous to the lymphocyte associated receptor of death (also known as LARD)”, from the TAMBIS project [173] ?protein  sadi:isHomologousTo sadi:isParticipantIn UniProt:Q93038  GO:0006915  sadi:hasMotif ?motif  (l) Query 6 as graph  Figure 3.7: SHARE translations of six queries from other bioinformatics integration projects, including TAMBIS, GALAXY, BioMART, and Linked Open Drug Data. These queries are used for the evaluation of the GREEDY optimization algorithm.  59  3.7. Evaluation The overall idea of the evaluation is to measure the performance difference between BASIC (optimizer off) and GREEDY (optimizer) with increasing amounts of training. However, the ordering of triple patterns in an input query can have a large effect on what this performance difference actually is. For example, if a “smart” user composes a query that already has the optimal ordering of triple patterns, then there will be no difference at all between BASIC and GREEDY for the performance of that query. For the optimizer evaluation described here, we will make the simplifying assumption that each input query ordering is equally likely. Ideally, we would like to compare the performance of BASIC and GREEDY for all possible orderings of each test query. This would provide us with the complete range of execution times that are possible for a given query; that is, how slow the query can be and how fast it can be, and how the possible orderings for the query are distributed between these two extremes. Unfortunately, the exhaustive approach is not practical, even for the relatively simple queries presented in Fig. 3.7; for example, Query 2 has 8 triple patterns, and thus has 8! = 40,320 possible orderings. Instead, execution times were measured for 5 randomly generated test query orderings, with the understanding that these orderings are merely a sample from the population of possible query orderings. The pseudocode for the benchmarking procedure is shown in listing 3.2. Three trials were recorded for each test query, in order to give an indication of the amount of variance that occurs between executions of the same query due to varying server load and data transfer times. The results in the following section demonstrate that this variance is small relative to the changes in query times that result from reordering triple patterns in the query.  3.7.2  Results  Benchmarking Results Figure 3.8 shows the execution times for the 5 randomly generated orderings of Test Queries 2, 4, and 6. (For a complete list of generated orderings for each query, see Appendix B.2.) Each group of bars in Fig. 3.8 represents a set of execution times for one particular ordering of a query. Within each group, the execution times were generally expected to decrease from left-to-right, beginning with the execution time for BASIC (optimizer off) and proceeding through execution times for GREEDY (optimizer on) with increasing numbers of training runs. A timeout of an hour was imposed on each query to ensure that the overall experiment would complete in a reasonable amount of time. The solution sets generated for all trials of Queries 2, 4, and 6 were identical, regardless of the ordering of triple patterns, and regardless of whether the queries were run under BASIC or GREEDY with 0-4 training runs. This was verified by recording the solution set for each query execution to a separate file, and comparing the files with the unix ‘fdupes’ utility for identifying  FOR t r a i n i n g R u n = 0 t o 5 IF ( t r a i n i n g R u n > 0 ) FOR t r a i n i n g Q u e r y = { Query 1 , Query 3 , Query 5 } Run v a r i a n t t r a i n i n g R u n o f t r a i n i n g Q u e r y and r e c o r d s t a t s FOR t e s t Q u e r y = { Query 2 , Query 4 , Query 6 } FOR t e s t Q u e r y O r d e r i n g = 0 t o 9 IF ( t r a i n i n g R u n == 0 ) FOR t r i a l = 0 t o 2 Run t e s t Q u e r y with BASIC and time FOR t r i a l = 0 t o 2 Run t e s t Q u e r y with GREEDY and time  Listing 3.2: Pseudocode for the benchmarking procedure for GREEDY  60  3.7. Evaluation  Execution Times for Query 2  100.0 10.0  TIMEOUT = 60 minutes  0.1  1.0  Query Time (minutes)  1000.0  BASIC GREEDY, 0 training runs GREEDY, 1 training runs GREEDY, 2 training runs GREEDY, 3 training runs GREEDY, 4 training runs  0  1  2  3  4  Random Query Ordering  (a) Execution times for Query 2 Execution Times for Query 4  100.0 1.0  10.0  TIMEOUT = 60 minutes  0.1  Query Time (minutes)  1000.0  BASIC GREEDY, 0 training runs GREEDY, 1 training runs GREEDY, 2 training runs GREEDY, 3 training runs GREEDY, 4 training runs  0  1  2  3  4  Random Query Ordering  (b) Execution times for Query 4  61  3.7. Evaluation  Execution Times for Query 6  100.0 10.0  TIMEOUT = 60 minutes  0.1  1.0  Query Time (minutes)  1000.0  BASIC GREEDY, 0 training runs GREEDY, 1 training runs GREEDY, 2 training runs GREEDY, 3 training runs GREEDY, 4 training runs  0  1  2  3  4  Random Query Ordering  (c) Execution times for Query 6  Figure 3.8: Results of GREEDY evaluation: a comparison of query execution times for Queries 2, 4, and 6 of Fig. 3.7. Each query is executed using 5 randomly generated orderings, and for each ordering, query times are recorded for BASIC (optimizer off) and GREEDY (optimizer on) with varying numbers of training runs.  62  3.7. Evaluation duplicate files. The lines of each solution file were sorted alphabetically prior to comparison, as some trials would generate the rows of the solutions table in different orders. In order to fully understand the decisions made by the trained versions of GREEDY in the discussion below, it is necessary to see the state of the predicate statistics after each training run. These states are depicted in Fig. 3.9, using two types of graphs: 1. scatterplots (left): show average response times for predicates which only have samples for a single x (numInputs) value. 2. bar graphs (right): show the slope (timePerInput) and intercept (baseTime) for predicates that have sufficient samples for a regression line. The reader will observe from the bar graphs of Fig. 3.9 that for most of the predicates, a regression line is only ever computed in one of the two possible directions. This happens because the structure of the training queries (Fig. 3.7) only permits resolving these predicates in one direction. For example, in Query 3, the drugbank:hgncId predicate is always resolved in the forward direction because the path of traversal must begin with the sole constant in the query, drug:DB00421. The sadi:isCausallyRelatedWith is a special case; although it is always resolved in the forward direction during training, it is its own inverse property, and so the same regression line describes both directions. Results for Query 2 (Fig. 3.8a) Over all trials, we observe 3 distinct query times: ∼ 45 seconds (best), ∼ 100 seconds (fair), and timeout after 60 minutes (worst). Examination of the query logs shows that it is the treatment of one particular triple pattern that determines the outcome for each trial: ?annotation rdf:type core2:Natural_Variant_Annotation One of the conventional uses of rdf:type is to indicate database record types. In this particular triple pattern, Natural Variant Annotation represents the type of record that is used to store amino acid sequence variations in the UniProt database14 . When the pattern is resolved in the reverse direction, a service call is made which retrieves all Natural Variant Annotation records from Bio2RDF’s UniProt SPARQL endpoint. At the current time, there are 70,629 such records, and retrieving a list of their identifiers requires about 55 seconds. While resolving the rdf:type pattern in the reverse direction significantly increases the query time, it is not by itself disastrous. More serious performance problems (i.e. timeouts) occur when the query engine proceeds to use the 70,629 bindings for ?annotation as input for solving a later query pattern. To illustrate, Fig. 3.10a shows the sequence of steps used by BASIC for Query Ordering 3 which lead to a query timeout. The GREEDY heuristic avoids the use of large input sets when resolving query patterns, and thus turning on the optimizer for Query Orderings 3 and 4 produces an immediate and significant improvement. However, without training, the choices of GREEDY are still not optimal. The GREEDY heuristic predicts that resolving the rdf:type pattern in the reverse direction will be an fast operation, since it requires sending only one input (Natural Variant Annotation) to the relevant services. The implicit assumption of the GREEDY heuristic is that all predicates will require the same amount of time to resolve, given the same number of inputs. This assumption does not hold true for the rdf:type predicate, and so GREEDY requires training in order to make the correct decision. Figure Fig. 3.10b and Fig. 3.10c show the steps followed by GREEDY for Query Ordering 3, without training and with training respectively. In the trained version, GREEDY delays resolution of the rdf:type pattern until it can be resolved in 14 The majority of these variations are single amino acid polymorphisms (SAPs), which are in turn due to single nucleotide polymorphisms (SNPs) in the coding DNA. In the UniProt database, the category of naturally occurring mutations also includes disease associated mutations and RNA editing events.  63  3.7. Evaluation  50.0  500.0  forward response time reverse response time  rdf:type sadi:isCausallyRelatedWith core:annotation  5.0  Average Response Time (seconds)  sadi:isFunctionOf sadi:hasFunction sadi:isParticipantIn sadi:hasParticipant  bio2rdf:xOMIM  1.0  drugbank:target  drugbank:hgncId rdfs:label  0.5  diseasome:omim core:range core:begin  2  4  6  8  10  12  Number of Inputs  baseTime forward timePerInput forward baseTime reverse timePerInput reverse  5000  Time (Milliseconds)  100 50  500  rdf:type  200  10  core:annotation  5  bio2rdf:xOMIM  1  diseasome:omim rdfs:seeAlso  sadi:isCausallyRelatedWith  drugbank:hgncId  rdfs:label  drugbank:target  core:begin  Average Response Time (seconds)  500  forward response time reverse response time  2000  sadi:isFunctionOf sadi:hasFunction sadi:isParticipantIn sadi:hasParticipant  20000  (a) 1 Training Run: Average Predicate Response Times  core:range  1  2  3  4  5  Number of Inputs  20000  (b) 2 Training Runs: Average Predicate Response (c) 2 Training Runs: Regression Line Parameters Times  sadi:isFunctionOf sadi:hasFunction  5000 2000  Time (Milliseconds)  100 50  rdf:type  baseTime forward timePerInput forward baseTime reverse timePerInput reverse  10  100  core:annotation  300  400  rdfs:label  sadi:isCausallyRelatedWith  200 Number of Inputs  drugbank:hgncId  100  diseasome:omim  1  rdfs:seeAlso  0  core:range  drugbank:target  core:begin  bio2rdf:xOMIM  5  Average Response Time (seconds)  500  forward response time reverse response time  500  sadi:isFunctionOf sadi:hasFunction sadi:isParticipantIn sadi:hasParticipant  (d) 3 Training Runs: Average Predicate Response (e) 3 Training Runs: Regression Line Parameters Times  64  20000  3.7. Evaluation  sadi:isFunctionOf sadi:hasFunction sadi:isParticipantIn sadi:hasParticipant  rdf:type  sadi:isFunctionOf sadi:hasFunction  2000 5000 500  50  100  Time (Milliseconds)  500  forward response time reverse response time  10  100  core:annotation  300  400  rdfs:label  sadi:isCausallyRelatedWith  200 Number of Inputs  drugbank:hgncId  100  diseasome:omim  1  rdfs:seeAlso  0  core:range  drugbank:target  core:begin  bio2rdf:xOMIM  5  Average Response Time (seconds)  baseTime forward timePerInput forward baseTime reverse timePerInput reverse  (f) 4 Training Runs: Average Predicate Response (g) 4 Training Runs: Regression Line Parameters Times  Figure 3.9: State of predicate statistics during the experiment. Left: Scatter plots showing the average response times of various predicates for a fixed number of inputs. The predicate/ direction pairs included these graphs have the same x value (i.e. numInputs) for all samples, and thus no regression line is computable. Right: Bar graphs depicting the slope (timePerInput) and intercept (baseTime) for all predicates for which a regression line was computable. Note that there is no bar graph for 1 training run because a regression line can only be computed from two or more samples; each predicate is used in only one training query, and so obtaining adequate samples entails two or more training runs.  65  3.7. Evaluation the forward direction. Finally, the reader will notice from Fig. 3.8a (and also Fig. 3.8c) that GREEDY always requires two training runs before there is any improvement in query times. The reason for this is that a minimum of two samples are required to compute a regression line. Until a regression line has been computed for at least one predicate, all patterns are compared by the number of input bindings. The results for each random query ordering (i.e. group of bars) in Fig. 3.8a differ only with respect to the query times under BASIC. In query orderings 0 and 2, the patterns were ordered such that BASIC resolved the rdf:type in the reverse direction, but did not use the large number of resulting bindings for ?annotation to resolve a later pattern. In query ordering 1, the patterns were ordered such that the rdf:type pattern was resolved in the forward direction. In query orderings 3 and 4, the patterns were ordered such that the rdf:type pattern was resolved in the reverse direction and the resulting bindings of ?annotation were used as input for a later pattern. (The complete list of randomly generated query orderings for the experiment is provided in Appendix B.2.) The results for query ordering 1 illustrate an important caveat of GREEDY, which is that query times prior to training may actually be worse than for BASIC, in the case that a “smart” user specifies the query patterns in an optimal order or is simply lucky.  66  3.7. Evaluation  3  probeset:53701_at 5 1 2 affy:xEnsembl affy:xSwissProtaffy:xGene_Ontology_Biological_Process affy:xGene_Ontology_Cellular_Component affy:xGene_Ontology_Molecular_Function  47ms  55ms  ?geneID  (1 binding)  3,696ms  ?goProcess  ?proteinID  (1 binding)  ?goComponent (1 binding)  65ms  ?goFunction (1 binding)  core:annotation ?annotation (70,629 bindings)  4 rdf:type  6  53,152ms  rdfs:seeAlso  TIMEOUT  core2:Natural_Variant_Annotation  ?snpID  (a) The steps followed by BASIC, in an attempt to answer Test Ordering 3 of Query 2 (Appendix B.2.1). The query times out after 60 minutes. probeset:53701_at  3  5 1 6 2 affy:xEnsembl affy:xSwissProtaffy:xGene_Ontology_Biological_Process affy:xGene_Ontology_Cellular_Component affy:xGene_Ontology_Molecular_Function  48ms  42ms  ?geneID  44ms  ?goProcess  ?proteinID  (1 binding)  7  2,968ms  (1 binding)  ?goComponent (1 binding)  65ms  ?goFunction (1 binding)  core:annotation  14,782ms  ?annotation (Step 4: 70,629 bindings, Step 7: 2 bindings)  4 rdf:type  8  54,759ms  rdfs:seeAlso 1,997ms  core2:Natural_Variant_Annotation  ?snpID (2 bindings)  (b) The steps followed by GREEDY with 0 training runs, in order to answer Test Ordering 3 of Query 2 (Appendix B.2.1). probeset:53701_at  3  4 1 5 2 affy:xEnsembl affy:xSwissProtaffy:xGene_Ontology_Biological_Process affy:xGene_Ontology_Cellular_Component affy:xGene_Ontology_Molecular_Function  45ms  41ms  ?geneID  44ms  ?goProcess  ?proteinID  (1 binding)  6  2,951ms  (1 binding)  ?goComponent (1 binding)  65ms  ?goFunction (1 binding)  core:annotation  14,782ms  ?annotation (2 bindings)  7  rdf:type  14,247ms  8  rdfs:seeAlso 165ms  core2:Natural_Variant_Annotation  ?snpID (2 bindings)  (c) The steps followed by GREEDY with 2 training runs, in order to answer Test Ordering 3 of Query 2 (Appendix B.2.1).  Figure 3.10: Execution plans for Query 2 under BASIC, untrained GREEDY, and trained GREEDY. Descriptions of the execution steps for each plan are given in Tables 3.5 to 3.7, respectively.  67  3.7. Evaluation  Query Step  Description  Triple Pattern  Time (ms)  Variable signed  Bindings  As-  1  Retrieve GO cellular component annotations for Affymetrix probeset 53701 at  (probeset:53701 at, affy:xGene Ontology Cellular Component, ?goComponent)  3,696  ?goComponent binding  2  Retrieve GO biological process annotations for Affymetrix probeset 53701 at  (probeset:53701 at, affy:xGene Ontology Biological Process, ?goComponent)  55  ?goProcess => 1 binding  3  Retrieve ENSEMBL gene cross-references for Affymetrix probeset 53701 at  (probeset:53701 at, affy:xEnsembl, ?geneID)  47  ?geneID => 1 binding  4  Retrieve amino acid sequence variation records for all proteins in the UniProt database  (?annotation, rdf:type, core2:Natural Variant Annotation)  53,152  ?annotation => 70,629 bindings  5  Retrieve GO molecular function annotations for Affymetrix probeset 53701 at  (probeset:53701 at, affy:xGene Ontology Molecular Function, ?goFunction)  65  ?goFunction => 1 binding  6  For each amino sequence variation record from Step 4, retrieve the associated SNP ID  (?annotation, rdfs:seeAlso, ?snpID)  TIMEOUT  N/A  =>  1  Table 3.5: Description of execution steps for Query 2 under BASIC, as depicted in Fig. 3.10a  Query Step  Description  Triple Pattern  Time (ms)  Variable signed  Bindings  As-  1  Retrieve GO cellular component annotations for Affymetrix probeset 53701 at  (probeset:53701 at, affy:xGene Ontology Cellular Component, ?goComponent)  2,968  ?goComponent binding  2  Retrieve GO biological process annotations for Affymetrix probeset 53701 at  (probeset:53701 at, affy:xGene Ontology Biological Process, ?goComponent)  44  ?goProcess => 1 binding  3  Retrieve ENSEMBL gene cross-references for Affymetrix probeset 53701 at  (probeset:53701 at, affy:xEnsembl, ?geneID)  48  ?geneID => 1 binding  4  Retrieve amino acid sequence variation records for all proteins in the UniProt database  (?annotation, rdf:type, core2:Natural Variant Annotation)  54,759  ?annotation => 70,629 bindings  5  Retrieve GO molecular function annotations for Affymetrix probeset 53701 at  (probeset:53701 at, affy:xGene Ontology Molecular Function, ?goFunction)  65  ?goFunction => 1 binding  6  Retrieve UniProt protein identifiers corresponding to Affymetrix probeset 53701 at  (probeset:53701 at, affy:xSwissProt, ?proteinID)  42  ?proteinID => 1 binding  7  Retrieve annotations for each protein from step 6  (?proteinID, core:annotation, ?annotation)  14,782  ?proteinID => 1 binding ?annotation => 2 bindings  8  For each annotation from step 7, retrieve the corresponding SNP ID  (?annotation, rdfs:seeAlso, ?snpID)  1,997  ?annotation => 2 bindings ?snpID => 2 bindings  =>  1  Table 3.6: Description of execution steps for Query 2 under untrained GREEDY, as depicted in Fig. 3.10b  68  3.7. Evaluation  Query Step  Description  Triple Pattern  Time (ms)  Variable signed  Bindings  As-  1  Retrieve GO cellular component annotations for Affymetrix probeset 53701 at  (probeset:53701 at, affy:xGene Ontology Cellular Component, ?goComponent)  2,951  ?goComponent binding  2  Retrieve GO biological process annotations for Affymetrix probeset 53701 at  (probeset:53701 at, affy:xGene Ontology Biological Process, ?goComponent)  44  ?goProcess=> 1 binding  3  Retrieve ENSEMBL gene cross-references for Affymetrix probeset 53701 at  (probeset:53701 at, affy:xEnsembl, ?geneID)  45  ?geneID => 1 binding  4  Retrieve GO molecular function annotations for Affymetrix probeset 53701 at  (probeset:53701 at, affy:xGene Ontology Molecular Function, ?goFunction)  65  ?goFunction => 1 binding  5  Retrieve UniProt protein identifiers corresponding to Affymetrix probeset 53701 at  (probeset:53701 at, affy:xSwissProt, ?proteinID)  41  ?proteinID => 1 binding  6  Retrieve annotations for each protein from step 6  (?proteinID, core:annotation, ?annotation)  11,237  ?proteinID => 1 binding ?annotation => 2 bindings  7  For each annotation record from Step 5, retrieve it?s record type  (?annotation, rdf:type, core2:Natural Variant Annotation)  14,247  ?annotation => 2 bindings  8  For each Natural Variant Annotation record, retrieve the corresponding SNP ID  (?annotation, rdfs:seeAlso, ?snpID)  165  ?annotation => 2 bindings ?snpID => 2 bindings  =>1  Table 3.7: Description of execution steps for Query 2 under trained GREEDY, as depicted in Fig. 3.10c In Figs. 3.10a to 3.10c, the reader will observe that many of the triple patterns are resolved very quickly. For instance, Steps 2, 3, and 5 of Fig. 3.10a are completed in 55, 47, and 65 milliseconds, respectively. This happens because the relevant data for these patterns has already been gathered by the service invocation for Step 1. In Step 1, SHARE resolves the pattern: (probeset:53701 at, affy:xGene Ontology Cellular Component, ?goComponent) by issuing the following query to the Bio2RDF Affymetrix endpoint: CONSTRUCT { p r o b e s e t : 5 3 7 0 1 a t ?p ? o . } WHERE { p r o b e s e t : 5 3 7 0 1 a t ?p ? o . }  As a result, Step 1 retrieves not only triples with the affy:xGene Ontology Cellular Component predicate, but all triples in the Affymetrix endpoint with probeset:53701 at as the subject. This includes all triples that match the query patterns of Steps 2, 3, and 5, and thus no additional service invocations need to be made in these steps. The query engine maintains a hash table to track which services have been invoked with which inputs, in order to avoid redundant service calls. Results for Query 4 (Fig. 3.8b) We see that for Query 4, there is no discernible difference between the performance of BASIC and GREEDY. The majority of query times are in the range 30-45 seconds, and this variation occurs across different trials for the same query plan, indicating that it is due to factors separate from the ordering of triple patterns (e.g. server load, network traffic). Query 4 contains no exceptionally expensive patterns, and there is no fan-out effect for any path of query traversal. 69  3.7. Evaluation For example, there are only 2 Alzheimer’s-associated proteins in UniProt and only 2 known targets for the drug Varenicline. It is more efficient to solve the query by searching the proteinprotein interactions of the Varenicline targets, as there is only 1 known interaction, whereas there are 58 known interactions for the Alzheimer’s proteins. However, there are relatively few orderings that would cause BASIC to use the latter strategy, and none of those orderings were included in the randomly generated test orderings. Results for Query 6 (Fig. 3.8c) The results graph for Query 6 shows similar patterns to the results graph for Query 4 (Fig. 3.8a). Here again, there are 3 distinct query times (∼ 140 seconds, ∼ 1300 seconds, and timeout), and the performance of each query depends on the handling of a single pattern: ?protein sadi:isParticipantIn GO:0006915 Similarly to rdf:type in Query 4, the sadi:isParticipantIn predicate is unusually expensive to resolve in the reverse direction. Using GO:0006915 (apoptosis) as input, resolving sadi:isParticipantIn in the reverse direction requires approximately 20 minutes and produces 40,475 bindings for the ?protein variable. It makes intuitive sense that this strategy would be expensive, because there may be a large number of proteins involved in a given biological process, particularly when proteins from all species are considered (as is the case here). In addition, retrieving the proteins associated with a given Gene Ontology term is more computationally intensive than a typical database lookup, because the results must include proteins that are annotated with any subterm (i.e. descendant term) of the target GO term. While these factors have a significant effect on query time, the factor that has the largest effect is the number of services that map to sadi:hasParticipant (the inverse predicate of sadi:isParticipantIn). In total, there are 27 services which map GO Terms to various types of database identifiers15 (GenBank, FlyBase, etc.), and the query engine invokes each of these services sequentially. As in Query 4, enabling GREEDY prevents query timeouts (e.g. Test Ordering 1 of Fig. 3.8c) that result from resolving a pattern with a large number of inputs. However, without training, GREEDY can also worsen query times in cases where the ordering of triple patterns is already efficient (e.g. Test Ordering 3 of Fig. 3.8c). Figs. 3.11a to 3.11c show the steps for Query Ordering 1 that lead to a timeout under BASIC (> 60 minutes), improved performance under GREEDY (∼ 20 minutes), and best performance under GREEDY with 2 training runs (∼ 2 minutes), respectively. 15 At the time of the experiment, many of the GO services (24 of 27) were not working correctly and were returning empty result sets. This did not affect the solutions for the query, as only one of the 27 services was actually relevant (GO ⇒ UniProt) in the overall context of the query. If all of the GO services had been working, the resolution of the sadi:isParticipantIn pattern would have been even more expensive.  70  3.7. Evaluation  ?protein (40,475 bindings)  2 sadi:isHomologousTo1 sadi:isParticipantIn 1,197,564ms  TIMEOUT  UniProt:Q93038  GO:0006915  sadi:hasMotif ?motif  (a) The steps followed by BASIC, in an attempt to answer Test Ordering 1 of Query 6 (Appendix B.2.3). The query times out after 60 minutes.  ?protein (Step 1: 40,475 bindings, Step 2: 23 bindings)  2  3  sadi:isHomologousTo1 sadi:isParticipantIn 1,189,735ms  56,286ms  UniProt:Q93038  GO:0006915  sadi:hasMotif 27,231ms  ?motif  (4 bindings)  (b) The steps followed by GREEDY with 0 training runs, in order to answer Test Ordering 1 of Query 6 (Appendix B.2.3).  ?protein (Step 1: 475 bindings, Step 2: 43 bindings)  1  3  sadi:isHomologousTo2 sadi:isParticipantIn 63,426ms  UniProt:Q93038  42,767ms  GO:0006915  sadi:hasMotif 22,709ms  ?motif  (4 bindings)  (c) The steps followed by GREEDY with 2 training runs, in order to answer Test Ordering 1 of Query 6 (Appendix B.2.3).  Figure 3.11: Execution plans for Query 6 under BASIC, untrained GREEDY, and trained GREEDY. Descriptions of the execution steps for each plan are given in Tables 3.8 to 3.10, respectively.  71  3.7. Evaluation  Query Description Step  Triple Pattern  Time (ms)  Variable Bindings Assigned  1  Retrieve proteins annotated with apoptosis or one of its subterms  (?protein, sadi:isParticipantIn, GO:0006915)  1,197,564  ?protein => 40,475 bindings  2  Retrieve homologs for each protein from Step 1  (?protein, sadi:isHomologousTo, UniProt:Q93038)  TIMEOUT  N/A  Table 3.8: Description of execution steps for Query 6 under BASIC, as depicted in Fig. 3.11a  Query Description Step  Triple Pattern  Time (ms)  Variable Bindings Assigned  1  Retrieve proteins annotated with apoptosis or one of its subterms  (?protein, sadi:isParticipantIn, GO:0006915)  1189735  ?protein => 40,475 bindings  2  Retrieve homologs of UniProt:Q93038 (lymphocyte associated receptor of death)  (?protein, sadi:isHomologousTo, UniProt:Q93038)  56286  ?protein => 23 bindings  3  For each protein from Step 2, retrieve the associated motifs  (?protein, sadi:hasMotif, ?motif)  27231  ?protein => 23 bindings motif => 4 bindings  Table 3.9: Description of execution steps for Query 6 under untrained GREEDY, as depicted in Fig. 3.11b  Query Description Step  Triple Pattern  Time (ms)  Variable Bindings Assigned  1  Retrieve homologs of UniProt:Q93038 (lymphocyte associated receptor of death)  (?protein, sadi:isHomologousTo, UniProt:Q93038)  63,426  ?protein => 475 bindings  2  For each homolog from Step 1, determine if it participates in apoptosis  (?protein, sadi:isParticipantIn, GO:0006915)  42,767  ?protein => 43 bindings  3  For each apoptosis protein from Step 2, retrieve the associated motifs  (?protein, sadi:hasMotif, ?motif)  22,709  ?protein => 43 bindings ?motif => 4 bindings  Table 3.10: Description of execution steps for Query 6 under trained GREEDY, as depicted in 72 Fig. 3.11c  3.8. Future Work  3.8  Future Work  There are several areas where significant performance improvements to SHARE could be made, which are separate from the issue of ordering joins. Requests to different services are currently issued sequentially, whereas in many cases it is possible to issue requests in parallel. In addition, there are cases where calls to irrelevant services could be avoided, by taking the input and output datatypes of the services into account. For instance, the resolution of the sadi:isParticipantIn predicate in Query 6 leads to the invocation of 27 services which map GO terms to various types of database identifiers. However, only one of the GO services (GO ⇒ UniProt) is actually relevant in the overall context of the query. By matching the input/output datatypes of the different triple patterns, it should be possible to eliminate such unnecessary service calls. Another possible area for improvement is the pipelining of data. Most database systems pipeline data between query operators, in order to minimize the time-to-first-result, whereas SHARE determines the full solution set prior to displaying results. The fact that pipelining is not used in SHARE allows the joins to be freely reordered at any point in the query, which has been demonstrated in the experiment to be a significant advantage. However, it is possible that some compromise can be reached between pipelining and flexibility in reordering; for example, this same issue has been dealt with in the Tukwila system [166] by constructing query plans from pipelined “query fragments”.  3.9  Conclusion  The unique characteristics of the GREEDY algorithm are that: (1) it is adaptive, and (2) it can learn from previous queries. The evaluation of the optimizer demonstrates that while the GREEDY heuristic is successful in avoiding the worst execution plans, training is often necessary to get the best performance. In particular, the training aspect of the algorithm is important whenever there is a predicate that is significantly more expensive than average, given the same number of inputs. In the experiment, the two expensive predicates were rdf:type (when resolved in the reverse direction) and sadi:isParticipantIn (when resolved in the reverse direction). It is reasonable to handle rdf:type as a special case because it is part of the RDF standard, and the conventions surrounding its use are well established. However, there are likely to be many other types of relationships in the domain of bioinformatics, such as sadi:isParticipantIn, that must be treated specially in order to answer queries efficiently. For example, any predicates specifying broad criteria such as a target tissue or cellular location will likely have a high cost when resolved in the reverse direction. The training aspect of GREEDY provides a uniform and automatic way of dealing with such predicates that avoids manually coding special cases. The experiment demonstrates that join ordering is an important factor for the performance of distributed queries. However, for queries that contain only a single constraint (i.e. only one triple pattern with a constant in the subject/object position), there is only one possible ordering of triple patterns, and thus the optimizer cannot produce any improvement. One of six queries in the evaluation was of this type (Query 3). At this point, it is difficult to say how common such queries will be under real usage of the system. A more ideal evaluation of the optimizer would sample real user queries from the logs of the SHARE demo, if/when the SHARE system is more widely used. One interesting problem that is not addressed by GREEDY is the question of which query plan produces the most complete result set. Although the result sets for all queries in the experiment were verified to be the same, regardless of the execution plan, this is not guaranteed to be true for all SHARE queries in general. For example, cross references between biology databases are typically maintained by hand, and so the set of relationships between Database A and Database B might differ depending on which database is queried (i.e. which direction the predicate is resolved in). One idea for estimating the relative “completeness” of a predicate in the  73  3.10. Source Code forward and reverse directions would be to use a sampling technique. The system could sample predicate outputs from previous user queries and try to map them back to their original inputs, by resolving the same predicate in the opposite direction. In this way, a relative completeness score could be computed for the forward and reverse directions of each predicate. The scores could then be used to achieve a tradeoff between efficiency and completeness.  3.10  Source Code  The source code for SHARE is publicly available at http://sadi.googlecode.com, and may be found in the sadi.share folder (i.e. http://sadi.googlecode.com/svn/trunk/sadi.share). In order to successfully build the project, the user must also check out the sadi.common and sadi.client projects. (The project build is automated with Maven.) The main procedure for the GREEDY algorithm is implemented in the executeQueryAdaptive method of the SHAREKnowledgeBase class (ca.wilkinsonlab.sadi.share.SHAREKnowledgeBase) and the logic for comparing the costs of query patterns is implemented in QueryPatternComparator, which is an inner class of SHAREKnowledgeBase. The code for recording, computing, and retrieving statistics about predicates may be found in the PredicateStatsDB class (ca.wilkinsonlab.sadi.stats.PredicateStatsDB).  74  Chapter 4  Conclusion In the field of bioinformatics, making combined use of data sets and software packages often requires custom scripting work to reconcile incompatible formats, schemas, and interfaces. These incompatibilities exist because the various databases and programs are developed independently by different research laboratories, and without reference to a shared set of standards. Unfortunately, achieving standardization within the realm of bioinformatics is a daunting task due to the numerous, interrelated types of data (e.g. sequences, sequence annotations, molecular structures, chemical equations, molecular interactions, metabolic pathways, controlled vocabulary annotations) and analyses (e.g. gene prediction, multiple sequence alignment, protein structure prediction, subcellular location prediction, motif identification, phylogenetic tree construction) that exist. However, data/software integration work is mundane, time-consuming, and error prone, and it impedes scientific studies. The potential benefits of automated data integration in bioinformatics are widely recognized, as evidenced by the large number of research projects that have been devoted to solving the problem (e.g. [57, 118, 119, 132, 141, 145, 168, 169, 174, 175]). In this thesis, we have described a distributed query system called SHARE, which aims to address the data and software integration problems that are currently prevalent in bioinformatics. However, a number of distributed query systems have already been developed for bioinformatics prior to SHARE, such as Kleisli [141], TAMBIS [168], BioMART [119], BioFlow [145], and BioMediator [174]. Moreover, all of these systems (including SHARE) are similar in their design; in each case, the system answers user queries by decomposing them into requests against data sources that support a common “wrapper” interface, as per the mediator architecture depicted in Fig. 1.7. The technical feasibility of these systems has been demonstrated repeatedly; however, none of the frameworks have been a success in the sense that they are widely used in bioinformatics work. In addition to the technical challenges, there is also a social aspect to the problem of integration that must be addressed. In particular, there are two important questions that need to be answered for any integration framework: 1. What incentive do data and software providers have for participating? 2. Will the framework last? The main characteristic that distinguishes SHARE from other distributed query systems is that it is built using Semantic Web standards, and this difference has relevance for both of the questions above. Data and software providers that implement SADI services will not only be compatible with SHARE, but will also be compatible with a growing collection of tools [176] (e.g. triple stores and data browsers) and data sources (e.g. Bio2RDF [90], LODD [91], UniProt [11], Linked Life Data [177]) that are specifically designed to support data integration tasks. Further, SHARE has no special requirements for the encoding of RDF data that is consumed/generated by the data sources. Thus, providers may encode their RDF using any ontology that they wish, and the data can readily be used outside the context of SHARE. While the use of Semantic Web technologies has benefits for data providers, the most significant benefits are related to the implementation of the framework itself. Typically, the creators of a mediator system must develop and maintain a mediated schema, and the wrappers for participating data sources must implement mappings to and from that schema. If the wrappers are implemented on the mediator side, as has been the case for many bioinformatics systems 75  Chapter 4. Conclusion (e.g. Kleisli, TAMBIS, BioFlow, and BioMediator), this poses an additional burden of maintenance. In either case, the success of the framework depends upon the continuing diligence of its creators. Thus, it is a important advantage that SHARE requires no mediated schema or mappings. This is a unique consequence of using RDF as the medium for data exchange. Triple stores are capable of storing arbitrary graph structures, and thus when the system aggregates data during a query, there are no restrictions on the structure of the RDF data that is returned from each source. Of course, this unconstrained approach also has disadvantages. The effective “master” schema for SHARE is determined collectively by the data sources, and is likely to be far more chaotic than a centrally curated system. Moreover, meaningful integration will only be possible in cases where data sources use the shared identifiers for common entities and predicates. The main contribution of this thesis is GREEDY, the query evaluation algorithm for SHARE that is described in Chapter 3. GREEDY differs significantly from the standard (i.e. System R style [160]) procedure that is used to process queries in relational databases, in order to address additional challenges that are present when answering queries across distributed resources. One of the main difficulties under the distributed scenario is that statistics about the data sources are usually not available. In a relational database, statistics such as the size of a table and the number of keys in an index are essential for formulating efficient query plans. The GREEDY algorithm offers two complimentary solutions for this problem. First, GREEDY employs an adaptive approach to query planning where one join is executed at a time, and future joins are selected based on the result sizes of previous joins. Second, GREEDY learns statistics about predicates as it executes a query, in order to improve the performance of future queries. One of the advantages of the GREEDY approach is that it is able to optimize queries over web services, rather than databases with a specific type of query interface16 As a result, the system is capable of learning statistics about arbitrary software. This is particularly valuable in the bioinformatics domain, where specialized software tools often play an essential role in data analyses, and where the available programs are just as diverse as the available data. While SHARE is a promising prototype, there are number areas where further work is needed before it is ready for real use by bioinformaticians. One of the most important pieces of missing functionality is provenance tracking. Bioinformaticians must be able to review a “proof” of each solution which shows the chain of supporting facts (i.e. triples), and the particular services that asserted each of those facts. Another requirement for real world use will be fine-grained configuration of analytical tools. For example, if a “homolog” predicate is resolved by a BLAST service, users must be able to configure the parameters for the BLAST service, such as the probability cutoff, the seed size, the similarity matrix, and so on. However, it is not clear how to best support such configuration through the query interface. Moreover, it runs contrary to the basic purpose of SHARE, which is to enable users to ask questions without needing to know the locations and interfaces of the specific services that are used to answer them. Thus, SHARE may be best suited for simple, ad hoc queries such as “Which biological pathways involve both protein A and protein B?” or “Which proteins are most frequently targeted by breast cancer drugs?”. In order to accomplish more sophisticated analyses (e.g. building a phylogenetic tree), a workflow composition tool such as Taverna [123] is more suitable. One potential method to combine the strengths of SHARE and Taverna would be to use SHARE for constructing template workflows, and Taverna for fine-grained configuration. Yet another area for further work will be developing user interfaces for SHARE. While the query interface to SHARE is powerful, it requires familiarity with the SPARQL query language, and is not well suited to exploration of the data. One promising avenue for obtaining new interfaces would be to adapt existing tools for exploring RDF data, such as RDF browsers [178, 179], graph browsers [180, 181], and faceted browsers [182, 183], to use SHARE as the underlying data source.  16 Recall  from Section 2.3.2 that SPARQL endpoints are modelled as SADI services within the query engine.  76  Bibliography [1] Elaine R. Mardis. Next-Generation DNA sequencing methods. Genomics and Human Genetics, 9:387–402, 2008. URL http://www.annualreviews.org/doi/abs/10.1146/annurev. genom.9.081307.164359?journalCode=genom. [2] A Schulze and J Downward. Navigating gene expression using microarrays. Nature Cell Biology, 3(8):190–195, 201. URL http://www.ncbi.nlm.nih.gov/pubmed/11483980. [3] Philippe Collas and John Arne Dahl. Chop it, ChIP it, check it: the current status of chromatin immunoprecipitation. Frontiers in Bioscience, 13:929–943, January 2008. URL http://www.bioscience.org/2008/v13/af/2733/list.htm. [4] Catherine Mathe, Marie-France Sagot, Thomas Schiex, and Pierre Rouze. Current methods of gene prediction, their strengths and weaknesses. Nucleic Acids Research, 30 (19):4103–4117, 2002. URL http://nar.oxfordjournals.org/cgi/content/abstract/30/19/ 4103. [5] Burkhard Rost. Protein secondary structure prediction continues to rise. Journal of Structural Biology, 134(2-3):204–218, May 2001. [6] C.A. Floudas, H.K. Fung, S.R. McAllister, M. Monnigmann, and R. Rajgaria. Advances in protein structure prediction and de novo protein design: A review. Chemical Engineering Science, 61(3):966–988, February 2006. [7] Nucleic acids research: Categorical list of online biology databases, retrieved Feb 4, 2011. URL http://www.oxfordjournals.org/nar/database/c/. [8] Michelle D. Brazas, Joseph Tadashi Yamada, and B.F. Francis Ouelette. Evolution in bioinformatic resources: 2009 update on the bioinformatics links directory. Nucleic Acids Research, 37:W3–W5. URL http://nar.oxfordjournals.org/cgi/content/full/37/suppl_ 2/W3. [9] Donna Maglott, Jim Ostell, Kim D. Pruitt, and Tatiana Tatusova. Entrez gene: genecentered information at NCBI. Nucleic Acids Research, 33:D54–D58, 2005. URL http: //nar.oxfordjournals.org/cgi/content/abstract/gkl993v1. [10] UniGene home, retrieved Feb 4, 2011. URL http://www.ncbi.nlm.nih.gov/unigene. [11] Amos Bairoch, Rolf Apweiler, Cathy H. Wu, Winona C. Barker, Brigitte Boeckmann, and et al. The universal protein resource (UniProt). Nucleic Acids Research, 33:D154–D159, 2005. URL http://nar.oxfordjournals.org/cgi/content/abstract/33/suppl_1/D154. [12] Josefine Sprenger, J. Lynn Fink, Seetha Karunaratne, Kelly Hanson, Nicholas A. Hamilton, and Rohan D. Teasdale. LOCATE: a mammalian protein subcellular localization database. Nucleic Acids Research, 36:D230–D233, 2008. URL http://nar.oxfordjournals. org/cgi/content/abstract/gkm950v1.  77  Bibliography [13] Sbastien Rey, Michael Acab, Jennifer L. Gardy, Matthew R. Laird, Katalin deFays, Christophe Lambert, and Fiona S. L. Brinkman. PSORTdb: a protein subcellular localization database for bacteria. Nucleic Acids Research, 33:D164–D168, 2005. URL http://nar.oxfordjournals.org/cgi/content/full/33/suppl_1/D164. [14] Ioannis Xenarios, Lukasz Salwinski, Xiaoqun Joyce Duan, Patrick Higney, Sul-Min Kim, and David Eisenberg. DIP, the database of interacting proteins: a research tool for studying cellular networks of protein interactions. Nucleic Acids Research, 30(1): 303–305, 2002. URL http://nar.oxfordjournals.org/cgi/content/full/30/1/303?ijkey= lVvmq2t6UjLVU&ke%EF%BF%BD%C3%9C. [15] Kyungsook Han, Byungkyu Park, Hyongguen Kim, Jinsun Hong, and Jong Park. HPID: the human protein interaction database. Bioinformatics, 20(5):2466–2470, October 2004. URL http://bioinformatics.oxfordjournals.org/cgi/content/abstract/20/15/2466. [16] Bobby-Joe Breitkreutz, Chris Stark, Teresa Reguly, Lorrie Boucher, and et al. The BioGRID interaction database: 2008 update. Nucleic Acids Research, 36:D637–D640, 2008. URL http://nar.oxfordjournals.org/cgi/content/abstract/gkm1001v1. [17] J. L. Sussman, D. Lin, N. O. Manning, J. Prilusky, O. Ritter, and E. E. Abola. Protein data bank (PDB): database of Three-Dimensional structural information of biological macromolecules. Biological Crystallography, D54:1078–1084. URL http://scripts.iucr. org/cgi-bin/paper?S0907444998009378. [18] S Bamford, E Dawson, S Forbes, J Clements, and et al. The COSMIC (Catalogue of somatic mutations in cancer) database and website. British Journal of Cancer, 91(2): 355–358, July 2004. URL http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2409828/. [19] M.P. Lefranc, V. Giudicelli, C. Busin, and et al. IMGT, the international ImMunoGeneTics database. Nucleic Acids Research, 26(1):297–303, 1998. URL http://nar. oxfordjournals.org/cgi/content/abstract/26/1/297. [20] Christian von Mering, Lars J. Jensen, Michael Kuhn, and et al. String 7–recent developments in the integration and prediction of protein interactions. Nucleic Acids Research, 35:358–362, 2007. [21] David S. Wishart, Craig Knox, An Chi Guo, and et al. DrugBank: a comprehensive resource for in silico drug discovery and exploration. Nucleic Acids Research, 34:D668–D672, 2006. URL http://nar.oxfordjournals.org/cgi/content/abstract/34/suppl_1/D668. [22] Ron Edgar, Michael Domrachev, and Alex E. Lash. Gene expression omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Research, 30(1):207–210. URL http://nar.oxfordjournals.org/cgi/content/full/30/1/207?ijkey= oxmpowsears7o&keytype=ref&siteid=nar. [23] Alvis Brazma, Helen Parkinson, Ugis Sarkans, and et al. ArrayExpress–a public repository for microarray gene expression data at the EBI. Nucleic Acids Research, 31(1):68–71, 2003. URL http://nar.oxfordjournals.org/cgi/content/abstract/31/1/68. [24] Gavin Sherlock, Tina Hernandez-Boussard, Andrew Kasarskis, and et al. The stanford microarray database. Nucleic Acids Research, 29(1):152–155. URL http://nar. oxfordjournals.org/cgi/content/abstract/29/1/152. [25] Minoru Kanehisa, Susumu Goto, Shuichi Kawashima, and et al. The KEGG resource for deciphering the genome. Nucleic Acids Research, 32:D277–D280, 2004. URL http: //nar.oxfordjournals.org/cgi/content/abstract/32/suppl_1/D277. 78  Bibliography [26] Peter D. Karp, Monica Riley, Suzanne M. Paley, and Alida Pellegrini-Toole. The MetaCyc database. Nucleic Acids Research, 30(1):59–61. URL http://nar.oxfordjournals.org/cgi/ content/abstract/30/1/59. [27] G. Joshi-Tope, M. Gillespie, I. Vastrik, and et al. Reactome: a knowledgebase of biological pathways. Nucleic Acids Research, 33:D428–D432. URL http://nar.oxfordjournals.org/ cgi/content/abstract/33/suppl_1/D428. [28] Lincoln Stein. Genome annotation: from sequence to biology. Nature Reviews Genetics, 2:493–503, July 2001. URL http://www.nature.com/nrg/journal/v2/n7/abs/nrg0701_493a. html. [29] Chris Simon, Thomas R. Buckley, Francesco Frati, James B. Stewart, and Andrew T. Beckenbach. Incorporating molecular evolution into phylogenetic analysis, and a new compilation of conserved polymerase chain reaction primers for animal mitochondrial DNA. Annual Review of Ecology, Evolution, and Systematics, 37:545–579, December 2006. URL http://arjournals.annualreviews.org/doi/full/10.1146/annurev.ecolsys.37. 091305.110018?cookieSet=1. [30] Olof Emanuelsson, Soren Brunak, Gunnar von Heijne, and Henrik Nielsen. Locating proteins in the cell using TargetP, SignalP and related tools. Nature Protocols, 2:953–971, 2007. URL http://www.nature.com/nprot/journal/v2/n4/abs/nprot.2007.131.html. [31] Christian von Mering, Roland Krause, Berend Snel, Michael Cornell, Stephen G. Oliver, Stanley Fields, and Peer Bork. Comparative assessment of large-scale data sets of protein– protein interactions. Nature, 417:399–403, May 2002. URL http://scholar.google.ca/ scholar?cluster=14172930869186324954&hl=en. [32] Joel N. Hirschhorn and Mark J. Daly. Genome-wide association studies for common diseases and complex traits. Nature Reviews Genetics, 6:95–108, February 2005. URL http://www.nature.com/nrg/journal/v6/n2/abs/nrg1521.html. [33] Julie L Morrison, Rainer Breitling, Desmond J Higham, and David R Gilbert. GeneRank: using search engine technology for the analysis of microarray experiments. BMC Bioinformatics, 6(233), September 2005. URL http://www.biomedcentral.com/14712105/6/233/abstract. [34] Xiaotu Ma, Hyunju Lee, Li Wang, and Fengzhu Sun. CGI: a new approach for prioritizing genes by combining gene expression and protein-protein interaction data. Bioinformatics, 23(2):215–221, 2007. URL http://bioinformatics.oxfordjournals.org/cgi/content/ abstract/23/2/215. [35] Aaron N. Chang. Prioritizing genes for pathway impact using network analysis. In Protein Networks and Pathway Analysis, volume 563 of Methods in Molecular Biology, pages 141–156. July 2009. URL http://www.springerprotocols.com/Abstract/doi/10.1007/9781-60761-175-2_8. [36] Vincent Detours, Jacques E. Dumont, Hugues Bersini, and Carine Maenhaut. Integration and cross-validation of high-throughput gene expression data: comparing heterogeneous data sets. FEBS Letters, 546(1):98–102. [37] Rafael A Irizarry, Daniel Warren, Forrest Spencer, and et al. Multiple-laboratory comparison of microarray platforms. Nature Methods, 2:345–350, 2005. URL http: //www.nature.com/nmeth/journal/v2/n5/abs/nmeth756.html.  79  Bibliography [38] Homin K. Lee, Amy K. Hsu, Jon Sajdak, Jie Qin, and Paul Pavlidis. Coexpression analysis of human genes across many microarray data sets. Genome Research, 14:1085–1094, 2004. URL http://genome.cshlp.org/content/14/6/1085.abstract. [39] Shui Qing Ye. Bioinformatics: A Practical Approach. Chapman & Hall/CRC, August 2007. URL http://www.amazon.ca/Bioinformatics-Practical-Approach-Shui-Qing/ dp/1584888105/ref=sr_1_1?ie=UTF8&s=books&qid=1255984955&sr=1-1. [40] Lincoln Stein. Creating a bioinformatics nation. Nature, 417:119–120, 2002. URL http: //www.nature.com/nature/journal/v417/n6885/full/417119a.html. [41] Alvis Brazma, Maria Krestyaninova, and Ugis Sarkans. Standards for systems biology. Nature Reviews Genetics, 7:593–605, 2006. URL http://www.nature.com/nrg/journal/v7/ n8/abs/nrg1922.html. [42] RE: LSID: what’s still needed to make it work within the semantic web?, retrieved Feb 4, 2011. URL http://lists.w3.org/Archives/Public/public-semweb-lifesci/2005Mar/0004. html. [43] My conversation with sean martin about LSIDs, retrieved Feb 4, 2011. URL http:// lists.w3.org/Archives/Public/www-tag/2006Jul/0041. [44] Huimin Zhao and Sudha Ram. Entity identification for heterogeneous database integrationa multiple classifier system approach and empirical evaluation. Information Systems, 30(2):119–132, April 2005. [45] C. Batini, M. Lenzerini, and S. B. Navathe. A comparative analysis of methodologies for database schema integration. ACM Computing Surveys, 18(4):323–364, December 1986. URL http://portal.acm.org/citation.cfm?id=27634. [46] Kalpdrum Passil, Louise Lane, Sanjay Madria, and et al. A model for XML schema integration. In E-Commerce and Web Technologies, volume 2455, pages 193–202, Berlin / Heidelberg, 2002. Spring. URL http://www.springerlink.com/content/47lan7d1xgrdqr8f/. [47] Yannis Kalfoglou and Marco Schorlemmer. Ontology mapping: The state of the art. The Knowledge Engineering Review, 18(1):1–31, 2003. URL http://journals.cambridge.org/ action/displayAbstract?fromPage=online&aid=183299. [48] Pipeline (Unix), retrieved Feb 4, 2011. URL http://en.wikipedia.org/wiki/Pipeline_ (Unix). [49] Thomas J Lee, Yannick Pouliot, Valerie Wagner, and et al. BioWarehouse: a bioinformatics database warehouse toolkit. BMC Bioinformatics, 7(170), 2006. URL http: //www.biomedcentral.com/1471-2105/7/170. [50] Surajit Chaudhuri and Umeshwar Dayal. An overview of data warehousing and OLAP technology. ACM SIGMOD Record, 26(1):65–74, March 1997. URL http://portal.acm.org/citation.cfm?id=248603.248616&coll=GUIDE&dl=ACM&idx= J689&part=periodical&WantType=periodical&title=ACM%20SIGMOD%20Record.  [51] Markus Fischer, Quan K Thai, Melanie Grieb, and Jrgen Pleiss. DWARF - a data warehouse system for analyzing protein families. BMC Bioinformatics, 7(495), 2006. URL http://www.biomedcentral.com/1471-2105/7/495. [52] Jongsun Park, Bongsoo Park, Kyongyong Jung, and et al. CFGP: a web-based, comparative fungal genomics platform. Nucleic Acids Research, 36:D562–D571. URL http: //nar.oxfordjournals.org/cgi/content/abstract/36/suppl_1/D562. 80  Bibliography [53] Zukang Feng, Li Chen, Himabindu Maddula, and et al. Ligand depot: a data warehouse for ligands bound to macromolecules. Bioinformatics, 20(13):2153–2155, 2004. URL http: //bioinformatics.oxfordjournals.org/cgi/content/abstract/20/13/2153. [54] Christian Kuenne, Ivo Grosse, and Inge Matthies. Using data warehouse technology in crop plant bioinformatics. Journal of Integrative Bioinformatics, 4(1), 2007. URL http: //journal.imbio.de/article.php?aid=88. [55] Sohrab P Shah, Yong Huang, Tao Xu, and et al. Atlas - a data warehouse for integrative bioinformatics. BMC Bioinformatics, 6(34). URL http://www.biomedcentral.com/14712105/6/34. [56] Katerina Michalickova, Gary D Bader, Michel Dumontier, and et al. SeqHound: biological sequence and structure database as a platform for bioinformatics research. BMC Bioinformatics, 3(32), 2002. URL http://www.biomedcentral.com/1471-2105/3/32. [57] T. Etzold, A. Ulyanov, and P. Argos. SRS: information retrieval system for molecular biology data banks. Methods in Enzymology, 266:114–128, 1996. URL http://www.ncbi. nlm.nih.gov/pubmed/8743681. [58] SRS@EBI (srs.ebi.ac.uk), retrieved Feb 4, 2011. URL http://srs.ebi.ac.uk/srsbin/cgibin/wgetz?-page+srsq2+-noSession. [59] Alan Ruttenberg, Jonathan A. Rees, Matthias Samwald, and M. Scott Marshall. Life sciences on the semantic web: the neurocommons and beyond. Briefings in Bioinformatics, 10(2):193–204, 2009. URL http://bib.oxfordjournals.org/cgi/content/abstract/bbp004? ijkey=S1KuIBDQvKkvKUV&keytype=ref. [60] Nikesh Kotechna, Kyle Bruck, William Lu, and Nigam Shah. Pathway knowledge base: An integrated pathway resource using BioPAX. Applied Ontology, 3(4):235–245, December 2008. URL http://portal.acm.org/citation.cfm?id=1516156. [61] Extensible markup language (XML), retrieved Feb 4, 2011. URL http://www.w3.org/XML/. [62] The ASN.1 consortium, retrieved Feb 4, 2011. URL http://www.asn1.org/. [63] Resource description framework (RDF), retrieved Feb 4, 2011. URL http://www.w3.org/ RDF/. [64] Xerces java parser, retrieved Feb 4, 2011. URL http://xerces.apache.org/xerces-j/. [65] D Fenyo. The biopolymer markup language. Bioinformatics, 15(4):339–340, 1999. URL http://bioinformatics.oxfordjournals.org/cgi/content/abstract/15/4/339. [66] Insdc, retrieved Feb 4, 2011. URL http://www.insdc.org/. [67] Allison Waugh, Patrick Gendron, Russ Altman, and et al. RNAML: a standard syntax for exchanging RNA information. RNA, 8:707–717, July 2002. URL http://journals. cambridge.org/action/displayAbstract?fromPage=online&aid=114721. [68] John Westbrook, Nobutoshi Ito, Haruki Nakamura, and et al. PDBML: the representation of archival macromolecular structure data in XML. Bioinformatics, 21(7):988–992, 2005. URL http://bioinformatics.oxfordjournals.org/cgi/content/abstract/21/7/988. [69] Phillipp N Seibel, Jan Krger, Sven Hartmeier, and et al. XML schemas for common bioinformatic data types and their application in workflow systems. BMC Bioinformatics, 7(490), November 2006. URL http://www.biomedcentral.com/1471-2105/7/490. 81  Bibliography [70] PSI-MI, retrieved Feb 4, 2011. URL http://www.psidev.info/index.php?q=node/60. [71] Paul T Spellman, Michael Miller, Jason Stewart, and et al. Design and implementation of microarray gene expression markup language (MAGE-ML). Genome Biology, 3, 2002. URL http://genomebiology.com/2002/3/9/RESEARCH/0046.9/ABSTRACT/ABSTRACT/ abstract/. [72] MINiML (MIAME notation in markup language), retrieved Feb 4, 2011. URL http: //www.ncbi.nlm.nih.gov/geo/info/MINiML.html. [73] M. Hucka, A. Finney, H.M. Sauro, and et al. The systems biology markup language (SBML): a medium for representation and exchange of biochemical network models. Bioinformatics, 19(4):524–531, 2003. URL http://bioinformatics.oxfordjournals.org/ cgi/content/abstract/19/4/524. [74] Catherine M. Lloyd, Matt D. B. Halstead, and Poul F. Nielsen. Progress in biophysics and molecular biology : CellML: its future, present and past. Progress in Biophysics and Molecular Biology, 85(2-3):433–450, 2004. [75] Andrew Chatraryamontri, Arnaud Ceol, Luisa Montecchi Palazzi, and et al. MINT: the molecular INTeraction database. Nucleic Acids Research, 35:D572– D574, 2007. URL http://nar.oxfordjournals.org/cgi/content/abstract/35/suppl_1/ D572?ijkey=dODhj4m5HbrlWs1&keytype=ref. [76] Henning Hermjakob, Luisa Montecchi-Palazzi, Chris Lewington, and et al. IntAct: an open source molecular interaction database. Nucleic Acids Research, 32:D452–D455, 2004. URL http://nar.oxfordjournals.org/cgi/content/abstract/32/suppl_1/D452. [77] Lena Strmbck and Patrick Lambrix. Representations of molecular pathways: an evaluation of SBML, PSI MI and BioPAX. Bioinformatics, 21(4):4401–4407, 2005. URL http: //bioinformatics.oxfordjournals.org/cgi/content/abstract/21/24/4401. [78] MGED home, retrieved Feb 4, 2011. URL http://www.mged.org/. [79] Tim F Rayner, Philippe Rocca-Serra, Paul T Spellman, and et al. A simple spreadsheetbased, MIAME-supportive format for microarray data: MAGE-TAB. BMC Bioinformatics, 7, 2006. URL http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1687205/?tool=pubmed. [80] Hong-Hai Do, Sergey Melnik, and Erhard Rahm. Comparison of schema matching evaluations. In Web, Web-Services, and Database Systems, volume 2593, pages 221– 237, Berlin / Heidelberg, 2009. Springer. URL http://www.springerlink.com/content/ 4wbxb1jbw8w4ek9v/. [81] T. Berners-Lee, J. Hendler, and O. Lassila. The semantic web. Scientific American, 284(5):28–37, May 2001. URL http://md1.csa.com/partners/viewrecord.php?requester= gs&collection=TRD&recid=114048AN&q=&uid=788564238&setcookie=yes. [82] World wide web consortium (W3C), retrieved Feb 4, 2011. URL http://www.w3.org/. [83] W3C semantic web activity, retrieved Feb 4, 2011. URL http://www.w3.org/2001/sw/. [84] Nigel Shadbolt, Tim Berners-Lee, and Wendy Hall. The semantic web revisited. IEEE Intelligent Systems, 21(3):96–101, 2006. URL http://portal.acm.org/citation.cfm?id= 1155313.1155373. [85] Turtle - terse RDF triple language, retrieved Feb 4, 2011. URL http://www.w3.org/ TeamSubmission/turtle/. 82  Bibliography [86] RDF/XML syntax specification (Revised), retrieved Feb 4, 2011. URL http://www.w3. org/TR/rdf-syntax-grammar/. [87] LargeTripleStores - ESW wiki, retrieved Feb 4, 2011. URL http://esw.w3.org/topic/ LargeTripleStores. [88] SPARQL query language for RDF, retrieved Feb 4, 2011. URL http://www.w3.org/TR/rdfsparql-query/. [89] Benjamin M. Good and Mark D. Wilkinson. The life sciences semantic web is full of creeps! Briefings in Bioinformatics, 7(3):275–286, 2006. URL http://bib.oxfordjournals.org/ cgi/content/full/7/3/275?ijkey=QgZhvAQ4xvjfqzK&amp;keytype=ref. [90] Fran¸cois Belleau, Marc-Alexandre Nolin, Nicole Tourigny, Philippe Rigault, and Jean Morissette. Bio2RDF: Towards a mashup to build bioinformatics knowledge systems. Journal of Biomedical Informatics, 41(5):706–716, 2008. [91] Anja Jentzsch, Jun Zhao, and Oktie Hassanzadeh. Linking open drug data. 2009. URL http://triplify.org/files/challenge_2009/LODD.pdf. [92] DailyMed homepage, retrieved Feb 4, 2011. URL http://dailymed.nlm.nih.gov/dailymed/ about.cfm. [93] SIDER side effect resource, retrieved Feb 4, 2011. URL http://sideeffects.embl.de/. [94] Biogateway overview, retrieved Feb 4, 2011. URL http://www.semantic-systems-biology. org/biogateway. [95] Ben P. Vandervalk, E. Luke McCarthy, and Mark D. Wilkinson. Moby and moby 2: Creatures of the deep (Web). Briefings in Bioinformatics, 10(2):114–128, 2009. URL http://bib.oxfordjournals.org/cgi/content/abstract/bbn051. [96] W.N. Borst, J.M. Akkermans, and J.L. Top. Engineering ontologies. International Journal of Human-Computer Studies, 46(2-3):365–406, 1997. URL http://en.scientificcommons. org/17613581. [97] Franz Baader, Diego Calvanese, Deborah L. McGuinness, Daniele Nardi, and Peter F. Patel-Schneider. The Description Logic Handbook: Theory, Implementation, and Applications. Cambridge University Press, New York, NY, 2003. URL http://portal.acm.org/ citation.cfm?id=885746. [98] Evren Sirin, Bijan Parsia, Bernardo Cuenca Grau, Aditya Kalyanpur, and Yarden Katz. Pellet: A practical OWL-DL reasoner. Web Semantics: Science, Services and Agents on the World Wide Web, 5(2):51–53, June 2007. [99] Volker Haarslev and Ralf Moller. Racer: An OWL reasoning agent for the semantic web. In International Workshop on Applications, Products and Services of Web-based Support Systems, pages 91–95. Society Press, 2003. URL http://citeseerx.ist.psu.edu/viewdoc/ summary?doi=10.1.1.59.4220. [100] Dmitry Tsarkov and Ian Horrocks. FaCT++ description logic reasoner: System description. In Automated Reasoning, volume 4130, pages 292–297, Seattle, WA, 2006. Springer. URL http://www.springerlink.com/content/d7525hr61575521v/. [101] Robert Stevens, Mikel Egaa Aranguren, Katy Wolstencroft, and et al. Using OWL to model biological knowledge. International Journal of Human-Computer Studies, 65(7): 583–594, July 2007. 83  Bibliography [102] Mykola Konyk, Alexander De Leon, and Michel Dumontier. Chemical knowledge for the semantic web. In Data Integration in the Life Sciences, volume 5109, pages 169–176, Evry, France, 2008. Springer Berlin / Heidlberg. URL http://www.springerlink.com/content/ e57342044463q184/. [103] Franz Baader and Werner Nutt. Chapter 2: Basic description logics. In The Description Logic Handbook: Theory, Implementation, and Applications, pages 47–95. Cambridge University Press, New York, NY, 2003. URL http://portal.acm.org/citation.cfm?id=885746. [104] Joanne S. Luciano. PAX of mind for pathway researchers. Drug Discovery Today, 10(13): 937–942, July 2005. [105] Main page - OBI ontology, retrieved Feb 4, 2011. URL http://obi-ontology.org/page/ Main_Page. [106] Nicholas Sioutos, Sherri de Coronado, Margaret W. Haber, and et al. NCI thesaurus: A semantic model integrating cancer-related clinical and molecular information. Journal of Biomedical Informatics, 40(1):30–43, February 2007. [107] OpenGALEN homepage. URL http://www.opengalen.org/. [108] M.Q. Stearns, C. Price, A. Spackman, and A.Y. Wang. SNOMED clinical terms: overview of the development process and project status. Proceedings of the American Medical Informatics Association, pages 662–666, 2001. URL http://www.ncbi.nlm.nih.gov/pmc/ articles/PMC2243297/. [109] P.G. Baker, C.A. Goble, S. Bechofer, N.W. Paton, R. Stevens, and A. Brass. An ontology for bioinformatics applications. Bioinformatics, 15(6):510–520, 1999. URL http://bioinformatics.oxfordjournals.org/cgi/content/abstract/15/6/510. [110] Michael Ashburner, Catherine A. Ball, Judith A. Blake, and et al. Gene ontology: tool for the unification of biology. Nature Genetics, 25:25–29, 2000. URL http://www.nature. com/ng/journal/v25/n1/abs/ng0500_25.html. [111] SOAP specifications, retrieved Feb 4, 2011. URL http://www.w3.org/TR/soap/. [112] Web service definition language (WSDL), retrieved Feb 4, 2011. URL http://www.w3.org/ TR/wsdl. [113] OASIS UDDI specification, retrieved Feb 4, 2011. committees/tc_home.php?wg_abbrev=uddi-spec.  URL http://www.oasis-open.org/  [114] Roy Thomas Fielding. Architectural styles and the design of network-based software architectures. PhD thesis, University of California, Irvine, 2000. URL http://portal.acm. org/citation.cfm?id=932295. [115] Leonard Richardson and Sam Ruby. RESTful web services. O’Reilly Media, 2007. URL http://books.google.ca/books?id=XUaErakHsoAC&source=gbs_navlinks_s. [116] dbfetch @ EBI, retrieved Feb 4, 2011. URL http://www.ebi.ac.uk/cgi-bin/emblfetch. [117] Shuichi Kawashima, Toshiaki Katayama, Yoko Sato, and Minoru Kanehisa. KEGG API: a web service using SOAP/WSDL to access the KEGG system. Genome Informatics, 14:673–674, 2003. URL http://74.125.155.132/scholar?q=cache:hh_P9eh2UEIJ:scholar. google.com/&hl=en&as_sdt=2000.  84  Bibliography [118] Mark D. Wilkinson. BioMOBY: an open source biological web services proposal. Briefings in Bioinformatics, 3(4):331–341, 2002. URL http://intl-bib.oxfordjournals.org/cgi/ content/abstract/3/4/331. [119] Damian Smedley, Syed Haider, Benoi Ballester, and et al. BioMart - biological queries made easy. BMC Genomics, 10(22), 2009. URL http://www.biomedcentral.com/14712164/10/22. [120] Phillip Lord, Pinar Alper, Chris Wroe, and Carole Goble. Feta: A Light-Weight architecture for user oriented semantic service discovery. In The Semantic Web: Research and Applications, volume 3532, pages 17–31, Berlin / Heidelberg, 2005. Springer. URL http://www.springerlink.com/content/xf2bwxqpfb0anb0w/. [121] K. Wolstencroft, P. Alper, D. Hull, and et al. The myGrid ontology: bioinformatics service discovery. International Journal of Bioinformatics Research and Applications, 3(3):303–325, 2007. URL http://inderscience.metapress.com/app/home/contribution. asp?referrer=parent&backto=issue,3,8;journal,12,22;linkingpublicationresults,1: 113293,1.  [122] Carole A. Goble, Khalid Belhajjame, Franck Tanoh, and et al. BioCatalogue: a curated web service registry for the life science community. Nature Precedings (pre-publication). URL http://precedings.nature.com/documents/3132/version/1. [123] Tom Oinn, Matthew Addis, Justin Ferris, and et al. Taverna: a tool for the composition and enactment of bioinformatics workflows. Bioinformatics, 20(17):3045–3054, 2004. URL http://bioinformatics.oxfordjournals.org/cgi/content/abstract/20/17/3045. [124] Peter Li. Workflow entry: Mapping microarray data onto metabolic pathways, retrieved Feb 4, 2011. URL http://www.myexperiment.org/workflows/79. [125] Paul Fisher. Workflow entry: Pathways and gene annotations for QTL phenotype, retrieved Feb 4, 2011. URL http://www.myexperiment.org/workflows/16. [126] M.B. Monteiro. Workflow entry: Workflow for protein sequence analysis, retrieved Feb 4, 2011. URL http://www.myexperiment.org/workflows/124. [127] Ilkay Altintas, Chad Berkley, Efrat Jaeger, and et al. Kepler: An extensible system for design and execution of scientific workflows. In 16th International Conference on Scientific and Statistical Database Management (SSDBM’04), 2004. URL http://www.computer.org/ portal/web/csdl/abs/proceedings/ssdbm/2004/2146/00/21460423abs.htm. [128] D. Hull, R. Stevens, P. Lord, and et al. Treating shimantic web syndrome with ontologies. In First AKT workshop on Semantic Web Services (AKT-SWS04) KMi, 2004. URL http://homepages.cs.ncl.ac.uk/phillip.lord/publications_others.html. [129] I. Foster, H. Kishimoto, and A. Savva. The open grid services architecture, version 1.0. Technical report, Global Grid Foundation, January 2005. URL http://www.gridforum. org/documents/GWD-I-E/GFD-I.030.pdf. [130] Globus toolkit homepage. http://www.globus.org/toolkit/. URL http://www.globus.org/ toolkit/. [131] Jeffrey S. Grethe, Chaitan Baru, Amarnath Gupta, and et al. Biomedical informatics research network: Building a national collaboratory to hasten the derivation of new understanding and treatment of disease. In From Grid to Healthgrid: Proceedings of Healthgrid 2005, volume 112 of Studies in Health Technology and Informatics, pages 100–109. IOS Press, 2005. URL http://iospress.metapress.com/content/lgpq69gty5ttrkgw/. 85  Bibliography [132] Joel Saltz, Scott Oster, Shannon Hastings, and et al. caGrid: design and implementation of the core architecture of the cancer biomedical informatics grid. Bioinformatics, 22(15):1910–1916, 2006. URL http://bioinformatics.oxfordjournals.org/cgi/content/ abstract/22/15/1910. [133] Amazon elastic compute cloud (Amazon EC2). http://aws.amazon.com/ec2/. URL http: //aws.amazon.com/ec2/. [134] Google app engine, retrieved Feb 4, 2011. URL http://code.google.com/appengine/. [135] Windows azure platform, retrieved Feb 4, 2011. windowsazure/.  URL http://www.microsoft.com/  [136] S. A. McIlraith, T. C. Son, and Honglei Zeng. Semantic web services. IEEE Intelligent Systems, 16(2):46–53, 2001. URL http://md1.csa.com/partners/viewrecord.php?requester= gs&collection=TRD&recid=20061060128503CI. [137] Austin Tate, James Hendler, and James Allen. Readings in Planning. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1994. URL http://portal.acm.org/citation. cfm?id=562077&dl=GUIDE, . [138] Liliana Cabral, John Domingue, Stefania Galizia, and et al. IRS-III: a broker for semantic web services based applications. In The Semantic Web - ISWC 2006, volume 4273, pages 201–214, Berlin / Heidelberg, 2006. Springer. URL http://www.springerlink.com/content/ uq5271245m673614/. [139] OWL-S: semantic markup for web services, retrieved Feb 4, 2011. URL http://www.w3. org/Submission/OWL-S/. [140] Christian Feier, Dumitru Roman, Axes Polleres, and et al. Towards intelligent web services: The web service modeling ontology. In International Conference on Intelligent Computing 2005, 2005. URL http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1. 1.94.1336. [141] Limsoon Wong. Kleisli, a functional query system. Journal of Functional Programming, 10 (1):19–56, 2000. URL http://journals.cambridge.org/action/displayAbstract?fromPage= online&aid=44289. [142] Limsoon Wong. The collection programming language - reference manual, 2010. URL http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.26.4089. [143] S. B. Davidson, C. Overton, V. Tannen, and L. Wong. BioKleisli: a digital library for biomedical researchers. International Journal on Digital Libraries, 1(1):36–53. URL http: //www.springerlink.com/content/l68c3ugxq3k5bkbu/. [144] Categorized responses from TAMBIS requirements questionnaire, retrieved Feb 4, 2011. URL http://web.archive.org/web/20050430152828/imgproj.cs.man.ac.uk/tambis/ questionnaire/summary/summary.html. [145] H. Jamil and B. El-Hajj-Diab. BioFlow: a Web-Based declarative workflow language for life sciences. In Services - Part I., pages 453–460. IEEE, 2008. URL http://ieeexplore. ieee.org/xpl/freeabs_all.jsp?arnumber=4578361. [146] Courtot M´elanie, Bug William, Gibson Frank, and et al. The OWL of biomedical investigations.  86  Bibliography [147] Daniel Nardi and Ronald J. Brachman. Chapter 1: An introduction to description logics. In The Description Logic Handbook: Theory, Implementation, and Applications, pages 47–95. Cambridge University Press, New York, NY, 2003. URL http://portal.acm.org/ citation.cfm?id=885746. [148] Jeremy J. Carroll, Ian Dickinson, Chris Dollin, and et al. Jena: implementing the semantic web recommendations. In Proceedings of the 13th international World Wide Web conference, pages 74–83, New York, NY, USA, 2004. ACM. URL http://portal.acm.org/ citation.cfm?id=1013367.1013381. [149] Jeen Broekstra, Arjohn Kampman, and Frank van Harmelen. Sesame: A generic architecture for storing and querying RDF and RDF schema. In The Semantic Web, volume 2342, pages 54–68, Sardinia, Italy, 2002. Spring Berlin / Heidlberg. URL http://www.springerlink.com/content/hj139wcxnl9aer5u/. [150] Orri Erling and Ivan Mikhailov. RDF support in the virtuoso DBMS. In Networked Knowledge - Neworked Media, volume 221 of Studies in Computational Intelligence, pages 7–24. Springer, Berlin / Heidelberg, 2009. URL http://www.springerlink.com/content/ w043243q2l729308/. [151] D2R server – publishing relational databases on the semantic web, retrieved Feb 4, 2011. URL http://www4.wiwiss.fu-berlin.de/bizer/d2r-server/. [152] Bastian Quilitz and Ulf Leser. Querying distributed RDF data sources with SPARQL. In The Semantic Web: Research and Applications, volume 5021, pages 524–538, Berlin / Heidelberg, 2008. Springer. URL http://www.springerlink.com/content/hm1v15q75371640p/. [153] Life science resource name (LSRN), retrieved Feb 4, 2011. URL http://lsrn.org/. [154] Alex Bateman, Ewan Birney, Lorenzo Cerruti, and et al. The pfam protein families database. Nucleic Acids Research, 30(1):276–280, 2002. URL http://nar.oxfordjournals. org/cgi/content/abstract/30/1/276. [155] DCMI home: Dublin core metadata initiative, retrieved Feb 4, 2011. //dublincore.org/.  URL http:  [156] SKOS simple knowledge organization system, retrieved Feb 4, 2011. URL http://www.w3. org/2004/02/skos/. [157] The friend of a friend (FOAF) project, retrieved Feb 4, 2011. URL http://www.foafproject.org/. [158] Tom Oinn, Matthew Addis, Justin Ferris, and et al. Delivering web service coordination capability to users. In Proceedings of the 13th international World Wide Web Conference, pages 438–439, New York, NY, USA, 2004. ACM. URL http://portal.acm.org/citation. cfm?id=1013367.1013514. [159] Surajit Chaudhuri. An overview of query optimization in relational systems. In Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems, pages 34–43, New York, NY, USA, 1998. ACM. URL http://portal.acm.org/ citation.cfm?id=275487.275492&type=series. [160] P. Griffiths Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G. Price. Access path selection in a relational database management system. In Proceedings of the 1979 ACM SIGMOD international conference on Management of data, pages 23–34, New York, NY, USA, 1979. ACM. URL http://portal.acm.org/citation.cfm?id=582099. 87  Bibliography [161] Ben P. Vandervalk, E. Luke McCarthy, and Mark D. Wilkinson. Optimization of distributed SPARQL queries using edmonds’ algorithm and prim’s algorithm. In 12th IEEE International Conference on Computational Science and Engineering (CSE 2009), volume 4, pages 29–31. IEEE, 2009. [162] Donald Kossman. The state of the art in distributed query processing. ACM Computing Surveys, 32(4):422–469, 2000. URL http://portal.acm.org/citation.cfm?id=371598&dl=. [163] Andreas Langegger, Wolfram W, and Martin Blchl. A semantic web middleware for virtual data integration on the web. In The Semantic Web: Research and Applications, volume 5021/2008, pages 493–507, Berlin / Heidelberg. Springer. [164] Simon Schenk and Steffen Staab. Networked graphs: a declarative mechanism for SPARQL rules, SPARQL views and RDF data integration on the web. In Proceeding of the 17th international conference on World Wide Web, pages 585–594, New York, NY, USA. ACM. URL http://portal.acm.org/citation.cfm?id=1367497.1367577. [165] J. Zemanek, S. Schenk, and V. Svatek. Optimizing sparql queries over disparate rdf data sources through distributed semi-joins. In Proceedings of the Poster and Demonstration Session at the 7th International Semantic Web Conference (ISWC2008), 2008. [166] Zachary G. Ives, Daniela Florescu, Marc Friedman, Alon Levy, and Daniel S. Weld. An adaptive query execution system for data integration. ACM SIGMOD Record, 28(2): 299–310, 1999. URL http://portal.acm.org/citation.cfm?id=304181.304209. [167] Ron Avnur and Joseph M. Hellerstein. Eddies: continuously adaptive query processing. ACM SIGMOD Record, 29(2):261–272, 2000. URL http://portal.acm.org/citation.cfm? id=335420. [168] P.G. Baker, A. Brass, S. Bechhofer, and et al. TAMBIS: transparent access to multiple bioinformatics information sources. In Proceedings of the International Conference on Intelligent Systems for Molecular Biology, volume 6, pages 25–34, United States, 1998. AAAI Press. URL http://scholar.google.ca/scholar?cluster=7300384927362886596&hl= en&as_sdt=2000. [169] Belinda Giardine, Cathy Riemer, Ross C. Hardison, and et al. Galaxy: A platform for interactive large-scale genome analysis genome research. Genome Research, 15(10):1451– 1455, 2005. URL http://171.66.122.45/content/15/10/1451.abstract. [170] Anja Jentzsch, Bo Andersson, and Oktie Hassanzadeh. Enabling tailored therapeutics with linked data. URL http://events.linkeddata.org/ldow2009/papers/ldow2009_paper9.pdf. [171] Yong Gao, June Kinoshita, Elizabeth Wu, and et al. SWAN: a distributed knowledge infrastructure for alzheimer disease research. Web Semantics: Science, Services and Agents on the World Wide Web, 4(3):222–228. [172] Robert Stevens, Carole Goble, Norman W. Paton, and et al. Complex query formulation over diverse information sources in TAMBIS. In Bioinformatics: Managing Scientific Data. Morgan Kaufmann, 2003. URL http://scholar.google.ca/scholar?cluster= 7393351363746076553&hl=en&as_sdt=2000. [173] C.A. Goble, R. Stevens, G. Ng, and et al. Transparent access to multiple bioinformatics information sources. IBM Systems Journal, 40(2):532–551, 2001. [174] R. Shaker, P. Mork, J.S. Brockenbrough, and et al. The biomediator system as a tool for integrating biologic databases on the web. In Proceedings of the Workshop on Information Integration on the Web. URL http://scholar.google.ca/scholar?cluster= 9141864155955356213&hl=en&as_sdt=2000. 88  Bibliography [175] caBIO wiki home page, retrieved Feb 4, 2011. URL https://wiki.nci.nih.gov/display/ caBIO/caBIO+Wiki+Home+Page. [176] Sweet tools, retrieved Feb 4, 2011. URL http://www.mkbergman.com/new-version-sweettools-sem-web/. [177] Vassil Momtchev, Deyan Peychev, Todor Primov, and et al. Expanding the pathway and interaction knowledge in linked life data. 2009. URL http://www.ontotext.com/ publications/index.html. [178] Tim Berners-Lee, Yuhsin Chen, Lydia Chilton, Dan Connolly, and et al. Tabulator: Exploring and analyzing linked data on the semantic web. In Proceedings of the 3rd International Semantic Web User Interaction Workshop, 2006. URL http://citeseerx. ist.psu.edu/viewdoc/summary?doi=10.1.1.97.950. [179] Christian Becker and Chris Bizer. Marbles linked data engine, retrieved Feb 4, 2011. URL http://marbles.sourceforge.net/. [180] Welkin: RDF graph browser, retrieved Feb 4, 2011. URL http://simile.mit.edu/welkin/. [181] Gruff: A Grapher-Based Triple-Store browser for AllegroGraph, retrieved Feb 4, 2011. URL http://www.franz.com/agraph/gruff/. [182] David Huynh and David Karger. Parallax and companion: Set-based browsing for the data web. URL http://davidhuynh.net/publications.php. [183] Longwell, retrieved Feb 4, 2011. URL http://simile.mit.edu/wiki/Longwell.  89  Appendix A  Supporting Material for Chapter 2 A.1  An Example RDF Service Description for a SADI Service, Obtained by Performing an HTTP GET on the URL http://sadiframework.org/examples/uniprot2pubmed  <r d f :RDF xmlns : r d f=” h t t p : / /www. w3 . o r g /1999/02/22 − r d f −syntax−ns#” xmlns : mygrid=” h t t p : / /www. mygrid . o r g . uk/ mygrid−moby−s e r v i c e#”> <mygrid : s e r v i c e D e s c r i p t i o n r d f : about=” h t t p : / / s a d i f r a m e w o r k . o r g / examples / uniprot2pubmed ”> <mygrid : h a s S e r v i c e D e s c r i p t i o n T e x t >Returns PubMed i d s a s s o c i a t e d with a UniProt r e c o r d . </ mygrid : h a s S e r v i c e D e s c r i p t i o n T e x t > <mygrid : hasServiceNameText r d f : d a t a t y p e=” h t t p : / /www. w3 . o r g /2001/XMLSchema# string ” >UniProt−to−Pubmed</mygrid : hasServiceNameText> <mygrid : h a s O p e r a t i o n > <mygrid : o p e r a t i o n r d f : about=” h t t p : / / s a d i f r a m e w o r k . o r g / examples / uniprot2pubmed#o p e r a t i o n ”> <mygrid : outputParameter> <mygrid : p a r a m e t e r r d f : about=” h t t p : / / s a d i f r a m e w o r k . o r g / examples / uniprot2pubmed#ou tp ut ”> <mygrid : o b j e c t T y p e r d f : r e s o u r c e=” h t t p : / / s a d i f r a m e w o r k . o r g / examples / uniprot2pubmed . owl#AnnotatedUniProtRecord ”/> </mygrid : parameter> </mygrid : outputParameter> <mygrid : inp utP ara met er > <mygrid : p a r a m e t e r r d f : about=” h t t p : / / s a d i f r a m e w o r k . o r g / examples / uniprot2pubmed#i n p u t ”> <mygrid : o b j e c t T y p e r d f : r e s o u r c e=” h t t p : / / p u r l . o c l c . o r g /SADI/LSRN/ U n i P r o t R e c o r d ”/> </mygrid : parameter> </mygrid : i npu tPa ram ete r > </mygrid : o p e r a t i o n > </mygrid : h a s O p e r a t i o n > </mygrid : s e r v i c e D e s c r i p t i o n > </ r d f :RDF>  90  Appendix B  Supporting Material for Chapter 3 B.1 B.1.1  Variants of Training Queries Used to Evaluate the GREEDY Optimization Algorithm Variants of Query 1  Variant 0) : PREFIX PREFIX PREFIX PREFIX PREFIX  r d f s : <h t t p : / /www. w3 . o r g /2000/01/ r d f −schema#> c o r e 2 : <h t t p : / / b i o 2 r d f . o r g / c o r e :> r d f : <h t t p : / /www. w3 . o r g /1999/02/22 − r d f −syntax−ns#> c o r e : <h t t p : / / p u r l . u n i p r o t . o r g / c o r e /> u n i p r o t : <h t t p : / / b i o 2 r d f . o r g / u n i p r o t :>  SELECT ? s u b s t i t u t i o n ? s t a r t ? end ? dbCrossRef WHERE { u n i p r o t : P01344 c o r e : a n n o t a t i o n ? a n n o t a t i o n . ? annotation core : range ? range . ? range core : begin ? s t a r t . ? r a n g e c o r e : end ? end . ? annotation r d f : type core2 : Natural Variant Annotation . ? a n n o t a t i o n r d f s : s e e A l s o ? dbCrossRef . ? annotation core : s u b s t i t u t i o n ? s u b s t i t u t i o n } Variant 1) : PREFIX PREFIX PREFIX PREFIX PREFIX  r d f s : <h t t p : / /www. w3 . o r g /2000/01/ r d f −schema#> c o r e 2 : <h t t p : / / b i o 2 r d f . o r g / c o r e :> r d f : <h t t p : / /www. w3 . o r g /1999/02/22 − r d f −syntax−ns#> c o r e : <h t t p : / / p u r l . u n i p r o t . o r g / c o r e /> u n i p r o t : <h t t p : / / b i o 2 r d f . o r g / u n i p r o t :>  SELECT ? s u b s t i t u t i o n ? s t a r t ? end ? dbCrossRef WHERE { ? annotation r d f : type core2 : Natural Variant Annotation . ? a n n o t a t i o n r d f s : s e e A l s o ? dbCrossRef . u n i p r o t : Q9P126 c o r e : a n n o t a t i o n ? a n n o t a t i o n . ? annotation core : s u b s t i t u t i o n ? s u b s t i t u t i o n . ? annotation core : range ? range . ? range core : begin ? s t a r t . ? r a n g e c o r e : end ? end } Variant 2) : PREFIX PREFIX PREFIX PREFIX  r d f s : <h t t p : / /www. w3 . o r g /2000/01/ r d f −schema#> c o r e 2 : <h t t p : / / b i o 2 r d f . o r g / c o r e :> r d f : <h t t p : / /www. w3 . o r g /1999/02/22 − r d f −syntax−ns#> c o r e : <h t t p : / / p u r l . u n i p r o t . o r g / c o r e />  91  B.1. Variants of Training Queries PREFIX  u n i p r o t : <h t t p : / / b i o 2 r d f . o r g / u n i p r o t :>  SELECT ? s u b s t i t u t i o n ? s t a r t ? end ? dbCrossRef WHERE { u n i p r o t : Q9H6R6 c o r e : a n n o t a t i o n ? a n n o t a t i o n . ? annotation core : range ? range . ? a n n o t a t i o n r d f s : s e e A l s o ? dbCrossRef . ? annotation r d f : type core2 : Natural Variant Annotation . ? r a n g e c o r e : end ? end . ? annotation core : s u b s t i t u t i o n ? s u b s t i t u t i o n . ? range core : begin ? s t a r t } Variant 3) : PREFIX PREFIX PREFIX PREFIX PREFIX  r d f s : <h t t p : / /www. w3 . o r g /2000/01/ r d f −schema#> c o r e 2 : <h t t p : / / b i o 2 r d f . o r g / c o r e :> r d f : <h t t p : / /www. w3 . o r g /1999/02/22 − r d f −syntax−ns#> c o r e : <h t t p : / / p u r l . u n i p r o t . o r g / c o r e /> u n i p r o t : <h t t p : / / b i o 2 r d f . o r g / u n i p r o t :>  SELECT ? s u b s t i t u t i o n ? s t a r t ? end ? dbCrossRef WHERE { u n i p r o t : Q9H116 c o r e : a n n o t a t i o n ? a n n o t a t i o n . ? annotation core : range ? range . ? range core : begin ? s t a r t . ? annotation r d f : type core2 : Natural Variant Annotation . ? annotation core : s u b s t i t u t i o n ? s u b s t i t u t i o n . ? r a n g e c o r e : end ? end . ? a n n o t a t i o n r d f s : s e e A l s o ? dbCrossRef }  B.1.2  Variants of Query 3  Variant 0) : PREFIX /> PREFIX PREFIX PREFIX PREFIX  d i s e a s o m e : <h t t p : / /www4 . w i w i s s . fu−b e r l i n . de / d i s e a s o m e / r e s o u r c e / d i s e a s o m e r d f s : <h t t p : / /www. w3 . o r g /2000/01/ r d f −schema#> drugbank : <h t t p : / /www4 . w i w i s s . fu−b e r l i n . de / drugbank / r e s o u r c e / drugbank/> b i o 2 r d f : <h t t p : / / b i o 2 r d f . o r g / ns / b i o 2 r d f#> drug : <h t t p : / /www4 . w i w i s s . fu−b e r l i n . de / drugbank / r e s o u r c e / d r u g s/>  SELECT ? t a r g e t P r o t e i n ? gene ? d i s e a s e ? diseaseName WHERE { drug : DB01038 drugbank : t a r g e t ? t a r g e t P r o t e i n . ? t a r g e t P r o t e i n drugbank : hgncId ? gene . ? gene b i o 2 r d f :xOMIM ?omim . ? d i s e a s e d i s e a s o m e : omim ?omim . ? d i s e a s e r d f s : l a b e l ? diseaseName } Variant 1) : PREFIX /> PREFIX PREFIX PREFIX PREFIX  d i s e a s o m e : <h t t p : / /www4 . w i w i s s . fu−b e r l i n . de / d i s e a s o m e / r e s o u r c e / d i s e a s o m e r d f s : <h t t p : / /www. w3 . o r g /2000/01/ r d f −schema#> drugbank : <h t t p : / /www4 . w i w i s s . fu−b e r l i n . de / drugbank / r e s o u r c e / drugbank/> b i o 2 r d f : <h t t p : / / b i o 2 r d f . o r g / ns / b i o 2 r d f#> drug : <h t t p : / /www4 . w i w i s s . fu−b e r l i n . de / drugbank / r e s o u r c e / d r u g s/>  SELECT ? t a r g e t P r o t e i n ? gene ? d i s e a s e ? diseaseName WHERE { drug : DB00161 drugbank : t a r g e t ? t a r g e t P r o t e i n . ? t a r g e t P r o t e i n drugbank : hgncId ? gene .  92  B.1. Variants of Training Queries ? gene b i o 2 r d f :xOMIM ?omim . ? d i s e a s e d i s e a s o m e : omim ?omim . ? d i s e a s e r d f s : l a b e l ? diseaseName } Variant 2) : PREFIX /> PREFIX PREFIX PREFIX PREFIX  d i s e a s o m e : <h t t p : / /www4 . w i w i s s . fu−b e r l i n . de / d i s e a s o m e / r e s o u r c e / d i s e a s o m e r d f s : <h t t p : / /www. w3 . o r g /2000/01/ r d f −schema#> drugbank : <h t t p : / /www4 . w i w i s s . fu−b e r l i n . de / drugbank / r e s o u r c e / drugbank/> b i o 2 r d f : <h t t p : / / b i o 2 r d f . o r g / ns / b i o 2 r d f#> drug : <h t t p : / /www4 . w i w i s s . fu−b e r l i n . de / drugbank / r e s o u r c e / d r u g s/>  SELECT ? t a r g e t P r o t e i n ? gene ? d i s e a s e ? diseaseName WHERE { drug : DB03881 drugbank : t a r g e t ? t a r g e t P r o t e i n . ? t a r g e t P r o t e i n drugbank : hgncId ? gene . ? gene b i o 2 r d f :xOMIM ?omim . ? d i s e a s e d i s e a s o m e : omim ?omim . ? d i s e a s e r d f s : l a b e l ? diseaseName } Variant 3) : PREFIX /> PREFIX PREFIX PREFIX PREFIX  d i s e a s o m e : <h t t p : / /www4 . w i w i s s . fu−b e r l i n . de / d i s e a s o m e / r e s o u r c e / d i s e a s o m e r d f s : <h t t p : / /www. w3 . o r g /2000/01/ r d f −schema#> drugbank : <h t t p : / /www4 . w i w i s s . fu−b e r l i n . de / drugbank / r e s o u r c e / drugbank/> b i o 2 r d f : <h t t p : / / b i o 2 r d f . o r g / ns / b i o 2 r d f#> drug : <h t t p : / /www4 . w i w i s s . fu−b e r l i n . de / drugbank / r e s o u r c e / d r u g s/>  SELECT ? t a r g e t P r o t e i n ? gene ? d i s e a s e ? diseaseName WHERE { drug : DB04896 drugbank : t a r g e t ? t a r g e t P r o t e i n . ? t a r g e t P r o t e i n drugbank : hgncId ? gene . ? gene b i o 2 r d f :xOMIM ?omim . ? d i s e a s e d i s e a s o m e : omim ?omim . ? d i s e a s e r d f s : l a b e l ? diseaseName }  B.1.3  Variants of Query 5  Variant 0) : PREFIX s a d i : <h t t p : / / s a d i f r a m e w o r k . o r g / o n t o l o g i e s / p r o p e r t i e s . owl#> PREFIX s s : <h t t p : / / s e m a n t i c s c i e n c e . o r g / r e s o u r c e /> PREFIX GO: <h t t p : / / l s r n . o r g /GO:> SELECT ? p r o t e i n ?omim FROM <h t t p : / / dev . b i o r d f . n e t /˜ benv / p r o p e r t i e s . owl> WHERE { ? p r o t e i n s a d i : h a s F u n c t i o n GO: 0 0 0 4 8 7 2 . # ” r e c e p t o r a c t i v i t y ” ? p r o t e i n s a d i : i s P a r t i c i p a n t I n GO: 0 0 0 7 5 9 5 . # ” l a c t a t i o n ” ? p r o t e i n s a d i : i s C a u s a l l y R e l a t e d W i t h ?omim . } Variant 1) : PREFIX PREFIX PREFIX  s a d i : <h t t p : / / s a d i f r a m e w o r k . o r g / o n t o l o g i e s / p r o p e r t i e s . owl#> ss : <h t t p : / / s e m a n t i c s c i e n c e . o r g / r e s o u r c e /> GO: <h t t p : / / l s r n . o r g /GO:>  SELECT ? p r o t e i n ?omim FROM <h t t p : / / dev . b i o r d f . n e t /˜ benv / p r o p e r t i e s . owl>  93  B.2. Randomly Generated Orderings for Test Queries WHERE { ? p r o t e i n s a d i : i s P a r t i c i p a n t I n GO: 0 0 4 8 5 8 8 . ? p r o t e i n s a d i : i s C a u s a l l y R e l a t e d W i t h ?omim . ? p r o t e i n s a d i : h a s F u n c t i o n GO: 0 0 0 4 8 7 2 } Variant 2) : PREFIX PREFIX PREFIX  s a d i : <h t t p : / / s a d i f r a m e w o r k . o r g / o n t o l o g i e s / p r o p e r t i e s . owl#> ss : <h t t p : / / s e m a n t i c s c i e n c e . o r g / r e s o u r c e /> GO: <h t t p : / / l s r n . o r g /GO:>  SELECT ? p r o t e i n ?omim FROM <h t t p : / / dev . b i o r d f . n e t /˜ benv / p r o p e r t i e s . owl> WHERE { ? p r o t e i n s a d i : i s P a r t i c i p a n t I n GO: 0 0 0 7 2 8 3 . ? p r o t e i n s a d i : i s C a u s a l l y R e l a t e d W i t h ?omim . ? p r o t e i n s a d i : h a s F u n c t i o n GO: 0 0 0 4 8 7 2 } Variant 3) : PREFIX PREFIX PREFIX  s a d i : <h t t p : / / s a d i f r a m e w o r k . o r g / o n t o l o g i e s / p r o p e r t i e s . owl#> ss : <h t t p : / / s e m a n t i c s c i e n c e . o r g / r e s o u r c e /> GO: <h t t p : / / l s r n . o r g /GO:>  SELECT ? p r o t e i n ?omim FROM <h t t p : / / dev . b i o r d f . n e t /˜ benv / p r o p e r t i e s . owl> WHERE { ? p r o t e i n s a d i : h a s F u n c t i o n GO: 0 0 0 4 8 7 2 . GO: 0 0 0 6 2 6 0 s a d i : h a s P a r t i c i p a n t ? p r o t e i n . ? p r o t e i n s a d i : i s C a u s a l l y R e l a t e d W i t h ?omim }  B.2  B.2.1  Randomly Generated Orderings for Test Queries Used to Evaluate the GREEDY Optimization Algorithm Orderings for Query 2  Ordering 0) : PREFIX PREFIX PREFIX PREFIX PREFIX PREFIX  r d f s : <h t t p : / /www. w3 . o r g /2000/01/ r d f −schema#> c o r e 2 : <h t t p : / / b i o 2 r d f . o r g / c o r e :> a f f y : <h t t p : / / b i o 2 r d f . o r g / ns / a f f y m e t r i x#> p r o b e s e t : <h t t p : / / b i o 2 r d f . o r g / a f f y m e t r i x :> r d f : <h t t p : / /www. w3 . o r g /1999/02/22 − r d f −syntax−ns#> c o r e : <h t t p : / / p u r l . u n i p r o t . o r g / c o r e />  SELECT ? geneID ? p r o t e i n I D ? snpID ? g o P r o c e s s ? goComponent ? g o F u n c t i o n WHERE { probeset :53701 a t a f f y : xSwissProt ? proteinID . ? annotation r d f : type core2 : Natural Variant Annotation . probeset :53701 a t a f f y : xGene Ontology Molecular Function ? goFunction . ? proteinID core : annotation ? annotation . ? a n n o t a t i o n r d f s : s e e A l s o ? snpID . probeset :53701 at a f f y : xGene Ontology Biological Process ? goProcess . p r o b e s e t : 5 3 7 0 1 a t a f f y : x G e n e O n t o l o g y C e l l u l a r C o m p o n e n t ? goComponent . p r o b e s e t : 5 3 7 0 1 a t a f f y : xEnsembl ? geneID } Ordering 1) :  94  B.2. Randomly Generated Orderings for Test Queries  PREFIX PREFIX PREFIX PREFIX PREFIX PREFIX  r d f s : <h t t p : / /www. w3 . o r g /2000/01/ r d f −schema#> c o r e 2 : <h t t p : / / b i o 2 r d f . o r g / c o r e :> a f f y : <h t t p : / / b i o 2 r d f . o r g / ns / a f f y m e t r i x#> p r o b e s e t : <h t t p : / / b i o 2 r d f . o r g / a f f y m e t r i x :> r d f : <h t t p : / /www. w3 . o r g /1999/02/22 − r d f −syntax−ns#> c o r e : <h t t p : / / p u r l . u n i p r o t . o r g / c o r e />  SELECT ? geneID ? p r o t e i n I D ? snpID ? g o P r o c e s s ? goComponent ? g o F u n c t i o n WHERE { probeset :53701 a t a f f y : xSwissProt ? proteinID . ? proteinID core : annotation ? annotation . p r o b e s e t : 5 3 7 0 1 a t a f f y : xEnsembl ? geneID . ? annotation r d f : type core2 : Natural Variant Annotation . probeset :53701 at a f f y : xGene Ontology Biological Process ? goProcess . ? a n n o t a t i o n r d f s : s e e A l s o ? snpID . probeset :53701 a t a f f y : xGene Ontology Molecular Function ? goFunction . p r o b e s e t : 5 3 7 0 1 a t a f f y : x G e n e O n t o l o g y C e l l u l a r C o m p o n e n t ? goComponent } Ordering 2) : PREFIX PREFIX PREFIX PREFIX PREFIX PREFIX  r d f s : <h t t p : / /www. w3 . o r g /2000/01/ r d f −schema#> c o r e 2 : <h t t p : / / b i o 2 r d f . o r g / c o r e :> a f f y : <h t t p : / / b i o 2 r d f . o r g / ns / a f f y m e t r i x#> p r o b e s e t : <h t t p : / / b i o 2 r d f . o r g / a f f y m e t r i x :> r d f : <h t t p : / /www. w3 . o r g /1999/02/22 − r d f −syntax−ns#> c o r e : <h t t p : / / p u r l . u n i p r o t . o r g / c o r e />  SELECT ? geneID ? p r o t e i n I D ? snpID ? g o P r o c e s s ? goComponent ? g o F u n c t i o n WHERE { probeset :53701 a t a f f y : xSwissProt ? proteinID . ? annotation r d f : type core2 : Natural Variant Annotation . ? proteinID core : annotation ? annotation . probeset :53701 a t a f f y : xGene Ontology Molecular Function ? goFunction . ? a n n o t a t i o n r d f s : s e e A l s o ? snpID . p r o b e s e t : 5 3 7 0 1 a t a f f y : x G e n e O n t o l o g y C e l l u l a r C o m p o n e n t ? goComponent . probeset :53701 at a f f y : xGene Ontology Biological Process ? goProcess . p r o b e s e t : 5 3 7 0 1 a t a f f y : xEnsembl ? geneID } Ordering 3) : PREFIX PREFIX PREFIX PREFIX PREFIX PREFIX  r d f s : <h t t p : / /www. w3 . o r g /2000/01/ r d f −schema#> c o r e 2 : <h t t p : / / b i o 2 r d f . o r g / c o r e :> a f f y : <h t t p : / / b i o 2 r d f . o r g / ns / a f f y m e t r i x#> p r o b e s e t : <h t t p : / / b i o 2 r d f . o r g / a f f y m e t r i x :> r d f : <h t t p : / /www. w3 . o r g /1999/02/22 − r d f −syntax−ns#> c o r e : <h t t p : / / p u r l . u n i p r o t . o r g / c o r e />  SELECT ? geneID ? p r o t e i n I D ? snpID ? g o P r o c e s s ? goComponent ? g o F u n c t i o n WHERE { p r o b e s e t : 5 3 7 0 1 a t a f f y : x G e n e O n t o l o g y C e l l u l a r C o m p o n e n t ? goComponent . probeset :53701 at a f f y : xGene Ontology Biological Process ? goProcess . p r o b e s e t : 5 3 7 0 1 a t a f f y : xEnsembl ? geneID . ? annotation r d f : type core2 : Natural Variant Annotation . probeset :53701 a t a f f y : xGene Ontology Molecular Function ? goFunction . ? a n n o t a t i o n r d f s : s e e A l s o ? snpID . probeset :53701 a t a f f y : xSwissProt ? proteinID . ? proteinID core : annotation ? annotation } Ordering 4) : PREFIX  r d f s : <h t t p : / /www. w3 . o r g /2000/01/ r d f −schema#>  95  B.2. Randomly Generated Orderings for Test Queries PREFIX PREFIX PREFIX PREFIX PREFIX  c o r e 2 : <h t t p : / / b i o 2 r d f . o r g / c o r e :> a f f y : <h t t p : / / b i o 2 r d f . o r g / ns / a f f y m e t r i x#> p r o b e s e t : <h t t p : / / b i o 2 r d f . o r g / a f f y m e t r i x :> r d f : <h t t p : / /www. w3 . o r g /1999/02/22 − r d f −syntax−ns#> c o r e : <h t t p : / / p u r l . u n i p r o t . o r g / c o r e />  SELECT ? geneID ? p r o t e i n I D ? snpID ? g o P r o c e s s ? goComponent ? g o F u n c t i o n WHERE { ? annotation r d f : type core2 : Natural Variant Annotation . ? proteinID core : annotation ? annotation . probeset :53701 a t a f f y : xGene Ontology Molecular Function ? goFunction . p r o b e s e t : 5 3 7 0 1 a t a f f y : x G e n e O n t o l o g y C e l l u l a r C o m p o n e n t ? goComponent . ? a n n o t a t i o n r d f s : s e e A l s o ? snpID . probeset :53701 at a f f y : xGene Ontology Biological Process ? goProcess . p r o b e s e t : 5 3 7 0 1 a t a f f y : xEnsembl ? geneID . probeset :53701 a t a f f y : xSwissProt ? proteinID }  B.2.2  Orderings for Query 4  Ordering 0) : PREFIX PREFIX PREFIX PREFIX PREFIX PREFIX PREFIX  dc : <h t t p : / / p u r l . o r g / dc / e l e m e n t s /1.1/ > r d f s : <h t t p : / /www. w3 . o r g /2000/01/ r d f −schema#> drugbank : <h t t p : / /www4 . w i w i s s . fu−b e r l i n . de / drugbank / r e s o u r c e / drugbank/> owl : <h t t p : / /www. w3 . o r g /2002/07/ owl#> c o r e : <h t t p : / / p u r l . u n i p r o t . o r g / c o r e /> omim : <h t t p : / / b i o 2 r d f . o r g /mim:> drug : <h t t p : / /www4 . w i w i s s . fu−b e r l i n . de / drugbank / r e s o u r c e / d r u g s/>  SELECT ? t a r g e t P r o t e i n ? a l z P r o t e i n ? a l z P r o t e i n N a m e WHERE { ? a l z P r o t e i n r d f s : s e e A l s o omim : 1 0 4 3 0 0 . drug : DB01273 drugbank : t a r g e t ? t a r g e t . ? a l z P r o t e i n dc : t i t l e ? a l z P r o t e i n N a m e . ? t a r g e t drugbank : s w i s s p r o t I d ? t a r g e t P r o t e i n . ? p a r t i c i p a n t owl : sameAs ? a l z P r o t e i n . ? interaction core : participant ? participant . ? targetProtein core : interaction ? interaction } Ordering 1) : PREFIX PREFIX PREFIX PREFIX PREFIX PREFIX PREFIX  dc : <h t t p : / / p u r l . o r g / dc / e l e m e n t s /1.1/ > r d f s : <h t t p : / /www. w3 . o r g /2000/01/ r d f −schema#> drugbank : <h t t p : / /www4 . w i w i s s . fu−b e r l i n . de / drugbank / r e s o u r c e / drugbank/> owl : <h t t p : / /www. w3 . o r g /2002/07/ owl#> c o r e : <h t t p : / / p u r l . u n i p r o t . o r g / c o r e /> omim : <h t t p : / / b i o 2 r d f . o r g /mim:> drug : <h t t p : / /www4 . w i w i s s . fu−b e r l i n . de / drugbank / r e s o u r c e / d r u g s/>  SELECT ? t a r g e t P r o t e i n ? a l z P r o t e i n ? a l z P r o t e i n N a m e WHERE { drug : DB01273 drugbank : t a r g e t ? t a r g e t . ? t a r g e t drugbank : s w i s s p r o t I d ? t a r g e t P r o t e i n . ? a l z P r o t e i n r d f s : s e e A l s o omim : 1 0 4 3 0 0 . ? p a r t i c i p a n t owl : sameAs ? a l z P r o t e i n . ? targetProtein core : interaction ? interaction . ? interaction core : participant ? participant . ? a l z P r o t e i n dc : t i t l e ? a l z P r o t e i n N a m e } Ordering 2) : PREFIX  dc :  <h t t p : / / p u r l . o r g / dc / e l e m e n t s /1.1/ >  96  B.2. Randomly Generated Orderings for Test Queries PREFIX PREFIX PREFIX PREFIX PREFIX PREFIX  r d f s : <h t t p : / /www. w3 . o r g /2000/01/ r d f −schema#> drugbank : <h t t p : / /www4 . w i w i s s . fu−b e r l i n . de / drugbank / r e s o u r c e / drugbank/> owl : <h t t p : / /www. w3 . o r g /2002/07/ owl#> c o r e : <h t t p : / / p u r l . u n i p r o t . o r g / c o r e /> omim : <h t t p : / / b i o 2 r d f . o r g /mim:> drug : <h t t p : / /www4 . w i w i s s . fu−b e r l i n . de / drugbank / r e s o u r c e / d r u g s/>  SELECT ? t a r g e t P r o t e i n ? a l z P r o t e i n ? a l z P r o t e i n N a m e WHERE { drug : DB01273 drugbank : t a r g e t ? t a r g e t . ? t a r g e t drugbank : s w i s s p r o t I d ? t a r g e t P r o t e i n . ? targetProtein core : interaction ? interaction . ? a l z P r o t e i n r d f s : s e e A l s o omim : 1 0 4 3 0 0 . ? a l z P r o t e i n dc : t i t l e ? a l z P r o t e i n N a m e . ? interaction core : participant ? participant . ? p a r t i c i p a n t owl : sameAs ? a l z P r o t e i n } Ordering 3) : PREFIX PREFIX PREFIX PREFIX PREFIX PREFIX PREFIX  dc : <h t t p : / / p u r l . o r g / dc / e l e m e n t s /1.1/ > r d f s : <h t t p : / /www. w3 . o r g /2000/01/ r d f −schema#> drugbank : <h t t p : / /www4 . w i w i s s . fu−b e r l i n . de / drugbank / r e s o u r c e / drugbank/> owl : <h t t p : / /www. w3 . o r g /2002/07/ owl#> c o r e : <h t t p : / / p u r l . u n i p r o t . o r g / c o r e /> omim : <h t t p : / / b i o 2 r d f . o r g /mim:> drug : <h t t p : / /www4 . w i w i s s . fu−b e r l i n . de / drugbank / r e s o u r c e / d r u g s/>  SELECT ? t a r g e t P r o t e i n ? a l z P r o t e i n ? a l z P r o t e i n N a m e WHERE { ? a l z P r o t e i n r d f s : s e e A l s o omim : 1 0 4 3 0 0 . ? p a r t i c i p a n t owl : sameAs ? a l z P r o t e i n . ? interaction core : participant ? participant . ? a l z P r o t e i n dc : t i t l e ? a l z P r o t e i n N a m e . drug : DB01273 drugbank : t a r g e t ? t a r g e t . ? t a r g e t drugbank : s w i s s p r o t I d ? t a r g e t P r o t e i n . ? targetProtein core : interaction ? interaction } Ordering 4) : PREFIX PREFIX PREFIX PREFIX PREFIX PREFIX PREFIX  dc : <h t t p : / / p u r l . o r g / dc / e l e m e n t s /1.1/ > r d f s : <h t t p : / /www. w3 . o r g /2000/01/ r d f −schema#> drugbank : <h t t p : / /www4 . w i w i s s . fu−b e r l i n . de / drugbank / r e s o u r c e / drugbank/> owl : <h t t p : / /www. w3 . o r g /2002/07/ owl#> c o r e : <h t t p : / / p u r l . u n i p r o t . o r g / c o r e /> omim : <h t t p : / / b i o 2 r d f . o r g /mim:> drug : <h t t p : / /www4 . w i w i s s . fu−b e r l i n . de / drugbank / r e s o u r c e / d r u g s/>  SELECT ? t a r g e t P r o t e i n ? a l z P r o t e i n ? a l z P r o t e i n N a m e WHERE { ? a l z P r o t e i n r d f s : s e e A l s o omim : 1 0 4 3 0 0 . ? a l z P r o t e i n dc : t i t l e ? a l z P r o t e i n N a m e . drug : DB01273 drugbank : t a r g e t ? t a r g e t . ? t a r g e t drugbank : s w i s s p r o t I d ? t a r g e t P r o t e i n . ? targetProtein core : interaction ? interaction . ? p a r t i c i p a n t owl : sameAs ? a l z P r o t e i n . ? interaction core : participant ? participant }  97  B.2. Randomly Generated Orderings for Test Queries  B.2.3  Orderings for Query 6  Ordering 0) : PREFIX PREFIX PREFIX  UniProt : <h t t p : / / l s r n . o r g / UniProt :> s a d i : <h t t p : / / s a d i f r a m e w o r k . o r g / o n t o l o g i e s / p r o p e r t i e s . owl#> GO: <h t t p : / / l s r n . o r g /GO:>  SELECT DISTINCT ? m o t i f FROM <h t t p : / / dev . b i o r d f . n e t /˜ benv / p r o p e r t i e s . owl> WHERE { ? p r o t e i n s a d i : isHomologousTo UniProt : Q93038 . GO: 0 0 0 6 9 1 5 s a d i : h a s P a r t i c i p a n t ? p r o t e i n . ? protein sadi : hasMotif ? motif } Ordering 1) : PREFIX PREFIX PREFIX  UniProt : <h t t p : / / l s r n . o r g / UniProt :> s a d i : <h t t p : / / s a d i f r a m e w o r k . o r g / o n t o l o g i e s / p r o p e r t i e s . owl#> GO: <h t t p : / / l s r n . o r g /GO:>  SELECT DISTINCT ? m o t i f FROM <h t t p : / / dev . b i o r d f . n e t /˜ benv / p r o p e r t i e s . owl> WHERE { ? p r o t e i n s a d i : i s P a r t i c i p a n t I n GO: 0 0 0 6 9 1 5 . ? p r o t e i n s a d i : isHomologousTo UniProt : Q93038 . ? protein sadi : hasMotif ? motif } Ordering 2) : PREFIX PREFIX PREFIX  UniProt : <h t t p : / / l s r n . o r g / UniProt :> s a d i : <h t t p : / / s a d i f r a m e w o r k . o r g / o n t o l o g i e s / p r o p e r t i e s . owl#> GO: <h t t p : / / l s r n . o r g /GO:>  SELECT DISTINCT ? m o t i f FROM <h t t p : / / dev . b i o r d f . n e t /˜ benv / p r o p e r t i e s . owl> WHERE { ? p r o t e i n s a d i : isHomologousTo UniProt : Q93038 . ? protein sadi : hasMotif ? motif . GO: 0 0 0 6 9 1 5 s a d i : h a s P a r t i c i p a n t ? p r o t e i n } Ordering 3) : PREFIX PREFIX PREFIX  UniProt : <h t t p : / / l s r n . o r g / UniProt :> s a d i : <h t t p : / / s a d i f r a m e w o r k . o r g / o n t o l o g i e s / p r o p e r t i e s . owl#> GO: <h t t p : / / l s r n . o r g /GO:>  SELECT DISTINCT ? m o t i f FROM <h t t p : / / dev . b i o r d f . n e t /˜ benv / p r o p e r t i e s . owl> WHERE { ? p r o t e i n s a d i : isHomologousTo UniProt : Q93038 . ? protein sadi : hasMotif ? motif . ? p r o t e i n s a d i : i s P a r t i c i p a n t I n GO: 0 0 0 6 9 1 5 } Ordering 4) : PREFIX PREFIX PREFIX  UniProt : <h t t p : / / l s r n . o r g / UniProt :> s a d i : <h t t p : / / s a d i f r a m e w o r k . o r g / o n t o l o g i e s / p r o p e r t i e s . owl#> GO: <h t t p : / / l s r n . o r g /GO:>  SELECT DISTINCT  ? motif  98  B.2. Randomly Generated Orderings for Test Queries FROM <h t t p : / / dev . b i o r d f . n e t /˜ benv / p r o p e r t i e s . owl> WHERE { ? p r o t e i n s a d i : isHomologousTo UniProt : Q93038 . GO: 0 0 0 6 9 1 5 s a d i : h a s P a r t i c i p a n t ? p r o t e i n . ? protein sadi : hasMotif ? motif }  99  

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0071783/manifest

Comment

Related Items