UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

The SHARE system : a semantic web based approach for evaluating queries across distributed bioinformatics databases and software Vandervalk, Ben

Abstract

Many bioinformatics studies require combined use of data sets and software developed by different research labs. At the current time, accomplishing such studies requires the development of custom scripts that act as “glue” for the independent resources, performing transformations on the data sets that will allow them to be loaded into a single database and/or shuttled through different pieces of software. Due to the tedium and inefficiency of manual data/software integration, many institutions and research groups have sought to find a more reliable and automatic approach. The most significant integration project in recent years has been the Semantic Web activity of the World Wide Web Consortium (W3C), which aims to automate data integration not only in bioinformatics, but on the WWW as a whole. The goal of the Semantic Web is to interlink data on the web in a manner that is similar to the way that HTML pages are linked, while at the same time making the data available in a universal form that can be easily processed by software. In this thesis, the author describes a distributed query system called SHARE (Semantic Health and Research Environment) which demonstrates how the available standards and tools of the Semantic Web can be assembled into a framework for automating data and software integration in bioinformatics. We find that while SHARE has a similar architecture to existing query systems, the use of Semantic Web technologies has important advantages for the implementation, maintenance, and addition of new data sources to the system. After reviewing the mechanics of SHARE, we examine the crucial problem of optimizing queries in an environment where statistics about the data sources are typically not available. A query evaluation procedure called GREEDY is presented that addresses this challenge by: i) interleaving the planning and execution phases of a query, and ii) learning statistics from the execution of previous queries. We conclude by highlighting the unique strengths of SHARE and GREEDY in relation to other integration systems, and review important areas for future work.

Item Media

Item Citations and Data

Rights

Attribution-NonCommercial-NoDerivatives 4.0 International