The SHARE system : a semantic web based approach for evaluating queries across distributed bioinformatics databases and software

UBC Theses and Dissertations

Featured Collection

UBC Theses and Dissertations

The SHARE system : a semantic web based approach for evaluating queries across distributed bioinformatics databases and software Vandervalk, Ben

Abstract

Many bioinformatics studies require combined use of data sets and software developed by different research labs. At the current time, accomplishing such studies requires the development of custom scripts that act as “glue” for the independent resources, performing transformations on the data sets that will allow them to be loaded into a single database and/or shuttled through different pieces of software. Due to the tedium and inefficiency of manual data/software integration, many institutions and research groups have sought to find a more reliable and automatic approach. The most significant integration project in recent years has been the Semantic Web activity of the World Wide Web Consortium (W3C), which aims to automate data integration not only in bioinformatics, but on the WWW as a whole. The goal of the Semantic Web is to interlink data on the web in a manner that is similar to the way that HTML pages are linked, while at the same time making the data available in a universal form that can be easily processed by software. In this thesis, the author describes a distributed query system called SHARE (Semantic Health and Research Environment) which demonstrates how the available standards and tools of the Semantic Web can be assembled into a framework for automating data and software integration in bioinformatics. We find that while SHARE has a similar architecture to existing query systems, the use of Semantic Web technologies has important advantages for the implementation, maintenance, and addition of new data sources to the system. After reviewing the mechanics of SHARE, we examine the crucial problem of optimizing queries in an environment where statistics about the data sources are typically not available. A query evaluation procedure called GREEDY is presented that addresses this challenge by: i) interleaving the planning and execution phases of a query, and ii) learning statistics from the execution of previous queries. We conclude by highlighting the unique strengths of SHARE and GREEDY in relation to other integration systems, and review important areas for future work.

Item Metadata

Title	The SHARE system : a semantic web based approach for evaluating queries across distributed bioinformatics databases and software
Creator	Vandervalk, Ben
Publisher	University of British Columbia
Date Issued	2011
Description	Many bioinformatics studies require combined use of data sets and software developed by different research labs. At the current time, accomplishing such studies requires the development of custom scripts that act as “glue” for the independent resources, performing transformations on the data sets that will allow them to be loaded into a single database and/or shuttled through different pieces of software. Due to the tedium and inefficiency of manual data/software integration, many institutions and research groups have sought to find a more reliable and automatic approach. The most significant integration project in recent years has been the Semantic Web activity of the World Wide Web Consortium (W3C), which aims to automate data integration not only in bioinformatics, but on the WWW as a whole. The goal of the Semantic Web is to interlink data on the web in a manner that is similar to the way that HTML pages are linked, while at the same time making the data available in a universal form that can be easily processed by software. In this thesis, the author describes a distributed query system called SHARE (Semantic Health and Research Environment) which demonstrates how the available standards and tools of the Semantic Web can be assembled into a framework for automating data and software integration in bioinformatics. We find that while SHARE has a similar architecture to existing query systems, the use of Semantic Web technologies has important advantages for the implementation, maintenance, and addition of new data sources to the system. After reviewing the mechanics of SHARE, we examine the crucial problem of optimizing queries in an environment where statistics about the data sources are typically not available. A query evaluation procedure called GREEDY is presented that addresses this challenge by: i) interleaving the planning and execution phases of a query, and ii) learning statistics from the execution of previous queries. We conclude by highlighting the unique strengths of SHARE and GREEDY in relation to other integration systems, and review important areas for future work.
Genre	Thesis/Dissertation
Type	Text
Language	eng
Date Available	2011-04-28
Provider	Vancouver : University of British Columbia Library
Rights	Attribution-NonCommercial-NoDerivatives 4.0 International
DOI	10.14288/1.0071783
URI	http://hdl.handle.net/2429/34088
Degree (Theses)	Master of Science - MSc
Program (Theses)	Bioinformatics
Affiliation	Science, Faculty of
Degree Grantor	University of British Columbia
Graduation Date	2011-05
Campus	UBCV
Scholarly Level	Graduate
Rights URI	http://creativecommons.org/licenses/by-nc-nd/4.0/
Aggregated Source Repository	DSpace

Open Collections

UBC Theses and Dissertations

UBC Theses and Dissertations

The SHARE system : a semantic web based approach for evaluating queries across distributed bioinformatics databases and software Vandervalk, Ben

Abstract

Item Metadata

Item Media

Item Citations and Data

Rights