Efficient extraction of ontologies from domain specific text corpora

UBC Theses and Dissertations

Featured Collection

UBC Theses and Dissertations

Efficient extraction of ontologies from domain specific text corpora Li, Tianyu

Abstract

There is a huge body of domain-specific knowledge embedded in free-text repositories such as engineering documents, instruction manuals, medical references and legal files. Extracting ontological relationships (e.g., ISA and HASA) from this kind of corpus can improve users’ queries and improve navigation through the corpus, as well as benefiting applications built for these domains. Current methods to extract ontological relationships from text data usually fail to capture many meaningful relationships because they concentrate on single-word-terms or very short phrases. This is particularly problematic in a smaller corpus, where it is harder to find statistically meaningful relationships. We propose a novel pattern-based algorithm that finds ontological relationships between complex concepts by exploiting parsing information to extract concepts consisting of multi-word and nested phrases. Our procedure is iterative: we tailor the constrained sequential pattern mining framework to discover new patterns. We compare our algorithm with previous representative ontology extraction algorithms on four real data sets and achieve consistently and significantly better results.

Item Metadata

Title	Efficient extraction of ontologies from domain specific text corpora
Creator	Li, Tianyu
Publisher	University of British Columbia
Date Issued	2011
Description	There is a huge body of domain-specific knowledge embedded in free-text repositories such as engineering documents, instruction manuals, medical references and legal files. Extracting ontological relationships (e.g., ISA and HASA) from this kind of corpus can improve users’ queries and improve navigation through the corpus, as well as benefiting applications built for these domains. Current methods to extract ontological relationships from text data usually fail to capture many meaningful relationships because they concentrate on single-word-terms or very short phrases. This is particularly problematic in a smaller corpus, where it is harder to find statistically meaningful relationships. We propose a novel pattern-based algorithm that finds ontological relationships between complex concepts by exploiting parsing information to extract concepts consisting of multi-word and nested phrases. Our procedure is iterative: we tailor the constrained sequential pattern mining framework to discover new patterns. We compare our algorithm with previous representative ontology extraction algorithms on four real data sets and achieve consistently and significantly better results.
Genre	Thesis/Dissertation
Type	Text
Language	eng
Date Available	2011-12-14
Provider	Vancouver : University of British Columbia Library
Rights	Attribution-NonCommercial-NoDerivatives 4.0 International
DOI	10.14288/1.0052152
URI	http://hdl.handle.net/2429/39686
Degree (Theses)	Master of Science - MSc
Program (Theses)	Computer Science
Affiliation	Science, Faculty of; Computer Science, Department of
Degree Grantor	University of British Columbia
Graduation Date	2012-05
Campus	UBCV
Scholarly Level	Graduate
Rights URI	http://creativecommons.org/licenses/by-nc-nd/4.0/
Aggregated Source Repository	DSpace

Open Collections

UBC Theses and Dissertations

UBC Theses and Dissertations

Efficient extraction of ontologies from domain specific text corpora Li, Tianyu

Abstract

Item Metadata

Item Media

Item Citations and Data

Rights