UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Efficient extraction of ontologies from domain specific text corpora Li, Tianyu

Abstract

There is a huge body of domain-specific knowledge embedded in free-text repositories such as engineering documents, instruction manuals, medical references and legal files. Extracting ontological relationships (e.g., ISA and HASA) from this kind of corpus can improve users’ queries and improve navigation through the corpus, as well as benefiting applications built for these domains. Current methods to extract ontological relationships from text data usually fail to capture many meaningful relationships because they concentrate on single-word-terms or very short phrases. This is particularly problematic in a smaller corpus, where it is harder to find statistically meaningful relationships. We propose a novel pattern-based algorithm that finds ontological relationships between complex concepts by exploiting parsing information to extract concepts consisting of multi-word and nested phrases. Our procedure is iterative: we tailor the constrained sequential pattern mining framework to discover new patterns. We compare our algorithm with previous representative ontology extraction algorithms on four real data sets and achieve consistently and significantly better results.

Item Media

Item Citations and Data

Rights

Attribution-NonCommercial-NoDerivatives 4.0 International