UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Artificial semantics in text retrieval Arazy, Ofer

Abstract

The increase in the amounts of available information, coupled with the rising importance of information for planning and decision making purposes, stress the need for effective information retrieval (IR) techniques. Specifically, we are interested in the retrieval of textual information from general - i.e. large and heterogeneous - collections. One of the most critical problems impeding the performance of retrieval systems is the gap between the way in which people think about information (though semantic representations) and the natural language form of textual documents. Bridging this gap requires that documents be translated to semantic representations. For general document collections, the extraction of semantic representation has to be automated, as manual effort and the use of domain-specific resources are inappropriate. We have identified four types of artificial (i.e. automatically extracted) semantic units that are the building blocks of IR representation: 'Tokens', 'Composite Concepts', 'Synonym Concepts', and 'Topics'. These artificial semantic units have been employed in a variety of retrieval system; however, the isolated effect of semantic units on retrieval performance has not been studies previously. This dissertation investigates the effect of semantic units on retrieval performance. Our findings suggest that (a) there are significant differences in performance between semantic units, and (b) our proposed combinations of semantic units into a coherent retrieval model result is performance gains. In addition to the academic contribution in this dissertation, our findings are of importance to practitioners interested in the design of retrieval systems.

Item Media

Item Citations and Data

License

For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use.

Usage Statistics