UBC Theses and Dissertations
Using term proximity measures for identifying compound concepts : an expolatory study Yin, Nawei
With the rapid development of information technology, individuals using the technology are liable to be overwhelmed by the excessive amounts of information available when conducting online (local or remote) document searches. It is important therefore that users specify the correct search terms. However, a user does not always know which terms to use and often the same idea can be described by different terms. Constructing lists of possible search terms for different domains would require a very substantial effort by experts in each domain. To alleviate these problems, automated techniques can be valuable to extract concepts and meaningful phrases for specific domains. This work is an exploratory study of automated extraction of compound concepts from a collection of documents in a specific domain. The concept-extraction methods used in this study employed clustering techniques based on distance measures that reflect term affinity statistics rather than techniques based on similarity measures adopted in most previous works. The study compared the effects of different methods of calculating affinities, depending on the sizes of textual units where terms co-occur and on directionality and asymmetry between terms. The accounting context was used as a case study to provide the data. An accounting expert evaluated the resulting clusters produced by the clustering program. As demonstrated by our results, the method identified meaningful accounting compound concepts and phrases. The research also indicated which affinity types generated better results. For example, affinities based on occurrence of terms within a document produced the poorest results. There was a significant manual effort involved in "preprocessing" the data prior to compound concept identification. However, we believe the techniques explored might be useful for users to search relevant information within individual domains and can be extended to support the construction of domain-specific thesauri.
Item Citations and Data