Experimenting with automatic concept identification in documents

UBC Theses and Dissertations

Featured Collection

UBC Theses and Dissertations

Experimenting with automatic concept identification in documents Mi, Andy Lu

Abstract

Many businesses in a wide range of industries are collecting and using large sets of text documents with information on their own operations and their commercial environments, and consequently they are becoming increasingly interested in tools like Automatic Concept Identification to help them manage their data files. This paper expands research into domain concepts and the processes of identifying them automatically from document collections with unrestricted yet narrowly defined domains. An automatic, scalable and consistent model of concept identification is proposed, integrating automatic text indexing techniques (for example stop-wording, stemming and phrase formation), a newly developed affinity measure, and Agglomerative Hierarchical Clustering techniques. To test the results of the proposed approach quantitatively, a system based on the proposed model has been developed and implemented, and three sensitivity studies have been conducted against three collections of technical white papers. This study contributes to the development of a word pair-wise affinity measure based on word co-occurrence, the distance between words being evaluated, and a variety of selection criteria and thresholds for index terms (e.g. Total Frequency and Document Frequency). This study's results concerning concept identification demonstrate that the proposed model generally delivers positive concept identification outcomes. The results of the sensitivity studies provide empirical evidence regarding the effects on concept identification outcomes generated by different index term selection thresholds, different sizes of co-occurrence windows, and different characteristics of document collections.

Item Metadata

Title	Experimenting with automatic concept identification in documents
Creator	Mi, Andy Lu
Publisher	University of British Columbia
Date Issued	2003
Description	Many businesses in a wide range of industries are collecting and using large sets of text documents with information on their own operations and their commercial environments, and consequently they are becoming increasingly interested in tools like Automatic Concept Identification to help them manage their data files. This paper expands research into domain concepts and the processes of identifying them automatically from document collections with unrestricted yet narrowly defined domains. An automatic, scalable and consistent model of concept identification is proposed, integrating automatic text indexing techniques (for example stop-wording, stemming and phrase formation), a newly developed affinity measure, and Agglomerative Hierarchical Clustering techniques. To test the results of the proposed approach quantitatively, a system based on the proposed model has been developed and implemented, and three sensitivity studies have been conducted against three collections of technical white papers. This study contributes to the development of a word pair-wise affinity measure based on word co-occurrence, the distance between words being evaluated, and a variety of selection criteria and thresholds for index terms (e.g. Total Frequency and Document Frequency). This study's results concerning concept identification demonstrate that the proposed model generally delivers positive concept identification outcomes. The results of the sensitivity studies provide empirical evidence regarding the effects on concept identification outcomes generated by different index term selection thresholds, different sizes of co-occurrence windows, and different characteristics of document collections.
Extent	5135191 bytes
Genre	Thesis/Dissertation
Type	Text
File Format	application/pdf
Language	eng
Date Available	2009-10-29
Provider	Vancouver : University of British Columbia Library
Rights	For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use.
DOI	10.14288/1.0090963
URI	http://hdl.handle.net/2429/14359
Degree (Theses)	Master of Science in Business - MScB
Program (Theses)	Business Administration - Management Information Systems
Affiliation	Business, Sauder School of; Management Information Systems, Division of
Degree Grantor	University of British Columbia
Graduation Date	2003-11
Campus	UBCV
Scholarly Level	Graduate
Aggregated Source Repository	DSpace

Item Media

ubc_2003-0452.pdf -- 4.9MB

Item Citations and Data

Rights

For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use.

Open Collections

UBC Theses and Dissertations

Experimenting with automatic concept identification in documents Mi, Andy Lu

Abstract

Item Metadata

Item Media

Item Citations and Data

Rights