- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Experimenting with automatic concept identification...
Open Collections
UBC Theses and Dissertations
UBC Theses and Dissertations
Experimenting with automatic concept identification in documents Mi, Andy Lu
Abstract
Many businesses in a wide range of industries are collecting and using large sets of text documents with information on their own operations and their commercial environments, and consequently they are becoming increasingly interested in tools like Automatic Concept Identification to help them manage their data files. This paper expands research into domain concepts and the processes of identifying them automatically from document collections with unrestricted yet narrowly defined domains. An automatic, scalable and consistent model of concept identification is proposed, integrating automatic text indexing techniques (for example stop-wording, stemming and phrase formation), a newly developed affinity measure, and Agglomerative Hierarchical Clustering techniques. To test the results of the proposed approach quantitatively, a system based on the proposed model has been developed and implemented, and three sensitivity studies have been conducted against three collections of technical white papers. This study contributes to the development of a word pair-wise affinity measure based on word co-occurrence, the distance between words being evaluated, and a variety of selection criteria and thresholds for index terms (e.g. Total Frequency and Document Frequency). This study's results concerning concept identification demonstrate that the proposed model generally delivers positive concept identification outcomes. The results of the sensitivity studies provide empirical evidence regarding the effects on concept identification outcomes generated by different index term selection thresholds, different sizes of co-occurrence windows, and different characteristics of document collections.
Item Metadata
Title |
Experimenting with automatic concept identification in documents
|
Creator | |
Publisher |
University of British Columbia
|
Date Issued |
2003
|
Description |
Many businesses in a wide range of industries are collecting and using large sets of text
documents with information on their own operations and their commercial environments,
and consequently they are becoming increasingly interested in tools like Automatic
Concept Identification to help them manage their data files. This paper expands research
into domain concepts and the processes of identifying them automatically from document
collections with unrestricted yet narrowly defined domains. An automatic, scalable and
consistent model of concept identification is proposed, integrating automatic text
indexing techniques (for example stop-wording, stemming and phrase formation), a
newly developed affinity measure, and Agglomerative Hierarchical Clustering
techniques. To test the results of the proposed approach quantitatively, a system based on
the proposed model has been developed and implemented, and three sensitivity studies
have been conducted against three collections of technical white papers.
This study contributes to the development of a word pair-wise affinity measure based on
word co-occurrence, the distance between words being evaluated, and a variety of
selection criteria and thresholds for index terms (e.g. Total Frequency and Document
Frequency). This study's results concerning concept identification demonstrate that the
proposed model generally delivers positive concept identification outcomes. The results
of the sensitivity studies provide empirical evidence regarding the effects on concept
identification outcomes generated by different index term selection thresholds, different
sizes of co-occurrence windows, and different characteristics of document collections.
|
Extent |
5135191 bytes
|
Genre | |
Type | |
File Format |
application/pdf
|
Language |
eng
|
Date Available |
2009-10-29
|
Provider |
Vancouver : University of British Columbia Library
|
Rights |
For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use.
|
DOI |
10.14288/1.0090963
|
URI | |
Degree | |
Program | |
Affiliation | |
Degree Grantor |
University of British Columbia
|
Graduation Date |
2003-11
|
Campus | |
Scholarly Level |
Graduate
|
Aggregated Source Repository |
DSpace
|
Item Media
Item Citations and Data
Rights
For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use.