- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Improve classification on infrequent discourse relations...
Open Collections
UBC Theses and Dissertations
UBC Theses and Dissertations
Improve classification on infrequent discourse relations via training data enrichment Jiang, Kailang
Abstract
Discourse parsing is a popular technique widely used in text understanding, sentiment analysis, and other NLP tasks. However, for most discourse parsers, the performance varies significantly across different discourse relations. In this thesis, we first validate the underfitting hypothesis, i.e., the less frequent a relation is in the training data, the poorer the performance on that relation. We then explore how to increase the number of positive training instances, without resorting to manually creating additional labeled data. We propose a training data enrichment framework that relies on co-training of two different discourse parsers on unlabeled documents. Importantly, we show that co-training alone is not sufficient. The framework requires a filtering step to ensure that only “good quality” unlabeled documents can be used for enrichment and re-training. We propose and evaluate two ways to perform the filtering. The first is to use an agreement score between the two parsers. The second is to use only the confidence score of the faster parser. Our empirical results show that agreement score can help to boost the performance on infrequent relations, and that the confidence score is a viable approximation of the agreement score for infrequent relations.
Item Metadata
Title |
Improve classification on infrequent discourse relations via training data enrichment
|
Creator | |
Publisher |
University of British Columbia
|
Date Issued |
2016
|
Description |
Discourse parsing is a popular technique widely used in text understanding, sentiment analysis, and other NLP tasks. However, for most discourse parsers, the performance varies significantly across different discourse relations. In this thesis, we first validate the underfitting hypothesis, i.e., the less frequent a relation is in the training data, the poorer the performance on that relation. We then explore how to increase the number of positive training instances, without resorting to manually creating additional labeled data. We propose a training data enrichment framework that relies on co-training of two different discourse parsers on unlabeled documents. Importantly, we show that co-training alone is not sufficient. The framework requires a filtering step to ensure that only “good quality” unlabeled documents can be used for enrichment and re-training. We propose and evaluate two ways to perform the filtering. The first is to use an agreement score between the two parsers. The second is to use only the confidence score of the faster parser. Our empirical results show that agreement score can help to boost the performance on infrequent relations, and that the confidence score is a viable approximation of the agreement score for infrequent relations.
|
Genre | |
Type | |
Language |
eng
|
Date Available |
2017-01-21
|
Provider |
Vancouver : University of British Columbia Library
|
Rights |
Attribution-NonCommercial-NoDerivatives 4.0 International
|
DOI |
10.14288/1.0340024
|
URI | |
Degree | |
Program | |
Affiliation | |
Degree Grantor |
University of British Columbia
|
Graduation Date |
2017-02
|
Campus | |
Scholarly Level |
Graduate
|
Rights URI | |
Aggregated Source Repository |
DSpace
|
Item Media
Item Citations and Data
Rights
Attribution-NonCommercial-NoDerivatives 4.0 International