Exploring machine learning design options in discourse parsing

UBC Theses and Dissertations

Featured Collection

UBC Theses and Dissertations

Exploring machine learning design options in discourse parsing Liao, Weicong

Abstract

Discourse parsing recently attracts increasing interest among researchers since it is very helpful for text understanding, sentiment analysis and other NLP tasks. In a well-written text, authors often use discourse to better organize the text, and sentences (or clauses) tend to interact with neighboring sentences (or clauses). Each piece of text locally exhibits a finer discourse structure called rhetorical structure. And a document can be organized to a discourse tree (this process is called discourse parsing), which seeks to capture the discourse structure and logically binds the sentences (or clauses) together. However, despite the fact that discourse parsing is very useful, although intra-sentential level discourse parsing already achieves high performance, multi-sentential level discourse parsing remains a big challenge in terms of both accuracy and efficiency. In addition, machine learning techniques are proved to be successful in many NLP tasks including discourse parsing. Thus, in this thesis, we try to enhance the performance (e.g., accuracy, efficiency) of discourse parsing by using machine learning techniques. To this aim, we propose a novel two-step discourse parsing system, which first builds a discourse tree for a given text by applying optimal probabilistic parsing to probabilities inferred from learned conditional random fields (CRFs), then uses learned log-linear models to tag all discourse relations to the nodes in the discourse tree. We analyze different aspects of the problem (e.g., sequential v.s. non-sequential model, greedy v.s. optimal parsing, joint v.s. separate model) and discuss their trade-offs. We also carried out extensive experiments to study the usefulness of different feature families and over-fitting. Consequently, we find out that the most effective feature sets for different tasks are different: part-of-speech (POS) and context features are the most effective for intra and multi-sentential structure prediction respectively, while ngram features are the most effective for both intra and multi-sentential relation labeling. Moreover, over-fitting does occur in our experiments, so we need proper regularization. Final result shows that our system achieves state-of-the-art F-scores of 86.2, 72.2 and 59.2 in structure, nuclearity and relation. And it is more efficient than Joty's (training: 40 times faster; test: 3 times faster).

Item Metadata

Title	Exploring machine learning design options in discourse parsing
Creator	Liao, Weicong
Publisher	University of British Columbia
Date Issued	2015
Description	Discourse parsing recently attracts increasing interest among researchers since it is very helpful for text understanding, sentiment analysis and other NLP tasks. In a well-written text, authors often use discourse to better organize the text, and sentences (or clauses) tend to interact with neighboring sentences (or clauses). Each piece of text locally exhibits a finer discourse structure called rhetorical structure. And a document can be organized to a discourse tree (this process is called discourse parsing), which seeks to capture the discourse structure and logically binds the sentences (or clauses) together. However, despite the fact that discourse parsing is very useful, although intra-sentential level discourse parsing already achieves high performance, multi-sentential level discourse parsing remains a big challenge in terms of both accuracy and efficiency. In addition, machine learning techniques are proved to be successful in many NLP tasks including discourse parsing. Thus, in this thesis, we try to enhance the performance (e.g., accuracy, efficiency) of discourse parsing by using machine learning techniques. To this aim, we propose a novel two-step discourse parsing system, which first builds a discourse tree for a given text by applying optimal probabilistic parsing to probabilities inferred from learned conditional random fields (CRFs), then uses learned log-linear models to tag all discourse relations to the nodes in the discourse tree. We analyze different aspects of the problem (e.g., sequential v.s. non-sequential model, greedy v.s. optimal parsing, joint v.s. separate model) and discuss their trade-offs. We also carried out extensive experiments to study the usefulness of different feature families and over-fitting. Consequently, we find out that the most effective feature sets for different tasks are different: part-of-speech (POS) and context features are the most effective for intra and multi-sentential structure prediction respectively, while ngram features are the most effective for both intra and multi-sentential relation labeling. Moreover, over-fitting does occur in our experiments, so we need proper regularization. Final result shows that our system achieves state-of-the-art F-scores of 86.2, 72.2 and 59.2 in structure, nuclearity and relation. And it is more efficient than Joty's (training: 40 times faster; test: 3 times faster).
Genre	Thesis/Dissertation
Type	Text
Language	eng
Date Available	2015-04-28
Provider	Vancouver : University of British Columbia Library
Rights	Attribution-NonCommercial-NoDerivs 2.5 Canada
DOI	10.14288/1.0167698
URI	http://hdl.handle.net/2429/52985
Degree (Theses)	Master of Science - MSc
Program (Theses)	Computer Science
Affiliation	Science, Faculty of; Computer Science, Department of
Degree Grantor	University of British Columbia
Graduation Date	2015-09
Campus	UBCV
Scholarly Level	Graduate
Rights URI	http://creativecommons.org/licenses/by-nc-nd/2.5/ca/
Aggregated Source Repository	DSpace

Open Collections

UBC Theses and Dissertations

UBC Theses and Dissertations

Exploring machine learning design options in discourse parsing Liao, Weicong

Abstract

Item Metadata

Item Media

Item Citations and Data

Rights