UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Exploring machine learning design options in discourse parsing Liao, Weicong

Abstract

Discourse parsing recently attracts increasing interest among researchers since it is very helpful for text understanding, sentiment analysis and other NLP tasks. In a well-written text, authors often use discourse to better organize the text, and sentences (or clauses) tend to interact with neighboring sentences (or clauses). Each piece of text locally exhibits a finer discourse structure called rhetorical structure. And a document can be organized to a discourse tree (this process is called discourse parsing), which seeks to capture the discourse structure and logically binds the sentences (or clauses) together. However, despite the fact that discourse parsing is very useful, although intra-sentential level discourse parsing already achieves high performance, multi-sentential level discourse parsing remains a big challenge in terms of both accuracy and efficiency. In addition, machine learning techniques are proved to be successful in many NLP tasks including discourse parsing. Thus, in this thesis, we try to enhance the performance (e.g., accuracy, efficiency) of discourse parsing by using machine learning techniques. To this aim, we propose a novel two-step discourse parsing system, which first builds a discourse tree for a given text by applying optimal probabilistic parsing to probabilities inferred from learned conditional random fields (CRFs), then uses learned log-linear models to tag all discourse relations to the nodes in the discourse tree. We analyze different aspects of the problem (e.g., sequential v.s. non-sequential model, greedy v.s. optimal parsing, joint v.s. separate model) and discuss their trade-offs. We also carried out extensive experiments to study the usefulness of different feature families and over-fitting. Consequently, we find out that the most effective feature sets for different tasks are different: part-of-speech (POS) and context features are the most effective for intra and multi-sentential structure prediction respectively, while ngram features are the most effective for both intra and multi-sentential relation labeling. Moreover, over-fitting does occur in our experiments, so we need proper regularization. Final result shows that our system achieves state-of-the-art F-scores of 86.2, 72.2 and 59.2 in structure, nuclearity and relation. And it is more efficient than Joty's (training: 40 times faster; test: 3 times faster).

Item Citations and Data

Rights

Attribution-NonCommercial-NoDerivs 2.5 Canada