Better document-level natural language understanding through data-driven applications of discourse theories

UBC Theses and Dissertations

Featured Collection

UBC Theses and Dissertations

Better document-level natural language understanding through data-driven applications of discourse theories Huber, Patrick

Abstract

A discourse constitutes a locally and globally coherent text in which words, clauses and sentences are not solely a sequence of independent statements, but follow a hidden structure, encoding the author's underlying communicative goal(s). As such, the meaning of a discourse as a whole goes beyond the meaning of its individual parts, guided by the latent semantic and pragmatic relationships holding between parts of the document. Clearly falling into the area of Natural Language Understanding (NLU), discourse analysis augments textual inputs with structured representations following linguistic formalisms and frameworks. Annotating documents following these elaborate formalisms has led to the computationally inspired research area of discourse parsing, aiming to generate robust and general discourse annotations for arbitrary documents through automated approaches. With computational discourse parsers having great success at inferring valuable structures and supporting prominent real-world tasks such as sentiment analysis, text classification, and summarization, discourse parsing has been established as a valuable source of structured information. However, a significant limitation preventing the broader application of discourse-inspired approaches, especially in the context of modern deep-learning models, is the lack of available gold-standard data, caused by the tedious and expensive human annotation process. To overcome the prevalent data sparsity issue in the areas of discourse analysis and discourse parsing, it is imperative to find new methods to generate large-scale and high-quality discourse annotations, not relying on the restrictive human annotation process. Along these lines, we present a set of novel computational approaches to (partially) overcome the data sparsity issue by proposing distantly and self-supervised methods to automatically generate large-scale, high-quality discourse annotations in a data-driven manner. In this thesis, we provide detailed insights into our technical contributions and diverse evaluations. Specifically, we show the competitive and complementary nature of our discourse inference approaches to human-annotated discourse information, partially outperforming gold-standard discourse structures on the important task of "inter-domain" discourse parsing. We further elaborate on our generated discourse annotations in regard to their ability to support linguistic theories and downstream tasks, finding that they have direct applications in linguistics and Natural Language Processing (NLP).

Item Metadata

Title	Better document-level natural language understanding through data-driven applications of discourse theories
Creator	Huber, Patrick
Supervisor	Carenini, Giuseppe
Publisher	University of British Columbia
Date Issued	2022
Description	A discourse constitutes a locally and globally coherent text in which words, clauses and sentences are not solely a sequence of independent statements, but follow a hidden structure, encoding the author's underlying communicative goal(s). As such, the meaning of a discourse as a whole goes beyond the meaning of its individual parts, guided by the latent semantic and pragmatic relationships holding between parts of the document. Clearly falling into the area of Natural Language Understanding (NLU), discourse analysis augments textual inputs with structured representations following linguistic formalisms and frameworks. Annotating documents following these elaborate formalisms has led to the computationally inspired research area of discourse parsing, aiming to generate robust and general discourse annotations for arbitrary documents through automated approaches. With computational discourse parsers having great success at inferring valuable structures and supporting prominent real-world tasks such as sentiment analysis, text classification, and summarization, discourse parsing has been established as a valuable source of structured information. However, a significant limitation preventing the broader application of discourse-inspired approaches, especially in the context of modern deep-learning models, is the lack of available gold-standard data, caused by the tedious and expensive human annotation process. To overcome the prevalent data sparsity issue in the areas of discourse analysis and discourse parsing, it is imperative to find new methods to generate large-scale and high-quality discourse annotations, not relying on the restrictive human annotation process. Along these lines, we present a set of novel computational approaches to (partially) overcome the data sparsity issue by proposing distantly and self-supervised methods to automatically generate large-scale, high-quality discourse annotations in a data-driven manner. In this thesis, we provide detailed insights into our technical contributions and diverse evaluations. Specifically, we show the competitive and complementary nature of our discourse inference approaches to human-annotated discourse information, partially outperforming gold-standard discourse structures on the important task of "inter-domain" discourse parsing. We further elaborate on our generated discourse annotations in regard to their ability to support linguistic theories and downstream tasks, finding that they have direct applications in linguistics and Natural Language Processing (NLP).
Genre	Thesis/Dissertation
Type	Text
Language	eng
Date Available	2022-10-18
Provider	Vancouver : University of British Columbia Library
Rights	Attribution-NonCommercial-NoDerivatives 4.0 International
DOI	10.14288/1.0421296
URI	http://hdl.handle.net/2429/82885
Degree (Theses)	Doctor of Philosophy - PhD
Program (Theses)	Computer Science
Affiliation	Science, Faculty of; Computer Science, Department of
Degree Grantor	University of British Columbia
Graduation Date	2022-11
Campus	UBCV
Scholarly Level	Graduate
Rights URI	http://creativecommons.org/licenses/by-nc-nd/4.0/
Aggregated Source Repository	DSpace

Open Collections

UBC Theses and Dissertations

UBC Theses and Dissertations

Better document-level natural language understanding through data-driven applications of discourse theories Huber, Patrick

Abstract

Item Metadata

Item Media

Item Citations and Data

Rights