UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Better document-level natural language understanding through data-driven applications of discourse theories Huber, Patrick


A discourse constitutes a locally and globally coherent text in which words, clauses and sentences are not solely a sequence of independent statements, but follow a hidden structure, encoding the author's underlying communicative goal(s). As such, the meaning of a discourse as a whole goes beyond the meaning of its individual parts, guided by the latent semantic and pragmatic relationships holding between parts of the document. Clearly falling into the area of Natural Language Understanding (NLU), discourse analysis augments textual inputs with structured representations following linguistic formalisms and frameworks. Annotating documents following these elaborate formalisms has led to the computationally inspired research area of discourse parsing, aiming to generate robust and general discourse annotations for arbitrary documents through automated approaches. With computational discourse parsers having great success at inferring valuable structures and supporting prominent real-world tasks such as sentiment analysis, text classification, and summarization, discourse parsing has been established as a valuable source of structured information. However, a significant limitation preventing the broader application of discourse-inspired approaches, especially in the context of modern deep-learning models, is the lack of available gold-standard data, caused by the tedious and expensive human annotation process. To overcome the prevalent data sparsity issue in the areas of discourse analysis and discourse parsing, it is imperative to find new methods to generate large-scale and high-quality discourse annotations, not relying on the restrictive human annotation process. Along these lines, we present a set of novel computational approaches to (partially) overcome the data sparsity issue by proposing distantly and self-supervised methods to automatically generate large-scale, high-quality discourse annotations in a data-driven manner. In this thesis, we provide detailed insights into our technical contributions and diverse evaluations. Specifically, we show the competitive and complementary nature of our discourse inference approaches to human-annotated discourse information, partially outperforming gold-standard discourse structures on the important task of "inter-domain" discourse parsing. We further elaborate on our generated discourse annotations in regard to their ability to support linguistic theories and downstream tasks, finding that they have direct applications in linguistics and Natural Language Processing (NLP).

Item Citations and Data


Attribution-NonCommercial-NoDerivatives 4.0 International