UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Representation learning for computational sociopragmatics Zhang, Chiyu

Abstract

Natural Language Processing (NLP) emerges as a critical solution for analyzing, manipulating, and understanding human language automatically and computationally, enabling the processing of vast amounts of language data swiftly. Computational NLP systems utilize numerical matrices or vectors as inputs, necessitating the conversion of discrete language symbols into a continuous representation space. The efficacy of these continuous representations is pivotal for developing successful NLP systems. With the advent of attention mechanisms, attention-based models have been adopted to learn contextual language representations by pre-training with language modeling (LM) objectives on extensive textual corpora. Despite the proven effectiveness of attention-based pre-trained language models (PLMs) in learning sequence-level representations for various NLP tasks, the integration of social aspects into representation learning remains unexplored. Recent efforts have applied PLMs to derive user-level representations, aiming to enhance content-based recommendation systems' transferability and precision. However, challenges persist in encoding lengthy user engagement histories, capturing users' diverse interests, and generating precomputable user-level representations. This dissertation focuses on advancing language representation learning for sequence-level sociopragmatic meaning (SM) comprehension and user-level content-based recommendation. For sequence-level SM, we introduce a novel weakly supervised method for pretraining and fine-tuning language models (Chapter 2). To enhance representation quality further, we propose a new contrastive learning framework for pretraining LMs (Chapter 3). Our approach is extended to the multilingual domain, presenting a unified, massively multilingual evaluation benchmark for SM (Chapter 4), alongside a comprehensive evaluation of state-of-the-art large language models for SM understanding. Addressing the challenges in learning user-level representations for recommendation systems, Chapter 5 introduces a novel framework that incorporates multiple poly-attention layers and sparse attention mechanisms. This framework hierarchically fuses token-level embeddings of session-based user history texts using PLM, tackling the intricacies of recommendation systems.

Item Media

Item Citations and Data

Rights

Attribution-NoDerivatives 4.0 International