UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Character sentence as the quantitative metrics : leveraging large language models to measure literary characterization Liu, Qilin

Abstract

This thesis investigates literary characterization through computational literary studies by introducing a novel analytical unit termed “Character Sentences” (CS)—units explicitly providing descriptive or action-related information about literary characters. Three primary contributions are presented in the thesis: (1) a narratological definition and criteria of character sentences; (2) two gold-standard datasets, HPCS (Harry Potter Character Sentence_clause-level & Harry Potter Character Sentence_full-sentence), meticulously annotated from the Harry Potter series, designed to benchmark automated character sentence extraction tasks; and (3) a natural language processing (NLP) pipeline integrating large language models (LLMs) to automatically identify character sentences and accurately attribute them to corresponding characters, suitable for texts of any length. The gold-standard datasets demonstrate high inter-annotator agreement, with Krippendorff’s α values exceeding 0.80 (αHPCS_clause-level = 0.81; αHPCS_full-sentence = 0.86). The proposed NLP pipeline comprises four modules: (1) a text cleaning module; (2) a sentence segmentation module aligned with the character sentence definition; (3) a zero-shot LLM processing module employing the LangGPT prompting framework and two-stage coreference resolution reasoning; and (4) a dependency parsing-based filter module enhancing the accuracy of character attribution. Empirical evaluations indicate the pipeline achieves a robust performance, yielding an F1 score of 94.51% in character sentence identification and an accuracy of 84.88% in character attribution on the HPCS_full-sentence dataset. This research is the first to explicitly define character sentences and develop an automated, theory-informed, sentence-level approach integrated with LLMs for character sentence extraction. It addresses critical gaps in computational literary studies and underscores the efficacy of LLMs and prompt engineering within literary analysis. The datasets and source code developed in this thesis are publicly accessible on GitHub to facilitate further research and methodological advancements in the field.

Item Media

Item Citations and Data

Rights

Attribution-NonCommercial-NoDerivatives 4.0 International