Character sentence as the quantitative metrics : leveraging large language models to measure literary characterization

UBC Theses and Dissertations

Featured Collection

UBC Theses and Dissertations

Character sentence as the quantitative metrics : leveraging large language models to measure literary characterization Liu, Qilin

Abstract

This thesis investigates literary characterization through computational literary studies by introducing a novel analytical unit termed “Character Sentences” (CS)—units explicitly providing descriptive or action-related information about literary characters. Three primary contributions are presented in the thesis: (1) a narratological definition and criteria of character sentences; (2) two gold-standard datasets, HPCS (Harry Potter Character Sentence_clause-level & Harry Potter Character Sentence_full-sentence), meticulously annotated from the Harry Potter series, designed to benchmark automated character sentence extraction tasks; and (3) a natural language processing (NLP) pipeline integrating large language models (LLMs) to automatically identify character sentences and accurately attribute them to corresponding characters, suitable for texts of any length. The gold-standard datasets demonstrate high inter-annotator agreement, with Krippendorff’s α values exceeding 0.80 (αHPCS_clause-level = 0.81; αHPCS_full-sentence = 0.86). The proposed NLP pipeline comprises four modules: (1) a text cleaning module; (2) a sentence segmentation module aligned with the character sentence definition; (3) a zero-shot LLM processing module employing the LangGPT prompting framework and two-stage coreference resolution reasoning; and (4) a dependency parsing-based filter module enhancing the accuracy of character attribution. Empirical evaluations indicate the pipeline achieves a robust performance, yielding an F1 score of 94.51% in character sentence identification and an accuracy of 84.88% in character attribution on the HPCS_full-sentence dataset. This research is the first to explicitly define character sentences and develop an automated, theory-informed, sentence-level approach integrated with LLMs for character sentence extraction. It addresses critical gaps in computational literary studies and underscores the efficacy of LLMs and prompt engineering within literary analysis. The datasets and source code developed in this thesis are publicly accessible on GitHub to facilitate further research and methodological advancements in the field.

Item Metadata

Title	Character sentence as the quantitative metrics : leveraging large language models to measure literary characterization
Creator	Liu, Qilin
Supervisor	Yoon, Kyong, 1974-; Gabora, Liane
Publisher	University of British Columbia
Date Issued	2025
Description	This thesis investigates literary characterization through computational literary studies by introducing a novel analytical unit termed “Character Sentences” (CS)—units explicitly providing descriptive or action-related information about literary characters. Three primary contributions are presented in the thesis: (1) a narratological definition and criteria of character sentences; (2) two gold-standard datasets, HPCS (Harry Potter Character Sentence_clause-level & Harry Potter Character Sentence_full-sentence), meticulously annotated from the Harry Potter series, designed to benchmark automated character sentence extraction tasks; and (3) a natural language processing (NLP) pipeline integrating large language models (LLMs) to automatically identify character sentences and accurately attribute them to corresponding characters, suitable for texts of any length. The gold-standard datasets demonstrate high inter-annotator agreement, with Krippendorff’s α values exceeding 0.80 (αHPCS_clause-level = 0.81; αHPCS_full-sentence = 0.86). The proposed NLP pipeline comprises four modules: (1) a text cleaning module; (2) a sentence segmentation module aligned with the character sentence definition; (3) a zero-shot LLM processing module employing the LangGPT prompting framework and two-stage coreference resolution reasoning; and (4) a dependency parsing-based filter module enhancing the accuracy of character attribution. Empirical evaluations indicate the pipeline achieves a robust performance, yielding an F1 score of 94.51% in character sentence identification and an accuracy of 84.88% in character attribution on the HPCS_full-sentence dataset. This research is the first to explicitly define character sentences and develop an automated, theory-informed, sentence-level approach integrated with LLMs for character sentence extraction. It addresses critical gaps in computational literary studies and underscores the efficacy of LLMs and prompt engineering within literary analysis. The datasets and source code developed in this thesis are publicly accessible on GitHub to facilitate further research and methodological advancements in the field.
Genre	Thesis/Dissertation
Type	Text
Language	eng
Date Available	2025-09-02
Provider	Vancouver : University of British Columbia Library
Rights	Attribution-NonCommercial-NoDerivatives 4.0 International
DOI	10.14288/1.0449994
URI	http://hdl.handle.net/2429/92164
Degree (Theses)	Master of Arts - MA
Program (Theses)	Interdisciplinary Studies - Digital Arts and Humanities
Affiliation	Creative and Critical Studies, Faculty of (Okanagan)
Degree Grantor	University of British Columbia
Graduation Date	2025-09
Campus	UBCO
Scholarly Level	Graduate
Rights URI	http://creativecommons.org/licenses/by-nc-nd/4.0/
Aggregated Source Repository	DSpace

Open Collections

UBC Theses and Dissertations

UBC Theses and Dissertations

Character sentence as the quantitative metrics : leveraging large language models to measure literary characterization Liu, Qilin

Abstract

Item Metadata

Item Media

Item Citations and Data

Rights