- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Character sentence as the quantitative metrics : leveraging...
Open Collections
UBC Theses and Dissertations
UBC Theses and Dissertations
Character sentence as the quantitative metrics : leveraging large language models to measure literary characterization Liu, Qilin
Abstract
This thesis investigates literary characterization through computational literary studies by introducing a novel analytical unit termed “Character Sentences” (CS)—units explicitly providing descriptive or action-related information about literary characters. Three primary contributions are presented in the thesis: (1) a narratological definition and criteria of character sentences; (2) two gold-standard datasets, HPCS (Harry Potter Character Sentence_clause-level & Harry Potter Character Sentence_full-sentence), meticulously annotated from the Harry Potter series, designed to benchmark automated character sentence extraction tasks; and (3) a natural language processing (NLP) pipeline integrating large language models (LLMs) to automatically identify character sentences and accurately attribute them to corresponding characters, suitable for texts of any length. The gold-standard datasets demonstrate high inter-annotator agreement, with Krippendorff’s α values exceeding 0.80 (αHPCS_clause-level = 0.81; αHPCS_full-sentence = 0.86). The proposed NLP pipeline comprises four modules: (1) a text cleaning module; (2) a sentence segmentation module aligned with the character sentence definition; (3) a zero-shot LLM processing module employing the LangGPT prompting framework and two-stage coreference resolution reasoning; and (4) a dependency parsing-based filter module enhancing the accuracy of character attribution. Empirical evaluations indicate the pipeline achieves a robust performance, yielding an F1 score of 94.51% in character sentence identification and an accuracy of 84.88% in character attribution on the HPCS_full-sentence dataset. This research is the first to explicitly define character sentences and develop an automated, theory-informed, sentence-level approach integrated with LLMs for character sentence extraction. It addresses critical gaps in computational literary studies and underscores the efficacy of LLMs and prompt engineering within literary analysis. The datasets and source code developed in this thesis are publicly accessible on GitHub to facilitate further research and methodological advancements in the field.
Item Metadata
Title |
Character sentence as the quantitative metrics : leveraging large language models to measure literary characterization
|
Creator | |
Supervisor | |
Publisher |
University of British Columbia
|
Date Issued |
2025
|
Description |
This thesis investigates literary characterization through computational literary studies by introducing a novel analytical unit termed “Character Sentences” (CS)—units explicitly providing descriptive or action-related information about literary characters. Three primary contributions are presented in the thesis: (1) a narratological definition and criteria of character sentences; (2) two gold-standard datasets, HPCS (Harry Potter Character Sentence_clause-level & Harry Potter Character Sentence_full-sentence), meticulously annotated from the Harry Potter series, designed to benchmark automated character sentence extraction tasks; and (3) a natural language processing (NLP) pipeline integrating large language models (LLMs) to automatically identify character sentences and accurately attribute them to corresponding characters, suitable for texts of any length. The gold-standard datasets demonstrate high inter-annotator agreement, with Krippendorff’s α values exceeding 0.80 (αHPCS_clause-level = 0.81; αHPCS_full-sentence = 0.86). The proposed NLP pipeline comprises four modules: (1) a text cleaning module; (2) a sentence segmentation module aligned with the character sentence definition; (3) a zero-shot LLM processing module employing the LangGPT prompting framework and two-stage coreference resolution reasoning; and (4) a dependency parsing-based filter module enhancing the accuracy of character attribution. Empirical evaluations indicate the pipeline achieves a robust performance, yielding an F1 score of 94.51% in character sentence identification and an accuracy of 84.88% in character attribution on the HPCS_full-sentence dataset. This research is the first to explicitly define character sentences and develop an automated, theory-informed, sentence-level approach integrated with LLMs for character sentence extraction. It addresses critical gaps in computational literary studies and underscores the efficacy of LLMs and prompt engineering within literary analysis. The datasets and source code developed in this thesis are publicly accessible on GitHub to facilitate further research and methodological advancements in the field.
|
Genre | |
Type | |
Language |
eng
|
Date Available |
2025-09-02
|
Provider |
Vancouver : University of British Columbia Library
|
Rights |
Attribution-NonCommercial-NoDerivatives 4.0 International
|
DOI |
10.14288/1.0449994
|
URI | |
Degree (Theses) | |
Program (Theses) | |
Affiliation | |
Degree Grantor |
University of British Columbia
|
Graduation Date |
2025-09
|
Campus | |
Scholarly Level |
Graduate
|
Rights URI | |
Aggregated Source Repository |
DSpace
|
Item Media
Item Citations and Data
Rights
Attribution-NonCommercial-NoDerivatives 4.0 International