Automatic glossing for Northern Interior languages featuring cross-lingual enhancements

UBC Theses and Dissertations

Featured Collection

UBC Theses and Dissertations

Automatic glossing for Northern Interior languages featuring cross-lingual enhancements Stacey, Anna

Abstract

This project explores incorporating cross-lingual transfer into state-of-the-art approaches for the automatic glossing of low-resource languages, with the goal of accelerating the language documentation process. The languages involved are the three Northern Interior members of the Salish family: St’át’imcets, nɬeʔkepmxcín, and Secwepemctsín, spoken in the Pacific Northwest. There is an urgent need to document the world’s languages to support revitalization, and recent advances in machine learning may help accelerate that process. However, these systems typically depend on a large amount of training data, rarely available in languages that would benefit most from these tools. Thus, this project implements and evaluates multilingual training that additionally learns from a closely-related language, plus several techniques for further potential enhancements: oversampling, fine-tuning, augmenting data, and language-labelling. Firstly, glossed datasets are gathered in each language, followed by a process of ensuring formatting consistency. Secondly, the glossing system is developed, using two models: a transformer (Vaswani et al., 2017) for segmentation, and conditional random fields (CRFs) (Lafferty, McCallum & Pereira, 2001) for glossing. The two models are used in succession to form a pipeline from transcription to segmentation to gloss. Finally, decisions on evaluation metrics are carefully made. For St’át’imcets, with the most plentiful data resources (976 sentences gathered), a relatively successful glossing model resulted from monolingual training data (with a 76.58% pipeline word-level accuracy). For nɬeʔkepmxcín, due to the small dataset size (335 sentences), training utilized close linguistic ties with St’át’imcets. Experiments on a multilingual combination of nɬeʔkepmxcín and St’át’imcets data delivered improved performance compared to monolingual training on nɬeʔkepmxcín data only, across almost all metrics. The additional enhancements to the multilingual training did not lead to clear improvements, but ideas for better implementations were found. For Secwepemctsín, with the least glossed data accessible (101 sentences), a ‘zero-shot’ model was trained on only St’át’imcets and nɬeʔkepmxcín data, reserving all the in-language data for the test set. Though the results were unsurprisingly poor, the model performed as well at segmenting words unseen during training (‘out-of-vocabulary’) as the models tested on St’át’imcets and nɬeʔkepmxcín (all near 50% segmentation word-level accuracy), showcasing the utility of cross-lingual learning.

Item Metadata

Title	Automatic glossing for Northern Interior languages featuring cross-lingual enhancements
Creator	Stacey, Anna
Supervisor	Silfverberg, Miikka Pietari
Publisher	University of British Columbia
Date Issued	2025
Description	This project explores incorporating cross-lingual transfer into state-of-the-art approaches for the automatic glossing of low-resource languages, with the goal of accelerating the language documentation process. The languages involved are the three Northern Interior members of the Salish family: St’át’imcets, nɬeʔkepmxcín, and Secwepemctsín, spoken in the Pacific Northwest. There is an urgent need to document the world’s languages to support revitalization, and recent advances in machine learning may help accelerate that process. However, these systems typically depend on a large amount of training data, rarely available in languages that would benefit most from these tools. Thus, this project implements and evaluates multilingual training that additionally learns from a closely-related language, plus several techniques for further potential enhancements: oversampling, fine-tuning, augmenting data, and language-labelling. Firstly, glossed datasets are gathered in each language, followed by a process of ensuring formatting consistency. Secondly, the glossing system is developed, using two models: a transformer (Vaswani et al., 2017) for segmentation, and conditional random fields (CRFs) (Lafferty, McCallum & Pereira, 2001) for glossing. The two models are used in succession to form a pipeline from transcription to segmentation to gloss. Finally, decisions on evaluation metrics are carefully made. For St’át’imcets, with the most plentiful data resources (976 sentences gathered), a relatively successful glossing model resulted from monolingual training data (with a 76.58% pipeline word-level accuracy). For nɬeʔkepmxcín, due to the small dataset size (335 sentences), training utilized close linguistic ties with St’át’imcets. Experiments on a multilingual combination of nɬeʔkepmxcín and St’át’imcets data delivered improved performance compared to monolingual training on nɬeʔkepmxcín data only, across almost all metrics. The additional enhancements to the multilingual training did not lead to clear improvements, but ideas for better implementations were found. For Secwepemctsín, with the least glossed data accessible (101 sentences), a ‘zero-shot’ model was trained on only St’át’imcets and nɬeʔkepmxcín data, reserving all the in-language data for the test set. Though the results were unsurprisingly poor, the model performed as well at segmenting words unseen during training (‘out-of-vocabulary’) as the models tested on St’át’imcets and nɬeʔkepmxcín (all near 50% segmentation word-level accuracy), showcasing the utility of cross-lingual learning.
Genre	Thesis/Dissertation
Type	Text
Language	eng
Date Available	2025-07-10
Provider	Vancouver : University of British Columbia Library
Rights	Attribution-NonCommercial-NoDerivatives 4.0 International
DOI	10.14288/1.0449331
URI	http://hdl.handle.net/2429/91547
Degree (Theses)	Master of Arts - MA
Program (Theses)	Linguistics
Affiliation	Arts, Faculty of; Linguistics, Department of
Degree Grantor	University of British Columbia
Graduation Date	2025-11
Campus	UBCV
Scholarly Level	Graduate
Rights URI	http://creativecommons.org/licenses/by-nc-nd/4.0/
Aggregated Source Repository	DSpace

Open Collections

UBC Theses and Dissertations

UBC Theses and Dissertations

Automatic glossing for Northern Interior languages featuring cross-lingual enhancements Stacey, Anna

Abstract

Item Metadata

Item Media

Item Citations and Data

Rights