- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Automatic glossing for Northern Interior languages...
Open Collections
UBC Theses and Dissertations
UBC Theses and Dissertations
Automatic glossing for Northern Interior languages featuring cross-lingual enhancements Stacey, Anna
Abstract
This project explores incorporating cross-lingual transfer into state-of-the-art approaches for the automatic glossing of low-resource languages, with the goal of accelerating the language documentation process. The languages involved are the three Northern Interior members of the Salish family: St’át’imcets, nɬeʔkepmxcín, and Secwepemctsín, spoken in the Pacific Northwest. There is an urgent need to document the world’s languages to support revitalization, and recent advances in machine learning may help accelerate that process. However, these systems typically depend on a large amount of training data, rarely available in languages that would benefit most from these tools. Thus, this project implements and evaluates multilingual training that additionally learns from a closely-related language, plus several techniques for further potential enhancements: oversampling, fine-tuning, augmenting data, and language-labelling. Firstly, glossed datasets are gathered in each language, followed by a process of ensuring formatting consistency. Secondly, the glossing system is developed, using two models: a transformer (Vaswani et al., 2017) for segmentation, and conditional random fields (CRFs) (Lafferty, McCallum & Pereira, 2001) for glossing. The two models are used in succession to form a pipeline from transcription to segmentation to gloss. Finally, decisions on evaluation metrics are carefully made. For St’át’imcets, with the most plentiful data resources (976 sentences gathered), a relatively successful glossing model resulted from monolingual training data (with a 76.58% pipeline word-level accuracy). For nɬeʔkepmxcín, due to the small dataset size (335 sentences), training utilized close linguistic ties with St’át’imcets. Experiments on a multilingual combination of nɬeʔkepmxcín and St’át’imcets data delivered improved performance compared to monolingual training on nɬeʔkepmxcín data only, across almost all metrics. The additional enhancements to the multilingual training did not lead to clear improvements, but ideas for better implementations were found. For Secwepemctsín, with the least glossed data accessible (101 sentences), a ‘zero-shot’ model was trained on only St’át’imcets and nɬeʔkepmxcín data, reserving all the in-language data for the test set. Though the results were unsurprisingly poor, the model performed as well at segmenting words unseen during training (‘out-of-vocabulary’) as the models tested on St’át’imcets and nɬeʔkepmxcín (all near 50% segmentation word-level accuracy), showcasing the utility of cross-lingual learning.
Item Metadata
Title |
Automatic glossing for Northern Interior languages featuring cross-lingual enhancements
|
Creator | |
Supervisor | |
Publisher |
University of British Columbia
|
Date Issued |
2025
|
Description |
This project explores incorporating cross-lingual transfer into state-of-the-art approaches for the automatic glossing of low-resource languages, with the goal of accelerating the language documentation process. The languages involved are the three Northern Interior members of the Salish family: St’át’imcets, nɬeʔkepmxcín, and Secwepemctsín, spoken in the Pacific Northwest.
There is an urgent need to document the world’s languages to support revitalization, and recent advances in machine learning may help accelerate that process. However, these systems typically depend on a large amount of training data, rarely available in languages that would benefit most from these tools. Thus, this project implements and evaluates multilingual training that additionally learns from a closely-related language, plus several techniques for further potential enhancements: oversampling, fine-tuning, augmenting data, and language-labelling. Firstly, glossed datasets are gathered in each language, followed by a process of ensuring formatting consistency. Secondly, the glossing system is developed, using two models: a transformer (Vaswani et al., 2017) for segmentation, and conditional random fields (CRFs) (Lafferty, McCallum & Pereira, 2001) for glossing. The two models are used in succession to form a pipeline from transcription to segmentation to gloss. Finally, decisions on evaluation metrics are carefully made. For St’át’imcets, with the most plentiful data resources (976 sentences gathered), a relatively successful glossing model resulted from monolingual training data (with a 76.58% pipeline word-level accuracy). For nɬeʔkepmxcín, due to the small dataset size (335 sentences), training utilized close linguistic ties with St’át’imcets. Experiments on a multilingual combination of nɬeʔkepmxcín and St’át’imcets data delivered improved performance compared to monolingual training on nɬeʔkepmxcín data only, across almost all metrics. The additional enhancements to the multilingual training did not lead to clear improvements, but ideas for better implementations were found. For Secwepemctsín, with the least glossed data accessible (101 sentences), a ‘zero-shot’ model was trained on only St’át’imcets and nɬeʔkepmxcín data, reserving all the in-language data for the test set. Though the results were unsurprisingly poor, the model performed as well at segmenting words unseen during training (‘out-of-vocabulary’) as the models tested on St’át’imcets and nɬeʔkepmxcín (all near 50% segmentation word-level accuracy), showcasing the utility of cross-lingual learning.
|
Genre | |
Type | |
Language |
eng
|
Date Available |
2025-07-10
|
Provider |
Vancouver : University of British Columbia Library
|
Rights |
Attribution-NonCommercial-NoDerivatives 4.0 International
|
DOI |
10.14288/1.0449331
|
URI | |
Degree (Theses) | |
Program (Theses) | |
Affiliation | |
Degree Grantor |
University of British Columbia
|
Graduation Date |
2025-11
|
Campus | |
Scholarly Level |
Graduate
|
Rights URI | |
Aggregated Source Repository |
DSpace
|
Item Media
Item Citations and Data
Rights
Attribution-NonCommercial-NoDerivatives 4.0 International