Anglocentric standards in universal text and speech processing

UBC Theses and Dissertations

Featured Collection

UBC Theses and Dissertations

Anglocentric standards in universal text and speech processing Samir, S M Farhan

Abstract

Even as early as the mid-20th century, researchers like Warren Weaver envisioned designing machines that could understand any language, thereby facilitating seamless communication across languages and cultures. Language technology researchers have been making strides towards this universalizing vision, facilitated by training sequence-generation models on the huge amount of text and speech recordings that have been uploaded to the web. However, the technoutopian vision of these universal language technologies is complicated by the fact that this research is largely carried out in the context of Anglo-American funding structures and tech companies situated in the United States. Underlying much of this research is thus a number of Anglocentric positions surrounding language ideologies -- beliefs about languages, what purposes they serve, how they should be represented. Much of the standards and practices surrounding language technology scholarship are set in this hegemonic Anglo-American context, so these positions are taken as universal and objective, rather than sociogeographically positioned and normative. I interrogate these positions through three computational studies, thereby shedding light on their Anglocentric basis. First, I demonstrate that even large English webtext collections provide socially positioned, and thus partial and incomplete, representations of facts. English Wikipedia, the largest of the Wikipedia editions with 7M articles and more than 10M editors, still contains substantial data gaps when compared to smaller language editions like French and Russian. Second, turning to the landscape of universal language technologies, I argue that an Anglocentric dependence on the written word has ramifications for training so-called universal speech recognition models, leading to poor digital representations for many language varieties in multilingual transcribed speech corpora. Third, I criticize the theory-free ideal of fully data-driven science, which relies on Anglocentric conceptions of the availability of large-scale datasets for scientific inquiries. With a case study on modeling morphological patterns with the general-purpose Transformer model, I show that modeling complex distributions relying only on statistical associations demands orders of magnitude more data than a hybrid approach that incorporates theoretical linguistic insights. I conclude with discussions of how we can move beyond Anglocentric metanarratives about how language technology should be developed and evaluated.

Item Metadata

Title	Anglocentric standards in universal text and speech processing
Creator	Samir, S M Farhan
Supervisor	Zhu, Jian
Publisher	University of British Columbia
Date Issued	2025
Description	Even as early as the mid-20th century, researchers like Warren Weaver envisioned designing machines that could understand any language, thereby facilitating seamless communication across languages and cultures. Language technology researchers have been making strides towards this universalizing vision, facilitated by training sequence-generation models on the huge amount of text and speech recordings that have been uploaded to the web. However, the technoutopian vision of these universal language technologies is complicated by the fact that this research is largely carried out in the context of Anglo-American funding structures and tech companies situated in the United States. Underlying much of this research is thus a number of Anglocentric positions surrounding language ideologies -- beliefs about languages, what purposes they serve, how they should be represented. Much of the standards and practices surrounding language technology scholarship are set in this hegemonic Anglo-American context, so these positions are taken as universal and objective, rather than sociogeographically positioned and normative. I interrogate these positions through three computational studies, thereby shedding light on their Anglocentric basis. First, I demonstrate that even large English webtext collections provide socially positioned, and thus partial and incomplete, representations of facts. English Wikipedia, the largest of the Wikipedia editions with 7M articles and more than 10M editors, still contains substantial data gaps when compared to smaller language editions like French and Russian. Second, turning to the landscape of universal language technologies, I argue that an Anglocentric dependence on the written word has ramifications for training so-called universal speech recognition models, leading to poor digital representations for many language varieties in multilingual transcribed speech corpora. Third, I criticize the theory-free ideal of fully data-driven science, which relies on Anglocentric conceptions of the availability of large-scale datasets for scientific inquiries. With a case study on modeling morphological patterns with the general-purpose Transformer model, I show that modeling complex distributions relying only on statistical associations demands orders of magnitude more data than a hybrid approach that incorporates theoretical linguistic insights. I conclude with discussions of how we can move beyond Anglocentric metanarratives about how language technology should be developed and evaluated.
Genre	Thesis/Dissertation
Type	Text
Language	eng
Date Available	2025-07-28
Provider	Vancouver : University of British Columbia Library
Rights	Attribution-NonCommercial-NoDerivatives 4.0 International
DOI	10.14288/1.0449514
URI	http://hdl.handle.net/2429/91709
Degree (Theses)	Doctor of Philosophy - PhD
Program (Theses)	Linguistics
Affiliation	Arts, Faculty of; Linguistics, Department of
Degree Grantor	University of British Columbia
Graduation Date	2025-11
Campus	UBCV
Scholarly Level	Graduate
Rights URI	http://creativecommons.org/licenses/by-nc-nd/4.0/
Aggregated Source Repository	DSpace

Open Collections

UBC Theses and Dissertations

UBC Theses and Dissertations

Anglocentric standards in universal text and speech processing Samir, S M Farhan

Abstract

Item Metadata

Item Media

Item Citations and Data

Rights