UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Anglocentric standards in universal text and speech processing Samir, S M Farhan

Abstract

Even as early as the mid-20th century, researchers like Warren Weaver envisioned designing machines that could understand any language, thereby facilitating seamless communication across languages and cultures. Language technology researchers have been making strides towards this universalizing vision, facilitated by training sequence-generation models on the huge amount of text and speech recordings that have been uploaded to the web. However, the technoutopian vision of these universal language technologies is complicated by the fact that this research is largely carried out in the context of Anglo-American funding structures and tech companies situated in the United States. Underlying much of this research is thus a number of Anglocentric positions surrounding language ideologies -- beliefs about languages, what purposes they serve, how they should be represented. Much of the standards and practices surrounding language technology scholarship are set in this hegemonic Anglo-American context, so these positions are taken as universal and objective, rather than sociogeographically positioned and normative. I interrogate these positions through three computational studies, thereby shedding light on their Anglocentric basis. First, I demonstrate that even large English webtext collections provide socially positioned, and thus partial and incomplete, representations of facts. English Wikipedia, the largest of the Wikipedia editions with 7M articles and more than 10M editors, still contains substantial data gaps when compared to smaller language editions like French and Russian. Second, turning to the landscape of universal language technologies, I argue that an Anglocentric dependence on the written word has ramifications for training so-called universal speech recognition models, leading to poor digital representations for many language varieties in multilingual transcribed speech corpora. Third, I criticize the theory-free ideal of fully data-driven science, which relies on Anglocentric conceptions of the availability of large-scale datasets for scientific inquiries. With a case study on modeling morphological patterns with the general-purpose Transformer model, I show that modeling complex distributions relying only on statistical associations demands orders of magnitude more data than a hybrid approach that incorporates theoretical linguistic insights. I conclude with discussions of how we can move beyond Anglocentric metanarratives about how language technology should be developed and evaluated.

Item Citations and Data

Rights

Attribution-NonCommercial-NoDerivatives 4.0 International