UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Towards Afrocentric natural language processing Adebara, Ifeoluwanimi

Abstract

This dissertation centers on Natural Language Processing (NLP) for African languages, endeavoring to unravel the progress, challenges, and future prospects within this linguistic context. The research encompasses language identification and Natural Language Understanding (NLU), Natural Language Generation (NLG), and culminates in a comprehensive case study on machine translation. The first chapter introduces the problem statement, articulates the motivation for addressing the issue, and presents the innovative solutions developed throughout this research. Chapter two discusses intricate details of African languages, offering insights into the genealogical classification, linguistic landscape, and the challenges of multilingual NLP. Building upon this foundation, the third chapter advocates for an Afrocentric approach to technology development, emphasizing the significance of aligning technology with the cultural values and linguistic diversity of African communities. It addresses challenges such as data scarcity and representation bias, spotlighting community-driven initiatives aimed at advancing NLP in the region. The fourth chapter unveils AfroLID, a neural language identification tool designed for 517 African languages and language varieties, establishing itself as the new state-of-the-art solution for African language identification. Chapter five introduces SERENGETI, a massively multilingual language model tailored to support 517 African languages and language varieties. Evaluation on AfroNLU, an extensive benchmark for African NLP, showcases SERENGETI’s superior performance, thereby paving the way for transformative research and development across a diverse linguistic landscape. The sixth chapter addresses NLG challenges in African languages, presenting Cheetah, a language model designed for 517 African languages. Comprehensive evaluations underscore Cheetah’s capacity to generate contextually relevant text across various African languages. The seventh chapter presents a case study on machine translation, focusing on Bare Nouns (BNs) translation from Yorùbá to English. This study illuminates the challenges posed by information asymmetry in machine translation and provides insights into the linguistic capabilities of Statistical Machine Translation (SMT) and Neural Machine Translation (NMT) systems. Emphasizing the importance of fine-grained linguistic considerations, the study encourages further research in addressing translation challenges faced by languages with BNs, analytic languages, and low-resource languages. In chapter eight, I conclude and discuss possible directions for future work.

Item Citations and Data

Rights

Attribution-NonCommercial-NoDerivatives 4.0 International