Towards Afrocentric natural language processing

UBC Theses and Dissertations

Featured Collection

UBC Theses and Dissertations

Towards Afrocentric natural language processing Adebara, Ifeoluwanimi

Abstract

This dissertation centers on Natural Language Processing (NLP) for African languages, endeavoring to unravel the progress, challenges, and future prospects within this linguistic context. The research encompasses language identification and Natural Language Understanding (NLU), Natural Language Generation (NLG), and culminates in a comprehensive case study on machine translation. The first chapter introduces the problem statement, articulates the motivation for addressing the issue, and presents the innovative solutions developed throughout this research. Chapter two discusses intricate details of African languages, offering insights into the genealogical classification, linguistic landscape, and the challenges of multilingual NLP. Building upon this foundation, the third chapter advocates for an Afrocentric approach to technology development, emphasizing the significance of aligning technology with the cultural values and linguistic diversity of African communities. It addresses challenges such as data scarcity and representation bias, spotlighting community-driven initiatives aimed at advancing NLP in the region. The fourth chapter unveils AfroLID, a neural language identification tool designed for 517 African languages and language varieties, establishing itself as the new state-of-the-art solution for African language identification. Chapter five introduces SERENGETI, a massively multilingual language model tailored to support 517 African languages and language varieties. Evaluation on AfroNLU, an extensive benchmark for African NLP, showcases SERENGETI’s superior performance, thereby paving the way for transformative research and development across a diverse linguistic landscape. The sixth chapter addresses NLG challenges in African languages, presenting Cheetah, a language model designed for 517 African languages. Comprehensive evaluations underscore Cheetah’s capacity to generate contextually relevant text across various African languages. The seventh chapter presents a case study on machine translation, focusing on Bare Nouns (BNs) translation from Yorùbá to English. This study illuminates the challenges posed by information asymmetry in machine translation and provides insights into the linguistic capabilities of Statistical Machine Translation (SMT) and Neural Machine Translation (NMT) systems. Emphasizing the importance of fine-grained linguistic considerations, the study encourages further research in addressing translation challenges faced by languages with BNs, analytic languages, and low-resource languages. In chapter eight, I conclude and discuss possible directions for future work.

Item Metadata

Title	Towards Afrocentric natural language processing
Creator	Adebara, Ifeoluwanimi
Supervisor	Abdul-Mageed, Muhammad
Publisher	University of British Columbia
Date Issued	2024
Description	This dissertation centers on Natural Language Processing (NLP) for African languages, endeavoring to unravel the progress, challenges, and future prospects within this linguistic context. The research encompasses language identification and Natural Language Understanding (NLU), Natural Language Generation (NLG), and culminates in a comprehensive case study on machine translation. The first chapter introduces the problem statement, articulates the motivation for addressing the issue, and presents the innovative solutions developed throughout this research. Chapter two discusses intricate details of African languages, offering insights into the genealogical classification, linguistic landscape, and the challenges of multilingual NLP. Building upon this foundation, the third chapter advocates for an Afrocentric approach to technology development, emphasizing the significance of aligning technology with the cultural values and linguistic diversity of African communities. It addresses challenges such as data scarcity and representation bias, spotlighting community-driven initiatives aimed at advancing NLP in the region. The fourth chapter unveils AfroLID, a neural language identification tool designed for 517 African languages and language varieties, establishing itself as the new state-of-the-art solution for African language identification. Chapter five introduces SERENGETI, a massively multilingual language model tailored to support 517 African languages and language varieties. Evaluation on AfroNLU, an extensive benchmark for African NLP, showcases SERENGETI’s superior performance, thereby paving the way for transformative research and development across a diverse linguistic landscape. The sixth chapter addresses NLG challenges in African languages, presenting Cheetah, a language model designed for 517 African languages. Comprehensive evaluations underscore Cheetah’s capacity to generate contextually relevant text across various African languages. The seventh chapter presents a case study on machine translation, focusing on Bare Nouns (BNs) translation from Yorùbá to English. This study illuminates the challenges posed by information asymmetry in machine translation and provides insights into the linguistic capabilities of Statistical Machine Translation (SMT) and Neural Machine Translation (NMT) systems. Emphasizing the importance of fine-grained linguistic considerations, the study encourages further research in addressing translation challenges faced by languages with BNs, analytic languages, and low-resource languages. In chapter eight, I conclude and discuss possible directions for future work.
Genre	Thesis/Dissertation
Type	Text
Language	eng
Date Available	2024-02-27
Provider	Vancouver : University of British Columbia Library
Rights	Attribution-NonCommercial-NoDerivatives 4.0 International
DOI	10.14288/1.0440415
URI	http://hdl.handle.net/2429/87483
Degree	Doctor of Philosophy - PhD
Program	Linguistics
Affiliation	Arts, Faculty of; Linguistics, Department of
Degree Grantor	University of British Columbia
Graduation Date	2024-05
Campus	UBCV
Scholarly Level	Graduate
Rights URI	http://creativecommons.org/licenses/by-nc-nd/4.0/
Aggregated Source Repository	DSpace

Open Collections

UBC Theses and Dissertations

UBC Theses and Dissertations

Towards Afrocentric natural language processing Adebara, Ifeoluwanimi

Abstract

Item Metadata

Item Media

Item Citations and Data

Rights