Building resilient ML applications using ensembles against faulty training data

UBC Theses and Dissertations

Featured Collection

UBC Theses and Dissertations

Building resilient ML applications using ensembles against faulty training data Chan, Abraham

Abstract

Supervised machine learning (ML) relies on large datasets, which are prone to faults like mislabelling, deletion, or repetition. Even popular open datasets like ImageNet contain faults. Our experiments on four multi-class classification datasets show that when training data is corrupted, even highly accurate ML models misclassify test inputs. This thesis investigates techniques against training data faults, and identifies ensembles - multiple independently trained ML models combined with simple majority voting, as most effective. We propose two novel ensemble-based solutions, allowing ML applications to tolerate faulty training data, while minimizing practitioner effort. (1) We find that ensembles are the most resilient technique among existing techniques against mislabelled training data, even in safety-critical domains such as autonomous vehicles and healthcare, and requires minimal practitioner effort. (2) We find that ensembles are the most generalizable fault tolerance technique, even in problems beyond multi-class classification, such as object detection. Ensembles deliver resilience within acceptable runtime overheads in safety-critical applications. (3) We find that ensembles have a higher resilience to faulty training data than individual models, especially when using ensembles with architecturally diverse constituent models. Despite their effectiveness, ensembles face adoption challenges in real-world safety-critical systems. First, there are many ways to construct diverse ensembles, resulting in an exponential factorial search space. How can one systematically build resilient ensembles against faulty training data? Second, ensembles can misclassify test inputs when incorrect models outvote correct ones. Can we reduce incorrect predictions by ensembles during deployment? Thus, this thesis presents two solutions. (1) We present D-semble, a framework that returns the most resilient ensembles within a time budget. D-semble encodes ensemble search into an evolutionary search problem, while using diversity as a heuristic. (2) We present ReMlX, a framework to reduce ensemble misclassifications at inference. ReMlX leverages the feature space of ML models, extracted with explainable AI, to maximize diversity in ensembles. In summary, this thesis advances the development of resilient ML systems against faulty training data. By developing comprehensive solutions, this work enables ensembles to be deployed with minimal effort in real-world safety-critical systems.

Item Metadata

Title	Building resilient ML applications using ensembles against faulty training data
Creator	Chan, Abraham
Supervisor	Pattabiraman, Karthik; Gopalakrishnan, Sathish
Publisher	University of British Columbia
Date Issued	2026
Description	Supervised machine learning (ML) relies on large datasets, which are prone to faults like mislabelling, deletion, or repetition. Even popular open datasets like ImageNet contain faults. Our experiments on four multi-class classification datasets show that when training data is corrupted, even highly accurate ML models misclassify test inputs. This thesis investigates techniques against training data faults, and identifies ensembles - multiple independently trained ML models combined with simple majority voting, as most effective. We propose two novel ensemble-based solutions, allowing ML applications to tolerate faulty training data, while minimizing practitioner effort. (1) We find that ensembles are the most resilient technique among existing techniques against mislabelled training data, even in safety-critical domains such as autonomous vehicles and healthcare, and requires minimal practitioner effort. (2) We find that ensembles are the most generalizable fault tolerance technique, even in problems beyond multi-class classification, such as object detection. Ensembles deliver resilience within acceptable runtime overheads in safety-critical applications. (3) We find that ensembles have a higher resilience to faulty training data than individual models, especially when using ensembles with architecturally diverse constituent models. Despite their effectiveness, ensembles face adoption challenges in real-world safety-critical systems. First, there are many ways to construct diverse ensembles, resulting in an exponential factorial search space. How can one systematically build resilient ensembles against faulty training data? Second, ensembles can misclassify test inputs when incorrect models outvote correct ones. Can we reduce incorrect predictions by ensembles during deployment? Thus, this thesis presents two solutions. (1) We present D-semble, a framework that returns the most resilient ensembles within a time budget. D-semble encodes ensemble search into an evolutionary search problem, while using diversity as a heuristic. (2) We present ReMlX, a framework to reduce ensemble misclassifications at inference. ReMlX leverages the feature space of ML models, extracted with explainable AI, to maximize diversity in ensembles. In summary, this thesis advances the development of resilient ML systems against faulty training data. By developing comprehensive solutions, this work enables ensembles to be deployed with minimal effort in real-world safety-critical systems.
Genre	Thesis/Dissertation
Type	Text
Language	eng
Date Available	2026-04-14
Provider	Vancouver : University of British Columbia Library
Rights	Attribution-NonCommercial-NoDerivatives 4.0 International
DOI	10.14288/1.0451918
URI	http://hdl.handle.net/2429/94021
Degree (Theses)	Doctor of Philosophy - PhD
Program (Theses)	Electrical and Computer Engineering
Affiliation	Applied Science, Faculty of; Electrical and Computer Engineering, Department of
Degree Grantor	University of British Columbia
Graduation Date	2026-05
Campus	UBCV
Scholarly Level	Graduate
Rights URI	http://creativecommons.org/licenses/by-nc-nd/4.0/
Aggregated Source Repository	DSpace

Open Collections

UBC Theses and Dissertations

UBC Theses and Dissertations

Building resilient ML applications using ensembles against faulty training data Chan, Abraham

Abstract

Item Metadata

Item Media

Item Citations and Data

Rights