Open Collections will undergo scheduled maintenance on the following dates: On Monday, April 27th, 2026, the site will not be available from 7:00 AM – 9:00 AM PST and on Tuesday, April 28th, 2026, the site will remain accessible from 7:00 AM – 9:00 AM PST, however item images and media will not be available during this time.
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Building resilient ML applications using ensembles...
Open Collections
UBC Theses and Dissertations
UBC Theses and Dissertations
Building resilient ML applications using ensembles against faulty training data Chan, Abraham
Abstract
Supervised machine learning (ML) relies on large datasets, which are prone to faults like mislabelling, deletion, or repetition.
Even popular open datasets like ImageNet contain faults.
Our experiments on four multi-class classification datasets show that when training data is corrupted, even highly accurate ML models misclassify test inputs.
This thesis investigates techniques against training data faults, and identifies ensembles - multiple independently trained ML models combined with simple majority voting, as most effective.
We propose two novel ensemble-based solutions, allowing ML applications to tolerate faulty training data, while minimizing practitioner effort.
(1) We find that ensembles are the most resilient technique among existing techniques against mislabelled training data, even in safety-critical domains such as autonomous vehicles and healthcare, and requires minimal practitioner effort.
(2) We find that ensembles are the most generalizable fault tolerance technique, even in problems beyond multi-class classification, such as object detection. Ensembles deliver resilience within acceptable runtime overheads in safety-critical applications.
(3) We find that ensembles have a higher resilience to faulty training data than individual models, especially when using ensembles with architecturally diverse constituent models.
Despite their effectiveness, ensembles face adoption challenges in real-world safety-critical systems.
First, there are many ways to construct diverse ensembles, resulting in an exponential factorial search space.
How can one systematically build resilient ensembles against faulty training data?
Second, ensembles can misclassify test inputs when incorrect models outvote correct ones.
Can we reduce incorrect predictions by ensembles during deployment?
Thus, this thesis presents two solutions.
(1) We present D-semble, a framework that returns the most resilient ensembles within a time budget. D-semble encodes ensemble search into an evolutionary search problem, while using diversity as a heuristic.
(2) We present ReMlX, a framework to reduce ensemble misclassifications at inference. ReMlX leverages the feature space of ML models, extracted with explainable AI, to maximize diversity in ensembles.
In summary, this thesis advances the development of resilient ML systems against faulty training data.
By developing comprehensive solutions, this work enables ensembles to be deployed with minimal effort in real-world safety-critical systems.
Item Metadata
| Title |
Building resilient ML applications using ensembles against faulty training data
|
| Creator | |
| Supervisor | |
| Publisher |
University of British Columbia
|
| Date Issued |
2026
|
| Description |
Supervised machine learning (ML) relies on large datasets, which are prone to faults like mislabelling, deletion, or repetition.
Even popular open datasets like ImageNet contain faults.
Our experiments on four multi-class classification datasets show that when training data is corrupted, even highly accurate ML models misclassify test inputs.
This thesis investigates techniques against training data faults, and identifies ensembles - multiple independently trained ML models combined with simple majority voting, as most effective.
We propose two novel ensemble-based solutions, allowing ML applications to tolerate faulty training data, while minimizing practitioner effort.
(1) We find that ensembles are the most resilient technique among existing techniques against mislabelled training data, even in safety-critical domains such as autonomous vehicles and healthcare, and requires minimal practitioner effort.
(2) We find that ensembles are the most generalizable fault tolerance technique, even in problems beyond multi-class classification, such as object detection. Ensembles deliver resilience within acceptable runtime overheads in safety-critical applications.
(3) We find that ensembles have a higher resilience to faulty training data than individual models, especially when using ensembles with architecturally diverse constituent models.
Despite their effectiveness, ensembles face adoption challenges in real-world safety-critical systems.
First, there are many ways to construct diverse ensembles, resulting in an exponential factorial search space.
How can one systematically build resilient ensembles against faulty training data?
Second, ensembles can misclassify test inputs when incorrect models outvote correct ones.
Can we reduce incorrect predictions by ensembles during deployment?
Thus, this thesis presents two solutions.
(1) We present D-semble, a framework that returns the most resilient ensembles within a time budget. D-semble encodes ensemble search into an evolutionary search problem, while using diversity as a heuristic.
(2) We present ReMlX, a framework to reduce ensemble misclassifications at inference. ReMlX leverages the feature space of ML models, extracted with explainable AI, to maximize diversity in ensembles.
In summary, this thesis advances the development of resilient ML systems against faulty training data.
By developing comprehensive solutions, this work enables ensembles to be deployed with minimal effort in real-world safety-critical systems.
|
| Genre | |
| Type | |
| Language |
eng
|
| Date Available |
2026-04-14
|
| Provider |
Vancouver : University of British Columbia Library
|
| Rights |
Attribution-NonCommercial-NoDerivatives 4.0 International
|
| DOI |
10.14288/1.0451918
|
| URI | |
| Degree (Theses) | |
| Program (Theses) | |
| Affiliation | |
| Degree Grantor |
University of British Columbia
|
| Graduation Date |
2026-05
|
| Campus | |
| Scholarly Level |
Graduate
|
| Rights URI | |
| Aggregated Source Repository |
DSpace
|
Item Media
Item Citations and Data
Rights
Attribution-NonCommercial-NoDerivatives 4.0 International