Open Collections

UBC Theses and Dissertations

Featured Collection

UBC Theses and Dissertations

A data-driven ensemble framework for modeling high-dimensional data : theory, methods, algorithms, and applications Christidis, Anthony-Alexander

Abstract

Sparse and ensemble methods are the two main approaches in the statistical literature for modeling high-dimensional data. On the one hand, sparse methods yield a single predictive model that is generally interpretable and possesses desirable theoretical properties. On the other hand, multi-model ensemble methods can generally achieve superior prediction accuracy, but current ensemble methodology relies on randomization or boosting to generate diverse models which results in uninterpretable ensembles. In this dissertation, we introduce a new data-driven ensemble framework that combines ideas from sparse modeling and ensemble modeling. We search for optimal ways to select and split the candidate predictors into subsets for the different models that will be combined in an ensemble. Each model in the ensemble provides an alternative explanation for the relationship between the predictor variables and the response variable. The degrees of sparsity of the individual models and diversity among the models are both driven by the data. The task of optimally splitting the candidate predictors into subsets results in a computationally intractable combinatorial optimization problem when the number of predictors is large. We demonstrate the potential of an exhaustive search for the optimal split of the predictors into the different models of an ensemble on specifically designed low-dimensional data which mimic the typical behavior of high-dimensional data. In this dissertation, we propose different computational approaches to the optimal split selection problem. We introduce a multiconvex relaxation in the regression case and develop efficient algorithms to compute solutions for any level of sparsity and diversity. We show that the resulting ensembles yield consistent predictions and consistent individual models, and provide empirical evidence that this method outperforms state-of-the-art sparse and ensemble methods for high-dimensional prediction tasks using simulated data and a chemometrics application. We then extend the methodology, theory and algorithms to classification ensembles, and investigate the performance of the method on simulated data and a large collection of gene expression datasets. We finally propose a direct computational approach to calculate approximate solutions to the optimal split selection problem in the regression case and benchmark the performance of the method gene expression data.

Item Metadata

Title	A data-driven ensemble framework for modeling high-dimensional data : theory, methods, algorithms, and applications
Creator	Christidis, Anthony-Alexander
Supervisor	Zamar, Ruben H.; Van Aelst, Stefan
Publisher	University of British Columbia
Date Issued	2022
Description	Sparse and ensemble methods are the two main approaches in the statistical literature for modeling high-dimensional data. On the one hand, sparse methods yield a single predictive model that is generally interpretable and possesses desirable theoretical properties. On the other hand, multi-model ensemble methods can generally achieve superior prediction accuracy, but current ensemble methodology relies on randomization or boosting to generate diverse models which results in uninterpretable ensembles. In this dissertation, we introduce a new data-driven ensemble framework that combines ideas from sparse modeling and ensemble modeling. We search for optimal ways to select and split the candidate predictors into subsets for the different models that will be combined in an ensemble. Each model in the ensemble provides an alternative explanation for the relationship between the predictor variables and the response variable. The degrees of sparsity of the individual models and diversity among the models are both driven by the data. The task of optimally splitting the candidate predictors into subsets results in a computationally intractable combinatorial optimization problem when the number of predictors is large. We demonstrate the potential of an exhaustive search for the optimal split of the predictors into the different models of an ensemble on specifically designed low-dimensional data which mimic the typical behavior of high-dimensional data. In this dissertation, we propose different computational approaches to the optimal split selection problem. We introduce a multiconvex relaxation in the regression case and develop efficient algorithms to compute solutions for any level of sparsity and diversity. We show that the resulting ensembles yield consistent predictions and consistent individual models, and provide empirical evidence that this method outperforms state-of-the-art sparse and ensemble methods for high-dimensional prediction tasks using simulated data and a chemometrics application. We then extend the methodology, theory and algorithms to classification ensembles, and investigate the performance of the method on simulated data and a large collection of gene expression datasets. We finally propose a direct computational approach to calculate approximate solutions to the optimal split selection problem in the regression case and benchmark the performance of the method gene expression data.
Genre	Thesis/Dissertation
Type	Text
Language	eng
Date Available	2022-10-21
Provider	Vancouver : University of British Columbia Library
Rights	Attribution-NonCommercial-NoDerivatives 4.0 International
DOI	10.14288/1.0421407
URI	http://hdl.handle.net/2429/82954
Degree (Theses)	Doctor of Philosophy - PhD
Program (Theses)	Statistics
Affiliation	Science, Faculty of; Statistics, Department of
Degree Grantor	University of British Columbia
Graduation Date	2022-11
Campus	UBCV
Scholarly Level	Graduate
Rights URI	http://creativecommons.org/licenses/by-nc-nd/4.0/
Aggregated Source Repository	DSpace

Item Media

ubc_2022_november_christidis_anthonyalexander.pdf -- 29.62MB

ubc_2022_november_christidis_anthonyalexander_supp.pdf -- 251.7kB

Item Citations and Data

Rights

Attribution-NonCommercial-NoDerivatives 4.0 International