UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

A data-driven ensemble framework for modeling high-dimensional data : theory, methods, algorithms, and applications Christidis, Anthony-Alexander

Abstract

Sparse and ensemble methods are the two main approaches in the statistical literature for modeling high-dimensional data. On the one hand, sparse methods yield a single predictive model that is generally interpretable and possesses desirable theoretical properties. On the other hand, multi-model ensemble methods can generally achieve superior prediction accuracy, but current ensemble methodology relies on randomization or boosting to generate diverse models which results in uninterpretable ensembles. In this dissertation, we introduce a new data-driven ensemble framework that combines ideas from sparse modeling and ensemble modeling. We search for optimal ways to select and split the candidate predictors into subsets for the different models that will be combined in an ensemble. Each model in the ensemble provides an alternative explanation for the relationship between the predictor variables and the response variable. The degrees of sparsity of the individual models and diversity among the models are both driven by the data. The task of optimally splitting the candidate predictors into subsets results in a computationally intractable combinatorial optimization problem when the number of predictors is large. We demonstrate the potential of an exhaustive search for the optimal split of the predictors into the different models of an ensemble on specifically designed low-dimensional data which mimic the typical behavior of high-dimensional data. In this dissertation, we propose different computational approaches to the optimal split selection problem. We introduce a multiconvex relaxation in the regression case and develop efficient algorithms to compute solutions for any level of sparsity and diversity. We show that the resulting ensembles yield consistent predictions and consistent individual models, and provide empirical evidence that this method outperforms state-of-the-art sparse and ensemble methods for high-dimensional prediction tasks using simulated data and a chemometrics application. We then extend the methodology, theory and algorithms to classification ensembles, and investigate the performance of the method on simulated data and a large collection of gene expression datasets. We finally propose a direct computational approach to calculate approximate solutions to the optimal split selection problem in the regression case and benchmark the performance of the method gene expression data. Supplementary materials available at: http://hdl.handle.net/2429/83086

Item Citations and Data

Rights

Attribution-NonCommercial-NoDerivatives 4.0 International