UBC Theses and Dissertations
Boosting for regression problems with complex data Ju, Xiaomeng
Boosting is a highly flexible and powerful approach when it comes to making predictions in non-parametric settings. By constructing an estimator using a combination of “base learners”, it can achieve high prediction accuracy and scale to data with many explanatory variables. In spite of the popularity and practical success of boosting algorithms, there is a lack of focus on its generalizations to “complex data”, such as data with “outliers” or functional variables. For data like these, we develop new boosting algorithms that fit in the framework of gradient boosting machines (GBM). We illustrate our findings on simulated and real datasets and developed openly available R packages implementing our proposals. For data contaminated with outliers, we propose a two-stage boosting algorithm similar to what is done for robust linear MM-regression: it first minimizes a robust residual scale estimator and then improves it by optimizing a bounded loss function. Unlike previous robust boosting proposals this approach does not require computing an ad hoc residual scale estimator in each boosting iteration. We address the issue of the initialization of our boosting algorithm and provide a permutation-based procedure to robustly measure the importance of each variable. For data containing functional predictors, we propose a boosting algorithm that uses tree “base-learners” that are constructed with multiple projections. Our proposal incorporates possible interactions between indices, making it capable of approximating complex regression functions. In addition, our estimator is constructed using relatively simple regression trees, which are notably easier to compute than multi-dimensional kernel smoothers used in other proposals. Finally, we extend the proposal above to robust functional regression in the presence of outliers, which may appear in the measurements of the response, the functional predictors, or both. We explore robust boosting algorithms derived from M-estimators or MM-estimators respectively and make suggestions on which method to use based on the type of contamination and computing budget.
Item Citations and Data
Attribution-NonCommercial-NoDerivatives 4.0 International