UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Robust estimation and variable selection in high-dimensional linear regression models Kepplinger, David

Abstract

Linear regression models are commonly used statistical models for predicting a response from a set of predictors. Technological advances allow for simultaneous collection of many predictors, but often only a small number of these is relevant for prediction. Identifying this set of predictors in high-dimensional linear regression models with emphasis on accurate prediction is thus a common goal of quantitative data analyses. While a large number of predictors promises to capture as much information as possible, it bears a risk of containing contaminated values. If not handled properly, contamination can affect statistical analyses and lead to spurious scientific discoveries, jeopardizing the generalizability of findings. In this dissertation I propose robust regularized estimators for sparse linear regression with reliable prediction and variable selection performance under the presence of contamination in the response and one or more predictors. I present theoretical and extensive empirical results underscoring that the penalized elastic net S-estimator is robust towards aberrant contamination and leads to better predictions for heavy tailed error distributions than competing estimators. Especially in these more challenging scenarios, competing robust methods reliant on an auxiliary estimate of the residual scale, are more affected by contamination due to the high finite-sample bias introduced by regularization. For improved variable selection I propose the adaptive penalized elastic net S-estimator. I show this estimator identifies the truly irrelevant predictors with high probability as sample size increases and estimates the parameters of the truly relevant predictors as accurately as if these relevant predictors were known in advance. For practical applications robustness of variable selection is essential. This is highlighted by a case study for identifying proteins to predict stenosis of heart vessels, a sign of complication after cardiac transplantation. High robustness comes at the price of more taxing computations. I present optimized algorithms and heuristics for feasible computation of the estimates in a wide range of applications. With the software made publicly available, the proposed estimators are viable alternatives to non-robust methods, supporting discovery of generalizable scientific results.

Item Citations and Data

Rights

Attribution-ShareAlike 4.0 International