UBC Theses and Dissertations
Robust linear model selection for high-dimensional datasets Khan, Md Jafar Ahmed
This study considers the problem of building a linear prediction model when the number of candidate covariates is large and the dataset contains a fraction of outliers and other contaminations that are difficult to visualize and clean. We aim at predicting the future non-outlying cases. Therefore, we need methods that are robust and scalable at the same time. We consider two different strategies for model selection: (a) one-step model building and (b) two-step model building. For one-step model building, we robustify the step-by-step algorithms forward selection (FS) and stepwise (SW), with robust partial F-tests as stopping rules. Our two-step model building procedure consists of sequencing and segmentation. In sequencing, the input variables are sequenced to form a list such that the good predictors are likely to appear in the beginning, and the first m variables of the list form a reduced set for further consideration. For this step we robustify Least Angle Regression (LARS) proposed by Efron, Hastie, Johnstone and Tibshirani (2004). We use bootstrap to stabilize the results obtained by robust LARS, and use "learning curves" to determine the size of the reduced set. The second step (of the two-step model building procedure) - which we call segmentation - carefully examines subsets of the covariates in the reduced set in order to select the final prediction model. For this we propose a computationally suitable robust cross-validation procedure. We also propose a robust bootstrap procedure for segmentation, which is similar to the method proposed by Salibian-Barrera and Zamar (2002) to conduct robust inferences in linear regression. We introduce the idea of "multivariate-Winsorization" which we use for robust data cleaning (for the robustification of LARS). We also propose a new correlation estimate which we call the "adjusted-Winsorized correlation estimate". This estimate is consistent and has bounded influence, and has some advantages over univariate-Winsorized correlation estimate (Huber 1981 and Alqallaf 2003).
Item Citations and Data