Robust linear model selection for high-dimensional datasets

UBC Theses and Dissertations

Featured Collection

UBC Theses and Dissertations

Robust linear model selection for high-dimensional datasets Khan, Md Jafar Ahmed

Abstract

This study considers the problem of building a linear prediction model when the number of candidate covariates is large and the dataset contains a fraction of outliers and other contaminations that are difficult to visualize and clean. We aim at predicting the future non-outlying cases. Therefore, we need methods that are robust and scalable at the same time. We consider two different strategies for model selection: (a) one-step model building and (b) two-step model building. For one-step model building, we robustify the step-by-step algorithms forward selection (FS) and stepwise (SW), with robust partial F-tests as stopping rules. Our two-step model building procedure consists of sequencing and segmentation. In sequencing, the input variables are sequenced to form a list such that the good predictors are likely to appear in the beginning, and the first m variables of the list form a reduced set for further consideration. For this step we robustify Least Angle Regression (LARS) proposed by Efron, Hastie, Johnstone and Tibshirani (2004). We use bootstrap to stabilize the results obtained by robust LARS, and use "learning curves" to determine the size of the reduced set. The second step (of the two-step model building procedure) - which we call segmentation - carefully examines subsets of the covariates in the reduced set in order to select the final prediction model. For this we propose a computationally suitable robust cross-validation procedure. We also propose a robust bootstrap procedure for segmentation, which is similar to the method proposed by Salibian-Barrera and Zamar (2002) to conduct robust inferences in linear regression. We introduce the idea of "multivariate-Winsorization" which we use for robust data cleaning (for the robustification of LARS). We also propose a new correlation estimate which we call the "adjusted-Winsorized correlation estimate". This estimate is consistent and has bounded influence, and has some advantages over univariate-Winsorized correlation estimate (Huber 1981 and Alqallaf 2003).

Item Metadata

Title	Robust linear model selection for high-dimensional datasets
Creator	Khan, Md Jafar Ahmed
Publisher	University of British Columbia
Date Issued	2006
Description	This study considers the problem of building a linear prediction model when the number of candidate covariates is large and the dataset contains a fraction of outliers and other contaminations that are difficult to visualize and clean. We aim at predicting the future non-outlying cases. Therefore, we need methods that are robust and scalable at the same time. We consider two different strategies for model selection: (a) one-step model building and (b) two-step model building. For one-step model building, we robustify the step-by-step algorithms forward selection (FS) and stepwise (SW), with robust partial F-tests as stopping rules. Our two-step model building procedure consists of sequencing and segmentation. In sequencing, the input variables are sequenced to form a list such that the good predictors are likely to appear in the beginning, and the first m variables of the list form a reduced set for further consideration. For this step we robustify Least Angle Regression (LARS) proposed by Efron, Hastie, Johnstone and Tibshirani (2004). We use bootstrap to stabilize the results obtained by robust LARS, and use "learning curves" to determine the size of the reduced set. The second step (of the two-step model building procedure) - which we call segmentation - carefully examines subsets of the covariates in the reduced set in order to select the final prediction model. For this we propose a computationally suitable robust cross-validation procedure. We also propose a robust bootstrap procedure for segmentation, which is similar to the method proposed by Salibian-Barrera and Zamar (2002) to conduct robust inferences in linear regression. We introduce the idea of "multivariate-Winsorization" which we use for robust data cleaning (for the robustification of LARS). We also propose a new correlation estimate which we call the "adjusted-Winsorized correlation estimate". This estimate is consistent and has bounded influence, and has some advantages over univariate-Winsorized correlation estimate (Huber 1981 and Alqallaf 2003).
Genre	Thesis/Dissertation
Type	Text
Language	eng
Date Available	2011-02-03
Provider	Vancouver : University of British Columbia Library
Rights	For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use.
DOI	10.14288/1.0100423
URI	http://hdl.handle.net/2429/31082
Degree (Theses)	Doctor of Philosophy - PhD
Program (Theses)	Statistics
Affiliation	Science, Faculty of; Statistics, Department of
Degree Grantor	University of British Columbia
Campus	UBCV
Scholarly Level	Graduate
Aggregated Source Repository	DSpace

Item Media

ubc_2007-267363.pdf -- 5.11MB

Item Citations and Data

Rights

For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use.

Open Collections

UBC Theses and Dissertations

Robust linear model selection for high-dimensional datasets Khan, Md Jafar Ahmed

Abstract

Item Metadata

Item Media

Item Citations and Data

Rights