UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Robust linear model selection for high-dimensional datasets Khan, Md Jafar Ahmed 2006

You don't seem to have a PDF reader installed, try download the pdf

Item Metadata

Download

Media
[if-you-see-this-DO-NOT-CLICK]
[if-you-see-this-DO-NOT-CLICK]
ubc_2007-267363.pdf [ 5.11MB ]
Metadata
JSON: 1.0100423.json
JSON-LD: 1.0100423+ld.json
RDF/XML (Pretty): 1.0100423.xml
RDF/JSON: 1.0100423+rdf.json
Turtle: 1.0100423+rdf-turtle.txt
N-Triples: 1.0100423+rdf-ntriples.txt
Original Record: 1.0100423 +original-record.json
Full Text
1.0100423.txt
Citation
1.0100423.ris

Full Text

Robust Linear Model Selection for High-Dimensional Batasets by MD JAFAR AHMED KHAN M.Sc., The University of  British Columbia, 2002 A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF Doctor of  Philosophy in THE FACULTY OF GRADUATE STUDIES (Statistics) The University of  British Columbia December 2006 © Md Jafar  Ahmed Khan, 2006 Abstract This study considers the problem of  building a linear prediction model when the number of  candidate covariates is large and the dataset contains a fraction  of  outliers and other contaminations "that are difficult  to visualize and clean. We aim at predicting the future non-outlying cases. Therefore,  we need methods that are robust and scalable at the same time. We consider two different  strategies for  model selection: (a) one-step model building and (b) two-step model building. For one-step model building, we robustify  the step-by-step algorithms forward  selection (FS) and stepwise (SW), with robust partial F-tests as stopping rules. Our two-step model building procedure consists of  sequencing and segmentation.  In sequencing, the input variables are sequenced to form  a list such that the good predictors are likely to appear in the beginning, and the first  m variables of  the list form  a reduced set for  further  consideration. For this step we robustify  Least Angle Regression (LARS) proposed by Efron,  Hastie, Johnstone and Tibshirani (2004). We use bootstrap to sta-bilize the results obtained by robust LARS, and use "learning curves" to determine the size of  the reduced set. The second step (of  the two-step model building procedure) - which we call seg-mentation  - carefully  examines subsets of  the covariates in the reduced set in order to select the final  prediction model. For this we propose a computationally suitable robust cross-validation procedure. We also propose a robust bootstrap procedure for  segmenta-tion, which is similar to the method proposed by Salibian-Barrera and Zamar (2002) to conduct robust inferences  in linear regression. We introduce the idea of  "multivariate-Winsorization" which we use for  robust data cleaning (for  the robustification  of  LARS). We also propose a new correlation estimate which we call the "adjusted-Winsorized correlation estimate". This estimate is consis-tent and has bounded influence,  and has some advantages over univariate-Winsorized correlation estimate (Huber 1981 and Alqallaf  2003). Contents Abstract ii Contents v List of  Tables x List of  Figures xii List of  Notation xv Acknowledgements . xvi Dedication xvii1 Introduction 1 1.1 Motivation1.2 Model selection strategy . 3 / 1.2.1 One-step model building 3 1.2.2 Two-step model building : . . . 4 1.3 Computation of  robust correlation matrices 6 1.4 Organization of  subsequent chapters 7 2 One-step Model Building: Robust Forward Selection and Stepwise Procedures 9 2.1 Introduction 9 2.2 Review: classical step-by-step algorithms 12 2.2.1 Forward Selection (FS) 13 2.2.2 Stepwise (SW) 4 2.2.3 Backward Elimination (BE) . 14 2.3 FS arid SW Expressed in Correlations . . .  . . 15 2.3.1 FS expressed in terms of  correlations 15 2.3.2 SW expressed in terms of  correlations 21 2.4 Robustification  of  FS and SW algorithms 22 2.4.1 Numerical complexity of  the algorithms 25 2.4.2 Limitation of  the proposed algorithms . 26 2.5 A simulation study 27 2.5.1 Model selection with Spearman's p and Kendall's r : 31 i 2.6 Examples . . . 32.7 Conclusion 34 3 Two-step Model Building: Robust Sequencing with Least Angle Regression 35 3.1 Introduction 33.2 Review: Least Angle Regression (LARS) • ... . 36 3.2.1 Forward Stagewise procedure (Stagewise) 37 3.2.2 The LARS algorithm 42 3.2.3 LARS and Shrinkage methods 43 3.3 LARS expressed in terms of  correlations 46 3.4 Robustification  of  LARS 48 3.4.1 Robust Plug-in 9 3.4.2 Robust Data Cleaning 54 3.4.3 Simulations 55 3.5 Size of  the reduced set 9 3.6 Bootstrapped sequencing 62 3.7 . Learning curves 4 3.8 Examples 67 3.9 Conclusion 72 3.10 Chapter Appendix 4 3.10.1 Determination of  7 for  one active covariate 74 3.10.2 Quantities related to equiangular vector Ba •• • 75 3.10.3 Determination of  7 for  two or more active covariates 76 4 Two-step Model Building: Robust Segmentation 78 4.1 Introduction4.2 Review: classical selection criteria 79 4.2.1 Akaike Information  Criterion (AIC) 79 4.2.2 Mallows' Cp 81 4.2.3 Final Prediction Error (FPE) 81 4.2.4 Cross-validation 82 4.2.5 Bootstrap 5 4.3 Review: robust selection criteria 86 4.3.1 Robust AIC 8. 4.3.2 Robust Cp . . . 87 4.3.3 Robust FPE . . . 88 4.4 Robust cross-validation 89 4.4.1 Dealing with numerical complexity . . . 91 4.5 Robust bootstrap 93 4.6 Simulation study 4 4.6.1 Robustness of  the estimates . 95 4.6.2 Final model selection 96 4.7 Examples 98 4.7.1 Demographic data 94.7.2 Protein data 9 4.8 Conclusion ' 100 5 Properties of  Adjusted-Winsorized Correlation Estimate 101 5.1 Introduction 105.2 Consistency of  adjusted-Winsorized estimate 103 5.3 Influence  function  of  adjusted-Winsorized estimate 109 5.3.1 Standard error of  adjusted-Winsorized estimate 118 5.4 Choice of  C\ and c2 for  rw . 120 5.5 Intrinsic bias in adjusted-Winsorized estimate 120 5.5.1 Achieving (approximate) Fisher-consistency for  rw . . 123 5.6 Asymptotic normality of  adjusted-Winsorized estimate 125 5.7 Conclusion 137 5.8 Chapter Appendix 135.8.1 Proof  of  Lemma 5.2 . 137 5.8.2 Influence  function:  interchanging differentiation  and integration . 138 5.8.3 Asymptotic normality of  the adjusted-Winsorized "covariance" es-timate 140 5.8.4 Difficulty  with the denominator of  (5.2) 145 6 Conclusion 147 Bibliography 151 List of  Tables 2.1 Performance  of  the classical and robust methods in clean and contaminated data for  moderate-correlation case. The average (SD) of  mean squared prediction error (MSPE) on the test set and the number of  noise variables (Noise) selected are shown 29 2.2 Performance  of  the classical and robust methods in clean and contaminated data for  no-correlation case. The average (SD) of  mean squared prediction error (MSPE) on the test set and the average number of  noise variables (Noise) selected are shown 30 3.1 Percentages of  correct sequences obtained by classical and robust methods for  univariate and leverage designs with 4 different  error distributions. . . 56 4.1 First 10 trials: classical and robust estimates of  prediction, errors 96 4.2 Performance  of  the classical and robust methods of  segmentation (evalua-tion of  all possible subsets of  the reduced set) 98 Evaluation of  the standard errors of  rw. The empirical SD and formula-based SE are close List of  Figures 2.1 QQplot of  the robust partial F values against the theoretical xl quantiles. 25 > 3.1 Limitation of  separate univariate-Winsorizations (c = 2). The bivariate outliers are left  almost unchanged 50 3.2 Bivariate Winsorizations for  clean and contaminated data. The ellipse for the contaminated data is only slightly larger than that for  the clean data. 52 3.3 Adjusted-Winsorization (for  initial estimate Rq) with c\ =2, = 1. The bivariate outliers are now shrunken to the corner of  the smaller square. . 53 3.4 Numerical complexity of  different  correlation estimates. Each estimate can be computed in 0(n\ogn)  time, but Maronna's estimate has a larger multiplication factor  54 3.5 Numerical complexity of  different  techniques. LARS requires 0(nd 2) time. W plug-in and M plug-in both require O((n\ogn)d 2) time, but M plug-in has a larger multiplication factor  58 3.6 Recall curves for  a = 9; (a) no correlation (b) low correlation (c) high correlation. The 4 curves for  (robust) LARS correspond to 4 levels of contamination. . 61 3.7 Recall curves for  a = 15 and moderate correlation with 4 different  levels of  contamination 63 3.8 Recall curves for  robust LARS and bootstrapped robust LARS for  covari-ates with moderate correlation; (a) a = 9 (b) a = 15. The 4 curves for each method correspond to 4 levels of  contamination 65 3.9 Learning curve for  Pollution data. A reduced set of  8 covariates is sug-gested by the plot 67 3.10 Learning curve for  Demographic data. A reduced set of  12 covariates is suggested by the plot 68 3.11 Error densities for  the two "best" models for  Demographic data. The "best of  12" model gives more stable result 69 3.12 Learning curve for  Protein data. A reduced set of  5 covariates is suggested by the plot. ...-..• 71 5.1 Influence  curve of  adjusted-Winsorized estimate with (ci, c2) = (3, 2). The .curve is symmetric about (0,0) 117 5.2 Intrinsic bias in adjusted-Winsorized estimates with c2 = hc\. The bias in rw decreases as c\ increases 122 5.3 Intrinsic bias in univariate-Winsorized estimate (c=0.01) 123 5.4 Approximate Fisher-consistency for  rw. By using c2 = C\{h  + l)/2 we get less intrinsic bias than c2 = hc\ 125 List of  Notation n number of  observations (rows) in the dataset d  total number of  covariates m number of  covariates in the reduced set tjy correlation between covariate XJ and response Y fjY.i  partial correlation between covariate XJ (j ^ 1) and response Y adjusted for  X\ fjY.i  quantity proportional to rjY.i w.o.l.g. without' loss of  generality FS forward  selection procedure SW stepwise procedure LARS Least Angle Regression SD standard deviation med median mad median absolute deviation from  median ^ MSPE mean squared prediction error Acknowledgements I would like to express my most sincere and deepest gratitude to my reverend supervisor, Dr. Ruben H. Zamar. This thesis would not have been completed without his excellent guidance, generous support and continuous motivation. In the beginning, when I wanted to quit my study due to my son's illness, he inspired me to stay the course. This thesis, therefore,  belongs to him. I am also very grateful  to his wife  for  allowing me to call their residence anytime I wanted. I offer  my very special thanks to Dr. Stefan  Van Aelst (University of  Ghent, Bel-gium), my co-supervisor, for  his invaluable suggestions and guidance throughout this study, and his sponsorship for  my stay in Belgium. I am very grateful  for  the hospitality I was offered  by Stefan  and his wife  during that time. My sincere thanks go to Dr. Matias Salibian-Barrera, member of  my supervisory committee, for  his effective  suggestions that helped me save many days of  work. I am very grateful  to him for  allowing me to use some of  his user-friendly  computer codes. Respectful  thanks are to Dr. Harry Joe, Dr. Raymond Ng (computer science, UBC), and Dr. Roy E. Welsch (MIT) for  taking their time to read my thesis and provide useful suggestions and comments. I sincerely thank Christine Graham, Rhoda Morgan, Elaine Salameh and Peggy Ng for  their help with administrative matters. Christine and Rhoda are no longer with this department, but they will be remembered forever. My heartfelt  thanks are for  my dear friends  Mike ("The Man from  St. Petersburg"), Jean-Frangois (we all know him as Jeff),  Lawrence, Marin, Justin and other graduate students for  their invaluable support during my journey to PhD. I benefitted  from  helpful discussions with Mike and Jean-Frangois, and from  their brilliant suggestions. (Mike also suggested how to acknowledge his suggestions!) They also made my life  in Vancouver such a pleasant experience! Finally, I would like to pay tribute to robustness with a poem that I wrote during my graduate studies: Ashes to ashes, dust to dust -The statistics I use must be robust! JAFAR AHMED KHAN The  University  of  British  Columbia December 2006 To  Reba, Roja and Aumi Chapter 1 Introduction 1.1 Motivation We consider the problem of  building a linear prediction model when there is a large number d of  candidate covariates. Large datasets usually contain a fraction  of  outliers and other contaminations, which are difficult  to visualize and clean. The classical algorithms are much affected  by these outliers and, therefore,  these algorithms often  fail  to select the 'correct' linear prediction model that would have been chosen if  there were no outliers. We argue that it is not reasonable to attempt to predict future  outliers without knowledge of  the underlying mechanism that produces them. Therefore,  we aim at predicting the future  non-outlying cases by fitting  well the majority  of  the data. For this, we need a robust method that is capable of  selecting the important variables in the presence of  outliers in high-dimensional datasets. Robust model selection has not received much attention in the robustness litera-ture. Seminal papers that address this issue include Ronchetti (1985) and Ronchetti and Staudte (1994) which introduced robust versions of  the selection criteria AIC and Cp, respectively. Yohai (1997) proposed a robust Final Prediction Error (FPE) crite-rion (for  Splus documentation). Ronchetti, Field and Blanchard (1997) proposed robust model selection by cross-validation. Morgenthaler, Welsch and Zenide (2003) constructed a selection technique to simultaneously identify  the correct model structure as well as unusual observations. All these robust methods require the fitting  of  all submodels. One exception is the model selection based on the Wald test (Sommer and Huggins 1996) which requires the computation of  estimates from  the full  model only. A major drawback of  the existing robust model selection methods is that they do not scale up to large dimensions, because fitting  a robust model is a nonlinear optimization problem. As the number d of  possible predictors increases, the number of  submodels (which is 2d — 1) increases dramatically, making the computational burden enormous. Also, the methods that require the fitting  of  only the full  model are not suitable, because i only a few  of  the d covariates are typically included in the final  model, and the fitting  of the full  model increases the numerical complexity of  the methods unnecessarily. In this study, we attempt to achieve robustness and computational suitability at the same time. That is, we attempt to develop linear prediction model building strategies that are simultaneously (i) capable of  selecting the important covariates in the presence of  contaminations, and (ii) scalable to high dimensions. The term "scalable" is used to indicate that the numerical complexity of  the statistical methods proposed is "reasonable" (e.g., not exponential). 1.2 Model selection strategy We consider two different  strategies for  the selection of  a linear prediction model for high-dimensional datasets: (a) one-step model building and (b) two-step model building, which are described below. 1.2.1 One-step model building Since for  large values of  d  the computational burden of  all possible subsets regression is enormous, we turn our focus  on step-by-step algorithms like forward  selection (FS) and stepwise (SW) procedures (see, for  example, Weisberg 1985, Chapter 8) that can stop when certain goals are achieved. Classical FS or SW procedures yield poor results when the data contain outliers and other contaminations, since they attempt to select the covariates that will fit  well all the cases (including the outliers). Therefore,  our goal is to develop robust step-by-step algorithms that will select important variables in the presence of  outliers, and predict well the future  non-outlying cases. We express the classical FS and SW algorithms in terms of  sample means, variances and correlations, and replace these sample quantities by their robust counterparts to obtain robust step-by-step algorithms. Similar ideas have been used for  building robust estimators of  regression parameters (see, for  example, Croux, Van Aelst and Dehon 2003, and the references  therein). We also incorporate robust partial F-tests as stopping rules during the implementation of  these robust algorithms. 1.2.2 Two-step model building Our two-step model building procedure is a blend of  all possible subsets regression and step-by-step algorithms. All possible subsets regression is expected to select a better model (with respect to predictive power) than any step-by-step algorithm, but its com-putational burden is extremely high for  large values of  d. We, therefore,  consider ap-plying this procedure on a "reduced set" of  covariates. Thus, we consider proceeding in two steps. The first  step - which we call sequencing - quickly screens out unimportant variables to form  a "reduced set" for  further  consideration. The second step - which we call segmentation  - carefully  examines different  subsets of  the variables in the reduced set for  possible inclusion in the prediction model. These two steps are described below. Sequencing The goal of  the first  step is a drastic reduction of  the number of  candidate covariates. The input variables are sequenced to form  a list such that the good predictors are likely to appear at the beginning of  the list. The first  m covariates of  the list then form  the reduced set from  which the final  prediction model will be obtained. One strategy for  sequencing the candidate covariates is to use one of  the several available step-by-step or stagewise algorithms, e.g., Forward Selection (FS), or Forward Stagewise procedure (Stagewise) (see, for  example, Hastie, Tibshirani and Friedman 2001, Chapter 10). We focus  on the powerful  algorithm recently proposed by Efron, Hastie, Johnstone and Tibshirani (2004) called Least Angle Regression (LARS), which is a mathematical solution to the Stagewise problem. LARS is computationally efficient  and has been shown to have clear statistical advantages over other step-by-step and stagewise algorithms. Since LARS is very sensitive to contamination, our goal is to robustify  LARS. We show that LARS can be expressed in terms of  the mean vector and covariance matrix of the data, and we replace these classical ingredients of  LARS by their robust counterparts to obtain robust LARS. We combine robust LARS algorithm with bootstrap to obtain a more stable and reliable list of  covariates. One important issue in the sequencing step is to determine the appropriate value of  the number of  covariates, m, for  the reduced set. The probability that the reduced set contains all the important variables increases with m. Unfortunately,  also the com-putational cost of  the second step, segmentation, increases with m. Therefore,  we aim to determine a "reasonable" value of  m which is large enough to include most of  the important variables but not so large as to make the second step impractical or unfeasi-ble. For this purpose, we introduce a "learning curve" that plots robust R2 values versus dimension. An appropriate value of  m is the dimension corresponding to the point where the curve starts to level off. Segmentation When we have a reduced set of  m covariates for  further  consideration, one reasonable approach to reach the final  model is to perform  all possible subsets regression on this reduced set using an appropriate selection criterion. Again, the classical selection criteria, e.g., Final Prediction Error (FPE), Akaike Information  Criterion (AIC), Mallows' Cp, cross-validation (CV) and bootstrap procedures are not resistant to outliers. The robust AIC procedure (Ronchetti 1985) has certain limitations, which are discussed in this thesis. The robust CV method (Ronchetti, Field and Blanchard 1997) is computationally expensive. In this study, we propose computationally suitable robust CV and robust boot-strap procedures to evaluate the predictive powers of  different  subsets of  the reduced set of  covariates. Our robust bootstrap procedure is similar to the methods proposed by Salibian-Barrera (2000), and Salibian-Barrera and Zamar (2002) to conduct robust statistical inferences  in linear regression. Since the performance  of  robust FPE procedure (Yohai 1997) has not been studied so far,  we also evaluate this method in our study. 1.3 Computation of  robust correlation matrices As mentioned earlier, our approach to robustification  of  FS, SW and LARS consists of  expressing these algorithms in terms of  the mean vector and the correlation matrix of  the data, and then replacing these classical ingredients by their robust counterparts. Therefore,  robust estimation of  the correlation matrix is a major component of  the robust methods we propose in this study. The computation of  robust correlation estimates from  a <i-dimensional data set is very time-consuming, particularly for  large values of  d.  Even the fast  MCD algorithm by Rousseeuw and Van Driessen (1999) is not fast  enough for  the type of  applications we have in mind. Therefore,  we use robust correlations derived from  a pairwise affine equivariant covariance estimator. Interestingly, the pairwise approach for  robust correlation matrix estimation is not only computationally suitable, it is also more relevant than the d-dimensional approach for  robust step-by-step algorithms. Pairwise approach allows us to compute only the required correlations at each step of  the algorithm. Since we intend to stop as soon as certain goals are achieved, a pairwise approach saves the computation of  the correlations that are not required. We consider robust correlations derived from  a simplified  version of  the bivariate M-estimator proposed by Maronna (1976). This estimate is computationally efficient, affine  equivariant and has a breakdown point of  1/3 in two dimensions. For very large high-dimensional data, however, we need an even faster  robust corre-lation estimator. Therefore,  as a part of  our robust LARS procedure, we propose a new correlation estimate called the "adjusted-Windsorized correlation estimate". Unlike two separate univariate Windsorizations for  X  and Y  (see Huber 1981 and Alqallaf  2003), we propose a joint Windsorization with a larger tuning constant C\ for  the points falling in the two quadrants that contain the majority of  the data, and a smaller constant c2 for  the points in the two minor quadrants. Our estimate has some advantages over the univariate-Windsorized correlation estimate. 1.4 Organization of  subsequent chapters The following  chapters are organized as follows.  In Chapter 2, we present the one-step model selection procedure, where we develop robust versions of  FS and SW algorithms incorporated with robust partial F-tests used as stopping rules. In Chapter 3, we present the first  step (robust sequencing) of  the two-step model selection procedure. Here, we robustify  LARS to sequence the covariates, use bootstrap to stabilize the results obtained by robust LARS, and use "learning curves" to decide about the size of  the reduced set. For the development of  robust LARS, we introduce the idea of  "multivariate-windsorization" which we use for  robust data cleaning. We also propose a new correlation estimate which we call the "adjusted-Windsorized correlation estimate". Chapter 4 deals with the second step (robust segmentation) of  the two-step model selection procedure. Here, we review the existing classical and robust selection criteria, discuss their limitations, and propose computationally suitable robust CV and robust bootstrap procedures to evaluate the predictive powers of  different  subsets of  the reduced set of  covariates. We also evaluate the performance  of  robust FPE (Yohai 1997). Chapter 5 studies the properties of  the adjusted-Windsorized correlation estimate (the new correlation estimate proposed in Chapter 3). We show that the proposed estimate is consistent and has bounded influence.  We obtain its asymptotic variance and intrinsic bias. We show that the tuning constants of  this estimate can be chosen such that it is approximately Fisher-consistent. We also show that a smoothed version of  this estimate is asymptotically normal. In Chapter 6 we conclude by summarizing the main ideas proposed in this thesis, and the main results obtained. Though the major chapters (Chapters 2-5) are connected conceptually, each of  them is independent of  the others. That is, they may be considered as individual research papers, to a certain extent. Chapter 2 One-step Model Building: Robust Forward Selection and Stepwise Procedures 2.1 Introduction When the number d of  candidate covariates is small, one can choose a linear predic-tion model by computing a reasonable criterion (e.g., Cp, AIC, cross-validation error or bootstrap error) for  all possible subsets of  the predictors. However, as d increases, the computational burden of  this approach (sometimes referred  to as all  possible  subsets regression)  increases very quickly. This is one of  the main reasons why step-by-step algo- ' rithms like forward  selection (FS) and stepwise (SW) (see, for  example, Weisberg 1985, Chapter 8) are popular. Unfortunately,  classical FS or SW procedures yield poor results when the data con-tain outliers and other contaminations. These algorithms attempt to select the covariates that will fit  well all the cases (including the outliers), and often  fail  to select the model that would have been chosen if  those outliers were not present in the data. Moreover, aggressive deletion of  outliers is not desirable, because we may end up deleting a lot of observations which are outliers only with respect to the predictors that will not be in the model. We argued earlier that it is not reasonable to attempt to predict the future  outliers without knowledge of  the underlying mechanism that produces them. Therefore,  our goal is to develop robust step-by-step algorithms that will select important variables in the presence of  outliers, and predict well the future  non-outlying cases. We show that the list of  variables selected by classical FS and SW procedures are functions  of  sample means, variances and correlations. We express the two classical algo-rithms in terms of  these sample quantities, and replace them by robust counterparts to obtain the corresponding robust versions of  the algorithms. Once the covariates are se-lected (by using these simple robust selection algorithms), we can use a robust regression estimator on the final  model. Robust correlation matrix estimators for  <i-dimensional data sets are usually de-rived from  affine-equivariant,  robust estimators of  scatter. This is very time-consuming, particularly for  large values of  d. Moreover, the computation of  such robust correlation matrices becomes unstable when the dimension d is large compared to the sample size n. On the other hand, only a few  of  the d covariates are typically included in the fi-• nal model, and the computation of  the whole d-dimensional correlation matrix at once will unnecessarily increase the numerical complexity of  the otherwise computationally suitable step-by-step algorithms. To avoid this complexity, we use an affine-equivariant  bivariate M-estimator of  scat-ter to obtain robust correlation estimates for  all pairs of  variables, and combine these to construct a robust correlation matrix. We call this the pairwise robust correlation approach. Interestingly, this pairwise approach for  robust correlation matrix estimation is not only computationally suitable, but is also more convenient (compared to the full d-dimensional approach) for  robust step-by-step algorithms. The reason is as follows. The sample correlation matrix (R, say) has the property that the correlation matrix of a subset of  variables can be obtained by simply taking the appropriate submatrix of  R. This property allows us to compute only the required correlations at each step of  the algorithm. With the pairwise robust correlation approach we keep this property. Affine  equivariance and regression equivariance are considered to be important prop-erties for  robust regression estimators (see, e.g., Rousseeuw and Leroy 1987). However, these properties are not required in the context of  variable selection, because we do not consider linear combinations of  the existing covariates. The only transformations  that should not affect  the selection result are linear transformations  of  individual variables, i.e., shifts  and scale changes. Variable selection methods are often  based on correlations among the variables. Therefore,  robust variable selection procedures need to be robust against correlation outliers, that is, outliers that affect  the classical correlation estimates but can not be detected by looking at the individual variables separately. Our approach based on pairwise correlations is robust against correlation outliers and thus suitable for robust variable selection. It should be emphasized that with our approach we consider the problem of  "selecting" a list of  important predictors, but we do not yet "fit"  the selected model. The final  model resulting from  the selection procedure usually contains only a small number of  predictors compared to the initial dimension d,  when d  is large. Therefore,  to robstly fit  the final  model we propose to use a highly robust regression estimator such as an MM-estimator (Yohai 1987) that is resistant to all types of  outliers. Note that we always use models with intercept. Croux, Van Aelst and Dehon (2003) estimated the parameters of  a regression model using S-estimators of  multivariate location and scatter. They also obtained the corre-sponding standard errors. Their estimation method can be adapted for  model-building purposes. However, for  the step-by-step algorithms like FS and SW, our pairwise ap-proach has computational advantages. The rest of  this chapter is organized as follows.  In Section 2.2 we review some classical step-by-step algorithms. In Section 2.3 we decompose the FS and SW procedures in terms of  the correlation matrix of  the data. In Section 2.4, we present robust versions of  these algorihms, along with their numerical complexities. Section 2.5 presents a Monte Carlo study that compares our robust methods with the classical ones by their predicting powers. Section 2.6 contains two real-data applications. Section 2.7 is the conclusion. 2.2 Review: classical step-by-step algorithms In this section, we review the three most important step-by-step algorithms: Forward Se-lection (FS), stepwise (SW) and Backward Elimination (BE). We show a serious drawback of  the BE procedure, which is ,why we did not consider this algorithm for  robustification. 2.2.1 Forward Selection (FS) Let us have d  predictors ..., X d, and a response Y.  Let each variable be standardized using its mean and standard deviation. The FS procedure selects the predictor (Xi, say) that has the largest absolute correlation |riy| with Y, and obtains the residual vector Y  — riyXi.  All the other covariates are then 'adjusted for  Xi  and entered into competition. That is, each Xj is regressed on and the corresponding residual vector Zj.i (which is orthogonal to is obtained. The correlations of  these ZjA with the residual vector Y  — riyXi,  which are also called "the partial correlations between Xj and Y  adjusted for  Xi  \ decide the next variable (X 2, say) to enter the regression model. All the other covariates are then 'adjusted for  X\  and X 2 and entered into further competition, and so on. We continue adding one covariate at each step, until a stopping criterion is met. The reason behind the 'orthogonalization', that is, the construction of  Zj,\ from  Xj, is that the algorithm measures what 'additional' contribution Xj makes in explaining the variability of  Y, when Xj joins X\ in the regression model. The R2 produced by (Xx, Z 2) is the same as the R2 produced by (Xl, X 2), and the orthogonalization ensures maximum R2 at each FS step. We stop when the partial F-value for  each covariate that has not yet entered the model is less than a pre-selected number, say "F-IN" (Weisberg 1985). 2.2.2 Stepwise (SW) The SW algorithm is the same as the FS procedure up to the second step. When there are at least two covariates in the model, at each subsequent SW step we either (a) remove a covariate, or (b) exchange two covariates, or (c) add a covariate, or (d) stop. Note that, the "exchange" of  two covariates is not the same as the addition (removal) of  a covariate in one step followed  by the removal (addition) of  another covariate in the next step. Sometimes, a new covariate cannot enter the model because of  an existing covariate, and the existing covariate cannot be removed according to the criterion used. The options at each SW step are considered in the order in which they are mentioned above. A selected covariate is removed if  its partial F-value is less than a pre-selected number, say "F-OUT" (Weisberg 1985). A selected covariate is exchanged with a new one if  the exchange increases R2. A covariate is added if  it has the highest partial F-value among the remaining covariates, and the value is more than F-IN (as in FS). And, we stop when none of  the above (removal, exchange or addition) occurs in a certain step. 2.2.3 Backward Elimination (BE) The BE procedure (see, for  example, Weisberg 1985, Chapter 8) is the opposite of  FS. BE starts with the full  model, and removes one covariate at each step. The covariate to remove is the one that has the smallest partial F-value among all the covariates that are currently in the model. We stop when the partial F-value for  each covariate currently in the model is greater than F-OUT. Limitation of  BE When the number d of  candidate covariates is large, only a few  of  these covariates are typically included in the final  model. However, to apply the BE algorithm, we have to compute the pairwise correlation estimates for  all the d covariates (since BE starts with the full  model). Therefore,  BE has higher numerical complexity than that of  FS (or SW). This problem will be more serious with the computation of  robust correlation estimates (for  robust BE) . Therefore,  we will not consider the BE procedure for  robustification.  • 2.3 FS and SW Expressed in Correlations In order to robustify  the FS and SW procedures, we will now express these algorithm in terms of  the original correlations of  the variables. 2.3.1 FS expressed in terms of  correlations Let the d covariates X x,..., Xj  and the response Y  be standardized using their mean and standard deviation. Let rjy denote the correlation between Xj and Y, and Rx be the correlation matrix of  the covariates. Suppose w.l.o.g. that X\ has the maximum absolute correlation with Y. Then, Xx is the first  variable that enters the regression model. We call the predictors that are in the current regression model "active" predictors. The remaining candidate predictors are called "inactive" predictors. We now need the partial correlations between Xj  (j  1) and Y  adjusted for  X\,  denoted by rjy.i, to determine the second covariate X 2 (say) that enters the model. The partial correlation rjY.1 expressed in terms of  original correlations Each inactive covariate Xj  should be regressed on X x to obtain the residual vector Zj, x as follows Zj_\ = Xj — PjiX\, (2.1) where We have and (3j l = ±X[X J=r jl. (2.2) -Zl xY  = (Xj  - faXtfY  = rjY  - rjlr1Y,  (2.3) = (XJ  -  /3 J XX X)  = 1 - r2jv (2.4) Therefore,  the partial correlation Tjy.i is given by Z) A(Y-p yxX x)ln rjY. 1 - , (2.5) Y Zj.iZj.i/n SD(F — 0Y XX x) Note that the factor  SD(F — (3y XX x) in the denominator of  (2.5) is independent of  the covariate Xj;  (j  = 2,...,d)  being considered. Hence, when selecting the covariate Xj that maximizes the partial correlation rjy. 1, this constant factor  can be ignored. This reduces computations and therefore  is more time efficient.  It thus suffices  to calculate Zji(Y  - PriXj/n rjY. 1 = ——, (2.6) 'ZUZ^/n which is proportional to the actual partial correlation, fj Y,x can be rewritten as follows Z\ iY/n Tjyi = — J' = [since Zj x and X x are orthogonal] (2.7) 'Z^Z^n rjY  r3lrlY  ^ ^ ^ Now, suppose w.l.o.g. that X 2 (or, equivalently, Z 2.1) is the new active covariate, because it minimizes fjy.  1 (and thus also the partial correlation rjy.i)- All the inactive covariates should now be orthogonalized with respect to Z 2,x. Orthogonalization of  Zj,i  wrt Z 2\ Each inactive variable Zj\  should be regressed on Z 2.\ to obtain the residual vector Zj, i2 as follows Zj.  12 = Zj.\  — (3j2.lZ2.l-Here, ZLZj.i/n Pj2.1 = ^2.1^2.1/"-[because of  orthogonality] (2.8) [Using (2.1) and (2.2)] ^.1^2.1/n XjZ^/n Z 2.\Z 2.i/n X 2{Xj  — VjiXi)/n Z 2AZ 2A/n T2j  - T-217-j-i 1 - r2\ [using (squared) denominator of  (2.7) for  j = 2], Thus, fjy.i  and (3j2.1 are expressed in terms of  original correlations. Lemma 2.1. Given that  the numerators and denominators  of  the following  equations i2-(fc-i)  = / ; - for  all  inactive  j, (2.9) \J  Z tjl2..<k_l)Z j.i2...^-i)/n and Zfi  12 (h—1)  Zj12--(/i—1)/^ Pjh.n-(h-i)  = „t hr for  h = 2,..., k; j inactive,  (2.10) are functions  of  original  correlations,  the numerators and denominators  of  the following quantities  can be expressed  as functions  of  original  correlations:  (a)  fj Y.i2-k and (b) Pj(k+l).12-k-Proof.  Here, fj Y.i2-(k-i)  determines the next active covariate X k (or, equivalently, Zk.\2-(k-i))-  The given 0jk.i2-(k-i) can be used to obtain the residual vector (2.11) Now, rjY12...k = J'12 f c =. (2.12) Z ljA2:..kZ jA2...k/n Using (2.11), the numerator of  (2.12) can be written as -Z tj.l2-(k-l) Y  ~ Pjk.l2-ik-l)-Z tk.l2...( k_l)Y, where the first  part is the numerator of  (2.9), (5jk.\2-{k-\)  comes from  (2.10) for  h = k, and the rest is the numerator of  (2.9) for  j = k (because X k was inactive at that point). Thus, the numerator of  (2.12) is a function  of  the original correlations. Using (2.11), the squared denominator of  (2.12) can be written as ~Zj.i2-(k-i)Zj.i2-ik-i)  ~ 2f3jk.i2-(k-i)-Z tk l2...( k_1}Z j.i2...( k-i) + 0jk.l2-{k-l)Z tk.l2-ik-l)Zk.l2-{k-l)i  (2.13) where, the first  term is the (squared) denominator of  (2.9), the second term is the nu-merator of  (2.10) for  h — k, and the last term is the denominator of  (2.10) for  h = k. This proves Part (a) of  the lemma. The quantities fj YA2...k determine the next active covriate Now, r' - Z{k+i).i2-k Zj.i2-k/n Pj(k+i).\2-k  - 7t 7 (2.14) Because of  orthogonality, the numerator of  (2.14) can be written as: — X( k+ljZj,i2-k  = ~Xlk+l){Xj  — PjiXi  — Pj2.1Z2.1  — • • • Pjk.l2-{k-l)Zk.l2-(k-l)) lb Iv  lb = rj(fc+l)  - fjiTki  - Pj2.l-Z( k+ly 1Z 2.l .-•••- Pjk.l2-{k-l)-Zl k+l)A2-{k-l) Zk.l2-{k-l), I  b lb where the fi's  come from  (2.10) for  h = 2,..., k, and the other quantities are the numer-ators of  (2.10) for  j = k + 1, and h — 2,..., k. For the denominator of  (2.14), we can use the relation Z{k+l).l2-k  = Z(fc+l).12-(k-l) — P(k+l)k.l2-{k-l)Zk.l2-(k-l), which follows  from  (2.11) by replacing j = (k  + 1). So, the denominator can be written as Zlk+i).i2-(k-i)Z(k+i).i2-(k-i)/n  - 2^k+i)k.i2--ik-i)Zl k+1-) l2...[ k-i)Z k.i2--\k-i)/n J rP(k+l)k.l2-(k-l)Zk.l2-(k-l)Zk.l2~ik-l)/n>, where the first  part is the (squared) denominator of  (2.9) for  j = k + 1, the second part is the numerator of  (2.10) for  j = k + 1 and h = k, and the last part is the denominator of  (2.10) for  h = k. Therefore,  Pj(k+i).i2...k  is a function  of  original correlations. This completes the proof.  • FS steps in correlations We can now summarize the FS algorithm in terms of  correlations among the original variables as follows: 1. To select the first  covariate X mi, determine m\ = argmax \rj\. 2. To select the &th covariate X mk (k  = 2, 3, ...), calculate fjy. mi...m(fc  l), which is pro-portional to the partial correlation between Xj  and Y  adjusted for  X mi, • • • , X m{ k_l), and then determine rrik = argmax|fjy. mi...m(f c_1)|. Partial F-tests for  stopping At each FS step, once the "best" covariate (among the remaining covariates) is identified, we can perform  a partial F-test to decide whether to include this covariate in the model (and continue the process) or to stop. The new "best" covariate enters the model only if the partial F-value, denoted by impartial, is greater than F(0.95,1, n — k — 1) (say), where k is the current size of  the model including the new covariate. Here again, the required quantities can be expressed in terms of  correlations among the original variables, as we show below. Suppose that Xi is already included in the model, and X<i has the largest absolute partial correlation with Y  after  adjusting for  To decide whether X 2 should be included in the model we perform  a partial F-test using the statistic Fpartiai given by F = (Y  - pY 1 Xtf{Y  - PyiXx)  - (y - faX!  - i3Y 2.1z2.1)t(Y  - pY lX 1 - f3 Y 2.lZ 2.l)  partial (y - (3 Y lX x - Py2.xZ 2AY(Y  - (3 Y lX x - (S Y2 XZ 2.x)l(n  - 3) (n - 3) (2 pY2AZ\AYjn - $2AZ2.i/n) l-r 21Y-  (2p Y2 xZ\ xYjn  - P\2XZ\ AZ 2Aln) (n- 3) (/fo.^y/n) 1 - - PyzAY/u {n  - 3) f\ Y A 2 ' ' lY  ' 2Y.1 (2.15) where f 2y.i is expressed in correlations in (2.7). Similarly, when (k  — 1) covariates X x,..., are already in the model, and w.l.o.g. Xk has the largest absolute partial correlation with Y  after  adjusting for  Xi,..., Xk-i, the partial F-statistic for  Xk can be expressed as: 7 _ (n - fe  - 1) f 2 'partial ~ -. 2 _ f2  Z2 X / 1 \r I ov 1 I (2.16) iy '2y.i "' ' fcy.i2-(fc-i) i where the partial correlations can be expressed in terms of  the original correlations using Lemma 2.1. 2.3.2 SW expressed in terms of  correlations The SW algorithm starts as the FS procedure. When there are at least two covariates in the model, at each subsequent SW step we either add a covariate, or drop a covariate, or exchange two covariates, or stop. To decide whether to add a covariate, the partial correlations of  each inactive co-variate XJ with Y can be computed as in the case of  FS (see (2.12)) to perform  a partial F-test (see (2.15) and (2.16)). To decide whether to drop an "active" covariate, we can pretend that the active covariate under consideration entered the model last, and calcu-late its partial correlations with Y (see (2.12), subscripts modified)  to perform  a partial F-test (see (2.15) and (2.16), subscripts modified). Once an "active" covariate is dropped, the "orthogonalizations" of  the other covari-ates (active or inactive) with this covariate that were used before  to derive the partial correlations become irrelevant, and the order of  the other active covariates in the model cannot be determined. Fortunately, this does not create a problem to decide, the next covariate, because, for  example, fj Y.zm  = ?jY.643-  Therefore,  we can update all relevant calculations considering the currently active covariates in any order. Stopping criteria for  SW. Unlike the FS algorithm where a stopping criterion is "optional" (we may choose to sequence all the covariates), SW has to have a built-in stopping rule, because at each step we have to decide whether to add one covariate and/or drop another. We may choose two different  theoretical F percentiles as the inclusion and deletion criteria, e.g., F(0.95, l,n — k\ — l) and F(0.90,1,  n—k2 — 1), respectively, where k\ and k2 are the model sizes after  inclusion and before  deletion. 2.4 Robustification  of  FS and SW algorithms In the last section we expressed the FS and SW algorithms in terms of  sample means, variances and correlations. Because of  these non-robust building blocks, these algorithms are sensitive to contamination in the data, as shown by our simulation and real-data ex-amples later on. A simple robustification  of  these algorithms can be achieved by replacing the non-robust ingredients of  the algorithms by their robust counterparts. For the ini-tial standardization, the choices of  fast  computable robust center and scale measures are straightforward:  median (med) and median absolute deviation (mad). As mentioned ear-lier, most available robust correlation estimators are computed from  the d-dimensional data and therefore  are very time consuming (see, for  example, Rousseeuw and Leroy 1987). Robust pairwise approaches (Huber 1981) are not affine  equivariant and, there-fore,  are sensitive to two-dimensional outliers. One solution is to use robust correlations derived from  a pairwise affine  equivariant covariance estimate. We consider an estimate'which is inspired by the computation-ally suitable multivariate M-estimate proposed by Maronna (1976). We first  present Maronna's estimate below. Definition  2.1. (Maronna's M-estimate of  multivariate location and scatter) Let us have n multivariate  observations  Zi,  i = 1, ..., n. Maronna's  M-estimate  of  the location  vector t and scatter  matrix  V  is defined  as the solution  of  the system of  equations: -Tmid^Zi-t) = 0, (2.17) i n i ± £>(<*?)(*-*)(*-*)' = V, (2.18) TX I where d? = (Zi  — t)'V~ 1(zi  — t), and u\ and u2 satisfy  a set of  general  assumptions. For further  computational ease, we considered the following  simplified  version of the bivariate M-estimate. We used the coordinatewise median as the bivariate location estimate and only solved (2.18) to estimate the scatter matrix and hence the'correlation. We used the function  u2(t)  = u(t)  — mm(c/t,  1) with c = 9.21, the 99% quantile of  a X 2 distribution. For the bivariate observations Z{  = (Xi,yi),  i = 1, ..., n, the steps for calculating this correlation estimate are presented below: 1. Calculate the medians mj and my, and obtain Zi = (), % = 1, ..., n, where ii = Xi- mx, and & = yi - my. 2. Calculate the mads sx and sy, and set V0 < ^ y 0 Sy j 3. Calculate dj  = z\ V 0 1 Zi,  i = 1, ..., n. Then obtain Vi  = ^ ^ u(df)  Zi  z\. n i=i 4. Set V0 <-V v 5. Repeat steps 3 and 4. We stop when |r(Vi) — r(Vo)| < where <5 > 0 is a pre-selected small number, and r(.) is the correlation coefficient  calculated from  the bivariate scatter matrix. Finally, FS and SW algorithms are implemented using these robust pairwise corre-lations. Robust partial F-tests. We replace the classical correlations in the partial F statistic by their robust counterparts to form  a robust partial F statistic. We conjecture that the robust pairwise correlations appearing in the numerator of  the F statistic are jointly normal. Therefore,  under the null hypothesis, the robust F statistic is asymptoti-cally distributed as xi- To assess our conjecture numerically, we conducted the following simulation. We generated Xi, and e2 from  a standard normal distribution. We then generated Y  = (3 0 + faXi  + a^i, and X 2 = 70 + 71X1 + a2e2, where @0, Pi, 70 and 71 are generated from  a uniform  distribution on (—10,10), a\ is chosen so that the signal-to-noise ratio equals 2, and a2 is chosen so that X\  and X 2 have a particular correlation randomly chosen from  a uniform  distribution on (0,1). We generated 2000 datasets of  size 100, and calculated the robust partial F statistic for  covariate X 2 in each case. Figure 2.1 shows the qqplot of  the robust partial F values against the theoretical xl quantiles. Moreover, the average and variance of  the robust F values are 0.99 (~ 1) and 2.02 (~ 2), respectively. All of  these support our conjecture. Therefore,  we consider it to be reasonable to use theretical F quantiles as our stopping criteria for  robust FS and SW algorithms. Theoretical chi-square quantiles Figure 2.1: QQplot of  the robust partial F values against the theoretical xl quantiles. 2.4.1 Numerical complexity of  the algorithms If  we sequence all d covariates, the standard FS procedure requires 0(nd 2) time. However, when applied with a stopping criterion, the complexity of  FS depends on the number of covariates selected in the model. Assuming that the model size will not exceed a certain number m < d, the complexity of  FS is less than or equal to 0(ndm).  Similarly, the maximum complexity of  SW is 0(n(dm  + m2)) = 0(ndm). Since we used the coordinatewise median as the bivariate location estimate, the correlation based on Maronna's M-estimate can be computed in (D(n\ogn  + bn) time, where b is the number of  iterations required. Assuming that b does not exceed (^(logn) (convergence was achieved after  3 to 5 iterations in our simulations), the complexity of  this estimate is O(nlogn). As a result, the maximum complexity of  robust FS is O ((n  log  n) dm), and the maximum complexity of  robust SW is 0((n  logn)(dm  + m2)) = 0((nlogn)dm). i Though all  possible  subsets regression  is expected to select a better model (with respect to predictive power) than any step-by-step algorithm, its computational burden is extremely high for  large values of  d, since it requires the fitting  of  all 2d — I  submodels. The complexity of  the classical algorithms of  this type is 0(2 dnd 2). Since robust model selection methods proposed so far  uses all  possible  subsets regression,  the complexity of  the existing robust algorithms is 0(2 dnd 2) multiplied by the number of  iterations required for  the robust fits. 2.4.2 Limitation of  the proposed algorithms The robust FS and SW procedures based on robust pairwise correlations proposed are resistant to bivariate (correlation) outliers. However, they can be sensitive to three-or higher-dimensional outliers, that is, outliers that are not detected by univariate and bivariate analyses. Also, the correlation matrix obtained from  the pairwise correlation approach may not be positive definite,  forcing  the use of  correction for  positive definite-ness in some cases (see, e.g., Alqallaf  et al. 2002). It should be emphasized here that these are very small prices to pay to make the selection of  covariates possible for  large values of  d. For example, in our simulations (presented later) we used d = 50. It is impossible to apply all  possible  subsets regression on a dataset of  this dimension. If  one robust fit  takes 0.001 cpu second, we would need 250 * 0.001/(3600 * 24 * 365) years to select the final  model. 2.5 A simulation study To compare our robust methods with the classical ones, we carried out a simulation study similar to Frank and Friedman (1993). The total number of  variables is d'=  50. A small number a — 9 or a = 15 of  them are nonzero covariates. We considered 2 correlation structures of  these nonzero covariates: "moderate correlation" case and "no correlation" case, which are described below. For the moderate-correlation case, we considered 3 independent 'unknown' processes, represented by latent variables Li, i = 1, 2, 3, which are responsible for  the systematic variation of  both the response and the covariates. The model is Y = 7Li + 6L2 + 5L3 + e = Signal + e, (2.19) where Li ~ N(0,1),  and e is a normal error not related to the latent variables. The variance of  e is chosen such that the signal-to-noise ratio equals 2, that is Var(e) = 110/4. The nonzero covariates are divided in 3 equal groups, with each group related to exactly one of  the latent variables by the following  relation Xj  — Li + Sj, where Sj ~ iV(0,1). Thus, we have a true correlation of  0.5 between the covariates generated with the same latent variable. For the no-correlation case (a true correlation of  0 between the covariates), inde-pendent predictors Xj  ~ N(0,1)  are considered, and Y  is generated using the a non-zero covariates, with coefficients  (7, 6, 5) repeated three times for  a — 9, and five  times for a = 15. For each case we generated 1000 datasets each of  which was randomly divided into a training sample of  size 100 and a test sample of  size 100. Contamination of  the training data. Each of  the d — a noise variables are contaminated independently. Each observation of  a noise variables is assigned probability 0.003 of  being replaced by a large number. If  this observation is contaminated, then the corresponding observation of  Y  is also replaced by a large number. Thus, the probability that any particular row of  the training sample will be contaminated is 1 — (1 — 0.003)d_a, which is approximately 10% for  a = 15, and 11.6% for  a = 9. For each of  the 4 methods (2 classical and 2 robust), we fitted  the obtained model on the training data, and then used it to predict the test data outcomes. We used MM-estimator (Yohai 1987) to fit  the models obtained by either of  the robust methods, because of  its high breakdown point and high efficiency  at the normal model. For each simulated dataset, we recorded (1) the average squared prediction error on the test sample, (2) the number of  noise variables selected in the model, and (3) the total number of  variables selected in the model. Table 2.1: Performance  of  the classical and robust methods in clean and contaminated data for  moderate-correlation case. The average (SD) of  mean squared prediction error (MSPE) on the test set and the number of  noise variables (Noise) selected are shown. a = 9 a = 15 Data Method MSPE Noise MSPE Noise Clean FS 59.7 (12.0) 4.9 (2.4) 50.2 (9.3) 4.3 (2.2) SW 60.3 (12.3) 4.8 (2.3) 51.2 (9.7) 4.2 (2.1) Rob FS 60.4 (12.2) 5.1 (2.6) 51.5 (10.3) 4.7 (2.5) Rob SW 61.1 (12.8) 5.0 (2.5) 52.8 (10.5) 4.6 (2.4) Contam FS 157.6 (40.8) 13.6 (3.1) 134.5 (32.9) 11.7 (2.9) SW 158.4 (41.3) 13.4 (3.0) 136.3 (33.3) 11.6 (2.8) Rob FS 94.9 (27.9) 2.5 (2.9) 78.9 (23.7) 1.6 (2.9) Rob SW 95.1 (27.8) 2.4 (2.8) 79.3 (23.4) 1.5 (2.6) Table 2.1 shows the average (sd) of  the first  two quantities mentioned above over all generated datasets for  the moderate-correlation case. The average (sd) of  the third quantity (total number of  variables) is similar for  all the methods in the clean data. However, for  the contaminated data, the average increases (decreases) for  the classical (robust) methods. For example, for  a — 15, the average for  the classical methods increases from  13 to 17 (approximately), while for  the robust methods it decreases from  13 to 6. In general, FS performs  as good as SW, and robust FS performs  as good as robust SW. For the clean data, the performance  of  robust FS (SW) is comparable to standard FS (SW). For the contaminated data, the test errors produced by robust methods are much smaller than the classical ones. Also, the models obtained by robust methods contain fewer  noise variables than the classical ones. Table 2.2: Performance  of  the classical and robust methods in clean and contaminated data for  no-correlation case. The average (SD) of  mean squared prediction error (MSPE) on the test set and the average number of  noise variables (Noise) selected are shown. a = 9 a = 15 Data Method MSPE Noise MSPE Noise Clean FS 55.6 (11.6) 5.0 (2.4) 107.0 (21.7) 4.6 (2.3) SW 55.8 (11.8) . 4.8 (2.3) 108.1 (22.1) 4.3 (2.1) Rob FS 56.5 (12.4) 5.1 (2.6) 109.9 (21.6) 4.8 (2.4) Rob SW 56.7 (12.8) 4.9 (2.5) 108.4 (22.4) 4.6 (2.3) Contam FS 161.8 (38.1) 13.6 (3.0) 296.7 (75.3) 11.9 (2.8) SW 162.5 (37.5) 13.4 (2.8) 297.9 (75.9) 11.7 (2.7)' Rob FS 72.5 (13.9) 2.1 (2.4) 124.1 (19.9) 1.2 (1.8) Rob SW 72.6 (13.8) 2.1 (2.3) 124.2 (20.8) 1.2 (1.7) Table 2.2 presents the results for  the no-correlation case. Here, robust FS and SW more drastically outperform  the standard FS and SW, as compared to the moderate-correlation case. Note that the errors presented in this table are not comparable to those of  Table 2.1 since Y  is generated using the non-zero covariates (a  = 9 or a = 15), instead of  the 3 latent variables. Thus, F has much more variability for  a = 15 than for  a =.9. 2.5.1 Model selection with Spearman's p and Kendall's r In Section 2.3 we expressed the classical FS and SW algorithms in terms of  the corre-lation matrix of  the data. In Section 2.4 we replaced these correlations by their robust counerparts to obtain robust FS and SW. We can also consider replacing the classical correlations in FS and SW by Spearman's p or Kendall's r, since they are standard estimates of  association that are invariant to monotone transformations  of  the data. They may be good options for  variable selection when there is skewness in the data and no cluster of  multivariate outliers. A small simulation study (not presented here) indicates that the methods based on Spearman's p and Kendall's r may perform  better than the classical FS and SW. Further study is required to investigate their performance  as model building tools and compare them with classical and robust FS and SW. It should be mentioned here that Spearman's p can be computed in C>(nlogn) time, the same as the adjusted-Winsorized correlation estimate (the new correlation estimate proposed later in this thesis). Though Kendall's r separately examines each of  the (™) (order of  n2) pairs of  bivariate observations, there is an algorithm that can calculate Kendall's r in C(nlogn) time (Knight 1966). 2.6 Examples In this section, we used two real-data examples to show the robustness and scalability of our algorithms. Executive data. This dataset is obtained from  Mendenhall and Sincich (2003). The annual salary of  100 executives is recorded as well as 10 potential predictors (7 quantitative and 3 qualitative) such as education, experience etc. We label the candidate predictors from  1 to 10. Classical FS (with F 0.9 as the inclusion criterion) and SW (with F0.9 as both inclusion and deletion criterion) both select the covariates: (1,3,4,2,5). Robust FS and SW (also with F0.9 as inclusion and deletion criterion) select the same model. We then contaminated the data by replacing one small value of  predictor 1 (less than 5) by a large value 100. When FS and SW are applied to the contaminated data, they both now select a larger set of  variables: (7, 3, 4, 2,1, 5,10). Thus, changing a single number in the data set drastically changes the selected model. On the other hand, robust FS and SW select the same model, (1, 3, 4, 2, 5), when applied to the contaminated dataset. Particle data. This quantum physics dataset was used for  the KDD-Cup 2004. Each of  n = 50000 data-points (rows) describes one "example" (particle generated in a high energy collider experiment). There are 80 variables in the data: Example ID, class of  the example (positive examples are denoted by 1, negative examples by 0), and 78 feature  measurements. We considered only the feature  variables in our analysis. We deleted 13 of  the features  (either because they have a large number of  missing values, or they are degenerate with all observations equal to 0), and used the first  feature  as the response. Thus, there are 64 covariates and one response in the selected data. Though this analysis may not be of  particular scientific  interest, it will demonstrate the scalability and robustness of  our algorithms. We first  applied the four  algorithms to a training sample of  size n = 5000. The remaining 45000 cases will be used as a test sample. The classical FS and SW (with F0.g criterion) select the same model. It contains the following  25 covariates: (2,60, 58,18,8,4, 51,53,1,59,5,20,10,6,62,19,38,46,39,47,21,36, 50,48,37). With F 0. 95 criterion, the model has 23 covariates. Interestingly, only one covariate is selected by robust FS and SW (with either F0.9 or F0,95 criterion): Covariate 1. The reason for  this drastic difference  is as follows.  The robust correlation of  Y  and Covariate 1 is 0.86, while the classical correlation between these variables is only 0.42. About 86% of  the values of  the response variable and 88% of  the values of  Covariate 1 are equal to zero. There are many zeroes in other covriates as well. Classical methods fail  to identify this unusual pattern in the data and therefore  are unable to select a parsimonious model that fits  well the majority the data (as opposed to all the data). The robust methods, on the other hand, successfully  detect the unusual pattern and select a model capable of predicting well 90% of  the data as explained below. We fitted  the selected classical and robust models using the training' data, and then used them to predict the test data outcomes. The 5% and 10% trimmed means of  squared prediction errors for  the classical and (robust) models are: 0.012 (0.043) and 0.008 (0.005), respectively. That is, the robust model with only one covariate predicts 90% of  the data better than the classical model with 25 covariates. To illustrate the scalability of  our algorithm we also used a training sample of  size n = 25000. This time, classical FS and SW select a model of  30 covariates, and robust FS and SW both select one covariate, in this case covariate 2 instead of  covariate 1. (Covariates 1 and 2 have robust correlations 0.82 and —0.85 with Y, respectively.) 2.7 Conclusion The main contribution of  this chapter is that we developed robust step-by-step algorithms as one-step model-building procedures. Classical step-by-step algorithms FS and SW are popular and computationally suitable, but they are sensitive to outliers. We expressed these algorithms in terms of  sample means, variances and correlations, and obtained simple robust versions of  FS and SW by replacing these sample quantities by their robust counterparts. We used robust partial F-tests for  stopping during the implementation of the proposed robust algorithms. For the construction of  the robust correlation matrix of  the required covariates we used a pairwise approach, because it is both computationally suitable, and more consistent with the idea of  step-by-step algorithms. We used robust correlations derived from  a simplified  version of  Maronna's bivariate M-estimator of  the scatter matrix. Our robust methods have much better performance  compared to the standard FS and SW algorithms. Also, they are computationally very suitable, and scalable to large dimensions. Chapter 3 Two-step Model Building: Robust Sequencing with Least Angle Regression 3.1 Introduction In this chapter, we will consider the first  step (sequencing) of  the two-step model building procedure. The candidate covariates will be sequenced to form  a list such that the good predictors are likely to appear at the beginning of  the list . The first  m covariates of  the list will form  the reduced set which will be studied further  in the next chapter. We need a suitable step-by-step algorithm to sequence the covariates! We focus  on the powerful  algorithm recently proposed by Efron,  Hastie, Johnstone and Tibshirani (2004), which is called Least Angle Regression (LARS). LARS is computationally effi-cient and has been shown to have clear statistical advantages over other step-by-step algorithms. • Since LARS is based on sample means, variances and correlations (as will be shown later), it yields poor results when the data are contaminated. This is a potentially serious deficiency.  Therefore,  we propose several approaches to strengthen the robustness properties of  LARS without affecting  its computational efficiency  too much, and compare their behavior. The rest of  this chapter is organized as follows.  In Section 3.2, we review LARS in details. In Section 3.3, we express the LARS procedure in terms of  the correlation matrix of  the data. In Section 3.4, we illustrate LARS' sensitivity to outliers and introduce two different  approaches to robustify  LARS. A small simulation study is also presented here to compare the performance  and the computing time of  LARS to those of  the two robust approaches. In Section 3.5, we investigate the selection of  the size of  the reduced set. of  candidate predictors. Section 3.6 proposes to use bootstrap to stabilize the results obtained by robust LARS. Section 3.7 introduces "learning curves" as a. graphical tool to choose the size of  the reduced set. Section 3.8 contains some real-data applications. Section 3.9 concludes and the chapter appendix contains some technical derivations. 3.2 Review: Least Angle Regression (LARS) Least Angle Regression (LARS), proposed by Efron,  Hastie, Johnstone and Tibshirani (2004), is closely related to another new algorithm called Forward' Stagewise (Hastie, Tibshirani and Friedman 2001, Chapter 10). To better understand LARS, we will review the Forward Stagewise procedure in details. 3.2.1 Forward Stagewise procedure (Stagewise) The Forward Stagewise procedure (Stagewise) is related to the classical algorithm For-ward Selection (FS). In FS, when the first  predictor (A^, say) is selected, all other predictors are regressed on X1} and the residual vectors compete for  the next entrance in the model. This causes a problem. Important predictors that happen to be correlated with Xi are eliminated from  the competition in many cases, which the researchers usually want to avoid. In this sense, FS is an aggressive model-building algorithm. The Forward Stagewise procedure is a less aggressive version of  FS. Unlike FS, the Stagewise procedure takes many tiny steps to move towards a final  model. We take the zero vector as the initial prediction. If  X\ has the largest absolute correlation with Y, we modify  our prediction by moving a 'small' step in the direction of  X\.  We obtain the new residual vector and repeat the process, until the required number of  predictors are selected. The goal is to obtain the order in which the variables enter the model. We can assume, without loss of  generality, that the covariates have mean 0 and variance 1, and the response has mean 0. Let e be a positive constant, typically small (less than the absolute value of  the regression coefficients).  The Stagewise algorithm can be described as follows: 1. Set the_ prediction vector, ft  = 0. 2. Calculate Cj  = X'j  (Y  — fi),  j = 1, ... , k, where dj is proportional to the correlation between Xj and the current residual. 3. Let m = argmax^ \tj\. Modify  the current prediction vector as follows: jj, i- p, + e sign(cm) X m, where e is a positive constant. 4. Repeat steps 2 and 3. At each step, the algorithm updates the prediction, and keeps track of  the sequence of  covariates as they enter the model. Notice that at each Stagewise step, we maximize the correlation of  the current residual vector with a covariate, which is equivalent to minimizing the 'local loss' n Y^fr-h-hX^) 2 (3.1) i=1 over all j. (Because of  standardization, j3m is proportional to cm.) Stagewise and FS: loss comparison Both Stagewise and FS select the same variable in their first  steps, because they minimize the same loss function.  Suppose, the selected variable is Xi, i.e., its loss (li — PiXu)2 is minimum. Let us consider that, after  many e-steps along Xi, Stagewise is about to choose a second variable from  the contenders ..., Xj. On the other hand, for  the second step, FS considers ..., Zj, which are the residuals of  the corresponding co-variates after  being adjusted for  Xi. To choose the second variable, FS minimizes the loss n &—PiXu—PjZji) 2, i-1 which is same as the loss n (3.2) i=i Note that (3{  is usually different  from  /?i. The loss used in Stagewise is n 2 E - - to) • (3.3) This loss depends on our position on the Xi-vector controlled by e. By choosing a small e, we ensure that f3\  < /?*, so that the variables correlated with X\ have more chance of  staying in the competition. (It should be noted that we are not 'fitting'  a model. We are 'selecting' the covariates.) The minimizer of  the Stagewise loss cannot beat the minimizer of  the FS loss (see equation (3.2)), at least at this stage. This means, if  FS chooses (Xi, X 2) and Stagewise selects (Xi, X3), then FS will yield a greater value of  R2. Because, FS technically considers the residual sum of  squares of  the final  fit  (equation (3.2)). However, this is not necessarily true for  the next stage if  FS selects (X^ X 2, X4) (for  example), and Stagewise selects (Xi, X3, X5). (Because, this Stagewise combination has not been considered by FS so far  in any order.  In other words, FS has taken a different path.) Greater R2 does not necessarily imply more prediction accuracy. Moreover, FS cannot guarantee greater R2 for  a particular subset size in all cases. Therefore,  orthog-onalization of  the subsequent covariates with respect to the active ones is not usually meaningful.  This is why researchers often  prefer  Stagewise to FS. Stagewise and Boosting Boosting, originally developed for  classification  problems, is a procedure that combines the output of  many "weak" classifiers  to produce a powerful  "committee" (Hastie et al. 2001, Chapter 10). In the general setup, boosting is a way of  predicting Yi by combining a set of  simple "basis" functions  additively: K f(x)  = ^f3 kb(x,cx k), k = l,...,K,  (3.4) k=l where (5 k are the expansion coefficients,  and b(x,a k) are real-valued functions  of  multi-variate x with parameters cx.k. Usually, f(x)  is fitted  by considering a loss function J^LU^Mxi,*,)], (3.5) i=i \ k=i J which is minimized with respect to fa's  and cc^'s. Often,  this is computationally intensive, and the solution to (3.5) is approximated by sequentially adding new basis functions without adjusting the parameters and expansion coefficients  of  the existing ones. In this approach, at each iteration k, one updates fk(xi)  = f k-i{xi)  + Pkb(xi,  ak) by minimizing the loss n ^ L {Yi, f k-i(xi) + pkb(xi, ak)). (3.6) i=i In Regularized  boosting,  one uses a parameter e to control the "learning rate" of  the boosting procedure: fk{xi)  = fk-i(xi)  + t(3 kb(xi,  OLk). (3.7) The Stagewise algorithm discussed before  is very similar to this regularized boosting, where we use squared error loss for  L in (3.6), K  (see (3.4)) is the number of  e-steps in Stagewise, b(xi,a k) = xki (the .ith observation of  covariate X( k) chosen in the kth iteration, not necessarily same as Xk), and fa  in (3-7) is replaced by sign{/?^}. Choice of  an appropriate e for  Stagewise The choice of  an appropriate e is a problem with the Stagewise procedure, and is the motivation for  LARS. If  e is 'small', the number of  Stagewise steps to reach a final  model may be very large, increasing the computational burden of  the algorithm. On the other hand, if  e is 'large', we have either or both of  the following  two problems: 'Incorrect' ordering of  predictors: If  e —» |cm|, Stagewise will aggressively throw covariates correlated with X m out of  the competition. A closed loop: In many cases, after  the selection of  the mth (say) covariate, the remaining predictors (that are not yet selected) have very small correlations with the current residual vector. Suppose that the correlation of  the currently selected predictor X m is positive. If  e is large, when we make an 'e-step' in the direction of  '+X m\ the correlation of  X m with the updated residuals becomes negative and larger in absolute value than that of  any other competitor. Thus, the next Stagewise step is to make an 'e-step' in the direction of  l—Xm\ These back-and-forth  movements may go on endlessly. Even when there is no closed loop, the Stagewise procedure may require hundreds of  tiny steps to reach the final  model. Therefore,  this algorithm is not computationally suitable. LARS overcomes this problem by taking a mathematical approach. 3.2.2 The LARS algorithm LARS uses a mathematical formula  to accelerate the computations in the Stagewise procedure. Suppose, the first  selected predictor in Stagewise is Xi (i.e., X\ has the largest absolute correlation with Y).  If  we choose a 'small' e, there will be at least several Stagewise steps in the direction of  the vector Xx.  A second predictor X 2 (say) will come in the picture as soon as we cross a certain point in the direction of  X x, a point at which both X\  and X 2 have equal absolute correlation with the residual. LARS uses a mathematical formula  to determine that point, and the prediction is modified  by making a move up to that point in a single step. In Stagewise, when a second predictor X 2 enters the model for  the first  time, we make a few  small steps in the direction of  X 2, but then X\  becomes more correlated with the residual, and we move in the direction of  X x. Thus, we alternate between the two directions, technically maintaining approximately equal absolute correlations of  X\ and X 2 with the residual (until a third predictor comes into the picture). LARS, on the other hand, mathematically determines a direction that has equal angle (correlation) with Xi and X 2, and makes the second LARS move along that direction upto a point (determined mathematically, again) at which a third predictor X 3 has equal absolute correlation with the residual vector, and so on. For the original LARS algorithm, Efron  et al. (2004) is referred  to, which is designed to get the modified  predictions at each step, in addition to the sequence of  the covariates as they enter the model. In Section 3.3 we show that, if  we are interested in the ordering of  the covariates only (and not the modified  predictions), the algorithm can be expressed in terms of  the correlation matrix of  the data (and not the observations themselves). 3.2.3 LARS and Shrinkage methods LARS and Ridge Regression If  there are many correlated variables in a linear model, the estimates may show high variance. This can be prevented by imposing a restriction on the size of  the coefficients. The ridge regression (Hoerl and Kennard 1970) minimizes a penalized residual sum of squares i=i j=i where A is the parameter that controls the amount of  shrinkage. Ridge regression shrinks the coefficients  towards zero, but does not set some coeffi-cients exactly equal to zero. Therefore,  it is not suitable for  subset selection, and cannot be compared to LARS. By imposing a different  penalty, the Lasso algorithm (Tibshirani 1996) forces  some of  the coefficients  to zero, which is presented below. LARS and Lasso The Lasso (Tibshirani 1996) estimates are obtained by minimizing n d Y J(Y i-pxiy + \J2\fri  (3-8) i=1 j=l Moderate to large A will cause some of  the Lasso coefficients  to be exactly zero, others will be smaller in magnitude than the corresponding least squares estimates. • i Interestingly, the estimates (and the sequance of  the covariates) obtained by LARS and Lasso are usually quite close, if  not the same. The reason has not been established mathematically, though it is clear that both algorithms can be viewed as less aggressive versions of  the FS procedure. Efron  et al. (2004) suggested a modification  in the LARS algorithm that will yield the Lasso solution, which is as follows.  Let j3 be the current Lasso estimate, and fi  = X/3.  Then, for  the Lasso estimates, sign(^j) = sign (corr(Y - ft,  Xj)), which is not necessarily true for  the LARS estimates. This restriction should be enforced in the LARS algorithm if  we want the Lasso solution. To better understand the LARS-Lasso relationship, let us consider the following definition  of  the Lasso estimate, which is equivalent to (3.8). n ^Lasso = argmin^(Y  - f3' X i)2, (3.9) i=1 d subject to E \/3j\  <t. 3=1 The 'tuning parameter't is varied over a certain range. If  t > Y? j=1 \Pf\,  where Pf  are the least squares estimates, then the Lasso estimates are the least squares estimates. For a LARS-Lasso comparison, suppose that LARS has selected the first  covariate SiA^. (LARS considers the 'signed covariates' to determine the equiangular vectors later.) To select the second covariate, LARS mathematically determines the minimum distance 7 to move along Si-X"i so that a new covariate s2X 2 (say) is equally correlated with the residual vector. We may assume that 7 is determined first  (as in Stagewise, where we make many e-steps to obtain Pe (see 3.3)) so that LARS loss can be written as n n £ - - PjX jz)2 = E ( Yi  - IsXu  - PjXji) 2 ; (3.10) i=l i=1 where j = 2, ..., d, and 7s = s 17 is a restricted regression coeffcient.  Since |7S| is the minimum distance (determined mathematically) to move before  a second variable enters the model, a comparison of  (3.10) and (3.9) makes it evident that, in the Lasso algorithm, t < 17S| =>• only Xi  is in the model (only is nonzero). In LARS, when we have two active covariates Si-Xi and s2X2, we modify  our prediction by moving along the equiangular vector Ba upto a point 74, so that the LARS loss has the form n (Yi ~ IsXu ~ JABAi - PjXji)2 i=1 n = (Yi  - %Xu  - Ja(wiS 1Xu + w2s2X 2i) - PjXjif i=i n = ( Y i ~ + 1aWi)siXu  - -yAW2S 2X2i  - PjXji) 2 , (3.11) where j = 3, ..., d, and w\ and w2 are given by (3.21) (see Chapter Appendix, Sec-tion 3.10.2). Again, since 7a is the minimum distance to move along Ba before  a third covariate comes in the picture, by comparing (3.11) and (3.9) we can say |7s| < t < |7«| + \^a(wi  + w2) \ =>• only Xi  and X 2 will be in the model, and so on. Note that 1(7 + 7^^1)51! + |7^W2S2| = |7S| + |7a(^i +^2)!, since 7, 7^, wx and w2 are all positive at this stage. (The Wj  may not be all positive for  higher dimensions.) Thus, it is not surprising that LARS and Lasso sequences agree in most cases. However, Lasso requires a computationally expensive quadratic programming technique to obtain the estimates, while, the computational cost of  LARS is comparable to the ordinary least squares applied to the full  set of  covariates. LARS and Boosting For a comparison of  Stagewise and boosting we refer  to Section 3.2.1. Since LARS is a mathematical solution of  the Stagewise problem, the LARS algorithm maybe considered as a mathematical alternative to a regularized boosting algorithm. 3.3 LARS expressed in terms of  correlations In this section, we show that the sequence of  covariates obtained by LARS can be derived from  the correlation matrix of  the data (without using the observations themselves). Let Y,Xi,...  ,X d be the variables, standardized using their mean and standard de-viation. Let vjY denote the correlation between Xj and Y, and Rx be the correlation matrix of  the covariates X\,...,  Xa-  Suppose that X m has the maximum absolute corre-lation r with Y  and denote sm = sign(rmy). Then, X m becomes the first  active variable and the current prediction jx 0 should be modified  by moving along the direction of smX m upto a certain distance 7 that can be expressed in terms of  correlations between the variables (see Chapter Appendix, Section 3.10.1, for  details). By determining 7, LARS simultaneously identifies  the new covariate that will enter the model, that is the second active variable. V- ' As soon as we have more than one active variable, LARS modifies  the current prediction along the equiangular  direction,  that is the direction that, has equal angle (correlation) with all active covariates. Moving along this direction ensure that the current correlation of  each active covariate with the residual decreases equally. Let A be the set of  subscripts corresponding to the active variables. In Chapter Appendix (Section 3.10.2) the standardized equiangular vector Ba is derived. Note that we do not need the direction BA itself  to decide which covariate enters the model next. We only need the correlation of  all variables (active and inactive) with Ba- These correlations can be expressed in terms of  the correlation matrix of  the variables as shown in Chapter Appendix (Section 3.10.2). LARS modifies  the current prediction by moving along Ba upto a certain distance 7a which, again, can be determined from  the correlations of  the variables (see Chapter Appendix, Section 3.10.3). Thus, the sequence of  covariates obtained by the LARS algorithm is a function  of the correlation matrix of  the standardized data. We now summarize the LARS algorithm in terms of  correlations rjy between Xj  and Y,  and the correlation matrix Rx of  the covariates: 1. Set the active set, A = 0, and the sign vector sa = 0-2. Determine m = argmax \rjY\,  and sm = sign{rmy}. Let r = smrmy. 3. Put A <-r  A U {m},  and '<— sA U {sm}. 4. Calculate a = [1 'a(DaRaDa)~ 1^-a\~1^ , where 1 a is a vector of  l's, Da = diag(s^), ? and Ra is the submatrix of  Rx corresponding to the active variables. Calculate wA = a (D aRaDa)~11a, and aj = (DATjjCl'wA,  for  j G Ac, where rj A is the vector of  correlations between Xj and the active variables. (Note that, when there is only one active covariate X m, the above quantities simplify  to a = 1, w = 1, and aj = rjm•) 5. For j G Ac, calculate 7+ = (r — rjY)/(a  — aj), and 7" = (r + rjY)/(a  + aj), and let 7j = min(7j",7~). Determine 7 = min{7j, j £ A0}, and m, the index corresponding to the minimum 7 = jm. If  7m = 7+, set sm = +1. Otherwise, set sm = —1. Modify  r <— r — 7a, and r^y r^y — 70,-, for  j €  Ac. 6. Repeat steps 3, 4 and 5. 3.4 Robustification  of  LARS From the results in Section 3.3, it is not surprising to see that LARS is sensitive to contamination in the data. To illustrate this, we use a dataset on executives obtained from  Mendenhall and Sincich (2003). The annual salary of  100 executives is recorded as well as 10 potential predictors. (7 quantitative and 3 qualitative) such as education, experience etc. We label the candidate predictors from  1 to 10. LARS sequences the covariates in the following  order: (1, 3,4, 2, 5,6, 9, 8,10, 7). We contaminate the data by replacing one small value of  predictor 1 (less than 5) by the large value 100. When LARS is applied to the contaminated data, we obtain the following  completely different sequence of  predictors: (7,3,2,4,5,1,10,6,8,9). Predictor 7, which was selected last (10th) in the clean data, now enters the model first.  The position of  predictor 1 changes from  first  to sixth. Predictors 2 and 4 interchange their places. Thus, changing a single number in the data set completely changes the predictor sequence, which illustrates the sensitivity of  LARS to contamination. We now introduce two approaches to robustify  the LARS procedure which we call the plug-in  and cleaning  approaches respectively. 3.4.1 Robust Plug-in The plug-in approach consists of  replacing the non-robust building blocks of  LARS (mean, variance and correlation) by robust counterparts. The choices of  fast  computable robust center and scale measures are straightforward:  median (med) and median absolute devi-ation (mad). Unfortunately,  good available robust correlation estimators are computed from  the (i-dimensional data and therefore  are very time consuming (see Rousseeuw and Leroy 1987). Robust pairwise approaches (see Huber 1981) are not affine  equivariant and therefore  are sensitive to two-dimensional outliers. One solution is to use robust correla-tions derived from  a pairwise affine  equivariant covariance estimator. A computationally efficient  choice is a bivariate M-estimator as defined  by Maronna (1976). Alternatively, a bivariate correlation estimator can be computed from  bivariate Winsorized data. Both methods will be explained in detail below. M Plug-in Maronna's bivariate M-estimator of  the location vector t and scatter matrix V  is defined in Chapter 2. It is affine  equivariant and computationally efficient,  and has breakdown point 1/3 in two dimensions. As before,  to further  simplify  computations, we used the coordinatewise median as the bivariate location estimate and only solved (2.18) to estimate the scatter matrix and hence the correlation. In this equation we used the function  u2(t)  = min(c/i, 1) with c = 9.21, the 99% quantile of  a x2 distribution. The bivariate correlations are then ensembled to form  a d  x d  correlation matrix R. Finally, LARS is applied to this robust correlation matrix. We call this the M plug-in method. -3-2-10123 variable 1 Figure 3.1: Limitation of  separate univariate-Winsorizations (c = 2). The bivariate outliers are left  almost unchanged. W Plug-in For very large, high-dimensional data we need an even faster  robust correlation es-timator. Huber (1981) introduced the idea of  one-dimensional Winsorization of  the data, and suggested that classical correlation coefficients  be calculated from  the trans-formed  data. Alqallaf,  Konis, Martin and Zamar (2002) re-examined this approach for the estimation of  individual elements of  a large-dimension correlation matrix. For n univariate observations X\,  x2 ..., xn, the transformation  is given by Ui  = ~ med(xi))/mad(a:i)), i = 1, 2,..., n, where the Huber score function  ipc (z) is defined  as ipc ix) = min{max{—c, x}, c}, with c a tuning constant chosen by the user, e.g., c = 2 or c = 2.5. This one-dimensional Winsorization approach is very fast  to compute but un-fortunately  it does not take into account the orientation of  the bivariate data. It merely i brings the outlying observations to the boundary of  a 2c x 2c square, as shown in Fig-ure 3.1. This plot clearly shows that the univariate approach does not resolve the effect of  the obvious outliers at the bottom right which are shrunken to the corner (2, —2), and thus are left  almost unchanged. / To remedy this problem, we propose a bivariate  Winsorization  of  the data based on an initial tolerance ellipse for  the majority of  the data. Outliers are shrunken to the border of  this ellipse by using the bivariate transformation  u = min(-y/c/D(a;.), 1) x with x = (x\,x2) t- Here D(x)  is the Mahalanobis distance based on an initial bivariate correlation matrix R0. For the tuning constant c we used c = 5.99, the 95% quantile of  the xl distribution. We call this the W plug-in method. The choice of  R0 will be discussed below. Figure 3.2 shows bivariate Winsorizations for  both the complete data set of  Fig-ure 3.1 and the data set excluding the outliers. The ellipse for  the contaminated data is only slightly larger than that for  the clean data. By using bivariate Winsorization the outliers are shrunken to the boundary.of  the larger ellipsoid. The initial correlation estimate. Choosing an appropriate initial correlation matrix R0 is an essential part of  bivariate Winsorization. For computational simplicity we can choose the estimate based on univariate Winsorization explained above. However, we propose an adjusted Winsorization method that is more resistant to bivariate outliers. This method uses two tuning constants: a tuning constant C\ for  the two quadrants that contain the majority of  the standardized data and a smaller tuning constant c2 for  the other two quadrants. For example, Ci is taken equal to 2 or 2.5 as before  and variable 1 Figure 3.2: Bivariate Winsorizations for  clean and contaminated data. The ellipse for the contaminated data is only slightly larger than that for  the clean data. C2 = hci where h = n2/ni with n\ the number of  observations in the major quadrants and n2 = n — n^. We use c\ = 2 in this chapter. Figure 3.3 shows how the adjusted Winsorization deals with bivariate outliers, which are now shrunken to the boundary of  the smaller square. Thus, adjusted Winsorization handles bivariate outliers much better than univariate Winsorization. The initial correla-tion matrix Rq is obtained by computing the classical correlation matrix of  the adjusted Winsorized data. It should be mentioned here that, though we used c2 = hc\ in this study, a more reasonable choice would have been c2 = \fhc\  (i.e., = hci), because the areas of  the two squares should be proportional to the number of  observations they contain. variable 1 Figure 3.3: Adjusted-Winsorization (for  initial estimate R0) with c\ = 2, c2 = 1. The bivariate outliers are now shrunken to the corner of  the smaller square. Note that the correlations based on both univariate- and adjusted-Winsorized data can be computed in 0(nlog?7,) time. The adjusted-Winsorized estimate takes slightly more time for  a particular n, but is much more accurate in the presence of  bivariate outliers as shown above. Bivariate-Winsorized estimate and Maronna's M-estimate also require O(nlogn) time, but Maronna's M-estimate has a larger multiplication factor  de-pending on the number of  iterations required. Thus for  large n, the. bivariate-Winsorized estimate is much faster  to compute than Maronna's M-estimate. Figure 3.4 shows for  each of  the four  correlation estimates the mean cpu times in seconds (based on 100 replicates) for  5 different  sample sizes: 10000, 20000, 30000, 40000 and 50000. These results confirm that the bivariate-Winsorized estimate is faster  to compute than Maronna's M-estimate and the difference  increases with sample size. Numerical results (not presented here) If) o o o m o o o o o 10000 20000 30000 40000 50000 sample size Figure 3.4: Numerical complexity of  different  correlation estimates. Each estimate can be computed in 0(n\ogn)  time, but Maronna's estimate has a larger multiplication factor. showed that the bivariate-Winsorized estimate is almost as accurate as Maronna's M-estimate also in the presence of  contamination. Note that both the univariate-Winsorized and adjusted-Winsorized correlations are very fast  to compute. 3.4.2 Robust Data Cleaning If  the dimension d is not extremely large, an alternative approach to robustifying  LARS is to apply it on cleaned data. For example, each standardized d-dimensional data point x — (rri,..., xaY  can be replaced by its Winsorized counterpart u = min(y/c/D(x),  1) x in the d-dimensional space. Here D(x)  — x^^x, is the Mahalanobis distance of  x based on V,  a fast  computable, robust initial correlation matrix. A reasonable choice for  the tuning distance c is c = Xd(0-95), the 95% quantile of  the Xd  distribution. The initial correlation matrix V.  The choice of  the initial correlation matrix V  is an essential part of  the Winsorization procedure. Most available highTbreakdown, affine-equivariant  methods are inappropriate for  our purposes because they are too com-putationally intensive. Therefore,  we resort to pairwise approaches, that is methods in which each entry of  the correlation matrix is estimated separately (see Alqallaf  et al. 2002). As before  we will use a bivariate M-estimator or the bivariate windsorized esti-mator to calculate the correlations in V. The resulting methods are called M cleaning and W cleaning, respectively. 3.4.3 Simulations To investigate the performance  and stability of  the four  proposed methods we Consider a simulation study involving a small number of  variables. We used the following  design (see Ronchetti et al. 1997). The error distributions considered are (el) standard normal, (e2) 93% from  standard normal and 7% from  N(0,  52), (e3) standard normal divided by a uniform  on (0,1), and (e4) 90% from  standard normal and 10% from  A^(30,1). Two design matrices are considered: the uniform  design for  which the columns are generated from  a uniform  distribution on (0,1), and the leverage design which is the same as the uniform  design except that it contains a leverage point. Six variables are used from  which the first  three are nonzero and in order of  importance. The true regression coefficients  for  the nonzero variables are 7, 5 and 3, respectively. The sample size equals Table 3.1: Percentages of  correct sequences obtained by classical and robust methods for univariate and leverage designs with 4 different  error distributions. Method Uniform Leverage el e2 e3 e4 el e2 e3 e4 LARS E LARS G 97 86 11 8 100 89 26 24 0 112 0 2 5 7 M plug-in E M plug-in G 95 97 53 87 99 99 74 95 96 ,96 49 87 99 99 68 95 W plug-in E W plug-in G 96 97 58 78 99 99 77 89 92 85 46 59 94 86 61 68 M cleaning E M cleaning G 96 98 55 89 99 99 77 97 96 97 50 87 100 98 73 97 W cleaning E W cleaning G 96 98 54 82 99 99 76 92 96 -94 52 83 98 96 71 92 n — 60 and we generated 200 data sets for  each setting. We used two performance measures which we call exact (E) and global (G). The exact measure gives the percentage of  times a procedure sequences the important variables in front  and in their true order. The global measure gives the percentage of  times a procedure sequences the important variables in front  in any order. Table 3.1 shows the simulation results. For error distribution el (standard normal), the performance  of  the robust methods is almost as good as that of  standard LARS. For the heavy tailed distributions the robust methods drastically outperform  LARS. Overall we see from  Table 3.1 that the plug-in approaches are almost as stable as the computationally more expensive data cleaning approaches. Comparing the M and W approaches for  both the plug-in and data cleaning procedures, it is reassuring to see that the computationally faster  W approach (see Figure 3.5 below) is almost as stable as the M approach. Numerical complexity of  the algorithms We now compare the computational complexity of  the different  methods. The stan-dard LARS procedure sequences all d covariates in only 0{nd 2) time. The plug-in and cleaning procedures based on M-estimators both require 0((n\ogn)d 2) time. Based on Winsorization these procedures also require 0((n  logn)d 2) time, but with a much smaller multiplication factor.  Moreover, if  we are only interested in sequencing the top fraction of  a large number of  covariates, then the plug-in approach will be much faster  than the cleaning approach, because the plug-in approach only calculates the required correlations along the way instead of  the 'full'  correlation matrix. In this case, the complexity for plug-in methods reduces to O((n  logn)dm),  where m is the number of  variables being sequenced. .Figure 3.5 shows the mean cpu times based on 10 replicates for  LARS, W plug-in and M plug-in for  different  dimensions d with a fixed  sample size n = 2000. The times required by the cleaning methods are not shown because they were similar to the plug-in times since we sequenced all the covariates. As in Figure 3.4, we see that the approaches based on M-estimators are more time consuming than the Winsorization approaches. The difference  increases fast  with dimension. o m 50 100 150 200 250 300 dimension Figure 3.5: Numerical complexity of  different  techniques. LARS requires 0(ncP)  time. W plug-in and M plug-in both require O((n\ogn)d 2) time, but M plug-in has a larger multiplication factor. The cleaning approaches perform  slightly better than the plug-in approaches when the number of  variables is relatively small, and much smaller than the number of  cases (see Table 3.1). However, plug-in approaches are less time-consuming when only a part of the predictors are sequenced. Since W plug-in has a reasonable performance  compared to the other methods and has favorable  computing times, this method is to be preferred  for large, high-dimensional datasets. The performance  of  W plug-in will be studied further in the next sections and we will call this method robust LARS from  now on. 3.5 Size of  the reduced set To obtain a good final  model, it is important to choose an appropriate value of  m, the size of  the reduced set of  covariates kept from  the sequencing step. The reduced set should be large enough to include most of  the important covariates, but not so large as to make the segmentation step (where we have to evaluate all possible subsets of  the reduced set) impractical. Several factors  can be important when determining the size m such as d, the total number of  variables, the sample size n, the unknown number of  non-zero variables in the optimal model, the correlation structure of  the covariates, and of  course also time and feasibility  of  the segmentation step. For example, for  high-dimensional datasets, including only 1% of  the variables in the reduced set may make the segmentation step already infeasible. To investigate what values of  m are appropriate, we carry out a simulation study similar to Frank and Friedman (1993). The total number of  variables is d = 100. A small number a = 9 or a = 15 of  them are nonzero covariates. We considered 3 correlation structures of  these nonzero covariates: "no-correlation" case, "moderate-correlation" case and "high-correlation" case, which are described below. For the no-correlation case (a true correlation of  0 between the covariates), inde-pendent covariates Xj  ~ AT(0,1) are considered, and Y  is generated using the a non-zero covariates, with coefficients  (7,6, 5) repeated three times for  a = 9, and five  times for a = 15. The variance of  the error term is chosen such that the signal-to-noise ratio equals 2. For the moderate-correlation and high-correlation cases, we consider 3 independent 'unknown' processes, represented by latent variables L,, i = 1,2, 3, which are responsible for  the systematic variation of  both the response and the covariates. The model is Y = 5Li + 4L2 + 3L3 + e = Signal + e, (3.12) where Lj ~ iV(0,1), and e is a normal error not related to the latent variables. The variance of  e is chosen such that the signal-to-noise ratio equals 2, that is Var(e) = 50/4. The nonzero covariates are divided in 3 equal groups, with each group related to exactly one of  the latent variables by the following  relation Xj  = Li + 5j, where 5j ~ N(0,  aj). The value of  aj determines the correlation structure of  the nonzero covariates. The high-correlation case has a true correlation of  0.9 between the covariates generated with the same latent variable, and the moderate-correlation case has a true correlation of  0.5. For each situation we generated 100 samples of  size n = 150. Outliers were added by giving the noise term a large positive mean (asymmetric error). We considered four different  levels of  contamination: 0,5,10 and 20%. For the high-correlation and moderate-correlation cases, though "a" of  the covari-ates are linked to the response Y through the latent variables, it is not clear which of these covariates should be considered important for  explaining Y. Even when the true pairwise correlations of  the covariates are zero (no-correlation case), the "best" model not necessarily includes all of  the a non-zero coefficients  because of  the bias-variance trade-off.  Therefore,  for  each simulated dataset we first  find  the "best" model among all possible subsets of  the non-zero covariates that has the minimum prediction error estimated by 5-fold  cross-validation. -1 1 1 1 1— 0.05 0.10 0.15 0.20 0.25 Proportion of variables in reduced set Figure 3.6: Recall curves for  a — 9; (a) no correlation (b) low correlation (c) high correlation. The 4 curves for  (robust) LARS correspond to 4 levels of  contamination. For each simulated dataset, we determine the "recall proportion", i.e., the propor-tion of  important variables (in the sense that they are in the "best" model by cross-validation) that are captured (recalled) by LARS/robust LARS for  a fixed  size of  the reduced sequence. For a = 9, Figure 3.6 plots the average recall proportion against the size of  the reduced set for  the three correlation structures. In each plot, the 4 curves with the same line type correspond to the 4 levels of  contamination, higher curves correspond to lower levels of  contamination. These plots show that, for  each correlation structure considered, we can capture the important variables if  the percentage of  variables in the reduced set is 9 or 10. Robust LARS performs  as good as LARS for  clean data, and much better than LARS for  contaminated data. Figure 3.7 plots the average recall proportion against the size of  the reduced set for the moderate-correlation case with a = 15. This plot can be compared with Figure 3.6(b) to see how the increase in the number of  nonzero variables affects  the recall proportions. In both cases, we observe that the average recall proportions stop increasing even before the size m of  the reduced set exceeds the number a of  non-zero variables. 3.6 Bootstrapped sequencing To obtain more stable and reliable results we can combine robust LARS with bootstrap. Therefore,  we generate a number B of  bootstrap samples from  the dataset, arid use robust LARS to obtain the corresponding sequence of  covariates for  each of  these bootstrap samples. Each sequence ranks the covariates from  1 to d. For each covariate we can take Proportion of variables in reduced set Figure 3.7: Recall curves for  a = 15 and moderate correlation with 4 different  levels of contamination. the average over these B ranks, and the m covariates with the smallest average ranks then form  the reduced set. When resampling from  a high-dimensional dataset (compared to the sample size, e.g., n — 150, d — 100) the probability of  obtaining singular samples becomes very high. Note that even the original sample may already be singular or the dimension d of  the data may exceed the sample size. In these cases it will be impossible to sequence all covariates. We can easily overcome this problem by sequencing only the first  mo < d of  the covariates for  each bootstrap sample, where preferably  mo > m. We then rank the covariates according to the number of  times (out of  B) they are actually sequenced. When ties occur, the order of  the covariates is determined according to the average rank in the sequences. In our simulations, we generated B = 100 bootstrap samples from  each of  the 100 simulated datasets. We sequenced the first  25 covariates in each bootstrap sample. Figure 3.8 shows the recall curves obtained by robust LARS (solid lines) and boot-strapped robust LARS (dotted lines) for  covariates with moderate correlation. The recall curves obtained by bootstrapped robust LARS perform  better than the initial robust LARS curves for  all levels of  contamination, the difference  being larger with larger con-tamination proportions. This confirms  that by applying the bootstrap we obtain more stable and reliable results. Even with 20% of  contamination, bootstrapped robust LARS with m = 10 (o = 9) or m = 15 (a  — 15) already yields a recall proportion around 90%. To investigate what minimum number of  bootstrap samples is required to obtain significant  improvement over robust LARS, we also tried B = 10, 20 and 50 in the above setups. In each case, B = 10 and B = 20 do not yield much improvement, while with B = 50 the results obtained are almost as stable as with B = 100. 3.7 Learning curves Although the simulation results in the previous sections suggested that it suffices  to select the size of  the reduced set equal to or slightly larger than the number of  predictors in the final  model, we usually have no information  about the number of  predictors that is needed. Hence, a graphical tool to select the size of  the reduced set would be useful.  The following plot can be constructed to determine a reasonable size for  the reduced set. Starting from a model with only 1 variable (the first  one in the sequence), we increase the number of (a) Proportion of variables in reduced set (b) Proportion of variables in reduced set Figure 3.8: Recall curves for  robust LARS and bootstrapped robust LARS for  covariates with moderate correlation; (a) a = 9 (b) a = 15. The 4 curves for  each method correspond to 4 levels of  contamination. variables according to the sequence obtained and each time fit  a robust regression model to compute a robust R2 measure such as R2 = 1 — Median(e2)/MAD2(F), where e is the vector of  residuals from  the robust fit.  We then plot these robust R2 values against the number of  variables in the model to obtain a learning  curve. The size of  the reduced set can be selected as the point where the learning curve does not have a considerable slope anymore. A problem that can occur with a robust R2 measure is that, unlike its classical coun-terpart, it is not always a nondecreasing function  of  the number of  covariates. This can be resolved as follows.  If  the robust R? at any step is smaller than that of  the preceding step, then fit  a robust simple linear regression of  the residuals from  the preceding step on the newly selected covariate. The residuals obtained from  this fit  can be used to compute another robust R2 value. We then use the larger of  the two values. To investigate the performance  of  learning curves, we consider a dataset on air pollution and mortality in 60 Metropolitan areas in the United States. The response variable is the age-adjusted mortality. There are 14 potential predictors, numbered from 1 to 14. Since row 21 contains 2 missing values, we drop this observation from  the data. Based on robust data exploration we identified  4 clear outliers that correspond to.the four  metropolitan areas in California.  We applied 5-fold  cross-validation (CV) to this dataset without the four  outliers, and obtained the "best model" that has the following 7 covariates: (2,3,4,6,7,10,13).. (The order of  the covariates is not relevant here.) Bootstrapped robust LARS applied to this dataset (including the outliers) produced the sequence (7,5,13,4,6,3,2,10,9,1,14,11,8,12). We used this sequence and fitted Least Median of  Squares (Rousseeuw 1984) regressions to obtain the robust R2 values. Figure 3.9: Learning curve for  Pollution data. A reduced set of  8 covariates is suggested by the plot. Figure 3.9 shows the corresponding learning curve. This plot suggests a reduced set of size 8. It is encouraging to notice that the reduced set (first  8 covariates in the sequence above) contains all 7 predictors selected in the "best model" obtained by CV. 3.8 Examples In this section we use two real datasets to evaluate the performance  of  (bootstrapped) robust LARS. The demographic data example further  explores the idea of  "learning curves" to choose the size of  the reduced set. We then use a large dataset (protein data) to demonstrate the scalability as well as stability of  robust LARS. Demographic data. This dataset contains demographical information  on the 50 states of  the United States for  1980. The response variable of  interest is the murder rate per 100,000 residents. There are 25 predictors which we number from  1 to 25. Exploration of  the data using robust estimation and graphical tools revealed one clear outlier. We applied 5-fold  CV to this dataset without the outlier, and obtained the "best of  25" model that has the following  15 covariates (1, 2, 3, 5, 6, 8, 9,10,16,17,18,19, 21, 24, 25). Figure 3.10 shows the learning curve for  the Demographic data based on boot-strapped robust LARS. This plot suggests a reduced set of  size 12 which include the covariates: (22, 20, 4,15,10, 2,19, 25, 8,18, 6, 24). The boldface  numbers correspond to covariates in the sequence that are also in the model obtained by CV. The number of "hits" is 8 out of  12. a . 2 4 6 8 10 Number of variables in the model -1 r 12 14 Figure 3.10: Learning curve for  Demographic data. A reduced set of  12 covariates is suggested by the plot. We applied 5-fold  CV to the clean data using the reduced set of  size 12 obtained by bootstrapped robust LARS. The model selected in this case has the following  9 covariates: (22, 20,4,15, 2,10, 25,18,24). To compare this "best of  12" model with the "best of  25" model above, we estimated the prediction errors of  these two models 1000. times using 5-fold  CV. The two density curves are shown in Figure 3.11. The "best of  12" model has a mean error of  204.8 (median error 201.5) while the "best of  25" model has a mean error of  215.9 (median error 202.0). Also, the standard deviations (mads) of  the errors are 25.6 (22.7) and 74.6 (31.4), respectively. (Some of  the "best of  25" errors are very large and not included in the plot.) Thus, bootstrapped robust LARS gives more stable results in this high-variability dataset. It should be mentioned here that we needed almost 10 days to find  the "best of  25" model, while "best of  12" model requires less than 5 minutes including the time needed to sequence the covariates by bootstrapped robust LARS. (CV on m covariates is times faster  than CV on d covariates.) Figure 3.11: Error densities for  the two "best" models for  Demographic data. The "best of  12" model gives more stable result. Protein data. This dataset of  n — 145751 protein sequences was used for  the KDD-Cup 2004. Each of  the 153 blocks corresponds to a native protein, and each data-point of  a particular block is a candidate homologous protein. There are 75 variables in the dataset: the block number (categorical) and 74 measurement's of  protein features. We replace the categorical variable by block indicator variables, and use the first  feature as the response. Though this analysis may not be of  particular scientific  interest, it will demonstrate the scalability and stability of  the robust LARS algorithm. We used the package R to apply robust LARS to this dataset, and obtained a reduced set of  size 25 from  d = 225 covariates (152 block indicators + 73 features)  in only 30 minutes. Given the huge computational burden of  other robust variable selection procedures, our algorithm maybe considered extremely suitable for  computations of  this magnitude. For a thorough investigation of  the performance  of  robust LARS with this dataset, we select 5 blocks with a total of  n = 4141 protein sequences. These blocks were chosen because they contain the highest proportions of  homologous proteins (and hence the •highest proportions of  potential outliers). We split the data of  each block into two almost equal parts to get a training sample of  size n = 2072 and a test sample of  size n = 2069. The number of  covariates is d'=  77, with 4 block indicators (variables 1 — 4) and 73 features.  We apply bootstrapped robust LARS with B = 100 bootstrap samples and we sequence the first  25 variables of  each bootstrap sample. The resulting learning curve is shown in Figure 3.12. This plot suggests that a drastic reduction to a small number of  predictors can be performed,  e.g. m=5 or m=10. The first  10 predictors found  by bootstrapped robust Number of variables in the model (bootstrap sequence) Figure 3.12: Learning curve for  Protein data. A reduced set of  5 covariates is suggested by the plot. LARS are (14,13, 5, 76,73,8,7,40,46, 51). The covariates in this sequence are almost the same as those obtained with the whole dataset (not shown). The standard LARS produced the sequence (14,13,5,8,7,76,18,65,2,46). Note that the two sequences are quite different.  For example, if  we select a model from  the first  five  predictors, then only 3 predictors are contained in both sequences. Using MM-estimators and robust AIC, the best model selected from  the first  five  variables of  the robust sequence contains vari-ables (14,13, 5, 76) while the best model out of  the first  10 predictors contains variables (14,13, 5, 76,40). Hence only 1 variable is added. Using classical AIC, the best model selected from  the first  5 variables of  the LARS sequence contains variables (14,13, 5,8). Variable 76 of  the corresponding robust model is replaced by Variable 8. The best model from  the first  10 predictors contains variables (14,13, 5,8, 76, 2). Note that 2 variables are added to the list compared to 1 variable in the robust case. We fitted  the 4 best models using the training data, and then used them to predict the test data outcomes. The 1%, 5% and 10% trimmed means of  prediction errors for  the smaller robust (classical) model are : 114.92 (117.49), 92.77 (95.66) and 74.82 (78.19), respectively. The corresponding quantities for  the larger robust (classical) model are: 114.37 (115.46), 92.43 (94.84) and 74.34 (76.50), respectively. Notice that the robust models always outperform  the classical models. 3.9 Conclusion The main contribution of  this chapter is that we developed robust versions of  LARS to obtain a reduced set of  covariates for  further  investigation. We also introduced the idea of  multivariate-Winsorization of  the data (when the dimension is not too large). We can perform  computationally suitable classical multivariate analyses on the transformed data to obtain reliable results. We also proposed a new robust correlation estimate for bivariate data which we called the "adjusted-Winsorized correlation estimate." LARS is a very effective,  time-efficient  model building tool, but is not resistant to outliers. We introduced two different  approaches to construct robust versions of  the LARS technique. The plug-in approach replaces the classical correlations in LARS by easily computable robust correlation estimates. The cleaning approach first  transforms  the dataset by shrinking the outliers towards the bulk of  the data, and then applies LARS on the transformed  data. Both approaches use robust pairwise correlation estimates which can be computed efficiently  using bivariate-Winsorization or bivariate M-estimates. The data cleaning approach is limited in use because the sample size needs to be (much) larger than the number of  candidate predictors to ensure that the resulting cor-relation matrix is positive definite.  Moreover, the data cleaning approach is more time consuming than the plug-in approach, certainly when only part of  the predictors is being sequenced. Since the plug-in approach has good performance,  is faster  to compute and more widely applicable, we prefer  this method. Comparing bivariate M-estimates with bivariate Winsorization we showed that the latter is faster  to compute with important time differences  when the number of  candidate predictors becomes high. We propose using the robust LARS technique to sequence the candidate predictors and as such identify  a reduced set of  most promising predictors from  which a more refined model can be selected in a second segmentation step. We recommend combining W plug-in with bootstrap to obtain more stable and reliable results. The reduced sets obtained by bootstrapped robust LARS contain more of  the. important covariates than the reduced sets obtained by initial robust LARS. It is important to select the number of  predictors to use for  the second step. This number is a trade-off  between success-rate, that is the number of  important predictors captured in the reduced set, and feasibility  of  the segmentation step. Our simulation study indicated that the reduced set can have size comparable to the actual number of relevant candidate predictors. However, this number is usually unknown. To still get an idea about an appropriate size for  the reduced set we introduced a learning curve that plots robust R2 values versus dimension. An appropriate size can be selected as the dimension corresponding to the point where the curve starts to level off. 3.10 Chapter Appendix 3.10.1 Determination of  7 for  one active covariate Assume that the first  selected covariate is +X m. The current prediction /x •<— 0 should be modified  as . fi  <- 7 X m. The distance 7 should be such that the modified  residual (F — ji) will have equal corre-lation with +X m and another signed covariate Xj.  We have corfy  - a X  ) - X™ { Y  ~ lXm)/ n - -^ (3 13) ' ~ SD(Y-7XJ  ~ SD(F — jX m)' ( 3"13and mr(Y  u \X)~ ^ ~ lX m^n - Ti Y  ~ ^ m 4) - m, - sd(f  - yxm) - SD(y-7xm)- (3-14) Equating (3.13) to (3.14), we have 7 (+Xj)  = f^.  (3.15) J- Tjm Similarly, equating (3.13) with the correlation of  modified  residual and — Xj we have 1 n" ' jm We should take the minimum of  (3.15) and (3.16) and minimum over all inactive (not yet selected) j. The signed covariate that will enter the model at this point is determined alongwith. 3.10.2 Quantities related to equiangular vector Ba / Here, A is the set of  'active' subscripts. Let X A = (•  • • siXi  • • •), I  e A, where s; is the sign of  Xi  as it enters the model. The standardized equiangular vector BA is obtained using the following  three conditions. BA is a linear combination of  the active signed predictors. BA = X A wA , where wA is a vector of  weights. (3-17) BA has unit variance: 1 nB'ABA = l. (3.18) BA has equal correlation (a,  say) with each of  the active predictors. Since the covariates and BA are standardized, —XABa — a 1A , 1A is a vector of  l's. (3.19) 7~t Using equation (3.17) in equation (3.18), we have -W' aX'AXAW A = 1, Ti so that W' aR^W A = 1, (3.20) where R^A is the correlation matrix of  the active signed variables. Using (3.17) in (3.19), we have R{^w a = alA, so that the weight vector wa can be expressed as wA = a Let Ra be the correlation matrix the unsigned active covariates, i.e., Ra is a submatrix of  Rx- Let sA be the vector of  signs of  the active covariates (we get the sign of  each covariate as it enters the model). We have wA = a (DaRaDa^Ia,  -(3.21) where Da is the diagonal matrix whose diagonal elements are the elements of  s^- Finally, using equation (3.21) in equation (3.20), we get a = [1' a(D aRADa)-11a]- 112. (3.22) The correlation of  an inactive covariate Xj with Ba, denoted by aj, can be expressed as follows AJ = -X'Ba = -X'XaWa  = (D Ar jA)'w A, (3.23) n J  n J where rjA is the vector of  correlation coefficients  between the inactive covariate Xj and the (unsigned) selected covariates. Thus, we need only (a part of)  the correlation matrix of  the data (not the observations themselves) to determine the above quantities. 3.10.3 Determination of  7 for  two or more active covariates Let us update r (r  — 7), see (3.13), and rjY  4— (— 7r,-m), see (3.14). The correlation of  an active covariate with the 'current' residual Y  — fx  is r/SD(y — ft),  and the correlation of  the active covariate with the current equiangular vector BA is 'a'. Therefore,  the correlation between an active covariate and the 'modified'  residual (Y  - p.- 7aBA) is T-Jacl SD(Y-p,- lABAy 76 An inactive covariate +Xj,  j £ Ac, has correlation rjy/SD(F — fi)  with the 'current' residual, and it has correlation a,j with Ba• Therefore,  the correlation between +Xj, j G Ac, and the 'modified'  residual is rjY  - 7a flj SD(F — A — IaBA) Equating the above two quantities, we get "f A(+X j) = (r-r jY)/(a-a j). (3.24) Similarly, 7A(-X j) = (r  + rjY)/{a  + aj). (3.25) We have to choose the minimum possible 7a over all inactive covariates. Note'that when A has only one covariate, (3.24) and (3.25) reduce to (3.15) and (3.16), respectively. Chapter 4 Two-step Model Building: Robust Segmentation 4.1 Introduction In Chapter 3 we developed robust sequencing methods to obtain a reduced set of  covari-ates from  which the final  prediction model can be selected. According to the notation used before,  we have m predictors X\,...,  X m in the reduced set. In this chapter we consider methods of  segmentation (evaluation of  all possible subsets of  the reduced set of  covariates) in order to select the final  prediction model. To compare different  subsets of  covariates, we require an appropriate robust selection criterion. For this purpose, we review some classical selection criteria in Section 4.2, and their robust counterparts in Section 4.3. We use /3 p to denote the estimate of  f3 p for  the p-parameter submodel (p  predictors including the intercept) under consideration. Many of  the methods below require and estimate the variance of  the error term under the "true" model, a2. In such cases, a2 is estimated using the full  model (with k — m + 1 parameters). 4.2 Review: classical selection criteria In this section, we review some important classical selection criteria: Final Prediction Error (FPE), Akaike Information  Criterion (AIC), Mallows' Cp, cross-validation and bootstrap. 4.2.1 Akaike Information  Criterion (AIC) A measure of  the similarity between the fitted  distribution f{y\0 p) and the true distri-bution g(y\(3)  is the Kullback-Leibler information  number (Kullback and Leibler 1951) It can be shown that (i) I(gJ)>  0, (ii) I(g,  /) = 0 g(y)  = f(y)  almost everywhere (Lebesgue measure). Our purpose is to minimize I(gJ)  = E{\ogg(Y\(3)}-E{\ogf(Y\P p)}, where only the second term is important in evaluating the fitted  model. This term is unknown, and it seems reasonable to consider the log-likelihood n i=1 as an estimate of  nE jlog/(y|/3p) j. However, this estimate has a bias, since the same data are used to find  the estimates (3p and to calculate the log-likelihood. Akaike (1973) showed that the expected value of  the bias ~ p. Therefore,  the corrected estimate of nE {log  f(Y\P p)} is L'0p)=L0p)-p. Based on this, Akaike (1973) proposed to choose the model that minimizes the Akaike Information  Criterion: AIC = -2 L0p) + 2 p, Bhansali and Downham (1977) proposed to generalize AIC by choosing a model that minimizes, for  a chosen fixed  a, AIC(p,a)  = -2L(P p) + ap. For normal errors, D C C AIC(p, a) = K(n,  a) + + ap, (4.1). where K(n,  a) is a constant depending on the marginal distribution of  the covariates, RSSp is the residual sum of  squares, and a2 is the estimate of  a2 from  the full  model. 4.2.2 Mallows' Cp Let us consider the following  submodel of  p parameters: Yi  = (3' pXi  + a, i = 1, 2, • • • , n, (4.2) where e* are independent observations from  the distribution F with mean zero and vari-ance a2 (when the current submodel is the true model). This subset model may produce biased fitted  values, i.e., Eft)  ^ E(Yi), where Yi = (5pXi. The bias may be tolerable if  it is offset  by a reduced variance. Therefore,  Mallows (1973) considered the mean square error for  each fitted  value, and defined  the following  criterion for  model evaluation: 1 n J p = — Eraseft) <7 i=l <7 Eft  - E(F,))2 (4.3) yii - niyii)^2 ,i=l The value of  J p has to be estimated from  the data. Mallows (1973) proposed the following estimate: D cc Cp = J p = ?jpL + 2p-n, (4.4) where a2 is the estimate of  a2 from  the full  model. It can be shown that, for  the full model with k = m + 1 parameters, C k = k. It is interesting to note that, for  normal errors, the C p statistic is equivalent to AIC(p, 2) (see (4.1)). 4.2.3 Final Prediction Error (FPE) Akaike (1969, 1970) proposed a criterion for  the selection of  predictors in the context of  autoregressive processes. The author minimized an estimate of  the expected squared error in predicting the observations that are independent of  the available data, but have the same distribution. Consider the subset model (4.2). Suppose that we are trying to predict, using the estimates j3p, the values Y*  satisfying Y*  = (3' pX l + e*, i = l,2,-.-,n, (4.5) where e*'s have the same distribution F, but they are independent of  the Cj's. The Final Prediction Error (FPE) of  the current model is defined  as i=i (4.6) It is interesting to note that, for  the linear regression setup considered above FPE = i=1 n 1 ^E i=i J p + n, a rE Eft  -,i=i where J p is defined  in (4.3). Therefore,  based on (4.4), an estimate of  FPE is given by (4.7) FPE  = ^ + 2 p, c-where a2 is the estimate of  a2 from  the full  model. Note that, for  the evaluation of  linear prediction models, FPE is equivalent to the C p statistic. 4.2.4 Cross-validation Cross-validation (CV) obtains an estimate of  the error-rate of  a prediction rule by split-ting the n data points into a training sample of  size nt (used for  fitting  the predic-tion model, i.e., for  estimating the model parameters) and a validation sample of  size nv = n — nt (used for  assessing the model). We calculate the average prediction error based on all or some of  the (^ ) different  validation samples, and use it as a criterion to select a prediction model. It is often  called leave-ri„-out cross-validation, or CV(nv). The vast majority of  papers on this topic deals with leave-one-out cross-validation, denoted by CV(1). Lachenbruch and Mickey (1968) proposed the use of  CV(1) in dis-criminant analysis. The method is furthered  by Allen (1974), Stone (1974), and Geisser (1975). The asymptotic equivalence of  CV(1) and AIC is shown by Stone (1977). Efron  (1983) used CV(1) to estimate the error rate of  a prediction rule in the situa-tion where the response Y  is dichotomous. We can easily generalize the author's approach to a continuous response. Suppose that we have an nxj) dataset Z = {Zi,i = 1, 2, ••• , n}, where each case (row) = (ccj, yi) is an observation of  the random quantity (X,  Y),  with X  being a row-vector of  (p — 1) covariates and Y  being a real-valued response variable. The dataset Z  is a random sample from  distribution H  on the p-dimensional sample space Rp. We want to evaluate a prediction rule r)(x,  Z)  constructed based on the given dataset. An example of  rj(x,Z)  is j3zx where j3z is the linear regression coefficient  of  Y  on X. We want to estimate the error rate of  r](x,  Z)  when rj(x 0, Z)  is used to predict a future value yo of  Y  for  a given predictor value Xo. Let Q[y 0,rj(x 0, Z)]  denote the. "error" in predicting y0 from  x0. For example, we can consider the squared error Q[y 0,V(x 0,Z)]  = {y 0-ri{x 0,Z)) 2. (4.8) True error rate The true error rate Err(Z,  H)  of  the prediction rule 77(2:0, Z)  can be defined  as Exx(Z,H)  = EH(Q[Y 0,r,(X 0,Z)}),  (4.9) the expectation being taken over (.Xo, Yo) ~ H  with Z  fixed  at its observed value. Apparent error rate The most obvious estimate of  the true error rate Err(Z,  H)  is the apparent error rate err (Z,H): err(Z,H)  = - V]  Q[y h V(x u Z)],  (4.10) n 1' i=i which usually underestimates Err(Z, H),  because the same data have been used both to construct and to evaluate the prediction rule rj(x,  Z). CV error rate CV attempts to overcome the problem of  underestimation of  Err(Z, H) by dividing the 1 given dataset into the training and the validation parts. For CV(1), let Z^) be the training set with case Zi removed, and rj(x,Z(,))  be the corresponding prediction rule. The CV(1) estimate'of  Err(Z,  H)  is given by 1 n Err = (4.11) i=i s Shao (1993) used CV(n„) for  model selection in regression using a random selection of  the (^) possible validation samples. 4.2.5 Bootstrap Efron  (1983) used bootstrap to estimate the true error rate Err(Z,  H)  (see (4.9)). Since the apparent error rate err(Z, H)  (see (4.10)) is an underestimate of  Err(Z,  H),  a correc-tion is required. Let op(Z, H) be defined  as op(Z, H)  = Err(Z, H)  — evv(Z,  H).  (4.12) The expectation of  op(Z,  H),  denoted by w(H),  is given by r(H)=E H{Evv(Z,H)-evv(Z,H)},  (4.13) w which could be the ideal correction if  it were known. Note that, though the true error rate and the apparent error rate are defined  for  particular dataset Z, the target correction is the expectation over all datasets. It is not easy to find  an estimate for  (4.12) which is defined  for  a particular Z. The unknown w(H)  can be estimated using the bootstrap procedure to get the bootstrap estimate of  Err(Z, H) as Err(B°0t) = err +w(Boot). (4.14) To obtain w^Boot\ let Z*  be a bootstrap sample, i.e., a random sample of  size n from  H. Based on (4.13), w(Boot) can be written as -(Boot) = ^jErr^tf)!  jerr(Z*,if)} = E* -E* (^PfQlyiMvuZ*)]^  , (4.15) where E* is the expectations over all bootstrap samples, and P* is the proportion of times a particular case Zj occurs in the bootstrap sample Z*,  i.e., #{z* = zA n The expression inside the first  pair of  parentheses of  (4.15) suggests that the prediction rule be constructed with the bootstrap sample Z*, and an average error of  this rule be calculated on the given dataset Z.  The expression inside the second pair of  parentheses of  (4.15) suggests that the prediction rule be constructed with the bootstrap sample Z*, and an average error of  this rule be calculated on the same bootstap sample Z*. 4.3 Review: robust selection criteria In this section we present the robust counterparts of  the classical selection criteria AIC, Cp and FPE (in order of  appearance in the robustness literature), and discuss their limitations. We discuss robust counterparts of  cross-validation and bootstrap procedures in Section 4.4 and Section 4.5, respectively. Ronchetti (1985) proposed a robust counterpart of  the AIC statistic. The extension of  AIC to AICR is inspired by the extension of  maximum likelihood estimation to M-estimation. The author derived AICR for  an error distribution with density For a given constant a and a given function  x, we can choose the model that minimizes 4.3.1 Robust AIC /(e) = ifexp(-x( e)). (4.16) n AICR(p, a, x) = 2 J2  x(ri) + (4.17) / \ / where r^ = (Yi  — p Xi)/a,  a is some robust estimate of  a, and (3  is the M-estimator defined  as the implicit solution of n ^2^(ri)Xi = 0, i=i with ijj = x'• The author also proposed a choice for  the parameter a, which is given by a = 2E[^2(e)]/E[#)]. (4.18) Limitation. The author considered that the M-estimate was the maximum likelihood estimate for  the density in (4.16). Unfortunately,  this only hold for  unbounded x func-tions, and in such cases the breakdown point of  the M-estimate is 0. 4.3.2 Robust Cp Ronchetti and Staudte (1994) pointed out the sensitivity of  the classical Cp to outlying points, and proposed a robust Cp statistic denoted by RCP. Consider an M-estimator 0 with weights Wi = V,0"i)/rij where r^ is the residual for  the ith observation. Using these weights, the author defined  a weighted version of  J p (see (4.3)) as follows ,i=l The author proposed the estimate of  Tp, i.e., the robust version of  C p (see (4.4)), as W RCP = -^-(U p-V p), (4.19) where W p — ^ wfrf  is the weighted residual sum of  squares, a2 is a robust and consistent estimate of  a2 from  the full  model, and U p and V p are constants depending on the weight function  and the number of  parameters p. When the weights are identically 1, RCP reduces to Mallows C p. 4.3.3 Robust FPE The robust analogue to the classical FPE criterion is proposed by Yohai (1997). Let s be an estimate of  the scale a from  the full  model, and j3 be the M-estimator of  the particular model under consideration. < 71 Pp = argmin^x((2/i - 0P*i)/s).  (4.20) i=i When we are trying to predict y* (equation 4.5) using the estimate j3 , the robust FPE (RFPE) is defined  as RFPE = J2E vt - p'vXi x' (4.21) \ U  I i L \ / . where the expectations are taken in the y*'s as well as in (3p. Note that when x( u) = u2> RFPE reduces to the classical FPE. Using second order Taylor expansions with respect to /3 p (assuming that the current model is the true model), RFPE is expressed as (4.22) where A = E(ip 2(e/a)),  B = E(ip'(e/a)),  and ip — x'- Therefore,  an estimate of  RFPE is given by (4-23) where A = n"1 Er=i(^2(n/s)), B = n"1 Y.U^'in/s)),  and r{ = yz - Ppxz. The performance  of  RFPE has not been studied so far.  In Section 4.6 we carry out a simulation study to evaluate RFPE. 4.4 Robust cross-validation Ronchetti, Field and Blanchard (1997) proposed a robust cross-validation procedure which is a robust version of  the cross-validation method proposed by Shao (1993). The authors used estimators that have optimal bounded influence  for  prediction. However, their method is computationally expensive. Hubert and Engelen (2004) proposed a fast cross-validation method in the context of  robust covariance estimation with MCD and robust principal component analysis. In this section we propose a robust CV procedure which is computationally suitable. First, let us consider a simple robustification  of  the CV procedure achieved by (a) con-structing a robust prediction rule, denoted by r]R(x,Z),  based on the given dataset Z, and (b) calculating robust summary statistics of  the prediction errors Q[yi,  V R( xi, ^(i))]-For the construction of  a robust prediction rule, we consider the regression MM-estimates proposed by Yohai (1987) because of  its high breakdown point and high efficiency  at the normal model. This estimate is defined  as follows. Definition  4.1. (Regression MM-estimate) Let Xo  R ->• R and Xi  '• ® ~5> ® be two score functions  such that  Xo( u) < Xi( u)i u £ ^ and each x satisfies  the following  set of regularity  conditions: 1- x(-u)  = x(u),  ueR, 2. x is non-decreasing  on [0,oo), 3. x(0) - 0, and X(oo) - 1, 4- X  i-s continuously  differentiable. Let $ be a high-breakdown-point  "initial"  estimate  for  (3,  and a be the estimate  of  scale of  the residuals  based on (3  satisfying lY,Xo({yi-x\0)/&)=b, (4.24) i= 1 where b £ (0,1] is the expectation  of  xo(-) under  the central  model.  Then,  the regression MM-estimate  is defined  as the solution  of n Exi ((y i-xtf)/a)x i = 0. (4.25) i=i A reasonable choice for  the initial estimate /3 in Definition  4.1 is the regression S-estimate proposed by Rousseeuw and Yohai (1984) because of  its high breakdown point. This estimate is defined  as follows. Definition  4.2. (Regression S-estimate) Let xo • —i• K  be the score function  de-scribed  above. The  regression  S-estimate  (3  is defined  as 0 = argmina(/3), (4.26) /3 where <j((3)  solves ^Y,xi((Vi-x ti0)/*(0))=b-  (4-27) i=i The  corresponding  S-estimate  of  scale, a, is given by' a = inf  cr(f3)  = d(f3).  (4.28) For a robust summary of  the (squared) prediction errors we will use the trimmed means with different  amounts of  trimming. The a-trimmed mean of  X, denoted by ma(X),  is the sample mean obtained after  dropping the largest 100o;% observations of X.  Let U\  < U 2 < • • • < U n be the ordered observations of  X,  and k = [n(l — a)], where [n(l — a)] means the integer part of  n(  1 — a). Then, n—k j=1 — (CV) The robust counterpart of  Err is now given by E?r(RCV) = ma (Q[y h r,R( X i, Z(i))]) . (4.30) •—• (RCV) ' Note that Err does not estimate Err(Z,  H) (see equation 4.9). Instead, it estimates ErrK(Z,H)  = EAH (Q[y 0, V R(x 0, Z)}) = /  Q[y 0,vR(xo,Z)]dH.  (4.31) 1 - OL J 0 The use of  a-trimmed mean will help us identify  the robust model(s) that can be expected to predict 100(1 — a)% of  the future  data better than other models. 4.4.1 Dealing with numerical complexity The computation of  the MM estimates of  regression for  each training sample, i.e., the computation of  (3^,  i = 1, 2, • • • , n, is very computer intensive. We propose to remedy this problem as follows.  We express the MM estimates of  regression based on all the observations on the current set of  covariates as a weighted least squares fit,  and obtain the weighted least squares fit  for  each training sample by using the selected cases and their corresponding weights. We elaborate the proposed method below. Let Ti = yi — $ Xi be the residuals obtained from  the fit  (3  (based on all the ob-servations on the current set of  covariates). Once the robust fit  is complete, (3  can be (4.32) (4.33) For further  computational ease, we will assume that a^) ~ a, where a^) is the S-scale based on the training sample Z^y Now, a computationally suitable version of  the regression MM-estimate can be calculated as ( n \ ~ ^ n E wi xi xtJ ) E wi xi Vy (4-34) / j¥=i " (0) Note that no robust fitting  is needed for  the calculation of  (3^. One-step adjustment Based on a small simulation study (not presented here), we consider a one-step correction to (3^ to make it closer to fi^y  Let r^ = yj — x^fl^,  j = 1, 2, • • • , i — 1, i + 1, • • • , n. The updated set of  weights w^ can be expressed as = Xi (rf/a)Irf,  j = 1, 2, • • • , i - 1, i + 1, • • • , n. (4.35) Thus, an adjusted estimate of  (3^ is given by ( n \~1n expressed as a weighted least squares regression coefficient  as follows: ( n \ n i=l / i=l with the weights Wi expressed as Wi  = %i (ri/a)/ri,  i = 1, 2, • • • , n. 4.5 Robust bootstrap For the purpose of  making robust statistical inferences  about the linear regression co-efficient  (3,  Salibian-Barrera (2000), and Salibian-Barrera and Zamar (2002) developed the robust bootstrap procedure. The author(s) considered the regression MM-estimate and generated a large number of  re-calculated /3*'s to estimate the asymptotic co-variance matrix and the distribution function  of  the robust estimate 0. This robust procedure is computationally suitable, because a linear system of  equations is solved for each bootstrap sample. We propose to use a similar approach to develop a computationally suitable robust — (Boot) counterpart of  the bootstrap estimate Err of  the true prediction error Err(Z,H). Let be the MM-estimate, /3 be the (initial) S-estimate and a be the S-scale. The robust counterpart of  the apparent error rate err(Z) (see equation 4.10) is given by errR(Z) - ma {Q[ V l, r,{ X i, Z))), (4.37) where ma(.)  is the cn-trimmed mean defined  before,  and r)K(xi,  Z) uses the MM-estimate /3. Let ri and f;  be the residuals associated with the MM- and S-estimates, respectively. Once the robust fit  is complete, f3  and a can be expressed as a weighted least squares fit.  Equation (4.32) shows the weighted average representation of  (3  with the weights Wi defined  in Equation (4.33). The scale estimate a can be expressed as n a = ^vi(yi-ptxi), ' (4.38) i=1 with the weights Vi defined  as Vi  = ^Xo  {fi/cr)/fi,  i = 1, 2, • • • , n: (4.39) Let Z*  — .{(xj, yi), i = 1, 2, • • • , n} be a bootstrap sample from  Z.  The unadjusted bootstrap estimates can be calculated as ( n \ — ^ n i=1 / i=l n K  = (4-41) i=1 where w* = Wj  and v* = Vj  when (a:*, y*) = (xj,  yj). The corrected bootstrap estimate p can be obtained as j3*. = 0 + M0l-0)  + d(K-&),  (4.42) where M  and d are the linear correction factors  (see Salibian-Barrera and Zamar 2002). The robust prediction rules r]R(x,  Z*)  can be based on the /3 above. Now the robust counterpart of  w^Boot' (see equation 4.15) is given by -(RBoot) = ^ |mQ (g[yi> ^R^ } _ E% (mQ (Q[yh vR{xh £.)]) | . (4.43) Finally, the robust bootstrap estimate of  ErrR(Z, H)  (see equation 4.31) can be expressed as Err(EB°0t) = errR(Z) + w(RBoot). (4.44) 4.6 Simulation study At first,  we carry out a small simulation (Section 4.6.1) to show that the classical CV and bootstrap estimates of  true error rate (see (4.9)) are sensitive to outliers while the robust estimates are resistant to outliers. We then conduct another study (Section 4.6.2) where we use these methods along with FPE and RFPE to select the "best" models, and compare the predictive powers of  these models. Since AIC and Cp are equivalent to FPE for  linear regression setup with normal errors, and robust AIC and robust Cp have some limitations, we do not consider these criteria for  our simulation study. 4.6.1 Robustness of  the estimates •—• (CV) -» (Boot) •—• (RCV) — (RBoot) We evaluate the 4 estimates Err , Err , Err and Err using simulated clean and contaminated datasets. Since the true error rates Err(Z,H)  and Errr(Z,H) are different  (the latter uses the trimmed mean), we multiply the robust estimates by EH o(Q[y,r)*(x,Z)}) E«H o(Q[y^(x,Z)}) to make the results more comparable with the classical results. We considered two standard normal covariates X\  and X 2, and generated Y  = 2 Xi  + X 2 + e, where e ~ N(0,  4). We simulated 100 datasets, and for  each dataset we calculated the estimates mentioned above. We then contaminated each dataset as follows.  Each of  the 3 variables (2 covariates and the response) is contaminated independently. Each observation of  a variable is assigned probability 0.03 of  being replaced by a large number. Therefore,  the probability that any particular row of  the dataset will be contaminated is 1 — (1 — 0.03)3, which means approximately 9% of  the rows will be contaminated. For each contaminated dataset we obtained the 4 estimates mentioned above. Table 4.1 presents the results for  the first  10 trials. The average Err(Z,  H0) (the average true error rate for  the clean data) is 4.12, while the average ErrR(Z, H0) (multiplied by A) is 4.13. Table 4.1 shows that, for  the clean data both the classical and robust methods estimate the true error rates very well. However, in Table 4.1: First 10 trials: classical and robust estimates of  prediction errors. Trial CV Boot RCV RBoot Clean Contam Clean Contam Clean Contam Clean Contam 1 4.60 12.01 4.61 11.65 4.18 5.80 4.15 5.47 2 4.05 10.20 4.02 8.99 3.48 5.18 3.42 5.05 3 3.86 10.35 3.88 10.64 3.70 5.72 3.67 5.66 4 4.95 11.56 4.97 11.80 5.45 6.14 5.43 6.47 5 3.95 14.92 3.94 11.77 4.29 5.57 4.25 5.33 6 4.54 12.56 4.52 10.91 5.11 6.37 5.07 6.30 7 5.22 10.28 5.16 10.53 -4.72 6.37 4.74 6.65 8 4.03 8.54 4.04 8.63 4.04 5.43 4.04 5.45 9 4.16 10.45 4.20 10.60 4.21 6.59 4.21 6.36 10 4.57 9.82 4.53 9.72 4.75 5.90 4.70 6.54 mean 4.14 10.94 4.13 10.28 4.17 5.32 4.16 5.22 (sd). (0.58) (2.57) (0.58) (2.13) (0.67) (0.94) (0.69) (0.93) the contaminated data, robust estimates perform  much better than the classical methods. 4.6.2 Final model selection In this simulation study we use the classical segmentation methods CV, Boot and FPE along with their robust counterparts RCV, RBoot and RFPE to select the "best" models, and compare the predictive powers of  these models. The study is similar to Frank and Friedman (1993). We considered 2 latent variables L,, i = 1, 2, to generate Y  = 6Li + 51/2 + e, where Li ~ N(0,1),  and e is a normal error not related to the latent variables. We considered a total of  m = 8 covariates. Of  them, a = 4 are related to the two latent variables, with 2 covariates related to Lx and the other two related to L2. We generated 100 datasets each of  which was randomly divided into a training sam-ple of  size 100 and a test sample of  size 100. Each training dataset was then contaminated as follows.  A number of  rows (10%) were chosen randomly, and for  these rows the co-variates values were replaced by large positive numbers while the response values were replaced by large negative numbers. We used all 6 methods on the clean and contaminated training data to select and fit  the final  models, and then used them to predict the test data outcomes. For each simulated dataset, we recorded the number of  noise variables in the model, and the average squared prediction error on the test sample. Table 4.2 shows the average test, error and the average number of  noise variables selected by each method. For the clean data, the robust methods perform  as good as the classical methods. For the contaminated data, robust methods produce much smaller test errors than the classical methods. Also, robust models contain less noise variables. The performance  of  the three robust methods are similar. Table 4.2: Performance  of  the classical and robust methods of  segmentation (evaluation of  all possible subsets of  the reduced set). Method Test error Noise Clean Contam Clean Contam Classical CV 41.81 56.49 0.00 0.60 Boot 41.32 54.88 0.00 0.50 FPE 41.93 55.17 0.02 0.60 Robust RCV 42.97 43.62 0.06 0.08 RBoot 41.59 44.80 0.08 0.08 RFPE 42.73 44.91 0.06 0.06 4.7 Examples In this section we use two real datasets to evaluate the performance  of  the classical and robust methods for  the segmentation of  the reduced set. Both of  these datasets were used in Chapter 3 for  the evaluation of  robust sequencing. 4.7.1 Demographic data This dataset contains n = 50 obsrvations on d = 25 covariates and a response. For more details Section 3.8 is referred  to. Using the learning curve based on standard LARS, we selected the reduced set (22, 20,4,15, 25, 2,14, 5, 3,17, 24, 23).' The robust bootstrapped LARS produced the reduced set (22, 20, 4,15,10, 2,19, 25, 8,18, 6, 24). We applied the classical segmentation methods CV, Boot and FPE on the first  re-duced set above. The. covariates selected by these methods are (22,4, 25, 2,14,17, 24, 23), (22,4,15, 25, 2,17, 24), and (22, 20,4,15,25, 2,14,17, 24,23), respectively. We then ap-plied the robust methods RCV, RBoot and RFPE on the second reduced set. The covari-ates selected are (22,4,15,10,19,25,18,24), (22,20,4,10,19,25,18,24), and (15,6,24), respectively. Interestingly, RFPE selects a very small model compared to others. To compare the models obtained by the classical and robust methods, we used the clean data (dropping one clear outlier) to estimate the prediction errors of  these models 1000 times using 5-fold  CV. The mean prediction errors for  the models are: CV 199.3, Boot 198.2, FPE 207.6, RCV 195.8, RBoot 197.5 and RFPE 246.9. The robust method RCV performs  slightly better than RBoot, and both of  them perform  much better than the classical methods and RFPE. 4.7.2 Protein data This KDD-Cup 2004 dataset was used in Section 3.8. We considered n = 4141 protein sequences from  5 blocks. The number of  covariates is d  = 77, with 4 block indicators (variables 1 — 4) and 73 features.  The data were split to get a training sample of  size n = 2072 and a test sample of  size n = 2069. We considered a reduced set of  size 5 using the learning curve based on standard LARS on the training data, which contains the covariates (14,13,5,8,7). Robust boot-strapped LARS gives the reduced set (14,13, 5,,76, 73). We applied the 3 classical meth-ods of  segmentation on the first  reduced set. They all select the same model, and it includes the covariates (14,13, 5, 8). The robust methods used on the second reduced set select the covariates (14,13, 5,76). We fitted  the 2 models using the training data, and then used them to predict the test data outcomes. The 1%, 5% and 10% trimmed means of  prediction errors for  the robust (classical) model are : 114.92 (117.49), 92.77 (95.66) and 74.82 (78.19), respectively. It is encouraging to note that the robust methods outperform  the classical methods for  the majority of  the data. 4.8 Conclusion The main contribution of  this chapter is that we developed computationally suitable robust methods of  segmentation (evaluation of  all possible subsets of  the reduced set obtained in Chapter 3) to select the final  model. Classical selection criteria FPE, AIC, Cp, CV and bootstrap are sensitive to out-liers. We also identified  certain limitations of  Robust AIC (Ronchetti 1985) and robust CV (Ronchetti, Field and Blanchard 1997) methods. We proposed computationally suit-able robust versions of  CV and bootstrap procedures. We evaluated our methods using both simulated and real datasets, and compared them with the classical methods as well as robust FPE proposed by Yohai (1997). According to the simulation study, the performance  of  the three robust methods are similar, and better than the classical meth-ods. In the real datasets, robust CV (RCV) and robust bootstrap (RBoot) have better performance  compared to RFPE. Chapter 5 Properties of  Adjusted-Winsorized Correlation Estimate 5.1 Introduction In Chapter 3 we proposed a new correlation estimate for  bivariate data, which we called the adjusted-Winsorized estimate. Unlike two separate univariate Winsorizations for X  and Y  (Huber 1981 and Alqallaf  2003), we proposed a joint Winsorization with a larger tuning constant c\ for  the points falling  in the two major quadrants, and a smaller constant c2 for  the points in the two minor quadrants. In this chapter we will establish the consistency and derive the influence  function  of the proposed correlation estimate. We will then discuss the asymptotic normality of  this estimate. Definition  5.1. (Adjusted-Winsorization) The  adjusted-Winsorization  of  (u,  v) £ R2, denoted  by tyc(u,v)  with c — (c\,  c2), is defined  as rt \ (^(uUcM.u^O, = (Mu), Mv)) = < (5.1). [ VVfcfa))  . UV < 0, where ip is a non-decreasing  symmetric  function,  and C\ and c2 are chosen constants. Definition  5.2. (Adjusted-Winsorized estimate of  correlation) Let (Xi,  Yi),  % — 1, 2, • • • , n, be a random  sample from  a bivariate  distribution  with location  parameters Hx  and fiy,  and scale parameters ax and ay, respectively.  Let 6 = (fix,  fx,  cy), and 0 = (jlx,  £I-Y,  &Y) be an estimate  of  6. Denote Ui  = (Xi  — fix)/&x,  and Vi  = (Yi  — fly)  j ay. Let (jJi,  V^j  = (^ c(Ui),ip c(vSj  be as defined  in (5.1).  Then,  the adjusted-  Winsorized  estimate  rw of  the correlation  between X  and Y  is given by 11Mm c(Vi)  - (i  t MUi))  (i  t MVi) ~ 1 V 1=1 / V i=1 /  2) itr cm-(itMUi))\ kntr cm-('-tmvS 2 n •<•—' <•/  in < t<~\ »/  i \ n ' Tc\ <•/  i n i=1 \ i=1 / V »=1 \ i=1 For the validity of  the results obtained in the subsequent sections, we need some assumptions on the functions  ipCl and ipc2 used for  the adjusted-Winsorization of  the data. Let ip : 3R. —y 1R satisfy  the following  set of  regularity conditions: Al. Ip(-u)  = —ifi(u),  u G R, A2. ip is non-decreasing, A3, ip is continuously differentiate, A4. ip, i)' and ip'(u)u  are bounded. For the adjusted-Winsorization of  the data, we will use the S-scales &x and ay-defined  in Chapter 4. Let us assume that the score function  x : R —> E used in the S-scales satisfy  the following  set of  regularity conditions: Bl. x{-u)  = x{u),  ueR, x(0) = 0, and X(oo) = 1, B2. x is non-decreasing on [0, oo), B3. .x is continuously differentiate, B4. Xj X' and x'( u)u are bounded. 5.2 Consistency of  adjusted-Winsorized estimate The following  theorem shows that under certain regularity conditions the adjusted-Winsorized correlation estimates are consistent, provided that the location and scale estimates are consistent. Theorem 5.1. (Consistency of  adjusted-Winsorized estimate) Let (Xj, Yi),  i = 1,2,•••, n, be a random  sample from  a bivariate  distribution  with location  parameters Hx  and fx Y,  and scale parameters ox and ay, respectively.  Let Ui  = (Xj — Hx)/&x, and Vi  = (Yi  — /j,y)/cry  be the standardized  variables.  Let 0 = (nx,  Hy,  cry), and Q = (fax,  Ay, &X,  <5y) be an estimate  of  0. Then,  if 0n —>• 0, 71—> OO then p Tyy ^ , n—>00 where f w is the adjusted-  Winsorized  estimate  of  correlation  between X  and Y,  and To prove this theorem, we need an extension of  "Serfling's  Lemma" (Serfling  1980, page 253). Lemma 5.1. (Extension of  Serfling's  Lemma) Let Zi — (Xi,  Yi),  i = 1, 2, • • • , n, be a sequence of  independent  random  variables  having an identical  bivariate  distribution with parameter vector 0. Let g(z,  t) : R2 x R4 —R be continuous in t uniformly  on z G Ac(0,  A) for  all  A > 0, and P(z  G A(0,  A)) 0 as A 0, with P(z  e A(0,  0)) = 0. Assume that  |g(z,  t)\ < K  for  all  z € R2. Let 0n be a sequence of  random  vectors such E [ip c(U)ip c(V)]  - EIMU)}  EIMV)} (5.3) that  0n —0. Then (5.4) Proof.  We have to show that, for  any given e > 0 (5.5) Now, i n 1 n n 0n)I(zi  e A(0,A))-E[g(Z,  0)1 (Z  G A(0,  A))] + - V  g( Z i, 0n)I( Z i G Ac(0,  A)) - E[g(Z,  0)I(Z  G Ac(0,  A))] n < * n < 1 n -'Yg(z i, 9n)I(zi  G A(9,  A)) n i=i + E[g(Z,  9)I(ZeA(9,  A))] + - £ g(zu  9n)I(zi  G Ac(9,  A)) - E[g(Z,  0)1 (Z  G Ac(9,  A))] n (5.6) < k~y/I(z ieA(9,A)) n 1' + i=l n + i V g(z u 9n)I(zi  G ,4C(0, A))--V g(zi,  9)I( Zi G .4C(0, A)) n n ' 1 = 1 2 = 1 n £ g(zi,  0)I(zi  G Ac(0, A)) - 0)1 (Z  G Ac(0, A))] + n i=i , (5.7) Q (say). Note that the last expression in (5.6) is bounded by the sum of  the last two expressions in (5.7). We will now deal with each of  the four  parts in (5.7). As n oo, e p( z e A{0,  A)), and P(z G A(0, 0)) = 0. n i=i Therefore,  for  any given e > 0, there exists A > 0 such that lim P lk-Yl(zi  € 4(0, A)) < e/4 ) = 1, i->oo I  n -f—'  I i=1 and k E[I(Z  G A(0, A))] = k P{Z  G A(0, A)) < e/4. (5.8) (5.9) We now focus  on the third part of  (5.7). Since 0 n > 0, we have, for  any 6 > 0, n—>oo n—>oo lim P (||0„ - 0|| < 5) = 1. (5.10) Now, for  any e > 0, we can choose <5 = 6(A)  (where A has been chosen before)  such that 19n - 0|| < 5 \g(z,  9n) - g(z,  9) | < e/4, z G Ac(9,  A). (5.11) That is, 5 is chosen in such a way that it will ensure the uniform  continuity of  g(z,  0) in 0 on z e Ac(0,  A). Using (5.10) and (5.11), we have, for  any e > 0, lim P n—>oo [g{z,  0n)-g(z,  0))l(zeA c(0,  A)) <e/4 =1, which gives 1 n <e/4 •)='!. (5.12) lim pf-fj^,  0n)I( Z i e Ac(0,  A)) n—>oo \ n z—' x i=l n Y,9(zi,  0)I{zieA c{0,  A)) j=i For the fourth  part of  (5.7), we can use the Weak Law of  Large Numbers. For any e > 0, n Y9(^0)I(zieAc(0, A)) i=1 Using inequalities (5.8), (5.9), (5.12) and (5.13) in (5.7), we have, lim P n—>oo <e/4 =1. (5.13) lim P (Q  < e) — 1, n—>oo which completes the proof. To prove Theorem 5.1, we also need the following  lemma, which is similar to Lemma 7.7 (Salibian-Barrera 2000, page 217), where the author deals with p-functions (^-functions  according to our notation). Lemma 5.2. (Uniform  continuity of  -0-functions)  Let ip : M —>• E be a continuous function  such that  il>(u)  = —c for  u < —c, and ip(u)  = c for  u > c, where c is a finite constant.  Let m €  A4 and s £ S, where M.  and S are bounded  real intervals,  and inf  <S > 0. Then r/  \ , ( U ~ m\ _ , . „ f{u,m,s)  — ip[  , m e M,  s €  o, is conutinuous in m and s uniformly  in u. The proof  of  this lemma is presented in Chapter Appendix (Section 5.8.1). ( Proof  of  Theorem 5.1 With the use of  Lemma 5.1 and Lemma 5.2 the proof  is straightforward.  We have Z  = (X,Y),  0 = (fx x, Hy,  ®x, cry), and t = (fi x, fiy,  <Jx,  oy). First, let us deal with the second term in the numerator of  Equation 5.2. We have \IIUUI) i=1 Consider g(z,  0) Since the tuning constant of  our score function  ip changes with quadrants, to apply Lemma 5.1 we set A(0,A)  = {z  = (x,y)  : \x - fi x\ < A or \y - fi Y\  < A} . We have to choose 5 in (5.11) such that if  (x — fix,  y — A4y) £ Ac(0,A)  belongs to a particular quadrant, then (x  — fix,  y ~ My)  belongs to the same quadrant. If,  for  ex-ample, (x  — fi x)(y  — A1Y) > 0, then (x — fix)(y  ~ AY) > 0, and, using Lemma 5.2, nti v \ E ^ (^r^) - - m > °) i=1 ^ i=i v X-jjLx Ox I{(X  - fix)(Y  - fiy)  < 0). = "0C = v. x-fix Ox + H( X  ~ Mx)(Y  ~ MY) < 0). ipc = 4>Cl (^f^) is continuous in fi x and ax uniformly  on z G A°(0,  A). There-fore,  using Lemma 5.1, we have n ' \ ox ' n->no i=1 v OX That is, E[MU)}-  (5.14) n < * ri-VfYi n i= 1 Similarly, -f^MVi)  E[MV)]. (5.15) n ' * n-ioo n 1=1 Let us now deal with the first  term in the numerator of  Equation 5.2. We have n^ n^ \ ax J  \ oy J i=i i=i \ /  \ / mf)  m^) - MV  - m > o) Considering g(Z,  0) = & rpc = wJU)'P c<V) : we have i y>c(&) <MVi) E[MU)MV)]- (5.16) n n-¥  oo i=l Using (5.16), (5.14) and (5.15) in the numerator of  (5.2), we have ^YMUr)  MVi)  -E [MU)tc{V)]  - E [ip c(U)}  E [MV)}  . (5.17) n ' \ n *—' /  \ n i=1  i=l /  \ i=l Let u's now focus  on the denominator of  Equation 5.2. We have = ktv&r-) i=i i=i x ' X—  1 + ; XX (mr),((x ~M(Y ~M < 0)' Consider g(Z,  0) =  We then have n ' n—>oo i=l Similarly, £[^00]- (5.19) rl f  * n.—inn ^ J n i=1 Using (5.18), (5.19), (5.14) and (5.15), we can say that the denominator of  (5.2) converges in probability to the denominator of  (5.3). • 5.3 Influence  function  of  adjusted-Winsorized esti-mate The following  theorem gives the influence  function  of  the adjusted-Winsorized correlation estimate, when the influence  functions  of  the scale estimates are well-defined. Theorem 5.2. (Influence  function  of  the adjusted-Winsorized estimate) Let (X,  Y)  follow  a continuous distribution  H.  Consider  the adjusted-Winsorized  correlation functional  rw(H)  given by rw(H) = (5.20) with N(H)  = EH X-m x(H) sx(H) — EH X-m x(H) Sx(H) Y  -mY(H) SY(H) (5.21) and 21 V2 2ll/2 (5.22) where mx(H)  and my (H)  are location  functionals,  and sx(H)  andSy(H)  are dispersion Junctionals.  Suppose that 1. The  central  model  Hq  is symmetric  about (m x(Ho),my(H 0)). We  can assume, without  loss of  generality,  that  mx(Ho)  = 0, my(H 0) = 0, sx{H 0) = 1, and Sy(H 0) = 1. 2. The  influence  functions  IF(s x,u, H0) and IF(sy,v,  Hq) are well-defined  for  all z = (u, d)6R2. The  influence  function  of  rW(H)  at Hq and z, denoted  by IF(r w, z, Hq), is given by IF{r w, z, H0) D0N0 - N0DP Dl where No  = EHo{MX)MY)}, Do = yfE H o{iP*(X)}E H o{iPl(Y)} No  = -EH o {MX)MY)}  + - IF(sy,v,H 0) EHO{MX)4'C(Y)Y} - IF(S X,U,H 0)EH o{MY)^' c(X)X}, and Dn EhM( Y)} 2EHo{W(X)} + IF(s x,u,H 0) EHO{2iP C(X)€(X)X}  j Eh 2EHO{R C{Y)} + IF(S y, v, H0) EHO{2MY)  I!>'C(Y)  Y} Proof.  Let H  be given by H tiX  = (l-t)H 0 + tS g. (5.23) For a fixed  z = (u,  v), using (5.21) we can express N(H t>z) = N(t,  z) = N(t)  as N(t)  = 'u-mx(t ) \ /  u M*) my(t)X sy(t) ; sxW yj v — mY(t) sY(t) (5.24) sx(t) Sy{t) +t^c[ u'mx}t))^J v-mY{t ) sx(t) sY(t) - (1 - 2*) < L^F1) } EHo (Y ~ mY[t) sY{t) V  — mY(t) sy(t) (5.25) sy(t)  J  J rc V where o(t)  includes the terms involving t2. Now, since mx(0)  = 0, my(0)  = 0, Sx(0) = 1, and sy(0) = 1, we have d_ dt N(t) = -eHo {MX)MY)}  + Mv)Mv) t=o d ^ f,  fX-m x(t)\  . (Y  — my(t) The last term in (5.26) can be written as d„ J , ( X-m x(t )\ , yr-mv(t) t=0 (5.26) t=o EH 0 < V, y - mr(t) C1 1 Sjc(i) J  m V M*) /f(X-m x(t))(F-my(t))>0 + d_ dt Sx(t)  ) ^ 2 V 3y{t)  J l((X-m x{t))(Y-my{t))<  0 . (5.27) t=0 Interchanging the operations of  differentiation  and integration (see Chapter Appendix, Section 5.8.2, for  the justification),  we can express (5.27) as E Ho <=oJ + E Ho d t=o. (5.28) The first  term of  (5.28) can be expressed as ENA dt sx(t) sY(t) t=0J Sx(t)  ;rci v sY(t) sY(t)f t[m Y(t)]-j{s Y(t)}(Y-m Y(t)) 4(t) l(XY  > 0) sY(t)  J  ^ V sx(t) sx(t)£[m x(t)}-£ls x(t)]{X-rn x(t)) 4W I(XY  > 0) t=o = -EH Q^pCI(X)^' CL(Y)[LF(m Y,v,H 0) + IF{S y,V,.H Q)Y] L{XY  > 0) + ^Cl(Y^' Ci(X)[lF(m x,u,H 0) + IF(s x,u,H 0)X}l(XY>0)  (5.29) Similarly, the second term of  (5.28) can be expressed as E Ho d r f x-mx{t) dt  {*«  {-T^tT 1 Y — MY sY(t)  J Q-) l(XY  < 0) t=o. = -EHOL^CL(X)^ CL(Y)[LF(m Y,v,H 0) +.IF(S Y,V,H 0)Y] L{XY  < 0) • +i;CL(Y)i; ,CL(X)[LF(m x,u,H0) + IF(s x,u,HO)X]L(XY  <0) |>. (5.30) Using (5.29) and (5.30) in (5.28), we have = -IF(m Y,v,H 0)EH o{MX)^c(Y)}  ~ IF(m x,u,H 0) EH o{MY)  iP' C(X)} . - IF(S y,V,H Q) EHO{MX)4>C(Y)Y}  - IF(s x,u,H0) EHO{MY)  iP' C(X)  X} = - IF(S y, v, H0) EH0{MX)  €(Y)  Y} - IF(s x, u, H0) EHO{MY)  ^C(X)  X}. (5.31) Using (5.31) in (5.26), we have d_ dt N(t) = -EH0 {IP C(X)IP C(Y)}  + MV)MV) t=o - IF(S y, v, H0) EH0 {M*)  VJX)  Y) - IF(S X, u,H 0) EHO{MY)  ip' c(X)  X}.  (5.32) Using (5.23) in (5.22), we have D(t)  = D!(t)  Dl(t), (5.33) where Di(t) (L -1) EHO \ T x - mx(t) sx(t) + tip c u - mx(t) sx(t) , (5.34) and (1  - t) EHO t Y  - mY(t) + ti> c v — mY{t )\ sr{t)  J  j V sy(J)  J Differentiating  both sides of  (5.33) w.r.t. t, and setting t = 0, we have (5.35) d_ dt D(t) t=o Di(t)  dt D2{t)  dt (5.36) t=o From (5.34), we have t=o + 1>l(u). (5.37) t= o <trao iv sx(t)  ) Using similar arguments as in the case of  the numerator (see Chapter Appendix, Sec-tion 5.8.2), d 2 f  X  — mx(t) E»° < s* Sjf(t) t=0 4=0 d a: - mx(t) sx(t) I(XY  < 0) t=o (5.38) sx(t)  J  ^ V  sx(t) -sx(t)  i[mx(t)] - j[s x(t)}(X  - mx(t)) + 2V, 4 (t) X-m x{t)\  fX-m x(t) tPc I(XY  > 0) sX(t)  J YC 2\ sx(t) sx(t)f t[rn x(t)}-i[s x(t)](X-m x(t)) s2x(t) L(XY  < 0) t=o = -EHO |2 VC1 M'CL {X)  [LF(m x, u, HO) + IF(s x, u, H0) X]  l(XY  > 0) + 2 VC2 (X)iP' C2 (X)  [.IF(m x, u, H0) + IF(s x, u, H0) X] l(XY  < 0) = -IF(m x,u,Ho)E H o{2MX)^' C(X)}  - IF(S X,U,H 0)EH o{2MXU' c(X)X} = -IF(S X,U,H 0)EH o{2MX)^ c(X)X}.  (5.39) Using (5.39) in (5.37), we have = -EHO{RT(X)}  - IF(s x,u,H0)EHO{2MXMX)X}  +^c(u).  (5.40) t=0 Similarly, I*® = -E„a{i,l(Y)}  - IF{sy,v,H 0)EBa{2MY)€(Y)Y}  + i>l{v)-  (5.41) t=0 Using (5.40) and (5.41) in (5.36), we have dt D(t) EHOWX)} t=o EH O{1>1(X)}  - j&u) 2EHO{R C(X)} + IF(s x, u, H 0) EHO{2^ c{X)  iP'C{X)  X) EhM( X)} EhMP)}-2 EHO{IPI(Y)} + IF(s y,V,H 0)EH o{2My)€(y)y]  )• (5.42) We also have and W(*)L=O  = EHO{MX)MY)}  = No  (say), D(t)  |t=0 = JE H o{r c{X)}E H o{r c{Y)}  = Do (say). (5.43) (5.44) Finally, differentiating  both sides of rw(t) w.r.t. t, and setting t = 0, we have IF{r w,z,H 0) m D(t) DQNQ - NQDQ Dl (5.45) (5.46) where N0 = FTN(t)\ t=Q and D0 = f tD(t)| t=Q are obtained from  (5.32) and (5.42), respec-tively, and N0 and D0 are obtained from  (5.43) and (5.44), respectively. • Figure 5.1 shows a 3D-plot of  IF(r w, z, H 0) against z = (u,v),  with u,v E [-10,10], H 0 = N(0,  £), and / 1 P P 1 We used p = 0.5 for  the bivariate normal distribution, and (ci, c2) — (3, 2) for  rw. \ (5.47) Figure 5.1: Influence  curve of  adjusted-Winsorized estimate with (ci,c2) = (3,2). The curve is symmetric about (0,0). Based on Equation 5.46 and Figure 5.1, we can make the following  comments on the influence  function  of  the adjusted-Winsorized estimate: • Because of  Regularity Condition A4, the expectations in (5.46) are bounded. Also, Do > 0. Therefore,  IF(r w, z, H 0) is bounded when IF(sx,  u, H 0) and IF(sy,  v, H 0) are bounded. Figure 5.1 also exhibits the boundedness of  the influence  function. • Since the •j/'-functions  are symmetric, the influence  function  IF(r w, z, Ho)  is sym-metric about (u,v)  — (mx(Ho),  my{H 0)) — (0,0) if  IF(sx,u,  H 0) is symmetric about u = 0, and IF(s Y,  v, H 0) is symmetric about v = 0. • By setting c\ = c2 = c in IF(r w, z, HQ), the influence  function  of  the univariate-Winsorized correlation functional  can be obtained. 5.3.1 Standard error of  adjusted-Winsorized estimate Let Zi = (Xi,Yi),  i = 1, 2, • • • , n, be i.i.d. according to a continuous distribution H, and rw(H)  be the adjusted-Winsorized correlation functional.  The influence  function  of rw(H)  at the central model H 0 and zel2, denoted by IF(r w, z, H 0), is given by (5.46). Using this, the asymptotic variance of  the adjusted-Winsorized correlation estimator f w for  the central model H 0 can be obtained as (see, for  example, Hampel, Ronchetti, Rousseeuw and Stahel 1986, page 85) For a sufficiently  large n, the standard error of  f w, denoted by SE(r w), is given by Since it is difficult  to get a closed form  expression for  (5.48), we can use numerical integration to obtain approximations to the asymptotic variance and the standard error of  rw. (5.48) SE(f w) = VAV{r w,H 0)/n. (5.49) To evaluate the accuracy of  the (approximate) standard error of  the adjusted-Winsorized correlation estimates, we carried out the following  simulation study. We generated 2000 random samples of  size n from  a bivariate normal distribution with mean vector 0 and covariance matrix £ = E(p) given by (5.47). We considered 3 different sample sizes: n = 25, 100 and 400, and 3 different  correlation coefficient  values: p = 0.1, 0.5 and 0.9. The values of  (ci, c2) chosen for  these correlation coefficients  are (3,3), (3, 2) and (3,1), respectively. The choice of  C\ and c2 is discussed later on in Section 5.4. Table 5.1: Evaluation of  the standard errors of  rw. The empirical SD and formula-based SE are close. p n = 25 SD SE n = 100 SD SE n = 400 SD SE 0.10 0.50 0.90 0.203 0.199 0.136 0.146 0.035 0.039 0.100 0.099 0.069 0.073 0.018 0.019 0.049 0.050 0.034 0.037 0.009 0.010 Table 5.1 presents the obtained results. For each n, the first  column gives the empirical standard deviations of  the adjusted-Winsorized correlation estimates (based on 2000 samples), while the second column shows the standard errors calculated using (5.49). In general, the differences  between the numbers in the two columns are reasonably small, particularly for  large sample sizes. An estimate of  the asymptotic variance in (5.48) is given by AV(r w,H n) = -'$2lFZ(r w,zi,H n), 7~h i=l and the estimated standard error of  rw, denoted by SE(r w) is given by SE(f w) = yfAV(r w,H n)/n. 119 (5.50) (5.51) 5.4 Choice of  c\ and o2 for  f It is important to note that the robustness, efficiency  and intrinsic bias (to be discussed in the next section) of  the adjusted-Winsorized correlation estimate depends on the values chosen for  c\ and c2. Figure 3.3 in Chapter 3 shows how the bivariate outliers are handled by f w. If  we choose large values for  c\ and c2, rw will be less resistant to the outliers. On the other hand, this will lead to a decrease in the standard error (increase in efficiency) and a decrease in the intrinsic bias, both of  which are desirable as well. Thus, the choice of  Ci and c2 may depend on our goal in a particular situation. For application purposes, we can first  select an appropriate value for  c\ (the larger tuning constant for  the two "major" quadrants, i.e., the quadrants that contain the majority of  the standardized data). Then, for  the two "minor" quadrants, we can use c2 = hc\, where h is the ratio of  the number of  observations in the minor quadrants to the number of  observations in the major quadrants. Note that, as |p| increases from  0 to 1, the asymptotic value of  h decreases from  1 to 0. 5.5 Intrinsic bias in adjusted-Winsorized estimate Let Z  = (X,  Y)  have a continuous distribution H,  and rw(H)  be the adjusted-Winsorized correlation functional.  Let the central model H 0 be given by H 0 = N(0,£),  where £ = £(p) is given by (5.47), and p = p(H 0) is the true correlation coefficient  of  X  and Y. The intrinsic bias of  rw occurs at the central model HQ because of  the data trans-formation.  Since the adjusted-Winsorized data have a slightly different  correlation coef-ficient  than that of  the original data, rw(H 0) ^ p(H 0). Therefore,  the intrinsic bias of rw, denoted by IB(r w), is given by IB(r w) = rw{H 0)-p(H 0). . (5.52) To compare rw(H 0) and p{H 0) empirically, we generated random samples of  size n — 100000 from  a bivariate normal distribution with mean 0 and covariance matrix £ = E(p). We considered several values of  p from  —1 to 1. To calculate rw = rw(Ho)  we used Huber score function  with different  values of  C\ with c2 = hc\, where h is defined  in the last section. I Figure 5.2 displays the plots of  rw against p for  C\ = 0.01, 1, 2 and 3. Based on these plots we can make the following  comments: • The intrinsic bias of  rw decreases as Ci increases. • rw is a non-decreasing function  of  p. • Consider 0 < p < 1. The magnitude of  intrinsic bias increases from  zero to reach its maximum at p = 0.5, and then decreases to zero. Similar behavior is observed for  -1 < p < 0. • For — 1 < p < 0 the intrinsic bias is negative, while for  0 < p < 1 the intrinsic bias is positive. This is the exact opposite of  the results obtained for  univariate-Winsorized estimate (see Alqallaf  2003, Figure 4.5, page 97), for  which the bias is positive when — 1 < p < 0, and negative when 0 < p < 1. To compare the behavior of  rw (with C\ ~ 0, c2 — 0) with the univariate-Winsorized correlation estimate r — r(H 0) (with c ~ 0), we plotted r with c = 0.01 against p in c1 = 0.01 c1 =1.00 -1.0 -0.5 T 0.0 P i n 0.5 1.0 -1.0 i -0.5 T 0.0 p I 0.5 1.0 5 J c1 = 2.00 -1.0 -0.5 0.0 c1 = 3.00 0.5 1.0 -1.0 -0.5 0.0 P 0.5 1.0 Figure 5.2: Intrinsic bias in adjusted-Winsorized estimates with c2 = hc\. The bias in rw decreases as C\ increases. -1.0 -0.5 0.0 0.5 1.0 P Figure 5.3: Intrinsic bias in univariate-Winsorized estimate (c=0.01). Figure 5.3 (which is a reproduction of  the top left  plot of  Figure 4.5, Alqallaf  2003). This plot is the exact opposite of  the top left  plot of  Figure 5.2 in terms of  the sign of  the intrinsic bias. The reason maybe as follows.  Though C\ = 0 =>• c2 = 0 for  the adjusted-Winsorized estimates, but c2 = hc\ approaches zero faster  than ci, making the limit of  rw different  from  the limit of  r as c tends to zero (the limit in the latter case being the quadrant correlation estimate). 5.5.1 Achieving (approximate) Fisher-consistency for  rw Let H 0 = N(0,  £(p)) and let rw(p,  ci, c2) be the asymptotic value of  rw (H 0) when we use tuning constants C\ and c2. A plot of  rw(p,  Ci, c2) against p is exhibited by Figure 5.2 for different  values of  C\ and c2 = hci. Recall that h is the ratio of  the number of  observations in the minor quadrants to the number of  observations in the major quadrants. To fix ideas and without loss of  generality, let us assume that p > 0. We notice that when c2 = C\ (univariate-Winsorized estimate, Alqallaf  2003) rw(p,c\,c\)  < p, while rw(p,ci,hci)  > p when c2 = hc\. Therefore,  to achieve Fisher-consistency for  a fixed  value of  p, we could use an appropriate value of  c2 between c2 = hci and c2 = c\. That is, we could use c2 = aci, where a G (h,  1) is such that rw(p,  c\, aci) = p. In practice, we could approximate a by numerical means. For a fixed  value of  Ci we could obtain a table relating p and a — gi (p).  Since h is a decreasing function  of  p we have that a = g(h). Tables relating a and h could be constructed by numerical means for  any desired value of  C\. We will not elaborate this approach further,  since a simple approach that works remarkably well is presented below. To avoid the construction and use of  numerous tables, we can use a simple alternative that does not require any table. Since c2 = hc\ and c2 = c\ give biases of  similar (though not same) magnitudes with opposite signs, we can use c2 = Ci(h + l)/2, that is, a = (h  + l)/2. Figure 5.4 displays the plots of  rw against p for  Ci = 1 with c2 = hc\ and c2 = ci(h  + l)/2. Though the first  plot (c2 = hc\) is the same as the top right plot of Figure 5.2, it is presented again here to make its scale more comparable to that of  the. second plot. The degree of  Fisher-consistency achieved by using c2 = ci(h  +1)/2 is quite satisfactory. Note that c2 = Ci(h + l)/2 < ci. Therefore,  with this choice of  c2 the adjusted-Winsorized estimate is still more resistant to bivariate outliers than the univariate-Winsorized estimate. At the same time, the extra tuning constant c2 allows us to make our estimate approximately Fisher-consistent. c1 = 1, c2 = hc1 c1 = 1, c2 = c1 (h+1 )/2 p p Figure 5.4: Approximate Fisher-consistency for  rw. By using c2 = Ci(h  + l)/2  we get less intrinsic bias than c2 = hc\. We mentioned in Chapter 3 that though we used c2 = hc\ in this study, a more reasonable choice would have been c2 = Vhci  (i.e., = hcf),  because the areas of the two squares should be proportional to the number of  observations they contain. Interestingly, (h  + l)/2 (the shrinkage factor  that gives approximate Fisher-consistency) is the first-order  Taylor expansion of  \fh. 5.6 Asymptotic normality of  adjusted-Winsorized es-timate Since the indicator functions  involved in the adjusted-Winsorized correlation estimate are not differentiate,  proving the asymptotic normality of  this estimate is extremely difficult.  One approach we can consider is to replace the sample indicator functions  by the true ones, which we can do only if  the amount of  error due to the replacement is o(l/y/n).  See Chapter Appendix (Section 5.8.3) for  more details of  this idea, where we use this approach to establish the asymptotic normality of  the numerator of  (5.2), i.e., the asymptotic normality of  the adjusted-Winsorized "covariance" estimates. Unfortunately,  for  the denominator of  (5.2), the amount of  error due to replacing the sample indicator functions  by the true ones is 0(1/\/n) (see Chapter Appendix, Section 5.8.4), and cannot be ignored. Therefore,  the above approach cannot be used to establish the asymptotic normality of  the adjusted-Winsorized "correlation" estimates. As a remedy of  this, we can use differentiate  versions of  the indicator functions, denoted by 71 (x Y ) and 72(0 = 1 — 7i(-)> respectively. For example, 71 can be the distribution function  of  a continous random variable with support (—e, e), for any small e > 0. Using the functions  71 and 72, we now define  the smoothed adjusted-/ Winsorization of  the data, and the smoothed adjusted-Winsorized correlation estimates as follows. Definition  5.3. (Smoothed adjusted-Winsorization) The  smoothed  adjusted-Winsorization  of  (u,v)  €  R2, denoted  by \&f(u,  t>), is defined  as ^sc(u,v)  = (^ sc(u)^ sc(v)),  (5.53) where $c( u) = ^Cl(uhi(uv)  + ^2(^)72  (uv), ipc( v) = ipa(vhi(uv)  + i)C2(v)-y 2(uv), ip is a non-decreasing  symmetric  function,  and c\ and c2 are chosen constants. Definition  5.4. (Smoothed adjusted-Winsorized estimate of  correlation) Let (Xi,  Yi),  i = 1,2, • • • , n, be a random  sample from  a bivariate  distribution  with loca-tion parameters fix  and and scale parameters ox and cry, respectively.  Let 0 = (fix,  IJ-Y,  &Y), and 6 = (fax,  fay,  by) be an estimate  of  0. Denote Ui  = (Xi  — fax)/bx>  and Vi  = (Yi  — fay)jby.  Then,  the smoothed  adjusted-Winsorized  correlation estimate,  fg,  is defined  as rs = i= 1 ie€{m scm- iE) UE i=1 i=l KTHM i=i i=l mmvi) 2 = 1 (5.54) The following  theorem states the asymptotic normality of  the smoothed adjusted-Winsorized correlation estimates, provided that the ip- and x-functions  satisfy  the above conditions, and the location estimates are consistent. Theorem 5.3. (Asymptotic normality of  the smoothed adjusted-Winsorized estimate) Let (Xi,  Yi),  i = 1, 2, • • • , n, be a random  sample from  an elliptically  sym-metric  bivariate  distribution  with location  parameters p,x and /j,y,  and scale parameters ax and ay, respectively.  Let fax  and fay  be consistent  estimates  of  p,x and [ly,  and ax and by are S-estimates  of  ax and ay with score functions  satisfying  conditions  B1 - B4-Then,  ~ \fc(r S-rs) ~^N(Q,AV), 71—>00 where f w is defined  in (5.54)  ip-functions  satisfying  conditions  Al - A4, E rs = IE IE S(Y-hy c \ ay and the variance AV  of  the limiting  distribution  is given by AV  = v;ev9, where £ = {an},  with an — Cov (Qi,  OA; i — 1,2, 3; j = 1, 2, 3; (3x3) Ql = M^A^I^A-^-X^-^ Q2 q3 € ox  X-fix Ox Y-fiy ay ay D X Ox K w (Y  - fly X D Y ay #17 f  X  - fi X\  #18 (Y  - fly D; -x ox D, -x aY K W  (Y  - fly X D Y 1 ay u #20 / X  - fix DxX\  Ox U W g \y/VW'  2VVVW'  2wvvwj ' with U  = E V  = E W  = E s i X  - fix Wc Ox Ipc Y-fly ay Ox S ~ MX ay ,2 I Y  - My c and the constants  used in the above expressions are specified  below: , (X-  fi X\  f  X  — fix Dx — E Dy = E X Ox /  \ Ox , fY  - fiy\  (Y  - fly X ay ay (5.55) (5.56) ox E < 2 + — ox x — fix  \ (X  — fi X\  , (  Y  fly  \ 2 fx  ~ fi XY  - fly V>ci I J 7l E ox / V ox X-fix Oy Ox Oy Ci OX . Y  - fly  \ FX  - fix  Y  - fiy V>ci I —:— ) 71 Oy Ox Oy 7i , f  X  - fix  Y  - fiy\  f  X  - fl XY  - fiy Ox 128 aY Ox Oy (5.57) K 2 is obtained  by interchanging  X  and Y  in (5.57),  K 3 is obtained  by replacing  c\ by c2, and 71 by 72 in (5.57),  K±  is obtained  by interchanging  X  and Y,  and replacing  C\ by c2, and 71 by 72 in (5.57), K 5 = —E ox € C1 x - Hx\  (X  - Hx\  , (Y  - hy Wc2 Ox /  \ ox ' x - \ixy - hy 7i Ox OY 72 oY X  - (1 X  Y  - jJLy Ox oy H  E ox Ox 1 ^W, (•7;(x-Y-"" Ox X  - ij, xY  - hy c\ ox oY  X  - nx ox oy J  \ ox X  — Hx  Y  — fly oy C2 72 . Ox Y~  fly Oy 7i oy X  — PxY  — fly Ox oy X  - nxY  - fjL Y\  f  X  - nxY  - ny 72 I —' —II — —) , (5.58) Ox ^ OY J \ Ox Oy J J is obtained  by interchanging  X  and Y,  and c\ and c2 in (5.58),  K-j  is obtained  by interchanging  C\ and c2 in (5.58),  K&  is obtained  by interchanging  X  and Y  in (5.58), K 9 •= K\  + + + K 7, K w = K 2 + K a + K 6 + K 8, (5.59) (5.60) K 11 -E ox c 1 X-llx Ox Ox € c 1 X  - Hx  \ f  X  - fjLx 7i X  - fl X  Y  - [ly Ox J  \ Ox J  \ Ox X  - Hx\  f  X  - Hx  Y  -/ly^ 7i oy 7i Ox J  \ Ox X  - fjL XY  - fJ,y Ox OY oY X  - tlx  Y  - JJLy Ox oY , (5.61) K 12 -E oy X Vx Ox 7i X - nxY - jiY ox oY X  - /j, xY  - /ly ox oY X  - nxY  - ny Ox Oy , (5.62) #13 and #14 are obtained  by replacing  c\ by and 71 by 72 in (5.61)  and (5.62)  (re-spectively), K 15 ox € X  - nx\ fX  - fjL X c 1 7i Ox 7i H  E ox 2 „ + — E ox Ox / V ox X  - fi xY  - fly Ox Oy *» V ^ X  - fi xY  - fiy Ox Oy X-IM X\ C2 72 X-fix Ox X  - fix  Y  - fly ox MX € oy X C2 Mx 72 /  \ ox X  - HxY  - fly Ox C1 c 2 Ox J X  - fix  Y  - fly X  — fix Vi Oy X  - fix  Y  - fly TP, ci a^ C7y X-flx Ox J  \ Ox X  - fl XY  - fly Oy OX C2 72 72 . Ox X-fix Ox - MX  Y 7i Mr ay Oy X  - fix  Y  - fly Ox Oy X  - fix  Y  - fly Ox oy , (5.63) #16 = -E oY C1 X-fix C2 X-fix Ox J  \ Ox X  - fix  Y  - fly 7i X  - fix  Y  - fly Oy Ox 'X  -Oy MX C1 ox 72 Ox Oy X  - fj XY  - fly ox Mx^ C2 72 , 7i ff*  / X  - fl X  Y  - fly OX Oy Oy X  - fl X  Y  - fly Ox Oy X  - fl X  Y  - fiy Ox Oy , (5.64) #17 = #11 +#13 + #15, #18 = #12 + #14 + #16, (5.65) (5.66) and, finally,  #i9 and #2o are obtained  by interchanging  X  and Y  in (5.65)  and (5.66), respectively. Sketch of  the Proof. The numerator of  (5.54) can be written as Tl  /  -rr ~ ' -I n /  ST Yi  - fiy ns=1-j:M ^k -1- E (1E* V J V OY ) nf^  V °x Jnfrl c • ay (5.67) We can express the first  term of  (5.67) as ip^yi^) Xi-  fi x\ f  Xi  - fi x Yi  - fi Y\  , , (Xi-  fx x\ (Xi  - jj,xYi-  fi Y 7i —= ~ + —r 72 Ox J  \ Ox OY J \ OX J \ OX OY x {^  7i (+ ^ ( Yl^Y) l2( Xi~ ft*  * -OY  / \ Ox OY  J  \ OY  J  \ ox oy 1 v^ /  (  Xi  ~ MX  \ , (  Yi  - fly  \ 2 ( Xi  - MX  Yi  - fly n •<—' \ Ox J  \ oY  ) \ Ox oY n \ ox ) \ OY ) \ OX OY + Y - Ax V ( Yi  - fl Y  \ [ Xj-flxYj-  fl y^j ^ fXj  - fix  Yj  - fly  ' \ Ox J  \ Oy J  \ Ox Oy ) \ Ox Oy - fly  \ (  Xi  - fl X  Yi  - fly\  (  Xi~  fl XYi~  fly* 7i : : 72 f Oy J  \ Ox Oy J  \ Ox Oy J (5.68) Let us consider the first  term of  (5.68). Using Taylor expansion about {/J>x,ox, My,oy), we can write - f>,  (fe^ ^ Y': n \ Ox ) \ Oy ) \ Ox Oy J n \ °x ) \ oY  ) \ ox oY  J - -4- XX (^v^fe  (^k f^L^) {6x _ nax ^ \ ox J  ox . \ oY  J  \ ox oY  J fe,  (71 (•x'::Ay) \ ox J  \ oY  J  \ ox oY  J t (Xi  — /ix  Yi  — p,Y  \ Xi  — fix  Yi  — fiy  , - s X 7i = = = : V  °x oY  ) ox oY - £>, (^^ V« C^) ^^ f  (Sy -nCrY  j^f  \ 0X  J 1 \ Oy J  Cry  \ 0X  Oy J - 4-X>, (V(x<:^ Y': nay ^ \ Ox J  \ Oy J  \ Ox Oy J , f  Xi  — fi x Yi  — fi Y\  Xi  — fix  Yi  — fiy  .n \ X 7l I = = ~ = (oy-oy), \ Ox Oy J  Ox oy since the coefficients  of  (fix  — px) and (fiy  — fiy)  converge to zero in probability. Thus, - X>. (^^ W^W*:Y':iiY) n \ Ox ) \ Oy ) \ Ox Oy J \ Ox J  \ Oy J  \ Ox oy) -K 1(O x-ox)-K 2(OY-O y), (5.69) where K x and K 2 are as defined  in the statement of  the theorem. Similarly, the next three terms of  (5.68) can be expressed as 1 X> ( j^W m^frW *: **y': ^) \ Ox ), \ Oy ) \ Ox Oy J = -t (^^ W (•(•* " >* Yi - " r ) •n~i \ Ox J  \ Oy J  \ Ox Oy J - K 3(o X  - Ox) - K 4(oy  - Oy), (5.70) n \ OX J  \ OY . J  \ 0X  OY ) \ Ox OY J -ly^ (^IM^  / Fj  - / Xj  - fix  Yj  ~ My\ (Xi  - fjixYi-  fi Y n j Cl I n^r ) °2 I /Ti/ / ^ X /T ir /Txr  / ^ f  \ ox J \ oY  \ OX Oy J \ OX OY 2=1 s ' - K5(o x - Ox) - K6(cr Y  - oY),  (5.71) and ly^Jj  ( xi~ M x\ .,. / ~ M y \ ( Xj~  fi x Yj  - fi Y  \ / - Ax Yi  ~ M y \ n ^ V / C1 v /71V oY  )72 V ^ <5y / ^ ly^ / - M.v A ^ - My ^ /Xj - /ix - My \ ~ Mx  Yj  - fi Y  A n C2 \ crj J 01 \ oY  /  A cry /  \ ox oY  J -K7{O X-OX)-K8(OY-OY),  (5.72) respectively, where K 3, K 5, K 7 and K 8 are as defined  in the statement of  the theorem. Using (5.69), (5.70), (5.71) and (5.72) in (5.68), we have i=i x Xi-  fj,x\  , ( Yj  - fl y \ 2(Xi-  fix  Yi-  fly Wcx  I _ 17i Ox J \ oy ) \ Ox Oy +-Efe  (^^ W^^ )tif * ~ ""y n~t \ °X  J  \ Oy J  \ Ox oy +1 Y ^ (Xi ~ »xJ^An  ( Xi V ( Xi" »xYi ~ \ Ox J  \ Oy J  \ Ox Oy J  \ Ox Oy 1 1 Yip  f X i~ M y\ Xi  ~ MX  Yj ~ fl y \ ^ fXj  — fix  Yj  — fly n \ ox J  \ oY  /  \ ox oy J  \ ox oY - K 9(o X  - Ox) - K l0(oy  - Oy), (5.73) where K 9 = K x + K 3 + K s + K 7 and K w = K 2 + K 4 + K 6 + K s. Now, it is straightforward  to show that the second term of  (5.67) is o{l/y/n).  There-fore,  using (5.73) in (5.67), we have = (I^Z)  tf  - K 9(O X  - ax) - K w(a Y  - aY).  (5.74) \ OX J \°YJ Note that the first  4 terms of  the right-hand-side of  (5.73) are combined to get a single term in (5.74). Since ax and by are S-scales, we have (see Alqallaf  2003, page 156) °x~ox= w* 1 ——: (5.75) t=i x ' x ' and ay -ay^;1 — -. (5.76) w ) V) Using (5.75) and (5.76) in (5.74), we have i=l N Xi  — Hx  \ ,s(Yi~  ^Y 1>\ Ox J C v AY n / / -it- ' \ \ ts i n - £ H^T") -b) - w»£ where Dx and Dy are defined  in the statement of  the theorem (equations 5.55 and 5.56). The first  term of  the denominator of  (5.54) can be written as We can express the first  term of  (5.78) as i s I  Xi  — fax i=i v ^ \ /  \ ox oY (Xi  — fx x\ (Xi  - fix  Yi  ~ ^ 2 + —: 72 —; — V Ox J  \ Ox Oy 1 , 2 (  Xi  — fix  \ 2 (  Xi  — fix  Yi  — Ay  V^ 12 V>X \ 2 = —T— hi n 77 V Ox J  \ Ox Oy 1=1 v ' N 1 V^ ,2 (Xi  - fix\ 2{  Xi  - fixYi-  fiy\ + n i=l , 2 V^  , (Xi-  fix\  , (Xi-  fix\  (Xi-  fixYi-  fiy + A —I ) 7i n ^ \ Ox J  \ Ox /  \ Ox Oy ^( X i:fixY-fi y\ \ Ox Oy) As in the case of  the numerator, by using Taylor expansion about (fix,  ox, HY,oy), the three terms on the right-hand-side of  (5.79) can be expressed as E 2 (  Xi  — fix\  2 (  Xi  — fixYi  — fiy\ n i=i i=1 N Xi  — fix  \ 2 / Xi  — fixYi  — fly !7I Ox J  \ Ox Oy K n(bx  ~ ox) -K l2(a Y  - av), (5.80) i=1 V ' Xi  — flx\ 2 (  Xi  — fix  Yi  — fly 72 —: Ox J  \ Ox oy - K n(a x - ax) ~ K u(a Y  - aY),  (5.81) - (( Xi:ilx\n (Xi:flxY i:fl YYJXi ~ ~ ^ n ' \ Ox J  \ Ox J  \ Ox Oy J  \ ax . Oy = - Y ^ ^ ( 7l (  X,-^Y-fly\ 72 / X^flxY-fly^ N^ V OX J V Ox J \ Ox Oy J \ ax Oy J -K l5(ax-o x)-K 16(by-ay),  (5.82) where Kn, K\2, #13, #14, #15 and Kw are as defined  in the statement of  the theorem. Using (5.80), (5.81) and (5.82) in (5.79), we have Xi  - fix Ox Xi  - fix Ox K17(OX  ~ Ox) - K18{O y -Oy),' (5.83) where Kn = Kn + #13 + #15 and Kw = K12 + #14 + KI6. Now, it can be shown that the second term of  (5.78) is o(l/y/n).  Therefore,  using (5.83) in (5.78), and using (5.75) and (5.76), we have 'Xi  - nx' 2 K„1 n Dxnj -T, n i=1 ox n -b Yi  ~ My X Dy n ' V  V  °y Similarly, for  the second term of  the denominator of  (5.54) we have 2 (5.84) f Xj  ~ jjLX Dx n \ \ <7* (5.85) Using (5.77), (5.84) and (5.85), and ignoring the terms which are o(l/^/n),  we can state that n L V ( E \ E E Y-tXy cry 2 )] \ (5.86) where E is as defined  in the statement of  the theorem. Finally, we can use the 5-method (Billingsley 1986) on (5.86) to complete the proof.  • 5.7 Conclusion The main contribution of  this chapter is that we established some asymptotic properties of  the adjusted-Winsorized correlation estimate (the new robust correlation estimate proposed in this thesis). Our estimate is consistent and has bounded influence.  We obtained its asymptotic variance and intrinsic bias. The tuning constants of  this estimate can be chosen such that we have approximate Fisher-consistency. An smoothed version of this estimate is asymptotically normal. The computing time of  this estimate is 0(n  log n), the same as that of  the univariate-Winsorized correlation estimate, but our estimate is more resistant to bivariate outliers. 5.8 Chapter Appendix 5.8.1 Proof  of  Lemma 5.2 We need to show that, for  any e > 0, there exists 5m > 0 and 5S > 0 such that, for  all ueR, \m1-m2\<5m, |si-s2|<tf a => \f{u,rni,s- L)-f{u,m 2,s2)\<e. Following Salibian-Barrera (2000), we will first  show that there exists a closed and bounded interval U  such that, for  any m1,m2 £ Ai, Sx,s2 £ S, It is given that m < m < in and s < s < s. Consider u > fh  + sc. For any m < m and s < s we have (u — m)/s  > c. Now, consider u < m — sc. For any m > m and s > s we have (u  — m)/s  < —c. Thus, for  U  = [m — sc, m + sc], (5.87) holds. For u E U,  ip is uniformly  continuous. Therefore,  we only need to show that U — MI U — 777.2 Si S2 . . |s2 — Slj |m25i'- 777iS2| = \u\ 1 u < Ku-S\S2 SiS2 |s2 - Si I , I , - Si I |m2 - mil 1- \m2\ h s2 SiS2 SiS2 SiS2 |s2-Si| _|s2-Si| \m2 — mi\ + m- + where K u = sup{|u| : u G U).  Thus, sufficiently  small. mi u—m,2 s 2 < 5 if  \m2 — mi| and |s2 — Si| are 5.8.2 Influence  function:  interchanging differentiation  and in-tegration The first  term of  the right-hand-side of  (5.27) can be expressed as at Jm x{t)  Jm Y(t) sx(t) sy(t) t=0 d '[ mx{t ) f mY{t ) L fx-  mx(i)\  fy -mY(t) sy{t) / mx{t)  r -co J- 00 (5.88) t=0 The first  part of  (5.88) can be written as dt r r Jm x(t)Jm Y(t)  V SX[t)  ) \ SY{t)  J t=0 lim -i—>0 t r r J  mx(t)  J  ray (t) sx{t) sY(t) oo roo /  tJjCl(x)ip Cl{y)  f(x,y)dydx Jo lim -t-X)  t ir r k c-.Jm x(t)  Jm Y{t)  (.  \ x - mx(t)\  (y-  mY(t) sx(t)  J  ^ V  sY(t) V'd  (x)^ Cl (y)  }f{x,y)  dydx I  rmx{t)  r oo t I  /  il>a{x)il>c 1(y)f{x,y)dydx 1 Jo  Jo 2 PCX)  rmY(t) / / Vci (x)  if>Cl (y)  f{x,  y) dy dx Jo  Jo •y pmx(t)  rrnY(t) /  /  ipCl {x)ip Cl [y)  f{x,y)dydx  (5.89) Jo  Jo — lim lim t-> o t lim • t->o t lim - k ('^M (Y^I)  - (x)Ai („) UO  JO t sx(t)  J  ™ V sY(t) I\x  - mx{t)  > 0) I(y-  mY{t)  > 0)f(x,y)  dydx , (5.90) since the last three terms of  (5.89) are zero. Now, assuming that ip(u),  ip'(u)  and ip'(u)u  are bounded for  all u el (under Regularity Condition A4), we can show that (X~sx(*)t]) \t=o is bounded. Therefore,  using Lebesgue's Dominated Convergence Theorem (see, for  example, Bartle 1995, page 44), we have d_ dt f  f Jm x(t)Jm Y(t)  V WJ / V SY[t)  J t=0 U™ 7 (v'd f——77T^)  V'ci (-  ] _ -0C1 (a;) ^ (y) k dydx JO  JO  / V SY(t) f{x,y)  dy dx. t=o (5.91) Similarly, d r^xit)  rmY(t) dt sx(t) SY(t) y - mY(t) sY(t) t=o f(x,y)  dy dx. t=o (5.92) Combining (5.91) and (5.92), we have d_ dt Sx(t) mx{t))(Y-m Y(t))  > 0 t=o = E H0 X-m x{t)\ L (Y-m Y{t) sY{t) I(XY  > 0) i=0 (5.93) We can now write (5.28) from  (5.27). In a similar way, we can show that d_ dt EH 0 < t ,2 \X  ~ mx{t) Jci sx(t) I^(X-m x(t))(Y-m Y(t))>  0 -F \d (,1.2  ( X~ mx(t) - Eh° dt  {  sx(t) t=0 I(XY  > 0) t=0 , (5.94) which gives (5.38). 5.8.3 Asymptotic normality of  the adjusted-Winsorized "co-variance" estimate The numerator of  (5.2) can be written as * = I ± ,c tak) * ( ) _I ± (X^M)  I £ ,c . n ^ V <?x J  \ J  V J  n ^ \ aY  J (5.95)' Now, we can express the first  term of  (5.95) as + W^1') «(<*)• (5.%) i=1 ^ Ox J  \ °Y where £(Ci) = l((X { - fi X)(Y>  - fiy)  > 0), and £(c2) = l((Xi  - fi x)(Yi  -fiy)  < 0). Denote /*( Cl) = l((Xi  - fi x)(Yi  - fiy)  > 0), and I  fa)  = l((Xi  - fi x)(Yi  - fiy)  < 0). Let us focus  on the cases when /;(ci) and Ii(c\) take different  values. Assuming (without loss of  generality) that fi x < fi x, we can argue that ij(ci) Ii(ci) when fi x < Xi  < fi x. (Or, fiy  <Yi<  py.) Now, fi x — fi x is 0(l/y/n)',  which means Ii(c\)  — Ii(c\)  is 0(l/y/n). Also, for  the Xi s above, xjjCl () is 0(1/y/n)  since ip is linear in the neighborhood of zero, while ipCl (Y'7^y ) is bounded. Therefore, (Uci)-I z( Cl))=0(l/n).  (5.97) Similarly, Using (5.97) and (5.98) in (5.96), we have 1 V^  /  / Xj-  ftX\  ( Yj~  fl y\ + -£, vJ^^W^W2)' (5.99) n jrf \ °X  J \ Oy J where "=" means "asymptotically equivalent".. Let us now focus  on the second term of  (5.95). We can write 1 /  f  Xi  — fix n < i=1 OX n ' \ Ox J  n i=i \ A / i=1 )/i(ci) + -yv / n , ' i=i n 1' i=i Xj-  fi x\ ox Xj  - Ax Ox Ii(c 2) + 0(l/Vn)  . (5.100) Using Taylor expansion about (/z^, o^) in (5.100), and expressing all the terms that involve Ax — /^x or bx — ox as O (1/y/n), we have 1 , f  X{  — fix n < n i=i Since E - ^Wo + W)+0 • (5-101) n~t \ J n \ Ox J Vd /(ci)] = o, we have I |>Cl = 0 (V^- Similarly, J E V>Cl f  ^i(ci) = o (1/x/n). Using these results in (5.101), we have n i=1 V / (5.102) i/vs). i=l N Similarly, Therefore,  we have n E*. i=i ~ AY <Ty = O (1/v^) • (5.103) n ' n V Jn i=i x ' i=x Using (5.99) and (5.104) in (5.95), we have Yj  - fiy oy o(  l/Vn).  (5.104) i n ( 1=1 x Xi  — fix  \ I  (Yi  — fiy V>C2 Ox Oy Ii{c 2). (5.105) Now, using Taylor expansion about (fix,ox),  we have ( = - (^M  | fc  - ,x) Ox \ Ox Ox \ Ox (5.106) Ox \ Ox \ Ox for  fi x < fi x < Mx and ox < ox < oX-Similarly, Oy J \ Oy J Oy V Oy (5.107) oy 1 V Oy J  \ Oy for  fiy  < fiy  < fx Y  and dy < Oy < Oy. Using (5.106) and (5.107) in the first  part of  the right-hand-side of  (5.105), we have i=i Ox Oy = l^MYi „ ) 7i(Cl) n~l \ Ox J \ Oy 1 noy i=1 Xi  — fix  \ I,  (  Yi  — jj>Y\  (  Yi  — fly Ox ci Oy Oy nox X^CI i=l Yi  — HY\  , (Xi  — fi x \ (Xi  — fi x Oy c 1 /i(ci) (<?y - cy) HcJiox-ox), (5.108) where the other terms are ignored since they are o(l/y/n).  Using Lemma 5.1, we have U aY  U Xi  — Hx\  ,, (Yi  — fly\  (  Yi  — fly Ox n—>oo Oy ci Vv1 Oy X-fix Ox Oy /i(ci) o-y 0Y (5.109) and nox t e^. (^k (("  i '<m ,ov ci fx  /V  ox Y  - hy\ (X  - fj, x\ (X  - nx oY ci Ox Ox /(ci) BC1, (5.110) where 1(a)  = l((X  - fi x)(Y  - fiy)  > 0). Therefore, n E^i i=i i=l v oY Xl ^X ( — — ) /i(ci) ~ AC1 (<Ty - CTy) - BC1 {o X  ~ Ox). Ox oY (5.111) Since bx and bY are S-scales, using (5.75) and (5.76) in (5.111), we have Xi  — p>x\ i (Yi  — fiy i=1 Ox E^ci (  ^ JV'dl  )h{ci) OY n 2> i=1 Ox i=i X <7y cry -M - 5C1 1 E i=i X - /-fx <7X where and Similarly, £>x = E Dy = E X i (  X  — fj-x  \ (  X  — fix ox ox X , (Y  - (i Y\  (Y  - fj, Y OY OY -b , (5.112) (5.113) (5.114) .n i=1 Ox oY MY Oy Yi  - My Ii(c 2) y Dy n V \ cry 1=1 x v 2=1 Xj  - Mx Ox where AC2 = —E ox BC2 = —E Oy Ox J  \ Oy J  \ Oy Y-MY\„ l, (X  - Hx  \ ( x - Hx Ox V>c2 = , Oy ) \ 0X I(C2) (5.115) (5.116) (5.117) and Dx and Dy are as above. Using (5.112) and (5.115) in (5.105), we have 'Xi-p, X\  lo, /Yi-fly ntt V i=1 x XI — AX V/.  ( YI ~ MY \ 1 n oy i=l i=l Oy Xi  — fix  \ , (Yi  — fly Vc Ox An 1 Dy n z E i=l X C7y - My ay Bc 1 Dx n ^E X Xj  - jJL X Ox where Ac = + AC2, and Bc — Bc, + DCo. b), (5.118) ci ^ C2 • Using (5.118, we can state that \fn N -M^M^)}] + AT(0 ,Q), where Q = Var Ox Oy lc i Y  - My D\ X Oy B, c _ , X  - Hx D) X ox 5.8.4 Difficulty  with the denominator of  (5.2) Let us focus  on the first  term in the denominator of  (5.2). It can be expressed as n ' 1 V^  , 2 f  Xi  — fix i=i Ox n i=i Xi  - fix Ox (5.119) As in the case of  the numerator, we can write (5.120) Unfortunately,  this time we cannot replace /;(.) by /j(.). The reason is as follows.  When h{c  1) "7~ Ji(ci), we have either (i) jl x < Xi  < /.i x, or (ii) fiy  < Yi < Hy-  In the first case, as in the case of  the numerator (see Section 5.8.3), we can show that the amount of  error due to replacing /;(.) by /j(.) is 0(l/n).  However, in the second case, though jJiy  — fiy  is 0(1/Vn), the term V. ( X ijf L) (which is also 0(1/^/n)  for  the l^'s above) is not there to make the product 0(l/n). Therefore,  the amount of  error in replacing the sample indicator functions  by the true ones is 0(1/\/n), and cannot be ignored. Chapter 6 Conclusion In this study, we considered the problem of  selecting linear prediction models for  large high-dimensional datasets that possibly contain a fraction  of  contaminations. Our goal was to achieve robustness and scalability at the same time. We considered one-step and two-step model building procedures, the latter consisting of  sequencing and segmentation steps. We will now summarize the main ideas proposed in this thesis, and the main results obtained. One-step model building We proposed robust versions of  step-by-step algorithms FS and SW. We expressed these classical algorithms in terms of  sample means, variances and correlations, and replaced these sample quantities by their robust counterparts to obtain the robust algorithms. We used robust correlations derived from  a simplified  version of  bivariate M-estimates of  the scatter matrix. We proposed robust partial F-tests for  stopping during the implementa-tion of  robust FS and SW procedures. Our robust methods have much better performance  compared to the standard FS and SW algorithms. Also, they are computationally very suitable, and scalable to large dimensions. Two-step model building Robust sequencing We considered time-efficient  algorithm LARS to sequence (some of)  the d covariates to form  a list such that the good predictors are likely to appear in the beginning. Since LARS' is not resistant to outliers, we proposed two different  approaches to robustify  LARS. In the plug-in approach, we replaced the classical correlations in LARS by easily computable robust correlation estimates. In the data-cleaning approach, we first  transformed  the dataset by shrinking the outliers towards the bulk of  the data (which we call multivariate-Winsorization), and then applied standard LARS on the transformed  data. The data-cleaning approach is more time-consuming than the plug-in approach when only some of the predictors are being sequenced. For both approaches (plug-in and data-cleaning), we used robust correlations derived from  a simplified  version of  the bivariate M-estimates of  the scatter matrix. We also proposed correlation estimates using bivariate-Winsorization of  the data. We showed that the latter is faster  to compute with important time differences  when the number of candidate predictors becomes large. We recommend combining robust LARS with bootstrap to obtain more stable and reliable results. The reduced sets obtained by bootstrapped robust LARS contain more of  the important covariates than the reduced sets obtained by initial robust LARS. To obtain a reduced set of  m covariates for  further  investigation, we introduced a learning curve that plots robust R2 values versus dimension. An appropriate value of  m is the dimension corresponding to the point where the curve starts to level off. Robust segmentation We perfromed  all possible subsets regression on the reduced set of  covariates obtained in the first  step. Since classical selection criteria FPE, AIG, Cp, CV and bootstrap are sensitive to outliers, we needed robust selection criteria for  this purpose. We identified certain limitations of  Robust AIC (Ronchetti 1985) and robust CV (Ronchetti, Field and Blanchard 1997) methods. We proposed computationally suitable robust CV and robust bootstrap procedures in this thesis. We evaluated our methods using simulated and real datasets, and compared them with the classical methods as well as robust FPE proposed by Yohai (1997). Our robust CV and robust bootstrap methods have better performance compared to the classical methods and robust FPE. Adjusted-Winsorized correlation estimate For the development of  robust LARS, we proposed this new correlation estimate for bivariate data. The proposed estimate is consistent and has bounded influence.  We obtained its asymptotic variance and intrinsic bias. The tuning constants of  this estimate can be chosen such that we have approximate Fisher-consistency.' An smoothed version of  this estimate is asymptotically normal. The computing time of  this estimate is the same (approximately) as that of  the univariate-Winsorized correlation estimate, but our estimate is more resistant to bivariate outliers. Bibliography Akaike, H. (1969). Fitting autoregressive models for  prediction. Annals of  the Institute of  Statistical  Mathematics,  21: 243-247. Akaike, H. (1970). Statistical predictor identification.  Annals of  the Institute  of  Statistical Mathematics,  22: 203-217. Akaike, H. (1973). Information  theory and an extension of  the maximum likelihood prin-ciple. Second  International  Symposium on Information  Theory,  Academiai  Kiado, Budapest,  pages 267-281. Allen, D. M. (1974). The relationship between variable selection and data augmentation and a method for  prediction. Technometrics,  16: 125-127. Alqallaf,  F. A. (2003). A new contamination  model  for  robust estimation  with large  high-dimensional  datasets.  PhD thesis, Department of  Mathematics (Institute of  Applied Mathematics), University of  British Columbia. Alqallaf,  F. A., Konis, K. P., Martin, R. D., and Zamar, R. H. (2002). Scalable robust covariance and correlation estimates for  data mining. Proceedings  of  the Seventh ACM  SIGKDD  International  Conference  on Knowledge  Discovery and Data Mining, Edmonton,  Alberta,  pages 14-23. Bartle, R. G. (1995). The  Elements  of  Integration  and Lebesgue Measure.  John Wiley & Sons. Bhansali, R. J. and Downham, D. Y. (1977). Some properties of  the order of  an autore-gressive model selected by a generalization of  Akaike's FPE criterion. Biometrika, 67: 546-551. Billingsley, P. (1986). Probability  and Measure.  John Wiley & Sons, 2nd edition. Croux, C., Van Aelst, S., and Dehon, C. (2003). Bounded influence  regression using high breakdown scatter matrices. Ann. Inst.  Statist.  Math.,  55(2): 265-285. « Efron,  B. (1983). Estimating the error rate of  a prediction rule: improvement on cross-validation. Journal  of  the American Statistical  Association,  78: 316-331. Efron,  B. E., Hastie, T., Johnstone, I., and Tibshirani, R. (2004). Least Angle Regression. The  Annals of  Statistics,  32(2): 407-499. Frank, I. and Friedman, J. H. (1993). A statistical view of  some chemometrics regression tools. Technometrics,  35: 109-148. Geisser, S. (1975): The predictive sample reuse method with applications. Journal  of  the American Statistical  Association,  70: 320-328. Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J., and Stahel, W. A. (1986). Robust Statistics.  John Wiley & Sons. Hastie, T., Tibshirani, R., and Friedman, J. (2001). The  Elements  of  Statistical  Learning. Springer-Verlag, New York. Hoerl, A. E. and Kennard, R. W. (1970). Ridge regression: biased estimation for nonorthogonal problems. Technometrics,  12: 55-67. Huber, P. J. (1981). Robust Statistics.  Wiley, New York. Hubert, M. and Engelen, S. (2004). Fast cross-validation of  high-breakdown resampling methods for  PCA. unpublished manuscript. Knight, W. R. (1966). A computer method calculating Kendall's tau with ungrouped data. Journal  of  the American Statistical  Association,  61: 436-439. Kullback, S. and Leibler, R. A. (1951). On information  and sufficiency.  Annals of Mathematical  Statistics,  22: 79-86. Lachenbruch, P. and Mickey, M. (1968). Estimation of  error rates in discriminant analysis. Technometrics,  10: 1-11. Mallows, C. L. (1973). Some comments on C p. Technometrics,  15: 661-675. Mallows, C. L. (1995). More comments on C v. Technometrics,  37: 362-372. Maronna, R. A. (1976). Robust M-estimators of  multivariate location and scatter. The Annals of  Statistics,  4: 51-67. Mendenhall, W. and Sincich, T. (2003). A Second  Course  in Statistics:  Regression Analysis.  Pearson Education, Inc., New Jersey, 6th edition. Morgenthaler, S., Welsch, R. E., and Zenide, A. (2003). Algorithms for  robust model se-lection in linear regression. Theory  and Applications  of  Recent Robust Methods,  eds. M. Hubert, G. Pison, A. Struyf,  and S. Van Aelst, Basel (Switzerland): Birkhauser-Verlag. Ronchetti, E. (1985). Robust model selection in regression. Statistics  and Probability Letters,  3: 21-23. Ronchetti, E., Field, C., and Blanchard, W. (1997). Robust linear model selection by / cross-validation. Journal  of  the American Statistical  Association,  92: 1017-1023. Ronchetti, E. and Staudte, R. G. (1994). A robust version of  Mallow's C p. Journal  of the American Statistical  Association,  89: 550-559. Rousseeuw, P. J. (1984). Least Median of  Squares Regression. Journal  of  the American Statistical  Association,  79: 871-880. Rousseeuw, P. J. and Leroy, A. M. (1987). Robust Regression and Outlier  Detection. Wiley-Interscience, New York. Rousseeuw, P. J. and VanDriessen, K. (1999). A fast  algorithm for  the minimum covari-ance determinant estimator. Technometrics,  41: 212-223. Rousseeuw, P. J. and Yohai, V. J. (1984). Robust regression by means of  S-estimators. Robust and Nonlinear  Time  Series Analysis  (J. Franke, W. Hardle, and R. D. Martin, eds.), Lecture Notes in Statistics 26, Springer Verlag, New York: 256-272. Salibian-Barrera, M. (2000). Contributions  to the theory of  robust inference.  PhD thesis, Department of  Statistics, University of  British Columbia. Salibian-Barrera, M. and Zamar, R. H. (2002). Bootstrapping robust estimates of  regres-sion. The  Annals of  Statistics,  30: 556-582. Serfling,  R. J. (1980). Approximation  Theorems  of  Mathematical  Statistics.  Wiley, New York. Shao, J. (1996). Bootstrap Model Selection. Journal  of  the American Statistical  Associ-ation, 91: 655-665. Sommer, S. and Huggins, R. M. (1996). Variable selection using the Wald Test and Robust C p. Journal  of  the Royal Statistical  Society,  Ser. B, 45: 15-29. Stone, M. (1974). Cross-validation choice and assessment of  statistical predictions. Jour-nal of  the Royal Statistical  Society,  Ser. B, 36: 111-147. Stone, M. (1977). An asymptotic equivalence of  choice of  model by cross-validation and Akaike's criterion. Journal  of  the Royal Statistical  Society,  Ser. B, 39: 44-47. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal  of  the Royal Statistical  Society,  Ser. B, 58: 267-288. Weisberg, S. (1985). Applied  Linear Regression. Wiley-Interscience, New York, 2nd edition. Yohai, V. J. (1987). High breakdown point and high efficiency  robust estimates for regression. The  Annals of  Statistics,  15: 642-656. Yohai, V. J. (1997). A new robust model selection criterion for  linear models: RFPE. unpublished manuscript. 

Cite

Citation Scheme:

    

Usage Statistics

Country Views Downloads
United States 6 1
Russia 5 0
Japan 5 0
France 5 0
Sri Lanka 4 0
Iraq 2 0
China 1 2
India 1 0
Canada 1 0
Egypt 1 0
City Views Downloads
Unknown 15 2
Tokyo 5 0
Penza 4 0
Ashburn 3 0
Sunnyvale 1 0
Vancouver 1 0
Seattle 1 0
Nanchang 1 1

{[{ mDataHeader[type] }]} {[{ month[type] }]} {[{ tData[type] }]}
Download Stats

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0100423/manifest

Comment

Related Items