Robust Linear Model Selection for High-Dimensional Batasets by MD JAFAR AHMED KHAN M.Sc., The University of British Columbia, 2002 A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF Doctor of Philosophy in THE FACULTY OF GRADUATE STUDIES (Statistics) The University of British Columbia December 2006 © Md Jafar Ahmed Khan, 2006 Abstract This study considers the problem of building a linear prediction model when the number of candidate covariates is large and the dataset contains a fraction of outliers and other contaminations "that are difficult to visualize and clean. We aim at predicting the future non-outlying cases. Therefore, we need methods that are robust and scalable at the same time. We consider two different strategies for model selection: (a) one-step model building and (b) two-step model building. For one-step model building, we robustify the step-by-step algorithms forward selection (FS) and stepwise (SW), with robust partial F-tests as stopping rules. Our two-step model building procedure consists of sequencing and segmentation. In sequencing, the input variables are sequenced to form a list such that the good predictors are likely to appear in the beginning, and the first m variables of the list form a reduced set for further consideration. For this step we robustify Least Angle Regression (LARS) proposed by Efron, Hastie, Johnstone and Tibshirani (2004). We use bootstrap to sta-bilize the results obtained by robust LARS, and use "learning curves" to determine the size of the reduced set. The second step (of the two-step model building procedure) - which we call seg-mentation - carefully examines subsets of the covariates in the reduced set in order to select the final prediction model. For this we propose a computationally suitable robust cross-validation procedure. We also propose a robust bootstrap procedure for segmenta-tion, which is similar to the method proposed by Salibian-Barrera and Zamar (2002) to conduct robust inferences in linear regression. We introduce the idea of "multivariate-Winsorization" which we use for robust data cleaning (for the robustification of LARS). We also propose a new correlation estimate which we call the "adjusted-Winsorized correlation estimate". This estimate is consis-tent and has bounded influence, and has some advantages over univariate-Winsorized correlation estimate (Huber 1981 and Alqallaf 2003). Contents Abstract ii Contents iv List of Tables x List of Figures xii List of Notation xv Acknowledgements . xvi Dedication xviii 1 Introduction 1 1.1 Motivation 1 1.2 Model selection strategy . 3 / 1.2.1 One-step model building 3 1.2.2 Two-step model building : . . . 4 1.3 Computation of robust correlation matrices 6 1.4 Organization of subsequent chapters 7 2 One-step Model Building: Robust Forward Selection and Stepwise Procedures 9 2.1 Introduction 9 2.2 Review: classical step-by-step algorithms 12 2.2.1 Forward Selection (FS) 13 2.2.2 Stepwise (SW) 14 2.2.3 Backward Elimination (BE) . 14 2.3 FS arid SW Expressed in Correlations . . . . . . 15 2.3.1 FS expressed in terms of correlations 15 2.3.2 SW expressed in terms of correlations 21 2.4 Robustification of FS and SW algorithms 22 2.4.1 Numerical complexity of the algorithms 25 2.4.2 Limitation of the proposed algorithms . 26 2.5 A simulation study 27 2.5.1 Model selection with Spearman's p and Kendall's r : 31 i 2.6 Examples . . . 31 2.7 Conclusion 34 3 Two-step Model Building: Robust Sequencing with Least Angle Regression 35 3.1 Introduction 35 3.2 Review: Least Angle Regression (LARS) • . . . . 36 3.2.1 Forward Stagewise procedure (Stagewise) 37 3.2.2 The LARS algorithm 42 3.2.3 LARS and Shrinkage methods 43 3.3 LARS expressed in terms of correlations 46 3.4 Robustification of LARS 48 3.4.1 Robust Plug-in 49 3.4.2 Robust Data Cleaning 54 3.4.3 Simulations 55 3.5 Size of the reduced set 59 3.6 Bootstrapped sequencing 62 3.7 . Learning curves 64 3.8 Examples 67 3.9 Conclusion 72 3.10 Chapter Appendix 74 3.10.1 Determination of 7 for one active covariate 74 3.10.2 Quantities related to equiangular vector Ba •• • 75 3.10.3 Determination of 7 for two or more active covariates 76 4 Two-step Model Building: Robust Segmentation 78 4.1 Introduction 78 4.2 Review: classical selection criteria 79 4.2.1 Akaike Information Criterion (AIC) 79 4.2.2 Mallows' C p 81 4.2.3 Final Prediction Error (FPE) 81 4.2.4 Cross-validation 82 4.2.5 Bootstrap 85 4.3 Review: robust selection criteria 86 4.3.1 Robust AIC 86 . 4.3.2 Robust C p . . . 87 4.3.3 Robust FPE . . . 88 4.4 Robust cross-validation 89 4.4.1 Dealing with numerical complexity . . . 91 4.5 Robust bootstrap 93 4.6 Simulation study 94 4.6.1 Robustness of the estimates . 95 4.6.2 Final model selection 96 4.7 Examples 98 4.7.1 Demographic data 98 4.7.2 Protein data 99 4.8 Conclusion ' 100 5 Properties of Adjusted-Winsorized Correlation Estimate 101 5.1 Introduction 101 5.2 Consistency of adjusted-Winsorized estimate 103 5.3 Influence function of adjusted-Winsorized estimate 109 5.3.1 Standard error of adjusted-Winsorized estimate 118 5.4 Choice of C\ and c2 for r w . 120 5.5 Intrinsic bias in adjusted-Winsorized estimate 120 5.5.1 Achieving (approximate) Fisher-consistency for rw . . 123 5.6 Asymptotic normality of adjusted-Winsorized estimate 125 5.7 Conclusion 137 5.8 Chapter Appendix 137 5.8.1 Proof of Lemma 5.2 . 137 5.8.2 Influence function: interchanging differentiation and integration . 138 5.8.3 Asymptotic normality of the adjusted-Winsorized "covariance" es-timate 140 5.8.4 Difficulty with the denominator of (5.2) 145 6 Conclusion 147 Bibliography 151 List of Tables 2.1 Performance of the classical and robust methods in clean and contaminated data for moderate-correlation case. The average (SD) of mean squared prediction error (MSPE) on the test set and the number of noise variables (Noise) selected are shown 29 2.2 Performance of the classical and robust methods in clean and contaminated data for no-correlation case. The average (SD) of mean squared prediction error (MSPE) on the test set and the average number of noise variables (Noise) selected are shown 30 3.1 Percentages of correct sequences obtained by classical and robust methods for univariate and leverage designs with 4 different error distributions. . . 56 4.1 First 10 trials: classical and robust estimates of prediction, errors 96 4.2 Performance of the classical and robust methods of segmentation (evalua-tion of all possible subsets of the reduced set) 98 Evaluation of the standard errors of r w . The empirical SD and formula-based SE are close List of Figures 2.1 QQplot of the robust partial F values against the theoretical xl quantiles. 25 > 3.1 Limitation of separate univariate-Winsorizations (c = 2). The bivariate outliers are left almost unchanged 50 3.2 Bivariate Winsorizations for clean and contaminated data. The ellipse for the contaminated data is only slightly larger than that for the clean data. 52 3.3 Adjusted-Winsorization (for initial estimate Rq) with c\ = 2 , = 1. The bivariate outliers are now shrunken to the corner of the smaller square. . 53 3.4 Numerical complexity of different correlation estimates. Each estimate can be computed in 0(n\ogn) time, but Maronna's estimate has a larger multiplication factor 54 3.5 Numerical complexity of different techniques. LARS requires 0(nd 2) time. W plug-in and M plug-in both require O((n\ogn)d 2) time, but M plug-in has a larger multiplication factor 58 3.6 Recall curves for a = 9; (a) no correlation (b) low correlation (c) high correlation. The 4 curves for (robust) LARS correspond to 4 levels of contamination. . 61 3.7 Recall curves for a = 15 and moderate correlation with 4 different levels of contamination 63 3.8 Recall curves for robust LARS and bootstrapped robust LARS for covari-ates with moderate correlation; (a) a = 9 (b) a = 15. The 4 curves for each method correspond to 4 levels of contamination 65 3.9 Learning curve for Pollution data. A reduced set of 8 covariates is sug-gested by the plot 67 3.10 Learning curve for Demographic data. A reduced set of 12 covariates is suggested by the plot 68 3.11 Error densities for the two "best" models for Demographic data. The "best of 12" model gives more stable result 69 3.12 Learning curve for Protein data. A reduced set of 5 covariates is suggested by the plot. . . . - . . • 71 5.1 Influence curve of adjusted-Winsorized estimate with (ci, c2) = (3, 2). The .curve is symmetric about (0,0) 117 5.2 Intrinsic bias in adjusted-Winsorized estimates with c2 = hc\. The bias in rw decreases as c\ increases 122 5.3 Intrinsic bias in univariate-Winsorized estimate (c=0.01) 123 5.4 Approximate Fisher-consistency for rw. By using c2 = C\{h + l ) /2 we get less intrinsic bias than c2 = hc\ 125 List of Notation n number of observations (rows) in the dataset d total number of covariates m number of covariates in the reduced set tjy correlation between covariate XJ and response Y fjY.i partial correlation between covariate XJ ( j ^ 1) and response Y adjusted for X\ fjY.i quantity proportional to rjY.i w.o.l.g. without' loss of generality FS forward selection procedure SW stepwise procedure LARS Least Angle Regression SD standard deviation med median mad median absolute deviation from median ^ MSPE mean squared prediction error Acknowledgements I would like to express my most sincere and deepest gratitude to my reverend supervisor, Dr. Ruben H. Zamar. This thesis would not have been completed without his excellent guidance, generous support and continuous motivation. In the beginning, when I wanted to quit my study due to my son's illness, he inspired me to stay the course. This thesis, therefore, belongs to him. I am also very grateful to his wife for allowing me to call their residence anytime I wanted. I offer my very special thanks to Dr. Stefan Van Aelst (University of Ghent, Bel-gium), my co-supervisor, for his invaluable suggestions and guidance throughout this study, and his sponsorship for my stay in Belgium. I am very grateful for the hospitality I was offered by Stefan and his wife during that time. My sincere thanks go to Dr. Matias Salibian-Barrera, member of my supervisory committee, for his effective suggestions that helped me save many days of work. I am very grateful to him for allowing me to use some of his user-friendly computer codes. Respectful thanks are to Dr. Harry Joe, Dr. Raymond Ng (computer science, UBC), and Dr. Roy E. Welsch (MIT) for taking their time to read my thesis and provide useful suggestions and comments. I sincerely thank Christine Graham, Rhoda Morgan, Elaine Salameh and Peggy Ng for their help with administrative matters. Christine and Rhoda are no longer with this department, but they will be remembered forever. My heartfelt thanks are for my dear friends Mike ("The Man from St. Petersburg"), Jean-Frangois (we all know him as Jeff), Lawrence, Marin, Justin and other graduate students for their invaluable support during my journey to PhD. I benefitted from helpful discussions with Mike and Jean-Frangois, and from their brilliant suggestions. (Mike also suggested how to acknowledge his suggestions!) They also made my life in Vancouver such a pleasant experience! Finally, I would like to pay tribute to robustness with a poem that I wrote during my graduate studies: Ashes to ashes, dust to dust -The statistics I use must be robust! JAFAR A H M E D K H A N The University of British Columbia December 2006 To Reba, Roja and Aumi Chapter 1 Introduction 1.1 Motivation We consider the problem of building a linear prediction model when there is a large number d of candidate covariates. Large datasets usually contain a fraction of outliers and other contaminations, which are difficult to visualize and clean. The classical algorithms are much affected by these outliers and, therefore, these algorithms often fail to select the 'correct' linear prediction model that would have been chosen if there were no outliers. We argue that it is not reasonable to attempt to predict future outliers without knowledge of the underlying mechanism that produces them. Therefore, we aim at predicting the future non-outlying cases by fitting well the majority of the data. For this, we need a robust method that is capable of selecting the important variables in the presence of outliers in high-dimensional datasets. Robust model selection has not received much attention in the robustness litera-ture. Seminal papers that address this issue include Ronchetti (1985) and Ronchetti and Staudte (1994) which introduced robust versions of the selection criteria AIC and Cp , respectively. Yohai (1997) proposed a robust Final Prediction Error (FPE) crite-rion (for Splus documentation). Ronchetti, Field and Blanchard (1997) proposed robust model selection by cross-validation. Morgenthaler, Welsch and Zenide (2003) constructed a selection technique to simultaneously identify the correct model structure as well as unusual observations. All these robust methods require the fitting of all submodels. One exception is the model selection based on the Wald test (Sommer and Huggins 1996) which requires the computation of estimates from the full model only. A major drawback of the existing robust model selection methods is that they do not scale up to large dimensions, because fitting a robust model is a nonlinear optimization problem. As the number d of possible predictors increases, the number of submodels (which is 2d — 1) increases dramatically, making the computational burden enormous. Also, the methods that require the fitting of only the full model are not suitable, because i only a few of the d covariates are typically included in the final model, and the fitting of the full model increases the numerical complexity of the methods unnecessarily. In this study, we attempt to achieve robustness and computational suitability at the same time. That is, we attempt to develop linear prediction model building strategies that are simultaneously (i) capable of selecting the important covariates in the presence of contaminations, and (ii) scalable to high dimensions. The term "scalable" is used to indicate that the numerical complexity of the statistical methods proposed is "reasonable" (e.g., not exponential). 1.2 Model selection strategy We consider two different strategies for the selection of a linear prediction model for high-dimensional datasets: (a) one-step model building and (b) two-step model building, which are described below. 1.2.1 One-step model building Since for large values of d the computational burden of all possible subsets regression is enormous, we turn our focus on step-by-step algorithms like forward selection (FS) and stepwise (SW) procedures (see, for example, Weisberg 1985, Chapter 8) that can stop when certain goals are achieved. Classical FS or SW procedures yield poor results when the data contain outliers and other contaminations, since they attempt to select the covariates that will fit well all the cases (including the outliers). Therefore, our goal is to develop robust step-by-step algorithms that will select important variables in the presence of outliers, and predict well the future non-outlying cases. We express the classical FS and SW algorithms in terms of sample means, variances and correlations, and replace these sample quantities by their robust counterparts to obtain robust step-by-step algorithms. Similar ideas have been used for building robust estimators of regression parameters (see, for example, Croux, Van Aelst and Dehon 2003, and the references therein). We also incorporate robust partial F-tests as stopping rules during the implementation of these robust algorithms. 1.2.2 Two-step model building Our two-step model building procedure is a blend of all possible subsets regression and step-by-step algorithms. All possible subsets regression is expected to select a better model (with respect to predictive power) than any step-by-step algorithm, but its com-putational burden is extremely high for large values of d. We, therefore, consider ap-plying this procedure on a "reduced set" of covariates. Thus, we consider proceeding in two steps. The first step - which we call sequencing - quickly screens out unimportant variables to form a "reduced set" for further consideration. The second step - which we call segmentation - carefully examines different subsets of the variables in the reduced set for possible inclusion in the prediction model. These two steps are described below. Sequencing The goal of the first step is a drastic reduction of the number of candidate covariates. The input variables are sequenced to form a list such that the good predictors are likely to appear at the beginning of the list. The first m covariates of the list then form the reduced set from which the final prediction model will be obtained. One strategy for sequencing the candidate covariates is to use one of the several available step-by-step or stagewise algorithms, e.g., Forward Selection (FS), or Forward Stagewise procedure (Stagewise) (see, for example, Hastie, Tibshirani and Friedman 2001, Chapter 10). We focus on the powerful algorithm recently proposed by Efron, Hastie, Johnstone and Tibshirani (2004) called Least Angle Regression (LARS), which is a mathematical solution to the Stagewise problem. LARS is computationally efficient and has been shown to have clear statistical advantages over other step-by-step and stagewise algorithms. Since LARS is very sensitive to contamination, our goal is to robustify LARS. We show that LARS can be expressed in terms of the mean vector and covariance matrix of the data, and we replace these classical ingredients of LARS by their robust counterparts to obtain robust LARS. We combine robust LARS algorithm with bootstrap to obtain a more stable and reliable list of covariates. One important issue in the sequencing step is to determine the appropriate value of the number of covariates, m, for the reduced set. The probability that the reduced set contains all the important variables increases with m. Unfortunately, also the com-putational cost of the second step, segmentation, increases with m. Therefore, we aim to determine a "reasonable" value of m which is large enough to include most of the important variables but not so large as to make the second step impractical or unfeasi-ble. For this purpose, we introduce a "learning curve" that plots robust R 2 values versus dimension. An appropriate value of m is the dimension corresponding to the point where the curve starts to level off. Segmentation When we have a reduced set of m covariates for further consideration, one reasonable approach to reach the final model is to perform all possible subsets regression on this reduced set using an appropriate selection criterion. Again, the classical selection criteria, e.g., Final Prediction Error (FPE), Akaike Information Criterion (AIC), Mallows' Cp , cross-validation (CV) and bootstrap procedures are not resistant to outliers. The robust AIC procedure (Ronchetti 1985) has certain limitations, which are discussed in this thesis. The robust CV method (Ronchetti, Field and Blanchard 1997) is computationally expensive. In this study, we propose computationally suitable robust CV and robust boot-strap procedures to evaluate the predictive powers of different subsets of the reduced set of covariates. Our robust bootstrap procedure is similar to the methods proposed by Salibian-Barrera (2000), and Salibian-Barrera and Zamar (2002) to conduct robust statistical inferences in linear regression. Since the performance of robust FPE procedure (Yohai 1997) has not been studied so far, we also evaluate this method in our study. 1.3 Computation of robust correlation matrices As mentioned earlier, our approach to robustification of FS, SW and LARS consists of expressing these algorithms in terms of the mean vector and the correlation matrix of the data, and then replacing these classical ingredients by their robust counterparts. Therefore, robust estimation of the correlation matrix is a major component of the robust methods we propose in this study. The computation of robust correlation estimates from a , where the first part is the (squared) denominator of (2.9) for j = k + 1, the second part is the numerator of (2.10) for j = k + 1 and h = k, and the last part is the denominator of (2.10) for h = k. Therefore, Pj(k+i).i2...k is a function of original correlations. This completes the proof. • FS steps in correlations We can now summarize the FS algorithm in terms of correlations among the original variables as follows: 1. To select the first covariate X mi, determine m\ = argmax \rj\. 2. To select the &th covariate X mk (k = 2, 3, ...), calculate fjy. mi...m(fc l ) , which is pro-portional to the partial correlation between Xj and Y adjusted for X mi, • • • , X m{ k_l ), and then determine rrik = argmax|fjy. mi...m(f c_1) | . Partial F-tests for stopping At each FS step, once the "best" covariate (among the remaining covariates) is identified, we can perform a partial F-test to decide whether to include this covariate in the model (and continue the process) or to stop. The new "best" covariate enters the model only if the partial F-value, denoted by impartial, is greater than F(0.95,1, n — k — 1) (say), where k is the current size of the model including the new covariate. Here again, the required quantities can be expressed in terms of correlations among the original variables, as we show below. Suppose that Xi is already included in the model, and X ( < * ? ) ( * - * ) ( * - * ) ' = V, (2.18) TX I where d? = (Zi — t)'V~ 1(zi — t), and u\ and u2 satisfy a set of general assumptions. For further computational ease, we considered the following simplified version of the bivariate M-estimate. We used the coordinatewise median as the bivariate location estimate and only solved (2.18) to estimate the scatter matrix and hence the'correlation. We used the function u2(t) = u(t) — mm(c/t, 1) with c = 9.21, the 99% quantile of a X 2 distribution. For the bivariate observations Z{ = (Xi,yi), i = 1, . . . , n, the steps for calculating this correlation estimate are presented below: 1. Calculate the medians m j and my, and obtain Zi = ( ) , % = 1, . . . , n, where i i = X i - m x , and & = yi - my. 2. Calculate the mads sx and sy, and set V0 < ^ y 0 Sy j 3. Calculate dj = z\ V 0 1 Zi, i = 1, ..., n. Then obtain Vi = ^ ^ u(df) Zi z\. n i=i 4. Set V0 <-V v 5. Repeat steps 3 and 4. We stop when |r(Vi) — r(Vo)| < where <5 > 0 is a pre-selected small number, and r(.) is the correlation coefficient calculated from the bivariate scatter matrix. Finally, FS and SW algorithms are implemented using these robust pairwise corre-lations. Robust partial F-tests. We replace the classical correlations in the partial F statistic by their robust counterparts to form a robust partial F statistic. We conjecture that the robust pairwise correlations appearing in the numerator of the F statistic are jointly normal. Therefore, under the null hypothesis, the robust F statistic is asymptoti-cally distributed as xi- To assess our conjecture numerically, we conducted the following simulation. We generated Xi, and e2 from a standard normal distribution. We then generated Y = (3 0 + faXi + a^i, and X 2 = 70 + 71X1 + a 2e 2 , where @0, Pi, 70 and 71 are generated from a uniform distribution on (—10,10), a\ is chosen so that the signal-to-noise ratio equals 2, and a2 is chosen so that X\ and X 2 have a particular correlation randomly chosen from a uniform distribution on (0,1). We generated 2000 datasets of size 100, and calculated the robust partial F statistic for covariate X 2 in each case. Figure 2.1 shows the qqplot of the robust partial F values against the theoretical xl quantiles. Moreover, the average and variance of the robust F values are 0.99 (~ 1) and 2.02 (~ 2), respectively. All of these support our conjecture. Therefore, we consider it to be reasonable to use theretical F quantiles as our stopping criteria for robust FS and SW algorithms. Theoretical chi-square quantiles Figure 2.1: QQplot of the robust partial F values against the theoretical xl quantiles. 2.4.1 Numerical complexity of the algorithms If we sequence all d covariates, the standard FS procedure requires 0(nd 2) time. However, when applied with a stopping criterion, the complexity of FS depends on the number of covariates selected in the model. Assuming that the model size will not exceed a certain number m < d, the complexity of FS is less than or equal to 0(ndm). Similarly, the maximum complexity of SW is 0(n(dm + m2)) = 0(ndm). Since we used the coordinatewise median as the bivariate location estimate, the correlation based on Maronna's M-estimate can be computed in (D(n\ogn + bn) time, where b is the number of iterations required. Assuming that b does not exceed (^(logn) (convergence was achieved after 3 to 5 iterations in our simulations), the complexity of this estimate is O(nlogn). As a result, the maximum complexity of robust FS is O ((n log n) dm), and the maximum complexity of robust SW is 0((n logn)(dm + m2)) = 0((nlogn)dm). i Though all possible subsets regression is expected to select a better model (with respect to predictive power) than any step-by-step algorithm, its computational burden is extremely high for large values of d, since it requires the fitting of all 2d — I submodels. The complexity of the classical algorithms of this type is 0(2 dnd 2). Since robust model selection methods proposed so far uses all possible subsets regression, the complexity of the existing robust algorithms is 0(2 dnd 2) multiplied by the number of iterations required for the robust fits. 2.4.2 Limitation of the proposed algorithms The robust FS and SW procedures based on robust pairwise correlations proposed are resistant to bivariate (correlation) outliers. However, they can be sensitive to three-or higher-dimensional outliers, that is, outliers that are not detected by univariate and bivariate analyses. Also, the correlation matrix obtained from the pairwise correlation approach may not be positive definite, forcing the use of correction for positive definite-ness in some cases (see, e.g., Alqallaf et al. 2002). It should be emphasized here that these are very small prices to pay to make the selection of covariates possible for large values of d. For example, in our simulations (presented later) we used d = 50. It is impossible to apply all possible subsets regression on a dataset of this dimension. If one robust fit takes 0.001 cpu second, we would need 25 0 * 0.001/(3600 * 24 * 365) years to select the final model. 2.5 A simulation study To compare our robust methods with the classical ones, we carried out a simulation study similar to Frank and Friedman (1993). The total number of variables is d'= 50. A small number a — 9 or a = 15 of them are nonzero covariates. We considered 2 correlation structures of these nonzero covariates: "moderate correlation" case and "no correlation" case, which are described below. For the moderate-correlation case, we considered 3 independent 'unknown' processes, represented by latent variables Li, i = 1, 2, 3, which are responsible for the systematic variation of both the response and the covariates. The model is Y = 7Li + 6L2 + 5L3 + e = Signal + e, (2.19) where Li ~ N(0,1), and e is a normal error not related to the latent variables. The variance of e is chosen such that the signal-to-noise ratio equals 2, that is Var(e) = 110/4. The nonzero covariates are divided in 3 equal groups, with each group related to exactly one of the latent variables by the following relation Xj — Li + Sj, where Sj ~ iV(0,1). Thus, we have a true correlation of 0.5 between the covariates generated with the same latent variable. For the no-correlation case (a true correlation of 0 between the covariates), inde-pendent predictors Xj ~ N(0,1) are considered, and Y is generated using the a non-zero covariates, with coefficients (7, 6, 5) repeated three times for a — 9, and five times for a = 15. For each case we generated 1000 datasets each of which was randomly divided into a training sample of size 100 and a test sample of size 100. Contamination of the training data. Each of the d — a noise variables are contaminated independently. Each observation of a noise variables is assigned probability 0.003 of being replaced by a large number. If this observation is contaminated, then the corresponding observation of Y is also replaced by a large number. Thus, the probability that any particular row of the training sample will be contaminated is 1 — (1 — 0.003)d_ a , which is approximately 10% for a = 15, and 11.6% for a = 9. For each of the 4 methods (2 classical and 2 robust), we fitted the obtained model on the training data, and then used it to predict the test data outcomes. We used MM-estimator (Yohai 1987) to fit the models obtained by either of the robust methods, because of its high breakdown point and high efficiency at the normal model. For each simulated dataset, we recorded (1) the average squared prediction error on the test sample, (2) the number of noise variables selected in the model, and (3) the total number of variables selected in the model. Table 2.1: Performance of the classical and robust methods in clean and contaminated data for moderate-correlation case. The average (SD) of mean squared prediction error (MSPE) on the test set and the number of noise variables (Noise) selected are shown. a = 9 a = 15 Data Method MSPE Noise MSPE Noise Clean FS 59.7 (12.0) 4.9 (2.4) 50.2 (9.3) 4.3 (2.2) SW 60.3 (12.3) 4.8 (2.3) 51.2 (9.7) 4.2 (2.1) Rob FS 60.4 (12.2) 5.1 (2.6) 51.5 (10.3) 4.7 (2.5) Rob SW 61.1 (12.8) 5.0 (2.5) 52.8 (10.5) 4.6 (2.4) Contam FS 157.6 (40.8) 13.6 (3.1) 134.5 (32.9) 11.7 (2.9) SW 158.4 (41.3) 13.4 (3.0) 136.3 (33.3) 11.6 (2.8) Rob FS 94.9 (27.9) 2.5 (2.9) 78.9 (23.7) 1.6 (2.9) Rob SW 95.1 (27.8) 2.4 (2.8) 79.3 (23.4) 1.5 (2.6) Table 2.1 shows the average (sd) of the first two quantities mentioned above over all generated datasets for the moderate-correlation case. The average (sd) of the third quantity (total number of variables) is similar for all the methods in the clean data. However, for the contaminated data, the average increases (decreases) for the classical (robust) methods. For example, for a — 15, the average for the classical methods increases from 13 to 17 (approximately), while for the robust methods it decreases from 13 to 6. In general, FS performs as good as SW, and robust FS performs as good as robust SW. For the clean data, the performance of robust FS (SW) is comparable to standard FS (SW). For the contaminated data, the test errors produced by robust methods are much smaller than the classical ones. Also, the models obtained by robust methods contain fewer noise variables than the classical ones. Table 2.2: Performance of the classical and robust methods in clean and contaminated data for no-correlation case. The average (SD) of mean squared prediction error (MSPE) on the test set and the average number of noise variables (Noise) selected are shown. a = 9 a = 15 Data Method MSPE Noise MSPE Noise Clean FS 55.6 (11.6) 5.0 (2.4) 107.0 (21.7) 4.6 (2.3) SW 55.8 (11.8) . 4.8 (2.3) 108.1 (22.1) 4.3 (2.1) Rob FS 56.5 (12.4) 5.1 (2.6) 109.9 (21.6) 4.8 (2.4) Rob SW 56.7 (12.8) 4.9 (2.5) 108.4 (22.4) 4.6 (2.3) Contam FS 161.8 (38.1) 13.6 (3.0) 296.7 (75.3) 11.9 (2.8) SW 162.5 (37.5) 13.4 (2.8) 297.9 (75.9) 11.7 (2.7)' Rob FS 72.5 (13.9) 2.1 (2.4) 124.1 (19.9) 1.2 (1.8) Rob SW 72.6 (13.8) 2.1 (2.3) 124.2 (20.8) 1.2 (1.7) Table 2.2 presents the results for the no-correlation case. Here, robust FS and SW more drastically outperform the standard FS and SW, as compared to the moderate-correlation case. Note that the errors presented in this table are not comparable to those of Table 2.1 since Y is generated using the non-zero covariates (a = 9 or a = 15), instead of the 3 latent variables. Thus, F has much more variability for a = 15 than for a =.9. 2.5.1 Model selection with Spearman's p and Kendall's r In Section 2.3 we expressed the classical FS and SW algorithms in terms of the corre-lation matrix of the data. In Section 2.4 we replaced these correlations by their robust counerparts to obtain robust FS and SW. We can also consider replacing the classical correlations in FS and SW by Spearman's p or Kendall's r, since they are standard estimates of association that are invariant to monotone transformations of the data. They may be good options for variable selection when there is skewness in the data and no cluster of multivariate outliers. A small simulation study (not presented here) indicates that the methods based on Spearman's p and Kendall's r may perform better than the classical FS and SW. Further study is required to investigate their performance as model building tools and compare them with classical and robust FS and SW. It should be mentioned here that Spearman's p can be computed in C>(nlogn) time, the same as the adjusted-Winsorized correlation estimate (the new correlation estimate proposed later in this thesis). Though Kendall's r separately examines each of the (™) (order of n2) pairs of bivariate observations, there is an algorithm that can calculate Kendall's r in C(nlogn) time (Knight 1966). 2.6 Examples In this section, we used two real-data examples to show the robustness and scalability of our algorithms. Executive data. This dataset is obtained from Mendenhall and Sincich (2003). The annual salary of 100 executives is recorded as well as 10 potential predictors (7 quantitative and 3 qualitative) such as education, experience etc. We label the candidate predictors from 1 to 10. Classical FS (with F 0.9 as the inclusion criterion) and SW (with F0.9 as both inclusion and deletion criterion) both select the covariates: (1,3,4,2,5). Robust FS and SW (also with F0.9 as inclusion and deletion criterion) select the same model. We then contaminated the data by replacing one small value of predictor 1 (less than 5) by a large value 100. When FS and SW are applied to the contaminated data, they both now select a larger set of variables: (7, 3, 4, 2,1, 5,10). Thus, changing a single number in the data set drastically changes the selected model. On the other hand, robust FS and SW select the same model, (1, 3, 4, 2, 5), when applied to the contaminated dataset. Particle data. This quantum physics dataset was used for the KDD-Cup 2004. Each of n = 50000 data-points (rows) describes one "example" (particle generated in a high energy collider experiment). There are 80 variables in the data: Example ID, class of the example (positive examples are denoted by 1, negative examples by 0), and 78 feature measurements. We considered only the feature variables in our analysis. We deleted 13 of the features (either because they have a large number of missing values, or they are degenerate with all observations equal to 0), and used the first feature as the response. Thus, there are 64 covariates and one response in the selected data. Though this analysis may not be of particular scientific interest, it will demonstrate the scalability and robustness of our algorithms. We first applied the four algorithms to a training sample of size n = 5000. The remaining 45000 cases will be used as a test sample. The classical FS and SW (with F0.g criterion) select the same model. It contains the following 25 covariates: (2,60, 58,18,8,4, 51,53,1,59,5,20,10,6,62,19,38,46,39,47,21,36, 50,48,37). With F 0. 95 criterion, the model has 23 covariates. Interestingly, only one covariate is selected by robust FS and SW (with either F0 .9 or F 0 , 9 5 criterion): Covariate 1. The reason for this drastic difference is as follows. The robust correlation of Y and Covariate 1 is 0.86, while the classical correlation between these variables is only 0.42. About 86% of the values of the response variable and 88% of the values of Covariate 1 are equal to zero. There are many zeroes in other covriates as well. Classical methods fail to identify this unusual pattern in the data and therefore are unable to select a parsimonious model that fits well the majority the data (as opposed to all the data). The robust methods, on the other hand, successfully detect the unusual pattern and select a model capable of predicting well 90% of the data as explained below. We fitted the selected classical and robust models using the training' data, and then used them to predict the test data outcomes. The 5% and 10% trimmed means of squared prediction errors for the classical and (robust) models are: 0.012 (0.043) and 0.008 (0.005), respectively. That is, the robust model with only one covariate predicts 90% of the data better than the classical model with 25 covariates. To illustrate the scalability of our algorithm we also used a training sample of size n = 25000. This time, classical FS and SW select a model of 30 covariates, and robust FS and SW both select one covariate, in this case covariate 2 instead of covariate 1. (Covariates 1 and 2 have robust correlations 0.82 and —0.85 with Y, respectively.) 2.7 Conclusion The main contribution of this chapter is that we developed robust step-by-step algorithms as one-step model-building procedures. Classical step-by-step algorithms FS and SW are popular and computationally suitable, but they are sensitive to outliers. We expressed these algorithms in terms of sample means, variances and correlations, and obtained simple robust versions of FS and SW by replacing these sample quantities by their robust counterparts. We used robust partial F-tests for stopping during the implementation of the proposed robust algorithms. For the construction of the robust correlation matrix of the required covariates we used a pairwise approach, because it is both computationally suitable, and more consistent with the idea of step-by-step algorithms. We used robust correlations derived from a simplified version of Maronna's bivariate M-estimator of the scatter matrix. Our robust methods have much better performance compared to the standard FS and SW algorithms. Also, they are computationally very suitable, and scalable to large dimensions. Chapter 3 Two-step Model Building: Robust Sequencing with Least Angle Regression 3.1 Introduction In this chapter, we will consider the first step (sequencing) of the two-step model building procedure. The candidate covariates will be sequenced to form a list such that the good predictors are likely to appear at the beginning of the list . The first m covariates of the list will form the reduced set which will be studied further in the next chapter. We need a suitable step-by-step algorithm to sequence the covariates! We focus on the powerful algorithm recently proposed by Efron, Hastie, Johnstone and Tibshirani (2004), which is called Least Angle Regression (LARS). LARS is computationally effi-cient and has been shown to have clear statistical advantages over other step-by-step algorithms. • Since LARS is based on sample means, variances and correlations (as will be shown later), it yields poor results when the data are contaminated. This is a potentially serious deficiency. Therefore, we propose several approaches to strengthen the robustness properties of LARS without affecting its computational efficiency too much, and compare their behavior. The rest of this chapter is organized as follows. In Section 3.2, we review LARS in details. In Section 3.3, we express the LARS procedure in terms of the correlation matrix of the data. In Section 3.4, we illustrate LARS' sensitivity to outliers and introduce two different approaches to robustify LARS. A small simulation study is also presented here to compare the performance and the computing time of LARS to those of the two robust approaches. In Section 3.5, we investigate the selection of the size of the reduced set. of candidate predictors. Section 3.6 proposes to use bootstrap to stabilize the results obtained by robust LARS. Section 3.7 introduces "learning curves" as a. graphical tool to choose the size of the reduced set. Section 3.8 contains some real-data applications. Section 3.9 concludes and the chapter appendix contains some technical derivations. 3.2 Review: Least Angle Regression (LARS) Least Angle Regression (LARS), proposed by Efron, Hastie, Johnstone and Tibshirani (2004), is closely related to another new algorithm called Forward' Stagewise (Hastie, Tibshirani and Friedman 2001, Chapter 10). To better understand LARS, we will review the Forward Stagewise procedure in details. 3.2.1 Forward Stagewise procedure (Stagewise) The Forward Stagewise procedure (Stagewise) is related to the classical algorithm For-ward Selection (FS). In FS, when the first predictor (A^, say) is selected, all other predictors are regressed on X 1 } and the residual vectors compete for the next entrance in the model. This causes a problem. Important predictors that happen to be correlated with Xi are eliminated from the competition in many cases, which the researchers usually want to avoid. In this sense, FS is an aggressive model-building algorithm. The Forward Stagewise procedure is a less aggressive version of FS. Unlike FS, the Stagewise procedure takes many tiny steps to move towards a final model. We take the zero vector as the initial prediction. If X\ has the largest absolute correlation with Y, we modify our prediction by moving a 'small' step in the direction of X\. We obtain the new residual vector and repeat the process, until the required number of predictors are selected. The goal is to obtain the order in which the variables enter the model. We can assume, without loss of generality, that the covariates have mean 0 and variance 1, and the response has mean 0. Let e be a positive constant, typically small (less than the absolute value of the regression coefficients). The Stagewise algorithm can be described as follows: 1. Set the_ prediction vector, ft = 0. 2. Calculate Cj = X'j (Y — fi), j = 1, ... , k, where dj is proportional to the correlation between X j and the current residual. 3. Let m = argmax^ \tj\. Modify the current prediction vector as follows: jj, i- p, + e sign(cm) X m, where e is a positive constant. 4. Repeat steps 2 and 3. At each step, the algorithm updates the prediction, and keeps track of the sequence of covariates as they enter the model. Notice that at each Stagewise step, we maximize the correlation of the current residual vector with a covariate, which is equivalent to minimizing the 'local loss' n Y ^ f r - h - h X ^ ) 2 (3.1) i=1 over all j . (Because of standardization, j3m is proportional to cm.) Stagewise and FS: loss comparison Both Stagewise and FS select the same variable in their first steps, because they minimize the same loss function. Suppose, the selected variable is Xi, i.e., its loss (li — PiXu)2 is minimum. Let us consider that, after many e-steps along Xi, Stagewise is about to choose a second variable from the contenders . . . , Xj. On the other hand, for the second step, FS considers . . . , Zj, which are the residuals of the corresponding co-variates after being adjusted for Xi. To choose the second variable, FS minimizes the loss n &—PiXu—PjZji) 2, i-1 which is same as the loss n (3.2) i=i Note that (3{ is usually different from /?i. The loss used in Stagewise is n 2 E - - to) • (3.3) This loss depends on our position on the Xi-vector controlled by e. By choosing a small e, we ensure that f3\ < /?*, so that the variables correlated with X\ have more chance of staying in the competition. (It should be noted that we are not 'fitting' a model. We are 'selecting' the covariates.) The minimizer of the Stagewise loss cannot beat the minimizer of the FS loss (see equation (3.2)), at least at this stage. This means, if FS chooses (Xi, X 2) and Stagewise selects (Xi, X3), then FS will yield a greater value of R2. Because, FS technically considers the residual sum of squares of the final fit (equation (3.2)). However, this is not necessarily true for the next stage if FS selects (X^ X 2, X4) (for example), and Stagewise selects (Xi, X3 , X5). (Because, this Stagewise combination has not been considered by FS so far in any order. In other words, FS has taken a different path.) Greater R2 does not necessarily imply more prediction accuracy. Moreover, FS cannot guarantee greater R2 for a particular subset size in all cases. Therefore, orthog-onalization of the subsequent covariates with respect to the active ones is not usually meaningful. This is why researchers often prefer Stagewise to FS. Stagewise and Boosting Boosting, originally developed for classification problems, is a procedure that combines the output of many "weak" classifiers to produce a powerful "committee" (Hastie et al. 2001, Chapter 10). In the general setup, boosting is a way of predicting Yi by combining a set of simple "basis" functions additively: K f(x) = ^f3 kb(x,cx k), k = l,...,K, (3.4) k=l where (5 k are the expansion coefficients, and b(x,a k) are real-valued functions of multi-variate x with parameters cx.k. Usually, f(x) is fitted by considering a loss function J ^ L U ^ M x i , * , ) ] , (3.5) i=i \ k=i J which is minimized with respect to fa's and cc^'s. Often, this is computationally intensive, and the solution to (3.5) is approximated by sequentially adding new basis functions without adjusting the parameters and expansion coefficients of the existing ones. In this approach, at each iteration k, one updates fk(xi) = f k-i{xi) + Pkb(xi, ak) by minimizing the loss n ^ L {Yi, f k- i (xi) + pkb(xi, a k ) ) . (3.6) i=i In Regularized boosting, one uses a parameter e to control the "learning rate" of the boosting procedure: fk{xi) = fk-i(xi) + t(3 kb(xi, OLk). (3.7) The Stagewise algorithm discussed before is very similar to this regularized boosting, where we use squared error loss for L in (3.6), K (see (3.4)) is the number of e-steps in Stagewise, b(xi,a k) = xki (the .ith observation of covariate X( k) chosen in the kth iteration, not necessarily same as Xk), and fa i n (3-7) is replaced by sign{/?^}. Choice of an appropriate e for Stagewise The choice of an appropriate e is a problem with the Stagewise procedure, and is the motivation for LARS. If e is 'small', the number of Stagewise steps to reach a final model may be very large, increasing the computational burden of the algorithm. On the other hand, if e is 'large', we have either or both of the following two problems: 'Incorrect' ordering of predictors: If e —» |cm | , Stagewise will aggressively throw covariates correlated with X m out of the competition. A closed loop: In many cases, after the selection of the mth (say) covariate, the remaining predictors (that are not yet selected) have very small correlations with the current residual vector. Suppose that the correlation of the currently selected predictor X m is positive. If e is large, when we make an 'e-step' in the direction of '+X m\ the correlation of X m with the updated residuals becomes negative and larger in absolute value than that of any other competitor. Thus, the next Stagewise step is to make an 'e-step' in the direction of l —X m \ These back-and-forth movements may go on endlessly. Even when there is no closed loop, the Stagewise procedure may require hundreds of tiny steps to reach the final model. Therefore, this algorithm is not computationally suitable. LARS overcomes this problem by taking a mathematical approach. 3.2.2 The LARS algorithm LARS uses a mathematical formula to accelerate the computations in the Stagewise procedure. Suppose, the first selected predictor in Stagewise is Xi (i.e., X\ has the largest absolute correlation with Y). If we choose a 'small' e, there will be at least several Stagewise steps in the direction of the vector Xx. A second predictor X 2 (say) will come in the picture as soon as we cross a certain point in the direction of X x, a point at which both X\ and X 2 have equal absolute correlation with the residual. LARS uses a mathematical formula to determine that point, and the prediction is modified by making a move up to that point in a single step. In Stagewise, when a second predictor X 2 enters the model for the first time, we make a few small steps in the direction of X 2, but then X\ becomes more correlated with the residual, and we move in the direction of X x. Thus, we alternate between the two directions, technically maintaining approximately equal absolute correlations of X\ and X 2 with the residual (until a third predictor comes into the picture). LARS, on the other hand, mathematically determines a direction that has equal angle (correlation) with Xi and X 2, and makes the second LARS move along that direction upto a point (determined mathematically, again) at which a third predictor X 3 has equal absolute correlation with the residual vector, and so on. For the original LARS algorithm, Efron et al. (2004) is referred to, which is designed to get the modified predictions at each step, in addition to the sequence of the covariates as they enter the model. In Section 3.3 we show that, if we are interested in the ordering of the covariates only (and not the modified predictions), the algorithm can be expressed in terms of the correlation matrix of the data (and not the observations themselves). 3.2.3 LARS and Shrinkage methods LARS and Ridge Regression If there are many correlated variables in a linear model, the estimates may show high variance. This can be prevented by imposing a restriction on the size of the coefficients. The ridge regression (Hoerl and Kennard 1970) minimizes a penalized residual sum of squares i=i j=i where A is the parameter that controls the amount of shrinkage. Ridge regression shrinks the coefficients towards zero, but does not set some coeffi-cients exactly equal to zero. Therefore, it is not suitable for subset selection, and cannot be compared to LARS. By imposing a different penalty, the Lasso algorithm (Tibshirani 1996) forces some of the coefficients to zero, which is presented below. LARS and Lasso The Lasso (Tibshirani 1996) estimates are obtained by minimizing n d Y J(Y i-pxiy + \J2\fri (3-8) i=1 j=l Moderate to large A will cause some of the Lasso coefficients to be exactly zero, others will be smaller in magnitude than the corresponding least squares estimates. • i Interestingly, the estimates (and the sequance of the covariates) obtained by LARS and Lasso are usually quite close, if not the same. The reason has not been established mathematically, though it is clear that both algorithms can be viewed as less aggressive versions of the FS procedure. Efron et al. (2004) suggested a modification in the LARS algorithm that will yield the Lasso solution, which is as follows. Let j3 be the current Lasso estimate, and fi = X/3. Then, for the Lasso estimates, sign(^j) = sign (corr(Y - ft, Xj)), which is not necessarily true for the LARS estimates. This restriction should be enforced in the LARS algorithm if we want the Lasso solution. To better understand the LARS-Lasso relationship, let us consider the following definition of the Lasso estimate, which is equivalent to (3.8). n ^Lasso = a r g m i n ^ ( Y - f3' X i)2, (3.9) i=1 d subject to E \/3j\ Y? j=1 \Pf\, where Pf are the least squares estimates, then the Lasso estimates are the least squares estimates. For a LARS-Lasso comparison, suppose that LARS has selected the first covariate SiA^. (LARS considers the 'signed covariates' to determine the equiangular vectors later.) To select the second covariate, LARS mathematically determines the minimum distance 7 to move along Si-X"i so that a new covariate s2X 2 (say) is equally correlated with the residual vector. We may assume that 7 is determined first (as in Stagewise, where we make many e-steps to obtain Pe (see 3.3)) so that LARS loss can be written as n n £ - - PjX jz)2 = E ( Yi - IsXu - PjXji) 2 ; (3.10) i=l i=1 where j = 2, . . . , d, and 7 s = s 17 is a restricted regression coeffcient. Since |7S| is the minimum distance (determined mathematically) to move before a second variable enters the model, a comparison of (3.10) and (3.9) makes it evident that, in the Lasso algorithm, t < 17S| =>• only Xi is in the model (only is nonzero). In LARS, when we have two active covariates Si-Xi and s2X2, we modify our prediction by moving along the equiangular vector Ba upto a point 74, so that the LARS loss has the form n (Y i ~ IsXu ~ J A B A i - PjXji)2 i=1 n = (Yi - %Xu - Ja(wiS 1Xu + w2s2X 2i) - PjXjif i=i n = ( Y i ~ + 1aWi)siXu - -yAW2S 2X2i - PjXji) 2 , (3.11) where j = 3, . . . , d, and w\ and w2 are given by (3.21) (see Chapter Appendix, Sec-tion 3.10.2). Again, since 7a is the minimum distance to move along Ba before a third covariate comes in the picture, by comparing (3.11) and (3.9) we can say |7s| < t < |7«| + \^a(wi + w2) \ =>• only Xi and X 2 will be in the model, and so on. Note that 1(7 + 7^^1)51! + |7^W2S2| = |7S| + |7a(^i +^2)!, since 7, 7^, wx and w2 are all positive at this stage. (The Wj may not be all positive for higher dimensions.) Thus, it is not surprising that LARS and Lasso sequences agree in most cases. However, Lasso requires a computationally expensive quadratic programming technique to obtain the estimates, while, the computational cost of LARS is comparable to the ordinary least squares applied to the full set of covariates. LARS and Boosting For a comparison of Stagewise and boosting we refer to Section 3.2.1. Since LARS is a mathematical solution of the Stagewise problem, the LARS algorithm maybe considered as a mathematical alternative to a regularized boosting algorithm. 3.3 LARS expressed in terms of correlations In this section, we show that the sequence of covariates obtained by LARS can be derived from the correlation matrix of the data (without using the observations themselves). Let Y,Xi,... ,X d be the variables, standardized using their mean and standard de-viation. Let vjY denote the correlation between Xj and Y, and Rx be the correlation matrix of the covariates X\,..., Xa- Suppose that X m has the maximum absolute corre-lation r with Y and denote sm = sign(rmy). Then, X m becomes the first active variable and the current prediction jx 0 should be modified by moving along the direction of smX m upto a certain distance 7 that can be expressed in terms of correlations between the variables (see Chapter Appendix, Section 3.10.1, for details). By determining 7, LARS simultaneously identifies the new covariate that will enter the model, that is the second active variable. V - ' As soon as we have more than one active variable, LARS modifies the current prediction along the equiangular direction, that is the direction that, has equal angle (correlation) with all active covariates. Moving along this direction ensure that the current correlation of each active covariate with the residual decreases equally. Let A be the set of subscripts corresponding to the active variables. In Chapter Appendix (Section 3.10.2) the standardized equiangular vector Ba is derived. Note that we do not need the direction BA itself to decide which covariate enters the model next. We only need the correlation of all variables (active and inactive) with Ba- These correlations can be expressed in terms of the correlation matrix of the variables as shown in Chapter Appendix (Section 3.10.2). LARS modifies the current prediction by moving along Ba upto a certain distance 7a which, again, can be determined from the correlations of the variables (see Chapter Appendix, Section 3.10.3). Thus, the sequence of covariates obtained by the LARS algorithm is a function of the correlation matrix of the standardized data. We now summarize the LARS algorithm in terms of correlations rjy between Xj and Y, and the correlation matrix Rx of the covariates: 1. Set the active set, A = 0, and the sign vector sa = 0-2. Determine m = argmax \rjY\, and s m = sign{rmy}. Let r = smrmy. 3. Put A <-r A U {m}, and '<— sA U {sm}. 4. Calculate a = [1 'a(DaRaDa)~ 1^-a\~1^ , where 1 a is a vector of l's, Da = diag(s^), ? and Ra is the submatrix of Rx corresponding to the active variables. Calculate wA = a (D aRaDa)~11a, and aj = (DATjjCl'wA, for j G Ac, where rj A is the vector of correlations between Xj and the active variables. (Note that, when there is only one active covariate X m, the above quantities simplify to a = 1, w = 1, and a j = rjm•) 5. For j G Ac, calculate 7+ = (r — rjY)/(a — aj), and 7" = (r + rjY)/(a + aj), and let 7j = min(7j",7~). Determine 7 = min{7j, j £ A0}, and m, the index corresponding to the minimum 7 = jm. If 7 m = 7+, set sm = +1. Otherwise, set sm = —1. Modify r <— r — 7a, and r^y r^y — 70,-, for j € Ac. 6. Repeat steps 3, 4 and 5. 3.4 Robustification of LARS From the results in Section 3.3, it is not surprising to see that LARS is sensitive to contamination in the data. To illustrate this, we use a dataset on executives obtained from Mendenhall and Sincich (2003). The annual salary of 100 executives is recorded as well as 10 potential predictors. (7 quantitative and 3 qualitative) such as education, experience etc. We label the candidate predictors from 1 to 10. LARS sequences the covariates in the following order: (1, 3,4, 2, 5,6, 9, 8,10, 7). We contaminate the data by replacing one small value of predictor 1 (less than 5) by the large value 100. When LARS is applied to the contaminated data, we obtain the following completely different sequence of predictors: (7,3,2,4,5,1,10,6,8,9) . Predictor 7, which was selected last (10th) in the clean data, now enters the model first. The position of predictor 1 changes from first to sixth. Predictors 2 and 4 interchange their places. Thus, changing a single number in the data set completely changes the predictor sequence, which illustrates the sensitivity of LARS to contamination. We now introduce two approaches to robustify the LARS procedure which we call the plug-in and cleaning approaches respectively. 3.4.1 Robust Plug-in The plug-in approach consists of replacing the non-robust building blocks of LARS (mean, variance and correlation) by robust counterparts. The choices of fast computable robust center and scale measures are straightforward: median (med) and median absolute devi-ation (mad). Unfortunately, good available robust correlation estimators are computed from the (i-dimensional data and therefore are very time consuming (see Rousseeuw and Leroy 1987). Robust pairwise approaches (see Huber 1981) are not affine equivariant and therefore are sensitive to two-dimensional outliers. One solution is to use robust correla-tions derived from a pairwise affine equivariant covariance estimator. A computationally efficient choice is a bivariate M-estimator as defined by Maronna (1976). Alternatively, a bivariate correlation estimator can be computed from bivariate Winsorized data. Both methods will be explained in detail below. M Plug-in Maronna's bivariate M-estimator of the location vector t and scatter matrix V is defined in Chapter 2. It is affine equivariant and computationally efficient, and has breakdown point 1/3 in two dimensions. As before, to further simplify computations, we used the coordinatewise median as the bivariate location estimate and only solved (2.18) to estimate the scatter matrix and hence the correlation. In this equation we used the function u2(t) = min(c/i, 1) with c = 9.21, the 99% quantile of a x2 distribution. The bivariate correlations are then ensembled to form a d x d correlation matrix R. Finally, LARS is applied to this robust correlation matrix. We call this the M plug-in method. - 3 - 2 - 1 0 1 2 3 variable 1 Figure 3.1: Limitation of separate univariate-Winsorizations (c = 2). The bivariate outliers are left almost unchanged. W Plug-in For very large, high-dimensional data we need an even faster robust correlation es-timator. Huber (1981) introduced the idea of one-dimensional Winsorization of the data, and suggested that classical correlation coefficients be calculated from the trans-formed data. Alqallaf, Konis, Martin and Zamar (2002) re-examined this approach for the estimation of individual elements of a large-dimension correlation matrix. For n univariate observations X\, x2 ..., xn, the transformation is given by Ui = ~ med(xi))/mad(a:i)), i = 1, 2 , . . . , n, where the Huber score function ipc (z) is defined as ipc i x ) = min{max{—c, x}, c}, with c a tuning constant chosen by the user, e.g., c = 2 or c = 2.5. This one-dimensional Winsorization approach is very fast to compute but un-fortunately it does not take into account the orientation of the bivariate data. It merely i brings the outlying observations to the boundary of a 2c x 2c square, as shown in Fig-ure 3.1. This plot clearly shows that the univariate approach does not resolve the effect of the obvious outliers at the bottom right which are shrunken to the corner (2, —2), and thus are left almost unchanged. / To remedy this problem, we propose a bivariate Winsorization of the data based on an initial tolerance ellipse for the majority of the data. Outliers are shrunken to the border of this ellipse by using the bivariate transformation u = min(-y/c/D(a;.), 1) x with x = (x\,x2) t- Here D(x) is the Mahalanobis distance based on an initial bivariate correlation matrix R0. For the tuning constant c we used c = 5.99, the 95% quantile of the xl distribution. We call this the W plug-in method. The choice of R0 will be discussed below. Figure 3.2 shows bivariate Winsorizations for both the complete data set of Fig-ure 3.1 and the data set excluding the outliers. The ellipse for the contaminated data is only slightly larger than that for the clean data. By using bivariate Winsorization the outliers are shrunken to the boundary.of the larger ellipsoid. The initial correlation estimate. Choosing an appropriate initial correlation matrix R0 is an essential part of bivariate Winsorization. For computational simplicity we can choose the estimate based on univariate Winsorization explained above. However, we propose an adjusted Winsorization method that is more resistant to bivariate outliers. This method uses two tuning constants: a tuning constant C\ for the two quadrants that contain the majority of the standardized data and a smaller tuning constant c2 for the other two quadrants. For example, Ci is taken equal to 2 or 2.5 as before and variable 1 Figure 3.2: Bivariate Winsorizations for clean and contaminated data. The ellipse for the contaminated data is only slightly larger than that for the clean data. C2 = hci where h = n 2 / n i with n\ the number of observations in the major quadrants and n 2 = n — n^. We use c\ = 2 in this chapter. Figure 3.3 shows how the adjusted Winsorization deals with bivariate outliers, which are now shrunken to the boundary of the smaller square. Thus, adjusted Winsorization handles bivariate outliers much better than univariate Winsorization. The initial correla-tion matrix Rq is obtained by computing the classical correlation matrix of the adjusted Winsorized data. It should be mentioned here that, though we used c2 = hc\ in this study, a more reasonable choice would have been c2 = \fhc\ (i.e., = hci), because the areas of the two squares should be proportional to the number of observations they contain. variable 1 Figure 3.3: Adjusted-Winsorization (for initial estimate R0) with c\ = 2, c2 = 1. The bivariate outliers are now shrunken to the corner of the smaller square. Note that the correlations based on both univariate- and adjusted-Winsorized data can be computed in 0(nlog?7,) time. The adjusted-Winsorized estimate takes slightly more time for a particular n, but is much more accurate in the presence of bivariate outliers as shown above. Bivariate-Winsorized estimate and Maronna's M-estimate also require O(nlogn) time, but Maronna's M-estimate has a larger multiplication factor de-pending on the number of iterations required. Thus for large n, the. bivariate-Winsorized estimate is much faster to compute than Maronna's M-estimate. Figure 3.4 shows for each of the four correlation estimates the mean cpu times in seconds (based on 100 replicates) for 5 different sample sizes: 10000, 20000, 30000, 40000 and 50000. These results confirm that the bivariate-Winsorized estimate is faster to compute than Maronna's M-estimate and the difference increases with sample size. Numerical results (not presented here) If) o o o m o o o o o 10000 20000 30000 40000 50000 sample size Figure 3.4: Numerical complexity of different correlation estimates. Each estimate can be computed in 0(n\ogn) time, but Maronna's estimate has a larger multiplication factor. showed that the bivariate-Winsorized estimate is almost as accurate as Maronna's M-estimate also in the presence of contamination. Note that both the univariate-Winsorized and adjusted-Winsorized correlations are very fast to compute. 3.4.2 Robust Data Cleaning If the dimension d is not extremely large, an alternative approach to robustifying LARS is to apply it on cleaned data. For example, each standardized d-dimensional data point x — (rri,. . . , xaY can be replaced by its Winsorized counterpart u = min(y/c/D(x), 1) x in the d-dimensional space. Here D(x) — x^^x, is the Mahalanobis distance of x based on V, a fast computable, robust initial correlation matrix. A reasonable choice for the tuning distance c is c = Xd(0-95), the 95% quantile of the Xd distribution. The initial correlation matrix V. The choice of the initial correlation matrix V is an essential part of the Winsorization procedure. Most available highTbreakdown, affine-equivariant methods are inappropriate for our purposes because they are too com-putationally intensive. Therefore, we resort to pairwise approaches, that is methods in which each entry of the correlation matrix is estimated separately (see Alqallaf et al. 2002). As before we will use a bivariate M-estimator or the bivariate windsorized esti-mator to calculate the correlations in V. The resulting methods are called M cleaning and W cleaning, respectively. 3.4.3 Simulations To investigate the performance and stability of the four proposed methods we Consider a simulation study involving a small number of variables. We used the following design (see Ronchetti et al. 1997). The error distributions considered are (el) standard normal, (e2) 93% from standard normal and 7% from N(0, 52), (e3) standard normal divided by a uniform on (0,1), and (e4) 90% from standard normal and 10% from A^(30,1). Two design matrices are considered: the uniform design for which the columns are generated from a uniform distribution on (0,1), and the leverage design which is the same as the uniform design except that it contains a leverage point. Six variables are used from which the first three are nonzero and in order of importance. The true regression coefficients for the nonzero variables are 7, 5 and 3, respectively. The sample size equals Table 3.1: Percentages of correct sequences obtained by classical and robust methods for univariate and leverage designs with 4 different error distributions. Method Uniform Leverage el e2 e3 e4 el e2 e3 e4 LARS E LARS G 97 86 11 8 100 89 26 24 0 1 1 2 0 2 5 7 M plug-in E M plug-in G 95 97 53 87 99 99 74 95 96 ,96 49 87 99 99 68 95 W plug-in E W plug-in G 96 97 58 78 99 99 77 89 92 85 46 59 94 86 61 68 M cleaning E M cleaning G 96 98 55 89 99 99 77 97 96 97 50 87 100 98 73 97 W cleaning E W cleaning G 96 98 54 82 99 99 76 92 96 -94 52 83 98 96 71 92 n — 60 and we generated 200 data sets for each setting. We used two performance measures which we call exact (E) and global (G). The exact measure gives the percentage of times a procedure sequences the important variables in front and in their true order. The global measure gives the percentage of times a procedure sequences the important variables in front in any order. Table 3.1 shows the simulation results. For error distribution el (standard normal), the performance of the robust methods is almost as good as that of standard LARS. For the heavy tailed distributions the robust methods drastically outperform LARS. Overall we see from Table 3.1 that the plug-in approaches are almost as stable as the computationally more expensive data cleaning approaches. Comparing the M and W approaches for both the plug-in and data cleaning procedures, it is reassuring to see that the computationally faster W approach (see Figure 3.5 below) is almost as stable as the M approach. Numerical complexity of the algorithms We now compare the computational complexity of the different methods. The stan-dard LARS procedure sequences all d covariates in only 0{nd 2) time. The plug-in and cleaning procedures based on M-estimators both require 0((n\ogn)d 2) time. Based on Winsorization these procedures also require 0((n logn)d 2) time, but with a much smaller multiplication factor. Moreover, if we are only interested in sequencing the top fraction of a large number of covariates, then the plug-in approach will be much faster than the cleaning approach, because the plug-in approach only calculates the required correlations along the way instead of the 'full' correlation matrix. In this case, the complexity for plug-in methods reduces to O((n logn)dm), where m is the number of variables being sequenced. .Figure 3.5 shows the mean cpu times based on 10 replicates for LARS, W plug-in and M plug-in for different dimensions d with a fixed sample size n = 2000. The times required by the cleaning methods are not shown because they were similar to the plug-in times since we sequenced all the covariates. As in Figure 3.4, we see that the approaches based on M-estimators are more time consuming than the Winsorization approaches. The difference increases fast with dimension. o m 50 100 150 200 250 300 dimension Figure 3.5: Numerical complexity of different techniques. LARS requires 0(ncP) time. W plug-in and M plug-in both require O((n\ogn)d 2) time, but M plug-in has a larger multiplication factor. The cleaning approaches perform slightly better than the plug-in approaches when the number of variables is relatively small, and much smaller than the number of cases (see Table 3.1). However, plug-in approaches are less time-consuming when only a part of the predictors are sequenced. Since W plug-in has a reasonable performance compared to the other methods and has favorable computing times, this method is to be preferred for large, high-dimensional datasets. The performance of W plug-in will be studied further in the next sections and we will call this method robust LARS from now on. 3.5 Size of the reduced set To obtain a good final model, it is important to choose an appropriate value of m, the size of the reduced set of covariates kept from the sequencing step. The reduced set should be large enough to include most of the important covariates, but not so large as to make the segmentation step (where we have to evaluate all possible subsets of the reduced set) impractical. Several factors can be important when determining the size m such as d, the total number of variables, the sample size n, the unknown number of non-zero variables in the optimal model, the correlation structure of the covariates, and of course also time and feasibility of the segmentation step. For example, for high-dimensional datasets, including only 1% of the variables in the reduced set may make the segmentation step already infeasible. To investigate what values of m are appropriate, we carry out a simulation study similar to Frank and Friedman (1993). The total number of variables is d = 100. A small number a = 9 or a = 15 of them are nonzero covariates. We considered 3 correlation structures of these nonzero covariates: "no-correlation" case, "moderate-correlation" case and "high-correlation" case, which are described below. For the no-correlation case (a true correlation of 0 between the covariates), inde-pendent covariates Xj ~ AT(0,1) are considered, and Y is generated using the a non-zero covariates, with coefficients (7,6, 5) repeated three times for a = 9, and five times for a = 15. The variance of the error term is chosen such that the signal-to-noise ratio equals 2. For the moderate-correlation and high-correlation cases, we consider 3 independent 'unknown' processes, represented by latent variables L,, i = 1,2, 3, which are responsible for the systematic variation of both the response and the covariates. The model is Y = 5Li + 4L2 + 3L3 + e = Signal + e, (3.12) where Lj ~ iV(0,1), and e is a normal error not related to the latent variables. The variance of e is chosen such that the signal-to-noise ratio equals 2, that is Var(e) = 50/4. The nonzero covariates are divided in 3 equal groups, with each group related to exactly one of the latent variables by the following relation Xj = Li + 5j, where 5j ~ N(0, aj). The value of aj determines the correlation structure of the nonzero covariates. The high-correlation case has a true correlation of 0.9 between the covariates generated with the same latent variable, and the moderate-correlation case has a true correlation of 0.5. For each situation we generated 100 samples of size n = 150. Outliers were added by giving the noise term a large positive mean (asymmetric error). We considered four different levels of contamination: 0,5,10 and 20%. For the high-correlation and moderate-correlation cases, though "a" of the covari-ates are linked to the response Y through the latent variables, it is not clear which of these covariates should be considered important for explaining Y. Even when the true pairwise correlations of the covariates are zero (no-correlation case), the "best" model not necessarily includes all of the a non-zero coefficients because of the bias-variance trade-off. Therefore, for each simulated dataset we first find the "best" model among all possible subsets of the non-zero covariates that has the minimum prediction error estimated by 5-fold cross-validation. -1 1 1 1 1— 0.05 0.10 0.15 0.20 0.25 Proportion of variables in reduced set Figure 3.6: Recall curves for a — 9; (a) no correlation (b) low correlation (c) high correlation. The 4 curves for (robust) LARS correspond to 4 levels of contamination. For each simulated dataset, we determine the "recall proportion", i.e., the propor-tion of important variables (in the sense that they are in the "best" model by cross-validation) that are captured (recalled) by LARS/robust LARS for a fixed size of the reduced sequence. For a = 9, Figure 3.6 plots the average recall proportion against the size of the reduced set for the three correlation structures. In each plot, the 4 curves with the same line type correspond to the 4 levels of contamination, higher curves correspond to lower levels of contamination. These plots show that, for each correlation structure considered, we can capture the important variables if the percentage of variables in the reduced set is 9 or 10. Robust LARS performs as good as LARS for clean data, and much better than LARS for contaminated data. Figure 3.7 plots the average recall proportion against the size of the reduced set for the moderate-correlation case with a = 15. This plot can be compared with Figure 3.6(b) to see how the increase in the number of nonzero variables affects the recall proportions. In both cases, we observe that the average recall proportions stop increasing even before the size m of the reduced set exceeds the number a of non-zero variables. 3.6 Bootstrapped sequencing To obtain more stable and reliable results we can combine robust LARS with bootstrap. Therefore, we generate a number B of bootstrap samples from the dataset, arid use robust LARS to obtain the corresponding sequence of covariates for each of these bootstrap samples. Each sequence ranks the covariates from 1 to d. For each covariate we can take Proportion of variables in reduced set Figure 3.7: Recall curves for a = 15 and moderate correlation with 4 different levels of contamination. the average over these B ranks, and the m covariates with the smallest average ranks then form the reduced set. When resampling from a high-dimensional dataset (compared to the sample size, e.g., n — 150, d — 100) the probability of obtaining singular samples becomes very high. Note that even the original sample may already be singular or the dimension d of the data may exceed the sample size. In these cases it will be impossible to sequence all covariates. We can easily overcome this problem by sequencing only the first mo < d of the covariates for each bootstrap sample, where preferably mo > m. We then rank the covariates according to the number of times (out of B) they are actually sequenced. When ties occur, the order of the covariates is determined according to the average rank in the sequences. In our simulations, we generated B = 100 bootstrap samples from each of the 100 simulated datasets. We sequenced the first 25 covariates in each bootstrap sample. Figure 3.8 shows the recall curves obtained by robust LARS (solid lines) and boot-strapped robust LARS (dotted lines) for covariates with moderate correlation. The recall curves obtained by bootstrapped robust LARS perform better than the initial robust LARS curves for all levels of contamination, the difference being larger with larger con-tamination proportions. This confirms that by applying the bootstrap we obtain more stable and reliable results. Even with 20% of contamination, bootstrapped robust LARS with m = 10 (o = 9) or m = 15 (a — 15) already yields a recall proportion around 90%. To investigate what minimum number of bootstrap samples is required to obtain significant improvement over robust LARS, we also tried B = 10, 20 and 50 in the above setups. In each case, B = 10 and B = 20 do not yield much improvement, while with B = 50 the results obtained are almost as stable as with B = 100. 3.7 Learning curves Although the simulation results in the previous sections suggested that it suffices to select the size of the reduced set equal to or slightly larger than the number of predictors in the final model, we usually have no information about the number of predictors that is needed. Hence, a graphical tool to select the size of the reduced set would be useful. The following plot can be constructed to determine a reasonable size for the reduced set. Starting from a model with only 1 variable (the first one in the sequence), we increase the number of (a) Proportion of variables in reduced set (b) Proportion of variables in reduced set Figure 3.8: Recall curves for robust LARS and bootstrapped robust LARS for covariates with moderate correlation; (a) a = 9 (b) a = 15. The 4 curves for each method correspond to 4 levels of contamination. variables according to the sequence obtained and each time fit a robust regression model to compute a robust R2 measure such as R2 = 1 — Median(e2)/MAD2(F), where e is the vector of residuals from the robust fit. We then plot these robust R2 values against the number of variables in the model to obtain a learning curve. The size of the reduced set can be selected as the point where the learning curve does not have a considerable slope anymore. A problem that can occur with a robust R2 measure is that, unlike its classical coun-terpart, it is not always a nondecreasing function of the number of covariates. This can be resolved as follows. If the robust R? at any step is smaller than that of the preceding step, then fit a robust simple linear regression of the residuals from the preceding step on the newly selected covariate. The residuals obtained from this fit can be used to compute another robust R2 value. We then use the larger of the two values. To investigate the performance of learning curves, we consider a dataset on air pollution and mortality in 60 Metropolitan areas in the United States. The response variable is the age-adjusted mortality. There are 14 potential predictors, numbered from 1 to 14. Since row 21 contains 2 missing values, we drop this observation from the data. Based on robust data exploration we identified 4 clear outliers that correspond to.the four metropolitan areas in California. We applied 5-fold cross-validation (CV) to this dataset without the four outliers, and obtained the "best model" that has the following 7 covariates: (2,3,4,6,7,10,13).. (The order of the covariates is not relevant here.) Bootstrapped robust LARS applied to this dataset (including the outliers) produced the sequence (7,5,13,4,6,3,2,10,9,1,14,11,8,12). We used this sequence and fitted Least Median of Squares (Rousseeuw 1984) regressions to obtain the robust R2 values. Figure 3.9: Learning curve for Pollution data. A reduced set of 8 covariates is suggested by the plot. Figure 3.9 shows the corresponding learning curve. This plot suggests a reduced set of size 8. It is encouraging to notice that the reduced set (first 8 covariates in the sequence above) contains all 7 predictors selected in the "best model" obtained by CV. 3.8 Examples In this section we use two real datasets to evaluate the performance of (bootstrapped) robust LARS. The demographic data example further explores the idea of "learning curves" to choose the size of the reduced set. We then use a large dataset (protein data) to demonstrate the scalability as well as stability of robust LARS. Demographic data. This dataset contains demographical information on the 50 states of the United States for 1980. The response variable of interest is the murder rate per 100,000 residents. There are 25 predictors which we number from 1 to 25. Exploration of the data using robust estimation and graphical tools revealed one clear outlier. We applied 5-fold CV to this dataset without the outlier, and obtained the "best of 25" model that has the following 15 covariates (1, 2, 3, 5, 6, 8, 9,10,16,17,18,19, 21, 24, 25). Figure 3.10 shows the learning curve for the Demographic data based on boot-strapped robust LARS. This plot suggests a reduced set of size 12 which include the covariates: (22, 20, 4,15,10, 2,19, 25, 8,18, 6, 24). The boldface numbers correspond to covariates in the sequence that are also in the model obtained by CV. The number of "hits" is 8 out of 12. a . 2 4 6 8 10 Number of variables in the model -1 r 12 14 Figure 3.10: Learning curve for Demographic data. A reduced set of 12 covariates is suggested by the plot. We applied 5-fold CV to the clean data using the reduced set of size 12 obtained by bootstrapped robust LARS. The model selected in this case has the following 9 covariates: (22, 20,4,15, 2,10, 25,18,24). To compare this "best of 12" model with the "best of 25" model above, we estimated the prediction errors of these two models 1000. times using 5-fold CV. The two density curves are shown in Figure 3.11. The "best of 12" model has a mean error of 204.8 (median error 201.5) while the "best of 25" model has a mean error of 215.9 (median error 202.0). Also, the standard deviations (mads) of the errors are 25.6 (22.7) and 74.6 (31.4), respectively. (Some of the "best of 25" errors are very large and not included in the plot.) Thus, bootstrapped robust LARS gives more stable results in this high-variability dataset. It should be mentioned here that we needed almost 10 days to find the "best of 25" model, while "best of 12" model requires less than 5 minutes including the time needed to sequence the covariates by bootstrapped robust LARS. (CV on m covariates is times faster than CV on d covariates.) Figure 3.11: Error densities for the two "best" models for Demographic data. The "best of 12" model gives more stable result. Protein data. This dataset of n — 145751 protein sequences was used for the KDD-Cup 2004. Each of the 153 blocks corresponds to a native protein, and each data-point of a particular block is a candidate homologous protein. There are 75 variables in the dataset: the block number (categorical) and 74 measurement's of protein features. We replace the categorical variable by block indicator variables, and use the first feature as the response. Though this analysis may not be of particular scientific interest, it will demonstrate the scalability and stability of the robust LARS algorithm. We used the package R to apply robust LARS to this dataset, and obtained a reduced set of size 25 from d = 225 covariates (152 block indicators + 73 features) in only 30 minutes. Given the huge computational burden of other robust variable selection procedures, our algorithm maybe considered extremely suitable for computations of this magnitude. For a thorough investigation of the performance of robust LARS with this dataset, we select 5 blocks with a total of n = 4141 protein sequences. These blocks were chosen because they contain the highest proportions of homologous proteins (and hence the •highest proportions of potential outliers). We split the data of each block into two almost equal parts to get a training sample of size n = 2072 and a test sample of size n = 2069. The number of covariates is d'= 77, with 4 block indicators (variables 1 — 4) and 73 features. We apply bootstrapped robust LARS with B = 100 bootstrap samples and we sequence the first 25 variables of each bootstrap sample. The resulting learning curve is shown in Figure 3.12. This plot suggests that a drastic reduction to a small number of predictors can be performed, e.g. m=5 or m=10. The first 10 predictors found by bootstrapped robust Number of variables in the model (bootstrap sequence) Figure 3.12: Learning curve for Protein data. A reduced set of 5 covariates is suggested by the plot. LARS are (14,13, 5, 76,73,8,7,40,46, 51). The covariates in this sequence are almost the same as those obtained with the whole dataset (not shown). The standard LARS produced the sequence (14,13,5,8,7,76,18,65,2,46). Note that the two sequences are quite different. For example, if we select a model from the first five predictors, then only 3 predictors are contained in both sequences. Using MM-estimators and robust AIC, the best model selected from the first five variables of the robust sequence contains vari-ables (14,13, 5, 76) while the best model out of the first 10 predictors contains variables (14,13, 5, 76,40). Hence only 1 variable is added. Using classical AIC, the best model selected from the first 5 variables of the LARS sequence contains variables (14,13, 5,8). Variable 76 of the corresponding robust model is replaced by Variable 8. The best model from the first 10 predictors contains variables (14,13, 5,8, 76, 2). Note that 2 variables are added to the list compared to 1 variable in the robust case. We fitted the 4 best models using the training data, and then used them to predict the test data outcomes. The 1%, 5% and 10% trimmed means of prediction errors for the smaller robust (classical) model are : 114.92 (117.49), 92.77 (95.66) and 74.82 (78.19), respectively. The corresponding quantities for the larger robust (classical) model are: 114.37 (115.46), 92.43 (94.84) and 74.34 (76.50), respectively. Notice that the robust models always outperform the classical models. 3.9 Conclusion The main contribution of this chapter is that we developed robust versions of LARS to obtain a reduced set of covariates for further investigation. We also introduced the idea of multivariate-Winsorization of the data (when the dimension is not too large). We can perform computationally suitable classical multivariate analyses on the transformed data to obtain reliable results. We also proposed a new robust correlation estimate for bivariate data which we called the "adjusted-Winsorized correlation estimate." LARS is a very effective, time-efficient model building tool, but is not resistant to outliers. We introduced two different approaches to construct robust versions of the LARS technique. The plug-in approach replaces the classical correlations in LARS by easily computable robust correlation estimates. The cleaning approach first transforms the dataset by shrinking the outliers towards the bulk of the data, and then applies LARS on the transformed data. Both approaches use robust pairwise correlation estimates which can be computed efficiently using bivariate-Winsorization or bivariate M-estimates. The data cleaning approach is limited in use because the sample size needs to be (much) larger than the number of candidate predictors to ensure that the resulting cor-relation matrix is positive definite. Moreover, the data cleaning approach is more time consuming than the plug-in approach, certainly when only part of the predictors is being sequenced. Since the plug-in approach has good performance, is faster to compute and more widely applicable, we prefer this method. Comparing bivariate M-estimates with bivariate Winsorization we showed that the latter is faster to compute with important time differences when the number of candidate predictors becomes high. We propose using the robust LARS technique to sequence the candidate predictors and as such identify a reduced set of most promising predictors from which a more refined model can be selected in a second segmentation step. We recommend combining W plug-in with bootstrap to obtain more stable and reliable results. The reduced sets obtained by bootstrapped robust LARS contain more of the. important covariates than the reduced sets obtained by initial robust LARS. It is important to select the number of predictors to use for the second step. This number is a trade-off between success-rate, that is the number of important predictors captured in the reduced set, and feasibility of the segmentation step. Our simulation study indicated that the reduced set can have size comparable to the actual number of relevant candidate predictors. However, this number is usually unknown. To still get an idea about an appropriate size for the reduced set we introduced a learning curve that plots robust R2 values versus dimension. An appropriate size can be selected as the dimension corresponding to the point where the curve starts to level off. 3.10 Chapter Appendix 3.10.1 Determination of 7 for one active covariate Assume that the first selected covariate is +X m. The current prediction /x •<— 0 should be modified as . fi <- 7 X m. The distance 7 should be such that the modified residual (F — ji) will have equal corre-lation with +X m and another signed covariate Xj. We have corfy - a X ) - X™ { Y ~ l X m ) / n - - ^ (3 13) ' ~ S D ( Y - 7 X J ~ SD(F — jX m)' ( 3"13) and mr(Y u \X)~ ^ ~ l X m ^ n - Ti Y ~ ^ m 4) - m, - s d ( f - y x m ) - S D ( y - 7 x m ) - ( 3 - 1 4 ) Equating (3.13) to (3.14), we have 7 (+Xj) = f ^ . (3.15) J- Tjm Similarly, equating (3.13) with the correlation of modified residual and — Xj we have 1 n" ' jm We should take the minimum of (3.15) and (3.16) and minimum over all inactive (not yet selected) j . The signed covariate that will enter the model at this point is determined alongwith. 3.10.2 Quantities related to equiangular vector Ba / Here, A is the set of 'active' subscripts. Let X A = (• • • siXi • • •), I e A, where s; is the sign of Xi as it enters the model. The standardized equiangular vector BA is obtained using the following three conditions. BA is a linear combination of the active signed predictors. BA = X A wA , where wA is a vector of weights. (3-17) BA has unit variance: 1 nB'ABA = l. (3.18) BA has equal correlation (a, say) with each of the active predictors. Since the covariates and BA are standardized, —XABa — a 1A , 1A is a vector of l's. (3.19) 7~t Using equation (3.17) in equation (3.18), we have -W' aX'AXAW A = 1, Ti so that W' aR^W A = 1, (3.20) where R^A is the correlation matrix of the active signed variables. Using (3.17) in (3.19), we have R{^w a = alA, so that the weight vector wa can be expressed as wA = a Let Ra be the correlation matrix the unsigned active covariates, i.e., Ra is a submatrix of Rx- Let sA be the vector of signs of the active covariates (we get the sign of each covariate as it enters the model). We have wA = a (DaRaDa^Ia, -(3.21) where Da is the diagonal matrix whose diagonal elements are the elements of s^- Finally, using equation (3.21) in equation (3.20), we get a = [1' a(D aRADa)-11a]- 112. (3.22) The correlation of an inactive covariate Xj with Ba, denoted by aj, can be expressed as follows A J = -X'Ba = -X'XaWa = (D Ar jA)'w A, (3.23) n J n J where r j A is the vector of correlation coefficients between the inactive covariate Xj and the (unsigned) selected covariates. Thus, we need only (a part of) the correlation matrix of the data (not the observations themselves) to determine the above quantities. 3.10.3 Determination of 7 for two or more active covariates Let us update r (r — 7), see (3.13), and rjY 4— ( — 7r,-m), see (3.14). The correlation of an active covariate with the 'current' residual Y — fx is r /SD(y — ft), and the correlation of the active covariate with the current equiangular vector BA is 'a'. Therefore, the correlation between an active covariate and the 'modified' residual (Y - p.- 7 a B A ) is T - J a c l SD(Y-p,- lABAy 76 An inactive covariate +Xj, j £ Ac, has correlation r jy /SD(F — fi) with the 'current' residual, and it has correlation a,j with Ba• Therefore, the correlation between +Xj, j G Ac, and the 'modified' residual is rjY - 7a flj SD(F — A — IaBA) Equating the above two quantities, we get "f A(+X j) = (r-r jY)/(a-a j). (3.24) Similarly, 7 A ( - X j ) = (r + rjY)/{a + aj). (3.25) We have to choose the minimum possible 7a over all inactive covariates. Note'that when A has only one covariate, (3.24) and (3.25) reduce to (3.15) and (3.16), respectively. Chapter 4 Two-step Model Building: Robust Segmentation 4.1 Introduction In Chapter 3 we developed robust sequencing methods to obtain a reduced set of covari-ates from which the final prediction model can be selected. According to the notation used before, we have m predictors X\,..., X m in the reduced set. In this chapter we consider methods of segmentation (evaluation of all possible subsets of the reduced set of covariates) in order to select the final prediction model. To compare different subsets of covariates, we require an appropriate robust selection criterion. For this purpose, we review some classical selection criteria in Section 4.2, and their robust counterparts in Section 4.3. We use /3 p to denote the estimate of f3 p for the p-parameter submodel (p predictors including the intercept) under consideration. Many of the methods below require and estimate the variance of the error term under the "true" model, a2. In such cases, a2 is estimated using the full model (with k — m + 1 parameters). 4.2 Review: classical selection criteria In this section, we review some important classical selection criteria: Final Prediction Error (FPE), Akaike Information Criterion (AIC), Mallows' C p , cross-validation and bootstrap. 4.2.1 Akaike Information Criterion (AIC) A measure of the similarity between the fitted distribution f{y\0 p) and the true distri-bution g(y\(3) is the Kullback-Leibler information number (Kullback and Leibler 1951) It can be shown that (i) I(gJ)> 0, (ii) I(g, / ) = 0 g(y) = f(y) almost everywhere (Lebesgue measure). Our purpose is to minimize I(gJ) = E{\ogg(Y\(3)}-E{\ogf(Y\P p)}, where only the second term is important in evaluating the fitted model. This term is unknown, and it seems reasonable to consider the log-likelihood n i=1 as an estimate of nE j log/ (y | /3 p ) j . However, this estimate has a bias, since the same data are used to find the estimates (3p and to calculate the log-likelihood. Akaike (1973) showed that the expected value of the bias ~ p. Therefore, the corrected estimate of nE {log f(Y\P p)} is L'0p)=L0p)-p. Based on this, Akaike (1973) proposed to choose the model that minimizes the Akaike Information Criterion: AIC = - 2 L0p) + 2 p, Bhansali and Downham (1977) proposed to generalize AIC by choosing a model that minimizes, for a chosen fixed a, AIC(p,a) = -2L(P p) + ap. For normal errors, D C C AIC(p, a) = K(n, a) + + ap, (4.1). where K(n, a) is a constant depending on the marginal distribution of the covariates, RSSp is the residual sum of squares, and a2 is the estimate of a2 from the full model. 4.2.2 Mallows' Cp Let us consider the following submodel of p parameters: Yi = (3' pXi + a, i = 1, 2, • • • , n, (4.2) where e* are independent observations from the distribution F with mean zero and vari-ance a 2 (when the current submodel is the true model). This subset model may produce biased fitted values, i.e., E f t ) ^ E(Yi), where Yi = (5pXi. The bias may be tolerable if it is offset by a reduced variance. Therefore, Mallows (1973) considered the mean square error for each fitted value, and defined the following criterion for model evaluation: 1 n J p = — E r a s e f t ) <7 i=l <7 E f t - E(F,))2 (4.3) yii - niyii)^2 ,i=l The value of J p has to be estimated from the data. Mallows (1973) proposed the following estimate: D c c C p = J p = ?jpL + 2p-n, (4.4) where a 2 is the estimate of a 2 from the full model. It can be shown that, for the full model with k = m + 1 parameters, C k = k. It is interesting to note that, for normal errors, the C p statistic is equivalent to AIC(p, 2) (see (4.1)). 4.2.3 Final Prediction Error (FPE) Akaike (1969, 1970) proposed a criterion for the selection of predictors in the context of autoregressive processes. The author minimized an estimate of the expected squared error in predicting the observations that are independent of the available data, but have the same distribution. Consider the subset model (4.2). Suppose that we are trying to predict, using the estimates j3p, the values Y* satisfying Y* = (3' pX l + e*, i = l , 2 , - . - , n , (4.5) where e*'s have the same distribution F, but they are independent of the Cj's. The Final Prediction Error (FPE) of the current model is defined as i=i (4.6) It is interesting to note that, for the linear regression setup considered above F P E = i=1 n 1 ^ E i=i J p + n, a rE E f t -,i=i where J p is defined in (4.3). Therefore, based on (4.4), an estimate of FPE is given by (4.7) FPE = ^ + 2 p, c-where a 2 is the estimate of a 2 from the full model. Note that, for the evaluation of linear prediction models, FPE is equivalent to the C p statistic. 4.2.4 Cross-validation Cross-validation (CV) obtains an estimate of the error-rate of a prediction rule by split-ting the n data points into a training sample of size nt (used for fitting the predic-tion model, i.e., for estimating the model parameters) and a validation sample of size nv = n — nt (used for assessing the model). We calculate the average prediction error based on all or some of the (^ ) different validation samples, and use it as a criterion to select a prediction model. It is often called leave-ri„-out cross-validation, or CV(nv). The vast majority of papers on this topic deals with leave-one-out cross-validation, denoted by CV(1). Lachenbruch and Mickey (1968) proposed the use of CV(1) in dis-criminant analysis. The method is furthered by Allen (1974), Stone (1974), and Geisser (1975). The asymptotic equivalence of CV(1) and AIC is shown by Stone (1977). Efron (1983) used CV(1) to estimate the error rate of a prediction rule in the situa-tion where the response Y is dichotomous. We can easily generalize the author's approach to a continuous response. Suppose that we have an n x j ) dataset Z = {Zi,i = 1, 2, ••• , n}, where each case (row) = (ccj, yi) is an observation of the random quantity (X, Y), with X being a row-vector of (p — 1) covariates and Y being a real-valued response variable. The dataset Z is a random sample from distribution H on the p-dimensional sample space Rp . We want to evaluate a prediction rule r)(x, Z) constructed based on the given dataset. An example of rj(x,Z) is j3zx where j3z is the linear regression coefficient of Y on X. We want to estimate the error rate of r](x, Z) when rj(x 0, Z) is used to predict a future value yo of Y for a given predictor value Xo. Let Q[y 0,rj(x 0, Z)] denote the. "error" in predicting y0 from x0. For example, we can consider the squared error Q[y 0,V(x 0,Z)] = {y 0-ri{x 0,Z)) 2. (4.8) True error rate The true error rate Err(Z, H) of the prediction rule 77(2:0, Z) can be defined as Exx(Z,H) = EH(Q[Y 0,r,(X 0,Z)}), (4.9) the expectation being taken over (.Xo, Yo) ~ H with Z fixed at its observed value. Apparent error rate The most obvious estimate of the true error rate Err(Z, H) is the apparent error rate err (Z,H): err(Z,H) = - V] Q[y h V(x u Z)], (4.10) n 1' i=i which usually underestimates Err(Z, H), because the same data have been used both to construct and to evaluate the prediction rule rj(x, Z). CV error rate CV attempts to overcome the problem of underestimation of Err(Z, H) by dividing the 1 given dataset into the training and the validation parts. For CV(1), let Z^) be the training set with case Zi removed, and rj(x,Z(,)) be the corresponding prediction rule. The CV(1) estimate'of Err(Z, H) is given by 1 n Err = (4.11) i=i s Shao (1993) used CV(n„) for model selection in regression using a random selection of the (^) possible validation samples. 4.2.5 Bootstrap Efron (1983) used bootstrap to estimate the true error rate Err(Z, H) (see (4.9)). Since the apparent error rate err(Z, H) (see (4.10)) is an underestimate of Err(Z, H), a correc-tion is required. Let op(Z, H) be defined as op(Z, H) = Err(Z, H) — evv(Z, H). (4.12) The expectation of op(Z, H), denoted by w(H), is given by r(H)=E H{Evv(Z,H)-evv(Z,H)}, (4.13) w which could be the ideal correction if it were known. Note that, though the true error rate and the apparent error rate are defined for particular dataset Z, the target correction is the expectation over all datasets. It is not easy to find an estimate for (4.12) which is defined for a particular Z. The unknown w(H) can be estimated using the bootstrap procedure to get the bootstrap estimate of Err(Z, H) as Err ( B° 0 t ) = err + w ( B o o t ) . (4.14) To obtain w^ B o o t \ let Z* be a bootstrap sample, i.e., a random sample of size n from H. Based on (4.13), w(B o o t) can be written as -(Boot) = ^ j E r r ^ t f ) ! jerr(Z*,if)} = E * - E * ( ^ P f Q l y i M v u Z * ) ] ^ , (4.15) where E* is the expectations over all bootstrap samples, and P* is the proportion of times a particular case Zj occurs in the bootstrap sample Z*, i.e., #{z* = zA n The expression inside the first pair of parentheses of (4.15) suggests that the prediction rule be constructed with the bootstrap sample Z*, and an average error of this rule be calculated on the given dataset Z. The expression inside the second pair of parentheses of (4.15) suggests that the prediction rule be constructed with the bootstrap sample Z*, and an average error of this rule be calculated on the same bootstap sample Z*. 4.3 Review: robust selection criteria In this section we present the robust counterparts of the classical selection criteria AIC, C p and FPE (in order of appearance in the robustness literature), and discuss their limitations. We discuss robust counterparts of cross-validation and bootstrap procedures in Section 4.4 and Section 4.5, respectively. Ronchetti (1985) proposed a robust counterpart of the AIC statistic. The extension of AIC to AICR is inspired by the extension of maximum likelihood estimation to M-estimation. The author derived AICR for an error distribution with density For a given constant a and a given function x, we can choose the model that minimizes 4.3.1 Robust AIC /(e) = i fexp(-x( e ) ) . (4.16) n AICR(p, a, x) = 2 J2 x( r i) + (4.17) / \ / where r^ = (Yi — p Xi)/a, a is some robust estimate of a, and (3 is the M-estimator defined as the implicit solution of n ^ 2 ^ ( r i ) X i = 0, i=i with ijj = x'• The author also proposed a choice for the parameter a, which is given by a = 2 E [ ^ 2 ( e ) ] / E [ # ) ] . (4.18) Limitation. The author considered that the M-estimate was the maximum likelihood estimate for the density in (4.16). Unfortunately, this only hold for unbounded x func-tions, and in such cases the breakdown point of the M-estimate is 0. 4.3.2 Robust Cp Ronchetti and Staudte (1994) pointed out the sensitivity of the classical C p to outlying points, and proposed a robust C p statistic denoted by RCP. Consider an M-estimator 0 with weights Wi = V,0"i)/rij where r^ is the residual for the ith observation. Using these weights, the author defined a weighted version of J p (see (4.3)) as follows , i = l The author proposed the estimate of Tp, i.e., the robust version of C p (see (4.4)), as W RCP = -^-(U p-V p), (4.19) where W p — ^ wfrf is the weighted residual sum of squares, a2 is a robust and consistent estimate of a2 from the full model, and U p and V p are constants depending on the weight function and the number of parameters p. When the weights are identically 1, RCP reduces to Mallows C p. 4.3.3 Robust FPE The robust analogue to the classical FPE criterion is proposed by Yohai (1997). Let s be an estimate of the scale a from the full model, and j3 be the M-estimator of the particular model under consideration. < 71 Pp = argmin^x((2/ i - 0P*i)/s). (4.20) i=i When we are trying to predict y* (equation 4.5) using the estimate j3 , the robust FPE (RFPE) is defined as RFPE = J 2 E vt - p'vXi x ' (4.21) \ U I i L \ / . where the expectations are taken in the y*'s as well as in (3p. Note that when x( u) = u2> RFPE reduces to the classical FPE. Using second order Taylor expansions with respect to /3 p (assuming that the current model is the true model), RFPE is expressed as (4.22) where A = E(ip 2(e/a)), B = E(ip'(e/a)), and ip — x'- Therefore, an estimate of RFPE is given by (4-23) where A = n " 1 Er=i (^ 2 (n /s ) ) , B = n" 1 Y.U^'in/s)), and r{ = yz - Ppxz. The performance of RFPE has not been studied so far. In Section 4.6 we carry out a simulation study to evaluate RFPE. 4.4 Robust cross-validation Ronchetti, Field and Blanchard (1997) proposed a robust cross-validation procedure which is a robust version of the cross-validation method proposed by Shao (1993). The authors used estimators that have optimal bounded influence for prediction. However, their method is computationally expensive. Hubert and Engelen (2004) proposed a fast cross-validation method in the context of robust covariance estimation with MCD and robust principal component analysis. In this section we propose a robust CV procedure which is computationally suitable. First, let us consider a simple robustification of the CV procedure achieved by (a) con-structing a robust prediction rule, denoted by r]R(x,Z), based on the given dataset Z, and (b) calculating robust summary statistics of the prediction errors Q[yi, V R( xi, ^(i))]-For the construction of a robust prediction rule, we consider the regression MM-estimates proposed by Yohai (1987) because of its high breakdown point and high efficiency at the normal model. This estimate is defined as follows. Definition 4.1. (Regression MM-estimate) Let Xo R ->• R and Xi '• ® ~5> ® be two score functions such that Xo( u) < Xi( u)i u £ ^ and each x satisfies the following set of regularity conditions: 1- x(-u) = x(u), ueR, 2. x is non-decreasing on [0,oo), 3. x(0) - 0, and X(oo) - 1, 4- X i-s continuously differentiable. Let $ be a high-breakdown-point "initial" estimate for (3, and a be the estimate of scale of the residuals based on (3 satisfying l Y , X o ( { y i - x \ 0 ) / & ) = b , (4.24) i= 1 where b £ (0,1] is the expectation of xo(-) under the central model. Then, the regression MM-estimate is defined as the solution of n Exi ((y i-xtf)/a)x i = 0. (4.25) i=i A reasonable choice for the initial estimate /3 in Definition 4.1 is the regression S-estimate proposed by Rousseeuw and Yohai (1984) because of its high breakdown point. This estimate is defined as follows. Definition 4.2. (Regression S-estimate) Let xo • —i• K be the score function de-scribed above. The regression S-estimate (3 is defined as 0 = argmina(/3), (4.26) /3 where ^ R ^ } _ E % ( m Q ( Q [ y h v R { x h £ . ) ] ) | . ( 4 . 4 3 ) Finally, the robust bootstrap estimate of ErrR(Z, H) (see equation 4.31) can be expressed as E r r ( E B ° 0 t ) = errR(Z) + w( R B o o t ) . (4.44) 4.6 Simulation study At first, we carry out a small simulation (Section 4.6.1) to show that the classical CV and bootstrap estimates of true error rate (see (4.9)) are sensitive to outliers while the robust estimates are resistant to outliers. We then conduct another study (Section 4.6.2) where we use these methods along with FPE and RFPE to select the "best" models, and compare the predictive powers of these models. Since AIC and C p are equivalent to FPE for linear regression setup with normal errors, and robust AIC and robust C p have some limitations, we do not consider these criteria for our simulation study. 4.6.1 Robustness of the estimates • — • ( C V ) - » ( B o o t ) • — • ( R C V ) — ( R B o o t ) We evaluate the 4 estimates Err , Err , Err and Err using simulated clean and contaminated datasets. Since the true error rates Err(Z,H) and Errr(Z,H) are different (the latter uses the trimmed mean), we multiply the robust estimates by EH o(Q[y,r)*(x,Z)}) E«H o(Q[y^(x,Z)}) to make the results more comparable with the classical results. We considered two standard normal covariates X\ and X 2, and generated Y = 2 Xi + X 2 + e, where e ~ N(0, 4). We simulated 100 datasets, and for each dataset we calculated the estimates mentioned above. We then contaminated each dataset as follows. Each of the 3 variables (2 covariates and the response) is contaminated independently. Each observation of a variable is assigned probability 0.03 of being replaced by a large number. Therefore, the probability that any particular row of the dataset will be contaminated is 1 — (1 — 0.03)3, which means approximately 9% of the rows will be contaminated. For each contaminated dataset we obtained the 4 estimates mentioned above. Table 4.1 presents the results for the first 10 trials. The average Err(Z, H0) (the average true error rate for the clean data) is 4.12, while the average ErrR(Z, H0) (multiplied by A) is 4.13. Table 4.1 shows that, for the clean data both the classical and robust methods estimate the true error rates very well. However, in Table 4.1: First 10 trials: classical and robust estimates of prediction errors. Trial CV Boot RCV RBoot Clean Contam Clean Contam Clean Contam Clean Contam 1 4.60 12.01 4.61 11.65 4.18 5.80 4.15 5.47 2 4.05 10.20 4.02 8.99 3.48 5.18 3.42 5.05 3 3.86 10.35 3.88 10.64 3.70 5.72 3.67 5.66 4 4.95 11.56 4.97 11.80 5.45 6.14 5.43 6.47 5 3.95 14.92 3.94 11.77 4.29 5.57 4.25 5.33 6 4.54 12.56 4.52 10.91 5.11 6.37 5.07 6.30 7 5.22 10.28 5.16 10.53 -4.72 6.37 4.74 6.65 8 4.03 8.54 4.04 8.63 4.04 5.43 4.04 5.45 9 4.16 10.45 4.20 10.60 4.21 6.59 4.21 6.36 10 4.57 9.82 4.53 9.72 4.75 5.90 4.70 6.54 mean 4.14 10.94 4.13 10.28 4.17 5.32 4.16 5.22 (sd). (0.58) (2.57) (0.58) (2.13) (0.67) (0.94) (0.69) (0.93) the contaminated data, robust estimates perform much better than the classical methods. 4.6.2 Final model selection In this simulation study we use the classical segmentation methods CV, Boot and FPE along with their robust counterparts RCV, RBoot and RFPE to select the "best" models, and compare the predictive powers of these models. The study is similar to Frank and Friedman (1993). We considered 2 latent variables L,, i = 1, 2, to generate Y = 6Li + 51/2 + e, where Li ~ N(0,1), and e is a normal error not related to the latent variables. We considered a total of m = 8 covariates. Of them, a = 4 are related to the two latent variables, with 2 covariates related to Lx and the other two related to L2. We generated 100 datasets each of which was randomly divided into a training sam-ple of size 100 and a test sample of size 100. Each training dataset was then contaminated as follows. A number of rows (10%) were chosen randomly, and for these rows the co-variates values were replaced by large positive numbers while the response values were replaced by large negative numbers. We used all 6 methods on the clean and contaminated training data to select and fit the final models, and then used them to predict the test data outcomes. For each simulated dataset, we recorded the number of noise variables in the model, and the average squared prediction error on the test sample. Table 4.2 shows the average test, error and the average number of noise variables selected by each method. For the clean data, the robust methods perform as good as the classical methods. For the contaminated data, robust methods produce much smaller test errors than the classical methods. Also, robust models contain less noise variables. The performance of the three robust methods are similar. Table 4.2: Performance of the classical and robust methods of segmentation (evaluation of all possible subsets of the reduced set). Method Test error Noise Clean Contam Clean Contam Classical CV 41.81 56.49 0.00 0.60 Boot 41.32 54.88 0.00 0.50 FPE 41.93 55.17 0.02 0.60 Robust RCV 42.97 43.62 0.06 0.08 RBoot 41.59 44.80 0.08 0.08 RFPE 42.73 44.91 0.06 0.06 4.7 Examples In this section we use two real datasets to evaluate the performance of the classical and robust methods for the segmentation of the reduced set. Both of these datasets were used in Chapter 3 for the evaluation of robust sequencing. 4.7.1 Demographic data This dataset contains n = 50 obsrvations on d = 25 covariates and a response. For more details Section 3.8 is referred to. Using the learning curve based on standard LARS, we selected the reduced set (22, 20,4,15, 25, 2,14, 5, 3,17, 24, 23).' The robust bootstrapped LARS produced the reduced set (22, 20, 4,15,10, 2,19, 25, 8,18, 6, 24). We applied the classical segmentation methods CV, Boot and FPE on the first re-duced set above. The. covariates selected by these methods are (22,4, 25, 2,14,17, 24, 23), (22,4,15, 25, 2,17, 24), and (22, 20,4,15,25, 2,14,17, 24,23), respectively. We then ap-plied the robust methods RCV, RBoot and RFPE on the second reduced set. The covari-ates selected are (22,4,15,10,19,25,18,24), (22,20,4,10,19,25,18,24), and (15,6,24), respectively. Interestingly, RFPE selects a very small model compared to others. To compare the models obtained by the classical and robust methods, we used the clean data (dropping one clear outlier) to estimate the prediction errors of these models 1000 times using 5-fold CV. The mean prediction errors for the models are: CV 199.3, Boot 198.2, FPE 207.6, RCV 195.8, RBoot 197.5 and RFPE 246.9. The robust method RCV performs slightly better than RBoot, and both of them perform much better than the classical methods and RFPE. 4.7.2 Protein data This KDD-Cup 2004 dataset was used in Section 3.8. We considered n = 4141 protein sequences from 5 blocks. The number of covariates is d = 77, with 4 block indicators (variables 1 — 4) and 73 features. The data were split to get a training sample of size n = 2072 and a test sample of size n = 2069. We considered a reduced set of size 5 using the learning curve based on standard LARS on the training data, which contains the covariates (14,13,5,8,7). Robust boot-strapped LARS gives the reduced set (14,13, 5,,76, 73). We applied the 3 classical meth-ods of segmentation on the first reduced set. They all select the same model, and it includes the covariates (14,13, 5, 8). The robust methods used on the second reduced set select the covariates (14,13, 5,76). We fitted the 2 models using the training data, and then used them to predict the test data outcomes. The 1%, 5% and 10% trimmed means of prediction errors for the robust (classical) model are : 114.92 (117.49), 92.77 (95.66) and 74.82 (78.19), respectively. It is encouraging to note that the robust methods outperform the classical methods for the majority of the data. 4.8 Conclusion The main contribution of this chapter is that we developed computationally suitable robust methods of segmentation (evaluation of all possible subsets of the reduced set obtained in Chapter 3) to select the final model. Classical selection criteria FPE, AIC, Cp , CV and bootstrap are sensitive to out-liers. We also identified certain limitations of Robust AIC (Ronchetti 1985) and robust CV (Ronchetti, Field and Blanchard 1997) methods. We proposed computationally suit-able robust versions of CV and bootstrap procedures. We evaluated our methods using both simulated and real datasets, and compared them with the classical methods as well as robust FPE proposed by Yohai (1997). According to the simulation study, the performance of the three robust methods are similar, and better than the classical meth-ods. In the real datasets, robust CV (RCV) and robust bootstrap (RBoot) have better performance compared to RFPE. Chapter 5 Properties of Adjusted-Winsorized Correlation Estimate 5.1 Introduction In Chapter 3 we proposed a new correlation estimate for bivariate data, which we called the adjusted-Winsorized estimate. Unlike two separate univariate Winsorizations for X and Y (Huber 1981 and Alqallaf 2003), we proposed a joint Winsorization with a larger tuning constant c\ for the points falling in the two major quadrants, and a smaller constant c2 for the points in the two minor quadrants. In this chapter we will establish the consistency and derive the influence function of the proposed correlation estimate. We will then discuss the asymptotic normality of this estimate. Definition 5.1. (Adjusted-Winsorization) The adjusted-Winsorization of (u, v) £ R2, denoted by tyc(u,v) with c — (c\, c2), is defined as rt \ ( ^ ( u U c M . u ^ O , = (Mu) , M v ) ) = < (5.1). [ VVfcfa)) . U V < 0, where ip is a non-decreasing symmetric function, and C\ and c2 are chosen constants. Definition 5.2. (Adjusted-Winsorized estimate of correlation) Let (Xi, Yi), % — 1, 2, • • • , n, be a random sample from a bivariate distribution with location parameters Hx and fiy, and scale parameters ax and ay, respectively. Let 6 = (fix, fx, cy), and 0 = (jlx, £I-Y, &Y) be an estimate of 6. Denote Ui = (Xi — fix)/&x, and Vi = (Yi — fly) j ay. Let (jJi, V^j = (^ c(Ui),ip c(vSj be as defined in (5.1). Then, the adjusted- Winsorized estimate rw of the correlation between X and Y is given by 11Mm c(Vi) - (i t MUi)) (i t MVi) ~ 1 V 1=1 / V i=1 / 2) itr cm-(itMUi))\ kntr cm-('-tmvS 2 n •<•—' <•/ in < t<~\ »/ i \ n ' T c \ <•/ i n i=1 \ i=1 / V »=1 \ i=1 For the validity of the results obtained in the subsequent sections, we need some assumptions on the functions ipCl and ipc2 used for the adjusted-Winsorization of the data. Let ip : 3R. —y 1R satisfy the following set of regularity conditions: Al. Ip(-u) = —ifi(u), u G R, A2. ip is non-decreasing, A3, ip is continuously differentiate, A4. ip, i)' a n d ip'(u)u are bounded. For the adjusted-Winsorization of the data, we will use the S-scales &x and ay-defined in Chapter 4. Let us assume that the score function x : R —> E used in the S-scales satisfy the following set of regularity conditions: Bl. x{-u) = x{u), ueR, x(0) = 0, and X(oo) = 1, B2. x is non-decreasing on [0, oo), B3. .x is continuously differentiate, B4. Xj X' a n d x'( u)u a r e bounded. 5.2 Consistency of adjusted-Winsorized estimate The following theorem shows that under certain regularity conditions the adjusted-Winsorized correlation estimates are consistent, provided that the location and scale estimates are consistent. Theorem 5.1. (Consistency of adjusted-Winsorized estimate) Let (Xj, Yi), i = 1 ,2 , • • • , n, be a random sample from a bivariate distribution with location parameters Hx and fx Y, and scale parameters ox and ay, respectively. Let Ui = (Xj — Hx)/&x, and Vi = (Yi — /j,y)/cry be the standardized variables. Let 0 = (nx, Hy, cry), and Q = (fax, Ay, &X, <5y) be an estimate of 0. Then, if 0n —>• 0, 71—> O O then p Tyy ^ , n—>00 where f w is the adjusted- Winsorized estimate of correlation between X and Y, and To prove this theorem, we need an extension of "Serfling's Lemma" (Serfling 1980, page 253). Lemma 5.1. (Extension of Serfling's Lemma) Let Zi — (Xi, Yi), i = 1, 2, • • • , n, be a sequence of independent random variables having an identical bivariate distribution with parameter vector 0. Let g(z, t) : R2 x R4 —R be continuous in t uniformly on z G Ac(0, A) for all A > 0, and P(z G A(0, A)) 0 as A 0, with P(z e A(0, 0)) = 0. Assume that |g(z, t)\ < K for all z € R 2 . Let 0n be a sequence of random vectors such E [ip c(U)ip c(V)] - EIMU)} EIMV)} (5.3) that 0n — 0 . Then (5.4) Proof. We have to show that, for any given e > 0 (5.5) Now, i n 1 n n 0n)I(zi e A(0,A))-E[g(Z, 0)1 (Z G A(0, A))] + - V g( Z i, 0n)I( Z i G Ac(0, A)) - E[g(Z, 0)I(Z G Ac(0, A))] n < * n < 1 n -'Yg(z i, 9n)I(zi G A(9, A)) n i=i + E[g(Z, 9)I(ZeA(9, A))] + - £ g(zu 9n)I(zi G Ac(9, A)) - E[g(Z, 0)1 (Z G Ac(9, A))] n (5.6) < k~y /I(z ieA(9,A)) n 1' + i = l n + i V g(z u 9n)I(zi G ,4C(0, A ) ) - - V g(zi, 9)I( Zi G .4C(0, A)) n n ' 1 = 1 2 = 1 n £ g(zi, 0)I(zi G Ac(0, A)) - 0)1 (Z G Ac(0, A))] + n i=i , (5.7) Q (say). Note that the last expression in (5.6) is bounded by the sum of the last two expressions in (5.7). We will now deal with each of the four parts in (5.7). As n oo, e p ( z e A{0, A)), and P(z G A(0, 0)) = 0. n i=i Therefore, for any given e > 0, there exists A > 0 such that lim P lk-Yl(zi € 4(0, A)) < e/4 ) = 1, i->oo I n -f—' I i = 1 and k E[I(Z G A(0, A))] = k P{Z G A(0, A)) < e / 4 . (5.8) (5.9) We now focus on the third part of (5.7). Since 0 n > 0, we have, for any 6 > 0, n—>oo n—>oo lim P (||0„ - 0|| < 5) = 1. (5.10) Now, for any e > 0, we can choose <5 = 6(A) (where A has been chosen before) such that 19n - 0|| < 5 \g(z, 9n) - g(z, 9) | < e / 4 , z G Ac(9, A). (5.11) That is, 5 is chosen in such a way that it will ensure the uniform continuity of g(z, 0) in 0 on z e Ac(0, A). Using (5.10) and (5.11), we have, for any e > 0, lim P n—>oo [g{z, 0n)-g(z, 0))l(zeA c(0, A)) < e / 4 = 1 , which gives 1 n < e / 4 • ) = ' ! . (5.12) lim p f - f j ^ , 0n)I( Z i e Ac(0, A)) n—>oo \ n z — ' x i = l n Y,9(zi, 0)I{zieA c{0, A)) j=i For the fourth part of (5.7), we can use the Weak Law of Large Numbers. For any e > 0, n Y 9 ( ^ 0 ) I ( z i e A c ( 0 , A)) i=1 Using inequalities (5.8), (5.9), (5.12) and (5.13) in (5.7), we have, lim P n—>oo < e / 4 = 1 . (5.13) lim P (Q < e) — 1, n—>oo which completes the proof. To prove Theorem 5.1, we also need the following lemma, which is similar to Lemma 7.7 (Salibian-Barrera 2000, page 217), where the author deals with p-functions (^-functions according to our notation). Lemma 5.2. (Uniform continuity of -0-functions) Let ip : M —>• E be a continuous function such that il>(u) = —c for u < —c, and ip(u) = c for u > c, where c is a finite constant. Let m € A4 and s £ S, where M. and S are bounded real intervals, and inf 0. Then r/ \ , ( U ~ m \ _ , . „ f{u,m,s) — ip[ , m e M, s € o, is conutinuous in m and s uniformly in u. The proof of this lemma is presented in Chapter Appendix (Section 5.8.1). ( Proof of Theorem 5.1 With the use of Lemma 5.1 and Lemma 5.2 the proof is straightforward. We have Z = (X,Y), 0 = (fx x, Hy, ®x, cry), and t = (fi x, fiy, 0, then (x — fix)(y ~ AY) > 0, and, using Lemma 5.2, n t i v \ E ^ ( ^ r ^ ) - - m > °) i=1 ^ i=i v X-jjLx Ox I{(X - fix)(Y - fiy) < 0). = "0C = v. x-fix Ox + H( X ~ Mx)(Y ~ MY) < 0). ipc = 4>Cl ( ^ f ^ ) i s continuous in fi x and ax uniformly on z G A°(0, A). There-fore, using Lemma 5.1, we have n ' \ ox ' n->no i=1 v OX That is, E[MU)}- (5.14) n < * ri-VfYi n i= 1 Similarly, - f ^ M V i ) E [ M V ) ] . (5.15) n ' * n - i o o n 1=1 Let us now deal with the first term in the numerator of Equation 5.2. We have n ^ n ^ \ ax J \ oy J i=i i=i \ / \ / m f ) m ^ ) - MV - m > o) Considering g(Z, 0) = & rpc = wJU)'P c c ( & ) oo i = l Similarly, £ [ ^ 0 0 ] - (5.19) rl f * n.—inn ^ J n i=1 Using (5.18), (5.19), (5.14) and (5.15), we can say that the denominator of (5.2) converges in probability to the denominator of (5.3). • 5.3 Influence function of adjusted-Winsorized esti-mate The following theorem gives the influence function of the adjusted-Winsorized correlation estimate, when the influence functions of the scale estimates are well-defined. Theorem 5.2. (Influence function of the adjusted-Winsorized estimate) Let (X, Y) follow a continuous distribution H. Consider the adjusted-Winsorized correlation functional rw(H) given by r w (H) = (5.20) with N(H) = E H X-m x(H) sx(H) — EH X-m x(H) Sx(H) Y -mY(H) SY(H) (5.21) and 21 V2 2 l l / 2 (5.22) where mx(H) and my (H) are location functionals, and sx(H) andSy(H) are dispersion Junctionals. Suppose that 1. The central model Hq is symmetric about (m x(Ho),my(H 0)). We can assume, without loss of generality, that mx(Ho) = 0, my(H 0) = 0, sx{H 0) = 1, and Sy(H 0) = 1. 2. The influence functions IF(s x,u, H0) and IF(sy,v, Hq) are well-defined for all z = (u, d ) 6 R 2 . The influence function of rW(H) at Hq and z, denoted by IF(r w, z, Hq), is given by IF{r w, z, H0) D0N0 - N0DP Dl where No = EHo{MX)MY)}, Do = yfE H o{iP*(X)}E H o{iPl(Y)} No = -EH o {MX)MY)} + - IF(sy,v,H 0) EHO{MX)4'C(Y)Y} - IF(S X,U,H 0)EH o{MY)^' c(X)X}, and Dn EhM( Y)} 2EHo{W(X)} + IF(s x,u,H 0) EHO{2iP C(X)€(X)X} j Eh 2EHO{R C{Y)} + IF(S y, v, H0) EHO{2MY) I!>'C(Y) Y} Proof. Let H be given by H tiX = (l-t)H 0 + tS g. (5.23) For a fixed z = (u, v), using (5.21) we can express N(H t>z) = N(t, z) = N(t) as N(t) = 'u-mx(t ) \ / u M * ) my(t)X s y ( t ) ; sxW yj v — mY(t) sY(t) (5.24) sx(t) Sy{t) +t^c[ u'mx}t))^J v-mY{t ) sx(t) sY(t) - (1 - 2*) < L ^ F 1 ) } EHo ( Y ~ m Y [ t ) sY{t) V — mY(t) sy(t) (5.25) sy(t) J J r c V where o(t) includes the terms involving t2. Now, since mx(0) = 0, my(0) = 0, Sx(0) = 1, and sy(0) = 1, we have d_ dt N(t) = - e H o {MX)MY)} + Mv)Mv) t=o d ^ f, fX-m x(t)\ . (Y — my(t) The last term in (5.26) can be written as d„ J , ( X-m x(t )\ , y r - m v ( t ) t=0 (5.26) t = o EH 0 < V, y - mr(t) C1 1 S j c ( i ) J m V M * ) / f ( X - m x ( t ) ) ( F - m y ( t ) ) > 0 + d_ dt Sx(t) ) ^ 2 V 3y{t) J l((X-m x{t))(Y-my{t))< 0 . (5.27) t=0 Interchanging the operations of differentiation and integration (see Chapter Appendix, Section 5.8.2, for the justification), we can express (5.27) as E Ho <=oJ + E Ho d t=o. (5.28) The first term of (5.28) can be expressed as ENA dt sx(t) sY(t) t = 0 J Sx(t) ; r c i v sY(t) sY(t)f t[m Y(t)]-j{s Y(t)}(Y-m Y(t)) 4 ( t ) l(XY > 0) sY(t) J ^ V sx(t) sx(t)£[m x(t)}-£ls x(t)]{X-rn x(t)) 4 W I(XY > 0) t=o = -EH Q^pCI(X)^' CL(Y)[LF(m Y,v,H 0) + IF{S y,V,.H Q)Y] L{XY > 0) + ^Cl(Y^' Ci(X)[lF(m x,u,H 0) + IF(s x,u,H 0)X}l(XY>0) (5.29) Similarly, the second term of (5.28) can be expressed as E Ho d r f x-mx{t) dt {*« {-T^tT 1 Y — MY sY(t) J Q-) l(XY < 0) t=o. = -EHOL^CL(X)^ CL(Y)[LF(m Y,v,H 0) +.IF(S Y,V,H 0)Y] L{XY < 0) • +i;CL(Y)i; ,CL(X)[LF(m x,u,H0) + IF(s x,u,HO)X]L(XY <0) |>. (5.30) Using (5.29) and (5.30) in (5.28), we have = -IF(m Y,v,H 0)EH o{MX)^c(Y)} ~ IF(m x,u,H 0) EH o{MY) iP' C(X)} . - IF(S y,V,H Q) EHO{MX)4>C(Y)Y} - IF(s x,u,H0) EHO{MY) iP' C(X) X} = - IF(S y, v, H0) EH0{MX) €(Y) Y} - IF(s x, u, H0) EHO{MY) ^C(X) X}. (5.31) Using (5.31) in (5.26), we have d_ dt N(t) = - E H 0 {IP C(X)IP C(Y)} + MV)MV) t=o - IF(S y, v, H0) EH0 {M*) VJX) Y) - IF(S X, u,H 0) EHO{MY) ip' c(X) X}. (5.32) Using (5.23) in (5.22), we have D(t) = D!(t) Dl(t), (5.33) where Di(t) (L -1) EHO \ T x - mx(t) sx(t) + tip c u - mx(t) sx(t) , (5.34) and (1 - t) EHO t Y - mY(t) + ti> c v — mY{t )\ sr{t) J j V sy(J) J Differentiating both sides of (5.33) w.r.t. t, and setting t = 0, we have (5.35) d_ dt D(t) t=o Di(t) dt D2{t) dt (5.36) t=o From (5.34), we have t=o + 1>l(u). (5.37) t= o 0) sX(t) J Y C 2 \ sx(t) sx(t)f t[rn x(t)}-i[s x(t)](X-m x(t)) s2x(t) L(XY < 0) t=o = -EHO | 2 VC 1 M'CL {X) [LF(m x, u, HO) + IF(s x, u, H0) X] l(XY > 0) + 2 VC2 (X)iP' C2 (X) [.IF(m x, u, H0) + IF(s x, u, H0) X] l(XY < 0) = -IF(m x,u,Ho)E H o{2MX)^' C(X)} - IF(S X,U,H 0)EH o{2MXU' c(X)X} = -IF(S X,U,H 0)EH o{2MX)^ c(X)X}. (5.39) Using (5.39) in (5.37), we have = -EHO{RT(X)} - IF(s x,u,H0)EHO{2MXMX)X} +^c(u). (5.40) t=0 Similarly, I*® = -E„a{i,l(Y)} - IF{sy,v,H 0)EBa{2MY)€(Y)Y} + i>l{v)- (5.41) t=0 Using (5.40) and (5.41) in (5.36), we have dt D(t) EHOWX)} t=o EH O{1>1(X)} - j&u) 2EHO{R C(X)} + IF(s x, u, H 0) EHO{2^ c{X) iP'C{X) X) EhM( X)} E h M P ) } -2 EHO{IPI(Y)} + IF(s y,V,H 0)EH o{2My)€(y)y] )• (5.42) We also have and W(*)L=O = EHO{MX)MY)} = No (say), D(t) | t = 0 = JE H o{r c{X)}E H o{r c{Y)} = Do (say). (5.43) (5.44) Finally, differentiating both sides of rw(t) w.r.t. t, and setting t = 0, we have IF{r w,z,H 0) m D(t) DQNQ - NQDQ Dl (5.45) (5.46) where N0 = FTN(t)\ t=Q and D0 = f tD(t)| t=Q are obtained from (5.32) and (5.42), respec-tively, and N0 and D0 are obtained from (5.43) and (5.44), respectively. • Figure 5.1 shows a 3D-plot of IF(r w, z, H 0) against z = (u,v), with u,v E [-10,10], H 0 = N(0, £), and / 1 P P 1 We used p = 0.5 for the bivariate normal distribution, and (ci, c2) — (3, 2) for rw. \ (5.47) Figure 5.1: Influence curve of adjusted-Winsorized estimate with (ci,c2) = (3,2). The curve is symmetric about (0,0). Based on Equation 5.46 and Figure 5.1, we can make the following comments on the influence function of the adjusted-Winsorized estimate: • Because of Regularity Condition A4, the expectations in (5.46) are bounded. Also, Do > 0. Therefore, IF(r w, z, H 0) is bounded when IF(sx, u, H 0) and IF(sy, v, H 0) are bounded. Figure 5.1 also exhibits the boundedness of the influence function. • Since the •j/'-functions are symmetric, the influence function IF(r w, z, Ho) is sym-metric about (u,v) — (mx(Ho), my{H 0)) — (0,0) if IF(sx,u, H 0) is symmetric about u = 0, and IF(s Y, v, H 0) is symmetric about v = 0. • By setting c\ = c2 = c in IF(r w, z, HQ), the influence function of the univariate-Winsorized correlation functional can be obtained. 5.3.1 Standard error of adjusted-Winsorized estimate Let Zi = (Xi,Yi), i = 1, 2, • • • , n, be i.i.d. according to a continuous distribution H, and rw(H) be the adjusted-Winsorized correlation functional. The influence function of rw(H) at the central model H 0 and z e l 2 , denoted by IF(r w, z, H 0), is given by (5.46). Using this, the asymptotic variance of the adjusted-Winsorized correlation estimator f w for the central model H 0 can be obtained as (see, for example, Hampel, Ronchetti, Rousseeuw and Stahel 1986, page 85) For a sufficiently large n, the standard error of f w, denoted by SE(r w), is given by Since it is difficult to get a closed form expression for (5.48), we can use numerical integration to obtain approximations to the asymptotic variance and the standard error of r w . (5.48) SE(f w) = VAV{r w,H 0)/n. (5.49) To evaluate the accuracy of the (approximate) standard error of the adjusted-Winsorized correlation estimates, we carried out the following simulation study. We generated 2000 random samples of size n from a bivariate normal distribution with mean vector 0 and covariance matrix £ = E(p) given by (5.47). We considered 3 different sample sizes: n = 25, 100 and 400, and 3 different correlation coefficient values: p = 0.1, 0.5 and 0.9. The values of (ci, c2) chosen for these correlation coefficients are (3,3), (3, 2) and (3,1), respectively. The choice of C\ and c2 is discussed later on in Section 5.4. Table 5.1: Evaluation of the standard errors of r w . The empirical SD and formula-based SE are close. p n = 25 SD SE n = 100 SD SE n = 400 SD SE 0.10 0.50 0.90 0.203 0.199 0.136 0.146 0.035 0.039 0.100 0.099 0.069 0.073 0.018 0.019 0.049 0.050 0.034 0.037 0.009 0.010 Table 5.1 presents the obtained results. For each n, the first column gives the empirical standard deviations of the adjusted-Winsorized correlation estimates (based on 2000 samples), while the second column shows the standard errors calculated using (5.49). In general, the differences between the numbers in the two columns are reasonably small, particularly for large sample sizes. An estimate of the asymptotic variance in (5.48) is given by AV(r w,H n) = -'$2lFZ(r w,zi,H n), 7~h i=l and the estimated standard error of rw, denoted by SE(r w) is given by SE(f w) = yfAV(r w,H n)/n. 119 (5.50) (5.51) 5.4 Choice of c\ and o2 for f It is important to note that the robustness, efficiency and intrinsic bias (to be discussed in the next section) of the adjusted-Winsorized correlation estimate depends on the values chosen for c\ and c2. Figure 3.3 in Chapter 3 shows how the bivariate outliers are handled by f w . If we choose large values for c\ and c2, r w will be less resistant to the outliers. On the other hand, this will lead to a decrease in the standard error (increase in efficiency) and a decrease in the intrinsic bias, both of which are desirable as well. Thus, the choice of Ci and c2 may depend on our goal in a particular situation. For application purposes, we can first select an appropriate value for c\ (the larger tuning constant for the two "major" quadrants, i.e., the quadrants that contain the majority of the standardized data). Then, for the two "minor" quadrants, we can use c2 = hc\, where h is the ratio of the number of observations in the minor quadrants to the number of observations in the major quadrants. Note that, as |p| increases from 0 to 1, the asymptotic value of h decreases from 1 to 0. 5.5 Intrinsic bias in adjusted-Winsorized estimate Let Z = (X, Y) have a continuous distribution H, and rw(H) be the adjusted-Winsorized correlation functional. Let the central model H 0 be given by H 0 = N(0,£), where £ = £(p) is given by (5.47), and p = p(H 0) is the true correlation coefficient of X and Y. The intrinsic bias of rw occurs at the central model HQ because of the data trans-formation. Since the adjusted-Winsorized data have a slightly different correlation coef-ficient than that of the original data, rw(H 0) ^ p(H 0). Therefore, the intrinsic bias of rw, denoted by IB(r w), is given by IB(r w) = rw{H 0)-p(H 0). . (5.52) To compare rw(H 0) and p{H 0) empirically, we generated random samples of size n — 100000 from a bivariate normal distribution with mean 0 and covariance matrix £ = E(p). We considered several values of p from —1 to 1. To calculate rw = rw(Ho) we used Huber score function with different values of C\ with c2 = hc\, where h is defined in the last section. I Figure 5.2 displays the plots of rw against p for C\ = 0.01, 1, 2 and 3. Based on these plots we can make the following comments: • The intrinsic bias of rw decreases as Ci increases. • rw is a non-decreasing function of p. • Consider 0 < p < 1. The magnitude of intrinsic bias increases from zero to reach its maximum at p = 0.5, and then decreases to zero. Similar behavior is observed for - 1 < p < 0. • For — 1 < p < 0 the intrinsic bias is negative, while for 0 < p < 1 the intrinsic bias is positive. This is the exact opposite of the results obtained for univariate-Winsorized estimate (see Alqallaf 2003, Figure 4.5, page 97), for which the bias is positive when — 1 < p < 0, and negative when 0 < p < 1. To compare the behavior of rw (with C\ ~ 0, c2 — 0) with the univariate-Winsorized correlation estimate r — r(H 0) (with c ~ 0), we plotted r with c = 0.01 against p in c1 = 0.01 c1 =1.00 -1.0 -0.5 T 0.0 P i n 0.5 1.0 - 1 . 0 i -0.5 T 0.0 p I 0.5 1.0 5 J c1 = 2.00 -1.0 -0.5 0.0 c1 = 3.00 0.5 1.0 -1.0 -0.5 0.0 P 0.5 1.0 Figure 5.2: Intrinsic bias in adjusted-Winsorized estimates with c2 = hc\. The bias in rw decreases as C\ increases. -1.0 -0.5 0.0 0.5 1.0 P Figure 5.3: Intrinsic bias in univariate-Winsorized estimate (c=0.01). Figure 5.3 (which is a reproduction of the top left plot of Figure 4.5, Alqallaf 2003). This plot is the exact opposite of the top left plot of Figure 5.2 in terms of the sign of the intrinsic bias. The reason maybe as follows. Though C\ = 0 =>• c2 = 0 for the adjusted-Winsorized estimates, but c2 = hc\ approaches zero faster than ci, making the limit of rw different from the limit of r as c tends to zero (the limit in the latter case being the quadrant correlation estimate). 5.5.1 Achieving (approximate) Fisher-consistency for rw Let H 0 = N(0, £(p)) and let rw(p, ci, c2) be the asymptotic value of rw (H 0) when we use tuning constants C\ and c2. A plot of rw(p, Ci, c2) against p is exhibited by Figure 5.2 for different values of C\ and c2 = hci. Recall that h is the ratio of the number of observations in the minor quadrants to the number of observations in the major quadrants. To fix ideas and without loss of generality, let us assume that p > 0. We notice that when c2 = C\ (univariate-Winsorized estimate, Alqallaf 2003) rw(p,c\,c\) < p, while rw(p,ci,hci) > p when c2 = hc\. Therefore, to achieve Fisher-consistency for a fixed value of p, we could use an appropriate value of c2 between c2 = hci and c2 = c\. That is, we could use c2 = aci, where a G (h, 1) is such that rw(p, c\, aci) = p. In practice, we could approximate a by numerical means. For a fixed value of Ci we could obtain a table relating p and a — gi (p). Since h is a decreasing function of p we have that a = g(h). Tables relating a and h could be constructed by numerical means for any desired value of C\. We will not elaborate this approach further, since a simple approach that works remarkably well is presented below. To avoid the construction and use of numerous tables, we can use a simple alternative that does not require any table. Since c2 = hc\ and c2 = c\ give biases of similar (though not same) magnitudes with opposite signs, we can use c2 = Ci(h + l)/2, that is, a = (h + l)/2. Figure 5.4 displays the plots of rw against p for Ci = 1 with c2 = hc\ and c2 = ci(h + l)/2. Though the first plot (c2 = hc\) is the same as the top right plot of Figure 5.2, it is presented again here to make its scale more comparable to that of the. second plot. The degree of Fisher-consistency achieved by using c2 = ci(h +1)/2 is quite satisfactory. Note that c2 = Ci(h + l) /2 < ci. Therefore, with this choice of c2 the adjusted-Winsorized estimate is still more resistant to bivariate outliers than the univariate-Winsorized estimate. At the same time, the extra tuning constant c2 allows us to make our estimate approximately Fisher-consistent. c1 = 1, c2 = hc1 c1 = 1, c2 = c1 (h+1 )/2 p p Figure 5.4: Approximate Fisher-consistency for rw. By using c2 = Ci(h + l)/2 we get less intrinsic bias than c2 = hc\. We mentioned in Chapter 3 that though we used c2 = hc\ in this study, a more reasonable choice would have been c2 = Vhci (i.e., = hcf), because the areas of the two squares should be proportional to the number of observations they contain. Interestingly, (h + l ) /2 (the shrinkage factor that gives approximate Fisher-consistency) is the first-order Taylor expansion of \fh. 5.6 Asymptotic normality of adjusted-Winsorized es-timate Since the indicator functions involved in the adjusted-Winsorized correlation estimate are not differentiate, proving the asymptotic normality of this estimate is extremely difficult. One approach we can consider is to replace the sample indicator functions by the true ones, which we can do only if the amount of error due to the replacement is o(l/y/n). See Chapter Appendix (Section 5.8.3) for more details of this idea, where we use this approach to establish the asymptotic normality of the numerator of (5.2), i.e., the asymptotic normality of the adjusted-Winsorized "covariance" estimates. Unfortunately, for the denominator of (5.2), the amount of error due to replacing the sample indicator functions by the true ones is 0 ( 1 / \ / n ) (see Chapter Appendix, Section 5.8.4), and cannot be ignored. Therefore, the above approach cannot be used to establish the asymptotic normality of the adjusted-Winsorized "correlation" estimates. As a remedy of this, we can use differentiate versions of the indicator functions, denoted by 71 ( x Y ) a n d 72(0 = 1 — 7i(-)> respectively. For example, 71 can be the distribution function of a continous random variable with support (—e, e), for any small e > 0. Using the functions 71 and 72, we now define the smoothed adjusted-/ Winsorization of the data, and the smoothed adjusted-Winsorized correlation estimates as follows. Definition 5.3. (Smoothed adjusted-Winsorization) The smoothed adjusted-Winsorization of (u,v) € R2, denoted by \&f(u, t>), is defined as ^sc(u,v) = (^ sc(u)^ sc(v)), (5.53) where $c( u) = ^Cl(uhi(uv) + ^2(^)72 (uv), ipc( v) = ipa(vhi(uv) + i)C2(v)-y 2(uv), ip is a non-decreasing symmetric function, and c\ and c2 are chosen constants. Definition 5.4. (Smoothed adjusted-Winsorized estimate of correlation) Let (Xi, Yi), i = 1,2, • • • , n, be a random sample from a bivariate distribution with loca-tion parameters fix and and scale parameters ox and cry, respectively. Let 0 = (fix, IJ-Y, &Y), and 6 = (fax, fay, by) be an estimate of 0. Denote Ui = (Xi — fax)/bx> and Vi = (Yi — fay)jby. Then, the smoothed adjusted-Winsorized correlation estimate, fg, is defined as rs = i= 1 i e € { m s c m - i E ) UE i=1 i = l KTHM i=i i=l mmvi) 2 = 1 (5.54) The following theorem states the asymptotic normality of the smoothed adjusted-Winsorized correlation estimates, provided that the ip- and x-functions satisfy the above conditions, and the location estimates are consistent. Theorem 5.3. (Asymptotic normality of the smoothed adjusted-Winsorized estimate) Let (Xi, Yi), i = 1, 2, • • • , n, be a random sample from an elliptically sym-metric bivariate distribution with location parameters p,x and /j,y, and scale parameters ax and ay, respectively. Let fax and fay be consistent estimates of p,x and [ly, and ax and by are S-estimates of ax and ay with score functions satisfying conditions B1 - B4-Then, ~ \fc(r S-rs) ~^N(Q,AV), 71—>00 where f w is defined in (5.54) ip-functions satisfying conditions Al - A4, E rs = IE IE S(Y-hy c \ ay and the variance AV of the limiting distribution is given by AV = v ; e v 9 , where £ = {an}, with an — Cov (Qi, OA; i — 1,2, 3; j = 1, 2, 3; ( 3 x 3 ) Q l = M ^ A ^ I ^ A - ^ - X ^ - ^ Q2 q 3 € ox X-fix Ox Y-fiy ay ay D X Ox K w (Y - fly X D Y ay #17 f X - fi X\ #18 (Y - fly D; -x ox D, -x aY K W (Y - fly X D Y 1 ay u #20 / X - fix DxX\ Ox U W g \y/VW' 2VVVW' 2wvvwj ' with U = E V = E W = E s i X - fix Wc Ox Ipc Y-fly ay Ox S ~ MX ay ,2 I Y - My c and the constants used in the above expressions are specified below: , (X- fi X\ f X — fix Dx — E Dy = E X Ox / \ Ox , fY - fiy\ (Y - fly X ay ay (5.55) (5.56) ox E < 2 + — ox x — fix \ (X — fi X\ , ( Y fly \ 2 fx ~ fi XY - fly V>ci I J 7l E ox / V ox X-fix Oy Ox Oy C i OX . Y - fly \ FX - fix Y - fiy V>ci I — : — ) 71 Oy Ox Oy 7i , f X - fix Y - fiy\ f X - fl XY - fiy Ox 128 aY Ox Oy (5.57) K 2 is obtained by interchanging X and Y in (5.57), K 3 is obtained by replacing c\ by c2, and 71 by 72 in (5.57), K± is obtained by interchanging X and Y, and replacing C\ by c2, and 71 by 72 in (5.57), K 5 = — E ox € C1 x - Hx\ (X - Hx\ , (Y - hy Wc2 Ox / \ ox ' x - \ i x y - hy 7i Ox OY 72 oY X - (1 X Y - jJLy Ox oy H E ox Ox 1 ^ W , ( • 7 ; ( x - Y - " " Ox X - ij, xY - hy c\ ox oY X - nx ox oy J \ ox X — Hx Y — fly oy C2 72 . Ox Y~ fly Oy 7i oy X — PxY — fly Ox oy X - nxY - fjL Y\ f X - nxY - ny 72 I — ' — I I — —) , (5.58) Ox ^ OY J \ Ox Oy J J is obtained by interchanging X and Y, and c\ and c2 in (5.58), K-j is obtained by interchanging C\ and c2 in (5.58), K& is obtained by interchanging X and Y in (5.58), K 9 •= K\ + + + K 7, K w = K 2 + K a + K 6 + K 8, (5.59) (5.60) K 11 -E ox c 1 X-llx Ox Ox € c 1 X - Hx \ f X - fjLx 7i X - fl X Y - [ly Ox J \ Ox J \ Ox X - Hx\ f X - Hx Y -/ly^ 7i oy 7i Ox J \ Ox X - fjL XY - fJ,y Ox OY oY X - tlx Y - JJLy Ox oY , (5.61) K 12 -E oy X Vx Ox 7i X - n x Y - j iY ox oY X - /j, xY - /ly ox oY X - nxY - ny Ox Oy , (5.62) #13 and #14 are obtained by replacing c\ by and 71 by 72 in (5.61) and (5.62) (re-spectively), K 15 ox € X - nx\ fX - fjL X c 1 7i Ox 7i H E ox 2 „ + — E ox Ox / V ox X - fi xY - fly Ox Oy *» V ^ X - fi xY - fiy Ox Oy X-IM X\ C2 72 X-fix Ox X - fix Y - fly ox MX € oy X C2 Mx 72 / \ ox X - HxY - fly Ox C1 c 2 Ox J X - fix Y - fly X — fix Vi Oy X - fix Y - fly TP, ci a ^ C7y X-flx Ox J \ Ox X - fl XY - fly Oy OX C2 72 72 . Ox X-fix Ox - MX Y 7i Mr a y Oy X - fix Y - fly Ox Oy X - fix Y - fly Ox oy , (5.63) # 1 6 = -E oY C1 X-fix C2 X-fix Ox J \ Ox X - fix Y - fly 7i X - fix Y - fly Oy Ox 'X -Oy MX C1 ox 72 Ox Oy X - fj XY - fly ox Mx^ C2 72 , 7i ff* / X - fl X Y - fly OX Oy Oy X - fl X Y - fly Ox Oy X - fl X Y - fiy Ox Oy , (5.64) #17 = #11 +#13 + #15, #18 = #12 + #14 + #16, (5.65) (5.66) and, finally, #i9 and #2o are obtained by interchanging X and Y in (5.65) and (5.66), respectively. Sketch of the Proof. The numerator of (5.54) can be written as Tl / -rr ~ ' -I n / ST Yi - fiy n s = 1 - j : M ^ k -1- E ( 1 E * V J V OY ) n f ^ V ° x J n f r l c • ay (5.67) We can express the first term of (5.67) as ip^yi^) Xi- fi x\ f Xi - fi x Yi - fi Y\ , , (Xi- fx x\ (Xi - jj,xYi- fi Y 7 i — = ~ + — r 72 Ox J \ Ox OY J \ OX J \ OX OY x { ^ 7i ( + ^ ( Yl^Y) l 2 ( X i ~ ft* * -OY / \ Ox OY J \ OY J \ ox oy 1 v^ / ( Xi ~ MX \ , ( Yi - fly \ 2 ( Xi - MX Yi - fly n •<—' \ Ox J \ oY ) \ Ox oY n \ ox ) \ OY ) \ OX OY + Y - Ax V ( Yi - fl Y \ [ Xj-flxYj- fl y^j ^ fXj - fix Yj - fly ' \ Ox J \ Oy J \ Ox Oy ) \ Ox Oy - fly \ ( Xi - fl X Yi - fly\ ( Xi~ fl XYi~ fly* 7 i : : 72 f Oy J \ Ox Oy J \ Ox Oy J (5.68) Let us consider the first term of (5.68). Using Taylor expansion about {/J>x,ox, My,oy), we can write - f > , ( f e ^ ^ Y ' : n \ Ox ) \ Oy ) \ Ox Oy J n \ °x ) \ oY ) \ ox oY J - - 4 - X X ( ^ v ^ f e ( ^ k f ^ L ^ ) { 6 x _ n a x ^ \ ox J ox . \ oY J \ ox oY J fe, ( 7 1 ( • x ' : : A y ) \ ox J \ oY J \ ox oY J t (Xi — /ix Yi — p,Y \ Xi — fix Yi — fiy , - s X 7i = = = : V °x oY ) ox oY - £ > , ( ^ ^ V« C ^ ) ^ ^ f (Sy -nCrY j^f \ 0X J 1 \ Oy J Cry \ 0X Oy J - 4 - X > , ( V ( x < : ^ Y ' : nay ^ \ Ox J \ Oy J \ Ox Oy J , f Xi — fi x Yi — fi Y\ Xi — fix Yi — fiy .n \ X 7l I = = ~ = (oy-oy), \ Ox Oy J Ox oy since the coefficients of (fix — px) and (fiy — fiy) converge to zero in probability. Thus, - X>. ( ^ ^ W ^ W * : Y ' : i i Y ) n \ Ox ) \ Oy ) \ Ox Oy J \ Ox J \ Oy J \ Ox o y ) -K 1(O x-ox)-K 2(OY-O y), (5.69) where K x and K 2 are as defined in the statement of the theorem. Similarly, the next three terms of (5.68) can be expressed as 1 X > ( j ^ W m ^ f r W * : ** y ' : ^) \ Ox ), \ Oy ) \ Ox Oy J = - t ( ^ ^ W ( • ( • * " >* Yi - " r ) •n~i \ Ox J \ Oy J \ Ox Oy J - K 3(o X - Ox) - K 4(oy - Oy), (5.70) n \ OX J \ OY . J \ 0X OY ) \ Ox OY J - l y ^ (^IM^ / Fj - / Xj - fix Yj ~ My\ (Xi - fjixYi- fi Y n j C l I n^r ) °2 I / T i / / ^ X / T i r / T x r / ^ f \ ox J \ oY / \ OX Oy J \ OX OY 2=1 s ' - K5(o x - Ox) - K6(cr Y - oY), (5.71) and l y ^ J j ( x i ~ M x\ .,. / ~ M y \ ( Xj~ fi x Yj - fi Y \ / - Ax Yi ~ M y \ n ^ V / C 1 v / 7 1 V oY ) 7 2 V ^ <5y / ^ l y ^ / - M.v A ^ - My ^ / X j - /ix - My \ ~ Mx Yj - fi Y A n C2 \ crj J 01 \ oY / A cry / \ ox oY J -K7{O X-OX)-K8(OY-OY), (5.72) respectively, where K 3, K 5, K 7 and K 8 are as defined in the statement of the theorem. Using (5.69), (5.70), (5.71) and (5.72) in (5.68), we have i=i x Xi- fj,x\ , ( Yj - fl y \ 2(Xi- fix Yi- fly Wcx I _ 17i Ox J \ oy ) \ Ox Oy + - E f e ( ^ ^ W ^ ^ ) t i f * ~ " " y n ~ t \ °X J \ Oy J \ Ox oy +1 Y ^ (Xi ~ » x J ^ A n ( Xi V ( Xi" »xYi ~ \ Ox J \ Oy J \ Ox Oy J \ Ox Oy 1 1 Yip f X i~ M y\ Xi ~ MX Yj ~ fl y \ ^ fXj — fix Yj — fly n \ ox J \ oY / \ ox oy J \ ox oY - K 9(o X - Ox) - K l0(oy - Oy), (5.73) where K 9 = K x + K 3 + K s + K 7 and K w = K 2 + K 4 + K 6 + K s. Now, it is straightforward to show that the second term of (5.67) is o{l/y/n). There-fore, using (5.73) in (5.67), we have = ( I ^ Z ) tf - K 9(O X - ax) - K w(a Y - aY). (5.74) \ OX J \ ° Y J Note that the first 4 terms of the right-hand-side of (5.73) are combined to get a single term in (5.74). Since a x and by are S-scales, we have (see Alqallaf 2003, page 156) ° x ~ o x = w* 1 — — : (5.75) t=i x ' x ' and ay - a y ^ ; 1 — - . (5.76) w ) V ) Using (5.75) and (5.76) in (5.74), we have i=l N Xi — Hx \ ,s(Yi~ ^Y 1>\ Ox J C v AY n / / -it- ' \ \ ts i n - £ H ^ T " ) - b ) - w » £ where Dx and Dy are defined in the statement of the theorem (equations 5.55 and 5.56). The first term of the denominator of (5.54) can be written as We can express the first term of (5.78) as i s I Xi — fax i=i v ^ \ / \ ox oY (Xi — fx x\ (Xi - fix Yi ~ ^ 2 + — : 72 — ; — V Ox J \ Ox Oy 1 , 2 ( Xi — fix \ 2 ( Xi — fix Yi — Ay V^ 12 V>X \ 2 = — T — h i n 7 7 V Ox J \ Ox Oy 1=1 v ' N 1 V^ ,2 (Xi - fix\ 2 { Xi - fixYi- fiy\ + n i=l , 2 V^ , (Xi- fix\ , (Xi- fix\ (Xi- fixYi- fiy + A — I ) 7i n ^ \ Ox J \ Ox / \ Ox Oy ^( X i:fixY-fi y\ \ Ox O y ) As in the case of the numerator, by using Taylor expansion about (fix, ox, HY,oy), the three terms on the right-hand-side of (5.79) can be expressed as E 2 ( Xi — fix\ 2 ( Xi — fixYi — fiy\ n i=i i=1 N Xi — fix \ 2 / Xi — fixYi — fly !7I Ox J \ Ox Oy K n(bx ~ ox) -K l2(a Y - av), (5.80) i=1 V ' Xi — flx\ 2 ( Xi — fix Yi — fly 72 — : Ox J \ Ox oy - K n(a x - ax) ~ K u(a Y - aY), (5.81) - ( ( X i : i l x \ n (Xi:flxY i:fl YYJXi ~ ~ ^ n ' \ Ox J \ Ox J \ Ox Oy J \ ax . Oy = - Y ^ ^ ( 7 l ( X,-^Y-fly\ 72 / X^flxY-fly^ N ^ V OX J V Ox J \ Ox Oy J \ ax Oy J -K l5(ax-o x)-K 16(by-ay), (5.82) where Kn, K\ 2 , #13, #14, #15 and K w are as defined in the statement of the theorem. Using (5.80), (5.81) and (5.82) in (5.79), we have Xi - fix Ox Xi - fix Ox K17(OX ~ Ox) - K18{O y - O y ) , ' (5.83) where Kn = Kn + #13 + #15 and K w = K12 + #14 + KI 6. Now, it can be shown that the second term of (5.78) is o(l/y/n). Therefore, using (5.83) in (5.78), and using (5.75) and (5.76), we have 'Xi - nx' 2 K„1 n Dxnj -T, n i=1 ox n -b Yi ~ My X Dy n ' V V °y Similarly, for the second term of the denominator of (5.54) we have 2 (5.84) f Xj ~ jjLX Dx n \ \ <7* (5.85) Using (5.77), (5.84) and (5.85), and ignoring the terms which are o(l/^/n), we can state that n L V ( E \ E E Y-tXy cry 2 )] \ (5.86) where E is as defined in the statement of the theorem. Finally, we can use the 5-method (Billingsley 1986) on (5.86) to complete the proof. • 5.7 Conclusion The main contribution of this chapter is that we established some asymptotic properties of the adjusted-Winsorized correlation estimate (the new robust correlation estimate proposed in this thesis). Our estimate is consistent and has bounded influence. We obtained its asymptotic variance and intrinsic bias. The tuning constants of this estimate can be chosen such that we have approximate Fisher-consistency. An smoothed version of this estimate is asymptotically normal. The computing time of this estimate is 0(n log n), the same as that of the univariate-Winsorized correlation estimate, but our estimate is more resistant to bivariate outliers. 5.8 Chapter Appendix 5.8.1 Proof of Lemma 5.2 We need to show that, for any e > 0, there exists 5m > 0 and 5S > 0 such that, for all ueR, \m1-m2\<5m, |si-s2| \f{u,rni,s- L)-f{u,m 2,s2)\ fh + sc. For any m < m and s < s we have (u — m)/s > c. Now, consider u < m — sc. For any m > m and s > s we have (u — m)/s < —c. Thus, for U = [m — sc, m + sc], (5.87) holds. For u E U, ip is uniformly continuous. Therefore, we only need to show that U — MI U — 777.2 Si S 2 . . |s2 — Slj |m 2 5 i ' - 777iS2| = \u\ 1 u < Ku-S\S2 SiS2 |s2 - Si I , I , - Si I |m2 - mil 1- \m2\ h s2 SiS2 SiS2 SiS2 | s 2 - S i | _ | s 2 - S i | \m2 — mi\ + m- + where K u = s u p { | u | : u G U). Thus, sufficiently small. mi u—m,2 s 2 < 5 if \m2 — mi| and |s2 — Si| are 5.8.2 Influence function: interchanging differentiation and in-tegration The first term of the right-hand-side of (5.27) can be expressed as a t Jm x{t) Jm Y(t) sx(t) sy(t) t=0 d ' [ m x { t ) f mY{t ) L fx- mx(i)\ fy -mY(t) sy{t) /mx{t) r -co J - 00 (5.88) t=0 The first part of (5.88) can be written as dt r r Jm x(t)Jm Y(t) V SX[t) ) \ SY{t) J t=0 li -i—>0 t r r J mx(t) J ray (t) sx{t) sY(t) oo roo / tJj Cl(x)ip Cl{y) f(x,y)dydx Jo lim -t-X) t ir r k c-.Jm x(t) Jm Y{t) (. \ x - mx(t)\ (y- mY(t) sx(t) J ^ V sY(t) V'd (x)^ Cl (y) }f{x,y) dydx I rmx{t) r oo t I / il>a{x)il>c 1(y)f{x,y)dydx 1 Jo Jo 2 PCX) rmY(t) / / Vci (x) if> Cl (y) f{x, y) dy dx Jo Jo •y pmx(t) rrnY(t) / / ipCl {x)ip Cl [y) f{x,y)dydx (5.89) Jo Jo — lim lim t-> o t lim • t->o t lim - k ( ' ^ M ( Y ^ I ) - ( x ) A i („) UO JO t sx(t) J ™ V sY(t) I\x - mx{t) > 0) I(y- mY{t) > 0)f(x,y) dydx , (5.90) since the last three terms of (5.89) are zero. Now, assuming that ip(u), ip'(u) and ip'(u)u are bounded for all u e l (under Regularity Condition A4), we can show that (X~sx(*) t ]) \t=o i s bounded. Therefore, using Lebesgue's Dominated Convergence Theorem (see, for example, Bartle 1995, page 44), we have d_ dt f f Jm x(t)Jm Y(t) V W J / V SY[t) J t=0 U™ 7 (v'd f——77T^) V'ci (- ] _ -0C1 (a;) ^ (y) k dydx JO JO / V SY(t) f{x,y) dy dx. t=o (5.91) Similarly, d r^xit) rmY(t) dt sx(t) SY(t) y - mY(t) sY(t) t=o f(x,y) dy dx. t=o (5.92) Combining (5.91) and (5.92), we have d_ dt Sx(t) mx{t))(Y-m Y(t)) > 0 t=o = E H0 X-m x{t)\ L (Y-m Y{t) sY{t) I(XY > 0) i=0 (5.93) We can now write (5.28) from (5.27). In a similar way, we can show that d_ dt EH 0 < t ,2 \ X ~ mx{t) Jci sx(t) I^(X-m x(t))(Y-m Y(t))> 0 - F \ d (,1.2 ( X~ mx(t) - Eh° dt { sx(t) t=0 I(XY > 0) t=0 , (5.94) which gives (5.38). 5.8.3 Asymptotic normality of the adjusted-Winsorized "co-variance" estimate The numerator of (5.2) can be written as * = I ± , c tak) * ( ) _ I ± (X^M) I £ , c . n ^ V - fiy) > 0), and £(c2) = l((Xi - fi x)(Yi -fiy) < 0). Denote /*( C l ) = l((Xi - fi x)(Yi - fiy) > 0), and I fa) = l((Xi - fi x)(Yi - fiy) < 0). Let us focus on the cases when /;(ci) and Ii(c\) take different values. Assuming (without loss of generality) that fi x < fi x, we can argue that ij(ci) Ii(ci) when fi x < Xi < fi x. (Or, fiy C l = 0 ( V ^ - Similarly, J E V>Cl f ^i(ci) = o (1/x/n). Using these results in (5.101), we have n i = 1 V / (5.102) i / v s ) . i=l N Similarly, Therefore, we have n E * . i=i ~ AY C 2 Ox Oy Ii{c 2). (5.105) Now, using Taylor expansion about (fix,ox), we have ( = - ( ^ M | fc - , x ) Ox \ Ox Ox \ Ox (5.106) Ox \ Ox \ Ox for fi x < fi x < Mx and ox < ox < oX-Similarly, Oy J \ Oy J Oy V Oy (5.107) oy 1 V Oy J \ Oy for fiy < fiy < fx Y and dy < Oy < Oy. Using (5.106) and (5.107) in the first part of the right-hand-side of (5.105), we have i=i Ox Oy = l ^ M Y i „ ) 7 i(C l) n ~ l \ Ox J \ Oy 1 noy i=1 Xi — fix \ I, ( Yi — jj>Y\ ( Yi — fly Ox ci Oy Oy nox X ^ C I i=l Yi — HY\ , (Xi — fi x \ (Xi — fi x Oy c 1 /i(ci) (oo Oy ci Vv1 Oy X-fix Ox Oy /i(ci) o-y 0Y (5.109) and nox t e^. ( ^ k ( ( " i ' 0). Therefore, n E ^ i i=i i = l v oY X l ^ X ( — — ) /i(ci) ~ AC1 (x\ i (Yi — fiy i=1 Ox E^ci ( ^ JV'dl )h{ci) OY n 2 > i=1 Ox i=i X <7y cry - M -5 C 1 1 E i=i X - /-fx <7X where and Similarly, £>x = E Dy = E X i ( X — fj-x \ ( X — fix ox ox X , (Y - (i Y\ (Y - fj, Y OY OY -b , (5.112) (5.113) (5.114) .n i=1 Ox oY MY Oy Yi - My Ii(c 2) y Dy n V \ c ry 1 = 1 x v 2 = 1 Xj - Mx Ox where AC2 = —E o x BC2 = —E Oy Ox J \ Oy J \ Oy Y-MY\„ l, (X - Hx \ ( x - Hx Ox V>c2 = , Oy ) \ 0 X I(C2) (5.115) (5.116) (5.117) and Dx and Dy are as above. Using (5.112) and (5.115) in (5.105), we have 'Xi-p, X\ l o , /Yi-fly n t t V i=1 x XI — AX V/. ( YI ~ MY \ 1 n oy i=l i=l Oy Xi — fix \ , (Yi — fly Vc Ox An 1 Dy n z E i=l X C7y - My ay B c 1 Dx n ^ E X Xj - jJL X Ox where Ac = + AC2, and Bc — Bc, + DCo. b), (5.118) ci ^ C2 • Using (5.118, we can state that \fn N -M^M^)}] + AT(0 ,Q), where Q = Var Ox Oy lc i Y - My D\ X Oy B, c _ , X - Hx D) X ox 5.8.4 Difficulty with the denominator of (5.2) Let us focus on the first term in the denominator of (5.2). It can be expressed as n ' 1 V^ , 2 f Xi — fix i=i Ox n i=i Xi - fix Ox (5.119) As in the case of the numerator, we can write (5.120) Unfortunately, this time we cannot replace /;(.) by /j(.). The reason is as follows. When h{c 1) "7~ Ji(ci), we have either (i) jl x < Xi < /.i x, or (ii) fiy < Yi < Hy- In the first case, as in the case of the numerator (see Section 5.8.3), we can show that the amount of error due to replacing /;(.) by /j(.) is 0(l/n). However, in the second case, though jJiy — fiy is 0(1/Vn), the term V. ( X ijf L) (which is also 0(1/^/n) for the l^'s above) is not there to make the product 0 ( l /n ) . Therefore, the amount of error in replacing the sample indicator functions by the true ones is 0(1/ \ /n) , and cannot be ignored. Chapter 6 Conclusion In this study, we considered the problem of selecting linear prediction models for large high-dimensional datasets that possibly contain a fraction of contaminations. Our goal was to achieve robustness and scalability at the same time. We considered one-step and two-step model building procedures, the latter consisting of sequencing and segmentation steps. We will now summarize the main ideas proposed in this thesis, and the main results obtained. One-step model building We proposed robust versions of step-by-step algorithms FS and SW. We expressed these classical algorithms in terms of sample means, variances and correlations, and replaced these sample quantities by their robust counterparts to obtain the robust algorithms. We used robust correlations derived from a simplified version of bivariate M-estimates of the scatter matrix. We proposed robust partial F-tests for stopping during the implementa-tion of robust FS and SW procedures. Our robust methods have much better performance compared to the standard FS and SW algorithms. Also, they are computationally very suitable, and scalable to large dimensions. Two-step model building Robust sequencing We considered time-efficient algorithm LARS to sequence (some of) the d covariates to form a list such that the good predictors are likely to appear in the beginning. Since LARS' is not resistant to outliers, we proposed two different approaches to robustify LARS. In the plug-in approach, we replaced the classical correlations in LARS by easily computable robust correlation estimates. In the data-cleaning approach, we first transformed the dataset by shrinking the outliers towards the bulk of the data (which we call multivariate-Winsorization), and then applied standard LARS on the transformed data. The data-cleaning approach is more time-consuming than the plug-in approach when only some of the predictors are being sequenced. For both approaches (plug-in and data-cleaning), we used robust correlations derived from a simplified version of the bivariate M-estimates of the scatter matrix. We also proposed correlation estimates using bivariate-Winsorization of the data. We showed that the latter is faster to compute with important time differences when the number of candidate predictors becomes large. We recommend combining robust LARS with bootstrap to obtain more stable and reliable results. The reduced sets obtained by bootstrapped robust LARS contain more of the important covariates than the reduced sets obtained by initial robust LARS. To obtain a reduced set of m covariates for further investigation, we introduced a learning curve that plots robust R2 values versus dimension. An appropriate value of m is the dimension corresponding to the point where the curve starts to level off. Robust segmentation We perfromed all possible subsets regression on the reduced set of covariates obtained in the first step. Since classical selection criteria FPE, AIG, Cp , CV and bootstrap are sensitive to outliers, we needed robust selection criteria for this purpose. We identified certain limitations of Robust AIC (Ronchetti 1985) and robust CV (Ronchetti, Field and Blanchard 1997) methods. We proposed computationally suitable robust CV and robust bootstrap procedures in this thesis. We evaluated our methods using simulated and real datasets, and compared them with the classical methods as well as robust FPE proposed by Yohai (1997). Our robust CV and robust bootstrap methods have better performance compared to the classical methods and robust FPE. Adjusted-Winsorized correlation estimate For the development of robust LARS, we proposed this new correlation estimate for bivariate data. The proposed estimate is consistent and has bounded influence. We obtained its asymptotic variance and intrinsic bias. The tuning constants of this estimate can be chosen such that we have approximate Fisher-consistency.' An smoothed version of this estimate is asymptotically normal. The computing time of this estimate is the same (approximately) as that of the univariate-Winsorized correlation estimate, but our estimate is more resistant to bivariate outliers. Bibliography Akaike, H. (1969). Fitting autoregressive models for prediction. Annals of the Institute of Statistical Mathematics, 21: 243-247. Akaike, H. (1970). Statistical predictor identification. Annals of the Institute of Statistical Mathematics, 22: 203-217. Akaike, H. (1973). Information theory and an extension of the maximum likelihood prin-ciple. Second International Symposium on Information Theory, Academiai Kiado, Budapest, pages 267-281. Allen, D. M. (1974). The relationship between variable selection and data augmentation and a method for prediction. Technometrics, 16: 125-127. Alqallaf, F. A. (2003). A new contamination model for robust estimation with large high-dimensional datasets. PhD thesis, Department of Mathematics (Institute of Applied Mathematics), University of British Columbia. Alqallaf, F. A., Konis, K. P., Martin, R. D., and Zamar, R. H. (2002). Scalable robust covariance and correlation estimates for data mining. Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, Alberta, pages 14-23. Bartle, R. G. (1995). The Elements of Integration and Lebesgue Measure. John Wiley & Sons. Bhansali, R. J. and Downham, D. Y. (1977). Some properties of the order of an autore-gressive model selected by a generalization of Akaike's FPE criterion. Biometrika, 67: 546-551. Billingsley, P. (1986). Probability and Measure. John Wiley & Sons, 2nd edition. Croux, C., Van Aelst, S., and Dehon, C. (2003). Bounded influence regression using high breakdown scatter matrices. Ann. Inst. Statist. Math., 55(2): 265-285. « Efron, B. (1983). Estimating the error rate of a prediction rule: improvement on cross-validation. Journal of the American Statistical Association, 78: 316-331. Efron, B. E., Hastie, T., Johnstone, I., and Tibshirani, R. (2004). Least Angle Regression. The Annals of Statistics, 32(2): 407-499. Frank, I. and Friedman, J. H. (1993). A statistical view of some chemometrics regression tools. Technometrics, 35: 109-148. Geisser, S. (1975): The predictive sample reuse method with applications. Journal of the American Statistical Association, 70: 320-328. Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J., and Stahel, W. A. (1986). Robust Statistics. John Wiley & Sons. Hastie, T., Tibshirani, R., and Friedman, J. (2001). The Elements of Statistical Learning. Springer-Verlag, New York. Hoerl, A. E. and Kennard, R. W. (1970). Ridge regression: biased estimation for nonorthogonal problems. Technometrics, 12: 55-67. Huber, P. J. (1981). Robust Statistics. Wiley, New York. Hubert, M. and Engelen, S. (2004). Fast cross-validation of high-breakdown resampling methods for PCA. unpublished manuscript. Knight, W. R. (1966). A computer method calculating Kendall's tau with ungrouped data. Journal of the American Statistical Association, 61: 436-439. Kullback, S. and Leibler, R. A. (1951). On information and sufficiency. Annals of Mathematical Statistics, 22: 79-86. Lachenbruch, P. and Mickey, M. (1968). Estimation of error rates in discriminant analysis. Technometrics, 10: 1-11. Mallows, C. L. (1973). Some comments on C p. Technometrics, 15: 661-675. Mallows, C. L. (1995). More comments on C v. Technometrics, 37: 362-372. Maronna, R. A. (1976). Robust M-estimators of multivariate location and scatter. The Annals of Statistics, 4: 51-67. Mendenhall, W. and Sincich, T. (2003). A Second Course in Statistics: Regression Analysis. Pearson Education, Inc., New Jersey, 6th edition. Morgenthaler, S., Welsch, R. E., and Zenide, A. (2003). Algorithms for robust model se-lection in linear regression. Theory and Applications of Recent Robust Methods, eds. M. Hubert, G. Pison, A. Struyf, and S. Van Aelst, Basel (Switzerland): Birkhauser-Verlag. Ronchetti, E. (1985). Robust model selection in regression. Statistics and Probability Letters, 3: 21-23. Ronchetti, E., Field, C., and Blanchard, W. (1997). Robust linear model selection by / cross-validation. Journal of the American Statistical Association, 92: 1017-1023. Ronchetti, E. and Staudte, R. G. (1994). A robust version of Mallow's C p. Journal of the American Statistical Association, 89: 550-559. Rousseeuw, P. J. (1984). Least Median of Squares Regression. Journal of the American Statistical Association, 79: 871-880. Rousseeuw, P. J. and Leroy, A. M. (1987). Robust Regression and Outlier Detection. Wiley-Interscience, New York. Rousseeuw, P. J. and VanDriessen, K. (1999). A fast algorithm for the minimum covari-ance determinant estimator. Technometrics, 41: 212-223. Rousseeuw, P. J. and Yohai, V. J. (1984). Robust regression by means of S-estimators. Robust and Nonlinear Time Series Analysis (J. Franke, W. Hardle, and R. D. Martin, eds.), Lecture Notes in Statistics 26, Springer Verlag, New York: 256-272. Salibian-Barrera, M. (2000). Contributions to the theory of robust inference. PhD thesis, Department of Statistics, University of British Columbia. Salibian-Barrera, M. and Zamar, R. H. (2002). Bootstrapping robust estimates of regres-sion. The Annals of Statistics, 30: 556-582. Serfling, R. J. (1980). Approximation Theorems of Mathematical Statistics. Wiley, New York. Shao, J. (1996). Bootstrap Model Selection. Journal of the American Statistical Associ-ation, 91: 655-665. Sommer, S. and Huggins, R. M. (1996). Variable selection using the Wald Test and Robust C p. Journal of the Royal Statistical Society, Ser. B, 45: 15-29. Stone, M. (1974). Cross-validation choice and assessment of statistical predictions. Jour-nal of the Royal Statistical Society, Ser. B, 36: 111-147. Stone, M. (1977). An asymptotic equivalence of choice of model by cross-validation and Akaike's criterion. Journal of the Royal Statistical Society, Ser. B, 39: 44-47. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Ser. B, 58: 267-288. Weisberg, S. (1985). Applied Linear Regression. Wiley-Interscience, New York, 2nd edition. Yohai, V. J. (1987). High breakdown point and high efficiency robust estimates for regression. The Annals of Statistics, 15: 642-656. Yohai, V. J. (1997). A new robust model selection criterion for linear models: RFPE. unpublished manuscript.