Robust Estimation andInference under Cellwise andCasewise ContaminationbyAndy Chun Yin LeungB.Sc., The University of British Columbia, 2011A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHYinThe Faculty of Graduate and Postdoctoral Studies(Statistics)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)December 2016c© Andy Chun Yin Leung 2016AbstractCellwise outliers are likely to occur together with casewise outliers in datasets ofrelatively large dimension. Recent work has shown that traditional high breakdownpoint procedures may fail when applied to such datasets. In this thesis, we considerthis problem when the goal is to (1) estimate multivariate location and scatter matrixand (2) estimate regression coefficients and confidence intervals for inference, whichboth are cornerstones in multivariate data analysis.To address the first problem, we propose a two-step procedure to deal with case-wise and cellwise outliers, which generally proceeds as follows: first, it uses a filterto identify cellwise outliers and replace them by missing values; then, it applies arobust estimator to the incomplete data to down-weight casewise outliers. We showthat the two-step procedure is consistent under the central model provided the filteris appropriately chosen.The proposed two-step procedure for estimating location and scatter matrix isthen applied in regression for the case of continuous covariates by simply adding athird step, which computes robust regression coefficients from the estimated robustmultivariate location and scatter matrix obtained in the second step. We show thatthe three-step estimator is consistent and asymptotically normal at the central model,for the case of continuous covariates. Finally, the estimator is extended to handleboth continuous and dummy covariates.Extensive simulation results and real data examples show that the proposed meth-ods can handle both cellwise and casewise outliers similarly well.iiPrefaceThis dissertation was prepared under the supervision of Professor Ruben Zamar andit is mainly based on the three papers coauthored with the supervisor and othercollaborators. Two of the papers have been published and a third one is submittedfor publication.Chapter 2 is mostly based on the published discussion paper in TEST, “Robustestimation of multivariate location and scatter in the presence of cellwise and casewisecontamination” (Agostinelli et al., 2015). The problem addressed in this paper, theproposed two-step procedures and the theoretical developments resulted from a broaddiscussion among the authors. Moreover, the author of this dissertation developeda bivariate filter version of the procedure which is included in the third paper. Theauthor also designed and conducted the empirical study and prepared the first versionof the manuscript, as well as the rejoinder, followed by the revisions proposed by thecoauthors. Finally, the author implemented the proposed procedure, as well as itsworkhorse, the generalized S-estimator (GSE) (Danilov et al., 2012), that is used inthe second step. This resulted in an R package, GSE (Leung et al., 2015), availableon CRAN (R Core Team, 2015).Chapter 3 follows up on the discussion and rejoinder of the discussion paperin TEST (Agostinelli et al., 2015). Professor Ricardo Maronna in his discussionsuggested replacing the second step of the two-step procedure with a GSE of Danilovet al. (2012) with a Rocke-type ρ function. This way the procedure could achieverobustness in higher dimensions. The author of the dissertation investigated thesuggestion and extended the GSE to its Rocke-type counterpart. The author inventeda new fast subsampling procedure for computing a needed initial estimator. Theauthor used the same empirical study design as presented in Chapter 2 and preparediiiPrefaceall the empirical results. The author also wrote the first version of this manuscriptfollowed by the revisions proposed by the supervisor. The work has been submittedfor publication. Finally, all the relevant procedures proposed in this chapter aremade available in the above mentioned R package, GSE.Chapter 4 is based on the published paper in Computational Statistics & DataAnalysis, “Robust regression estimation and inference in the presence of cellwiseand casewise contamination” (Leung et al., 2016). The problem addressed in thispaper, the proposed procedures and the theoretical developments resulted from abroad discussion among the authors. In particular, the author came up with thekey idea to skip the filtering step if cellwise contamination is not suspected, whichconsiderably helps the theoretical analysis of the approach. The first version ofthe manuscript, including the asymptotic results and the empirical study, were allprepared by the author, followed by the revisions proposed by the supervisor. Finally,the author implemented the three-step approach in the R package robreg3S (Leunget al., 2015), also available on CRAN.In addition to the aforementioned contributions, the supervisor made severalsuggestions regarding the presentation of material in this dissertation, relevant liter-ature, and motivation of the research. He also checked all the proofs and made somechanges in the writing to improve the flow of ideas in the dissertation and papers.ivTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Real data examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2.1 Geochemical data . . . . . . . . . . . . . . . . . . . . . . . . 31.2.2 Micro-cap stock returns data . . . . . . . . . . . . . . . . . . 61.3 Contributions and outline of the thesis . . . . . . . . . . . . . . . . . 92 Robust Estimation of Multivariate Location and Scatter in the Pres-ence of Cellwise and Casewise Contamination in Moderate Dimen-sion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.2 Global-robust estimation under THCM and ICM . . . . . . . . . . . 162.2.1 Step I: Filtering cellwise outliers . . . . . . . . . . . . . . . . 162.2.2 Step II: Dealing with casewise outliers . . . . . . . . . . . . . 21vTable of Contents2.3 Consistency of GSE on filtered data . . . . . . . . . . . . . . . . . . 242.4 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252.5 Real data example: geochemical data revisit . . . . . . . . . . . . . . 332.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362.6.1 Controlling the robustness and efficiency for large p . . . . . . 372.6.2 Computational issue . . . . . . . . . . . . . . . . . . . . . . . 372.6.3 Remarks on handling large and flat data sets . . . . . . . . . 382.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393 Robust Estimation of Multivariate Location and Scatter in the Pres-ence of Cellwise and Casewise Contamination in High Dimension 413.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413.2 Generalized Rocke S-estimators . . . . . . . . . . . . . . . . . . . . . 423.3 Computing issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453.3.1 Initial estimator . . . . . . . . . . . . . . . . . . . . . . . . . 453.3.2 Another computing issue . . . . . . . . . . . . . . . . . . . . 483.4 Two-step estimation and extended simulations . . . . . . . . . . . . 483.5 Real data example: small-cap stock returns data . . . . . . . . . . . 523.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564 Robust Regression Estimation and Inference in the Presence of Cell-wise and Casewise Contamination . . . . . . . . . . . . . . . . . . . . 584.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584.2 Consistent filter for general continuous data . . . . . . . . . . . . . . 604.3 Three-step regression . . . . . . . . . . . . . . . . . . . . . . . . . . 624.3.1 The estimator . . . . . . . . . . . . . . . . . . . . . . . . . . 624.3.2 Models with continuous and dummy covariates . . . . . . . . 654.4 Asymptotic properties of three-step regression . . . . . . . . . . . . . 674.5 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 694.5.1 Models with continuous covariates . . . . . . . . . . . . . . . 704.5.2 Models with continuous and dummy covariates . . . . . . . . 75viTable of Contents4.6 Real data example: Boston Housing data . . . . . . . . . . . . . . . 774.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 805 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83AppendicesA Supplementary material for Chapter 2 . . . . . . . . . . . . . . . . . 89A.1 Additional tables from the simulation study in Section 2.4 . . . . . . 89A.2 Proofs of propositions and theorem . . . . . . . . . . . . . . . . . . . 91B Supplementary material for Chapter 3 . . . . . . . . . . . . . . . . . 101B.1 Efficiency of GRE and tuning parameter α in Rocke-ρ function . . . 101B.2 Performance comparison between GSE and GRE . . . . . . . . . . . 102C Supplementary material for Chapter 4 . . . . . . . . . . . . . . . . . 104C.1 Additional figures from the simulation study in Section 4.5 . . . . . . 104C.2 Investigation on the performance on non-normal covariates . . . . . . 105C.3 Supplementary material for the Boston housing data analysis . . . . 107C.4 Proofs of lemmas and theorems . . . . . . . . . . . . . . . . . . . . . 107viiList of Tables2.1 Maximum average LRT distances under cellwise contamination. Thesample size is n = 10p. . . . . . . . . . . . . . . . . . . . . . . . . . 312.2 Maximum average LRT distances under casewise contamination. Thesample size is n = 10p. . . . . . . . . . . . . . . . . . . . . . . . . . . 322.3 Finite sample efficiency for first order autoregressive correlations, AR1(ρ),with ρ = 0.9. The sample size is n = 10p. . . . . . . . . . . . . . . . . 332.4 Average “CPU time” – in seconds of a 2.8 GHz Intel Xeon – evaluatedusing the R command, system.time. The sample size is n = 10p. . . 333.1 Maximum average LRT distances under cellwise contamination. Thesample size is n = 10p. . . . . . . . . . . . . . . . . . . . . . . . . . 493.2 Maximum average LRT distances under casewise contamination. Thesample size is n = 10p. . . . . . . . . . . . . . . . . . . . . . . . . . 503.3 Finite sample efficiency for random correlations. The sample size isn = 10p. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513.4 Average “CPU time” – in seconds of a 2.8 GHz Intel Xeon – evaluatedusing the R command, system.time. The sample size is n = 10p. . . 524.1 Maximum MSE in all the considered scenarios for models with con-tinuous covariates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 714.2 Average lengths of confidence intervals for clean data and for cellwiseand casewise contamination. . . . . . . . . . . . . . . . . . . . . . . . 744.3 Maximum MSE in all the considered scenarios for models with con-tinuous and dummy covariates. . . . . . . . . . . . . . . . . . . . . . 76viiiList of Tables4.4 Estimates and p-values of the regression coefficients for the originalBoston Housing data. . . . . . . . . . . . . . . . . . . . . . . . . . . . 784.5 Pairwise squared norm distances between the estimates for the originalBoston housing data. . . . . . . . . . . . . . . . . . . . . . . . . . . . 784.6 Estimates and p-values of the regression coefficients for the imputedBoston Housing data. . . . . . . . . . . . . . . . . . . . . . . . . . . . 794.7 Pairwise squared norm distances between the estimates for the im-puted Boston housing data. . . . . . . . . . . . . . . . . . . . . . . . 79A.1 Maximum average LRT distances under cellwise contamination. Thesample size is n = 5p. . . . . . . . . . . . . . . . . . . . . . . . . . . 89A.2 Maximum average LRT distances under casewise contamination. Thesample size is n = 5p. . . . . . . . . . . . . . . . . . . . . . . . . . . . 90A.3 Finite sample efficiency for first order autoregressive correlations, AR1(ρ),with ρ = 0.9. The sample size is n = 5p. . . . . . . . . . . . . . . . . 90B.1 Finite sample efficiency and maximum average LRT distances forGRE-C with various values of α. The sample size is n = 10p. . . . . . 101B.2 Maximum average LRT distances. The sample size is n = 10p. . . . . 102B.3 Finite sample efficiency. The sample size is n = 10p. . . . . . . . . . . 103C.1 MSE for clean data and cellwise contaminated data. . . . . . . . . . 106C.2 Description of the variables in the Boston Housing data . . . . . . . . 107ixList of Figures1.1 Illustration of (a) casewise contamination and (b) cellwise contamina-tion in a data matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 (a) Histograms and (b) quantile–quantile plots of the compound con-tent in the geochmical data. . . . . . . . . . . . . . . . . . . . . . . . 41.3 Squared Mahalanobis distances of the samples in the geochemical databased on estimates calculated on (a) the original data and (b) thecleaned data. Samples that contain one or more flagged components(large cellwise outliers) are in green. . . . . . . . . . . . . . . . . . . . 61.4 (a) Histograms and (b) quantile–quantile plots of the weekly returnsin the micro-cap asset returns data. . . . . . . . . . . . . . . . . . . . 71.5 Squared Mahalanobis distances of the weekly observations in the micro-cap asset returns data based on estimates calculated on (a) the origi-nal data and (b) the cleaned data. Large distances are truncated forbetter visualization. Observations that contain one flagged compo-nents (large cellwise outliers) are in blue and those contain at leasttwo flagged components are in green. The weeks corresponding to the2008 financial crisis are enclosed by the vertical dashed lines. . . . . . 81.6 Results of UBF-GRE-C applied to the original data. . . . . . . . . . . 92.1 Average LRT distances for various contamination values, k, under 10%cellwise contamination. The dimension is p = 20 and sample size isn = 10p. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31xList of Figures2.2 Squared Mahalanobis distances of the samples in the geochemical databased on different estimates. Large distances are truncated for bettervisualization. Samples that contain one or more flagged components(large cellwise outliers) are in green. . . . . . . . . . . . . . . . . . . 342.3 Univariate scatterplots of the 10 components in the geochemical data.Each component is standardized by its median and MAD. Points arerandomly jittered to avoid much overlapping. The points flagged bythe univariate filter are in blue, and those additionally flagged by thebivariate filter are in orange. . . . . . . . . . . . . . . . . . . . . . . . 352.4 Pairwise scatterplots of the geochemical data for V 2 versus V 8, V 2versus V 3, and V 3 versus V 9. Points with components flagged bythe univariate filter are in blue. Points with components additionallyflagged by the bivariate filter are in orange. . . . . . . . . . . . . . . . 363.1 Weight functions of the Tukey bisquare and the Rocke for p = 40.Chi-square density functions are also plotted in blue for comparison.All the functions are scaled so that their maximum is 1 to facilitatecomparison. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443.2 Average LRT distance behaviors of UBF-GSE and UBF-GRE-C forrandom correlations under 10% casewise contamination. The dimen-sion is p = 30 and the sample size is n = 10p. . . . . . . . . . . . . . 513.3 Normal quantile–quantile plots of weekly returns. Weekly returns thatare three MAD away from the coordinatewise-median are shown ingreen. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533.4 Squared Mahalanobis distances of the weekly observations in the small-cap asset returns data based on the MLE, the Rocke, the UF-GSE,and the UBF-GSE estimates. Weeks that contain one or more assetreturns with values three MAD away from the coordinatewise-medianare in green. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54xiList of Figures3.5 Pairwise scatterplots of the asset returns data for WTS versus HTLD,HTLD versus WSBC, and WSBC versus SUR. Points with compo-nents flagged by the univariate filter are in blue. Points with compo-nents additionally flagged by the bivariate filter are in orange. . . . . 553.6 Squared Mahalanobis distances of the weekly observations in the small-cap asset returns data based on the UBF-GSE and the UBF-GRE-Cestimates. Weeks that contain one or more asset returns with valuesthree MAD away from the coordinatewise-median are in green. . . . 564.1 MSE for various cellwise and casewise contamination values, k, formodels with continuous covariates. The sample size is n = 300. . . . . 724.2 CR for clean data and for cellwise and casewise contaminated data ofvarious sample size, n. . . . . . . . . . . . . . . . . . . . . . . . . . . 744.3 MSE for various cellwise and casewise contamination values, k, formodels with continuous and dummy covariates. The sample size isn = 300. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76B.1 Average LRT distances for various contamination sizes, k, for randomcorrelations under 10% casewise contamination. The dimension isp = 30 and the sample size is n = 10p. . . . . . . . . . . . . . . . . . 103C.1 MSE for various cellwise and casewise contamination values, k, formodels with p = 15 continuous covariates. The sample size is n = 150.For details see Section 4.5.1 in the chapter. . . . . . . . . . . . . . . 104C.2 MSE for various cellwise and casewise contamination values, k, formodels with px = 12 continuous and pd = 3 dummy covariates. Thesample size is n = 150. For details see Section 4.5.2 in the chapter. . . 105xiiDedicationI dedicate this to my mother.xiiiChapter 1Introduction1.1 BackgroundStatistical estimations and inferences are generally based on some assumptions aboutthe underlying situation (e.g., model distributions of the data). In practice, theseassumptions may not always be fulfilled. For instance, observations may deviatefrom specified model distribution or may contain gross errors. All such differentlybehaving observations are generally called contaminated observations. It is wellknown that some of the most common statistical procedures are extremely sensitiveto the presence of contaminated observations, and therefore, many robust alternativeshave been proposed.Most of the traditional robust procedures are based on the assumption that themajority of the observations or cases in the data follow a specified model distribu-tion, while only a minority follow an arbitrary, unspecified distribution. This as-sumption usually refers to casewise contamination or rowwise contamination, whosename comes from representing an observed data set in a matrix where rows are casesand columns are variables (see Figure 1.1). Traditional robust procedures then flagand down-weight contaminated cases entirely, and they have been shown to workwell under these assumption.However, the casewise contamination model does not always hold for real data be-cause observations may only be partially contaminated. This type of contaminationoften appears as single outlying cells in a data matrix and can be modeled as inde-pendent cellwise contamination (see Figure 1.1). Under the cellwise contaminationmodel, the traditional practice of down-weighting entire cases is no longer appropri-ate because it could entail a serious loss of information. Nonetheless, it still works11.1. Background(a) (b)Figure 1.1: Illustration of (a) casewise contamination and (b) cellwise contaminationin a data matrix.under the circumstance that the fraction of cases affected by cellwise contaminationis small, which is usually the case for small data sets (small number of variables).Datasets in recent years often contain a large number of variables, and as such,they could suffer from a large fraction of cellwise contaminated cases. For example,if the proportion of cellwise contamination is ε = 0.05 and the dimension is p = 10,then the probability ε that at least one component of a case is contaminated isε = 1− (1− ε)p = 0.40;if ε = 0.05 and p = 20, then ε = 0.64; and if ε = 0.05 and p = 30, then ε = 0.79.This phenomenon is referred to as the propagation of (cellwise) outliers, challengingthe fundamental assumption required by the traditional robust procedures.21.2. Real data examplesIn real life situations, data sets may even contain both casewise and cellwiseoutliers, further complicating the problem. The following two examples provide ev-idence of the occurrence of cellwise and casewise outliers in real data and illustratethe following fact: Traditional casewise-robust procedures are not sufficient for deal-ing with these two types of outliers simultaneously. As a result, they fail to providea good fit to the bulk of the data and miss out real outliers.1.2 Real data examples1.2.1 Geochemical dataConsider the geochemical data in Smith et al. (1984). The data contain contentmeasure (in parts per million) for 20 chemical compounds in 53 samples of rocks inWestern Australia. In this example, we focus on a subset of 10 chemical compoundswith the most suspected cellwise contamination. As the original data are skewed, weapply a log transformation to the data to make them more symmetric.Figure 1.2 presents the distribution of the content measure for the 10 compoundsin histograms and normal quantile–quantile plots. We notice a relatively large num-ber of outliers in compound V9 and V17, and we suspect a few outliers in the othercompounds.Consider the following outlier detection rule. Denote a sample of content mea-sure by X1, . . . , Xn. Consider a pair of location and dispersion estimators T0n andS0n. Here we use the median and the median absolute deviance for T0n and S0n,respectively. Suppose the content measures have normal distributions. We flag acontent measure ifXi < T0,n − 2.81 · S0,n or Xi > T0,n + 2.81 · S0,n, (1.1)so that approximately only 0.5% of the measures would be flagged, assuming thatT0,n and S0,n are close to true parameter values. Applying this rule, we find in total5.1% of cellwise outliers that propagates to 41.5% of the cases, which is close to the31.2. Real data examplesV2 V3 V6 V7 V8V9 V10 V16 V17 V18log10(content)count(a) Distributions of compound contentlllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllll l lllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllllllllllllllllllllllllll ll lllllllllllllllllllllllllllllllllllllllllll l llllllllllllllllllllllllllllllllllllllllll lllllllllllllllllllllllllllllllllllllllllV2 V3 V6 V7 V8V9 V10 V16 V17 V18standard normallog10(content)(b) Normal quantile–quantile plots of compound contentFigure 1.2: (a) Histograms and (b) quantile–quantile plots of the compound contentin the geochmical data.41.2. Real data examplesmaximum contamination that traditional robust procedures can handle (i.e., 50%).Let’s investigate the effect of this propagation of cellwise outliers on traditionalrobust estimates. A common way for detecting outlying cases in high dimension isto use the squared Mahalanobis distance (MD):d(x,mˆ, Cˆ ) = (x − mˆ)tCˆ−1(x − mˆ),where x is a p-dimensional data vector, mˆ is a multivariate location estimate, and Cˆis a covariance matrix estimate. Under the assumption that the data are normallydistributed and that mˆ and Cˆ are close to the true parameter values m and C , thedistance d(x,m,C ) has an approximate chi-squared distribution with p degrees offreedom. With reasonably large sample sizes, the estimates will be close to their truevalues provided they are robust and consistent. Therefore, it is common practiceto compare the squared Mahalanobis distances with a high percentile (such as the99.5-th percentile) of a chi-squared with p degrees of freedom. Points exceeding thisthreshold are flagged as possible outliers.We consider the MLE approach (sample mean and covariance matrix) and a state-of-the-art robust approach (against casewise contamination), the Rocke-S-estimator(Maronna & Yohai, 2015). Figure 1.3a shows the squared Mahalanobis distances ofthe samples based on the two estimates. Cases that contain one or more flaggedcomponents are shown in green in the figure. Clearly, the MLE estimates are muchaffected by the outliers in the data; the corresponding MD’s flag no green cases.The robust estimates are also affected by the propagation of cellwise outliers; thecorresponding MD’s fail to flag more than half of the suspected green cases.Next, we replace the flagged cells by their coordinate medians in an attemptto control the effect of outliers propagation. Then, we re-estimate the multivariatelocation and covariance matrix using these “cleaned” data. The resulting squaredMahalanobis distances of the observations are given in Figure 1.3b. The robustestimates now recognize most of the cellwise contaminated observations as outlyingobservations. In addition, they unmask two new outlying observations (samples 2and 35). These two observations do not have any outlying cellwise components and51.2. Real data examplesl llll lll l l lll lll l l l llll lll llll l l l lll l lll ll l l l lll l lll l2 35lllllllllll ll lll l l l ll lllll llll l l l ll l l lll ll l l l lll l l l l l2 35MLERocke−S025507510012502550751001255 10 15 20 25 30 35 40 45 50SampleSquared Mahalanobis distances(a) Original estimatesl lllllllll l ll llll l l ll llll ll lll l l l llll lllll l l l llllllll2 35lllllllllllll lll ll l ll lllllllll l l l llllll llll l l llllllll2 35MLERocke−S025507510012502550751001255 10 15 20 25 30 35 40 45 50SampleSquared Mahalanobis distances(b) Cleaned estimatesFigure 1.3: Squared Mahalanobis distances of the samples in the geochemical databased on estimates calculated on (a) the original data and (b) the cleaned data.Samples that contain one or more flagged components (large cellwise outliers) are ingreen.they did not appear outlying based on the original robust estimates (see Figure 1.3a).It is clear that cellwise data-preprocessing followed by some casewise-robust pro-cedure are indeed necessary for capturing the full extent of contamination in thesedata (cellwise and casewise).1.2.2 Micro-cap stock returns dataThese data contain the weekly returns from 01/08/2008 to 12/28/2010 (n = 730weeks) in a portfolio of 20 micro-cap stocks (p = 20) in Martin (2013).Figure 1.4 shows the distributions and normal QQ-plots of the 20 micro-capstocks returns in the portfolio. Overall, the returns do not seem to deviate muchfrom normal, but they do contain a few outliers. Applying the outlier detection rule in(1.1) to these data, we identify 7.2% of outlying returns in the portfolio, propagatingto 55.4% of the cases. Most of the weeks with outlying returns correspond to the2008 financial crisis (from late-2008 to mid-2009).Figure 1.5a shows the squared Mahalanobis distances of the weekly observations61.2. Real data examplesAE AMIC ATX CYBE DGASDXYN FEIM FLXS INCB LCRDNOBH NUHC PMD PNBC REFRRMCF SCX SGC SLI TGXWeekly returnscount(a) Distributions of weekly returnsllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll ll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll l llllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllllllllllllllllllllllllllll l lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllll ll lllllllllllllllllllllllllllllllllllllllll lllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllll lll lllllllllllllllllllllllllllllllllllllllll l lll llllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllllllllllllllllllllllllllllllllll ll l lllllllllllllllllllllllllllllllllllllllllll l lAE AMIC ATX CYBE DGASDXYN FEIM FLXS INCB LCRDNOBH NUHC PMD PNBC REFRRMCF SCX SGC SLI TGXstandard normalWeekly returns(b) Normal quantile–quantile plots of weekly returnsFigure 1.4: (a) Histograms and (b) quantile–quantile plots of the weekly returns inthe micro-cap asset returns data.71.2. Real data exampleslllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll9 30 2008lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll9 30 2008MLERocke−S0501001502002500501001502002501/1/20084/8/20087/8/200810/7/20081/6/20094/7/20097/7/200910/6/20091/5/20104/6/20107/6/201010/5/2010WeekSquared Mahalanobis distances(a) Original Datalllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll9 30 2008lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll9 30 2008MLERocke−S0501001502002500501001502002501/1/20084/8/20087/8/200810/7/20081/6/20094/7/20097/7/200910/6/20091/5/20104/6/20107/6/201010/5/2010WeekSquared Mahalanobis distances(b) Cleaned DataFigure 1.5: Squared Mahalanobis distances of the weekly observations in the micro-cap asset returns data based on estimates calculated on (a) the original data and (b)the cleaned data. Large distances are truncated for better visualization. Observa-tions that contain one flagged components (large cellwise outliers) are in blue andthose contain at least two flagged components are in green. The weeks correspondingto the 2008 financial crisis are enclosed by the vertical dashed lines.based on the MLE and the Rocke-S estimates calculated on the original data. A totalof 35 weeks (22.3% of the cases) contain one flagged component and are shown in bluein the figures. Another 52 weeks (33.1% of the cases) contain two or more flaggedcomponents and are shown in green. The financial crisis is the period between thetwo vertical dashed lines. Notice that all the weeks in the financial crisis are eitherblue or green.As expected, the MLE estimates are adversely affected by the outliers and con-sider many of the weeks during the crisis as normal weeks, contradicting intuition.They fail to flag 35 green and 32 blue cases. The casewise-robust estimate is also up-set by the propagation of cellwise outliers and misses 7 green and 29 blue cases, twoof them during the financial crisis. The propagation of cellwise outliers has distorted81.3. Contributions and outline of the thesislllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll9 30 2008UBF−GRE_C0501001502002501/1/20084/8/20087/8/200810/7/20081/6/20094/7/20097/7/200910/6/20091/5/20104/6/20107/6/201010/5/2010WeekSquared Mahalanobis distancesFigure 1.6: Results of UBF-GRE-C applied to the original data.the robust estimates and corresponding MD’s.To try to control the effect of propagation of cellwise outliers, we repeat theanalysis but this time replacing the flagged cells by their coordinate medians. Figure1.5b shows the squared Mahalanobis distances based on the new estimates calculatedon the cleaned data. The MLE and the robust estimates now flag all the weeks duringthe crisis as outliers, no longer contradicting intuition. Also, interestingly, the robustestimates now unmask additional outlying weeks (e.g., Week 09/30/2008) that arecasewise outliers masked in the original analysis.Figure 1.6 shows the results from more sophisticated approach developed inChapter 3, called UBF-GRE-C. This procedure is able to flag all the green weeks,all but 3 of the blue weeks and 40 new casewise outlying weeks which were maskedby the propagation of cellwise outliers in previous analysis.1.3 Contributions and outline of the thesisFrom the examples, we see that casewise and cellwise outliers could co-exist in realdata. We also see from these examples that traditional robust procedures are inade-quate to provide reliable estimates and to detect outliers in the presence of cellwise-outliers propagation. To address these problems, we propose and study new robustmethods for dealing with cellwise and casewise contamination in this thesis. Our91.3. Contributions and outline of the thesiscontributions are listed below.• We provide several real data examples with convincing evidence of simultaneousoccurrence of cellwise and casewise contamination.• We study the problem of robust estimation of multivariate location and scat-ter matrices in the presence of cellwise and casewise contamination. Thesequantities are of great importance as they are cornerstones in multivariatedata analysis. We propose a new procedure and show that this procedure canefficiently deal with cellwise outliers. Moreover, we show that the proposedprocedure can deal with casewise outliers for datasets of moderate dimension(p ≤ 15, say), performing similarly to traditional robust methods. Further-more, we show that, under no contamination, the procedure is consistent andhighly efficient. Part of the procedure is published as a discussion paper inTEST (see Agostinelli et al., 2015).• We revisit the same problem above but for higher dimensional data (p > 15,say). We improve the proposed procedure in TEST in two different aspects:robustness under casewise contamination and computation speed. We equipthe procedure with a new robust estimator for incomplete data that can si-multaneously attain high robustness and reasonable efficiency for moderate tolarge dimension. We also develop a new fast subsampling method for comput-ing initial estimates for the procedure when the dimension is large. This workis submitted for publication.• We study the classic problem of multiple linear regression in the same paradigm.We propose a new procedure for estimating regression coefficients that can han-dle cellwise and casewise outliers similarly well. We prove that the procedureis consistent and asymptotically normal for the case of continuous covariates.This allows for statistical inference, at least for large sample sizes. Furthermore,the procedure is extended to handle both continuous and dummy covariatesusing an iterative algorithm in estimation. To the best of our knowledge, the101.3. Contributions and outline of the thesisproposed procedure is the first robust regression methods that can achieve ro-bust estimation and inference in the presence of cellwise and casewise contam-ination, and can deal with numerical and dummy covariates. The procedureis published as a methodology paper in Computational Statistics and DataAnalysis (CSDA) (see Leung et al., 2016).• Finally, we develop two R packages for robust analysis in the presence of cell-wise and casewise contamination. The first R package, GSE (Leung et al., 2015),implements our proposed procedure for estimating multivariate location andscatter (Agostinelli et al., 2015). The package also implements several robustmultivariate location and scatter estimators for incomplete data that are heav-ily used by our procedure such as the generalized S-estimator (Danilov, 2010).The second R package, robreg3S (Leung et al., 2015), implements our proposedregression estimator for robust estimation and inference in multiple linear re-gression (see Leung et al., 2016). The two R packages are freely available onCRAN (R Core Team, 2015).The rest of the thesis is organized as follows. Chapter 2 and 3 is dedicated to ourwork on robust estimation of multivariate location and scatter and Chapter 4 is onrobust linear regression analysis. The thesis concludes in Chapter 5, where some ofthe challenges that remain to be solved and the directions we foresee for future workare presented.11Chapter 2Robust Estimation of MultivariateLocation and Scatter in thePresence of Cellwise and CasewiseContamination in ModerateDimension2.1 IntroductionOutliers are a common problem for data analysts because they may have a bigdetrimental effect on estimation, inference and prediction. In robust statistics it isgenerally assumed that a relatively small fraction of cases may be contaminated,however, it has also been recently noticed that the majority of the cases (and evenall of them) could be partially contaminated for moderate and high-dimensionaldata. The problem of interest in this chapter is robust estimation of multivariatelocation and scatter matrix in consideration of the latter case. The estimation ofthese parameters is a corner stone in many applications such as principal componentanalysis, factor analysis, and multiple linear regression.Classical contamination modelTo fix ideas, suppose that a multivariate data set is organized in a table with rows ascases and columns as variables, that is, X = (X 1, . . . ,Xn)t, withX i = (Xi1, . . . , Xip).122.1. IntroductionThe vast majority of procedures for robust analysis of multivariate data are basedon the classical Tukey–Huber contamination model (THCM), or sometimes calledcasewise contamination model, where a small fraction of rows in the data tablemay be contaminated. In THCM, the contamination mechanism is modeled as amixture of two distributions: one corresponding to the nominal model and the othercorresponding to the outliers. More precisely, THCM considers the following familyof distributions:H = {H = (1− )H0 + H˜ : H˜ is any distribution on Rp} (2.1)whereH0 is a central parametric distribution such as the multivariate normalNp(µ,Σ)and H˜ is an unspecified outlier generating distribution. We then assume a case fol-lows a distribution from the above family, that is X i ∼ H where H ∈ H. Thekey feature of this model is that when is small we have X i ∼ H0 most of thetime, therefore, detection and down-weighting of outlying cases make sense and workwell in practice. High breakdown point affine equivariant estimators such as MVE(Rousseeuw, 1985), MCD (Rousseeuw, 1985), S (Davies, 1987), MM (Tatsuoka &Tyler, 2000) and Stahel–Donoho estimators (Stahel, 1981; Donoho, 1982) proceed inthis general way.Independent contamination modelIn many applications, however, the contamination mechanism may be different inthat individual components (or cells) in X are independently contaminated. Forinstance, in the case of high-dimensional data, variables could be gathered separatelyand therefore, exposed to contamination independently. The cellwise contaminationmechanism may in principle seem rather harmless, but in fact it has far reachingconsequences including the possible breakdown of classical high breakdown pointestimators.The new contamination framework, called independent contamination model(ICM), was presented and formalized in Alqallaf et al. (2009). In the ICM framework,132.1. Introductionwe consider a different family of distribution:I = {H : H is the distribution of X = (I −B )X 0 +B X˜}, (2.2)where X 0 ∼ H0, X˜ ∼ H˜, and B = diag(B1, . . . , Bp), where the Bj are independentBin(1, ). In other words, each component ofX has a probability of being indepen-dently contaminated. Furthermore, the probability that at least one component ofX is contaminated is now = 1− (1− )p.This implies that even if is small, could be large for large p, and could exceed the0.5 breakdown point of highly robust affine equivariant estimators under THCM. Forexample, if = 0.1 and p = 10, then = 0.65; if = 0.05 and p = 20, then = 0.64and if = 0.01 and p = 100, then = 0.63.Alqallaf et al. (2009) showed that for this type of contamination, the breakdownpoint of all the traditional 0.5 breakdown point and affine equivariant location esti-mators is 1− 0.51/p → 0 as p→∞. It can be shown that the same holds for robustand affine equivariant scatter estimators. Hence, we have a new manifestation of thecurse of dimensionality : when p is large, traditional robust estimators break downfor a rather small fraction of independent contamination.To remedy this problem, some researchers have proposed to Winsorize potentialoutliers for each variable separately. For instance, Alqallaf et al. (2002) revisited Hu-berized Pairwise Covariance (Huber, 1981), which is constructed using transformedcorrelation coefficients calculated separately on Huberized data as basic buildingblocks. Huberization is a form of Winsorization. Although pairwise robust esti-mators show some robustness under ICM, they cannot deal with casewise outliersfrom THCM, as well as finely shaped multivariate data. Another approach to dealwith ICM outliers was proposed in Van Aelst et al. (2012). They modified theStahel–Donoho (SD) estimator (Stahel, 1981; Donoho, 1982) by calculating the SD-outlyingness measure and weights on Huberized data instead of the raw data. Inour simulation study, this estimator performs very well under THCM, but is notsufficiently robust under ICM.142.1. IntroductionAn alternative approach consists of replacing cellwise outliers by NA’s, which wecall the approach filtering, like how oversize particles can be separated in filtration.The use of filtering to fend against cellwise contamination has been suggested byvarious authors (e.g., Danilov, 2010; Van Aelst et al., 2012; Farcomeni, 2014a). Inparticular, Danilov (2010) has compared maximum likelihood estimates of covariancematrix computed for various pre-processed data, and empirically found that filtereddata yield the most robust estimates against large cellwise contamination. Farcomeni(2014a) proposed an interesting idea to estimate location and scatter matrix byoptimizing some maximum likelihood over the parameters of interest, as well as thefiltering set with a fixed size (the filtering operation was called snipping in the paper).The original method of Farcomeni (2014a) was for clustering multivariate data whereeach cluster has an unknown location and scatter matrix, but it can be easily adaptedto our problem by fixing the number of clusters to one. In our simulation study, theestimators of Danilov (2010) and Farcomeni (2014a) perform very well under ICM,but neither is sufficiently robust under THCM.The main goal of this chapter is to emphasize the need for a new generation ofglobal–robust estimators that can simultaneously deal with outliers from ICM andTHCM, as well as to to define new robust estimators that can deal with them.In Section 2.2, we introduce a global–robust estimator of multivariate location andscatter. In Section 2.3, we show that our estimation procedure is strongly consistent.That is, the multivariate location estimator converges a.s. to the true location andthe scatter matrix estimator converges a.s. to a scalar multiple of the true scattermatrix, for a general elliptical distribution. Moreover, for a normal distribution thescalar factor is equal to one. In Section 2.4, we report the result of a simulationstudy. In Section 2.5, we analyze two real data sets using the proposed and severalcompeting estimators. In Section 2.6, we discuss several main points raised by thediscussants of the original paper of this chapter. Finally, we conclude in Section 2.7.We also provide some additional numerical results and all the proofs in Appendix A.152.2. Global-robust estimation under THCM and ICM2.2 Global-robust estimation under THCM andICMOur approach for global–robust estimation under THCM and ICM is to first flagunivariate outlying cells in the data table and to replace them by NA’s. In thesecond step we then apply a procedure that is robust against casewise outliers. Two-step procedures like this were relatively unpopular in the robustness field becauseof the potential lack of desirable statistical properties for the final estimate (suchas consistency and efficiency) and also because it had not been convincingly shownthat the final estimate is robust under THCM and ICM. These two limitations areovercome in our procedure by the use of an adaptive filter (Gervini & Yohai, 2002)in the first step and a generalized S-estimator (GSE) (Danilov et al., 2012) in thesecond step.More precisely, our procedure has two major steps:Step I. Filtering large cellwise outliers. We flag cellwise outliers and replacethem by NA’s. This step prevents cellwise contaminated cases fromhaving large robust Mahalanobis distances in the second step. SeeSection 2.2.1 for further details.Step II. Dealing with casewise outliers. We apply GSE, to the filtered datacoming from Step I. Notice that GSE has been specifically designedto deal with incomplete multivariate data with casewise outliers. SeeSection 2.2.2 for further details.Full account of these steps is provided in the remaining of this section.2.2.1 Step I: Filtering cellwise outliersConsider a random sample of X = (X 1, . . . ,Xn)t, where X i follows a distributionfrom I in (2.2). The filtering we present consists of two parts: a part that aimsat detecting large cellwise outliers by looking at marginals, and another part that162.2. Global-robust estimation under THCM and ICMaims at detecting moderate cellwise outliers by incorporating information about thecorrelation structure of the data.Univariate filteringLet X1, . . . , Xn be a random (univariate) sample of observations. Consider a pairof initial location and dispersion estimators, T0n and S0n. Common choices for T0nand S0n that are also adopted in this chapter are the median and median absolutedeviation (MAD). Denote the standardized sample by Zi = (Xi − T0n)/S0n. Let Fbe a chosen reference distribution for Zi. An ideal choice for a reference distributionwould be F0, the actual distribution of (Xi − µ0)/σ0. Unfortunately, F0 is unknownin practice. Thus, we use the standard normal distribution, F = Φ, as an approxi-mation. A normalizing transformation could be applied if the marginal data do notseem normal from standard diagnostic tools such as normal quantile-quantile plots.Instead of a fixed cutoff value for Zi, we introduce an adaptive cutoff (Gervini& Yohai, 2002) which is asymptotically “correct”, meaning that for clean data thefraction of flagged outliers tends to zero as the sample size n tends to infinity. Theadaptive cutoff values are defined as follows. Let F+n be the empirical distributionfunction for the absolute standardized value, that is,F+n (t) =1nn∑i=1I(|Zi| ≤ t).The proportion of flagged outliers is defined bydn = supt≥η{F+(t)− F+n (t)}+= maxi>i0{F+(|Z|(i))− (i− 1)n}+,(2.3)where in general {a}+ represents the positive part of a and F+ is the distribution of|Z| when Z ∼ F . Here, |Z|(i) is the order statistics of |Zi|, i0 = max{i : |Z|(i) < η},and η = (F+)−1(α) is a large quantile of F+. We use α = 0.95 for univariate filtering172.2. Global-robust estimation under THCM and ICMas the aim is to detect large outliers, but other choices could be considered. Then,we flag bndnc observations with the largest standardized value as cellwise outliersand replace them by NA’s (here, bac is the largest integer less than or equal to a).Finally, the resulting adaptive cutoff value for Zi istn = min{t : F+n (t) ≥ 1− dn}, (2.4)that is, tn = Z(in) with in = n− bndnc. Equivalently, we flag Xi if |Zi| ≥ tn.The following proposition states that even when the actual distribution is un-known, asymptotically, the univariate filter will not flag outliers when the tail ofthe chosen reference distribution is heavier (or equal) than the tail of the actualdistribution. We call this property consistency throughout this thesis.Proposition 2.1. Consider a random variable X ∼ F0 with F0 continuous. Also,consider a pair of location and dispersion estimators, T0n and S0n, such that T0n →µ0 ∈ R and S0n → σ0 > 0 a.s. [F0]. Let F+0 (t) = PF0(|X−µ0σ0 | ≤ t). If the referencedistribution F+ satisfies the inequalitymaxt≥η{F+(t)− F+0 (t)} ≤ 0, (2.5)thenn0n→ 0 a.s.,wheren0 = bndnc.Proof: See Section A.2 in the Appendix.Bivariate filteringAs pointed out by Rousseeuw & Van den Bossche (2015), to filter the univariateoutliers based solely on their value may be too limiting as no correlation with othervariables is taken into account. A moderately contaminated cell may pass the filter182.2. Global-robust estimation under THCM and ICMwhen viewed marginally, but it may be flagged as an outlier when viewed togetherwith other components, especially for highly correlated data.Let (X 1, . . . ,Xn), with X i = (Xi1, Xi2)t, be a random sample of bivariate obser-vations. Consider also a pair of initial location and scatter estimators,T 0n =(T0n,1T0n,2)and C 0n =(C0n,11 C0n,12C0n,21 C0n,22).Similar to the univariate case we use the coordinate-wise median and the bivariateGnanadesikan-Kettenring estimator with MAD scale (Gnanadesikan & Kettenring,1972) for T 0n and C 0n, respectively. More precisely, the initial scatter estimators aredefined byC0n,jk =14(MAD({Xij +Xik})2 −MAD({Xij −Xik})2),where MAD({Yi}) denotes the MAD of Y1, . . . , Yn. Note that C0n,jj = MAD({Xj})2,which agrees with our choice of the coordinate-wise dispersion estimators. Now,denote the pairwise (squared) Mahalanobis distances by Di = (X i −T 0n)tC−10n (X i −T 0n). Let Gn be the empirical distribution for pairwise Mahalanobis distances,Gn(t) =1nn∑i=1I(Di ≤ t).Finally, we filter outlying points X i by comparing Gn(t) with G(t), where G is achosen reference distribution. In this thesis, we use the chi-squared distribution withtwo degrees of freedom, G = χ22. The proportion of flagged bivariate outliers isdefined bydn = supt≥η{G(t)−Gn(t)}+= maxi>i0{G(D(i))− (i− 1)n}+.(2.6)Here, η = G−1(α), and we use α = 0.85 for bivariate filtering since we now aim formoderate outliers, but other choices of α can be considered. Then, we flag bndnc192.2. Global-robust estimation under THCM and ICMobservations with the largest pairwise Mahalanobis distances as outlying bivariatepoints. The resulting adaptive cutoff value for the distances can be defined in thesame way as in (2.4). Finally, the following proposition states the consistency prop-erty of the bivariate filter.Proposition 2.2. Consider a random vector X = (X1, X2)t ∼ H0. Also, considera pair of bivariate location and scatter estimators, T 0n and C 0n, such that T 0n →µ0 ∈ R2 and C 0n → Σ0 ∈ PDS(2) a.s. [H0] (PDS(q) is the set of all positive definitesymmetric matrices of size q). Let G0(t) = PH0((X − µ0)tΣ−10 (X − µ0) ≤ t) andsuppose that G0 is continuous. If the reference distribution G satisfies:maxt≥η{G(t)−G0(t)} ≤ 0, (2.7)thenn0n→ 0 a.s.,wheren0 = bndnc.Proof: See Section A.2 in the Appendix.Combining the univariate and bivariate filteringWe first apply the univariate filter to each variable in X separately using the initial lo-cation and dispersion estimators, T 0n = (T0n,1, . . . , T0n,p) and S0n = (S0n,1, . . . , S0n,p).Let U be the resulting auxiliary matrix of zeros and ones with zeros indicating thefiltered entries in X. We next iterate over all pairs of variables in X to identifyoutlying bivariate points which helps filtering the moderately contaminated cells.Fix a pair of variables, (Xij, Xik) and set X(jk)i = (Xij, Xik). Let C(jk)0n be aninitial pairwise scatter matrix estimator for this pair of variables. We calculate thepairwise Mahalanobis distances D(jk)i = (X(jk)i − T (jk)0n )t(C (jk)0n )−1(X (jk)i − T (jk)0n ) andperform the bivariate filtering on the pairwise distances with no flagged componentsfrom the univariate filtering: {D(jk)i : Uij = 1, Uik = 1}. We apply this procedure to202.2. Global-robust estimation under THCM and ICMall pairs of variables 1 ≤ j < k ≤ p. LetJ ={(i, j, k) : D(jk)i is flagged as bivariate outlier},be the set of triplets which identify the pairs of cells flagged by the bivariate filterin rows i = 1, ..., n. It remains to determine which cells (i, j) in row i are to beflagged as cellwise outliers. For each cell (i, j) in the data table, i = 1, . . . , n andj = 1, . . . , p, we count the number of flagged pairs in the i-th row where cell (i, j) isinvolved:mij = # {k : (i, j, k) ∈ J} .Cells with large mij are likely to correspond to univariate outliers. Suppose thatobservation Xij is not contaminated by cellwise contamination. Then mij approxi-mately follows the binomial distribution, Bin(∑k 6=j Uik, δ), under ICM, where δ isthe overall proportion of cellwise outliers that were not detected by the univariatefilter. We flag observation Xij ifmij > cij,where cij is the 0.99-quantile of Bin(∑k 6=j Uik, δ). In practice we obtained goodresults (in both simulation and real data applications) using the conservative choiceδ = 0.10, which is adopted in this thesis.2.2.2 Step II: Dealing with casewise outliersThis second step introduces robustness against casewise outliers that went undetectedin Step I. Data that emerged from Step I have holes (i.e., NA’s) that correspond topotentially contaminated cells. To estimate the multivariate location and scattermatrix from that data, we use a recently developed estimator called GSE, brieflyreviewed below.Let X i = (Xi1, . . . , Xip)t, 1 ≤ i ≤ n be p-dimensional i.i.d. random vectors that212.2. Global-robust estimation under THCM and ICMfollow a distribution in an elliptical family E(µ0,Σ0) with densityfX (x,µ0,Σ0) =1|Σ0|f0(D(x,µ0,Σ0)) (2.8)where |A| is the determinant of A, f0 is non-increasing and strictly decreasing at 0,andD(x,m,C ) = (x −m)tC−1(x −m)is the squared Mahalanobis distance. We also use the normalized squared Maha-lanobis distancesD∗(x,m,C ) = D(x,m,C ∗),where C ∗ = C/|C |1/p, so |C ∗| = 1.Let U be the auxiliary matrix of zeros and ones, with zeros indicating the cor-responding missing entries. Let pi = p(U i) =∑pj=1 Uij be the actual dimensionof the observed part of X i. Given a p-dimensional vector of zeros and ones u, a p-dimensional vectorm and a p×p matrixA, we denote bym(u) andA(u) the sub-vectorof m and the sub-matrix of A, respectively, with columns and rows corresponding tothe positive entries in u.Let Ω0n be a p × p positive definite initial estimator for Σ0. Given the locationvector µ ∈ Rp and a p × p positive definite matrix Σ, we define the generalizedM-scale, sGS(µ,Σ,Ω0n,X,U), as the solution in s to the following equation:n∑i=1cp(U i)ρD∗(X(U i)i ,µ(U i),Σ(U i))s cp(U i)∣∣∣Ω(U i)0n ∣∣∣1/p(U i) = b n∑i=1cp(U i) (2.9)where ρ(t) is an even, non-decreasing in |t| and bounded loss function. The tuningconstants ck, 1 ≤ k ≤ p, are chosen such thatEΦ(ρ( ||X ||2ck))= b, X ∼ Nk(0, I ), (2.10)to ensure consistency under the multivariate normal. We consider the Tukey’s222.2. Global-robust estimation under THCM and ICMbisquare rho function, ρ(u) = min(1, 1 − (1 − u)3), and b = 0.5 throughout thischapter.The inclusion of Ω0n in (2.9) is needed to re-normalize the distances D∗ to achieverobustness. A heuristic argument for the inclusion of Ω0n is as follows. Suppose thatT n ≈ µ0 and Sn ≈ Ω0n ≈ Σ0. Then, given U = u,D∗(X (u),T (u)n ,S(u)n )cp(u)∣∣∣Ω(u)0n ∣∣∣1/p(u) ≈D∗(X (u),µ(u)0 ,Σ(u)0 )cp(u)∣∣∣Σ(u)0 ∣∣∣1/p(u) ∼||Y (u)||2cp(u)where Y (u) is a p(u) dimensional random vector with an elliptical distribution. Hence,||Y (u)||2/cp(u) has M-scale of 1 for the given ρ function if Y is normal, and largeMahalanobis distances can be down-weighted accordingly. Here, we use extendedminimum volume ellipsoid (EMVE) for Ω0n as suggested in Danilov et al. (2012).A generalized S-estimator is then defined by(T GS,CGS) = arg minµ,ΣsGS(µ,Σ,Ω0n,X,U) (2.11)subject to the constraintsGS(µ,Σ,Σ,X,U) = 1. (2.12)Under mild regularity assumptions, in the case of elliptical data with U i inde-pendent of X i (missing completely at random assumption) any solution to (2.11) isa consistent estimator for the shape of the scatter matrix. Moreover, in the case ofnormal data, any solution to (2.11) satisfying (2.12) is consistent in shape and sizefor the true covariance matrix. Proofs of these claims, as well as the formulas andthe derivations of the estimating equation for GSE, can be found in Danilov et al.(2012).Finally, our two-step location and scatter estimator is defined byT 2S = T GS(X,Un)C 2S = CGS(X,Un)(2.13)232.3. Consistency of GSE on filtered datawhere Un is an estimated matrix of zeros and ones with zeros indicated filtered entriesin the data table X.2.3 Consistency of GSE on filtered dataThe missing data created in Step I are not missing at random because the missingdata indicator, U, depends on the original data X (univariate outliers are declaredmissing). Therefore, the consistency of our two-step estimator cannot be directlyderived from Danilov et al. (2012). However, as shown in Theorem 2.1 below, ourprocedure is consistent at the central model provided the fraction of missing dataconverges to zero. We need the following assumptions:Assumption 2.1. The function ρ is (i) non-decreasing in |t|, (ii) strictly increasingat 0, (iii) continuous, and (iv) ρ(0) = 0 and (v) limv→∞ ρ(v) = 1 (e.g., Tukey’sbisquare rho function).Assumption 2.2. The random vector X follows a distribution, H0, in the ellipticalfamily defined by (2.8).Assumption 2.3. Let H0 be the distribution of X and denote σ(µ,Σ) the solutionin σ to the following equationEH0(ρ(D(X,µ,Σ)cpσ))= b,and consider the minimization problem,min|Σ|=1σ(µ,Σ). (2.14)We assume that (2.14) has a unique solution, (µ0,Σ00), where µ0 ∈ Rp and Σ00 ∈PDS(p). We also put σ0 = σ(µ0,Σ00).Assumption 2.4. The proportion of fully observed entries,qn = #{i, 1 ≤ i ≤ n : pi = p(U n,i) = p}/n,242.4. Simulationstends to one a.s. as n tends to infinity. Recall that U n,i is the indicator vector ofnon-filtered entries in X i.Remark 2.1. Davies (1987) showed that Assumption 2.2 implies Assumption 2.3with Σ00 = Σ0/|Σ0|.Remark 2.2. By Proposition 2.1 and 2.2, the procedure described in Step I satisfiesAssumption 2.4, provided that the marginal distributions for the distribution thatgenerated the data satisfy equation (2.5) and (2.7).Theorem 2.1. Let X 1, . . . ,Xn be a random sample from H0 and U n,1, . . . ,U n,n beas described in Section 2.2.2. Suppose Assumptions 2.1–2.4 hold. Let (T GS,CGS) bethe GSE defined by (2.11)–(2.13). Then,(i) T GS → µ0 a.s. and(ii) CGS → σ0Σ00 a.s..(iii) When X ∼ N(µ0,Σ0), we have σ0Σ00 = Σ0.Proof: See Section A.2 in the Appendix.2.4 SimulationsWe conduct a simulation study in R (R Core Team, 2015) to compare various esti-mators from different generations of robust estimators of multivariate location andscatter:(a) MCD, the fast minimum covariance determinant proposed by Rousseeuw &Van Driessen (1999) (see also Section 6.7.5 in Maronna et al., 2006). MCD isimplemented in the R package rrcov, function CovMcd;(b) MVE-S, the estimator proposed by Maronna et al. (2006, Section 6.7.5). Itis an S-estimator with bisquare ρ function that uses as initial value of theiterative algorithm, an MVE estimator. The MVE estimator is computed by252.4. Simulationssubsampling with concentration step. The number of subsample in MVE is 500.Once the estimator of location and covariance corresponding to one subsampleare computed, the concentration step consists in computing the sample meanand sample covariance of the [n/2] observations with smallest Mahalanobisdistance. MVE-S is implemented in the R package rrcov, function CovSest,option method="bisquare" (Todorov & Filzmoser, 2009);(c) Rocke-S (or shortened to Rocke), the estimator recently promoted by Maronna& Yohai (2015). It is an S-estimator with a non-monotonic weight function(Rocke, 1996). The only difference to the original proposal is that the esti-mator uses the KSD estimator (Pen˜a & Prieto, 2001) as initial value for theiterative algorithm. The KSD estimator is computed by finding directions thatmaximize or minimize the kurtosis of the respective projections, as well asrandom “specific” directions aimed at detecting casewise outliers. The KSDestimator is implemented in a MATLAB code kindly provided by the author. Theinitial estimate can then be used to calculate Rocke-S, which is implementedin the R package rrcov, function CovSest, option method="rocke".(d) HSD, Stahel–Donoho estimator with Huberized outlyingness proposed by Van Aelstet al. (2012). We use a MATLAB code kindly provided by the authors. The num-ber of subsamples used in HSD is 200p;(e) SnipEM (or shortened to Snip), the procedure proposed in Farcomeni (2014a).This method requires an initial specification of the position of the snipped cellsin the form of a binary data table. We compared (using simulation) severalpossible choices for this initial set including: (a) snipping the largest 10% ofthe absolute standardized values for each variable; (b) snipping the largest15% of the absolute standardized values for each variable; and (c) snippingthe standardized values that are more than 1.5 times the interquartile rangeless the first quartile or more than 1.5 times the interquartile range plus thethird quartile, for each variable. We only report the results from case (b) as ityields the best performances. SnipEM is implemented in the R package snipEM,262.4. Simulationsfunction snipEM, default option (Farcomeni & Leung, 2014);(f) DetMCDScore (or shortened to DMCDSc), the procedure proposed in the com-ment by Rousseeuw & Van den Bossche (2015). The DetMCDScore is calcu-lated by applying deterministic MCD (DetMCD) on the normal scores (copula)of the data. A similar approach was also proposed in O¨llerer & Croux (2015).The DetMCDScore is very computationally efficient and has been shown todeal with cellwise and casewise outliers adequately. DetMCD is implementedin the R package DetMCD, function DetMCD, default option (Kaveh, 2015);(g) UF-GSE and UBF-GSE, the proposed two-step approach. The first step ap-plies either the univariate filter only (shortened to UF) or the combination ofunivariate and bivariate filter (shortened to UBF) to the data. The secondstep then calculates GSE for the incomplete data, starting from the EMVEestimator. The EMVE estimator is computed by subsampling with concen-tration step. The number of subsamples used in EMVE is 500. The two-stepprocedure is available as the TSGS function, option alpha=c(0.95, 0) (UF)and option alpha=c(0.95, 0.85) (UBF), in the R package GSE (Leung et al.,2015).The tuning parameters for the high breakdown point estimators MVE-S, Rocke-S,and MCD are chosen to attain 0.5 breakdown point under THCM.We consider clean and contaminated samples from a Np(µ0 ,Σ0) distribution withdimension p = 5, 10, 15, 20 and sample size n = 10p. We have also considered n = 5p,but the results are generally similar and are provided in Section A.1 in the Appendix.The simulation mechanisms are described below.Since the contamination models and the estimators considered in our simulationstudy are location and scale equivariant, we can assume without loss of generalitythat the mean, µ0, is equal to 0 and the variances in diag(Σ0) are all equal to 1.That is, Σ0 is a correlation matrix.Since the cellwise contamination model and the estimators are not affine-equivariant,we consider the two different approaches to introduce correlation structures: randomcorrelation and first order autoregressive correlation (AR1).272.4. SimulationsRandom CorrelationFor each sample in our simulation, we create a different random correlation matrixwith condition number, which is defined as the largest eigenvalue of a correlationmatrix divided by the smallest eigenvalue, fixed at CN = 100. Correlation matriceswith high condition number are generally less favorable for non-affine equivariantestimators as extensively explored by Alqallaf (2003) and Danilov (2010). We usethe following procedure to obtain random correlations with a fixed condition numberCN :1. For a fixed condition number CN , we first obtain a diagonal matrix Λ =diag(λ1, . . . , λp), [λ1 < λ2 < · · · < λp] with smallest eigenvalue λ1 = 1 andlargest eigenvalue λp = CN. The remaining eigenvalues λ2, . . . , λp−1 are p − 2sorted independent random variables with a uniform distribution in the interval(1,CN).2. We first generate a random p × p matrix Y , which elements are indepen-dent standard normal random variables. Then, we form the symmetric matrixY tY = QVQt to obtain a random orthogonal matrixQ via eigendecomposition.3. Using the results of 1 and 2 above, we construct the random covariance matrixby Σ0 = QΛQt . Notice that the condition number of Σ0 is equal to the desiredCN .4. Convert the covariance matrix Σ0 into the correlation matrix R0 as follows:R0 = W−1/2Σ0W −1/2whereW = diag(σ1, . . . , σp)and σ1, . . . , σp are the standard deviations in the covariance matrix Σ0.5. After the conversion to correlation matrix in Step 4 above, the condition num-ber of R0 is no longer necessarily equal to CN . To remedy this problem, we282.4. Simulationsconsider the eigenvalue diagonalization of R0R0 = Q0Λ0Qt0 . (2.15)whereΛ0 = diag(λR01 , . . . , λR0p ), λR01 < λR02 < · · · < λR0p .is the diagonal matrix formed using the eigenvalues of R0. We now re-establishthe desired condition number CN by redefiningλR0p = CN× λR01and using the modified eigenvalues in (2.15).6. Repeat 4 and 5 until the condition number of R0 is within a tolerance level (oruntil we reach some maximum iterations). In our simulation study, we set thetolerance for the difference in CN at 10−5 and the maximum iterations to be100. However, convergence was reached after a few iteration in all the cases.First Order Autoregressive CorrelationThe random correlation structure generally has small correlations, especially withincreasing p. For example, for p = 10, the maximum correlation values have anaverage of 0.49, and for p = 50, the average maximum is 0.28. So, we consider alsoa different correlation structure with higher correlations, in which the correlationmatrix has entriesΣ0,jk = ρ|j−k|,with ρ = 0.9. This correlation is also known as the first order autoregressive corre-lation (AR1).Contamination ScenariosWe then consider the following scenarios:292.4. Simulations• Clean data: No further changes are done to the data.• Cellwise contamination: We randomly replace a of the cells in the data matrixby Xcontij ∼ N(k, 0.12), where k = 1, 2, . . . , 10.• Casewise contamination: We randomly replace a of the cases in the data ma-trix by X conti ∼ 0.5N(cv, 0.12I ) + 0.5N(−cv, 0.12I ), where c =√k(χ2)−1p (0.99)and k = 1, 2, . . . , 10 and v is the eigenvector corresponding to the smallesteigenvalue of Σ0 with length such that (v − µ0)tΣ−10 (v − µ0) = 1. Experi-ments show that the placement of outliers in this way is the least favorable forthe proposed estimator.We consider = 0.05, 0.10 for cellwise contamination, and = 0.10, 0.20 for casewisecontamination. The number of replicates in our simulation study is N = 500.The performance of a given scatter estimator Σn is measured by the Kullback–Leibler divergence between two Gaussian distribution with the same mean and co-variances Σ and Σ0:D(Σ,Σ0) = trace(ΣΣ−10 )− log(|ΣΣ−10 |)− p.This divergence also appears in the likelihood ratio test statistics for testing the nullhypothesis that a multivariate normal distribution has covariance matrix Σ = Σ0.We call this divergence measure the likelihood ratio test distance (LRT). Then, theperformance of an estimator Σn is summarized byD(Σn,Σ0) =1NN∑i=1D(Σˆn,i,Σ0)where Σˆn,i is the estimate at the i-th replication. Finally, the maximum average LRTdistances over all considered contamination values, k, is also calculated.Table 2.1 shows the maximum average LRT distances among the considered con-tamination sizes for the cellwise contamination setting for n = 10p. UBF-GSE andUF-GSE perform similarly when correlations are small because the bivariate filter302.4. SimulationsTable 2.1: Maximum average LRT distances under cellwise contamination. Thesample size is n = 10p.Corr. p MCD MVE-S Rocke HSD Snip DMCDSc UF- UBF-GSE GSERandom 5 0.05 1.2 0.6 0.8 1.7 10.0 1.7 0.8 1.00.10 2.8 6.6 19.1 8.5 57.8 6.8 3.8 3.710 0.05 2.6 11.1 3.6 11.4 7.6 3.7 4.7 4.60.10 99.1 150.0 260.6 61.5 11.0 15.2 16.3 16.015 0.05 21.6 65.2 46.3 31.8 12.0 6.7 8.5 8.40.10 168.2 198.2 202.8 155.0 15.4 21.3 21.0 21.120 0.05 53.6 99.7 190.4 58.5 15.5 9.6 11.1 11.30.10 216.3 240.0 737.5 253.1 18.6 25.8 24.3 24.4AR1(0.9) 5 0.05 1.1 0.6 0.8 0.6 7.7 1.1 0.7 0.90.10 4.0 9.7 21.2 1.8 33.9 3.9 2.1 1.610 0.05 2.3 13.8 3.8 2.8 7.2 3.4 2.1 1.20.10 166.9 205.9 629.0 20.6 14.8 14.1 11.0 2.715 0.05 60.8 104.4 111.3 12.3 9.7 7.8 5.1 1.80.10 328.7 381.2 412.8 103.0 14.3 26.0 21.6 6.620 0.05 140.1 208.3 690.4 31.4 14.4 12.9 9.3 2.70.10 479.3 526.9 1677.4 274.0 20.4 38.1 34.1 14.5Random AR1(0.9)51015202501020302 4 6 2 4 6kLRTUF−GSE UBF−GSE Snip DMCDScFigure 2.1: Average LRT distances for various contamination values, k, under 10%cellwise contamination. The dimension is p = 20 and sample size is n = 10p.312.4. SimulationsTable 2.2: Maximum average LRT distances under casewise contamination. Thesample size is n = 10p.Corr. p MCD MVE-S Rocke HSD Snip DMCDSc UF- UBF-GSE GSERandom 5 0.10 1.5 0.9 3.5 1.3 10.6 1.9 2.4 3.80.20 18.3 6.4 27.9 5.2 33.3 16.5 20.0 31.310 0.10 7.7 4.2 8.9 3.9 32.7 5.1 9.9 18.50.20 113.7 38.2 4.3 21.9 62.3 61.2 80.7 97.215 0.10 31.2 7.4 5.3 8.2 48.5 16.4 19.0 32.50.20 145.6 102.7 4.4 49.1 85.4 81.4 116.2 135.320 0.10 64.8 10.4 5.0 13.5 61.3 37.8 30.3 52.40.20 174.6 142.6 4.7 92.4 107.4 99.9 148.4 175.3AR1(0.9) 5 0.10 1.3 1.0 1.8 0.9 7.0 1.4 1.2 1.50.20 12.2 6.3 32.4 2.5 18.3 5.2 7.2 7.810 0.10 5.8 3.5 4.0 1.7 20.2 2.9 3.8 4.40.20 101.6 37.8 15.7 8.8 45.6 31.9 52.5 52.315 0.10 21.9 6.9 3.6 3.0 29.9 6.1 7.3 8.00.20 133.8 99.6 14.9 17.3 68.3 58.7 100.7 102.520 0.10 61.2 9.6 3.2 4.4 42.9 15.7 13.5 15.40.20 165.6 128.9 16.0 32.7 85.6 85.9 129.9 132.9is not sufficient enough to filter moderate cellwise outliers (e.g., k = 2). However,UBF-GSE outperforms UF-GSE when correlations are high because the bivariatefilter can filter moderate cellwise outliers. See, for example, Figure 2.1 that showsthe average LRT distance behaviors for different contamination sizes, k, for p = 20.Table 2.2 shows the maximum average LRT distances among the considered con-tamination sizes for the casewise contamination setting for n = 10p. UF-GSE andUBF-GSE have an acceptable performance for moderate dimensions (e.g., p ≤ 10),comparable with that of MVE-S, but neither UF-GSE, nor UBF-GSE, nor MVE-Sperform very well for higher dimensions (e.g., p ≥ 15) and unsatisfactorily for highercontamination level (see Section 2.6 for further discussion).Table 2.3 shows the finite sample relative efficiency under clean samples withAR1(0.9) correlation for the considered robust estimates, taking the MLE averageLRT distances as the baseline. The results for the random correlation are verysimilar and not shown here. UF-GSE and UBF-GSE, like MVE-S, shows increasing322.5. Real data example: geochemical data revisitTable 2.3: Finite sample efficiency for first order autoregressive correlations, AR1(ρ),with ρ = 0.9. The sample size is n = 10p.p MCD MVE-S Rocke HSD Snip DMCDSc UF- UBF-GSE GSE5 0.40 0.74 0.45 0.40 0.15 0.34 0.63 0.5210 0.51 0.90 0.50 0.70 0.26 0.43 0.82 0.7115 0.62 0.94 0.54 0.85 0.13 0.50 0.87 0.7920 0.66 0.96 0.56 0.90 0.14 0.55 0.91 0.85Table 2.4: Average “CPU time” – in seconds of a 2.8 GHz Intel Xeon – evaluatedusing the R command, system.time. The sample size is n = 10p.p UF- UBF-GSE GSE10 0.7 1.120 7.7 11.030 34.5 45.640 120.5 144.950 278.4 338.0efficiency when increasing p. Results for larger sample sizes, not reported here, showan identical pattern, except for MCD which efficiency increases with the sample size.The computing times for our estimator for various dimensions and n = 10p areaveraged over all replications and are shown in Table 2.4. Comparatively longercomputing times for the two-step procedures arise for higher dimensions becauseGSE becomes more computationally intensive for higher dimensions and for higherfractions of cases affected by filtered cells (see Section 2.6 for further discussion).2.5 Real data example: geochemical data revisitIn Chapter 1, we presented the geochemical data from Smith et al. (1984) and high-lighted the presence of cellwise and casewise outliers in these data. We showed therethat traditional robust procedures provide poor estimates and fail to identify realoutliers. In this section, we revisit this example. Our purpose here is twofold: first,to show that the two-step procedures can provide good estimates and identify outliers332.5. Real data example: geochemical data revisitl lll l l ll l l l l l lll l l l l l ll lll lll l l l l lll l ll l ll l l l l l l l ll l l1 2 10 16 1725 35 39llllllll l ll l l ll l l l l l l ll lll lll l l l l l l l l lll ll l l l l l l l l l l l121016 17253539lllllllllll l l lllll l l l llllll llll ll llll llll llll l llllll l121016172535 39llllllllllll llll ll l l l llllll llll ll llll lllllllllllllll l12 10 16 172535 39MLERocke−SUF−GSEUBF−GSE0501000501000501000501005 10 15 20 25 30 35 40 45 50SampleSquared Mahalanobis distancesFigure 2.2: Squared Mahalanobis distances of the samples in the geochemical databased on different estimates. Large distances are truncated for better visualization.Samples that contain one or more flagged components (large cellwise outliers) are ingreen.that were missed by traditional procedures; and second, to show that the bivariatefilter version (UBF-GSE) can also identify moderate cellwise outliers, and as such,yield further improved results.As mentioned in Chapter 1, the geochemical data give the content measure for10 chemical compounds in 53 samples of rocks. A log transformation was applied tothe data to make them more symmetric. From the quantile–quantile plots the dataappear fairly normal for the most part, but outliers are observed.We now compute the UF-GSE and the UBF-GSE estimates for these data, as well342.5. Real data example: geochemical data revisitllllllllllllllllllllllllllllllllllllllll lllllllllllllllllllllllllllllllllllllll lllllllllllll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllllllllllllllllllll lllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllll llllllllllllllllllllllllllllllll llllllllllllllll−5.0−2.50.02.5V2 V3 V6 V7 V8 V9 V10 V16 V17 V18varcontentFigure 2.3: Univariate scatterplots of the 10 components in the geochemical data.Each component is standardized by its median and MAD. Points are randomly jit-tered to avoid much overlapping. The points flagged by the univariate filter are inblue, and those additionally flagged by the bivariate filter are in orange.as the Rocke-S estimates for comparison. Figure 2.2 shows the squared Mahalanobisdistances (MD) of the 53 samples based on the different estimates. Samples thatcontain at least one component that lies more than three MAD away from the median(i.e., the large cellwise outliers) are shown in green in the figure. There are in total41.5% of samples flagged as such. Notice that all the cases with large cellwise outliersare also flagged MD outliers by UBF-GSE and UF-GSE. But this is not the case forRocke-S. This is so because UBF-GSE and UF-GSE makes a more efficient use of theclean part of cases affected by cellwise outliers. Notice that Rocke-S must assign afinal weight to each case by looking at the whole case, even in situations when thereis only a single outlying component. As a result, for these data, Rocke-S fails toprovide a good fit and produces unreliable Mahalanobis distances and final weights.In addition to the samples with large cellwise outliers, eight new cases (samples 1,2, 10, 16, 17, 25, 35, 39) are flagged as outliers by UBF-GSE, with estimated fullMahalanobis distances exceeding the 99.99% chi-square cutoff. UF-GSE also flagsmost of these cases, but misses samples 2 and 17.UF-GSE and UBF-GSE are both equally resistant against large cellwise outliersbut UF-GSE is less resistant against moderate cellwise outliers, which are presentin these data. Figure 2.3 depicts univariate scatterplots for each component (stan-352.6. Discussionlllllllllllll lllll lllllllll lllllllllllllllllllllll012345−4.5 −4.0 −3.5V2V8lllllllllllll lllll lllllllllllllllllllllllllll lllll024−4.5 −4.0 −3.5V2V3llll lllllllllllllllllllllll llllllll lllllll llllll−1012340 2 4V3V9Figure 2.4: Pairwise scatterplots of the geochemical data for V 2 versus V 8, V 2 versusV 3, and V 3 versus V 9. Points with components flagged by the univariate filter arein blue. Points with components additionally flagged by the bivariate filter are inorange.dardized and with random jitter) in the data. The points flagged by the univariatefilter are in blue, and those flagged by the bivariate filter are in orange. Addition-ally, in Figure 2.4 bivariate scatterplots are shown for V 2 versus V 8, V 2 versus V 3,and V 3 versus V 9, where some correlations are observed. From these figures, wesee that the bivariate filter has identified some additional cellwise outliers that arenot-so-large marginally but become more visible when viewed together with othercorrelated components (the orange points in Figure 2.3). These moderate cellwiseoutliers account for 5.1% of the cells in the dataset and propagate to 28.3% of thecases. The final median weight assigned to these cases by UF-GSE and UBF-GSEare 0.36 and 0.70, respectively. By filtering the mild outliers UBF-GSE is able tomake a more efficient use of the clean components in 28.3% of the cases.2.6 DiscussionIn the results section, we have found that the two-step approach performs overall thebest under ICM, but not so well under THCM for p > 10. The computing times ofthe two-step procedures for higher dimensions are rather long, making the proceduresless appealing for real time use. These points were also raised by the discussants in362.6. Discussionthe published paper connected with this chapter, along with a remark on handlinglarge and flat data sets. For the rest of the section, we discuss these main pointsindividually.2.6.1 Controlling the robustness and efficiency for large pMaronna (2015) made a thoughtful remark regarding the loss of robustness of UF-GSE–and in general, S-estimators with a fixed loss function ρ–when p is large. TheGaussian efficiency of UF-GSE, as well as UBF-GSE, systematically increases toone as p increases, but this gain in efficiency comes at the expense of a decrease inrobustness. Hence, we agree that for large p we need to modify the GSE step toavoid the lack of robustness of S-estimators with fixed loss function. A possibilitycould be to use a well-calibrated MM-estimator of multivariate location and scatter(Tatsuoka & Tyler, 2000) after adapting it for handling data with missing values. Theresulting generalized MM-estimator would then gain robustness for p large. Anotherpossibility could be to replace the bisquare rho function in GSE by a Rocke-type lossfunction, which changes with dimension in order to preserve the robustness of theestimator (Rocke, 1996). A comparison of MM-estimators and Rocke type estimatorswith S-estimators based on a bisquare rho-function for complete data and casewisecontamination can be found in Maronna & Yohai (2015). Further work on this topiccan be found in Chapter 3.2.6.2 Computational issueSeveral authors (Croux & O¨llerer, 2015; Maronna, 2015; Rousseeuw & Van den Boss-che, 2015; Van Aelst, 2015; Welsch, 2015) commented on the high computational costof the two-step procedures and the need for faster alternatives.The first step of the two-step procedure (filter) is fast, but the second step is slowdue to the computation of the generalized S-estimator (GSE) (Danilov et al., 2012).We notice that GSE first resamples the filtered data to compute an initial estimateand then iterates until convergence a sequence of robust imputation and estimationsteps. These iterations can be computationally intensive and time consuming when a372.6. Discussionlarge fraction of the data has been filtered. However, the main computational burdenin GSE comes from the computation of the initial robust estimator (extended min-imum volume ellipsoid, EMVE) which is needed to achieve high robustness againstcasewise outliers.EMVE introduced by Danilov et al. (2012) is a generalized version for incom-plete data of the MVE (Rousseeuw, 1985). The computation of EMVE consists ofa combination of subsampling and concentration steps. Once the estimators of loca-tion and scatter for a given subsample are obtained, the concentration step consistsof computing the Gaussian MLE via the classical EM algorithm on the half of theobservations with the smallest Mahalanobis distances. The concentration steps aretime consuming, especially when there is a large number of filtered cells. This prob-lem is aggravated by the required large number of subsamples, especially when p islarge. Therefore, there is a need for finding a fast and fully robust initial estimatefor GSE. Further investigation on choices of initial estimator are done in Chapter 3.2.6.3 Remarks on handling large and flat data setsCroux & O¨llerer (2015) gave an extensive and detailed discussion of the performanceof UF-GSE in the case of large and flat data sets (large p and relatively small n).Our numerical experiments confirm that the two-step procedure does not handlewell large and flat data sets (e.g., n < 5p). In fact, when n ≤ 2p, the generalizedS-estimator fails to exist, likewise S-estimator and all classical robust estimators withbreakdown point 1/2. When much data are filtered and the fraction of complete datais small, the iterations in GSE may fail to converge. In this case, GSE produces anearly singular covariance matrix. This situation is more likely to occur for datasetswith relatively small n compared to p. Danilov et al. (2012) provided a sufficientcondition for the existence of GSE: the proportion of complete observations in generalposition to be larger than 1/2 + (p+ 1)/n. Numerical results have shown that GSEperforms well for some smaller proportions of complete observations. However, notheoretical results are available for these cases.To overcome the lack of convergence of the two-step procedure for large and382.7. Conclusionsflat data sets, we may partially impute the filtered cells to ensure a fraction 1/2 +(p+ 1)/n of complete observations. More precisely, the procedure is to first fil-ter outliers, then randomly select observations and impute the filtered cells usingcoordinate-wise medians, and finally estimate the location and scatter using GSE.Although this procedure is rather ad hoc, our initial numerical experiments suggeststhat it may work for n ≥ 5p.2.7 ConclusionsAffine equivariance, a proven asset for achieving THCM robustness, becomes a hin-drance under ICM because of outliers propagation.We advocate the practical and theoretical importance of ICM and point to theperils and drawbacks of relying solely on the THCM paradigm. ICM promotes aless aggressive cellwise down-weighting of outliers and becomes an essential tool formodeling contamination in moderate and high dimensional data.We introduce a non-affine equivariant, two-step procedure to achieve robustnessunder ICM and THCM. The first step applies a filter to the data, aim to reduce theimpact of outliers propagation and to overcome the curse of dimensionality posedby ICM. The second step then applies the generalized S-estimator of Danilov et al.(2012) to the incomplete data from the first step, aiming to achieve robustness underTHCM. A univariate filtering and a combination of univariate and bivariate filteringare proposed in the first step, resulting in two versions of the two-step procedure:UF-GSE and UBF-GSE.The two-step procedures (UF-GSE and UBF-GSE) exhibits high robustness againstlarge cellwise outliers from ICM, but UF-GSE is not resistant against moderate cell-wise outliers. In this case, UBF-GSE exhibits higher robustness than UF-GSE whenthe correlations between the uncontaminated variables are high. However, this gainin robustness comes at the expense of a decrease in robustness under THCM. There-fore, we recommend UBF-GSE when the correlations are seemingly high, but UF-GSE otherwise. Overall, the two-step procedures exhibits satisfactory robustnessagainst casewise outliers from THCM for low to moderate dimensional data (e.g.,392.7. Conclusionsp ≤ 10), but starts losing robustness with increasing p (e.g., p > 10).Finally, UF-GSE and UBF-GSE both are not yet capable of handling high di-mensional data for the following reasons. (1) The high computational cost of thecurrent initial estimator (EMVE) required by the procedure, making it infeasible forclock-time computation. (2) The generalized S-estimators employed in the secondstep is incapable of dealing with casewise outliers when p is large. Further work onthese two topics are presented in the next chapter.40Chapter 3Robust Estimation of MultivariateLocation and Scatter in thePresence of Cellwise and CasewiseContamination in High Dimension3.1 IntroductionIn this chapter, we continue our work on the problem of robust estimation of multi-variate location and scatter matrix under cellwise and casewise contamination.Most traditional robust estimators assume that the majority of cases is totallyfree of contamination. Any case that deviates from the model distribution is fullyflagged as an outlier. In situations that only a small number of components of acase are contaminated, down-weighting the whole case may not be appropriate andcan cause a huge loss of information, especially when the dimension is large. Whendata contain both cellwise and casewise outliers, the problem becomes even moredifficult. For this, we proposed two-step procedures called UF-GSE and UBF-GSEin Chapter 2. The first step applies either a univariate filter (UF) or a combination ofunivariate and bivariate filter (UBF) to the data matrix X and sets the flagged cellsto missing values, NA’s. The second step then applies the generalized S-estimator(GSE) of Danilov et al. (2012) to this incomplete data set. The two-step proceduresare shown to be simultaneously robust against cellwise and casewise outliers for lowdimensional data (e.g., p ≤ 10). Unfortunately, the procedures do not scale well forhigher dimensions due to the high computational cost of its initial estimator EMVE.413.2. Generalized Rocke S-estimatorsFurthermore, the procedures loses robustness against casewise outliers for higherdimensional data (e.g., p > 10).One goal of this chapter is to improve the robustness of UF-GSE and UBF-GSEin high dimension. For that, we introduce a new robust estimator called General-ized Rocke S-estimator or GRE to replace GSE in the second step. The resultingprocedures are called UF-GRE and UBF-GRE. In his discussion of Agostinelli et al.(2015), Maronna (2015) made a thoughtful remark regarding the loss of robustnessof UF-GSE–and in general, S-estimators with a fixed loss function ρ–when p is large.S-estimators with a fixed ρ uncontrollably gain efficiency and lose their robustnessfor large p (Rocke, 1996). Such curse of dimensionality has also been confirmed forUF-GSE and UBF-GSE, which use a GSE with a fixed ρ in its second step.Another goal of this chapter is to reduce the high computational cost of the two-step approach in high dimension. The first step of filtering is generally fast, butthe second step is slow due to the computation of the extended minimum volumeellipsoid (EMVE), used as initial estimate by the generalized S-estimators. Subsam-pling is the standard way to compute EMVE, but it requires an impractically largenumber of subsamples, making the initial estimation extremely slow. To address thiscomputational issue, we introduce a new subsampling procedure based on clusteringfor computing EMVE. The new initial estimator is called EMVE-C.The rest of the chapter is organized as follows. In Section 3.2, we introduce theGRE. In Section 3.3, we describe the computational issues of the proposed estimatorsregarding the initial estimation and introduce EMVE-C to serve this capacity. InSection 3.4 and 3.5, we compare the two-step approaches equipped with GSE andGRE through an extended simulation study of that in Chapter 2, as well as througha real data example. Finally, we conclude in Section 3.6. We also provide additionalsimulation results and other supplementary material in Appendix B.3.2 Generalized Rocke S-estimatorsRocke (1996) showed that if the weight function W (x) = ρ′(x)/x in S-estimators isnon-increasing, the efficiency of the estimators tends to one when p→∞. However,423.2. Generalized Rocke S-estimatorsthis gain in efficiency is paid for by a decrease in robustness. Not surprisingly,the same phenomenon has been observed for generalized S-estimators in simulationstudies. Therefore, there is a need for new generalized S-estimators with controllableefficiency/robustness trade off.Rocke (1996) proposed that the ρ function used to compute S-estimators shouldchange with the dimension to prevent loss of robustness in higher dimensions. TheRocke-ρ function is constructed based on the fact that for large p the scaled squaredMahalanobis distances for normal dataD(X,µ,Σ)σ≈ Zpwith Z ∼ χ2p,and hence that D/σ are increasingly concentrated around one. So, to have a highenough, but not too high, efficiency, we should give a high weight to the values ofD/σ near one and down-weight the cases where D/σ is far from one.Letγ = min(χ2(1− α)p− 1, 1), (3.1)where χ2(β) is the β-quantile of χ2p. In this chapter, we use a conventional choiceof α = 0.05 that gives a satisfactory efficiency of the estimator, but we have alsoexplored smaller values of α (see Section B.1 in the Appendix). Maronna et al. (2006)proposed a modification of the Rocke-ρ function, namelyρ(u) =0 for 0 ≤ u ≤ 1− γ(u−14γ)[3−(u−1γ)2]+ 12for 1− γ < u < 1 + γ1 for u ≥ 1 + γ(3.2)which has as derivative the desired weight function that vanishes for u 6∈ [1−γ, 1+γ]W (u) =34γ[1−(u− 1γ)2]I(1− γ ≤ u ≤ 1 + γ).433.2. Generalized Rocke S-estimators0.000.250.500.751.000 50 100 150 200zWTukey−bisq. Rocke chi−sq. densityFigure 3.1: Weight functions of the Tukey bisquare and the Rocke for p = 40. Chi-square density functions are also plotted in blue for comparison. All the functionsare scaled so that their maximum is 1 to facilitate comparison.Figure 3.1 compares the Rocke-weight function, WRocke(z/cp), and the Tukey-bisquare weight function, WTukey(z/cp), for p = 40, where cp as defined in (2.10).The chi-square density function is also plotted in blue for comparison. When p islarge the tail of the Tukey-bisquare weight function greatly deviates from the tailof the chi-square density function and inappropriately assigns high weights to largedistances. On the other hand, the Rocke-weight function can resemble the shapeof the chi-square density function and is capable of assigning low weights to largedistances.Finally, we define the generalized Rocke S-estimators or GRE by (2.11) and (2.12)with the ρ-function in (2.9) replaced by the modified Rocke-ρ function in (3.2). Wecompared GRE with GSE via simulation and found that GRE has a substantialbetter performance in dealing with casewise outliers when p is large (e.g., p > 10).Results from this simulation study are provided in Section B.2 in the Appendix.443.3. Computing issues3.3 Computing issuesThe generalized S-estimators described above are computed via iterative reweightedmeans and covariances, starting from an initial estimate. We now discuss somecomputing issues associated with this iterative procedure.3.3.1 Initial estimatorFor the initial estimate, the extended minimum volume ellipsoid (EMVE) has beenused, as suggested by Danilov et al. (2012). The EMVE is computed with a largenumber of subsamples (> 500) to increase the chance that at least one clean sub-sample is obtained. Let ε be the proportion of contamination in the data and m bethe subsample size. The probability of having at least one clean subsample of sizem out of M subsamples isq = 1−[1−(n · (1− ε)m)/(nm)]M. (3.3)For large p, the number of subsamples M required for a large q, say q = 0.99, canbe impractically large, dramatically slowing down the computation. For example,suppose m = p, n = 10p, and ε = 0.50. If p = 10, then M = 7758; if p = 30, thenM = 2.48× 1010; and if p = 50, then M = 4.15× 1016. Therefore, there is a need fora faster and more reliable starting point for large p.Cluster-Based SubsamplingNext, we introduce a cluster-based algorithm for faster and more reliable subsam-pling for the computation of EMVE. The EMVE computed with the cluster-basedsubsampling is called called EMVE-C throughout the chapter.High-dimensional data have several interesting geometrical properties as describedin Hall et al. (2005). One such property that motivated the Rocke-ρ function, as wellas the following algorithm, is that for large p the p-variate standard normal distri-bution Np(0, I ) is concentrated “near” the spherical shell with radius√p. So, if453.3. Computing issuesoutliers have a slightly different covariance structure from clean data, they wouldappear geometrically different. Therefore, we could apply a clustering algorithm tofirst separate the outliers from the clean data. Subsampling from a big cluster, whichin principle is composed of mostly clean cases, should be more reliable and requirefewer number of subsamples.Given a data matrix X, let U be the auxiliary matrix of zeros and ones, with zerosindicating the missing entries in X. The following steps describe our clustering-basedsubsampling:1. Standardize the data X with some initial location and dispersion estimator T0jand S0j. Common choices for T0j and S0j that are also adopted in this chapterare the coordinate-wise median and median absolute deviance (MAD). Denotethe standardized data by Z = (Z 1, . . . ,Zn)t, where Z i = (Zi1, . . . , Zip)t andZij = (Xij − T0j)/S0j.2. Compute a simple robust correlation matrix estimate R = (Rjk). Here, we usethe Gnanadesikan-Kettenring estimator (Gnanadesikan & Kettenring, 1972),whereRij =14(S20jk+ − S20jk−),and where S0jk+ is the dispersion estimate for {Zij +Zik|Uij = 1, Uik = 1} andS0jk− the estimate for {Zij − Zik|Uij = 1, Uik = 1}. We use Qn (Rousseeuw &Croux, 1993) for the dispersion estimate.3. Compute the eigenvalues λ1 ≥ · · · ≥ λp and eigenvectors e1, . . . , ep of thecorrelation matrix estimateR = EΛE t,where Λ = diag(λ1, . . . , λp) and E = (e1, . . . , ep). Let p+ be the largest di-mension such that λj > 0 for j = 1, . . . , p+. Retain only the eigenvectorsE0 = (e1, . . . , ep+) with a positive eigenvalue.4. Complete the standardized data Z by replacing each missing entry, as indicatedby U, by zero. Then, project the data onto the basis eigenvectors Z˜ = ZE0,and then standardize the columns of Z˜ , or so called principal components,463.3. Computing issuesusing coordinate-wise median and MAD of Z˜ .5. Search for a “clean” cluster C in the standardized Z˜ using a hierarchical cluster-ing framework by doing the following. First, compute the dissimilarity matrixfor the principal components using the Euclidean metric. Then, apply classicalhierarchical clustering (with any linkage of choice). A common choice is theWard’s linkage, which is adopted in this chapter. Finally, define the “clean”cluster by the smallest sub-cluster C with a size at least n/2. This can beobtained by cutting the clustering tree at various heights from the top until allthe clusters have size less than n/2.6. Take a subsample of size n0 from C.With good clustering results, we can draw fewer subsamples, and equally im-portant, we can use a larger subsample size. The current default choices in GSEare M = 500 subsamples of size n0 = (p + 1)/(1 − αmis) as suggested in Danilovet al. (2012), where αmis is the fraction of missing data (αmis = number of missingentries /(np)). For the new clustering-based subsampling, we choose M = 50 andn0 = 2(p + 1)/(1 − αmis) in this chapter, but other choices of M and n0 can beconsidered. However, we found that choosing a too large subsample size could resultin contaminated subsamples with outliers that went undetected.In principle, this procedure could be time-consuming because the number of op-erations required by hierarchical clustering is of order n3. As an alternative, one maybypass the hierarchical clustering step and sample directly from the data points withthe smallest Euclidean distances to the origin calculated from Z˜ . This is because theEuclidean distances, in principle, should approximate the Mahalanobis distances tothe mean of the original data. However, our simulations show that the hierarchicalclustering step is essential for the excellent performance of the estimates, and thatthis step entails only a small increase in computational time, even for n = 1000. Formuch larger n, when computational time becomes a serious concern, we can alwaysperform the clustering procedure on a randomly chosen smaller fraction of the datato keep the computational speed, which should be sufficient for finding a reliableinitial estimate.473.4. Two-step estimation and extended simulations3.3.2 Another computing issueThere is no formal proof that the recursive algorithm decreases the objective functionat each iteration for the case of generalized S-estimators with a monotonic weightfunction (Danilov et al., 2012). This also the case for generalized S-estimators with anon-monotonic weight function. For Rocke estimators with complete data, Maronnaet al. (2006, see Section 9.6.3) described an algorithm that ensures attaining a localminimum. We have adapted this algorithm for the generalized counterparts. Al-though we cannot provide a formal proof, we have seen so far in our experimentsthat the descending property of the recursive algorithms always holds.3.4 Two-step estimation and extendedsimulationsThe proposed two-step approach for global–robust estimation under cellwise andcasewise contamination is to first flag outlying cells in the data table and to replacethem by NA’s using either univariate filtering only (shortened to UF) or univariateand bivariate filtering (shortened to UBF). In the second step, the generalized S-estimator is then applied to this incomplete data. Our new version of this is to replaceGSE in the second step by GRE-C (i.e., GRE starting from EMVE-C). We call thenew two-step procedure UF-GRE-C and UBF-GRE-C. The new procedures withGRE in the second step are available as the TSGS function, option method="rocke",in the R package GSE (Leung et al., 2015).We now conduct the same simulation study in Chapter 2 comparing UF-GRE-C and UBF-GRE-C with UF-GSE and UBF-GSE, as well as several other robustestimators that were shown to be as competitive under cellwise (Snip) or casewisecontamination (Rocke and HSD) or both (DMCDSc). In addition, we consider higherdimensions, p = 30, 40, 50. To show the influence of cellwise outliers propagationon casewise robust estimates in higher dimensions, we also consider three levels ofcellwise contamination, ε = 0.02, 0.05, 0.10. Finally, the number of replications inthis simulation study is N = 500.483.4. Two-step estimation and extended simulationsTable 3.1: Maximum average LRT distances under cellwise contamination. Thesample size is n = 10p.Corr. p Rocke HSD Snip DMCDSc UF- UBF- UF- UBF-GSE GSE GRE-C GRE-CRandom 10 0.02 1.2 2.3 6.9 1.6 1.2 1.4 1.3 1.40.05 3.6 11.2 7.5 3.2 4.5 4.4 2.2 2.50.10 256.0 59.9 10.8 14.7 16.2 15.9 11.8 11.520 0.02 2.7 10.6 13.9 2.6 4.0 4.4 2.9 3.00.05 187.2 57.1 15.5 9.3 11.0 11.1 8.0 8.20.10 725.8 251.5 18.4 24.7 24.0 24.1 21.5 21.630 0.02 23.1 22.6 18.5 4.4 5.8 6.3 5.4 5.90.05 380.5 123.1 20.8 13.7 14.2 14.8 12.3 13.40.10 938.5 503.0 23.6 33.2 30.3 30.4 28.2 28.540 0.02 121.3 38.9 22.6 6.0 7.3 8.0 9.4 10.90.05 584.1 212.4 25.8 17.9 16.6 17.4 18.4 19.90.10 1104.3 744.2 30.0 39.5 35.3 35.3 37.5 37.750 0.02 192.8 58.7 27.1 8.1 9.1 10.0 12.5 12.90.05 618.1 298.7 29.7 20.7 19.6 20.6 22.7 23.60.10 1251.8 1002.2 32.0 46.7 43.1 43.2 45.7 46.3AR1(0.9) 10 0.02 1.2 0.9 4.9 1.5 0.9 0.9 1.2 1.30.05 2.6 2.8 7.0 3.1 2.1 1.1 1.7 1.40.10 627.4 20.3 13.8 13.3 10.9 2.6 10.4 2.520 0.02 2.5 3.9 10.5 2.6 2.1 1.5 2.2 2.10.05 690.6 31.3 14.3 12.3 9.3 2.7 7.6 2.80.10 1679.3 273.5 20.4 36.4 34.1 14.4 32.1 8.830 0.02 71.1 10.7 13.9 5.4 4.0 2.3 3.9 3.40.05 1190.1 103.3 19.8 22.6 20.3 6.2 18.1 5.50.10 3440.3 916.9 29.0 63.4 59.7 40.5 56.0 27.740 0.02 222.1 22.7 16.2 8.9 6.7 3.5 6.5 5.70.05 1785.5 259.9 23.7 34.8 31.4 14.0 29.7 12.40.10 5712.8 1966.5 35.9 90.4 84.9 74.4 81.5 52.950 0.02 628.1 43.3 18.9 12.8 9.7 4.9 9.7 6.40.05 4271.7 534.5 28.9 46.5 42.8 22.6 40.8 20.40.10 4129.1 2998.6 44.7 119.5 111.8 104.6 106.3 75.3493.4. Two-step estimation and extended simulationsTable 3.2: Maximum average LRT distances under casewise contamination. Thesample size is n = 10p.Corr. p Rocke HSD Snip DMCDSc UF- UBF- UF- UBF-GSE GSE GRE-C GRE-CRandom 10 0.10 2.8 3.9 44.4 4.9 9.7 18.5 11.0 19.10.20 4.7 21.8 110.3 123.6 91.8 146.8 30.1 53.020 0.10 3.4 13.4 76.9 37.8 29.7 50.1 11.5 20.90.20 5.6 95.9 166.5 187.6 291.8 311.4 22.0 49.330 0.10 4.3 26.1 82.3 118.6 75.3 101.3 12.8 21.80.20 7.4 297.7 220.9 268.4 415.5 445.2 21.7 47.640 0.10 5.2 46.3 101.6 130.6 140.2 168.8 18.6 29.50.20 9.1 547.4 186.2 340.1 534.1 579.9 22.7 52.350 0.10 5.9 80.0 121.9 139.5 258.1 228.8 27.5 43.40.20 10.0 682.4 224.3 407.7 650.1 710.9 24.2 64.8AR1(0.9) 10 0.10 2.8 1.7 20.2 2.9 3.7 4.3 3.1 3.60.20 4.8 8.7 49.7 29.7 50.8 50.1 7.2 8.420 0.10 2.8 4.7 43.8 14.8 12.9 14.9 3.5 4.30.20 5.3 35.3 113.0 87.6 260.5 193.9 7.3 10.530 0.10 3.4 8.9 66.1 32.2 31.3 37.7 4.1 5.10.20 8.2 155.5 144.8 122.9 372.7 365.1 8.4 13.340 0.10 4.3 15.6 83.7 49.2 69.1 75.5 6.4 7.30.20 9.2 430.3 151.9 209.3 477.6 479.7 10.0 17.450 0.10 5.1 26.5 103.3 64.4 148.2 160.1 7.6 8.10.20 11.1 538.3 188.5 276.0 581.6 585.0 11.0 21.2We report the results for n = 10p only since the results for n = 5p are similar. Ta-ble 3.1 and Table 3.2 show the maximum average LRT distances under cellwise andcasewise contamination, respectively. In general, UF-GSE and UBF-GSE performsimilarly as UF-GRE-C and UBF-GRE-C, respectively, under cellwise contamina-tion. However, UF-GRE-C and UBF-GRE-C substantially outperforms UF-GSEand UBF-GSE under casewise contamination. The Rocke ρ function used in GRE inthe second step is capable of giving smaller weights to points that are at moderate-to-large distances from the main mass of points; see, for example, Figure 3.2 that showsthe average LRT distance behaviors of UBF-GSE and UBF-GRE-C for dimensionp = 30 and AR1(0.9) correlated data under 10% casewise contamination.Table 3.3 shows the finite sample relative efficiency under clean samples with503.4. Two-step estimation and extended simulations02550751002 4 6 8 10 12 14 16 18 20kLRTUBF−GSE UBF−GRE−CFigure 3.2: Average LRT distance behaviors of UBF-GSE and UBF-GRE-C forrandom correlations under 10% casewise contamination. The dimension is p = 30and the sample size is n = 10p.Table 3.3: Finite sample efficiency for random correlations. The sample size isn = 10p.p Rocke HSD Snip DMCDSc UF- UBF- UF- UBF-GSE GSE GRE-C GRE-C10 0.50 0.73 0.12 0.41 0.75 0.66 0.53 0.4820 0.57 0.92 0.09 0.56 0.83 0.73 0.59 0.5530 0.58 0.93 0.10 0.63 0.87 0.79 0.49 0.4440 0.60 0.94 0.10 0.68 0.89 0.83 0.39 0.3650 0.60 0.94 0.11 0.70 0.91 0.84 0.48 0.49random correlation for the considered robust estimates, taking the MLE averageLRT distances as the baseline. The results for the AR1(0.9) correlation are verysimilar and not shown here. As expected, UF-GSE and UBF-GSE show an increasingefficiency as p increases while UF-GRE-C and UBF-GRE-C have lower efficiency.Improvements can be achieved by using smaller α in the Rocke ρ function with sometrade-off in robustness. Results from this experiment are provided in Section B.1 inthe Appendix.Finally, we compare the computing times of the two-step procedures. Table513.5. Real data example: small-cap stock returns dataTable 3.4: Average “CPU time” – in seconds of a 2.8 GHz Intel Xeon – evaluatedusing the R command, system.time. The sample size is n = 10p.p UF- UBF- UF- UBF-GSE GSE GRE-C GRE-C10 0.7 1.1 0.1 0.220 7.7 11.0 1.2 1.730 34.5 45.6 5.4 6.340 120.5 144.9 14.5 16.950 278.4 338.0 33.0 37.03.4 shows the average computing times over all contamination settings for variousdimensions and for n = 10p. The computing times for the two-step procedure havebeen substantially improved with the implementation of the faster initial estimator,EMVE-C.3.5 Real data example: small-cap stock returnsdataIn this section, we consider the weekly returns from 01/08/2008 to 12/28/2010 (n =730 weeks) for a portfolio of 20 small-cap stocks (p = 20) from Martin (2013).The data set is different from the micro-cap stock returns data set in Chapter 1. Itcontains more correlated stock returns and contains relatively more moderate cellwiseoutliers as we will show next.The purpose of this example is fourfold: first, to show that the classical MLEand traditional robust procedures perform poorly on data affected by propagationof cellwise outliers; second, to show that the two-step procedures can provide betterestimates by filtering large outliers; third, that the bivariate-filter version of the two-step procedure provides even better estimates by flagging additional moderate cell-wise outliers; and fourth, that UBF-GRE-C can more effectively down-weight somehigh-dimensional casewise outliers than UBF-GSE, for this 20-dimensional dataset.Therefore, UBF-GRE-C provides the best results for the dataset.523.5. Real data example: small-cap stock returns datallllllllll lll lllllllll lllllllllllllllllllllllllllllllllll lllll llllllll lllllllllllllllllllllllllllllll llllllll llllllllllllllllllllllllllllllllllll lllllllllll lllllllllllllllllllllllllllllllllllllll llllllll llllllllllll lllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllll lllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllll llllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllll llllllllllllllllllllllllllllllllll llllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllll ll llll llllllll lllllllllllllllllllllllllll ll ll l l lllll lll lll llllll llllllllllll lllllll ll lllllllllllllllllllllllllllllllllllllllllll llllllllllll llllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllllll lllllllllllllllllllACO AIN BRY BWINB CBMCCC CHFC CNMD COKE GNCMAHGIC HTLD IMKTA LNN OFGPLAB PLXS SUR WSBC WTSstandard normalWeekly returnsFigure 3.3: Normal quantile–quantile plots of weekly returns. Weekly returns thatare three MAD away from the coordinatewise-median are shown in green.Figure 3.3 shows the normal QQ-plots of the 20 small-cap stocks returns in theportfolio. The bulk of the returns in all stocks seem roughly normal, but largeoutliers are clearly present for most of these stocks. Stocks with returns lying morethan three mads away from the coordinatewise-median (i.e., the large outliers) areshown in green in the figure. There is a total of 4.8% large cellwise outliers thatpropagate to 40.1% of the cases. Over 75% of these weeks correspond to the 2008financial crisis.First, we compute the MLE and the Rocke-S estimates, as well as the UF-GSEand the UBF-GSE estimates that were proposed in Chapter 2, for these data. Figure3.4 shows the squared Mahalanobis distances of the 157 weekly observations based onthe estimates. Weeks that contain large cellwise outliers (asset returns with valuesthree MAD away from the coordinatewise-median) are in green. From the figure,we see that the MLE and the Rocke-S estimates have failed to identify many of533.5. Real data example: small-cap stock returns datallllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllMLERockeUF−GSEUBF−GSE0501001500501001500501001500501001501/1/20087/8/20081/6/20097/7/20091/5/20107/6/2010WeeksSquared Mahalanobis distancesFigure 3.4: Squared Mahalanobis distances of the weekly observations in the small-cap asset returns data based on the MLE, the Rocke, the UF-GSE, and the UBF-GSEestimates. Weeks that contain one or more asset returns with values three MAD awayfrom the coordinatewise-median are in green.those weeks as MD outliers (i.e., failed to flag these weeks as having estimated fullMahalanobis distance exceeding the 99.99% quantile chi-squared distribution with20 degrees of freedom). The MLE misses all but seven of the 59 green cases. TheRocke-S estimate does slightly better but still misses one third of the green cases.This is because it is severely affected by the large cellwise outliers that propagate to40.1% of the cases. The UF-GSE estimate also does a relatively poor job. On theother hand, the UBF-GSE estimate successfully flaggs all but five of the 59 greencases.543.5. Real data example: small-cap stock returns datalllllllllllllllllllll lllllllllll llllll lllllllllllll llllllllllll llllllllllllllll lllll l llllllllllllllllllllllll lllllllllllll lll−0.10.00.10.2−0.1 0.0 0.1 0.2WTSHTLDllllllllllllllllllllllllllllll llllllllllllllll lllllll llllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllll−0.10.00.10.2−0.1 0.0 0.1 0.2HTLDWSBClllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllll lllll llllllllllllll llllllllllllll lllllll l−0.2−0.10.00.10.20.3−0.1 0.0 0.1 0.2WSBCSURFigure 3.5: Pairwise scatterplots of the asset returns data for WTS versus HTLD,HTLD versus WSBC, and WSBC versus SUR. Points with components flagged bythe univariate filter are in blue. Points with components additionally flagged by thebivariate filter are in orange.Figure 3.5 shows the pairwise scatterplots for WTS versus HTLD, HTLD versusWSBC, and WSBC versus SUR with the results from the univariate and the bivariatefilter. The points flagged by the univariate filter are in blue, and those flagged bythe bivariate filter are in orange. We see that the bivariate filter has identified someadditional cellwise outliers that are not-so-large marginally but become more visiblewhen viewed together with other correlated components. These moderate cellwiseoutliers account for 6.9% of the cells in the data and propagate to 56.7% of the cases.The final median weight assigned to these cases by UF-GSE and UBF-GSE are 0.50and 0.65, respectively. By filtering the moderate cellwise outliers, UBF-GSE makesa more effective use of the clean part of these partly contaminated data points (i.e.,the 56.7% of the cases).Figure 3.6 shows the squared Mahalanobis distances produced by UBF-GRE-Cand UBF-GSE, for comparison. Here, we see that UBF-GRE-C has missed only 3of the 59 green cases, while UBF-GSE has missed 6 of the 59. UBF-GRE-C has alsoclearly flagged weeks 36, 59, and 66 (with final weights 0.6, 0.0, and 0.0, respectively)as casewise outliers. In contrast, UBF-GSE gives final weights 0.8, 0.5, and 0.5 tothese cases. As shown in the simulations, UBF-GSE has difficulty down-weightingsome high dimensional outlying cases on datasets of high dimension.553.6. Conclusionslllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll36596667 96lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll36 59 666796UBF−GRE−CUBF−GSE1/1/20087/8/20081/6/20097/7/20091/5/20107/6/2010050100150050100150WeeksSquared Mahalanobis distancesFigure 3.6: Squared Mahalanobis distances of the weekly observations in the small-cap asset returns data based on the UBF-GSE and the UBF-GRE-C estimates.Weeks that contain one or more asset returns with values three MAD away fromthe coordinatewise-median are in green.In this example, UBF-GRE-C makes the most effective use of the clean part of thedata and has the best outlier detecting performance among the considered estimates.3.6 ConclusionsIn this chapter, we overcome two serious limitations of GSE and UF-/UBF-GSE inhigher dimensions (p ≥ 20). First, these estimators show an incontrollable increase inGaussian efficiency, which is paid off by a serious decrease in robustness, for larger p.Second, the initial estimator (extended minimum volume ellipsoids, EMVE) used byGSE and UF-/UBF-GSE does not scale well in higher dimensions because it requiresan impractically large number of subsamples to achieve a high breakdown point inlarger dimensions.To achieve a controllable efficiency/robustness trade off in higher dimensions, weequip GSE and UF-/UBF-GSE with a Rocke type loss function. To overcome the563.6. Conclusionshigh computational cost of the EMVE, we introduce a clustering-based subsamplingprocedure. We show via simulation studies that, in higher dimensions, estimatorsusing the proposed subsampling with only 50 subsamples can achieve equivalent oreven better performance than the usual uniform subsampling with 500 subsamples.57Chapter 4Robust Regression Estimation andInference in the Presence ofCellwise and CasewiseContamination4.1 IntroductionIn this chapter, we study another classic but fundamental problem, linear regressionestimation and inference, under the same contamination paradigm as in the previouschapters.The vast majority of procedures for robust linear regression are based on the clas-sical Tukey–Huber contamination model (THCM) in which a relatively small fractionof cases may be contaminated. High breakdown point affine equivariant estimatorssuch as least trimmed squares (Rousseeuw, 1984), S-regression (Rousseeuw & Yohai,1984) and MM-regression (Yohai, 1985) proceed by down-weighting outlying cases,which makes sense and works well in practice, under THCM. However, in some appli-cations, the contamination mechanism may be different in that random cells in a datatable (with rows as cases and columns as variables) are independently contaminated.In this paradigm, a small fraction of random cellwise outliers could propagate to arelatively large fraction of cases, breaking down classical high breakdown point affineequivariant estimators (see Alqallaf et al., 2009). Since cellwise and casewise outliersmay co-exist in some applications, our goal in this chapter is to develop a methodfor robust regression estimation and inference that can deal with both cellwise and584.1. Introductioncasewise outliers.There is a vast literature on robust regression for casewise outliers, but onlya scant literature for cellwise outliers and none for both types of outliers in theregression context. Recently, O¨llerer et al. (2015) combined the ideas of coordinatedescent algorithm (called the shooting algorithm in Fu, 1998) and simple S-regression(Rousseeuw & Yohai, 1984) to propose an estimator called the shooting S. Theshooting S-estimator assigns individual weight to each cell in the data table to handlecellwise outliers in the regression context. The shooting S-estimator is robust againstcellwise outliers and vertical response outliers.In this chapter, we propose a three-step regression estimator which combinesthe ideas of filtering cellwise outliers and robust regression via covariance matrixestimate (Maronna & Morgenthaler, 1986; Croux et al., 2003), namely 3S-regressionestimator. By filtering, here we mean detecting outliers and replacing them bymissing values as defined in Chapter 2. Our estimator proceeds as follows: first,it uses a univariate filter to detect and eliminate extreme cellwise outliers in orderto control the effect of outliers propagation; second, it applies a robust estimator ofmultivariate location and scatter to the filtered data to down-weight casewise outliers;third, it computes robust regression coefficients from the estimates obtained in thesecond step. With the choice of a filter that has simultaneous good sensitivity (iscapable of filtering outliers) and good specificity (can preserve all or most of the cleandata), the resulting estimator can be resilient to both cellwise and casewise outliers;furthermore, it attains consistency and asymptotic normality for clean data. Inthis regards, we propose a new filter that is consistent under some assumptions onthe tails of the covariates distributions. By consistent filter, we mean a filter thatasymptotically can preserve all the data when they are clean.The rest of the chapter is organized as follows. In Section 4.2, we introduce a newfamily of consistent filters. In Section 4.3, we introduce 3S-regression. In Section 4.4,we show some asymptotic properties of 3S-regression. In Section 4.5, we evaluatethe performance of 3S-regression in an extensive simulation study. In Section 4.6,we analyze two real data sets with cellwise and casewise outliers. In Section 4.7,we conclude with some remarks. Finally, we also provide all the proofs, additional594.2. Consistent filter for general continuous datasimulation results, and other related material in Appendix C.4.2 Consistent filter for general continuous dataFiltering is a method for preprocessing data in order to control the effect of potentialcellwise outliers. In this chapter, we pre-process the data by flagging outliers andreplacing them by missing values, NAs, similar to what was proposed in Chapter 2,as well as in other applications (see e.g., Danilov, 2010; Farcomeni, 2014b,c).Consistent filters are ones that do not filter good data points asymptotically.Gervini & Yohai (2002) introduced a consistent filter for normal residuals in re-gression estimation to achieve a fully-efficient robust regression estimator. Consis-tent filters are desirable because their good asymptotic properties are shared by thefollowing-up estimation procedure. In this chapter, we introduce a new family ofconsistent filters for univariate data that are sufficiently general in regression appli-cation.Consider a random variable X with a continuous distribution function G(x). Wedefine the scaled upper and lower tail distributions of G(x) as follows:F u(t) = PG(X − ηumed(X − ηu|X > ηu) ≤ t|X > ηu)andF l(t) = PG(ηl −Xmed(ηl −X|X < ηl) ≤ t|X < ηl).(4.1)Here, med stands for median, ηu = G−1(1 − α), ηl = G−1(α), and 0 < α < 0.5.We use α = 0.20, but other choices could be considered. To simplify the notation,we set su = med(X − ηu|X > ηu) and sl = med(ηl − X|X < ηl). Alternatively, acombined tails approach could be used for symmetric distributions as in Gervini &Yohai (2002).Let {X1, . . . , Xn} be a random sample from G, and let X(1) < X(2) < · · · < X(n)be the corresponding order statistics. Consistent estimators for (ηu, su, ηl, sl) are604.2. Consistent filter for general continuous datagiven byηun = G−1n (1− α), sun = med({Xi − ηun|Xi > ηun}),ηln = G−1n (α), sln = med({ηln −Xi|Xi < ηln}),whereG−1n (a) = X(dnae), 0 < a < 1, is the empirical quantile and med({Y1, . . . , Ym}) =Y(dm/2e) is the sample median (see Lemma C.1 in Section C.4 in the Appendix for aproof of the consistency for sun and sln). The empirical distribution functions for thescaled upper and lower tails in (4.1) are now given byF un (t) =∑ni=1 I(0 < (Xi − ηun)/sun ≤ t)∑ni=1 I(Xi > ηun)andF ln(t) =∑ni=1 I(0 < (ηln −Xi)/sln ≤ t)∑ni=1 I(Xi < ηln).Upper and lower tails outliers can be flagged by comparing the empirical distri-bution functions for the scaled tails with their expected distributions. We assumethat aside from contamination, F u and F l decay exponentially fast or faster. Let{a}+ = max(0, a) denote the positive part of a. Then, we define the proportions offlagged upper and lower tails outliers bydun = supt≥t0{F0(t)− F un (t)}+ and dln = supt≥t0{F0(t)− F ln(t)}+,where F0(t) = 1 − exp(− log(2)t) and t0 = 1/ log(2). When X − ηu|X > ηu isexponentially distributed with a rate of λu > 0, the standardized tail (X−ηu)/su|X >ηu have exponential distribution with a rate of log(2), leading to our choice of F0(t)and t0. Finally, we flag bnudunc of the most extreme points in the upper tail and flagbnldlnc of the most extreme points in the lower tail, where nu and nl are the number ofobservations in {Xi|Xi > ηun} and {Xi|Xi < ηln}, respectively. Equivalently, settingtun = min {t : F un (t) ≥ 1− dun} and tln = min{t : F ln(t) ≥ 1− dln},we filter Xi’s with Xi < ηln − slntln or Xi > ηun + suntun.We tried several heavy tail models for F0(t) including Pareto distributions with614.3. Three-step regressiondifferent tail indexes, and we found that the chosen exponential model strikes a goodbalance between the robustness and consistency of the filtering procedure.Theorem 4.1 below shows that our filter is consistent under the following assump-tion on the tails of G(x).Assumption 4.1. G(x) is continuous, and F u(t) and F l(t) satisfy the following:F0(t)− F u(t) ≤ 0, t ≥ t0 and F0(t)− F l(t) ≤ 0, t ≥ t0.Theorem 4.1. Suppose that Assumption 4.1 holds for G(x). Then, dun → 0 a.s. anddln → 0 a.s.Proof: See Section C.4 in the Appendix.In practice, the distributions F u(t) and F l(t) are unknown. To allow for someflexibility, Assumption 4.1 does not completely specify F u(t) and F l(t), but it onlyrequires that their upper tails are as heavy as or lighter than the upper tail of F0(t).4.3 Three-step regression4.3.1 The estimatorConsider the modelYi = α +Xtiβ + εi (4.2)for i = 1, . . . , n, where the error terms εi are i.i.d. and independent of the covariatesX i = (Xi1, . . . , Xip)t. The least squares (LS) estimator (αLS,βtLS) are defined as theminimizers of the sum squares of residuals,(αLS,βtLS) = arg min(α,βt)∈R(p+1)n∑i=1(Yi − α−X tiβ)2.624.3. Three-step regressionThe solution to this problem is explicit:βLS = S−1n,xxSn,xy,αLS = mn,y −mtn,xβLS.(4.3)Here, Sn,xx,Sn,xy,mn,y, and mn,x are the components of the empirical covariancematrix and mean:Sn =(Sn,xx Sn,xySn,yx Sn,yy)and mn =(mn,xmn,y)(4.4)for the joint data {Z 1, . . . ,Zn} with Z i = (X ti, Yi)t.Several authors (see Maronna & Morgenthaler, 1986; Croux et al., 2003) proposedto achieve robust regression and inference for casewise outliers by robustifying thecomponents in (4.3). Croux et al. (2003) replaced the empirical covariance matrix andmean by the multivariate S-estimator (Davies, 1987). We will refer to this approachas two-step regression (2S-regression). Croux et al. (2003) have shown that undermild assumptions (including symmetry of εi and independence of εi and X i) 2S-regression is Fisher consistent and asymptotically normal even if the S-estimatorsof multivariate location and scatter themselves are not consistent. Furthermore, 2S-regression is resilient to all kinds of outliers, that is, vertical outliers, bad leveragepoints, and good leverage points. Note that down-weighting good leverage pointscould lead to some efficiency loss, but it may also prevent the underestimation of thevariance of the estimator, which could be problematic for inferential purposes (seefor example, Ruppert & Simpson, 1990).To deal with casewise and cellwise outliers, we propose to use a generalized S-estimator that uses the consistent filter described in Section 4.2. The estimator issimilar to that in Agostinelli et al. (2015), but with the filter which is consistent fora broader range of distributions. This generality is needed in the regression setting.634.3. Three-step regressionOur proposed globally robust regression estimator, called 3S-regression, is given by:β3S = C−12S,xxC 2S,xyα3S = T2S,y − T t2S,mβ3S.(4.5)Here, (T 2S,C 2S) is a two-step generalized S-estimator computed as follows:Step 1. Filter extreme cellwise outliers to prevent cellwise contaminated cases fromhaving large robust Mahalanobis distances in Step 2, andStep 2. Down-weight casewise outliers by applying generalized S-estimator (GSE) formultivariate location and scatter (Danilov et al., 2012) to the filtered datafrom Step 1. The GSE is a generalization of the S-estimator for incompletedata that are missing completely at random (MCAR). Since the independentcontamination model (ICM) assumes that cells are outlying completely atrandom, the MCAR assumption is fulfilled if the ICM model holds.More precisely, consider a set of covariates {X 1, . . . ,Xn}. We perform univariatefiltering as described in Section 4.2 on each variable, {X1j, . . . , Xnj}, j = 1, . . . , p.Let {U n,1, . . . ,U n,n} be the resulting auxiliary vectors of zeros and ones with zerosindicating the filtered entry in X i. More precisely, U n,i = (Un,i1, . . . , Un,ip)t, whereUn,ij = I(ηlj,n − slj,ntlj,n ≤ Xi ≤ ηuj,n + suj,ntuj,n).The goal of the filter is to prevent propagation of cellwise outliers. If the fractionof cases with at least one flagged cell is very small (below 1%, say) then propagation ofcellwise outliers is not an issue and the filter can be safely turned off. The procedurethat turns the filter off when the fraction of affected cases is below a given smallthreshold, ξ, is considerably simpler to analyze from the asymptotic point of view.Moreover, it retains all the robustness properties derived from the filter. Let n0 =#{1 ≤ i ≤ n : U n,i = 1} be the number of complete observations after filtering. WesetU ∗n,i = 1I(n− n0n≤ ξ)+U n,iI(n− n0n> ξ), i = 1, . . . , n, (4.6)644.3. Three-step regressionwith ξ equal to some small threshold. In this chapter, we use ξ = 0.01.Finally, let Z = (Z 1, . . . ,Zn)t and Un = ((U ∗n,1, . . . ,U ∗n,n)t,1). The two-stepgeneralized S-estimator can now be defined asT 2S = T GS(Z,Un),C 2S = CGS(Z,Un),(4.7)where T GS and CGS are robust multivariate location and scatter generalized S-estimator for incomplete data, (Z,U), with Tukey’s bisquare rho function ρB(t) =min(1, 1 − (1 − t)3) and 50% breakdown point (see Chapter 2 for full definition).Note that when U = (1, . . . ,1) (i.e., when the input data is complete), the general-ized S-estimator reduces to S-estimator (Danilov et al., 2012).Alternatively, the second step can be replaced by GRE as introduced in Chapter3. In the chapter, GSE was shown to lose robustness against casewise outliers forhigher dimensional data. GRE has been proposed to remedy this problem. So, inthe case of large p in X, GRE may be a more appropriate choice for the secondstep than GSE. However, in this chapter, we generally focus on smaller to moderatedimensional data (e.g., p ≤ 15) and hence, GSE should be sufficient.4.3.2 Models with continuous and dummy covariatesFor models with continuous and dummy covariates, the direct computation of 3S-regression is likely to fail because the subsampling algorithm (needed to computethe generalized S-estimator) is likely to yield collinear subsamples. In this case, weendow 3S-regression with an iterative algorithm similar to that in Maronna & Yohai(2000) to deal with continuous and dummy covariates.Consider now the following model:Yi = α +Xtiβx +Dtiβd + εi (4.8)for i = 1, . . . , n where X i = (Xi1, . . . , Xipx)t is a px dimensional vector of continuouscovariates and Di = (Di1, . . . , Dipd)t is a pd dimensional vector of dummy covariates.654.3. Three-step regressionSet X = (X 1, . . . ,Xn)t, D = (D1, . . . ,Dn)t, and Y = (Y1, . . . , Yn)t. We assume thatthe columns in X and D are linearly independent.We modify the alternating M- and S-regression approach proposed by Maronna& Yohai (2000). Our algorithm uses 3S-regression to estimate the coefficients ofthe continuous covariates and regression M-estimators with Huber’s rho functionρH(t) = min(1, t2/2) (Huber & Ronchetti, 2009) to estimate the coefficients of thedummy covariates. More specifically, the algorithm works as follows:(αˆ(k), βˆ(k)x ) = g(X,Y − Dβˆ(k−1)d ),βˆ(k)d = M(D,Y − αˆ(k) − Xˆβˆ(k)x ), for k = 1, . . . , K,(4.9)where g(X,Y ) denotes the operation of 3S-regression for a response vector (Y ,X) asdefined in (4.5) and M(D,Y ) denotes the operation of regression M -estimator withno intercept for (Y ,D). We let Xˆ be the imputed X with the filtered entries imputedby the best linear predictor using Tˆ(k)and Cˆ(k), the generalized S-estimates at thek-th iteration as defined in (4.7). We use Xˆ instead of X to control the effect ofpropagation of cellwise outliers.As in Maronna & Yohai (2000), to calculate the initial estimates, (αˆ(0), βˆ(0)x , βˆ(0)d ),we first remove the effect of Di from the continuous covariates and the responsevariable. LetY = Y − Dt and X = X− DT,where t = M(D,Y ) and T is a pd × px-matrix with the j-th column as T j =M(D, (X1j, . . . , Xnj)t). Now, the initial estimates are defined by(αˆ(0), βˆ(0)tx ) = g(X,Y ),βˆ(0)d = M(D,Y − αˆ(0) − Xˆβˆ(0)x ).Finally, the procedure in (4.9) is iterated until convergence or until it reaches amaximum of K = 20 iterations. We choose K = 20 because our simulation has shownthat the procedure usually converges for K < 20, provided good initial estimates areused.664.4. Asymptotic properties of three-step regression4.4 Asymptotic properties of three-stepregressionTheorem 4.2 establishes the equivalence between 3S-regression and 2S-regression(Croux et al., 2003) for the case of continuous covariates. Let (α3S,βt3S) be the3S-regression estimator and (α2S,βt2S) be the 2S-regression estimator based on thesample {Z 1, . . . ,Zn}, where Z i = (X ti, Yi). Let G(x) and Gj(x) be the distributionfunctions for X i for Xij respectively.Theorem 4.2. Suppose that Assumption 4.1 holds for each Gj, j = 1, . . . , p. Then,with probability one, for sufficiently large n, α3S = α2S and β3S = β2S.Proof: See Section C.4 in the Appendix.Since 3S-regression becomes 2S-regression for sufficiently large n, 3S-regressioninherits the established asymptotic properties of 2S-regression. Corollary 4.3 statesthe strong consistency and asymptotic normality of 3S-regression. The corollaryrequires the following regularity assumptions that are needed for deriving the con-sistency and asymptotic normality of 2S-regression (see Croux et al., 2003).Assumption 4.2. Let Fε be the distribution of the error term εi in (4.2). Thedistribution Fε has a positive, symmetric and unimodal density fε.Assumption 4.3. For all v ∈ Rp and δ ∈ R, PG(X tiv = δ) < 1/2.Corollary 4.3. Suppose that Assumption 4.1 holds for each Gj, j = 1, . . . , p, andAssumption 4.2–4.3 hold. Denote θ3S = (α3S,βt3S)t and θ = (α,β t)t. Then,(a) θ3S → θ a.s..(b) Let H be the distribution of (X t, Y ) and let (T H ,CH) be the S-estimator func-tional (see Lopuhaa¨, 1989). We use the same partition outlined in (4.4) for(T H ,CH). Set X˜ = (1,Xt)t. Then,√n(θ3S − θ)→d N(0, ASV (H)),674.4. Asymptotic properties of three-step regressionwhereASV (H) = C(H)−1D(H)C(H)−1,and whereC(H) = EH{w(dH(Z ))X˜X˜t}+2σ2ε(H)EH{w′(dH(Z ))(Y − X˜ tθ)2X˜X˜ t},D(H) = EH{w2(dH(Z ))(Y − X˜ tθ)2X˜X˜ t},σε(H) =√CH,yy − β tCH,xxβ,dH(Z ) = (Z − T H)tC−1H (Z − T H),w(t) = ρ′B(t).Here, ρB(t) is the Tukey’s bisquare rho function.Remark 4.1. Croux et al. (2003) proved the Fisher consistency of 2S-regression, butthe strong consistency also follows from that and Theorem 3.2 in Lopuhaa¨ (1989).The asymptotic covariance matrix needed for inference can be estimated in thefollowing natural way. Let (mˆ, Sˆ) be the generalized S-estimate and (αˆ3S, βˆt3S) bethe 3S-regression estimate. Then, replace Z i = (Xti, Yi) by Zˆ i = (Xˆti, Yi) and X˜ i =(1,X ti)t bŷ˜X i = (1, Xˆti)t, where Xˆ i is the best linear prediction of X i (which ispossibly incomplete due to filter) using (Tˆ , Cˆ ). The identified cellwise outliers in X iare filtered and imputed in order to avoid the effect of propagation of outliers on theasymptotic covariance matrix estimation. Now,̂ASV (H) = Ĉ(H)−1D̂(H)Ĉ(H)−1,684.5. SimulationswhereĈ(H) =1nn∑i=1{w(dn(Zˆ i)) +2σˆ2ε,nw′(dn(Zˆ i))rˆ2i} ̂˜X î˜Xti,D̂(H) =1nn∑i=1w2(dn(Zˆ i))rˆ2î˜X î˜Xti,σˆε,n =√sˆyy − βˆt3SCˆxxβˆ3S,dn(Zˆ i) = (Zˆ i − Tˆ )tCˆ−1(Zˆ i − Tˆ ),rˆi = Yi − ̂˜X tiθˆ3S.Although the asymptotic covariance matrix formula is valid under clean data, weshall show in Section 4.5 that our proposed inference remains approximately valid inthe presence of a moderate fraction of cellwise and casewise outliers.In the case of continuous and dummy covariates, Maronna & Yohai (2000) derivedasymptotic results for the alternating regression M- and S-estimates. However, thereis no proof of asymptotic results when regression S-estimators are replaced by 2S-regression. The study of the asymptotic properties of the alternating M- and 2S-regression is worth of future research.4.5 SimulationsWe carried out extensive simulation studies in R (R Core Team, 2015) to investigatethe performance of 3S-regression by comparing it with least square (LS) and tworobust alternatives:(i) 2S-regression as in Croux et al. (2003). The location and scatter S-estimatorwith bisquare ρ function and 50% breakdown point is computed by an iterativealgorithm that uses an initial MVE estimator. The MVE estimator is computedby subsampling with a concentration step. This procedure is implemented inthe R package rrcov, function CovSest, option method="bisquare" (Todorov& Filzmoser, 2009); and(ii) Shooting S-estimator introduced in O¨llerer et al. (2015) with bisquare ρ func-694.5. Simulationstion and 20% breakdown point (for each simple regression) as suggested by theauthors to attain a good trade-off between robustness and efficiency. The R codeis available at http://feb.kuleuven.be/Viktoria.Oellerer/software.The generalized S-estimates needed by 3S-regression are computed using the R pack-age GSE, function GSE with default options (Leung et al., 2015). The regressionM-estimates needed by the alternating M- and 3S-regression are computed using theR package MASS, function rlm, option method="M" (Venables & Ripley, 2002). Fi-nally, the proposed procedures are implemented in the R package robreg3S, whichis freely available on CRAN (the Comprehensive R Archive Network, R Core Team,2015).4.5.1 Models with continuous covariatesWe consider the regression model in (4.2) with p = 15 and n = 150, 300, 500, 1000.The random covariates X i, i = 1, . . . , n, are generated from multivariate normaldistribution Np(µ,Σ). We set µ = 0 and Σjj = 1 for j = 1, . . . , p without lossof generality because GSE in the second step of 3S-regression is location and scaleequivariant. To address the fact that 3S-regression and the shooting S-estimatorare not affine-equivariant, we consider the random correlation structure for Σ asdescribed in Agostinelli et al. (2015). We fix the condition number of the randomcorrelation matrix at 100 to mimic the practical situation for data sets of similardimensions. Furthermore, to address the fact that the two estimators are not re-gression equivariant, we randomly generate β as β = Rb, where b has a uniformdistribution on the unit spherical surface and R is set to 10. We set α = 0 becauseGSE is location equivariant. The response variable Yi is given by Yi = Xtiβ + εi,where εi are independent (also independent of X i’s) identically normally distributedwith mean 0 and σ = 0.5. Finally, we consider the following scenarios:• Clean data: No further changes are done to the data;• Cellwise contamination: Randomly replace a fraction of the cells in the co-variates by outliers Xcontij = E(Xij) + k × SD(Xij) and proportion of the704.5. SimulationsTable 4.1: Maximum MSE in all the considered scenarios for models with continuouscovariates.Clean 1% Cellwise 5% Cellwise Casewisen = 150 300 150 300 150 300 150 3003S 0.012 0.005 0.039 0.020 0.902 0.797 0.223 0.143ShootS 0.034 0.017 0.134 0.080 1.129 0.912 1.570 1.4602S 0.010 0.004 0.025 0.014 3.364 3.041 0.109 0.122LS 0.009 0.004 2.723 2.440 4.812 4.732 8.286 8.182responses by outliers Y contij = E(Yij) + k × SD(εi), where k = 1, 2, . . . , 10;• Casewise contamination: Randomly replace a fraction of the cases by lever-age outliers (X contit, Y conti ), where Xconti = cv and Yconti = Xcontitβ + εconti withεconti ∼ N(k, σ2), where k = 1, 2, . . . , 15. Here, v is the eigenvector correspond-ing to the smallest eigenvalue of Σ with length such that (v−µ)tΣ−1(v−µ) = 1.Monte Carlo experiments show that the placement of outliers in this direction,v, is the least favorable for our estimator. We repeat the simulation study inAgostinelli et al. (2015) for dimension 16 and observe that c = 8 is the leastfavorable value for the performance of the scatter estimator.We consider = 0.01, 0.05 for cellwise contamination, and = 0.10 for casewisecontamination. The number of replicates for each setting is N = 1000.Coefficient estimation performanceWe examine the effect of cellwise and casewise outliers on the bias of the estimatedcoefficients. We evaluate the bias using the Monte Carlo mean squared error (MSE):MSE =1NN∑m=11pp∑j=1(βˆ(m)n,j − β(m)j )2where βˆ(m)n,j is the estimate for β(m)j at the m-th simulation run.Table 4.1 shows the MSE for clean data and the maximum MSE for all thecellwise and casewise contamination settings for n = 150, 300. Figure 4.1 shows the714.5. Simulations1% Cellwise 5% Cellwise Casewise0.00.10.20.30.40.00.51.01.50.00.51.01.52 4 6 8 10 2 4 6 8 10 2 4 6 8 10 12 14kMSEEstimators 3S ShootS 2S LSFigure 4.1: MSE for various cellwise and casewise contamination values, k, formodels with continuous covariates. The sample size is n = 300.curves of MSE for various cellwise and casewise contamination values for n = 300.The results for n = 150 are similar and the corresponding figure is shown in SectionC.1 in the Appendix.In the cellwise contamination setting, 3S-regression is highly robust against mod-erate and large cellwise outliers (k ≥ 3), but less robust against inliers (k ≤ 2).Notice that inliers also affect the performance of the shooting S-estimator but to alesser extent. Since the filter does not flag inliers, 3S-regression and 2S-regressionperform similarly in the presence of inliers (see the central panel of Figure 4.1).The shooting S-estimator is highly robust against large outliers, but less so againstmoderate cellwise outliers. As expected, 2S-regression breaks down in the case of = 0.05, when the propagation of large cellwise outliers is expected to affect morethan 50% of the cases.In the casewise contamination setting, 2S-regression has the best performance,as expected. 3S-regression also performs fairly well in this setting. The shootingS-estimator performs less satisfactorily in this case.We have also considered other simulation settings and observed similar results(not shown here). In particular, we considered p = 5 with n = 50, 100 and p = 25with n = 250, 500 under the same set of scenarios (clean data, cellwise contamination,and casewise contamination). Moreover, we studied the performance of 3S-regression724.5. Simulationsfor larger casewise contamination levels up to 20%. 3S-regression maintains its com-petitive performance, outperforming Shooting S and not falling too far behind 2S-regression, which is expected to win in these situations.Performance of confidence intervalsWe then assess the performance of confidence intervals for the regression coefficientsbased on the asymptotic covariance matrix as described in Section 4.4. Intervals thathave a coverage close to the nominal value, while being relatively short, are desirable.The 100(1− τ)% confidence interval (CI) of 3S-regression has the form:CI(βˆn,j) =[βˆn,j − Φ−1(1− τ/2)√ÂSV (βˆn,j)/n, βˆn,j + Φ−1(1− τ/2)√ÂSV (βˆn,j)/n],for j = 0, 1, . . . , p, where βˆn,0 = αˆn. We consider τ = 0.05 here. We evaluate theperformance of CI using the Monte Carlo mean coverage rate (CR):CR =1NN∑m=11pp∑j=1I(β(m)j ∈ CI(βˆ(m)n,j )),and the Monte Carlo mean CI lengths:CIL =1NN∑m=11pp∑j=12Φ−1(1− τ/2)√ÂSV (βˆn,j)/n.Figure 4.2 shows the CR in the case of clean data, 5% cellwise contamination(k = 5), and 10% casewise contamination (k = 3) simulation, with different samplesizes n = 150, 300, 500, 1000. The nominal value of 95% is indicated by the horizontalline in the figure.For clean data, the coverage rates of all the intervals reach the nominal level whenthe sample size grows, as expected. For data with casewise outliers, 2S-regressionyields the best coverage rate, which is closest to the nominal level. However, 3S-regression has an acceptable performance, comparable with that of 2S-regression.734.5. Simulationsnominal level nominal level nominal levelClean 5% Cellwise, k=5 10% Casewise, k=30.000.250.500.751.000.000.250.500.751.000.000.250.500.751.00250 500 750 1000 250 500 750 1000 250 500 750 1000nCREstimators a a a3S 2S LSFigure 4.2: CR for clean data and for cellwise and casewise contaminated data ofvarious sample size, n.Table 4.2: Average lengths of confidence intervals for clean data and for cellwise andcasewise contamination.Clean 1% Cell., k = 5 5% Cell., k = 5 10% Case., k = 3Size (n) 3S 2S 3S 2S 3S 2S 3S 2S150 0.341 0.352 0.355 0.402 0.450 1.519 0.329 0.355300 0.242 0.247 0.244 0.275 0.294 1.148 0.239 0.253500 0.187 0.189 0.190 0.212 0.222 0.912 0.189 0.1971000 0.133 0.133 0.134 0.150 0.155 0.662 0.137 0.140For data with cellwise outliers, 3S-regression yields intervals with a coverage raterelatively closer to the nominal value than LS and 2S-regression.Furthermore, the length of the intervals obtained from 3S regression is comparableto that LS for clean data and that of 2S-regression for clean data and data withcasewise outliers. For data with cellwise outliers, 3S-regression yields intervals withlengths relatively closer to the case of clean data. Table 4.2 shows the averagelengths of the confidence intervals obtained from 3S- and 2S-regression in the case ofclean data, 1% cellwise contamination (k = 5), 5% cellwise contamination (k = 5),and 10% casewise contamination (k = 3) simulation, with different sample sizesn = 150, 300, 500, 1000. The results of LS are not included here.In general, 3S-regression yields slightly shorter intervals than 2S-regression in744.5. Simulationsall scenarios because the asymptotic variance is calculated on the data with thefiltered cells imputed instead of the complete data. On the other hand, 2S-regressiontends to yield longer intervals in the cellwise contamination model, even when thepropagation of outliers is below the 0.5 breakdown point under THCM, for example,when ε = 0.01. This maybe because 2S-regression loses a significant amount of cleandata for estimation when it down-weights cases with outlying components.4.5.2 Models with continuous and dummy covariatesWe now conduct a simulation study to assess the performance of our procedurewhen the model includes continuous and dummy covariates. We consider the re-gression model in (4.8) with px = 12, pd = 3, and n = 150, 300. The randomcovariates (X i,Di), i = 1, . . . , n, are first generated from multivariate normal distri-bution Np(0,Σ), where Σ is the randomly generated correlation matrix with a fixedcondition number of 100. Then, we dichotomize Dij at Φ−1(pij) where pij = 14 ,13, 12for j = 1, 2, 3, respectively. Finally, the rest of data are generated in the same wayas described in Section 4.5.1.In the simulation study, we consider the following scenarios:• Clean data: No further changes are done to the data;• Cellwise contamination: Randomly replace a fraction of the cells in X byoutliers Xcontij = E(Xij) + k × SD(Xij) and proportion of the responses byoutliers Y contij = E(Yij) + k × SD(εi), where k = 1, 2, . . . , 10;• Casewise contamination: Let Σx be the sub-matrix of Σ with rows and columnscorresponding to the continuous covariates. Randomly replace a fractionof the cases in X by leverage outliers X conti = cv, where v is the eigen-vector corresponding to the smallest eigenvalue of Σx with length such that(v − µx)tΣ−1(v − µx) = 1. In this case, the number of continuous variablesis 13 (instead of 16) and the corresponding least favorable casewise contam-ination size is found to be c = 7 (instead of 8) using the same procedure754.5. SimulationsTable 4.3: Maximum MSE in all the considered scenarios for models with continuousand dummy covariates.Clean 1% Cellwise 5% Cellwise Casewisen = 150 300 150 300 150 300 150 3003S 0.010 0.004 0.018 0.008 0.636 0.507 0.090 0.071ShootS 0.012 0.005 0.026 0.015 0.746 0.468 0.450 0.3872S 0.008 0.003 0.014 0.007 1.894 1.341 0.060 0.054LS 0.007 0.003 2.785 2.532 5.162 4.981 1.332 1.322as in Section 4.5.1. Finally, we replace the corresponding response value byY conti = Xcontitβx +Dtiβd + εconti with εconti ∼ N(k, σ2), where k = 1, 2, . . . , 10.Again, we consider = 0.01, 0.05 for cellwise contamination, and = 0.10 for casewisecontamination. The number of replicates for each setting is N = 1000.1% Cellwise 5% Cellwise Casewise0.000.050.100.150.000.250.500.750.00.20.40.60.82 4 2 4 6 8 10 2 4 6 8 10kMSEEstimators 3S ShootS 2S LSFigure 4.3: MSE for various cellwise and casewise contamination values, k, formodels with continuous and dummy covariates. The sample size is n = 300.Table 4.3 shows the MSE for clean data and the maximum MSE for all thecellwise and casewise contamination settings for n = 150, 300. Figure 4.3 shows thecurves of MSE for various cellwise and casewise contamination values for n = 300.The results for n = 150 are similar and the corresponding figure is shown in SectionC.1 in the Appendix. Overall, 3S-regression remains competitive in the case ofcontinuous and dummy covariates.764.6. Real data example: Boston Housing dataWe also consider the case of non-normal covariates. The covariates are generatedfrom several asymmetric distributions, and the data are contaminated in a similarfashion. The performance of 3S-regression in the case of non-normal covariates issimilar to the performance in the case of normal covariates. Results are available inSection C.2 in the Appendix.4.6 Real data example: Boston Housing dataWe illustrate the effect of cellwise outlier propagation on classical robust estimatorsusing the Boston Housing data. The data, available at the UCI repository (Bache& Lichman, 2013), was collected from 506 census tracts in the Boston StandardStatistical Metropolitan Area in the 1970s on 14 different features. We consider thenine quantitative variables that were extensively studied (e.g., see in O¨llerer et al.,2015). The variables are listed and described in Table C.2 in the Appendix. There isno missing data. The original objective of the study in Harrison & Rubinfeld (1978)was to analyze the association between the median housing values (medv) in Bostonand the residents’ willingness to pay for clean air, as well as the association betweenmedv and those variables on the list.We fit the following model using 3S-regression, the shooting S-estimator, 2S-regression and the LS estimator:log(medv) = α + β1 log(crim) + β2 nox2 + β3 rm2 + βx,4 age+ β5 log(dis) + β6 tax+ β7 ptratio+ β8 black + β9 log(lstat) + ε.The regression coefficient estimates and their P-values are given in Table 4.4. Inparticular, we observe that the regression coefficients for the covariates age andblack are very different under 3S and 2S-regression. Moreover, age is significantunder 2S-regression but highly non-significant under 3S-regression. 2S-regression issomewhat inefficient because it throws away a substantial amount of clean data dueto the propagation of cellwise outliers. It fully down-weights 16.4% of the cases in thedataset (cases that receive a zero weight by the multivariate S-estimator). Slightly774.6. Real data example: Boston Housing dataTable 4.4: Estimates and p-values of the regression coefficients for the original BostonHousing data.Variable 3S ShootS 2S LSCoeff. P-Val. Coeff. P-Val. Coeff. P-Val. Coeff. P-Val.log(lstat) -0.243 <0.001 -0.266 - -0.153 <0.001 -0.395 <0.001rm2 0.015 <0.001 0.013 - 0.018 <0.001 0.007 <0.001tax -0.051 <0.001 -0.021 - -0.046 <0.001 -0.028 0.006log(dis) -0.125 <0.001 -0.157 - -0.126 <0.001 -0.139 <0.001ptratio -0.026 <0.001 -0.027 - -0.025 <0.001 -0.029 <0.001nox2 -0.578 0.013 -0.463 - -0.445 0.023 -0.451 <0.001age -0.023 0.645 -0.040 - -0.152 0.001 0.050 0.391black -0.726 0.398 0.787 - -0.007 0.993 0.500 <0.001log(crim) -0.006 0.513 0.004 - 0.005 0.527 -0.002 0.813Table 4.5: Pairwise squared norm distances between the estimates for the originalBoston housing data.3S ShootS 2S LS3S - 1.389 3.145 6.725ShootS - 4.312 4.6612S - 16.614LS -more than half of these cases (8.7%) are affected by the propagation of cellwiseoutliers mainly in the covariates nox2 and black (1.3% of the cells in the datasetare flagged by the consistent filter). After filtering, these cases have relatively smallpartial Mahalanobis distances, indicating they are close to the bulk of the data forthe remaining variables.We further compare the four estimators by computing their squared norm dis-tances, n×∑pj=1(β̂n,j,A− β̂n,j,B)2×MAD({X1j, . . . , Xnj})2 (see O¨llerer et al., 2015),where MAD is the median absolute deviation. Table 4.5 shows the squared normdistances for the considered estimators. Overall, the three robust estimators are verydifferent from LS. As expected, 3S-regression and shooting S are closer to each otherthan they are to 2S-regression. We next illustrate that the observed differences be-tween the three robust estimators are indeed mostly caused by the propagation ofcellwise outliers in the Boston housing data.784.6. Real data example: Boston Housing dataTable 4.6: Estimates and p-values of the regression coefficients for the imputedBoston Housing data.Variable 3S ShootS 2S LSCoeff. P-Val. Coeff. P-Val. Coeff. P-Val. Coeff. P-Val.log(lstat) -0.243 0.000 -0.264 - -0.227 <0.001 -0.385 <0.001rm2 0.015 0.000 0.013 - 0.014 <0.001 0.009 <0.001tax -0.051 0.000 -0.030 - -0.047 <0.001 -0.032 0.002log(dis) -0.125 0.000 -0.161 - -0.129 <0.001 -0.144 <0.001ptratio -0.026 0.000 -0.028 - -0.025 <0.001 -0.027 <0.001nox2 -0.578 0.013 -0.522 - -0.619 0.010 -0.479 <0.001age -0.023 0.645 -0.037 - -0.037 0.471 0.051 0.386black -0.726 0.398 0.371 - -0.882 0.376 -0.206 0.519log(crim) -0.006 0.513 -0.001 - -0.012 0.233 -0.012 0.213Table 4.7: Pairwise squared norm distances between the estimates for the imputedBoston housing data.3S ShootS 2S LS3S - 0.862 0.172 5.486ShootS - 1.158 3.9922S - 6.366LS -Recall that half of the cases fully down-weighted by 2S-regression have entriesflagged as cellwise outliers. We replace these flagged cells by their best linear pre-dictions (using the 3S-regression estimate) and then, refit the model with the fourconsidered estimators. The resulting coefficient estimates and their P-values aregiven in Table 4.6. Notice that the covariate age is no longer significant under 2S-regression. Moreover, Table 4.7 shows the norm distances between all the estimatescalculated from such imputed data. Now, 2S-regression is considerably closer to thecellwise robust estimators, and it no longer fully down-weights the cases formerlyaffected by cellwise outliers (the median weight of these cases is now 0.64, closerto the overall median weight, 0.69). The LS estimator remains different from therobust estimators, possibly due to the existence of casewise outliers in the data.MM-regression (Yohai, 1985) behaves similarly to 2S-regression in this example.794.7. Conclusions4.7 ConclusionsHigh breakdown point affine equivariant robust estimators are neither efficient norrobust in the independent cellwise contamination model (ICM). By efficiency herewe mean the ability to use the clean part of the data. In fact, classical robustestimators are inefficient under ICM because they may down-weight an entire rowwith a single component being contaminated. Therefore, they may lose some usefulinformation contained in the data. Furthermore, the classical high breakdown pointaffine equivariant robust estimators may break down under ICM. A small fraction ofcellwise outliers could propagate, affecting a large proportion of cases. For instance,the probability that at least one component of a case is contaminated is =1− (1− )p, where is the proportion of independent cellwise outliers. This impliesthat even if is small, could be large for large p, and could exceed the 0.5 breakdownpoint under THCM. For example, if = 0.1 and p = 10, then = 0.65; and if = 0.05and p = 20, then = 0.64.To overcome these deficiencies of the classical robust estimators, we introduce athree-step regression estimator that can deal with cellwise and casewise outliers. Thefirst step of our estimator is aimed at reducing the impact of outliers propagationposed by ICM. The second step is aimed at achieving robustness under THCM. Asa result, the robust regression estimate from the third step is shown to be efficient(in terms of data usage) and robust under ICM and THCM. We also prove that ourestimator is consistent and asymptotically normal at the central regression modeldistribution. Finally, we extend our estimator to models with continuous and dummycovariates and provide an algorithm to compute the regression coefficients.80Chapter 5ConclusionsIn this thesis, two important aspects of robust analysis are studied:• Multivariate location and scatter matrix under cellwise and casewise contami-nation;• Regression analysis under cellwise and casewise contamination.The following is an outline of the main results, some limitations that the proposedmethods have, and the directions we foresee for future work.In Chapter 2 and 3, a two-step procedure for estimating multivariate locationand scatter matrix is proposed. Four estimators are derived from the procedure:• UF-GSE for less correlated data in moderate dimensions (p ≤ 15);• UBF-GSE for more correlated data in moderate dimensions (p ≤ 15);• UF-GRE for less correlated data in high dimensions (p > 15); and• UBF-GRE for more correlated data in high dimensions (p > 15).Simulation results have shown that these estimators provide fairly high resistanceagainst cellwise and casewise outliers, when comparing with the best performingrobust estimators in their settings. However, the two-step procedure still has somelimitations. For sample sizes 2p < n ≤ 5p, the estimator may encounter convergenceproblems and result in close-to-singular estimates. The problems may be remediedby using graphical lasso (GLASSO; Friedman et al., 2008) to make close-to-singular(or even singular) estimates better conditioned. However, for sample sizes n ≤ 2p,the estimators fail to exist.81Chapter 5. ConclusionsThere is a recently proposed cellwise robust estimator for high dimensional datawith small n/p called the GGQ estimator by O¨llerer & Croux (2015), which in spiritcoincides with the work of Tarr et al. (2016). The GGQ estimator is defined by aprocedure of calculating correlations pairwise using normal scores (also known as theGaussian rank correlation), then applying GLASSO to the pairwise scatter matrix.Because the correlation estimation is done pairwise, it can handle data with n < p.In our preliminary study, we have found that GGQ exhibits fairly high robustnessagainst cellwise outliers, but is not so robust against casewise outliers. We believethat pairwise estimation itself is not sufficient to deal with casewise outliers and finelystructured high dimensional data. Hence, there is still a need for further research onhigh dimensional estimation in the presence of cellwise and casewise contaminationwhen n/p is small.In Chapter 4, a three-step procedure for robust regression with continuous co-variates is proposed. The procedure coincides with the classical two-step procedurefor robust regression of Croux et al. (2003), when the first step of filtering is re-moved. The method is extended to handle both continuous and dummy covariates.Simulation results and example have shown that the procedure handles both cell-wise outliers and casewise outliers similarly well. Asymptotic results are providedfor the case of continuous covariates, but no results are available for the case ofcontinuous and dummy covariates, which is a major limitation of the procedure. In-terestingly, there are also no asymptotic results for the classical procedure of Crouxet al. (2003), in this setting. Hence, we believe the study of the asymptotic propertiesof the extended procedure for regression with continuous and dummy covariates is aworthwhile project for future research.Due to the novelty of the topic of cellwise and casewise contamination, we hopethat in general further research will follow this thesis.82BibliographyAgostinelli, C., Leung, A., Yohai, V. J., & Zamar, R. H. (2015). Robust estima-tion of multivariate location and scatter in the presence of cellwise and casewisecontamination. TEST, 24 (3), 441–461. iii, 10, 11, 42, 63, 70, 71Alqallaf, F. (2003). A New Contamination Model for Robust Estimation with LargeHigh-Dimensional Data Sets. PhD thesis, University of British Columbia. 28Alqallaf, F., Van Aelst, S., Yohai, V. J., & Zamar, R. H. (2009). Propagation ofoutliers in multivariate data. Ann Statist, 37 (1), 311–331. 13, 14, 58Alqallaf, F. A., Konis, K. P., Martin, R. D., & Zamar, R. H. (2002). Scalable robustcovariance and correlation estimates for data mining. In Proceedings of the eighthACM SIGKDD international conference on Knowledge discovery and data mining,KDD ’02, (pp. 14–23). 14Bache, K. & Lichman, M. (2013). UCI machine learning repository. http://archive.ics.uci.edu/ml. 77Croux, C. & O¨llerer, V. (2015). Comments on: Robust estimation of multivariatelocation and scatter in the presence of cellwise and casewise contamination. TEST,24 (3), 462–466. 37, 38Croux, C., van Aelst, S., & Dehon, C. (2003). Bounded influence regression usinghigh breakdown scatter matrices. 55, 265–285. 59, 63, 67, 68, 69, 82Danilov, M. (2010). Robust Estimation of Multivariate Scatter under Non-AffineEquivarint Scenarios. PhD thesis, University of British Columbia. 11, 15, 28, 6083BibliographyDanilov, M., Yohai, V. J., & Zamar, R. H. (2012). Robust estimation of multivariatelocation and scatter in the presence of missing data. J Amer Statist Assoc, 107,1178–1186. iii, 16, 23, 24, 37, 38, 39, 41, 45, 47, 48, 64, 65, 97, 112Davies, P. (1987). Asymptotic behaviour of S-estimators of multivariate locationparameters and dispersion matrices. Ann Statist, 15, 1269–1292. 13, 25, 63Donoho, D. L. (1982). Breakdown Properties of Multivariate Location Estimators.PhD thesis, Harvard University. 13, 14Farcomeni, A. (2014a). Robust constrained clustering in presence of entry-wise out-liers. Technometrics, 56, 102–111. 15, 26Farcomeni, A. (2014b). Robust constrained clustering in presence of entry-wise out-liers. Technometrics, 56, 102–111. 60Farcomeni, A. (2014c). Snipping for robust K-means clustering under component-wise contamination. Stat Comp, 24, 909–917. 60Farcomeni, A. & Leung, A. (2014). snipEM: Snipping methods for robust estimationand clustering. R package version 1.0. 27Friedman, J., Hastie, T., & Tibshirani, R. (2008). Sparse inverse covariance estima-tion with the graphical lasso. Biostatistics, 9 (3), 432–441. 81Fu, W. (1998). Penalized regressions: The bridge versus the lasso. J Comput GraphStatist, 7 (3), 397–416. 59Gervini, D. & Yohai, V. J. (2002). A class of robust and fully efficient regressionestimators. Ann Statist, 30 (2), 583–616. 16, 17, 60Gnanadesikan, R. & Kettenring, J. R. (1972). Robust estimates, residuals, andoutlier detection with multiresponse data. Biometrics, 28, 81–124. 19, 46Hall, P., Marron, J., & Neeman, A. (2005). Geometric representation of high di-mension, low sample size data. J R Stat Soc Ser B Stat Methodol, 67, 427–444.4584BibliographyHarrison, D. & Rubinfeld, D. L. (1978). Hedonic prices and the demand for cleanair. J Environ Econ Manage, 5, 81–102. 77Huber, P. J. (1981). Robust Statistics. New York: John Wiley & Sons. 14Huber, P. J. & Ronchetti, E. M. (2009). Robust Statistics (2nd edition). New Jersey:John Wiley & Sons. 66Kaveh, V. (2015). DetMCD: DetMCD Algorithm (Robust and Deterministic Estima-tion of Location and Scatter). R package version 0.0.2. 27Leung, A., Danilov, M., Yohai, V., & Zamar, R. (2015). GSE: Robust Estimationin the Presence of Cellwise and Casewise Contamination and Missing Data. Rpackage version 3.2.3. iii, 11, 27, 48, 70Leung, A., Zhang, H., & Zamar, R. (2015). robreg3S: Three-Step Regression andInference for Cellwise and Casewise Contamination. R package version 0.4. iv, 11Leung, A., Zhang, H., & Zamar, R. (2016). Robust regression estimation and infer-ence in the presence of cellwise and casewise contamination. Comput Statist DataAnal, 99, 1–11. iv, 11Lopuhaa¨, H. P. (1989). On the relation between S-estimators and M-estimators ofmultivariate location and covariance. Ann Statist, 17, 1662–1683. 67, 68Maronna, R. A. (2015). Comments on: Robust estimation of multivariate locationand scatter in the presence of cellwise and casewise contamination. TEST, 24 (3),471–472. 37, 42Maronna, R. A., Martin, R. D., & Yohai, V. J. (2006). Robust Statistics: Theoryand Methods. Chichister: John Wiley & Sons. 25, 43, 48Maronna, R. A. & Morgenthaler, S. (1986). Robust regression through robust co-variance matrices. Comm Statist Theory Methods, 15, 1347–1365. 59, 6385BibliographyMaronna, R. A. & Yohai, V. J. (2000). Robust regression with both continuous andcategorical predictors. J Statist Plann Inference, 89, 197–214. 65, 66, 69Maronna, R. A. & Yohai, V. J. (2015). Robust and efficient estimation of highdimensional scatter and location. arXiv:1504.03389 [math.ST]. 5, 26, 37Martin, R. (2013). Robust covariances: Common risk versus specific risk outliers.Presented at the 2013 R-Finance Conference, Chicago, IL, www.rinfinance.com/agenda/2013/talk/DougMartin.pdf, visited 2016-08-24. 6, 52O¨llerer, V., Alfons, A., & Croux, C. (2015). The shooting S-estimator for robustregression. 59, 69, 77, 78O¨llerer, V. & Croux, C. (2015). Robust high-dimensional precision matrix estima-tion. In K. Nordhausen & S. Taskinen (Eds.), Modern Nonparametric, Robustand Multivariate Methods: Festschrift in Honour of Hannu Oja (pp. 325–350).Springer. 27, 82Pen˜a, D. & Prieto, F. J. (2001). Multivariate outlier detection and robust covariancematrix estimation. Technometrics, 43, 286–310. 26R Core Team (2015). R: A Language and Environment for Statistical Computing.Vienna, Austria: R Foundation for Statistical Computing. iii, 11, 25, 69, 70Rocke, D. M. (1996). Robustness properties of S-estimators of multivariate locationand shape in high dimension. Ann Statist, 24, 1327–1345. 26, 37, 42, 43, 102Rousseeuw, P. (1984). Least median of squares regression. J Amer Statist Assoc, 79,871–880. 58Rousseeuw, P. J. (1985). Multivariate estimation with high breakdown point. InW. Grossmann, G. Pflug, I. Vincze, & W. Wertz (Eds.), Mathematical statisticsand applications, volume B (pp. 256–272). Dordrecht: Reidel Publishing Company.13, 3886BibliographyRousseeuw, P. J. & Croux, C. (1993). Alternatives to the median absolute deviation.J Amer Statist Assoc, 88, 1273–1283. 46Rousseeuw, P. J. & Van den Bossche, W. (2015). Comments on: Robust estima-tion of multivariate location and scatter in the presence of cellwise and casewisecontamination. TEST, 24 (3), 473–477. 18, 27, 37Rousseeuw, P. J. & Van Driessen, K. (1999). A fast algorithm for the minimumcovariance determinant estimator. Technometrics, 41, 212–223. 25Rousseeuw, P. J. & Yohai, V. J. (1984). Robust regression by means of S-estimators.In J. Franke, W. Ha¨rdle, & D. Martin (Eds.), Robust and Nonlinear Time Series,volume 26 of Lecture Notes in Statistics (pp. 256–272). New York, US: Springer.58, 59Ruppert, D. & Simpson, D. (1990). Unmasking multivariate outliers and leveragepoints: Comment. J Amer Statist Assoc, 85, 644–646. 63Smith, R. E., Campbell, N. A., & Lichfield, A. (1984). Multivariate statistical tech-niques applied to pisolitic laterite geochemistry at Golden Grove, Western Aus-tralia. J Geochem Explor, 22, 193–216. 3, 33Stahel, W. A. (1981). Breakdown of covariance estimators. Technical Report 31,Fachgruppe fu¨r Statistik, ETH Zu¨rich, Switzerland. 13, 14Tarr, G., Mu¨ller, S., & Weber, N. (2016). Robust estimation of precision matricesunder cellwise contamination. Comput Statist Data Anal, 93, 404–420. 82Tatsuoka, K. S. & Tyler, D. E. (2000). On the uniqueness of S-functionals andM-functionals under nonelliptical distributions. Ann Statist, 28, 1219–1243. 13,37Todorov, V. & Filzmoser, P. (2009). An object-oriented framework for robust mul-tivariate analysis. Journal of Statistical Software, 32 (3), 1–47. 26, 6987Van Aelst, S. (2015). Comments on: Robust estimation of multivariate locationand scatter in the presence of cellwise and casewise contamination. TEST, 24 (3),478–481. 37Van Aelst, S., Vandervieren, E., & Willems, G. (2012). A Stahel-Donoho estimatorbased on Huberized outlyingness. Comput Statist Data Anal, 56, 531–542. 14, 15,26Venables, W. N. & Ripley, B. D. (2002). Modern Applied Statistics with S (Fourthed.). New York: Springer. ISBN 0-387-95457-0. 70Welsch, R. (2015). Comments on: Robust estimation of multivariate location andscatter in the presence of cellwise and casewise contamination. TEST, 24 (3),482–483. 37Yohai, V. J. (1985). High breakdown point and high efficiency robust estimates forregression. Technical Report 66, Department of Statistics, University of Washing-ton. available at http://www.stat.washington.edu/research/reports/1985/tr066.pdf, visited 2016-08-24. 58, 79, 9788Appendix ASupplementary material forChapter 2A.1 Additional tables from the simulation studyin Section 2.4Table A.1: Maximum average LRT distances under cellwise contamination. Thesample size is n = 5p.Corr. p MCD MVE-S Rocke HSD Snip DMCDSc UF- UBF-GSE GSERandom 5 0.05 4.1 1.5 4.1 2.3 52.8 3.4 3.5 5.80.10 14.3 12.8 63.9 9.0 154.2 8.7 7.9 10.310 0.05 9.7 13.4 32.0 11.7 18.7 5.8 5.6 5.80.10 142.9 134.4 156.3 56.9 66.4 16.4 16.5 16.115 0.05 41.2 60.9 62.9 30.8 13.0 9.0 9.2 9.60.10 198.5 198.4 202.6 134.7 26.9 21.3 21.5 21.520 0.05 72.9 94.8 90.7 55.9 15.4 11.2 12.2 12.40.10 240.9 240.7 242.1 243.4 19.8 26.1 25.1 25.1AR1(0.9) 5 0.05 3.8 1.5 3.7 1.5 39.6 2.3 3.2 5.90.10 21.3 15.6 44.5 2.6 117.6 5.3 4.6 9.210 0.05 15.4 19.8 44.6 3.7 11.9 5.1 3.6 2.70.10 219.1 187.4 220.6 19.8 90.7 15.1 11.8 5.315 0.05 96.3 99.9 134.2 12.2 12.5 9.7 6.8 3.30.10 367.3 369.6 387.6 83.9 55.4 26.8 21.8 8.820 0.05 174.4 197.3 235.9 28.5 19.1 14.9 10.9 4.40.10 518.5 526.5 557.8 260.8 34.3 39.8 33.7 15.389A.1. Additional tables from the simulation study in Section 2.4Table A.2: Maximum average LRT distances under casewise contamination. Thesample size is n = 5p.Corr. p MCD MVE-S Rocke HSD Snip DMCDSc UF- UBF-GSE GSERandom 5 0.10 4.1 1.5 3.0 1.7 8.2 3.4 4.7 9.70.20 32.3 12.1 30.5 7.2 29.9 26.3 35.9 54.310 0.10 11.1 3.9 4.7 5.1 27.8 9.0 15.9 28.90.20 128.2 71.0 16.6 28.1 57.2 62.9 86.6 102.715 0.10 28.7 7.6 4.9 8.8 41.3 26.3 33.1 47.90.20 146.5 109.3 20.5 64.1 81.5 86.9 122.1 143.220 0.10 76.6 17.8 6.7 16.0 57.6 49.2 59.7 67.50.20 167.7 141.9 19.0 111.0 103.4 110.1 154.4 183.3AR1(0.9) 5 0.10 3.9 1.4 1.9 1.6 8.9 2.5 2.2 3.80.20 18.7 8.0 28.4 3.7 22.6 6.9 11.2 13.310 0.10 9.4 3.9 3.3 2.7 19.1 4.8 4.9 5.90.20 122.7 59.2 29.1 11.6 44.8 31.0 69.0 57.215 0.10 19.3 7.1 4.3 3.9 29.8 8.5 10.1 13.10.20 139.8 98.0 32.6 22.8 64.0 69.5 100.2 103.420 0.10 72.3 16.3 5.6 6.3 48.7 16.6 34.9 37.30.20 161.5 127.2 23.6 48.5 87.5 108.5 129.3 133.8Table A.3: Finite sample efficiency for first order autoregressive correlations, AR1(ρ),with ρ = 0.9. The sample size is n = 5p.p MCD MVE-S Rocke HSD Snip DMCDSc UF- UBF-GSE GSE5 0.22 0.62 0.50 0.50 0.21 0.30 0.45 0.2910 0.32 0.86 0.55 0.71 0.08 0.39 0.74 0.6115 0.42 0.94 0.55 0.83 0.28 0.44 0.84 0.7120 0.48 0.96 0.56 0.88 0.17 0.48 0.87 0.7790A.2. Proofs of propositions and theoremA.2 Proofs of propositions and theoremProof of Proposition 2.1Proof. Without loss of generality, set µ0 = 0 and σ0 = 1. Let Z0i =Xi−µ0σ0= Xi andZi =Xi−T0nS0n. Denote the empirical distributions of Z01, . . . , Z0n and Z1, . . . , Zn byF+0n(t) =1nn∑i=1I (|Z0i| ≤ t) and F+n (t) =1nn∑i=1I (|Zi| ≤ t) .By assumption, with probability one, there exists n1 such that n ≥ n1 implies0 < 1− δ ≤ S0n ≤ 1 + δ and −δ ≤ T0n ≤ δ, and we haveF+n (t) =1nn∑i=1I (−t ≤ Zi ≤ t) = 1nn∑i=1I(−t ≤ Xi − T0nS0n≤ t)=1nn∑i=1I (−tS0n + T0n ≤ Xi ≤ tS0n + T0n)≥ 1nn∑i=1I (−t(1− δ) + T0n ≤ Xi ≤ t(1− δ) + T0n)≥ 1nn∑i=1I (−t(1− δ) + δ ≤ Xi ≤ t(1− δ)− δ)=1nn∑i=1I (|Xi| ≤ t(1− δ)− δ) = F+0n(t(1− δ)− δ).Now, by the Glivenko–Cantelli Theorem, with probability one there exists n2 suchthat n ≥ n2 implies that supt |F+0n(t)−F+0 (t)| ≤ ε/2. Also, by the uniform continuityof F+0 , given ε > 0, there exists δ > 0 such that |F+0 (t(1− δ)− δ)− F+0 (t)| ≤ ε/2.Finally, note thatF+n (t) ≥ F+0n(t(1− δ)− δ)=(F+0n(t(1− δ)− δ)− F+0 (t(1− δ)− δ))+ (F+0 (t(1− δ)− δ)− F+0 (t)) + (F+0 (t)− F+(t)) + F+(t).91A.2. Proofs of propositions and theoremLet n3 = max(n1, n2), then n ≥ n3 implysupt>η(F+(t)− F+n (t)) ≤ supt>η∣∣F+0 (t(1− δ)− δ)− F+0n(t(1− δ)− δ)∣∣+ supt>η∣∣F+0 (t)− F+0 (t(1− δ)− δ)∣∣+ supt>η(F+(t)− F+0 (t))≤ ε2+ε2+ 0 = ε.This implies that n0/n→ 0 a.s..Proof of Proposition 2.2We need the following lemma for the proof.Lemma A.1. Consider a sample of p-dimensional random vectorsX 1, . . . ,Xn. Also,consider a pair of multivariate location and scatter estimators T 0n and C 0n. Supposethat T 0n → µ0 and C 0n → Σ0 a.s.. Let Di = (X i − T 0n)tC−10n (X i − T 0n) andDi = (X i − µ0)tΣ−10 (X i − µ0). Given K < ∞. For all i = 1, . . . , n, if D0i ≤ K ,then:Di → D0i a.s..Proof of Lemma A.1. Note that|Di −D0i| = |(X i − T 0n)tC−10n (X i − T 0n)− (X i − µ0)tΣ−10 (X i − µ0)|= |((X i − µ0) + (µ0 − T 0n))t(Σ−10 + (C−10n −Σ−10 ))((X i − µ0) + (µ0 − T 0n))− (X i − µ0)tΣ−10 (X i − µ0)|≤ |(µ0 − T 0n)tΣ−10 (µ0 − T 0n)|+ |(µ0 − T 0n)t(C−10n −Σ−10 )(µ0 − T 0n)|+ |2(X i − µ0)tΣ−10 (µ0 − T 0n)|+ |2(X i − µ0)t(C−10n −Σ−10 )(µ0 − T 0n)|+ |(X i − µ0)t(C−10n −Σ−10 )(X i − µ0)|= An +Bn + Cn +Dn + En.By assumption, there exists n1 such that for n ≥ n1 implies An ≤ ε/5 andBn ≤ ε/5.92A.2. Proofs of propositions and theoremNext, note that|(X i − µ0)tΣ−1/20 y| = |y tΣ−1/20 (X i − µ0)|≤ ||y||||Σ−1/20 (X i − µ0)|| = ||y||√(X i − µ0)tΣ−10 (X i − µ0) ≤ ||y||√K.So, there exists n2 such that n ≥ n2 impliesCn = |2(X i − µ0)tΣ−10 (µ0 − T 0n)|= |2(X i − µ0)tΣ−1/20 Σ−1/20 (µ0 − T 0n)|≤ 2||Σ−1/20 (µ0 − T 0n)||√K≤ ε/5.Similarly, there exists n3 such that n ≥ n3 impliesDn = |2(X i − µ0)t(C−10n −Σ−10 )(µ0 − T 0n)|= |2(X i − µ0)tΣ−1/20 Σ1/20 (C−10n −Σ−10 )(µ0 − T 0n)|≤ 2||Σ1/20 (C−10n −Σ−10 )(µ0 − T 0n)||√K≤ ε/5.Also, there exists n4 such that n ≥ n4 impliesEn = |(X i − µ0)t(C−10n −Σ−10 )(X i − µ0)|= |(X i − µ0)tΣ−1/20 Σ1/20 (C−10n −Σ−10 )(X i − µ0)|≤ ||Σ1/20 (C−10n −Σ−10 )(X i − µ0)||√K≤ ||(C−10n −Σ−10 )|| ||Σ1/20 (X i − µ0)||√K≤ ||(C−10n −Σ−10 )||K≤ ε/5.93A.2. Proofs of propositions and theoremFinally, let n5 = max{n1, n2, n3, n4}, then for all i, n ≥ n5 implies|Di −D0i| ≤ ε/5 + ε/5 + ε/5 + ε/5 + ε/5 = ε.Proof of Proposition 2.2. LetD0i = (X i−µ0)tΣ−10 (X i−µ0) andDi = (X i−T 0n)tC−10n (X i−T 0n). Denote the empirical distributions of D01, . . . , D0n and D1, . . . , Dn byG0n(t) =1nn∑i=1I (D0i ≤ t) and Gn(t) = 1nn∑i=1I (Di ≤ t) .Note that|Gn(t)−G0n(t)| =∣∣∣∣∣ 1nn∑i=1I (Di ≤ t)− 1nn∑i=1I (D0i ≤ t)∣∣∣∣∣=∣∣∣∣∣ 1nn∑i=1I (Di ≤ t) I(D0i > K) + 1nn∑i=1I (Di ≤ t) I(D0i ≤ K)− 1nn∑i=1I (D0i ≤ t) I(D0i > K)− 1nn∑i=1I (D0i ≤ t) I(D0i ≤ K)∣∣∣∣∣≤∣∣∣∣∣ 1nn∑i=1I (Di ≤ t) I(D0i > K)− 1nn∑i=1I (D0i ≤ t) I(D0i > K)∣∣∣∣∣+∣∣∣∣∣ 1nn∑i=1I (Di ≤ t) I(D0i ≤ K)− 1nn∑i=1I (D0i ≤ t) I(D0i ≤ K)∣∣∣∣∣= |An|+ |Bn|.We will show that |An| → 0 and |Bn| → 0 a.s..Choose a large K such that PG0(D0 > K) ≤ ε/8. By law of large numbers, thereexists n1 such that for n ≥ n1 implies | 1n∑ni=1 I(D0i > K) − PG0(D0 > K)| ≤ ε/894A.2. Proofs of propositions and theoremand|An| =∣∣∣∣∣ 1nn∑i=1[I (Di ≤ t)− I (D0i ≤ t)]I(D0i > K)∣∣∣∣∣≤ 1nn∑i=1|I (Di ≤ t)− I (D0i ≤ t) |I(D0i > K)≤ 1nn∑i=1I(D0i > K)≤ PG0(D0 > K) + ε/8≤ ε/8 + ε/8 = ε/4.By assumption, we have from Lemma A.1 that Di → D0i a.s. for all i whereD0i ≤ K. Let Ei = Di − D0i. So, with probability 1, there exists n2 such thatn ≥ n2 implies that −δ ≤ Ei ≤ δ for all i. Then,Bn =1nn∑i=1[I (Di ≤ t)− I (D0i ≤ t)]I(D0i ≤ K)=1n∑i:D0i≤K[I (Di ≤ t)− I (D0i ≤ t)]=1n∑i:D0i≤K[I (D0i ≤ t− Ei)− I (D0i ≤ t)]≤ 1n∑i:D0i≤K[I (D0i ≤ t+ δ)− I (D0i ≤ t)]≤ 1nn∑i=1[I (D0i ≤ t+ δ)− I (D0i ≤ t)].95A.2. Proofs of propositions and theoremAlso,Bn =1n∑i:D0i≤K[I (D0i ≤ t− Ei)− I (D0i ≤ t)]≥ 1n∑i:D0i≤K[I (D0i ≤ t− δ)− I (D0i ≤ t)]≥ 1nn∑i=1[I (D0i ≤ t− δ)− I (D0i ≤ t)]Now, by the Gilvenko–Cantelli Theorem, with probability one there exists n3 suchthat n ≥ n3 implies that supt | 1n∑ni=1 I (D0i ≤ t+ δ)−G0(t+ δ)| ≤ ε/16,supt | 1n∑ni=1 I (D0i ≤ t− δ)−G0(t−δ)| ≤ ε/16, and supt | 1n∑ni=1 I (D0i ≤ t)−G0(t)| ≤ε/16. Also, by the uniform continuity of G0, there exists δ > 0 such that |G0(t +δ)−G0(t)| ≤ ε/8 and |G0(t− δ)−G0(t)| ≤ ε/8. Together,1nn∑i=1I (D0i ≤ t− δ)− I (D0i ≤ t) ≤ Bn ≤ 1nn∑i=1I (D0i ≤ t+ δ)− I (D0i ≤ t)G0(t− δ)− ε/16−G0(t)− ε/16 ≤ Bn ≤ G0(t+ δ) + ε/16−G0(t) + ε/16(G0(t− δ)−G(t))− ε/8 ≤ Bn ≤ (G0(t+ δ)−G0(t)) + ε/8−ε/8− ε/8 = −ε/4 ≤ Bn ≤ ε/8 + ε/8 = ε/4.Finally, note thatG(t)−Gn(t) = (G(t)−G0(t)) + (G0(t)−G0n(t)) + (G0n(t)−Gn(t)).Let n4 = max{n1, n2, n3}, then n ≥ n4 impliessupt>η(G(t)−Gn(t)) ≤ supt>η(G(t)−G0(t)) + supt>η(G0(t)−G0n(t)) + supt>η(G0n(t)−Gn(t))≤ (ε/4 + ε/4) + ε/16 + 0 ≤ ε.96A.2. Proofs of propositions and theoremProof of Theorem 2.1We need the following Lemma proved in Yohai (1985).Lemma A.2. Let {Z i} be i.i.d. random vectors taking values in Rk, with commondistribution Q. Let f : Rk × Rh → R be a continuous function and assume that forsome δ > 0 we have thatEQ[sup||λ−λ0||≤δ|f(Z, λ)|]<∞.Then, if λˆn → λ0 a.s., we have1nn∑1=1f(Z i, λˆn)→ EQ [f(Z, λ0)] a.s..Proof of Theorem 2.1. Define(µˆGS, Σ˜GS) = arg minµ,|Σ|=1sGS(µ,Σ, Ωˆ). (A.1)We drop out X and U in the argument to simplify the notation. Since sGS(µ, λΣ, Ωˆ) =sGS(µ,Σ, Ωˆ), to prove Theorem 2.1 it is enough to show(a)(µˆGs, Σ˜GS)→ (µ0,Σ00) a.s., and (A.2)(b)sGS(µˆGS, Σ˜GS, Σ˜GS)→ σ0 a.s.. (A.3)Note that since we haveEH0(ρ(d (X,µ0,Σ0)σ0cp))= b,then part (i) of Lemma 6 in the Supplemental Material of Danilov et al. (2012)97A.2. Proofs of propositions and theoremimplies that given ε > 0, there exists δ > 0 such thatlimn→∞inf(µ,Σ)∈CCε ,|Σ|=11nn∑i=1cpρ(d (X i,µ,Σ)σ0cp (1 + δ))> (b+ δ)cp, (A.4)where Cε is a neighborhood of (µ0,Σ00) of radius ε and if A is a set, then AC denotesits complement. In addition, by part (iii) of the same Lemma we have for any δ > 0,limn→∞1nn∑i=1cpρ(d (X i,µ0,Σ00)σ0cp (1 + δ))< b cp. (A.5)LetQi(µ,Σ) = cpρ(d (X i,µ,Σ)σ0cp(1 + δ))andQ(U )i (µ,Σ) = cp(U i)ρd∗(X(U i)i ,µ(U i),Σ(U i))S cp(U i)∣∣∣Ωˆ(U i)∣∣∣1/p(U i) ,Now, if |Σ| = 1 and S = σ0(1 + δ)/|Ωˆ|1/p, we have1nn∑i=1Q(U )i (µ,Σ) =1n∑pi=pQi(µ,Σ) +1n∑pi 6=pQ(U )i (µ,Σ). (A.6)We also have1n∑pi 6=pQ(U )i (µ,Σ) ≤ cp(1− tn) (A.7)and, therefore, by Assumption 2.4 we havelimn→∞supµ,|Σ|=11n∑pi 6=pQ(U )i (µ,Σ) = 0 a.s.. (A.8)98A.2. Proofs of propositions and theoremSimilarly, we can prove thatlimn→∞supµ,|Σ|=11n∑pi 6=pQi(µ,Σ) = 0 a.s. (A.9)andcp − 1nn∑i=1cp(U i) → 0, a.s.. (A.10)Then, from (A.4) and (A.6)–(A.10) we getlimn→∞inf(µ,Σ)∈CCε ,|Σ|=11nn∑i=1Q(U )i (µ,Σ) > (b+δ) limn→∞1nn∑i=1cp(U i) = (b+δ)cp a.s.. (A.11)Using similar arguments, from (A.5) we can provelimn→∞1nn∑i=1Q(U )i (µ0,Σ00) < b limn→∞1nn∑i=1cp(U i) = b cp a.s.. (A.12)Equations (A.11)–(A.12) imply thatlimn→∞inf(µ,Σ)∈CCε ,|Σ|=1sGS(µ,Σ, Ωˆ) > S a.s.andlimn→∞sGS(µ0,Σ00, Ωˆ) < S a.s..Therefore, with probability one there exists n0 such that for n > n0 we have(µˆGS, Σ˜GS) ∈ CCε . Then, (µˆGS, Σ˜GS)→ (µ0,Σ00) a.s. proving (a).LetPi(µ,Σ, s) = cpρ(d (X i,µ,Σ)cp s)andP(U )i (µ,Σ, s) = cp(U i)ρd(X(U i)i ,µ(U i),Σ(U i))cp(U i) s .99A.2. Proofs of propositions and theoremSince |Σ˜GS| = 1, we have that sGS(µˆGS, Σ˜GS, Σ˜GS) is the solution in s in the followingequation1nn∑i=1P(U )i (µˆGS, Σ˜GS, s) =bnn∑i=1cp(U i). (A.13)Then, to prove (A.3) it is enough to show that for all ε > 0limn→∞1nn∑i=1P(U )i (µˆGS, Σ˜GS, σ0 + ε) < b cp a.s. andlimn→∞1nn∑i=1P(U )i (µˆGS, Σ˜GS, σ0 − ε) > b cp a.s.(A.14)Using Assumption 2.4, to prove (A.14) it is enough to showlimn→∞1nn∑i=1Pi(µˆGS, Σ˜GS, σ0 + ε) < b cp a.s. andlimn→∞1nn∑i=1Pi(µˆGS, Σ˜GS, σ0 − ε) > b cp a.s.(A.15)It is immediate thatE(ρ(d (X,µ0,Σ0)cp (σ0 + ε)))< E(ρ(d (X,µ0,Σ0)cp σ0))= bandE(ρ(d (X,µ0,Σ0)cp (σ0 − ε)))> E(ρ(d (X,µ0,Σ0)cp σ0))= b.Then Equation (A.15) follows from Lemma A.2 and part (a). This proves (b).100Appendix BSupplementary material forChapter 3B.1 Efficiency of GRE and tuning parameter αin Rocke-ρ functionThe tuning parameter α in the Rocke-ρ function in γ in (3.1) is chosen small to controlthe efficiency. In this chapter, we used the conventional choice α = 0.05, as seen toachieve reasonable efficiency while achieving high robustness. Here, we explore theperformance of GRE-C with smaller values of α. We repeat the simulation study asin Section 3.4 for p = 10, 30, 50 and n = 10p. The number of replicates is N = 30.Table B.1 reports the finite sample efficiency and maximum average LRT distancesunder 20% casewise contamination. In general, higher efficiency can be achievedusing smaller values of α, but with the cost of some loss in robustness.Table B.1: Finite sample efficiency and maximum average LRT distances for GRE-Cwith various values of α. The sample size is n = 10p.p Efficiency, clean data Max LRT, 20% casewiseα = 0.05 α = 0.01 α = 0.001 α = 0.05 α = 0.01 α = 0.00110 0.54 0.67 0.67 33.1 32.1 32.130 0.58 0.85 0.95 16.0 20.2 28.750 0.55 0.58 0.93 27.1 28.1 47.7101B.2. Performance comparison between GSE and GRETable B.2: Maximum average LRT distances. The sample size is n = 10p.p EMVE GSE EMVE-C GRE-C10 0.10 8.7 4.6 17.3 10.90.20 81.4 84.8 43.4 36.120 0.10 20.8 24.1 9.2 8.10.20 123.0 156.8 13.1 14.930 0.10 31.2 54.8 13.4 9.40.20 299.1 223.2 24.3 16.040 0.10 77.5 80.7 21.9 12.20.20 511.8 287.9 43.2 17.150 0.10 172.5 125.1 29.4 16.50.20 644.3 349.8 60.2 26.3B.2 Performance comparison between GSE andGREWe conduct a simulation study to compare the standalone performances of the secondsteps (i.e. the estimation step) in the two-step S-estimators: GRE-C starting fromEMVE-C versus GSE starting from EMVE.We consider clean and casewise contaminated samples from a Np(µ0 ,Σ0) distri-bution with p = 10, 20, . . . , 50 and n = 10p. The simulation mechanisms are thesame as that of Chapter 2, but in addition, 5% of the cells in the generated samplesare randomly selected and assigned a missing value. The number of replicates isN = 500.Table B.2 shows the maximum average LRT distances from the true correlationmatrices among the considered contamination sizes and, for brevity, shows only thevalues for random correlations. EMVE is capable of dealing small fraction of outlierswith 500 subsamples, but breaks down when the fraction gets larger, and bringsdown the performance of GSE. EMVE-C with more refined subsampling procedureand larger subsample sizes shows better performance than EMVE, even for relativelylarger fraction of outliers. Overall, GRE performs better than GSE. The Rocke ρfunction used in GRE is capable of giving smaller weights to points that are moderate-to-large distances from the main mass of points (Rocke, 1996); see, for example,102B.2. Performance comparison between GSE and GRE10203040502 4 6 8 10kLRTGSE GRE−CFigure B.1: Average LRT distances for various contamination sizes, k, for randomcorrelations under 10% casewise contamination. The dimension is p = 30 and thesample size is n = 10p.Table B.3: Finite sample efficiency. The sample size is n = 10p.p EMVE GSE EMVE-C GRE-C10 0.24 0.89 0.26 0.5420 0.30 0.95 0.30 0.5930 0.34 0.98 0.33 0.5840 0.35 0.98 0.34 0.4750 0.37 0.99 0.35 0.48Figure B.1 that shows the average LRT distance behaviors for 10% contamination fordimension 30 and sample size 300 data. In the figure, we see that GRE outperformsGSE for moderate sizes contamination points, as expected.Table B.3 shows the finite sample relative efficiency under clean samples, takingthe classical EM estimator as the baseline. As expected, GSE shows an increasingefficiency as p increases. GRE, overall, has lower efficiency.103Appendix CSupplementary material forChapter 4C.1 Additional figures from the simulation studyin Section 4.51% Cellwise 5% Cellwise Casewise0.00.20.40.60.80.00.51.01.501232 4 6 8 10 2 4 6 8 10 2 4 6 8 10 12 14kMSEEstimators 3S ShootS 2S LSFigure C.1: MSE for various cellwise and casewise contamination values, k, formodels with p = 15 continuous covariates. The sample size is n = 150. For detailssee Section 4.5.1 in the chapter.104C.2. Investigation on the performance on non-normal covariates1% Cellwise 5% Cellwise Casewise0.050.100.150.200.000.250.500.751.000.00.20.40.62 4 2 4 6 8 10 2 4 6 8 10kMSEEstimators 3S ShootS 2S LSFigure C.2: MSE for various cellwise and casewise contamination values, k, formodels with px = 12 continuous and pd = 3 dummy covariates. The sample size isn = 150. For details see Section 4.5.2 in the chapter.C.2 Investigation on the performance onnon-normal covariatesHere, we conduct a modest simulation study to compare the performance of 3S-regression, the shooting S-estimator, 2S-regression and the LS estimator for datawith non-normal covariates.We consider the same regression model with p = 15 and n = 300 as in Section4.5, but the covariates are generated from a non-normal distribution as follows. Therandom covariates X i, i = 1, . . . , n, are first generated from multivariate normaldistribution Np(0,Σ), where Σ is the randomly generated correlation matrix with afix condition number of 100. Then, we transform the variables by doing the following:(Xi1, Xi2, . . . , Xip)← (G−11 (Φ(Xi1)), G−12 (Φ(Xi2)), . . . , G−1p (Φ(Xip))),where Φ(x) is the standard normal. We set Gj as N(0, 1) for j = 1, 2, 3, χ2(20) forj = 4, 5, 6, F (90, 10) for j = 7, 8, 9, χ2(1) for j = 10, 11, 12, and Pareto(1, 3) forj = 13, 14, 15.In the simulation study, we consider the following scenarios:105C.2. Investigation on the performance on non-normal covariates• Clean data: No further changes are done to the data;• Cellwise contamination: Randomly replace = 0.05 fraction of the cells in thecovariates by outliers Xcontij = k×G−1j (0.999) and proportion of the responsesby outliers Y contij = E(Yij)+k×SD(εi). We present the results for k = 1, 5, 10,but for larger values of k we obtain similar results.The number of replicates for each setting is N = 1000.The performance of the estimator in terms of MSE are summarized in TableC.1. The performance of 3S-regression is comparable to that of LS and 2S-regressionfor clean data and outperforms the shooting S, LS and 2S-regression for cellwise-contaminated data, even under some deviations from the assumptions on the taildistributions of the covariates.Table C.1: MSE for clean data and cellwise contaminated data.Estimators Clean Cellwisek = 2 k = 5 k = 103S 0.007 0.014 0.013 0.015ShootS 0.254 0.839 1.048 0.8822S 0.003 4.102 3.851 4.057LS 0.001 4.311 6.438 6.588106C.3. Supplementary material for the Boston housing data analysisC.3 Supplementary material for the Bostonhousing data analysisTable C.2: Description of the variables in the Boston Housing dataVariables Descriptionmedv (response) corrected median value of owner-occupied homes in USD 1000’scrim per capita crime rate by townnox nitric oxides concentration (parts per 10 million)rm average number of rooms per dwellingage proportion of owner-occupied units built prior to 1940dis weighted distances to five Boston employment centerstax full-value property-tax rate per USD 1,000,000ptratio pupil-teacher ratio by townblack (B − 0.63)2 where B is the proportion of blacks by townlstat percentage of lower status of the populationC.4 Proofs of lemmas and theoremsProof of Theorem 4.1We need to following lemma in the proof.Lemma C.1. Let X,X1, . . . , Xn be independent with a continuous distribution func-tion G(x). Given 0 < α < 1, let η = G−1(1− α) and s = med(X − η|X > η). Now,consider the following estimator: ηn = G−1n (1−α) and sn = med({Xi−ηn|Xi > ηn}).Then, sn → s a.s.Proof. Without loss of generality, assume that X1 < X2 < · · · < Xn. So, ηn =G−1n (1− α) = Xdn(1−α)e, and#{Xi|Xi > ηn} = n− dn(1− α)e = n− (n+ d−nαe) = bnαc.107C.4. Proofs of lemmas and theoremsThen, Xk = med({Xi|Xi > ηn}) wherek = dn(1− α)e+⌈bnαc2⌉= n− bnαc+⌈bnαc2⌉= n−⌊bnαc2⌋= n−⌊nα2⌋.In other words, med({Xi|Xi > ηn}) = Xdn(1−α/2)e = G−1n (1− α/2), andsn = G−1n (1− α/2)− ηn.Therefore, sn → s a.s., where s = G−1(1− α/2)− η.Proof of Theorem 4.1Proof. Without loss of generality, we consider only the upper tail. Also, to simplifythe notation, we drop out the G in the probability and the u that was used todistinguish between the notations for upper tail and lower tail.Define F (t) and Fn(t) byF (t) =P (0 < (X − η)/s ≤ t)P (X > η)and Fn(t) =1n∑ni=1 I(0 < (Xi − ηn)/sn ≤ t)1n∑ni=1 I(Xi > ηn).Let F0(t) = 1− e−t. It is sufficient to prove that for every > 0 there exists N suchthat for all n ≥ N ,supt≥t0{F0(t)− Fn(t)}+ < .108C.4. Proofs of lemmas and theoremsNote that|F0(t)− Fn(t)| ≤∣∣∣∣F0(t)− P (0 < (X − η)/s ≤ t)P (X ≥ η)∣∣∣∣+∣∣∣∣P (0 < (X − η)/s ≤ t)P (X > η) − P (0 < (X − ηn)/sn ≤ t)P (X > η)∣∣∣∣+∣∣∣∣P (0 < (X − ηn)/sn ≤ t)P (X > η) − P (0 < (X − ηn)/sn ≤ t)P (X > ηn)∣∣∣∣+∣∣∣∣P (0 < (X − ηn)/sn ≤ t)P (X > ηn) −1n∑ni=1 I(0 < (Xi − ηn)/sn ≤ t)P (X > ηn)∣∣∣∣+∣∣∣∣ 1n∑ni=1 I(0 < (Xi − ηn)/sn ≤ t)P (X > ηn) −1n∑ni=1 I(0 < (Xi − ηn)/sn ≤ t)1n∑ni=1 I(Xi > ηn)∣∣∣∣= A+B + C +D + E.By Assumption 4.1, A = 0.Note thatB =1α|P (0 < (X − η)/s ≤ t)− P (0 < (X − ηn)/sn ≤ t)|=1α|[G(st+ η)−G(snt+ ηn)]− [G(η)−G(ηn)]|.Next, we show that supt |G(st+η)−G(snt+ηn)| < εα/4 and |G(η)−G(ηn)| < εα/4.Given a small δ0 > 0 such that s − δ0 > c and η − δ0 > c for c > 0. Choose alarge K > 0 such that for Kδ0 = (s− δ0)K + η− δ0, G(Kδ0) > 1− εα4 . First, considert > K. Since δ0 > 0, we have st + η > (s − δ0)K + (η − δ0) = Kδ0 , and therefore,G(st + η) > G(Kδ0). Also, by Lemma C.1, sn → s a.s. and ηn → η a.s.. So, thereexists N0 such that |sn − s| < δ0 and |ηn − η| < δ0 for all n ≥ N0. So, we havesn > s − δ0 and ηn > η − δ0, which implies snt + ηn > (s − δ0)K + (η − δ0) = Kδ0and G(snt+ ηn) > G(Kδ0). Therefore,supt>K{G(st+ η)−G(snt+ ηn)} ≤ εα4.Now, consider t ≤ K. We have |(st + η) − (snt + ηn)| ≤ t|s − sn| + |η − ηn| <109C.4. Proofs of lemmas and theoremsKδ0 +δ0 < δ1. Now by the uniform continuity of G, given ε > 0, there exists N1 suchthat for n ≥ N1, |(st+η)− (snt+ηn)| < δ, and therefore, |G(st+η)−G(snt+ηn)| <εα4. Similarly, there exists N2 such that for n ≥ N2, |η − ηn| < δ, and therefore,|G(η)−G(ηn)| < εα4 .So, with probability one, take N = max{N0, N1, N2} such that for n ≥ N , itimplies that |(st+ η)− (snt+ ηn)| < δ and |η − ηn| < δ. Then, we haveB ≤ 1α[supt|G(st+ η)−G(snt+ ηn)|+ |G(η)−G(ηn)|]≤ 1α(εα4+εα4) =ε2.Next, we haveC ≤ |G(snt+ ηn)−G(ηn)|(1−G(η))(1−G(ηn)) |G(η)−G(ηn)|≤ 11−G(η) |G(η)−G(ηn)| ≤1αεα4=ε4.By the Gilvenko–Cantelli Theorem, with probability one, we can show that thereexists N3 such that for n ≥ N3, supt |P (0 < (X − ηn)/sn ≤ t) − 1n∑ni=1 I(0 <(X − ηn)/sn ≤ t)| < εα16 . Note that for large enough n, we have P (X > ηn) > α2 . So,D =∣∣∣∣P (0 < (X − ηn)/sn ≤ t)P (X > ηn) −1n∑ni=1 I(0 < (Xi − ηn)/sn ≤ t)P (X > ηn)∣∣∣∣ ≤ 2α εα16 = ε8 .Next, by the Gilvanko–Cantelli Theorem again, there exists N4 such that for n ≥N4, |P (X > ηn) − 1n∑ni=1 I(Xi > ηn)| < supt |P (X > t) − 1n∑ni=1 I(Xi > t)| < εα16 .110C.4. Proofs of lemmas and theoremsThen, we haveE ≤ ( 1nn∑i=1I(0 < (Xi − ηn)/sn ≤ t))∣∣P (X > ηn)− 1n∑ni=1 I(Xi > ηn)∣∣P (X > ηn) (1n∑ni=1 I(Xi > ηn))≤ ( 1nn∑i=1I(Xi > ηn))∣∣P (X > ηn)− 1n∑ni=1 I(Xi > ηn)∣∣P (X > ηn) (1n∑ni=1 I(Xi > ηn))≤ 2αεα16=ε8.Finally, take N = max{N0, N1, N2, N3, N4}, we havesupt{F (t)− Fˆn(t)} ≤ A+B + C +D + E ≤ ε2+ε4+ε8+ε8= ε.Proof of Theorem 4.2Proof. Let (U n,1, . . . ,U n,n)t be the matrix of zeros and ones with zero correspondingto a filtered component in (X 1, . . . ,Xn)t, and n0 be the number of complete observa-tions after the filter step. Now, let Cj = {i, 1 ≤ i ≤ n : Un,ij = 0} and C = ∪pj=1Cj.So, Cj is the set of indices of filtered values for variable j, and C is the set of indicesof incomplete observations. By Boole’s inequality,n− n0 = #C ≤p∑j=1#Cj.Let ξ > 0 be as described in Section 4.3. Now, for each variable {X1j, . . . , Xnj},j = 1, . . . , p, apply Theorem 4.1 to obtain Nj such that, with probability one,#Cj ≤ nξ/p, for n ≥ Nj.111C.4. Proofs of lemmas and theoremsSet N = max{N1, . . . , Np}. Hence, with probability one,n− n0 ≤p∑j=1#Cj ≤p∑j=1nξ/p = nξ,for n ≥ N , or equivalently,n0n≥ 1− ξ.Therefore, U ∗n,i = (1, . . . , 1)t according to (4.6), and Un = I, where I has every entryequal to 1. In other words, for n ≥ N , the GSE in Section 4.3 becomesT 2S = T GS(Z, I)C 2S = CGS(Z, I).Since GSE on complete data reduces to the regular S-estimator (Danilov et al., 2012),this implies that 3S-regression reduces to S-regression for n ≥ N .112
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Robust estimation and inference under cellwise and...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Robust estimation and inference under cellwise and casewise contamination Leung, Andy Chin Yin 2016
pdf
Page Metadata
Item Metadata
Title | Robust estimation and inference under cellwise and casewise contamination |
Creator |
Leung, Andy Chin Yin |
Publisher | University of British Columbia |
Date Issued | 2016 |
Description | Cellwise outliers are likely to occur together with casewise outliers in datasets of relatively large dimension. Recent work has shown that traditional high breakdown point procedures may fail when applied to such datasets. In this thesis, we consider this problem when the goal is to (1) estimate multivariate location and scatter matrix and (2) estimate regression coefficients and confidence intervals for inference, which both are cornerstones in multivariate data analysis. To address the first problem, we propose a two-step procedure to deal with casewise and cellwise outliers, which generally proceeds as follows: first, it uses a filter to identify cellwise outliers and replace them by missing values; then, it applies a robust estimator to the incomplete data to down-weight casewise outliers. We show that the two-step procedure is consistent under the central model provided the filter is appropriately chosen. The proposed two-step procedure for estimating location and scatter matrix is then applied in regression for the case of continuous covariates by simply adding a third step, which computes robust regression coefficients from the estimated robust multivariate location and scatter matrix obtained in the second step. We show that the three-step estimator is consistent and asymptotically normal at the central model, for the case of continuous covariates. Finally, the estimator is extended to handle both continuous and dummy covariates. Extensive simulation results and real data examples show that the proposed methods can handle both cellwise and casewise outliers similarly well. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2017-01-21 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivatives 4.0 International |
DOI | 10.14288/1.0340497 |
URI | http://hdl.handle.net/2429/60145 |
Degree |
Doctor of Philosophy - PhD |
Program |
Statistics |
Affiliation |
Science, Faculty of Statistics, Department of |
Degree Grantor | University of British Columbia |
Graduation Date | 2017-02 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
Aggregated Source Repository | DSpace |
Download
- Media
- 24-ubc_2017_february_leung_andy.pdf [ 808.16kB ]
- Metadata
- JSON: 24-1.0340497.json
- JSON-LD: 24-1.0340497-ld.json
- RDF/XML (Pretty): 24-1.0340497-rdf.xml
- RDF/JSON: 24-1.0340497-rdf.json
- Turtle: 24-1.0340497-turtle.txt
- N-Triples: 24-1.0340497-rdf-ntriples.txt
- Original Record: 24-1.0340497-source.json
- Full Text
- 24-1.0340497-fulltext.txt
- Citation
- 24-1.0340497.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0340497/manifest