Robust and Sparse Regression in the Presence of Cellwise andCasewise Contamination With Application in Data Quality ModellingbyGlenn McGuinnessB.A.Sc., University of British Columbia, 2017A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMaster of ScienceinTHE FACULTY OF GRADUATE AND POSTDOCTORAL STUDIES(Statistics)The University of British Columbia(Vancouver)April 2020c© Glenn McGuinness, 2020The following individuals certify that they have read, and recommend to the Faculty of Graduateand Postdoctoral Studies for acceptance, the thesis entitled:Robust and Sparse Regression in the Presence of Cellwise and Casewise Contamina-tion With Application in Data Quality Modellingsubmitted by Glenn McGuinness in partial fulfillment of the requirements for the degree of Mas-ter of Science in Statistics.Examining Committee:Ruben Zamar, StatisticsSupervisorMatı´as Salibia´n-Barrera, StatisticsAdditional ExamineriiAbstractThis thesis considers the problem of robust and sparse estimation of linear regression parametersin data with structural and independent contamination. Independent outliers can propagate in datawith relatively large numbers of dimensions, resulting in a high fraction of observations with atleast one outlying cells. Recent work has shown that traditional robust regression methods are nothighly robust to such outliers.We investigate the application of Robust Least Angle Regression (RLARS) to data with in-dependent contamination. We also propose two modified versions of RLARS to further improveits performance. The first method applies RLARS to data which has been filtered of independentoutliers. The second method performs RLARS with the Lasso modification for Least Angle Re-gression (LARS). Extensive simulations show that RLARS is resilient to structural and independentcontamination. Compared with RLARS, simulation results show that the first modified version hassignificantly improved robustness to independent contamination and the second modified versionhas improved robustness when there are a large number of predictors.We also consider the application of the proposed methods to data quality modelling in a casestudy for MineSense Technologies Ltd. (MineSense). MineSense develops sensor packages for usein the harsh conditions of an active mine. To maintain high system availability and performance,data must be monitored for a deterioration in sensor health or a change in the data generatingprocess, such as a change in ore body, which can manifest as outliers. We pose the problem of con-tamination detection, the identification of whether a dataset contains outliers, as a distinct problemfrom outlier detection, the identification of which cases or cells are outliers. We propose a con-tamination detection method based on the comparison of robust and non-robust linear regressionestimates. When outliers are present, the robust and non-robust estimates differ significantly, indi-cating the presence of contamination. Simulation results and analysis of real sensor data providedby MineSense suggest that our method can effectively detect the presence of contamination with alow false detection rate.iiiLay SummaryIn regression analysis, data quality problems are common. To perform analysis on data with qualityproblems, a statistical model of the issue is required, called a contamination model. Traditionally,a structural contamination model has been used. However, for high-dimensional data the fullyindependent contamination model is often more appropriate. Recent work has shown that classicalrobust regression methods may fail when applied to such data. Hence, in this thesis we propose theapplication of Robust Least Angle Regression (RLARS) to data with independent contamination.We also propose two modified version of RLARS to further improve performance. We apply theproposed methods to a data quality modelling case study for MineSense Technologies. We furtherpropose a method to detect the presence of structural and independent outliers in a dataset. A realdata study demonstrate this method effectively detects contamination with a low false detectionrate.ivPrefaceThis dissertation is original, unpublished work by the author, Glenn McGuinness, under the super-vision of Professor Ruben Zamar. The data for the case study contained in the thesis was providedby MineSense Technologies Ltd. All simulations and analyses were carried out by the author.vTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiGlossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xivDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Robust and Sparse Linear Regression in the Presence of Structural and IndependentContamination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Case Study: Data Quality Modelling Using Robust Linear Regression . . . . . . . 61.4 Organization of Subsequent Chapters . . . . . . . . . . . . . . . . . . . . . . . . . 72 Robust Least Angle Regression in the Presence of Structural and Independent Con-tamination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9vi2.2 Review: Least Angle Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.2.1 Incremental Forward Subset Selection . . . . . . . . . . . . . . . . . . . . 102.2.2 LARS and Stagewise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.2.3 Derivation of LARS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.2.4 LARS Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.3 Review: Robust Least Angle Regression . . . . . . . . . . . . . . . . . . . . . . . 192.3.1 Variable Sequencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.3.2 Variable Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.4 RLARS in the Presence of Independent Contamination . . . . . . . . . . . . . . . 272.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283 Robust Least Angle Regression Using Pre-Filtered Data . . . . . . . . . . . . . . . . 293.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.2 Review: Detecting Deviating Data Cells . . . . . . . . . . . . . . . . . . . . . . . 303.3 RLARS With the DDC Imputed Data Matrix . . . . . . . . . . . . . . . . . . . . 333.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344 Robust Least Angle Regression with Lasso Modification . . . . . . . . . . . . . . . . 354.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.2 Review: LARS Lasso Modification . . . . . . . . . . . . . . . . . . . . . . . . . . 354.3 RLARS-Lasso Modification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.4 Selecting the Penalty Parameter . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405.2 Simulation Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415.3 Performance Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455.4 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 496 Case Study: Data Quality Modelling Using Robust Linear Regression . . . . . . . . 576.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 576.2 Data Quality Modelling Using Robust Linear Regression . . . . . . . . . . . . . . 596.3 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62vii6.3.1 Simulation Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 626.3.2 Performance Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 636.3.3 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 636.4 MineSense Real Data Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . 666.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 687 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77A Robust Estimators Used in Detecting Deviating Data Cells . . . . . . . . . . . . . . . 77B Data Quality Modelling Simulation Additional Figures . . . . . . . . . . . . . . . . 79viiiList of TablesTable 5.1 Expected fraction of contaminated cases as n −→ ∞ for p = 50,450,1000 fordifferent levels of independent contamination εi. . . . . . . . . . . . . . . . . . 44Table 5.2 Mean and standard error of MSPE, PR, and RC across replicates with no con-tamination for data generation model (a) for p = 50, 450, and 1000. The MSPEis shown plus/minus the standard error. The best method for each performancemeasure is highlighted. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46Table 5.3 Mean and standard error of MSPE, PR, and RC across replicates with no con-tamination for data generation model (c) for p = 50, 450, and 1000. The MSPEis shown plus/minus the standard error. The best method for each performancemeasure is highlighted . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47Table 5.4 Mean and standard error of MSPE, PR, and RC across replicates for simula-tion settings (i)-(iii) with shift normal casewise contamination with and withouthigh-leverage values. The MSPE is shown plus/minus the standard error. Thebest method for each performance measure is highlighted . . . . . . . . . . . . 50Table 5.5 Mean MSPE, PR, and RC across replicates for simulation settings (a)-(c) withsymmetric slash contamination with and without high leverage values. TheMSPE is shown plus/minus the standard error. The best method for each per-formance measure is highlighted. . . . . . . . . . . . . . . . . . . . . . . . . . 51Table 5.6 Mean MSPE, RC, and PC across replicates for simulation settings (a)-(c) withclustered outliers for a range of magnitude parameters η . The MSPE is shownplus/minus the standard error. The best method for each performance measureis highlighted. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52ixTable 5.7 Mean MSPE, PR, and RC across replicates for simulation settings (a)-(c) witha fraction εi = 0.014 independent outliers for a range of magnitude parametersk. The MSPE is shown plus/minus the standard error. The best method for eachperformance measure is highlighted. . . . . . . . . . . . . . . . . . . . . . . . 53Table 5.8 Mean MSPE, PR, and RC across replicates for simulation settings (a)-(c) witha fraction εi = 0.054 independent outliers for a range of magnitude parametersk. The MSPE is shown plus/minus the standard error. The best method for eachperformance measure is highlighted. . . . . . . . . . . . . . . . . . . . . . . . 54Table 6.1 Definition of outcomes of contamination detection. A detection of contamina-tion is defined as a positive and no detection as a negative. . . . . . . . . . . . . 64Table 6.2 Mean and standard deviation of the specificity and recall across replicates fordata with clustered outliers for a range of fractions of contamination, ε , andmagnitude parameters η . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65Table 6.3 Mean and standard deviation of the specificity and recall across replicates fordata with independent outliers for a range of fractions of contamination, εi, andmagnitude parameters k. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65Table 6.4 Recall over 100 random splits for the one variable contamination model detec-tion with a 95% bootstrap quantile cut-off. The specificity is 0.95. . . . . . . . . 68Table 6.5 Recall over 100 random splits for the three variable contamination model detec-tion with a 95% bootstrap quantile cut-off. The specificity is 0.92. . . . . . . . . 68Table 6.6 Recall over 100 random splits for the combined contamination detection modelwith a 95% bootstrap quantile cut-off. The specificity is 0.87. . . . . . . . . . . 69Table 6.7 Recall over 100 random splits for the one variable contamination model detec-tion with a 90% bootstrap quantile cut-off. The specificity is 0.88. . . . . . . . . 69Table 6.8 Recall over 100 random splits for the three variable contamination model detec-tion with a 90% bootstrap quantile cut-off. The specificity is 0.86. . . . . . . . . 69xList of FiguresFigure 1.1 Expected fraction of cases that contain at least one outlying cell, ε , for p from1 to 1000 when each cell has a probability εi = 0.01 of being contaminated. . . 3Figure 2.1 Example of univariate Winsorization with c = 2. The bivariate outliers in thisexample will not be shrunk at all by univariate Winsorization. . . . . . . . . . 22Figure 2.2 Example of bivariate Winsorization with c= 5.99. Bivariate outliers are shrunkto the border of the ellipse. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23Figure 2.3 Example of adjusted Winsorization with c= 2. Bivariate outliers are shrunk tothe edge of the small box. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24Figure 5.1 Mean MSPE for a range of independent contamination levels. The maximumstandard error of the MSPE in all cases shown is 0.64. . . . . . . . . . . . . . 55Figure 5.2 The mean rank of the MSPE across all replicates for a range of p with ε =0.014. The maximum standard error of the MSPE rank amongst cases shownis 1.05. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55Figure 5.3 The mean rank of the MSPE across all replicates for a range of p with ε =0.054. The maximum standard error of the MSPE rank amongst all casesshown is 1.27. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56Figure 5.4 Binary decision tree describing which methods are recommended for differenttypes of data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56Figure B.1 Sensitivity across simulation replicates for detection of data contaminated withclustered outliers for a range of fractions of contaminated cases and magnitudeparameters η . These plots present the results for the same simulations as Table6.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79xiFigure B.2 Specificity across simulation replicates for detection of data contaminated withclustered outliers for a range of fractions of contaminated cases and magnitudeparameters η . These plots present the results for the same simulations as Table6.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80Figure B.3 Sensitivity across simulation replicates for detection of data contaminated withindependent outliers for a range of fractions of contaminated cells and magni-tude parameters k. These plots present the results for the same simulations asTable 6.3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80Figure B.4 Specificity across simulation replicates for detection of data contaminated withindependent outliers for a range of fractions of contaminated cells and magni-tude parameters k. These plots present the results for the same simulations asTable 6.3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81xiiGlossaryDDC Detecting Deviating Data CellsDDC-RLARS Robust Least Angle Regression using the Detecting Deviating Data Cellsimputed data matrixLARS Least Angle RegressionRLARS Robust Least Angle RegressionRLARS-LASSO Robust Least Angle Regression with Lasso ModificationSTAGEWISE Incremental Forward Subset SelectionSTEPWISE Forward Stepwise RegressionxiiiAcknowledgmentsI am deeply and sincerely grateful for the guidance, patience, and support of my supervisor, Profes-sor Ruben Zamar through my research. Many times when I was feeling disheartened by a problemin my research, he provided me with the motivation and statistical insight that I needed to pushthrough. I hope to one day emulate Ruben’s qualities as a supervisor and mentor. I would alsolike to thank Professor Matı´as Salibia´n-Barrera for being my second reader and providing me withhelp in many areas throughout my graduate studies. Whenever it was needed, he was there with aclear and concise answer. I would also like to offer my gratitude and thanks to Anthony-AlexanderChristidis. He was always available at any hour of the day with help, advice, and feedback.I would like to thank MineSense Technologies Ltd. for funding my research. My most sincerethanks go to Dr. David Turner, the research manager at MineSense and the industrial supervisorof my project. From his work organizing, formulating, and reviewing my research to his adviceon navigating graduate school, David has helped me at every stage of my studies and I am forevergrateful.xivDedicationTo Emma, my partner, who was with me every step of the way, offering support and encouragement.She always found a way to make me laugh through all of the long days and nights.To my parents, Janice and Bryan McGuinness. Their hard work and sacrifice made me who Iam.xvChapter 1Introduction1.1 MotivationConsider the linear regression dataset (X,y) where X ∈Rn×p is a matrix of predictor variables andy ∈ Rn is the vector of responses. We assume the data follow the linear regression model, y =Xβ +ε where β ∈Rp is a vector of coefficients and ε = (ε1, . . . ,εn)′ is a vector of error terms. Letεi be i.i.d. and independent of the predictors xi1, . . . ,xi,p for all i= 1, . . . ,n. Improvements in sensorand computer technology has led the number of variables, p, to exceed the number of observations,n. Such data are called high-dimensional. However, the number s of variables correlated withthe response is frequently very small relative to the number of observations. Sparse methods canimprove estimator variance when p/n is large and s/n is small by selecting a subset of variables toinclude in the model.In addition to being sparse, it is often important for a method to be robust to outliers. Classicalsparse linear regression methods can fit the majority of the data very poorly when even a smallnumbers of outliers are present. When good data quality can not be guaranteed, it is thus desirableto have a method that is both sparse and robust.In this thesis, we consider the problem of data quality and how it relates to robust and sparseestimation of linear regression models. As a first step, a statistical model for the data must beselected. The model will be used to evaluate the performance of the considered linear regressionmethods so an unrealistic or inappropriate choice may result in poor performance on real data.Classical models assume that the randomness in data follow some “well-behaved” distribution,such as a Gaussian or Poisson distribution. However, such models do not incorporate outliers andare thus not sufficient for robust methods.1In robust statistics, data are modelled using contamination models. A contamination modelassumes that some fraction of the data may contain atypical values, called outliers. Almost allclassical robust methods for linear regression have modelled data quality using the Tukey-HuberContamination Model (THCM) (Alqallaf, Van Aelst, et al., 2009). However, the THCM is oftennot appropriate for high-dimensional data.The THCM assumes that a minority of observations are randomly contaminated with outliers.As the THCM assumes that contamination occurs at the observation (case) level, such outliers arecalled structural or casewise contamination. Under the THCM, data follow the distributionFε = (1− ε)F + εG. (1.1)Observations are drawn from some uncontaminated distribution, F , with probability 1− ε andsome outlier distribution, G, with probability ε . Highly robust affine equivariant linear regressionmethods developed under the THCM, such as S-regression (Rousseeuw and Yohai, 1984), M-regression (Huber, 1973), or MM-regression (Yohai, 1987), proceed by down-weighting outlyingobservations. Using this approach, methods can be robust to at most 50% outliers (e.g. Ricardo AMaronna, Martin, et al., 2019). Down-weighting outlying observations is a reasonable approachunder the THCM because it is assumed that a majority of observations are not contaminated.However, the assumption that the majority of observations are uncontaminated has been crit-icized when applied to high-dimensional data. When p is large and individual cells have somesmall probability of being contaminated independent of other cells in the same observation thenthe fraction of outlying cases can be much greater than 0.5 (Alqallaf, Van Aelst, et al., 2009).To address the limitations of the Tukey-Huber contamination model for data with high p, Alqal-laf, Van Aelst, et al. (2009) proposed a different contamination model – the independent, or cell-wise, contamination model. Under this model, each cell in a data table has some probability ofbeing contaminated independently of other cells in the same observation.To provide an intuitive example, imagine a machine is collecting measurements of some sam-ples with several sensors. Structural contamination is analogous to all sensors having some chanceof failing at the same time. Independent contamination is analogous to each sensor having a chanceof failing individually.As each covariate is contaminated independently, a small fraction of contaminated cells canresult in a large fraction of contaminated observations. This is called outlier propagation. Outlierpropagation is a serious problem for high-dimensional data. The probability that an observation2Figure 1.1: Expected fraction of cases that contain at least one outlying cell, ε , for p from 1to 1000 when each cell has a probability εi = 0.01 of being contaminated.contains at least one outlying cell isε = 1− (1− εi)p (1.2)where εi is the probability that a cell is independently contaminated and ε is the probability that anobservation is contaminated. For low εi, even a moderately large p will result in a ε that exceeds0.5, breaking down regression estimators developed under the THCM. For example, Figure 1.1shows independent to structural error propagation for εi = 0.01 for a range of p. For only p = 100dimensions, the probability that a case is contaminated is 0.64.High-dimensional datasets often have over a thousand dimensions, so even tiny amounts of in-dependent contamination can cause robust linear regression estimators developed under the THCMto fail. Thus, a sparse linear regression method that is robust to both independent and structuralcontamination is needed.1.2 Robust and Sparse Linear Regression in the Presence ofStructural and Independent ContaminationThe fully independent contamination model was only recently proposed by Alqallaf, Van Aelst,et al. (2009), so few methods exist for robust regression for data with independent contamina-tion. O¨llerer et al. (2016) recently proposed the Shooting S-estimator, named after the shooting(coordinate descent) algorithm (Fu, 1998). In coordinate descent, gradient descent is iterativelyperformed along each variable until convergence. As gradient descent is performed coordinate-wise, the Shooting S-estimator can weight cells in the same observation differently. However, even3though coordinate descent is commonly used to fit regularized regression estimators, the ShootingS-estimator is not regularized and is not a sparse method.Leung et al. (2016) proposed the 3S-Estimator, so called for the three steps that compose theprocedure. The 3S-Estimator first uses a univariate filter to replace independent outliers with miss-ing values. A robust estimate of scatter and location is then used to down-weight casewise out-liers. Finally, robust location and scatter estimates are used to compute regression coefficients.3S-regression is also not sparse and, moreover, can not be computed when p> n.Classical approaches to sparse linear regression include Incremental Forward Stagewise Re-gression, the Least absolute shrinkage and selection operator (Lasso) (Tibshirani, 1996), and Elas-tic Net (Zou and Hastie, 2005). Incremental Forward Stagewise (FS) regression is a boostingalgorithm that generates a coefficient profile by iteratively updating the regression coefficient mostcorrelated with the residuals by a small quantity (e.g. Freund et al., 2013). By selecting the numberand size of update, FS can produce a regularized solution. However, FS uses the sample correlationto determine the direction of the update, which is not robust to outliers. The resulting estimator istherefore sensitive to contamination.The least squares estimator with a Lasso or an Elastic Net penalty are the minimizer to a lossfunction of the form,L (β ,λ ) = Pr(y−Xβ )+λPs(β ) (1.3)where the function Pr : Rn −→ R is a residual scale penalty, λ is the sparsity parameter, and thefunction Pd(β ) : Rp −→ R is a regression coefficient penalty. Lasso and Elastic Net are sparsedue to their coefficient penalties. Lasso (Tibshirani, 1996) penalizes the L1 norm of the regressioncoefficients while Elastic Net (Zou and Hastie, 2005) penalizes a linear combination of the L1 normand the L2 norms. For linear regression, both Lasso and Elastic Net penalized linear regressionestimators use a sum of squares residual penalty. It is well known that the sum of squares residualpenalty are sensitive to outliers (e.g. Freue et al., 2017). As a result, the least squares estimatorwith either of the Lasso or Elastic Net penalty are not robust to contamination.Some methods for robust and sparse regression have been developed, but the field is still youngand existing methods were developed under the THCM. The first robust and sparse regression es-timator was Robust Least Angle Regression (RLARS) proposed by Khan et al. (2007). RLARSreplaces the non-robust sample means and sample correlations used to calculate Least Angle Re-gression (LARS) with robust alternatives. RLARS has some properties that make it attractive forapplication to data with independent contamination, which we will discuss later in this section.Later, Alfons et al. (2013) extended the Least Trimmed Squares estimator (Rousseeuw, 1984)4to include an L1 penalty of the regression coefficients (SparseLTS). This method has been demon-strated to have good robustness properties, but can only be tuned for either high robustness or highefficiency under normality, but not both. Moreover, SparseLTS down-weights atypical observationsand is thus not robust to independent contamination.Ricardo A Maronna (2011) proposed an MM-estimator for Ridge Regression (MMRidge) thatcan be both efficient and robust. This method replaces the sum of squares residual error of RidgeRegression with a robust penalty function. While MMRidge performs well on data with structuralcontamination, it is not sparse.Recently, Smucler and Yohai (2017) proposed MMLasso, an MM-estimator with an L1 penalty,and Freue et al. (2017) proposed PENSEM, an MM-estimator with an Elastic Net penalty. Thesemethod has been shown to perform better than existing regularized robust regression methods indata with structural contamination. However, MMLasso and PENSEM gain robustness by down-weighting atypical observations and are thus sensitive to independent outliers.As alluded to before, RLARS has properties that are desirable for regression under the inde-pendent contamination model. RLARS robustly sequences variables into a series of nested sub-models and then selects the optimal model from MM-regression estimators fit along the sequence.MM-regression estimators achieve robustness by down-weighting atypical observations. Thus, theoptimal model along the sequence will be vulnerable to outlier propagation in the active variables.However, in sparse data s/p is small, so outlier propagation is a much less severe problem whenonly considering the set A= { j : β j > 0} of the indices of variables that are related to the response,called the active set. As a result, RLARS is less sensitive to independent contamination than otherrobust and sparse linear regression methods.In summary, no existing sparse linear regression method is robust to both structural and inde-pendent contamination. Therefore, our goal is to adapt RLARS to be robust to independent andstructural contamination.In this study, we present several robust and sparse linear regression methods for applicationto data with structural and independent contamination. We consider the application of RLARSto data with structural and independent contamination. Further, we introduce two separate mod-ified versions of RLARS that improve predictive performance in different scenarios. First, wepropose pre-filtering the data matrix using the Detecting Deviating Data Cells (DDC) procedureproposed by Rousseeuw and Bossche (2018) before applying RLARS, called DDC-RLARS. TheDDC-RLARS estimator is recommended when diagnostic tools suggest the fraction of independentoutliers exceeds 5%, so εi > 0.05. Second, we propose a modification of RLARS with the LARSLasso modification, called RLARS-Lasso. RLARS-Lasso replaces the sample means and correla-5tions in the LARS Lasso modification with robust counterparts. The RLARS-Lasso estimator isrecommended for application to moderately high-dimensional data, approximately p/n> 10.1.3 Case Study: Data Quality Modelling Using Robust LinearRegressionIn this thesis, we also consider the problem of directly modelling data quality in a linear regressiondataset as a case study for MineSense Technologies Ltd. (MineSense). MineSense designs, builds,and deploys sensor-based systems that provide real-time grade telemetry to mines as ore is collectedand before it is transported to downstream processing facilities. We aim to detect the presence ofindependent and structural outliers in the high-dimensional data collected by MineSense sensors.In statistics, outliers are usually regarded as an obstacle to data analysis. However, outliers canprovide valuable insight into the status of the data generating process. For example, consider acommon thought experiment used as motivation in robust statistics. A sensor is collecting obser-vations with some chance of failure. When the sensors fails, it produces an outlier. In many texts,this example is only used to show how outliers may be generated and for the rest of the documentoutliers are only considered in terms of their adverse impact on data analysis.Instead, this example might be considered from the perspective of the operator of the sensor.In many applications, a properly operating sensor will not frequently fail while a damaged orotherwise improperly functioning sensor will. Hence, the presence of outliers in a dataset can beindicative of a sensor failure. In this context, information about the presence of outliers is valuableindependent of its impact on data analysis. For these reasons, MineSense aims to use informationabout data quality to improve sensor maintenance.Sensor maintenance is important to MineSense because their sensors are mounted on excavatorsoperating at the mining face. The environment in which excavators work is harsh so sensors areexpected to undergo significant wear and require regular maintenance. However, excavators are acritical step in the transport of rock so any disruptions to their operation is costly.Detecting a deterioration in data quality can provide an early warning that a disruption is oc-curring. Outliers can result from a change in the data generating process, such as a change in orebody or a deterioration in system health. Hence, a principled method to detect when a dataset iscontaminated with outliers could provide a scalable method for monitoring system performanceand health.First, we must distinguish between what we call contamination detection and the more commonproblem of outlier detection. Outlier detection aims to identify atypical observations or cells.6These observations are then removed from the data. In contrast, contamination detection aims todetect when outliers are present in a dataset, not which observations or cells are outliers. Ideally,contamination should be detected when there are a small number of minor outliers – before theproblem generating the outliers becomes more serious.Researchers have only recently begun considering the problem of detecting both structural andindependent contamination. Rousseeuw and Bossche (2018) proposed Detecting Deviating DataCells (DDC), the first and so far only method designed to detect both structural and independentoutliers. However, DDC is not intended for contamination detection. While it is possible to createa contamination detection method from an outlier detection method, it may be possible to achievebetter results using a purpose built procedure. For example, we could declare a dataset contami-nated if more than some number of independent and structural outliers were detected. However,this method would be ad-hoc and the false detection rate could not easily be tuned. Hence, aprincipled contamination detection method is desirable.Our approach to contamination detection compares robust and non-robust models. The com-parison of robust and non-robust models was first proposed in the S-plus User Manual (InsightfulCorporation, 2002) as a diagnostic tool to aid with manual examination of data, specifically in themanual for the S-plus robust library in the description of the fit.models function which waslater ported to R in the fit.models package available on the Comprehensive R Archive Net-work (CRAN). We consider a principled and automated extension of this concept for applicationto contamination detection.We propose a flexible contamination detection method for linear regression datasets with aneasily tuned false detection rate. This method measures the difference between regression coeffi-cients estimated by robust and non-robust methods. When this distance exceeds a cut-off deter-mined using a bootstrap estimate of an upper quantile, the data is considered contaminated. Bychanging the cut-off, the fraction of datasets without outliers that are incorrectly detected as con-taining outliers, called the false detection rate, can be easily tuned.1.4 Organization of Subsequent ChaptersThe subsequent chapters are organized as follows. In Chapter 2, we review RLARS and discussthe application of RLARS to data with independent contamination.In Chapter 3, we propose the application of RLARS using data filtered by DDC to removeindependent contamination. By using the DDC filtered data matrix, the robustness of RLARS toindependent outliers is further improved.7In Chapter 4, we propose the Lasso modification for RLARS. Here, we show that the Lassomodification for LARS can be used to estimate regression coefficients using only robust quantities.In Chapter 5, we present a simulation study comparing the performance of a variety robust andsparse linear regression methods, including existing methods and the two modifications of RLARSpresented in this study. The predictive and variable selection performance is compared on data withcellwise and casewise contamination.Chapter 6 is a case study where we present a method to detect the presence of independent andstructural contamination in high-dimensional linear regression data.In Chapter 7, we conclude by summarizing the main ideas and results of this thesis.8Chapter 2Robust Least Angle Regression in thePresence of Structural and IndependentContamination2.1 IntroductionIn this chapter, we consider the application of Robust Least Angle Regression (RLARS) (Khanet al., 2007) to data with structural and independent contamination. RLARS is a robust and sparselinear regression method based on Least Angle Regression (LARS). RLARS replaces the samplemeans, variances, and correlations used in LARS with robust counterparts.RLARS consists of a two-step model building procedure, sequencing and segmentation. First,a sequence of variables is generated, such that the best predictors are at the beginning. Then, robustlinear regression estimators are fit along the variable sequence and the optimal model is selected,which Khan et al. (2007) call segmentation.RLARS was originally developed having the Tukey-Huber Contamination Model (THCM) inmind, but also has some resistance to cellwise outliers. In this chapter, we discuss how RLARSvariable sequencing is robust to independent contamination and how outlier propagation is lesssevere in the variable subsets considered in segmentation than in the full set of variables.The rest of this chapter is organized as follows. In Section 2.2, the LARS algorithm, uponwhich RLARS is based, is reviewed. In Section 2.3, the two-step RLARS model-building proce-dure is reviewed. In Section 2.4, the application of RLARS to data with independent contamination9is discussed. Section 2.5 concludes, providing an overview of the major topics of the chapter.2.2 Review: Least Angle RegressionLeast Angle Regression (LARS) was proposed by Efron et al. (2004) as a less greedy alternativeto traditional forward selection procedures. LARS is closely related to the forward selection pro-cedure Incremental Forward Stagewise regression (Friedman et al., 2001). We will first reviewStagewise regression to provide motivation for LARS.2.2.1 Incremental Forward Subset SelectionIncremental Forward Stagewise regression (Stagewise) is a forward selection procedure for build-ing a linear regression model. Forward selection procedures operate by iteratively adding variables,partially or entirely, to a model. The simplest forward selection procedure is Forward Stepwise Se-lection (Stepwise) (e.g. Friedman et al., 2010, chapter 3).In Stepwise, variables are iteratively added to a model. At each step, the variable most corre-lated with the residuals, say xi, is added to the set of variables included in the model, called theactive set. The set A contains the indices of the variables in the active set. The response y is thenregressed on the set of variables (x j) j∈A with ordinary least squares (OLS). If a good predictor iscorrelated with xi, it may be excluded from the model at the next step. For this reason, Stepwise isan aggressive model building procedure that can be overly greedy.Stagewise was proposed as a more cautious version of Stepwise (Friedman et al., 2010). Ateach step, Stagewise updates the coefficient of the variable most correlated with the residuals by asmall value c> 0.We will now present the Stagewise algorithm, based on the excellent description in Freund etal. (2013). Let the responses y and the columns of the data matrix X be centred to have mean zero.Stagewise is initialized with residuals e0 = y and coefficients β 0 = 0. We then repeat the followingsteps, shown at step k.1. Find the variable most correlated with the residuals,jk = argmaxj∈{1,...,p}|(ek)′X j|.102. Update the coefficient vector and residuals in the direction of the selected variable,β k+1jk ←− β kjk + c sign(|(ek)′X jk |),ek+1←− ek− c sign(|(ek)′X jk |)X jk .where β kjk is the jkth element of the coefficient vector at step k.3. Steps 1 and 2 are then repeated until all covariates are uncorrelated with the residuals or thedesired level of sparsity is reached.Stagewise has the desirable sparsity properties that, at step k, ||β k||1 = kc and #{ j : β j 6= 0} ≤ k.By controlling the number of iterations, one can control the sparsity of the resulting estimate. Inproducing a regularized solution, Stagewise can obtain a favourable bias-variance trade-off withhigh-dimensional data.However, by performing a more cautious update Stagewise may require many more than thep steps required by Stepwise to reach the least squares solution. The computation time can bereduced by using a larger c, but this comes at the cost of making the algorithm more greedy.2.2.2 LARS and StagewiseThe original motivation for LARS was to create an accelerated version of Stagewise (Efron etal., 2004) based on the following behaviour of Stagewise. Let x1 be the first variable added tothe active set by Stagewise. If c is sufficiently small, then Stagewise will usually take at leastone step in the direction of x1 before adding a second variable, say x2. At this stage, x1 and x2have similar absolute correlations with the residuals. Now, if Stagewise steps in the direction ofx2, then the absolute correlation of x2 with the residuals will decrease relative to that of x1. Inpractice, this results in Stagewise alternating between the stepping in the direction of x1 and x2,such that both variables have approximately equal absolute correlations until a third variable isadded. Extrapolating this behaviour, Stepwise tends to move in a direction such that all variablesin the active set have equal absolute correlation with the residuals.LARS directly calculates the direction that has equal correlation (angle) with all members of theactive set. The fit is updated in this direction until another variable has an equal absolute correlationwith the last updated residuals to variables in the active set. The process is repeated until min(n, p)variables are included in the active set. The length of the update is easily computed, so the totalnumber of steps is also min(n, p).112.2.3 Derivation of LARSUsing the same notation as before, let (y,X) be a linear regression dataset with y ∈ Rn and X =(x1, . . . ,xp) ∈ Rnxp following the linear model,y = Xβ + εwhere β is the coefficient vector and ε the vector of i.i.d regression errors. The errors εi areindependent of the predictors and follow the distribution εi ∼ N(0,σ2) for i = 1, . . . ,n. We assumethat the data are normalized so that y and the columns of X have mean zero and standard deviationone.Derivation of First LARS StepLARS is initialized with a zero coefficient vector βˆ = 0 and a residual vector equal to the responsee = y− yˆ = y. As with Stagewise, the first variable to enter the active set is that with the greatestabsolute correlation with the residuals,j = argmaxj∈{1,...,p}|e′X j|.For simplicity, we assume that only one variable enters the active set at a time. Without loss ofgenerality, let the first variable to enter the active set be x1. LARS updates the predictions yˆ in thedirection with equal absolute correlation to the members of the active set.To simplify calculations, variables are not included in the active set directly. Instead, signedvariables are incorporated in the active set. A signed variable is a variable x j multiplied by the signof the variable x j with the residuals when it enters the active set. The first signed variable in theactive set is s1x1 where s1 = sign(e′x1). The correlation of the signed variable and the residuals arethus, by construction, positive. At this stage, only one variable is in the active set so the predictionis updated in the direction of s1x1,yˆ←− yˆ+ γs1x1. (2.1)The step size γ > 0 is selected such that the next variable to enter the active set has equal absolutecorrelation to the residuals as x1. This is to ensure that variables in the active set always havegreater absolute correlation with the residuals than those in the inactive set.Efron et al. (2004) observed that the absolute correlation of all covariates with the residualschanges linearly with γ . The linearity of the absolute correlation of x j for all j = 1, . . . , p with the12residual vector is simple to show. Let r j be the sample correlation of the (unsigned) variable j andthe residual vector. At the end of a step of length γ , the correlation r j is given by,r j =1n(x′j(y− yˆ))=1n(x′j(y− γs1x1))=1n(x′jy− γs1x′jx1)With this expression, we can find the absolute correlation of x1 and the residuals. Let r = |r1| bethe correlation of the active set and the residuals.r =1n|s1x′1(y− yˆ)|=1n|x′1ys1− γ|At the first step, e = y, so s1 = sign(x′1y) and thus x′1ys1 = |x′1y|. We choose 0 < γ < |x′1ys1|resulting in the expression,r =1n|x′1y|− γ. (2.2)To find γ , we calculate the step size required for each variable to have equal absolute correlationwith the residuals as x1, denoted by γ+j . We first find the step size for the positive signed vector+x j,r = r j1n(x′1y− γ+j x′1x1) =1n(x′jy− γ+j x′jx1)γ+j =x′1y−x′jyx′1x1−x′jx1γ+j =r− r j1−x′jx1/nSimilarly, for the negative signed vector −x j,γ−j =r+ r j1+x′jx1/n. (2.3)Recall that the active set is constructed such that r > |r j| for all j ∈ {2, . . . , p} and all variables are13standardized such that |x′jx1/n|< 1. Hence, γ+j ,γ−j > 0 for all j ∈ {2, . . . , p}.LARS performs the largest step γ such r ≥ |r j| for all j = 2, . . . , p. This is equivalent to settingγ to the smallest γ+j or γ−j for j ∈ {2, . . . , p},γ = minj∈{2,...,p}{min(γ+j ,γ−j )}. (2.4)Derivation of the Second and Later StepsIn this section, we derive step k for some k ∈ {2,3, . . . ,min(n, p)}. A quantity at step k is denotedusing a superscript (·)k. Let the set A contain the indices of all variables in the active set. To beginthe step we update Ak = Ak∪{ j} with the j corresponding to the γ j = min(γ+j ,γ−j ) selected as thestep size γ in the previous step.We define XA = (· · · ,s jx j, · · ·) for j ∈ A as the signed data matrix where s j is the sign of thecorrelation between the residuals and variable x j when it it enters the active set. As in step 1,updates are performed based on the absolute correlation of unsigned variables with the residuals,so it is simpler to use XA than the unsigned data matrix.LARS updates yˆ in the equiangular direction. The equiangular direction is defined as a linearcombination of the columns of XA with equal correlation to all signed variables in the active set.Let uA denote the equiangular vector. In the first step, the equiangular vector is parallel to the onlyvector in the active set, x1. For later steps, uA is derived in terms of the data from the followingthree constraints.(1) Linear combination of the columns of the signed data matrixuA = XAwA (2.5)where wA is a vector of weights.(2) Equal correlation with all members of the signed data matrix,1nX′AuA = aA1A (2.6)where aA is the absolute correlation of the members of the active set with the equiangularvector and 1A is a vector of ones with an equal number of elements to uA.14(3) Standardized to have a unit variance,1nu′AuA = 1. (2.7)The vector of predicted values at step k is updated in the direction of the equiangular vectorwhen performing a step,yˆk+1 = yˆk + γuA. (2.8)To calculate an update, the quantities uA, wA, and aA in Equations (2.5), (2.6), and (2.7) must bederived in terms of the data using the constraints. We can find wA in terms of aA by substituting(2.5) into (2.6),1nX ′AXAwA = aA1AwA = aA(1nX ′AXA)−11A (2.9)Substituting (2.5) into (2.7) and using the expression for wA we can get an expression for aA1 =1nw′AX′AXAwA1 = a2A1′A(1nX ′AXA)−11AaA =(1′A(1nX ′AXA)−11A)−1(2.10)Letting DA = diag{. . . ,s j, . . .} for j ∈ A and RA be the sample correlation matrix of the (unsigned)variables in the active set.aA = (1′A(DARADA)−11A)−1 (2.11)An expression for wA can be found by substituting Equation (2.11) into (2.9),wA = aA(DARADA)−11A (2.12)Finally, uA is given by substituting Equation (2.12) into (2.5),uA = aAXA(DARADA)−11A. (2.13)15Using uA, wA, and aA, the update can now be derived. First, we derive the update for the correlationsof the signed variables in the active set s jx j with the residuals e, denoted as r. Equation (2.8) canbe used to find the correlation update steprk+1 =1ns jx′j(y− yˆk+1)=1ns jx′j(y− yˆk− γuA)= rk− γ 1ns jx′juABy construction, 1n s jx′juA = aA for j ∈ A so the update of r is given by,rk+1 = rk− γaA (2.14)The update of the correlation of the (unsigned) variable x j for j ∈ Ac with e, r j, can be similarlyderived,rk+1j =1nx′j(y− yˆk+1)=1nx′j(y− yˆk− γuA)rk+1j = rkj− γ1nx′juAWe define a j = 1n x′juA for j ∈ Ac giving the update,rk+1j = rkj− γa j, j ∈ Ac. (2.15)Notice that a j does not depend on γ so rk+1j is linear in γ .The coefficient update can also be derived from Equation (2.8). Let u ∈ Rp be a vector withelements equal to equiangular vector for j ∈ A and with u j = 0 for j ∈ Ac and let βA be a vector16containing the elements of the coefficient vector corresponding to the active set,yˆk+1 = Xβˆk+1= Xβˆk+ γuXAβˆk+1A = XAβˆkA+ γwAuAβˆk+1A = βˆkA+ γ(s1wA,1, . . . ,snwA,n)′βˆ k+1j = βˆkj + γs jwA, j, j ∈ A (2.16)LARS takes the largest step such that r≥ |r j| for all j ∈ Ac. Thus to find γ , we find the smalleststep size such that r = |r j| for some j ∈ Ac. Let γ+j be the step size such that r = r j for j ∈ Ac, forthe positive signed vector +x j,rk+1j = rk+1rkj− γ+j a j = rk− γ+j aAγ+j =rk− rkjaA−a j .Similarly, for the negative signed vector x j,γ−j =rk + rkjaA+a jWe choose the minimum possible γ+ or γ− over all inactive covariates,γ = minj∈Ac{min(γ+j γ−j )}. (2.17)At the end of the step, one variable will have equal correlation with the residuals to the active setand will thus be added to the active set at the beginning of the next step.2.2.4 LARS SummaryWe will now summarize the steps of the LARS algorithm in terms of quantities derived from thecorrelation matrix of the data. Khan et al. (2007) first derived this form of the algorithm based onEfron et al. (2004). We consider the dataset (y,X) following the linear regression model,y = Xβ + ε (2.18)17where y∈Rn is the vector of responses, X= (x1,x2, . . . ,xp)∈Rnxp is the data matrix, β ∈Rp is thevector of regression coefficients, and ε is the vector of independent normally distributed regressionerrors with mean zero. We assume that y and all columns of X have been standardized to havemean of zero and standard deviation one.Let RX = 1n X′X be the sample correlation matrix, r j = 1n x′j(y− yˆ) be the sample correlationbetween covariate j and the current residuals, and r be the sample correlation between the membersof the active set and the residuals. By construction, all members of the active set have equalcorrelation with the residuals. We define A as the set of indices of covariates in the active set andsA as the sign of the correlation between the variables in the active set and the residuals when theyentered the active set. We now summarize the LARS procedure,(1) Initialize the current fitted values yˆ = 0, the active set A = /0, and the coefficient vector βˆ = 0.(2) Add to the active set the variable with the greatest correlation with the residuals. First thevariable is found,m = argmaxj∈{1,...,p}|r j|, sm = sign(rm), and r = smrm (2.19)and then added to the active set.A←− A∪{m}, sA←− sA∪{sm}. (2.20)(3) Calculate the direction of the update and the related quantities. Let XA = (· · ·s jx j · · ·) forj ∈ A denote the signed data matrix of variables in the active set, RA denote the sample corre-lation matrix of the (unsigned) variables in the active set, and DA = diag{sA}. The followingquantities are calculatedaA = {1A(DARADA)−11A}−1/2, wA = aA(DARADA)−11A. (2.21)and a j = (XAx j)′wA, j ∈ Ac. (2.22)For the first step, this reduces to aA = 1, wa = 1, and a j = r jm.(4) Calculate the largest step size such that no covariate outside of the active set has greater abso-18lute correlation with the residuals than those in the active set,γ = minj(min(γ+j ,γ−j )) = minj(min(r− r jaa−a j ,r+ r jaa+a j))(2.23)(5) Update the model using the above quantities. The prediction is updated in the direction withequal correlation to all signed variables in the active set.yˆ←− yˆ+ γua, r←− r− γaA, r j←− r j− γa j, j ∈ Ac. (2.24)The coefficient update can be optionally performed,βˆ j←− βˆ j + γs jwA j, j ∈ A (2.25)(6) Repeat steps (2) - (5) until min(n, p) elements have been added to the active set.2.3 Review: Robust Least Angle RegressionRobust Least Angle Regression (RLARS) is a robust linear regression model-building procedureproposed by Khan et al. (2007) consisting of two steps, variable sequencing and segmentation. Invariable sequencing, RLARS creates a list of variables in which the best predictors are first. Invariable segmentation, the optimal submodel is chosen along the sequence. In this section, we willreview the two step model-building process.2.3.1 Variable SequencingVariable sequencing is the first step in the two part RLARS model-building process. Variablesequencing generates an ordered list of variables, such that the best predictors are first in the list.Sequencing is performed using the RLARS algorithm. The order of the sequence is the order inwhich variables enter the active set. We will now discuss the RLARS algorithm in detail.Khan et al. (2007) observed that if the data are normalized, then the LARS algorithm canbe written entirely in terms of quantities calculated using the sample correlation matrix. Hence,LARS can be made robust by normalizing the data using robust location and scale estimators andcalculating the LARS update using a robust estimate of the correlation matrix of the data.The data are normalized by centering and scaling each covariate of the data matrix, X =(x1, . . . ,xp), and the vector of responses y with the median the median absolute deviation. For19a variable x j, the normalized vector z j is,z j =x j−medi∈{1,...,n} xi jmadi∈{1,...,n} xi j, j = 1, . . . , p (2.26)Khan et al. (2007) considered two versions of RLARS, Data Cleaning RLARS and Plug-inRLARS. Data Cleaning RLARS cleans the data of outliers and then performs LARS on the cleaneddata. Plug-in RLARS replaces the sample correlation matrix in LARS with a robust correlationmatrix.Data Cleaning RLARS achieves robustness by replacing each normalized observation zi fori = 1, . . . ,n with the Winsorized equivalent ui = min(√c/D(zi),1)zi, where Z = (zi j)1≤i≤n,1≤ j≤nis the normalized data matrix. Here, D(x) = z′iΣˆ−1zi is the Mahalanobis distance and Σˆ is a robustinitial estimate of the correlation matrix and c is a tuning constant. Data cleaning by Winsorizationachieves robustness by down-weighting outlying observations. As previously discussed in Section2.1, methods that down-weight outliers can fail with very low levels of independent contaminationfor moderately large p. Hence, Data Cleaning RLARS is not appropriate for data with independentcontamination, the primary focus of this thesis.Plug-in RLARS is conceptually very simple – the sample correlation matrix in the LARS al-gorithm is replaced with a with robust counterpart. Classical affine equivariant robust correlationestimators are prohibitively computationally expensive for large p. For this reason Khan et al.(2007) proposed using robust bivariate correlation coefficients to estimate the elements of the cor-relation matrix. If the resulting correlation matrix is not positive-definite, the method describedin Ricardo A Maronna and Zamar (2002) can be used to produce a positive definite scatter matrixbased on the spectral decomposition of the original.Two methods for robustly estimating bivariate correlations are considered in Khan et al. (2007),bivariate M-estimators of scatter (Ricardo Antonio Maronna, 1976) and bivariate Winsorization.The bivariate M-estimator of location and scatter is computationally efficient and affine equivariantand is thus a good choice. Bivariate Winsorization was a new method proposed by Khan et al.(2007) as a way to compute robust bivariate correlations even faster. The reason Khan et al. (2007)proposed a new method is that the number of bivariate correlations that must be computed growswith p2, so improving the speed of computation is paramount.Simulation results by Khan et al. (2007) show similar performance for RLARS using bothcorrelation estimators and thus recommends the faster method, bivariate Winsorization. We willtherefore restrict our focus to Plug-in RLARS using bivariate Winsorization, hereafter referred to20as RLARS.Bivariate WinsorizationBivariate Winsorization extends the idea of univariate Winsorization proposed by Huber (1981).For a variable with n observations, x = (x1, . . . ,xn)′, the univariate Winsorized values are givenby ui = ψc((xi−med(xi))/mad(xi)), i = 1, . . . ,n where ψc(t) = min(max(−c,x),c). Here, c is atuning constant used to define the region to which the Winsorized data are restricted.Huber (1973) was the first to suggest that correlation coefficients could be calculated usingthe Winsorized data to get a robust estimate of correlation. Later, Alqallaf, Konis, et al. (2002)suggested that a fast and robust estimate of the correlation matrix for data with high p could beobtained by estimating the elements using the correlation coefficients calculated on the Winsorizeddata.Univariate Winsorization is effective at detecting marginal outliers. However, observations canbe within the typical range of data in each variable, but lie outside the normal range of the data inpairs of covariates which is very important for estimating bivariate correlations. Figure 2.1 showsan example of how univariate Winsorization can be ineffective at dealing with bivariate outliers.Bivariate Winsorization incorporates the bivariate relationships between pairs of variables and isthus more robust to bivariate outliers.Bivariate Winsorization is performed by shrinking outliers to the border of an ellipse, usingthe transformation u = min(√c/D(x),1)x with x = (x1,x2)′ where D(x) = x′Σ̂−10 x′ is a Maha-lanobis distance calculated estimated using an initial estimate of the bivariate correlation matrixΣ̂0. The tuning constant recommended by Khan et al. (2007) is c = 5.99, the 95% quantile of theχ22 distribution. The final correlation coefficient can be calculated using the Winsorized values.The elements of the correlation matrix are estimated by calculating the correlation coefficienton the bivariate Winsorized data. However, the resulting correlation matrix is not necessarily pos-itive definite. When this is required, the correction described in Ricardo A Maronna and Zamar(2002) can be used to make the matrix positive definite.Figure 2.2 shows an example of how bivariate Winsorization can improve performance overunivariate Winsorization with bivariate data. Figure 2.1 shows the same dataset with univariateWinsorization. Bivariate Winsorization is much more effective at shrinking the outliers whichimproves the correlation estimate.21Figure 2.1: Example of univariate Winsorization with c = 2. The bivariate outliers in thisexample will not be shrunk at all by univariate Winsorization.Initial Correlation EstimateAdjusted Winsorization is a fast and robust method for getting an initial estimate of the bivari-ate correlation in bivariate Winsorization. It performs structured univariate Winsorization in eachquadrant. The quadrants are divided into major and minor quadrants, which use different tuningconstants. The major quadrants are the diagonally opposite quadrants containing the majority ofthe data and the minor quadrants are the remaining two quadrants.Bivariate Winsorization has one tuning parameter c1. This is the tuning constant of the uni-variate Winsorization for the major quadrants. Khan et al. (2007) recommends setting c1 to 2. Thetuning constant of the minor quadrants is set to c2 = c1√n2/n1 where n1 is the number of pointsin the major quadrants and n2 = n−n1 is the number of points in the minor quadrants.Figure 2.3 shows how adjusted Winsorization shrinks bivariate outliers much closer to themajority of the data than univariate Winsorization. Khan et al. (2007) shows that the adjustedWinsorized estimate is asymptotically consistent, with bounded influence, and that a smoothedversion of the estimator is asymptotically normal. Importantly, the calculation of a correlationcoefficient using adjusted Winsorization has the same computational complexity as calculating thecorrelation coefficient on the raw data and is thus very fast.22Figure 2.2: Example of bivariate Winsorization with c = 5.99. Bivariate outliers are shrunkto the border of the ellipse.2.3.2 Variable SegmentationVariable segmentation selects a submodel along the previously generated variable sequence accord-ing to a model selection criterion. The variable sequence is treated as a sequence of nested sub-models, with each subsequent submodel containing one additional variable. Robust linear modelsare fit to the submodels and the optimal model is selected using a robust model selection criterion.Khan et al. (2007) proposed that linear models be estimated using MM-regression. MM-regression (Yohai, 1987) is a highly efficient and robust linear regression method. A definitionbased on that in Khan (2006) is as follows,Definition: MM-regression estimate 2.3.1. Let χ0 :R−→R and χ1 :R−→R be two score functionsuch that χ0(u)< χ1(u), u ∈ R, and each χ satisfies the following set of regularity conditions:(R1) χ(−u) = χ(u),u ∈ R,(R2) χ non-decreasing on [0,∞),(R3) χ(0) = 0, and χ(∞) = 1,(R4) χ is continuously differentiable.23Figure 2.3: Example of adjusted Winsorization with c = 2. Bivariate outliers are shrunk tothe edge of the small box.Let β˜ be a high-breakdown point “initial” estimate for β and σˆ be the estimate of the scale of theresiduals satisfying,1nn∑i=1χ0(yi−x′iβ˜σˆ)= δ , (2.27)where δ ∈ (0,1] is the expectation of χ0(·) under the central model. Then, the regression MM-estimate βˆ is a local minimum of the loss function,L (β ) =1nn∑i=1χ1(yi−x′iβσˆ)(2.28)such that the following condition is satisfied,L (βˆ )≤L (β˜ ) (2.29)A good choice for the initial estimate β˜ in Definition 2.3.1 is the regression S-estimate (Rousseeuwand Yohai, 1984) due to its high breakdown point. The S-estimate is defined as follows by Khan(2006),24Definition: S-regression estimate 2.3.2. Let χ0 : R−→ R be the score function described in Defi-nition 2.3.1. The S-estimate β˜ is defined asβ˜ = argminβσˆ(β ) (2.30)where σˆ(β ) solves1nn∑i=1χ0(yi−x′iβσˆ(β ))= δ , (2.31)The corresponding S-estimate of scale, σˆ , is given byσˆ = infβσˆ(β ) = σˆ(β˜ ) (2.32)For both the χ1 and χ0 functions we use Tukey’s Bisquare loss, given byχ(t) = min{1,1− (1− (t/c)2)3}. (2.33)Theoretically, MM-regression estimators can be computed with up to n variables. However,simulation results in Chapter 5 suggest that the number of variables should not exceed n/2. Thus,only the first n/2 variables in the sequence are considered.A model selection method is used to select the best submodel. Non-robust model selectionmethods are susceptible to outliers in the data and are thus not appropriate to select a robust method.Instead, a robust model selection method is required.Khan et al. (2007) recommend using either robust Bootstrapping (RBoot) (Khan et al., 2007,based on Salibian-Barrera, 2000 and Salibian-Barrera and Zamar, 2002), or robust cross-validation(robust CV) (Ronchetti et al., 1997) to perform model selection. Both methods are robust andperform well compared to the other robust model selection procedures considered. Cross-validationis commonly used for model selection in sparse linear regression, so we choose to restrict our focusto robust CV.Robust Cross-ValidationIn classical k-fold cross-validation, the dataset (y,X) is randomly partitioned into k groups. Amethod is trained on the data excluding observations in one of the groups. The fitted model isthen used to predict the response of the group that was excluded from the training set, giving theprediction error at each observation in the test set. This is then repeated for every group. When25k = n, this is referred to as leave-one-out (LOO) cross-validation.Let (e1, . . . ,en) denote the prediction errors for each observation. The classical cross-validationmean squared prediction error (MSPE) isMSPE =1nn∑i=1e2i . (2.34)The model with the smallest MSPE is selected. However, the squared error is not robust to outliers.A single outlier can result in a very large MSPE, even if the method used to fit the model is robust.The trimmed MSPE (TMSPE) excludes the largest α fraction of the prediction errors from thecriterion (Khan et al., 2007). Let (e(1), . . . ,e(n)) denote the order statistics of the prediction error.The TMSPE is given byTMSPE =1bn(1−α)cbn(1−α)c∑i=1e2(i) (2.35)where bn(1−α)c is the largest integer smaller than n(1−α). The TMSPE is robust to up to dnαecasewise outliers.The TMSPE can still be susceptible to the propagation of independent outliers even if themethod estimating βˆ is robust. However, if βˆ is sparse, only independent outliers in the active vari-ables will affect the TMPSE. Under the assumption that s/p is small, independent error propagationis a less severe problem for the TMSPE for sparse methods.Robust Bayesian Information CriterionAs an alternative to the robust cross-validation criterion, Alfons et al. (2013) proposed performingmodel selection using a robust Bayesian Information Criterion (BIC). The generic form of the BIC(Schwarz, 1978) is given by,BIC(βˆ ,s) =−2L (βˆ )+ s · logn (2.36)where L (βˆ ) is the log-likelihood and s is the number of parameters included in the model. Min-imizing the BIC along a sequence of models is equivalent to maximizing the posterior probabilityof the selected model. Unlike the AIC, the BIC is an asymptotically consistent model selectioncriterion (Friedman et al., 2001).When comparing linear regression models assuming normally distributed error, minimizing the26BIC is equivalent to minimizing the following criterion,RBIC(βˆ ,s) = log(σˆ)+s · log(n)n(2.37)where σˆ is the sample standard deviation of the residuals. Alfons et al. (2013) proposed performingmodel selection amongst linear regression models by selecting the model that minimizes Equation2.37 with a robust estimate of residual scale for σˆ . For RLARS, the robust estimate of residualscale obtained while finding the S-estimate of regression can be used.2.4 RLARS in the Presence of Independent ContaminationRLARS was originally developed under the Tukey-Huber contamination model. However, unlikemany robust regression methods, RLARS does not down-weight atypical observations based on allp variables. As a result, the two-step RLARS model building process has some desirable robustnessproperties in the presence of independent contamination.In variable sequencing, the data are only directly used to calculate the correlation matrix. Theelements of the correlation matrix are estimated using robust bivariate correlation estimates. FromEquation 1.2, we can see outlier propagation is not a significant problem when only consideringtwo variables. If the breakdown point of a robust correlation coefficient estimator is ε for structuralcontamination, then the breakdown point for independent contamination εi isεi = 1−√1− ε (2.38)For a breakdown point of ε = 0.5 the breakdown point for independent contamination is εi = 0.29.The variable sequencing step is therefore robust to independent contamination.Segmentation fits a series of MM-regression estimates to subsets of variables. As discussed inSection 2.1, MM-regression estimates are robust by down-weighting outlying values and are thussusceptible to independent outliers (Alqallaf, Van Aelst, et al., 2009). However, the MM-estimatorsare only fit to a subset of variables in each model so, similar to robust bivariate correlation estimates,the propagation of independent outliers is a less severe problem than for robust penalized regressionmethods developed under the Tukey-Huber contamination model. The breakdown point of an MM-estimator fit on the true active variables isεi = 1− (1− ε)1/s (2.39)27where ε is generally set to 0.5. Under the assumption of sparsity, s/p is small so RLARS is robustto a higher fraction of independent outliers than methods that down-weight observations based onall p predictors.2.5 ConclusionIn this chapter, we reviewed RLARS and discussed it’s application to data with independent con-tamination. RLARS is a robust and sparse linear regression method composed of two steps, variablesequencing and segmentation. Variable sequencing generates a sequence of variables with the bestvariables appearing first in the list. Variable segmentation selects a submodel along the sequence.RLARS was developed under the Tukey-Huber contamination model, but has good robust-ness against independent contamination. Variable sequencing only uses univariate and bivariaterobust estimators and is thus not vulnerable to outlier propagation. Variable segmentation fits MM-regression estimators to sub-models along the sequence and, under the assumption of sparsity, theoptimal submodel has much fewer variables than p. As a result, the optimal submodel is consider-ably less susceptible to outlier propagation than the full model.An open question for future research is the use of the Shooting S-estimator (O¨llerer et al.,2016) in variable segmentation as it is robust to both independent and structural contamination, un-like the MM-regression estimator currently used by RLARS. We suggest the Shooting S-estimatorbecause, unlike the 3S-regression estimator, it calculates a residual scale estimate, which can beused to calculate the robust BIC criterion for model selection. Using the shooting S-estimatorcould thus potentially improve the robustness of RLARS to independent contamination. However,variable segmentation is already computationally costly when performed using the MM-regressionestimator, which is fast to compute relative to the Shooting S-estimator. Hence, the increasedcomputational cost of using the Shooting S-estimator must be carefully considered.Overall, RLARS has good robustness to both cellwise and casewise outliers in sparse data.This is confirmed by the simulation results in Chapter 5. RLARS has an intuitively appealingrelationship to the LARS estimator. It is a regularized estimator that is applicable to data withmore dimensions than observations, filling an important gap in the literature.28Chapter 3Robust Least Angle Regression UsingPre-Filtered Data3.1 IntroductionIn Chapter 2, we discussed how RLARS is robust to independent contamination in sparse data, butsusceptible to outlier propagation in the active set. However, outlier propagation in the active setcan still be significant for high levels of independent contamination. In this section we propose amodification of RLARS to further improve the robustness of RLARS to independent outliers.As previously discussed in Chapter 2, RLARS is vulnerable to independent contamination inthe variable segmentation step. Outlier propagation in the active set can break down the MM-estimators fit to the submodels along the RLARS variable sequence. If variable segmentationcan be made robust to independent outliers, then the robustness of RLARS can be significantlyimproved.To improve the robustness of variable segmentation, we might consider the 3S-regression es-timator for inspiration. The 3S-regression estimator is highly robust to independent outliers. Itachieves this robustness by filtering independent outliers from the data matrix before fitting a linearregression model (Leung et al., 2016). Outliers are replaced with missing values. The robustnessof RLARS to independent outliers might similarly be improved by filtering the data matrix of in-dependent outliers. However, RLARS can not handle missing data, so a different filtering methodthan that used in 3S regression is required.Rousseeuw and Bossche (2018) recently proposed the Detecting Deviating Data Cells (DDC)29method for detecting independent and structural outliers in approximately Gaussian data. DDCdetects independent contamination using the bivariate relationships between variables. Outlyingcells are replaced with predicted values to produce an “imputed” data matrix. Rousseeuw andBossche (2018) suggests that the imputed data matrix can be analyzed by robust methods that cannot use incomplete data. Hence, the imputed matrix could be used to create a “filtered” data matrixfor RLARS.In this chapter, we propose performing RLARS using the DDC imputed data matrix, which wecall DDC-RLARS. By filtering out independent outliers, DDC-RLARS is robust to high levels ofindependent contamination in the active set.The rest of this chapter is organized as follows. In Section 3.2, the DDC algorithm for comput-ing the imputed data matrix is calculated. In Section 3.3, DDC-RLARS is presented. Section 3.4concludes, reviewing the major topics of the chapter.3.2 Review: Detecting Deviating Data CellsDetecting Deviating Data Cells (DDC) was proposed by Rousseeuw and Bossche (2018) as amethod for detecting outlying rows and cells. DDC uses the bivariate relationships between vari-ables to more effectively detect cellwise outliers. As part of the outlier detection algorithm, apredicted data matrix, is computed using robust bivariate relationships between variables.The predicted data matrix is used to detect outlying cells. Cells that significantly differ fromthe predicted value are flagged as outliers. These cells are replaced with the predicted values tocreate the imputed data matrix. This section will review the parts of the DDC algorithm used tocompute the imputed data matrix.Consider a data matrix X∈Rn×p with a multivariate Gaussian distribution N(µ,Σ). Rousseeuwand Bossche (2018) recommend that it be manually confirmed that the data are approximatelynormal and that any variables that are very non-Gaussian are transformed to be approximatelyGaussian. The imputed data matrix Ximp is calculated as follows.(1) Standardize the columns of X using robust estimates of location and scale.m j = robLoci(xi j), s j = robScalei(xi j−m j), (3.1)where robLoc and robScale are robust estimators of univariate scale and location, respectively.All robust estimators used in this section are described in Appendix A. The standardized data30matrix Z is given byzi j = (xi j−m j)/s j, i = 1, . . . ,n, j = 1, . . . , p. (3.2)(2) Perform univariate outlier detection on Z. Outlying cells are replaced with missing values,represented with NA’s, to produce a univariate filtered data matrix U,ui j =zi j, |zi j| ≤ cNA, |zi j|> c i = 1, . . . ,n, j = 1, . . . , p. (3.3)The cutoff is c =√χ21 (p) where χ21 (p) is the pth quantile of the χ21 distribution. The value pis a tuning parameter. Under the assumption of normality, approximately 1% of non-outlierswill be flagged.(3) Determine bivariate correlations between columns of U. For all pairs of columns, ` 6= j for`, j ∈ {1, . . . , p}, calculater` j = robCorri(ui`,u j`) (3.4)where robCorr is a robust bivariate correlation estimator given in Appendix A, excluding pairswhere either value is NA. If the correlation exceed some cutoff,|ρ` j| ≥ ρlim (3.5)then the variables ` and j are considered connected. Connected variables have a non-negligiblerelation that is used to determine whether a cell is outlying. By default, ρlim = 0.5. For con-nected pairs (`, j), we computeb` j = robSlopei(ui`|ui j) (3.6)where robSlope is the slope of variable ` regressed upon variable j without an intercept, givenin Appendix A.(4) Predict cells using the relationships between the connected variables. For this step, we use therobust slopes calculated in the previous step to predict the value of each cell, zˆi j. If the numberof dimensions exceeds some value, by default 1000, then only the k most highly correlated areused to reduce the required memory and computation time.31Let L j be the set of all indices satisfying (3.5) for variable j, including j. Then the predictedvalue is given byzˆi j = G({b` jui` : h ∈ L j}, i = 1, . . . ,n. (3.7)The function G is a combination rule that omits NA’s and is zero when all variables in the roware missing. It is recommended that G be a weighted mean with w` j = |r` j|. By construction,ui j < c, so the impact of a single outlying cell on the prediction is limited.(5) Deshrinkage of the predicted values. To reduce the effect of the shrinkage due to the combina-tion rule, we regress the observed values zi′ j onto the predicted value zˆi′ j, using the same robustregression coefficient estimation from Step (3),a j = robSlopei′(zi′ j|zˆi′ j) j = 1, . . . , p. (3.8)The predicted values are then replaced by the corrected valueszˆi j←− a j zˆi j (3.9)for i = 1, . . . ,n and j = 1, . . . , p.(6) Create the standardized imputed matrix by replacing cellwise outliers with predicted values.First, we compute the standardized residuals,ri j =zi j− zˆi jrobScalei′(zi′ j− zˆi′ j) , i = 1, . . . ,n, j = 1, . . . , p (3.10)and flag cells where |ri j| > c, with c given in Step (2), as cellwise outlier. The imputed stan-dardized data matrix Zimp is defined with entries,zimpi j =zi j, |ri j| ≤ czˆi j, |ri j|> c i = 1, . . . ,n j = 1, . . . , p. (3.11)(7) Destandardize the standardized imputed data matrix Zimp to get the imputed data matrix Ximp.The entries of Ximp areximpi j = s jzimpi j +m j, i = 1, . . . ,n; j = 1, . . . , p (3.12)32for m j and s j computed in Step (1).The imputed data matrix Ximp is the data matrix X with outlying cells replaced with values pre-dicted by DDC. The full DDC algorithm contains an extra step between (6) and (7) that flagsstructural outliers. RLARS is robust to structural outliers, so we will not review this step.3.3 RLARS With the DDC Imputed Data MatrixTo further improve the robustness of RLARS to outliers, we propose performing RLARS on theDDC imputed data matrix. The DDC imputed data matrix has outlying cells replaced with valuesimputed using the non-outlying cells in the same observation. As no independent outliers arepresent in the data matrix during variable segmentation, DDC-RLARS is not susceptible to outlierpropagation in the active set. It is thus highly robust to independent outliers.To calculate the DDC-RLARS estimate, DDC is first used to calculate the imputed data matrixXimp. Next, the two-step RLARS procedure is performed using the dataset (y,Ximp). Duringvariable segmentation, MM-estimators are fit to submodels along the variable sequence. MM-estimators achieve robustness by down-weighting outlying observations and are thus not robust toindependent contamination on unfiltered data, as discussed in Chapter 1.In RLARS, variable segmentation is performed using the raw data matrix and is thus suscep-tible to outlier propagation in the active set. Under the assumption of sparsity, the number ofvariables in the optimal submodel is much less than p, so RLARS is resistant to independent out-liers. However, for high-levels of independent contamination, outlier propagation can result in thefraction of contaminated cases in the active set exceeding 0.5, exceeding the maximum breakdownpoint of the MM-estimators used by RLARS. In contrast, variable segmentation in DDC-RLARSis performed using data filtered of independent outliers so outlier propagation is not a problem.Thus, DDC-RLARS is not susceptible to independent outliers in the active set.However, the data matrix loses information by replacing flagged cells with imputed values. Asa result, DDC pre-filtering decreases the performance of RLARS when low levels of independentcontamination are present. Hence, DDC-RLARS should only be used when there there is reasonto believe that there are high levels of independent contamination.Alternatively, to improve the robustness of RLARS to independent outliers we could replacethe MM-estimator in the segmentation step with an estimator that is robust to both independentand structural contamination such as the Shooting S-estimator (O¨llerer et al., 2016), as discussed inSection 2.5. Such as estimator could be fit using the raw data and thus may have better performancethan DDC-RLARS. However, we were unable to explore this alternative due to time constraints.33Overall, DDC-RLARS is more robust to independent contamination than RLARS. Based onsimulation results in Chapter 5, we recommend that DDC-RLARS be used instead of RLARS ifthere is more than 5% independent contamination. The diagnostic tools provided by DDC in thepackage cellWise by Rousseeuw and Bossche (2018) can be used to manually determine theseverity and prevalence of independent contamination in the data matrix.3.4 ConclusionIn this chapter, we proposed DDC-RLARS, a method that further improves the robustness ofRLARS to independent contamination. For high levels of contamination, outlier propagation cancause the number of contaminated cases in the variables of the active set to exceed 0.5 – the max-imum breakdown point of RLARS. DDC-RLARS uses an initial filtering step to remove indepen-dent outliers and thus avoid this problem.DDC-RLARS applies RLARS to the DDC imputed data matrix. The DDC imputed data matrixhas outlying cells replaced with imputed values, leaving only the casewise outliers. RLARS isrobust to casewise outliers, so DDC-RLARS is very robust to both structural and independentcontamination.However, DDC-RLARS has reduced performance on data with low levels of independent con-tamination. Therefore, DDC-RLARS should only be preferred over RLARS when manual exam-ination of the data indicates that there are significant levels of independent outliers, such as thosein Rousseeuw and Bossche (2018). We recommend that DDC-RLARS be used in data with morethan 5% independent outliers, based on numerical results in Chapter 5.34Chapter 4Robust Least Angle Regression withLasso Modification4.1 IntroductionThe least squares estimator with a Least absolute shrinkage and selection operator (Lasso) penaltyis a robust and sparse regularized linear regression estimator. As previously discussed in Chapter1, Lasso solves a constrained linear regression problem that can be written as the solution of aclassical sum-of-squares residual loss function with an L1 coefficient penalty. The L1 coefficientpenalty constrains the effective degrees of freedom, so solutions can be obtained when n< p. TheL1 penalty also tends to shrink coefficients towards zero, making Lasso estimates sparse.Efron et al. (2004) observed that a simple modification to LARS implements the Lasso, calcu-lating all possible Lasso estimates for a given problem. RLARS improves the robustness of LARSwithout the Lasso modification by replacing non-robust scale, location, and correlation estimatorswith robust counterparts. The Lasso modification can be made robust is the same way. We proposeperforming RLARS with the Lasso modification (RLARS-Lasso). We show that the LARS Lassomodification can be made robust similar to RLARS.4.2 Review: LARS Lasso ModificationEfron et al. (2004) observed that the LARS solution path is very similar to the Lasso solution path.Exploring this further, the paper proved that a simple modification of the LARS algorithm can beused to calculate the full Lasso solution path.35The least squares estimator with the Lasso penalty is the solution of,βˆ = argminβ∈Rn∑i=1(yi−xiβ )2+λp∑j=1|β j|. (4.1)where λ > 0 is the coefficient penalty parameter. The L1 coefficient penalty of Lasso tends toshrink the regression coefficients to zero, a desirable sparsity property. As with LARS, Lasso canhave at most n non-zero coefficient estimators (e.g. Hastie et al., 2015, Chapter 2).While LARS and Lasso can have very similar solution paths in some problems, Lasso has thefollowing restriction not present in LARS. Let βˆ be a Lasso solution with non-negative coefficientsfor i ∈ A and r j be the correlation of covariate j with the residuals.sign(βˆ j) = sign(r j) (4.2)Efron et al. (2004) proved that if Constraint 4.2 is enforced in LARS through the Lasso modi-fication, then LARS produces the full Lasso solution path. To quote Theorem 1 in Efron et al.(2004),Under the Lasso modification, and assuming the “one at a time” condition discussed below,the LARS algorithm yields all Lasso solutions.The “one at a time” condition is that only one covariate enters or leaves the active set A on eachstep.Constraint 4.2 is always true when the variable first enters the active set. This can be seenfrom the coefficient update step in Step 4 of the LARS algorithm. However, this not necessarilytrue in later steps. As a result, the Lasso modification must sometimes remove variables from theactive set, unlike LARS, and thus may require more than min(n, p) steps to calculate the completesolution path.Constraint 4.2 can be enforced by ensuring that β j for j ∈ A will not change sign during Step(5) of the LARS algorithm. This can be performed by determining whether the smallest step sizesuch that a coefficient changes sign is smaller than the previously calculated LARS step size γ . Wethus calculate the step size γ˜ j such that the jth coefficient at step k+ 1 equals zero, βˆ k+1j = 0, for36all j ∈ A using the coefficient update in Equation (2.16),βˆ k+1j = 0 = βˆkj + γ˜ js jwA, jγ˜ j =−βˆ kjs jwA, j.The smallest such step size is then found,m = argminj∈A{γ˜ j}γ˜ = γ˜m.If γ˜ ≥ γ , then a modified is used to perform the LARS update,γ ←− γ˜,A←− A−{m},sA←− sA−{sm},The LARS Lasso modification is incorporated in the LARS summary in Section 2.2.4 by addingthe following step between steps 4 and 5.(4a) The Lasso modification is performed by calculating the step size for which each βˆ j for j ∈ Acrosses zero. This is easily derived from the coefficient update step,γ˜ j =βˆ js jwA j, j ∈ A. (4.3)If the step size required for the coefficient to change sign is less than the size of the next step,γ >minj∈Aγ˜ j, (4.4)then the corresponding variable is removed from the active set,m = argminj∈Aγ˜ j (4.5)γ j←− γ˜m, A←− A−{m}, and sA←− sA−{sA}. (4.6)37Continue onto Step 5 in the LARS algorithm, using the new step size.4.3 RLARS-Lasso ModificationWe propose adding the Lasso modification to RLARS, which we call RLARS-Lasso. All quantitiesin the Lasso modification can be calculated from the sample correlation matrix. As in RLARS, wereplace the sample correlation matrix with the robust correlation matrix estimated using bivariateWinsorization. As discussed in Section 2.4, the correlation matrix estimate in RLARS is robust tocellwise and casewise contamination. Hence, the RLARS-Lasso solution path is similarly robustto cellwise and casewise contamination.In RLARS, the variable sequencing step is followed by a variable segmentation step that selectsthe final model. The variable segmentation step selects an unpenalized MM-regression estimators,so the resulting regression estimate is not regularized. Instead of using a variable segmentationstep to get coefficient estimates, RLARS-Lasso computes a coefficient update directly. This hasthe benefit of being a one-step process. Further, the coefficient estimates are robust to independentcontamination as all quantities are calculated using univariate or bivariate robust estimators.RLARS-Lasso uses a fast, but low-efficiency correlation estimate to calculate the coefficientupdates. In contrast, the MM-estimators used by RLARS to estimate the regression coefficients arehighly efficient and robust under casewise contamination. As a result, RLARS has better predictiveperformance than RLARS-Lasso in some situations with structural contamination. However, MM-estimators perform poorly when the number of covariates included in the model s are close to n.Hence, it has been observed RLARS-Lasso outperforms RLARS when p/n is even moderatelylarge. The simulation study in Chapter 5 suggests RLARS-Lasso should be preferred over RLARSwhen p/n> 10.To get the final regression coefficient estimate, a model along the full RLARS-Lasso solutionpath must be chosen. This is equivalent to selecting the shrinkage parameter λ . As data may haveoutliers, a robust model selection method must be used.4.4 Selecting the Penalty ParameterThe penalty term λ is selected from grid of 200 logarithmically equispaced values. The modelwith the lowest Trimmed Mean Squared Prediction Error (TMSPE) is selected. The TMSPE isgiven in Equation 2.35. The TMSPE is robust to up to a fraction α outlying observations. Werecommend trimming α = 0.25 by default as a good balance between robustness and efficiency.Similar to other methods developed under the Tukey-Huber contamination model, the TMSPE38down-weights extreme observations. As a result, the TMSPE is vulnerable to outlier propagation.This is discussed in greater depth in Chapter 1. Thus, the TMPSE can break down for less than afraction of α of cellwise outliers due to outlier propagation in the active variables.However, an outlying cell will only influence a prediction, and thus the TMPSE, if the outlieris in a variable in the active set. Thus, RLARS-Lasso is only susceptible to outlier propagation inthe active set, similar to RLARS. Under the assumption of sparsity, the number of active variabless is much less than the total number of variables p so RLARS-Lasso is resistant to independentoutliers.4.5 ConclusionIn this chapter, we proposed performing RLARS with a robust LARS Lasso modification, RLARS-Lasso. The RLARS-Lasso modification is computed by replacing the sample correlation matrixwith the robust correlation matrix estimated by RLARS. The final model along the solution path isselected using the TMSPE.RLARS-Lasso computes the coefficient update directly and thus does not require a variablesegmentation step. As a result, the RLARS-Lasso coefficient estimates along the solution path arerobust to independent and structural contamination. However, the penalty parameter selection stepis not robust to independent contamination. Similar to RLARS, outlier propagation in the activeset can cause the number of contaminated cases to exceed 0.5, significantly influencing the penaltyparameter selection.An open question for future research is how the robustness of penalty parameter selection toindependent contamination could be improved. Potentially, the DDC imputed data matrix describedin Section 3.2 could be used to calculate the TMSPE. The imputed data matrix replaces cellwiseoutliers with imputed values, possibly reducing the influence of independent contamination on theTMPSE.RLARS uses the highly robust and efficient MM-regression estimators to estimate regressioncoefficients. As a result, RLARS has superior predictive performance than RLARS-Lasso for lowp/n. However, MM-estimators are less efficient when the number of parameters is close to n. Wetherefore recommend that RLARS-Lasso be preferred over RLARS when p/n is larger than 10based on simulation results in Chapter 5.39Chapter 5Simulation Study5.1 IntroductionThis chapter presents a simulation study performed in R to compare the prediction accuracy andvariable selection capabilities of RLARS, RLARS-DDC, and RLARS-Lasso with various robustand sparse estimators across a variety of settings. Where possible, estimators were tuned to have abreakdown point of 25%, except the MM-estimators used to fit submodels in RLARS and RLARS-DDC. MM-estimators already have very high efficiency and are thus instead tuned to have a break-down point of 50%, as in Alfons et al. (2013),.The following six competitors were considered in the simulations.1. Lasso penalized MM-estimator (Smucler and Yohai, 2017) computed using the pense pack-age, called MMLasso. This method was tuned to have a 25% breakdown point.2. Penalized elastic net MM-estimator with α = 3/4 computed using the pense package,called PENSEM-EN. This method was tuned to have a 25% breakdown point.3. Sparse least trimmed squares regression (Alfons et al., 2013) computed using the robustHDpackage, called SparseLTS. This method was tuned to have a 25% breakdown point.4. Robust variable sequencing using robust least angle regression with an MM-estimator fit tothe selected variables (Khan et al., 2007) computed using the robustHD package (Alfonset al., 2013), called RLARS.5. Robust filtering of data using Detecting Deviating Data Cells (DDC) (Rousseeuw and Boss-che, 2018) with RLARS applied to the filtered data, called RLARS-DDC. The DDC imputed40data matrix is computed using the package cellWise.6. RLARS with the Lasso modification computed using an R package implemented by theauthor, called RLARS-Lasso.Tuning parameters for each regression method are selected using the method recommendedby the original author. The RLARS-Lasso shrinkage parameter is selected using the TMSPE.For RLARS and DDC-RLARS, the penalty parameter is selected using the robust BIC (Alfonset al., 2013). The robust BIC is used because the preferred alternative, the TMSPE, has a highcomputational cost that would make the large number of simulations computationally infeasible.However, simulation results indicate that robust cross-validation performs much better in certaincases than the robust BIC. In particular, the simulations in settings (a) and (b) in the first column ofTable 5.6 were a pathological case that required that DDC-RLARS use the TMSPE instead of theRBIC.Preliminary results showed that PENSEM, and thus MMLasso, require significantly more com-putation time than the other methods considered, especially for large p. While the pense packagecontains a well-written and efficient implementation of PENSEM in C++, the computation of thePENSEM estimator requires solving a very difficult optimization problem and is thus unsurpris-ingly slow. We consider a large number of simulation settings, so including PENSEM in every onewould make the time required to complete the simulations unfeasible. While it is undesirable toexclude a good competitor, results show that RLARS outperforms PENSEM, especially when datacontains independent outliers. As PENSEM is not a close competitor with RLARS for data withindependent outliers and performs slightly worse than RLARS in data with structural outliers, it isonly included in simulations with the lowest dimensionality, p = 50.5.2 Simulation SettingsSimulated data are generated using several simulation settings. The simulation settings are asfollows.(a) The first simulation setting is a latent variable model from Khan et al. (2007), which covers thecase where p< n. This setting uses the linear latent variable model of the formy = L1+L2+ · · ·+L`+σε (5.1)with ` latent variables, where L1,L2, · · · ,L` and ε are independent standard normal random41variables. A value of σ =√`/3 is chosen to set the signal to noise ratio (SNR) to 3. A set ofp predictor variables are generated from the latent variables as follows,x j = L j + τε j, j = 1, . . . , `x`+ j = Lb j/2c+1+δε`+ j j = 1, . . . ,2`x3`+ j = ε3`+ j j = 1, . . . , p−3`(5.2)where all ε j are independent standard normal random variables. Values of δ = 5 and τ =0.3 are chosen so that corr(x j,xbk/2c+1) = 0.5 for j = 1, . . . , ` and k = 1, . . . ,2`. The first 3`predictors are active, with the first ` being low noise perturbations of the latent variables andthe next 2` predictors being high noise perturbations. The last p− 3` random variables areinactive. We set p = 50 and `= 6 to replicate the simulation settings in Khan et al. (2007).(b) The second simulation setting has p > n. For this setting, the latent variable model in setting(a) is used with p = 450 and `= 6. This dimensionality was chosen to so that p/n = 3.(c) This simulation setting is from Alfons et al. (2013). For this simulation setting we consider thelinear model of the form,y = Xβ +σε (5.3)where X ∈ Rn×p is the data matrix, y ∈ Rn is the vector of responses, β ∈ Rp is the regressioncoefficient vector, and ε ∈ Rn is a vector of independent regression errors. The regressionerrors have distribution εi ∼ N(0,σ2) for i = 1, . . . ,n. The value of σ is chosen to achieve anSNR of 3. The predictors follow a multivariate normal distribution given by N(0,Σ) whereΣ = (Σi j)1≤i, j≤p with Σi j = 0.5|i− j|. This is commonly referred to as an autoregressive (AR)correlation structure.To fully replicate the setting from Alfons et al. (2013), the coefficient vector is set to β =(βi)1≤i≤p with β1 = β7 = 1.5, β2 = 0.5, β4 = β11 = 1, and β j = 0 for j ∈ {1,2, . . . , p} \{1,2,4,5,11}.All simulations settings have n = 150 observations, which is chosen to match the number ofobservations in the simulations in Khan et al. (2007).Contamination Schemes: Resilience to outliers is assessed using a set of contaminationschemes. Methods are tested on data and with both structural and independent contamination.42Structural contamination is generated by replacing the first bεnc observations with outliers. Allstructural contamination settings use ε = 0.1. We first consider the five structural contaminationschemes used in Khan et al. (2007).(i) No contamination.(ii) Symmetric slash vertical contamination: ε ∼ (1− ε)N(0,1)+ ε N(0,1)/Uniform(0,1), re-ferred to as symmetric slash contamination.(iii) Shifted normal vertical contamination: ε ∼ (1− ε)N(0,1)+ ε N(20,1), referred to as shiftnormal contamination.(iv) High leverage symmetric slash contamination, as in (2), with xi j generated from independentN(50,1) distributions, referred to as symmetric slash leverage contamination.(v) High leverage shifted normal contamination, as in (3), with predictors as in (4), referred to asshift normal leverage contamination.An additional structural contamination scheme from Alfons et al. (2013) is included. This schemeproduces high-leverage, non-sparse, tightly clustered outliers.(vi) High leverage clustered contamination: High leverage predictors x˜i are generated from in-dependent N(10,0.1) distributions. Response variables are generated from y˜ = η x˜′γ whereγ = (−1/p, . . . ,−1/p). The parameter η controls the magnitude of the outliers. This isreferred to as clustered contamination.Alfons et al. (2013) considers η = 1,2, . . . ,25. However, due to the large number of contami-nation schemes in this simulation study, we only consider η = 1,13,25.An independent contamination scheme is adapted from Leung et al. (2016). Independent con-tamination is added to the data by randomly replacing bεinc cells independently in each columnwith outliers in both the predictors and responses. Simulations are performed with εi = 0.014 andεi = 0.054, corresponding to 2 and 8 outliers in each column, respectively. We refer to these valuesof εi as low and high, respectively. Table 5.1 shows the expected fraction of contaminated cases forthe values of εi and p considered.(vi) Independent Contamination: Outlying predictors x˜i are set to E(xi)+ kSD(xi) and outlyingresponses y˜ are set to E(y)+ kSD(y).43Table 5.1: Expected fraction of contaminated cases as n−→∞ for p= 50,450,1000 for differ-ent levels of independent contamination εi.p = 50 p = 450 p = 1000εi = 0.014 0.51 1.00 1.00εi = 0.054 0.94 1.00 1.00This setting is slightly different than in Leung et al. (2016) in which, the outlying responses y˜ areset to E(y)+ kSD(ε). Simulations are performed with k = 2,3,5,10.For every combination of simulation settings (a)-(c) and contamination settings (i)-(vi) 100replications are performed. In addition, to prevent the difference in correlation structure betweensettings (a)/(b) and (c) from confounding differences in p, simulations are performed with the latentvariable correlation structure of settings (a) and (b) and the AR correlation structure of setting (c)for the three dimensionalities considered in this study, p = 50,450 and 1000, with contaminationsetting (i), no contamination.Two additional simulation settings are also considered. These settings are chosen to demon-strate how the performance of the robust and sparse linear regression methods change of a rangeof p and εi. To keep the number of settings manageable, they are only used with contaminationsetting (vi).(d) Range of p: This setting has the same correlation structure and distribution of X as the setting(c) with βi = 1 for i ∈ {1, . . . ,bζ pc} and βi = 0 for i ∈ {bζ pc, . . . , p} where ζ is the sparsitylevel. As with the other simulation settings, the value of σ is set so that the signal-to-noiseratio is 3.Simulations are performed for a range of dimensions, p= 100,200, . . .1000. The sparsity levelis set to ζ = 0.1. We consider εi = 0.014 and 0.054, such that there are 2 and 8 outlying cellsin each column, respectively.(e) Range of εi: This setting uses the same structure as (d), but with ζ = 0.1 and p = 300. Inde-pendent contamination levels are varied from bεinc= 2,4, . . . ,16. This provides a range of lowto very high levels of independent contamination.Overall, a total of 67 settings are considered across a variety of p, correlation structures, andcontamination settings. Simulations are performed for data following a latent variable and an ARcorrelation structure. Contamination schemes include vertical and high-leverage structural outliers44and independent outliers. Finally, two additional settings are used to examine how the performanceof the different methods evolve over a grid of p and εi, respectively.5.3 Performance MeasuresFor each replicate, two datasets are generated using the same settings. Contamination is added tothe training dataset, if required by the contamination scheme. The test dataset is used to computethe performance measures. The predictive performance is measured using the MSPE divided bythe variance of noise σ2, called the normalized MSPE, such that the best result is 1. The meanand the standard error of the normalized MSPE is reported. The normalized MSPE is henceforthreferred to as the MSPE. The variable selection performance is measured using the precision (PR)and recall (RC), defined as follows,PR =TPTP+FP=#{ j : βˆ j 6= 0∧β j 6= 0}{ j : βˆ j 6= 0}, RC =TPTP+FN=#{ j : βˆ j 6= 0∧β j 6= 0}{ j : β j 6= 0}where TP is the number of true positives, TN is the number of true negatives, FN is the numberof false negatives, and FP is the number of false positives. A positive is defined as the inclusionof a variable in the active set. An identification is true if the variable was correctly included in theactive set.5.4 Simulation ResultsTables 5.2 and 5.3 show the average MSPE, PR, and RC of the methods with the latent variableand AR correlation structure of simulation settings (a) and (c), respectively, with no outliers. Ineach column, the best performing method is highlighted. If multiple methods have the same valuewithin one decimal place, they are all highlighted. The standard error of the MSPE, PR, and RC isgreater than 0.1 in almost all cases across all contamination settings, so only one decimal place inshown.Table 5.2 shows the results for data generated using the latent variable correlation structurein simulation settings (a) and (b) with no outliers. For simulations with p = 50, RLARS has thebest mean MSPE. However, RLARS-Lasso, DDC-RLARS, MMLasso, and PENSEM-EN are allclose competitors. For p = 450 and 1000, RLARS-Lasso has the best MSPE. PENSEM-EN andSparseLTS have the best recall for p = 50, while all methods have comparable precision for thelatent variable correlation structure. For all cases, DDC-RLARS has the best precision.45Table 5.2: Mean and standard error of MSPE, PR, and RC across replicates with no contam-ination for data generation model (a) for p = 50, 450, and 1000. The MSPE is shownplus/minus the standard error. The best method for each performance measure is high-lighted.No Contaminationp Method MSPE PR RC50 MMLasso 2.1±0.4 0.6 0.5PENSEM-EN 2.1±0.3 0.5 0.6SparseLTS 2.5±0.5 0.4 0.6RLARS 2.0±0.3 0.8 0.4RLARS-Lasso 2.2±0.4 0.6 0.5DDC-RLARS 2.1±0.4 0.9 0.4450 SparseLTS 4.5±1.4 0.1 0.4RLARS 2.9±0.7 0.3 0.4RLARS-Lasso 2.6±0.5 0.2 0.4DDC-RLARS 3.1±1.0 0.4 0.41000 SparseLTS 6.1±1.8 0.1 0.4RLARS 3.2±0.8 0.2 0.4RLARS-Lasso 2.8±0.6 0.2 0.4DDC-RLARS 3.7±1.1 0.3 0.4Table 5.3 shows the results for data with the AR correlation structure and no outliers. All meth-ods have comparable MSPE for p = 50, while RLARS-Lasso performs the best for p = 450,1000.All methods have comparable recall. DDC-RLARS has the best precision for p = 50. The preci-sion of RLARS-Lasso does not change significantly with p while the precision of other methodsdeteriorates. By p = 1000, RLARS-Lasso has the best precision. Overall, for data with no outliersRLARS narrowly has the best predictive performance for simulations with p = 50 while RLARS-Lasso has the best for p = 450,1000. DDC-RLARS has the best precision in most cases while allmethods have comparable recall.Tables 5.4 and 5.5 show the results for simulation settings (a)-(c) with shift normal and sym-metric slash casewise contamination with and without leverage points. For non-leverage points,RLARS has the best predictive performance in all cases, except with symmetric slash contami-nation with setting (c). For outliers with leverage points, DDC-RLARS has the best predictiveperformance, sometimes by a considerable margin. RLARS-Lasso is the worst performing methodwith shift normal contamination with and without leverage points. However, RLARS-Lasso’s per-46Table 5.3: Mean and standard error of MSPE, PR, and RC across replicates with no contam-ination for data generation model (c) for p = 50, 450, and 1000. The MSPE is shownplus/minus the standard error. The best method for each performance measure is high-lightedNo Contaminationp Method MSPE PR RC50 MMLasso 1.2±0.1 0.4 1.0PENSEM-EN 1.2±0.2 0.3 1.0SparseLTS 1.3±0.2 0.2 1.0RLARS 1.2±0.2 0.7 1.0RLARS-Lasso 1.2±0.2 0.4 1.0DDC-RLARS 1.2±0.3 0.8 1.0450 SparseLTS 1.8±0.4 0.1 0.9RLARS 1.5±0.4 0.3 1.0RLARS-Lasso 1.3±0.2 0.3 1.0DDC-RLARS 1.7±0.5 0.3 1.01000 SparseLTS 2.0±0.5 0.1 0.9RLARS 1.7±0.3 0.2 1.0RLARS-Lasso 1.4±0.2 0.3 1.0DDC-RLARS 1.9±0.4 0.1 1.0formance is comparable to other methods for data without leverage points for p = 1000. RLARSand DDC-RLARS have the best precision for data with low-leverage outliers, achieving similar val-ues. All methods had comparable recall, except RLARS and DDC-RLARS for setting (a) with shiftnormal contamination. For data with high leverage outliers, DDC-RLARS have the best predictiveperformance and precision while RLARS and SparseLTS have the best recall.Table 5.6 shows the results for simulations with the high-leverage, clustered outliers describedin contamination scheme (vi). DDC-RLARS has the best predictive performance in every case.RLARS has poor predictive performance for η = 1 and η = 13, but performs well for η = 25.RLARS-Lasso performs poorly for settings (a) and (b), but performs reasonably well for setting(c). This agrees with previous results indicating that RLARS-Lasso has better relative performancefor settings with high p. DDC-RLARS has the best precision in all cases, frequently by a largemargin, and DDC-RLARS and SparseLTS have the best recall in most cases. RLARS and DDC-RLARS have low precision, but similar recall to other methods. Overall, DDC-RLARS has the bestpredictive and variable selection performance for data with high-leverage clustered contamination.47Tables 5.8 and 5.7 show the results for simulations with independent contamination, with εi =0.014 and 0.054, respectively. First, we will consider results for simulations with εi = 0.014.RLARS has the best predictive performance for simulation settings (a) and (b) while RLARS-Lasso has the best performance for most cases with setting (c). DDC-RLARS has comparablepredictive performance to RLARS, except for settings (a) and (b) with k = 5 and 10. RLARS hadthe best precision in all cases. DDC-RLARS had a similar precision for all cases except for setting(c). RLARS-Lasso had low precision compared to the other RLARS based methods, but betterprecision than SparseLTS. SparseLTS, MMLasso, and PENSEM-EN had similarly high recall forsetting (a), outperforming the RLARS based methods. For simulation setting (b), all methods hadsimilar recall. For setting (c), the RLARS based methods had very high recall, outperformingSparseLTS.For simulations with high levels of independent contamination, εi = 0.054, DDC-RLARS hasthe best predictive performance for all cases with k greater than 2. For k = 2, RLARS has thebest predictive performance for settings (a) and (b) while RLARS-Lasso has the best predictiveperformance for setting (c). RLARS has the best precision for all cases and DDC-RLARS hasa similar precision for most cases when k is less than 10. For simulation setting (a), SparseLTSconsistently has the best recall. For setting (b) and (c), the recall of all methods are similar. Over-all, DDC-RLARS has the best predictive performance and precision for higher levels independentcontamination, εi = 0.054, and RLARS-Lasso has the best predictive performance for the higherdimensional simulation setting, (c), with low levels of independent contamination, εi = 0.014.We will now consider simulation settings (d) and (e), which test the performance of methodsover a grid of p and εi values, respectively. Recall that the sparsity level of these simulations isfixed at ζ = 0.1, so the number of active variables changes with p. Figure 5.1 shows the meanMSPE of different methods over a range of εi from 0.014 to 0.1. The predictive performance ofDDC-RLARS is stable as the fraction of independent contamination increases. In comparison, theperformance of other methods deteriorates significantly as εi increases. DDC-RLARS is the bestperforming methods for εi > 0.03, which we refer to as moderate to severe independent contami-nation, which agrees with other simulation results.Figures 5.2 and 5.3 show the mean rank of the MSPE for different methods for a range of pfrom 100 to 1000 with k = 5 and εi = 0.014 and 0.054, respectively. The mean MSPE rank isshown as the MSPE is not directly comparable for different p. With εi = 0.054, DDC-RLARS hasthe best mean MSPE rank for p less than 600. However, the average MSPE rank of DDC-RLARSdeteriorates as p increases. The relative performance of RLARS-Lasso improves with p, achievingthe best predictive performance for p≥ 750 with εi = 0.014 and 0.054.48To summarize, RLARS frequently has the best performance with data with no contaminationand low-leverage structural contamination. RLARS-Lasso has the best performance for the high-est dimensional setting, with p = 1000, for low leverage structural and independent contamina-tion. DDC-RLARS has the best performance for data with high-leverage structural contamination.DDC-RLARS also outperforms other methods for data with moderate to high levels of independentcontamination, εi ≥ 0.03. Figure 5.4, shows a decision tree containing the recommended methodto use in different situations based on the simulation results.5.5 ConclusionThis chapter contains the results of a simulation study, comparing RLARS, DDC-RLARS, andRLARS-Lasso with a variety of robust and sparse regression estimators on data with cellwise andcasewise contamination. RLARS is the best performing method with casewise and low-levels ofcellwise contamination. As was expected, simulation results show that the performance of RLARSdeteriorates as the level and magnitude of independent contamination increases. This is becauseRLARS is vulnerable to the propagation of independent contamination in active variables. Agree-ing with the results in Alfons et al. (2013), RLARS has low robustness to high-leverage outliers.DDC-RLARS is the best performing method for data with high-leverage casewise outliers. TheDDC imputed data matrix replaces cellwise outliers with values predicted using other correlatedvariables in the same observation. This method excludes extreme values in each variable whencomputing the imputed data matrix, returning zero if all values are extreme. As a result, the DDCimputed data matrix replaces high leverage outliers with low leverage outliers, which RLARS ishighly robust to.DDC filtering also improves the performance of RLARS for high-levels of independent con-tamination. As discussed in Chapter 3, DDC-RLARS is robust to outliers in the active set ofvariables, unlike RLARS. This is clearly demonstrated by the much greater stability of the MSPEfor DDC-RLARS over a range of cellwise contamination levels compared to other methods.RLARS-Lasso has better predictive performance than RLARS in the simulation settings withthe highest p, clearly outperforming RLARS for p = 1000. This is likely because the RLARS-Lasso coefficient estimates are regularized and thus have greater stability for high-dimensionaldata. The simulation results are summarized in Figure 5.4 in a decision tree containing the recom-mended method for different data types.49Table 5.4: Mean and standard error of MSPE, PR, and RC across replicates for simulation settings (i)-(iii) with shift normalcasewise contamination with and without high-leverage values. The MSPE is shown plus/minus the standard error.The best method for each performance measure is highlightedShift Normal Shift Normal Lev.Setting Method MSPE PR RC MSPE PR RC(a) MMLasso 2.1±0.3 0.6 0.5 2.5±0.9 0.6 0.5PENSEM-EN 2.2±0.3 0.5 0.6 2.7±1.1 0.6 0.6SparseLTS 2.3±0.4 0.4 0.6 2.3±0.4 0.5 0.6RLARS 1.9±0.2 0.9 0.3 2.7±1.4 0.7 0.7RLARS-Lasso 3.3±0.7 0.5 0.6 5.5±1.3 0.5 0.8DDC-RLARS 2.0±0.3 0.9 0.4 2.0±0.3 0.9 0.4(b) SparseLTS 3.6±0.9 0.1 0.4 3.7±0.9 0.1 0.4RLARS 2.1±0.5 0.7 0.3 4.7±1.2 0.4 0.7RLARS-Lasso 4.5±1.1 0.2 0.4 5.7±1.2 0.3 0.7DDC-RLARS 2.5±1.0 0.7 0.3 2.2±0.5 0.7 0.4(c) SparseLTS 1.8±0.4 0.1 0.9 1.8±0.4 0.1 0.9RLARS 1.3±0.3 0.6 0.9 1.7±0.3 0.2 0.9RLARS-Lasso 1.8±0.3 0.2 0.9 1.7±0.3 0.1 0.9DDC-RLARS 1.5±0.4 0.5 0.9 1.6±0.3 0.3 1.050Table 5.5: Mean MSPE, PR, and RC across replicates for simulation settings (a)-(c) with symmetric slash contaminationwith and without high leverage values. The MSPE is shown plus/minus the standard error. The best method for eachperformance measure is highlighted.Symm. Slash Symm. Slash Lev.Setting Method MSPE PR RC MSPE PR RC(a) MMLasso 2.1±0.3 0.6 0.5 2.5±0.5 0.8 0.4PENSEM-EN 2.1±0.3 0.5 0.5 2.6±0.6 0.7 0.5SparseLTS 2.5±0.5 0.4 0.6 2.5±0.4 0.4 0.7RLARS 2.0±0.3 0.9 0.4 2.3±0.7 0.5 0.4RLARS-Lasso 2.3±0.3 0.5 0.5 4.7±0.7 0.3 0.5DDC-RLARS 2.1±0.5 0.9 0.4 2.0±0.3 0.9 0.4(b) SparseLTS 4.4±1.2 0.1 0.4 3.7±0.9 0.1 0.5RLARS 2.5±0.7 0.5 0.4 3.8±1.1 0.2 0.3RLARS-Lasso 2.8±0.5 0.2 0.4 4.6±0.7 0.1 0.4DDC-RLARS 3.0±1.3 0.5 0.4 2.3±0.6 0.6 0.4(c) SparseLTS 2.1±0.6 0.1 0.9 1.8±0.4 0.1 1.0RLARS 1.7±0.5 0.3 1.0 2.0±0.4 0.2 0.8RLARS-Lasso 1.4±0.3 0.2 1.0 1.9±0.4 0.1 0.9DDC-RLARS 1.9±0.6 0.2 1.0 1.6±0.4 0.3 1.051Table 5.6: Mean MSPE, RC, and PC across replicates for simulation settings (a)-(c) with clustered outliers for a range ofmagnitude parameters η . The MSPE is shown plus/minus the standard error. The best method for each performancemeasure is highlighted.η = 1 η = 13 η = 25Setting Method MSPE PR RC MSPE PR RC MSPE PR RC(a) MMLasso 3.6±1.6 0.7 0.5 2.8±1.2 0.9 0.4 2.7±0.8 0.9 0.4PENSEM-EN 3.7±1.5 0.6 0.5 2.8±0.8 0.8 0.4 2.8±0.9 0.8 0.4SparseLTS 2.3±0.4 0.5 0.6 2.4±0.4 0.4 0.7 2.4±0.4 0.4 0.7RLARS 7.0±1.8 0.4 0.7 2.3±0.8 0.3 0.4 2.2±0.6 0.4 0.4RLARS-Lasso 7.3±1.2 0.4 0.7 5.8±1.1 0.3 0.5 5.6±0.9 0.3 0.5DDC-RLARS 2.1±0.4 0.9 0.8 1.9±0.2 0.9 0.4 1.9±0.3 0.9 0.4(b) SparseLTS 3.7±1.2 0.1 0.4 3.7±1.1 0.1 0.4 3.8±1.0 0.1 0.5RLARS 6.9±1.9 0.2 0.6 8.1±8.2 0.2 0.3 3.5±1.5 0.2 0.3RLARS-Lasso 7.7±2.0 0.2 0.6 5.9±1.1 0.1 0.3 5.2±1.0 0.2 0.4DDC-RLARS 2.3±1.2 0.9 0.7 2.1±0.4 0.7 0.4 2.2±0.5 0.7 0.3(c) SparseLTS 1.9±0.4 0.1 1.0 1.8±0.3 0.1 1.0 1.8±0.3 0.1 1.0RLARS 2.7±0.6 0.1 0.9 6.1±1.4 0.1 0.7 2.1±0.4 0.2 0.7RLARS-Lasso 2.4±0.5 0.1 0.7 2.3±0.4 0.1 0.7 2.1±0.3 0.1 0.8DDC-RLARS 1.6±0.4 0.3 1.0 1.6±0.3 0.3 1.0 1.6±0.3 0.3 1.052Table 5.7: Mean MSPE, PR, and RC across replicates for simulation settings (a)-(c) with a fraction εi = 0.014 independentoutliers for a range of magnitude parameters k. The MSPE is shown plus/minus the standard error. The best methodfor each performance measure is highlighted.k = 2 k = 3 k = 5 k = 10Setting Method MSPE PR RC MSPE PR RC MSPE PR RC MSPE PR RC(a) MMLasso 2.2±0.3 0.5 0.5 2.4±0.4 0.5 0.6 2.7±0.6 0.5 0.6 2.6±1.2 0.4 0.7PENSEM-EN 2.3±0.3 0.5 0.6 2.5±0.4 0.5 0.6 2.8±0.6 0.5 0.6 2.8±1.3 0.4 0.7SparseLTS 2.6±0.6 0.4 0.6 2.7±0.7 0.4 0.7 2.7±0.7 0.4 0.7 2.6±0.9 0.4 0.7RLARS 2.0±0.3 0.9 0.4 2.0±0.3 0.9 0.4 2.0±0.4 0.9 0.4 1.9±0.3 0.9 0.4RLARS-Lasso 2.5±0.5 0.6 0.5 2.5±0.5 0.5 0.5 2.5±0.5 0.6 0.5 2.4±0.5 0.6 0.5DDC-RLARS 2.1±0.6 0.9 0.4 2.1±0.4 0.9 0.4 2.0±0.3 0.9 0.4 2.1±0.3 0.9 0.4(b) SparseLTS 5.6±1.7 0.1 0.4 5.7±1.7 0.1 0.4 7.3±1.9 0.1 0.4 9.2±1.8 0.1 0.4RLARS 2.8±1.0 0.5 0.4 2.6±0.8 0.6 0.4 2.3±0.5 0.7 0.3 1.9±0.3 0.8 0.3RLARS-Lasso 3.2±0.8 0.3 0.4 3.1±0.7 0.3 0.4 3.1±0.7 0.3 0.4 3.0±0.6 0.3 0.4DDC-RLARS 3.3±1.5 0.5 0.4 2.8±0.8 0.5 0.4 3.0±1.1 0.5 0.4 3.3±1.2 0.4 0.4(c) SparseLTS 2.1±0.5 0.1 0.9 2.2±0.6 0.1 0.9 2.6±0.7 0.1 0.8 3.1±0.9 0.0 0.6RLARS 1.9±0.6 0.2 1.0 1.7±0.5 0.3 1.0 1.4±0.3 0.5 1.0 1.2±0.3 0.6 1.0RLARS-Lasso 1.5±0.3 0.2 1.0 1.5±0.3 0.2 1.0 1.4±0.2 0.3 1.0 1.4±0.2 0.3 1.0DDC-RLARS 2.0±0.6 0.2 1.0 1.9±0.6 0.2 1.0 1.9±0.5 0.2 1.0 1.9±0.4 0.2 1.053Table 5.8: Mean MSPE, PR, and RC across replicates for simulation settings (a)-(c) with a fraction εi = 0.054 independentoutliers for a range of magnitude parameters k. The MSPE is shown plus/minus the standard error. The best methodfor each performance measure is highlighted.k = 2 k = 3 k = 5 k = 10Setting Method MSPE PR RC MSPE PR RC MSPE PR RC MSPE PR RC(a) MMLasso 3.0±0.5 0.6 0.5 4.3±1.0 0.6 0.5 7.2±1.4 0.6 0.4 9.6±1.1 0.4 0.1PENSEM-EN 3.1±0.6 0.5 0.6 4.5±1.0 0.6 0.6 7.3±1.4 0.7 0.4 9.6±1.1 0.4 0.1SparseLTS 3.6±0.9 0.4 0.7 4.6±1.3 0.4 0.8 6.6±1.6 0.4 0.8 8.7±1.5 0.4 0.8RLARS 2.5±0.5 0.9 0.4 3.2±0.8 0.9 0.4 4.9±0.8 0.9 0.3 7.2±1.4 1.0 0.3RLARS-Lasso 3.6±0.8 0.6 0.6 3.8±0.9 0.6 0.5 4.3±1.2 0.8 0.4 7.1±2.0 0.9 0.3DDC-RLARS 2.6±0.6 0.9 0.4 2.2±0.4 0.9 0.4 2.1±0.5 0.9 0.4 2.0±0.4 0.9 0.4(b) SparseLTS 7.2±1.7 0.1 0.4 8.7±1.7 0.1 0.4 9.9±1.4 0.1 0.3 10.4±1.4 0.0 0.2RLARS 4.3±2.4 0.5 0.4 4.3±1.7 0.6 0.3 5.5±1.1 0.8 0.3 7.5±1.5 0.9 0.3RLARS-Lasso 5.2±1.5 0.2 0.4 5.3±1.3 0.3 0.4 5.3±1.2 0.5 0.4 7.9±2.0 0.7 0.2DDC-RLARS 5.1±2.7 0.4 0.4 3.0±1.9 0.7 0.4 2.5±0.8 0.7 0.4 2.6±0.9 0.6 0.3(c) SparseLTS 2.6±0.7 0.1 0.9 3.0±0.8 0.0 0.8 3.6±0.7 0.0 0.6 4.1±0.5 0.0 0.3RLARS 2.4±0.9 0.3 0.9 2.0±0.8 0.4 0.9 2.1±0.5 0.7 0.9 2.0±0.5 0.9 0.7RLARS-Lasso 1.9±0.4 0.2 0.9 1.9±0.4 0.3 0.9 2.0±0.4 0.4 0.9 2.3±0.6 0.7 0.8DDC-RLARS 2.7±1.0 0.2 0.9 1.7±0.8 0.4 0.9 1.7±0.5 0.4 0.9 1.8±0.5 0.3 1.054Figure 5.1: Mean MSPE for a range of independent contamination levels. The maximumstandard error of the MSPE in all cases shown is 0.64.Figure 5.2: The mean rank of the MSPE across all replicates for a range of p with ε = 0.014.The maximum standard error of the MSPE rank amongst cases shown is 1.05.55Figure 5.3: The mean rank of the MSPE across all replicates for a range of p with ε = 0.054.The maximum standard error of the MSPE rank amongst all cases shown is 1.27.Contamination Typeεi Leverage PointDDC-RLARS p/n DDC-RLARS RLARSRLARS-Lasso RLARSIndependentεi ≥ 0.05 εi < 0.05StructuralYes Nop/n≥ 10 p/n< 10Figure 5.4: Binary decision tree describing which methods are recommended for differenttypes of data.56Chapter 6Case Study: Data Quality ModellingUsing Robust Linear Regression6.1 IntroductionIn this chapter, we consider the problem of detecting when data contains independent and/or struc-tural outliers in sparse and high-dimensional linear regression data. In statistics, outliers are fre-quently only considered as an obstacle to data analysis. However, outliers can provide valuableinsight into the data generating mechanism.Consider the common motivating example in robust statistics in which data are collected by asensor with some chance of failure, manifesting as an outlier. In this context, outliers are usuallypresented as an obstacle to be removed. We might instead consider this from the perspective of thesensor operator. Intuitively, a properly operating sensor will not frequently fail while a damaged orotherwise improperly functioning sensor may more frequently malfunction. Hence, the presenceof many outliers in a dataset may be indicative of a hardware failure. Hence, we are interested inthe presence, or lack thereof, of outliers and how this may encode valuable information.This chapter is a case study carried out for MineSense Technologies Ltd. (MineSense). Mine-Sense designs, builds, and deploys sensor-based systems that provide real-time grade telemetry tomines as ore is collected at the mining face and before it is transported to downstream processingfacilities. The environment in which mining excavators work is harsh. Sensors deployed in thisenvironment are expected to undergo significant wear and require regular maintenance.However, excavators are a critical step in the transport of rock so any disruptions to their op-57eration can be costly to a mine. A deterioration in data quality can provide an early warning thatsuch a disruption is occurring. These problems can result from a change in the process generatingthe data, such as a change in ore body or a deterioration is system health. Therefore, it is criticalfor MineSense to be able to detect a degradation in system performance and predict failures of itssensing systems when in active use. A principled method to detect when a dataset is contaminatedwith outliers can provide scalable methods for monitoring system performance and health.Detecting whether a dataset is contaminated with outliers, or contamination detection, is dis-tinct from the more common problem of outlier detection. Outlier detection aims to identify in-dividual atypical observations or cells that lie outside of the regular variation of the data. Suchmethods usually work by determining, under some distributional assumptions, the variation of thedata and identifying values that exceed some cut-off. These observations are then generally re-moved from the analysis. Contamination detection aims to detect whether a entire dataset containsoutliers, not individual observations or cells.We are interested in detecting both independent and structural contamination. As previouslydiscussed in Chapter 1, researchers have only recently begun considering the problem of linearregression in data with independent contamination and structural contamination. The same is trueof the detection of structural and independent outliers. Rousseeuw and Bossche (2018) recentlyproposed Detecting Deviating Data Cells (DDC), the first outlier detection procedure designedto detect both structural and independent contamination. However, DDC is not a contaminationdetection algorithm. As far as the author is aware, there currently exist no published contaminationdetection algorithms.We therefore propose a flexible new contamination detection method with a principled methodfor tuning the false detection rate. Our method is based upon the idea that, by design, robustand non-robust methods produce similar estimates in data without outliers and, in many cases,significantly different estimates in data with outliers. Hence, if the estimates by robust and non-robust methods have an abnormally large difference then outliers are likely present in the data.Comparing robust and non-robust methods as a diagnostic tool was first proposed in the S-plusRobust Library User Guide (Insightful Corporation, 2002) to aid manual review of the data. Wepropose a principled and automated extension of this concept. We apply this paradigm to XRF datacollected by MineSense to provide a data quality measurement that can be used to flag ongoing dataquality problems.586.2 Data Quality Modelling Using Robust Linear RegressionIn this section, we propose a data quality modelling method based on the comparison of robustand non-robust estimates of linear regression models. We consider the regression dataset (X,y)following the linear regression model,y = Xβ + ε (6.1)where X ∈ Rn×p is a matrix of predictor variables, y ∈ Rn is a vector of responses, and ε ∼N(0,σ2I) is the vector of independent regression errors. We assume that the data may be high-dimensional and sparse to match the characterstics of the MineSense XRF data.Let (ytrain,Xtrain) be a training dataset without outliers and (ytest ,Xtest) be a dataset that we wishto test for contamination, called the test dataset. We assume the test data may be contaminated withstructural or independent outliers. Structural contamination is modelled using the Tukey-Hubercontamination model (e.g. Ricardo A Maronna, Martin, et al., 2019, chapter 2), described in detailin Chapter 1. Independent contamination is modelled using the fully independent contaminationmodel (Alqallaf, Van Aelst, et al., 2009). We consider a correct detection as correctly detecting thepresence of either independent or structural contamination.The workflow for our contamination detection tool is as follows. (1) Data are collected whenthe data are known to be in a good condition, such as when a sensor is new or is being operated ina controlled setting. The tool is trained on this data. (2) New data collected in active operation aretested for contamination using the trained tool.Sketch of the Contamination Detection MethodWe will first provide a brief sketch of the contamination detection method. The goal of the contam-ination detection procedure is to flag a dataset as being contaminated or uncontaminated. Contam-ination is detected when a distance between the linear regression coefficients estimated by robustand non-robust methods exceeds some cut-off. The cut-off is selected in a principled manner tocontrol false detection rate.The robust distance (Rousseeuw and Bossche, 2018) is used to measure the distance betweenthe regression coefficient estimates. Let βˆ r and βˆ n be regression coefficient estimated by robustand non-robust methods on the same dataset, respectively, and let ∆= βˆ r− βˆ n. The robust distanceis given by,RD(∆) =√(∆−T)′ S−1 (∆−T), (6.2)where T and S are the Minimum Covariance Determinant (MCD) estimates (Rousseeuw, 1985)59of the location and scatter of β r − β n, implemented in the robustbase package available onCRAN. Non-parametric bootstrap (e.g. Efron, 1982) is used to estimate of the pth upper quantileof the robust distance of ∆, denoted as Qp, and the location and scatter, T and S. If the robustdistance ∆test of a test set exceeds Qp, then the dataset is flagged as contaminated.The bootstrap upper quantile Qp is chosen as the cut-off because, under the assumption thatthe data contains no outliers, the probability that ∆test exceeds the bootstrap upper quantile Qp isapproximately p. When outliers are present in the data used to calculate ∆test but not the data usedto estimate T and S, then ∆test will be more likely to be far from T relative to the scatter of theuncontaminated data S. Hence, RD(∆test) will exceed Qp with some probability greater than p.Simulation results indicate that if βˆ r and βˆ n are estimated with all variables in a high-dimensionaldataset, the variability of ∆ is high and thus the detection rate is poor. To avoid this issue, a vari-able selection step is first performed to select a subset of variables A ⊂ {1, . . . p} to include in thesubsequent steps of the contamination detection procedure.A sparse and robust linear regression method is required to perform variable selection. Asdiscussed in Section 1, regularization is a useful way to achieve sparsity and a good way to reduceestimator variance in high-dimensional data, thus improving the stability of the model. DDC-RLARS is a robust and sparse regularized regression estimator and is thus used to perform variableselection. As an added benefit, it is simple to use DDC-RLARS to select some fixed number ofvariables m instead of using a model selection procedure, if desired. This feature is useful later inthis chapter.Additionally, it may be desirable to include manual variable selection. For real data applica-tions, there may exist a scientific reason for using specific variables. Manual variable selectioncan be used to either supplement or replace automatic variable selection. To supplement automaticvariable selection with manual variable selection, we recommend first performing manual variableselection and then performing automatic variable selection on the residuals of a linear model fit tothe manually selected variables.Detailed Description of Contamination Detection MethodIn this section, we present a detailed description of the contamination detection procedure. Theprocedure has two stages (1) training on an uncontaminated dataset and (2) testing a dataset forpossible contamination. Let (ytrain,Xtrain) be a training dataset and (ytest ,Xtest) be a test dataset.We assume that (ytrain,Xtrain) is uncontaminated. The training process is as follows,60Training Contamination Detection Procedure(1) Perform variable selection using DDC-RLARS on the training set (y,X) to select the set ofactive variables A. We define the active set asA = { j : βˆr, j 6= 0} (6.3)where βˆ r are the regression coefficients estimated by DDC-RLARS. Let XA be the subset ofthe data matrix corresponding to A.(2) Determine the distance cut-off using the bootstrap to estimate the upper quantile of the robustdistance. We first take nbs bootstrap samples of (y,XA) by sampling with replacement from thedata. Let (y`,X`A) for ` ∈ {1, . . . ,nbs} denote a bootstrap sample.For each bootstrap sample, a robust and non-robust linear regression method are used to esti-mate the regression coefficients βˆ`r and βˆ`n, respectively. The robust estimate of the regressioncoefficients βˆ`r is estimated by first computing the DDC imputed data matrix of the bootstrapsample X`A,Imp and then fitting an MM-estimator to (y`,X`A,Imp). The non-robust estimate ofthe regression coefficients, βˆ`n, is estimated using ordinary least-squares (OLS) on the rawbootstrap sample, (y`,X`A).For each bootstrap sample, the coefficient difference is calculated∆` = βˆ`r− βˆ`n, `= 1, . . . ,n (6.4)and used to construct the coefficient difference matrixD =(∆1)T(∆2)T...(∆nbs)T . (6.5)The robust distance for each coefficient difference is calculated using the robust distance d` =RD(∆`) for `= 1, . . . ,nbs, defined in Equation (6.2), where T and S are the MCD estimates oflocation and scatter of D, respectively. The cutoff Qp is the pth empirical quantile of the robustdistances (d1, . . . ,dnbs).61Testing for Contamination(3) Test for contamination on a test dataset. First, the coefficient difference ∆test is calculated withthe test data. The robust distance RD(∆test) is then calculated using T and S from Step (2). Thedataset is flagged as having contamination whenRD(∆test)> Qp (6.6)using Qp estimated in Step (2).6.3 Simulation StudyThis section contains a small simulation study used to test the performance of the contamination de-tection procedure proposed in this chapter. All MM-estimators are tuned to have a 50% breakdownpoint and 95% efficiency. We use 300 bootstrap samples of size equal to the dataset to estimate the95% bootstrap upper quantile used as the distance cut-off.6.3.1 Simulation SettingsSimulated data are generated using a linear model,yi = x′iβ +σεi. (6.7)where xi and β are p dimensional vectors, εi are standard normal random variables, and σ = 0.5.The covariates follow a multivariate normal distribution given by N(0,Σ) where Σ = (Σi j)1≤i, j≤pwith Σi j = 0.5|i− j|. The coefficient values are βi = 1 for i = 1, . . . ,ζ p and βi = 0 for i = ζ p, . . . , pwhere ζ is the sparsity level. For all replicates, we choose n = 100, p = 100, and ζ = 0.05 toapproximately match the dimensionality and sparsity of the MineSense XRF spectrometry data.For each simulation replicate, one training dataset and many test datasets are generated usingthese simulation settings. Outliers are then added to half of the test sets. The uncontaminated testsets are used to estimate the false detection rate of the contamination detection method and thecontaminated test sets are used to estimate the detection rate.To generate the outliers, we use one independent and one structural contamination scheme. Thestructural contamination scheme is based on the clustered contamination scheme in Alfons et al.(2013).(i) High leverage clustered contamination: High leverage predictors x˜i are independently drawn62from N(5,0.1). Response variables are generated from y˜=η x˜′γ where γ =(−1/p, . . . ,−1/p).The parameter η controls the magnitude of the outliers.To increase the difficulty of detecting the outliers, the leverage and severity of the outliers arereduced compared to those used in Alfons et al. (2013). The mean of the high-leverage predictorsis reduced from 10 to 5 and only the low end of the range of magnitude parameters used by Alfonset al. (2013) are considered, η = 1,2,3, and 5.The independent contamination scheme is very similar to that in Leung et al. (2016). Indepen-dent contamination is added to the data by randomly replacing bεinc cells independently in eachcolumn with outliers for εi from 0.01 to 0.05 by 0.01. The independent contamination scheme isgiven by,(ii) Independent Contamination: Outlying predictors x˜i are set to E(xi)+kSD(xi) and outlyingresponses y˜ are set to E(y)+ kSD(y).We consider independent outliers with a magnitudes k = 2,3,5,10.6.3.2 Performance MeasuresFor each replicate, one training dataset and one hundred test datasets are generated. Half of thetest sets are contaminated with outliers and half test sets are generated without outliers. For eachtest set, a single result is produced; a true or false based on whether contamination is detectedor not, respectively. Table 6.1 shows the possible outcomes of a detection based on whether thetest set actually contained outliers. We measure the performance using the recall (RC), also calledthe sensitivity/true positive rate/detection rate, and specificity (SPEC), also called the true negativerate. They are defined as follows.RC =TPTP+FNand SPEC =TNTN+FP. (6.8)Note that 1-RC is the false detection rate. The specificity should be approximately equal to thequantile chosen for the cutoff, which is set to 95% for all simulations.6.3.3 Simulation ResultsTable 6.2 and 6.3 show the mean and standard error of the recall and specificity of the contaminationdetection method. Appendix B contains histograms of the recall and specificity for each simulationsetting. The same model used to generate the data for all settings, so the specificity is directly63Contamination No ContaminationDetection (TP) True Positive (FP) False PositiveNo Detection (FN) False Negative (TN) True NegativeTable 6.1: Definition of outcomes of contamination detection. A detection of contaminationis defined as a positive and no detection as a negative.comparable. The mean specificity across all simulations is 0.94 with a standard error of 0.05. Thisis within the standard error of the bootstrap quantile p.In simulations with clustered contamination, the mean recall of all settings is at least 0.9, in-cluding the lowest level of contamination with only 2 outlying observations out of 100. The stan-dard error of the recall decreases as the magnitude and number of outliers increases, approaching100% recall. The contamination detection method is able to effectively detect very low levels ofclustered contamination, achieving the desired specificity, or false positive rate.In simulation with independent contamination, the mean recall is low for the lowest magnitudeoutliers, k = 2, but increases significantly for k = 3, reaching an average of 0.5 for εi = 0.02. Fork = 5 and 10, the mean recall exceeds 0.5 for even 1% contamination, or 1 outlier per column.Overall, the simulation results indicate that the contamination detection procedure can be ef-fectively tuned to achieve the desired specificity. Additionally, good recall is achieved with evenlow levels of either structural or independent contamination.64Table 6.2: Mean and standard deviation of the specificity and recall across replicates for data with clustered outliers for arange of fractions of contamination, ε , and magnitude parameters η .η = 1 η = 2 η = 3 η = 5ε RC SPEC RC SPEC RC SPEC RC SPEC0.02 0.9±0.3 0.9±0.1 1.0±0.2 0.9±0.0 0.9±0.2 0.9±0.1 1.0±0.2 0.9±0.00.04 0.9±0.2 0.9±0.0 0.9±0.2 0.9±0.1 1.0±0.1 0.9±0.0 1.0±0.1 0.9±0.00.06 1.0±0.1 0.9±0.1 0.9±0.2 0.9±0.0 1.0±0.1 0.9±0.1 1.0±0.2 0.9±0.10.08 1.0±0.2 0.9±0.1 1.0±0.1 0.9±0.0 1.0±0.1 0.9±0.1 1.0±0.2 0.9±0.10.1 1.0±0.0 0.9±0.0 1.0±0.2 0.9±0.0 1.0±0.1 0.9±0.0 1.0±0.1 0.9±0.1Table 6.3: Mean and standard deviation of the specificity and recall across replicates for data with independent outliers fora range of fractions of contamination, εi, and magnitude parameters k.k = 2 k = 3 k = 5 k = 10εi RC SPEC RC SPEC RC SPEC RC SPEC0.01 0.1±0.1 0.9±0.1 0.3±0.2 0.9±0.0 0.6±0.2 0.9±0.1 0.9±0.2 0.9±0.10.02 0.2±0.1 0.9±0.1 0.5±0.3 0.9±0.1 0.8±0.2 0.9±0.1 1.0±0.2 0.9±0.10.03 0.2±0.1 0.9±0.1 0.6±0.3 1.0±0.0 0.9±0.2 0.9±0.0 1.0±0.2 0.9±0.10.04 0.2±0.1 0.9±0.0 0.7±0.2 0.9±0.0 0.9±0.2 0.9±0.0 1.0±0.1 0.9±0.10.05 0.2±0.1 0.9±0.0 0.8±0.2 0.9±0.1 0.9±0.2 0.9±0.0 1.0±0.1 0.9±0.0656.4 MineSense Real Data Case StudyThis section applies our contamination detection method to X-ray Fluorescence Spectrometer(XRF) data collected by MineSense. The goal of this analysis is to detect the presence of out-liers in an XRF dataset caused by a degradation in sensor health or changes in the underlying datagenerating mechanism, such as a transition to a new ore body. Ideally, the outliers will be detectedwhen they compose a small fraction of the dataset. However the underlying changes producingthese problems can appear gradually, initially resulting in low magnitude outliers. Such outlierscan be difficult for outlier detection methods to identify. Thus we believe that our contaminationdetection method is better suited for this application.This dataset is composed of a set of XRF sensor measurements of copper (Cu) ore. The datawere collected over 7 contiguous months from sensors operating on a cable shovel excavator. Theconcentration of copper, referred to as the grade, is the response variable. The dataset contains ob-servations collected by sensors with a known hardware issue resulting from wear. The observationscollected by the faulty sensors are labelled. We aim to detect when these observations are presentin a dataset.While we aim to detect this hardware using contamination detection, not all observations withthe labelled hardware issue are outliers and thus detectable. The hardware issue appears to destroythe relationship between the predictors and the response rather than producing extreme values.Hence, some observations lie within the normal variation of the “good quality” data. However,as the primary aim of the case study is to detect hardware failure for this section we define atrue positive as the correct identification that a dataset contains observations collected by a faultysensor. As not all such observations are outliers this definition will result in a lower recall than thedefinition used in Section 6.3. However, it will more accurately show the ability of the method todetect real hardware failures.An observation, also referred to as a spectrum, is composed of 1024 variables. Variables areordered by the energy of the photons that they measure, from low to high. For our analysis, weexclude energies that are known to have no relationship with the response, which leaves variables200-500 and 600-800. After removing variables with very low variability, we are left with p= 101variables. As with previous sections, we assume data follow a linear model as XRF spectrometerdata is well-known to have an approximately linear relationship with the elemental composition ofa sample.MineSense provided two datasets, a problem dataset of data collected by a sensor with a la-belled hardware issue and a clean dataset collected by sensors with no known problem. The clean66dataset has 105 observations and the problem dataset has 201 observations. Several outliers wereremoved from the clean dataset using conventional outlier detection techniques.It is well known that elements fluoresce at well-defined energies. Thus, as suggested in Section6.2, we incorporate this knowledge by manually including the variable corresponding to the energywith the greatest signal into the active set which we refer to by it’s scientific name, CuKα .We apply several variations of the contamination detection method, based on the variablesincluded in the variable selection step. We refer to these variations as the one variable model, thethree variable model, and the combined method. For the one variable model, we only select theCuKα predictor in the variable selection step. In the three variable model, we include variablesselected by both manual and automatic variable selection. We manually include the CuKα variable.We then select two additional variables with DDC-RLARS using the method described in Section6.2. We do not consider cases with more variables because preliminary tests indicate that if a largenumber of variables are included in the model, the coefficient difference is highly variable whichresults in a low detection rate.Additionally, initial results indicated that the one and the three variable method were comple-mentary. By this, we mean that the one and three variable model could detect contamination indatasets that the other could not. So, we also consider the combined method, which flags a datasetas contaminated if either the one or three variable method detect contamination.The performance of the contamination detection methods are tested by randomly dividing theclean data into two datasets of equal size. One half is used to train the contamination detectionmethod. The other half is used to test the performance. First, the method is applied to the cleantest data to test the ability of the contamination detection method to identify true negatives. Then,the first bεnc of the observations in the test set are replaced with data randomly drawn from theproblem dataset for a range of ε from 0.02 to 0.1 by 0.02, resulting in between 1 to 5 observationsbeing replaced. The contaminated datasets are then used to test the ability of the contaminationdetection method to detect true positives. The results are used to calculate the recall and specificity.As mentioned before, not all observations in the problem dataset are outliers. Thus, it is not un-likely that no observations in the test dataset to be outlier when only one observation is drawn fromthe problem dataset. This appears to result in the performance measures having high variability.Tables 6.4 and 6.5 show the mean recall and specificity of the one variable and three variablemodels, respectively, with the cut-off set to the 95% bootstrap quantile. The one variable modelhas a specificity 0f 0.95, but low recall for low levels of contamination. The three variable modelhas a slightly lower specificity of 0.92. This is likely due to the high variability of upper bootstrapquantile estimation. The recall of the three variable model is higher for all ε , especially low ε .67Table 6.4: Recall over 100 random splits for the one variable contamination model detectionwith a 95% bootstrap quantile cut-off. The specificity is 0.95.ε 0.02 0.04 0.06 0.08 0.1RC 0.26 0.37 0.42 0.33 0.45Table 6.5: Recall over 100 random splits for the three variable contamination model detectionwith a 95% bootstrap quantile cut-off. The specificity is 0.92.ε 0.02 0.04 0.06 0.08 0.1RC 0.44 0.43 0.46 0.54 0.53Table 6.6 shows the recall and specificity for the combined model with the cut-off set to the95% bootstrap quantile. The sum of the false detection rate of the combined model is approximatelyequal to the sum of that of the component models, suggesting that the component models generallydo not have a significant overlap for false detections. The recall of this model is significantlyhigher than the one variable models for all levels of contamination. However, the specificity of thecombined method is 0.87, so this is not a fair comparison.Thus we also compare the combined method to the one and three variable models with a the90% bootstrap quantile cut-off, shown in Tables 6.7 and 6.8. Notice that the recall does not alwaysincrease with ε . This is likely due to the high variability of the estimate. We can see that thecombined method, with a specificity of 0.87, has better recall than both the one and three variablecontamination models, which have a specificity of 0.88 and 0.86, respectively. However, the vari-ability of the recall and specificity is high so it is inconclusive whether the recall achieved by thecombined method is significantly better than that of the three variable model with a comparablespecificity.6.5 ConclusionThis chapter is a case study in detecting the presence of outliers in linear regression data. This casestudy was performed for MineSense Technologies Ltd. (MineSense), a mining sensor technologycompany. MineSense sensors are used to measure the concentration of elements in ore. MineSensesensors operate in the very harsh environment of an active mine, so monitoring the health of sensorsis vital. For some sensor health problems, issues manifest in the data as outliers. Hence, we proposeusing contamination detection to aid in the detection of sensor health degradation.We present the problem of contamination detection; the detection of whether a dataset contains68Table 6.6: Recall over 100 random splits for the combined contamination detection modelwith a 95% bootstrap quantile cut-off. The specificity is 0.87.ε 0.02 0.04 0.06 0.08 0.1RC 0.56 0.68 0.71 0.73 0.76Table 6.7: Recall over 100 random splits for the one variable contamination model detectionwith a 90% bootstrap quantile cut-off. The specificity is 0.88.ε 0.02 0.04 0.06 0.08 0.1RC 0.33 0.48 0.52 0.41 0.51Table 6.8: Recall over 100 random splits for the three variable contamination model detectionwith a 90% bootstrap quantile cut-off. The specificity is 0.86.ε 0.02 0.04 0.06 0.08 0.1RC 0.58 0.61 0.57 0.70 0.64outliers. Unlike the more common problem of outlier detection, where individual observationsor cells are flagged as being outliers, contamination detection aims to identify whether an entiredataset contains any outliers. By jointly considering the entire dataset, contamination detection canachieve a high contamination detection rate.We propose a new contamination detection method, based on comparing robust and non-robustlinear regression estimates. DDC-RLARS is used to perform variable selection. Then, the robustdistance is used to measure the distance between the coefficients estimated by robust and non-robustmethods. The bootstrap upper quantile is used to determine a cut-off. If the distance measured ona dataset is beyond the cut-off the dataset is flagged as containing outliers.In a simulation study, the contamination detection method was shown to detect low levels ofclustered and independent contamination very accurately. The model was easily tuned to have alow false positive rate.The primary focus of the case study is the application of the contamination detection method toan example XRF spectrometer dataset provided by MineSense. The example dataset is composedof two subsets. One subset is collected with a sensor with no known hardware issues, the “clean”data, and the other was collected by sensors with a known hardware problem, the “problem” data.The goal of this analysis was to determine the performance of the contamination detection methodat correctly identifying the presence of data drawn from the problem dataset in clean data.Three variations of the contamination detection are considered – one which uses a single vari-69able, called CuKα , manually selected based on scientific theory, another which uses CuKα withtwo additional variables selected using DDC-RLARS, and a combination of both methods. Thecontamination detection method with the one and three variable models individually achieve goodrecall and high specificity individually. The combination of the two models, increases sensitivity atthe cost of a modest drop in specificity. However, when compared with the one and three variablemodels with a similar specificity it is inconclusive whether the combined method has significantlyimproved performance over the three variable model.70Chapter 7ConclusionIn this thesis, we considered the problem of robust and sparse regularized linear regression fordata with structural and independent contamination. Due to outlier propagation, a small fraction ofcellwise outliers can result in a large number of contaminated cases. As a result, high-breakdownaffine equivariant robust and sparse linear regression methods developed under the Tukey-Hubercontamination model (THCM) are not robust to independent contamination.Almost all existing robust and sparse linear regression methods achieve robustness by down-weighting outlying observations. Such estimators have a maximum breakdown point of a fractionε = 0.5 contaminated cases. However, due to outlier propagation the probability that a case has atleast one outlying variable grows with p. Sparse regression is frequently performed on data withhigh p so even a very small fraction of independent outliers can exceed the maximum breakdownpoint for traditional robust estimators.In this thesis, we considered the application of an existing linear regression method for high-dimensional data, Robust Least Angle Regression (RLARS), to data with independent contamina-tion. We further proposed two separate modified version of RLARS to further improve performancein two different scenarios. We also considered an application in a case study modelling data quality.We will now summarize the main topics studied in this thesis and the most significant results.Robust Least Angle Regression in the Presence of Structural andIndependent ContaminationWe reviewed RLARS and considered it’s robustness to data with independent contamination. It wasshown that RLARS variable sequencing is robust to independent contamination and that RLARS71variable segmentation is more resistant to independent outliers than existing robust and sparseregression estimators.For future work, we proposed using the Shooting S-estimator (O¨llerer et al., 2016) in the vari-able segmentation step to potentially improve the robustness of RLARS to independent contami-nation. The Shooting S-estimator robustly estimates the residual scale, which could potentially beused to perform robust segmentation in the presence of structural and independent contamination.However, variable segmentation is slow, so the computational cost of any new regression methodshould be carefully considered.Numerical results indicated that RLARS has better performance than other existing robust andsparse regression estimators for data with independent outliers and in many cases with casewiseoutliers.Robust Least Angle Regression Using Pre-Filtered DataWe proposed DDC-RLARS as a robust linear regression method that further improves the perfor-mance of RLARS for data with independent contamination. DDC-RLARS performs RLARS onthe imputed data matrix obtained using Detecting Deviating Data Cell (DDC) (Rousseeuw andBossche, 2018). The DDC imputed data matrix replaces independent outliers with imputed values.DDC-RLARS has improved robustness to independent contamination in variable selection com-pared to RLARS. However, DDC-RLARS has lower performance than RLARS on data with nooutliers or low leverage structural outliers.Numerical results indicate that DDC-RLARS has better performance than RLARS for data withmoderate to high levels of independent contamination. DDC-RLARS also has significantly betterperformance against high-leverage structural outliers than existing robust and sparse regressionestimators, including RLARS.We recommend DDC-RLARS over RLARS when analysis of the data using diagnostic toolsindicates that significant levels of independent outliers are present, with 5% independent outliersas the suggested cut-off. We recommend using the diagnostic tools described by (Rousseeuw andBossche, 2018) to estimate the level of independent contamination.Robust Least Angle Regression With Lasso ModificationWe proposed RLARS-Lasso as a new robust and sparse linear regression method. RLARS-Lassocalculates the LARS Lasso modification by replacing the sample correlation matrix with the robustcorrelation matrix used in RLARS. Unlike RLARS, RLARS-Lasso directly computes the coeffi-72cients during variable sequencing and thus does not have a variable selection step. RLARS-Lassois still susceptible to outlier propagation in the active set during penalty parameter selection likeRLARS.Numerical results indicate that RLARS-Lasso outperforms RLARS for data with independentoutlier for high p. We recommend RLARS-Lasso over RLARS for data with independent contam-ination when p≥ 1000.Case Study: Data Quality Modelling Using Robust Linear RegressionA case study was performed for MineSense Technologies (MineSense), a mining sensor technologycompany. The goal of was to detect deterioration in sensor health using data quality modelling. Forthis case study, we proposed the problem of contamination detection, the detection of whether adataset contains outliers. This is distinct from outlier detection, the detection of which observationsare outliers.We proposed a method to detect the presence of independent and structural outliers in linearregression data. DDC-RLARS is used to perform an initial variable selection step. Then, ourmethod compares estimates by robust and non-robust linear regression methods. A large differencein the robust and non-robust estimates indicates the presence of contamination. The cut-off usedto flag the presence of contamination is determined in a principled manner to control the falsedetection rate.Numerical results indicate that the contamination detection method can effectively identify thepresence of both structural and independent outliers in a dataset. Additionally, the false detectionrate achieves the desired value.Results from the application of the contamination detection algorithm to a real X-ray Fluores-cence Spectrometer (XRF) dataset provided by MineSense indicate that the presence of a smallfraction of observations collected by a faulty sensor can be detected. The best performance wasachieved by combining automatic variable selection using DDC-RLARS with manual variable se-lection based on scientific knowledge.73BibliographyAlfons, Andreas, Christophe Croux, and Sarah Gelper (2013). “Sparse Least Trimmed Squares Re-gression For Analyzing High-dimensional Large Data Sets”. In: The Annals of Applied Statis-tics 7.1, pp. 226–248.Alqallaf, Fatemah, Kjell P Konis, R Douglas Martin, and Ruben H Zamar (2002). “Scalable ro-bust covariance and correlation estimates for data mining”. In: Proceedings of the eighth ACMSIGKDD international conference on Knowledge discovery and data mining, pp. 14–23.Alqallaf, Fatemah, Stefan Van Aelst, Victor J Yohai, and Ruben H Zamar (2009). “Propagation ofoutliers in multivariate data”. In: The Annals of Statistics 37.1, pp. 311–331.Efron, Bradley (1982). The jackknife, the bootstrap, and other resampling plans. Vol. 38. Siam.Efron, Bradley, Trevor Hastie, Iain Johnstone, and Robert Tibshirani (2004). “Least angle regres-sion”. In: The Annals of statistics 32.2, pp. 407–499.Freue, Gabriela V Cohen, David Kepplinger, Matıas Salibia´n-Barrera, and Ezequiel Smucler (2017).“PENSE: A Penalized Elastic Net S-Estimator”. In:Freund, Robert M, P Grigas, and R Mazumder (2013). “Incremental Forward Stagewise Regres-sion: Computational Complexity and Connections to LASSO”. In: International Workshopon advances in Regularization, Optimization, Kernel Methods and Support Vector Machines(ROKS).Friedman, Jerome, Trevor Hastie, and Tibshirani (2001). The Elements of Statistical Learning.Vol. 1. 10. Springer series in statistics New York.— (2010). “Regularization paths for generalized linear models via coordinate descent”. In: Journalof statistical software 33.1, p. 1.74Fu, Wenjiang J (1998). “Penalized regressions: the bridge versus the lasso”. In: Journal of compu-tational and graphical statistics 7.3, pp. 397–416.Gnanadesikan, Ramanathan and John R Kettenring (1972). “Robust estimates, residuals, and outlierdetection with multiresponse data”. In: Biometrics, pp. 81–124.Hastie, Trevor, Robert Tibshirani, and Martin Wainwright (2015). Statistical learning with sparsity:the lasso and generalizations. Chapman and Hall/CRC.Huber, Peter J. (1973). “Robust Regression: Asymptotics, Conjectures and Monte Carlo”. In: TheAnnals of Statistics 1.5, pp. 799–821. ISSN: 00905364.— (1981). Robust Statistics. English. New York: Wiley.Insightful Corporation (2002). S-PLUS 6 Robust Library User’s Guide. English. Insightful Corpo-ration. Seattle, WA. 188 pp.Khan, Jafar A (2006). “Robust linear model selection for high-dimensional datasets”. PhD thesis.University of British Columbia. DOI: http : / /dx .doi .org /10 .14288/1 .0100423. URL: https ://open.library.ubc.ca/collections/ubctheses/831/items/1.0100423.Khan, Jafar A, Stefan Van Aelst, and Ruben H Zamar (2007). “Robust Linear Model SelectionBased on Least Angle Regression”. In: Journal of the American Statistical Association 102.480,pp. 1289–1299.Leung, Andy, Hongyang Zhang, and Ruben Zamar (2016). “Robust regression estimation and in-ference in the presence of cellwise and casewise contamination”. In: Computational Statistics& Data Analysis 99, pp. 1–11.Maronna, Ricardo A (2011). “Robust ridge regression for high-dimensional data”. In: Technomet-rics 53.1, pp. 44–53.Maronna, Ricardo A, R Douglas Martin, Victor J Yohai, and Matıas Salibia´n-Barrera (2019). Ro-bust statistics: theory and methods (with R). John Wiley & Sons.Maronna, Ricardo A and Ruben H Zamar (2002). “Robust estimates of location and dispersion forhigh-dimensional datasets”. In: Technometrics 44.4, pp. 307–317.Maronna, Ricardo Antonio (1976). “Robust M-estimators of multivariate location and scatter”. In:The annals of statistics, pp. 51–67.75O¨llerer, Viktoria, Andreas Alfons, and Christophe Croux (2016). “The shooting S-estimator forrobust regression”. In: Computational Statistics 31.3, pp. 829–844.Ronchetti, Elvezio, Christopher Field, and Wade Blanchard (1997). “Robust linear model selectionby cross-validation”. In: Journal of the American Statistical Association 92.439, pp. 1017–1023.Rousseeuw, Peter (1984). “Least Median of Squares Regression”. In: Journal of the AmericanStatistical Association 79.388, pp. 871–880.— (1985). “Multivariate estimation with high breakdown point”. In: Mathematical statistics andapplications 8.283-297, p. 37.Rousseeuw, Peter and Wannes Van Den Bossche (2018). “Detecting deviating data cells”. In: Tech-nometrics 60.2, pp. 135–145.Rousseeuw, Peter and Victor J Yohai (1984). “Robust regression by means of S-estimators”. In:Robust and nonlinear time series analysis. Springer, pp. 256–272.Salibian-Barrera, Matias (2000). “Contributions to the theory of robust inference”. PhD thesis.University of British Columbia.Salibian-Barrera, Matias and Ruben H Zamar (2002). “Bootrapping robust estimates of regression”.In: The Annals of Statistics 30.2, pp. 556–582.Schwarz, Gideon (1978). “Estimating the dimension of a model”. In: The annals of statistics 6.2,pp. 461–464.Smucler, Ezequiel and Victor J Yohai (2017). “Robust and sparse estimators for linear regressionmodels”. In: Computational Statistics & Data Analysis 111, pp. 116–130.Tibshirani, Robert (1996). “Regression Shrinkage and Selection via the Lasso”. In: Journal of theRoyal Statistical Society. Series B (Methodological) 58.1, pp. 267–288.Yohai, Victor J (1987). “High breakdown-point and high efficiency robust estimates for regression”.In: The Annals of Statistics 15.2, pp. 642–656.Zou, Hui and Trevor Hastie (2005). “Regularization and variable selection via the elastic net”. In:Journal of the royal statistical society: series B (statistical methodology) 67.2, pp. 301–320.76Appendix ARobust Estimators Used in DetectingDeviating Data CellsThis appendix details the robust estimators used by Detecting Deviating Data Cells (DDC) as com-ponents in the algorithm (Rousseeuw and Bossche, 2018). The robust methods were selected forthe low computational complexity and memory requirements – all methods have O(n) computa-tional complexity and memory requirements.Rousseeuw and Bossche (2018) defines a robust location estimator robLoc(x) as the first iter-ation used to calculate an M-estimate with a previously computed dispersion estimate, found onpages 40-41 of Ricardo A Maronna, Martin, et al. (2019). Initial estimates of location and scaleare calculated by m1 = medi xi and sq = madi xirobLoc(x) = ∑ni wiyi∑ni wi(A.1)with wi = ρ((yi−m1)/s1) where ρ is Tukey’s Bisquare function defined in Equation (2.33).Rousseeuw and Bossche (2018) also proposes an estimator for robust univariate scale. Thisestimator assumes data are centered. Let s2 be the median absolute deviation.robScale = s2√√√√ 1δnn∑i=1min((yis2)2, b2)(A.2)where b = 2.5 by default and δ = 0.85 to ensure consistency for normally distributed data.The relationships between variables are estimated using robust bivariate correlations. An initial77estimate of the robust bivariate correlation of columns j and h of the centred and normalized datamatrix Z based on the method proposed in Gnanadesikan and Kettenring (1972) is calculated,ρˆ jh = ((robScalei(zi j + zih))2− (robScalei(zi j− zih)2)). (A.3)The estimate ρˆ jh is used to define a tolerance ellipse, assuming joint normality, in which observa-tions lie with probability p where p the same as that used to define c in Equation (3.3). The robustbivariate correlation is calculated using only points that lie within the tolerance ellipse.The robust slopes used to produce the predicted value matrix are calculated as follows. Assumethe columns are centered but not necessarily normalized. An initial estimate of the slope is firstcalculatedbi j =nmedi=1(zi j/zih) (A.4)which is then used to calculate residuals,ri jh = zi j−b jhzih. (A.5)The ordinary least squares (OLS) slope, with no intercept, is then calculated with points where|ri jh|/c≤ robScale(ri jh) (A.6)where c2 = χ22 (p), where p is the same as was used to compute the robust scale. The robust slopethe OLS slope computed on the filtered data.78Appendix BData Quality Modelling SimulationAdditional FiguresFigure B.1: Sensitivity across simulation replicates for detection of data contaminated withclustered outliers for a range of fractions of contaminated cases and magnitude param-eters η . These plots present the results for the same simulations as Table 6.2.79Figure B.2: Specificity across simulation replicates for detection of data contaminated withclustered outliers for a range of fractions of contaminated cases and magnitude param-eters η . These plots present the results for the same simulations as Table 6.2.Figure B.3: Sensitivity across simulation replicates for detection of data contaminated withindependent outliers for a range of fractions of contaminated cells and magnitude pa-rameters k. These plots present the results for the same simulations as Table 6.3.80Figure B.4: Specificity across simulation replicates for detection of data contaminated withindependent outliers for a range of fractions of contaminated cells and magnitude pa-rameters k. These plots present the results for the same simulations as Table 6.3.81
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Robust and sparse regression in the presence of cellwise...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Robust and sparse regression in the presence of cellwise and casewise contamination with application… McGuinness, Glenn 2020
pdf
Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
Page Metadata
Item Metadata
Title | Robust and sparse regression in the presence of cellwise and casewise contamination with application in data quality modelling |
Creator |
McGuinness, Glenn |
Publisher | University of British Columbia |
Date Issued | 2020 |
Description | This thesis considers the problem of robust and sparse estimation of linear regression parameters in data with structural and independent contamination. Independent outliers can propagate in data with relatively large numbers of dimensions, resulting in a high fraction of observations with at least one outlying cells. Recent work has shown that traditional robust regression methods are not highly robust to such outliers. We investigate the application of Robust Least Angle Regression (RLARS) to data with independent contamination. We also propose two modified versions of RLARS to further improve its performance. The first method applies RLARS to data which has been filtered of independent outliers. The second method performs RLARS with the Lasso modification for Least Angle Regression (LARS). Extensive simulations show that RLARS is resilient to structural and independent contamination. Compared with RLARS, simulation results show that the first modified version has significantly improved robustness to independent contamination and the second modified version has improved robustness when there are a large number of predictors. We also consider the application of the proposed methods to data quality modelling in a case study for MineSense Technologies Ltd. (MineSense). MineSense develops sensor packages for use in the harsh conditions of an active mine. To maintain high system availability and performance, data must be monitored for a deterioration in sensor health or a change in the data generating process, such as a change in ore body, which can manifest as outliers. We pose the problem of contamination detection, the identification of whether a dataset contains outliers, as a distinct problem from outlier detection, the identification of which cases or cells are outliers. We propose a contamination detection method based on the comparison of robust and non-robust linear regression estimates. When outliers are present, the robust and non-robust estimates differ significantly, indicating the presence of contamination. Simulation results and analysis of real sensor data provided by MineSense suggest that our method can effectively detect the presence of contamination with a low false detection rate. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2020-04-09 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivatives 4.0 International |
DOI | 10.14288/1.0389791 |
URI | http://hdl.handle.net/2429/73971 |
Degree |
Master of Science - MSc |
Program |
Statistics |
Affiliation |
Science, Faculty of Statistics, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2020-05 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 24-ubc_2020_may_mcguinness_glenn.pdf [ 1.33MB ]
- Metadata
- JSON: 24-1.0389791.json
- JSON-LD: 24-1.0389791-ld.json
- RDF/XML (Pretty): 24-1.0389791-rdf.xml
- RDF/JSON: 24-1.0389791-rdf.json
- Turtle: 24-1.0389791-turtle.txt
- N-Triples: 24-1.0389791-rdf-ntriples.txt
- Original Record: 24-1.0389791-source.json
- Full Text
- 24-1.0389791-fulltext.txt
- Citation
- 24-1.0389791.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
data-media="{[{embed.selectedMedia}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0389791/manifest