Robust Estimation and VariableSelection in High-DimensionalLinear Regression ModelsbyDavid KepplingerB.Sc., Vienna University of Technology, 2012Dipl.-Ing., Vienna University of Technology, 2015A DISSERTATION SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHYinTHE FACULTY OF GRADUATE AND POSTDOCTORAL STUDIES(Statistics)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)August 2020© David Kepplinger, 2020The following individuals certify that they have read, and recommend to the Faculty ofGraduate and Postdoctoral Studies for acceptance, the dissertation entitled:Robust Estimation and Variable Selection in High-Dimensional LinearRegression Modelssubmitted by David Kepplinger in partial fulfillment of the requirements forthe degree of Doctor of Philosphy in Statistics.Examining Committee:Gabriela V. Cohen Freue, StatisticsSupervisorMatías Salibián-Barrera, StatisticsSupervisory Committee MemberAlexandre Bouchard-Côté, StatisticsUniversity ExaminerAnne Condon, BioinformaticsUniversity ExaminerAdditional Supervisory Committee Members:Ruben H. Zamar, StatisticsSupervisory Committee MemberiiAbstractLinear regression models are commonly used statistical models for predicting a responsefrom a set of predictors. Technological advances allow for simultaneous collection of manypredictors, but often only a small number of these is relevant for prediction. Identifyingthis set of predictors in high-dimensional linear regression models with emphasis on accurateprediction is thus a common goal of quantitative data analyses. While a large number ofpredictors promises to capture as much information as possible, it bears a risk of containingcontaminated values. If not handled properly, contamination can affect statistical analysesand lead to spurious scientific discoveries, jeopardizing the generalizability of findings.In this dissertation I propose robust regularized estimators for sparse linear regressionwith reliable prediction and variable selection performance under the presence of contam-ination in the response and one or more predictors. I present theoretical and extensiveempirical results underscoring that the penalized elastic net S-estimator is robust towardsaberrant contamination and leads to better predictions for heavy tailed error distributionsthan competing estimators. Especially in these more challenging scenarios, competing ro-bust methods reliant on an auxiliary estimate of the residual scale, are more affected bycontamination due to the high finite-sample bias introduced by regularization.For improved variable selection I propose the adaptive penalized elastic net S-estimator.I show this estimator identifies the truly irrelevant predictors with high probability as samplesize increases and estimates the parameters of the truly relevant predictors as accurately asif these relevant predictors were known in advance. For practical applications robustness ofvariable selection is essential. This is highlighted by a case study for identifying proteins topredict stenosis of heart vessels, a sign of complication after cardiac transplantation.High robustness comes at the price of more taxing computations. I present optimizedalgorithms and heuristics for feasible computation of the estimates in a wide range of ap-plications. With the software made publicly available, the proposed estimators are viablealternatives to non-robust methods, supporting discovery of generalizable scientific results.iiiLay SummaryThis dissertation presents new methods for identifying variables, such as protein levelsextracted from blood samples, relevant for predicting an outcome of interest, for exampleseverity of a disease. The methods are specifically designed for applications where manyvariables are available, and the observed data possibly contains some highly unusual values.Examples of such unusual values are aberrantly high levels of some proteins in a bloodsample, or an unusually severe disease outcome. These values can lead to biased andmisleading results.The methods proposed in this dissertation are less affected by unusual values and henceincrease reliability of results. Therefore, results from a small set of observations are morelikely to be generalizable to the broader population. The software is made openly availableand gives researchers a versatile tool to support reliable scientific discoveries.ivPrefaceThis dissertation is the original work of David Kepplinger, prepared under the supervisionof Prof. Gabriela V. Cohen Freue.Parts of Chapters 3 and 6 are based on a paper coauthored with the supervisor andtwo collaborators [G. V. Cohen Freue, D. Kepplinger, M. Salibián-Barrera, and E. Smucler(2019). “Robust elastic net estimators for variable selection and identification of proteomicbiomarkers”. In: Annals of Applied Statistics 13.4, pp. 2065–2090]. The original idea for thepresented estimator and the procedure for initial estimates were developed by the supervisorand jointly refined by the author and supervisor. Development of algorithms, numerical ex-periments, and significant parts of manuscript writing were conducted by David Kepplinger.Other parts of the manuscript were jointly discussed by all coauthors, with significant con-tributions and developments by the author. The asymptotic properties presented herein arethe original intellectual product of the author and pertain to conditions different than inthe paper.vTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiNotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiiGlossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviiDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.1 The Linear Regression Model . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2 Robust Estimation in the Linear Regression Model . . . . . . . . . . . . . . 122.3 Estimation Under the Sparsity Assumption . . . . . . . . . . . . . . . . . . 212.4 Robust Regularized Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 273 Elastic Net S-Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.2 Initial Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35viTABLE OF CONTENTS3.2.1 Random Subsampling . . . . . . . . . . . . . . . . . . . . . . . . . . 353.2.2 Elastic Net Peña-Yohai Procedure . . . . . . . . . . . . . . . . . . . 363.2.3 Empirical Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . 393.2.4 Initial Estimates for a Set of Penalization Levels . . . . . . . . . . . 423.3 Theoretical Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443.4 Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463.5 Hyper-Parameter Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 473.5.1 Restricting the Search Space . . . . . . . . . . . . . . . . . . . . . . 473.5.2 Cross Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493.5.3 Train/Test Split . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513.6 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533.6.1 Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543.6.2 Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543.6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 624 Variable Selection Consistent S-Estimators . . . . . . . . . . . . . . . . . . 654.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 664.1.1 Hyper-Parameter Selection . . . . . . . . . . . . . . . . . . . . . . . 674.2 Statistical Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 684.3 Robustness Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 714.3.1 Robustness of Variable Selection . . . . . . . . . . . . . . . . . . . . 714.4 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 724.4.1 Preliminary Estimate for Adaptive PENSE . . . . . . . . . . . . . . 734.4.2 Effects of Good Leverage Points . . . . . . . . . . . . . . . . . . . . 754.4.3 Overall Effect of Contamination . . . . . . . . . . . . . . . . . . . . 764.5 Biomarkers for Cardiac Allograft Vasculopathy . . . . . . . . . . . . . . . . 784.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 835 Residual Scale Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 865.1 The Problem in High Dimensions . . . . . . . . . . . . . . . . . . . . . . . . 885.2 Data-Splitting Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 905.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 946 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96viiTABLE OF CONTENTS6.1 Algorithms for Weighted LS Adaptive EN . . . . . . . . . . . . . . . . . . . 976.1.1 Augmented Ridge . . . . . . . . . . . . . . . . . . . . . . . . . . . . 976.1.2 Augmented LARS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 986.1.3 Alternating Direction Method of Multipliers (ADMM) . . . . . . . . 1026.1.4 Dual Augmented Lagrangian (DAL) . . . . . . . . . . . . . . . . . . 1066.2 Initial Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1106.3 Computing Local Minima . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1146.4 Computing Adaptive PENSE for Many Hyper-Parameters . . . . . . . . . . 1216.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1267 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142A Simulation Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142A.1 Data-Generation Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142A.1.1 Short-Hand Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . 144A.2 Comparison of Initial Estimates . . . . . . . . . . . . . . . . . . . . . . . . . 145A.3 Numerical Experiments for PENSE and Adaptive PENSE . . . . . . . . . . 146B Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147B.1 Breakdown Point of PENSE . . . . . . . . . . . . . . . . . . . . . . . . . . . 147B.2 Asymptotic Properties of Adaptive PENSE . . . . . . . . . . . . . . . . . . 150B.2.1 Preliminary Results Concerning the M-Scale Estimator . . . . . . . 151B.2.2 Root-n Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . 156B.2.3 Variable Selection Consistency . . . . . . . . . . . . . . . . . . . . . 158B.2.4 Asymptotic Normal Distribution . . . . . . . . . . . . . . . . . . . . 160C Additional Results from Numerical Experiments . . . . . . . . . . . . . . 163C.1 Elastic Net S-Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163C.1.1 Prediction Performance . . . . . . . . . . . . . . . . . . . . . . . . . 163C.1.2 Variable Selection Performance . . . . . . . . . . . . . . . . . . . . . 163C.1.3 Estimation Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . 164C.2 Adaptive Elastic Net S-Estimators . . . . . . . . . . . . . . . . . . . . . . . 171viiiTABLE OF CONTENTSC.2.1 Prediction Performance . . . . . . . . . . . . . . . . . . . . . . . . . 171C.2.2 Variable Selection Performance . . . . . . . . . . . . . . . . . . . . . 171C.2.3 Estimation Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . 171ixList of Tables4.1 Proteins selected by adaptive PENSE in the CAV biomarker study. . . . . . 846.1 Computational complexity of algorithms for weighted LS-adaEN. . . . . . . 110xList of Figures3.1 PENSE objective function for a simple linear regression model. . . . . . . . 343.2 Comparison of initial estimates. . . . . . . . . . . . . . . . . . . . . . . . . . 413.3 PENSE objective function evaluated on different subsets of the data. . . . . 523.4 Comparison of strategies for hyper-parameter selection for PENSE. . . . . . 573.5 Prediction performance of regularized estimators under various outlier posi-tions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 583.6 Prediction performance of PENSE and competitors in numerical experiments. 603.7 Variable selection performance of PENSE and competitors in numerical ex-periments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 624.1 Variable selection performance of adaptive PENSE using different prelimi-nary estimates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 754.2 Effect of high-leverage points on the variable selection performance of esti-mators using adaptive or non-adaptive penalties. . . . . . . . . . . . . . . . 774.3 Prediction performance of adaptive PENSE and competitors in numericalexperiments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 784.4 Variable selection performance of adaptive PENSE and competitors in nu-merical experiments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 794.5 Univariate regression estimates for two proteins in the CAV study. . . . . . 804.6 Estimated prediction performance and fitted maximum percentage of diam-eter stenosis in the CAV study. . . . . . . . . . . . . . . . . . . . . . . . . . 835.1 Effect of residual scale estimation on the PENSEM estimator. . . . . . . . . 905.2 Accuracy of residual scale estimates based on data-splitting strategies. . . . 946.1 Computation time for augmented LARS using different storage schemes. . . 1016.2 Convergence of iterative algorithms for weighted LS-adaEN. . . . . . . . . . 109xiLIST OF FIGURES6.3 Comparison of the average time to compute EN-PY initial estimates usingvarying number of threads. . . . . . . . . . . . . . . . . . . . . . . . . . . . 1116.4 Comparison of the average time to compute EN-PY initial estimates usingdifferent algorithms for the LS-adaEN subproblems. . . . . . . . . . . . . . 1136.5 Comparison of convergence of the MM algorithm using different tighteningstrategies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1186.6 Performance of the MM algorithm for computing local minima of the adaptivePENSE objective function using different tightening strategies. . . . . . . . 1196.7 Comparison of the average time to compute local minima using the MMalgorithm with different algorithms for the weighted LS-adaEN subproblems. 1206.8 Prediction performance estimated via cross-validation. . . . . . . . . . . . . 128A.1 Short-hand notation for data generation schemes. . . . . . . . . . . . . . . . 145C.1 Prediction performance of PENSE and competitors in very sparse scenarios. 165C.2 Prediction performance of PENSE and competitors in sparse scenarios. . . . 166C.3 Variable selection performance of PENSE and competitors in very sparsescenarios with no contamination. . . . . . . . . . . . . . . . . . . . . . . . . 167C.4 Variable selection performance of PENSE and competitors in sparse scenarios.168C.5 Estimation accuracy of PENSE and competitors in very sparse scenarios. . 169C.6 Estimation accuracy of PENSE and competitors in sparse scenarios. . . . . 170C.7 Prediction performance of adaptive PENSE based on different preliminaryestimates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172C.8 Prediction performance of adaptive PENSE and competitors in very sparsescenarios. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173C.9 Prediction performance of adaptive PENSE and competitors in sparse scenarios.174C.10 Variable selection performance of adaptive PENSE and competitors in verysparse scenarios with no contamination. . . . . . . . . . . . . . . . . . . . . 175C.11 Variable selection performance of adaptive PENSE and competitors in sparsescenarios. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176C.12 Estimation accuracy of adaptive PENSE and competitors in very sparse sce-narios. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177C.13 Estimation accuracy of adaptive PENSE and competitors in sparse scenarios. 178xiiNotationThroughout this dissertation the following notation is consistently maintained. Chapter-specific notation is omitted here and defined where required.Boldface characters denote vectors or matrices, whereas non-boldface characters arescalars. Capital characters in calligraphy typeface are reserved for random variables andrandom vectors, whereas observed values of random variables are written in regular typeface.Sets are denoted by capital characters in script typeface, e.g., Q. The index variable i isonly used to index observations in a sample, while j is reserved for indexing the set ofpredictors. Some commonly used symbols areY The random response variable in the linear regression model.X The random vector of predictors in the linear regression model.U The random error term in the linear regression model.yi The i-th observed response value.xi The vector of observed predictor values for the i-th observation.xij The value of the j-th predictor observed for the i-th observation.r A matrix of observed predictor values, r = (x⊺1P O O O Px⊺n)⊺.Z A sample, i.e., a set of observed values Z = {(y1Px1)P O O O P (ynPxn)}.The boldface Greek letter β and the non-boldface Greek letter µ are reserved for theslope and intercept parameter, respectively, in the linear regression model. The boldfaceGreek letter θ always denotes the concatenated vector of µ and β, θ = (µPβ⊺)⊺. Accents,subscripts, and superscripts on θ are propagated to µ and β, e.g. xθ = (µˆP xβ⊺)⊺. The totalnumber of predictors in the linear regression model is denoted by p, i.e., the parameter vectorxiiiNOTATIONβ has p elements, and the sample size is represented by n. Examples for such parametersareβ0 The true value of the slope parameter in the linear regression model.µ0 The true value of the intercept in the linear regression model.xθ Estimate of the intercept and slope parameters in the linear regression model.βj The j-th element of a vector of slope coefficients.Additionally, the following miscellaneous symbols and functions are often encounteredin this dissertation:In The identity matrix with n rows and n columns.1n A vector of n 1’s.R The set of real numbers.Rn The set of real vectors of dimension n.Rn×p The set of real matrices of dimension n× p.EF The expected value with respect to distribution F .‖ · ‖ A vector or operator norm (if applied to a vector or a matrix, respectively).∇uf(u)|u=u˜ The subgradient of function f with respect to u, evaluated at u˜.L (yP xy) A positive regression loss function taking values in R+, quantifying the dif-ference between observed values y ∈ Rn and fitted values xy ∈ Rn.Φ(β) A penalty function Rp → R+, measuring the “size” of coefficients β.O(θ) An objective function mapping regression coefficients in Rp+1 to the set ofpositive real numbers.xθna.s.−−→ θ The random variable xθn converges almost surely to θ as the sample size nincreases, i.e., Pr(limn→∞ xθn = θ) = 1.xθnp−→ θ The random variable xθn converges in probability to θ as the sample size nincreases, i.e., limn→∞ Pr(‖xθn − θ‖ S ϵ) = 0 for any ϵ S 0.xivGlossaryThe following acronyms are commonly used throughout. Each acronym is defined at itsfirst occurrence in the text.adaEN Adaptive elastic netADMM Alternating direction method of multipliersCAV Cardiac allograft vasculopathyCV Cross-validationDAL Dual augmented LagrangianEN Elastic netFBP Finite-sample breakdown pointLASSO Least absolute shrinkage and selection operatorLARS Least angle regressionLS Least squaresLAD Least absolute deviationLOO Leave-one-outMAD Median absolute deviationMSE Mean-square errorPENSE Penalized elastic net S-estimatorPSC Principal sensitivity componentxvGLOSSARYPVE Percentage of variance explainedPY Peña-Yohai procedureRCV Refitted cross-validationRMSPE Root mean-square prediction errorxviAcknowledgementsMy journey through the PhD program would not have been as successful without thesupport of truly remarkable people. I am grateful for the guidance by inspiring mentors,first and foremost by my supervisor Dr. Gabriela Cohen Freue. Her dedication and supporthas helped me to get where I am today, professionally and personally. Giving me autonomywhile ensuring I would not lose sight on what’s important has allowed me to grow as ascholar and person. Her advice at professional and personal crossroads has been a godsendand working together for the past five years has been nothing but inspiring. Of course, themembers of my supervisory committee have also been instrumental to this dissertation andmy development at UBC. Thanks to Dr. Matías Salibián-Barrera for his insightful input andthe many stimulating discussions. I also thank Dr. Ruben Zamar for sharing his profoundexpertise and providing me with the opportunity to gain research experience in industry.I count myself very lucky to have been part of the Department of Statistics which hasprovided me with a rich learning experience and highly supportive environment. I greatlyappreciate Peggy Ng, Ali Hauschildt and Andrea Sollberger for their lifting spirits evenduring stressful times and their grokking of the UBC apparatus. They were never toobusy to ask about my well-being and lending an open ear. Department-organized seminars,department teas, and grad trips have provided opportunities to foster connections andfriendships. I am grateful to many other members of the department who I have been soprivileged to meet and who have shared their vast experience on numerous occasions, suchas Melissa Lee, Dr. Nancy Heckman, Dr. John Petkau, Dr. Paul Gustafson, and Dr. WillWelch, among others. Furthermore, volunteer positions made available by the department,such as graduate student representative and membership on search committees, have givenme valuable leadership skills and insights into academic processes. My studies have onlybeen made possible by generous funding from UBC through the Four-Year Fellowship andFaculty of Science Graduate Awards. Dr. Cohen Freue and Dr. Zamar have also graciouslysupported my studies through research assistantships. I could not have wished for a betterxviiACKNOWLEDGEMENTSenvironment to pursue my doctorate.I would have never made it to this point without the incredible support from my family.My gratitude for my spouse Alexandra Patzak is immeasurable, for she has been a constantsource of inspiration and energy. I cannot find the words to express how lucky I am tohave her in my life and to have embarked on our PhD journeys together. I will also foreverbe thankful to my mother Heidi and sister Sara for inciting and nurturing my curiosity,teaching me the importance of education and candor, and for staying close to me (at leastvirtually) wherever life took me. So did my grandmother Elisabeth, in her joyful way,although I know she would have preferred me to stay closer to home. Her cryptic recipeswith little instructions have kept me well fed over the years. My uncle Dietmar I thank forhis take on teaching and the side-projects which have been a welcoming contrast to research.I thank all of them for their efforts to follow my research, despite my struggles explainingit in an accessible manner.I will always cherish the memories I have made with the many uniquely wonderful peo-ple I have come to know during my time at UBC. I will fondly remember the subtle senseof humor of Eric Fu when he was updating me on everything that was going on in the de-partment. I have also very much enjoyed my conversations with Creagh Briercliffe, who hasnever felt obliged to hide his sarcastic, refreshingly indecorous, wit. Daniel Hadley’s incisiveanalyses of society were thought-provoking, yet he managed that we (almost) always endedup laughing. One of my proudest moments at UBC, thanks to our captain Daniel Dinsdale,was winning the UBC departmental futsal division alongside fabulous teammates JonathanAgyeman, Jonathan Steif, Joe Watson, and Dr. Matías Salibián-Barrera. Although DerekCho left for a job in Japan before our victorious campaign, I also credit him for the win,and thank him for explaining peculiarities of Vancouver’s culture or our conversations abouthockey. I thank Andy Leung for his inspiring integrity, for standing up for his friends andcolleagues, and for introducing me to exciting new flavors of Japanese and Chinese cuisine.I have met numerous more people who have made my PhD studies unforgettable, such asSonja Isberg and Vincenzo Coia, among many others.All of these people whose paths overlapped with mine over the past five years and moremade a unique impression on my dissertation and shaped me as a person. For this I amforever grateful. Thank you all.xviiixixChapter 1IntroductionThe ability to predict a continuous response of interest using a set of predictors is centralto many scientific and industrial applications. Technological advances significantly pushedthe frontiers in science and industry by enabling the collection of immense numbers ofpossibly relevant predictors. The scientific goal is two-fold: (a) predicting the response ifonly the predictors’ values are available and (b) identifying which predictors are relevant,particularly for accurate prediction. The relationship at the heart of the problem may behighly complex, but a crude approximation by a linear relationship using a small set ofthe available predictors can nevertheless give valuable insights into the involved processesand allow for accurate prediction of the response. Approximations by a linear model arepertinent in applications where the sample size is small, particularly if many predictors areavailable. As an example, consider predicting yield of a crop based on numerous predictorssuch as a variety’s genotype, nutrition content of the soil, and other environmental factors.Collecting a large sample is complicated by several obstacles, such as the time required tofully grow the crop, costs of measuring all possible predictors, but also continued cooperationof growers who are willing to share their trade secrets.The scientific goal translates to a statistical goal of estimating the parameters relatingthe values of the predictors with the response, with emphasis on identifying which coeffi-cients are truly non-zero. Assuming that only some of the available predictors are relevantleads to the linear regression model being sparse in the sense that only these few relevantpredictors have non-zero coefficient. A myriad of methods is available for estimating pa-rameters and simultaneously identifying relevant predictors in the sparse linear regressionmodel. These methods are predominantly founded in the assumption that all observationsin the sample at hand are equally trustworthy. The danger of this assumption is that even11. INTRODUCTIONa single contaminated observation, for instance an observation with aberrant response valueand/or highly anomalous values in one or more predictors, can jeopardize the reliability ofthese methods and, in turn, the generalizability of the estimated predictive model. Contam-ination can take countless forms, but contaminated observations are generated by unknownprocesses different from the linear model underlying the majority of observations. The morepredictors are available, the more questionable the assumption of no contamination in thesample at hand is, thereby exacerbating the risk of spurious discoveries.Robust methods for linear regression, in contrast, are devised to cope with potentialpresence of contamination. Robust methodology for problems with only a small numberof predictors is well established, but these solutions are challenged by characteristics inher-ent to high dimensional data and the notion of sparse models. Therefore, robust methodsfor simultaneous estimation and variable selection have not seen the same proliferationas non-robust methods. One of the biggest roadblocks to applying robust methods in highdimensional problems is computational complexity. Computation of robust estimates is dif-ficult even in low dimensional settings, but as dimensions grow computational complexitycan become insurmountable. Furthermore, robust methods are devised under the assump-tion that some of the observations may be contaminated. This inherent “mistrust” leadsin general to less precise parameter estimates compared to non-robust methods in pristinesettings without contamination. To compensate for this loss of efficiency, robust methodsare often two-tiered: first computing a highly robust but potentially imprecise estimate andthen refining this estimate to gain precision. The refinement step, however, is problematicin high dimensions and can, in worst case scenarios, lead to the loss of robustness and hencereliability. Last but not least, the interplay of sparsity and possible contamination adds alayer of difficulty which received little attention in the existing literature.The first two contributions of this dissertation are the development and study of tworobust estimators for high dimensional sparse linear regression. Both estimators are highlyrobust towards possibly large amounts of contamination in the data and perform reliablyeven in the most challenging situations where two-tiered robust estimators are at an elevatedrisk of being unduly affected by contamination. Understanding the interaction betweensparsity and possible contamination provides important insights into its effect on estimators’ability to identify these relevant predictors. Particularly in very sparse problems, i.e., whereonly a small number of the available predictors are truly relevant for prediction, one ofthe proposed robust estimators protects against an inflation of the number of irrelevantpredictors wrongly selected due to contamination.21. INTRODUCTIONThis work also sheds light onto the difficulty of performing refinement steps to improveprecision of robust estimators in high dimensional sparse problems. While justified theoreti-cally for estimators without sparsity constraints, the theoretical foundation of the refinementstep crumbles when sparsity is induced. Furthermore, robustness of the refinement step iscontingent on an accurate estimate of the residual scale but obtaining this estimate in highdimensions under contamination is difficult. Therefore, applying the refinement step in highdimensions may jeopardize the reliability of the estimator.The final main contribution is the adaptation and implementation of algorithms forcomputing the proposed robust estimators of linear regression. Robust estimation posesseveral computational challenges inherent to taming the influence of potentially contami-nated observations. These challenges are especially taxing in the high dimensional problemsconsidered in this work. Analysis and rigorous optimization of the developed algorithmscurtails computational complexity and ensures feasibility of robust estimation in a widerange of applications. These algorithms are made publicly available through an accessiblesoftware package, paving the way for robust estimation to take a foothold in high dimen-sional data analysis.Broadly speaking this dissertation is concerned with robust estimation in high dimen-sional linear regression models under the assumption that only some of the numerous avail-able predictors are truly relevant for prediction, also known as the sparsity assumption.The specific focus is on simultaneous parameter estimation and variable selection throughpenalizing the size of non-zero coefficients, known as regularized estimation. Chapter 2gives a comprehensive summary of the sparse linear regression model, the effects of con-tamination, and robust estimation in low dimensional settings. The chapter continues byoutlining the benefits of the sparsity assumption in high dimensional linear regression andhow regularization induces sparsity in estimates and concludes with an overview of avenuesfor fusing the sparsity assumption and robust estimation.In Chapters 3 and 4, I present two robust regularized estimators for sparse linear regres-sion. The estimator presented in Chapter 3, the penalized elastic net S-estimator (PENSE),combines robust estimation via the S-loss for linear regression (Rousseeuw and Yohai 1984)with the elastic net penalty (Zou and Hastie 2005) for variable selection. The chapter de-lineates an elaborate scheme germane to locating global optima of the non-convex PENSEobjective function and establishes theoretical properties pertaining to robustness and asymp-totic consistency of the estimator, highlighting its reliability even under challenging circum-stances. Theoretical results, however, lack guidance for selecting hyper-parameters intro-31. INTRODUCTIONduced with the elastic net penalty, governing sparsity and prediction performance of theensuing estimate. I discuss strategies for selecting hyper-parameters in practical applica-tions and ascertain favorable finite-sample performance of PENSE in an extensive simulationstudy. While empirical results underline the good prediction performance of PENSE, theyalso expose shortcomings in its ability to screen out irrelevant predictors.To improve upon the high false positive rate of PENSE, Chapter 4 introduces the adap-tive PENSE estimator, combining the robust S-loss with the adaptive elastic net penalty(Zou and Zhang 2009). Leveraging a preliminary PENSE estimate to penalize predictorsdifferently, adaptive PENSE is shown to possess the oracle property even under adverseconditions. Asymptotically, the adaptive PENSE estimator correctly identifies all trulyirrelevant predictors with high probability and estimates the non-zero coefficients for thetruly relevant predictors as efficient as if they were known in advance. Importantly, vari-able selection by adaptive PENSE is highly resilient against aberrant values in the trulyirrelevant predictors, whereas PENSE and other robust regularized estimators would falselyidentify the affected predictors as relevant. The improved robustness of variable selectionis important for practical applications. This is demonstrated by applying adaptive PENSEto biomarker discovery for cardiac allograft vasculopathy, a common complication in hearttransplant recipients.A common strategy for obtaining more accurate robust estimators is to refine a highlyrobust, but possibly imprecise estimate. The strategy is successful in low dimensions butproves less reliable in higher dimensions. The refinement step hinges on a robust scale of theresiduals from the initial, highly robust, fit. As Chapter 5 outlines, robust estimation of theresidual scale faces several challenges in high dimensions. While PENSE, adaptive PENSE,and other highly robust regularized estimators perform well for prediction, the empiricaldistribution of the residuals and robust estimates of the residual scale are severely biased infinite samples with many predictors. The inflated bias can hamstring the refinement stepor, worse, make it susceptible to the influence of contamination. I present empirical resultsdemonstrating that existing remedies developed for de-biasing non-robust residual scaleestimates do not work well for robust estimates. This underlines the practical importanceof robust regularized methods which do not depend on robust estimates of the residualscale, such as PENSE and adaptive PENSE.The estimators proposed in this dissertation incur multiples of the computational costsof comparable non-robust estimators. The algorithms and heuristics detailed in Chapter 6are therefore paramount for ensuring applicability of the estimators to high dimensional41. INTRODUCTIONproblems. I adapt several algorithms for minimizing convex functions such that they canbe utilized to efficiently locate minima of non-convex robust objective functions. With thisvariety of algorithms the range of problems amenable to robust regularized estimators isexpanded, enabling the use of (adaptive) PENSE in a wide range of problems. Especiallyin conjunction with the need to select appropriate hyper-parameters, computational com-plexity would balloon without the optimized algorithms developed for PENSE and adaptivePENSE.5Chapter 2BackgroundIn this chapter I formally introduce the linear regression model and outline several methodsto estimate the parameters in this model. I expose how some estimators of linear regressionare affected even by minor contaminations and in Section 2.2 I outline common strategiesto derive estimators that are robust against these contaminations. For applications where itcan be assumed that many of the available predictors are truly unrelated with the responsethe methods in Section 2.2 are suboptimal. In Section 2.3 I discuss methods to estimatethe regression parameters while also identifying those “irrelevant” predictors and shed somelight on possible improvements in the presence of contamination.2.1 The Linear Regression ModelAs outlined in Chapter 1, the linear regression model discussed in this work assumes thatthe value of a response variable Y (taking values in R) relates to the values of a randomvector of predictors X (taking values in Rp) through a linear function of the formY = µ0 CX ⊺β0 C U (2.1)where µ0 ∈ R and β0 ∈ Rp are the true, unknown regression parameters, and U is a randomerror following some distribution F0. To make the arguments in this work more concise,θ0 ∈ Rp+1 denotes the concatenated parameter vector (µ0Pβ0⊺)⊺.I assume that the random predictor vectorX is independent of U and follows distributionH0. Therefore, the joint distribution G0 of (YPX ) factorizes into the productG0(yPx) = H0(x)F0(y − µ0 − x⊺β0)O (2.2)62.1. THE LINEAR REGRESSION MODELIt is important to highlight that so far, the only assumptions on the distributions is that Uis centered at zero and that X and U are independent.Without any additional assumptions, the linear regression model (2.1) can be used torelate the conditional expectation of the response to the predictors through a linear function.Assuming the expected value EF0 sU u = 0, independence of F0 and H0 leads to an expressionof the conditional expectation of the response in the form ofEF0 sY|X = xu = µ0 C x⊺β0O (2.3)If the parameters are known, this expression can be used to predict the value of the responsewhich can be expected given only observed values of the predictors.In practice the true parameters are of course unknown. Using (2.3) for predicting theresponse based only on observed values of the predictors therefore requires estimates of theparameters. For estimating these parameters, it is assumed a sample of n S 0 independentrealizations of (YPX ) is available. The observed sample is written as the vector-matrixpair (yPr), where y = (y1P O O O P yn)⊺ and r = (x1P O O O Pxn)⊺. The observed response valuesyi ∈ R and associated observed predictor values xi ∈ Rp, i = 1P O O O P n, are used to computeestimates of the parameters according to some estimation method. The quality of theseestimates and thus the prediction can be assessed by analyzing the statistical properties ofthe estimator, i.e., the random vector arising from applying the estimation method to therandom sample (YiPX i), i = 1P O O O P n.An important quality of an estimator is for the estimate to “be close” to the trueparameter value. Ideally, an estimator should be unbiased, EG0 sxθu = θ0, and have smallvariance, EG0 s‖xθ− θ0‖22u. Tolerating a small bias in finite-samples, however, can often leadto an estimator with smaller variance. More important than unbiasedness is that bothbias and variance vanish as the sample size increases. This is the case if the estimator isconsistent for the true parameter, limn→∞ P(‖xθ − θ0‖ S ϵ) = 0 for every ϵ S 0, or evenstrongly consistent, P(limn→∞ xθ = θ0) = 1. A consistent estimator may be biased in finitesamples but its bias and variance tend to 0 as the sample size increases.Having a consistent estimator xθ and being able to derive the asymptotic distributionof √n(xθ − θ0) enables statistical inference on the parameters and comparisons betweenestimators. Of particular interest are estimators converging to a Normal distribution withmean 0 and covariance matrix V(xθPθ0) which can be factorized into υ(xθPθ0)V˜(θ0), whereυ(xθPθ0) ∈ R. In case two estimators xθ and θ˜ converge to such a Normal distribution, they72.1. THE LINEAR REGRESSION MODELcan be compared by the ratio of υ(θ˜Pθ0) to υ(xθPθ0), i.e., the asymptotic relative efficiencyof xθ. Usually, θ˜ is taken to be an estimator with small variance in a particular setting,e.g., the maximum likelihood estimator (MLE), if it exists. Asymptotic relative efficiencyis useful for quantifying the costs incurred by an estimator xθ which, for example, requiresless stringent assumptions on the model than θ˜.Asymptotic properties facilitate comparison between estimators but give limited insightsinto the estimator’s qualities when the sample size n is small. Finite-sample properties,on the other hand, are more useful assessments of the performance of an estimator inpractical applications, but at the same time are difficult to derive theoretically, except forin a few special cases. For many regression estimators and model distributions G0, finite-sample performance measures are therefore calculated through extensive simulations. Withprediction performance being of primary interest in this work, the mean squared predictionerror (MSPE) is an important measure of performance in finite-samples:eSP](xθP G0) R= EG0[(Y˜ −(µˆC X˜ ⊺ xβ))2]= narG0[Y˜ − µˆ− X˜ ⊺ xβ]C EG0[Y˜ − µˆ− X˜ ⊺ xβ]2O(2.4)Here, the expectation is taken over the n observations in the sample used to estimate xθas well as the “new” observation (Y˜P X˜ ). The mean squared prediction has the intuitiveinterpretation of the sum of the variance of the prediction error Y˜−µˆ−X˜ ⊺ xβ and its squaredbias. It can therefore be seen as an overall metric of prediction performance.The MSPE can also be written aseSP](xθP G0) = EG0[(Y˜ −(µˆC X˜ ⊺ xβ))2]= EG0[(µ0 C X˜ ⊺β0 C U˜ − µˆC X˜ ⊺ xβ)2]= EG0[U˜2]C 2EG0[U˜]EG0[(µ0 − µˆ) C X˜ ⊺(β0 − xβ)]C EG0[((µ0 − µˆ) C X˜ ⊺(β0 − xβ))2]OThe first term in the last line is the variance of the errors and the middle term is 0 becausethe errors U˜ are centered and independent of the predictors. The final term is the mean82.1. THE LINEAR REGRESSION MODELsquared error (MSE) of the estimator, defined byeS](xθP G0) R= EG0[(µˆ− µ0)2]C 2EG0 [(µˆ− µ0)(xβ − β0)⊺]EH0 sX˜ uC EG0[(xβ − β0)⊺EH0 sX˜ X˜⊺u(xβ − β0)]O(2.5)This definition of the MSE from a prediction-based perspective (Maronna et al. 2019)measures the overall estimation accuracy, taking into account the covariance among predic-tors and their multivariate location. Comparable to the asymptotic relative efficiency, thefinite-sample efficiency of an estimator xθ, defined as eS](θ˜P G0)ReS](xθP G0), facilitatescomparison of estimation accuracy between different estimators. Again, the estimator θ˜ is a“gold standard”, e.g., the maximum likelihood estimator as defined below, and finite-sampleefficiency is desirable to be close to or even larger than 1.Closely related to the MSE, the a2 estimation error EG0[‖xθ − θ0‖22]provides similarinformation about the finite-sample performance of an estimator. The a2 estimation error,however, ignores the covariance among predictors, i.e., omitting EH0 sX˜ X˜⊺u in (2.5). TheMSE and the a2 estimation error coincide if the predictors are centered and pairwise inde-pendent with identical variance. In cases where predictors are highly correlated, the MSEremains small even if the parameter estimates are slightly biased, as long as the combinedeffect of the correlated predictors (i.e., the sum of the scaled coefficient values) is close tothe truth. As an example, consider a linear regression model with two centered predictorswhich are highly correlated (e.g., [orH0(X1PX2) ≈ 1) and have variance σ21 and σ22, respec-tively. In this case, the MSE is small as long as βˆ1σ1 C βˆ2σ2 ≈ β01σ1 C β02σ2. Consideringthat both X1 and X2 carry almost the same information, the actual value of the parametersis irrelevant for explaining the response well, as long as the sum of the scaled coefficients isclose to the truth. For the a2 estimation error to be small, on the other hand, both |βˆ1−β01 |and |βˆ2 − β02 | must be small.Even when restricting attention to estimators that possess several of the above listeddesired properties, there is a plethora of methods available to estimate the regression pa-rameters in the linear regression model. Which method to use depends on the researcher’semphasis as well as additional assumptions that can be imposed. For instance, if the dis-tribution of the errors is (assumed to be) known to have density function f0, the maximumlikelihood estimator (MLE)xθMLE = argminµ,βn∑i=1− log f0 (yi − µ− x⊺iβ)92.1. THE LINEAR REGRESSION MODELhas very appealing properties as the sample size n grows. Under mild regularity conditionson the distribution F0 and as n → ∞, the MLE is consistent (i.e., converges to the trueparameters in probability) and asymptotically efficient (i.e., no consistent estimator canhave lower variance). However, these optimality properties heavily depend on the validityof the assumption on G0.A different approach to estimate the parameters is by trying to fit the observed responsevalues y well without regard of the actual distribution of the errors. Formally, the approachis to determine xθ such thatxθ = argminµ,βL (yP µCrβ)P (2.6)where the regression loss function L R Rn × Rn → s0P∞) measures the inaccuracy of thefitted values, i.e., how far the fitted values xy = µˆCrxβ are from the observed response y.Therefore, it makes sense to require that L (yP xy) = 0 if and only if xy = y. ParaphrasingLehmann and Casella (2003), the desire is to have an accurate estimate, but since it isusually unknown what the estimate will be used for once it is made public, the choice ofthe measure of accuracy is arbitrary. However, the chosen loss function directly affects theproperties of the estimator and therefore it should be chosen wisely. The most prominentloss function for linear regression is the sum of square residualsLLS(yP xy) =12nn∑i=1(yi − yˆi)2which is mathematically convenient and leads to an accurate and theoretically sensibleestimator in many settings. In the case where F0 is assumed Gaussian, for instance, theleast squares (LS) estimator, xθLS coincides with the MLE and thus enjoys all the asymptoticproperties of the MLE. Even more convincing, the Gauss-Markov theorem (and extensionsof it) states that the LS-estimator has uniformly smallest variance among all unbiasedlinear estimators if (i) the variance of the error term is finite and (ii) the distribution ofthe predictors H0 is unknown, or G0 is multivariate Normal with unknown parameters(Lehmann and Casella 2003, p. 184f).Despite these strong arguments for the LS-estimator, there are reasons why the LS-estimator might not be the best choice. The LS-estimator has smallest variance amongall unbiased and linear estimators (i.e., estimators for which the fitted values are a linearcombination of the observed response values). Consequently, unless F0 is Gaussian, it may102.1. THE LINEAR REGRESSION MODELbe possible to find an estimator that is not linear in the observed response values or biased(but still consistent) and has smaller variance than the LS-estimator.Especially if it is likely that the error term takes on large values, i.e., F0 has heavy tails,finite-sample performance of the LS-estimator suffers considerably. Even if the researcheris willing to assume the error term is Normally distributed, it is most often only a crudeapproximation to the truth and large errors may occur more often than expected. Becauseof the square function, unusually large residuals contribute substantially to the LS-loss andforce the estimator xθLS to adapt to these observations to shrink the discrepancy betweenthe fitted value and the observed value. If the sample at hand contains a small fraction ofobservations with unusually large residuals, they can dominate the LS-loss function and theestimate could be excessively affected by them.A maybe even more worrisome property of the LS-loss is revealed when considering itsgradient with respect to the regression parameters, which is given by∇βLLS(yP µCrβ)|β=β˜ = − 1nn∑i=1(yi − µ− x⊺i β˜)xiOAt every minimum of the LS-loss, each element of the gradient needs to be 0. Therefore,the LS-estimator xθLS must satisfy0p =n∑i=1(yi − µˆLS − x⊺i xβLS)xiPwhere 0p is the p-dimensional 0-vector. From this equation it can be clearly seen thatif the value of any predictor of the i-th observation is unusually large, the correspondingresponse needs to be fitted very well to keep the residual small and counterbalance theinfluence of the predictor on the gradient. Observations with unusually large values in anyof the predictors are called leverage points; unless the true residual of this observation isvery small, the observation can have a devastating effect on the LS-estimator. Huber andRonchetti (2009) argue leverage points are usually of no concern in designed experimentswhere the researcher has (at least some) control over the values of the predictors. Even withrandom predictors, as considered in this work, Huber and Ronchetti suggest that leveragepoints are interesting by themselves and should be identified in advance to be analyzedseparately. This approach might work in some settings, but in Section 2.3 I argue why thisis challenging or nearly impossible in the settings considered in this work.Now that I have outlined some instances where the LS-loss might not be an appropriate112.2. ROBUST ESTIMATION IN THE LINEAR REGRESSION MODELchoice, the question becomes if there are alternatives with similar appealing properties. Ofcourse, one possibility is to assume a different distribution for the errors, one with heaviertails, and compute the MLE. However, this approach might lose precision if a large majorityof the observations are well explained by a regression model where F0 has light tails (e.g.,Gaussian) and only a few observations have gross errors. Additionally, the MLE does notaddress the problem of leverage points. In the next section I introduce a strategy fromrobust statistics.2.2 Robust Estimation in the Linear Regression ModelIn many practical applications of linear regression, precluding the presence of adverse obser-vations is almost impossible. The approach taken by robust statistics is to not try to builda comprehensive model that accounts for these few adverse observations, but rather derivemethods that are stable and give “reasonable” results as long as the number of adverseobservations remains small. Importantly, it is assumed that the parametric model G0 un-derlies the majority of the observations in the sample. However, to allow for a more realisticrepresentation of the observed sample, a small proportion of the sample is allowed to comefrom an unspecified, possibly degenerate, model G˘. In the Tukey-Huber contaminationmodel for linear regression, this can be written as G˜0(yPx) = (1 − ϵ)G0(yPx) C ϵG˘(yPx),with G0 the parametric model defined in (2.2) and contamination proportion ϵ ∈ s0P 0OM). Inthis “casewise” contamination model, the observed sample is generated by a mixture of thedata generating process of interest, G0, and the contamination process G˘. The goal is stillto estimate the parameters in G0, but it is more difficult because some of the observationsare actually generated by G˘, and it is not known which observations. An observation is onlyuseful for estimating the parameters if it is indeed generated by G0 and robust proceduresdesigned for the Tukey-Huber should filter information from observations generated by G˘.To ensure G0 and hence the parameters in the model are identifiable, the contaminationproportion ϵ should be less than 50%, i.e., the majority of the observed sample is generatedfrom the process of interest.The casewise contamination model can be compared to the more general independentcontamination model (Alqallaf et al. 2009) where each individual value of the observation isindependently either generated by the assumed model or by the unspecified contaminationprocess. If thinking of the sample as an n × (p C 1) matrix, with the i-th observationbeing recorded in the i-th row and the j-th column corresponding the value of the j-th122.2. ROBUST ESTIMATION IN THE LINEAR REGRESSION MODELpredictor (or the response if j = p C 1), the independent contamination model can bethought of as “cellwise” contamination. In this framework, each cell is either generated bythe true model, or by the unspecified contamination process. In the casewise contaminationmodel, on the other hand, each observation with a single contaminated value is consideredto be generated by G˘. Having a few contaminated cells can lead to a large number ofcontaminated cases, especially if p is large. This may be problematic in high-dimensionaldatasets as the proportion of contaminated cases could be propelled outside the sustainable50%. The cellwise contamination model, however, poses great challenges for estimationprocedures which go beyond the scope of this work. Henceforth contamination is alwaysunderstood in the sense of the Tukey-Huber contamination model, i.e., an observation iseither considered contaminated or not.Despite the presence of a small proportion of contamination, the aim is still to estimatethe true regression parameters in G0, θ0 = (µ0Pβ0). However, in addition to the desiredproperties for any estimator discussed in the previous section, robust estimators strive tolimit the effect of adverse observations. Over time, different measures of robustness, andthereby properties related to these measures, have been developed. A concept that playsa central role in this work is the notion of the replacement finite-sample breakdown point(FBP) as defined in Donoho and Huber (1982). The FBP measures how many observationsin any given sample must be replaced by arbitrary values to push the estimate to theboundary of the parameter space. In the context of regression, this is equivalent to forcingthe norm of the estimated regression parameter to infinity. To define the breakdown pointformally, I introduce the notation xθ = xθ(Z ) for an estimator of the regression parametersto explicitly show the dependence on the sample Z = (yPr) = {(yiPxi) R i = 1P O O O P n}.With this notation, an estimate of the regression parameters has FBP ϵ∗(xθPZ ) given byϵ∗(xθPZ ) = maxm=1,...,n{mnR supZ˜ ∈Zm(Z )‖xθ(Z˜ )‖ Q∞}O (2.7)The set Zm(Z ) denotes all possible samples obtained by replacing at most m observationsfrom the original sample Z with arbitrary values. Ideally, the FBP does not depend onthe actual sample Z , as long as the sample satisfies some estimator-dependent conditions.The FBP can be considered as a measure of how much contamination can be toleratedwithout suffering the worst-possible consequences. It does not, however, imply that forless contamination the estimate is anywhere close to the true parameter θ0; this includesthat the estimator does not have to be consistent for θ0 under any positive amount of132.2. ROBUST ESTIMATION IN THE LINEAR REGRESSION MODELcontamination. A related concept is the (asymptotic) breakdown point which, instead ofoperating on the sample level, considers the worst effect on the parameter estimate if theactual distribution G˜0 is within an ϵ neighborhood of the assumed distribution G0 (Daviesand Gather 2005). A driving factor in the development of many robust estimators is thedesire to obtain a “high breakdown point” estimator; i.e., an estimator that achieves abreakdown point close to 0.5, the maximum for regression-equivariant estimators 1 (Daviesand Gather 2005).Instead of focusing on the worst-case scenarios, the sensitivity curve measures how muchthe parameter estimate changes when adding a single observation to the original sample(Maronna et al. 2019). The asymptotic version of the sensitivity curve, the influence func-tion, measures the effect on the estimate when adding infinitesimal point-mass at (y˜P x˜) tothe assumed distribution G0 (Hampel 1974). In general, it is desired that a robust estimatorhas a bounded sensitivity curve and influence function. However, even if an estimator hasa breakdown point greater than 0, neither the sensitivity curve nor the influence functionneeds to be bounded.A more balanced measure is the maximum asymptotic bias (MB) which measures byhow far a consistent estimator misses the target value θ0 if the actual distribution G˜0 isin an ϵ-neighborhood of the assumed G0 (Maronna et al. 2019). The maximum bias givesa more refined picture of how badly an estimator can be affected by a certain amount ofcontamination and from the definition of the breakdown point it is evident that the MB isfinite for ϵ ≤ ϵ∗(xθ). A more complete discussion of measures of robustness can be foundin Maronna et al. (2019) and Huber and Ronchetti (2009). To summarize, in this workthe main measure of robustness is the finite-sample breakdown point, while also keeping inmind the increase in the MSE or estimation error incurred by contamination, especially infinite-samples.As has been shown in numerous different settings and applications, the classical LS-estimator of regression, possesses neither of these desired robustness properties (Maronnaet al. 2019). Its FBP is 1Rn and consequently has asymptotic breakdown point of 0; a singleaberrant observation can push the estimated regression parameters to infinity. Similarly, theSC and IF are unbounded, and the MB is infinity for any amount of contamination. Underthese considerations, a substantial body of research is therefore devoted to find alternativesto the LS-estimator in various settings.1an estimator θˆ is regression-equivariant if it satisfies θˆ(ay+bX,CX) 5 C−1(aθˆ(y,X)+b) for all a ∈ R,b ∈ Rp+1, and all non-singular matrices C ∈ Rp+1×p+1142.2. ROBUST ESTIMATION IN THE LINEAR REGRESSION MODELOne prominent approach is to view the LS-loss as the sample variance of the (uncentered)estimated residuals, s2(r) =∑ni=1 r2i Rn. Therefore, the LS-estimator seeks to minimize thesample variance of the residuals. In this light, it seems sensible to replace the sample vari-ance with a robust measure of variability. A measure that is used extensively in univariatescale estimation problems is the median absolute deviation (MAD). In the linear regressioncontext, minimizing the MAD of the residuals is equivalent to minimizing the Least MedianSquares (LMS) loss (Hampel 1975; Rousseeuw 1984) given byLLMS(yP xy) = medi=1,...,n(yi − yˆi)2OThe LMS-estimator is consistent for θ0 and can withstand large amounts of contaminationas its finite-sample breakdown point is ϵ∗LMS = ⌊nR2⌋−p+2n (Rousseeuw 1984). However,the convergence rate is only of order n−1R3 (Kim and Pollard 1990) which implies that inthe case of no contamination the LMS-estimator is considerably less efficient than the LS-estimator with a convergence rate of order n−1R2. Maybe more problematic for practicalconsiderations, however, is the non-smoothness of the loss function which impedes fastalgorithms to compute the estimate.The issues of the LMS-estimator can be avoided by using a continuous function to definethe scale estimator, instead of the median of squared residuals. One such estimator for theresidual scale is the M-scale estimator, σˆM (Huber and Ronchetti 2009), defined asσˆM RRn → R+r 7→ inf{s S 0R1nn∑i=1/(ris)≤ δ}O (2.8)This mapping is continuous if the function / R R→ s0P∞) satisfies the condition[R1] /(0) = 0 and it is continuous, even, i.e., /(−t) = /(t), and nondecreasing, i.e., 0 ≤ t ≤t′ implies /(t) ≤ /(t′).Using this M-scale estimate, the corresponding S-estimator (Rousseeuw and Yohai 1984) oflinear regression is defined through the S-lossLS(yP xy) =12σˆ2M(y − xy)O (2.9)As detailed later, the constant δ is essential for the robustness of the S-estimator and needsto satisfy 0 Q δ Q limr→∞ /(r).152.2. ROBUST ESTIMATION IN THE LINEAR REGRESSION MODELThe definition of the M-scale estimator, and in turn of the S-estimator, may seemarbitrary, but becomes clearer when considering the equivalent implicit definition1nn∑i=1/(riσˆM(r))= δPwhich holds if / satisfies condition [R1] and σˆM(r) S 0. From the implicit definition andconsidering the special case /(x) = x2, it is evident that in this case σˆM(r) = δn‖r‖22 andhence the S-estimator coincides with the LS-estimator.To understand the robustness properties of the S-estimator, it is necessary to first un-derstand them for the M-scale estimator. The M-scale estimator is resistant to grosslyaberrant values only if the / function is bounded. A bounded / function in this work isassumed to satisfy[R2] /(t) = 1 for all |t| S x with x Q ∞ and / is strictly increasing on (0P x), i.e., 0 ≤ t Qt′ Q x implies /(t) Q /(t′).With a bounded / function, the constant δ is in (0P 1) and directly affects the robustness ofthe M-scale estimator with FBP given by bnmin(δP 1−δ)cRn. More specifically, the M-scaleestimator with bounded / function can tolerate up to bnδc gross outliers without explodingto infinity, and up to bn(1− δ)c “inliers” without imploding to 0. Because robustness of theM-scale estimator hinges on the boundedness of the / function, from now on it is implicitlyassumed that the / function used for M-scale estimation is bounded.Independent of the exact choice of the / function, as long as it satisfies conditions [R1],[R2] and is continuously differentiable with bounded derivative, the S-estimator is consistentfor the true parameters under certain conditions on G0 (Davies 1990; Smucler 2019). Thelast condition mentioned for consistency is formalized by[R3] / is continuously differentiable with the derivative /′(t) and t/′(t) both bounded.As with the univariate M-scale estimator, the FBP of the S-estimator is determined by δthrough ϵ∗S = (bnmin(δP 1− δ)c − p− 1)Rn. Hössjer (1992) show that even with an optimalchoice of the / function, the efficiency of the S-estimator is inversely proportional to itsresistance to outliers; in other words, it cannot be both highly efficient and highly robust.A popular choice for the / function that satisfies all of the above conditions is Tukey’s162.2. ROBUST ESTIMATION IN THE LINEAR REGRESSION MODELbisquare family of functions given by/(tS x) = min(1P 1−(1− t2x2)3)with x S 0O (2.10)Tukey’s bisquare / function is convenient to handle in computations and yields an S-estimator which is reasonably close to the S-estimator which uses an “optimal” / functionin terms of efficiency under the Normal model (Hössjer 1992). It is easy to show thatwith Tukey’s bisquare / function, but also many other popular / functions, the constantx is merely a scaling factor and does not affect the estimated regression parameters, i.e.,LS(yP xyS x) = x2LS(yP xyS 1). In practice, x is usually chosen to yield a consistent estimate ofthe residual scale under the assumed model G0, which amounts to x = 1OM4N7 in the caseof Gaussian G0 and a breakdown point of δ = 0OM.Computation of the S-estimator is challenging because of the non-convexity of the lossfunction induced by a non-convex / function. Consistency and asymptotic Normality ofthe S-estimator (Rousseeuw and Yohai 1984; Davies 1990; Smucler 2019) only apply to theglobal minimum of the loss function LS, which is in practice difficult to find. Optimizationalgorithms for non-convex problems only converge to a local minimum which depends on thegiven starting point. Mei et al. (2018) lists several conditions on the / function and G0 underwhich the loss function has a unique local minimum in an r-ball around the true parameter,i.e., {θ ∈ Rp+1 R ‖θ − θ0‖2 Q r}, with high probability if the sample size n S Xp log(p).Additionally, this unique local minimum actually corresponds to a global minimum forwhich the statistical guarantees hold, and gradient descent algorithms converge to it if thestarting point is within an 2r3 -ball of the true parameter. It is therefore necessary to choosethe starting points in a strategic way and/or try many different starting points. Althoughthis increases the chance of finding a point within this neighborhood, it is no guarantee.A key observation to finding good starting points (and to compute the S-estimator) isthat the S-loss can be written as a weighted LS-loss,LS(yP xy) = LS(yP µ−rβ) = LLS(WθyPWθ(1nµCrβ))with a diagonal matrix Wθ ∈ Rn×n of weights that depend on where the loss is evaluatedand the data itself. Therefore, the S-estimator also minimizes a weighted LS-loss, wheresuspicious observations are down-weighted. Because the / function is bounded, highlyoutlying observations can even get 0 weight, which means they are effectively removed from172.2. ROBUST ESTIMATION IN THE LINEAR REGRESSION MODELthe equation as if they were not part of the sample.Based on this observation, a strategy using random subsamples from the data is in-troduced in Rousseeuw and Leroy (1987) for the LMS- and Least Trimmed Squares (LTS)estimators and optimized for S-estimators in Salibián-Barrera and Yohai (2006) for sampleswith n S (p C 1)Rδ. It can be thought of randomly generating the weight matrix WθˆS byassigning a weight of 1 to a random sample of p C 1 observations and a weight of 0 to therest. In essence, the idea is based on the observation that there must be at least one sub-sample of size pC1 which does not contain contaminated observations and the LS-estimatorcomputed on this subsample is close to a global minimum of the S-estimator computed onthe complete sample. The justification for this specific size of the subsample is that it mustcomprise at least pC1 observations to ensure a unique solution for the LS-estimator. On theother hand, any subsample greater than pC1 is more likely to include contaminated obser-vations. To ensure high probability of actually finding a subsample without contamination,many random subsamples need to be considered. As the size of the subsample grows withthe dimensionality, the number of subsamples also needs to increase exponentially with thenumber of predictors. This makes the strategy unfeasible when p is of moderate size.A somewhat different strategy is given in Peña and Yohai (1999), who aim to identifypossibly influential observations and subsequently compute the LS-estimator without theseinfluential observations. The idea is again that the LS-estimator computed on a “clean”subsample is close to a global minimum of the S-estimator computed on the full sample.Because the strategy by Peña and Yohai uses a more guided scheme to find clean subsamplesas compared to random subsampling, the number of subsamples to explore is drasticallyreduced and only grows linearly with the number of predictors. Another advantage is thatthe strategy is deterministic and therefore always results in the same S-estimator.It is important to note that the S-loss has potentially multiple global minima. In par-ticular, the S-loss has a unique global minimum only if every p-dimensional subspace of(yPr) contains less than bn(1 − δ)c − 1 observations (Rousseeuw and Yohai 1984; Yohaiand Zamar 1988). In other words, if bn(1 − δ)c observations can be fit exactly with θ˜,the S-loss has a global minimum at θ˜. This is a direct consequence of the smaller effectivesample size induced by the bounded / function (up to bnδc observations can have 0 weight).The S-estimator can therefore be sensibly computed only if p Q bn(1 − δ)c − 1, comparedto p Q n− 1 for the LS-estimator.Regardless of the actual distribution G0, the S-estimator is highly robust. This highrobustness, however, comes at the price of low efficiency under the Normal model compared182.2. ROBUST ESTIMATION IN THE LINEAR REGRESSION MODELto the LS-estimator. Due to this deficiency, the S-estimator is in practice usually only thefirst step in the multi-tiered MM-estimator (Yohai 1987). The MM-estimator employs theM-loss function defined byLM(yP xyS σˆS) =1nn∑i=1/M(yi − yˆiσˆS)Pwhich quantifies the size of the residuals through a /M function satisfying the same condi-tions as the / function for the M-scale, in particular being bounded, and /M(t) ≤ /(t) forall t. The size of the residuals is taken relative to the scale of the residuals, such that theboundedness of the /M function affects only observations with residuals being large rela-tive to the scale of the residuals. This is where the MM-estimator relies on an S-estimateof regression. The scale of the residuals can be estimated from the residuals of the fit-ted S-estimate, σˆS = σˆM(y − µˆS −rxβS). The M-loss with bounded /M is non-convex andcomputing MM-estimators in general therefore entails similar challenges as outlined forcomputing S-estimators. If /M and / for the initial S-estimate of regression satisfy condi-tions [R1] and [R2], Yohai (1987) proves that the MM-estimator inherits the breakdownpoint of the initial scale estimator σˆS, is consistent for θ0 under mild conditions on theerror distribution, and has asymptotic efficiency governed by /M. It is therefore possible toincrease the efficiency of an MM-estimator without sacrificing robustness.The bias inflicted by gross errors and efficiency under G0 depends on the shape of the /Mfunction, but more importantly on the cutoff x in condition [R2]. Intuitively, if x is chosenvery large, the loss is practically unbounded (and usually behaves like the LS-loss) and grosserrors as well as leverage points can damage the estimate. On the other hand, if x is chosentoo small, the estimator is inefficient under G0. Usually, the cutoff x is therefore chosen toyield a certain asymptotic efficiency under G0 while also limiting the maximum asymptoticbias under contamination. Yohai and Zamar (1997) propose an “optimal” /M function in thesense that it is minimizing sensitivity towards contamination while simultaneously achievinga desired asymptotic efficiency.The MM-estimator is particularly useful in practice because it yields a highly robust andefficient estimate without significantly increasing computational complexity. Even thoughthe M-loss in the second step is non-convex, it is not necessary to find the global minimumof the objective function. Yohai (1987) shows that any local minimum of LM close to xθShas the same asymptotic properties as a global minimum. The practical challenge withMM-estimators, however, is that /M needs to be chosen in concordance with / and G0. The192.2. ROBUST ESTIMATION IN THE LINEAR REGRESSION MODELprescribed asymptotic efficiency is achieved by choosing the cutoff x according to the limitof σˆS under G0. For these results to be transferable to finite samples, the bias of the M-scaleestimate of the residuals must not be too large.MM-estimators are not the only strategy to compute M-estimators when the residualscale is unknown. Several other estimators augment the objective function to allow for jointestimation of the regression parameters and the residual scale. Options include the con-comitant scale estimate (Huber and Ronchetti 2009) and constrained M-estimators (Mendesand Tyler 1996). Usually, these estimators are difficult to compute when using bounded/ functions because they require certain constraints on the scale to evade global minimaat a residual scale of 0. The τ -estimator (Yohai and Zamar 1988) uses a similar strategythrough optimization of the τ -lossLτ (yP xy) = σˆ2M(y − xy)1nn∑i=1/τ(yi − yˆiσˆM(y − xy))where /τ is again a bounded / function satisfying conditions [R1] – [R3] as well as 2/τ (r)−r/′τ (r) ≥ 0. The loss function is very similar to the concomitant scale estimate, but insteadof jointly optimizing over the scale and the regression parameters, the scale is given bythe M-scale of the residuals. Like the MM-estimator, the τ -estimator can be tuned for highbreakdown and asymptotic efficiency. Other robustness-properties (e.g., the maximum bias)are also similar in practice. The main advantage of the MM-estimator over the τ -estimatoris that the MM-estimator is easier to compute. Although both the MM-estimator and theτ -estimator can be tuned to have high efficiency, higher efficiency also leads to larger biasunder contamination. To keep the bias under contamination reasonably small, a typicalchoice for the asymptotic efficiency of MM-estimators is 85% in the Normal model. Severalone-step procedures to improve upon the asymptotic efficiency as well as the finite sampleefficiency of the MM-estimator are discussed in Maronna et al. (2019, Chapter 5.9).Recently, attention has been directed at circumventing scale estimation for the M-estimator with Huber’s / function, /(tS x) = min(t2R2P x(|t| − xR2)) by choosing the cutoffvalue x adaptively. It should be noted that Huber’s / function is robust towards observationswith contamination in the response, but the convex loss function does not protect againstinfluence from aberrant values in the predictors. Loh (2018) constructs a grid of candidatecutoff values {3σmax2kR2K R k = 1P O O O PK} and chooses the smallest cutoff value such thatthe difference of the estimate to the estimate at the next larger cutoff is below a certainthreshold. Under certain conditions, this procedure leads to consistent parameter estimates202.3. ESTIMATION UNDER THE SPARSITY ASSUMPTIONand small bounds on the estimation error. To handle possible leverage points, Loh (2018)suggests a weighting function to down-weight observations with large norm of the predictors.Under the assumption that the error distribution is heavy-tailed (but not contaminated),Sun et al. (2019) choose the cutoff value for the Huber loss x = kR√nlog(n)∑ni=1(yi − y¯)2and search for an appropriate multiplier k via cross-validation. The influence of possibleleverage points is reduced by univariate winsorizing, i.e., any predictor value larger than apredetermined threshold is replaced by this threshold value. Univariate winsorizing, how-ever, does not take into account the multivariate structure of the data; leverage points areoften not overly extreme in a single direction but are extreme when taking into accountthe overall structure of the predictors. The merit of these works is that they derive non-asymptotic bounds for the a1 and a2 estimation error, that hold with high probabilityunder relatively mild conditions. Due to the handling of leverage points, however, neitherof these adaptive procedures has a high breakdown point.The finite sample breakdown point of all robust estimators discussed so far have onekey weakness: the breakdown point is lower the closer the number of predictors is to thenumber of observations. However, not only the robustness properties suffer as the dimensionincreases, but also the finite-sample and asymptotic efficiency gets worse (e.g., Maronna andYohai 2010). Albeit much less severe than for robust estimators, the LS-estimator also hashigher variability in high-dimensional settings. In the following section, I discuss ways tosimultaneously (i) reduce the variability of robust estimators by allowing for a larger finite-sample bias and (ii) make robust estimators applicable to settings where the sample size isless than the number of predictors.2.3 Estimation Under the Sparsity AssumptionWith the surge of data in the last decade, it is increasingly common that the number ofpotential predictors is in the hundreds or even tens-of-thousands. At the same time, thesample size is only slightly larger than or even smaller than the number of predictors. Forexample, proteomic technologies measure the expression of hundreds of proteins, but thenumber of patients in a study is often less than a hundred. In these cases, the estimators in-troduced in the previous two sections are not well defined. However, by imposing additionalrestrictions on the true parameter and translating these additional assumptions into con-straints on the parameter estimates, the (uncountably) infinite set of global minima of theobjective functions for estimators presented in the previous sections can possibly be reduced212.3. ESTIMATION UNDER THE SPARSITY ASSUMPTIONto a finite set. In many applications, for instance, it is reasonable to assume that only afew of the many available predictors are actually associated with the response; but it is notknown which or exactly how many. In other applications, the number of predictors may notbe extraordinarily large compared to the sample size, but the goals of the researcher includeto identify the predictors that are actually associated with the response. In both of thesescenarios, the assumption can be translated to the linear regression estimation problem byassuming the number of truly relevant, or active, predictors, A = {j R β0j 6= 0}, is muchsmaller than p. Usually, the size of the active set, s = |A |, is not known. This assumptionof sparsity, i.e., only s p predictors have non-zero coefficient, is central to this sectionand the remainder of this work.Before discussing ways to leverage the sparsity assumption to estimate the regressionparameters in the linear regression model (2.2), I extend the list of desired properties whenthe sparsity assumption is imposed. Since it is assumed that p− s predictors actually havea coefficient value of 0, it is natural to ask if the predictors with zero coefficient can berecovered with high probability, at least as the sample size increases. This leads to thenotion of variable selection consistency. Whereas consistency of the estimator implies thatthe coefficient approaches its true value, variable selection consistency requires that theprobability of all truly inactive coefficients being exactly zero approaches 1, i.e.,limn→∞P(xβA n = 0p−s) = 1OHere and henceforth, a vector ξ indexed by a setS (e.g., xβA n) denotes the vector of elementsin ξ with index in the set S , i.e., ξS = (ξj)j∈S ∈ R|S |. If an estimator is variable selectionconsistent, one can further ask for the limiting distribution of the parameter estimates forthe truly active predictors to be as good as if the true active set would have been known inadvance, i.e.,√n(xβA − β0A ) y−→ fs(0sPV(β0A ))P (2.11)where p, s, and A possibly grow with n. These two properties together are called the“oracle property” (Fan and Li 2001) as the estimator performs as good as an estimator thatknows the true active set (an oracle). Although the oracle property is desired, it will turnout that it is not always easy to obtain an estimator that possesses it; usually this requiresseveral strong conditions on the model and the sample.The two properties discussed so far in this section are both asymptotic in nature. Inthe high-dimensional setting it can be desirable to not only consider what happens when222.3. ESTIMATION UNDER THE SPARSITY ASSUMPTIONthe sample size n increases, but also when the dimensionality pn is growing with the samplesize. In the following, I distinguish between results under fixed dimensionality (i.e., pn = premains the same for every sample size), and results under growing dimensionality (i.e., pngrows with n). Results for growing pn usually require a condition that pn does not grow toofast as n tends to infinity, e.g., log(p)Rn→ 0 (Bühlmann and van de Geer 2011). Similarly,although not covered in this work, the size of the active set could be allowed to grow withthe sample size.An obvious question is how the LS-estimator performs under the sparsity assumptionwhen the sample size is larger than the number of predictors, n S p, and, for example,G0 is multivariate Normal. Although the LS-estimator is consistent for estimating theparameters, the estimated coefficients of the truly inactive predictors are non-zero withprobability 1; only in the limit they are 0. Therefore the LS-estimator does not lead to anyvariable selection and hence does not fulfill the oracle inequality. The same is true for anyof the robust estimators discussed before. It is therefore necessary to look for alternativeswith positive probability of setting coefficients actually to 0.In an idealized world where the number of truly active predictors s is known, a simplestrategy for computing an estimator defined by a loss function L is to determine the subsetof predictors of size s which minimizes the loss, i.e.,argminµ∈R,β : ‖β‖0=sL (yP µCrβ)O (2.12)This is computationally challenging as the a0 pseudo-norm ‖ · ‖0 R u 7→∑pj=1 |uj |0 is non-convex and not continuous. A naïve way to find the best subset of size s is to try everysingle set of s active predictors, which is of course unfeasible unless p and s are small.If s is unknown, the problem becomes several times more difficult as the minimizationproblem (2.12) needs to be performed for several (or all p) choices of 0 ≤ q = ‖β‖0 ≤ p.Furthermore, the obtained solutions for the different choices of q then must be comparedusing a validation metric to identify the overall best solution. The value of the loss functionis an inappropriate metric for comparing the solutions as per definition it decreases forincreasing q. Even with recent advances in mixed integer optimization in Bertsimas et al.(2016), which allow more efficient optimization of problem (2.12) over β R ‖β‖0 ≤ q, it onlyworks for moderately sized problems. Greedy searches, on the other hand, can provideadequate approximations to the best subset regression. Examples include forward stepwiseregression, where the search begins with the empty model, q = 0, and one predictor at a232.3. ESTIMATION UNDER THE SPARSITY ASSUMPTIONtime is added such that the loss is minimized among all possible additions. Nevertheless,it is difficult to provide provable statistical guarantees for best subset regression or greedyapproximations thereof. Furthermore, Hastie et al. (2017) demonstrate that best subsetregression with the LS-loss (and the greedy approximation by forward stepwise regression)often leads to an estimator with small bias but large variance.Continuous alternatives to the a0 pseudo-norm are popular tools to improve computa-tional efficiency and decrease the variability of the estimator, usually at the cost of increasedbias. Theoretically, any “measure of the size of the coefficient vector”, ΦR Rp → s0P∞), canbe used to constrain the minimization problemargminµ∈R,β : Φ(β)≤aL (yP µCrβ)to reduce the number of global minima to a finite set if n S p. However, a necessary andsufficient condition on Φ for the minimization problem to lead to sparse solutions is that itis nondifferentiable at βj = 0, j = 1P O O O P p (Fan and Li 2001). For convenience, I rephrasethe constrained optimization problem in its dual form, which in the context of regression isoften called regularized or penalized regression:argminµ∈R,β∈RpL (yP µCrβ) C λΦ(β)O (2.13)The hyper-parameter λ is inversely related to the constant v in the constrained optimizationproblem. If λ = 0 this is the unregularized minimization problem and identical to (2.6),while λ → ∞ necessarily leads to xβ = 0p and thus the estimated active set is empty.Both, the penalty function Φ and the hyper-parameter λ are unrelated to the model andthe choice cannot be inferred from the model itself but needs to be done based on externalconsiderations.Probably the most popular choice for the penalty function for sparse estimation in statis-tics and beyond is the a1 norm, Φ1(β) = ‖β‖1. The a1 norm is the convex envelope of thea0 pseudo-norm over a small domain and as such yields the closest approximation to bestsubset regression by means of convex penalty functions (Jojic et al. 2011). When combinedwith the LS-loss, the a1 penalty leads to the widely known least absolute selection andshrinkage operator (LASSO) (Tibshirani 1996), henceforth called LS-LASSO to emphasizethe specific combination of the loss and penalty function. The LASSO penalty can be mo-tivated from many different angles. Numerous results are available which present different242.3. ESTIMATION UNDER THE SPARSITY ASSUMPTIONconditions on the distribution G0 and the sample under which the LS-LASSO is consistent,variable selection consistent, or possesses the oracle property with growing pn. Typically,the conditions for the LS-LASSO to have these properties include that the amount of pe-nalization, reflected in λ, vanishes as the sample size increases. While the rate depends onthe other conditions imposed, it is usually required to be at most of order d(√log pnRn).Vanishing regularization is required to remove the bias introduced by the regularization,at least asymptotically. From a practical perspective, this is a relatively mild requirementas with enough data, the LS-estimator will already estimate the coefficients for the trulyinactive predictors close to 0 and only a slight nudge is required to make them exactly zero.For a comprehensive summary of conditions and the most important results see Bühlmannand van de Geer (2011).The elastic net (EN) penalty, proposed by Zou and Hastie (2005) has similar variableselection properties as the LASSO, but is able to retain groups of highly correlated activepredictors. The EN penalty is given by a linear combination of the a1 and squared a2penalty,ΦEN(βSα) = α‖β‖1 C 1− α2‖β‖22 with α ∈ s0P 1uO (2.14)The LASSO is a special case of the EN penalty with α = 1, and as long as α S 0, theEN penalty has singularities at the origin and therefore also leads to sparse estimates. Thea2 penalty is beneficial in the presence of highly correlated predictors, stabilizing variableselection (Zou and Hastie 2005).The LS-LASSO and LS-EN estimators possess the oracle property only under very spe-cific and impractical conditions, due to the bias introduced by the a1 and the a2 penalty.To solve this problem, a different penalty would need to be considered; one possibility is thefamily of folded-concave penalties introduced by Fan and Li (2001). Folded-concave penal-ties are singular at the origin (i.e., produce sparse results) and are bounded, i.e., predictorswith coefficients larger than a certain threshold are all penalized equally, regardless of theactual size of the coefficient. The LS-loss combined with folded-concave penalties yieldsan estimator that possesses the oracle property under growing dimension, requiring lessrestrictive conditions than the LS-LASSO (Fan and Peng 2004; Zhang and Zhang 2012).Due to the boundedness of the folded-concave penalties, the objective function (2.13) isnon-convex, even if combined with the LS-loss. This proves problematic because the oracleproperties and other statistical guarantees are only valid for the global minimum. The locallinear approximation (LLA) to folded-concave penalties in combination with the LS-loss is252.3. ESTIMATION UNDER THE SPARSITY ASSUMPTIONshown to yield an estimator that has the same properties as the “good” global minimum ifp Q n (Zou and Li 2008) or if the smallest true coefficient value of the active predictors arelarge enough and F0 is sub-Gaussian (Fan et al. 2014).Fan et al. (2018) improve these results for convex loss functions by introducing acomputational framework (I-LAMM) for computing general regularized estimators of theform (2.13), including folded-concave penalties combined with the LS-loss. They cast theestimation problem as an iterative algorithm which, after an infinite number of iterations,coincides with the good global minimum of the LS-loss combined with the folded-concavepenalty. However, they also give a bound for the a2 estimation error that depends on thenumber of iterations and the chosen numerical accuracy of the obtained solutions. Fromthese bounds it can be seen that the a2 estimation error approaches the oracle bound if thenumerical accuracy is chosen small enough and the number of iterations increases.Interestingly, the computational framework in Fan et al. (2018) also connects the folded-concave penalties with another important class of penalties: the adaptive LASSO and theadaptive EN. The adaptive EN penalty function (Zou 2006; Zou and Zhang 2009), penalizesthe coefficients for each predictor differently depending on the corresponding element in avector ω of strictly positive penalty loadings:ΦAN(βSωP αP ζ) =1− α2‖β‖22 C αp∑j=1ωζj |βj | with ζ S 0O (2.15)With the adaptive EN penalty, predictors with a large penalty loading ωj are more heavilypenalized than predictors with a small penalty loading. The penalty loadings are commonlyset to the reciprocal values of a preliminary estimate of the regression parameter. Intuitively,if the preliminary estimate is consistent for β0, penalization for truly inactive predictorstends to infinity for increasing sample size. Therefore, if the hyper-parameter λ scalesappropriately with n, the bias introduced by the penalty becomes negligible and the oracleproperty can be obtained.With a slight modification of the penalty loadings, the adaptive LS-LASSO estimator(i.e., α = 1) can be obtained after two I-LAMM iterations (Fan et al. 2018). For thisequivalence to hold, the weights ωj must be truncated by max(τP ωj) using a reasonablylarge but finite τ . From this it is easy to obtain bounds for the estimation error of the(modified) adaptive LS-LASSO.Unsurprisingly, regularized estimators utilizing the LS-loss suffer the same issues as theLS-estimator under contamination, albeit often less obvious. Recalling that the breakdown262.4. ROBUST REGULARIZED ESTIMATIONpoint of regression estimators involves the estimated coefficients exploding to infinity, itseems comforting to know that regularized estimators are by definition bounded away fromthe boundary of the parameter space. In the dual formulation of the regularized loss (2.13),however, the parameter estimate can still diverge to infinity for any fixed λ Q∞ as shownin Alfons et al. (2013). Furthermore, the intercept parameter µ, is not regularized and canthus also explode under contamination. Even if the model does not include an intercept,the regularization parameter λ poses problems; although λ is not a model parameter andas such is not estimated, the selection of a good λ value is affected by contamination. Asshown in Cohen Freue et al. (2019), constraining the slope estimate xβ to the interior of theparameter space, λ to would be required to grow indefinitely.As highlighted by Davies and Gather (2005), the notion of breakdown point is not asensible measure of robustness for non-equivariant estimators; regularized estimator are perdefinition not (regression) equivariant. Nevertheless, the breakdown point can still givevaluable insights about the robustness properties of an estimator. The maximum MSEunder contamination, on the other hand, can be a useful metric for comparing regular-ized estimators; this is especially true in the presence of leverage points. In increasinglyhigh dimensions it is more important to have estimators that are insensitive to leveragepoints. Although Huber and Ronchetti (2009) suggest identifying possible leverage pointsin advance and analyze them separately, in high dimensional problems this approach isimpractical because leverage points are very difficult to identify. Even if it is possible toidentify leverage points, under the sparsity assumption, it is not sensible to take aside obser-vations with potential leverage coming from the truly inactive predictors. However, becauseit is unknown which predictors are truly active and inactive it is impossible to “screen out”leverage points for separate analysis prior to computing an estimate.Just as without the sparsity assumption it is therefore necessary to devise methods whichcan achieve low MSE but additionally identify important predictors even under arbitrarycontamination. Because the maximum MSE under contamination is usually impossible toderive theoretically, it is also still desirable to achieve a high breakdown point, even thoughit is not the best measure of robustness for regularized estimators.2.4 Robust Regularized EstimationThe main culprit in the erratic behavior of regularized estimators under contamination isstill the LS-loss. Drawing from the insights gained in unregularized estimation, it therefore272.4. ROBUST REGULARIZED ESTIMATIONseems sensible to replace the LS-loss with a robust surrogate.Due to its importance for quantile regression, the LAD-LASSO (Wang and Li 2007)is among the first regularized regression estimators with robustness towards gross errors.Numerous papers study the behavior of M-loss functions with convex / functions (e.g.,Huber’s /) combined with the a1 penalty under different settings. Many of the propertiesof the LS-LASSO also hold for convex M-estimators under similar conditions (van de Geerand Müller 2012). Recently, several strategies to reduce the bias introduced by the convexM-loss as well as to avoid residual scale estimation have been proposed (Loh 2018; Fan et al.2016; Fan et al. 2017; Fan et al. 2018; Sun et al. 2019; Yang 2017).Robust regularized estimation is not the only strategy for robust estimation in thesparse linear regression model. Khan et al. (2007), for example, propose the Robust LeastAngle Regression (RLARS) estimator to compute robust regression estimates in a step-wisemanner. Following ideas of the LARS estimator (Efron et al. 2004), the steps are taken inthe direction of the predictor with highest correlation with the residuals from the previousstep. RLARS gains robustness towards arbitrary contamination by using robust measures oflocation, scale, and correlation for selecting and taking the steps. Empirical results suggestRLARS is reliable under gross contamination, but the finite-sample bias is often higher thanof other robust methods and the algorithmic definition of RLARS hinders the establishmentof theoretical guarantees.Given the increased difficulties caused by leverage points, the bounded M-loss is anindispensable tool in higher dimensions. However, considerably less attention has beengiven to LASSO-type M-estimators with non-convex or bounded / functions as well asS-estimators. Smucler and Yohai (2017) proves that the MM-LASSO, the estimator thatminimizes a redescending M-loss combined with the LASSO penalty, is √n-consistent for θ0when the dimension is fixed but otherwise very mild conditions. Importantly, there are nomoment-conditions on F0; the errors only need to have a density that is symmetric around0 and monotonically decreasing in |u| and strictly decreasing in a neighborhood of 0. Anadditional condition is that the second moment of H0 must be finite, and the covariancematrix of the predictors needs to be non-singular. Therefore, the MM-LASSO is consistenteven under very heavy-tailed distributions F0, such as the Cauchy distribution. Similarresults, albeit under more restrictive assumptions, specifically finite second moment of theerror F0, are obtained in Arslan (2016) and Chang et al. (2018).Loh (2017) studies the finite-sample bounds of the a1 and a2 estimation errors of re-descending regularized M-estimators, including those with a LASSO penalty. She shows282.4. ROBUST REGULARIZED ESTIMATIONthat any minimum of the objective function (not only a global minimum), which lies in anr-ball around the true parameter, fulfills the oracle inequality for the estimation error. Ascan be expected of finite-sample results for complicated non-convex estimation problems,there are several technical conditions for this result to hold:1. The regularized objective (2.13) is restricted to ang-box around the origin, {β R ‖β‖1 Qg} which needs to contain the true parameter, i.e., ‖β0‖1 Q g, and as such requiresto have a rough idea of the size of the true parameter; the larger g, the weaker thebound on the estimation error.2. The sample size needs to be large enough to guarantee that there is at least oneminimum in an r-ball around the true parameter with high probability; the smaller rthe larger a sample size is needed. With an even larger sample size, every minimumin the g-box around the origin falls within the r-ball around the true parameter withhigh probability.3. The gradient of the M-loss evaluated at the true parameter needs to be bounded withhigh probability.4. Most importantly, the M-loss needs to satisfy the restricted strong convexity (RSC)condition in an r-ball around the true parameter with high probability. This conditionis essentially bounding the “non-convexity” of the loss function L around the trueparameter; the more non-convex the larger the bound on the estimation error.Establishing these conditions is difficult in theory for a given G0 and almost impossiblein practice. To overcome these difficulties, Loh (2017) states different sufficient conditionson G0 under which the above conditions hold with high probability. For example, thegradient is bounded with high probability if the distribution of the predictors, H0, is sub-Gaussian, i.e., has lighter tails than a multivariate Normal distribution. Furthermore, undersub-Gaussian predictors and a specific tail-behavior of the errors F0, the RSC condition alsoholds with high probability.We recently proposed the first S-estimator with an elastic net penalty (Cohen Freueet al. 2019) called Penalized Elastic Net S-Estimator (PENSE) which shares many of theproperties of the MM-LASSO, without the need for an auxiliary scale estimate. Chapter 3gives a detailed exposition of the EN penalty, its advantages over the LASSO, and thetheoretical properties and empirical results concerning PENSE. Importantly, PENSE has292.4. ROBUST REGULARIZED ESTIMATIONvery good robustness properties and is root-n consistent for the true regression parameterunder fixed dimension.The only other regularized S-estimator proposed so far is the S-Ridge (Maronna 2011).The S-Ridge combines the S-loss with the Ridge penalty, i.e., the squared a2 norm of thecoefficients (also a special case of PENSE with α = 0). The Ridge penalty does not inducesparsity, i.e., none of the estimated coefficients will be 0, but it helps in high-dimensionalproblems to reduce the variability of the estimate at the cost of increased bias. Smuclerand Yohai (2017) prove that the S-Ridge is a consistent estimator for the true regressionparameter and the residual scale. This allows them to use the S-Ridge estimator to obtainan auxiliary estimate of the residual scale for their MM-LASSO estimator. Despite thedifferent penalties involved, the authors also use the S-Ridge estimate as the starting pointfor the optimization of the non-convex MM-LASSO objective function. Although thereis no guarantee that this will yield a sensible estimate, the empirical performance of theMM-LASSO is very competitive.There also exist results for M-estimators with different penalty functions. The theoryin Loh (2017) for M-estimators covers folded-concave penalties and the results establishthe oracle property for a broad class of loss functions, given that the above-mentionedconditions (and a stronger RSC condition) hold with high probability. Fan et al. (2018)establish the error-bounds for estimates computed by the I-LAMM procedure with highprobability for the LS-loss and a sub-Gaussian G0, but their theory also allows for differentconvex loss functions. It remains open, however, if the conditions for their results can beobtained with high probability when using non-convex, redescending M-estimators underheavy-tailed errors and contamination in the predictors.In Cohen Freue et al. (2019) we propose a refinement step to PENSE, called PENSEM.The idea of PENSEM is similar to MM-estimators for low-dimensional regression, improvingefficiency by a subsequent M-step which relies on a scale estimate obtained from the residualsof the fitted PENSE estimate. This refinement works well in many problems, but, as detailedin Chapter 5, the residual scale estimate from PENSE and other robust estimators can bevery biased in high-dimensional problems. In finite samples, this bias may impede gains inefficiency and make the M-step possibly susceptible to contamination.The adaptive MM-LASSO (Smucler and Yohai 2017; Chang et al. 2018) combines abounded M-loss with the adaptive LASSO penalty, and it is shown in Smucler and Yohai(2017) that this estimator possess the oracle property under the same mild conditions as forroot-n consistency of the MM-LASSO. To avoid the necessity of an initial scale estimate,302.4. ROBUST REGULARIZED ESTIMATIONI introduce the adaptive PENSE in Chapter 3. The adaptive PENSE combines the S-losswith the adaptive EN penalty and uses PENSE as preliminary estimate. I show that theadaptive PENSE also possess the oracle property under the same conditions as needed forPENSE to be root-n consistent.31Chapter 3Elastic Net S-EstimatorsThis chapter introduces a novel estimator for the linear regression model under the sparsityassumption which can tolerate the presence of a large proportion of adverse contamination.The challenge of obtaining a robust estimate of the residual scale under the sparsity as-sumption, especially in high dimensional problems, hampers the application of regularizedM-estimators. In Cohen Freue et al. (2019), we therefore propose the penalized elastic-net S-estimator (PENSE), which combines the robust S-loss function with an elastic netpenalty. PENSE circumvents the need of an auxiliary scale estimate.3.1 MethodThe PENSE estimator is defined by a regularized objective function which combines theclassical S-loss (2.9) and the EN penalty (2.14):OS(µPβSλP α) = LS(yP µCrβ) C λΦEN(βSα)O (3.1)Minimizers of this objective function are denoted by θ˜(λ,α) = argminµ,β OS(µPβSλP α), whilethe arguments λ or α are omitted if irrelevant or obvious from the context.Due to the non-convexity of the S-loss, the PENSE objective function is also non-convex.Without non-convexity, PENSE would not possess its robustness properties as detailed inSection 3.4, but it is also the source of computational challenges. The issue is not uniqueto PENSE but is shared among all S- and redescending M-estimators with and withoutregularization. The robustness properties and statistical guarantees for the non-regularizedS-estimator only pertain to a global minimum (Davies 1990; Smucler 2019) of the S-loss.323.1. METHODThe asymptotic statistical properties of PENSE detailed in Section 3.3 also only pertainto the global minimum and are contingent on λ decreasing fast enough. In other words, λcannot be too large for the global minimum to have good statistical properties. This is inline with conditions for asymptotic properties of LS-EN and LS-LASSO estimators, albeittheir objective functions are convex and a large regularization parameter merely introducestoo much bias to attain a minimum with provable properties. For PENSE, on the otherhand, too large λ values not only introduce bias but also abets the estimator’s robustness.For large λ, local minima of the objective function that are close to the origin are morelikely also global minima. Hence, if λ is too large global minima could very well be artifactsof contamination and not sensible estimates. One such instance is depicted in Figure 3.1for a simple regression model without intercept and a single predictor. The true regressioncoefficient is β0 = 1, but for λ = 1 the objective function exhibits a global minimum aroundβ = −0OM due to contamination in the sample. Only as λ gets smaller, the “good” minimumaround β = 1 becomes a global minimum. A sensible PENSE estimate can therefore beattained only if an appropriate strategy to select the regularization parameter λ is used.Although only global minima have provable statistical properties, for larger λ values,local minima not caused by contamination can still be useful to predict the expected valueof the response, given a set of predictor values. Prediction, alongside identifying the predic-tors important to make good predictions, is a main goal in many applications of regularizedestimators. Therefore, it is important to not only check the global minima for their predic-tive capability, but also other local minima, even though they might not possess the samestatistical properties as the global minima.As noted in Chapter 2, the S-loss and therefore the PENSE objective function can berewritten as a weighted LS-EN objective functionOS(µPβSλP α) =12nn∑i=1w2i (r)r2i C λΦEN(βSα) =R OEN(µPβSw(r)P λP α) (3.2)with residuals ri = yi − µ− x⊺iβ and weightswi(r) = σˆM(r)√√√√√ /′(riσˆM(r))Rri1n∑nk=1 /′(rkσˆM(r))rkO (3.3)This representation of the PENSE objective function allows for an intuitive interpretationof the estimator. The PENSE estimate corresponds to a properly weighted LS-EN estimate,333.1. METHODl2345−1.0 −0.5 0.0 0.5 1.0 1.5βPENSE objective functionlλ = 1.5λ = 0.05Figure 3.1: PENSE objective function (3.1) for a simple linear regression model of the form y 5 x + u,evaluated at different values of β and λ on a data set with contamination. The marked dots depictthe locations of the global minima for different λ.where the weights are chosen to down-weight the contaminated observations and to givemore weight to proper observations.The challenges in computing and applying the PENSE estimate are (i) to find globalminima of the objective function and (ii) choose a regularization parameter λ such thatthe global minima enjoy good statistical properties. Additionally, it is also advisable toretain other local minima and determine their predictive abilities. Numerical algorithmsto find stationary points of (3.1) require a starting point as input and typically convergeto a stationary point which depends on this starting point. To find global minima of theobjective function, it is therefore necessary to have starting points that are close to globalminima. Local minima are caused by contamination and unusually large error terms, andhence a sensible strategy to find starting points is to compute a LS-EN estimate on a subsetof the data which does not contain observations exerting high leverage on the estimate.This direct relationship between the data and the presence/location of local minima isa clear advantage over non-convexity caused by folded-concave penalties. With folded-concave penalties, local minima are due to the underlying regression parameters and nointuitive strategy is available which is known to give starting points close to the desiredoptimum. Under a restricted eigenvalue condition, sub-Gaussian errors and large enoughtrue coefficient values, Fan et al. (2014) show that with high probability the LS-LASSO isa starting point for their algorithm which leads to the desired local minimum. Althoughthe intuition behind good starting points for optimizing the PENSE objective functionis simpler and holds without any restrictive conditions, it is nevertheless challenging to343.2. INITIAL ESTIMATORdetermine good subsets of the data.3.2 Initial EstimatorThis section discusses different strategies to obtain starting values for locating minima ofthe PENSE objective function. As outlined above, the landscape of the objective functionis scattered with local minima and the goal is to find local and global minima that are notcaused by contamination.3.2.1 Random SubsamplingThe most common strategy to determine initial estimates for unregularized S-estimatorsas proposed in Rousseeuw and Yohai (1984) and Salibián-Barrera and Yohai (2006) isto randomly select subsets of the available observations and compute the classical LS-estimate using only the random subset. The motivation behind the strategy is to get acrude approximation to the weights (3.3) at a global minimum. In the unregularized case,to guarantee that the resulting S-estimator has a breakdown point of ϵ with probability atleast ,, the lower bound for the number of subsets c is given byc ≥ log(1− ,)log(1− (1− ϵ)p+1)and thus grows exponentially with p (Salibián-Barrera and Yohai 2006). With c subsets,the probability that at least one of the subsamples of size p C 1 is “clean”, i.e., does notcontain any contaminated observations is ,. However, even an initial estimator computed onsuch a “clean” subsample does not necessarily lead to a global optimum of the S-estimator.Hence, it is in general not enough to examine a single clean subset, increasing the requirednumber of subsamples even further. While Salibián-Barrera and Yohai (2006) proposeseveral computational shortcuts to make random subsampling feasible for the unregularizedS-estimator with up to a moderate number of predictors, in higher dimensional settings thecomputational burden of finding a global minimum with high probability is insurmountable.Random subsampling is similarly used for robust regularized estimation, where thepenalty term could potentially reduce computational challenges, even in high dimensions.Due to regularization, the size of the subset can be much smaller than the number of pre-dictors. Alfons et al. (2013), for example, use random subsets of size 3 to obtain initialestimates for their SparseLTS estimator. By decoupling the size of the subset from the353.2. INITIAL ESTIMATORnumber of predictors, the number of subsets required to get at least one clean subsam-ple with high probability only increases exponentially with the chosen size of the subset.Although this implies that only a few subsets are required, subsamples of very small size(e.g., 3 as for SparseLTS) correspond to an approximation of the weights (3.3) at a globalminimum of the PENSE objective function by a vector with only 3 non-zero entries, whichis likely inaccurate considering that the vector of weights at a global minimum has at leastb(1 − δ)nc non-zero entries. Therefore, to maintain a high likelihood of locating a globalminimum (or a good local minimum) it is still necessary to consider a very large numberof random subsamples; either because clean subsamples of small size are likely not a goodinitial estimate, or because a large subsample likely contains contamination. This is a ma-jor obstacle for using random subsampling to initialize PENSE or other robust regularizedestimators.3.2.2 Elastic Net Peña-Yohai ProcedureThe problem with random subsampling, i.e., the large number of subsets required to increasethe chance of finding a good local optimum, stems from the fact that the subsets are chosenwithout considering the data itself. The following strategy, proposed by Peña and Yohai(1999) as outlier detection method and standalone estimator for linear regression, on theother hand, aims at identifying and omitting contaminated observations. The Peña-Yohai(PY) procedure builds several subsets of the data, each of which omits observations withpossibly large influence on the LS-estimate, computes the LS-estimate for each of thesesubsets, and finally chooses the estimate whose residuals have the smallest M-scale. The PYprocedure mainly screens out observations with high leverage, while retaining observationswith small leverage but large residuals. To remove the influence of these observations aswell, the PY procedure is iterated several times by removing the observations with largeresiduals in the fit with the smallest M-scale of the residuals. Although Peña and Yohai(1999) propose their procedure for the unpenalized S-estimator, Maronna (2011) successfullyadapts the PY procedure to find initial estimates for the S-Ridge estimator. In Cohen Freueet al. (2019), we adapt the PY procedure for general, non-linear, regularized estimatorssuch as PENSE, by employing regularized LS-estimators throughout the procedure. ThePY procedure adapted for regularized estimation using the EN penalty (EN-PY) is outlinedin Algorithm 1 for fixed penalty parameters λ and α.The central piece in the EN-PY procedure is the set of possibly clean subsets, line 5 inAlgorithm 1. Peña and Yohai (1999) derive this set using the principal sensitivity compo-363.2. INITIAL ESTIMATORAlgorithm 1 EN-PY ProcedureInput: Fixed penalty parameters λ and α, the proportion of observations in each cleansubset, κ, 0 Q κ Q 1, a cutoff value X S 0 for “large” residuals, and the maximumnumber of PY iterations I.1: Initialize the set of indices with the full data set, I (0) = {1P O O O P n}.2: Set ι = 0.3: repeat4: Compute the LS-EN estimate for fixed λ, α with all observations in the currentindex set I (ι), θ˜(0).5: Obtain a set of possibly clean subsets of I (ι), {S1P O O O PSK}, each of size bκ|I (ι)|cand Sk ⊂ I (ι).6: for k = 1P O O O PK do7: Compute the LS-EN estimate for fixed λ, α on the subset Sk, θ˜(k).8: end for9: Choose the LS-EN estimate that results in the smallest M-scale of all n residuals,xθ(ι)= θ˜(k′) with k′ = argmink=0,...,KσˆM(y − µ(k) −rβ˜(k))O10: Update the index set to include only observations with small standardized residuals,I (ι+1) ={i = 1P O O O P n R∣∣∣yi − µˆ(ι) − x⊺i xβ(ι)∣∣∣ Q XσˆM(y − µ(ι) −rβ˜(ι))} O11: Increment ι, ι = ιC 1.12: until ι = I or the index set did not change, I (ι) = I (ι−1)13: return all K C 1 estimates{θ˜(k)R k = 0P O O O PK}from the last EN-PY iteration, ι− 1.nents (PSCs); a set of directions in which points of high leverage should appear as largevalues. For EN-PY, the principal sensitivity components are obtained from the n× n ma-trix of leave-one-out (LOO) residuals, R; the k-th column of R is the vector of differencesbetween the observed y and the values fitted by an LS-EN-estimate computed from all butthe k-th observation (line 2 in Algorithm 2). The PSCs are defined as the projections ofmatrixR on its eigenvectors. It can be shown (Peña and Yohai 1999) that observations withvery high leverage have an extreme value (positive or negative) in at least one PSC. Fromeach PSC, three subsets of size m are obtained from: (a) the m observations with smallestvalues in this direction (i.e., filter extremely positive values), (b) the m observations withlargest values (i.e., filter extremely small values), and (c) the m observations with smallestabsolute values (i.e., filter extremely positive or negative values). The detailed procedureto derive subsets from the PSCs for PENSE is given in Algorithm 2.The Peña-Yohai procedure for regularized estimators as detailed in Algorithms 1 and 2373.2. INITIAL ESTIMATORAlgorithm 2 Subsets derived from the Principal Sensitivity ComponentsInput: Fixed penalty parameters λ and α, an index set I of cardinality n˜ and the desiredproportion of indices in each subset, κ Q 1.1: Define the desired size of the subsets as m = bκn˜c.2: Compute the n˜× n˜ sensitivity matrix R. The entries of R are given bygi,k = yi − µˆ(−k) − x⊺i xβ(−k) iP k = 1P O O O P n˜Pwhere xθ(−k) is the LS-EN estimate computed for fixed λ, α, from the observations inthe index set I with the k-th entry omitted, i.e., the leave-one-out LS-EN estimate.3: Determine f, the number of non-zero eigenvalues of the matrix R⊺R.4: for q = 1P O O O P f do5: Compute the q-th PSC, z(q) = Rv(q), where v(q) is the q-th eigenvector of R⊺R.6: Define the subset with the m observations with smallest values in z(q), i.e.,Sq = {i = 1P O O O n˜ R z(q)i Q os} with os = inf{o R m ≤n˜∑i=1I{z(q)i Q o}}7: Define the subset with the m observations with largest values in z(q), i.e.,Sf+q = {i = 1P O O O n˜ R z(q)i S ol} with ol = sup{o R m ≤n˜∑i=1I{z(q)i S o}}8: Define the subset with the m observations with smallest absolute values in z(q), i.e.,S2f+q = {i = 1P O O O n˜ R |z(q)i | Q oa} with oa = inf{o R m ≤n˜∑i=1I{|z(q)i | Q o}}9: end for10: return the set {S1P O O O PS3f}.generates a total of 3fC1 initial estimates for computing the PENSE estimate. The majorbenefit of EN-PY over random subsampling is that f ≤ max(pP n), and hence the numberof initial estimates from the EN-PY procedure only grows linearly with the number ofobservations and the number of predictors, as opposed to the exponential growth requiredfor random subsampling. Therefore, by choosing the subsets in a more guided fashion, thecomputation can be greatly reduced compared to naïve subsampling.For the case of unpenalized regression, Peña and Yohai (1999) present several mathe-matical shortcuts to efficiently derive the PSCs. These shortcuts are based on the closedform solution for LOO residuals in the case of linear estimators and thus cover the ordinaryLS-estimator as well as the LS-Ridge estimator. Unfortunately, there is no counterpart of383.2. INITIAL ESTIMATORthese closed form solutions for regularized estimators with non-smooth penalty function.Therefore, the bottleneck of the EN-PY procedure is the cumbersome computation of theLOO residuals. For a fixed value of λ and α, the EN-PY procedure requires the computa-tion of at most n(4 C 4I C Iκ) C I C 1 LS-EN estimates, where I and κ as in Algorithm 1.Nevertheless, the actual number of LS-EN estimates that need to be computed is usuallymuch smaller than this upper bound since the residual-filtered index set most often remainsconstant after a few iterations. Hence, even without mathematical shortcuts to computethe PSCs, the EN-PY procedure is still significantly faster than random subsampling forobtaining initial estimates for PENSE.In addition to the PY procedure, Peña and Yohai (1999) propose an estimate for linearregression based on a one-step re-weighting of the S-estimate obtained from the “best” es-timate (in terms of minimal M-scale of the residuals of all observations) computed throughthe PY procedure. Their estimate is a weighted LS estimate, where hard-rejection weights(0/1) are derived from the residuals of this aforementioned S-estimate. The weights, how-ever, are derived from the “hat” matrix of their linear estimator and the idea is thereforenot transferable to regularized estimates.3.2.3 Empirical ComparisonsThe main selling point for the EN-PY procedure is the decreased computational burden byselecting the subsets in a way that excludes potentially contaminated observations. Withthe same number of initial estimates, the chance that the EN-PY procedure gives at leastone good initial estimate should be higher than with random subsampling. Peña and Yohai(1999) show that high leverage points are detectible in at least on PSC direction. Theauthors claim that due to this property the PY procedure can efficiently clean the data ofgross contamination. Although the theory presented in the paper does not cover moderateleverage points, the results of simulation studies further underline the benefits of the PYprocedure.To ascertain that the advantages of the PY procedure translate to similar properties ofthe EN-PY procedure for sparse linear regression, I compare EN-PY and random subsam-pling empirically. For this experiment, data sets with n = 100 observations and p = 1Npredictors are randomly generated according to 42 scenarios following scheme VS1-LT* (seeAppendix A.1.1). In this lower-dimensional problem the likelihood of uncovering at leastone clean subset with a computationally feasible number of random subsamples is still high.The scenarios are divided into two groups: two scenarios where no contamination is intro-393.2. INITIAL ESTIMATORduced and 40 scenarios where 25% of the observations are contaminated. In scenarios withcontamination, the placement of contaminated observations is controlled by the leverage ofcontaminated observations as well as the regression parameter in the linear model gener-ating these contaminated observations. The variance of the error term is chosen such thatthe percentage of variance explained (PVE) by the true regression model is either 25% or50% (this amounts to a signal-to-noise ratio of 1R3 and 1, respectively, and follows the sug-gestions in Hastie et al. (2017)). Appendix A.2 gives the complete details of the scenariosconsidered in this numerical experiment.For each generated data set, initial estimates are obtained at 10 different penalizationlevels using random subsampling and the EN-PY procedure. All initial estimates fromdifferent penalization levels are merged into two sets: TRS comprising initial estimates fromrandom subsampling and TEN-PY for initial estimates from the EN-PY procedure. ThePENSE estimate is then computed for 50 different values of the penalization level. At everypenalization level, the PENSE estimate is computed once from initial estimates TRS andonce from initial estimates TEN-PY, recording the difference in the attained value of theobjective function.The main results of this experiment are depicted in Figure 3.2. The left plot shows thenumber of settings (i.e., combinations of data sets and penalization levels) where a differencebetween the EN-PY and random subsampling procedures is detected. For the vast majorityof settings, both procedures lead to the same minimum being uncovered, but more severeleverage points lead to more differences between the two procedures. This can be expectedbecause the PENSE objective function usually exhibits more local optima the more severeleverage points are present. Of those replications where EN-PY and random subsamplinglead to different local optima, the local optimum uncovered by EN-PY is most often betterthan the local optimum found via random subsampling. Interestingly, the differences aremore plentiful when the variance of the error term is small (PVE of 50%).For those replications where there is a difference between the two procedures, the rightplot (Figure 3.2(b)) shows the magnitude of these differences, relative to the true varianceof the error term. Although EN-PY does not always lead to better optima, if there is adifference, the local optimum uncovered by EN-PY initial estimates are sometimes substan-tially better than the local optimum obtained from random subsampling. Relative to thetrue variance of the error, EN-PY sometimes leads to a local optimum more than 20% bet-ter than the local optimum attained from starting at initial estimates obtained by randomsubsampling.403.2. INITIAL ESTIMATORPVE: 25%PVE: 50%0% 2% 4% 6%ExtremeHighModerateLowNo Cont.ExtremeHighModerateLowNo Cont.Nr. of settings with different estimatesContamination leverageRandom subsamplingleads to better estimateEN−PY leads tobetter estimate(a) Number of estimates with different objectivevalue.PVE: 25%PVE: 50%−20% −10% 0% 10% 20%ExtremeHighModerateLowNo Cont.ExtremeHighModerateLowNo Cont.Rel. difference in objective functionContamination leverageRandom Subsampling EN−PY(b) Magnitude of differences.Figure 3.2: Comparison of the PENSE objective function at the best minimum uncovered by the EN-PYinitial estimates and random subsampling initial estimates. Plot (a) shows the number of settings(relative to the total number of combinations of data sets and penalization levels considered in eachscenario) where either EN-PY or random subsampling resulted in a lower value of the objectivefunction. Plot (b) shows the actual difference (relative to the true variance of the errors) betweenthe local optima obtained through EN-PY initial estimates and random subsampling. Positive valuesindicate the EN-PY initial estimate resulted in a smaller value of the PENSE objective function.Overall, there seem to be only small differences between initial estimates obtainedthrough random subsampling and EN-PY. These small differences, however, suggest fa-voring EN-PY for many configurations of the data. To compare computational complexityof the two procedures in this experiment, the number of initial estimates obtained via ran-dom subsampling is set to the number of LS-EN estimates computed for EN-PY. Whilethe number of LS-EN problems is the same, the similarity of LS-EN problems involved inthe EN-PY procedure makes it on average 2.9 times as fast as random subsampling forcomputing the initial estimates alone. Even more importantly, EN-PY leads to a muchsmaller number of initial estimates that have to be considered when computing the PENSEestimate. Given that the computation of the PENSE estimate for each initial estimateis computationally challenging even when using optimizations, the savings in computationtime when using EN-PY are substantial. In this experiment, it takes on average 8.7 times413.2. INITIAL ESTIMATORlonger to compute PENSE estimates using initial estimates from random subsamples thenwhen using initial estimates from EN-PY. The quality of local minima uncovered by EN-PYis better than those uncovered by random subsampling, yet computing PENSE estimatesusing EN-PY is several times faster, suggesting EN-PY initial estimates are highly prefer-able.3.2.4 Initial Estimates for a Set of Penalization LevelsRandom subsampling and the EN-PY procedure produce initial estimates for a fixed pe-nalization level. In practice, however, a good penalization level is unknown in advance andPENSE must be computed for an entire set of penalization levels. The number of selectedvariables and prediction performance of the estimate vary greatly among different penaliza-tion levels; hence a fine grid of many penalty levels is preferred. Computing initial estimatesfor every value in this large set of penalty levels, Q, is infeasible. The fine granularity ofQ, on the other hand, allows for an efficient strategy of “warm-starts” as devised in CohenFreue et al. (2019).Consider a grid Q containing f S 1 penalization levels in descending order, i.e., Q ={λ1P O O O P λf} such that λq−1 S λq for q = 2P O O O P f. Further, denote by xθ(q−1) a localminimum of the PENSE objective function at λq−1. Since the grid is fine-grained, λq−1and λq are not too far apart, suggesting a local minimum of the objective function at λqis likely close to xθ(q−1). If more than one local minimum at λq−1 is uncovered, each ofthese minima can be used as initial estimate at λq. These warm-starts are repeated at eachλ ∈ Q, thereby “following” local minima over different penalization levels. As depicted inFigure 3.1, this strategy can greatly increase chances of uncovering global minima as a localminimum may transmute to a global minimum as the level of penalization changes.The warm-starts of course depend on local minima uncovered at the preceding penal-ization level. Therefore, at some point, a different approach for computing initial estimatesis necessary. The simplest form is the “0-based” regularization path. For a large enoughpenalization level, the 0-vector, β = 0p is a local minimum of the PENSE objective functionand thus can be traced throughout the penalization grid. This particular form of warm-starts is predominantly used in iterative algorithms for computing LS-EN estimates becauseit can drastically improve computation speed (e.g., Friedman et al. 2010). With the convexLS-EN objective function, the uncovered minima are actually global minima. In the contextof robust estimators with non-convex objective function, the “0-based” regularization pathis still usable but the uncovered minima, one per penalization level, are not necessarily423.2. INITIAL ESTIMATORglobal minima. It is therefore necessary to also consider other initial estimates along thegrid, such as initial estimates from random subsampling or the EN-PY procedure.In Cohen Freue et al. (2019) we combine initial estimates from the EN-PY procedurewith the idea of warm-starts. We take a small number, say fI f, of penalizationlevels from the large set Q, denoted by QI ⊂ Q. Only at these few levels of penalizationinitial estimates are computed with the EN-PY procedure. When traversing the fine gridto compute local minima of the PENSE objective function, warm-starts at λq ∈ Q arecombined with initial estimates from the EN-PY procedure if λq is also in QI. Furtherincreasing the probability of uncovering global minima, the grid Q is traversed in bothdirections. In the second pass in reverse direction, local minima at λq are used to initializethe PENSE estimate at λq−1. This combined strategy of bidirectional warm-starts andEN-PY effectively reduces computation while maintaining high quality of the uncoveredminima.Absent from the discussion so far, but critical for computing initial estimates, is theissue of translating a specific level of penalization of PENSE to comparable penalization ofthe initial estimates. Both procedures for computing initial estimates presented here useLS-EN estimates, computed on a subset of the data, to locate PENSE estimates nearby.For this to be successful, the amount of penalization induced by the penalty level λI ona LS-EN estimate compute on a (small) subset of the data, must approximately matchthe effect of the desired penalization level λS on the PENSE estimate computed on thefull data. Because of the differences in loss function and data used for computation, usingthe same penalization level does not work well in general. Particularly the very differentloss functions can lead to the empty model from the LS-EN estimate for any subset of thedata for a certain λ, while a global optimum of the PENSE objective function at this λcorresponds to all predictors having non-zero coefficient estimate.For the S-Ridge estimator, Maronna (2011) matches the regularization parameters be-tween LS-Ridge and S-Ridge via a multiplicative adjustment of λI to get λS. The authorderives these adjustment factors from the ratio of the squared M-scale estimate to the vari-ance estimate of a Normal random variable in two extreme cases: (i) the mean of the Normaldistribution is 0 and (ii) the variance of the Normal distribution is 0. For a given value ofδ in the definition of the S-loss (2.8) these two ratios can be computed exactly, and theauthor takes the geometric mean of these two numbers as a crude approximation to theexpected ratio of the S-loss to the LS-loss at their respective optima. The adjustment iseasy to compute, but empirical observations suggest the quality of the match is suboptimal.433.3. THEORETICAL PROPERTIESThe combined strategy of warm-starts and EN-PY initial estimates in Cohen Freue et al.(2019) also suffers from an imperfect match of penalization levels. The effects, however, areless detrimental because local minima are followed across penalization levels. For compu-tational reasons, however, not every local minimum is traced throughout the entire path,only the most promising minima. If the penalization introduced in the initial estimate isvastly different from the penalization of the PENSE estimate, this filter may drop minimaprematurely. This problem can be avoided by merging all EN-PY initial estimates fromeach penalization level in QI into one large set of initial estimates, T . Each of these initialestimates is used for computing PENSE at every λS ∈ Q. Instead of relying on an approx-imate matching between λS for PENSE and the regularization parameter λI used for theinitial estimate, the idea is that for each λS ∈ Q, there should be at least one λI ∈ QI whichgives roughly the same penalization of the initial estimate as λS provides for the PENSEestimate. Although the match will in general not be perfect, the chance that some of theinitial estimates will be close to a global optimum are much higher if trying several differ-ent regularization parameters for the initial estimates. The chances can be increased evenfurther by combining the set of initial estimates T with the idea of warm-starts. Empiri-cally, this simplified scheme leads to slightly better optima than bidirectional warm-startsproposed in Cohen Freue et al. (2019). The computational burden of using this excessivelylarge number of initial estimates can be contained by fully iterating only “promising” initialestimates. Because the simplified scheme is more amenable to algorithmic optimizations,computational complexity is very similar to bidirectional warm-starts. Further details aboutthese optimizations to improve computational performance are given in Chapter 6.3.3 Theoretical PropertiesNone of the discussed strategies for initial estimates can guarantee that a global optimumof the PENSE objective function is attained, but the chances are good if using EN-PY and,if enough computing resources are available, can be increased by adding initial estimatesobtained from a large number of random subsamples. The global optimum, however, isdesirable due to its provable statistical properties. In the following, the PENSE estimatorθ˜ for θ0 ∈ Rp+1 is defined as the global minimum of the PENSE objective functionθ˜ = argminµ,βOS(µPβSλS,nP αS) (3.4)443.3. THEORETICAL PROPERTIESwhere αS and λS,n are independent of the given data, but λS,n can depend on the numberof observations n.As detailed in the previous chapter, it is desired for the estimator to be consistent forthe true regression parameters. To derive consistency of the PENSE estimator, severalassumptions are imposed on the linear regression model (2.2):[A1] P(X ⊺θ = 0) Q 1− δ for all non-zero θ ∈ Rp and δ as defined in (2.8).[A2] The distribution F0 of the residuals U has an even density f0(u) which is monotonedecreasing in |u| and strictly decreasing in a neighborhood of 0.[A3] The second moment of G0 is finite and EG0 sXX ⊺u is non-singular.Assumption [A1] ensures that the probability that observations are perfectly aligned on ahyperplane is not too large. It is noteworthy that the assumption on the residuals [A2] doesnot impose any moment conditions on the distribution, which makes the following resultsapplicable to extremely heavy tailed errors. Furthermore, unlike many results concerningregularized M-estimators, PENSE only requires finite second moment of the predictors.The proofs of the following properties also require the / function to satisfy the conditionthat[R4] t/′(t), is unimodal in |t|. In other words, there exists a x′ with 0 Q x′ Q x, where x isthe threshold defined in [R2], such that t/′(t) is strictly increasing for 0 Q t Q x′ andstrictly decreasing for x′ Q t Q x.Although this assumption is a slight variation of more common assumptions on the map-ping t 7→ t/′(t), it is nevertheless satisfied by most bounded / functions used for robustestimation, including Tukey’s bisquare function.The results in Smucler and Yohai (2017) about the consistency of the S-Ridge can beapplied directly to the PENSE estimator.Proposition 1. Let (yiPx⊺i ), i = 1P O O O P n, be i.i.d observations with distribution G0 whichsatisfies (2.2). Under assumptions [A1] and [A2] and if λS,n → 0, the PENSE estimator θ˜as defined in (3.4), is a strongly consistent estimator of the true regression parameter θ0:θ˜a.s.−−→ θ0.Although the penalty functions used for the S-Ridge and PENSE are different, thegrowth condition on λS,n has the same effect on PENSE as on the S-Ridge; making the453.4. ROBUSTNESSpenalty term negligible for large enough n. The proof of Proposition (1) is therefore identicalto the proof of Proposition 1.i in Smucler and Yohai (2017).The next step is to quantify the speed of convergence in Proposition (1). The followingtheorem states that the PENSE estimate exhibits a n1R2 converges to the true parameter.Theorem 1. Let (yiPx⊺i ), i = 1P O O O P n, be i.i.d observations with distribution G0 whichsatisfies (2.2). Under regularity conditions [A1]–[A3] and if λS,n = d(1R√n), the PENSEestimator θ˜ as defined in (3.4), is a root-n consistent estimator of the true parameter vectorθ0: ‖θ˜ − θ0‖ = dp(1R√n).The proof of this theorem is given in Appendix B.2.2 for a more general penalty function,of which the EN penalty is a special case. The proof is based on first-order Taylor expansionsof the objective function around the true parameter θ0 and the true residuals ui.Consistency and root-n consistency of PENSE both hold even under very heavy tailederror distributions F0 and only require a finite second moment of the predictors. Impor-tantly, the estimator is consistent for the true parameters without any prior knowledgeabout H0; it is irrelevant whether the M-scale of the residuals is tuned to be a consistentestimator of the true scale of the error or not. Although the main focus of regularizedestimators are applications with many predictors and a comparably small sample size, theasymptotic results in this section provide assurance that PENSE is sensible for estimat-ing parameters in the linear regression model. Furthermore, the asymptotic guarantees forPENSE are necessary for developing theoretical results in the following chapter which allowfor informative comparisons with other methods. The theory presented so far, however,does not specify how arbitrary contamination may affect the estimator.3.4 RobustnessAn overarching goal of this work is to devise estimators which can tolerate a considerableamount of contamination without giving aberrant results. Despite its shortcomings when itcomes to regularized estimators as mentioned in Section 2.2, the finite-sample breakdownpoint is an important measure of robustness; it measures how much contamination can beintroduced such that the maximum bias under contamination remains bounded.An appealing property of the FBP is that it can usually be proven theoretically withoutresorting to numerical experiments. For PENSE, the breakdown point is close to δ as shownin the following theorem.463.5. HYPER-PARAMETER SELECTIONTheorem 2. For a sample Z = {(yiPxi) R i = 0P O O O P n} of size n, let m(δ) ∈ N be thelargest integer strictly smaller than nmin(δP 1 − δ), where δ is as defined in (2.8). Then,for a fixed λS,n S 0 and α ∈ s0P 1u, the breakdown point (2.7) of the PENSE estimator θ˜ asdefined in (3.4), ϵ∗(θ˜SZ), satisfies the following inequalities:m(δ)n≤ ϵ∗(θ˜SZ)≤ δ OThe proof of this theorem can be found in the Appendix B.1.The finite-sample breakdown point does not reveal the actual magnitude of the bias,MSE, or prediction error under contamination; it only states that these measures are finitefor a contamination proportion less than δ. For applications, however, it is important tohave a better understanding of an estimator’s behavior under contamination. Numericalexperiments, detailed in Section 3.6, shed light on the behavior of the PENSE estimatorunder contamination.3.5 Hyper-Parameter SelectionThe asymptotic properties of PENSE depend on an appropriate choice of the hyper-parameterλS,n. More specifically, the estimator is consistent only if λS,n only if λS,n → 0. In practicethis growth rate is difficult to ascertain. Furthermore, while the theoretical properties donot depend on a certain choice of α, it nevertheless impacts the performance of the esti-mator. For the remainder of this section the subscripts of λS,n are being dropped as onlyPENSE for a fixed sample size n is being considered.3.5.1 Restricting the Search SpaceBefore discussion strategies for choosing the hyper-parameters, the search space needs tobe restricted, in particular the range of values considered for the penalization level λ. Forconvex objective function, e.g., the LS-EN objective function, it is straightforward to deter-mine the largest penalization level such that β = 0p is a global minimum. It is unnecessaryto consider penalization levels beyond this largest level, as the global minimum will be thesame for all of them.This upper bound cannot easily determined for the PENSE objective function due tothe non-convexity of the problem. It is, however, possible to determine λ˜S, the smallestpenalization level such that β = 0p is a local minimum, using the generalized gradient473.5. HYPER-PARAMETER SELECTIONas defined in Clarke (1990). First it is important to note that because the unpenalizedS-loss is continuously differentiable and the EN penalty is locally Lipschitz, the PENSEobjective function is also locally Lipschitz. Therefore, the generalized gradient of the PENSEobjective function is the subgradient of the EN penalty plus the derivative of the S-loss.The subgradient of a convex function g R Rp → R at u0 is defined by Clarke (1990) as theset∇ug(u)|u=u0 = {v R v⊺(u˜− u0) ≤ g(u˜)− g(u0) ∀u˜ ∈ Rp} OSince the generalized gradient evaluated at any local minimum must contain 0p+1, it is suf-ficient to determine the smallest penalty level such that the subgradient of the EN penalty,evaluated at β = 0p, contains the gradient of the S-loss evaluated at β = 0p, i.e.,λ˜S = inf{λ S 0R ∇βOS(yP µCrβ)|β=0p ∈ λ∇βΦEN(βSα)|β=0p}OThe subgradient of the EN penalty and the gradient of the S-loss are given by∇βΦEN(βSα)|β=β˜ =(1− α)β˜j C α sgn(β˜j) β˜j 6= 0s−αSαu β˜j = 0pj=1∇βOS(yP µCrβ)|β=β˜ = − 1nn∑i=1w2i (y − µ−rβ˜)(yi − µ− x⊺i β˜)xiPwith weights wi(y − µ−rβ˜) as defined in (3.3). Evaluated at β = 0p, the subgradient ofthe EN penalty is the set {b R |wj | ≤ αP j = 1P O O O P p}. Combined with the gradient of theS-loss at β = 0p, λ˜S is thereforeλ˜S =1nαmaxj=1,...,p∣∣∣∣∣n∑i=1w2i (y − µˆy)(yi − µˆy)xij∣∣∣∣∣ P (3.5)where µˆy is the estimated intercept in the empty model, µˆy = argminµ σˆM(y − µ). Ifλ˜S S 0, the 0-vector is a local minimum of OS(µPβSλP α) for all λ S λ˜S. On the other hand,if λ˜S = 0, the 0-vector is a local maximum for all λ smaller than a certain value a and alocal minimum for λ S a. In this border case, no simple expression exists to determine aand a trial and error search for λ˜S is the only other option.With the approximate upper bound λ˜S, the search for an optimal penalization level canbe concentrated on the range (0P λ˜S). The prevalent strategy is to tune the hyper-parameters483.5. HYPER-PARAMETER SELECTIONto optimize some performance metric of interest, such as metrics pertaining to the quality ofthe fit or the prediction performance. Robust fit-based metrics, for example robust versionsof popular information criteria AIC (Akaike 1974) or BIC (Schwarz 1978), rely on a robustestimate of the residual scale. In high-dimensional settings, however, estimating the residualscale is a difficult task by itself. Especially robust estimation is challenging, because robustscale estimates themselves require tuning parameters. Changing these tuning parametersis effectively changing the information criterion itself; if the distribution of the error termis unknown, there is no general way to choose these tuning parameters.More importantly, applications motivating this work demand estimators with strongprediction performance. For these applications, fit-based metrics are not useful because theyonly give limited insight into how well the fitted model generalizes beyond the sample athand. Due to this shortcoming, prediction performance is usually evaluated using measuresof the prediction error. The prediction error, however, cannot be sensibly estimated onthe same data as used to fit the model, i.e., the data used in the computation of PENSE.Strategies to estimate the prediction error are most often based on withholding some of theavailable observations (i.e., the “test” set) and computing optima of the PENSE objectivefunction on the remaining observations (i.e., the “training” set). The prediction error is thenestimated as the error arising by predicting the responses of the withheld observations.3.5.2 Cross ValidationThe arguably most prevalent strategy for estimating prediction performance is K-fold crossvalidation (CV). In K-fold CV, the n observations in the sample at hand are split intoK disjoint sets of roughly equal size, called folds. In cross-validation, every observation isused exactly once for prediction and K − 1 times for training, i.e., computing of a globaloptimum of the objective function.To outline the procedure, the index set of a single fold is denoted by Sk ⊂ {1P O O O P n},k = 1P O O O PK. These sets are such that they are disjoint, roughly the same size, and⋃Kk=1Sk = {1P O O O P n}. For each k ∈ {1P O O O PK}, a global optimum of the objective functionusing the observations in ⋃Kk′=1,k′ ̸=kSk′ is computed and denoted by xθ(λ,α)k . These k optimaare used to predict the observed responses in the k-th fold byyˆ(λ,α)i = x⊺ixβ(λ,α)k C µˆ(λ,α)k for all i ∈ SkO (3.6)No observation affects the optimum used to predict its value and hence these n predicted493.5. HYPER-PARAMETER SELECTIONvalues can be used to adequately estimate the prediction error of the method with hyper-parameters (λP α).A popular metric for the prediction performance an estimator xθ is its root mean squaredprediction error (RMSPE), defined asRMSPE(xθ)=√E[(Y −X ⊺ xβ − µˆ)2]O (3.7)Using cross-validation, the RMSPE can be estimated by\RMSPE(λP α) =√√√√ 1nn∑i=1(yˆ(λ,α)i − yi)2OIf the error distribution is heavy-tailed the RMSPE might not be well defined. Moreimportantly, under the presence of contamination in the sample the estimated RMSPE isbadly affected and does not adequately reflect the estimate’s prediction performance. Sincethe RMSPE is essentially a measure of the expected absolute size of the prediction error, itis more sensible to use a robust measure of scale for quantifying the prediction performance.A common choice to robustly measure the prediction performance of an estimator xθ is theuncentered τ -scale (Maronna and Zamar 2002) of the prediction errors, given byτP(xθ)=√√√√√√Emaxxτ P∣∣∣Y −X ⊺ xβ − µˆ∣∣∣eedian∣∣∣Y −X ⊺ xβ − µˆ∣∣∣2P (3.8)which can be estimated via CV byτ̂P(λP α) =√√√√√√ 1nn∑i=1maxxτ P∣∣∣yi − yˆ(λ,α)i ∣∣∣eediani′=1,...,n∣∣∣yi − yˆ(λ,α)i ∣∣∣2OThe parameter xτ S 0 controls the tradeoff between efficiency and robustness of the τ -sizeby defining what constitutes outlying values in terms of multiples of the median absolutedeviation. In this work, the τ -size is always reported for xτ = 3.Once the set of hyper-parameters resulting in the best prediction performance is de-termined, a global optimum at these chosen hyper-parameters is computed using all nobservations. Cross-validation is shown to work very well for regularized estimators using503.5. HYPER-PARAMETER SELECTIONconvex objective functions (Hirose et al. 2013; Homrighausen and McDonald 2016; Hom-righausen and McDonald 2018). Cross-validation performs well when a global optimumcomputed using all n observations, xθ(λ,α), is “reasonably close” to global optima computedon the subsets of observations, xθ(λ,α)k , k = 1P O O O PK. This is usually the case if the amountof penalization induced by the hyper-parameters (λP α) is comparable between the subsam-ples and the objective function only exhibits a single optimum. With non-convex objectivefunctions, however, it is possible that a local optimum of the objective function evaluatedon the full data is a global optimum when evaluated on a subset of the data. An example ofthis behavior is given in Figure 3.3 for data generated by a simple linear regression modelwith true parameter value β0 = 1 and 30% of the observations contaminated. While theglobal minimum of the PENSE objective function evaluated on all observations is around0.9, the global minimum of the objective function evaluated on three of the five subsetsis close to −1. The subsets in this example satisfy the conditions for cross-validation andcontamination never exceeds the desired breakdown point of 50%, but it is obvious thatthe predictions from three of the five estimates are likely far off. For this particular set ofhyper-parameters the estimated prediction performance is therefore not representative ofthe prediction performance of the global minimum on the full data. Although this exam-ple shows an extreme scenario, it highlights that cross-validation may give very differentestimates of the prediction performance for different splits of the data. This issue is notunique to PENSE, but any estimator defined via non-convex objective functions because ofthe disconnect between the minima uncovered in the CV folds and the estimate from thefull data.3.5.3 Train/Test SplitThe challenges of cross-validation exposed in the previous section can be traced back totwo issues: (i) estimating the prediction performance by combining the prediction errorsfrom different optima (computed on different subsets of the data) which may not be com-parable and (ii) trusting that this estimated prediction performance is representative ofthe prediction performance of the optimum computed on the full data set for the selectedhyper-parameters.These challenges could be surmounted by gauging the prediction performance of everypossible estimate directly. For train/test splitting, PENSE estimates are computed on arandom subset of the data (i.e., the training set) and the estimates’ prediction performanceis evaluated on the left-out observations (i.e., the test set). In contrast to cross-validation,513.5. HYPER-PARAMETER SELECTIONl12−1 0 1βPENSE objective functionObservationsl AllSubsetFigure 3.3: PENSE objective function (3.1) for a simple linear regression model of the form y 5 x + u,evaluated at different values of β on the full data set with 100 observations (solid blue line) andsubsets of size 80 (dashed light blue lines). The points on each curve mark the global minimum ofthe objective function evaluated on the particular subset.the PENSE estimates are not computed on the full data set but only on the training set,avoiding the issues highlighted before.Simple train/test splitting, however, suffers from different issues, especially in the pres-ence of contamination. If there is a large number of contaminated observations in the testset, it is not possible to accurately estimate the prediction performance of the PENSE es-timates. Estimates which are affected by contamination in the training set may appear tohave good prediction performance. On the other hand, “good” PENSE estimates will notappear as such since contaminated observations in the test set will not be predicted well.A single train/test split is therefore not sufficient.It is more appropriate to equally divide the observations into K disjoint folds, sim-ilar to cross-validation. Each fold is used as test set exactly once, with the remainingK − 1 folds being used for training. This leads to K PENSE estimates for every hyper-parameter-configuration, with each estimate being evaluated on a different test set. If thetotal contamination in the data is ϵn, there is at least one test set with less than ϵnKR(K−1)contaminated observations. Nevertheless, an estimate affected by contamination can stillappear to outperform the other estimates.A more resilient procedure can be constructed when averaging comparable informationfrom all K folds. As outlined above PENSE estimates computed with the same λ, but ondifferent subsets of the data, might not be comparable. The effect of α, on the other hand,is more stable across subsets of the data. The following two-stage procedure therefore leads523.6. NUMERICAL EXPERIMENTSto a more stable hyper-parameter selection than simple train/test splitting. For each αin a grid of values, A = {α1P O O O P αA}, and for every fold k = 1P O O O PK, select the PENSEestimate with hyper-parameter λk which minimizes the scale of the prediction error in the k-th fold. Thus, each of theK folds yields V PENSE estimates, one for every α ∈ A . Test setswith a large proportion of contamination can occasionally lead to a highly underestimatedscale of the prediction error. If the breakdown point of the estimator, however, is largeenough, it is unlikely that this phenomenon occurs for every α in the grid. The predictionperformance in each of the K folds can be summarized by taking the median scale of theprediction error of all V estimates in the k-th fold. The final PENSE estimate is then chosenas the estimate with minimum scale of the prediction error in the fold with smallest medianscale of the prediction error.The major drawback of train/test splitting is that some observations are forfeited for useas test set. While this can improve estimation of the prediction performance, it can directlylower the prediction performance of the PENSE estimate, because it does not have access toall observations. The numerical experiments conducted in the following section also exposethis weakness of train/test splitting. Although CV is sometimes much more affected bycontamination, in the majority of cases estimates computed by train/test splitting seem tobe slightly worse.3.6 Numerical ExperimentsThe theoretical properties in Section 3.3 give an indication about the qualities of the PENSEestimator, but it is difficult to translate these asymptotic properties into tangible metrics onfinite samples. The growth condition on the penalty parameter λS,n, for example, requiresa procedure independent of the data to select the penalty parameter; there are no theoret-ical guarantees regarding the data-driven hyper-parameter selection procedures outlined inSection 3.5. Similarly, the breakdown point of PENSE only guarantees that the parameterestimates remain bounded, but it is unknown how contamination affects the estimates. Nu-merical experiments are a useful tool to gauge the effectiveness of different hyper-parameterselection strategies and the practical performance and robustness of PENSE and competingestimators.533.6. NUMERICAL EXPERIMENTS3.6.1 EstimatorsIn the following experiments, PENSE is computed with a breakdown point of 33%, i.e.,δ in the S-loss (2.9) is set to 0O33. The grid of α values is A = {0OMP 0ONNP 0OP3P 1} andthe grid for λ comprises 50 values equidistant on the log-scale with the upper endpointλ˜S (derived in Section 3.5.1) and the lower endpoint set to 0O001αλ˜S. Initial estimatesfor PENSE are computed according to the 0-based regularization path and the simplifiedscheme described in Section 3.2.4, for a total of 10 penalization levels. As justified by theresults in the beginning of Section 3.6.3, the hyper-parameters α and λ are selected by 5-fold cross-validation as discussed in Section 3.5. Prediction performance is measured by theτ -scale of the prediction errors. A detailed description of the algorithms used to computethe PENSE estimate is given in Chapter 6.PENSE is compared to several other robust and non-robust estimators. The most similarrobust estimator to PENSE is MMLASSO, with the initial S-Ridge estimate computed for10 different penalization levels and the penalization level for MMLASSO selected by 5-fold CV. In low- to moderate-dimensional settings only (p Q (1 − δ)n − 1), the robustunregularized S- and MM-estimators (denoted by S and MM, respectively) are computedas provided in the R-package RobStatTM (Yohai et al. 2019), with breakdown point set to33%. For hypothetical comparisons, the oracle S- and MM-estimates are computed usingonly the truly active predictors. All robust estimates employ Tukey’s bisquare / function,with cutoff set to 2O37 which yields a consistent scale estimate in case of Normal errors andδ = 0O33.The LS-EN estimate is computed using the glmnet (Simon et al. 2011) R package.Hyper-parameters are selected by 5-fold CV on the same grid of α values as used for PENSEand the penalty parameter λLS is chosen from a set of 50 values generated by glmnet.Prediction performance for cross-validation is measured by the mean absolute predictionerror.3.6.2 ScenariosRobust estimators should perform well under any conceivable contamination. While it isinfeasible to cover every possible contamination, the objective function of PENSE suggeststhe kind of contamination with most severe effect on the estimate. As for other S-estimatorsof linear regression (e.g., Maronna 2011), a strong linear relationship between the contami-nated responses and predictors combined with high leverage potentially leads to a large bias543.6. NUMERICAL EXPERIMENTSin the PENSE estimate. The numerical experiments in this section therefore cover a rangeof contamination scenarios where the contaminated observations follow a linear relationshipdifferent from the majority of the data.The majority of the n observations follows the linear modelyi = xi1 C · · ·C xis C ui i = bϵncC 1P O O O P nwhere xi is the vector of p predictors following a multivariate t-distribution with 4 degreesof freedom and s Q p is the number of predictors with non-zero coefficient. The error termsui are i.i.d. following a stable distribution with varying tail parameter. The empirical scaleof the error term, σˆu, is chosen to control for the proportion of variance explained by themodel (PVE), ,:, = 1− σˆ2uσˆ2yOFollowing the argument in Hastie et al. (2017) on realistic values of explained variation, ,is fixed at 0O2M.Contaminated observations, on the other hand, follow the linear modelyi = kvx˜⊺iπ C u′i i = 1P O O O P bϵncwith parameter kv controlling the “outlyingness” of the contaminated observations andperturbation u′i following a centered Normal distribution scaled such that the model ex-plains 91% of the variation in the contaminated observations. On the one hand, largevalues of |kv| lead to farther outlying observations and hence have more potential of biasingestimates. On the other hand, robust estimators can better identify highly outlying ob-servations as contaminated and assign low weights in the estimation to these observations.Additionally, the regularizing term steers the estimate towards the model favoring the non-contaminated observations if |kv| is very large. Therefore, it is difficult to predict whichvalues of kv lead to higher bias of PENSE estimates. To get an overall assessment of thebias incurred by contamination, five different contamination parameters kv are considered,kv = {−2P−1P 0P 3P 7}.The vector π is randomly generated to have exactly s 1’s and p − s 0’s, determiningwhich predictors are included in the linear relationship of the contaminated observations.Leverage of the contaminated observations is increased by scaling the values of the predictorsincluded in the linear model for the contaminated observations. The magnitude of scaling is553.6. NUMERICAL EXPERIMENTSdetermined by contamination parameter kl S 1. Larger values of the scaling factor kl leadto higher leverage of the contaminated observations and thus to larger bias of estimates, butthe effect on robust estimates levels off. Therefore the value is fixed for all contaminationscenarios at kl = P. The detailed scaling mechanism is explained in Appendix A.3.Scenarios without contamination (“no contamination”) are replicated 100 times, whilecontaminated scenarios are replicated 50 times. A detailed description of the simulationscenarios and data generation schemes is given in Appendix A.3.3.6.3 ResultsBefore comparing PENSE to other regularized estimators, the strategy for hyper-parameterselection is to be determined. Figure 3.4 shows the relative scale of the prediction errorfor PENSE estimates obtain from different hyper-parameter selection strategies in all con-sidered scenarios. To compare results across error distributions, sample sizes, and sparsitysettings, the scale of the prediction error is standardized by the scale of the prediction errorof the PENSE estimate obtained from hyper-parameters selected to minimize the predictionerror on a large independent validation set. This validation set is in practice unavailablebut is the gold-standard to compare the different hyper-parameter selection strategies.The strategies compared in Figure 3.4 are cross-validation (Section 3.5.2) and train/testsplitting as outlined in Section 3.5.3. The strategy Train/Test (min) uses the estimateresulting in the smallest estimated prediction error, while Train/Test (avg) averages infor-mation from all K folds, as detailed at the end of Section 3.5.3. Figure 3.4 highlights thatCV is preferable to train/test splitting in the vast majority of cases. Especially for scenarioswithout contamination, the PENSE objective function is in general well-behaved and localminima do not cause problems, while train/test splitting clearly suffers from a reduced sam-ple size. Under contamination, CV still tends to perform better than train/test splitting,albeit the difference is in general negligible. The numerical experiment also underlines that,in isolated cases, CV does suffer from the issues outlined in Section 3.5.2. Overall, however,the benefits of cross-validation and using the full sample to compute the estimates dominatetrain/test splitting strategies.Summarizing results under contaminationScenarios with 25% contamination may be grouped into groups of five scenarios by ignoringthe value of the contamination parameter, kv. In each of these 5 scenarios, the uncontam-563.6. NUMERICAL EXPERIMENTSNo Contamination 25% Contamination16 32 64 128 16 32 64 1281.01.52.02.5Number of predictorsRelative scale of prediction errorMethodCVTrain/Test (min)Train/Test (avg)Figure 3.4: Prediction performance of PENSE estimates with hyper-parameters chosen according to 5-foldCV or different versions of 5-fold train/test split. The scale of the prediction error on the verticalaxis is shown relative to the prediction error of the PENSE estimate with hyper-parameters obtainedby using an independent validation set of 1000 observations. The boxplots include results from allconsidered scenarios.inated observations are identical and the same as in the corresponding scenario withoutcontamination. Figure 3.5 shows the τ -size of the prediction error estimated on an inde-pendent validation set relative to the true scale of the residuals for LS-EN, MMLASSO,and PENSE, under the different outlier positions. In this plot, an outlier position of kv = 1corresponds to the scenario without contamination. For the non-robust LS-EN estima-tor, prediction performance decreases sharply with increasing severity of the outliers, i.e.,|kv− 1|, but the effects are much more pronounced in the very sparse scenario VS1-MH(kv,8) shown in the left panel. The robust estimators MMLASSO and PENSE show similarperformance across different outlier positions, but MMLASSO seems to be more affectedby some of the outlier positions than PENSE. Both robust estimators are most affected bymoderate severity of the outliers, i.e., contamination which is not easily detectable as such,but neither exhibits a severe loss of prediction performance.Contamination in sparse scenarios (i.e., 24 predictors out of 64 are active) appears tobe less problematic than in very sparse scenarios (6 predictors are active). The reason forthis phenomenon is that in sparse scenarios the true scale of the residuals is much greaterthan in very sparse scenarios as the proportion of variance explained is kept constant at, = 0O2M. The increased true error scale results in the residuals of the contaminatedobservations (with regards to the true model) being much less extreme as for similar verysparse scenarios. Therefore, neither robust nor non-robust estimators are highly affected by573.6. NUMERICAL EXPERIMENTSlllll lllllllVery sparse Sparse−2 −1 0 1 3 7 −2 −1 0 1 3 71.01.21.41.6Parameter in contamination model, kvRelative scale of prediction errorMethodlENMMLASSOPENSEFigure 3.5: Prediction performance of regularized estimators under scenarios VS1-MH(kv, 8) (left) andMS1-MH(kv, 8) (right) with n 5 )(( and p 5 64. The horizontal axis shows the different outlier po-sitions, kv, where kv 5 ) corresponds to the “no contamination” scenario. The scale of the predictionerror on the vertical axis is shown relative to true scale of the residuals. The error bars depict therange of the inner 50% (inner quartile range) of relative prediction errors from 50 replications.the contamination in sparse scenarios. The trend of the LS-EN estimator, however, stronglyindicates that for larger values of kv the estimator will lead to nonsensical predictions.For an overall assessment of performance of the estimators under contamination, themetrics reported below are summarizing the scenarios by ignoring the value of the contam-ination parameter kv. In other words, the different outlier positions are treated equallywhen assessing performance. Scenarios with contamination are replicated 50 times, andhence the reported values summarize M×M0 = 2M0 values. This leads to simpler comparisonof different methods across different scenarios based on their “average performance undercontamination”.Prediction performancePrediction performance is measured either by the root mean square prediction error (RM-SPE) as defined in (3.7) or by the τ -size of the prediction errors, defined in (3.8). TheRMSPE is standardized by the empirical standard deviation of the true errors σˆu and re-ported only for Normal errors. The τ -size, standardized by the empirical τ -scale of the trueerrors, τˆu, is reported for all other error distributions without finite variance. Both measuresof prediction performance are estimated on an independent test set of 1000 observationswithout contamination. A relative scale of the prediction error of 1 says the prediction erroris of the same magnitude as the random error and indicates good prediction performance,583.6. NUMERICAL EXPERIMENTSwhile larger values mean worse prediction performance.Figure 3.6 shows boxplots of prediction performance for the LS-EN estimator, MM-LASSO, and PENSE under Normal and Cauchy errors and increasing number of predictorsp. In all of these scenarios, the number of observations is fixed at n = 100 and the truemodel explains 25% of the variance in the observed response.With more predictors available, the problem becomes more challenging and the predic-tion performance decreases accordingly. Even for low-dimensional problems, the predictionperformance of the non-robust LS-EN estimator deteriorates drastically under the presenceof contamination or heavy-tailed errors. Of the two robust estimators shown, MMLASSOleads to better prediction performance than PENSE for Normal errors and without contam-ination present, regardless of the number of truly active predictors. This can be expectedsince the scale estimate used for MMLASSO is tuned for consistency under Normal errors,leading to improved efficiency. For more heavy-tailed errors, however, the advantage ofthe M-step dissipates and the prediction performance of PENSE estimates is as good as orslightly better than MMLASSO estimates. While PENSE is outperformed by MMLASSOin some scenarios with Normal errors, MMLASSO seems more affected than PENSE byheavy-tailed errors and under the presence of grossly contaminated observations.A comprehensive summary of the prediction performance in all scenarios, including ad-ditional error distributions and sample sizes, is given in Appendix C.1.1. It should be notedthat in these visualizations LS-EN seems only slightly affected by contamination in sparsesettings. As explained above, this is an artifact of the contamination being overshadowed bythe large variability of the error term. Overall PENSE is the most stable of the consideredestimator, leading to more robust estimates with highly competitive prediction performance.Variable selection performanceThe stated goal of PENSE is to achieve good prediction performance while at the same timeidentify relevant variables. For variable selection two measures are of interest: the relativenumber of correctly identified active predictors (sensitivity, SE) and the relative number ofcorrectly identified inactive predictors (specificity, SP). These two measures are defined asSE(xβ) = TP(xβ)TP(xβ) C FN(xβ)P SP(xβ) = TN(xβ)TN(xβ) C FP(xβ)(3.9)593.6. NUMERICAL EXPERIMENTSNo Contamination 25% ContaminationError distribution: NormalError distribution: Cauchy16(4)32(5)64(6)128(7)16(4)32(5)64(6)128(7)1.01.21.41.61.001.251.501.75Number of predictors(thereof active)Relative scale of prediction errorMethod EN MMLASSO PENSE(a) Very sparse scenarios VS1-LT* and VS1-HT*.No Contamination 25% ContaminationError distribution: NormalError distribution: Cauchy16(12)32(17)64(24)128(34)16(12)32(17)64(24)128(34)1.01.21.41.61.001.251.501.75Number of predictors(thereof active)Relative scale of prediction errorMethod EN MMLASSO PENSE(b) Sparse scenarios MS1-LT* and MS1-HT*.Figure 3.6: Prediction performance of regression estimates in different scenarios with a sample size ofn 5 )((. The horizontal axis in each panel shows the total number of predictors, while the verticalaxis in each panel shows the root mean square prediction error (for Normal errors) or the τ scale ofthe prediction errors (for Cauchy errors).whereTP(xβ) =∣∣∣{j R βˆj 6= 0 ∧ β0j 6= 0}∣∣∣ FP(xβ) = ∣∣∣{j R βˆj 6= 0 ∧ β0j = 0}∣∣∣TN(xβ) =∣∣∣{j R βˆj = 0 ∧ β0j = 0}∣∣∣ FN(xβ) = ∣∣∣{j R βˆj = 0 ∧ β0j 6= 0}∣∣∣are the number of true positives, false positives, true negatives, and false negatives, respec-tively. Perfect variable selection is achieved if both measures are 1, i.e., all active predictorshave non-zero coefficient and all inactive coefficients have a coefficient value of 0.Figure 3.7 shows the sensitivity and specificity under very sparse and sparse scenariosfor a sample size of n = 100. As for prediction performance, variable selection is morechallenging when more predictors are available and the more predictors are truly active.Variable selection of the non-robust LS-EN estimator is much more affected by heavy-tailederror distributions than by gross contamination. Particularly sensitivity drops to almost 0%for the LS-EN estimator if the errors are Cauchy distributed, even under no contamination;603.6. NUMERICAL EXPERIMENTSthe LS-EN estimate almost always selects the empty model in these scenarios. Interestingly,contamination by leverage points appears to help LS-EN identify some relevant predictorseven for Cauchy errors. The reason is that some of the truly active predictors are con-taminated by high-leverage values, which are immediately selected by LS-EN, alongside theother contaminated predictors. This highlights the hypersensitivity of LS-EN estimates toleverage point contamination; any predictor with leverage points will be selected by LS-ENwith near certainty.The robust estimators, on the other hand, perform very similarly for light- and heavy-tailed errors as well as under contamination. Sensitivity of both PENSE and MMLASSOestimates is almost unaffected by gross contamination under Normal errors and decreasesonly slightly if errors are Cauchy distributed. Specificity decreases more under contamina-tion and it seems even robust estimators tend to wrongly select inactive predictors if theyare contaminated with leverage points.In sparse scenarios (right plot 3.7(b)), variable selection is apparently more challengingthan in very sparse scenarios. The greater flexibility of the EN penalty used by PENSEseems to be an advantage in these sparse scenarios. While MMLASSO has comparablesensitivity to PENSE in very sparse scenarios, the a1 penalty can be too restrictive for sce-narios where many predictors are truly active. In these scenarios PENSE has substantiallyhigher sensitivity than MMLASSO.Across all scenarios, PENSE has the highest sensitivity and selects more of the trulyactive predictors than the other estimators. Even under no contamination and Normalerrors, PENSE is as good as LS-EN in detecting truly active predictors. Unsurprisingly,MMLASSO tends to have lower sensitivity than PENSE because of the restrictions imposedby the a1 penalty. On the other hand, PENSE usually selects many more irrelevant variablesthan MMLASSO. Overall, PENSE has high sensitivity but only moderate specificity, ashortcoming addressed in the following Chapter 4.These conclusions also extend to the other error distributions and sample sizes, as vi-sualized in Appendix C.1.2. Variability of the variable selection performance of PENSEestimates decreases substantially with larger sample size, and sensitivity improves notice-able. Specificity, on the other hand, increases only moderately with a larger sample size.Importantly, variable selection properties of LS-EN deteriorate quickly even for a moderate-light-tailed error distribution.It is important to note that none of the estimators shown here possess any theoreticalguarantees of uncovering the true active set with high probability in the scenarios considered.613.7. CONCLUSIONSllllllllllllllllllllllllllllllllNo Contamination 25% ContaminationError distribution: NormalError distribution: Cauchy16(4)32(5)64(6)128(7)16(4)32(5)64(6)128(7)100%50%0%50%100%100%50%0%50%100%Number of predictors(thereof active)←Specificity | Sensitivity→Method l EN MMLASSO PENSE(a) Very sparse scenarios VS1-LT* and VS1-HT*.llllllllllllllllllllllllllllllllNo Contamination 25% ContaminationError distribution: NormalError distribution: Cauchy16(12)32(17)64(24)128(34)16(12)32(17)64(24)128(34)100%50%0%50%100%100%50%0%50%100%Number of predictors(thereof active)←Specificity | Sensitivity→Method l EN MMLASSO PENSE(b) Sparse scenarios MS1-LT* and MS1-HT*.Figure 3.7: Variable selection performance of regularized regression estimates in different scenarios with asample size of n 5 )((. The horizontal axis in each panel shows the total number of predictors. Thevertical axis in each panel is split in two halves: sensitivity (i.e., the number of correctly identifiedactive predictors) is shown on the top half, and specificity (i.e., the number of correctly identifiedinactive predictors) is shown downwards with perfect specificity (100%) on the bottom. Solid verticallines show the range of the inner 50%, while the dashed lines extend from the 5% to the 95% quantile.Regularized robust estimators with better variable selection performance are discussed inthe following Chapter 4.3.7 ConclusionsThe elastic net S-estimator, PENSE, proposed in Cohen Freue et al. (2019) and explained indetail in this chapter, is a highly robust method for linear regression problems with favorableprediction performance and good variable selection properties. Compared to competingmethods, PENSE does not require an auxiliary scale estimate and theoretical guaranteesdo not depend on moment conditions on the error term. This make PENSE a very versatilemethod applicable to problems with high noise and possible contamination in the responseand the predictors.PENSE gains its robustness towards contamination and heavy-tailed error distribu-623.7. CONCLUSIONStions by regularizing the robust, non-convex S-loss with the EN penalty. Locating a goodminimum of this non-convex objective function with limited computing resources requirescarefully chosen initial estimates. Using ideas from the Peña-Yohai estimator (Peña andYohai 1999), we devised the EN-PY procedure in Cohen Freue et al. (2019) for PENSE tocompute initial regularized estimates based on subsets of the data which likely exclude ob-servations with high leverage. In practice, the EN-PY procedure, outlined in Section 3.2.2,often leads to better local optima than other strategies to obtain initial estimates whilebeing computationally much more efficient.Despite the complications introduced by the non-convex objective function, I establishthe root-n consistency of the PENSE estimator in Section 3.3 for a fixed number of predictorsbut otherwise very mild assumptions. These asymptotic results, however, require penaltyparameters chosen independently of the available sample according to the necessary growthconditions. In practice, this is infeasible. Section 3.5 therefore discusses different data-drivenstrategies to select hyper-parameters based on the prediction performance of the resultingestimate. All of these heuristics are prone to high variability in the estimated predictionperformance due to the potential presence of contaminated observations combined with thenon-convex objective function. The numerical experiments suggest that hyper-parametersselected via cross-validation lead to better estimates than hyper-parameters selected byother data-driven methods in the vast majority of cases. While in rare cases CV seems tobe more affected by contamination than train/test splitting, the overall performance of CVjustifies its use in practice. This underlines that hyper-parameter selection is challenging forestimators defined through non-convex objective functions and becomes more challengingthe more severe the non-convexity caused by contaminated observations.The numerical experiments also demonstrate that PENSE leads to better predictionperformance than other estimators with provable theoretical guarantees in problems withhigh noise in the response and/or contaminated observations. Besides the numerical ex-periments conducted and explained herein, empirical results in Cohen Freue et al. (2019)for different data generation schemes underscore the versatility of PENSE, especially inproblems where some predictors are highly correlated. From these empirical results andfrom the theoretical results presented and developed in this chapter, it can be concludedthat PENSE has strong prediction performance and estimation accuracy even under verychallenging circumstances. None of the competing methods is able to cope with high noiselevels and contamination both in the response and the predictors as good as PENSE.With respect to variable selection, the simulation study shows that PENSE has very633.7. CONCLUSIONShigh sensitivity in almost all scenarios. This high sensitivity, however, comes at the priceof a large number of falsely selected predictors. In many applications, a large numberof false positives is undesirable. In biomarker discovery studies, for instance, too manypotential biomarkers lead to prohibitively expensive follow-up validation studies or renderthe biomarkers infeasible for clinical use. It is therefore of practical importance to develop arobust estimator with better variable selection performance, particularly higher specificity,without sacrificing sensitivity or prediction performance.64Chapter 4Variable Selection ConsistentS-EstimatorsThe penalized elastic-net S-estimator (PENSE), as detailed in the previous chapter, achieveshighly robust estimation and prediction performance. Theoretical results and numericalexperiments demonstrate that PENSE estimates yield competitive prediction performanceoutperforming other estimators in challenging problems with heavy-tailed errors and adversecontamination. Albeit PENSE uncovers most of the truly active predictors, the estimateoften selects many truly inactive predictors. The issue arises from the elastic net (EN)regularization term in the PENSE objective function, which introduces non-negligible biasand hence cannot lead to a variable selection consistent estimator. Therefore, I propose toreplace the elastic net penalty by the adaptive EN penalty which has been shown to leadto variable selection consistent estimators when combined with the LS-loss.The adaptive EN, as defined in (2.15), combines the advantages of the adaptive LASSOpenalty (Zou 2006) and the elastic net penalty (Zou and Zhang 2009). The adaptive LASSOleverages information from a preliminary regression estimate, β˜, to penalize predictorswith initially “small” coefficient values more heavily than predictors with initially “large”coefficients. This has two major advantages over the non-adaptive EN penalty: (i) thebias for large coefficients is reduced and (ii) variable selection is improved by reducing thenumber of false positives. Compared to adaptive LASSO, the a2 term in the adaptive ENimproves stability of the estimator in presence of highly correlated predictors (Zou andZhang 2009).In this chapter I introduce adaptive PENSE by combining the robust S-loss and theadaptive EN penalty. I state its theoretical properties and show that the adaptive EN654.1. METHODpenalty leads to more reliable variable selection than what can be achieved by PENSE. Fur-thermore, numerical experiments showcase the improved variable selection performance overPENSE while retaining similar predictive power and demonstrates that adaptive PENSEperforms better than other variable selection consistent estimators under contamination.The improved variable selection is an important feature for practical applications. I revisita biomarker discovery study from Cohen Freue et al. (2019) to highlight the utility of theadaptive PENSE estimator.4.1 MethodThe adaptive PENSE estimator is defined by a regularized objective function which com-bines the robust S-loss and the adaptive EN penalty. The adaptive EN penalty (2.15) issimilar to the EN penalty except that the a1 penalty applied to parameter βj is scaledby penalty loading ωj , raised to the power of ζ S 0. For adaptive PENSE, these loadingsare set to the reciprocal values of an initial PENSE slope estimate, β˜(λS,αS). The objectivefunction for adaptive PENSE is given byOAS(µPβSλASP αASP ζPω) = LS(yP µCrβ) C λASΦAN(βSωP αASP ζ) (4.1)with ωj = 1Rβ˜(λS,αS)j , j = 1P O O O P p. Minimizers of the adaptive PENSE objective functionare denoted by xθ(λAS,αAS,ζ,ω) = argminµ,β OAS(µPβSλASP αASP ζPω). The hyper-parametersare omitted if not pertinent to the argument or obvious from the context.The interpretations of hyper-parameters λAS and αAS are identical to interpretationsof hyper-parameters λS and αS for PENSE, i.e., they control the amount of penalizationand the balance between the a1/a2 penalties, respectively. The exponent in the predictor-specific regularization, hyper-parameter ζ, is less intuitive. In general, a larger ζ leadsto more reliance on the initial estimate β˜(λS,αS) for variable selection. A small preliminarycoefficient estimate |β˜j | leads to a larger penalty loading ωj . With ζ large, this large penaltyloading is further amplified, heavily penalizing predictor j which is in turn likely omittedfrom the active set. Therefore, if ζ is large, only predictors with the very large preliminarycoefficient estimates are likely to be selected.Predictors with a preliminary coefficient estimate of 0 remain inactive after adaptivePENSE. In the formulation of the adaptive EN penalty, these predictors have infinite pe-nalization because αASλAS S 0 is required. Therefore, these coefficients necessarily stay 0.664.1. METHODWhen computing the adaptive PENSE estimate according to (4.1), only predictors in thepreliminary active set A (β˜(λS,αS)) ={j R β˜(λS,αS)j 6= 0}are considered. While irrelevant fortheoretical properties of variable selection performance of adaptive PENSE, the absorbingstate at 0 can in practice deteriorate variable selection performance, but at the same timeimprove computational speed by reducing the complexity of the problem. As an alternativeZou and Hastie (2005) suggest replacing zero coefficients with a very small value ϵ by adjust-ing the penalty loadings to ωj = 1Rmax(ϵP |β˜(λS,αS)j |). Another way of evading the absorbingstate is to use a preliminary estimate with almost surely non-zero coefficients, for examplethe PENSE-Ridge (i.e., αS = 0). For adaptive PENSE, empirical results suggest that aninitial PENSE-Ridge estimate leads to good results and has computational advantages overPENSE estimates with αS S 0.Finding minima of adaptive PENSE’s non-convex objective function is as difficult asfor PENSE. The challenge, however, is further elevated by the larger number of hyper-parameters needed for the adaptive PENSE.4.1.1 Hyper-Parameter SelectionComputing an adaptive PENSE estimate for given values of the hyper-parameters involvestwo expensive non-convex optimizations: first compute the PENSE estimate θ˜(λS,αS), thenthe adaptive PENSE estimate xθ(λAS,αAS,ζ,ω). An exhaustive hyper-parameter search foradaptive PENSE would in the first stage compute PENSE on a 2-dimensional grid of valuesfor λS and αS. In the second stage, adaptive PENSE is computed on a 3-dimensional gridof values for λAS, αAS, and ζ, trying every PENSE estimate computed in the first stage.Performing an exhaustive search in this large space is obviously infeasible in practice.There are several ways to restrict this extensive search. Instead of using every PENSEestimate from the first stage, the search space can be reduced by only considering the “best”PENSE estimate among all PENSE estimates with αS = αAS. A further simplification isto fix the preliminary estimate at the best overall PENSE estimate while still performinga full hyper-parameter search for the adaptive PENSE estimate. For the adaptive LS-ENestimator, Zou and Zhang (2009) propose an even more restricted search. The authorssuggest to first select hyper-parameters for the preliminary LS-EN estimate, denoted by α∗and λ∗. For the adaptive LS-EN estimate the authors then only search over the restrictedset {(αP λ) R λ1−α2 = λ∗ 1−α∗2 }, fixing the a2 penalization in the adaptive EN penalty to thesame level as selected for the preliminary LS-EN estimate. This could be translated to theadaptive PENSE estimator by fixing αAS in the second stage to the same value as αS in the674.2. STATISTICAL THEORYbest overall PENSE estimate.A different approach to constrain the computational burden of the hyper-parametersearch is to compute only the PENSE-Ridge (i.e., αS = 0) in the first stage. This hastwo advantages: (i) reducing the risk of false negatives in the model selected by adaptivePENSE because the preliminary active set contains all predictors, and (ii) the PENSE-Ridge estimate is faster to compute than PENSE estimates with αS S 0. Although thisdecreases the computational burden of the first stage considerably, the search in the secondstage cannot be restricted and a full 3-dimensional hyper-parameter search is necessary.Empirical results in Section 4.4.1 favor the use of PENSE-Ridge in most applications.4.2 Statistical TheoryIn this section I establish theoretical properties of the adaptive PENSE estimator xθ forθ0 ∈ Rp+1, defined as the global minimum of the adaptive PENSE objective functionxθ = argminµ,βOAS(µPβSλAS,nP αASP ζPω) (4.2)where ωj = 1Rβ˜(λS,αS)j , j = 1P O O O P p is determined from an initial PENSE estimate. Allhyper-parameters λAS,nP αASP ζP αSP λS,n are chosen independently of the sample, but λAS,nand λS,n need to decrease according to the number of observations n.The following asymptotic properties hold under the same general conditions [A1] –[A3]as given for PENSE in Section 3.3. To ease notation and without loss of generality, I assumethat the first s components of β0 are non-zero (i.e., A (β0) = {1P O O O P s}). The leading non-zero components of the true coefficient vector are denoted by β0I while the trailing p − scomponents are denoted by β0II with β0II = 0p−s.Proposition 2. Let (yiPx⊺i ), i = 1P O O O P n, be i.i.d. observations with distribution G0 whichsatisfies (2.2). Under assumptions [A1] and [A2] and if λS,n → 0 as well as λAS,n → 0, theadaptive PENSE estimator xθ as defined in (4.2), is a strongly consistent estimator of thetrue regression parameter θ0: xθ a.s.−−→ θ0.Noting that the level of a2 penalization given by λAS,n 1−αAS2 converges deterministicallyto 0 due to the condition that λAS,n → 0, the proof of strong consistency of adaptive PENSEis otherwise identical to the proof of strong consistency of adaptive MM-LASSO given inSmucler and Yohai (2017) and hence omitted. An important result for the following variable684.2. STATISTICAL THEORYselection properties is the speed of convergence of the adaptive PENSE estimator, provenin Appendix B.2.2.Theorem 3. Let (yiPx⊺i ), i = 1P O O O P n, be i.i.d. observations with distribution G0 whichsatisfies (2.2). Under regularity conditions [A1]–[A3] and if λS,n → 0 and λAS,n = d(1R√n),the adaptive PENSE estimator xθ as defined in (4.2), is a root-n consistent estimator of thetrue parameter vector θ0: ‖xθ − θ0‖ = dp(1R√n).The results so far show that adaptive PENSE theoretically performs as well as PENSE.The adaptive penalty, however, gives rise to an important additional property of adaptivePENSE: variable selection consistency. The following theorem which is proven in Ap-pendix B.2.3 shows that under conditions [A1]–[A3], adaptive PENSE is able to recover thetruly active predictors with high probability.Theorem 4. Let (yiPx⊺i ), i = 1P O O O P n, be i.i.d. observations with distribution G0 which sat-isfies (2.2). Under regularity conditions [A1]–[A3], and if (1) λS,n = d(1R√n), (2) λAS,n =d(1R√n), (3) αASλAS,nnζR2 → ∞, the adaptive PENSE estimator, xθ = (µˆP xβ) as definedin (4.2), is variable selection consistent:P(xβII = 0p−s)→ 1 for n→∞OIt should be noted that conditions (2) and (3) in the theorem imply that αAS and ζ mustbe greater than 0. Furthermore, condition (3) is guaranteed to be satisfied for ζ S 1. Usingvariable selection consistency of adaptive PENSE, it is possible to determine the asymptoticdistribution of the estimator of the truly active parameters.Theorem 5. Under the same conditions as for Theorem 4 as well as √nλAS,n → 0 theasymptotic distribution of the truly active coefficients of the adaptive PENSE estimator, xβI,is√n(xβI − β0I) d−→ cs(0sP σ2M(U) v(/P F0)w(/P F0)2Σ−1I)for n→∞OHere, σM(U) is the population M-scale of the true residuals,σM(U) = inf {s S 0R EF0 s/(URs)u ≤ δ} Pv(/P F0) = EF0[/′ (URσM(U))2], w(/P F0) = EF0 s/′′ (URσM(U))u, and ΣI is the covariancematrix of the truly active predictors, X I.694.2. STATISTICAL THEORYTogether, Theorems 4 and 5 imply that the adaptive PENSE estimator has the sameasymptotic properties as if the true model would be known in advance, under fairly mildconditions on the distribution of the predictors and the error term. By the asymptoticnature of these results, they are not immediately transferable to finite samples, especiallyif the number of predictors is large and the sample size comparatively small. These resultsare nevertheless useful because they underscore that a large number of irrelevant predictorsdoes not have an undue effect on the accuracy of the estimates; the decisive factor is thenumber of truly relevant predictors. For practice even more important, these asymptoticresults allow for simple comparison of the properties of adaptive PENSE to other competingmethods and to understand under what circumstances adaptive PENSE may be preferable.For example, similar results as for adaptive PENSE are obtained for adaptive MM-LASSO inSmucler and Yohai (2017), but their results are contingent on a good estimate of the residualscale. Distinguishing the results in Theorems 4 and 5 from previous work is that the oracleproperty for the adaptive PENSE estimate can be obtained without prior knowledge of theresidual scale, even under very heavy tailed errors.The scaling factor σ2M(U) a(/,F0)b(/,F0)2 in the covariance matrix of the asymptotic Normaldistribution of the adaptive PENSE estimator is evidence the adaptive PENSE estima-tor cannot simultaneously achieve high robustness and high efficiency; the larger δ in thedefinition of the S-loss, the lower the asymptotic efficiency. The heavier the tail of theerror distribution, however, the less severe the loss of efficiency compared to the adaptiveMM-LASSO or the adaptive LS-EN. For central stable distributions (Mandelbrot 1960)with stability parameter less than 1.5, for example, the efficiency of adaptive PENSE withδ ≤ 1R3, relative to adaptive MM-LASSO, is at least 88%. For adaptive MM-LASSO toachieve higher efficiency than adaptive PENSE the M-loss must be tuned for the specificerror distribution. More importantly, however, for the tuning to improve efficiency in finitesamples, the residual scale estimate must be close to σM(U) which is very difficult to achievein finite samples. Chapter 5 discusses these difficulties in more detail.The growth rates of λS,n and λAS,n are important to achieve consistency in parameterestimation and variable selection. In practice, however, the hyper-parameters are usuallychosen in a data-driven way and hence these growth conditions are almost impossible to en-force or check. The empirical results in Section 4.4 underline that perfect variable selectionis very difficult to achieve in finite samples with data-driven hyper-parameter search. Never-theless, adaptive PENSE shows better variable selection performance than other estimatorsin challenging problems.704.3. ROBUSTNESS PROPERTIES4.3 Robustness PropertiesAdaptive PENSE enjoys similar robustness properties as PENSE. The finite-sample break-down point (FBP) of adaptive PENSE is at least as large as the FBP of the preliminaryPENSE estimate. Theorem 2 establishes the breakdown point of the preliminary PENSEestimate is close to δ, where δ is as defined in (2.8) for the S-loss of the preliminary PENSEestimate. If the same δ is used for the adaptive PENSE estimator, it also achieves a break-down point close to δ, as per the following theorem.Theorem 6. For a sample Z = {(yiPxi) R i = 0P O O O P n} of size n, let m(δ) ∈ N be the largestinteger strictly smaller than nmin(δP 1− δ), where δ is as defined in (2.8) for the S-loss ofthe preliminary PENSE estimate and the S-loss of the adaptive PENSE estimator. Then,for a fixed hyper-parameters λS S 0P λAS S 0 and αSP αAS ∈ s0P 1u, the breakdown point (2.7)of the adaptive PENSE estimator, ϵ∗(xθSZ), satisfies the following inequalities:m(δ)n≤ ϵ∗(xθSZ)≤ δ ONoting that the preliminary estimate θ˜ remains bounded by Theorem 2 and hence everycoefficient is penalized, the proof is identical to the proof of the FBP of PENSE which isgiven in Appendix B.1.4.3.1 Robustness of Variable SelectionIn the presence of certain contamination in the predictors, the adaptive EN penalty bringsan important advantage over non-adaptive penalties. For PENSE, the smallest penalizationlevel such that β = 0p is a local optimum, as given in (3.5), reveals that a single very largevalue in a predictor, paired with a non-outlying residual, leads to the explosion of λ˜AS.Consider the case where predictor j is truly inactive and observation i has an unusuallylarge value for predictor j, i.e., xij is contaminated. Since predictor j is truly inactive,the response yi is unaffected by this contamination. From the subgradient of the PENSEobjective function at β = 0p,∇βOS(µPβSαSP λS)|β=0p = −1nn∑i=1w2i (y − µ) (yi − µ)xi C λSs−αSSαSuPit can be seen that direction j will dominate the gradient, as long as the response yi is nototherwise contaminated (or exactly fitted by the intercept-only model). Hence, this single714.4. NUMERICAL EXPERIMENTSaberrant value in irrelevant predictor j leads to this predictor being the first to enter themodel, wrongly suggesting that this predictor is likely relevant.Standardizing the data beforehand to transform all predictors to the same scale does notmitigate the problem as robust scale estimates would be unaffected by this single contam-inated value. A non-robust scale estimate would help to alleviate effects of this particularcontamination but would make the regression estimate susceptible to most other forms ofcontamination. For this reason the classical LS-EN estimator is unaffected by these leveragepoints when standardizing the predictors by their sample standard deviation.Inspecting the effects of these leverage points in inactive predictors on PENSE also high-lights that the estimated coefficients remain small. Similar to non-regularized estimators,as long as the linear model holds, extremely large values in the predictors actually aid theestimation. These “good” leverage points are highly informative about the true model andforce the coefficient value to be close to the true value. In the case where the predictor withthese extreme values is truly inactive, the coefficient estimate is forced towards 0. In fact,as xij → ∞ the estimated coefficient value approaches the true value β˜j → β˜0j = 0, but itwill never be exactly 0 because the predictor eludes the grips of the EN penalty.Leveraging a preliminary PENSE estimate gives a distinct advantage to adaptive PENSE.Given that the coefficient estimate for the affected predictor is likely small, the penalty load-ing in adaptive PENSE is very large. This leads to adaptive PENSE most probably screeningout these spuriously included predictors, as also showcased in the numerical experiments inSection 4.4.2 Therefore, adaptive PENSE overall has not only theoretically better variableselection properties, but variable selection is also more robust.4.4 Numerical ExperimentsAdaptive PENSE enjoys many important theoretical properties as the sample size increasesand hyper-parameter λAS decreases accordingly. How these properties translate to finitesamples and different contamination is not answered by the theory. As with PENSE, theeffects of contamination are bounded by theoretical results, but the magnitude is unknownin practice. Continuing the experiments in Section 3.6, the numerical studies presented inthis section showcase the benefits of adaptive PENSE in practice.Additionally to the estimators considered in Section 3.6, adaptive PENSE is compared toseveral other estimators possessing the oracle property in one or more scenarios considered.Under the same conditions as adaptive PENSE, adaptive MM-LASSO can recover the true724.4. NUMERICAL EXPERIMENTSmodel with high probability even in scenarios where the error distribution has infinitevariance, making it a suitable method in the scenarios considered here. Adaptive PENSEand the preliminary PENSE estimate are both tuned to a breakdown point of 33%, whileadaptive MM-LASSO chooses the breakdown point automatically between 25–50% basedon the degrees of freedom estimated by the S-Ridge (Smucler and Yohai 2017). Hyper-parameters for adaptive PENSE and adaptive MM-LASSO are selected via cross-validationto minimize the estimated τ -size of the prediction error as defined in (3.8). The hyper-parameter ζ for adaptive PENSE is chosen via CV from ζ ∈ {1P 2}. The grid for αAS andλAS is chosen as in Section 3.6.The highly robust adaptive PENSE and adaptive MM-LASSO are compared to two otherestimators which possess the oracle property, at least for Normal errors. I-LAMM (Fan etal. 2018) with Huber’s loss function is also designed for error distributions more heavy-tailed than the Normal, with strong theoretical guarantees even for finite samples, but doesrequire the variance to be finite. For the numerical experiments here, I-LAMM is computedwith the methods available in the R package from https://github.com/XiaoouPan/ILAMMusing the a1 penalty and default settings. Hyper-parameters are selected via 5-fold cross-validation by the procedure cvNcvxHuberReg, with the modification of using the meanabsolute prediction error (MAPE) as scale metric to improve performance under heavytailed error distributions. Adaptive LS-EN, using LS-Ridge as preliminary estimate, iscomputed by the glmnet package in R, with 5-fold CV to select the hyper-parametersminimizing the MAPE.4.4.1 Preliminary Estimate for Adaptive PENSEAdaptive PENSE relies on a preliminary PENSE estimate, but theoretical results do notprovide guidance on which hyper-parameters are appropriate to compute the preliminaryestimate. As outlined in Section 4.1.1, a comprehensive search for all five hyper-parametersis infeasible.The main goal of adaptive PENSE is to improve variable selection over PENSE while re-taining good prediction performance. Figure 4.1 compares two different preliminary PENSEestimates: (i) PENSE (Ridge) computed for αS = 0 with λS selected via 5-fold CV and(ii) PENSE (CV) with αS and λS selected via 5-fold CV (this is the PENSE estimate shownin Section 3.6). The plots show the change in sensitivity and specificity of adaptive PENSEcompared to the PENSE estimate in percentage points, with the dots representing themedian and the error bars extending from the 25% to the 75% quantile.734.4. NUMERICAL EXPERIMENTSAs expected, specificity of adaptive PENSE is higher than that of PENSE, regardless ofthe preliminary estimate. In particular when using PENSE (CV) as preliminary estimate,specificity must be at least as high as for PENSE as any predictor excluded by PENSE willnecessarily also be excluded by adaptive PENSE. Therefore, leveraging the PENSE (CV)estimate leads to slightly higher specificity than if using PENSE (Ridge). At the sametime, adaptive PENSE derived from PENSE (CV) identifies fewer truly relevant predic-tors because it can only select from those predictors previously selected by PENSE. UsingPENSE (Ridge), on the other hand, all predictors are considered when computing adaptivePENSE, and hence the drop in sensitivity from PENSE is more moderate and in manyscenarios sensitivity of adaptive PENSE is even higher than sensitivity of PENSE.It appears as if the benefits of adaptive PENSE decrease as more predictors are available,but it needs to be noted that specificity of PENSE is already quite high in these settings,leaving less room for improvements. In these higher-dimensional problems, PENSE andother regularized estimators have more difficulty identifying the relevant predictors. Whileadaptive PENSE in general reduces sensitivity even further, leveraging PENSE (Ridge)often leads to an estimate with higher sensitivity in high-dimensional settings.In terms of prediction performance, adaptive PENSE leads to similar performance asPENSE, albeit slightly reduced. Basing adaptive PENSE on PENSE (CV) tends to de-crease prediction performance in the majority of situations as shown in Figure C.7 in theappendix. Prediction performance of adaptive PENSE with PENSE (Ridge) as the prelim-inary estimate, on the other hand, is not substantially different from PENSE.Overall, adaptive PENSE based on the PENSE (Ridge) preliminary estimate improvesspecificity without sacrificing as much sensitivity as if using PENSE (CV). LeveragingPENSE (Ridge) can even be beneficial for sensitivity in high dimensions and does notimpede prediction performance of the estimate. In applications where the costs associatedwith including irrelevant predictors is prohibitive, PENSE (CV) may be the more appro-priate preliminary estimate for adaptive PENSE. In general, however, adaptive PENSEbased on PENSE (Ridge) leads to an overall more substantial improvement of variable se-lection properties with similar prediction performance as PENSE. In subsequent numericalexperiments, adaptive PENSE is therefore reported with PENSE (Ridge) as preliminaryestimate.744.4. NUMERICAL EXPERIMENTSllllllllllllllllllllll lllllll lllllllllllllllllllllllllllllllllVery sparse Very sparse Sparse SparseNo contamination25% contamination16(4)32(5)64(6)128(7)16(4)32(5)64(6)128(7)16(12)32(17)64(24)128(34)16(12)32(17)64(24)128(34)−25pp0pp25pp50pp−25pp0pp25pp50ppNumber of predictors(thereof active)Difference in variable selection [percentage points]Measure l lSpecificity Sensitivity Preliminary Estimate l lPENSE (CV) PENSE (Ridge)Figure 4.1: Comparison of variable selection performance of adaptive PENSE using different preliminaryestimates. Data is simulated according to schemes VS1-* in panels on the left andMS1-* in panels onthe right, with n 5 )(( and 25% variance explained by the true model. Results for “no contamination”(top) show the median and inter-quartile range over 100 replications, while results on the bottomsummarize 50 replications for each of 6 scenarios with different contamination settings.4.4.2 Effects of Good Leverage PointsCombining the robust S-loss with the adaptive EN penalty promises more robust variableselection in the presence of good leverage points as detailed in Section 4.3.1. To supportthis statement with empirical results, data is generated according to schemeMS1-MH(–, kl)with p = 32 predictors and n = 100 observations with adapted contamination model. All100 response values are generated according to the true model, but in 10% of observations,some predictor values are contaminated byx˜i,15+i = xi,15+iklmaxi′=1,...,ny2iy215+ifor i = 1P O O O P 10 (4.3)with y2i the squared Mahalanobis distance of observation i, relative to the 10 contaminatedpredictors, as in (A.3). In other words, each of the first 10 observations has a single predictorwith unusually large value, with the severity of leverage controlled by parameter kl. Thefirst 17 predictors are truly active; hence this contamination model introduces leveragepoints in 2 truly active predictors and 8 truly inactive predictors.754.4. NUMERICAL EXPERIMENTSResults are shown in Figure 4.2, underlining that PENSE estimates are considerablyaffected by these “good” leverage points. Sensitivity and specificity are calculated separatelyfor contaminated (top) and uncontaminated predictors (bottom). All estimates select thetruly active predictors with contamination in the vast majority of replications, regardlessof the severity of leverage introduced. As predicted, PENSE almost always selects alltruly irrelevant predictors with contamination. Adaptive PENSE using PENSE with α =0 as preliminary estimate, on the other hand, shows highly consistent variable selectionperformance over all leverage parameters, kl. Adaptive PENSE is able to identify most trulyactive predictors (contaminated or not), while also screening out large parts of the trulyinactive predictors. Sensitivity of I-LAMM estimates drops drastically as the severity of theleverage points increases, with specificity increasing in tandem. Therefore, in the presence ofvery severe leverage points, I-LAMM selects only the contaminated truly active predictors,everything else is excluded from the model. Non-robust (adaptive) LS-EN show fairlygood variable selection with high specificity for both contaminated and uncontaminatedpredictors, but the trajectory of sensitivity follows a similar trajectory as I-LAMM, albeitthe decrease is more gradual.Good leverage points seem to be more helpful for non-robust estimates, up to the pointwhere the leverage becomes too severe and overshadows the other truly active predictors.Adaptive PENSE maintains a high level of sensitivity and specificity for any severity of goodleverage points, but compared to non-robust estimators, these variable selection propertiesalso persist in the presence of other contamination.4.4.3 Overall Effect of ContaminationAdaptive PENSE performs reliably under the presence of good leverage points. Assessingthe impact of a greater variety of contamination, adaptive PENSE and other variable selec-tion consistent estimators are computed in the same scenarios as considered in Section 3.6.Figure 4.3 summarizes the prediction performance for scenarios with n = 100 observa-tions. I-LAMM, with Huber’s loss and LASSO penalty is not robust towards high leveragepoints in the predictors but outperforms robust estimators for Normal errors and no con-tamination. When the error distribution is heavy-tailed or when gross contamination isintroduced, predictions from I-LAMM estimates tend to give higher errors than predictionsfrom PENSE or adaptive PENSE. Across all scenarios, adaptive PENSE estimates havevery similar predictive power as PENSE estimates, as evident from the results reportedin Appendix C.2.1. Adaptive MM-LASSO performs as good as adaptive PENSE in very764.4. NUMERICAL EXPERIMENTSl l l l l l l l l l l l l l l l l ll l l l lll l l l l l l llllll l llll ll l ll ll l l ll l l l l l l l l l l l l l l l l lll l lllllll llll l l l lllll ll l l l llllllllll l l l l l l l l l l lll l lllSensitivity SpecificityContaminatedpredictorsUncontaminatedpredictors2 4 8 16 32 64 2 4 8 16 32 640%25%50%75%100%0%25%50%75%100%Severity of leverage, klMedian variable selection performanceMethodllLS−ENAda. LS−ENI−LAMMPENSEAda. PENSEFigure 4.2: Effect of high-leverage points on the sensitivity and specificity of variable selection. Medianvalues over 50 replications of these measures are reported separately for predictors containing con-taminated values and predictors free from any contamination. Data is generated according to schemeMS1-MH* with p 5 +2, n 5 )((, 75% variance explained by the true model and 10% contaminationintroduced according to (4.3).sparse scenarios, but more active predictors are better handled by adaptive PENSE. Con-clusions for the estimation accuracy reported in Appendix C.2.3 coincide with predictionperformance.Variable selection performance of adaptive PENSE, shown in Figure 4.4, underscoresthe conclusions from previous experiments. Adaptive PENSE is performing similar to I-LAMM in very sparse scenarios with no contamination, but adaptive PENSE is more robusttowards heavy tailed errors and leverage points. Compared to PENSE, adaptive PENSE es-timates screen out more truly irrelevant predictors, at the cost of missing some truly relevantones. Noting that in very sparse scenarios (Figure 4.3(a)), the introduced outliers are moreextreme than in sparse scenarios (Figure 4.3(b)), adaptive PENSE has almost the same sen-sitivity as PENSE under the presence of severe leverage points combined with gross outliers,but adaptive PENSE excludes many more irrelevant predictors. Adaptive MM-LASSO, asshown in Appendix C.2.2, has substantially lower sensitivity than adaptive PENSE in thevast majority of scenarios. Compared to other variable selection consistent estimators, adap-tive PENSE tends to retain more truly active predictors while still screening out most of theirrelevant predictors. Variable selection properties of adaptive PENSE are less affected byoutliers and heavy-tailed errors than I-LAMM estimates or MM-LASSO estimates. Adap-tive PENSE strikes a balance between high specificity achieved by LASSO-type estimatorsand high sensitivity of PENSE estimates. This is especially useful in applications where a774.5. BIOMARKERS FOR CARDIAC ALLOGRAFT VASCULOPATHYNo contamination 25% contaminationError distribution: NormalError distribution: Cauchy16(4)32(5)64(6)128(7)16(4)32(5)64(6)128(7)1.01.21.41.61.001.251.501.75Number of predictors(thereof active)Relative scale of prediction errorMethod I−LAMM Ada. PENSE PENSE(a) Very sparse scenarios VS1-LT* and VS1-HT*.No contamination 25% contaminationError distribution: NormalError distribution: Cauchy16(12)32(17)64(24)128(34)16(12)32(17)64(24)128(34)1.01.21.41.61.001.251.501.75Number of predictors(thereof active)Relative scale of prediction errorMethod I−LAMM Ada. PENSE PENSE(b) Sparse scenarios MS1-LT* and MS1-HT*.Figure 4.3: Prediction performance of regression estimates in different scenarios with a sample size ofn 5 )((. The horizontal axis in each panel shows the total number of predictors, while the verticalaxis in each panel shows the root mean square prediction error (for Normal errors) or the τ scale ofthe prediction errors.small number of false negatives can be tolerated at the benefit of substantially reducing thenumber of true negatives.4.5 Biomarkers for Cardiac Allograft VasculopathyIn Cohen Freue et al. (2019) we demonstrate the usefulness of PENSE in clinical biomarkerdiscovery studies. In this application, the overarching goal is to identify a small set ofproteins which help to detect whether a patient suffers from cardiac allograft vasculopathy(CAV). CAV is a common complication in patients who received a cardiac transplant. Al-most 50% of recipients develop CAV in the years following transplantation (Cohen Freueet al. 2019), accounting for almost 15% of deaths in heart transplant recipients who sur-vived the first year after transplantation (Lin et al. 2013). In clinical practice, transplantrecipients are monitored at least annually for the onset of CAV. Diagnostics typically relyon coronary angiography, measuring the narrowing of arteries supplying oxygenated blood784.5. BIOMARKERS FOR CARDIAC ALLOGRAFT VASCULOPATHYllllllllllllllllllllllllllllllllNo contamination 25% contaminationError distribution: NormalError distribution: Cauchy16(4)32(5)64(6)128(7)16(4)32(5)64(6)128(7)100%50%0%50%100%100%50%0%50%100%Number of predictors(thereof active)←Specificity | Sensitivity→Method l I−LAMM Ada. PENSE PENSE(a) Very sparse scenarios VS1-LT* and VS1-HT*.llllllllllllllllllllllllllllllllNo contamination 25% contaminationError distribution: NormalError distribution: Cauchy16(12)32(17)64(24)128(34)16(12)32(17)64(24)128(34)100%50%0%50%100%100%50%0%50%100%Number of predictors(thereof active)←Specificity | Sensitivity→Method l I−LAMM Ada. PENSE PENSE(b) Sparse scenarios MS1-LT* and MS1-HT*.Figure 4.4: Sensitivity and specificity of regression estimates in different scenarios with a sample size ofn 5 )((. The horizontal axis in each panel shows the total number of predictors. The vertical axis ineach panel is split in two halves: sensitivity (i.e., the number of correctly identified active predictors)is shown on the top half, and specificity (i.e., the number of correctly identified inactive predictors) isshown downwards with perfect specificity (100%) on the bottom. Solid vertical lines show the rangeof the inner 50%, while the dashed lines extend from the 5% to the 95% quantile.to the heart (Schmauss and Weis 2008). Coronary angiography is an invasive procedureprone to complications (Lin et al. 2013). A simple blood test targeting specific proteins inthe plasma could potentially reduce the risks to patients substantially and improve healthoutcomes of heart transplant recipients.The data used here was first analyzed in Lin et al. (2013) and later in Cohen Freueet al. (2019), comprising information on 37 cardiac transplant recipients. All 37 patientswere assessed for CAV by measuring the maximum percentage of diameter stenosis (Max%DS) in the left anterior descending (LAD) artery (Lin et al. 2013). The original proteomicdata consists of measurements of hundreds of proteins detected in blood plasma samplesfrom the 37 recipients. Following the analysis in Cohen Freue et al. (2019), I utilizes onlythe 81 proteins reliably detected across all plasma samples.The statistical goal is to predict the Max %DS in the LAD through a linear model ofthe measured protein levels such that only some of the proteins are included in the linear794.5. BIOMARKERS FOR CARDIAC ALLOGRAFT VASCULOPATHYllllllllllllllllllllllllll lllllllll lllllllllllllllllll lllllllll lllllllllllProtein: ECM1 Protein: LUM0.8 1.0 1.2 1.4 1.6 1.0 1.50%20%40%60%Protein levelMax %DS in the LADEstimateLSMMFigure 4.5: Univariate regression estimates for regressing the maximum percentage of diameter stenosis(Max %DS) in the LAD artery on the level of proteins ECM1 and LUM in the CAV case study.relationship. Limiting the number of relevant proteins is important for a viable blood test,as the costs of a test targeting many proteins would prohibit a wide-spread use.Exploratory analysis of the data suggests that the measurement of Max %DS in theLAD but also some protein levels contain possibly contaminated values. Figure 4.5, forinstance, shows the results of univariate regressions of the response variable on the measuredlevels of proteins ECM1 and LUM. The robust univariate MM-estimate detects a negativerelationship between the protein levels and Max %DS in the LAD vessel in the sample athand. The classical least squares estimate (LS), on the other hand, estimates a positiverelationship between ECM1 and the response variable and a substantially smaller effectof LUM. For both proteins, a few patients with unusually severe narrowing of the LADcombined with a comparatively high abundance of proteins ECM1 and LUM in their bloodplasma excessively affect the LS estimate. Several similar instances of contamination in thesample cast doubt on the appropriateness of non-robust methods for identifying relevantproteins and quantifying their effect.Comparison of the prediction performance of several estimates in the CAV study isdone by nested cross-validation. Specifically, the sample of 37 observations is split into 7CV folds (the “outer” folds). Within each outer fold, an “inner” 7-fold CV is used to selecthyper-parameters individually for each estimator. To counter the inherent variability incross-validation for robust estimators, the inner CV for all estimators is repeated 50 times(see also Chapter 6 for details on repeated CV for PENSE and adaptive PENSE). As inthe numerical experiments, (adaptive) PENSE choose hyper-parameters to minimize theτ -size of the prediction error, while other methods minimize the mean absolute prediction804.5. BIOMARKERS FOR CARDIAC ALLOGRAFT VASCULOPATHYerror. With these selected hyper-parameters, the left-out observations from the outer foldare predicted and the scale of the prediction error recorded. The outer CV is replicated100 times to assess overall prediction performance of the considered estimators in the CAVstudy.Results of nested CV are shown in Figure 4.6(a). The difference in prediction perfor-mance between the estimates is not very pronounced, but nevertheless noticeable. This is inline with the prediction performances reported in Cohen Freue et al. (2019), albeit resultsreported here suggest slightly better performance for all estimators because repeating theinner CV leads to more stable hyper-parameter selection. Adaptive PENSE leads on aver-age to better prediction performance than the other methods considered. Adaptive LS-ENperforms poorly in the CAV study, much like in the numerical experiments under the pres-ence of contamination. The initial LS-Ridge estimate is likely affected by contamination,and hence “leveraging” this estimate amplifies the effect of contamination. The number ofrelevant predictors selected varies between CV splits, but in general adaptive PENSE andI-LAMM select far fewer proteins than the other methods.Each method is also applied to the full sample, again using repeated 7-fold CV to selecthyper-parameters. For all but LS-EN, the hyper-parameters are not selected to achieve min-imum scale of the prediction error, but rather to lead to the most parsimonious model witha scale of the prediction error not substantially worse than the minimum (within 1R2 thestandard error of the minimum). For LS-EN, this “half standard error rule” always leadsto the empty model, a typical observation with LS-EN under high noise in the responsevariable. Prediction performance may be similar, but the proteins selected by the differentestimates vary substantially. Non-robust LS-EN and adaptive LS-EN select 21 and 20 pro-teins, respectively, with adaptive LS-EN dropping only a single protein. Similarly, PENSEdetects 20 relevant proteins, 13 overlapping with (adaptive) LS-EN. Adaptive PENSE andI-LAMM select the smallest number of proteins among the considered estimators, 14 and 12,respectively, but based on the prediction performance estimated before, the panel identifiedby adaptive PENSE, listed in Table 4.1, is likely more relevant for predicting CAV. Halfof the proteins identified by adaptive PENSE overlap with proteins selected by non-robustmethods, but adaptive PENSE detects several novel proteins.In Cohen Freue et al. (2019), we improve upon the model fitted by PENSE via a subse-quent M-step (PENSEM), selecting a total of 15 proteins. The proteins selected by adaptivePENSE, PENSEM, and Lin et al. (2013) are listed in Table 4.1. Three proteins are selectedby all three methods, while adaptive PENSE and PENSEM overlap in four additional pro-814.5. BIOMARKERS FOR CARDIAC ALLOGRAFT VASCULOPATHYteins. Interestingly, adaptive PENSE selects the extracellular matrix proteins ECM1 andLUM, which have been linked to coronary artery disease (Zhao et al. 2016) and formationof new blood vessels (Neve et al. 2014). Lumican (LUM) is also determined relevant inLin et al. (2013), but ECM1 is selected only by robust estimators, potentially because ofcontamination highlighted in Figure 4.5 transmogrifies the predominantly negative effectof ECM1 on the response into a positive effect. Adaptive PENSE also detects some novelproteins not previously associated with CAV, most notably Hemopexin (HPX). Hemopexinhas been targeted to improve cardiovascular function (Vinchi et al. 2013) and is associatedwith several inflammatory diseases (Mehta and Reddy 2015).We show that the PENSEM estimator can lead to improved prediction performance overPENSE and other robust estimators (Cohen Freue et al. 2019). The M-step is supposed toincrease efficiency of the initial S-estimator, similar to the idea of MM-LASSO and classicalMM-estimators. However, just like MM-LASSO, the M-step for PENSEM hinges on theaccuracy of the residual scale estimated by the initial S-estimator. Especially in higherdimensions or if the true error distribution is heavy tailed, however, the scale estimatederived from PENSE or S-Ridge may not be relied upon. This can lead to severe problemsfor PENSEM and MM-LASSO, as highlighted in the numerical experiments in this and theprevious chapter.Compared to the model fitted by PENSEM, adaptive PENSE detects a stronger signalfor fitting and predicting CAV using a smaller panel of proteins. With adaptive PENSE,the maximum percentage diameter stenosis in the LAD vessel can be fitted well as shown inFigure 4.6(b). Additionally, the robust nature of the estimate allows identification of severalpatients with unusual stenosis. Patients with residuals located in the shaded regions ofFigure 4.6(b) are more than two standard errors (estimated by the M-scale of the residuals)away from the diagonal and can be considered outliers. The adaptive PENSE estimatesuggests that in six patients the measured Max %DS is suspiciously different from whatcould be expected based on their proteomic profile. Most of these patients are also flaggedby PENSEM as having unusual response values, but more severe and mild stenosis is fittedsubstantially better by adaptive PENSE than PENSEM. A follow-up measurement usingmore accurate intravascular ultrasound revealed three patients with initially no stenosisdetected, B-584, B-527 and B-561 (initially measured in weeks 51 or 52 after transplant),have indeed developed mild stenosis of the LAD artery of about 16 Max %DS, very closeto the values fitted by adaptive PENSE. Adaptive PENSE identifies a small set of proteinsleading to superior prediction performance and a better fit to the data than other methods.824.6. CONCLUSIONS12%14%16%18%20%22%LS−EN Ada. LS−EN I−LAMM PENSE Ada. PENSEScale of prediction error [Max %DS in the LAD](a) Estimated prediction performance.llllllllllll llllllllllllll ll ll B−381W51B−506W105B−527W51B−561W52B−584W52B−585W500%15%30%45%60%75%0% 15% 30% 45% 60% 75%Observed max. %DS in the LAD arteryFitted max. %DS in the LAD artery(b) Observed vs. fitted Max %DS in the LAD fromadaptive PENSE.Figure 4.6: Results of the CAV study: (a) the scale of the prediction error of the maximum percentageof diameter stenosis (Max %DS) for several estimators in the CAV study and (b) observed valuesversus values fitted by adaptive PENSE. Prediction errors in (a) are estimated via nested 7-foldcross-validation, repeated 100 times. The shaded regions in (b) depict residuals farther than twicethe standard error away from the fitted value, indicating unusually large residuals. Hyper-parametersfor the adaptive PENSE fit in (b) are selected by 7-fold CV, repeated 100 times.With the demonstrated robustness of variable selection, adaptive PENSE is an importantaddition to the toolbox for biomarker discovery.4.6 ConclusionsThe elastic net S-estimator, PENSE, introduced in Chapter 3 has highly competitive predic-tion performance even under the presence of adverse contamination. Furthermore, PENSEis demonstrated to identify the vast majority of truly relevant predictors, but PENSE es-timates often wrongly include a very high number of irrelevant predictors. The adaptiveelastic net S-estimator, adaptive PENSE, is devised out of the need of controlling the ex-cessive rate of false discoveries made by PENSE estimates.Adaptive PENSE is shown to possess two important asymptotic properties missing fromPENSE: variable selection consistency and the oracle property. In Section 4.2 it is provedthat adaptive PENSE estimators are variable selection consistent even in settings wherethe error distribution does not have finite variance. Variable selection consistency is thekey ingredient for showing that adaptive PENSE estimates of the coefficients of truly active834.6. CONCLUSIONSTable 4.1: Proteins identified by adaptive PENSE to predict Max %DS in the LAD artery, compared toproteins selected by other methods.Gene symbol Protein name AdaptivePENSE PENSEMLin et al.(2013)AMBP Protein AMBP ✓ ✓ ✓APOE Apolipoprotein E ✓ ✓ ✓C4B;C4A Complement C4-B/C4-A ✓ ✓ ✓ECM1 Extracellular matrix protein 1 ✓ ✓F2 Prothrombin (Fragment) ✓ ✓HBA2;HBA1 Hemoglobin alpha-2 ✓ ✓HBD Hemoglobin subunit delta ✓ ✓C7 Complement component C7 ✓ ✓LUM Lumican ✓ ✓C1R Complement C1r subcomponent ✓HABP2 Hyaluronan-binding protein 2 ✓HPX Hemopexin ✓SERPINA3 Alpha-1-antichymotrypsin ✓SERPINC1 Antithrombin-III ✓predictors as precise as if the truly active predictors were known in advance. Therefore,even in problems with many available predictors, coefficients of the active predictors areaccurately estimated.The adaptive elastic net penalty also improves robustness of variable selection as out-lined in Section 4.3.1 and demonstrated numerically in Section 4.4.2. Contamination ofinactive predictors in observations which follow the true linear model causes PENSE es-timates to wrongly include these predictors in the model. This leads to a breakdown ofvariable selection of PENSE, where contamination with leverage points severely degradesspecificity. The robustness of the S-loss, however, ensures that the coefficient estimatesfor truly inactive predictors with excessive leverage remain small. By leveraging a robustPENSE estimate, adaptive PENSE is able to screen out many of these spuriously selectedpredictors. Empirical observations suggest PENSE with the Ridge penalty (αS = 0) is theappropriate preliminary estimate in most applications. With the Ridge penalty, PENSEcan be computed more efficiently than with a non-smooth penalty (αS S 0) and hyper-parameter selection is substantially less demanding. Furthermore, sensitivity decreasesonly moderately compared to the best PENSE estimate while specificity increases.Adaptive PENSE’s increased robustness towards leverage points is an important prop-erty for real-world applications. Section 4.5 revisits a biomarker discovery study with the844.6. CONCLUSIONSgoal of identifying proteins in the human blood plasma which help to predict cardiac allo-graft vasculopathy, a major complication after heart transplantations. In Cohen Freue et al.(2019) we use PENSE and a subsequent M-step to determine a panel of 15 possibly relevantproteins. The M-step proves to be challenging in this application due to the difficulty ofestimating the scale of the residuals accurately. When applying adaptive PENSE to thesame data set, a slightly different panel of 13 possible relevant proteins is uncovered. Theadaptive PENSE estimate leads to superior prediction performance in the study and at thesame time fits the data better than competing robust and non-robust methods.Theoretical results expose the major drawback of regularized S-estimators: substantiallylower asymptotic efficiency than regularized M-estimators for light- and moderate-light-tailed error distributions. Empirical results are in line with this observation, although thedifferences between regularized M- and S-estimators in finite-samples are far less pronouncedthan suggested by the theory. The following chapter discusses challenges of robustly esti-mating the residual scale and thereby sheds light on reasons why regularized M-estimatorsmay not provide the gains in efficiency in finite-samples as promised by the theory.85Chapter 5Residual Scale EstimationS-estimators of regression are highly robust to aberrant contamination in the data andheavy tailed error distributions. In Chapters 3 and 4 I show that this also holds for PENSEand adaptive PENSE, even in high dimensions. The apparent downside of S-estimators,already discussed in Section 2.2, are their low efficiency under the Normal model. An iconicidea in robust statistics is to follow the S-estimator by an additional M-step (Yohai 1987).The resulting MM-estimator of linear regression inherits the robustness properties from theinitial estimator but can be tuned to achieve high efficiency arbitrarily close to the LS-estimator. In their most basic form, MM-estimators are defined by the following sequenceof steps:Step 1 Compute a highly robust and strongly consistent estimate of regression, e.g., thePENSE estimate xθS.Step 2 Compute the M-scale of the residuals from the estimate fitted in step 1, σˆS =σˆM(y − µˆS −rxβS).Step 3 Using the S-estimate xθS as initial estimate, find a local minimum xθMM ofLM(yPrβ C µS σˆS) =1nn∑i=1/M(yi − µ− x⊺iβσˆS)which improves upon the initial estimate, i.e., LM(yPrxβMM C µˆMM) ≤ LM(yPrxβS C µˆS).Here, /M is a bounded / function according to [R1]–[R3] which is dominated by the /function used to compute the M-scale (2.8), i.e., /M(t) ≤ /(t) for all t.Evidently, the M-step is computationally cheap, given that only a single starting point86needs to be considered and the objective function is separable over the observations. InCohen Freue et al. (2019) we adopt this idea to improve upon the PENSE estimate by asubsequent M-step, called PENSEM. Smucler and Yohai (2017) base their MM-LASSO onthe same principle, but the execution is slightly different than what is done for PENSEM.These differences highlight some of the challenges translating the idea of MM-estimators toa high dimensional setting using robust regularized estimators.None of the three steps for computing MM-estimates can be applied to regularizedestimators without modification. In step 1, the question is what hyper-parameters shouldbe selected for a robust regularized estimate of regression. For MM-LASSO, the choice isto compute the S-Ridge estimate, i.e., using αS = 0, optimizing the penalization level forprediction performance. PENSEM, on the other hand, uses the PENSE estimate with bothαS and λS optimized for prediction performance. Others propose to not use a regularizedestimate in step 1 but an unregularized MM-estimate (Arslan 2016), which is only possiblefor low-dimensional problems.Step 3 raises similar questions as step 1 about an appropriate choice of hyper-parametersand whether a local minimum close to the estimate from step 1 is a sensible choice. Due tovastly different scales of the loss functions in the two steps, the penalization level selectedin step 1 is in general not a reasonable choice for the M-step. Both MM-LASSO andPENSEM carry out a separate hyper-parameter search for the M-step; MM-LASSO for thepenalization level, PENSEM for α and λ. The theoretical results in Yohai (1987) justifyusing only the consistent and robust estimate from step 1 as starting point for computing theMM-estimate, as the local minimum uncovered has the same asymptotic properties as theglobal minimum. No such results are available for the regularized M-step, but MM-LASSOand PENSEM nevertheless follow the same principle and do not perform an exhaustivesearch for good initial estimates to restrain the computational overhead. Despite this leapof faith, both MM-LASSO and PENSEM show an improved efficiency over the initial S-estimate, but not in every setting. Most concerning is the observation that MM-LASSOand PENSEM seem to be much more affected by contamination in some situations.The main problem why regularized M-steps do not always improve efficiency, and some-times seemingly break down, is the difficulty posed by step 2. For the M-step to performas expected under the assumed model, the /M function, more specifically the cutoff value,is chosen based on the probabilistic limit of σˆS. This of course requires the assumed modelto hold for the majority of observations, but the greater challenge in practical applicationsis the bias of the estimate σˆS. Even if σˆS converges almost surely to a fixed limit, the finite-875.1. THE PROBLEM IN HIGH DIMENSIONSsample bias may be arbitrarily large. If the bias in finite samples is too large, the chosen/M may not deliver the promised gains in efficiency. Especially in higher dimensions, thebias in the residual scale estimate can be unacceptably large.5.1 The Problem in High DimensionsEstimating the scale of the residuals in high dimensional linear models is known to bedifficult and prone to bias. Already Mammen (1996) paints a bleak picture, showing howfast the bias of the empirical distribution of the residuals in a p-dimensional linear regressionmodel increases with p. The bias in the empirical residuals, if not corrected, translates toa biased scale estimator. The problem is even amplified by the use of regularized and/orrobust estimators, but it has only recently been attracting attention (e.g., Fan et al. 2012;Dicker 2014; Chatterjee and Jafarov 2015; Reid et al. 2016; Chen et al. 2018; Tibshiraniand Rosset 2019).Regularized estimators have been studied extensively, but attention was mostly directedat prediction and variable selection performance of these estimators. Perhaps pushing theissue to the sidelines even more, theoretical properties of regularized estimators do notdepend on an estimate of the residual scale. With emergence of more literature on post-selection inference, however, residual scale estimation has become more closely investigated.The review paper by Reid et al. (2016), and the recent proposals by Yu and Bien(2019) and Chen et al. (2018) highlight the numerous challenges in estimating the residualvariance using regularized estimators. As already outlined in Fan et al. (2012), residual scaleestimation with regularized estimators is impeded by the inherent bias of the estimates. Themain sources of the bias are the penalization of the coefficients and data-driven selectionof the hyper-parameters with a goal of good prediction. Unfortunately, these two sourcesof bias work in tandem, accentuating their effect on the residual scale estimate. In highdimensions, predictors spuriously correlated with the response add to the problem. Thelarger the number of irrelevant predictors, the greater the chances of spurious correlationand hence overfitting the response.The problem is not restricted to classical, least-squares-based estimators, but affectsrobust estimators even more. For non-regularized MM-estimators, Maronna and Yohai(2010) show how serious underestimation of the error scale by the M-scale from the residualsof the S-estimate can be. Even in a setting considered low dimensional in this work (n = M0and p = 1M) the estimate of the error scale is below half of the true error scale almost 50%885.1. THE PROBLEM IN HIGH DIMENSIONSof times. These results are for non-regularized MM-estimators and do not account for theimpact of regularization.The down-stream effect of a poor scale estimate on the M-step can be devastating.The M-loss function depends on the boundedness of the /M function to protect againstgross outliers, while at the same time behaving similarly to the LS-loss for small residualsto ensure efficiency. Considering a severe underestimation of the error scale, σˆS σU , thescaled residuals yi−x⊺i βσˆSare artificially inflated. This “pushes” many scaled residuals into thebounded region of the /M function, treating them as outlying. Therefore, the M-step doesnot improve efficiency because a large proportion of actually uncontaminated observationsare incorrectly down-weighted. Severe overestimation of the error scale, on the other hand,shrinks the scaled residuals towards 0, neutralizing the boundedness of /M. In this case,outlying observations are not detected as such and can grossly affect the M-estimate. Aninaccurate estimation of the error scale can thus either lead to a decrease in efficiencycompared to the initial S-estimator, or even jeopardize the robustness of the M-estimator.Estimating the residual scale with PENSE suffers from the bias inflicted by the M-scalein addition to the bias introduced by the penalty function and data-driven hyper-parameterselection. As depicted in Figure 5.1 for simulated data, the effects of a poor scale estimateon the subsequent M-step, PENSEM, are worrisome. Firstly, the plots clearly show theprevalence of severe underestimation and overestimation. Underestimation is commonlyobserved even without contamination, but the scale is often severely overestimated in thepresence of contamination, especially when combined with heavy-tailed errors. It is evidentthat gross under- or overestimation of the residuals scale leads to a degradation of predictionperformance of PENSEM, with distressing effects under the presence of contamination. Theconclusions are the same for any regularized M-estimator relying on an initial scale estimatein high dimensions: in the majority of cases the M-step improves efficiency and leads tobetter estimation and prediction, but in an unsettling large number of instances the M-estimator is either less efficient than the initial S-estimator, or worse, seriously affected bycontamination.Without an improved residual scale estimate in high dimensional problems, results fromPENSEM or other regularized redescending M-estimators may be unreliable. As an ad-hocsolution for unpenalized MM-estimators, Maronna and Yohai (2010) suggest a multiplicativecorrection to increase the residual scale estimated from the S-estimate. Smucler and Yohai(2017) use this correction for the MM-LASSO estimator, but empirical results presentedhere suggest the adjustment does not work well for regularized estimators. The residual895.2. DATA-SPLITTING STRATEGIESNo contamination 10% contaminationError distribution: NormalError distribution: Cauchy0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.51.01.52.02.53.01.01.52.02.53.0Relative residual scale estimateRelative scale of the prediction errorFigure 5.1: Prediction performance of the PENSEM estimate as a function of the residual scale estimatedby PENSE. Hyper-parameters are selected via 5-fold CV. The residual scale estimate on the horizontalaxis and the scale of the prediction error are reported relative to the true M-scale of the residuals.Data is generated according to scheme VS1-LT* (top) and VS1-HT* (bottom) for n 5 )(( andp ∈ {5(, )((}. The true model explains 83% of the variation and results under contamination (right)consider scenarios with 10 different vertical outlier positions and kl 5 6.scale estimate is often overestimating the true scale as suggested by the simulation resultsreported before; further inflating the scale estimate in these situations exacerbates theproblem. For non-robust estimators, several strategies for correcting the bias in the scaleestimate have been proposed in the literature. The majority of these proposals is based onthe idea of splitting the data into non-overlapping chunks.5.2 Data-Splitting StrategiesOne of the driving forces behind the bias in regularized estimators is data-driven hyper-parameter selection. With penalization leading to an underestimation of the coefficients,cross-validation or similar strategies to give good prediction performance compensate forthe biased coefficients by selecting a penalization level which is too small to screen outspuriously correlated predictors. The regularized estimate computed on the entire data setwith this small penalization level will typically include some of these irrelevant predictors.Just by chance of observing these spuriously correlated predictors, the fitted model explainsmore variation in the response than the true model, leading to an underestimation of theresidual scale.Fan et al. (2012) therefore proposes refitted cross-validation (RCV), based on the as-905.2. DATA-SPLITTING STRATEGIESsumption if a data set is split into multiple chunks, the chance for the same predictor tobe spuriously correlated with the response in each chunk is minuscule. Following the ideaof cross-validation, variables are selected based on all but one part the data (e.g., using aregularized regression estimator or any other model selection procedure), while the coeffi-cients of the selected predictors are then re-fitted on the left-out part (e.g., using ordinaryleast squares or a regularized method). To ensure there are enough observations in eachpart for efficient re-estimation of the coefficients, the data is usually split only into twoparts for RCV: (y(1)Pr(1)) and (y(2)Pr(2)). Each of the two parts is used once for modelselection, yielding two estimated sets of active predictors, Aˆ(1) and Aˆ(2). The coefficientsare then re-estimated in the other half of the data, restricted to the model selected in thefirst step. More specifically, the estimate xθ(1) is computed using the response vector y(1)and the subset of the design matrix r(1)Aˆ22) , while xθ(2) is computed from r(2)Aˆ21) and y(2). TheRCV estimate of the residual variance is then the pooled variance estimates2RCV =∥∥∥y(1) − µˆ(1) −r(1)Aˆ22) xβ(1)∥∥∥22 C ∥∥∥y(2) − µˆ(2) −r(2)Aˆ21) xβ(2)∥∥∥22n− |Aˆ(1)| − |Aˆ(2)| ORefitted cross-validation remedies the negative effects of spurious correlation in highdimensions, but the estimation bias introduced by the penalty function is not removed. Theeffects of data-driven hyper-parameter selection are slightly reduced by decoupling variableselection from coefficient estimation, but they are still noticeable in RCV. Reid et al. (2016)compare RCV for LS-LASSO to other data-splitting methods as well as estimators withfolded-concave penalty or other de-biased versions of the LS-LASSO, in a wide range ofscenarios. They conclude that estimating the error variance ass2CV =1n− |Aˆ|∥∥∥y − µˆ−rxβ∥∥∥22P (5.1)where xθ is computed on the full data using a penalization level chosen via standard cross-validation, performs best overall. Theoretical results in Chatterjee and Jafarov (2015)support this conclusion, albeit with a non-adjusted scaling factor 1Rn, which tends to biasthe scale estimate downwards. As pointed out in Yu and Bien (2019), the adjustment 1n−|Aˆ|is also problematic as it hinges on an accurately recovered model to avoid overestimationof the residuals scale.Especially when the sparsity of the true signal decreases while the variance explainedby the true model remains fixed, Reid et al. (2016) show that RCV and other corrective915.2. DATA-SPLITTING STRATEGIESmeasures stop working reliably. With the variance explained by the true model fixed, aless sparse signal also entails decreasing magnitude of each coefficient. Theoretical resultsfor the RCV estimator suggest the magnitude of each truly non-zero coefficient needs to belarge enough for the estimator to be consistent and efficient, explaining the results seen byReid et al. (2016). Surprisingly, Reid et al. (2016) also find that for less sparse models withlarger signal strength per coefficient, the RCV estimator is substantially upwards biasedin finite samples. Therefore, it appears that correction strategies such as RCV only workwell for very sparse problems where the true coefficient values are large enough, which mayjeopardize their applicability in practice.The question is if these empirical results are transferable to PENSE or other robustregularized S-estimators, which bring additional biases. The data-splitting methods canbe readily adapted for robust estimation by replacing the regression estimator with, forexample, PENSE and the empirical standard deviation by the M-scale of the residuals. Thecross-validation based estimator for the residual scale (5.1) may be defined using a PENSEestimate θ˜ with hyper-parameters selected via cross-validation, and the robust M-scale ofthe residuals:σˆCV = σˆM(y − µ˜−rβ˜)OThe refitted cross-validation estimator using PENSE for model selection and re-estimationcan similarly be defined asσˆRCV =√12(σˆ2M(y(1) − µ˜(1) −r(1)Aˆ22)β˜(1)) C σˆ2M(y(2) − µ˜(2) −r(2)Aˆ21)β˜(2)))ODetermining an appropriate number of splits for RCV is difficult when using robust esti-mators. Bisecting the data set leaves enough observations for re-estimating the coefficientsbut cuts the attainable breakdown point in half. Due to the additional K-fold CV forhyper-parameter selection inside each RCV fold, the maximum attainable breakdown pointis n(K−1)2K .Downward bias of the residual scale can also be exacerbated by overfitting the data.To avoid possible overfitting, the scale of the prediction error could serve as a surrogateestimator for the scale of the residuals:σˆPR = σˆM(y − xy(λS,αS))with xy(λS,αS) the predicted values in the CV folds as defined in (3.6) and λSP αS selected by925.2. DATA-SPLITTING STRATEGIESthe same cross-validation. While the scale of the prediction error may reduce the problemof overfitting, in empirical studies it often overestimates the error scale. An ad hoc waybalancing downward bias of σˆCV and upward bias of σˆPR is averaging them:σˆAVG =√σˆ2CV C σˆ2PR2ODespite the lack of theoretical underpinnings, the empirical results presented below indicatethat this average estimate performs better than the individual estimates.It is important to note that here the M-scale estimateσˆM(r) = inf{s R1nn∑i=1/(riRs) Q δ}is not corrected for the effective degrees of freedom of the estimated model as sometimesdone to decrease finite-sample bias of the estimator (e.g., in Maronna 2011). The correctioneffectively reduces the breakdown point of the M-scale estimate, without adjusting thebreakdown point of the robust estimate of regression accordingly. Consider a robust estimatecomputed with 25% breakdown point, tolerating up to 25% of arbitrarily large residuals.Adjusting the breakdown point of the M-scale estimator, e.g., to 15%, opens the floodgatesto some of these possibly extreme residuals affecting the scale estimate and in turn breakingthe M-step. As seen before, overestimation of the scale can be even more detrimental tothe reliability of the M-estimator than underestimation and should be avoided.Figure 5.2 summarizes the results of a numerical study of data-splitting methods inconjunction with PENSE. The reported scale estimate is relative to the true scale of theresiduals and it is evident that the CV-based estimate, σˆCV, severely underestimates theerror scale, especially as more predictors are available. Under contamination, on the otherhand, the M-scale of the residuals estimated by PENSE is inflated and shows large variation.At the same time, the estimate based on the prediction error, σˆPR, is badly biased upwards.Under no contamination, under- and overestimation of the CV-based and prediction-basedestimates seem to cancel out reasonably well and the average estimate, σˆAVG, performs muchbetter than the individual estimates. In the presence of contaminated observations, however,high variability and upward bias of both σˆCV and σˆPR carry over to the average estimate.Refitted cross-validation with PENSE is outperformed by the simple average estimate ifno contamination is present but shows slightly better performance under contamination.With RCV, however, the maximum breakdown point is substantially reduced which would935.3. DISCUSSIONNo contamination 15% contaminationPVE: 50%PVE: 83%20 (4) 80 (6) 640 (9) 20 (4) 80 (6) 640 (9)0.00.51.01.52.02.50.00.51.01.52.02.5Number of predictors (thereof active)Relative estimated scaleEstimatorσ^CVσ^PRσ^AVGσ^RCVFigure 5.2: Estimated residual scale using PENSE in conjunction with different data-splitting strategies.The reported residual scale estimates are relative to the true scale of the residuals. The n 5 )((observations are generated according to scheme VS3-LT* with the true model explaining 50% (top)and 83% (bottom) of the variance. Results under contamination (right) consider scenarios with 10different vertical outlier positions and moderate leverage, kl 5 2.lead to problems in situations with more than 15% contamination. Concurrent with thefindings in Reid et al. (2016), the RCV estimate tends to do worse if the signal strength islarger. The PENSE estimate is perhaps not efficient enough to give reliable estimates fora sample half the size of the original data. In particular due to the stark reduction in thepossible breakdown point, RCV is not well suited to be combined with robust estimators.Using adaptive PENSE as the initial high-breakdown estimator instead of PENSE improvesresults only marginally in these empirical studies and does not warrant the slightly increasedcomputational complexity. The results reported here suggest none of the considered data-splitting strategies works well across all considered scenarios.5.3 DiscussionThe problem of residual scale estimation in moderate- to high-dimensional linear regressionmodels is an actively evolving area. In the context of non-robust regularized estimators,the increased demand for post-selection inference has recently shifted attention to the issueof error variance estimation. Many proposals focus on different data-splitting strategies toget an accurate estimate of the error variance. Others adapt the LS-LASSO for improvedresidual scale estimation (e.g., Yu and Bien 2019; Sun and Zhang 2012; Belloni et al. 2011),but they explicitly target Normal errors.945.3. DISCUSSIONFor robust estimators, the residual scale estimate has another important role: improvingthe efficiency of a highly robust but inefficient estimator via a subsequent M-step. This M-step requires an accurate and robust scale estimate to achieve the promised gain in efficiency.As demonstrated empirically in Section 5.1, finite-sample bias in the error scale estimate canrender the M-step unreliable. Particularly overestimation of the residual scale exposes theM-estimate to the influence of outliers and hence risks a breakdown under contamination.Methods for improved scale estimation in the non-robust realm are not transferableto robust regularized estimators due to the effects of possible contamination. Refittedcross-validation and other data-splitting methods for variance estimation suffer from thelow efficiency of regularized S-estimators and lead to a severe reduction of the maximumbreakdown point. Data-splitting methods for estimating the error variance suffer from thesame issues as hyper-parameter search via cross-validation under contamination, discussedin Section 3.5.2.An interesting direction is presented in Loh (2018) for the a1 regularized Huber loss,a convex amalgam between the LS- and LAD loss. Up to a fixed threshold, Huber’s lossis the square function, which transitions to the absolute value for values greater than thethreshold. While not robust towards leverage points in the predictors, it protects againstoutliers in the response. Choosing the threshold involves the same complications as choosingthe cutoff value for the M-step: requiring an estimate of the residual scale. Loh (2018)sidesteps scale estimation and instead proposes to use several candidate values for the scaleand adaptively choose a good value based on Lepski’s method. The author proves that theresulting estimator performs as well as the estimator obtained by knowing the true errorscale. Extensions of this method to regularized M-estimators with redescending /M functionare of potential interest.This chapter highlights that estimation of the residual scale in high-dimensions by ro-bust means is very difficult. Methods relying on the accuracy and robustness of a residualscale estimate are susceptible to be severely damaged by contamination. With data-drivenhyper-parameter selection, consistency of the scale estimate is not guaranteed, and em-pirical results suggest the estimates are highly biased. While in pristine settings withoutany contamination an M-step can indeed improve efficiency and lead to better predictionperformance than PENSE or adaptive PENSE, the M-estimator may not be reliable undercontamination or heavy-tailed error distributions, overshadowing any potential gain in effi-ciency. As long as the issue of residual scale estimation in high dimensions is not adequatelysolved, PENSE and adaptive PENSE are the safer choices over regularized M-estimators.95Chapter 6SoftwareAs hinted several times in the previous chapters, computing PENSE estimates is a chal-lenging endeavor. For adaptive PENSE, the computational challenges are the same but ingeneral more daunting as adaptive PENSE depends on more hyper-parameters.To facilitate the application of PENSE (Chapter 3) and adaptive PENSE (Chapter 4),a software package for the language and environment for statistical computing R (R CoreTeam 2020) is made available at https://cran.r-project.org/package=pense. Thischapter details the computational solutions developed for PENSE and adaptive PENSE asavailable in the pense R package. Computation is agnostic to the hyper-parameter ζ, henceit is absorbed by the penalty loadings ω = (ωζ1 P O O O P ωζp)⊺ and dropped from the notationbelow. Computation of PENSE is a special case of adaptive PENSE with penalty loadingsfixed at ω = 1p. The following exposition therefore considers only the more general case ofadaptive PENSE.Computing solutions to weighted least-squares adaptive elastic net (LS-adaEN) prob-lems is an essential component for computing adaptive PENSE estimates. As detailed inChapters 3 and 4, finding a set of initial estimates for PENSE and adaptive PENSE in-volves a large number of weighted LS-EN and weighted LS-adaEN problems, respectively.Moreover, the adaptive PENSE objective function is equivalent to a weighted LS-adaEN ob-jective function, with weights depending on where the objective function is evaluated. Thisequivalence is the foundation for computing local minima of the adaptive PENSE objectivefunction. Computation of adaptive PENSE therefore relies heavily on efficient algorithmsfor solving weighted LS-adaEN problems.966.1. ALGORITHMS FOR WEIGHTED LS ADAPTIVE EN6.1 Algorithms for Weighted LS Adaptive ENComputational performance of finding local minima of the adaptive PENSE objective func-tion and computing initial estimates depends on the performance of the algorithm for solvingweighted LS-adaEN problems of the formOWLS(µPβPW) = LLS (WyPW(rβ − µ)) C λΦAN(βSωP α)P (6.1)with diagonal weighting matrix W ∈ Rn×n. Throughout this section the matrix W˜ =√1Rw2l denotes the normalized weight matrix, where w2 = 1n∑ni=1l2ii is the averagesquared weight. Furthermore, the squared matricesW2 and W˜2 denote the diagonal matrixof squared weights and squared normalized weights, respectively.Many of the weighted LS-adaEN problems arising during the computation of adaptivePENSE estimates are “close”, in the sense that only the weight matrix W or the set of ob-servations change marginally between subsequent minimizations. While these “proximal”problems are important for adaptive PENSE, computational optimizations for these specialuse-cases are missing from the literature. Most of the attention in the literature on com-puting weighted LS-adaEN estimates focuses on computational shortcuts when minimizingthe objective function for a decreasing sequence of the penalty parameter (e.g., Friedmanet al. 2010; Tibshirani et al. 2012).In the following, special attention is therefore given to optimizing the weighted LS-adaEN objective function when only the weights or only the data change between subsequentminimizations. Ideally, algorithms for weighted LS-adaEN problems should incur little over-head when changing only weights, data, or the penalty level. The pense package implementsseveral algorithms for optimizing the weighted LS-adaEN objective function (6.1), each withits own use-cases, advantages and disadvantages.6.1.1 Augmented RidgeThe augmented ridge algorithm is specialized for weighted LS-Ridge problems (i.e., α = 0in (6.1)). The weighted LS-Ridge problem can be solved exactly by noting that the weightedLS-adaEN objective function in the case of α = 0 and without intercept term simplifies to12n‖W (y −rβ)‖22 C12nnλ‖β‖22 =12n∥∥∥y˜ − r˜β∥∥∥22(6.2)976.1. ALGORITHMS FOR WEIGHTED LS ADAPTIVE ENwherey˜ =(Wy0p)and r˜ =(Wr√nλIp×p)ODue to the equivalence in 6.2, the closed-form solution for the Ridge estimate xβ isxβ =(r⊺W2rC nλIp×p)−1r⊺W2yOAn intercept term can be accommodated by making the predictor matrix orthogonalto the centered response. More specifically, the weighted and centered response is y∗ =Wy − 1n1⊺nWy. Similarly, the orthogonalized predictor matrix r∗ is given byr∗ = r˜− 1nW1n×nWr˜ (6.3)where r˜ =W(r− 1px¯) is the centered and weighted predictor matrix and x¯ = 1nr⊺1n isthe mean vector of all predictors. The slope parameter and intercept are then computed byxβ = (r∗⊺r∗ C nλIp×p)−1r∗⊺y∗µˆ =1n1⊺nW2(y −rxβ)O(6.4)Computing the optimum for any penalty level incurs d(npCp3) floating-point operations(flops) to solve the system of p linear equations in (6.4). Changing the data or weightsrequires recomputing the orthogonalized predictor matrix r∗ and its Gram matrix r∗⊺r∗.These changes therefore incur an additional computational complexity of d(n2p C np2)flops. Solving the linear equations in (6.4) can be a computational bottleneck for verylarge p. However, if the p × p matrix fits into memory and λ S 0, the augmented Ridgealgorithm is highly competitive as the solution can be computed to high precision in a singlestep without potential convergence issues. This stability argument often outweighs limitedscalability as computing local minima of the adaptive PENSE objective function involves alarge number of weighted LS-adaEN problems and convergence issues in a single weightedLS-adaEN problem lead to more serious convergence issues down the road.6.1.2 Augmented LARSThe Least Angle Regression (Efron et al. 2004) algorithm (LARS) can be used to computesolutions of the LS-LASSO objective function exactly. Starting from the empty model, i.e.,all coefficients equal 0, the LARS algorithm translates a LS-LASSO problem into a sequence986.1. ALGORITHMS FOR WEIGHTED LS ADAPTIVE ENof ordinary least-squares (OLS) problems, one for each penalty level where covariates “enter”or “leave” the model. For a fixed penalty level λ, the LARS algorithm solves K ≥ 0 OLSproblems at λ˜0 S λ˜1 S O O O S λ˜K , where λ˜K−1 Q λ ≤ λ˜K . The LS-LASSO at penalty levelλ can then be recovered exactly by linear interpolation between the coefficients computedat λ˜K−1 and λ˜K :xβ(λ) =λ˜K − λλ˜K − λ˜K−1xβ(λ˜K−1) Cλ− λ˜K−1λ˜K − λ˜K−1xβ(λ˜K)OPenalty loadings ω are incorporated into the LARS algorithm by scaling the predictormatrix with the inverse penalty loadings rΩ−1, where Ω−1 = diag(1Rω1P 1Rω2P O O O P 1Rωp).The elastic net penalty can be accommodated by changing the penalty level for the LARSalgorithm to αλ and using equivalence (6.2) to handle the a2 penalization with√nλIp×preplaced by matrix√n1−α2 λΩ−1 The LARS algorithm therefore solves the weighted LS-adaEN problem by computing the LS-LASSO solution on the weighted, centered responseand orthogonalized predictors given in (6.3), and r replaced by rΩ−1.At every step k, k = 0P O O O PK, of the augmented LARS algorithm a system of linearequations must be solved. However, the sequence of the OLS problems allows for solvingthese systems of linear equations more efficiently by sequentially updating a “running”Cholesky decomposition (Efron et al. 2004; Watkins 2002). Consider the symmetric p × pmatrix A = r∗⊺r∗ C√n1−α2 λΩ−1. In the following, A(k) denotes the symmetric matrixcomprising only the rows and columns of A for predictors included in the model at thek-th step. Instead of calculating A(k) for every k, the augmented LARS algorithm onlyneeds the (upper-triangular) Cholesky decomposition U(k) of A(k), A(k) = U(k)⊺U(k). ThisCholesky decomposition can be computed efficiently from the Cholesky decomposition atthe previous step, U(k−1). Consider predictor j is added in the k-th step. The updatedCholesky decomposition is given byU(k) =(U(k−1) U(k−1)−1v0⊺ Vjj − v⊺v)P v =(Vjj′)j′∈A 2k−1) (6.5)where A (k−1) is the set of active predictors in the previous step. The system U(k−1)−1v canbe solved efficiently using back substitution because U(k−1) is an upper-triangular matrix(Watkins 2002). This requires only d(p˜2) operations, where p˜ ≤ p is the dimension ofU(k−1). Performing updates in this way leads to a different order of the predictors in the996.1. ALGORITHMS FOR WEIGHTED LS ADAPTIVE ENCholesky decomposition than in r. Therefore, it is necessary to keep track of the order thepredictors are added to reconstruct the original order of the coefficients. The performancegain, however, outweighs the overhead of rearranging the coefficients only once.Dropping a predictor is also a simple update of the Cholesky decomposition. Considerpredictor j is dropped in the k-th step, and v = (v⊺1P v2)⊺ corresponds to the upper-diagonalelements of the column in U(k−1) corresponding to the dropped predictor,U(k−1) =U(k−1)11 v1 U(k−1)130 v2 j(k−1)230 0 U(k−1)33 OThe updated Cholesky decomposition U(k) is then given byU(k) =(U(k−1)11 U(k−1)130 U(k)33)where U(k)33 is the Cholesky decomposition of a rank-one update U(k−1)33⊺U(k−1)33 C v1v⊺1,which can be computed efficiently (Gill et al. 1974).Updating the running Cholesky decomposition involves growing and shrinking of thedecomposition at every single step. Conventionally, the p˜2 elements of the decompositionU ∈ Rp˜×p˜ are stored in a contiguous array (Anderson et al. 1999):u11 u12 · · · u1p˜u21 u22 · · · u2p˜... ... . . . ...up˜1 up˜2 · · · up˜p˜ stored as−−−−−→ su11P u21P O O O P up˜1︸ ︷︷ ︸column 1 P u12P u22P O O O P up˜2︸ ︷︷ ︸column 2 P O O O P u1p˜P u2p˜P O O O P up˜p˜︸ ︷︷ ︸column p˜ uOThis storage schema is not ideal for the running Cholesky decomposition for two reasons.First, the decomposition is an upper-triangular matrix with all entries below the diagonalbeing 0 and never referenced. Therefore, the conventional storage scheme requires almosttwice as much memory as necessary. Secondly, appending or removing a column and rowto/from a conventionally stored matrix requires moving almost every element in memory,which is an expensive operation. Considering that any row appended to the Choleskydecomposition contains only 0’s, except for the diagonal entry, this is a superfluous andprodigal operation.1006.1. ALGORITHMS FOR WEIGHTED LS ADAPTIVE ENp = 50 p = 100 p = 200 p = 400 p = 80050 300 600 900 50 300 600 900 50 300 600 900 50 300 600 900 50 300 600 900100400900Number of observationsTime [ms]ConventionalColumn−packedFigure 6.1: Comparison of computation time for the weighted LS-adaEN minimizer using the augmentedLARS algorithm with the Cholesky decomposition stored in conventional scheme (dashed light-blueline) or column-packed scheme (solid blue line). The vertical axis is on the square-root-scale. Timingsare taken for simulated data sets (one per (n, p) combination) and averaged over 100 runs on a systemwith Intel® Xeon® E5-1650 v2 @ 3.50GHz processors.To improve performance of the running Cholesky decomposition used for the augmentedLARS algorithm, the implementation in the pense package stores the decomposition a incolumn-packed scheme (Anderson et al. 1999). Only the (p˜2 C p˜)R2 non-zero elements ofthe upper-triangular Cholesky decomposition U ∈ Rp˜×p˜ are stored in memory asu11 u12 · · · u1p˜0 u22 · · · u2p˜... ... . . . ...0 0 · · · up˜p˜ stored as−−−−−→ s u11︸︷︷︸column 1P u12P u22︸ ︷︷ ︸column 2P O O O P u1p˜P u2p˜P O O O P up˜p˜︸ ︷︷ ︸column p˜ uOAppending a row and column to the matrix U only requires appending p˜ C 1 elements inmemory, without moving any of the other elements. Removing a row and column frommatrix U still requires moving elements in memory, but it is less expensive than for con-ventional storage as only non-zero elements must be moved. Considering that appendingis a much more frequent operation than removing for the running Cholesky decomposition(Efron et al. 2004), the performance gains of using column-packed storage are substantial.This is evident in Figure 6.1, where the computation times for two implementations ofthe augmented LARS algorithm are compared: an implementation using the conventionalstorage scheme for the Cholesky decomposition (denoted by “conventional” in the graph)and an improved implementation using column-packed storage for the Cholesky decomposi-tion (denoted by “column-packed”). For most problem sizes, the “column-packed” storagescheme leads to substantial improvements in computational speed.1016.1. ALGORITHMS FOR WEIGHTED LS ADAPTIVE ENAugmented LARS solves the optimization of the weighted LS-adaEN objective functionexactly using a sequence of OLS problems and is therefore numerically very stable. Unlessα = 1, changing the penalty parameters requires recomputing the entire sequence of OLSproblems. Each update of the running Cholesky decomposition requires d(p˜2) flops, wherep˜ ≤ p is the number of active predictors in the step. Therefore, computational complexityfor solving the sequence of K OLS problems is d(Kp2), where K is typically ≲ max(nP p)unless predictors are highly correlated. Furthermore, if the penalty level is large and hencethe solution has a small number of non-zero coefficients, the augmented LARS algorithminvolves only a few low-dimensional OLS problems and is computationally very efficient.As for augmented Ridge, updating the weights or data requires recomputing the weighted,orthogonal predictor matrix r∗ adding d(n2pC np2) flops. Quadratic computational com-plexity can also be seen in Figure 6.1. On the square-root scaling of the vertical axis inthese plots the computation time increases linearly with the number of observations n, forany p.Closed form solutions for the intermediate OLS problems avoid convergence issues foraugmented LARS. Accurate results, high stability and computational efficiency for sparsesolutions (i.e., large penalty levels) are clear advantages of the augmented LARS algorithm.A main drawback, however, is the need to store a p×p matrix and, for small penalty levels,d(p2) flops per step. Furthermore, the algorithm cannot leverage solutions to “proximal”problems (e.g., after a small change to the penalty level) to speed up computation, a keyadvantage of iterative algorithms.6.1.3 Alternating Direction Method of Multipliers (ADMM)The Alternating Direction Method of Multipliers (ADMM) algorithm leverages the fact thatthe objective function of weighted LS-adaEN (6.1) is compound of the convex weighted LSloss and the non-smooth (but convex) adaptive EN penalty. For ADMM, the minimizationproblem is written in consensus form (Deng and Yin 2016)argminµ,βOWLS(µPβ) = argminθ∈Rp+1,yˆ∈Rnf(xy) C g(θ)subject to xy − r˜θ = 0(6.6)with f(xy) = 12‖W˜(y−xy)‖22 the (scaled) weighted LS loss function, r˜ = (1nPr) the predictormatrix with a column of 1’s for the intercept term, and g(θ) the scaled adaptive EN penalty1026.1. ALGORITHMS FOR WEIGHTED LS ADAPTIVE ENfunctiong(θ) = g((µPβ⊺)⊺) =nw¯λΦAN(βSωP α)OThe consensus form splits the optimization problem for xy and θ in two independent partsand one equality constraint. The constrained optimization problem in (6.6) can be cast intoan unconstrained augmented Lagrangian problemaτ (θP xyP z) = f(xy) C g(θ) C z⊺(xy − r˜θ) C τ2‖xy − r˜θ‖22with step size τ S 0 and dual variable z ∈ Rn for the consensus constraint (Bertsekas 1982;Deng and Yin 2016).In the augmented Lagrangian formulation of the minimization problem, parameters xyand θ are separable up to a quadratic term. The augmented Lagrangian problem is solvediteratively byθ(k+1) = argminθaτ (xy(k)PθP z(k)) (6.7)xy(k+1) = argminyˆaτ (xyPθ(k+1)P z(k)) (6.8)z(k+1) = z(k) − τ(xy(k+1) − r˜θ(k+1))(6.9)where k S 0 is the iteration counter.The challenge computing the first step (6.7) in the ADMM iterations stems from of theproduct r˜θ in the quadratic penalty term. To simplify the step, it can be approximatedby linearizing the quadratic term τ2‖xy(k) − r˜θ‖22 by a first-degree Taylor expansion aroundθ(k):τ2‖xy(k) − r˜θ‖22 ∝ τ(xy(k) − r˜θ(k))⊺r˜θ C τ‖r˜(θ − θ(k))‖22Q τ((xy − r˜θ(k))⊺r˜θ C 12τ ′‖θ − θ(k)‖22)with 0 Q τ ′ Q 1R‖r˜‖2 (He and Yuan 2015). Instead of (6.7), this “linearized” ADMM1036.1. ALGORITHMS FOR WEIGHTED LS ADAPTIVE ENsolves the minimization problemθ(k+1) = argminθaτ (xy(k)PθP z(k))= argminθg(θ) C τ((θ − θ(k))⊺r˜⊺(r˜θ(k) − xy(k) C 1τz(k))C12τ ′‖θ − θ(k)‖22)= argminµ,βg((0Pβ⊺)⊺)C τ(β⊺r⊺(rβ(k) − xy(k) C 1τz(k))nµ(k)β⊺0xC12τ ′‖β − β(k)‖22)C µτ(nµ(k) C n0x⊺β(k) Cn∑i=1(1τz(k)i − yˆ(k)i ))Cτ2τ ′(µ− µ(k))2where 0x ∈ Rp is the vector of column means of the predictor matrix r. This minimizationproblem can be solved separately for the intercept and slope. The updated intercept usingthe linear approximation isµ(k+1) = µ(k) − τ(nµ(k) C n0x⊺β(k) Cn∑i=1(1τz(k)i − yˆ(k)i ))O (6.10)The updated slope can be represented by the proximal operator of the adaptive EN penalty:β(k+1) = prox τ ′nλτw¯ΦAN(β(k) − τr⊺(rβ(k) − xy(k) C 1τz(k))− nτµ(k)0x)O (6.11)Following Parikh and Boyd (2014), the proximal operator proxηf R Rq → Rq of a closedproper convex function f R Rq → R, scaled by positive scalar η ∈ R+, is defined asproxηf (u) = argminv∈Rq{f(v) C12η‖u− v‖22}OThe proximal operator of the EN penalty is thus the scaled, coordinate-wise, soft-thresholdingoperator (Parikh and Boyd 2014):proxηΦAN(u) =(sgn(uj)max(0P |uj | − ηαωj)1 C η(1− α))pj=1O (6.12)Once the first step in the ADMM iterations is computed, the second step (6.8), can be1046.1. ALGORITHMS FOR WEIGHTED LS ADAPTIVE ENeasily solved byxy(k+1) = argminyˆaτ (xyPθ(k+1)P z(k)) =(In×n C1τW˜2)−1(r˜θ(k+1) C1τ(W˜2y C z(k)))and involves only the inverse of a diagonal matrix. The final step (6.9) is a simple vectorupdate and does not incur substantial computations.A single iteration for linearized ADMM can be computed very efficiently, requiring onlyd(pn) flops. The convergence rate and hence the number of iterations of the linearizedADMM algorithm depends on the rank of the predictor matrix r as well as the elastic netparameter α. Deng and Yin (2016) show that if either r has full column rank or α Q 1(i.e., the EN penalty is strongly convex), θ(k) converges “Q-linearly” to a global minimumθ∗, meaning there exists a x ∈ (0P 1) such that‖θ(k+1) − θ∗‖2‖θ(k) − θ∗‖2≤ xOIn the case where α = 1 (i.e., adaptive LASSO) and r does not have full column rank, theconvergence rate of linearized ADMM is only sub-linear (Davis and Yin 2017), in the sensethat the value of the objective function converges sub-linearly to the value of the objectivefunction at a global minimum,(f(xy(k)) C g(θk))− (f(r˜θ∗) C g(θ∗)) = d(1Rk)OTheoretically, linearized ADMM converges for any choice of the step size parameterτ . The actual speed of convergence of linearized ADMM, however, depends heavily onthe value chosen for τ . If τ is too small or too large, the algorithm may not convergewithin a reasonable number of iterations or even diverge due to numerical instability. Theconvergence rates in Deng and Yin (2016) can be used to determine an “optimal” step size ifr˜ is of full column rank or α Q 1. In the case where both conditions are satisfied, the optimalstep size is the product of the minimum and maximum weights, τ = mini l˜ii×maxi l˜ii. Incase neither condition is satisfied, the step size is more difficult to tune, and no theoreticalguidance is available.Steps 6.10, 6.11, 6.8, and 6.9 are iterated until the gap between iterations is sufficientlysmall, i.e.,‖xy(k+1) − xy(k)‖22 C ‖z(k+1) − z(k)‖22 Q ϵ1056.1. ALGORITHMS FOR WEIGHTED LS ADAPTIVE ENfor a small convergence threshold ϵ S 0, or until the algorithm exceeds the prespecifiedmaximum number of iterations.Overall, linearized ADMM can be very efficient, but a change to the data requirescomputing the “linearization” step size τ ′, incurring an additional d(p2n) flops. The mainadvantage of linearized ADMM is that a single iteration is very efficient and that it canleverage solutions to “proximal” problems. However, convergence can be very slow if thestep size is not chosen properly.6.1.4 Dual Augmented Lagrangian (DAL)The DAL algorithm as proposed in Tomioka et al. (2011) is an iterative algorithm whichcan be adapted to computing the weighted LS-adaEN estimate. Using the same functionsf and g as defined for the ADMM algorithm (6.6), DAL uses Fenchel’s duality theorem(Rockafellar 1970, Theorem 31.1) to cast the weighted LS-adaEN objectiveargminµ,βOWLS(µPβ) = argminµ∈R,β∈Rpf(rβ C µ1n) C g(β)into its corresponding dual formargmaxα∈Rn,v∈Rp− f∗(−α)− g∗(v)subject to v = r⊺α and 1⊺nα = 0where the second equality constraint encodes the intercept. The functions f∗ and g∗ arethe convex conjugates of f and g, respectively, and defined asf∗(v) = supu∈Rn(v⊺u− f(u)) P g∗(v) = supu∈Rp(v⊺u− g(u))OAs the name suggests, Dual Augmented Lagrangian iteratively minimizes the augmentedLagrangian of this dual problem, given byaτ (αPvPβ) = −f∗(α)− g∗(v) C β⊺(v −r⊺α− 1⊺nα)−τ2‖v −r⊺α‖22OIn Fenchel’s dual formulation the Lagrangian multiplier, β, corresponds to the primal so-lution to the weighted LS-adaEN problem (for the slope), and the intercept can be easilyrecovered by µ = τ1⊺nα.1066.1. ALGORITHMS FOR WEIGHTED LS ADAPTIVE ENAlgorithm 3 Dual augmented Lagrangian algorithm for the weighted LS-adaEN problem.Input: Initial step size multiplier η S 0, initial solution β(0), µ(0).1: τ (0)1 = τ(0)2 = ηw2R(nλ)2: α(0) = y −rβ(0)3: repeat4: β(k+1) = prox nτ2k)1 w¯λΦAN(β(k) C τ(k)1 rα(k))5: µ(k+1) = µ(k) C τ (k)2 1⊺nα(k)6: τ (k+1)1 = 2τ(k)17: if k S 1 and |1⊺nα(k+1)| S ϵ and |1⊺nα(k+1)| S |1⊺nα(k)|R2 then8: τ (k+1)2 = 10τ(k)29: else10: τ (k+1)2 = 2τ(k)211: end if12: α(k+1) = argminα∈Rn <k+1(α), where<k+1(α) = f∗(−α) C 12τ(k+1)1∥∥∥∥∥prox nτ2k+1)1 w¯λΦAN(β(k+1) C τ(k+1)1 rα)∥∥∥∥∥22C12τ(k+1)2(µ(k+1) C τ(k+1)2 1⊺nα)213: k = k C 114: until RDG(k) Q ϵ (as defined in (6.13))Tomioka et al. (2011) propose to solve this dual augmented Lagrangian problem by theiterative procedure given in Algorithm 3. In the first step on lines 4 and 5, β(k+1) andµ(k+1) are updated from the previous solution using the dual vector α(k). The slope β(k+1)is updated through the proximal operator of the adaptive EN penalty as given in (6.12)and together with the update to the intercept term can be done in d(pn) flops. Thesecond step updates the step sizes τ1 and τ2 for the slope and intercept, respectively. Thelast step, updating the dual vector α(k+1), is more involved; the strongly convex function<k+1 can only be minimized approximately using numerical methods. The DAL algorithmimplemented in the pense package uses Newton’s method with backtracking line search(Boyd et al. 2004, pp. 464ff) for computing an approximate solution α(k+1). Newton’smethod for minimizing <k+1 requires inverting the n×n Hessian of <k+1 and hence a totalof d(n3 C n2p) flops. This can be somewhat improved by noting that the Hessian of <k+1changes only marginally between iterations and the inversion can be accelerated by usingthe previous inverse as a pre-conditioner in the conjugate gradient method (Gentle 2007,1076.1. ALGORITHMS FOR WEIGHTED LS ADAPTIVE ENAlgorithm 6.2).To get exponential convergence of the DAL algorithm the step size needs to increaseat every iteration. Furthermore, to alleviate convergence issues due to the unpenalizedintercept, Algorithm 3 implements the suggestion in Tomioka et al. (2011) to use separatestep sizes for the slope coefficients (τ (k)1 ) and the intercept coefficient (τ(k)2 ). If the interceptcoefficient does not change substantially between iterations, the step size for the interceptis increased aggressively to speed up convergence.The DAL algorithm is stopped when the relative duality gap, RDG(k) is less than theprescribed numerical tolerance ϵ S 0. The relative duality gap is defined asRDG(k) =f(rβ(k) C µ(k)1n) C g(β(k))− f∗(−α˜(k))− g∗(r⊺α˜(k))f(rβ(k) C µ(k)1n) C g(β(k))(6.13)with candidate dual vector α˜(k) = α(k) − 1n1n1⊺nα(k).Tomioka et al. (2011) establish strong convergence results for the DAL algorithm, evenwhen solving for α(k+1) only approximately. The DAL algorithm converges super-linearlyto a global optimum, θ∗, of the weighted LS-adaEN objective, i.e.,‖θ(k+1) − θ∗‖2‖θ(k) − θ∗‖2≤ 1√1 C 2xτ(k)1Pfor some constant x S 0. It can be seen that convergence is faster the larger the initialstep size τ (0)1 , but a larger step size makes the optimization of <k+1 more difficult as thestrong convexity constant of <k+1 is inversely related to τ (k+1)1 . The default setting in thepense package is to double the step size in each iteration, as shown in Algorithm 3. Theinitial step size is derived from the level of penalization and the scale of the loss functionmultiplied by parameter η S 0, using a conservative multiplier of η = 0O01 by default.Compared to ADMM, DAL is designed to converge in much fewer iterations, but eachiteration carries a substantially higher computational burden. The advantages of DAL arethreefold: (i) DAL performs noticeably better for (severely) ill-conditioned problems thanother iterative algorithms (Tomioka et al. 2011), (ii) DAL is well suited when the numberof predictors p is much larger than the number of observations n and (iii) sparsity in theprimal solution vector β can be harnessed to substantially reduce the memory footprintand computational complexity.The faster convergence of DAL is clearly visible Figure 6.2 for two simulated data1086.1. ALGORITHMS FOR WEIGHTED LS ADAPTIVE ENADMM DAL100 200 300 5 10 1510−610−410−2100Number of iterations||θ(k)−θ*|| 2n = 200, p = 50n = 50, p = 200Figure 6.2: Distance between the true global minimum, θ∗, and the solution in the k-th iteration, θ2k)versus iteration counter k for linearized ADMM and DAL for weighted LS-adaEN on two data setssimulated according to scheme MS1-MH(-2, 8). Observation weights, wi (i 5 ), O O O , n) are randomdraws from a uniform distribution on [(, 4] and the penalty loadings ωj (j 5 ), O O O , p) are from auniform distribution on [(, )].sets with randomly generated observation weights and penalty loadings. The exact globalminima for these two data sets are computed using the augmented LARS algorithm upto floating-point precision. The hyper-parameters of the adaptive EN penalty are fixedat α = 0OM and λ = λ¯WLSR2, where λ¯WLS is the smallest penalty level such that β =0p minimizes the weighted LS-adaEN objective function. As summarized in Table 6.1,linearized ADMM exhibits linear convergence for α Q 1, which is supported by the lineartrend under logarithmic scaling of the distance between the iterates θ(k) and the true globalminimum θ∗. DAL, on the other hand, converges super-linearly and requires far feweriterations than ADMM to get within a distance of 10−6 of the true global minimum. Interms of computational speed, however, DAL only outperforms ADMM if the number ofobservations is small and the number of predictors is very large.Table 6.1 summarizes computational complexity of the algorithms implemented in thepense package. They are optimized to perform well in the use-cases required for comput-ing adaptive PENSE estimates. Particular attention is devoted to reducing the overheadincurred by small changes to the data, for example changing weights between successiveminimizations. These three algorithms for weighted LS-adaEN cover a wide range of prob-lem sizes and ensure computing adaptive PENSE estimates is feasible in applications withlarge and demanding data sets.1096.2. INITIAL ESTIMATESAugmented LARS Linearized ADMM DALComplexity d(n2pC np2 CKp2) d(Kpn) d(K(n3 C n2p))Data-change overhead – d(p2n) –# of iterations, K ≲ max(nP p) d(z−k) or d(1Rk) o(z−k)Table 6.1: Comparison of computational complexity of algorithms to minimize the weighted LS-adaENobjective function (6.1) measured in floating-point operations. For augmented LARS, the number ofsteps required K is usually the number of non-zero coefficient values in the result, but in the presenceof highly correlated predictors the number of iterations may be slightly larger. Linearized ADMMconverges linearly, in O(e−k) iterations, if the penalty function is strictly convex (i.e., α < )) or ifX⊺X is positive definite.6.2 Initial EstimatesThe non-convex objective function of adaptive PENSE bears the need for an elaboratescheme to find good starting points. These starting points, or “initial estimates”, are acrucial component of computing regularized S-estimates. Numerical methods for findinglocal minima of the non-convex objective function 4.1 converge to different local station-ary points depending on the chosen starting point. Different strategies are explored inSection 3.2, while the most reliable strategy for regularized S-estimates is the EN-PY pro-cedure detailed in algorithms 1 and 2.The computational burden of EN-PY is substantial due to the computation of leave-one-out (LOO) residuals required to compute the sensitivity matrix R (line 2 in Algorithm 2)and because LS-adaEN estimates need to be computed for each potentially clean subset ofthe data (line 7 in Algorithm 1). As detailed in Section 3.2.4, it is difficult to match thelevel of penalization desired for adaptive PENSE with an appropriate level of penalizationfor the EN-PY procedure. Therefore, EN-PY initial estimates are usually computed for afixed α but a set of f penalty levels QI .In case of multiple penalty levels, line 4 of Algorithm 1 can be improved upon in thefirst iteration (ι = 0) because the index set I (0) is the same for all penalty levels. Itera-tive algorithms for optimizing the LS-adaEN objective function, such as ADMM and DALdiscussed in Section 6.1, at penalty level λq, 1 Q q Q f, converge faster if the minimum ofthe LS-adaEN objective function at penalty level λq−1 is leveraged. A similar improvementin the first iteration can be implemented for computing LOO LS-adaEN estimates neededfor the sensitivity matrix R. For subsequent iterations such optimizations are not possiblebecause the index set I (ι) is most likely different for different penalty levels. However, theiterations can be done in parallel for different penalty levels, leveraging multiple cores with1106.2. INITIAL ESTIMATESp = 25 p = 50 p = 1002 4 6 8 2 4 6 8 2 4 6 80%25%50%75%100%Number of threadsRelative computation timen = 50n = 100n = 200Figure 6.3: Comparison of the average time to compute the EN-PY initial estimates using 1 to 8 threads.Computation time is relative to the average computation time required using 1 thread. Timings aretaken for data simulated according to scheme MS1-MH(-2, 8) and averaged over 100 runs on a systemwith Intel® Xeon® E3-12XX @ 2.70GHz processors (each CPU comprises 4 cores). Augmented LARSis used to compute LS-adaEN solutions and penalty parameters are fixed at αAS 5 (O5, ω 5 1p. Theset QI 5 {5× )(−4eλAS, O O O , eλAS} contains 12 penalty levels, equally spaced on the logarithmic scale,with eλAS given in (6.21).negligible overhead because these computations are completely independent.Figure 6.3 shows the speed gains of using 1 – 8 CPU cores simultaneously via threadsfor computing the EN-PY initial estimates over a grid of 12 penalization levels, startingat the smallest penalty level such that 0p is a local optimum, as given in (6.21). Foreach combination of n and p, a single data set is randomly generated according to datageneration scheme MS1-MH(-2, 8) and computation is replicated 100 times. The systemhas 9 processors with 4 cores each, i.e., sharing data between 4 threads incurs little overhead,while moving beyond 4 threads involves increased memory management. This is also visiblein Figure 6.3, where performance does not improve noticeably when using more than 4threads, even for large problems. For all problem sizes, two threads can reduce computationtime almost by half, while for small problems the overhead of more threads can devour thegains of parallelizing. In general, the more challenging the problem, the more gains frommultithreading. If possible, using as many threads as cores per processor leads to fastestcomputation without degrading performance.Iterations of the EN-PY procedure must be done sequentially, but some steps withina single iteration allow for efficient parallelization to multiple cores. Computing the LS-adaEN estimates on the potentially clean subsets (line 7 in Algorithm 1) can be performedin parallel without the need to share data between cores. Similarly, the LOO estimates usedfor the sensitivity matrix R can be computed simultaneously on multiple cores.In case of the Ridge penalty (α = 0), EN-PY initial estimates can be computed much1116.2. INITIAL ESTIMATESfaster by exploiting the linearity of the LS-Ridge estimator. Instead of computing LOOresiduals manually, the elements of the sensitivity matrix R can be computed efficiently bygij = y⊺bi· −HijzjR(1−Hjj) whereb = r (r⊺rC (n− 1)λI)−1r⊺ and e = y −byOThe closed-form solution for the sensitivity matrix considerably improves computationalspeed for EN-PY in case of the Ridge penalty. However, the Ridge penalty does not leadto any coefficient value being exactly 0. Therefore, all eigenvalues of R⊺R are non-zero(f = n˜, the number of observation in the EN-PY iteration), leading to a large numberof potentially clean subsets and hence the need to compute many LS-adaEN estimates inline 7 of Algorithm 1.The EN-PY procedure given in Algorithm 1 returns only the estimates from the lastiteration. The risk of missing potentially good initial estimates can be reduced by tweakingthe algorithm to additionally retain all estimates “close” to the best initial estimate, xθ(ι),from the final iteration (in terms of their M-scale of the residuals). The EN-PY procedureimplemented in the pense package retains estimates from all previous iterations which haveless than twice the M-scale of the residuals from the best initial estimate. The threshold canbe changed to retain more or less estimates from previous iterations. Retaining estimatesfrom previous iterations increases the computational burden but boosts the chances offinding global optima.The main computational challenge for EN-PY is solving a large number of LS-adaENsubproblems. Furthermore, numerical instability or convergence issues of algorithms aredifficult to correct automatically but can have a detrimental effect on EN-PY. It is thereforeimportant to employ efficient and stable numerical algorithms, chosen according to thedimension of the sample. Section 6.1 details the algorithms available in the pense package.Computation can be accelerated by leveraging “proximity” of LS-adaEN problems arisingin the EN-PY procedure. When computing LOO estimates, for example, the estimates areunlikely to differ drastically from each other. Therefore, the computational burden can besubstantially decreased by leveraging the LOO estimate xθ(−(i−1)) when computing the LOOestimate xθ(−i), i = 2P O O O P n˜ in line 2 of Algorithm 2.The algorithm for solving the LS-adaEN subproblems needs to be chosen in accordancewith the dimensions of the problem. Figure 6.4 shows computation time for EN-PY initialestimates using different algorithms to solve the LS-adaEN subproblems for several com-1126.2. INITIAL ESTIMATESn = 50 n = 100 n = 200 n = 40025 200 400 600 800 25 200 400 600 800 25 200 400 600 800 25 200 400 600 800100101102103104Number of predictorsComputation time [s]DALADMMLARSFigure 6.4: Comparison of the median time (on log-scale) for computing EN-PY initial estimates usingdifferent algorithms to the solve LS-adaEN subproblems. The shaded areas depict the inter-quartilerange over 50 replications on a system running on Intel® Xeon® CPU E3-12XX @ 2.70GHz processors.Data is simulated according to scheme MS1-MH(-2, 8) with varying number of observations (n)and predictors (p). Penalty parameters are fixed at αAS 5 (O5, ω 5 1p, and the set QI(α) 5{5× )(−4eλAS, O O O , eλAS} contains 12 penalty levels, equally spaced on the logarithmic scale, with eλASgiven in (6.21).binations of the number of observations, n, and number of predictors p. As suggested bythe computational complexity of the different algorithms in Table 6.1, the DAL algorithmoutperforms others if the number of observations is reasonably small but the number of pa-rameters is large. The DAL algorithm leverages proximal solutions particularly well, oftenrequiring only one or two iterations when computing LOO estimates, making it particularlywell suited for the EN-PY procedure as long as n is not too large. The LARS algorithm,on the other hand, does not benefit from proximal solutions but giving its efficient imple-mentation it is usually the fastest option if the number of predictors is small to moderate.Computational complexity of linearized ADMM is linear in both n and p, but becausechanging the data incurs additional d(p2n) flops, ADMM is recommended for EN-PY onlyif both n and p are large.For each λ in the set of penalty levels, QI , the EN-PY procedure yields a set of initialestimates T (λ). Due to the difficulty of matching the penalty level between the EN-PYprocedure and adaptive PENSE, the implementation in the pense package combines allinitial estimates into one large set of initial estimates T = ⋃λ∈QI T (λ). Each of theseinitial estimates is subsequently used to find local minima of the adaptive PENSE objectivefunction.1136.3. COMPUTING LOCAL MINIMA6.3 Computing Local MinimaOnce a set of reliable starting points, T , is obtained the task is to locate local minima ofthe adaptive PENSE objective function (4.1) close to these starting points. The adaptivePENSE objective function is not continuously differentiable everywhere, making gradient-based methods or Newton’s method unusable (Parikh and Boyd 2014). Subgradient-basedmethods are a generalization of gradient-based methods for non-smooth functions (Shor1985). Subgradient-based methods are conceptually simple, but convergence to local sta-tionary points is generally slow and not ascertained for the non-convex adaptive PENSEobjective function (Bagirov et al. 2013). While some adaptations of subgradient-basedmethods improve convergence for non-convex problems (e.g., Bagirov et al. 2013), theyare in practice unstable for large-scale problems. For adaptive PENSE, the most stableand efficient numerical algorithms are based on the Minimization by Majorization (MM)principle.MM algorithms are a broad class of algorithms with many applications. Lange (2016)provides an extensive overview of the theory and applications of MM algorithms. Thegeneral idea of MM algorithms is very versatile yet simple. For adaptive PENSE, forinstance, the goal is to find a local minimum of the objective function OAS(θ) over θ ∈ Rp+1,starting from an initial guess θ(0). Key to MM algorithms is finding a “surrogate” functionwith majorizes the true objective function at anchor point θ∗. A function g(θ|θ∗) is said tomajorize the objective function OAS(θ) at θ∗ ifg(θ∗|θ∗) = OAS(θ∗) and g(θ|θ∗) ≥ OAS(θ) for all θ ∈ θ ∈ Rp+1O (6.14)In other words the majorizing surrogate function g(θ|θ∗) equals the true objective functionat θ∗ and is greater than the true objective function everywhere else. An MM algorithmsequentially minimizes surrogate functions until a fixed point of the true objective functionis reached. Starting from the initial guess θ(0), the sequence of steps is given byθ(k+1) = argminθ∈Rp+1g(θ|θ(k)) (6.15)for k = 0P 1P O O O until y(θ(k+1)Pθ(k))Q ϵ where y R Rp+1 × Rp+1 → s0P∞) is a distancemetric and ϵ S 0 a numerical tolerance level. Iterations of MM algorithms are guaranteed1146.3. COMPUTING LOCAL MINIMAto produce a sequence of estimates with non-increasing value of the objective function:OAS(θ(k+1)) ≤ g(θ(k+1)|θ(k)) ≤ g(θ(k)|θ(k)) = OAS(θ(k))O (6.16)The first inequality and last equality are due to g being a majorizing function and themiddle inequality holds because θ(k+1) minimizes g(θ|θ(k)). For a suitably chosen surrogatefunction, the iterates (6.15) converge at least sub-linearly to a stationary point of the trueobjective function close to the initial guess θ(0) (Lange 2016). This stationary point doesnot have to be a local minimum, but because the adaptive PENSE objective is optimizedfor a multitude of starting points, saddle points and local maxima are very likely screenedout at the end.The idea is that a difficult problem (i.e., finding local minima of the true objectivefunction) is replaced by a sequence of simpler problems (i.e., finding minima of surrogatefunctions). This implies that the surrogate function g(θ|θ∗) must be reasonably simple andeasy to minimize for MM algorithms to be of use. For adaptive PENSE, it suffices to find asurrogate function for the S-loss, as the adaptive EN penalty is already convex. From 6.14 itis evident that combining a majorizer of the S-loss with the adaptive EN penalty majorizesthe entire adaptive PENSE objective function.The local representation of the objective function as a weighted adaptive LS-EN problem,introduced first in Section 3.1, proves important for deriving a surrogate function of theadaptive PENSE objective function. Let r˜ = (1nPr) ∈ Rn×(p+1) be the predictor matrixaugmented by a column of 1’s for the intercept term. For any anchor point θ∗ ∈ Rp+1,consider the local surrogate functiongS(θ|θ∗) = 12n∥∥∥Wθ∗ (y − r˜θ)∥∥∥22C λASΦAN(βSωP αAS)= OWLS(θPWθ∗)O(6.17)with diagonal weight matrix Wθ∗ ∈ Rn×n having diagonal elementswi =√/′(r˜i)Rr˜i1n∑nk=1 /′(r˜k)r˜kwhere r˜i =yi − x˜⊺i θ∗σˆM(θ∗)P i = 1P O O O P nOIt is easy to verify that gS(θ|θ∗) coincides with the adaptive PENSE objective functionat θ∗, but this surrogate function is not ascertained to majorize the objective functioneverywhere. Following Fan et al. (2018), it is not necessary for the surrogate to majorize1156.3. COMPUTING LOCAL MINIMAthe true objective function everywhere for the MM algorithm to produce a convergingsequence of iterates. The sequence converges as long as the surrogate majorizes the trueobjective function locally, i.e., satisfies the local propertyOAS(θ(k+1)) ≤ g(θ(k+1)|θ(k))O (6.18)The MM algorithm implemented in the pense package utilizes the weighted LS-adaENsurrogate function as defined in (6.17) despite the lack of proof that the local property (6.18)holds. If at any iteration property (6.18) is violated, the iteration can be repeated using ashifted and scaled weighted LS-adaEN surrogate function until the local property is satisfied.In practice an instance where the local property is violated by the surrogate (6.17) hasyet to emerge, suggesting that the surrogate does indeed satisfy the local property. Localminima of the adaptive PENSE can therefore be computed efficiently by sequentially solvingweighted LS-adaEN problems.Numerical tolerance for solving LS-adaEN problemsThese weighted LS-adaEN problems are simpler than the non-convex adaptive PENSEobjective function, but they are not solvable exactly either. Many numerical algorithms tosolve weighted LS-adaEN problems do so up to a prescribed numerical tolerance. From thenon-increasing sequence in 6.16 it can be seen that the surrogate functions do not have to beminimized exactly as long as iterates θ(k+1) reduce (or at least not increase) the surrogateobjective function.This observation opens avenues for improving performance of MM algorithms. Consid-ering a desired numerical tolerance for local optima of ϵ as defined below (6.15), only thelast MM iteration must solve the surrogate problem with numerical tolerance less than ϵ,preceding iterations can solve the surrogate problems with less accuracy. The idea is inthe same spirit as the continuous analogue of the MM principle discussed in Lange (2016,p. 110), without requiring a strictly convex or smooth surrogate function. To improve nu-merical stability, the surrogate problem must be solved with higher accuracy than ϵ in thefinal iterations. The implementation in the pense package solves the surrogate problems inthe final iteration with a more stringent numerical tolerance of ϵ˜ = ϵR10. Using less accurateiterations generally increases the number of MM iterations required to find local optima,but at the same times decreases the computational burden of minimizing the surrogatefunction. The actual speed improvement depends on the strategy to choose the accuracy1166.3. COMPUTING LOCAL MINIMAfor MM iterates and on the computational complexity of initializing the algorithm for thesurrogate problem with “reweighted” data and savings from weaker demands on accuracy.The pense package implements two “tightening” strategies to reduce computation time:exponential and adaptive. Exponential tightening sets the initial numerical tolerance levelto ϵ(0) =√ϵ˜. If the surrogate objective decreases in the k-th MM iteration, in other words,g(θ(k+1)|θ(k)) Q g(θ(k)|θ(k)), the numerical tolerance is adjusted toϵ(k+1) = max(ϵPmin(y(θ(k+1)Pθ(k))P ϵ(k)ϵ˜2RK))PwhereK is the maximum number of MM iterations. If the surrogate objective function is notdecreased, the iteration is repeated with a smaller numerical tolerance, i.e., ϵ(k) = ϵ(k)ϵ˜1R10.Adaptive tightening, on the other hand, decreases the numerical tolerance for the sub-problems only if the parameter does not change meaningfully. As for exponential tightening,the initial numerical tolerance level is ϵ(0) =√ϵ˜. The “aggressiveness” of adaptive tighten-ing, and what is considered a meaningful change in the parameters, is controlled throughthe maximum number of adjustments S, with a default value of S = 1. If the surrogateobjective decreases in the k-th MM iteration but the parameter values do not change sub-stantially, the numerical tolerance remains constant, i.e., ϵ(k+1) = ϵ(k). Adaptive tighteningtakes action if the surrogate objective decreases in the k-th MM iteration and the change inparameter values, y(θ(k+1)Pθ(k)) Q ϵ(k), adjusting the tolerance to ϵ(k+1) = ϵ(k)ϵ˜1RS . In casethe surrogate objective function does not decrease, the iteration is repeated with a tighternumerical tolerance ϵ(k) = ϵ(k)ϵ˜1R(2S).The effect of these different tightening strategies is shown in Figure 6.5 for a single sim-ulated data set with desired convergence tolerance ϵ = 10−6. The plot on the left shows therelative difference in the value of the adaptive PENSE objective function between consecu-tive iterations as well as the convergence tolerance for the surrogate problem, ϵ(k). Withouttightening strategy (solid black line), the convergence tolerance for the surrogate problemremains fixed at ϵ˜ = 10−7, in which case the MM algorithm converges after 7 iterations.With adaptive tightening (dashed light-blue line), the number of MM iterations increasesto 10, and for exponential tightening (dotted blue line) 26 MM iterations are required.While the tightening schemes lead to more MM iterations, the total number of iterationsperformed by ADMM are 1851, 617, and 583 for no tightening, adaptive tightening, andexponential tightening, respectively. The plot on the right highlights that tightening strate-gies reduce the number of ADMM iterations especially for the first few MM iterations. At1176.3. COMPUTING LOCAL MINIMA10−610−410−21001020 5 10 15 20 25IterationRelative difference(a) Relative difference to coefficients fromprevious iteration.01002003004005000 5 10 15 20 25IterationNumber of ADMM iterationsTigtheningNoneAdaptiveExponential(b) Number of ADMM iterations.Figure 6.5: Convergence path of different tightening strategies for the MM algorithm for adaptive PENSE.The weighted LS-adaEN solutions are computed using linearized ADMM. Data is generated accordingto scheme MS1-MH(-2, 8) with 100 observations and 400 predictors. The gray lines in plot (a) depictthe numerical tolerance to solve the surrogate problems for the different tightening strategies at eachiteration. Penalty parameters are fixed at αAS 5 (O5, ω 5 1p, and λAS 5 eλAS/2, with eλAS givenin (6.21). The MM algorithm is started at 0p+1 and the convergence tolerance is ϵ 5 )(−6.these initial iterations, the MM iterates change considerably and it is not necessary to solvethe surrogate function precisely. Once the MM iterations approach the local minimum ofthe adaptive PENSE objective function, however, more precise solutions are necessary toavoid “zigzagging” around the local minimum.Figure 6.5(b) also shows the numerical tolerance level at each MM-iteration, ϵ(k), visu-alizing how tightening strategies work. As described above, adaptive tightening reduces thenumerical tolerance of the surrogate problem once the relative change between iterates issmaller than ϵ(k). After one adjustment, adaptive tightening uses the maximum accuracy ofϵ˜ = 10−7. From the right plot it can further be seen that as soon as the numerical toleranceis lowered, ADMM requires substantially more iterations. With exponential tightening,on the other hand, the numerical tolerance changes more gradually and ADMM needs ingeneral less iterations to converge in the individual MM iterations. At the very end thenumerical tolerance is reduced to the desired accuracy of ϵ˜ = 10−7, leading to slightly moreADMM iterations.The smoother adjustment of the numerical tolerance for exponential tightening leads ingeneral to a lower number of ADMM iterations. This trend is also visible in Figure 6.6(a),where the total number of ADMM iterations required per adaptive PENSE minimization(relative to the number of ADMM iterations required if no tightening strategy is used) are1186.3. COMPUTING LOCAL MINIMA0%25%50%75%100%Exponential AdaptiveRelative number of total iterations(a) Number of ADMM it-erations.n = 100 n = 200 n = 40025 200 400 600 800 25 200 400 600 800 25 200 400 600 80010−210−1100Number of predictorsComputation time [s]TigtheningNoneExponentialAdaptive(b) Computation time.Figure 6.6: Performance of the MM algorithm (using the linearized ADMM algorithm to minimize thesurrogate functions) for computing local minima of the adaptive PENSE objective function, usingdifferent tightening strategies. Figure (a): Number of ADMM iterations required to compute a localminimum of the adaptive PENSE objective function for different tightening strategies, relative to thenumber of ADMM iterations required with no tightening strategy. Figure (b): Median runtime ofthe MM algorithm with different tightening strategies to compute a local minimum of the adaptivePENSE objective function. The vertical axis is on the log-scale. The shaded area around the mediandepicts the inter-quartile range from 50 replications measured on a system with Intel® Xeon® E3-12XX processors clocked at 2.70GHz. Data is generated according to scheme MS1-MH(-2, 8) andpenalty parameters are fixed at αAS 5 (O5, ω 5 1p, and λAS 5 eλAS/2, with eλAS given in (6.21). TheMM algorithm is started at 0p+1 and the convergence tolerance is set to )(−6.compared for exponential and adaptive tightening. Both tightening strategies lead to asubstantial decrease in the total number of ADMM iterations required, with exponentialtightening leading to a slightly greater reduction. This translates to decreased computationtime as evident in Figure 6.6(b), where the time required to compute a minimum of theadaptive PENSE objective function is shown for different problem sizes. Albeit the overallreduction in computation time is not as pronounced as the reduction in ADMM iterations,tightening saves computing resources especially for large problems.Tightening works well with linearized ADMM, but less so with DAL. Adaptive tighteningslightly reduces the number of DAL iterations required, but exponential tightening increasesthe number of DAL iterations substantially, almost tripling the number of DAL iterationsin the numerical experiments for Figure 6.6. The reason for this inflation of DAL iterationsis that the convergence criterion employed by DAL (the relative duality gap) is not linearlyrelated to the relative change in the coefficient values as used by the MM algorithm todetermine convergence. Furthermore, if the weights change, the inner minimization carriedout for DAL (step 12 of Algorithm 3) cannot re-use the Hessian from the previous iteration,1196.3. COMPUTING LOCAL MINIMAn = 100 n = 200 n = 40025 200 400 600 800 25 200 400 600 800 25 200 400 600 80010−210−1100101Number of predictorsComputation time [s]AlgorithmDALADMMLARSFigure 6.7: Median time for computing local minima of the adaptive PENSE objective function (4.1)by the MM algorithm using different algorithms to solve the weighted LS-adaEN subproblems. TheMM algorithm uses adaptive tightening for ADMM, while no tightening is used for DAL and LARSalgorithms. The vertical axis is on the log-scale and the shaded area around the median depicts theinter-quartile range from 50 replications measured on a system with Intel® Xeon® E3-12XX processorsclocked at 2.70GHz. Data is generated according to scheme MS1-MH(-2, 8) and penalty parametersare fixed at αAS 5 (O5, ω 5 1p, and λAS 5 eλAS/2, with eλAS given in (6.21). The MM algorithm isstarted at 0p+1 with convergence tolerance set to )(−6.leading to overhead in the computations which cannot be compensated by a moderatereduction in the number of DAL iterations through tightening.Performance of the MM algorithm with each of the three algorithms for weighted LS-adaEN described in Section 6.1 is shown in Figure 6.7. It is noticeable that the augmentedLARS algorithm outperforms the other algorithms for small p or large n. As expected,the DAL algorithm is competitive for a small number of observations and when the num-ber of predictors is large. However, as already noted above, changing weights causes theDAL algorithm to recompute the Hessian required for the inner minimization from scratch.Therefore, DAL is better suited for use in the EN-PY procedure where changes to the dataare more gradual than in the MM algorithm, except for scenarios with many predictors andfew observations. Linearized ADMM, on the other hand, strikes a balance between aug-mented LARS and DAL and is suggested for situations where both n and p are moderateto large. An important property of the augmented LARS algorithm which is not visiblein these plots is its accuracy. While augmented LARS is often outperformed by iterativealgorithms, iterative algorithms are more prone to convergence issues, leading in turn toconvergence problems for the MM algorithm.The MM algorithm developed for adaptive PENSE delivers reliable and scalable perfor-mance. Allowing the use of any algorithm for solving weighted LS-adaEN subproblems, theMM algorithm is adaptable to many problems. Tightening strategies further reduce com-1206.4. COMPUTING ADAPTIVE PENSE FOR MANY HYPER-PARAMETERSputational complexity of solving a large number of subproblems with iterative algorithms.These optimizations become even more important when the MM algorithm is run nu-merous times. The algorithm described in this chapter locates a local minimum for fixedhyper-parameters and a single starting point. In practice, a large set of different startingpoints needs to be explored to increase chances of finding a global optimum of the objec-tive function. Furthermore, good values for the hyper-parameters are unknown in advanceand need to be selected in a data-driven fashion, involving multitudinous minimizations.The solutions developed for the MM algorithm and weighted LS-adaEN algorithms are cru-cial to make large-scale explorations possible, but there is room for even more aggressiveoptimizations.6.4 Computing Adaptive PENSE for Many Hyper-ParametersAs detailed in previous chapters, good values for the hyper-parameters of PENSE andadaptive PENSE are in practice unknown and need to be selected based on the availabledata. Sections 3.5 and 4.1.1 outline the benefits and shortcomings of using K-fold cross-validation for hyper-parameter selection. The computational burden makes K-fold CVchallenging in larger problems. The pense package combines several heuristics, as outlinedbelow, to make cross-validation a feasible strategy for hyper-parameter selection for adaptivePENSE.Throughout this section it is assumed that the penalty loadings ω ∈ Rp+ are fixed. Foradaptive PENSE, this means that both the initial estimate β˜ and the exponent ζ are fixed.If ζ is to be chosen based on the available data as well, the steps detailed below can berepeated for different penalty loadings.Hyper-parameter selection via K-fold CV relies on suitably standardized data to ensurecomparability of penalization levels across CV folds. To simplify standardization withineach individual CV fold, the entire data set (yPr) is standardized as well. The goal ofstandardization is to make penalization levels more comparable between individual CVfolds and the full data set, requiring the S-loss function, LS, to be on a standardized scale.Every predictor is centered and scaled by its univariate location and scale estimated asµˆj = argminµσˆM(x·j − µ) and σˆj = σˆM(x·j − µˆj) for j = 1P O O O P pOSimilarly, the S-estimate of location of the observed responses µˆy = argminµ σˆM(y − µ)1216.4. COMPUTING ADAPTIVE PENSE FOR MANY HYPER-PARAMETERSis used to center the response. With these estimates of location and scale the data isstandardized byy˜ = y − µˆy and r˜ =(x·1 − µˆ1σˆ1P O O O Px·p − µˆpσˆp)O (6.19)An estimate θ˜ computed on the standardized data can be un-standardized according toxβ = diag (1Rσˆ1P O O O P 1Rσˆp) β˜ and µˆ = µ˜− µˆy C (µˆ1P O O O P µˆp) xβO (6.20)To avoid introducing distracting notation the subsequent steps assume that the data set(yPr) is standardized.For given penalty loadings ω, the goal is to select a tuple (α∗P λ∗) of hyper-parametersleading to good prediction performance of the estimate. As long as 0 Q α Q 1, the effect ofthe α parameter on the estimate and hence the prediction performance is small comparedto the effect of the penalization level λ. Furthermore, α, the balance between the a1 and a2penalties, can be more intuitively interpreted. Therefore, it is usually sufficient to consideronly a small number of different values for α. In the following, the set of values consideredfor the parameter α is denoted by A , which typically consists of only a few values, e.g.,A = {1R3P 2R3P 1}. Since variable selection is of primary concern, A usually does notcontain 0. While the adaptive PENSE objective function is smooth in α, the coarse gridA does not emit any gains in computational performance when sharing information acrossvalues in A . Therefore, prediction performance of adaptive PENSE at different hyper-parameter settings is estimated independently for each value of α in A according to thefollowing steps.Step 1 (defining a grid of penalization levels): The penalization level λ has a muchmore pronounced yet subtle effect on the adaptive PENSE estimates than the hyper-parameter α. It is therefore important to cover a wide range of penalization levels overa fine-grained grid. Going beyond a penalization level where all coefficient estimates arenecessarily 0 is pointless but determining this penalization level is difficult due to the non-convex objective function as discussed in Section 3.5.1. The results in Section 3.5.1 can beextended to show that for given α and penalty loadings ω, the smallest penalization level1226.4. COMPUTING ADAPTIVE PENSE FOR MANY HYPER-PARAMETERSfor which 0p is a stationary point of the adaptive PENSE objective function is given byλ˜AS =1nωjαmaxj=1,...,p∣∣∣∣∣n∑i=1w2i (y − µˆy)(yi − µˆ)xij∣∣∣∣∣ O (6.21)with µˆy = argminµ σˆM(y − µ) and weights wi(y− µˆy) as defined in (3.3). For standardizeddata, µˆy = 0.It is typically not necessary to consider penalization levels greater than λ˜AS. The pensepackage spans a logarithmically-spaced grid of f penalization levels from λ˜AS to 10−3αλ˜AS,denoted by Q = {λ1P O O O P λf}. It is important to note that the penalization levels are indecreasing order, i.e., λq S λq+1, for all q = 1P O O O P f− 1.Step 2 (defining CV folds): With α and Q fixed, the n observations are randomly splitinto K cross-validation folds. The K CV folds are defined through randomly generated“folds”, i.e., disjoint index sets S (k) ⊂ {1P O O O P n}, k = 1P O O O PK of roughly equal size whichinclude all observations, i.e. ⋃Kk=1S (k) = {1P O O O P n}.Step 3 (cross-validation): For every single fold S (k), the training data is defined byy(k) =(yi R i R∈ S (k))r(k) =(xi R i R∈ S (k))⊺and contains n− |S (k)| observations.With the reduced number of observations in the training data, the robustness param-eter δ needs to be adjusted. Given δ fixed beforehand, at most bnδc observations may becontaminated. Since the training data is a random subset of the entire data set, all con-taminated observations may be contained in this particular subset. To guard against thispotentially increased proportion of contamination, the parameter needs to be adjusted toδ(k) = bnδcR(n−|S (k)|). In other words, cross-validation effectively decreases the maximumbreakdown point attainable by robust estimators to δ ≤ 0OM(n−maxk=1,...,K |S (k)|).Step 3.1 (standardizing training data): The training data is standardized accordingto (6.19), with the location and scale estimates σˆj , µˆj , and µˆy estimated on the trainingdata. The fixed penalization levels Q have approximately the same effect on the adaptivePENSE estimate computed on the standardized training data as if computed on the entirestandardized data set.Step 3.2 (computing the regularization path): The grid of penalization levels typi-1236.4. COMPUTING ADAPTIVE PENSE FOR MANY HYPER-PARAMETERScally contains many different values and computing adaptive PENSE solutions for each ofthese levels is computationally the most demanding step. To ensure K-fold CV is feasibleeven for larger data sets, the pense package optimizes computing all estimates along this“regularization path”, i.e., for all λ ∈ Q where α and ω are fixed, as detailed in Algorithm 4.Before the regularization path can be computed, initial estimates T are obtained ac-cording to Section 6.2. It is both unfeasible and unnecessary to compute initial estimatesfor every penalty level in Q. By default, the pense package computes initial estimates forevery fifth penalization level, QI = {λ1P λ6P λ11P O O O }. Many initial estimates do not lead toa good local optimum or lead to the same optimum found with a different starting point.To avoid squandering computational resources on initial estimates without merit, the pensepackage employs a two-stage strategy for computing the regularization path.For every penalty level λq, the algorithm is separated into two stages: exploration andimprovement. In the exploration stage, approximate solutions are computed by the MMalgorithm with relaxed numerical tolerance (ϵexp = 0O1 by default) and no tightening. Toincrease chances of finding good local optima, the MM algorithm in the exploration stageis started from every solution found for the previous penalty level λq−1 as well as all initialestimates in T . Using a looser numerical tolerance in the exploration stage, the MMalgorithm runs for only a few iterations, reducing the computational burden of exploringall possible starting points.In the second stage, the MM algorithm is started from each of the b best approximatesolutions. In this improvement stage, the MM algorithm runs until convergence to thedesired numerical tolerance (by default 10−6) and the best solution is retained for eachλ ∈ Q. In both the exploration and improvement stage, solutions are judged by theirassociated value of the adaptive PENSE objective function. This two-stage approach strikesa balance between vast exploration and feasible computation and is successfully applied formany other robust estimators as well (e.g., Salibián-Barrera and Yohai 2006; Rousseeuwand Van Driessen 2006; Alfons et al. 2013). Empirical results suggest that “good” solutionscan be differentiated from “bad” solutions after only a few iterations of the MM algorithm.The inner loops of Algorithm 4 (on lines 4 and 13) can be efficiently distributed amongmultiple cores, significantly accelerating computation. The outer loop, however, must bedone sequentially as sharing information between subsequent penalization levels improvesthe likelihood of uncovering good local optima.Step 3.3 (predicting values): Prediction performance of the coefficient estimates alongthe regularization path is estimated through the prediction error on the test set in the CV1246.4. COMPUTING ADAPTIVE PENSE FOR MANY HYPER-PARAMETERSfold. The coefficient estimates must be un-standardized using (6.20) with location andscale estimates obtained for the training data in step 3.1. The prediction errors from un-standardized estimates {xθ(1)P O O O P xθ(f)} are then given byzi,q = yi − µˆ(q) − x⊺i xβ(q) for all i ∈ S (k)P q = 1P O O O P fOStep 4 (computing estimate of prediction performance): After step 3, each obser-vation i = 1P O O O P n has f associated prediction errors; one for every considered penaltylevel. Prediction performance of adaptive PENSE estimates at each penalization level isestimated by the τ -scale of the prediction errorsτˆα,λq =√√√√√ 1nn∑i=1maxxτ P |zi,q|eediani′=1,...,n∣∣zi′,q∣∣2 α ∈ A P q = 1P O O O P fP (6.22)where efficiency constant xτ = 3 by default in pense.Step 5 (repeating CV with different splits): The non-convexity of the objective func-tion leads to difficulties for cross-validation, as detailed in Section 3.5. This is underlinedby empirical results showing that the CV curve of the prediction performance is typicallyvery rough and unstable; varying whimsically between different cross-validation splits. Thisis clearly visible in the left panel of Figure 6.8, showing the cross-validated prediction per-formance of adaptive PENSE using two different CV splits on simulated data alongsideprediction performance as estimated on an independent validation set. The individual CVcurve roughly match the prediction performance from the validation set, but the curvesare capricious. Considering only a single CV curve to determine good hyper-parametersis therefore suboptimal as the location of the minimum is most likely not correspondingto a level of penalization leading to the best prediction performance. When averaging theprediction performance estimated over several replications (i.e., cross-validation splits), theCV curve exhibits a smoother surface as shown in the right panel of Figure 6.8. Therefore,the implementation in the pense package repeats steps 2 to 4 g times and averages theprediction performance at every λq over these g replications:τ¯α,λq =1gR∑r=1τˆ(r)α,λqα ∈ A P q = 1P O O O P fOAveraging multiple CV replications leads to a smoother CV curve and furthermore allows1256.5. SUMMARYfor accurate estimation of the variability of the estimated prediction performance at anyconsidered penalty level. This enables a more sensible selection of the hyper-parameters foradaptive PENSE. For a fixed α a commonly employed strategy is to not choose λq at whichthe average prediction performance is minimized, but to rather choose a larger penalizationlevel (i.e., a sparser solution) at which the average prediction performance is statistically“indistinguishable” from the smallest average prediction performance. The pense packageimplements this strategy by allowing the user to specify the multiple of the standard er-ror of the smallest average prediction performance considered “indistinguishable”, i.e., ageneralization of the “one-standard-error” rule (Hastie et al. 2009). In Figure 6.8(b), forexample, the error bars depict one half standard error and the average best prediction per-formance is achieved with λ ≈ POP (21 non-zero coefficients). Using a sparser coefficientvector estimated at λ ≈ 13O2 (15 non-zero coefficients), leads to very similar predictionperformance with fewer selected predictors and lower false-positive rate (the true model inthis simulation has 16 non-zero coefficients).Steps 1 to 5 are performed independently for every α ∈ A . With multiple replicationsof CV for each α, selecting good hyper-parameters for PENSE and adaptive PENSE iscomputationally very taxing. While many steps can be efficiently parallelized onto multiplecores or compute nodes, the two-stage approach for computing the regularization path withAlgorithm 4 is important to ensure scalability. Without the optimized algorithms describedin this chapter, computation would not be feasible for realistic problem sizes.6.5 SummaryComputation of adaptive PENSE estimates is challenging yet crucial for successful appli-cation. Easing the use of adaptive PENSE and making it available to a large audience, theR package pense is published on CRAN, the central system for packages extending R. Thedesign goal of the pense package is to make adaptive PENSE a versatile tool and applicableto a wide range of problems.Non-convexity of the objective function combined with necessary selection of hyper-parameters and possible contamination require several novel or adapted computational op-timizations to make adaptive PENSE a method of choice. It turns out that all computationscan be decomposed into a series of weighted least-squared adaptive elastic net problems.Each of these subproblems is convex and solvable efficiently. However, because of their sheernumber, even these supposedly banausic subproblems require diligent optimizations using1266.5. SUMMARYAlgorithm 4 Regularization path of adaptive PENSEInput: Set of penalty levels Q = {λ1P O O O P λf} in decreasing order, set of initial estimatesT , maximum number of estimates to improve, b S 0, coarse convergence tolerance forexploration ϵexp S 0.1: Define xθ(0) = 0p+1.2: for q = 1P O O O P f do3: Initialize an empty set of approximate solutions B(q) = {}.4: for θ˜ ∈{xθ(q−1)} ∪T do5: Starting the MM algorithm from θ˜, compute an approximate solution xθ using aconvergence tolerance of ϵexp.6: if Set of approximate solutions is not full, i.e., |B(q)| Q b then7: Add xθ to the set of approximate solutions, B(q).8: else if O(xθSλq) Q max{O(θSλq) R θ ∈ B(q)} then9: Replace the worst approximate solution in B(q) by xθ.10: end if11: end for12: Initialize best optimum as xθ(q) = 0p+1.13: for θ˜ ∈ W(q) do14: Starting the MM algorithm from θ˜, compute a local minimum of the adaptivePENSE objective function, denoted by xθ.15: if O(xθSλq) Q a(xθ(q)Sλq) then16: Update the best optimum to xθ(q) = xθ.17: end if18: end for19: end for20: Return the set of all solutions, {xθ(1)P O O O P xθ(f)}.the specific characteristics of the sequence of problems. The pense offers three algorithmsfor weighted LS-adaEN with optimizations to efficiently handle small changes in the datamatrix or in the weights. Each of these three algorithms has certain features making themapplicable to specific problem sizes and configurations, covering a wide range of problems.Numerically locating optima of the non-convexity adaptive PENSE objective functionnecessitates a careful selection of starting points using the EN-PY procedure. ComputingEN-PY initial estimates simultaneously for several penalty parameters allows for compu-tational shortcuts. Once these starting points are computed, local optima of the adaptivePENSE objective function can be computed using a minimization-by-majorization (MM)algorithm. I show that the weighted LS-adaEN objective function with properly chosenweights is a useful surrogate function for the adaptive PENSE objective function. Solvinga sequence of these weighted LS-adaEN problems leads to a local minimum of the adaptive1276.5. SUMMARY353739413 10 30λPrediction performance (τ^)ValidationRepl. 1Repl. 2(a) CV curve from individual CV splits.MinimumIndistinguishablefrom minimum353739413 10 30λAverage prediction performance (τ)(b) Average prediction performance.Figure 6.8: Prediction performance of adaptive PENSE (α 5 (O5) estimated by 100 replications of 7-foldcross-validation on data simulated according to schemeMS1-MH(-5, 2) with n 5 )(( and p 5 +2. Theblack dashed line in both plots shows the prediction error as estimated on an independent validationset. The error bars in the right plot depict half the standard error.PENSE objective function. Computing a large number of these local minima using differ-ent starting points improves the likelihood of finding a global minimum, or at least a localminimum close to the global minimum, unaffected by contamination.A good choice of the hyper-parameters governing the penalization of the estimates isunknown in practice. Selecting these hyper-parameters therefore usually involves comput-ing adaptive PENSE estimates for many different combinations of the hyper-parameters.As with initial estimates, several computational shortcuts are possible when computingadaptive PENSE for a sequence of hyper-parameters. These optimizations are essential tomaking computation of adaptive PENSE feasible for realistic problem sizes. Especially be-cause hyper-parameter selection for adaptive PENSE using cross-validation inherently leadsto high variance of the estimated prediction performance, requiring several replications ofCV, escalating the computational burden. The algorithms and methods implemented inthe pense package incorporate many optimizations exploiting the characteristics of theadaptive PENSE objective function. These optimizations ensure that adaptive PENSE iscomputable using reasonable resources for many problems and thus a feasible alternative inmost applications.128Chapter 7ConclusionsThis dissertation highlights the inherent challenges arising when considering the possibilityof contamination in a sample with many potential predictors but only a limited numberof observations. These challenges motivate the development of novel estimators for highdimensional, sparse linear regression models under the presence of contamination with thegoal of accurate prediction of the response for a new set of observations and simultaneousidentification of a small number of predictors relevant for prediction.Combining ideas for robust estimation in low-dimensional linear regression models withregularization for variable selection, Chapter 3 proposes the penalized elastic net S-estimator.For robustness of the estimator entails a non-convex objective function, considerable effortsare devoted to guide exploration of the objective function in the quest to locate global min-ima. The EN-PY procedure is shown to outperform other methods both in terms of qualityof the uncovered minima and computational costs. The asymptotic guarantees establishedfor the estimator underline its appropriateness for challenging problems with heavy tailederror distributions and potential contamination in the observed response or predictor val-ues. Data-driven hyper-parameter search is vulnerable to high variance of the performanceestimate which is inflated by the presence of contamination and the non-convexity of theobjective function. Nevertheless, empirically cross-validation leads to good prediction per-formance of PENSE, from chimerical scenarios without contamination and well-behavederror terms, to the most challenging situations with heavy-tailed errors and gross contami-nation.The PENSE estimator reliably identifies relevant predictors from the large set of avail-able predictors, but theoretical and empirical results expose one shortcoming of the PENSEestimator: insufficient filtering of truly irrelevant predictors. In Chapter 4 I therefore pro-1297. CONCLUSIONSpose the adaptive PENSE estimator which leverages the PENSE estimator to substantiallydecrease the number of falsely selected predictors while at the same time retaining thepredictive capabilities. Asymptotically, the adaptive PENSE estimator is proven to filterout all irrelevant predictors with high probability, while simultaneously estimating the pa-rameters of the truly relevant predictors with the same efficiency as if the truly relevantpredictors were known in advance. This oracle property of the adaptive PENSE estimator,combined with the empirically demonstrated performance even in very challenging scenar-ios, ascertains reliability and practical advantages of adaptive PENSE.Analysis of the interplay between sparsity of the true model and contamination of thepredictors accentuates the effects of two forms of contamination in the predictors not prop-agated to the response value: (i) extreme values in predictors with truly non-zero coefficientand (ii) extreme values in truly irrelevant predictors. Prediction performance and variableselection of PENSE is unscathed by contamination (i), while variable selection of non-robustestimators is erratic. Under contamination (ii), on the other hand, it is shown the PENSEestimate is inherently unable to filter out the irrelevant predictors with contaminated val-ues, whereas non-robust methods are more resilient to the effects of these “good” leveragepoints. Adaptive PENSE combines the best of both worlds, with prediction performanceand variable selection unscathed by either form of contamination. Anecdotally, contam-ination (ii) is very common in practical applications, as the sheer number of irrelevantpredictors creates more space for this form of contamination.Adaptive PENSE’s robustness of variable selection and its good prediction performanceare germane to meaningful and generalizable scientific results. The utility of adaptivePENSE is demonstrated in a biomarker discovery study with the goal of identifying proteinsrelevant for predicting cardiac allograft vasculopathy. Adaptive PENSE is estimated to givemore accurate predictions using a smaller panel of proteins than other robust or non-robustestimators.Chapter 5 outlines the problem of residual scale estimation in sparse high-dimensionallinear regression models under the presence of contamination. Many proposals for robustregularized regression estimators depend on the availability of an accurate and robust esti-mate of the residual scale for efficient estimation but also to retain robustness. Theoreticalresults in low dimensional settings justifying computational shortcuts without sacrificing ef-ficiency are not applicable to regularized M-estimators, entailing a substantial leap of faithwhen computing M-estimates on possibly contaminated finite samples. I highlight preva-lence of severe under- and overestimation of the residual scale in high-dimensional linear1307. CONCLUSIONSregression, leading to degraded performance of M-estimators. The bias in the scale estimateproves difficult to remove in finite-samples, and strategies for de-biasing proposed for non-robust methods seem unfit for the use with robust estimators. Despite the arguably betterperformance of regularized M-estimators in less challenging scenarios, the elevated risk ofbeing subjected to the undue influence of contamination, signify more robust alternativesPENSE and adaptive PENSE are to be preferred in practice.For PENSE and adaptive PENSE to be viable methods for high dimensional data anal-ysis, they need to be readily available in the form of software capable of computing theestimates in a wide range of scenarios. Chapter 6 details adaptations and optimizations ofnumerical algorithms for use as building blocks in the algorithm devised for computing localminima of the (adaptive) PENSE objective function. Together with an efficient implementa-tion of the EN-PY procedure to guide the search for global minima, (adaptive) PENSE canbe efficiently computed for a host of problem sizes. Repeated cross-validation can effectivelyreduce the high variability of the hyper-parameter search and further improve predictionperformance, variable selection, and reliability of the (adaptive) PENSE estimate. With theoptimizations developed in Chapter 6, computation of (adaptive) PENSE estimates remainsfeasible even in high-dimensional settings.The methods developed in this dissertation gain robustness by down-weighting poten-tially contaminated observations. An observation is considered contaminated if either theresidual or any of it’s predictor values is contaminated, following the “casewise” contamina-tion model. With a large number of predictors available in high-dimensional datasets, thisapproach may lead to problems as even a small number of contaminated values can trans-late to a large proportion of contaminated observations. Robust methods for the “cellwise”contamination model (Alqallaf et al. 2009), on the other hand, aim at identifying individualvalues (i.e., cells in the data matrix) with potential contamination and gain robustness byreducing the influence of these cells on the estimation procedure. This strategy is betterequipped for high-dimensional datasets, as contamination is not “propagated” from a singlevalue to the entire observation. Methods for the cellwise contamination model, however, arecomputationally substantially more challenging than PENSE or adaptive PENSE. Impor-tantly, the sparsity assumption imposed in this dissertation alleviates the propagation effectto a certain degree, as aberrant values in the many irrelevant predictors do not pose thesame challenges as aberrant values in relevant predictors. In particular adaptive PENSEshows very reliable prediction and variable selection properties in the presence of theseforms of contamination, without the need to down-weight affected observations. It would1317. CONCLUSIONSbe nevertheless interesting to investigate a possible combination of techniques used in thecellwise contamination model with adaptive PENSE in future research.The statistical theory developed for PENSE and adaptive PENSE sheds light on theirrobustness and asymptotic properties under a general linear regression model. While theconsidered model covers a wide range of situations, some limitations cannot be ignored. Theasymptotic properties of the estimators, for example, are derived under the assumption ofi.i.d. errors, which in particular implies that the errors are independent of the predictors andhomoscedastic (if F0 has finite variance). This assumption is sometimes violated in practi-cal applications. Consistency of unregularized S-estimators holds even if these assumptionsare violated (Maronna et al. 2019), suggesting that similar extensions may be possible forPENSE and adaptive PENSE. Furthermore, the high breakdown point of the estimatorsrequires a fixed set of hyper-parameters and does not account for any effects of choosing thehyper-parameters based on the potentially contaminated sample. To mitigate the effectsof contamination, Chapters 3 and 4 stress the importance of using a robust measure ofprediction performance. While empirical results demonstrate the proposed cross-validationscheme selects hyper-parameters which lead to reliable estimates, further analysis of thebreakdown point under this scheme would give a more practical assessment of the proce-dures’ robustness towards contamination.The many facets of contamination in high-dimensional data paired with variable selec-tion and regularized estimation outlined in this dissertation point to several other challengesleft for future research. Foremost, low efficiency of the proposed S-estimators in some sce-narios suggests room for improvement. Regularized M-estimators are fettered by the highbias in robust estimates of the residual scale as currently available. Building upon the initialstudy of the problem in this work, grokking the sources of bias in finite samples is crucial toeventual development of appropriate countermeasures and hence more reliable regularizedM-estimators. Loh (2018), Fan et al. (2018), and other proposed methods, circumvent theproblem of scale estimation altogether by choosing the scaling of the residuals for convexM-estimators from a grid of candidate values, but the theory currently does not adequatelysupport robust estimation under the presence of contamination in the predictors. A poten-tial avenue for future advances is combining the ideas of an adaptive search for appropriatescaling with highly robust regularized estimators. It is of particular interest whether anadaptive search is feasible and reliable under the presence of contaminated predictors. Sim-ilarly, other proposals for highly robust estimators for low-dimensional linear regressionmodels can serve as blueprints for robust regularized estimators with higher efficiency than1327. CONCLUSIONSS-estimators. As the distinct computational advantage of MM-estimators over other highlyrobust and efficient estimators vanishes in higher dimensions and in presence of a penaltyterm, alternatives such as the τ estimator (Yohai and Zamar 1988), may be more practica-ble. It remains for future research to see whether these approaches can be adapted to thesparse linear regression model while retaining efficiency and robustness.With the proliferation of data seen in recent history, sparse linear regression modelsare ubiquitous in many areas. The demonstrated reliability of the proposed estimatorscombined with an efficient implementation for the software environment R, available fromhttps://cran.r-project.org/package=pense, will improve generalizability of predictivemodels and aid future scientific discoveries.133BibliographyAkaike, H. (1974). “A new look at the statistical model identification”. In: IEEE Transac-tions on Automatic Control 19.6, pp. 716–723.Alfons, A., C. Croux, and S. Gelper (2013). “Sparse least trimmed squares regression foranalyzing high-dimensional large data sets”. In: The Annals of Applied Statistics 7.1,pp. 226–248.Alqallaf, F., S. Van Aelst, V. J. Yohai, and R. H. Zamar (2009). “Propagation of outliersin multivariate data”. In: The Annals of Statistics 37.1, pp. 311–331.Anderson, E. et al. (1999). LAPACK Users’ Guide. Philadelphia, PA: Society for Industrialand Applied Mathematics. isbn: 9780898719604.Arslan, O. (2016). “Penalized MM regression estimation with aγ penalty: a robust versionof bridge regression”. In: Statistics 50.6, pp. 1236–1260.Bagirov, A. M., L. Jin, N. Karmitsa, A. Al Nuaimat, and N. Sultanova (2013). “Subgradientmethod for nonconvex nonsmooth optimization”. In: Journal of Optimization Theory andApplications 157.2, pp. 416–435.Belloni, A., V. Chernozhukov, and L. Wang (2011). “Square-root lasso: pivotal recovery ofsparse signals via conic programming”. In: Biometrika 98.4, pp. 791–806.Bertsekas, D. P. (1982). Constrained optimization and Lagrange multiplier methods. NewYork, NY: Academic Press. isbn: 9780120934805.Bertsimas, D., A. King, and R. Mazumder (2016). “Best subset selection via a modernoptimization lens”. In: The Annals of Statistics 44.2, pp. 813–852.Boyd, S., S. Boyd, and L. Vandenberghe (2004). Convex Optimization. Cambridge, MA:Cambridge University Press. isbn: 9780521833783.Bühlmann, P. and S. van de Geer (2011). Statistics for High-Dimensional Data: Methods,Theory and Applications. Springer Series in Statistics. Berlin Heidelberg: Springer.Chang, L., S. Roberts, and A. Welsh (2018). “Robust lasso regression using Tukey’s biweightcriterion”. In: Technometrics 60.1, pp. 36–47.134BIBLIOGRAPHYChatterjee, S. and J. Jafarov (2015). “Prediction error of cross-validated lasso”. In: ArXive-prints, arXiv:1502.06291.Chen, Z., J. Fan, and R. Li (2018). “Error variance estimation in ultrahigh dimensionaladditive models”. In: Journal of the American Statistical Association 113.512, pp. 315–327.Clarke, F. (1990). Optimization and Nonsmooth Analysis. Classics in Applied Mathematics.Philadelphia, PA: Society for Industrial and Applied Mathematics. isbn: 9781611971309.Cohen Freue, G. V., D. Kepplinger, M. Salibián-Barrera, and E. Smucler (2019). “Robustelastic net estimators for variable selection and identification of proteomic biomarkers”.In: Annals of Applied Statistics 13.4, pp. 2065–2090.Davies, L. (1990). “The asymptotics of S-estimators in the linear regression model”. In: TheAnnals of Statistics 18.4, pp. 1651–1675.Davies, P. L. and U. Gather (2005). “Breakdown and groups”. In: The Annals of Statistics33.3, pp. 977–1035.Davis, D. and W. Yin (2017). “Faster convergence rates of relaxed Peaceman-Rachford andADMM under regularity assumptions”. In: Mathematics of Operations Research 42.3,pp. 783–805.Deng, W. and W. Yin (2016). “On the global and linear convergence of the generalizedalternating direction method of multipliers”. In: Journal of Scientific Computing 66.3,pp. 889–916.Dicker, L. H. (2014). “Variance estimation in high-dimensional linear models”. In: Biometrika101.2, pp. 269–284.Donoho, D. L. and P. J. Huber (1982). “The notion of breakdown point”. In: A FestschriftFor Erich L. Lehmann. Ed. by P. J. Bickel, D. K., and J. Hodges. CRC Press, pp. 157–184.Efron, B., T. Hastie, I. Johnstone, and R. Tibshirani (2004). “Least angle regression”. In:The Annals of Statistics 32.2, pp. 407–499.Fan, J., S. Guo, and N. Hao (2012). “Variance estimation using refitted cross-validation inultrahigh dimensional regression”. In: Journal of the Royal Statistical Society: Series B(Statistical Methodology) 74.1, pp. 37–65.Fan, J., Q. Li, and Y. Wang (2017). “Estimation of high dimensional mean regression inthe absence of symmetry and light tail assumptions”. In: Journal of the Royal StatisticalSociety: Series B (Statistical Methodology) 79.1, pp. 247–265.135BIBLIOGRAPHYFan, J. and R. Li (2001). “Variable selection via nonconcave penalized likelihood and itsoracle properties”. In: Journal of the American Statistical Association 96.456, pp. 1348–1360.Fan, J., H. Liu, Q. Sun, and T. Zhang (2018). “I-LAMM for sparse learning: simultaneouscontrol of algorithmic complexity and statistical error”. In: The Annals of Statistics 46.2,pp. 814–841.Fan, J. and H. Peng (2004). “Nonconcave penalized likelihood with a diverging number ofparameters”. In: The Annals of Statistics 32.3, pp. 928–961.Fan, J., W. Wang, and Z. Zhu (2016). “A shrinkage principle for heavy-tailed data: high-dimensional robust low-rank matrix recovery”. In: arXiv e-prints, arXiv:1603.08315.Fan, J., L. Xue, and H. Zou (2014). “Strong oracle optimality of folded concave penalizedestimation”. In: The Annals of Statistics 42.3, pp. 819–849.Friedman, J., T. Hastie, and R. Tibshirani (2010). “Regularization paths for generalizedlinear models via coordinate descent”. In: Journal of Statistical Software, Articles 33.1,pp. 1–22.Gentle, J. E. (2007). Matrix Algebra: Theory, Computations, and Applications in Statistics.2nd edition. Springer Texts in Statistics. New York, NY: Springer. isbn: 9780387708737.Gill, P. E., G. H. Golub, W. Murray, and M. A. Saunders (1974). “Methods for modifyingmatrix factorizations”. In: Mathematics of Computation 28.126, pp. 505–535.Hampel, F. R. (1975). “Beyond location parameters: robust concepts and methods”. In:Bulletin of the International Statistical Institute 46.1, pp. 375–382.— (1974). “The influence curve and its role in robust estimation”. In: Journal of the Amer-ican Statistical Association 69.346, pp. 383–393.Hastie, T., R. Tibshirani, and J. Friedman (2009). The Elements of Statistical Learning.2nd edition. New York, NY: Springer.Hastie, T., R. Tibshirani, and R. J. Tibshirani (2017). “Extended comparisons of best subsetselection, forward stepwise selection, and the lasso”. In: arXiv e-prints, arXiv:1707.08692.He, B. and X. Yuan (2015). “On non-ergodic convergence rate of Douglas–Rachford al-ternating direction method of multipliers”. In: Numerische Mathematik 130.3, pp. 567–577.Hirose, K., S. Tateishi, and S. Konishi (2013). “Tuning parameter selection in sparse regres-sion modeling”. In: Computational Statistics & Data Analysis 59, pp. 28–40.Homrighausen, D. and D. J. McDonald (2016). “Risk-consistency of cross-validation withlasso-type procedures”. In: ArXiv e-prints.136BIBLIOGRAPHYHomrighausen, D. and D. J. McDonald (2018). “A study on tuning parameter selectionfor the high-dimensional lasso”. In: Journal of Statistical Computation and Simulation88.15, pp. 2865–2892.Hössjer, O. (1992). “On the optimality of S-estimators”. In: Statistics & Probability Letters14.5, pp. 413–419.Huber, P. J. and E. M. Ronchetti (2009). Robust Statistics. Wiley Series in Probability andStatistics. Hoboken, NJ: John Wiley & Sons, Inc.Jojic, V., S. Saria, and D. Koller (2011). “Convex envelopes of complexity controlling penal-ties: the case against premature envelopment”. In: Proceedings of the Conference onArtificial Intelligence and Statistics 15, pp. 399–406.Khan, J. A., S. V. Aelst, and R. H. Zamar (2007). “Robust linear model selection basedon least angle regression”. In: Journal of the American Statistical Association 102.480,pp. 1289–1299.Kim, J. and D. Pollard (1990). “Cube root asymptotics”. In: Annals of Statistics 18.1,pp. 191–219.Lange, K. (2016). MM Optimization Algorithms. Society for Industrial and Applied Math-ematics. isbn: 9781611974409.Lehmann, E. and G. Casella (2003). Theory of Point Estimation. Springer Texts in Statistics.New York, NY: Springer. isbn: 9780387985022.Lin, D. et al. (2013). “Plasma protein biosignatures for detection of cardiac allograft vas-culopathy”. In: The Journal of Heart and Lung Transplantation 32.7, pp. 723–733.Loh, P.-L. (2017). “Statistical consistency and asymptotic normality for high-dimensionalrobust M-estimators”. In: The Annals of Statistics 45.2, pp. 866–896.— (2018). “Scale calibration for high-dimensional robust regression”. In: arXiv e-prints.Mammen, E. (1996). “Empirical process of residuals for high-dimensional linear models”.In: The Annals of Statistics 24.1, pp. 307–335.Mandelbrot, B. (1960). “The Pareto-Lévy law and the distribution of income”. In: Interna-tional Economic Review 1.2, pp. 79–106.Maronna, R., D. Martin, V. Yohai, and M. Salibián-Barrera (2019). Robust Statistics: The-ory and Methods (with R). Wiley Series in Probability and Statistics. Hoboken, NJ:John Wiley & Sons, Inc. isbn: 9781119214670.Maronna, R. and V. J. Yohai (2010). “Correcting MM estimates for “fat” data sets”. In:Computational Statistics & Data Analysis 54, pp. 3168–3173.137BIBLIOGRAPHYMaronna, R. A. and R. H. Zamar (2002). “Robust estimates of location and dispersion forhigh-dimensional datasets”. In: Technometrics 44.4, pp. 307–317.Maronna, R. A. (2011). “Robust ridge regression for high-dimensional data”. In: Techno-metrics 53.1, pp. 44–53.Mehta, N. U. and S. T. Reddy (2015). “Role of hemoglobin/heme scavenger protein hemopexinin atherosclerosis and inflammatory diseases.” In: Current Opinion in Lipidology 26.5,pp. 384–387.Mei, S., Y. Bai, and A. Montanari (2018). “The landscape of empirical risk for nonconvexlosses”. In: The Annals of Statistics 46.6A, pp. 2747–2774.Mendes, B. and D. E. Tyler (1996). “Constrained M-estimation for regression”. In: RobustStatistics, Data Analysis, and Computer Intensive Methods: In Honor of Peter Hu-ber’s 60th Birthday. Ed. by H. Rieder. New York, NY: Springer, pp. 299–320. isbn:9781461223801.Neve, A., F. P. Cantatore, N. Maruotti, A. Corrado, and D. Ribatti (2014). “Extracellularmatrix modulates angiogenesis in physiological and pathological conditions.” In: BiomedResearch International 2014, p. 756078.Parikh, N. and S. Boyd (2014). “Proximal algorithms”. In: Foundations and Trends® inOptimization 1.3, pp. 127–239.Peña, D. and V. J. Yohai (1999). “A fast procedure for outlier diagnostics in large regressionproblems”. In: Journal of the American Statistical Association 94.446, pp. 434–445.R Core Team (2020). R: A Language and Environment for Statistical Computing. R Foun-dation for Statistical Computing. Vienna, Austria.Reid, S., R. Tibshirani, and J. Friedman (2016). “A study of error variance estimation inlasso regression”. In: Statistica Sinica 26, pp. 35–67.Rockafellar, R. (1970). Convex Analysis. Princeton Landmarks in Mathematics and Physics.Ewing, NJ: Princeton University Press. isbn: 9780691015866.Rousseeuw, P. J. (1984). “Least median of squares regression”. In: Journal of the AmericanStatistical Association 79.388, pp. 871–880.Rousseeuw, P. J. and A. M. Leroy (1987). Robust Regression and Outlier Detection. Wi-ley Series in Probability and Statistics. Hoboken, NJ: John Wiley & Sons, Inc. isbn:0471852333.Rousseeuw, P. J. and K. Van Driessen (2006). “Computing LTS regression for large datasets”. In: Data Mining and Knowledge Discovery 12.1, pp. 29–45.138BIBLIOGRAPHYRousseeuw, P. J. and V. J. Yohai (1984). “Robust regression by means of S-estimators”.In: Robust and Nonlinear Time Series Analysis. New York, NY: Springer, pp. 256–272.isbn: 9781461578215.Salibián-Barrera, M. and V. J. Yohai (2006). “A fast algorithm for S-regression estimates”.In: Journal of Computational and Graphical Statistics 15.2, pp. 414–427.Schmauss, D. and M. Weis (2008). “Cardiac allograft vasculopathy”. In: Circulation 117.16,pp. 2131–2141.Schwarz, G. (1978). “Estimating the dimension of a model”. In: The Annals of Statistics6.2, pp. 461–464.Shor, N. (1985). Minimization Methods for Non-Differentiable Functions. Springer Series inComputational Mathematics. Berlin, Heidelberg: Springer. isbn: 9783540127635.Simon, N., J. Friedman, T. Hastie, and R. Tibshirani (2011). “Regularization paths for Cox’sproportional hazards model via coordinate descent”. In: Journal of Statistical Software39.5, pp. 1–13.Smucler, E. (2019). “Asymptotics for redescending M-estimators in linear models with in-creasing dimension”. In: Statistica Sinica 29, pp. 1065–1081.Smucler, E. and V. J. Yohai (2017). “Robust and sparse estimators for linear regressionmodels”. In: Computational Statistics & Data Analysis 111.C, pp. 116–130.Sun, Q., W.-X. Zhou, and J. Fan (2019). “Adaptive Huber regression”. In: Journal of theAmerican Statistical Association 115.529, pp. 254–265.Sun, T. and C.-H. Zhang (2012). “Scaled sparse linear regression”. In: Biometrika 99.4,pp. 879–898.Tibshirani, R. (1996). “Regression shrinkage and selection via the lasso”. In: Journal of theRoyal Statistical Society. Series B (Statistical Methodology) 58.1, pp. 267–288.Tibshirani, R., J. Bien, J. Friedman, T. Hastie, N. Simon, J. Taylor, and R. J. Tibshirani(2012). “Strong rules for discarding predictors in lasso-type problems”. In: Journal ofthe Royal Statistical Society. Series B (Statistical Methodology) 74.2, pp. 245–266.Tibshirani, R. J. and S. Rosset (2019). “Excess optimism: how biased is the apparent errorof an estimator tuned by SURE?” In: Journal of the American Statistical Association114.526, pp. 697–712.Tomioka, R., T. Suzuki, and M. Sugiyama (2011). “Super-linear convergence of dual aug-mented lagrangian algorithm for sparsity regularized estimation”. In: Journal of MachineLearning Research 12, pp. 1537–1586.139BIBLIOGRAPHYVan de Geer, S. and P. Müller (2012). “Quasi-likelihood and/or robust estimation in highdimensions”. In: Statistical Science 27.4, pp. 469–480.Van der Vaart, A. and J. Wellner (1996). Weak Convergence and Empirical Processes: WithApplications to Statistics. Springer Series in Statistics. New York, NY: Springer. isbn:9780387946405.Vinchi, F., L. De Franceschi, A. Ghigo, T. Townes, J. Cimino, L. Silengo, E. Hirsch, F.Altruda, and E. Tolosano (2013). “Hemopexin therapy improves cardiovascular functionby preventing heme-induced endothelial toxicity in mouse models of hemolytic diseases.”In: Circulation 127.12, pp. 1317–1329.Wang, H. and R. Li (2007). “Tuning parameter selectors for the smoothly clipped absolutedeviation method”. In: Biometrika 94.3, pp. 553–568.Watkins, D. S. (2002). Fundamentals of Matrix Computations. 2nd edition. New York, NY:John Wiley & Sons, Inc.Yang, T. (2017). “Adaptive robust methodology for parameter estimation and variableselection”. PhD thesis. Clemson, SC: Clemson University. isbn: 9780355344769.Yohai, V., R. J. Maronna, D. Martin, G. Brownson, K. Konis, and M. Salibián-Barrera(2019). RobStatTM: Robust Statistics: Theory and Methods. R package version 1.0.0.Yohai, V. J. (1985). High breakdown point and high efficiency robust estimates for regression.Tech. rep. 66. University of Washington.— (1987). “High breakdown-point and high efficiency robust estimates for regression”. In:The Annals of Statistics 15.2, pp. 642–656.Yohai, V. J. and R. H. Zamar (1986). High breakdown-point estimates of regression by meansof the minimization of an efficient scale. Tech. rep. 84. University of Washington.— (1988). “High breakdown-point estimates of regression by means of the minimization ofan efficient scale”. In: Journal of the American Statistical Association 83.402, pp. 406–413.Yohai, V. J. and R. H. Zamar (1997). “Optimal locally robust M-estimates of regression”.In: Journal of Statistical Planning and Inference 64.2, pp. 309–323.Yu, G. and J. Bien (2019). “Estimating the error variance in a high-dimensional linearmodel”. In: Biometrika 106.3, pp. 533–546.Zhang, C.-H. and T. Zhang (2012). “A general theory of concave regularization for high-dimensional sparse estimation problems”. In: Statistical Science 27.4, pp. 576–593.140BIBLIOGRAPHYZhao, Y., J. Chen, J. M. Freudenberg, Q. Meng, null null, D. K. Rajpal, and X. Yang (2016).“Network-based identification and prioritization of key regulators of coronary arterydisease loci”. In: Arteriosclerosis, Thrombosis, and Vascular Biology 36.5, pp. 928–941.Zou, H. (2006). “The adaptive lasso and its oracle properties”. In: Journal of the AmericanStatistical Association 101.476, pp. 1418–1429.Zou, H. and T. Hastie (2005). “Regularization and variable selection via the elastic net”. In:Journal of the Royal Statistical Society. Series B (Statistical Methodology) 67.2, pp. 301–320.Zou, H. and R. Li (2008). “One-step sparse estimates in nonconcave penalized likelihoodmodels”. In: The Annals of Statistics 36.4, pp. 1509–1533.Zou, H. and H. H. Zhang (2009). “On the adaptive elastic-net with a diverging number ofparameters”. In: The Annals of Statistics 37.4, pp. 1733–1751.141Appendix ASimulation SettingsA.1 Data-Generation SchemesThe p-dimensional predictors, xi, i = 1P O O O P n are independent realizations of a p-dimensionalrandom variable m from a multivariate t distribution with 4 degrees of freedom. The cor-relation structure among the predictors can be one of the following.Correlation structure 1 [AR(1)]: Exponential decay of the correlation between predic-tors according to their “distance”, [or(Xj PXj′) = /|j−j′|, for jP j′ = 1P O O O P p. The parameter0 ≤ / ≤ 1 determines the general strength of the correlation.Correlation structure 2 [equal correlation]: All predictors are equally correlated,[or(Xj PXj′) = / for all jP j′ = 1P O O O P p, j 6= j′.The response values yi, i = 1P O O O P n are generated by a linear combination of the first spredictors:yi = ui Cs∑j=1xij P i = 1P O O O P nO (A.1)The residuals ui are scaled versions of raw residuals u˜i. These unscaled u˜i are independentrealizations of a random variable U following a central stable distribution (Mandelbrot 1960)with varying stability parameter α:LT light-tailed stable distribution with tail parameter α = 2, i.e., a Standard Normaldistribution,ML moderate- to light-tailed table distribution with stability parameter α = 1ONN,142A.1. DATA-GENERATION SCHEMESMH moderate to heavy-tailed stable distribution with stability parameter α = 1O33,HT heavy-tailed stable distribution with stability parameter α = 1, i.e., a Cauchy distri-bution.The raw residuals u˜i are scaled to attain a certain proportion of variance explained(PVE) by the true linear regression model (A.1):ui =√1− ,,u˜iσˆ0τˆu˜whereτˆu˜ =√√√√√ 1nn∑i=1max3P |u˜i|eediani′=1,...,n|u˜i|2σˆ0 =√√√√√ 1n− 1n∑i=1 s∑j=1xij − 1nn∑i′=1s∑j′=1xi′j′2O(A.2)This definition of PVE uses a robust measure of spread of the error terms because of theconsidered error distributions, only the light-tailed Normal distribution has finite variance.Unless otherwise specified, data is generated with , = 0O2M, i.e., the true model explainsabout 25% of the observed variance in yi.Contamination is artificially introduced in 0 ≤ nc Q n observations. Contaminatedobservations are generated by a different linear model with strong signal and have highleverage by replacing some predictor values with more extreme values. Usually nc = bnR4c,i.e., 25% contamination, unless otherwise specified.Leverage points are introduced by contaminating q = log2(p) predictors. The indices ofcontaminated predictors are sampled non-uniformly without replacement from {1P O O O P p} toincrease the chances of active predictors being contaminated. This is done by first samplingqA from a discrete uniform distribution over {max(0P q C s − p)P O O O Pmin(qP s)}. Then, qAindices are sampled uniformly without replacement from {1P O O O P s} and q− qA are sampleduniformly without replacement from {sC1P O O O P p}, denoting the sampled indices by JA andJAC , respectively. The values of these contaminated predictors are replaced byxij = xij√√√√klmaxi′=1,...,py2i′y2ii = 1P O O O P ncP j ∈ JA ∪ JAC (A.3)143A.1. DATA-GENERATION SCHEMESwhere y2i is the squared Mahalanobis distance of the i-th observation, relative to the em-pirical covariance matrix of the predictors JA ∪ JAC , estimated over the uncontaminatedobservations. The placement of the leverage points and thus the severity of leverage is con-trolled by the parameter kl which can take values kl ∈ {2P 4P PP 1N}, corresponding to low,moderate, high, and extreme leverage, respectively.The response values of the nc contaminated observations are determined by the q con-taminated predictorsyi = ui C∑j∈JA∪JACkvxij i = 1P O O O P ncP (A.4)where kv determines the magnitude of the residuals, relative to the true model, and takesvalues in {−2P−1P 0P 3P 7}. The larger the difference |kv− 1|, the more extreme the contam-ination. In case of contamination, the scale estimates in (A.2) are computed only from then− nc uncontaminated observations.A.1.1 Short-Hand NotationData generation schemes are referenced throughout the text according to a short-hand no-tation as explained in Figure A.1. The short-hand notation consists of four parts. The firsttwo letters specify the sparsity of the true model, i.e., the number of truly active predictorsas a function of p, followed by a number identifying the correlation structure among the ppredictors. The third part consists of one to two characters denoting the error distributionin terms of the weight of tails. The last part specifies the parameters for contamination. If“(—)”, the generated data does not contain contaminated observations, while two numbersin parentheses specify kv, the parameter for contaminating the model according to (A.4),and kl, the parameter for contaminating the predictors according to (A.3), in that order.The last part can also be “*”, meaning that several combinations of contamination param-eters are considered.The short-hand notation does not specify the dimensions of the generated data, n andp. If not specified otherwise, the data is generated such that the true model explains 25%of the observed in yi, i.e., , = 0O2M. If the last part of the notation is given, 25% of theobservations are contaminated, unless otherwise given in the text.144A.2. COMPARISON OF INITIAL ESTIMATESFigure A.1: Short-hand notation for data generation schemes.A.2 Comparison of Initial EstimatesTo compare the performance of initial estimates in Section 3.2.3, data sets of size n =100 and p = 1N are generated according scheme VS1-LT*. The proportion of varianceexplained is 25% or 50%. For contamination, all combinations of kl ∈ {1P 2P 4P P} andkv ∈ {−2P−1P 0P 3P 7} are considered. Combined with the scenarios of no contamination,this leads to a total of 42 scenarios.For each of the two scenarios without contamination, 250 data sets are generated, whilefor scenarios with contamination 50 data sets each are generated. On each of the 2500 datasets, the PENSE estimate is computed over a grid Q comprising 50 log-spaced penalizationlevels. At 10 log-spaced penalization levels, QI (spanning the same range asQ), the EN-PYestimates are computed. All of these estimates are used to initialize the PENSE algorithmfor each of the 50 values in Q to find the best local minimum.In the process of computing the EN-PY estimates, a total of K LS-EN estimates arecomputed. To make the computational demand for the EN-PY estimator and the randomsubsampling strategy comparable, a total of dKR10e random subsamples are taken for therandom subsampling strategy. For each of these random subsamples, the LS-EN estimatesare computed over the same grid QI as used for EN-PY. All of the K initial estimates arethen used to initialize the PENSE algorithm, similar to the EN-PY initial estimates.145A.3. NUMERICAL EXPERIMENTS FOR PENSE AND ADAPTIVE PENSEA.3 Numerical Experiments for PENSE and Adaptive PENSENumerical experiments comparing PENSE and adaptive PENSE to competing methods inSections 3.6 and 4.4 a consider a large number of scenarios following the data generationdetailed in Section A.1. Specifically, data is generated according to data generation schemesVS1-* and MS1-*, with , = 0O2M and varying number of observations and predictors. Forn = 100 observations p ∈ {1NP 32P N4P 12P}, while for n = 400, the number of predictors iseither p = 32 or p = N4. In scenarios with contamination, 25% of observations are affectedwith leverage parameter kl fixed at P and vertical outlier positions kv ∈ {−2P−1P 0P 3P 7}.PENSE and adaptive PENSE estimates are computed using the pense R package avail-able from CRAN and detailed in Chapter 6. MM-LASSO is computed using the code fromhttps://github.com/esmucler/mmlasso, implementing the originally algorithm proposedin Smucler and Yohai (2017). Cross-validation for (adaptive) PENSE, in particular stan-dardizing the data and adjusting the robustness parameter δ in the CV folds, is doneaccording to Section 6.4. To ensure computational feasibility in this large-scale simulationstudy, cross-validation for hyper-parameter selection is performed only a single time for allconsidered estimates. The reported performance metrics are therefore likely underestimat-ing the true performance of the estimators, albeit all methods should be equally affected.146Appendix BProofsB.1 Breakdown Point of PENSERecall that the PENSE estimate, θ˜, computed from the sample Z = (yPr) = {(yiPxi) R i =0P O O O P n} is given byθ˜ = argminµ,βOS(µPβSλP αPZ ) = argminµ,βLS(yP µCrβ) C λΦEN(βSα)OIn the following, contaminated samples derived from Z , where m Q n out of the n obser-vations are replaced by arbitrary values are denoted by Z˜m = (y˜mP r˜m).To prove the finite-sample breakdown point of PENSE, the following lemma fromMaronna et al. (2019, p. 184) is essential.Lemma 1. Consider any sequence of samples(Z˜(k)m)k∈Nwith individual observation pairs(y˜(k)i P x˜(k)i ) and corresponding residuals r˜(k)i = y˜(k)i − µ(k) −(x˜(k)i)⊺β(k) for any sequence ofestimates (µ(k)Pβ(k)).(i) Let X ={i R |r˜(k)i | → ∞}. If ;(X) S nδ, then σˆM(r˜(k))→∞ for k →∞.(ii) Let D ={i R |r˜(k)i | is bounded}. If ;(D) S n− nδ, then σˆM(r˜(k)) is bounded.With Lemma 1 in place, the proof of the upper and lower bounds in Theorem 2 is doneseparately. The following proof of the FBP of PENSE first appeared in Cohen Freue et al.(2019) with slightly different notation.Proof of Theorem 2, bounded from below. Consider an arbitrary sequence of contaminatedsamples(Z˜(k)m)k∈Nwith m ≤ m(δ). The goal is to show that the corresponding sequence of147B.1. BREAKDOWN POINT OF PENSEPENSE estimates,(θ˜(k))k∈N, remains bounded. The sequence of residuals of these PENSEestimates is denoted by r˜(k) = y˜(k) − µ˜(k) − (x˜(k)i )⊺β˜(k).First, let θ∗ fixed for all k such that |µ∗| Q ∞ and ‖β∗‖1 = K1 Q ∞, which impliesalso finite a2 norm of the slope ‖β∗‖22 = K2 Q∞. For those uncontaminated observations(yiPx⊺i ) which are also in the contaminated sample Z˜(k)m , the triangle inequality says thatthe residuals r∗i (k) = yi − µ∗ − x⊺iβ∗ are bounded, |r∗i (k)| Q ∞. Therefore, the numberof bounded residuals ;(D) ≥ n −m ≥ n − nδ and hence part (ii) of Lemma 1 says thatσˆM(r∗(k)) is bounded:supk∈NσˆM(r∗(k)) Q∞O (B.1)Now suppose that the sequence of slope estimates from PENSE,(‖β˜(k)‖1)k∈Nis un-bounded. It is important to note that the the sequence estimated intercepts may be boundedor unbounded. The boundedness of the M-scale estimate in B.1 implies there exists a k0 ∈ Nsuch that ‖β˜(k0)‖1 S K1C 1αλ supk∈N σˆ2M(r∗(k)) and ‖β˜(k0)‖22 S K2. Thus, for every k′ ≥ k0,OS(µ˜(k′)P β˜(k′)SλP αP Z˜ (k)m ) S σˆ2M(r∗(k′)) C λ(1− α2K2 C αK1)C supk∈Nσˆ2M(r∗(k))≥ OS(µ∗Pβ∗SλP αP Z˜ (k)m )P(B.2)contradicting the assumption that θ˜(k) minimizes the PENSE objective function. Thisproves that β˜(k) is bounded for m ≤ m(δ) regardless of µ˜(k) being bounded or not. Itremains to show that the intercept is bounded as well.Since(‖β˜(k)‖1)k∈Nis bounded, |yi− x⊺i β˜(k)| is bounded for the n−m uncontaminatedobservations (yiPxi) in the contaminated sample Z˜ (k)m . Assume now that |µ˜(k)| → ∞. Thenthe residuals of the uncontaminated observations also tend to infinity and hence ;(X) S nδ.According to part (i) of Lemma 1 this implies that σˆM(r˜(k))→∞. Therefore, there exists aninteger k1 ∈ N such that σˆ2M(r˜(k1)) S supk∈N σˆ2M(r∗(k))Cλ(1−α2 K2 C αK1). Similar to (B.2),this shows that for all k′ ≥ k1,OS(µ˜(k′)P β˜(k′)SλP αP Z˜ (k)m ) ≥ OS(µ∗Pβ∗SλP αP Z˜ (k)m )Pand hence θ˜(k) must be bounded for m ≤ m(δ). □Proof of Theorem 2, bounded from above. Taking m S nδ it can be shown that the PENSE148B.1. BREAKDOWN POINT OF PENSEestimate breaks down. Without loss of generality, assume that the first m observationsin the contaminated samples Z˜ (k)m are different from the original sample Z . Choosing anarbitrary x0 with ‖x0‖2 = 1 and 0 Q , ≤ 1, it can be shown that for the sequence ofcontaminated samples(Z˜(k)m)k∈N,(y˜(k)i P x˜(k)i ) =(k,+1P kx0) i ∈ X(yiPxi) i R∈ X Pthe corresponding sequence of estimates(θ˜(k))k∈Ncan not be bounded.Assume here that θ˜(k) is bounded in norm. As in the proof above the residuals of theuncontaminated observations |r˜(k)i | Q ∞ for i = m C 1P O O O P n and all k ∈ N. Residuals forcontaminated samples, on the other hand, are bounded below by|r˜(k)i | ≥ k∣∣∣k, − ‖x0‖1‖β˜(k)‖1∣∣∣− |µ˜(k)| i = 1P O O O nOThe norms of µˆ(k) and β˜(k) are bounded, and hence the right-hand side goes to infinity, asdo the residuals for i ∈ X. According to part (i) of Lemma 1, this implies the scale σˆM(r˜(k))tends to infinity as well. The M-estimation equation in the definition of the S-loss can bedecomposed tom∑i=1/(r˜(k)iσˆM(r˜(k)))Cn∑i=m+1/(r˜(k)iσˆM(r˜(k)))= nδOTaking the limit for k →∞, the argument in the / function of the second sum tends to zerobecause the residuals of uncontaminated observations remain bounded, which in turn leadsto the second sum converging to 0. The summands in the first term, on the other hand, areall identical and the limit must belimk→∞/(1− µ˜(k)Rk,+1 − x⊺0β˜(k)Rk,σˆM(r˜(k))Rk,+1)=nδmO (B.3)From assumptions [R1] and [R2] the function /(t) is continuous and increasing for t S 0such that /(t) Q 1 = /(∞). Because nδRm Q 1 = /(∞) there exists a unique value γ suchthat/(1γ)=nδmO (B.4)The numerator in the argument in (B.3) tends to 1 and due to (B.4) any converging149B.2. ASYMPTOTIC PROPERTIES OF ADAPTIVE PENSEsubsequence of σˆM(β˜(k))Rk,+1 must have limit γ. Therefore, the boundedness of θ˜(k) implieslimk→∞1k2,+2OS(µ˜(k)P β˜(k)SλP αP Z˜ (k)m ) = γ2O (B.5)Next define an unbounded sequence of parameters as µ(k) = 0 and β(k) = kν2 x0. Forthis sequence of parameters the residuals arer(k)i =kν+12 i = 1P O O O Pmyi − kν2 x⊺0xi i = mC 1P O O O P nPwhich all tend to infinity for k → ∞, implying that σˆM(r(k)) → ∞. The decomposition ofthe M-estimation equation yieldsm∑i=1/(k,+1R2σˆM(r(k)))Cn∑i=m+1/(yi − kν2 x⊺0xiσˆM(r(k)))= nδOTaking the limit for k →∞ in all terms, the second sum tends to 0 and, following the sameargument as before, the limit of the first sumlimk→∞1k2,+2OS(µ(k)Pβ(k)SλP αP Z˜ (k)m ) =γ24O (B.6)because the a1 norm of x0 is finite.From the limits (B.5) and (B.6) it follows that there exists a k0 such that for all k S k01k2,+2OS(µ(k)Pβ(k)SλP αP Z˜ (k)m ) Q1k2,+2OS(µ˜(k)P β˜(k)SλP αP Z˜ (k)m )Pshowing that a bounded θ˜(k) can not be a global minimum of the PENSE objective functionfor the contaminated samples.□B.2 Asymptotic Properties of Adaptive PENSEBelow are the proofs of asymptotic properties of adaptive PENSE as presented in Section 4.2.For notational simplicity, I drop the intercept term from the model, i.e., the linear model 2.1is simplified toY = X ⊺β0 C U150B.2. ASYMPTOTIC PROPERTIES OF ADAPTIVE PENSEand the joint distribution G0 of (YPX ) is written in terms of the errorG0(uPx) R= G0(yPx) = G0(x)F0(y − x⊺β0)OAll the proofs also hold for the model with an intercept term included. Another notationalshortcut in the following proofs is to write the M-scale of the residuals in terms of theregression coefficients, i.e.,σˆM(β) = σˆM(y −rβ)and accordingly the population version, σM(β). For all proofs below, I define ψ(t) = /′(t)to denote the first derivative of the / function in the definition of the M-scale estimate andhence of the S-loss, as well as the mapping < R R→ s0S xu as<(t) R= ψ(t)tOB.2.1 Preliminary Results Concerning the M-Scale EstimatorBefore proving asymptotic properties of the adaptive PENSE estimator, several intermediateresults concerning the M-scale estimator are required.Lemma 2. Let (yiPx⊺i ), i = 1P O O O P n, be i.i.d. observations with distribution G0 whichsatisfies (2.2) and ui = yi − x⊺iβ0. If v ∈ Rp and s ∈ (0P∞) positive, then the empiricalprocesses (enηv,s)v,s withηv,s(uPx) R= <(uC x⊺vs)converge uniformly almost sure:limn→∞ supv∈Rps∈(0,∞)∣∣∣∣∣ 1nn∑i=1ηv,s(uiPxi)− EG0 sηv,s(U PX )u∣∣∣∣∣ = 0 a.s. (B.7)Proof of Lemma 2. I will show step by step that the space F = {ηv,s R v ∈ RpP s ∈ (0P∞)}is a bounded Vapnik–Chervonenkis (VC) class of functions and hence Glivenko-Cantelli.The space F is bounded because <(t) is bounded by assumptions on /. Define the mappinggv,s R=Rp+1 → R(ux)7→ (u− x⊺v)s−1 O151B.2. ASYMPTOTIC PROPERTIES OF ADAPTIVE PENSEThe corresponding function space G = {gv,s R v ∈ RpP s ∈ (0P∞)} is a subset of a finite-dimensional vector space with dimension dim(G ) = p C 1. Therefore, G is VC with VCindex k (G ) ≤ pC 3 according to Lemma 2.6.15 in van der Vaart and Wellner (1996). Dueto the assumptions on /, the function <(t) can be decomposed into<(t) = max{min{<1(t)P <2(t)}Pmin{<1(−t)P <2(−t)}}with <1,2 monotone functions. Thus, Φ1,2 = {<1,2(g(·)) R g ∈ G } and Φ(−)1,2 = {<1,2(−g(·)) Rg ∈ G } are also VC due to Lemma 2.6.18 (iv) and (viii) in van der Vaart and Wellner (1996).Using Lemma 2.6.18 (i) in van der Vaart and Wellner (1996) then leads to Φ = Φ1∧Φ2 andΦ(−) = Φ(−)1 ∧Φ(−)2 also being VC. Finally,F = Φ∨Φ(−) is VC because of Lemma 2.6.18 (ii).Since F is bounded, Theorem 2.4.3 in van der Vaart and Wellner (1996) concludes theproof. □Lemma 3. Let (yiPx⊺i ), i = 1P O O O P n, be i.i.d. observations with distribution G0 whichsatisfies (2.2) and ui = yi− x⊺iβ0. Under assumptions [A1], [A2] and if β∗n = β0C vn withlimn→∞ ‖vn‖ = 0 a.s., then we have(a) almost sure convergence of the estimated M-scale to the population M-scale of the errordistributionlimn→∞ σˆM(β∗n)a.s.−−→ σM(β0)(b) and almost sure convergence oflimn→∞1nn∑i=1<(ui − x⊺i vnσˆM(β∗n))= EF0[<( UσM(β0))]a.s.Proof of Lemma 3. The first result (a) is a direct consequence of the conditions of the lemma(u− x⊺vn → u a.s.) and Theorem 3.1 in Yohai (1987).For part (b), it is know from Lemma 2 the empirical process converges uniformly almostsure. Since σM(β0) S 0, the continuous mapping theorem gives ui−x⊺i vnσˆM(β∗n)→ UσM(β0)almostsurely. Finally, due to the continuity and boundedness of <:EG0[<(U −X ⊺vnσˆM(β∗n))]a.s.−−−→n→∞ EF0[<( UσM(β0))](B.8)which concludes the proof. □152B.2. ASYMPTOTIC PROPERTIES OF ADAPTIVE PENSELemma 4. Let (yiPx⊺i ), i = 1P O O O P n, be i.i.d. observations with distribution G0 whichsatisfies (2.2) and ui = yi−x⊺iβ0. Under regularity conditions [A1]–[A3] and if v ∈ K ⊂ Rpwith K compact and β∗n = β0 C vR√n, then(a) the M-scale estimate converges uniformly almost suresupv∈K∣∣σˆM(β∗n)− σM(β0)∣∣ a.s.−−→ 0P (B.9)(b) for every ϵ S 0 with ϵ Q EF0[<(UσM(β0))]the uniform bound over v ∈ Ksupv∈K∣∣∣∣∣∣ σˆM(β∗n)1n∑ni=1 <(ui−x⊺i vR√nσˆM(β∗n))∣∣∣∣∣∣ Q ϵC σM(β0)EF0[<(UσM(β0))]− ϵ(B.10)holds with arbitrarily high probability if n is sufficiently large.Proof of Lemma 4. The proof for (B.9) relies on Lemma 4.5 from Yohai and Zamar (1986)which states that under the same conditions as for this lemma, the following holds:supv∈K|σˆM(β∗n)− σM(β∗n)| a.s.−−→ 0OTherefore, the missing step is to show that supv∈K |σM(β∗n)−σM(β0)| → 0 almost surely asn→∞. This is done by contradiction.Assume there exists a subsequence (nk)kS0 such that for all k, supv∈K |σM(β∗n) −σM(β0)| S ϵ S 0. Since v ∈ K with K a compact set, for every sequence vn there ex-ists a subsequence (vnk)k such that |σM(β0 C vnkR√nk) − σM(β0)| S ϵ for all nk S cϵ.Therefore, either one of the following holds: (i) σM(β0 C vnkR√nk) S σM(β0) C ϵ or(ii) σM(β0 C vnkR√nk) Q σM(β0)− ϵ. In the first case (i) it is know that/(U −X ⊺vnkR√nσM(β0 C vnkR√nk))Q /(U −X ⊺vnkR√nσM(β0) C ϵ)→ /( UσM(β0) C ϵ)ODue to the boundedness of /, the dominated convergence theorem givesEG0[/(U −X ⊺vnkR√nσM(β0 C vnkR√nk))]Q EG0[/(U −X ⊺vnkR√nσM(β0) C ϵ)]→ EG0[/( UσM(β0) C ϵ)]Q δ153B.2. ASYMPTOTIC PROPERTIES OF ADAPTIVE PENSEwhich contradicts the definition of σM(β0 C vnkR√nk). In case (ii) similar steps yieldEG0[/(U −X ⊺vnkR√nσM(β0 C vnkR√nk))]S δfor all nk S c with c large enough. Therefore, the assumption supv∈K |σM(β∗n)−σM(β0)| Sϵ S 0 can not be valid and hence supv∈K |σM(β∗n)− σM(β0)| → 0. This concludes the proofof (B.9).Before proving (B.10), note that ϵ is well defined because EF0[<(UσM(β0))]S 0 as perLemma 6 in Smucler (2019). To prove (B.10), I first bound the denominator uniformlyover v ∈ K. From Lemma 2 it is known that the empirical processes converge almostsurely, uniformly over v ∈ K and s S 0. As a next step, I show the deterministic uniformconvergence ofsupv∈Ks∈[σM(β0)−ϵ1,σM(β0)+ϵ1]∣∣∣∣EG0 sfn(U PX PvP s)u− EG0 [<(Us)]∣∣∣∣→ 0P (B.11)where fn(U PX PvP s) is defined asfn(U PX PvP s) R= <(U −X ⊺vR√ns)OThe functions fn(U PX PvP s) are bounded and converge pointwise to <(Us), entailing point-wise convergence of EG0 sfn(U PX PvP s)u→ EF0[<(Us)]as n→∞ by the dominated conver-gence theorem. Because / has bounded second derivative, the derivative of fn(U PX PvP s)with respect to v ∈ K and s ∈ sσM(β0) − ϵ1P σM(β0) C ϵ1u is also bounded, meaningfn(U PX PvP s) is equicontinuous on this domain. Pointwise convergence together with theequicontinuity make the Arzelà-Ascoli theorem applicable and hence conclude that (B.11)holds.From (B.9) it follows that for any δ2 S 0 there is a cδ2 such that for all v ∈ K and alln S cδ2 , P(|σˆM(β∗n)− σM(β0)| ≤ ϵ1) S 1 − δ2. Combined with (B.11) this yields that forevery δ2 S 0 and ϵ2 S 0 there is an cδ2,ϵ2 such that for all n S cδ2,ϵ2 and every v ∈ K∣∣∣∣EG0 sfn(U PX PvP σˆM(β∗n))u− EF0 [<( UσˆM(β∗n))]∣∣∣∣ Q ϵ2with probability greater than 1− δ2. Since both expected values are positive this can also154B.2. ASYMPTOTIC PROPERTIES OF ADAPTIVE PENSEbe written asEG0 sfn(U PX PvP σˆM(β∗n))u S EF0[<(uσˆM(β∗n))]− ϵ2O (B.12)The final piece for the denominator to be bounded is to show thatsupv∈K∣∣∣∣EG0 [<( UσˆM(β∗n))]− EF0[<( UσM(β0))]∣∣∣∣ a.s.−−−→n→∞ 0O (B.13)Set Ω1 = {ω R σˆM(β∗nSω) → σM(β0)} which has P(Ω1) = 1 due to the first part of thislemma. Similarly, set Ω2 = {ω R equation (B.13) holds}. Assume now that P(Ω1∩Ωc2) S 0.This assumption entails that there exists an ω′ ∈ Ω1 ∩ Ωc2, an ϵ3 S 0 and a subsequence(nk)kS0 such thatlimk→∞∣∣∣∣∣EG0[<(UσˆM(β0 Cvnk√nkSω′))]− EF0[<( UσM(β0))]∣∣∣∣∣ S ϵ3O (B.14)However, since vnk is in the compact set K, the sequence β0CvnkR√nk converges to β0 asn→∞. Additionally, < is bounded and together with the dominated convergence theoremthis leads tolimk→∞EG0[<(UσˆM(β0 C vnkR√nkSω′))]= EF0[<( UσM(β0))]and in turn tolimk→∞∣∣∣∣∣EG0[<(UσˆM(β0 C vnkR√nkSω′))]− EF0[<( UσM(β0))]∣∣∣∣∣ = 0contradicting the claim in (B.14). Therefore, P(Ω1 ∩ Ωc2) = 0, proving (B.13). Combining(B.12) and (B.13) leads to the conclusion that with arbitrarily high probability for largeenough n ∣∣∣∣EG0 [<(U −X ⊺vR√nσˆM(β∗n))]∣∣∣∣ S −ϵ4 C EF0 [<( UσM(β0))](B.15)for every v ∈ K.From the first part of this lemma, σˆM(β∗n)a.s.−−→ σM(β0), and due to (B.15), for everyδ S 0 and every 0 Q ϵ Q EF0[<(UσM(β0))]there exists an cδ,ϵ such that for all v ∈ K andn ≥ cδ,ϵ equation (B.10) holds. □155B.2. ASYMPTOTIC PROPERTIES OF ADAPTIVE PENSEB.2.2 Root-n ConsistencyProof of Theorem 3. To ease the notation for the proof, the hyper-parameters are droppedfrom the objective function LAS and the adaptive elastic net penalty is simply denoted byΦ(β) = ΦAN(βSλASP αASP ζPω). Also, γ(t) R= β0 C t(xβ − β0) denotes the convex combi-nation of the true parameter β0 and the adaptive PENSE estimator xβ. It is importantto remember the penalty loadings are derived from a preliminary PENSE estimator, β˜,ω = (1Rβ˜1P O O O P 1Rβ˜p)⊺.The first step in the proof is a Taylor expansion of the objective function around thetrue parameter β0:σˆ2M(xβ) C Φ(xβ) =σˆ2M(β0) C Φ(β0) C (Φ(xβ)− Φ(β0))− 2 11n∑ni=1 <(ui−x⊺i vnσˆM(β∗n))︸ ︷︷ ︸=:AnσˆM(β∗n)nn∑i=1ψ(ui − x⊺i vnσˆM(β∗n))x⊺i vn︸ ︷︷ ︸=:Znwhere vn = τ(xβ−β0) and β∗n = β0 C vn for a 0 Q τ Q 1. Due to the strong consistency ofxβ from Proposition 2, vn → 0 a.s. and hence from Lemma 3 and the continuous mappingtheorem it is know that Vn a.s.−−→ 1EF0[φ(UσM2β0))] =R V S 0 as well as σˆM(β∗n) a.s.−−→ σM(β0).The term on is handled by a Taylor expansion of ψ(ui−x⊺i vnσˆM(β∗n))around ui to geton = σˆM(β∗n)(1nn∑i=1ψ(uiσˆM(β∗n))x⊺i vn −1σˆM(β∗n)nn∑i=1ψ(ui − x⊺i v∗nσˆM(β∗n))x⊺i vnx⊺i vn)=(xβ − β0)⊺√n[τ σˆM(β∗n)1√nn∑i=1ψ(uiσˆM(β∗n))xi]− τ2(xβ − β0)⊺[1nn∑i=1ψ′(ui − x⊺i v∗nσˆM(β∗n))xix⊺i](xβ − β0)for some v∗n = τ∗vn with τ∗ ∈ (0P 1).The rest of the proof follows closely the proof of Proposition 2 in Smucler and Yohai(2017). More specifically, noting that σˆM(β∗n)a.s.−−→ σM(β0), the results in Smucler and Yohai(2017) (which are derived from results in Yohai (1985)) state thatWn R= ‖ξn‖ = dp(1) with ξn = τ σˆM(β∗n)1√nn∑i=1ψ(uiσˆM(β∗n))xi156B.2. ASYMPTOTIC PROPERTIES OF ADAPTIVE PENSEand hence with arbitrarily high probability for n sufficiently large there is a W such that(xβ − β0)⊺√nξn ≤1√n‖xβ − β0‖‖ξn‖ ≤W√n‖xβ − β0‖O (B.16)Similarly, the results in Smucler and Yohai (2017) can be used to showXn R= τ2(xβ − β0)⊺[1nn∑i=1ψ′(ui − x⊺i v∗nσˆM(β∗n))xix⊺i](xβ − β0) ≥ X˜n‖xβ − β0‖2 (B.17)with X˜n a.s.−−→ X S 0.Next is the difference in the penalty terms Dn R= Φ(xβ)− Φ(β0), which can be reducedto the truly non-zero coefficients:Dn =λAS,np∑j=1(1− α2((βˆj)2 − (β0j )2)C α|βˆj | − |β0j ||β˜j |ζ)≥λAS,ns∑j=1(1− α2((βˆj)2 − (β0j )2)C α|βˆj | − |β0j ||β˜j |ζ)OObserving that xβ is a strongly consistent estimator, |βˆj −β0j | Q ϵj Q |β0j | for all j = 1P O O O P sand any ϵj ∈ (0P |β0j |) with arbitrarily high probability for sufficiently large n. This entailsthat, for all 0 ≤ t ≤ 1 and j = 1P O O O P s, the sign of the convex combination sgn(γj(t)) =sgn(β0j ) 6= 0 and thus |γj(t)| is differentiable. This allows application of the mean valuetheorem on the quadratic and the absolute term in Dn to yieldDn ≥λAS,ns∑j=1(1− α4γj(τj) C αsgn(β0j )|β˜j |ζ)(βˆj − β0j )for some τj ∈ (0P 1), j = 1P O O O P s, with arbitrarily high probability for large enough n.Because both β˜ and xβ are strongly consistent for β0 and λAS,n = d(1R√n), there exists aconstant D such that with arbitrarily high probabilityDn ≥ − D√n‖xβ − β0‖ (B.18)for sufficiently large n.157B.2. ASYMPTOTIC PROPERTIES OF ADAPTIVE PENSESince xβ minimizes the adaptive PENSE objective function LAS,0 ≥LAS(xβ)− LAS(β0) = σˆ2M(xβ) C Φ(xβ)− σˆ2M(β0)− Φ(β0) = Dn − 2VnonOWith the bounds derived in (B.16), (B.17), and (B.18) this in turn yields0 ≥Dn − 2Vnon = Dn − 2VnWn C 2VnXn≥− D√n‖xβ − β0‖ − 2V W√n‖xβ − β0‖C 2VX‖xβ − β0‖2=1√n‖xβ − β0‖(−D − 2VW C 2VX√n‖xβ − β0‖)with arbitrarily high probability for large enough n. Rearranging the terms leads to theinequality√n‖xβ − β0‖ ≤ 2VW CD2VXO□B.2.3 Variable Selection ConsistencyProof of Theorem 4. To ease notation in the following, I denote the coordinate-wise adap-tive EN penalty function byϕ(βSλAS,nP αASP ζP ω) = λAS,n(1− αAS2β2 C αAS|β||ω|ζ)such that λAS,nΦAN(βSαASP ζPω) =∑pj=1 ϕ(βj SλAS,nP αASP ζP ωj). I follow the proof in Smu-cler and Yohai (2017) and define the functionkn(v1Pv2) R=σˆ2M(β0I C v1R√nPβ0II C v2R√n)Cs∑j=1ϕ(β0j C v1,jR√nSλAS,nP αASP ζP ωj)Cp∑j=s+1ϕ(β0j C v2,j−sR√nSλAS,nP αASP ζP ωj)OFrom Theorem 3 follows with arbitrarily high probability, ‖xβ−β0‖ ≤ XR√n for sufficientlylarge n. Therefore, with arbitrarily high probability kn(v1Pv2) attains its minimum onthe compact set{(v1Pv2) R ‖v1‖2 C ‖v2‖2 ≤ X2}at xβ. The goal is to show that for any158B.2. ASYMPTOTIC PROPERTIES OF ADAPTIVE PENSE‖v1‖2 C ‖v2‖2 ≤ X2 with ‖v2‖ S 0 and with arbitrarily high probability, kn(v1Pv2) −kn(v1P0p−s) S 0 for sufficiently large n.Taking the difference while observing that β0II = 0p−s giveskn(v1Pv2)− kn(v1P0p−s) =(σˆ2M(β0I C v1R√nPv2R√n)− σˆ2M(β0I C v1R√nP0p−s))Cp∑j=s+1ϕ(v2,j−sR√nSλAS,nP αASP ζP ωj)OThe first term can be bounded by defining vn(t) R= (v⊺1P tv⊺2)⊺R√n and applying the meanvalue theorem gives some τ ∈ (0P 1) such thatσˆ2M(β0 C vn(1))− σˆ2M(β0 C vn(0)) =2√nσˆM(β0 C vn(τ))(0⊺s Pv⊺2)∇βσˆM(β)|β0+vn(τ) =− 2√nσˆM(β0 C vn(τ))1n∑ni=1 <(ui−x⊺i vn(τ)σˆM(β0+vn(τ)))︸ ︷︷ ︸=:An(0⊺s Pv⊺2)1nn∑i=1ψ(ui − x⊺i vn(τ)σˆM(β0 C vn(τ)))xi︸ ︷︷ ︸=:BnOBy Lemma 4 the term Vn is uniformly bounded in probability, hence |Vn| Q V with ar-bitrarily high probability for large enough n. Furthermore, |Wn| ≤ ‖ψ‖∞‖v2‖∥∥ 1n∑ni=1 xi∥∥and due to the law of large numbers there is a constant W such that the upper bound for|Wn| is|Wn| ≤ ‖ψ‖∞‖v2‖(‖EH0 sX u ‖C ϵ) Q ‖v2‖Wwith arbitrarily high probability for sufficiently large n. Together, the bounds for Vn andWn giveσˆ2M(β0 C vn(1))− σˆ2M(β0 C vn(0)) ≥ −‖v2‖√n2VWO (B.19)The next step is to ensure that the penalty term grows large enough to make thedifference kn(v1Pv2) − kn(v1P0p−s) positive. Indeed, the assumption αAS S 0 and using aPENSE estimator for the penalty loadings, ωj = 1R|β˜j | leads top∑j=s+1ϕ(v2,j−sR√nSλAS,nP αASP ζP ωj) ≥αASλAS,np∑j=s+1|v2,j−s|√n|β˜j |ζ=αASλAS,nn(ζ−1)R2p∑j=s+1|v2,j−s||√nβ˜j |ζO159B.2. ASYMPTOTIC PROPERTIES OF ADAPTIVE PENSEThe root-n consistency of β˜ established in Theorem 1 gives |√nβ˜j | Q b with arbitrarilyhigh probability for large enough n. Therefore,αASλAS,nn(ζ−1)R2n(ζ−1)R2p∑j=s+1|v2,j−s||√nβ˜j |ζS αASλAS,nn(ζ−1)R2n(ζ−1)R2p∑j=s+1|v2,j−s|b ζ= αASλAS,nn(ζ−1)R2n(ζ−1)R2‖v2‖1b ζ≥ ‖v2‖√nb ζαASλAS,nn(ζ−1)R2nζR2O(B.20)Combining (B.19) and (B.20) yieldskn(v1Pv2)− kn(v1P0p−s) S ‖v2‖√n(−2VW C αASλAS,nnζR2b ζ)(B.21)uniformly over v1 and v2 with arbitrarily high probability for sufficiently large n. Byassumption αASλAS,nnζR2 → ∞ and hence the right-hand side in (B.21) will eventually bepositive, concluding the proof. □B.2.4 Asymptotic Normal DistributionProof of Theorem 5. For this proof I denote the values of the active predictors and theactive predictors in the i-th observation by rI and xi,I, respectively. Because xβ is stronglyconsistent for β0, the coefficient values for the truly active predictors are almost surelybounded away from zero if n is large enough. This entails that the partial derivatives of thepenalty function exist for the truly active predictors and the gradient at the estimate xβ is0s =∇βILAS(xβ) = −2σˆM(xβ)Vn1nn∑i=1ψ(yi − x⊺i xβσˆM(xβ))xi,IC∇βIΦAN(xβSλAS,nP αASP ζPω) (B.22)with Vn = 1n∑ni=1 <(yi−x⊺i βˆσˆM(βˆ)). The truly active coefficients can be separated from thetruly inactive coefficients by noting that ψ(yi−x⊺i βˆσˆM(βˆ))= ψ(yi−x⊺iFIβˆIσˆM(βˆ))C oi for some oi whichvanishes in probability, P(oi = 0)→ 1, because of Theorem 4 and because ψ is continuous.160B.2. ASYMPTOTIC PROPERTIES OF ADAPTIVE PENSEEquation (B.22) can now be written as0s =− 2 σˆM(xβ)Vn1√nn∑i=1ψ(yi − x⊺i,I xβIσˆM(xβ))xi,I− 2 σˆM(xβ)Vn1√nn∑i=1oixi,IC√n∇βIΦAN(xβSλAS,nP αASP ζPω)and using the mean value theorem there are τi ∈ s0P 1u and hence a matrixWn =1nn∑i=1ψ′ui − τix⊺i,I(xβI − β0I)σˆM(xβ)xi,Ix⊺i,Isuch that the equation can be further rewritten to0s = −2 σˆM(xβ)Vn1√nn∑i=1ψ(yi − x⊺i,Iβ0IσˆM(xβ))xi,IC 21VnWn√n(xβI − β0I)− 2 σˆM(xβ)Vn1√nn∑i=1oixi,IC√nλAS,n∇βIΦAN(xβSαASP ζPω)OSeparating the term √n(xβ∗I − β0I)then gives√n(xβ∗I − β0I)= σˆM(xβ)W−1n1√nn∑i=1ψ(yi − x⊺i,Iβ0IσˆM(xβ))xi,IC σˆM(xβ)W−1n1√nn∑i=1oixi,IC√nλAS,nσˆM(xβ)VnW−1n ∇βIΦAN(xβSαASP ζPω)O(B.23)The strong consistency of xβ for β0 and Lemma 3 lead to σˆM(xβ) a.s.−−→ σM(β0) andVna.s.−−→ EF0[<(UσM(β0))]Q ∞. Also, because of σˆM(xβ) a.s.−−→ σM(β0), Lemma 4.2 in Yohai(1985), and the law of large numbersWna.s.−−→ w(/P F0)ΣIO161B.2. ASYMPTOTIC PROPERTIES OF ADAPTIVE PENSECombined with the assumption that √nλAS,n → 0 this leads to the last two lines in (B.23)converging to 0s in probability. Finally by Lemma 5.1 in Yohai (1985) and the CLT1√nn∑i=1ψ(yi − x⊺i,Iβ0IσˆM(xβ))xi,Iy−−→ cs (0sP v(/P F0)ΣI)which, after applying Slutsky’s Theorem, completes the proof. □162Appendix CAdditional Results from NumericalExperimentsC.1 Elastic Net S-EstimatorsBelow are complete results from the numerical experiments detailed in Section 3.6 includingadditional estimators, error distributions, and sample sizes. Unregularized MM- and S-estimates are computed only for scenarios where p Q b(1− δ)nc − 1. The breakdown pointof all robust estimators is set to δ = 0O33. Oracle MM- and S-estimators are computedusing only the truly active predictors.C.1.1 Prediction PerformancePrediction performance is measured in terms of the relative scale of the prediction error,as detailed in Section 3.6.3. Figures C.1 and C.2 show results for very sparse scenarios(s = log2(p)) and sparse scenarios (s = 3√p).C.1.2 Variable Selection PerformanceVariable selection performance is summarized by the sensitivity (i.e., the proportion of trulyactive predictors detected as such) and specificity (i.e., the proportion of truly inactivepredictors detected as such). The summary figures show sensitivity and specificity in asingle plot for regularized estimators only. Sensitivity extends upwards, specificity extendsdownwards. Methods perform well in terms of variable selection if the two points are at thetop and bottom ends of the plot. Figures C.3 and C.4 show results for very sparse scenarios163C.1. ELASTIC NET S-ESTIMATORS(s = log2(p)) and sparse scenarios (s = 3√p).C.1.3 Estimation AccuracyThe focus of this work is on prediction performance and variable selection of estimators inthe linear regression model. To underline consistency of the estimator, however, estimationaccuracy is also of interest. Estimation accuracy is assessed by the a2 estimation error,RMSE(xβ) =√∥∥∥xβ − β0∥∥∥22C (µˆ− µ0)2OAs detailed in Section 2.1, the a2 estimation error is similar to the RMSPE, but possibledependence between predictors is ignored. The a2 estimation error captures both the biasand variance of the estimator. The smaller the a2 estimation error, the more accurate theestimation. Figures C.5 and C.6 show results for very sparse scenarios (s = log2(p)) andsparse scenarios (s = 3√p).164s/p = 4/16 s/p = 5/32 s/p = 6/64 s/p = 7/128 s/p = 4/16 s/p = 5/32 s/p = 6/64 s/p = 7/128LS-ENMMLASSOPENSEMM (Oracle)S (Oracle)MMSLS-ENMMLASSOPENSEMM (Oracle)S (Oracle)MMSLS-ENMMLASSOPENSEMM(Oracle)S (Oracle)MMSLS-ENMMLASSOPENSEMM (Oracle)S (Oracle)LS-ENMMLASSOPENSEMM (Oracle)S (Oracle)MMSLS-ENMMLASSOPENSEMM (Oracle)S (Oracle)MMSLS-ENMMLASSOPENSEMM (Oracle)S (Oracle)MMSLS-ENMMLASSOPENSEMM (Oracle)S (Oracle)1.01.21.41.61.01.11.21.31.41.51.01.21.41.61.01.21.41.61.8MethodRelative scale of prediction errorn = 100No contamination 25% contaminations/p = 5/32 s/p = 6/64 s/p = 5/32 s/p = 6/64NormalStable(1.66)Stable(1.33)CauchyLS-ENMMLASSOPENSEMM (Oracle)S (Oracle)MMSLS-ENMMLASSOPENSEMM (Oracle)S (Oracle)MMSLS-ENMMLASSOPENSEMM (Oracle)S (Oracle)MMSLS-ENMMLASSOPENSEMM (Oracle)S (Oracle)MMS1.01.11.21.01.11.21.01.11.21.31.01.21.41.6n = 400No contamination 25% contaminationFigure C.1: Prediction performance of estimates under data generation scheme VS1-*. In scenarios without contamination (left), plots show summariesof the metric over 100 replications. In scenarios introducing 25% contamination (right), plots show summaries of 250 values from 50 replicationsof 5 different outlier positions. The dots show the median value, while solid lines show the range of the inner 50% and the dashed whiskers extendfrom the 5% to the 95% quantile.165s/p = 12/16 s/p = 17/32 s/p = 24/64 s/p = 34/128 s/p = 12/16 s/p = 17/32 s/p = 24/64 s/p = 34/128LS-ENMMLASSOPENSEMM (Oracle)S (Oracle)MMSLS-ENMMLASSOPENSEMM (Oracle)S (Oracle)MMSLS-ENMMLASSOPENSEMM(Oracle)S (Oracle)MMSLS-ENMMLASSOPENSEMM (Oracle)S (Oracle)LS-ENMMLASSOPENSEMM (Oracle)S (Oracle)MMSLS-ENMMLASSOPENSEMM (Oracle)S (Oracle)MMSLS-ENMMLASSOPENSEMM (Oracle)S (Oracle)MMSLS-ENMMLASSOPENSEMM (Oracle)S (Oracle)1.01.21.41.01.11.21.31.41.01.21.41.61.01.21.41.6MethodRelative scale of prediction errorn = 100No contamination 25% contaminations/p = 17/32 s/p = 24/64 s/p = 17/32 s/p = 24/64NormalStable(1.66)Stable(1.33)CauchyLS-ENMMLASSOPENSEMM (Oracle)S (Oracle)MMSLS-ENMMLASSOPENSEMM (Oracle)S (Oracle)MMSLS-ENMMLASSOPENSEMM (Oracle)S (Oracle)MMSLS-ENMMLASSOPENSEMM (Oracle)S (Oracle)MMS1.001.051.101.151.201.001.051.101.151.201.001.051.101.151.201.251.01.11.21.3n = 400No contamination 25% contaminationFigure C.2: Prediction performance of estimates under data generation scheme SP1-*. In scenarios without contamination (left), plots show summariesof the metric over 100 replications. In scenarios introducing 25% contamination (right), plots show summaries of 250 values from 50 replicationsof 5 different outlier positions. The dots show the median value, while solid lines show the range of the inner 50% and the dashed whiskers extendfrom the 5% to the 95% quantile.166s/p = 4/16 s/p = 5/32 s/p = 6/64 s/p = 7/128 s/p = 4/16 s/p = 5/32 s/p = 6/64 s/p = 7/128LS-ENMMLASSOPENSELS-ENMMLASSOPENSELS-ENMMLASSOPENSELS-ENMMLASSOPENSELS-ENMMLASSOPENSELS-ENMMLASSOPENSELS-ENMMLASSOPENSELS-ENMMLASSOPENSE100%50%0%50%100%100%50%0%50%100%100%50%0%50%100%100%50%0%50%100%Method←Specificity | Sensitivity→n = 100No contamination 25% contaminations/p = 5/32 s/p = 6/64 s/p = 5/32 s/p = 6/64NormalStable(1.66)Stable(1.33)CauchyLS-ENMMLASSOPENSELS-ENMMLASSOPENSELS-ENMMLASSOPENSELS-ENMMLASSOPENSE100%50%0%50%100%100%50%0%50%100%100%50%0%50%100%100%50%0%50%100%n = 400No contamination 25% contaminationFigure C.3: Sensitivity (upwards) and specificity (downwards) of regularized estimates under data generation scheme VS1-*. In scenarios withoutcontamination (left), plots show summaries of the metric over 100 replications. In scenarios introducing 25% contamination (right), plots showsummaries of 250 values from 50 replications of 5 different outlier positions. The dots show the median value, while solid lines show the range ofthe inner 50% and the dashed whiskers extend from the 5% to the 95% quantile.167s/p = 12/16 s/p = 17/32 s/p = 24/64 s/p = 34/128 s/p = 12/16 s/p = 17/32 s/p = 24/64 s/p = 34/128LS-ENMMLASSOPENSELS-ENMMLASSOPENSELS-ENMMLASSOPENSELS-ENMMLASSOPENSELS-ENMMLASSOPENSELS-ENMMLASSOPENSELS-ENMMLASSOPENSELS-ENMMLASSOPENSE100%50%0%50%100%100%50%0%50%100%100%50%0%50%100%100%50%0%50%100%Method←Specificity | Sensitivity→n = 100No contamination 25% contaminations/p = 17/32 s/p = 24/64 s/p = 17/32 s/p = 24/64NormalStable(1.66)Stable(1.33)CauchyLS-ENMMLASSOPENSELS-ENMMLASSOPENSELS-ENMMLASSOPENSELS-ENMMLASSOPENSE100%50%0%50%100%100%50%0%50%100%100%50%0%50%100%100%50%0%50%100%n = 400No contamination 25% contaminationFigure C.4: Sensitivity (upwards) and specificity (downwards) of regularized estimates under data generation scheme SP1-*. In scenarios withoutcontamination (left), plots show summaries of the metric over 100 replications. In scenarios introducing 25% contamination (right), plots showsummaries of 250 values from 50 replications of 5 different outlier positions. The dots show the median value, while solid lines show the range ofthe inner 50% and the dashed whiskers extend from the 5% to the 95% quantile.168s/p = 4/16 s/p = 5/32 s/p = 6/64 s/p = 7/128 s/p = 4/16 s/p = 5/32 s/p = 6/64 s/p = 7/128LS-ENMMLASSOPENSEMM (Oracle)S (Oracle)MMSLS-ENMMLASSOPENSEMM (Oracle)S (Oracle)MMSLS-ENMMLASSOPENSEMM (Oracle)S (Oracle)MMSLS-ENMMLASSOPENSEMM (Oracle)S (Oracle)LS-ENMMLASSOPENSEMM (Oracle)S (Oracle)MMSLS-ENMMLASSOPENSEMM (Oracle)S (Oracle)MMSLS-ENMMLASSOPENSEMM (Oracle)S (Oracle)MMSLS-ENMMLASSOPENSEMM (Oracle)S (Oracle)36936936912369MethodL2estimation errorn = 100No contamination 25% contaminations/p = 5/32 s/p = 6/64 s/p = 5/32 s/p = 6/64NormalStable(1.66)Stable(1.33)CauchyLS-ENMMLASSOPENSEMM (Oracle)S (Oracle)MMSLS-ENMMLASSOPENSEMM (Oracle)S (Oracle)MMSLS-ENMMLASSOPENSEMM (Oracle)S (Oracle)MMSLS-ENMMLASSOPENSEMM (Oracle)S (Oracle)MMS246246246246n = 400No contamination 25% contaminationFigure C.5: Estimation accuracy in terms of the a2 estimation error of several estimates under data generation scheme VS1-*. In scenarios withoutcontamination (left), plots show summaries of the metric over 100 replications. In scenarios introducing 25% contamination (right), plots showsummaries of 250 values from 50 replications of 5 different outlier positions. The dots show the median value, while solid lines show the range ofthe inner 50% and the dashed whiskers extend from the 5% to the 95% quantile.169s/p = 12/16 s/p = 17/32 s/p = 24/64 s/p = 34/128 s/p = 12/16 s/p = 17/32 s/p = 24/64 s/p = 34/128LS-ENMMLASSOPENSEMM (Oracle)S (Oracle)MMSLS-ENMMLASSOPENSEMM (Oracle)S (Oracle)MMSLS-ENMMLASSOPENSEMM (Oracle)S (Oracle)MMSLS-ENMMLASSOPENSEMM (Oracle)S (Oracle)LS-ENMMLASSOPENSEMM (Oracle)S (Oracle)MMSLS-ENMMLASSOPENSEMM (Oracle)S (Oracle)MMSLS-ENMMLASSOPENSEMM (Oracle)S (Oracle)MMSLS-ENMMLASSOPENSEMM (Oracle)S (Oracle)051015202505101520051015202505101520MethodL2estimation errorn = 100No contamination 25% contaminations/p = 17/32 s/p = 24/64 s/p = 17/32 s/p = 24/64NormalStable(1.66)Stable(1.33)CauchyLS-ENMMLASSOPENSEMM (Oracle)S (Oracle)MMSLS-ENMMLASSOPENSEMM (Oracle)S (Oracle)MMSLS-ENMMLASSOPENSEMM (Oracle)S (Oracle)MMSLS-ENMMLASSOPENSEMM (Oracle)S (Oracle)MMS51051015510481216n = 400No contamination 25% contaminationFigure C.6: Estimation accuracy in terms of the a2 estimation error of several estimates under data generation scheme SP1-*. In scenarios withoutcontamination (left), plots show summaries of the metric over 100 replications. In scenarios introducing 25% contamination (right), plots showsummaries of 250 values from 50 replications of 5 different outlier positions. The dots show the median value, while solid lines show the range ofthe inner 50% and the dashed whiskers extend from the 5% to the 95% quantile.170C.2. ADAPTIVE ELASTIC NET S-ESTIMATORSC.2 Adaptive Elastic Net S-EstimatorsExtending the numerical experiments from Section 3.6, below are detailed results for es-timators discussed in Section 4.4 with the addition of other variable selection consistentestimators.C.2.1 Prediction PerformancePrediction performance is measured in terms of the relative scale of the prediction error,as detailed in Section 3.6.3. Figures C.8 and C.9 show results for very sparse scenarios(s = log2(p)) and sparse scenarios (s = 3√p).C.2.2 Variable Selection PerformanceVariable selection performance is summarized by the sensitivity (i.e., the proportion of trulyactive predictors detected as such) and specificity (i.e., the proportion of truly inactivepredictors detected as such). The summary figures show sensitivity and specificity in asingle plot for regularized estimators only. Sensitivity extends upwards, specificity extendsdownwards. Methods perform well in terms of variable selection if the two points are atthe top and bottom ends of the plot. Figures C.10 and C.11 show results for very sparsescenarios (s = log2(p)) and sparse scenarios (s = 3√p).C.2.3 Estimation AccuracyEstimation accuracy is assessed by the a2 estimation error,RMSE(xβ) =√∥∥∥xβ − β0∥∥∥22C (µˆ− µ0)2OThe smaller the a2 estimation error, the more accurate the estimation. Figures C.12and C.13 show results for very sparse scenarios (s = log2(p)) and sparse scenarios (s = 3√p).171ll llll llllll llll l lll ll l lll l lVery sparse SparseNo contamination25% contamination16(4)32(5)64(6)128(7)16(12)32(17)64(24)128(34)0.981.001.021.040.981.001.021.04Number of predictors(thereof active)Relative scale of the prediction errorPreliminary EstimatellPENSE (CV)PENSE (Ridge)Figure C.7: Scale of the prediction error of adaptive PENSE, relative to the scale of the predictionerror from PENSE. The preliminary estimates considered here are described in Section 4.4.1. Data isgenerated according to schemes VS1- (on the left) and SP1-* (on the right) with n 5 )(( observationsand 25% variance explained by the true model. In scenarios without contamination (top), plots showsummaries of the metric over 400 values from 100 replications and 4 different error distributions. Inscenarios introducing 25% contamination (bottom), plots show summaries of 1000 values from 50replications of 5 different outlier positions and 4 different error distributions. The dots mark themedian value, while error bars span the range of the inner 50%.172s/p = 4/16 s/p = 5/32 s/p = 6/64 s/p = 7/128 s/p = 4/16 s/p = 5/32 s/p = 6/64 s/p = 7/128Ada. LS-ENI-LAMMAda. MMLASSOAda. PENSEPENSEAda. LS-ENI-LAMMAda. MMLASSOAda. PENSEPENSEAda.LS-ENI-LAMMAda. MMLASSOAda. PENSEPENSEAda. LS-ENI-LAMMAda. MMLASSOAda. PENSEPENSEAda. LS-ENI-LAMMAda. MMLASSOAda. PENSEPENSEAda. LS-ENI-LAMMAda. MMLASSOAda. PENSEPENSEAda. LS-ENI-LAMMAda. MMLASSOAda. PENSEPENSEAda. LS-ENI-LAMMAda. MMLASSOAda. PENSEPENSE1.01.21.41.61.01.21.41.61.01.21.41.61.001.251.501.75MethodRelative scale of prediction errorn = 100No contamination 25% contaminations/p = 5/32 s/p = 6/64 s/p = 5/32 s/p = 6/64NormalStable(1.66)Stable(1.33)CauchyAda. LS-ENI-LAMMAda. MMLASSOAda. PENSEPENSEAda. LS-ENI-LAMMAda. MMLASSOAda. PENSEPENSEAda. LS-ENI-LAMMAda. MMLASSOAda. PENSEPENSEAda. LS-ENI-LAMMAda. MMLASSOAda. PENSEPENSE1.01.11.21.001.051.101.151.201.251.01.11.21.31.01.21.41.6n = 400No contamination 25% contaminationFigure C.8: Prediction performance of estimates under data generation scheme VS1-*. In scenarios without contamination (left), plots show summariesof the metric over 100 replications. In scenarios introducing 25% contamination (right), plots show summaries of 250 values from 50 replicationsof 5 different outlier positions. The dots show the median value, while solid lines show the range of the inner 50% and the dashed whiskers extendfrom the 5% to the 95% quantile.173s/p = 12/16 s/p = 17/32 s/p = 24/64 s/p = 34/128 s/p = 12/16 s/p = 17/32 s/p = 24/64 s/p = 34/128Ada. LS-ENI-LAMMAda. MMLASSOAda. PENSEPENSEAda. LS-ENI-LAMMAda. MMLASSOAda. PENSEPENSEAda. LS-ENI-LAMMAda.MMLASSOAda. PENSEPENSEAda. LS-ENI-LAMMAda. MMLASSOAda. PENSEPENSEAda. LS-ENI-LAMMAda. MMLASSOAda. PENSEPENSEAda. LS-ENI-LAMMAda. MMLASSOAda. PENSEPENSEAda. LS-ENI-LAMMAda. MMLASSOAda. PENSEPENSEAda. LS-ENI-LAMMAda. MMLASSOAda. PENSEPENSE1.01.11.21.31.41.01.11.21.31.41.01.11.21.31.41.51.01.21.41.6MethodRelative scale of prediction errorn = 100No contamination 25% contaminations/p = 17/32 s/p = 24/64 s/p = 17/32 s/p = 24/64NormalStable(1.66)Stable(1.33)CauchyAda. LS-ENI-LAMMAda. MMLASSOAda. PENSEPENSEAda. LS-ENI-LAMMAda. MMLASSOAda. PENSEPENSEAda. LS-ENI-LAMMAda. MMLASSOAda. PENSEPENSEAda. LS-ENI-LAMMAda. MMLASSOAda. PENSEPENSE1.001.051.101.151.001.051.101.151.001.051.101.151.201.01.11.21.3n = 400No contamination 25% contaminationFigure C.9: Prediction performance of estimates under data generation scheme SP1-*. In scenarios without contamination (left), plots show summariesof the metric over 100 replications. In scenarios introducing 25% contamination (right), plots show summaries of 250 values from 50 replicationsof 5 different outlier positions. The dots show the median value, while solid lines show the range of the inner 50% and the dashed whiskers extendfrom the 5% to the 95% quantile.174s/p = 4/16 s/p = 5/32 s/p = 6/64 s/p = 7/128 s/p = 4/16 s/p = 5/32 s/p = 6/64 s/p = 7/128Ada. LS-ENI-LAMMAda. MMLASSOAda. PENSEPENSEAda. LS-ENI-LAMMAda. MMLASSOAda. PENSEPENSEAda. LS-ENI-LAMMAda. MMLASSOAda. PENSEPENSEAda.LS-ENI-LAMMAda. MMLASSOAda. PENSEPENSEAda. LS-ENI-LAMMAda. MMLASSOAda. PENSEPENSEAda. LS-ENI-LAMMAda. MMLASSOAda. PENSEPENSEAda. LS-ENI-LAMMAda. MMLASSOAda. PENSEPENSEAda. LS-ENI-LAMMAda. MMLASSOAda. PENSEPENSE100%50%0%50%100%100%50%0%50%100%100%50%0%50%100%100%50%0%50%100%Method←Specificity | Sensitivity→n = 100No contamination 25% contaminations/p = 5/32 s/p = 6/64 s/p = 5/32 s/p = 6/64NormalStable(1.66)Stable(1.33)CauchyAda. LS-ENI-LAMMAda. MMLASSOAda. PENSEPENSEAda. LS-ENI-LAMMAda. MMLASSOAda. PENSEPENSEAda. LS-ENI-LAMMAda. MMLASSOAda. PENSEPENSEAda. LS-ENI-LAMMAda. MMLASSOAda. PENSEPENSE100%50%0%50%100%100%50%0%50%100%100%50%0%50%100%100%50%0%50%100%n = 400No contamination 25% contaminationFigure C.10: Sensitivity (upwards) and specificity (downwards) of regularized estimates under data generation scheme VS1-*. In scenarios withoutcontamination (left), plots show summaries of the metric over 100 replications. In scenarios introducing 25% contamination (right), plots showsummaries of 250 values from 50 replications of 5 different outlier positions. The dots show the median value, while solid lines show the range ofthe inner 50% and the dashed whiskers extend from the 5% to the 95% quantile.175s/p = 12/16 s/p = 17/32 s/p = 24/64 s/p = 34/128 s/p = 12/16 s/p = 17/32 s/p = 24/64 s/p = 34/128Ada. LS-ENI-LAMMAda. MMLASSOAda. PENSEPENSEAda. LS-ENI-LAMMAda. MMLASSOAda. PENSEPENSEAda. LS-ENI-LAMMAda. MMLASSOAda. PENSEPENSEAda.LS-ENI-LAMMAda. MMLASSOAda. PENSEPENSEAda. LS-ENI-LAMMAda. MMLASSOAda. PENSEPENSEAda. LS-ENI-LAMMAda. MMLASSOAda. PENSEPENSEAda. LS-ENI-LAMMAda. MMLASSOAda. PENSEPENSEAda. LS-ENI-LAMMAda. MMLASSOAda. PENSEPENSE100%50%0%50%100%100%50%0%50%100%100%50%0%50%100%100%50%0%50%100%Method←Specificity | Sensitivity→n = 100No contamination 25% contaminations/p = 17/32 s/p = 24/64 s/p = 17/32 s/p = 24/64NormalStable(1.66)Stable(1.33)CauchyAda. LS-ENI-LAMMAda. MMLASSOAda. PENSEPENSEAda. LS-ENI-LAMMAda. MMLASSOAda. PENSEPENSEAda. LS-ENI-LAMMAda. MMLASSOAda. PENSEPENSEAda. LS-ENI-LAMMAda. MMLASSOAda. PENSEPENSE100%50%0%50%100%100%50%0%50%100%100%50%0%50%100%100%50%0%50%100%n = 400No contamination 25% contaminationFigure C.11: Sensitivity (upwards) and specificity (downwards) of regularized estimates under data generation scheme SP1-*. In scenarios withoutcontamination (left), plots show summaries of the metric over 100 replications. In scenarios introducing 25% contamination (right), plots showsummaries of 250 values from 50 replications of 5 different outlier positions. The dots show the median value, while solid lines show the range ofthe inner 50% and the dashed whiskers extend from the 5% to the 95% quantile.176s/p = 4/16 s/p = 5/32 s/p = 6/64 s/p = 7/128 s/p = 4/16 s/p = 5/32 s/p = 6/64 s/p = 7/128Ada. LS-ENI-LAMMAda. MMLASSOAda. PENSEPENSEAda. LS-ENI-LAMMAda. MMLASSOAda. PENSEPENSEAda. LS-ENI-LAMMAda. MMLASSOAda. PENSEPENSEAda. LS-ENI-LAMMAda. MMLASSOAda. PENSEPENSEAda. LS-ENI-LAMMAda. MMLASSOAda. PENSEPENSEAda. LS-ENI-LAMMAda. MMLASSOAda. PENSEPENSEAda. LS-ENI-LAMMAda. MMLASSOAda. PENSEPENSEAda. LS-ENI-LAMMAda. MMLASSOAda. PENSEPENSE369369123692.55.07.510.012.5MethodL2estimation errorn = 100No contamination 25% contaminations/p = 5/32 s/p = 6/64 s/p = 5/32 s/p = 6/64NormalStable(1.66)Stable(1.33)CauchyAda. LS-ENI-LAMMAda. MMLASSOAda. PENSEPENSEAda. LS-ENI-LAMMAda. MMLASSOAda. PENSEPENSEAda. LS-ENI-LAMMAda. MMLASSOAda. PENSEPENSEAda. LS-ENI-LAMMAda. MMLASSOAda. PENSEPENSE2462462462468n = 400No contamination 25% contaminationFigure C.12: Estimation accuracy in terms of the a2 estimation error of several estimates under data generation scheme VS1-*. In scenarios withoutcontamination (left), plots show summaries of the metric over 100 replications. In scenarios introducing 25% contamination (right), plots showsummaries of 250 values from 50 replications of 5 different outlier positions. The dots show the median value, while solid lines show the range ofthe inner 50% and the dashed whiskers extend from the 5% to the 95% quantile.177s/p = 12/16 s/p = 17/32 s/p = 24/64 s/p = 34/128 s/p = 12/16 s/p = 17/32 s/p = 24/64 s/p = 34/128Ada. LS-ENI-LAMMAda. MMLASSOAda. PENSEPENSEAda. LS-ENI-LAMMAda. MMLASSOAda. PENSEPENSEAda. LS-ENI-LAMMAda. MMLASSOAda. PENSEPENSEAda. LS-ENI-LAMMAda. MMLASSOAda. PENSEPENSEAda. LS-ENI-LAMMAda. MMLASSOAda. PENSEPENSEAda. LS-ENI-LAMMAda. MMLASSOAda. PENSEPENSEAda. LS-ENI-LAMMAda. MMLASSOAda. PENSEPENSEAda. LS-ENI-LAMMAda. MMLASSOAda. PENSEPENSE0510152005101520051015202505101520MethodL2estimation errorn = 100No contamination 25% contaminations/p = 17/32 s/p = 24/64 s/p = 17/32 s/p = 24/64NormalStable(1.66)Stable(1.33)CauchyAda. LS-ENI-LAMMAda. MMLASSOAda. PENSEPENSEAda. LS-ENI-LAMMAda. MMLASSOAda. PENSEPENSEAda. LS-ENI-LAMMAda. MMLASSOAda. PENSEPENSEAda. LS-ENI-LAMMAda. MMLASSOAda. PENSEPENSE2.55.07.510.0510510510n = 400No contamination 25% contaminationFigure C.13: Estimation accuracy in terms of the a2 estimation error of several estimates under data generation scheme SP1-*. In scenarios withoutcontamination (left), plots show summaries of the metric over 100 replications. In scenarios introducing 25% contamination (right), plots showsummaries of 250 values from 50 replications of 5 different outlier positions. The dots show the median value, while solid lines show the range ofthe inner 50% and the dashed whiskers extend from the 5% to the 95% quantile.178
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Robust estimation and variable selection in high-dimensional...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Robust estimation and variable selection in high-dimensional linear regression models Kepplinger, David 2020
pdf
Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
Page Metadata
Item Metadata
Title | Robust estimation and variable selection in high-dimensional linear regression models |
Creator |
Kepplinger, David |
Publisher | University of British Columbia |
Date Issued | 2020 |
Description | Linear regression models are commonly used statistical models for predicting a response from a set of predictors. Technological advances allow for simultaneous collection of many predictors, but often only a small number of these is relevant for prediction. Identifying this set of predictors in high-dimensional linear regression models with emphasis on accurate prediction is thus a common goal of quantitative data analyses. While a large number of predictors promises to capture as much information as possible, it bears a risk of containing contaminated values. If not handled properly, contamination can affect statistical analyses and lead to spurious scientific discoveries, jeopardizing the generalizability of findings. In this dissertation I propose robust regularized estimators for sparse linear regression with reliable prediction and variable selection performance under the presence of contamination in the response and one or more predictors. I present theoretical and extensive empirical results underscoring that the penalized elastic net S-estimator is robust towards aberrant contamination and leads to better predictions for heavy tailed error distributions than competing estimators. Especially in these more challenging scenarios, competing robust methods reliant on an auxiliary estimate of the residual scale, are more affected by contamination due to the high finite-sample bias introduced by regularization. For improved variable selection I propose the adaptive penalized elastic net S-estimator. I show this estimator identifies the truly irrelevant predictors with high probability as sample size increases and estimates the parameters of the truly relevant predictors as accurately as if these relevant predictors were known in advance. For practical applications robustness of variable selection is essential. This is highlighted by a case study for identifying proteins to predict stenosis of heart vessels, a sign of complication after cardiac transplantation. High robustness comes at the price of more taxing computations. I present optimized algorithms and heuristics for feasible computation of the estimates in a wide range of applications. With the software made publicly available, the proposed estimators are viable alternatives to non-robust methods, supporting discovery of generalizable scientific results. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2020-08-24 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-ShareAlike 4.0 International |
DOI | 10.14288/1.0392915 |
URI | http://hdl.handle.net/2429/75637 |
Degree |
Doctor of Philosophy - PhD |
Program |
Statistics |
Affiliation |
Science, Faculty of Statistics, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2020-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-sa/4.0/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 24-ubc_2020_november_kepplinger_david.pdf [ 6.44MB ]
- Metadata
- JSON: 24-1.0392915.json
- JSON-LD: 24-1.0392915-ld.json
- RDF/XML (Pretty): 24-1.0392915-rdf.xml
- RDF/JSON: 24-1.0392915-rdf.json
- Turtle: 24-1.0392915-turtle.txt
- N-Triples: 24-1.0392915-rdf-ntriples.txt
- Original Record: 24-1.0392915-source.json
- Full Text
- 24-1.0392915-fulltext.txt
- Citation
- 24-1.0392915.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
data-media="{[{embed.selectedMedia}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0392915/manifest