Median Loss Analysis and Its Application to Model Selection by Chi Wai Yu B.Sc., Mathematics, The Hong Kong University of Science and Technology, 2002 M.Phil., Mathematics, The Hong Kong University of Science and Technology, 2004 A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF Doctor of Philosophy in The Faculty of Graduate Studies (Statistics) The University of British Columbia (Vancouver) April 2009 c Chi Wai Yu, 2009 ° Abstract In this thesis, we propose a median-loss-based procedure for inference. The optimal estimators under this criterion often have desirable properties. For instance, they have good resistance to outliers and are resistant to the specific loss used to form them. In the Bayesian framework, we establish the asymptotics of median-lossbased Bayes estimators. It turns out that the median-based Bayes estimator has a root-n rate of convergence and is asymptotically normal. We also give a simple way to compute this Bayesian estimator. In regression problems, we compare the median-based Bayes estimator with two other estimators. One is the Frequentist version of our median-lossminimizing estimator, which is exactly the least median of squares (LMS) estimator, and the other one is two-sided least trimmed squares (LTS) estimator. This comparison is natural because the LMS estimator is medianbased but only has cubic-root-n convergence, while the 2-sided LTS is not median-based but it has root-n convergence. We show that our medianbased Bayes estimator is a good tradeoff between the LMS and 2-sided LTS estimators. For model selection problems, we propose a median analog of the usual cross validation procedure. In the context of linear models, we present simulation result to compare the performance of cross validation (CV) and median cross validation (MCV). Our results show that when the error terms are from a heavy tailed distribution or are from the normal distribution with small values of the unknown parameters, MCV works better CV does in terms of the probability that chooses the true model. By contrast, when the ii Abstract error terms are from the normal distribution and the values of the unknown parameters are large, CV outperforms MCV. iii Table of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . xv Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi Statement of Co-Authorship . . . . . . . . . . . . . . . . . . . . . xvii 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Motivation 1.2 Utility theory . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Statistical decision theory . . . . . . . . . . . . . . . . . . . . 4 1.3.1 Loss and risk functions 5 1.3.2 Admissibility and minimaxity . . . . . . . . . . . . . 7 1.3.3 Bayes estimation . . . . . . . . . . . . . . . . . . . . . 7 1.3.4 Median as an appropriate measure for the overall eval- . . . . . . . . . . . . . . . . . uation of our choice . . . . . . . . . . . . . . . . . . . 1.4 Empirical aspects of optimizing medians 1 8 . . . . . . . . . . . 9 1.4.1 Heavy-tailed data . . . . . . . . . . . . . . . . . . . . 10 1.4.2 Data with outliers . . . . . . . . . . . . . . . . . . . . 17 iv Table of Contents 1.5 A brief introduction of the thesis . . . . . . . . . . . . . . . . 19 1.6 References 21 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Median Loss Analysis . . . . . . . . . . . . . . . . . . . . . . . 23 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.2 Quantile axiomatic foundation . . . . . . . . . . . . . . . . . 27 2.2.1 Classical expected-utility models . . . . . . . . . . . . 28 2.2.2 Problems with expected utility . . . . . . . . . . . . . 29 2.2.3 Quantile utility 30 2.3 2.4 2.5 Statistical decision theory under median loss . . . . . . . . . 32 2.3.1 Usual criteria for comparing estimates . . . . . . . . . 33 2.3.2 Median loss alternative . . . . . . . . . . . . . . . . . 36 Local properties of medloss estimation . . . . . . . . . . . . . 39 2.4.1 Estimation using the translation and scale classes . . 39 2.4.2 Estimation under the class of median-unbiasedness . 49 Global properties of medloss estimation . . . . . . . . . . . . 53 2.5.1 Median-admissibility and median-minimaxity . . . . . 53 2.5.2 Median-inadmissibility for linear models 57 2.5.3 Robustness of the best medloss estimator to the choice of loss functions 2.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 Properties for the posterior median loss estimation . . . . . . 60 2.6.1 General procedure for computing the posterior medloss estimator 2.6.2 . . . . . . . . . . . . . . . . . . . . . . 60 Closed form of the posterior medloss estimator for unimodal and symmetric posterior densities . . . . . . . 62 . . . . . . . . . . . . . . . . . . . . . . . . 64 . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 2.7.1 Frequentist medloss in regression problems . . . . . . 66 2.7.2 Model selection based on median cross validation . . 72 2.6.3 2.7 . . . . . . . Prediction Implications 2.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 2.9 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 v Table of Contents 3 Asymptotics of Bayesian Median Loss Estimation . . . . . 80 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 3.2 Main results . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 3.3 Asymptotics for two related estimators 3.4 A comparison of posterior medloss estimator, LMS, and LTS 3.5 . . . . . . . . . . . . 90 estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 4 Asymptotics of the Least Quantile of Squares Estimator in Nonlinear Models . . . . . . . . . . . . . . . . . . . . . . . . . . 99 4.1 Introduction 99 4.2 Main result . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 4.3 Detailed proof for the LQS estimator 4.3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . Manageability (n−1/2 ) . . . . . . . . . . . . . 104 . . . . . . . . . . . . . . . . . . . . . . 105 4.3.2 Op rate of convergence of rn 4.3.3 Conditions for Kim and Pollard’s Lemma 4.1 are satisfied in nonlinear case 4.3.4 . . . . . . . . . . . . . . . . . 109 Check the conditions of Kim and Pollard’s main theorem 4.3.5 . . . . . . . . . . 107 . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 Proof of asymptotic results for the LQS estimators in nonlinear models . . . . . . . . . . . . . . . . . . . . . 117 4.4 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 5 Median-Based Cross Validation for Model Selection . . . . 119 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 5.2 Motivation and methodology: Median cross validation . . . . 124 5.3 Two model cases . . . . . . . . . . . . . . . . . . . . . . . . . 128 5.3.1 Theoretical results . . . . . . . . . . . . . . . . . . . . 129 5.3.2 Dependence of model selection on the size of parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 5.4 Three term case . . . . . . . . . . . . . . . . . . . . . . . . . 141 vi Table of Contents 5.4.1 Theoretical results . . . . . . . . . . . . . . . . . . . . 143 5.4.2 Simulation . . . . . . . . . . . . . . . . . . . . . . . . 145 5.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 5.6 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 6 Summary and Future Plan . . . . . . . . . . . . . . . . . . . . 161 6.1 Summary of the thesis . . . . . . . . . . . . . . . . . . . . . . 161 6.2 Future plans 6.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 6.2.1 Using the median in other model classes 6.2.2 Model averaging for prediction . . . . . . . . . . . . . 164 6.2.3 Penalized median-loss methods . . . . . . . . . . . . . 168 6.2.4 Other possibilities . . . . . . . . . . . . . . . . . . . . 170 References . . . . . . . 162 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 Appendices A Appendix to Chapter 2 . . . . . . . . . . . . . . . . . . . . . . 174 A.1 Classical axiomatic expected utility . . . . . . . . . . . . . . 174 A.2 The Allais paradox and Ellsberg’s paradox . . . . . . . . . . 176 A.3 Rostek’s axiomatic quantile utility . . . . . . . . . . . . . . . 177 A.4 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 B Appendix to Chapter 3 . . . . . . . . . . . . . . . . . . . . . . 184 B.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 B.1.1 Assumptions T D . . . . . . . . . . . . . . . . . . . . 185 B.1.2 Assumptions T H . . . . . . . . . . . . . . . . . . . . 186 B.1.3 Assumptions T I . . . . . . . . . . . . . . . . . . . . . 187 B.2 Proofs of the limiting results of two-sided LTS estimators in nonlinear regression models . . . . . . . . . . . . . . . . . . . 188 B.2.1 Preliminary results . . . . . . . . . . . . . . . . . . . 188 vii Table of Contents B.2.2 Proof of Lemma 17 for the uniform law of large numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 B.2.3 Proof of Proposition 3 on asymptotic linearity B.2.4 Proofs of our main theorem B.3 References . . . . 201 . . . . . . . . . . . . . . 206 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210 C Appendix: Inequalities for Medloss and Risk . . . . . . . . 211 C.1 Inequalities for medloss and risk . . . . . . . . . . . . . . . . 211 C.1.1 Inequality for the median loss and expected loss for unimodal distributions . . . . . . . . . . . . . . . . . 213 C.1.2 Some results for symmetric and unimodal location family . . . . . . . . . . . . . . . . . . . . . . . . . . . 214 C.2 General results for exponential families C.3 References . . . . . . . . . . . . 221 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226 D Appendix: Nonparametric Curve Estimation . . . . . . . . 227 D.1 Nonparametric curve estimation with kernel smoothing approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 D.1.1 Selection of bandwidth h by CV . . . . . . . . . . . . 228 D.1.2 Median kernel smoother for curve estimation . . . . . 229 viii List of Tables 1.1 Data set of Y, X1 and X2 . . . . . . . . . . . . . . . . . . . . 14 1.2 Estimator analysis . . . . . . . . . . . . . . . . . . . . . . . . 14 5.1 Comparison of MCV and CV under normal and heavy-tailed errors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 5.2 Dependence of MCV and CV on parameters . . . . . . . . . . 143 5.3 Proportion of times each candidate model is selected by 5-fold CV and MCV with errors generated from N (0, 1) and covari√ ates x from N (0, 1), U(−√3,√3) and t3 / 3. In the column of ”True beta”, × represents that CV wins, and X represents that MCV wins. 5.4 . . . . . . . . . . . . . . . . . . . . . . . . . 151 Proportion of times each candidate model is selected by 5-fold CV and MCV with errors generated from the t1 (or Cauchy) distribution and covariates x from N (0, 1), U(−√3,√3) and √ t3 / 3. In the column of ”True beta”, × represents that CV wins, and X represents that MCV wins. . . . . . . . . . . . . 152 5.5 Proportion of times each candidate model is selected by 5-fold CV and MCV with errors generated from the t0.5 distribution √ and covariates x from N (0, 1), U(−√3,√3) and t3 / 3. In the column of ”True beta”, × represents that CV wins, and X represents that MCV wins. . . . . . . . . . . . . . . . . . . . 153 ix List of Tables 5.6 Proportion of times each candidate model is selected by 5fold CV and MCV with errors generated from the Lévy (α = 0.5) distribution and covariates x from N (0, 1), U(−√3,√3) and √ t3 / 3. In the column of ”True beta”, × represents that CV wins, and X represents that MCV wins. . . . . . . . . . . . . 154 x List of Figures 1.1 Top: the quantile plot of model of Y on X1 . Bottom: the quantile plot of model of Y on X2 . . . . . . . . . . . . . . . . 12 1.2 The normal quantile plot of model of Y on X1 and X2 . . . . . 15 2.1 n = 10 and 50. The solid curves are for the medloss as a function of c; the dashed curves are for the expected loss. The dotted vertical line indicates the point at which the expected loss attains its minimum, and the dotted horizontal line is the corresponding expected loss at that point. The minimizers of the medloss and the expected loss get closer and closer as n increases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 47 The posterior risk and posterior medloss as a function of δ(x). The dashed curve represents the posterior risk for L2 , while the solid one is for the posterior medloss under L2 . The central vertical line indicates the position of the posterior median. The left and right vertical lines indicate the positions of the posterior medloss estimate and the posterior mean, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 xi List of Figures 2.3 The dark lines are for the best risk estimators and the light lines for the best medloss estimators. The solid lines represent the sample means of 2 sequences of the best risk and medloss estimators; while the dashed lines represent the sample medians of these 2 sequences. The panel on the top uses L1 loss, and the bottom one uses L2 loss. For these two panels, we see that neither method dominates the other, since the DG and DA models are the same. . . . . . . . . . . . . . 2.4 Again, the panel on the top is for L1 67 loss and the bottom one for L2 loss. In these two panels, the DG has a heavier tail than the DA model. The solid dark line and the dashed dark line are always above the solid light line and the dashed light line for the cases of sample-mean case and sample-median case, respectively. 2.5 . . . . . . . . . . . . . . . . . . . . . . . . The solid lines are the errors for LMS estimator as a function of the boundary B. The dashed lines are for LS estimator. . . 2.6 68 71 The solid circles represent the median-based CV criterion, and the squares represent the usual expectation-based CV criterion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7 75 When the error distribution is N(0,1), the usual CV works better than our MCV. However, outside the normal error case, MCV outperforms CV. . . . . . . . . . . . . . . . . . . . . . . 5.1 76 Proportion of times each model is selected by 5-fold MCV and CV criteria for samples with different error distributions. The black dot represents the MCV performance, while the open square represents the CV performance. 5.2 . . . . . . . . . 126 The zero and non-zero points on the x-axis correspond to the cases that Model 1 and Model 2 are true, respectively. The vertical dotted line represents the value of kβ2 /σk = 0.19. . . 135 xii List of Figures 5.3 Black dots mean MCV outperforms. That is, the proportion of time that MCV chooses the correct model is higher than that for CV. The open circle means the reverse. The dash straight line indicates the line: β2 = 0.19σ. All black dots are in the region 0 < β2 < 0.19σ. . . . . . . . . . . . . . . . . . . 136 5.4 LOO and 5 fold MCV and CV for normal errors. . . . . . . . 138 5.5 LOO and 5 fold MCV and CV for Cauchy errors. . . . . . . . 139 5.6 Comparison of MCV and CV for the 2-model case by β2 and v degree of freedom of the t distribution. Black dots mean that MCV works better in the sense that it has a higher proportion of times that the correct model is chosen than CV, while the open circles mean that CV outperforms. . . . . . . . . . . . . 141 5.7 Comparison of MCV and CV for the 2-model case by β2 and an exponent α of the Lévy stable distribution. Again, black dots mean that MCV works better than CV, while the open circles mean that CV outperforms. . . . . . . . . . . . . . . . 142 5.8 3 term nested model: Black dots mean that MCV outperforms CV in the sense that the proportion of time choosing the correct model by MCV is higher than CV, while open circles mean CV does better. The dashed straight line represents the line: β2 = 0.19σ. All black dots are in the region 0 < β2 < 0.19σ. . . . . . . . . . . . . . . . . . . . . . . . . . . 147 5.9 3 term nested model: Comparison of MCV and CV by β2 and v degree of freedom of the t distribution. Black dots mean that MCV works better in the sense that it has a higher proportion of times that the correct model is chosen than CV, while the open circles mean that CV outperforms MCV. . . . 148 xiii List of Figures 5.10 3 term nested model: Comparison of MCV and CV by β2 and an exponent α of the Lévy stable distribution. Again, black dots mean that MCV works better than CV, while the open circles mean that CV outperforms MCV. . . . . . . . . . . . . 149 5.11 Non nested case for a class of tv error distributions . . . . . . 156 5.12 Non nested case for a class of Lévy α-stable error distributions157 xiv Acknowledgements For the completion of this thesis, firstly, I would like to express my heartfelt gratitude to my supervisor Dr. Bertrand Clarke for his invaluable advice and guidance, patience, remarkable insights and encouragement. I have learned many things from him, particularly regarding academic research and analytical writing. Secondly, I would like to express my sincere appreciation to my supervisory committee members, Dr. Paul Gustafson and Dr. Matı́as SalibiánBarrera. For the time they dedicated to discuss research work with me, reviewing my thesis and for their precious advice. Next, I sincerely thank Peggy Ng, Viena Tran and Elaine for their help with administrative matters. Especially for Viena, she helps me very much. The faculty, graduate students and staff in the Department of Statistics also provide a supportive and friendly environment and make my stay here an enriching experience. Last, but not least, I would like to dedicate this thesis to my parents. In particular, I greatly appreciate my girl friend for her steadfast support and encouragement through my Ph.D. study. Chi Wai Yu The University of British Columbia April 2009 xv Dedication To my parents and my girl friend Margaret Chan xvi Statement of Co-Authorship The thesis is finished under the supervision of my supervisor, Professor Bertrand Clarke. Chapter 2 is co-authored with Professor Bertrand Clarke. My main contributions are proposing and justifying a median-loss-based decision theory in statistics, giving a formal definition of the median-loss-based criterion for parameter estimation, and developing some results based on our criterion parallel to the usual expected-loss-based criterion. Chapter 3 is co-authored with Professor Bertrand Clarke. My main contributions are establishing the asymptotic results of our proposed estimator in the Bayesian context, and extending the results of other two Frequentist existing estimators in regression problems to compare with our Bayesian proposed estimator. Chapter 4 is co-authored with Professor Bertrand Clarke. My main contribution is the derivations of the asymptotic properties of one of our extended Frequentist estimators in Chapter 3. Chapter 5 is co-authored with Professor Bertrand Clarke. Professor Bertrand Clarke and I develop a new median-based cross validation technique for model selection, and compare the performances of our new method and the usual cross validatory approach in the simulation study. xvii Chapter 1 Introduction 1.1 Motivation The motivation for this thesis is twofold: The theoretical motivation comes from decision theory and the empirical motivation comes from the concern for heavy tails and robustness to outliers. Foundationally, we argue that using the median of the loss in place of the expected loss gives a viable alternative to classical decision theory. This median based decision theory parallels mean based decision theory but provides a different evaluation of preferences. It is clear that operators other than the mean and median can give further variations on classical decision theory; however, we do not consider these here. Our point is that median based methods enjoy all the conceptual benefits that mean based methods do and consequently arguments against their widespread use must be outside of the usual desirable properties of inference procedures. In other words, it is very hard to find good reasons not to use median based methods such as we develop here. Indeed, in many problems, we can systematically replace the mean by the median. Empirically, it is well known that many estimators based on using the moments of observations are sensitive to outliers. However, it is usually not known whether a candidate outlier is a correct observation or not. If the observation can be determined to be wrong, or grossly unrepresentative, for some reason, there is a good argument for just deleting it. Otherwise, it must be accepted at face value, or some way must be found to reduce its influence. There are many ways to deal with overly influential observations, however, 1 Chapter 1. Introduction we prefer here to bypass such thorny and often controversial details. We go directly to a procedure that is insensitive to candidate outliers but captures the typical location of data: The median. To begin our study of the median as a general way to do inference, we review basic decision theory to see how it changes if one adopts a median perspective. Then, we see how heavy tailed errors can arise quite commonly in linear regression problems. In such problems, the median perspective may be much more reasonable since moments need not be assumed and outliers do not matter very much. 1.2 Utility theory Utility measures relative satisfaction from or desirability of consumption of various goods and services. Given this measure, one may speak meaningfully of increasing or decreasing utility, and thereby explain the choice behavior of individuals in terms of the increase, or decrease, of utility. For choice behavior under uncertainty, Bernoulli (1738) first proposed an expected utility hypothesis and recommended that the choice with the highest expected utility can be regarded as the most desirable. von Neumann and Morgenstern (1944) reinterpreted the expected utility model and developed its axiomatization. Their general result was that if the choice behavior of individuals satisfies two axioms, a given ordering on choices could be represented by the order on their expected utilities, and conversely. The two axioms von Neumann and Morgenstern (1944) identified were the following. Independence Axiom : For all p, q, r ∈ P and 0 < λ ≤ 1, if p Â q, then λp+(1−λ)r Â λq+(1−λ)r, where P is a set of probability distributions and p Â q means that p is strictly preferred to q. Continuity Axiom : For all p, q, r ∈ P, if p % q % r, then there exists a number λ ∈ [0, 1] such that λp + (1 − λ)r ∼ q, where p % q means that p is weakly preferred to q, and p ∼ q is equivalent to p % q and q % p, which means that we are indifferent between p and q. 2 Chapter 1. Introduction Parallel to von Neumann and Morgenstern’s axiomatic expected utility model with objective probability, Savage (1954) established a Bayesian version of the Expected Utility Theorem for subjective probability settings. He used six different axioms to construct his subjective expected utility model. That is, he gave six axioms under which a preference ordering could be made equivalent to an ordering based on an expected utility that used a subjective probability. (See the Appendix to Chapter 2 for details on this.) Since expected utilities have many nice properties, the theories by von Neumann and Morgenstern (hereafter vNM) and by Savage became popular indeed dominant. However, both theories have been criticized as models for rational human behavior. The main way these criticisms have been presented is by deriving counter-intuitive properties from the axioms under which the vNM or Savage representations hold. For instance, one of the well-known violations for the vNM’s Independence axiom is Allais’ paradox, Allais (1953). From the Bayesian perspective, the Ellsberg’s paradox, Ellsberg (1961), shows that Savage’s Sure-Thing principle for actions – the Bayesian counterpart of vNM’s Independence axiom for probability distributions – also contradicts real life decision making. These two paradoxes are explained in more detail in the Appendix to Chapter 2. Taken together, these findings led to the development of numerous non-expected utility models. One of the alternatives is advocated by Manski (1988). Unlike vNM’s expected utility model, he suggested using quantiles of the distribution of the utility to model our choices under uncertainty. In fact, in many applied areas in statistics, quantiles have been widely used. For instance, one of the most popular measures of risk in finance is the Value-at-risk (V aR), which is often defined as a high quantile of the loss (the negative of the utility), say the 0.95-th or 0.99-th. However, Manski (1988) did not provide any axioms for his system. That is, he did not provide necessary and sufficient conditions for the ordering on preferences to be equivalent to their ordering 3 Chapter 1. Introduction a quantile representation. This gap led Rostek (2007) to develop axioms under which a preference ordering on actions would be equivalent to the ordering of those actions under a quantile representation. Some of Rostek’s axioms come from Machina and Schmeidler’s work (1992). They gave an axiomatization for non-expected utility models in a Bayesian setting. However, as noted by Rostek, Machina and Schmeidler’s axioms are not valid for quantile representations such as the median, thereby necessitating her treatment of the problem. Here we invoke Rostek’s quantile utility model for statistical loss-based approaches frequently. It is so foundational to our work that we routinely use it without comment after Chapter 2. Specifically, because we focus on the central tendency of the loss (often defined as a negative of the utility), we consider the median (the 0.5-th quantile) of the loss. We do this because we want to avoid overestimation as much as underestimation and overprediction as much as underprediction. We remark that throughout this thesis we are not developing quantile utility theory but only using it to give an axiomatic support for our median version of decision-theoretical methods in the sense of utility models, just like von Neumann and Morgenstern’s axiomatization of the expected utility model to the classical statistical decision theory. To understand how the median or quantile criteria describe people’s behavior under uncertainty, see Rostek (2007) for utility problems involving data, and Walsh (1969, 1970, 1971) and De Vries (1974) for the median and quantile versions of two-person zero-sum game problems involving no data, respectively. 1.3 Statistical decision theory As the name suggests, decision theory is a theory about how to make good choices in the absence of full information. It is well known that the classical statistical decision theory is one of Abraham Wald’s most important contri4 Chapter 1. Introduction butions to statistics. Roughly speaking, Wald’s statistical decision models can be viewed as a statistical analog of vNM’s and Savage’s expected utility models coupled with an experiment involving observations and parameters. Purely to set the record straight, we observe that although von Neumann and Morgenstern (1944) influenced the final form of Wald’s ideas on his statistical decision theory, Wald (1950), the core ideas of Wald’s theory were already present in his paper Wald (1939) which predated vNM. Moreover, the specifically statistical elements of Wald’s theory go far beyond vNM’s and Savage’s theories. Specifically, Wald (1939) developed a unified structure into which estimation and hypothesis testing can be readily embedded. He also introduced much of the mental landscape of contemporary decision theory, including loss functions, risk functions, admissible decision rules, a priori distributions, Bayes decision rules, and minimax decision rules. In what follows, we briefly recall these because they recur in a different form under our medianloss-based decision theory. 1.3.1 Loss and risk functions The loss function L plays a central role in Wald’s statistical decision theory. In statistics, the loss is defined to be a function mapping from a product of decision space D and parameter space Θ onto a set of non-negative real numbers R+ 0 = {c ∈ R : c ≥ 0}. That is, L : D × Θ → R+ 0. The decision space D is a set of functions δ(X n ), where we often call δ(X n ) decisions, rules, strategies, or estimators, and they map from the sample space X n onto a set of real numbers R or real vectors Rd . In many situations, it is reasonable to replace R or Rd by the parameter space Θ ⊂ R or Θ ⊂ Rd , respectively. In estimation, L(δ(X n ), θ) is a measure of the accuracy when we use δ(X n ) to estimate a parameter θ. Generally, the loss is defined to be 5 Chapter 1. Introduction a function of the difference between the estimated value and the true value of the parameter. The most commonly used loss functions are 1. the absolute error loss L1 (δ, θ) = |δ − θ|, 2. the squared error loss L2 (δ, θ) = |δ − θ|2 , and 3. the 0-1 error loss L01 (δ, θ) = 1 if |δ − θ| > ² and 0 otherwise, where ² > 0. The first, natural for probability densities, weights all differences equally. The second weights big differences much more than small differences; it is mostly used for mathematical convenience because Euclidean norms have an associated inner product. The third loss, sometimes called classification loss, detects when two quantities are the same or different but ignores the amount by which they differ. Note that the value of the loss is random because of the randomness of X n and of Θ in the Frequentist and Bayesian contexts, respectively. Conventionally, the way to remove the randomness in the ‘random accuracy’ L(δ(X n ), θ) is to take an average of the loss over X n in the Frequentist framework. The expected loss with respect to X n is called risk and defined £ ¤ by Rθ (δ(X n )) = EX n L(δ(X n ), θ) . Converting the random variable to a real number puts the value of all estimators on the same scale for easy comparison. Now, in principle, the adequacy of any estimator is summarized by its risk, so we can find the optimal estimator by minimizing the risk. Unfortunately, an ‘optimal’ estimator minimizing the risk uniformly does not exist, in general. This is so because the risk Rθ is usually a nontrivial function of θ. To see what goes wrong, suppose we want to compare two estimators, δ1 and δ2 . Then, each δj generates a curve based on its risk by £ ¤ regarding y = EX n L(δ(X n ), θ) as a curve in the (θ, y) plane. It is typical for the two curves from the risks of δ1 and δ2 to cross each other. This means that there are regions of the parameter space where δ1 has smaller 6 Chapter 1. Introduction risk than δ2 and regions where δ2 has smaller risk than δ1 . So, neither is universally preferred over the other in general. In those relatively rare cases where one estimator is always better than another, we say an estimator δ1 dominates an estimator δ2 iff Rθ (δ1 ) ≤ Rθ (δ2 ) for all θ and the inequality is strict for at least one θ. Overall, this means that additional criteria for finding an optimal estimator are required. 1.3.2 Admissibility and minimaxity In classical (Frequentist) decision theory, we consider two weaker optimality criteria for the dominance of estimators over Θ. One such criterion is called admissibility. An estimator is admissible if and only if no other estimator dominates it; otherwise it is inadmissible. An admissible estimator should be preferred over an inadmissible estimator since for any inadmissible estimator there is an admissible one that performs at least as well for all possible θ and better for some. The second criterion is minimaxity. An estimator δ ∗ (X n ) is called minimax if its maximal risk is minimal among all estimators under consideration. That is, supRθ (δ ∗ (X n )) = inf supRθ (δ(X n )). θ∈Θ δ∈D θ∈Θ This means that the minimax estimator δ ∗ (X n ) is one which performs best in the worst possible case allowed. 1.3.3 Bayes estimation Alternatively, we can use a Bayesian approach to look for an ‘optimal’ estimator in a overall sense. In the conventional Bayesian decision theory, instead of taking the average of the loss over X n as in the Frequentist ap7 Chapter 1. Introduction proach. we can remove the randomness of the loss by averaging over Θ since xn is already fixed. The corresponding expected loss with respect to the posterior distribution Π(Θ|xn ) is called posterior risk and defined by £ ¤ rΠ (δ(xn )) = EΠ L(δ(xn ), Θ) . An estimator minimizing the posterior risk is called a Bayes estimator, provided its posterior risk is finite. There is a well-known theorem that gives standard assumptions under which a Bayes estimator is equivalent to a minimax or admissible estimator, see Lehmann (1983, Chapter 4). 1.3.4 Median as an appropriate measure for the overall evaluation of our choice In Section 1.2, we have already seen that a quantile representation for choice behavior is as justified theoretically as an expected utility representation is. In addition, since medians do not require moments, quantile representations are more generally applicable than the expected utility representations justified by vNM’s or Savage’s work. Separate from the existence of moments, we note that the loss function is a non-negative random variable and essentially always has a skewed distribution. Often, it is quite strongly skewed. It is well known that for skewed distributions, the median is more representative of the bulk of the distribution than the mean is. Thus, from the standpoint of representing the location of the most typical losses, the median is more appropriate than the mean. One of our goals in this thesis is to develop a median analog of risk, admissibility, minimaxity and Bayes estimation in Chapter 2. Moreover, the asymptotics of the median-based Bayes estimator is our main contribution in Chapter 3. 8 Chapter 1. Introduction 1.4 Empirical aspects of optimizing medians From the empirical standpoint the considerations up to now are only peripherally relevant. Of much more immediate importance are data sets that are i) from heavy-tailed distribution or ii) contaminated by outliers. In a sense, these two issues are related: If the tails are heavy then any seeming outliers may well be representative of the population. More typically, even when heavy tails may be reasonable, often there are a small number of points that just do not seem to fit well with the rest. Sometimes this occurs because the small number of points is on a region far from the others; sometimes this occurs because the small number of points has some pattern – a funnel shape in a residual plot for instance – that does not fit well with the assumed randomness in, say, an error term. For a variety of reasons, the normal distribution is central to inference in many settings. Sometimes this is justified by citing the Central Limit Theorem. Sometimes this is justified by maximum entropy. More typically, many practitioners just regard the normal error in a linear model, for instance, as the correct form of the error term on the grounds that it seems to work well in a wide range of settings and has a compelling inferential theory. Here, we note that the normal has light tails and all moments and so is ideal from the standpoint of conventional decision theory, based as it is on taking the mean of the loss. The key problem with the widespread assumption of normality is that there is little evidence it validates. That is, it is reasonable to assume normality only if the assumption is tested at the end of the analysis in some kind of predictive sense whether cross-validatory or one new data. This can be a huge problem for regression techniques. Indeed, even in basic linear regression it is not hard to argue that the seeming success of normal inference is artificial. The preference for normality of errors is so strong that practitioners are often willing to use ill-justified tricks so that the normal assumption will appear reasonable. We suggest that the main reason 9 Chapter 1. Introduction analysts do this is for convenience: It permits them to avoid consideration of heavy-tailed distributions which would complicate inference. In part because of this bias to normality, heavy tailed distributions are often unrecognized in many applications. To defend our assertion that the overwhelming prevalence of normal error terms in modeling is artificial, we point out that in practice it is not hard to construct artificially a normal error in linear regression problems. This can be done as a consequence of the variable selection procedure. This follows because if a model is wrong, then the unmodeled covariates can be shunted into the error term and become indistinguishable from it. If these covariates have the property that they ‘correct’ the heavy tailed error to a light or normal tailed error the model mis-specification will be very hard to detect. That is, because a response Y is expressed as a sum of Xβ and an error E neither of which can be measured directly, the model is non-identifiable and so there are many possible solutions most of which are inaccurate. We will see how this plays out in practice in simple simulations in next subsection. 1.4.1 Heavy-tailed data First, we observe that heavy tailed data are already accepted as commonplace in many subject matter disciplines. For instance, in economics and finance, heavy tailed distributions are used to model fluctuations of stock returns, excess bond returns, foreign exchange rates, commodity price returns and real estate returns amongst other economic variables, see McCulloch (1996) and Rachev and Mittnik (2000). In the context of engineering, the noise in some communications and radar systems is also often heavy-tailed. The same problem occurs in image processing and signal processing. Indeed, in some fields like genomics, the normality assumption used with micro-array data is, even when it gives results that are reasonable, mostly the result of extensive processing of the raw measurements. In these cases, and many 10 Chapter 1. Introduction others, the normal assumption is often untenable. Second, when the normal assumption is made incorrectly most of the existing methods, e.g. approaches based on least squares reasoning, often do not work very well. Fundamentally, this is so because such methods rely heavily on the assumption of finite moments – a mathematical convenience that is problematic when the tails are heavy. This is apart from any problems with outliers or contamination which only increases the extent to which normality is questionable. To see how poorly methods can perform when normality is incorrectly assumed, recall that, as mentioned before, it is easy for routine statistical analyses to construct a light tailed error through model mis-specification. Indeed, if one sets out to be purposefully wicked, there are at least two ways to mis-specify a model so as to achieve normality. The first way is based on the non-identifiability of the error and the regression model; the second way is to cheat by using a transformation. (i) Two examples of non-identifiability Our first example models Y as a sum of X1 and X2 , i.e., Y = β1 X1 + β2 X2 , (1.1) where X1 is known to be N (0, 1) and X2 is known to be Cauchy. The true parameters of (β1 , β2 ) are (1,1). Now suppose measurements of the form (Yi , X1i ) for i = 1, . . . , n are available. Then, β2 X2 is the (Cauchy) error term. The least squares estimator of β1 can be found and the model Y = β1 X1 + E fitted. Then a quantile plot of the residuals against Cauchy quantiles is obtained. The top panel of Figure 1.1 shows that the fit is good. However, now suppose, more commonly, that measurements of the form (Yi , X2i ) are available. Then, the linear model becomes Y = β2 X2 + E and has a normal error term. The usual least squares estimate of β2 can be found and the model fitted. The quantile plot using normal quantiles can 11 Chapter 1. Introduction be obtained, as shown in the bottom panel of Figure 1.1. Again, it is seen that we get a good fit. This shows that if the response Y really is a stochastic function of other variables then omitted terms form the error. Clearly, it is possible to include only those terms that give a normal error provided the other terms sum to −2 2 4 6 Quantile Plot for t_1 errors −6 sample quantile of the residual a quantity that represents the rest of the variability in Y . −6 −4 −2 0 2 4 6 4 6 −2 2 4 6 Quantile Plot for normal errors −6 sample quantile of the residual quantile of t_1 −6 −4 −2 0 2 quantile of normal Figure 1.1: Top: the quantile plot of model of Y on X1 . Bottom: the quantile plot of model of Y on X2 . 12 Chapter 1. Introduction Our second example shows the problem is even worse than the first model reveals. To see why, let us look at a regression model Y = β1 X1 + β2 X2 + E, (1.2) where E is the error term and X1 , X2 are measured explanatory variables. Assuming data of the usual form, i.e., (Yi , X1i , X2i ) for i = 1, . . . , n with the conventional assumptions of IID errors, the usual procedure for (1.2) is to find β̂LS = (β̂1 , β̂2 ) by least squares and then test submodels to see if terms can be dropped. Here, there are two submodels formed by omitting one or the other of the explanatory variables. Whether either of these submodels is reasonable can be evaluated by testing H1 : β1 = 0 and H2 : β2 = 0. As before, it is common practice to produce a quantile plot to ascertain whether a good fit has been achieved. We do precisely this analysis and then see how, in practice, it may not be as reliable as we would like. To set up our routine analysis, we generated 30 outcomes for X1 and for X2 . We take it as known that X1 ∼ N (0, 1) but assume the dependence of the distribution of X2 on X1 is not known. We will see how the distribution of the error used to generate Y , unknown to the analyst, affects the inference procedures. In fact, to make our point, we only look at individual data sets of size 30. This means that the randomness in the generation of (X1 , X2 ) can be ignored. It is enough to regard the outcomes as deterministically chosen design points that can be perturbed by an error distribution to give the corresponding outcomes for Y . Table 1.1 shows the data set we generated. Now consider trying to fit the model (1.2). The usual analysis gives the entries in Table 1.2. It is seen that the least squares parameter estimates are β̂1 ≈ 3.8 and β̂2 ≈ 1. The standard errors indicate that both variables are far from zero. Indeed, if a multiple comparisons procedure using the conservative Bonferroni correction is used to test (β1 , β2 ) = (0, 0), again neither parameter can be reasonably taken as zero. Thus, regardless of 13 Chapter 1. Introduction Outcome 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Y -5.3033 -3.6200 -5.5315 -7.9936 0.9535 -3.7250 -0.2690 -9.0174 72.6267 1.7789 5.5903 6.0789 2.0914 6.9873 -1.8672 X1 -0.4101 -2.0153 -0.3860 0.9578 1.2655 -1.3486 -0.0403 -1.3326 0.1414 0.4701 0.1628 1.1990 0.2894 -0.7600 -0.7837 X2 -3.2433 4.5951 1.9783 -18.4266 -5.5287 -2.6474 12.0591 -0.0708 68.4117 7.1028 3.8304 5.8940 2.0439 13.8218 -1.3106 Outcome 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 Y 83.2148 1.1077 0.3226 -0.0885 1.9883 -0.4950 1.7951 -1.3429 -3.0678 -2.6882 -2.7023 4.8959 0.0509 -2.1155 -0.8377 X1 0.3581 0.0698 0.0524 0.1625 -0.3533 -0.1338 -0.4801 0.6400 -0.8689 -1.1170 -1.3752 1.4115 -0.9251 -0.3520 0.2684 X2 73.9643 -0.4236 0.5338 -1.6023 6.2485 -0.1414 12.7869 -1.0665 0.4740 -8.1612 -3.8681 1.4447 12.0864 5.8767 -0.9703 Table 1.1: Data set of Y, X1 and X2 any dependence structure linking X1 and X2 , the linear model is expressing something genuine about the relationship between (X1 , X2 ) and Y . This is 2 = 0.9326 indicate that the fit is reinforced by noting R2 = 0.9371 and Radj pretty good. Parameter β1 β2 Estimate 3.7988 1.0014 Standard Error 1.1524 0.0497 t-value 3.297 20.145 p-value 0.0027 < 2 × 10−16 Table 1.2: Estimator analysis As a final check on the fit of the model to the data, the normal-based quantile plot is given in Figure 1.2. It is seen that the residuals track the diagonal line reasonably closely and that about half the points are above it and the other half are below it. There is no bunching of the points over any region either. All in all, this is a reasonably successful fit and the natural 14 Chapter 1. Introduction way to make point predictions is from the estimated regression function Y = βˆ1 X1 + βˆ2 X2 , (1.3) and the prediction interval can be obtained from the usual techniques. 1 0 −1 −2 quantile of standardized residuals 2 Quantile plot of the model: Y on X1 and X2 −1 0 1 quantile of N(0,1) Figure 1.2: The normal quantile plot of model of Y on X1 and X2 . The problem with this analysis is that the final model obtained above is wrong. In fact, the data generating model of Y is Yi = 2X1i + Ei , for i = 1, . . . , n, (1.4) where {Ei : i = 1, . . . , n} are IID from the standard Cauchy distribution. The covariate X1 was generated from the standard normal distribution. The variable X2 was artificially constructed by setting X2 = ê + normal noise, 15 Chapter 1. Introduction where ê is the residual from the least squares estimators in (1.4) using only Y and X1 . This example shows that the usual technique for checking the normality assumption can be misleading, even when the fit is very good. It actually shows even more: In practice, analysts with large numbers of explanatory variables may just be selecting those variables that fit the error rather than explaining the response. That is, model mis-specification can be the result of the determination to show that a normal error provides a good fit. At root, this is possible because the basic model is not identifiable. Again: The regular appearance of normal error in many linear regression problems may be spurious. Light tails may be nothing more than the construct of the fitting procedure because unmodeled terms just end up contributing to the error, possibly making it smaller rather than larger. Overall, these two examples are a warning for the determined pursuit of normality i.e., the rejection of heavy-tailed distributions may be forcing analysts to do bad modeling by improper use of variable selection, or other techniques. In general, this will not be found out unless further data is gathered to validate the model – a practice which is relatively uncommon. (ii) The problem with transformations A second way to get normality dishonestly is via transformation of the variables. One way to do this is to use the Inverse Probability Integral Transform (IPIT) of the response Y . Suppose that Y ∼ FY , where FY is the continuous distribution function of Y , and denote by Φ the standard normal distribution function. It is well known that FY (Y ) has a standard uniform distribution. So, by the IPIT, Y 0 = Φ−1 (FY (Y )) has the same distribution function as Φ. ¡ ¢¡ ¢ In other words, Φ−1 ◦ FY · transforms the original data into normally distributed data. On the face of it, this is just a mathematical fact. However, the problem is that if an analyst wants to use this to transform non-normal data to normality, the distribution function FY must be estimated. This requires data – a lot of data, because an entire function is 16 Chapter 1. Introduction being estimated rather than a single parameter. Therefore, few of any degrees of freedom remain for estimating the parameters in the normal data. More sophisticated methods for transforming data have the same drawback: Unless the degrees of freedom associated with the selection or estimation of the transformation is accounted for in the downstream estimation of parameters, the estimates will, at a minimum, have much larger standard errors than would be obtained from n independent data points. That is, although normality can be achieved, the post-transformation data has much less information. The fraud in this case is to pretend that transforming to get normality is cost free. If these two ways to cheat are disallowed, then the occurrence of normality is much reduced. Since lighter tails like the normal are implausible, the consequence is that heavier tailed distributions may occur more commonly than anyone cares to admit. Fortunately, to handle problems with heavy-tailed distributions, we do not need to use any bizarre or exotic techniques: Just use the median in place of the mean. 1.4.2 Data with outliers Another setting in which median based methods perform well is the contamination class of distributions. Contamination classes are one common way to model outliers. For such data, standard practice for achieving robustness against outliers in regression problems devolves to Huber-type reasoning, see Huber (1964, 1973). Parallel to least square (LS) approaches, this technique deals with outliers by choosing a criterion function ρ, and then estimating the regression parameters by minimizing n X ρ(yi − x0i β). (1.5) i=1 17 Chapter 1. Introduction It is typical to assume the first derivative of ρ is bounded to reduce the sensitivity to the outliers. For example, take ρ(t) = |t|, so ρ0 (t) = 1 if t > 0, -1 if t < 0 and 0 if t = 0. The method based on this ρ is called the leastabsolute-deviation (LAD) approach or median regression, which is a special case of quantile regression methods by Koenker and Bassett (1978). Note that LS results from choosing ρ(t) = t2 and ρ0 (t) = 2t is unbounded. So, in the presence of outliers, Huber-type methods often outperform methods based purely on the sums of squares. The obvious drawback to the Huber-style approach in (1.5) is that we have to determine ρ. This is important because the robust estimates depend meaningfully on the choice of ρ in general, see Huber (1981). Here, we despair of fixing the problems inherent in expectation-based (i.e. the corresponding operator is expectation, sum or sample mean) methods and propose a modification that we argue works much better — without having to make any subjective choices or correct for data driven quantities. Specifically, we advocate using the median in place of the mean. That is, in any optimality criterion that rests mathematically, or conceptually, on taking the expectation of a random variable, we study the corresponding expression using the median instead. It will be seen that, as can be anticipated, the use of the median routinely gives higher resistance to outliers than expectation-based techniques. Median-based (i.e. the corresponding operator is population median or sample median) estimators are often the same as mean based estimators for symmetric distributions with finite mean, but, as will be seen, are typically different for asymmetric distributions and symmetric distributions with infinite moments, where these distributions often appear in practice. Moreover, using the median has other benefits including invariance (up to strictly increasing transformations) and improved model selection (because expectation-based techniques often give too much sparsity). 18 Chapter 1. Introduction 1.5 A brief introduction of the thesis This thesis is manuscript-based, where each chapter is self-contained, including references, but some of the more routine technical arguments in proofs are relegated to Appendices. The rest of this thesis is organized as follows. 1. In Chapter 2, we propose a formal definition and detailed arguments of our median-loss-based approach for inference. Parallel to the usual risk-based techniques, we develop their median-loss-based versions and get parallel finite-sample results, such as median minimaxity, median admissiblily and median Bayes estimators. Unlike the usual Bayes estimators, our median-based Bayes estimators are robust to the choice of the loss functions. We also provide a generic algorithm for finding the optimal median-based Bayes estimators. 2. In Chapter 3, the asymptotics of median-loss-based Bayes estimators √ are established. This yields the n-convergence and asymptotic normality of the median-loss-based Bayes estimators. Furthermore, we study a two-sided least trimmed squares (LTS) estimator and the Frequentist version of our median-loss-based estimator, which is exactly Rousseeuw’s least median of squares (LMS) estimator, Rousseeuw (1984). Our results show that the median Bayes estimator is a good tradeoff between the LMS and 2-sided LTS estimators. 3. Chapter 4 provides the typical details and proofs of the asymptotics of the least αth- quantile of squares estimators in nonlinear models. The limiting result of LMS estimators in Chapter 3 is a special case of the results in this chapter when choose α = 0.5. 4. In Chapter 5, we propose an application of our median-loss-based procedure to model selection. Specifically, we propose a median analog of the standard cross validation (CV) and call Median Cross Valida19 Chapter 1. Introduction tion or MCV. In the context of linear models, we present simulation results to illustrate the performance of MCV and the usual CV in cases of normal error and heavy-tailed error distributions. We find that MCV works better than CV does in the predictive sense when the error has heavy tails. For cases with large values of true parameters, CV outperforms MCV in the predictive sense when the error is light-tailed, for instance, normal. However, we get the reverse results for small values of the true parameters. That is, MCV chooses the true model with a higher proportion of times than the CV does when the error is normal and the true values of the parameters are small. This was unexpected because CV is based on least squared approaches and expectation-based optimizations are usual better than other methods when the errors are normal. 5. Chapter 6 contains a summary of what this thesis has achieved along with our future plans. Furthermore, Appendix C presents material that provides a comparison of median-based and expectation-based procedures from the standpoint of the expectation and median of the loss. In Appendix D, we discuss the application of MCV to nonparametric curve estimation with kernel-based approaches. 20 Chapter 1. Introduction 1.6 References Bernoulli, D. Originally published in 1738 and translated by Dr. Lousie Sommer in 1954. Exposition of a New Theory on the Measurement of Risk, Econometrica, 22(1), 22-36. De Vries, H. (1974). Quantile Criteria for the Selection of Strategies in Game Theory, Int. Journal of Game Theory. 3(2), 105-114. Huber, P.J. (1964). Robust estimation of a location parameter, The Annals of Mathematical Statistics. 35, 73-101. Huber, P.J. (1973). Robust regression: Asymptotics, conjectures and Monte Carlo, Ann. Statist. 1, 799-821. Huber, P.J. (1981). Robust Statistics. Wiley, New York. Koenker, R.W. and Bassett, G.W. (1978). Regression Quantiles, Econometrica. 46, 33-50. McCulloch, J.H. (1996). Financial applications of stable distributions, in G. S. Maddala, C. R. Rao (eds.), Handbook of Statistics, Vol. 14, Elsevier, 393-425. Rachev, S.T. and Mittnik, S. (2000). Stable Paretian Models in Finance. Wiley, New York. Rostek, M.J. (2007). Quantile Maximization in Decision Theory. Unpublished Manuscript. Rousseeuw, P.J. (1984). Least Median of Squares Regression, J. Amer. Statist. Assoc. 79, 871-880. Von Neumann, J. and Morgenstern, O. (1947). The Theory of Games and Economic Behaviour, 2nd ed. (1st edn 1944). Princeton, Princeton University Press. Wald, A. (1939). Contributions to the Theory of Statistical Estimation and Testing Hypotheses, The Annals of Mathematical Statistics. 10(4), 299-326. Wald, A. (1950). Statistical Decision Functions. John Wiley, New York. Walsh, J.E. (1969). Discrete Two Person Game Theory with Median Payoff 21 Chapter 1. Introduction Criterion, Opsearch. 6, 83-97, and Errata, Opsearch. 6, 216. Walsh, J.E. (1970). Generally Applicable Solutions for Two-Person Median Game Theory, Journal of the Operations Research Society of Japan, 13. 1-5. Walsh, J.E. (1971). Description of Median Game Theory with Examples of Competitive and Median Competitive Games, Revue de StatistiqueTijdschrift voor Statistiek. 10(4), 2-13. 22 Chapter 2 Median Loss Analysis 2.1 Introduction The classical statistical decision theory, first proposed by Wald (1939), can be viewed as the expected utility models by von Neumann and Morgenstern (1944, 1947) and by Savage (1954) in the Frequentist and Bayesian contexts, respectively. In the Frequentist expected utility model, von Neumann and Morgenstern (1944, 1947) identified two axioms, called the Independence axiom and the Continuity axiom, under which choices ranked by a preference relation could be represented by expected utility. This meant that the optimal action under uncertainty could be found by maximizing expected utility. In the Bayesian context, Savage (1954) used six axioms and again established an equivalence between ordering under a preference relation and expected utility so that again optimal actions under uncertainty could be found by maximizing expected utility. Because expectation has so many appealing properties, maximizing expected utility became standard in many fields, from both the Bayesian and Frequentist standpoints. Despite this, expected utility as a criterion for decision making has been repeatedly criticized from several standpoints. First, the Allais paradox, Allais (1953), calls the Independence axiom, and hence standard Frequentist A version of this chapter has been submitted for publication. Yu, C.W. and Clarke, B. Median Loss Analysis. 23 Chapter 2. Median Loss Analysis decision theory into question. Second, Ellsberg’s paradox, Ellsberg (1961), shows that Savage’s Sure-Thing principle, the counterpart of the Independence axiom, also contradicts real life decision making, thereby calling the standard Bayesian formulation into question. In addition, it is well known that, in both the Bayesian and Frequentist contexts, the decision rules depend delicately on the loss functions which are generally unknown, leading to poor robustness properties. These findings, among others, have motivated the development of several utility models that are not based on expectation. One of the alternatives to expectation based utility models for decision making under uncertainty is due to Manski (1988). He suggested using quantiles of the distribution of the utility as the main criterion in the Frequentist context. In fact, in many applications, quantile based criteria are well-established. For instance, one of the most popular measures of risk in finance is the Value-at-risk, VaR, which is often taken to be a high quantile, say the 0.95-th or 0.99-th, of the negative of the utility which can be regarded as the loss function here. Despite this, no axiomatic justification for Frequentist quantile optimization was offered by Manski (1988). This gap led Rostek (2007) to develop axioms under which a preference ordering over actions would be equivalent to the ordering of those actions under a quantile representation. Her work made use of Machina and Schmeidler (1992) which provided an axiomatization for non-expected utility models in a Bayesian setting. To the best of our knowledge, no one has provided an axiomatization in which quantile optimization is justified from a Frequentist perspective. We conjecture this can be done, but do not do so here. We note, however, that our median loss minimizations provide a context in which the concept of least median of squares, LMS, a variant on quantile optimization, is natural. The LMS estimator was first introduced by Hampel (1975, page 380) and then developed by Rousseeuw (1984). It is known that the LMS estimator is robust with 50% breakdown point. 24 Chapter 2. Median Loss Analysis For a variety of reasons, it seems most reasonable to use quantile utility models which focus on the 0.5-th quantile or median instead of expected utility when interest focuses on the central tendency of the response. All of our results here are for this choice of quantile. This means that in a predictive setting, for instance, underprediction and overprediction can be avoided. Indeed, in estimation problems, the loss L(δ(X n ), θ) can be regarded as a measure of adequacy of an action or estimator δ(X n ) for a parameter θ, where X n = {Xi : i = 1, . . . , n} and all Xi is distributed according to Pθ . Moreover, in the Frequentist context, the loss L(δ(X n ), θ) is random for each θ and is typically (strongly) right skewed. Accordingly, the median is a good measure of location for it, arguably more appropriate than its mean. One of the benefits of optimizing quantiles rather than means is that moments no longer need to be assumed. So, heavy tailed distributions are not a problem. For instance, suppose that X follows the Cauchy distribution with density f (x) = [π(1 + (x − θ)2 )]−1 and we want to find an optimal estimator in a translation class of estimators using X, C1 = {X + c, ∀c ∈ R, a set of all real numbers}. Consider the squared-error loss, i.e. L(X + c, θ) = (X + c − θ)2 . We can see that the conventional expected loss criterion does not permit a solution because E(X + c − θ)2 does not exist. The absolute error, L1 loss, has the same problem. However, minimizing the median in these settings is feasible. Moreover, even though the Cauchy itself may not be common in reality, problems with heavy tails show up long before moments no longer exist. For instance, optimality criteria based on expectations often exhibit a high sensitivity to outliers, like the least squared estimates in regression settings. A separate issue from the sensitivity to outliers of the conventional criteria is the sensitivity to the loss. This is well known from routine problems comparing the squared-error loss to, for instance, a 0-1 or SCAD loss. A related problem is the difficulty in reliably specifying a loss function in prac25 Chapter 2. Median Loss Analysis tice. The consequence of this is that it is desirable to have the right degree of insensitivity to the loss function. In the Bayesian context, Rukhin (1978) shows that a reasonable degree of robustness holds for even loss functions when the posterior density is symmetric and unimodal, under the assumption that the posterior expected loss exists and is finite. He also proves that under some regularity conditions, the Bayes estimators are independent of the loss functions, but this is a very special case. Furthermore, the conventional alternatives to the risk for the sensitivity of the loss, or called the loss robustness, are Pitman measure of closeness, PMC, Pitman (1937), universal domination (u.d.) and stochastic domination (s.d.), see Hwang (1985). However, each of these can be criticized as well: The PMC is not transitive and u.d. and s.d. are hard to implement in practice. The rest of this chapter is organized as follows. In Section 2 we introduce Rostek’s axiomatic foundation of quantile utility model. In contrast with the usual criteria for a comparison of estimates, we propose a median loss (medloss) approach in Section 3. Section 4 focus on the local properties of the medloss estimation. For instance, we find the optimal medloss estimator under the classes of translation, scale and median-unbiasedness. In Section 5, we consider the global properties by dropping the restriction for the class of estimators in Section 4 and then using some weak optimum criteria like median-admissibility and median-minimaxity, so that all estimators can be taken into account. In Section 6, we discuss some properties for the medloss estimator with respect to the posterior distribution and provide computing results. Since changing the optimality criterion invites us to replace the usual expectation-based methods with median-based methods systematically, in Section 7 we define a median analogue of cross validation and give a simulation study for model selection to show that the medloss approach outperforms the expected-loss approach outside the normal error. In Section 8, we summarize our work. 26 Chapter 2. Median Loss Analysis 2.2 Quantile axiomatic foundation In this section, we present the motivation for minimizing the median loss (hereafter medloss) from the standpoint of utility theory. We also argue that medloss analysis is a natural variation of the usual expected loss analysis, which has dominated statistics and other related areas. In utility theory, a preference relation for choice behavior can be represented by a utility function u(z). If we consider a set of outcomes Z, then a utility function is the function that assigns a numerical value to z ∈ Z preserving the rank ordering of the outcomes. More formally, Definition 1. A function u : Z → R is a utility function representing preference relation % if and only if for all x, y ∈ Z, x % y ⇐⇒ u(x) ≥ u(y), where x % y means that x is preferred or equivalent to y. Similarly, we can also define strict preference relation Â in which x Â y means that x is preferred over y. To model choice behavior under uncertainty, we consider probability distributions (also called lotteries) p ∈ P over the outcomes, where P is the set of all lotteries over Z. Here we remark that although Z and P look like very different objects, we can actually think of Z as just a subset of P. This is so because the lottery that puts probability 1 on a particular outcome z ∈ Z is indistinguishable from the outcome z itself. The consequence is that the preference ordering an outcomes, and its representation by a utility, extends to an ordering on the elements of P. For instance, suppose there are three outcomes {x, y, z}, and two lotteries defined by p1 = [1, 0, 0] and p2 = [0, 1, 0], where p1 = [1, 0, 0] means that we assign x with probability 1 and y and z with probability 0 and similarly for p2 . Then x % y or u(x) ≥ u(y) can be interpreted to mean p1 is preferred to p2 . More generally, we want to use utility functions to compare lotteries like p3 = [0.1, 0.4, 0.5] and 27 Chapter 2. Median Loss Analysis p4 = [0.2, 0.5, 0.3] over {x, y, z}. In fact, there are many ways to construct an agent’s preferences on lotteries from his preferences on the outcomes. The best known and arguably most useful one is based on the theory of expected utility maximization by von Neumann and Morgenstern (vNM). 2.2.1 Classical expected-utility models von Neumann and Morgenstern (1944, 1947) set a rational foundation for decision-making under uncertainty according to Expected Utility under the following two axioms. Independence Axiom : For all p, q, r ∈ P and 0 < λ ≤ 1, if p Â q, then λp + (1 − λ)r Â λq + (1 − λ)r. Continuity Axiom : For all p, q, r ∈ P, if p % q % r, then there exists a number λ ∈ [0, 1] such that λp + (1 − λ)r ∼ q. Roughly speaking, vNM’s Expected Utility theorem says that there exists a utility function such that an agent’s preference over lotteries can be represented by the expectation of this utility with respect to the lotteries iff the preference relation satisfies both axioms. Since the lottery is an objective probability in vNM’s Expected-Utility theory, we call it Frequentist Expected Utility theory. The vNM’s Expected Utility theorem can be found in Appendix A.1. By contrast, Savage developed an Expected-Utility theory with subjective probability for the Bayesian context. He assured preferences over acts were based on subjective probability and utility, and he also proposed a set of axioms, P 1−P 6 shown in Appendix A.1, to construct his Expected Utility model. One of these axioms is called the Sure-Thing Principle, which is the counterpart of vNM’s independence axiom. Basically, Savage’s Sure-Thing Principle (axiom P 2 in Appendix A.1) states that in a comparison of two 28 Chapter 2. Median Loss Analysis acts that lead to the same consequences in some set of states, the decision maker restricts his attention to those states in which the consequences are different (see Hens, 1992). More detailed interpretation of these axioms can be found in Machina and Schmeidler (1992). Under this model, Savage established an Expected-Utility representation for choice behavior like vNM’s expected utility theorem but using subjective probability, see Appendix A.1. 2.2.2 Problems with expected utility In vNM’s objective and Savage’s subjective expected utility models, the independence axiom and the Sure-thing principle respectively are crucial for the representability of preferences over lotteries by an expected utility. However, there are several types of systematic violations of the expected utility model. The best-known violations of vNM’s Independence Axiom and the Savage’s sure-thing principle are the so-called Allais Paradox, Allais (1953) and Ellsberg Paradox, Ellsberg (1961), respectively, which suggest the assumptions made in conventional expected utility theory contradict real life decisions, as shown in Appendix A.2. In other words, these paradoxes tell us that the expected utility representation for our preference is no longer valid. These findings have led to the development of several non-expected utility or non-linear utility models as alternatives to expected utility. According to Manski (1988), any pattern of behavior consistent with the existence of a preference ordering on the space of probability measures should be considered ‘rational’ if actions are characterized by probability measures of outcomes. This means that expected utility maximization is not central to the study of rational behavior under uncertainty. He also proved that there exists an ‘ordinal utility model’ of rational behavior under uncertainty whose predictions are invariant under ordinal transformations of utility. 29 Chapter 2. Median Loss Analysis 2.2.3 Quantile utility Advocating an ordinal-utility approach to modeling choice under uncertainty, Manski (1988) examined an agent’s preferences by maximizing a quantile of the distribution of utility. Unfortunately, he did not provide any axioms under which necessary and sufficient conditions for preferences would admit a quantile representation. However, Machina and Schmeidler (1992) did provide a foundation for Non-Expected Utility models by modifying Savage’s axioms. Specifically, in the notation of Appendices A.1 and A.3 they removed the Sure-Thing principle P 2 and replaced P 4 by P 4Q . However, this new axiomatic system still does not fit the quantile utility model, because, as Rostek (2007) points out, Savage’s axiom P 6 is too restrictive for the quantile representation. Consequently, Rostek (2007) axiomatized quantile maximization for decision making in Savage’s setting by combining the results of Manski (1988) and Machina and Schmeidler (1992). More precisely, Rostek provides an axiomatization for quantile utility models in a Savage-type setting by removing P 2 − P 4, retaining P 1 and P 5, modifying P 6 to P 6Q and adding P 3Q and P 4Q , where the superscript ”Q” indicates the axioms added for the quantile utility model. The relationship between Rostek’s axioms and Savage’s axioms is discussed briefly in the Appendix A.3, see also Machina and Schmeidler (1992). Now we can state Rostek’s Quantile Utility Theorem as follows. Theorem 1 (Rostek, 2007). Let S be the set of the state of nature. Then, a preference relation over the set of Savage’s finite-outcome acts F = {f |f : S → Z and |f (S)| < ∞}, satisfies axioms P 1, P 3Q , P 4Q , P 5 and P 6Q if and only if there exists a number τ ∈ [0, 1], a subjective probability measure π on S and a utility function u on the set of outcomes Z such that f % g ⇐⇒ Qf (τ ) ≥ Qg (τ ), ∀f, g ∈ F. (2.1) 30 Chapter 2. Median Loss Analysis where Qf (τ ) = inf{t ∈ R|π[(u ◦ f )(s) ≤ t] ≥ τ }. In addition, u is unique up to strictly increasing transformations. That is, taking preferences over acts as a primitive, Rostek (2007) finds conditions that are necessary and sufficient for those preferences to admit a quantile representation. Clearly, a quantile utility model is an ordinal framework. Rostek builds on her theorem by deriving probability measures representing subjective beliefs that are unique and independent of τ ∈ (0, 1). She also gives a uniqueness result for τ under hypotheses on the preferences over probabilities induced by the preference relation %. Since considerably weaker axioms are used in Quantile Utility models than in Expected Utility models, Quantile Utility models have properties such as robustness to the choice of utility functions and ordinality. Quantile Utility models do not require moment restrictions, so they are resistent to outliers, unlike conventional expected utility models. Observe that the τ -th quantile of the distribution of utility u(z) is the action obtained from minimizing the risk using a linear loss. That is, the τ -th quantile achieving the minimum of Z u(z ∗ ) (1 − τ ) Z ∗ ∞ [u(z ) − u(z)]dΠ(z) + τ −∞ [u(z) − u(z ∗ )]dΠ(z), (2.2) u(z ∗ ) where Π is the probability associated to the density π which assigns mass to u ◦ f as a random variable in Z as in the definition of Qf . In the notation of (2.2), z = f (s) and z ∗ = f (s∗ ) and the minimization in (2.2) is over u(z ∗ ). Analogously, the expected utility can be derived as the optimal action under a risk criterion defined from the squared error loss. Thinking of the criterion itself (quantile of loss or expectation of loss) as emerging from a higher level decision theory problem with its own utility function is unusual, but central to Rostek’s arguments. It is well-known the quantiles are less affected by outliers than moments are. Consequently, since the τ -th quantile of the loss solves the minimization problem in (2.2), it is typically less sensitive 31 Chapter 2. Median Loss Analysis to outliers than the mean which would solve the corresponding minimization problem with quadratic loss. In particular, therefore, solutions to the subsequent minimization of the τ -th quantile with generally be less affected by outliers, heavy tails and asymmetry than the solutions to conventional risk minimization problems. It is useful to think of the minimum of (2.2) as an agent’s prediction for the next outcome from the lottery. The lower τ is, the more the agent is concerned about underpredictions relative to overpredictions – hence, the more she cares about the lower-tail outcomes relative to the higher-tail outcomes. Similarly, higher τ corresponds to the reverse. The absolute error loss is intermediate between these two extremes and corresponds to the median where underprediction is as harmful as overprediction. The selection of an appropriate τ depends on the application. In finance and economics, τ = 0.95 or 0.99 is common as a way to protect an agent from rare but catastrophic events. However, in this paper we focus on central tendencies since our interest is predictive. This leads us to choose τ = 0.5 and use the median of the loss as our optimality criterion. This choice also gives the sort of robustness most important for generic prediction problems and as a recognized measure of central tendency is the most suitable quantile to use in place of the mean. Note that we are not saying that median loss models are always better than the expected loss models, or vice versa. Indeed, the relative usefulness of the two types of optimization depends on the application. We suggest median loss approaches are better for most situations potentially involving outliers, asymmetry, heavy tails or the possibility that the data generator is more dispersed than the model. 2.3 Statistical decision theory under median loss The most important use of the expected utility maximization in statistics is to assess the goodness of estimators. An estimator is said to be good if 32 Chapter 2. Median Loss Analysis its expected loss is small. In this section, we discuss some limitations of the conventional statistical approaches for the comparison of estimators. Then we introduce our medloss approach and verify it does not have these limitations. Throughout this paper, we denote the set of real numbers by R, a sequence of n random variables by X n = {X1 , . . . , Xn } with its realization xn = {x1 , . . . , xn }. Suppose a random variable X has a distribution F , we denote its median by med X, med (X), or med F . For the median of a function of X, say g(X), we denote it by medX g(X) or medF g(X). 2.3.1 Usual criteria for comparing estimates Since comparisons of estimates under the expected loss criterion are so sensitive to the choice of loss, it is important to study the criterion under a class of loss functions. To do this, three criteria have been used: Pitman’s measure of closeness, universal domination and stochastic domination. First we briefly review them and discuss their various problems; second we propose a Frequentist median-loss (medloss) criterion comparing estimators, and third we give examples of estimators which achieve optimality in a medloss sense. We proceed by grouping the two domination criteria together. 1. Universal domination (u.d.) and stochastic domination (s.d.) : Universal domination is a criterion introduced by Hwang (1985). It takes into account the set of all non-decreasing loss functions L of kδ(xn ) − θ)kQ , where kzk2Q = z T Qz is the generalized Euclidean norm with respect to a positive-definite matrix Q that is user-specified. An estimator δ1 (X n ) universally dominates another estimator δ2 (X n ) if, for every L and for every θ, £ ¤ £ ¤ E L(kδ1 (X n ) − θkQ ) ≤ E L(kδ2 (X n ) − θkQ ) . (2.3) Additionally, Hwang (1985) showed that u.d. is equivalent to another criterion called stochastic domination, which compares the estimators 33 Chapter 2. Median Loss Analysis by the stochastic ordering of their generalized Euclidean distances from the estimators to the true parameter. An estimator δ1 (X n ) stochastically dominates another estimator δ2 (X n ) under the generalized Euclidean norm with respect to a positive-definite matrix Q that is userspecified if, for every c > 0 and for every θ, £ ¤ £ ¤ Pθ L(kδ1 (X n ) − θkQ ) ≤ c ≥ Pθ L(kδ2 (X n ) − θkQ ) ≤ c . (2.4) The main problem with these criteria is that they are extremely difficult to satisfy. This is so because the expectation operator is only invariant with respect to linear operators, whereas our medloss criterion is invariant with respect to strictly increasing transformations, a much larger class. The implication of these two forms of invariance is that the expectation is much less stable than the medloss and so is exquisitely sensitive to small changes in the sample or parameter spaces, or in the loss function. 2. Pitman’s measure of closeness (PMC) : To compare two estimators of θ, δ1 (X n ) and δ2 (X n ), Pitman (1937) suggested examining distribution of their closeness to θ. An estimator δ1 (X n ) is Pitman-closer to θ than an estimator δ2 (X n ) if, for every θ, £ ¤ Pθ kδ1 (X n ) − θkQ ≤ kδ2 (X n ) − θkQ > 0.5, or £ ¤ Pθ L(kδ1 (X n ) − θkQ ) ≤ L(kδ2 (X n ) − θkQ ) > 0.5. (2.5) Although PMC overcomes two problems that the risk encounters, namely infinite moments and sensitivity to the loss function, flaws remain. The most severe criticism of PMC is its intransitivity, see 34 Chapter 2. Median Loss Analysis Robert, Hwang and Strawderman (1993). Intransitivity makes PMC contradict the ”rational behavior axioms” underlying utility theory. Therefore PMC is not logically consistent with decision theory. Casella and Wells (1993) argue that this is such an overwhelming deficiency that PMC should be eliminated from consideration. A separate criticism that has been made is that PMC depends on the joint distribution of the estimators they are comparing. Indeed, Robert, Hwang and Strawderman (1993) point out that this property contradicts Savage’s principle (Savage, 1954) that no reasonable criterion can separate two estimators with the same marginal distribution, see Casella and Wells (1993). On the other hand, Keating, Mason and Sen (1993) provide an example with respect to the Cauchy distribution to argue that the PMC criterion is better than the risk criterion. Since the Cauchy distribution has no finite moments of order ≥ 1, the MSE criterion fails to compare the two estimators. By contrast, in the sense of PMC, δ2 (X1 , X2 ) = (X1 + X2 )/2 is better than δ1 (X1 , X2 ) = X1 , where X1 and X2 are independent and follow the same Cauchy distribution with density f (x) = [π(1 + (x − θ)2 )]−1 and θ is an unknown location parameter. It may be that dependence on the joint distribution does not actually matter. For instance, Blyth (1986) showed that δ1 (X1 , X2 ) and δ2 (X1 , X2 ) have the same marginal distribution. So, in practice there is no point in saying one is better than the other. In aggregate it appears that PMC is interesting but inconclusive. Indeed, under our median loss criterion, neither δ1 (X1 , X2 ) nor δ2 (X1 , X2 ) dominates the other. In the next subsections, we show how the median loss can replace these usual criteria giving a more useful way to compare estimators. 35 Chapter 2. Median Loss Analysis 2.3.2 Median loss alternative To satisfy Savage’s requirement that the evaluation criterion should depend only on the marginal distribution of the estimator in the PMC case, and to weaken the conditions for u.d and s.d, and thereby robustify the expectation operator for the risk, we define our proposed Frequentist medloss criterion as follows. Definition 2. An estimator δ1 (X n ) is better than an estimator δ2 (X n ) in the sense of median loss (medloss) iff ³ ´ Pθ L(δ1 (X n ), θ) ≤ medX n L(δ2 (X n ), θ) ≥ 0.5, (2.6) for all θ ∈ Θ, which is equivalent to medX n L(δ1 (X n ), θ) ≤ medX n L(δ2 (X n ), θ), ∀ θ ∈ Θ, (2.7) where Θ is the parameter space and medX n L(δ(X n ), θ) denotes the median of the loss L(δ(X n ), θ). In particular, for any strictly increasing loss function L of kδ(X n ) − θkQ , δ1 (X n ) is better than δ2 (X n ) in the sense of medloss iff medX n (Lkδ1 (X n ) − θkQ ) ≤ medX n (Lkδ2 (X n ) − θkQ ), ∀ θ ∈ Θ. (2.8) We notice that the medloss criterion (2.8) replaces L(kδ2 (X n ) − θkQ ) or L(kδ1 (X n ) − θkQ ) with their medians in the PMC criterion in (2.5). This changes the joint-distribution problem to a marginal-distribution problem because the random quantity L(kδ(X n ) − θkQ ) is replaced by its median. Moreover, the expectation operator in u.d. and risk is replaced by the median operator. Now the invariance of the median of the loss under strictly increasing transformations makes the medloss criterion easier to satisfy. Replacing L(kδ(X n )−θkQ ) by its median also implies that c on the left in (2.4) is to be replaced by medX n (Lkδ1 (X n ) − θkQ ). Alternatively, on the right, 36 Chapter 2. Median Loss Analysis c is to be replaced by medX n (Lkδ2 (X n ) − θkQ ). Replacing c in this way on both sides gives a trivial relation. Similarly, we also have the medloss criterion in the Bayesian context. They are two different principles that the first and second ones are with respect to the prior and the posterior, respectively. Definition 3. Let π(·) be the prior density of θ. Define the Bayes medloss of an estimator δ with a specific prior π by mπ (δ) = med [medX n L(δ(X n ), θ)], π where med [medX n L(δ(X n ), θ)] means the median of the medloss medX n L(δ(X n ), θ) π with respect to the prior π. Thus, given a specific prior π, an estimator δ1 is better than an estimator δ2 in the sense of Bayes medloss iff mπ (δ1 ) ≤ mπ (δ2 ). (2.9) The estimate which minimizes the Bayes medloss is called Bayes medloss estimator. The following is another Bayesian medloss criterion based on the posterior distribution. Definition 4. An estimator δ1 (xn ) is better than an estimator δ2 (xn ) in the sense of posterior medloss iff med L(Θ, δ1 (xn )) ≤ med L(Θ, δ2 (xn )), π(Θ|xn ) π(Θ|xn ) (2.10) where med L(Θ, δ(xn )) means the median of the loss L(Θ, δ(xn )) with reπ(Θ|xn ) spect to the posterior distribution π(Θ|xn ) of Θ given xn , or called the posterior medloss. The estimate which minimizes the posterior medloss is called posterior medloss estimator, which can be interpreted as the mid-point of the smallest interval on which the posterior probability is 1/2. 37 Chapter 2. Median Loss Analysis In particular, if the posterior density is symmetric, then the posterior medloss estimator is the mid-point of 50% highest posterior density, i.e. the posterior median, where the rigorous proof of this result is provided in Section 2.6.2. In the following, we use the Cauchy distribution to give a sense in which our medloss approach is more reasonable than the expected loss approach. We use a translation class of estimates with respect to the squared-error loss, even though the expected-loss approach fails. Example : Suppose that X1 and X2 are independent and both follow the same Cauchy distribution with parameter θ, with density f (x) = 1/[π(1 + (x − θ)2 )]. Consider two estimators δ1 (X1 , X2 ) = X1 and δ2 (X1 , X2 ) = (X1 + X2 )/2. We have shown that these estimators are equivalent under the medloss criterion. However, if we restrict the class of estimators we can identify an optimal estimator. So, consider the translation class C1 = {X + c, ∀c ∈ R}, and the squared-error loss L2 , i.e. L(X + c, θ) = (X + c − θ)2 . Finding an optimal estimator under the usual risk is impossible because E(X + c − θ)2 does not exist. In general, risk is ineffective for problems having any heavytailed distributions. By contrast, the medloss does not have any moment conditions and can cover many heavy-tailed distributions, like the Cauchy. To solve the problem for the Cauchy distribution, it is enough to find c̃ to minimize the median of the distribution of (X+c−θ)2 . Recall that W = X−θ follows the standard Cauchy distribution with density f (w) = 1/[π(1 + w2 )]. It is symmetric around its mode of 0. Let F(W +c)2 (·) denote the distribution function of (W + c)2 . We have ∀y ≥ 0, √ √ F(W +c)2 (y) = P ((W + c)2 ≤ y) = P (− y − c ≤ W ≤ y − c). 38 Chapter 2. Median Loss Analysis √ √ Thus, c̃ = arg max P (− y − c ≤ W ≤ y − c), ∀y ≥ 0, i.e. c̃ = 0. In other c words, for the Cauchy distribution, X is the uniformly optimal medloss estimator in the translation class under the squared-error loss. This remains true for any Lp loss with p > 0. The medloss criterion has other nice properties the PMC has, in addition to giving solutions without requiring moments. For instance, Keating, Mason and Sen (1993) identify robustness against the loss as an important desideratum. Risk-based criteria usually do not have this robustness, but the PMC does as does the medloss. In particular, Keating, Mason and Sen (1993) show that the PMC is invariant under powers of the L1 loss, and in Theorem 2 below we show this also holds for the medloss. This invariance dramatizes how the medloss focuses on the ranking of estimators rather than on evaluating how much an estimator is preferred to another. 2.4 Local properties of medloss estimation Similar to the risk, we want to find an optimal estimator δ which minimizes the medloss, medX n L(δ(X n ) − θ), at every value of θ. However, this estimator does not exist in general. So, what we do is to restrict the class of estimators to be considered by an impartiality requirement such as medianunbiasedness, just as if the mean-unbiasedness for the usual risk. In what follows, we first present four examples comparing the best medloss estimator to the best expected-loss estimator under the translation and scale classes of estimators. Then we will discuss the estimation problems under the medloss in the class of median-unbiased estimators. 2.4.1 Estimation using the translation and scale classes The first example is for the normal mean using the translation and scale classes; the second considers the exponential mean using the scale class; 39 Chapter 2. Median Loss Analysis the third one looks at the uniform mean using the translation class. All three examples are with respect to the squared error loss. For the fourth example, we use the 0-1 loss for a normal mean and the translation class. The best medloss estimators are unique in the first two examples but not in the last two. However, the non-unique medloss estimators include the best expected-loss estimators as special cases. Suppose {Xi , i = 1, 2, . . . , n} are IID samples from N (θ, 1). Define the translation and scale classes P P { n1 ni=1 Xi + c, ∀c ∈ R} and { nc ni=1 Xi , ∀c ∈ R \ {0}}, respectively. The problem can be reduced to a one-sample case by letting P X = ni=1 Xi /n which follows N (θ, 1/n). Since we are only interested in θ and the constant variance for fixed n does not affect finding the best estimates, we let n = 1 in the following. (ia) Translation class Suppose X ∼ N (θ, 1), then we find (A) the best medloss estimator for θ and (B) the best expected loss estimator for the unknown normal mean parameter θ in the translation class C1 = {δc1 (X) = X + c, ∀c ∈ R} under L2 . Beginning with (A), we have medX L2 (X + c, θ) = medX (θ − (X + c))2 , (2.11) so the minimum over c ∈ R can be found as follows. Since (X + c) − θ ∼ N (c, 1), we see that ((X + c) − θ)2 ∼ χ21,α(c) , (2.12) where χ21,α(c) denotes the non-central χ2 distribution with 1 degree of freedom and non-centrality parameter α = α(c) = |c|. Letting F1,α be the d.f. 40 Chapter 2. Median Loss Analysis for χ21,α , we have α1 > α2 ⇒ F1,α1 ≤ F1,α2 ⇒ medX F1,α1 ≥ medX F1,α2 . (2.13) In other words, the best medloss estimator for θ minimizes the non-centrality parameter α. So, we have ∀c ∈ R and ∀θ ∈ Θ, medX (θ − X)2 ≤ medX (θ − (X + c))2 , (2.14) which gives that X is the (unique) best estimate for the normal mean θ in C1 under median loss with L2 . The uniqueness follows because 0 is the unique minimizer of α(c) = |c| ≥ 0. Next, turning to (B), since EL2 (X + c, θ) = E(θ − (X + c))2 = 1 + c2 , (2.15) it suffices to find the best expected-loss estimator for θ by minimizing the function 1 + c2 over c ∈ R. In other words, for all θ, c∗0 = 0 = arg min EL2 (θ, X + c). c∈R Thus, ∀c ∈ R and ∀θ ∈ Θ, E(θ − X)2 ≤ E(θ − (X + c))2 , which gives that X is also the unique best estimate for θ in the translation class C1 based on the expected loss with L2 . (ib) Scale class Here we do the same thing as before, but C2 = {δc2 (X) = cX, ∀c ∈ R \ {0}}. 41 Chapter 2. Median Loss Analysis Now, for part (A), the median loss is medX L2 (cX, θ) = medX (θ − cX)2 (2.16) and the minimum over c ∈ R \ {0} can be found as follows. We see that if X ∼ N (θ, 1), then (cX − θ)2 /c2 ∼ χ21,α(c) , (2.17) where α2 (c) = (1 − 1/c)2 θ2 . However in C2 , we cannot find an uniformly best estimate for θ. That is, there does not exist c1 such that medX (θ − c1 X)2 is always less than the median loss for any other c over all θ. The reason is that the non-centrality parameter α(c) depends on the unknown parameter θ. However, we can still find the locally best medloss estimator for θ. It turns out that minimizing the expected loss E(θ − cX)2 suffers the same problem. That is, we can only find the best estimate for θ locally because the expected loss E(θ − cX)2 = c2 + (1 − c)2 θ2 depends on the unknown parameter θ. Our results are summarized in the following. Proposition 1. In the normal mean problem with the scale class, we have that (i) For c2 ≥ 1, ∀θ ∈ Θ, medX (θ − X)2 ≤ medX (θ − cX)2 . (ii) For c ≥ 1 or c ≤ (θ2 − 1)/(θ2 + 1), ∀θ ∈ Θ, E(θ − X)2 ≤ E(θ − cX)2 . (iii) For (θ2 − 1)/(θ2 + 1) < c < 1, E(θ − X)2 > E(θ − cX)2 . We remark that when 0 < c2 < 1 for medloss, medX (θ−X)2 > medX (θ− cX)2 when θ = 0. However, if θ 6= 0, then we cannot find a closed-form 42 Chapter 2. Median Loss Analysis solution to settle the question of which of medX (θ −X)2 and medX (θ −cX)2 is bigger. If c2 ≥ 1, then c ≥ 1 or c ≤ −1. Both cases satisfy condition (ii) in Proposition 1; thus, we get that for all θ, E(θ − X)2 ≤ E(θ − cX)2 . On the other hand, if 0 < c2 < 1, then −1 < c < 1, which is equivalent to the condition (iii) in Proposition 1 when θ = 0. Thus, for θ = 0 and 0 < c2 < 1, we have E(θ − X)2 > E(θ − cX)2 . However, if θ 6= 0, we cannot compare them since (θ2 − 1)/(θ2 + 1) depends on the unknown parameter θ. When L2 loss is replaced by L1 loss in this example, it can be verified that the minimum expected loss and medloss procedures continue to give essentially the same results. (iia) Results for the exponential distribution with median and expected losses In the last examples, the medloss and expected loss criteria gave the same solutions. However, in the case of an exponential distribution for X, the estimators from the median loss and the expected loss criteria are very different. Indeed, when n = 1, we will see that the best expected-loss estimator for the exponential mean λ is 0.5X, while the best medloss estimator is ≈ 0.85X. It is well known that the UMVUE for λ is X. Thus, the best medloss estimator 0.85X is a tradeoff between the efficient and the best expected-loss estimators. Suppose that X ∼ Exp(λ) with pdf f (x|λ) = λ−1 e−x/λ , where x ≥ 0 and λ > 0. Let’s consider the situation with L2 (δ(x), λ) = (λ − δ(x))2 in the scale class {δc (X) = cX, ∀c ∈ R+ }. Here we fix the support of δc (X) to be the same as that of X, i.e. R+ ∪ {0}. The expected loss EL2 (δc (X), λ) on {δc (X) = cX, ∀c ∈ R+ } is E(λ − cX)2 = (c2 + (c − 1)2 )λ2 , which is a convex function of c with the unique minimizer at c∗ = 1/2 for all λ. Thus, X/2 is the unique best estimate for the exponential mean λ under L2 in the scale class {δc (X) = cX, ∀c ∈ R+ }. 43 Chapter 2. Median Loss Analysis Now we consider the case for medX L2 (δc (X), λ). Let Yc = (λ−cX)2 and let mc be the median of Yc . i.e. mc = medX (λ − cX)2 . Again we want to find c̃ such that mc̃ < mc for any c > 0 and all λ > 0. Let W = cX ∼ Exp(cλ), we have that 1 = P (Yc ≤ mc ) 2 √ √ = P (λ − mc ≤ W ≤ λ + mc ) R √ √ λ+ mc −w/(cλ) √ e (cλ)−1 dw, f or λ − mc > 0 λ− mc = R λ+√mc −w/(cλ) √ e (cλ)−1 dw, f or λ − mc ≤ 0 0 √ √ e( mc −λ)/(cλ) − e(− mc −λ)/(cλ) , f or mc < λ2 = 1 − e(−√mc −λ)/(cλ) , f or m ≥ λ2 . (2.18) c The form of mc can be found by examining two cases. Case 1: For mc < λ2 , let 4 = e √ mc /(cλ) > 0, so we have 0.5e1/c = 4 − 4−1 , i.e. we solve for 4 from 42 − 0.5e1/c 4 −1 = 0. By some algebra, we get 4 = (e1/c + p e2/c + 16)/4, p from which we can derive mc = A(c)λ2 , where A(c) = c2 {ln[(e1/c + e2/c + 16)/4]}2 . Case 2: For mc ≥ λ2 , we can derive mc = B(c)λ2 , where B(c) = (c ln 2 − 1)2 by a technique similar to case 1. From case 1 we get 0 < A(c) < 1 and from case 2 we get B(c) ≥ 1. It can be verified computationally that A(c) and B(c) cross each other at c0 = 2/ ln 2 ≈ 2.8854 and that A(c0 ) = 1 = B(c0 ). Thus, we have the 44 Chapter 2. Median Loss Analysis following expression for mc . £ ¤ mc = A(c)I(0<c≤c0 ) + B(c)I(c>c0 ) λ2 . (2.19) The minimum of mc in (2.19) is c̃ = arg min mc . Approximating c̃ numeric cally gives c̃ ≈ 0.8498. It is clear that c̃ 6= c∗ = 1/2. To conclude, we observe that both A(c) and B(c) are independent of λ and hence so is c̃. Thus c̃X is uniformly optimal for λ under L2 in the scale class {δc (X) = cX, ∀c > 0}. Therefore, medX (λ − c̃X)2 ≤ medX (λ − cX)2 , for any c > 0 and all λ > 0. (iib) Extension to n-data point case for exponential distribution Consider the scale class Cn = {cX̄ : ∀c ∈ R+ }, where X̄ = 1 n Pn i=1 Xi and {Xi , i = 1, . . . , n} is a sequence of IID random variables from Exp(λ) with mean λ. Analogous to the case for one-sample case, the best expected-loss estimate can be obtained by minimizing E(cX̄ − λ)2 = [(c − 1)2 + c2 /n]λ2 over c and it is c∗ X̄, where c∗ here is n/(n + 1). Similarly, in the n-data case for medloss, we have Yc = (cX̄ − λ)2 . Letting mc be the median of Yc and Wc = cX̄ ∼ Ga(n, (cλ)/n), we have 1 = P (Yc ≤ mc ) 2 √ √ = P (λ − mc ≤ Wc ≤ λ + mc ) ³ ´ √ √ = P 2n(λ − mc ) ≤ cλX2n ≤ 2n(λ + mc ) , where X2n means a random variable from a χ2 distribution with 2n degree of freedom. Unlike the case of n = 1, we cannot find an explicit form for mc in terms of c. However, if desired the optimal c̃ to minimize mc could be found using our generic algorithm in Section 2.6.1. 45 Chapter 2. Median Loss Analysis Of greater importance here is Figure 2.1 which gives graphs for n = 10 and 50. The solid and the dashed curves represent for the medloss and the expected loss as a function of c, respectively. Note that the point c̃ at which the solid curve attains its minimum value is less than 1 and greater than the point c∗ at which the dashed curve attains its minimum value, which means that the best medloss estimator is closer to the efficient rule X̄ than the best risk estimator. We also see that the minimum point of the medloss is always less than that of the expected loss. When the sample size n is large, the best medloss and expected-loss estimators get closer and closer. (iii) Results for the uniform(θ − 1/2, θ + 1/2) in the translation class under L2 In these examples, we find unique best expected-loss estimators but multiple best medloss estimators. However the multiple estimates contain the unique best expected-loss estimate. Suppose X has a uniform distribution with pdf f (x) = 1 for x ∈ [θ − 1/2, θ + 1/2], and EX = θ and EX 2 = θ2 + 1/12. Consider the translation class C1 under L2 . For the expected loss, EL2 (θ, δc (x)) = E(X + c − θ)2 = c2 + 1/12 (independent of θ). Thus, c∗ = 0, where c∗ is the optimal value to minimize EL2 . In other words, X is the uniformly optimal estimator for θ in the translation class under EL2 , and it is unique. For the medloss, let Yc = [(X + c) − θ]2 and mc be the median of Yc . We get c2 , for c > 1/4 or c < −1/4 mc = 1/16, for − 1/4 ≤ c ≤ 1/4. Let c̃ = arg min mc . It turns out that c̃ can be any value in [−1/4, 1/4], c which includes c∗ = 0. Thus we have a class of optimal estimators for θ, X + c̃, where c̃ ∈ [−1/4, 1/4], under med L2 . 46 Chapter 2. Median Loss Analysis 0.2 0.4 0.6 0.8 1.0 n=10 1.0 1.5 2.0 c 0.02 0.04 0.06 0.08 0.10 0.12 0.14 n=50 0.8 0.9 1.0 1.1 1.2 1.3 1.4 c Figure 2.1: n = 10 and 50. The solid curves are for the medloss as a function of c; the dashed curves are for the expected loss. The dotted vertical line indicates the point at which the expected loss attains its minimum, and the dotted horizontal line is the corresponding expected loss at that point. The minimizers of the medloss and the expected loss get closer and closer as n increases. 47 Chapter 2. Median Loss Analysis (iv) Results for the normal distribution in the translation class with 0-1 loss Suppose that X ∼ N (θ, 1) and we consider the 0 − 1 loss function L01 0, = 1, if |θ − δ| ≤ ² , with ² ≥ 0. if |θ − δ| > ² Consider the translation class and let Yc = (X + c) − θ ∼ N (c, 1). For the expected loss, EL01 = P (|θ − δc (X)| > ²) = 1 − P (|Yc | < ²), (independent of θ) Z ² 1 1 2 √ e− 2 (t−c) dt =1− 2π −² = 1 − [Φ(² − c) − Φ(−² − c)]. If we set c∗ = arg min EL01 = arg max [Φ(² − c) − Φ(−² − c)], then c∗ = 0. c c In other words, X is an optimal estimator of θ in the translation class under the expected loss with L01 . In the context of the medloss, let mc = med L01 , i.e. mc = 0, if P (|Yc | ≤ ²) ≥ 1/2 1, if P (|Yc | > ²) ≥ 1/2. We want to find c̃ to minimize med L01 , i.e. find c̃ such that P (|θ −(X −c̃)| ≤ ²) ≥ 1/2 ⇒ EL01 < 1/2. This means c̃ must satisfy Φ(² − c̃) − Φ(−² − c̃) ≥ 1/2. (2.20) 48 Chapter 2. Median Loss Analysis To find the optimal value for c̃, let A and B satisfy Φ(A) − Φ(B) = 1/2, where A > B. If ² − c̃ ≥ A and −² − c̃ ≤ B (i.e. −² − B ≤ c̃ ≤ ² − A), then (2.20) holds. Clearly, there are many choices of A and B to satisfy (2.20). For example, we can set A = Q(3), the 75-th quantile of standard normal, B = Q(1) = −A, the 25-th quantile of standard normal, and ² = 1.5A. Then, the optimal c̃ is any value in the interval [−² − B, ² − A] = [−0.5A, 0.5A] = [0.5B, 0.5A]. We conclude with two observations in this example. First, both c∗ for the expected loss and c̃ for the median loss are independent of the unknown parameter θ; however, unlike c∗ , c̃ depends on ². In general, c̃ 6= c∗ . Second, c∗ is unique, but we have more than one c̃ = c̃(²). 2.4.2 Estimation under the class of median-unbiasedness The property of median-unbiasedness is proposed by Lehmann (1983). According to Lehmann’s interpretation, unlike the mean-unbiasedness ensures that the amounts by which δ(X) over- and underestimates g(θ) will balance in the long run, so that the estimated value will be correct on the average, median-unbiasedness controls not the amount but the relative frequencies of over- and underestimation. The usual practice to simplify finding a good estimator is to restrict attention to a small class of candidates by imposing some extra conditions. A familiar one is mean-unbiasedness. Within the class of mean-unbiased estimators of τ (θ) def UE = {δ(X n ) : Eδ(X n ) = τ (θ)}, we wish to choose the estimator under L2 , which minimizes the variance for all θ, usually called the UMVUE. The classical Cramér-Rao (CR) lower bound plays an important role with UMVUE’s, because it provides a lower bound on the variance of mean-unbiased estimators. If there exists an estimator whose variance reaches the CR lower bound, then it is UMVUE, but not conversely in general. 49 Chapter 2. Median Loss Analysis (i) Median-unbiased estimators Consider the class of median-unbiased estimators of τ (θ) : Θ ⊂ R → R def UM = {δ(X n ) : medX n δ(X n ) = τ (θ)}. Instead of minimizing variance or MSE as in the case of mean-unbiased estimators, Sung (1988, 1990) proposes minimizing a notion of diffusivity, where the diffusivity of a median-unbiased estimator δ(X n ) of τ (θ) is defined by 1 , 2gδ(X n ) (τ (θ); θ) with gδ(X n ) (·) being the density of δ(X n ). According to Sung’s interpretation, the diffusivity measures the vertical spread of a density, while the usual measure of dispersion, like variance, generally measures horizontal spread of a distribution. Sung (1988) also develops a CR analogue for the diffusivity. We will call the optimal estimator in the median-unbiased class, which minimizes the diffusivity for all θ, uniformly minimum diffusivity medianunbiased estimator, or shortly UMDMUE. Recall that the classical CR lower bound for mean-unbiased estimators of τ (θ) is derived from the Fisher information matrix def I2 (θ) = E £ ∂ ∂θ ¤2 log f (X n ; θ) and is given by V ar(δ(θ)) ≥ I2−1 (θ)[τ 0 (θ)]2 . (2.21) By contrast, in the context of median-unbiased estimators, set def ∂ I1 (θ) = E| ∂θ log f (X n ; θ)|. Thus, this L1 Fisher information can be used to develop the lower bound for Sung’s diffusivity of the median-unbiased estimators. The corresponding 50 Chapter 2. Median Loss Analysis median version of the CR lower bound is, under some regularity conditions given in Theorem 3.2 in Sung (1988), [2gδ(X n ) (τ (θ); θ)]−1 ≥ I1−1 (θ)|τ 0 (θ)|, (2.22) where δ(X n ) is a median-unbiased estimator for τ (θ) with a continuous density gδ(X n ) (·; θ), and τ (θ) is a real-valued function on Θ, which is differentiable. The term on the left-hand side of (2.22) is Sung’s diffusivity, Sung (1988), and its square is comparable to the mean squared error. Compared with the usual CR lower bound, the square of the term on the right-hand side of (2.22) is sharper because I12 (θ) ≤ I2 (θ). Moreover, Sung’s median CR-inequality in (2.22) gives an upper bound on the density of the estimator at the true value and so restricts how peaked the sampling distribution of median-unbiased δ can be analogous to how the CR lower bound restricts the peakedness of the sampling distribution of mean-unbiased δ. Consider two estimators δ1 (X n ) and δ2 (X n ) for τ (θ). Then δ1 is preferred to δ2 in the sense of the diffusivity if 1 1 ≤ , 2gδ1 (τ (θ); θ) 2gδ2 (τ (θ); θ) or equivalently, gδ1 (τ (θ); θ) ≥ gδ2 (τ (θ); θ), i.e. δ1 (X n ) has a higher density at τ (θ) than δ2 (X n ). (ii) Relationship between the optimal medloss estimator and UMDMUE Here we discuss the relationship between the optimal medloss estimator and UMDMUE, and also conjecture the lower bound for the optimal medloss 51 Chapter 2. Median Loss Analysis estimators, just like the CR lower bound for U M V U E. Recall that the optimal medloss estimator for τ (θ) is, for every θ ∈ Θ, δ̂ M (X n ) = arg min medX n L(δ(X n ), τ (θ)). δ∈D (2.23) where D is the class of decision functions/estimators under consideration. In particular, if we consider the class of any strictly increasing loss functions of |δ(X n ) − τ (θ)|, then the best medloss estimator defined by (2.23) is δ̂ M (X n ) = arg min medX n |δ(X n ) − τ (θ)|. δ∈D (2.24) If we replace D by UM , then δ̂ M (X n ) is the estimator that minimizes m1 = medX n |δ(X n ) − medX n δ(X n )|, where medX n δ(X n ) = τ (θ). In this case, we have 1 = P (τ (θ) − m1 ≤ δ(X n ) ≤ τ (θ) + m1 ) 2 Z τ (θ)+m1 = gδ (y; θ)dy τ (θ)−m1 = 2m1 gδ ((2λ − 1)m1 + τ (θ); θ), for some λ ∈ (0, 1). In particular, if λ = 1/2, then 2m1 = 1 . 2gδ (τ (θ); θ) Thus, (2.24) becomes 1 . δ∈UM 2gδ (τ (θ); θ) δ̂ M (X n ) = arg min In other words, the best medloss estimator coincides with UMDMUE. Thus, here we conjecture that for any median-unbiased estimators δ(X n ) of τ (θ) 52 Chapter 2. Median Loss Analysis in UM , there exists a constant K such that medX n |δ(X n ) − τ (θ)| ≥ K |τ 0 (θ)| , I1 (θ) (2.25) where K may be equal to 1/2. 2.5 Global properties of medloss estimation Dropping the restriction for the class of estimators, we consider some weaker optimum properties than the uniformly minimum medloss, so that we can admit all estimators into competition. To do this end, we redefine the admissibility and the minimaxity with respect to the medloss criterion. Some lemmas parallel to the conventional risk-based results are given. The medianinadmissibility of the least squares estimator under a linear model will also be discussed. 2.5.1 Median-admissibility and median-minimaxity Now we give the definitions for median-admissibility and median-minimaxity, and then provide some preliminary results like the usual ones. Definition 5. An estimator δ(X n ) is median-inadmissible, or m-inadmissible, if there exists an estimator δ ∗ (X n ) such that medX n L(δ ∗ (X n ), θ) ≤ medX n L(δ(X n ), θ) for all θ, with strict inequality for some θ. By contrast, it is median-admissible if δ ∗ (X n ) does not exist. Definition 6. An estimator δ ∗∗ (X n ) is an median-minimax, or m-minimax, estimator if it minimizes sup medX n L(δ(X n ), θ) among all estimators δ(X n ) θ in the decision space D, i.e. if sup medX n L(δ ∗∗ (X n ), θ) = inf sup medX n L(δ(X n ), θ). θ δ θ 53 Chapter 2. Median Loss Analysis (i) Preliminary results between the median-admissibility and median-minimaxity We have the following lemmas parallel to the usual results for the risk. Lemma 1. If δ ∗ (X n ) is a unique m-minimax rule, then it is also m-admissible. Proof. Assume that δ ∗ (X n ) is not m-admissible, i.e. ∃δ(X n ) 6= δ ∗ (X n ) such that medX n L(δ(X n ), θ) ≤ medX n L(δ ∗ (X n ), θ), for all θ. Thus, we have sup medX n L(δ(X n ), θ) ≤ sup medX n L(δ ∗ (X n ), θ) ≤ sup medX n L(δ̃(X n ), θ), θ θ θ for any δ̃ ∈ D. In other words, δ(X n ) is also m-minimax, which contradicts the uniqueness of δ ∗ (X n ). So, δ ∗ (X n ) is m-admissible. Lemma 2. If δ ∗ (X n ) is a m-admissible rule with constant medloss, then δ ∗ (X n ) is also a m-minimax rule. Proof. Assume that δ ∗ (X n ) is not m-minimax, i.e. ∃δ such that sup medX n L(δ(X n ), θ) < sup medX n L(δ ∗ (X n ), θ). θ By assumption that θ δ ∗ (X n ) has constant medloss, say m0 , we have that medX n L(δ(X n ), θ) ≤ sup medX n L(δ(X n ), θ) θ ∗ n < sup medX n L(δ (X ), θ) = m0 = medX n L(δ ∗ (X n ), θ), θ which contradicts the admissibility of δ ∗ (X n ). Thus, δ ∗ (X n ) is m-minimax. (ii) Preliminary results linking median-admissibility and median-minimaxity to the Bayes medloss estimator In Section 2.3.2, we have introduced the Bayes medloss estimator which is defined by the prior. Here we propose some results to show the relationship among the m-admissible, m-minimax and Bayes medloss estimators. 54 Chapter 2. Median Loss Analysis Lemma 3. For any given prior π(θ), if the Bayes medloss estimator δ π (X n ) is unique, then it is m-admissible. Proof. Assume that δ π (X n ) is not m-admissible. Then ∃δ(X n ) 6= δ π (X n ) such that medX n L(δ(X n ), θ) ≤ medX n L(δ π (X n ), θ), for all θ. Then comparing their median with respect to π, we have that mπ (δ) = med[medX n L(δ(X n ), θ)] ≤ med[medX n L(δ π (X n ), θ)] = mπ (δ π ), π π which contradicts the uniqueness of δ π (X n ). ∗ Lemma 4. Suppose that δ π (X n ) is the Bayes medloss estimator with respect to a specific prior π ∗ ∈ Π. Then it is also m-minimax if it has a ∗ ∗ constant medloss, i.e. medX n L(δ π (X n ), θ) = m(δ π , θ) = ρ∗ , for all θ. ∗ Proof. By the definition of δ π (X n ) and the assumption of ρ∗ , we have ∗ ρ∗ = mπ∗ (δ π ) = inf mπ∗ (δ) ≤ sup inf mπ (δ). δ (2.26) π∈Π δ Moreover, for any δ, we have mπ (δ) ≤ sup mπ (δ), so inf mπ (δ) ≤ inf sup mπ (δ). δ π∈Π This implies that δ π∈Π sup inf mπ (δ) ≤ inf sup mπ (δ). π∈Π δ (2.27) δ π∈Π Similarly, for any π ∈ Π, mπ (δ) = med m(δ, θ) ≤ sup m(δ, θ). So, sup mπ (δ) ≤ π π∈Π θ sup m(δ, θ), which implies that θ ∗ inf sup mπ (δ) ≤ inf sup m(δ, θ) ≤ sup m(δ π , θ) = ρ∗ . δ π∈Π δ θ (2.28) θ ∗ Combining (2.26), (2.27) and (2.28) with the assumption of ρ∗ for δ π (X n ), we have ∗ sup m(δ π , θ) = ρ∗ = inf sup m(δ, θ) θ δ (2.29) θ 55 Chapter 2. Median Loss Analysis Lemma 5. Let {πk : k ≥ 1} be a sequence of priors π on Θ. Denote the sequences of the corresponding Bayes medloss estimators and their Bayes medloss by {δk (X n ) : k ≥ 1} and {mπk (δk ) : k ≥ 1}, respectively. Suppose that δ ∗ (X n ) is an estimator for θ and its medloss satisfies sup m(δ ∗ (X n ), θ) ≤ lim mπk (δk ). k→∞ θ∈Θ (2.30) Then, δ ∗ (X n ) is m-minimax. Proof. Assume that δ ∗ (X n ) is not m-minimax, i.e. ∃δ̃(X n ) such that sup m(δ̃(X n ), θ) < sup m(δ ∗ (X n ), θ). θ (2.31) θ By the definitions of δk and mπk (δk ) for k ≥ 1, we have mπk (δk ) ≤ mπk (δ̃) = med m(δ̃(X n ), θ) ≤ sup m(δ̃(X n ), θ). πk θ Thus, by (2.31), we get mπk (δk ) < sup m(δ ∗ (X n ), θ). Then taking limit on θ both sides, we have lim mπk (δk ) < sup m(δ ∗ (X n ), θ), k→∞ θ which contradicts the condition (2.30). So, δ ∗ (X n ) is m-minimax. Corollary 1. If δ ∗ (X n ) is an estimator for θ with constant medloss, say m(δ ∗ (X n ), θ) = ρ∗ , for all θ, and there exists a sequence of priors {πk } such that the corresponding Bayes medloss estimators δk (X n ) has Bayes medloss mπk (δk ) satisfying lim mπk (δk ) = ρ∗ . k→∞ (2.32) Then, δ ∗ (X n ) is m-minimax. 56 Chapter 2. Median Loss Analysis 2.5.2 Median-inadmissibility for linear models As mentioned before, our proposed medloss criterion is weaker than the universal domination and the stochastic domination by Hwang (1985). So, it implies that the results in Hwang (1985) for the U-admissibility defined by the two criteria can also hold for the median-admissibility. In what follows, we will first show some results in the case for the p-dimensional random vector X with median θ, and then the median-inadmissibility for linear models. Our result is actually the direct consequence of Theorem 3.4 in Hwang (1985), which we state here as the following lemma. Let δa (X) be the JamesStein positive part estimator, i.e. ³ δa (X) = 1 − a ´ X, kXk2 + (2.33) where y+ = max{y, 0} and kzk2 = z T z is the Euclidean norm. So, we have Lemma 6. Assume that X has a density f in form of f (kx − θk2 ) and f 0 (s)/f (s) is defined for all s ∈ [α0 , α1 ], where for some c > 0 and a > 0, α0 = (c − √ 2 a)+ and α1 = c2 + a. If −(p − 2)a−1/2 ln{[c + (c2 + a)1/2 ]a−1/2 } ≤ inf f 0 (s)/f (s), 2c s∈(α0 ,α1 ) (2.34) then for every θ, Pθ (kθ − δa (X)k ≤ c) > Pθ (kθ − Xk ≤ c). (2.35) Clearly, if (2.35) is satisfied for c = medX kθ − Xk, then δa (X) is better than (or dominates) X under the medloss criterion with respect to the Euclidean error. Thus, we have the following corollary. 57 Chapter 2. Median Loss Analysis Corollary 2. With the assumptions and notations of Lemma 6 above, if there exists a such that (2.34) is satisfied for c = medX kθ − Xk, then δa (X) dominates X under the medloss criterion. In particular, Hwang (1985) considers the situation that X − θ has a p-variate t distribution with N degrees of freedom, and gets the results that for every N and p ≥ 3, X is inadmissible under the Euclidean error in the sense of universal domination, and the James-Stein positive part estimator δa (X) universally dominates X under the Euclidean error if a > 0 satisfies p−2 ln{[(N + a)1/2 + (N + 2a)1/2 ]a−1/2 } ≥ (N + p)/N. (2.36) (N + a)1/2 a1/2 This result holds for all c, which implies that δa (X) dominates X under the medloss criterion. So, we have Corollary 3. Assume that X −θ has a p-variate t distribution with N degree of freedom. If a > 0 satisfies (2.36), then under the Euclidean error, X is median-inadmissible for every N and p ≥ 3, and is dominated by δa (X) in the medloss sense. Based on Corollary 3, we can show the median-inadmissibility of the least-squares estimator δ̂ LS for θ in a linear model. Consider the linear model Y = Xθ + e, where X is an m × p known design matrix with full rank p and e/σ has a m-variate t distribution with N degrees of freedom, where σ is known. For this model, we have Corollary 4. If a satisfies (2.36), then for every N and p ≥ 3, the leastsquares estimator δ̂ LS = (X T X)−1 X T Y is median-inadmissible and is dominated by ³ δ(Y ) = 1 − aσ 2 k(X T X)1/2 δ̂ LS k2 ´ + δ̂ LS , (2.37) in the medloss sense with the generalized Euclidean loss with respect to X T X, i.e. kθ − δ(Y )kX T X . 58 Chapter 2. Median Loss Analysis 2.5.3 Robustness of the best medloss estimator to the choice of loss functions In general, the usual risk-based estimator is invariant up to positive affine transformations of the loss. In addition, under the assumptions of unimodality and symmetry of the posterior density, Rukhin (1978) showed that the posterior expected-loss estimator is only invariant up to the choice of even loss functions when the posterior risk is finite. Here we show that when the posterior density is continuous, the posterior medloss estimator is invariant up to any strictly increasing transformation of the absolute-error loss, which implies that the posterior medloss estimate has a higher loss robustness than the usual posterior expected-loss estimate does under weaker assumptions on the posterior. In fact, this loss robustness also holds for median-admissible, median-minimax, and Bayes medloss estimators. Because of the similarity, we just provide the theorem with proof for the posterior medloss estimators. Theorem 2. Define δ1 (xn ) = arg min med |Θ − d(xn )|, d π(Θ|xn ) where π(Θ|xn ) is any continuous posterior density of Θ given xn . Assume that the median of the loss L is unique. Then for any strictly increasing functions L of |Θ − d(xn )|, we have δ1 (xn ) = arg min med L(|Θ − d(xn )|). d π(Θ|xn ) i.e. all med L(|Θ−d(xn )|) attain their minimum at the same point δ1 (xn ). π(Θ|xn ) Proof. let Π−1 (·|xn ) be the inverse of the continuous posterior distribution Π(·|xn ) of Θ given xn , and consider the medloss for any strictly increasing loss function L of |Θ − d(xn )|, mL = med L(|Θ − d(xn )|). π(Θ|xn ) Here mL = mL(d, xn ). By the property of median that med(h(X)) = h(med(X)) for strictly increasing functions h, we have 59 Chapter 2. Median Loss Analysis ³ ´ mL = L med |Θ − d(xn )| = L(m1 ), π(Θ|xn ) so the median of the absolute error loss m1 = m1 (d, xn ) = L−1 (mL), where L−1 is the inverse of L. By the definition of mL, 1 = Π(L(|Θ − d(xn )|) ≤ mL |xn ) = Π(|Θ − d(xn )| ≤ m1 2 2.6 |xn ). Properties for the posterior median loss estimation Unlike the other median-based estimators mentioned above, we provide some ways to get the posterior medloss estimator in the following. 2.6.1 General procedure for computing the posterior medloss estimator In this section, we provide a general procedure for computing the posterior medloss estimators when the posterior density is continuous. Let Ξ = {(a, b) ∈ R2 : Π(a ≤ Θ ≤ b |xn ) = 1/2}. Then for any (a, b) ∈ Ξ, let a = d(xn ) − m1 and b = d(xn ) + m1 . After some algebra, we get d(xn ) = (b + a)/2 m = (b − a)/2. 1 These expressions suggest the following general procedure to find posterior medloss estimates. 1. Find the posterior distribution function Π(·|xn ) and its inverse Π−1 (·|xn ). 2. Define a∗ = inf{a : Π(a) > 0} and b∗ = sup{b : Π(b) < 1}. 60 Chapter 2. Median Loss Analysis Consider a sequence of ai ∈ (−∞, m], where m is the posterior median, i.e. m = Π−1 (1/2 |xn ) ∈ (a∗ , b∗ ). Thus, we can pick any number r ∈ (a∗ , m], and then define ai = r + si with step size s for i = 1, . . . , (m − r)/s. 3. Define bi = Π−1 [1/2 + Π(ai |xn )|xn ]. 4. Let di (xn ) = (bi + ai )/2. The corresponding value of the median loss is m1 (i) = (bi − ai )/2. 5. Find the minimum median loss m̃1 in {m1 (i)}i=1,...,(m−r)/s . The cor˜ n ) is the value of the posterior medloss estimate. responding d(x This procedure gives good performance in routine examples. Consider X|λ ∼ P oisson(λ) λ ∼ Ga(α, β) , where the Poisson density is given by P (X = x|λ) = λx e−λ (x!)−1 equipped with the conjugate Gamma prior for convenience. Thus, the posterior distribution for λ given X = x is λ|X = x ∼ Ga((α + x), β/(1 + β)). To compare the risk-based and medloss-based estimators, we must first find them. For the risk, under L2 = (λ − δ(X))2 , the optimal rule δ2∗ (x) is the posterior mean, i.e. δ2∗ (x) = E(λ|X = x) = β(x + α)/(β + 1). For the medloss, we do not have a closed form for δ̃2 (x) = arg min med |λ − δ(x)|2 , δ π(λ|x) however, our generic algorithm can be used to generate the curves for the posterior risk and posterior medloss as the value of δ varies in Figure 2.2. 61 Chapter 2. Median Loss Analysis For the fixed value x = 2 and a Gamma(2, 3) prior, the posterior mean is δ2∗ (x) = 3, the posterior medloss estimate is δ̃2 (x) ≈ 2.374675 and the posterior median is ≈ 2.754046. 2.6.2 Closed form of the posterior medloss estimator for unimodal and symmetric posterior densities In Theorem 2, we have shown that when the posterior density is continuous, the posterior medloss estimator is invariant up to any strictly increasing transformation of the absolute-error loss. Additionally, suppose that the posterior density is unimodal and symmetric about its median, we can show that the posterior median is the posterior medloss estimator. This is parallel to Rukhin’s result (Rukhin, 1978) that the posterior risk estimator is invariant up to any even loss functions if the posterior density is symmetric and unimodal and the posterior risk of this estimator is finite. Theorem 3. If the posterior density is continuous, symmetric and decreases away from its median, then the posterior median is the posterior medloss estimator with respect to any strictly increasing function of the absolute error loss L1 . Theorem 3 is a direct consequence of the following two lemmas for any continuous and symmetric density that decreases away from its median. Lemma 7. If the density of Z is symmetric and decreases away from its median med(Z), then med(Z) maximizes P (|Z − a| ≤ k) over a, for any k ≥ 0. In other words, med(Z) = arg supP (|Z − a| ≤ k), for any k ≥ 0. a 62 Chapter 2. Median Loss Analysis 8 6 4 posterior medloss posterior risk 2 posterior medloss/risk 10 12 posterior medloss/risk estimators with conjugate Gamma prior (2, 3) 0 2 4 6 8 delta(x) Figure 2.2: The posterior risk and posterior medloss as a function of δ(x). The dashed curve represents the posterior risk for L2 , while the solid one is for the posterior medloss under L2 . The central vertical line indicates the position of the posterior median. The left and right vertical lines indicate the positions of the posterior medloss estimate and the posterior mean, respectively. 63 Chapter 2. Median Loss Analysis Proof. Without loss of generality, assume that med(Z) = 0 and a > 0. Thus, P (|Z − a| ≤ k) = P (a − k ≤ Z ≤ a + k) Z k+a Z Z k = dF (z) + dF (z) − −k Z k k Z −k Z k ≤ dF (z) −k Z k k+a dF (z) + = a−k dF (z) − k dF (z) k−a dF (z) −k = P (|Z − 0| ≤ k). Lemma 8. If Z has a symmetric distribution with unique median, then its median, med(Z), is the unique minimizer of the medloss for any strictly increasing function of |Z − m|, the L1 loss, for any m ∈ R. Proof. It suffices to show that med(Z) is the unique minimizer of the medloss with L1 loss. Then by Lemma 7, we have P (|Z − med(Z)| ≤ k) > P (|Z − a| ≤ k), for any k ≥ 0 and any a. Now let k = med(|Z − a|). Then the right hand side of the above inequality is 1/2, so P (|Z − med(Z)| ≤ med(|Z − a|)) > 1/2, i.e. med(|Z − med(Z)|) < med(|Z − a|), for all a. Thus, med(Z) is the unique minimizer of the medloss with L1 , or any Lp loss function. 2.6.3 Prediction We conducted a simulation study to show that medloss performs well predictively. To verify the anticipated robustness properties of the medloss method, we generated data from one model, the data generator (DG), but analyzed it with a different model taken as true. The DG is chosen to have 64 Chapter 2. Median Loss Analysis slightly heavier tails than the data analysis (DA) model; this makes the predict task a little bit harder. It will be seen that medloss estimators generally do better than risk-based estimators in this setting. In the Bayes setting, to compare the optimal estimators based on the expectation and median criteria under L1 and L2 losses, our procedure has six steps. First, let b = {10, 11, . . . , 20} be a vector of values to be taken as true. Second, let the DG be Gamma(a0 , b) so we generate n observations of X ∼ Gamma(a0 , b), where a0 is a fixed shape parameter. Third, assume that the prior distribution on the scale parameter b is inverse gamma I.Ga(α, β) with mode = [(α + 1)β]−1 = 15 with fixed α and β, so that the true values of b are around 15. Fourth, let the DA model be X ∼ Gamma(a1 , b), so that if a0 > a1 , the DG has a heavier tail than DA model. Fifth, based on these assumptions on the priors and data distributions, we have the posterior distribution b|X n = xn ∼ I.Ga(α∗ , β ∗ ), where α∗ = na1 + α, and P β ∗ = β[1 + β ni=1 xi ]−1 . Sixth, we find the optimal expected-loss estimator and the medloss estimator. For each b, we repeat steps 2 - 6 N times giving N optimal expected-loss and N optimal medloss estimators under L1 loss and L2 loss. From these N values, we calculate the sample means b̂E of these 2 sequences of estimators, and we calculate the sample medians b̂M of these 2 sequence of estimators. Finally, we find the relative errors |b̂E − b|/b and |b̂M − b|/b for each sequence of the estimators for each value of b. In our simulation, we chose α = 4, β = 1/75, n = 20, N = 50, and (a0 , a1 ) = (3, 3) for the two panels (top for L1 and bottom for L2 ) in Figure 2.3, and (a0 , a1 ) = (5, 3) for the two panels (top for L1 and bottom for L2 ) in Figure 2.4. Since a0 > a1 , the DG has a heavier tail than the DA model. It is seen in Figure 2.4 that the relative error based on expected-loss estimators (solid dark line for sample-mean case and dashed dark line for sample-median case) is larger than for the medloss estimators (solid light line for sample-mean case and dashed light line for sample-median case). When the DG and DA models are the same, we see that neither methods 65 Chapter 2. Median Loss Analysis dominate the other, as shown in Figure 2.3. Note that whether we use the sample mean or sample median to define the predictive error, the medloss approach has a lower error in the case of a0 > a1 . We do not investigate cases in which a0 < a1 . This setting would correspond to having data that had tighter tails than its proposed model. These cases are rare and unusual in practice because real data typically exhibits much more variability than their proposed models can account for. 2.7 Implications To develop the logical implications of switching to medloss, it is enough to replace means where they occur in the usual criteria for inference with medians. For instance, in place of a sum of squared errors in mean-based cross validation, it is natural to use the median of the squared errors. Below we develop two implications. The first is that using a least median of squares (LMS) estimator is better than a least squares (LS) estimator outside the normal error case. The second is a median version of cross validation which likewise seems to give better results outside normal error cases. 2.7.1 Frequentist medloss in regression problems Except for LMS estimator, all robust alternatives to the LS estimator for Frequentist regression problems seem to be expectation-based. Rousseeuw (1984) first introduced the LMS estimator for linear regression parameters. Later, Stromberg (1995) proved consistency of LMS estimators in nonlinear regression models, and Andrews et al. (1972) and Kim and Pollard (1990) showed LMS estimators have cube-root asymptotic behavior in the linear regression setting. Note that the LMS estimator corresponds to our Frequentist medloss estimator. In other words, here we have provided a decisiontheoretic framework for LMS estimators through the medloss criterion. The 66 0.00 0.01 0.02 0.03 0.04 0.05 0.06 Chapter 2. Median Loss Analysis 10 12 14 16 18 20 16 18 20 0.00 0.01 0.02 0.03 0.04 0.05 0.06 b 10 12 14 b Figure 2.3: The dark lines are for the best risk estimators and the light lines for the best medloss estimators. The solid lines represent the sample means of 2 sequences of the best risk and medloss estimators; while the dashed lines represent the sample medians of these 2 sequences. The panel on the top uses L1 loss, and the bottom one uses L2 loss. For these two panels, we see that neither method dominates the other, since the DG and DA models are the same. 67 0.56 0.58 0.60 0.62 0.64 0.66 0.68 0.70 Chapter 2. Median Loss Analysis 10 12 14 16 18 20 16 18 20 0.56 0.58 0.60 0.62 0.64 0.66 0.68 0.70 b 10 12 14 b Figure 2.4: Again, the panel on the top is for L1 loss and the bottom one for L2 loss. In these two panels, the DG has a heavier tail than the DA model. The solid dark line and the dashed dark line are always above the solid light line and the dashed light line for the cases of sample-mean case and sample-median case, respectively. 68 Chapter 2. Median Loss Analysis usual robust estimators are only natural in the classical expectation-based decision-theoretic framework. Simulation: 0-1 loss with the least median of squares (LMS) estimate To compare the accuracy and robustness of the medloss and expected-loss estimators in a regression context, consider the model, Y = β0 + β1 X + e. (2.38) For the sake of precision, we fix β = (β0 , β1 )T = (7, 3)T and take the sample size to be n = 80. To conduct our simulation, we choose three distributions for e: i) the standard Normal, N (0, 1), ii) the standard Cauchy, i.e. t-distribution with 1 degree of freedom, and iii) the mixed Normal, 0.9∗ N (0, 1) + 0.1∗ N (10, 0.52 ). To be sure we had enough variability in the data, we generated the xi ’s from the uniform distribution on [0,1]. The next task is to define an assessment of the accuracy for the two estimators. Consider the 0 − 1 loss defined by I{kβ̂−βk>B} 1, if kβ̂ − βk > B; = 0, if kβ̂ − βk ≤ B, (2.39) where k · k is the Euclidean norm and B is the error boundary. It is seen that kβ̂ − βk is the distance between the estimator and the true value. Using the expected-loss criterion in the linear regression problem leads to the LS estimator β̂LS . Using the medloss criterion leads to the LMS estimator β̂LM S . The indicator function in (2.39) gives the percentages of error of the estimators β̂LS and β̂LM S against the boundary. We let B range from 0 to 5 with step-size 0.1, i.e. B = {B1 , . . . , B51 : Bi = (0.1)(i − 1)}. The error 69 Chapter 2. Median Loss Analysis percentages are now N 1 X ELerror[i] = I{kβ̂ j −βk>Bi } LS N (2.40) j=1 for the LS estimator, and N 1 X M Lerror[i] = I{kβ̂ j −βk>Bi } LM S N (2.41) j=1 j j and β̂LM for the LMS estimator, where i = 1, . . . , 51 and β̂LS S mean the LS and the LMS estimator for β at the j th replication, respectively. The number of replications of the data generating procedure is N which we took to be 100. The results are shown in Figure 2.5 for the normal distribution of error (the top panel), the Cauchy distribution (the middle panel), and the mixed Normal (the bottom panel). In Figure 2.5, the solid and the dashed lines represent the errors of the LMS and the LS estimates, respectively. It is seen that the usual LS estimator is better only when the error distribution is normal. By contrast, when the error distributions are Cauchy or mixed Normal, the LMS estimator outperforms the LS estimator, because the LMS estimator is less sensitive to outliers or heavy tails. Although the LMS estimator is not efficient, it outperforms the LS. We suggest this occurs because the non-normal error structure increases the model uncertainty and the LS estimator is highly sensitive to this form of uncertainty. Note that the evaluation of the estimators was done in 0-1 loss, a very different sense from L2 loss used to construct them. Indeed, since both (2.40) and (2.41) are means, we see that our median-based method outperforms risk even in risk like terms. 70 Chapter 2. Median Loss Analysis 0.6 0.4 LMS LS 0.0 0.2 % of error 0.8 1.0 Standard normal distribution 0 1 2 3 4 5 B 0.8 1.0 Cauchy distribution 0.6 0.4 0.0 0.2 % of error LMS LS 0 1 2 3 4 5 B 0.4 0.6 LMS LS 0.0 0.2 % of error 0.8 1.0 0.9*N(0,1)+0.1*N(10, 0.25) 71 0 1 2 3 4 5 B Figure 2.5: The solid lines are the errors for LMS estimator as a function of the boundary B. The dashed lines are for LS estimator. Chapter 2. Median Loss Analysis 2.7.2 Model selection based on median cross validation Model selection is the task of selecting a statistical model from a set of potential models, given data. To do this, one needs a criterion which quantifies a model’s performance. For example, consider a regression model yi = f (xi ; θ) + ei , (2.42) where ei are assumed to be independent and identically distributed with mean 0 and variance σ 2 , i = 1, . . . , n and n is the sample size. Suppose fˆλ is def an estimate of the function in (2.42) from the model space M = {fλ , λ ∈ Λ}, where each fλ is a candidate model and λ is the model index belonging to a set Λ which may be finite, countable or uncountable. (i) Median cross validation (MCV) Similar to the usual mean-based cross validation (CV), one selects randomly from the complete set of observations a subsample called a ‘training’ sample to estimate the model and subsequently uses the model to predict the criterion variable for the remaining ‘validation’ sample. More formally, suppose that one has decided on a measure of discrepancy for model evaluation, like the predictive error. A V − fold median-based CV selects a model as follows. 1. Split the whole data into V disjoint subsamples S1 , . . . , SV . 2. For v = 1, . . . , V , fit model Mλ to the training sample S i6=v Si , and compute discrepancy, dv (λ), using the validation sample Sv . 3. Find optimal λ as the minimizer of the overall discrepancy d(λ) = med dv (λ). 1≤i≤V For example, consider the case V = n, i.e. delete-1 (leave-one-out) MCV. Let zi = (yi , xTi ) with all xi ∈ Rp , and z−i be the n − 1 vector with the ith observation, zi , removed from the original observations z. Let fˆ−i be the λ estimate for fλ ∈ M based on n − 1 observations z−i . 72 Chapter 2. Median Loss Analysis Define the median-based CV error for fλ by CV M edloss(λ) = med (yi − fˆλ−i (xi ))2 , 1≤i≤n (2.43) where fˆ−i (xi ) = f (xi ; θ̂−i ), and θ̂−i is based on data z−i . For the usual expectation-based CV error for fλ , we have n CV M SE(λ) = 1X (yi − fˆλ−i (xi ))2 . n (2.44) i=1 Here we remark that CVMSE(λ) can be regarded as the expected squarederror loss for the model fλ under the empirical distribution of the data. From our previous considerations we expect CV M edloss to be a more stable and reliable model selection criterion outside normal error case than the CV M SE is. In next subsection, a simulation comparing the CV M SE with the CV M edloss for model selection confirms our expectations. (ii) A comparison of cross validation and MCV for nested models Consider the following three candidate models: M odel 1 : y = 2(1 + x1 + x2 ) + error, M odel 2 : y = 2(1 + x1 + x2 + x3 + x4 ) + error, M odel 3 : y = 2(1 + x1 + x2 + x3 + x4 + x5 + x6 ) + error. For the purposes of our simulations, M odel 2 is set to be the true model, i.e. the data generating model. Thus, M odel 1 underfits and M odel 3 overfits. We fix the sample size n to be 50, and generate x1 , . . . , x6 from the uniform distribution on [0,1]. To provide a reasonable range of comparisons, we use four error distributions: i) t-distribution with 0.5 degree of freedom, denoted by t(0.5) , ii) a contaminated t-distribution, 90%*t(0.5) +10%*N(0, 0.012 ), iii) standard normal, N(0,1) and iv) a mixture of normals, 70%*N(5, 73 Chapter 2. Median Loss Analysis 1) + 30%*N(15, 22 ). Now, our simulation procedure is simple. For each error distribution, we get 50 values of y from the true model, i.e. M odel 2. To compare the usual CV and our MCV methods, we first fix each of these models and then use the observations corresponding to it. Thus, we use pairs (y, {x1 , . . . xk }) for k = 2, 4 and 6 for Models 1, 2 and 3, respectively. Next we use V -fold CV and V -fold MCV. That is, at each stage, we use n/V data points for validation and n − n/V data points for training. We choose V = 5. If V is too large, the predictions tend to be correlated and not representative of the true predictive error. If V is too small, then the variability in the parameter estimates gives a large MSE inflating the cross validation error. Based on the training data, we estimate the regression parameters for the chosen model by the LS method to represent the usual expectation criterion and also by the LMS method to represent our median criterion. Next, we use these estimators to predict the y’s in the validation group. After predicting the y’s in all V validation groups, we calculate the following two different measures to select the appropriate model. 1. Based on the usual mean criterion with least square estimators, we define n 1X (Yi − ŶLS,i )2 . CV M SE = n (2.45) i=1 2. For LMS estimator for the median criterion, we have CV M edloss = med (Yi − ŶLM S,i )2 . 1≤i≤n (2.46) Finally, we calculate these measures for each model, and choose the model with smaller CV M SE or CV M edloss at each replication. To get our conclusion, we redo these steps N times. Here we set N = 100. Then we record the proportion of times each model is chosen over the N replications based on the CV and MCV criteria for each error distribution. If the 74 Chapter 2. Median Loss Analysis 0.4 0.6 Median CV usual CV 0.0 0.2 % for the chosen model 0.8 1.0 5 fold : t−distribution with 0.5 d.f. 1 2 3 Model 0.4 0.6 Median CV usual CV 0.0 0.2 % for the chosen model 0.8 1.0 5 fold : 90%*t−distribution+ 10%*N(0, sd=0.01) 1 2 3 Model Figure 2.6: The solid circles represent the median-based CV criterion, and the squares represent the usual expectation-based CV criterion. 75 Chapter 2. Median Loss Analysis 0.4 0.6 Median CV usual CV 0.0 0.2 % for the chosen model 0.8 1.0 5 fold : Standard normal(0,1) 1 2 3 Model 0.4 0.6 Median CV usual CV 0.0 0.2 % for the chosen model 0.8 1.0 5 fold : 70%N(5,1)+30%N(15,4) 1 2 3 Model Figure 2.7: When the error distribution is N(0,1), the usual CV works better than our MCV. However, outside the normal error case, MCV outperforms CV. 76 Chapter 2. Median Loss Analysis criterion chooses the true model, i.e. Model 2, more often, then it is better. Figures 2.6 and 2.7 show that outside the normal error case we are always more likely to choose the true model under the median-based CV than under the usual CV, while in the normal error case the usual expectation-based CV is more likely to choose the correct model. Our simulation results also show that outside the normal error case, the usual CV underfits. 2.8 Discussion In this paper we have presented a median-based approach to decision theory. In the Bayes case we have given a generic algorithm for finding the optimal medloss estimators. In the Frequentist context, we have verified that median-based methods can outperform the usual expectation-based methods outside the normal error context for regression and model selection. In separate work, we have obtained the asymptotic behavior of posterior medloss estimators and argued they are more generally appropriate than either the LMS or the least trimmed squares estimators. Compared with the work here showing medloss procedures are effective outside the normal case, we are led to advocate median-based procedures widely. Indeed, in future work we intend to replace means in a variety of contexts with medians and expect similar improvements to these found here. Systematic replacement of means by medians may make numerous modeling strategies more effective for complex data sets that are increasingly common. 77 Chapter 2. Median Loss Analysis 2.9 References Allais, M. (1953). Le comportement de lhomme rationnel devant le risque: critique des postulats et axiomes de lecole Americaine. Econometrica 21 503–546. Andrews, D.F., Bickel, P.J., Hampel, F.R., Huber, P.J., Rogers, W.H., and Tukey, J.W. (1972). Robust Estimates of Location: Survey and Advances. Princeton University Press. Princeton, New Jersey. Blyth, C.R. (1986). Convolutions of Cauchy distributions. Amer. Math. Monthly 93 645–647. Casella, G. and Wells, M.T. (1993). Is Pitman closeness a reasonable criterion? Comment. J. Amer. Statist. Assoc. 88 70–71. Ellsberg, D. (1961). Risk, Ambiguity, and the Savage Axioms. Quarterly Journal of Economics 75 643–669. Hampel, F.R. (1975). Beyond Location Parameters: Robust Concepts and Methods, Bulletin of the International Statistical Institute. 46 375– 382. Hens, T. (1992). A note on Savage’s Theorem with a finite number of states. Journal of Risk and Uncertainty 5 63–71. Hwang, J.T. (1985). Universal Domination and Stochastic Domination: Estimation Simultaneously Under a Broad Class of Loss Functions. Ann. Statist. 13(1) 295–314. Keating, J.P., Mason, R.L. and Sen, P.K. (1993). Pitman’s measure of closeness: a comparison of statistical estimators. Society for Industrial and Applied Mathematics. Philadelphia. Kim, J and Pollard, D. (1990). Cube Root Asymptotics. Ann. Statist. 18(1) 191–219. Lehmann, H.L. (1983). Theory of Point Estimation, 1st ed. New York, Wiley. Machina, M. and D. Schmeidler. (1992). A More Robust Definition of Subjective Probability. Econometrica 60(4) 745–780. Manski, C. F. (1988). Ordinal Utility Models of Decision Making Under 78 Chapter 2. Median Loss Analysis Uncertainty. Theory and Decision 25(1) 79–104. Pitman, E.J.G. (1937). The ”closest” estimates of statistical parameters. Proc. Cambridge Philos. Soc. 33 212–222. Robert, C.P., Hwang, J.T. and Strawderman, W.E. (1993). Is Pitman closeness a reasonable criterion? J. Amer. Statist. Assoc. 88 57–63. Rostek, M.J. (2007). Quantile Maximization in Decision Theory. Unpublished Manuscript. Rousseeuw, P.J. (1984). Least Median of Squares Regression. J. Amer. Statist. Assoc. 79 871-880. Rukhin, A.L. (1978). Universal Bayes Estimators. Ann. Statist. 6(6) 1345– 1351. Savage, L.J. (1954). The Foundations of Statistics. Wiley, New York. Stromberg, A.J. (1995). Consistency of the least median of squares estimator in nonlinear regression. Comm. Statist. Theory Methods 24(8) 1971– 1984. Sung, N.K. (1988). A Cramér-Rao Analogue for Median-Unbiased Estimators. Prepreint Series 88-29, Department of Statistics, Iowa State University, Ames, Iowa. Sung, N.K. (1990). A Generalized Cramér-Rao Analogue for Median-Unbiased Estimators. J. Multivariate Anal. 32 204-212. Von Neumann, J. and Morgenstern, O. (1947). The Theory of Games and Economic Behaviour, 2nd ed. (1st edn 1944). Princeton, Princeton University Press. Wald, A. (1939). Contributions to the Theory of Statistical Estimation and Testing Hypotheses. The Annals of Mathematical Statistics 10(4) 299–326. Wald, A. (1950). Statistical Decision Functions. John Wiley, New York. 79 Chapter 3 Asymptotics of Bayesian Median Loss Estimation 3.1 Introduction Conventional statistical approaches like estimation problems and hypothesis testing can be embedded in Wald’s expected-loss models in his Statistical Decision Theory (Wald, 1939, 1950). However, Yu and Clarke (2008a) argue that taking expectation of the loss is inappropriate in a predictive context, because the nonnegativity of the loss implies that the loss as a random variable will typically be strongly right skewed. This means that the expected loss, i.e., the risk, will in general not be representative of the bulk of the distribution of the loss. Thus, the authors suggest using the median of the loss (hereafter medloss) because in terms of prediction, it helps avoid overprediction and underprediction. They propose a median analog of the Bayes estimator, and call it the posterior medloss estimator, which minimizes the median of the loss with respect to the posterior. That is, the posterior medloss estimator is δ(xn ) = arg min med L(a(xn ), Θ), a∈D π(Θ|xn ) (3.1) A version of this chapter has been submitted for publication. Yu, C.W. and Clarke, B. Asymptotics of Bayesian Median Loss Estimation. 80 Chapter 3. Asymptotics of Bayesian Median Loss Estimation where xn = {xi : i = 1, . . . , n} are the realizations of n d-dimensional random vectors Xn = {Xi ∈ Rd : i = 1, . . . , n}, L(a, θ) is a nonnegative loss function, a(xn ) is the estimate of θ, D ⊂ Rp is the decision space, and med L is the median of the loss L under the posterior density π(·|xn ) of π(Θ|xn ) Θ ∈ Rp given xn . In particular, in regression problems, yi = h(xi , β0 ) + ui , i = 1, . . . , n, (3.2) where yi , xi and ui are the realizations of random elements Yi ∈ R, Xi ∈ Rd and Ui ∈ R, respectively, and h is a known function in a class of functions H. Assume that the true parameter β0 is an interior point of the parameter space B, which is an open subset of Rp , and β ∈ B is a random vector from a prior density π. Furthermore, we suppose that (xi , ui ) are independently sampled from a probability distribution P on Rd × R. Then, to find the corresponding posterior medloss estimator (3.1), we only have to derive the posterior density of β given y and x. In this paper, we establish the asymptotic properties of the posterior medloss estimator in finite dimension. In the Frequentist context, one of the most common methods to estimate the regression coefficients β0 is by the least squares (LS) approach, which minimizes the sum of squares of the residuals. It is well known that the LS √ estimator has many nice properties like n-consistency and asymptotic normality. However, it is highly sensitive to outliers. To overcome this problem, there are numerous alternative robust approaches. One of them is the least median of squares (LMS) estimator first introduced by Hampel (1975, page 380) and then developed by Rousseeuw (1984). The LMS estimator has 50% breakdown point. That is, 50 % is the smallest portion of the data that must be contaminated to force the LMS estimator to move an arbitrarily large amount. Like the LS estimator, the LMS estimator minimizes the median of 81 Chapter 3. Asymptotics of Bayesian Median Loss Estimation squares of the residuals, i.e. β̂nLM S = arg min median[yi − h(xi , β)]2 . β (3.3) i=1≤i≤n For the asymptotic behavior of the LMS estimator, Rousseeuw (1984) provides a heuristic proof of its n1/3 rate of convergence in linear models by using similar arguments of Andrews et al. (1972) to the shorth estimator, and the rigorous proof is given by Kim and Pollard (1990). In nonlinear regression models, Stromberg (1995) proposes the consistency of the LMS estimator. To improve the slow rate of convergence of the LMS estimator, we can consider the least trimmed squares (LTS) estimator (see Rousseeuw and Leory, 2003). The one-sided LTS estimator is defined by (LT S,τ ) β̂n,1 = arg min β τ X 2 r[i] (β), (3.4) i=1 2 (β) represents the ith order statistics of squared residuals r 2 (β) = where r[i] i {yi − h(xi , β)}2 , and the trimming constant τ satisfies n 2 < τ ≤ n. Its consistency and asymptotic normality for nonlinear regression can be found in Čı́žek (2004, 2005). Comparing the Bayesian and Frequentist approaches of the medloss estimators, Yu and Clarke (2008b, 2008c) extend Kim and Pollard’s and Čı́žek’s results to nonlinear models and to two-sided trimmings, respectively. It turns out that the posterior medloss estimator not only retains the high rate of convergence and asymptotic normality, but also avoids problems with inliers. By contrast, the LMS will be susceptible to problems from inliers since it is a function of a small number of seemingly representative points and has a slow rate of convergence. In general, problems from overly influential points, which may or may not be inliers, are increasingly serious as the dimension of the data or parameter increase. In general, medloss estimators in both the Bayesian and Frequentist 82 Chapter 3. Asymptotics of Bayesian Median Loss Estimation contexts have nice robustness properties (to the loss and in some senses to the data) as well as giving good prediction. Therefore, we suggest that medloss estimators are most appropriate when the underlying distribution, of an error term for instance, is heavy tailed or asymmetric. The rest of this chapter is organized as follows. In Section 2, we present our main result for the asymptotics of the posterior medloss estimator. The extension of the LMS and LTS results is shown in Section 3. Section 4 discusses the comparison of the posterior medloss estimator and the LMS and LTS estimators. 3.2 Main results We establish the asymptotic behavior of the posterior medloss estimator δn in finite dimensions in four steps. First, we use the asymptotic normality of the maximum likelihood estimator (MLE) θ̂n to identify the limiting distribution. Second, the convergence of posterior density to the normal in total variation is used to show the convergence of their spatial medians. Third, we prove that δn can be approximated by θ̂n up to an error of op (n−1/2 ). Finally, Slutsky’s theorem gives the result we want for δn . Since there are numerous results for the asymptotic normality of the finite-dimensional MLE, it is enough here to quote them without proof. For instance, the following lemma in Schervish (1997) gives conditions under which the MLE is asymptotically multivariate normal in general parametric families. Lemma 9. Let Ω be a subset of Rp , and let {Xi ∈ Rd : i = 1, 2, . . .} be conditionally IID given Θ = θ ∈ Rp each with density fX1 |Θ (·|θ). Let θ̂n be an MLE and assume that it converges to θ in Pθ for all θ. Assume that fX1 |Θ (·|θ) has continuous second partial derivatives with respect to θ and that differentiation can be passed under the integral sign. Suppose that there exist Mr1 (x, θ) such that, for each interior point θ0 of Ω and each k, j, we 83 Chapter 3. Asymptotics of Bayesian Median Loss Estimation have sup kθ−θ0 k≤r1 ¯ ¯ ¯ ¯ ∂2 ∂2 ¯ log fX1 |Θ (x|θ0 ) − log fX1 |Θ (x|θ)¯ ≤ Mr1 (x, θ0 ), ∂θk ∂θj ∂θk ∂θj with lim Eθ0 Mr1 (X, θ0 ) = 0. We also assume that the Fisher information r1 →0 matrix IX1 (θ0 ) is finite and nonsingular. Then, under Pθ0 , √ L −1 n(θ̂n − θ0 ) → N (0, IX (θ0 )). 1 (3.5) Next we turn to the convergence of posterior density to the normal in a finite-dimensional case. As Lehmann (1983) mentioned that it is not enough to impose the conditions in Lemma 9 on log fX1 |Θ (x|θ) in the neighborhood of θ0 as is typically the case in asymptotic results. So, we have to control the behavior of log fX1 |Θ (x|θ) at a distance from θ0 because the Bayesian approaches involve the whole range of θ values. Again, there are numerous results, see Borwanker et al. (1971), Lehmann (1983), Ghosh and Ramamoorthi (2003) and Prakasa Rao (1987) for univariate cases, and Schervish (1997) for multivariate cases. Here, we use the following. Lemma 10. In addition of the assumptions in Lemma 9, suppose that for any r3 > 0, there exists an ² > 0 such that o 1 (Ln (θ) − Ln (θ0 )) ≤ −² → 1, kθ−θ0 k>r3 n n Pθ0 where Ln (θ) = sup Pn i=1 log fX1 |Θ (xi |θ). We also assume that the prior has a density π(θ) with respect to Lebesgue measure, which is continuous and positive at θ0 , then we have, as n → ∞, Z Rp ¯ ¯ Pθ0 ¯ ∗ ¯ ¯ π (t|xn ) − (2π)−p/2 |IX1 (θ0 )|1/2 exp{−tT IX1 (θ0 )t/2} ¯ dt −→ 0, (3.6) where xn = {xi : i = 1, . . . , n} and π ∗ (·|xn ) is the posterior density of √ T = n(Θ − θ̂n (xn )). 84 Chapter 3. Asymptotics of Bayesian Median Loss Estimation To prove our main result, we also need Lemma 11 below for the convergence in quantile. Lemma 11. For any distribution function F (·), the quantile function is def Q(t) = F −1 (t) = inf{x : F (x) ≥ t}, for 0 < t < 1. Now denote by Qn the quantile function associated with the distribution function Fn for each n ≥ 0. If Qn (t) → Q0 (t) at each continuity point t of Q0 (t) Q in (0, 1), Qn is said to converge in quantile to Q0 , denoted by Qn → Q0 . Thus, we have L Q Fn → F0 ⇐⇒ Qn → Q0 . Proof. See Proposition 3.1 in Chapter 7 in Shorack (2000). Now we can establish our asymptotic results for the posterior medloss estimator in finite dimension. Theorem 4. Suppose that the assumptions of Lemma 10 hold with the convergence of the MLE, θ̂n , to θ0 a.s., i.e. θ̂n → θ0 a.s. P0 = Pθ0 . Further, let δn = δn (xn ) be the posterior medloss estimator of θ ∈ Rp for all realizations {xi : i = 1, . . . , n} of {Xi ∈ Rd : i = 1, . . . , n} and all n with respect to a nonnegative loss function L(θ, a) satisfying the following conditions: (i) L(θ, a) = l(θ − a) ≥ 0, (ii) l(t1 ) ≥ l(t2 ) if kt1 k ≥ kt2 k. Moreover, suppose that there exist a non-negative sequence {an } and continuous function K : Rp → R such that (iii) For any real-valued vector cn depending on n, ¯ ¯ ¯ ¯ lim ¯ medn [an l((T + cn )/n1/2 )] − medn [K(T + cn )]¯ = 0, n→∞ T|X where T = T|X √ n(Θ − θ̂n ), 85 Chapter 3. Asymptotics of Bayesian Median Loss Estimation −1 If Z has the normal distribution N (0, IX (θ0 )), i.e. the limiting distri1 bution of the posterior density in Lemma 10, suppose that (iv) 1/2 is a continuous point of the distribution of K(Z), and (v) medZ K(Z + m) has a unique minimum at m=0, where medZ is the median with respect to Z. Then we have L −1 δn → θ0 a.s. P0 and n1/2 (θ0 − δn ) → N (0, IX (θ0 )). 1 Remark 1: It will be seen that the asymptotic normality of the MLE is central to the proof of Theorem 4. Indeed, any time we have sufficient conditions for the MLE to be asymptotically normal, we have a corresponding result for the posterior medloss estimator. For instance, in generalized linear models, if the first and second conditional moments of the response variable (given the explanatory variables) exist then, as in L. Fahrmeir and Kaufmann (1985), we get asymptotic normality of the MLE and hence, by our Theorem 4, an asymptotic normality result for posterior medloss estimators in generalized linear models. Similarly, when asymptotic normality holds for nonlinear models, see Gallant (1987), we obtain, via Theorem 4, an asymptotic normality result for posterior medloss estimators in nonlinear models. Finally, the results of Koenker and Bassett (1978), Bassett and Koenker (1986) can be used to obtain asymptotic normality of the posterior medloss estimator, via Theorem 4, for quantile regression models. Remark 2: The assumptions in Lemma 9 and Lemma 10 are only used to get the results of asymptotic normality of the MLE and that of the posterior density. We do not use those assumptions again in the proof of the asymptotics of our posterior medloss estimator below. Remark 3: Our results in Theorem 4 can be easily extended to the case of Markov chain settings by using the similar arguments of Borwanker et al. (1971) for the asymptotic Bayesian inference for Markov chains. Proof. We prove Theorem 4 in three steps. The first shows that n1/2 (θ̂n −δn ) is finite a.s. and the second step shows it goes to 0 a.s. P0 . Then we complete 86 Chapter 3. Asymptotics of Bayesian Median Loss Estimation the proof by using the Slutsky’s theorem and the asymptotic normality of θ̂n . Denote by Mn (a) = medn L(θ, a) the posterior medloss with respect π(Θ|X ) to L(θ, a). 1. First, lim supn an Mn (δn ) ≤ lim supn an Mn (θ̂n ) = lim supn medn [an l(T/n1/2 )]. T|X ¯ ¯ ¯ ¯ ¯ ¯ 1/2 Moreover, ¯ medn [an l(T/n )] − medZ [K(Z)]¯ ≤ ¯ medn [an l(T/n1/2 )] − T|X ¯T|X ¯ ¯ ¯ ¯ ¯ medn [K(T)]¯ + ¯ medn [K(T)] − medZ [K(Z)]¯ → 0. The first term goes T|X T|X to zero based on the condition (iii) of the loss function. By (3.6), we can show that T converges in distribution to Z, which implies that K(T) also converges to K(Z) in distribution by the Continuous Mapping Theorem. So, using Lemma 11 with the condition (iv), we have med K(T) → medZ K(Z) and therefore the second term converges to T|Xn zero. Thus, lim supn an Mn (δn ) ≤ lim supn an Mn (θ̂n ) ≤ medZ K(Z). (3.7) 2. Let Wn = n1/2 (θ̂n − δn ). Now we show lim supn |Wn | < ∞ a.s. First, suppose that the statement lim supn |Wn | < ∞ a.s. is false, then for every positive vector M, there exists a set AM with Pθ (AM ) > 0 such that Wn (x) > M or Wn (x) < −M i.o. for x ∈ AM . Without loss of generality, we can assume that Wn (x) > M i.o. Then, for the subsequence {ni } where the inequality holds, we have h ³ T+W ´i ni a l ani Mni (δni ) = ani medn l(Θ − δni ) = med ni 1/2 π(Θ|X i ) T|Xni ni ³ T+W ´ h i ni a l ≥ med I ni {T + M≥0} 1/2 T|Xni ni h ³ T+M ´ i ≥ med a l I n {T + M≥0} i 1/2 T|Xni ni h i → medZ K(Z+M)I{Z + M≥0} . 87 Chapter 3. Asymptotics of Bayesian Median Loss Estimation The first inequality holds because l(X)I{X∈B} ≤ l(X) for any nonnegative random vector X and an indicator function I with any set B. The second inequality holds because of the assumption of Wn (x) > M i.o. and the condition (ii) with T + Wni > T + M ≥ 0. Then we use the similar arguments for the convergence of medn [an l(T/n1/2 )] T|X to medZ [K(Z)] in Step 1 to complete the above result. According to Tomkins’ median version of the Lebesgue dominated convergence theorem (Tomkins, 1978) and the condition (v) in our Theorem 4, we have h i h i lim medZ K(Z+M)I{Z + M≥0} = medZ lim K(Z+M)I{Z + M≥0} M→+∞ M→+∞ = K(+∞) > medZ K(Z). Therefore, for large M, on a set of positive probability, lim supn an Mn (δn ) > medZ K(Z) ≥ lim supn an Mn (θ̂n ), which contradicts the definition of δn . Thus, lim supn Wn < ∞ a.s. P0 . Similarly, we have lim inf n Wn > −∞ a.s. P0 . Next for any arbitrary ² > 0, we denote by BM the set such that for x ∈ BM , −M ≤ Wn ≤ M for every n and Pθ (BM ) > 1 − ². For a fixed x ∈ BM , Wn (x) is a bounded sequence, so it has a limit point m. Assume that m 6= 0. Then, for the subsequence {ni } where 88 Chapter 3. Asymptotics of Bayesian Median Loss Estimation Wni (x) → m, we have lim inf ani Mni (δni ) ni h ³ T+W ´i ni = lim inf med a l n i 1/2 ni T|Xni ni ¯ ¯ ¯ ¯ h ³ T+W ´i ¯ ¯ ni ≥ medZ K(Z+m) − lim sup ¯ med a l − med K(Z+m) ¯. Z 1/2 ¯T|Xni ni ¯ ni n i Note that ¯ ¯ ¯ ¯ h ³ T+W ´i ¯ ¯ ni a l − med K(Z+m) ¯ med ¯ n Z i 1/2 ¯T|Xni ¯ ni ¯ ¯ ¯ ¯ ³ T+W ´i h ¯ ¯ ni ≤ ¯ med − med K(T+W ) l a ni ¯ 1/2 ni ¯T|Xni ni ¯ T|X ni ¯ ¯ ¯ ¯ ¯ ¯ + ¯ med K(T+W ) − med K(Z+m) ¯. Z ni ¯T|Xni ¯ Then, using the condition (iii) and the similar arguments for the convergence of medn [K(T)] to medZ [K(Z)] in Step 1, we have T|X ¯ ¯ ¯ ¯ ³ T+W ´i h ¯ ¯ ni − med K(Z+m) l med a ¯ ≤ ². ¯ Z 1/2 ¯ ¯T|Xni ni n i Thus, lim inf ani Mni (δni ) ≥ medZ K(Z+m) − ² ni > medZ K(Z) − ². (by condition (v)) Since ² is arbitrary, we get lim inf ni ani Mni (δni ) > medZ K(Z), which is impossible by (3.7). Therefore, m=0 and n1/2 (δn − θ̂n ) → 0 a.s. P0 . 89 Chapter 3. Asymptotics of Bayesian Median Loss Estimation 3. Finally, the proof is completed by observing L −1 n1/2 (δn − θ0 ) = n1/2 (δn − θ̂n ) + n1/2 (θ̂n − θ0 ) → N (0, IX (θ0 )). 1 Note that conditions (i), (ii) and (iii) are true for L1 loss with an = n1/2 and K(t) = ktk. Further, Z has an multivariate normal distribution with median 0, so conditions (iv) and (v) are satisfied. Therefore, we have the following result. Corollary 5. Suppose the assumptions of the asymptotic normality of the MLE and that of the posterior density hold. Consider any continuous posterior density of Θ given Xn = xn with L1 loss, i.e. L(θ, a) = kθ −ak. Assume that the median of the L1 loss is unique. Then we have L −1 δn → θ0 a.s. P0 and n1/2 (θ0 − δn ) → N (0, IX (θ0 )). 1 More generally, the above result still holds if the L1 loss is replaced by any strictly increasing functions of kΘ−d(xn )k, where the median of the function is unique. 3.3 Asymptotics for two related estimators In this section we state asymptotic results for the LMS and LTS estimators in the regression context. First, in linear regression models, Kim and Pollard (1990) deduce a limiting Gaussian process for the LMS estimators. Yu and Clarke (2008b) extend this result to nonlinear cases, as shown in Theorem 5 below. Consider the nonlinear regression model (3.2). We have the following result. Theorem 5. Suppose 90 Chapter 3. Asymptotics of Bayesian Median Loss Estimation 1. dim(H) is finite, where H is the vector space of real-valued regression functions. 2. Xi and ui are independent for i = 1, . . . , n; 3. h(xi , β) is continuous in β ∈ B and is an once differentiable function in β on a neighborhood of β0 . 4. Qh = EX [h0 (X, β0 )h0 (X, β0 )T ] is positive definite. 5. ui has a bounded, symmetric density γ that decreases away from its mode at zero, and it has a strictly negative derivative at r0 , the unique median of |u|. 6. For any h ∈ H, h satisfies the Lipschitz condition, i.e. |h(X, β1 ) − h(X, β2 )| ≤ LX kβ1 − β2 k, where LX > 0 depends on X, and EX (LX ) < ∞. 7. EX kh0 (X, ξ)k < ∞ for ξ ∈ U (β0 , R), where U (a, b) means an open ball at center a with radius b, and R is defined for the envelope GR . 8. EX |h0 (X, β0 )T w| 6= 0 for any w 6= 0. Then we have that n1/3 (β̂nLM S − β0 ) converges in distribution to the arg max of the Gaussian process θ Z(θ) = γ 0 (r0 )θT Qh θ + W (θ), as n → ∞, where θ = β − β0 and the Gaussian process W has zero mean with continuous sample paths. Proof. This theorem is the case q = 1/2 in Yu and Clarke (2008b) or in Chapter 4 of this thesis. 91 Chapter 3. Asymptotics of Bayesian Median Loss Estimation This result shows that in nonlinear regression model, the LMS estimator also has a slow rate of convergence. Note that since it is based on a median, it can also be viewed as a trimmed mean estimator with a trimming proportion approaching 50% on both sides. Clearly, the more the trimming, the fewer data points that contribute directly to the estimator. Consequently, the rate of convergence slows from root-n to cube root n. To verify this intuition, we observe that relaxing the trimming proportion gives the n1/2 rate of convergence and asymptotic normality. The justification of trimming extreme values on both sides of the squared residuals is that we want to avoid over-dependence on data points that fit our model unrealistically well, a concern that is more important in flexible, complex models than in simple models. This is not as perverse as it sounds because we want to compare our medloss method to the LTS estimator with near 50% trimming on each side. The clearly silly alternative would be to use 99% trimming on one side; for any complex models there will certainly be such points and they are essentially meaningless. At least with trimming on each side, we have the hope of reducing to points that are representative of the fits of the model to typical points. However, even in this case, the points with the greatest pretense to fitting the model will not typically give good estimates for the regression coefficients β because of data sparsity. That is, the data points that are most representative of the fit are few and, while valid, will not in general be satisfactorily representative of the overall data generating mechanism forcing estimates to be unstable. Otherwise put, rerunning the experiment may result in a very different small collection of data points being found representative of the fit and giving different estimates of β. This problem will be more and more serious as the dimensions of the data and parameter increase. It is well known that data sparsity increases as dimension increases; this is a variation on the Curse of Dimensionality. As a result of this reasoning, Yu and Clarke (2008c) used the two-sided LTS estimator in nonlinear models and established its limiting behavior. 92 Chapter 3. Asymptotics of Bayesian Median Loss Estimation The proof of these results heavily relies on the asymptotics of the one-sided LTS estimators proved by Čı́žek (2004, 2005). Consider the nonlinear regression model (3.2). The two-sided LTS estimator is defined by β̂n(LT S,τ ) = arg min β τ X 2 r[i] (β), (3.8) n−τ +1 2 (β) represents the ith order statistics of squared residuals r 2 (β) = where r[i] i {yi − h(xi , β)}2 , and the trimming constant τ satisfies the distribution functions of ui and u2i n 2 < τ ≤ n. Denote by F and G, the corresponding pdf’s by f and g, and quantile functions by F −1 and G−1 , respectively. The choice of the trimming constant τ depends on the sample size n, so consider a sequence of trimming constants τn . Since τn /n determines the fraction of sample included in the LTS objective function, we choose a sequence for which τn = [λn], where [z] represents the integer part of z so that τn /n → λ for some 1/2 < λ ≤ 1. Now we state our two results on the consistency and asymptotic normal(LT S,τn ) ity of β̂n . The required assumptions and the whole proof are put in the Appendix B. Theorem 6 (Consistency). Under assumptions T D, T H1, T H5 and T I provided in the Appendix B and for λ ∈ (1/2, 1), the two-sided LTS estimator (LT S,hn ) βn minimizing (3.8) is weakly consistent, i.e. (LT S,hn ) p βn → β0 , as n → ∞. (LT S,hn ) In addition, if all conditions of Assumption T H are satisfied. Then βn √ is n-consistent, i.e. √ (LT S,hn ) n(βn − β0 ) = Op (1), as n → ∞. Theorem 7 (Asymptotic Normality). Suppose that assumptions T D, T H and T I are satisfied and Cλ 6= 0 and for λ ∈ (1/2, 1), then we have 93 Chapter 3. Asymptotics of Bayesian Median Loss Estimation √ L (LT S,hn ) n(βn − β0 ) → N (0, V2λ ), 2 Q−1 , Q = E [h0 (X, β )h0 (X, β )T ], C = (2λ−1)+ where V2λ = (Cλ )−2 σ2λ 0 0 X h λ h p qλ +q1−λ ( 2 )[H(λ) − H(1 − λ)], H(λ) = f (qλ ) + f (−qλ ), qλ = G−1 (λ) and 2 = Eu2 I 2 σ2λ i [G−1 (1−λ), G−1 (λ)] (ui ). The proofs of these results are substantially the same as the proofs in Čı́žek (2004, 2005) for one-sided LTS estimators, so we omit the details. We only extend the required lemmas and propositions for Čı́žek’s one-sided LTS estimator to our two-sided situation. Since the objective function giving the two-sided LTS estimator is not differentiable, we consider the behavior of the ordered residual statistics (Lemmas 18 and 19 in the Appendix B). Given this, the proof of the asymptotic linearity of the corresponding LTS normal equations as stated in Proposition 3 can be given. Then combining these results with the uniform law of large numbers (Lemma 17) and stochastic equicontinuity for mixing processes, we can prove the consistency and rate of convergence of the two-sided LTS estimates (Theorem 6). Finally, using Proposition 4 in the Appendix B, the proof of the asymptotic normality of the two-sided LTS estimate (Theorem 7) will follow from the consistency and asymptotic linearity of the LTS normal equations. √ Here we remark that the n-convergence and asymptotic normality is preserved for the two-sided LTS estimator, but it is inefficient. The asymptotic variance V2λ increases as we trim more and more, i.e. λ tends to 1/2. 3.4 A comparison of posterior medloss estimator, LMS, and LTS estimators To conclude, we argue that the posterior medloss estimator is a better choice than either the LMS or the LTS. First, the posterior medloss estimator is √ better than the LMS estimator because we get the n rate better than the n1/3 rate of the LMS. 94 Chapter 3. Asymptotics of Bayesian Median Loss Estimation Second, compared to the LTS, the medloss has two advantages, the second arguably following from the first. The first big advantage the posterior medloss has over the LTS is that all data enter the estimator in the same way for the posterior medloss. Thus, IID data are genuinely treated as interchangeable. This also means that the influence of a small amount of dubious data will tend to attenuate, unless the posterior depends on statistics like the mean that are inherently somewhat unstable. (This happens in the normal likelihood and normal prior case, for example.) The second big advantage the posterior medloss has over the LTS is that the LTS zeroes in on a small subset of the data effectively ignoring the rest. One can argue that this provides some measure of robustness to the LTS in that bad data is ignored. In small dimensions this argument may be valid, however, it ignores the problem of data sparsity which becomes ever more serious as the data and parameter dimension increases. More precisely, in high dimensional contexts, where using nonlinear models is most important, the effect of data sparsity is that data tend to form in clumps with vast regions empty of data between them. This means that even reasonable data may be spread so thinly over the space that small subsets of the data are grossly unrepresentative and lead to poor estimates. Otherwise put, the regions of space where we do not have data are so vast that the regions for which data is available make our conclusions dependent on a small number of incompletely representative points. The consequence of this is that an LTS estimator exacerbates data sparsity and nonrepresentativity while the posterior medloss estimator does the best it can with all the data, making it preferable. Taken together, these considerations demonstrate that the posterior medloss estimator is a good tradeoff between the rate of the LMS and the potentially weak representativity of the LTS. Remark 1: It may be more appropriate to compare the LMS estimator with our posterior medloss estimator δn under regression settings. 95 Chapter 3. Asymptotics of Bayesian Median Loss Estimation Although we do not contain that in this thesis, as we mentioned before, it is still possible to get the asymptotic results of δn in regression problems. For instance, we can use quasi-likelihood to obtain the results in problems for GLM. Under some regularity conditions, the quasi-maximum likelihood √ estimator (QMLE) is also n rate of convergence and asymptotic normality. Then like what we did in Theorem 4, δn is asymptotically equivalent √ to QMLE. Thus, δn can still have the nice properties of n rate of convergence and asymptotic normality in the regression settings, where the LMS estimator does not have. Remark 2 : For the comparison of δn with the two Frequentist estimators, we can consider in the issue of breakdown point. We know that the LMS estimator has the highest breakdown points, i.e. 50%. For the LTS, the more the data trimmed, the higher the breakdown points. In general, it is hard to determine the breakdown points of δn because we have no explicit form of δn , although it is still possible to find its breakdown point in some cases. For instance, for the case of the normal likelihood and the normal conjugate prior, the posterior density is still normal, so the posterior mean (i.e. the usual Bayes estimator under squared error loss) and the posterior median (i.e. the posterior medloss estimator in this case) are a convex combination of the prior mean and the sample mean of xn . Therefore, δn has 0 breakdown point in this case. However, outside the case that δn coincides with the usual Bayes estimator, we think that δn may have high breakdown point. Recall that the posterior medloss estimator is a minimizer of the median of the loss with respect to the posterior density, so the outlier should not have a big effect to that median through the posterior density because the median operator can have a high resistance to the change of the posterior density from the data change. Acknowledgments This research was supported by by NSERC operating grant number RGPIN 138122. 96 Chapter 3. Asymptotics of Bayesian Median Loss Estimation 3.5 References D.F. Andrews, P.J. Bickel, F.R. Hampel, P.J. Huber, W.H. Rogers, J.W. Tukey, Robust Estimates of Location: Survey and Advances, Princeton University Press, Princeton, New Jersey, 1972. G.W. Bassett, R.W. Koenker, Strong Consistency of Regression Quantiles and Related Empirical Processes, Econometric Theory 2(1986) 191201. J. Borwanker, G. Kallianpur, B.L.S. Prakasa Rao, The Bernstein-Von Mises Theorem for Markov Processes, Ann. Math. Stat. 42(1971) 12411253. P. Čı́žek, Asymptotics of least trimmed squares regression, CentER Discussion Paper 2004-72, Tilburg University, The Netherlands, 2004. P. Čı́žek, Least trimmed squares in nonlinear regression under dependence, J. Statist. Plann. Inference 136(2005) 3967-3988. L. Fahrmeir, H. Kaufmann, Consistency and Asymptotic Normality of the Maximum Likelihood Estimator in Generalized Linear Models, Ann. Statist. 12(1985) 342-368. A.R. Gallant, Nonlinear Statistical Models, John Wiley and Sons, 1987. J.K. Ghosh, R.V. Ramamoorthi, Bayesian Nonparametrics, Springer Series in Statistics, Springer, 2003. F.R. Hampel, Beyond Location Parameters: Robust Concepts and Methods, Bulletin of the International Statistical Institute. 46 375-382, 1975 J. Kim, D. Pollard, Cube Root Asymptotics, Ann. Statist. 18(1990) 191-219. R.W. Koenker, G.W. Bassett, Regression Quantiles, Econometrica 46(1978) 33-50. E.L. Lehmann, Theory of Point Estimation, John Wily and Sons, 1983. P. Milasevic, G.R. Ducharme, Uniqueness of the Spatial Median, Ann. Statist. 15(1987) 1332-1333. B.K.S. Prakasa Rao, Asymptotic Theory of Statistical Inference, John Wiley and Sons, 1987. 97 Chapter 3. Asymptotics of Bayesian Median Loss Estimation P.J. Rousseeuw, Least Median of Squares Regression, J. Amer. Statist. Assoc. 79(1984) 871-880. P.J. Rousseeuw, A.M. Leroy, Robust Regression and Outlier Detection, John Wiley and Sons, 2003. M.J. Schervish, Theory of Statistics, Springer Series in Statistics, Springer, 1997. G.R. Shorack, Probability for Statisticians, Springer texts in statistics, SpringerVerlag, New York, 2000. A.J. Stromberg, Consistency of the least median of squares estimator in nonlinear regression, Comm. Statist. Theory Methods 24(1995) 19711984. R.J. Tomkins, Convergence Properties of Conditional Medians, Canad. J. Statist. 6(1978) 169-177. A. Wald, Contributions to the Theory of Statistical Estimation and Testing Hypotheses, Ann. Math. Stat. 10(1939) 299-326. A. Wald, Statistical Decision Functions, John Wiley, New York, 1950. C.W. Yu, B. Clarke, Median Loss Analysis, Submitted, 2008a. C.W. Yu, B. Clarke, Cube root asymptoics of the Least Quantile of Squares Estimator in Nonlinear Regression Models, Submitted, 2008b. C.W. Yu, B. Clarke, Asymptotics of Bayesian Median Loss Estimation, Technical Report # 243, Department of Statistics, University of British Columbia, September 2008c. 98 Chapter 4 Asymptotics of the Least Quantile of Squares Estimator in Nonlinear Models 4.1 Introduction Consider a nonlinear regression model, yi = h(xi , β) + ui , i = 1, . . . , n, (4.1) where yi , xi and ui are the realizations of random elements Yi ∈ R, Xi ∈ Rd and Ui ∈ R, respectively, β ∈ B ⊂ Rp is fixed and unknown, and h is a known function in a class of functions H. Assume that the parameter space B is compact and the true parameter β0 is an interior point, and that (xi , ui ) are independently sampled from a probability distribution P on Rd × R. One of the most common methods to estimate β0 is the least squares (LS) estimator, which minimizes the sum of squares of the residuals. Although A version of this chapter has been submitted for publication. Yu, C.W. and Clarke, B. Cube root asymptoics of the Least Quantile of Squares Estimator in Nonlinear Regression Models. 99 Chapter 4. Asymptotics of the Least Quantile of Squares Estimator in Nonlinear Models it has many nice properties like √ n-consistency and asymptotic normality, it is well-known that the LS estimator is highly sensitive to outliers. To overcome this problem, there are various alternative robust approaches. One of them is the least median of squares (LMS) estimator first introduced by Hampel (1975, page 380) and then proposed by Rousseeuw (1984) (see also Rousseeuw and Leory, 2003). The LMS estimator has 50% breakdown point. That is, 50 % is the smallest portion of the data that must be contaminated to force the LMS estimator to move an arbitrary amount. Analogous to the LS estimator, the LMS estimator minimizes the median of squares of the residuals, i.e. β̂nLM S = arg min median[yi − h(xi , β)]2 . β 1≤i≤n (4.2) Rousseeuw (1984) also gives a heuristic proof of a n1/3 rate of convergence of the LMS estimator in linear models by using similar arguments of Andrews et al. (1972) to the asymptotics of the shorth estimators. The rigorous proof is provided by Kim and Pollard (1990) through empirical process. Stromberg (1995) shows the consistency of the LMS estimator in a nonlinear regression setting. In this paper, we extend Kim and Pollard’s result for the linear context to nonlinear settings. Specifically, we establish the asymptotics of the least qth quantile of squares estimator β̂nq = arg min Q(i,n,q) [yi − h(xi , β)]2 β (4.3) with a nonlinear regression function h and q ∈ (0, 1), where Q(i,n,q) (ui ) means the sample qth quantile of {ui : i = 1, . . . , n}, or the [nq]th smallest ordered data of {ui : i = 1, . . . , n} with the greatest integer function [·]. The rest of this chapter is organized as follows. In Section 4.2, we state our main result. The proof is given in Section 4.3. 100 Chapter 4. Asymptotics of the Least Quantile of Squares Estimator in Nonlinear Models 4.2 Main result The asymptotic results for our LQS estimator β̂nq in non-linear regression models (4.1) rely heavily on the main theorem in Kim and Pollard (1990). So, we state this theorem before giving our main result. The notion of manageability used below is defined in Section 4.3.1. Lemma 12 (Kim and Pollard, 1990). Consider the empirical processes En g(·, θ) = 1 X g(ηi , θ), n 1≤i≤n where {ηi = (xi , ui )} is a sequence of independent observations taken from a distribution P on Rd × R and G = {g(·, θ) : θ ∈ Θ} is a class of functions indexed by a subset Θ of Rp . Define the envelope GR (·) as the supremum of |g(·, θ)| over the class GR = {g(·, θ) : kθ − θ0 k ≤ R}, i.e. GR (xi , ui ) = sup |g(xi , ui , θ)|. g∈GR Also make the following assumptions: 1. Choose a sequence of estimators {θ̂n } for which En g(·, θ̂n ) ≥ supEn g(·, θ)− θ∈Θ op (n−2/3 ). 2. The sequence {θ̂n } converges in probability to the unique θ0 that maximizes Eg(·, θ), the expectation of g(·, θ) with respect to the distribution P. 3. The true value θ0 is an interior point of Θ. Let the functions g(·, θ0 ) be standardized so that g(·, θ0 ) = 0 and suppose that the class GR , for R near 0, is uniformly manageable for the envelopes GR . Then we also require : 101 Chapter 4. Asymptotics of the Least Quantile of Squares Estimator in Nonlinear Models 4. Eg(·, θ) is twice differentiable with second derivative matrix −V at θ0 . 5. H(s, t) ≡ lim αEg(·, θ0 + s/α)g(·, θ0 + t/α) exists for each s, t in Rp α→∞ and lim αEg(·, θ0 + t/α)2 {|g(·, θ0 + t/α)| > ²α} = 0 α→∞ for each ² > 0 and t ∈ Rp . 6. EG2R = O(R) as R → 0 and for each ² > 0 there is a constant K such that EG2R I{GR >K} < ²R for R near 0. 7. E|g(·, θ1 ) − g(·, θ2 )| = O(|θ1 − θ2 |) near θ0 . Now, under the above assumptions 1 - 7, we have that the process n2/3 En g(·, θ0 + tn−1/3 ) converges in distribution to a Gaussian process Z(t) with continuous sample paths, expected value −1 0 2 tV t and covariance kernel H, as n → ∞. Finally, if V is positive definite and if Z has nondegenerate increments, then as n → ∞, n1/3 (θ̂n − θ0 ) converges in distribution to the (almost surely unique) random vector that maximizes Z. Now we can state our main result. For simplicity, we just consider univariate explanatory data. It is clear that our main result still holds for multivariate X. Assume that Xi and ui are independent for i = 1, . . . , n, and h(xi , β) is a continuous in β ∈ B and once differentiable function in β on a neighborhood of β0 . It will be seen that we can show the asymptotics of the least quantile of squares estimators in nonlinear models by verifying the assumptions of Lemma 12. Theorem 8. Suppose 1. dim(H) is finite, where H is the vector space of real-valued regression functions. 2. Qh = EX [h0 (X, β0 )h0 (X, β0 )T ] is positive definite. 102 Chapter 4. Asymptotics of the Least Quantile of Squares Estimator in Nonlinear Models 3. ui has a bounded, symmetric density γ that decreases away from its mode at zero, and it has a strictly negative derivative at r0 (q), the unique qth quantile of |u|, where q ∈ (0, 1). 4. For any h ∈ H, h satisfies the Lipschitz condition, i.e. |h(X, β1 ) − h(X, β2 )| ≤ LX kβ1 − β2 k, where LX > 0 depends on X, and EX (LX ) < ∞. 5. EX kh0 (X, ξ)k < ∞ for ξ ∈ U (β0 , R), where U (a, b) means an open ball at center a with radius b, and R is defined for the envelope GR . 6. EX |h0 (X, β0 )T w| 6= 0 for any w 6= 0. Then we have that n1/3 (β̂nq − β0 ) converges in distribution to the arg max of θ the Gaussian process Z(θ) = γ 0 (r0 (q))θT Qh θ + W (θ), as n → ∞, where θ = β − β0 and the Gaussian process W has zero mean, covariance kernel H and continuous sample paths. Remark : However, we still have no information about the maximizer of the Gaussian process Z(θ). For instance, we do not know if it has finite moments or not. In the following, we just outline the proof of Theorem 8. The whole proof is presented in Section 4.3. First, we recast (4.3) as a problem of constrained optimization by reparametrizing β by β0 + θ, and taking a first-order Taylor expansion of h(x, β) at β0 . Thus, y − h(x, β) = u − h0 (x, ξ)T θ, (4.4) where ξ ∈ (β0 , β) and ξ → β0 as θ → 0. Then define 103 Chapter 4. Asymptotics of the Least Quantile of Squares Estimator in Nonlinear Models fh,x,u (θ, r, ξ) = I{|u−h0 (x,ξ)T θ|≤r} (x, u), and rn = inf {r : supEn [fh,x,u (θ, r, ξ)] ≥ q}. (4.5) θ Let θ̂nq = β̂nq − β0 be a value at which supEn fh,x,u (θ, rn , ξ) is achieved, where θ En corresponds to the empirical version of the expectation under P. Assume that the corresponding constrained maximization (4.5) for the expectation under P has a unique solution θ0 and r0 (q). Without loss of generality, let θ0 = 0 and r0 (q) = 1. Since fh,x,u (θ, r, ξ) can be rewritten as I{|y−h(x,θ+β0 )|≤r} (x, y) = I{h(x,θ+β0 )−y+r≥0 and y+r−h(x,θ+β0 )≥0} (x, y), (4.6) we let fh,x,y (θ, r) = fh,x,u (θ, r, ξ) and define gh,x,u (θ, δ, ξ) = fh,x,u (θ, 1 + δ, ξ) − fh,x,u (0, 1 + δ, ξ). Our main result can be established by applying the main theorem in Kim and Pollard (1990), stated as Lemma 12 here. So, it suffices to check that all the required conditions of Kim and Pollard’s theorem can be satisfied in the nonlinear case for the least qth quantile estimator. We give the verifications in next section. 4.3 Detailed proof for the LQS estimator Here we verify that the LQS estimator in nonlinear situations satisfies the conditions of Lemma 12. Before verifying these conditions in Section 4.3.4, we need results from Sections 4.3.1, 4.3.2 and 4.3.3. Finally, we prove the asymptotic results for the LQS estimators in Section 4.3.5. 104 Chapter 4. Asymptotics of the Least Quantile of Squares Estimator in Nonlinear Models 4.3.1 Manageability Manageability, proposed by Pollard (1990), is a notion used to establish a n−1/3 rate of convergence for the LMS estimators, and to verify the stochastic equicontinuity conditions for showing the limiting behavior of the LMS estimators in linear models in Kim and Pollard (1990). As explained in Pollard (1990), the concept of manageability formalizes the idea that maximal inequalities for the maximum deviation of a sum of independent stochastic processes from its expected value can be derived from uniform bounds on the random packing numbers. Formally, let Fnω = {(f1 (ω, t), . . . , fn (ω, t)) : t ∈ T }, and define the packing number D(², F) for a subset F of a metric space with metric d as the largest m for which there exist points t1 , . . . , tm in F with d(ti , tj ) > ² for i 6= j. Also, for each α = (α1 , . . . , αn ) of nonnegative constants, and each f = (f1 , . . . , fn ) ∈ Rn , the pointwise product α ¯ f is the vector in Rn with ith coordinate αi fi , and α ¯ F is the set of all vectors α ¯ f with f ∈ F. Then following Pollard (1990), a triangular array of random processes {fni (ω, t) : t ∈ T, 1 ≤ i ≤ kn } is manageable, with respect to the envelopes Fn (ω), for n = 1, 2, . . ., if there exists a deterministic function λ, for which • R1p ln λ(x)dx < ∞, and 0 • the random packing number D(x|α ¯ Fn (ω)|, α ¯ Fnω ) ≤ λ(x) for 0 < x ≤ 1, all ω, all vectors α of nonnegative weights, and all n. A sequence of processes {fi } is manageable if the array defined by fni = fi for i ≤ n is manageable. The concept of manageability extends to a definition of uniform manageability based on the maximal inequality. Among those classes of functions which are manageable, those that are also uniformly manageable satisfy the extra condition that the bound in the maximal inequality is independent of R used in the envelope GR . See Kim and Pollard (1990) for details. 105 Chapter 4. Asymptotics of the Least Quantile of Squares Estimator in Nonlinear Models Manageability of the class of functions fh,x,y (θ, r) and gh,x,u (θ, δ, ξ) By the sufficient conditions for manageability in Kim and Pollard (1990), we can easily show that the classes of functions fh,x,y (θ, r) and gh,x,u (θ, δ, ξ) for nonlinear models are also manageable. Lemma 13 (Dudley, 1979). If G is a vector space of real functions on a set, then V C(Cg ) = dim(G) + 1, where Cg = {x ∈ X : g(x) ≥ 0, g ∈ G} and V C(Cg ) is the VC dimension of Cg To use this result, suppose G1 and G2 are the classes g1 (θ, r) = h(x, θ + β0 ) − y + r and g2 (θ, r) = y + r − h(x, θ + β0 ) for any h ∈ H, respectively. Consider C1 = {(θ, r) ∈ Rd+1 : 0 ≤ g1 (θ, r), g1 ∈ G1 } and C2 = {(θ, r) ∈ Rd+1 : 0 ≤ g2 (θ, r), g2 ∈ G2 }. Therefore, by Dudley’s lemma and our assumption 1 in Theorem 8, the VC dimensions of C1 and C2 are bounded above by dim(H)+3 < ∞. So, C1 and C2 form VC-classes, which implies that C1 ∩ C2 is also a VC class. Now, the class of functions fh,x,u (θ, r, ξ) or fh,x,y (θ, r) forms a VC-subgraph, and hence is manageable. Recall that gh,x,u (θ, δ, ξ) = fh,x,u (θ, 1 + δ, ξ) − fh,x,u (0, 1 + δ, ξ). Since the classes F and F0 of fh,x,u (θ, r, ξ) and fh,x,u (0, r, ξ), respectively, are VC-subgraphs, the class G = {f1 − f0 : f1 ∈ F and f0 ∈ F0 }, 106 Chapter 4. Asymptotics of the Least Quantile of Squares Estimator in Nonlinear Models is also a VC-subgraph by Lemma 2.6.18 in van der Vaart and Wellner (1996). Thus, subclasses GR of G as defined in Kim and Pollard (1990) are uniformly manageable with the envelope GhR = sup|gh,x,u (θ, r, ξ)|. GR By the manageability of the class of fh,x,y (θ, r) and Corollary 3.2 in Kim and Pollard (1990), we have sup|En fh,x,u (θ, r, ξ) − Efh,x,u (θ, r, ξ)| = Op (n−1/2 ). (4.7) θ,r 4.3.2 Op (n−1/2 ) rate of convergence of rn Denote the distribution function of u by Γ. We have Efh,x,u (θ, r, ξ) = Ex Eux [fh,x,u (θ, r, ξ)] = Ex [Γ(h0 (X, ξ)T θ + r) − Γ(h0 (X, ξ)T θ − r)], (4.8) where E is the expectation with respect to the product probability measure P of (x, u), Eux means the condition expectation with respect to u given X and Ex is the unconditional expectation taken over X. Clearly, (4.8) is a continuous function of θ and r, which is maximized by θ = 0 for each fixed r because of the symmetry of u at 0. In other words, we have sup Efh,x,u (θ, r, ξ) = Γ(r) − Γ(−r). θ Thus, it follows that there exist positive constants k and λ for which sup Efh,x,u (θ, 1 − δ, ξ) < q − kδ (4.9) Efh,x,u (θ, 1 + δ, ξ) ≥ q + λδ, (4.10) θ and for any δ > 0 small enough. Let P [A, B] and Pn [A, B] represent EI[A,B] and 107 Chapter 4. Asymptotics of the Least Quantile of Squares Estimator in Nonlinear Models En I[A,B] , which have a probability measure P and its empirical counterpart Pn , respectively. By (4.7), we have ¯ ¯ def 4n = sup¯Pn [h0 (X, ξ)T θ − r, h0 (X, ξ)T θ + r] − P [h0 (X, ξ)T θ − r, h0 (X, ξ)T θ + r]¯ θ,r = Op (n−1/2 ). Putting r = 1 − 4n k , we get 4n 0 4n , h (X, ξ)T θ + 1 − ] k k 4n 0 4n ≤4n + P [h0 (X, ξ)T θ − 1 + , h (X, ξ)T θ + 1 − ]. k k Pn [h0 (X, ξ)T θ − 1 + Thus by (4.9), we have supPn [h0 (X, ξ)T θ − 1 + θ 4n 0 T k , h (X, ξ) θ +1− 4n k ] < 4n + q − k(4n /k) = q, which implies that rn ≥ 1 − 4n /k. (4.11) Similarly, by (4.10), there exists λ > 0 such that P [−1 − δ, 1 + δ] ≥ q + λδ, for all δ > 0 small enough. Therefore, Pn [−1 − 4λn , 1 + 4λn ] ≥ −4n + P [−1 − 4λn , 1 + 4λn ] ≥ −4n + q + λ( 4λn ) = q, which implies rn ≤ 1 + 4n . λ (4.12) Combining the results in (4.11) and (4.12), we get rn = 1 + Op (n−1/2 ). 108 Chapter 4. Asymptotics of the Least Quantile of Squares Estimator in Nonlinear Models 4.3.3 Conditions for Kim and Pollard’s Lemma 4.1 are satisfied in nonlinear case Lemma 4.1 in Kim and Pollard (1990) plays an important role to establish the convergence of the LMS estimator in the linear context. Here we show that this lemma also holds for our LQS estimator in the nonlinear settings. Denote GhR at fixed x by GhR (x). Note that ∗ ∗ |gh,x,u (θ, r, ξ)| ≤ I(−1−δ,h 0 (x,ξ)T θ−1−δ) (u) + I(h0 (x,ξ)T θ+1+δ,1+δ) (u), for fixed x. Here the asterisk of the indicator function means that the interval may be reversed, that is, ∗ I(a,b) (u) = I(min(a,b),max(a,b)) (u). By the boundedness of the density of u, let M < ∞ be the supremum of the density of u. Thus, Eux GhR (x) ≤ sup {2M |h0 (x, ξ)T θ|}. (4.13) kθ−θ0 k≤R Recall that we set θ0 = 0. By the Cauchy-Schwarz inequality, we get that Eux GhR (x) ≤ 2M kh0 (x, ξ)kR, which implies that EGhR = Ex Eux GhR (x) ≤ 2M Ex (kh0 (X, ξ)k)R. Therefore, by assumption 6 in our Theorem 8, it follows that EGhR = O(R), which is required for Lemma 4.1 in Kim and Pollard (1990) to establish the convergence of θ̂nq or β̂nq . 4.3.4 Check the conditions of Kim and Pollard’s main theorem In what follows, we verify that Kim and Pollard’s main theorem holds for our LQS estimators in nonlinear models. So, it suffices to check the conditions 109 Chapter 4. Asymptotics of the Least Quantile of Squares Estimator in Nonlinear Models of our Lemma 12. (i) Conditions 2, 3 and 4 in our Lemma 12 are satisfied First, we use Lemma 4.1 in Kim and Pollard (1990) for the pair (θ, δ) to show that there exists Mn = Op (1) such that |En gh,x,u (θ, δ, ξ) − Egh,x,u (θ, δ, ξ)| ≤ ²[kθk2 + δ 2 ] + n−2/3 Mn2 , (4.14) for each ² > 0. To do this, consider Egh,x,u (θ, δ, ξ) = Ex [Eux [gh,x,u (θ, δ, ξ)]], where Eux [gh,x,u (θ, δ, ξ)] = Eu [gh,x,u (θ, δ, ξ)|x] is a continuous function of θ and δ for fixed x. By Taylor’s expansion of Eux [gh,x,u (θ, δ, ξ)] about θ = 0 and δ = 0, we have Eux [gh,x,u (θ, δ, ξ)] = γ 0 (1)θT Qxh θ + o(kθk2 ) + o(δ 2 ) and Egh,x,u (θ, δ, ξ) = γ 0 (1)θT Qh θ + o(kθk2 ) + o(δ 2 ), (4.15) where Qh = Ex Qxh = Ex [h0 (X, β0 )h0 (X, β0 )T ]. (4.15) is used to verify conditions 2 and 4. However, its derivation is long so it is differed to the end of this subsection. By (4.14) and (4.15), we have En gh,x,u (θ, δ, ξ) ≤ γ 0 (1)θT Qh θ + o(1)kθk2 + o(1)δ 2 + ²[kθk2 + δ 2 ] + Op (n−2/3 ). Since θ̂nq maximizes En gh,x,u (θ, rn − 1, ξ), we have 0 = En gh,x,u (0, rn − 1, β0 ) ≤ En gh,x,u (θ̂nq , rn − 1, ξ) ≤ γ 0 (1)(θ̂nq )T Qh (θ̂nq ) + (² + o(1))kθ̂nq k2 + (² + o(1))(rn − 1)2 + Op (n−2/3 ). 110 Chapter 4. Asymptotics of the Least Quantile of Squares Estimator in Nonlinear Models Since we proved that rn = 1 + Op (n−1/2 ), we now obtain 0 ≤ γ 0 (1)(θ̂nq )T Qh (θ̂nq ) + (² + o(1))kθ̂nq k2 + (² + o(1))Op (n−1 ) + Op (n−2/3 ). (4.16) Note that Qh is a symmetric matrix, so we have λp ≤ θ T Qh θ θT θ ≤ λ1 , where λ1 and λp are the largest and smallest eigenvalues of Qh . In other words, we have θT Qh θ ≥ λp kθk2 , which implies that γ 0 (1)(θ̂nq )T Qh (θ̂nq ) ≤ γ 0 (1)λp kθ̂nq k2 (∵ γ 0 (1) < 0). Thus, by (4.16), [−λp γ 0 (1) − (² + o(1))]kθ̂nq k2 ≤ (² + o(1))Op (n−1 ) + Op (n−2/3 ). Since Qh is positive-definite, λp > 0. Taking ² = −γ 0 (1) 2 λp > 0, we have 0 [ −γ2(1) λp − o(1)]kθ̂nq k2 ≤ Op (n−2/3 ), which implies kθ̂nq k = Op (n−1/3 ) or kβ̂nq − β0 k = Op (n−1/3 ). So, condition 2 holds. Second, condition 3 is satisfied by the assumption of our model (4.1) that β0 is an interior point of the parameter space. Third, to verify condition 4, observe that (4.15) implies that Egh,x,u (θ, δ, ξ) is twice differentiable in θ and the second derivative matrix with respect to θ at (0, 0, β0 ) is i h ∂2 ∂2 x Eg (0, 0, β ) = E E [g (0, 0, β )] = EX [2γ 0 (1)Qxh ] = 2γ 0 (1)Qh . 0 0 X h,x,u h,x,u u ∂θ∂θT ∂θ∂θT Finally, we derive the expression (4.15). Recall that Egh,x,u (θ, δ, ξ) = Ex [Eux [gh,x,u (θ, δ, ξ)]], 111 Chapter 4. Asymptotics of the Least Quantile of Squares Estimator in Nonlinear Models where Eux [gh,x,u (θ, δ, ξ)] = Eu [gh,x,u (θ, δ, ξ)|x] is a continuous function of θ and δ for fixed x. In the following, we use g = gh,x,u for simplicity. By Taylor’s expansion of Eux [g(θ, δ, ξ)] about θ = 0 and δ = 0, we have ∂ ∂ Eux [g(θ, δ, ξ)] =Eux [g(0, 0, β0 )] + θT Eux [g(0, 0, β0 )] + δ Eux [g(0, 0, β0 )] ∂θ ∂δ 2 ∂2 1h x 2 ∂ E [g(0, 0, β )]θ + δ E x [g(0, 0, β0 )] + θT 0 u 2 ∂θ∂θT ∂δ 2 u i ∂2 + 2δθT Eux [g(0, 0, β0 )] + o(kθk2 ) + o(δ 2 ). ∂δ∂θ By the definition of gh,x,u , we have Eux [g(0, 0, β0 )] = 0. Moreover, since ∂ ∂ x Eu [g(θ, δ, ξ)] = ∂θ ∂θ Z h0 (x,ξ)T θ+1+δ γ(u)du h0 (x,ξ)T θ−1−δ = h0 (x, ξ)γ(h0 (x, ξ)T θ + 1 + δ) − h0 (x, ξ)γ(h0 (x, ξ)T θ − 1 − δ), we have ∂ x ∂θ Eu [g(0, 0, β0 )] = h0 (x, β0 )γ(1) − h0 (x, β0 )γ(−1) = 0, because of the symmetry of γ about 0, i.e. we have γ(1) = γ(−1). Similarly, we have (i) ∂ ∂ x Eu [g(θ, δ, ξ)] = ∂δ ∂δ Z h0 (x,ξ)T θ+1+δ h0 (x,ξ)T θ−1−δ ∂ γ(u)du − ∂δ Z 1+δ γ(u)du −1−δ = [γ(h0 (x, ξ)T θ + 1 + δ) + γ(h0 (x, ξ)T θ − 1 − δ)] − [γ(1 + δ) + γ(−1 − δ)], ⇒ ∂ x ∂δ Eu [g(0, 0, β0 )] = 0. ∂2 E x [g(θ, δ, ξ)] = γ 0 (h0 (x, ξ)T θ+1+δ)h0 (x, ξ)h0 (x, ξ)T −γ 0 (h0 (x, ξ)T θ− ∂θ∂θT u δ)h0 (x, ξ)h0 (x, ξ)T , thus we have (ii) 1− ∂2 E x [g(0, 0, β0 )] = [γ 0 (1)h0 (x, β0 )h0 (x, β0 )T ] − [γ 0 (−1)h0 (x, β0 )h0 (x, β0 )T ] = 2γ 0 (1)Qxh , ∂θ∂θT u 112 Chapter 4. Asymptotics of the Least Quantile of Squares Estimator in Nonlinear Models where Qxh = h0 (x, β0 )h0 (x, β0 )T for fixed x and γ 0 (1) = −γ 0 (−1). (iii) ∂ ∂δ [γ(1 ∂2 E x [g(θ, δ, ξ)] ∂δ 2 u + δ) + γ(−1 − δ)], (iv) 1 − δ)]. ∂ 0 T 0 T ∂δ [γ(h (x, ξ) θ + 1 + δ) + γ(h (x, ξ) θ − 1 − δ)] − ∂2 x which implies that ∂δ 2 Eu [g(0, 0, β0 )] = 0. = ∂2 ∂ x 0 0 T 0 0 T ∂δ∂θ Eu [g(θ, δ, ξ)] = ∂δ [h (x, ξ)γ(h (x, ξ) θ+1+δ)−h (x, ξ)γ(h (x, ξ) θ− ∂2 Thus, we have ∂δ∂θ Eux [g(0, 0, β0 )] = 0. Thus, we have Eux [gh,x,u (θ, δ, ξ)] = γ 0 (1)θT Qxh θ + o(kθk2 ) + o(δ 2 ) and the expression (4.15) holds. (ii) Conditions 6 and 7 in Lemma 12 are satisfied For condition 6, since u has a bounded density and Ekh0 (X, ξ)k < ∞ by our assumption 6, it follows that E(GhR )2 = O(R) by the same technique we used for showing EGhR = O(R) in Section 4.3.3. For condition 7, recall that gh,x,u is the difference of two indicator functions fh,x,u (θ, 1+δ, ξ) and fh,x,u (0, 1+δ, ξ), where fh,x,u (θ, r, ξ) = I{|u−h0 (x,ξ)T θ|≤r} . So, for (θ1 , δ1 ) and (θ2 , δ2 ) near (0, 0), ¯ ¯ ∗ ∗ ∗ ∗ ¯gh,x,u (θ1 , δ1 , ξ1 ) − gh,x,u (θ2 , δ2 , ξ2 )¯ ≤ IA (u) + IA (u) + IA (u) + IA (u). 1 2 3 4 There are many combinations of intervals of the form A1 , A2 , A3 and A4 . For example, A1 = (−1−δ1 , −1−δ2 ), A2 = (1+δ2 , 1+δ1 ), A3 = (h0 (x, ξ2 )T θ2 −1− δ2 , h0 (x, ξ1 )T θ1 −1−δ1 ) and A4 = (h0 (x, ξ2 )T θ2 +1+δ2 , h0 (x, ξ1 )T θ1 +1+δ1 ). In all cases the total length of the intervals A1 , A2 , A3 and A4 on the right is bounded by 2|h(x, β1 ) − h(x, β2 )| + 4|δ2 − δ1 | for fixed x, where βi = β0 + θi , and ξi ∈ (β0 , βi ) for i=1 and 2. Moreover, ξi → β0 as θi → 0. Thus, ¯ ¯ Eux ¯gh,x,u (θ1 , δ1 , ξ1 ) − gh,x,u (θ2 , δ2 , ξ2 )¯ ≤ M [2|h(x, β1 ) − h(x, β2 )| + 4|δ2 − δ1 |], 113 Chapter 4. Asymptotics of the Least Quantile of Squares Estimator in Nonlinear Models where M is defined in (4.13). By assumption 4 in our Theorem 8, we have |h(x, β1 ) − h(x, β2 )| ≤ Lx kβ1 − β2 k = Lx kθ1 − θ2 k , where Lx > 0 depends on x. Therefore, ¯ ¯ Eux ¯gh,x,u (θ1 , δ1 , ξ1 ) − gh,x,u (θ2 , δ2 , ξ2 )¯ ≤ 2M Lx kθ1 − θ2 k + 4M |δ2 − δ1 |, and ¯ ¯ ¯ ¯ E ¯gh,x,u (θ1 , δ1 , ξ1 ) − gh,x,u (θ2 , δ2 , ξ2 )¯ = Ex Eux ¯gh,x,u (θ1 , δ1 , ξ1 ) − gh,x,u (θ2 , δ2 , ξ2 )¯ ≤ 2M Ex (Lx )kθ1 − θ2 k + 4M |δ2 − δ1 | £ ¤£ ¤ ≤ 2M max{Ex (Lx ), 2} kθ1 − θ2 k + |δ2 − δ1 | , which implies that ¯ ¯ E ¯gh,x,u (θ1 , δ1 , ξ1 ) − gh,x,u (θ2 , δ2 , ξ2 )¯ = O(kθ1 − θ2 k + |δ2 − δ1 |). (4.17) So Kim and Pollard’s condition 7 is satisfied. (iii) Condition 1 in Lemma 12 is satisfied Now we show that θ̂nq comes close to maximizing En fh,x,u (θ, 1, ξ), which is equivalent to saying that β̂nq maximizes Pn (|y − h(x, β)| ≤ 1). Kim and Pollard’s technique needs to check whether or not the two-parameter centered process Xn (a, b) = n2/3 En gh,x,u (an−1/3 , bn−1/3 , ξ(a)) − n2/3 Egh,x,u (an−1/3 , bn−1/3 , ξ(a)) satisfies the uniform tightness (i.e. stochastic equicontinuity) condition used for the weak convergence of the process. In their lemma 4.6, Kim and Pollard (1990) show that the process Xn satisfies the uniform tightness. The main hypotheses of lemma 4.6 are uniform manageability and conditions 6 and 7 of our Lemma 12. In Section 4.3.1 we have shown the classes of fh,x,u and 114 Chapter 4. Asymptotics of the Least Quantile of Squares Estimator in Nonlinear Models gh,x,u are manageable. Also, in Section 4.3.4(ii) we establish conditions 6 and 7. Thus, Xn is uniformly tight. Given this, we must show that β̂nq comes close to maximizing Pn (|y − h(x, β)| ≤ 1). Then, using n1/3 (rn − 1) = op (1), we have Xn (n1/3 θ, n1/3 (rn − 1)) − Xn (n1/3 θ, 0) = op (1) uniformly over θ in an Op (n−1/3 ) neighborhood of zero. That is, En gh,x,u (θ, rn − 1, ξ) = Egh,x,u (θ, rn − 1, ξ) + En gh,x,u (θ, 0, ξ) − Egh,x,u (θ, 0, ξ) + op (n−2/3 ), uniformly over an Op (n−1/3 ) neighborhood. Within such a neighborhood, by (4.15) we have Egh,x,u (θ, rn − 1, ξ) − Egh,x,u (θ, 0, ξ) = o((rn − 1)2 ) = op (n−2/3 ). Then if mn maximizes En gh,x,u (θ, 0, ξ) just as θ̂nq maximizes En gh,x,u (θ, rn − 1, ξ), we have mn = Op (n−1/3 ). Therefore, En gh,x,u (θ̂nq , 0, ξ) = En gh,x,u (θ̂nq , rn − 1, ξ) − op (n−2/3 ) ≥ En gh,x,u (mn , rn − 1, ξ) − op (n−2/3 ) = En gh,x,u (mn , 0, ξ) − op (n−2/3 ). In other words, we have En gh,x,u (θ̂nq , 0, ξ) ≥ supEn gh,x,u (θ, 0, ξ) − op (n−2/3 ), θ which means that θ̂nq comes close to maximizing En fh,x,u (θ, 1, ξ). (iv) Condition 5 in Lemma 12 is satisfied Consider the one-parameter class of functions {gh,x,u (θ, 0, ξ) : θ ∈ Rd , ξ ∈ (β0 , β)} with θ = β − β0 . Using the same techniques as in the verification of 115 Chapter 4. Asymptotics of the Least Quantile of Squares Estimator in Nonlinear Models conditions 6 and 7, we have for fixed s and t, s t Eux |gh,x,u ( , 0, ξs ) − gh,x,u ( , 0, ξt )|2 α α t s t 0 T s =|Γ(1 + h (x, ξs ) ) − Γ(1 + h0 (x, ξt )T )| + |Γ(−1 + h0 (x, ξs )T ) − Γ(−1 + h0 (x, ξt )T )|. α α α α By Taylor’s expansion of the first two terms at 1 and the last two at −1 with γ(1) = γ(−1), we have ¯ ¯ s t s t Eux |gh,x,u ( , 0, ξs ) − gh,x,u ( , 0, ξt )|2 = 2¯γ(1)[h0 (x, ξs )T − h0 (x, ξt )T ] + o(1/α)¯, α α α α (4.18) where ξs ∈ (β0 , βs ) and ξt ∈ (β0 , βt ), βs = β0 + s α and βt = β0 + αt . In fact, kξs − β0 k ≤ ks/αk and kξt − β0 k ≤ kt/αk. As α → ∞, ξs and ξt will tend to β0 . Thus we have t s L(s − t) ≡ lim αE|gh,x,u ( , 0, ξs ) − gh,x,u ( , 0, ξt )|2 α→∞ α α ¯ ¯ 0 T 0 =2 lim Ex ¯γ(1)[h (x, ξs ) s − h (x, ξt )T t] + αo(1/α)¯ α→∞ =2γ(1)Ex |h0 (x, β0 )T (s − t)|. Similarly, we can also prove that L(s) ≡ lim αE|gh,x,u ( αs , 0, ξs )|2 = 2γ(1)Ex |h0 (x, β0 )T s| α→∞ and L(t) ≡ lim αE|gh,x,u ( αt , 0, ξt )|2 = 2γ(1)Ex |h0 (x, β0 )T t|. α→∞ Thus, the limiting covariance function is s t 1 H(s, t) ≡ lim αE[gh,x,u ( , 0, ξs )gh,x,u ( , 0, ξt )] = [L(s) + L(t) − L(s − t)], α→∞ α α 2 by the identity 2xy = x2 + y 2 − (x − y)2 . 116 Chapter 4. Asymptotics of the Least Quantile of Squares Estimator in Nonlinear Models 4.3.5 Proof of asymptotic results for the LQS estimators in nonlinear models Conditions 1-7 are satisfied, it is enough to complete the proof of Theorem 8 by verifying that the limiting Gaussian process has nondegenerate increments. Note that in Section 4.3.4(iv), since L(0) = 0, H(s, s) = L(s) and H(t, t) = L(t). Thus, by our assumption 6, we have H(s, s) − 2H(s, t) + H(t, t) = L(s − t) 6= 0, for any s 6= t. (4.19) Under (4.19), Lemma 2.6 in Kim and Pollard (1990) can be applied to give that the limiting Gaussian process has nondegenerate increments. Consequently, applying Kim and Pollard’s main theorem with our assumption 3 on the positive definiteness of Qh , we can identify the limit distribution of n1/3 θ̂nq , i.e. n1/3 (β̂nq − β0 ), with the arg max of the Gaussian process θ Z(θ) = γ 0 (r0 (q))θT Qh θ + W (θ), where r0 (q) = 1 is the unique qth quantile of |u| and W has zero means, covariance kernel H and continuous sample paths. Acknowledgments This research was supported by by NSERC operating grant number RGPIN 138122. 117 Chapter 4. Asymptotics of the Least Quantile of Squares Estimator in Nonlinear Models 4.4 References D.F. Andrews, P.J. Bickel, F.R. Hampel, P.J. Huber, W.H. Rogers, J.W. Tukey, Robust Estimates of Location: Survey and Advances, Princeton University Press, Princeton, New Jersey, 1972. R.M. Dudley, Balls in Rk do not cut all subsets of k + 2 points, Adv. Math. 31 (1979) 306-308. F.R. Hampel, Beyond Location Parameters: Robust Concepts and Methods, Bulletin of the International Statistical Institute. 46 (1975) 375-382. J. Kim, D. Pollard, Cube Root Asymptotics, Ann. Statist. 18(1990) 191-219. D. Pollard, Empirical Processes: Theory and Applications. Conference Board of the Mathematical Sciences, NSF-CBMS Regional Conference Series in Probability and Statistics, Volume 2, Hayward, Calif. : Institute of Mathematical Statistics, and Alexandria, Va. : American Statistical Association, 1990. P.J. Rousseeuw, Least Median of Squares Regression, J. Amer. Statist. Assoc. 79 (1984) 871-880. P.J. Rousseeuw, A.M. Leroy, Robust Regression and Outlier Detection, John Wiley and Sons, 2003. A.J. Stromberg, Consistency of the least median of squares estimator in nonlinear regression, Comm. Statist. Theory Methods 24(1995) 19711984. A.W. van der Vaart, J.A. Wellner, Weak convergence and empirical processes: with applications to statistics, Springer series in statistics, Springer-Verlag, New York, 1996. 118 Chapter 5 Median-Based Cross Validation for Model Selection 5.1 Introduction Model selection is the task of selecting a statistical model from a set of potential models, given data. To do this, one needs a criterion which quantifies a model’s performance. For example, consider a regression model Yi = fλ (Xi ; β) + ei , (5.1) where {ei : i = 1, . . . , n} is assumed to be an independent and identically distributed sample of size n with median 0. Suppose fˆλ (X) = fλ (X, β̂) is an estimate of the regression function in (5.1) from the model list M = {fλ , λ ∈ Λ}, where each fλ (X; β) represents one candidate model mapping from Rd × Rp onto R. The first task is to choose λ. Here, usually, λ will be taken to vary over a finite set Λ, although there are many cases where Λ is countably infinite or uncountable. Once λ has been chosen, the remaining task is to estimate β. The choice of λ may be regarded as model selection A version of this chapter has been submitted for publication. Yu, C.W. and Clarke, B. Median based Cross-Validation for Model Selection. 119 Chapter 5. Median-Based Cross Validation for Model Selection and the estimation of β may be regarded as variable selection. However, in linear models used here the two are so closely related that we do not distinguish between them. There are numerous model selection techniques for choosing λ. Information based methods are based on penalizing the log likelihood as in BIC, AIC and Hanan and Quinn, for instance. Second, shrinkage methods are based on penalizing an empirical risk as in the L1 penalty in LASSO, in the smoothly clipped absolute deviation (SCAD) penalty of Fan and Li (2001) and in elastic nets more generally. A third class of model selection techniques is cross-validatory, which is based on within sample predictive accuracy. Two benefits of cross-validatory techniques over information and shrinkage techniques are that (i) unlike AIC, BIC and other information approaches, cross validation is non-parametric, so the fully likelihood is not required, and unlike LASSO and other shrinkage methods, the cross validation eliminates the necessity to choose a penalty term, and (ii) the cross-validatory technique emphasizes the ability of the data to summarize the procedure in a predictive sense in addition to fit. More formally, cross validation, hereafter CV, is used to estimate the expectation of the squared prediction error by the sum of squared prediction errors for a model, i.e., n 1X (yi − fλ (xi ; β̂))2 . n (5.2) i=1 Given a list of candidate models, cross-validation selects the model with the smallest estimated expected prediction error. The literature on CV in general is too vast to be adequately summarized here. For instance, for leave-one-out CV, see Stone (1974, 1977), Geisser (1975), Efron (1986) and Li (1987). For multi-fold CV, see Zhang (1993) and Shao (1993, 1997). For the generalized CV, see Li (1987) and Hastie and Tibshirani (1990). For the Robust CV, see Ronchetti, Field and Blanchard (1997), Cantoni and 120 Chapter 5. Median-Based Cross Validation for Model Selection Ronchetti (2001) and Leung (2005). For the Bayesian CV, see Chakrabarti and Samanta (2008). In addition, stacking, invented by Smyth and Wolpert (1999), uses cross validation to find weights for models on a list for model averaging in the Frequentist context. The resulting average has been found extremely effective in a prediction sense. In contrast to Bayes model averaging, for instance, stacking is robust to deviations of the true model away from models on the model list. Since CV is based on the least-squares approach, one of its key problems is its sensitivity to outliers. One way to avoid the over-sensitivity of CV is to use Robust CV. For linear models this was developed in Ronchetti, Field and Blanchard (1997) and used the M-estimating approach of Huber (1964, 1973). That is, we find λ to minimize n 1X ρ(yi − fλ (xi ; β̂)), n (5.3) i=1 where ρ is a function satisfying some regularity conditions, see Huber (1964, 1973). Note that (5.3) includes (5.2) by choosing ρ(t) = t2 . If we choose ρ(t) so that ρ increases slower than t2 when |t| → ∞, then the corresponding minimum is less sensitive to outliers, or the extreme values of residuals, than the least-squares approach. However, this methodology still requires the choice of ρ, which is often subjective. In nonparametric regression, Leung (2005) shows that the minimum of (5.3) is asymptotically independent of the choice of ρ in large samples. However, for small or moderate sample sizes, the minimum of (5.3) may depend nontrivially on the choice of ρ. To avoid choosing the function ρ in the conventional robust formulation, we propose an alternative robust cross validation technique. Our strategy is to use the sample median in place of the sample mean in (5.2) to estimate the median of the squared prediction error. That is, we optimize the objective 121 Chapter 5. Median-Based Cross Validation for Model Selection function med (yi − fλ (xi ; β̂))2 . 1≤i≤n (5.4) One advantage of this strategy is that using the median automatically gives invariance of the estimators up to strictly increasing transformations of the absolute residual. We call the procedure of finding the best model fλ under (5.4) Median Cross Validation or MCV, which is also a direct consequence from using the minimum median of the loss suggested by Yu and Clarke (2008a) and justified by Rostek (2006). Indeed, using the median of the loss may be more reasonable because the nonnegativity of the loss implies that the loss, as a random variable, will typically be strongly right skewed. It is well known that the median is more representative of the bulk of a skewed distribution than the expectation is. Optimizing (5.4) is similar to the least median of squares (LMS) estimator (see Rousseeuw (1984) and Rousseeuw and Leory (2003)), defined by β̂ LM S = arg min median[yi − f (xi ; β)]2 . β 1≤i≤n (5.5) The LMS estimator is an alternative to the least squares (LS) estimator and other usual robust estimators obtained from (5.3). Analogous to the usual CV with the LS estimator for the regression coefficients in model selection, we suggest using MCV with the LMS estimator. We further justify using MCV in two ways. First, when the error term in a regression model is heavy tailed, the LS methods tend to do poorly. This is so because the error term may be very large giving severe problems with outliers. By contrast, the median is highly resistant to outliers. Second, the median always exists even when the mean or variance fails to exist. Thus, the MCV is expected to perform better than the CV for heavy tailed distributions. In practice, heavy tailed distributions occur regularly in many 122 Chapter 5. Median-Based Cross Validation for Model Selection domains of application. For instance, in economics and finance, they are used to model fluctuations of stock returns, excess bond returns, foreign exchange rates, commodity price returns and real estate returns (see McCulloch (1996) and Rachev and Mittnik (2000)); in the context of engineering, noise in some communications and radar systems is often heavy-tailed. In such cases, the appropriateness of the commonly adopted normal assumption is highly questionable. In a comparison of CV and MCV for linear models, we have revealed a dependence on the size of the parameter. That is, to decide which of CV or MCV is more appropriate to a given setting seems to depend on the size of the parameter β in the linear model and on the heaviness of the tails of the error. We summarize this in Table 5.1. That is, MCV seems to work better Normal error Heavy-tailed error Small kβk MCV MCV Large kβk CV MCV Table 5.1: Comparison of MCV and CV under normal and heavy-tailed errors. in a minimum predictive error sense than CV does, except the case that the tails of the error distribution are close to normal and the true value of kβk is large. In this summary, large kβk means relative to σ in a normal case. Throughout this article, we say that MCV works better than CV does, if it has a higher probability of choosing the correct or true model (say, the data-generating model in our simulation) than CV does, i.e. PMT (MCV chooses MT ) > PMT (CV chooses MT ), (5.6) where MT denotes the correct model, or vice versa. Note that (5.6) is a predictive error sense for evaluating performance. These findings suggest that our method may be particularly suitable for model selection when the correct model has several large leading terms that are easy to identify but a large number of smaller terms that contribute. In 123 Chapter 5. Median-Based Cross Validation for Model Selection making this statement it is assumed that all the variables have been standardized so that the relative sizes of their contribution can be determined. More formally, consider the model Y = X1 β1 + X2 β2 + E, where E is N(0, σ 2 ), X1 consists of a small number of variables which have large parameter values β1 , and X2 consists of a large number of variables all having relatively small, but nonzero, coefficients β2 . Now, X1 is easy to identify so β1 is easy to estimate, by least squares for instance. Therefore, our work implies that once X1 is identified, MCV is going to be better than CV for variable selection within X2 , at least under normal error. The rest of this chapter is organized as follows. In the next section, we present a simple first example to motivate MCV and then state formally what our procedure is. In Section 3, we study the selection of one of two models by MCV and CV theoretically and by simulations. In Section 4, we examine model selection by MCV and CV with three terms involved. This generalizes the results in Section 3. Overall, the results in Sections 3 and 4 justify Table 5.1. Section 5, the last, discusses several of the issues related to deciding between CV and MCV in practice. Throughout this article, we only consider the case that the true model is on the model list. 5.2 Motivation and methodology: Median cross validation Here we provide a motivational example to illustrate the results we obtain more generally, and then propose our technique formally. To demonstrate that the MCV is expected to outperform the CV in many contexts, consider the following simple simulation. Suppose there are three nested models dif- 124 Chapter 5. Median-Based Cross Validation for Model Selection fering in the extra terms included: Model 1: Y = 2(1 + X1 ) + E, Model 2: Y = 2(1 + X1 + X2 ) + E, Model 3: Y = 2(1 + X + X + X ) + E. 1 2 3 (5.7) In this first simulation, assume that Model 2 is true. So, Model 1 underfits and Model 3 overfits. We generate 1000 sample of size n = 50 with covariates generated from the standard uniform distribution. Next, consider three error distributions E: i) standard normal, ii) standard Cauchy and iii) a mixture of normals, 80%*N(0, 1) + 20%*N(15, 1). The first error distribution is the standard case. The second one is heavy-tailed and the third one corresponds to a case with outliers. Figure 5.1 shows that CV is more likely to choose the correct model in the commonly assumed case, but that outside this case MCV is more likely to choose the correct model. In addition, Figure 5.1 illustrates an important caveat of using CV: The typical form of CV strongly prefers smaller models, regardless of how good they are. This will be seen in our simulations below. Next, it is important to state formally what our MCV methodology is. Basically, the technique for MCV is very similar to the usual CV: First, select a random sample from the observations to serve as the ”training” set to estimate the parameters in the model. Then using the estimated model, predict the response variables for the values of the explanatory variables for all the data points outside the training set. Compare these predictions with the actual measured observations on the ”validation” sample and assess the discrepancy. The model with the smallest discrepancy is announced as ”true”. This can be expressed mathematically as follows. For V −fold MCV, 1. Split the whole data into V disjoint subsamples S1 , . . . , SV . 2. For v = 1, . . . , V , fit model fλ to the training sample S i6=v Si , and compute discrepancy, dv (λ), using the validation sample Sv . 125 Chapter 5. Median-Based Cross Validation for Model Selection 0.8 0.8 1.0 5 fold : Standard Cauchy 1.0 5 fold : Standard normal N(0,1) 0.6 0.4 0.0 0.0 0.2 0.4 0.6 % for the chosen model Median CV usual CV 0.2 % for the chosen model Median CV usual CV 1 2 3 1 2 Model 3 Model 0.8 1.0 5 fold : 4/5N(0,1)+1/5N(10,1) 0.6 0.4 0.0 0.2 % for the chosen model Median CV usual CV 1 2 3 Model Figure 5.1: Proportion of times each model is selected by 5-fold MCV and CV criteria for samples with different error distributions. The black dot represents the MCV performance, while the open square represents the CV performance. 126 Chapter 5. Median-Based Cross Validation for Model Selection 3. Then a V -fold MCV is to find optimal λ as the minimizer of the overall discrepancy d(λ) = median{d1 (λ), . . . , dV (λ)}, where the usual CV uses the sample mean of {d1 (λ), . . . , dV (λ)}. In particular, when V = n, i.e. leave-one-out MCV, let zi = (yi , xi ) with xi ∈ Rd , and let z−i be the (n − 1)-dimensional vector with the ith observation, zi , removed from the original observations z. Let fˆλ−i be the estimate for fλ ∈ M based on z−i . Denote the corresponding prediction errors by yi − fˆλ−i (xi ). (5.8) Then, the median of squared prediction errors using the LMS estimator is defined by −i 2 CVMedloss(λ) = med (yi − fˆλ,LM S (xi )) . 1≤i≤n (5.9) −i LM S LM S is the LMS estimator for β where fˆλ,LM S (xi ) = fλ (xi ; β̂(i) ) and β̂(i) based on z−i . For the usual CV, denote the sum of squared prediction errors using the LS estimator by CVMSE(λ) = n X −i (yi − fˆλ,LS (xi ))2 , (5.10) i=1 −i LS ) and β̂ LS is the LS estimator for β based on where fˆλ,LS (xi ) = fλ (xi ; β̂(i) (i) z−i . Unless otherwise specified, we use CVMedloss and CVMSE for any fold MCV and CV, respectively, throughout this article. To conclude this section, we describe the general form of the simulations we have done. To obtain good test cases, we randomly generated design points X from a range of distributions. Given these values, we generated error terms E from another range of distributions. Then we formed variables Y from Xβ + E. To test MCV against CV, we only permitted the model selection procedures to use the design points X and the response values 127 Chapter 5. Median-Based Cross Validation for Model Selection Y. The parameter values and the errors were only available to CV or MCV implicitly. Given that n data points were generated, V − fold cross validation meant that at each stage, we used n/V data points for validation and n−n/V data points for training. From the training data, we estimated the regression coefficients for the chosen candidate model by LS for the CV criterion and by LMS for our MCV criterion. Then, we used the estimated models based on the LS and the LMS estimators to predict the y’s in the validation group. After predicting the y’s in all V validation groups, we calculated CVMedloss in (5.9) and CVMSE in (5.10) to select the appropriate model. Finally, we calculated these two measures for each candidate model, and chose the model with the smallest CVMSE or CVMedloss at each replication. To get our conclusion, we redid these steps B times. We recorded the fraction of times each model was chosen over the B replications for both CV and MCV. Whichever of CV and MCV had a higher empirical probability of choosing the correct model was declared better. Throughout all simulations conducted in this article, we used the statistical package R. 5.3 Two model cases In this section, we consider the two-model selection problem. Suppose we have ( Model 1: Y = X1 β1 + E, Model 2: Y = X1 β1 + X2 β2 + E, (5.11) where Y = (y1 , . . . , yn )0 , E = (e1 , . . . , en )0 , X1 and X2 are n × p and n × q matrices, and β1 and β2 are p × 1 and q × 1 vectors, respectively. As summarized in Table 5.1, once the parameters are large, MCV often outperforms for heavy-tailed error distributions, while CV does better for normal error distributions. When the regression parameters are small, then the dominance of CV over MCV for normal error no longer exists. In the 128 Chapter 5. Median-Based Cross Validation for Model Selection following, we first provide some theoretical results about the dependence of CV on the parameters. Our simulation then suggests that the dependence of MCV on the parameters is similar to that of CV on the parameters, although we are unable to show this formally. 5.3.1 Theoretical results When the larger model, Model 2 in (5.11), is true, if the candidate model has covariates which are in the true model, then the corresponding parameter coefficients have no effect on the prediction error. By contrast, if the smaller model, Model 1, is true, then each prediction error is independent of β1 and β2 , regardless of the size of the candidate model. To show these, we use the following lemma (without proof) and proposition. Lemma 14. Let A be a nonsingular matrix, and U and V be two column vectors. Then (A + U V 0 )−1 = A−1 − (A−1 U )(V 0 A−1 ) . 1 + V 0 A−1 U (5.12) Proposition 2. Denote by X(i) the matrix X with the i-th row deleted and by β̂(i) the corresponding LS estimates of β when the i-th observation (yi , x0i ) is not used. Then we have: 0 X )−1 = (X 0 X)−1 + 1. (X(i) (i) 2. β̂(i) = β̂ − (X 0 X)−1 xi êi , 1−hii 3. yi − x0i β̂(i) = (X 0 X)−1 xi x0i (X 0 X)−1 , 1−hii and êi 1−hii , where hii = x0i (X 0 X)−1 xi , β̂ = (X 0 X)−1 X 0 Y, êi = yi − ŷi and ŷi = x0i β̂. Proof. The first part of the results is proved by using Lemma 14 with A = 0 Y 0 X 0 X, U = −xi and V = xi . For part two, note that X(i) (i) = X Y − xi yi . 129 Chapter 5. Median-Based Cross Validation for Model Selection Thus, 0 0 β̂(i) = (X(i) X(i) )−1 X(i) Y(i) h i (X 0 X)−1 xi x0i (X 0 X)−1 ih 0 = (X 0 X)−1 + X Y − xi yi 1 − hii 0 −1 0 (X X) xi xi β̂ (X 0 X)−1 xi hii yi = β̂ + − (X 0 X)−1 xi yi − 1 − hii 1 − hii 0 −1 (X X) xi (yˆi − yi ) = β̂ + . 1 − hii Finally, part 3 follows by looking at the result of part 2 coordinatewise. Now we show our main results for CV. Lemma 15. Let Y = (y1 , . . . , yn )0 , E = (e1 , . . . , en )0 , X1 and X2 be n × p and n × q matrices, and β1 and β2 be p × 1 and q × 1 vectors, respectively. Consider ( Model 1: Y = X1 β1 + E, Model 2: Y = X1 β1 + X2 β2 + E. 1. Suppose that the larger model, Model 2, is true. (a) The prediction error, using the LS estimators based on Model 1, is independent of β1 , but depends on β2 . (b) Using Model 2, the prediction error is independent of both β1 and β2 . 2. Suppose that the smaller model, Model 1, is true. Then the prediction errors for both Model 1 and Model 2 are free of both β1 and β2 . Otherwise put, if the candidate model has covariates which are included in the true model, the corresponding parameter coefficients have no effect on the prediction error. 130 Chapter 5. Median-Based Cross Validation for Model Selection Proof. We show 1(b) first. Note that Model 2 can be written as the same form as Model 1 by letting zi0 = (x01i , x02i ), Z = (X1 , X2 ) and β = (β10 , β20 )0 , i.e. Model 2: Y = Zβ + E. By result 3 of Proposition 2, we have (1 − τii )[yi − zi0 β̂(i) ] = êi = yi − zi0 β̂ = yi − zi0 (Z 0 Z)−1 Z 0 Y = (ei + zi0 β) − zi0 (Z 0 Z)−1 Z 0 [Zβ + E] = ei − zi0 (Z 0 Z)−1 Z 0 E, (5.13) where τii = zi0 (Z 0 Z)−1 zi and β̂(i) is the corresponding estimates of β when the i-th observation (yi , zi0 ) is not used. For 1(a), when the larger model is true and the candidate model is smaller, the i-th predictive error is yi − x01i β̂1(i) = (1 − hii )−1 (yi − x01i β̂1 ) = (1 − hii )−1 [yi − x01i (X10 X1 )−1 X10 Y ] n o = (1 − hii )−1 (ei + x1i β1 + x2i β2 ) − x01i (X10 X1 )−1 X10 [X1 β1 + X2 β2 + E] n o = (1 − hii )−1 ei + x2i β2 − x01i (X10 X1 )−1 X10 [X2 β2 + E] , where hii = x01i (X10 X1 )−1 x1i and β̃1(i) is the corresponding estimates of β1 when the i-th observation (yi , x01i ) is not used. For part two, the smaller model is assumed true. So, the i-th predictive error with respect to a larger candidate model can be written as yi − zi0 β̂(i) = (1 − τii )−1 (yi − zi0 β̂) n o = (1 − τii )−1 yi − zi0 [(Z 0 Z)−1 Z 0 Y ] n o = (1 − τii )−1 ei + x01i β1 − zi0 [(Z 0 Z)−1 Z 0 (X1 β1 + E)] . (5.14) 131 Chapter 5. Median-Based Cross Validation for Model Selection Note that X1 = ZDp , where Dp = ¡ Ip ¢ Oq×p is a (p + q) × p matrix with a p × p identity matrix Ip and a q × p zero matrix Oq×p . Thus, expression (5.14) becomes n o yi − zi0 β̂(i) = (1 − τii )−1 ei + x01i β1 − zi0 (Z 0 Z)−1 Z 0 X1 β1 − zi0 (Z 0 Z)−1 Z 0 E n o = (1 − τii )−1 ei + x01i β1 − zi0 (Z 0 Z)−1 Z 0 ZDp β1 − zi0 (Z 0 Z)−1 Z 0 E o n = (1 − τii )−1 ei + x01i β1 − zi0 Dp β1 − zi0 (Z 0 Z)−1 Z 0 E n o = (1 − τii )−1 ei − zi0 (Z 0 Z)−1 Z 0 E , (5.15) where zi0 Dp = x01i Ip + x02i Oq×p = x01i . For a smaller candidate model, set Z = X1 . So, zi0 = x01i and Dp = Ip . Also, replace τii in (5.15) by hii . Remark : Note that expressions (5.13) and (5.15) are exactly the same. This means that the predictive errors under larger candidate models are independent of the parameters regardless of the size of true model. Theorem 9. Consider comparing two models: ( Model 1: Y = X1 β1 + E, Model 2: Y = X1 β1 + X2 β2 + E. Denote by CV(j) the sum of squared prediction errors based on Model j with LS estimators of β1 and β2 , where j = 1, 2. Then we have the following: 1. If Model(j) is true, then CV(j) does not depend on β1 or β2 . 2. If Model 1 is true, the probability that CV(1)<CV(2) is free of parameters. 3. If Model 2 is true, the probability that CV(2)<CV(1) depends only on β2 . 132 Chapter 5. Median-Based Cross Validation for Model Selection Before proving Theorem 9, we remark that all of the results are for the usual CV. We have been unable to show the corresponding results for MCV mathematically, nevertheless, we conjecture that the analog of the statements in Lemma 15 and Theorem 9 hold for the MCV as well. Our conjecture rests on the computational results shown below that reveal similar properties for the MCV. Proof. The first statement follows directly from 1(b) and 2 of Lemma 15. The second and the third statements follow from 1(a) and 2 of Lemma 15 and the first statement of this theorem. 5.3.2 Dependence of model selection on the size of parameters Practitioners commonly presume that the usual CV method outperforms other model selection criteria when the error term is normal. However, we find that the range of models for which CV is better than MCV under the normal error does not include models with ”small” values of the parameters. As can be surmised from Theorem 9, these parameters can affect the prediction error of models. We do not have a formal argument that gives the threshold for the size of individual βi ’s; however, we have found that 0.19 σ is a useful heuristic, where σ 2 is the variance of the normal error. For example, under Model 2 in (5.11), if β2 has norm less than 0.19σ, then MCV does better than CV in the normal error case. In the following, we show simulation results to see how the relative performance of MCV and CV are related to the values of parameters, the number of fold for the MCV and CV procedures, and the shape of error distributions. We look first at the normal error case, and then turn to the Cauchy error. Next we quantify how CV and MCV are related to the heaviness of the tails through two classes of error distributions. 133 Chapter 5. Median-Based Cross Validation for Model Selection (i) The case of normal errors To illustrate the parameter dependence on the setting of (5.11), we started with a simple simulation using one-dimensional parameters. From Theorem 9, we know that the performance of CV, and likely MCV, depend on the value of β2 . So, we fixed β1 = 2 in our simulation. Additionally, 1000 independent samples of size 30 were used and the covariates were generated from the uniform distribution with zero mean and unit variance, and the normal error with zero mean and variance σ 2 . Since the distributions of the covariates are symmetric, it is enough to look only at β2 ≥ 0. From our computation, we generated a graph of the curves representing a proportions of times MCV and CV chose the correct model for different values of β2 /σ. This is given in Figure 5.2. The zero point on the x-axis in Figure 5.2 corresponds to β2 = 0, which means that Model 1 is true. In other words, the zero point on the x-axis means either the probabilities P1 (CV (1) < CV (2)) or P1 (M CV (1) < M CV (2)) in which the subscript 1 indicates that Model 1 is taken as true. The non-zero points on the x-axis, i.e. β2 > 0, represent the cases that Model 2 is true. These points mean the probabilities P2 (CV (2) < CV (1)) and P2 (M CV (2) < M CV (1)) in which the subscript 2 indicates that Model 2 is taken as true. Figure 5.2 shows that when the small model, Model 1, is true (i.e. β2 = 0), or the large model, Model 2, is true with β2 /σ > 0.19, then the usual CV works better than MCV does. However, when Model 2 is true and 0 < β2 /σ < 0.19, the MCV does better than the CV. This justify the summary in Table 5.1. The vertical dotted line represents the cutoff β2 /σ = 0.19. We suspect that the sample size only affect the value of probability but not the position of the cutoff. We also note that as β2 /σ increases, both CV and MCV perform better. Figure 5.3 presents the performance of MCV and CV in another way. Specifically, it makes sense to look at P1 (M CV (1) < M CV (2))−P1 (CV (1) < CV (2)) or P2 (M CV (2) < M CV (1)) − P2 (CV (2) < CV (1)), the difference 134 Chapter 5. Median-Based Cross Validation for Model Selection in proportion of times that MCV chooses the correct model and the proportion of times that CV chooses the correct model, as a function of β2 and σ. In Figure 5.3, we use a black dot to indicate that MCV has a higher probability of choosing the correct model and an open circle to indicate that CV has a higher probability of choosing the correct model. The dash straight line indicates our heuristic β2 = 0.19σ. Note that all black dots are in the region 0 < β2 < 0.19σ. 0.8 0.6 0.4 Median CV usual CV 0.2 % that CV and MCV chose the true model 1.0 Results for the class of N(0,sigma^2) error distribution 0.0 0.2 0.4 0.6 beta_2/sigma Figure 5.2: The zero and non-zero points on the x-axis correspond to the cases that Model 1 and Model 2 are true, respectively. The vertical dotted line represents the value of kβ2 /σk = 0.19. Next we compare the performance of LOO and 5-fold CV and MCV under the standard normal error distribution with some comparatively large 135 Chapter 5. Median-Based Cross Validation for Model Selection 4 3 1 2 sigma 5 6 Comparison of MCV and CV under the normal error distribution 0.0 0.5 1.0 1.5 2.0 2.5 3.0 beta_2 Figure 5.3: Black dots mean MCV outperforms. That is, the proportion of time that MCV chooses the correct model is higher than that for CV. The open circle means the reverse. The dash straight line indicates the line: β2 = 0.19σ. All black dots are in the region 0 < β2 < 0.19σ. 136 Chapter 5. Median-Based Cross Validation for Model Selection values of β2 ∈ {0, 1, 2, 3, 4, 5, 6} and σ = 1. When β2 =0, Model 1 is true, while Model 2 is true for nonzero β2 . Figure 5.4 shows that for normal errors, the proportion of times that CV chooses the correct model is higher than that of MCV, regardless of Model 1 or 2 being true. We also note that those results for LOO and 5 fold look very similar. Again, as the value of β2 increases, the percentages that choose the true (larger) model by both CV and MCV also increase. (ii) The case of Cauchy errors When the error distribution is heavy-tailed, the results seen in Figure 5.4 change dramatically to in Figure 5.5. Similar to the setting of Section 5.3.2(i), we continue to use large parameter values but change the error term to the Cauchy. Figure 5.5 shows the results with LOO and 5-fold CV and MCV. It is seen that when the larger model, Model 2, is true with the Cauchy error distribution, MCV chooses the correct model more than CV does. However, when the smaller model, Model 1, is true, CV does better than MCV even for the Cauchy distribution. Again, the probability that CV or MCV choose the correct model increases with β2 . (iii) Two classes of heavy tailed error distributions The results of the last subsections suggest there is a boundary based on the heaviness of the tails of the error so that on one side MCV is better and on the other CV is better. To locate this boundary we did simulations with two classes of error terms for which the tail behavior can be controlled by a single parameter. The first class is based on the t distribution and so contains both the Cauchy and the normal as special cases. Denote by tv the t-distribution with v degrees of freedom, where v ∈ (0, ∞). It is well-known that the thickness of tails of the tv distribution decreases as the degree of freedom v increases. When v = 1, the tv distribution becomes the Cauchy. When v → ∞, the tv distribution goes to normal, but in practice there is 137 Chapter 5. Median-Based Cross Validation for Model Selection 0.6 0.4 Median CV usual CV 0.0 0.2 % to choose the true model 0.8 1.0 Loo CV and MCV with sample size 30 and normal error 0 1 2 3 4 5 6 value of beta_2 0.6 0.4 Median CV usual CV 0.0 0.2 % to choose the true model 0.8 1.0 5−fold CV and MCV with sample size 30 and normal error 0 1 2 3 4 5 6 value of beta_2 Figure 5.4: LOO and 5 fold MCV and CV for normal errors. 138 Chapter 5. Median-Based Cross Validation for Model Selection 0.6 0.4 Median CV usual CV 0.0 0.2 % to choose the true model 0.8 1.0 LOO CV and MCV with sample size 30 and Cauchy error 0 1 2 3 4 5 6 value of beta_2 0.6 0.4 Median CV usual CV 0.0 0.2 % to choose the true model 0.8 1.0 5−fold CV and MCV with sample size 30 and Cauchy error 0 1 2 3 4 5 6 value of beta_2 Figure 5.5: LOO and 5 fold MCV and CV for Cauchy errors. 139 Chapter 5. Median-Based Cross Validation for Model Selection no virtual difference between the normal and tv distributions when v is 30. The second class of heavy tailed distributions is called Lévy skewness α-stable distribution for α ∈ (0, 2], see Wikipedia (2008). This distribution is commonly used in econometric applications for heavy-tailed data such as stock returns. This class of distributions is characterized by four parameters: i) the characteristic exponent (or index of stability) α ∈ (0, 2]; ii) the scale (or spread) parameter σ ≥ 0; iii) the skewness (or symmetry) parameter τ ∈ [−1, 1]; and iv) the shift (or location) parameter µ ∈ R. The parameter τ indicates the skewness of the distribution, where τ = 0 corresponds to the symmetric case. In the computations shown here, we have only varied α, setting τ = 0, µ = 0 and σ = 1. The characteristic exponent α determines the heaviness of the tails of the distribution. When α equals 1, the Lévy skewness α-stable distribution reduces to the Cauchy distribution and when α = 2, it gives the normal distribution. Using these two classes of error distributions, and the same simulation settings in Section 5.3.2(i), we generated Figures 5.6 and 5.7 parallel to Figure 5.3. That is, for a range of values of β2 and v, or β2 and α, we determined which of CV and MCV have better performance in the sense of equation (5.6). Both Figures 5.6 and 5.7 show a convex region in the plane for which CV is better than MCV outside this region MCV is better. In both cases, it is seen that when the tails are heavy enough, say v ≤ 1.5 or α ≤ 1.4 MCV wins. However, as the tails become lighter, i.e., v or α increases, the range of β2 ’s for which MCV wins shrinks. For v > 1.5 and α > 1.4 there is a central region of β’s for which CV wins and the width of this region increases with v or α. On the x-axis where β2 = 0, i.e., the smaller model, Model 1, is true, CV always wins. This is consistent with the intuition that CV is good at finding small models but may break down if the simplest models are not true. The regions where CV does well as a function of the tail parameter and β2 look like parabolas and this may lead to a useful heuristic for deciding which of CV and MCV to use. However, 140 Chapter 5. Median-Based Cross Validation for Model Selection we have not satisfactorily identified one yet. 2.0 1.5 0.5 1.0 degree of freedom 2.5 3.0 Comparison of MCV and CV under the t error distributions 0.0 0.5 1.0 1.5 2.0 2.5 3.0 beta_2 Figure 5.6: Comparison of MCV and CV for the 2-model case by β2 and v degree of freedom of the t distribution. Black dots mean that MCV works better in the sense that it has a higher proportion of times that the correct model is chosen than CV, while the open circles mean that CV outperforms. 5.4 Three term case Although comparing two models is a paradigm test case, it is relatively rare that an analyst is only entertaining two models. To see how the two model case generalizes to more complex settings, we now consider the setting that there are three terms. Now, there are two cases, the nested and non-nested. It will be seen that the results of the three term cases are broadly consistent with the two model case. That is, the qualitative conclusions from the two 141 Chapter 5. Median-Based Cross Validation for Model Selection 1.5 1.0 0.5 exponent alpha 2.0 Comparison of MCV and CV under the Levy stable error distributions 0.0 0.5 1.0 1.5 2.0 2.5 3.0 beta_2 Figure 5.7: Comparison of MCV and CV for the 2-model case by β2 and an exponent α of the Lévy stable distribution. Again, black dots mean that MCV works better than CV, while the open circles mean that CV outperforms. 142 Chapter 5. Median-Based Cross Validation for Model Selection model case extend readily. Sections 5.4.1 provides some theoretical results. In Section 5.4.2, we have simulation results, where Sections 5.4.2(i) and (ii) compare the performance of MCV and CV when the error is normal and Cauchy, respectively. Then in Section 5.4.2(iii), we again look at the cases of more general heavy tailed distributions. 5.4.1 Theoretical results First, we extend Theorem 9 to three nested models. Theorem 10. Consider comparing the three models: Model 1: Y = X1 β1 + E, Model 2: Y = X1 β1 + X2 β2 + E, Model 3: Y = X β + X β + X β + E. 1 1 2 2 3 3 (5.16) Again, denote by CV(j) the sum of squared prediction errors based on Model j with LS estimators of parameters, where j = 1, 2 and 3. Then we have: 1. When Model 1 is true, the probability that CV(1) is minimum is free of parameters; 2. When Model 2 is true, the probability that CV(2) is minimum depends on β2 only; 3. When Model 3 is true, the probability that CV(3) is minimum depends on β2 and β3 . Proof. This theorem is a direct result of Theorem 9. Model 1 as true Model 2 as true Model 3 as true CV(1) × β2 β2 and β3 CV(2) × × β3 CV(3) × × × Table 5.2: Dependence of MCV and CV on parameters 143 Chapter 5. Median-Based Cross Validation for Model Selection The conclusion of Theorem 10 can be neatly summarized in Table 5.2. The column on the left indicates which Model i, for i = 1, 2, 3, is true. The top row indicates which CV(j) for Model j, where j=1,2,3, is being calculated. In the 9 cells of the table, the appearance of a βk for k = 2, 3 in cell (i, j) indicates that when Model i is true, the prediction error for Model j depends on βk , so does CV(j). The appearance of a cross × means that there is no parameter dependence of CV(j) on any βk . Thus, in the 3 nested model case, when Model 1 is true, the minimum of CV(1), CV(2) and CV(3) is independent of all regression parameters. When Model 2 is true, the minimum of CV(1), CV(2) and CV(3) depends on β2 only. Finally, when Model 3 is true, the minimum of CV(1), CV(2) and CV(3) depends on both β2 and β3 . The same kind of theorem can be established for four or more nested models and each can be summarized by tables of the same form. However, when the model list becomes more complicated, no reductions as in Table 5.2 are possible. That is, every CV measure for candidate models depends on all the relevant parameters. So we only show the seven nonnested models below for our simulation study without any analysis like what we did in Theorem 10. The list of the seven non-nested models is: Model 1: Y = X1 β1 + E, Model 2: Y = X2 β2 + E, Model 3: Y = X3 β3 + E, Model 4: Y Model 5: Y Model 6: Y Model 7: Y = X2 β2 + X3 β3 + E, (5.17) = X1 β1 + X3 β3 + E, = X1 β1 + X2 β2 + E, = X1 β1 + X2 β2 + X3 β3 + E. 144 Chapter 5. Median-Based Cross Validation for Model Selection 5.4.2 Simulation To situate our comparison of models based on 3 terms in the more general context of which it is representative, consider models with k covariates X1 , . . . , Xk . In general, there are 2k possible true models. However, we move out the trivial model, that is the model without any covariates. In our simulation, we have generated the Xk from the same distribution. That is, all explanatory variables enter the model in the same way. Because of this, any two models with the same number of explanatory variables are equivalent. Thus, we use the model with covariates X1 to represent all model with one covariate and use the model with X1 and X2 to represent any two-covariate models, and so on. For instance, if k = 5, then we only consider five possible true models: Models (1), (1, 2), (1, 2, 3),(1, 2, 3, 4) and (1, 2, 3, 4, 5), where Model (1, 2) represents the true model with an intercept and covariates X1 and X2 , and the others are similar. For simplicity, we use a single model to unify all possible candidates on the model list. Write the model Y = α + γ1 β1 X1 + . . . + γk βk Xk + E, (5.18) where γj = 0,1 according to whether Xj is or is not in the candidate model, for j = 1, . . . , k. Parallel to what we did in Sections 5.3.2(i), (ii) and (iii) for two-term cases, we give simulation results for three-term cases in Sections 5.4.2(i), (ii) and (iii). (i) The case of normal errors for nested models We begin by generating a figure to compare the behavior of MCV and CV for choosing among 3 nested models. This generalizes Figure 5.3 for comparing two models. Ideally, we want a three dimensional plot showing σ, β2 and β3 on its axes with each point in R3 indicating whether CV or MCV is better, when the larger model, Model 3 in (5.16), is true. However, the figures 145 Chapter 5. Median-Based Cross Validation for Model Selection generated to represent this are hard to depict conveniently on a page. So, we only present the slice through R3 for β3 = 0, i.e., the case that Model 1 or 2 is true. Figure 5.8 shows the relative performance of MCV and CV for choosing between Models 1 and 2 when it is possible to choose Model 3 as well. As before it is seen that the range of β’s for which MCV does better than CV increases with σ. By Theorem 10, when Model 2 in (5.16) is true, the value of β1 has no effect but the value of β2 does. Also, when Model 1 is true, there is no effect from β1 either. So, without loss of generality, we have set β1 = 2 with k = 2 and α = 1 in (5.18) for the true model with β2 ≥ 0. Also, we generate 1000 independent samples of size n = 30 with covariates generated from the standard normal distribution. Figure 5.8 shows that when β2 = 0, i.e. Model 1 is true, CV outperforms MCV no matter what σ is. If β2 is not zero, i.e., Model 2 is true, MCV outperforms CV when 0 < β2 < 0.19σ, the same as the condition we got for the two-model case in Section 5.3.2 (i). Additionally, if β3 is also not zero, i.e., Model 3 is true, we expect to get a 3 dimensional generalization of the black-dot region in Figure 5.8. (ii) Two classes of heavy tailed error distributions for nested models Parallel to Section 5.3.2(iii), we again determined how heavy tails must be for MCV to outperform CV systematically with the same simulation setting for Figure (5.8) in Section 5.4.2(i). As before, we used the tv and Lévy αstable classes of distributions for error terms. Our results are summarized in Figures 5.9 and 5.10 which, like Figures 5.6 and 5.7, are for the nested case in Theorem 10. Thus, Figures 5.9 and 5.10 extend Figure 5.8 to heavier tails. It is seen that, roughly, when v ≤ 1 or α ≤ 1.1, MCV always wins if β2 > 0. Otherwise, there is a bound b that when β2 ≥ b, CV wins. The lower bound b decreases to zero as the tails become lighter, i.e., as v or α increases. That is, when the tails of the error are believed to be not much heavier than 146 Chapter 5. Median-Based Cross Validation for Model Selection 1 2 3 sigma 4 5 6 Comparison of MCV and CV under the normal error distribution 0.0 0.5 1.0 1.5 2.0 2.5 3.0 beta_2 Figure 5.8: 3 term nested model: Black dots mean that MCV outperforms CV in the sense that the proportion of time choosing the correct model by MCV is higher than CV, while open circles mean CV does better. The dashed straight line represents the line: β2 = 0.19σ. All black dots are in the region 0 < β2 < 0.19σ. 147 Chapter 5. Median-Based Cross Validation for Model Selection normal, it is reasonable to default to CV routinely in these problems. 2.0 1.5 0.5 1.0 degree of freedom > 0 2.5 3.0 Comparison of MCV and CV under the t error distributions 0 1 2 3 4 5 beta_2 Figure 5.9: 3 term nested model: Comparison of MCV and CV by β2 and v degree of freedom of the t distribution. Black dots mean that MCV works better in the sense that it has a higher proportion of times that the correct model is chosen than CV, while the open circles mean that CV outperforms MCV. (iii) Non-nested models To compare MCV and CV in the context of a non-nested model list, set k = 3. Now, generate B = 1000 samples of size n = 30 and use five-fold CV and MCV to choose the correct model. To be specific, let the true values 148 Chapter 5. Median-Based Cross Validation for Model Selection 1.5 1.0 0.5 exponent alpha between 0 and 2 2.0 Comparison of MCV and CV under the Levy stable error distributions 0.0 0.5 1.0 1.5 2.0 2.5 3.0 beta_2 Figure 5.10: 3 term nested model: Comparison of MCV and CV by β2 and an exponent α of the Lévy stable distribution. Again, black dots mean that MCV works better than CV, while the open circles mean that CV outperforms MCV. 149 Chapter 5. Median-Based Cross Validation for Model Selection of (α, β1 , β2 , β3 ) in (5.18) be (2,8,0,0), (2,8,5,0) and (2,8,5,6) to represent a small, medium and large true model, respectively. Although the values 2, 8, 5, and 6 are large relative to the distributions of the covariates and error term, we choose them to be in line like the examples in Shao (1993). Large values such as these will tend to favor CV when the error is normal and MCV when the error is heavy tailed. Suppose the covariates {Xj : j = 1, . . . , k} are generated from three different distributions with zero mean and unit variance, namely (i) N (0, 1), √ (ii) U(−√3,√3) and (iii) t3 / 3, where t3 is the t-distribution with 3 degrees of freedom. For simplicity, we use N(0,1) for the error term. Our results are summarized in Table 5.3 for the normal error, Tables 5.4 and 5.5 for the t distributions with degrees of freedom v = 1 and 0.5, respectively, and Table 5.6 for the Lévy α-stable distribution with α = 0.5. Note that under the heading ”Model”, the integers i apparentheses indicate which covariates are in the candidate models. The entries in the table in boldface are the probability that CV or MCV chooses the correct model. It is seen that when the error term is normal, CV always outperforms MCV regardless of the covariate. As indicated by the crosses and checks in the true β column, as the heaviness of the tails increases, MCV does better and better relative to CV regardless of the covariates. Finally, we considered a range of v’s and α’s, so as to characterize again how heavy the tails must be for CV to do better than MCV. To generate Figures 5.11 and 5.12 we used n = 50 and B = 300 with standard normal covariates for a range of choices of v and α. We set all the nonzero parameters equal to 2 so that the size of the regression parameters would be comparable to the size of the error term and both would be larger than the distribution of the covariates. We believe this is a stringent test of the methods since the error term is twice the variability of any of the covariates. The curves in Figures 5.11 and 5.12 show the probability that each method chooses the correct model. The values plotted correspond to the boldface values in 150 Table 5.3: Proportion of times each candidate model is selected by 5-fold CV and MCV with errors generated from √ √ √ N (0, 1) and covariates x from N (0, 1), U(− 3, 3) and t3 / 3. In the column of ”True beta”, × represents that CV wins, and X represents that MCV wins. Error covariates True beta Method Model N (0, 1) N (0, 1) (2,8,0,0) (1) (2) (3) (2,3) (1,3) (1,2) (1,2,3) cv 0.627 0.000 0.000 0.000 0.168 0.153 0.052 × mcv 0.419 0.000 0.000 0.000 0.225 0.218 0.138 (2,8,5,0) cv 0.000 0.000 0.000 0.000 0.000 0.781 0.219 × mcv 0.000 0.000 0.000 0.000 0.000 0.601 0.399 (2,8,5,6) cv 0.000 0.000 0.000 0.000 0.000 0.000 1.000 × mcv 0.000 0.000 0.000 0.000 0.000 0.000 1.000 U(−√3,√3) (2,8,0,0) cv 0.613 0.000 0.000 0.000 0.154 0.179 0.054 × mcv 0.419 0.000 0.000 0.000 0.211 0.247 0.123 (2,8,5,0) cv 0.000 0.000 0.000 0.000 0.000 0.806 0.194 × mcv 0.000 0.000 0.000 0.000 0.000 0.642 0.358 (2,8,5,6) cv 0.000 0.000 0.000 0.000 0.000 0.000 1.000 × mcv 0.000 0.000 0.000 0.000 0.000 0.000 1.000 √ t3 / 3 (2,8,0,0) cv 0.615 0.000 0.000 0.000 0.152 0.168 0.065 × mcv 0.471 0.000 0.000 0.000 0.220 0.225 0.084 (2,8,5,0) cv 0.000 0.000 0.000 0.000 0.000 0.794 0.206 × mcv 0.000 0.000 0.000 0.000 0.001 0.678 0.321 (2,8,5,6) cv 0.000 0.000 0.000 0.000 0.000 0.000 1.000 × mcv 0.000 0.000 0.000 0.000 0.005 0.000 0.995 Chapter 5. Median-Based Cross Validation for Model Selection 151 Table 5.4: Proportion of times each candidate model is selected by 5-fold CV and MCV with errors generated √ √ √ from the t1 (or Cauchy) distribution and covariates x from N (0, 1), U(− 3, 3) and t3 / 3. In the column of ”True beta”, × represents that CV wins, and X represents that MCV wins. Error covariates True beta Method Model Cauchy (i.e. t1 ) N (0, 1) (2,8,0,0) (1) (2) (3) (2,3) (1,3) (1,2) (1,2,3) cv 0.658 0.037 0.030 0.003 0.122 0.123 0.027 × mcv 0.498 0.000 0.000 0.000 0.221 0.184 0.097 (2,8,5,0) cv 0.105 0.053 0.024 0.004 0.013 0.659 0.142 X mcv 0.000 0.000 0.000 0.000 0.002 0.694 0.304 (2,8,5,6) cv 0.051 0.033 0.030 0.018 0.062 0.051 0.755 X mcv 0.000 0.000 0.000 0.000 0.007 0.003 0.990 U(−√3,√3) (2,8,0,0) cv 0.621 0.050 0.053 0.001 0.117 0.125 0.033 × mcv 0.454 0.000 0.000 0.000 0.215 0.238 0.093 (2,8,5,0) cv 0.113 0.067 0.034 0.007 0.020 0.619 0.140 X mcv 0.002 0.000 0.000 0.000 0.000 0.688 0.310 (2,8,5,6) cv 0.054 0.050 0.046 0.017 0.076 0.034 0.723 X mcv 0.001 0.000 0.000 0.000 0.002 0.000 0.997 √ t3 / 3 (2,8,0,0) cv 0.658 0.047 0.053 0.003 0.110 0.101 0.028 × mcv 0.482 0.000 0.000 0.000 0.211 0.209 0.098 (2,8,5,0) cv 0.128 0.065 0.043 0.013 0.014 0.619 0.118 X mcv 0.010 0.003 0.000 0.000 0.003 0.687 0.297 (2,8,5,6) cv 0.072 0.033 0.047 0.035 0.083 0.066 0.664 X mcv 0.000 0.000 0.001 0.002 0.027 0.005 0.965 Chapter 5. Median-Based Cross Validation for Model Selection 152 Table 5.5: Proportion of times each candidate model is selected by 5-fold √ CV and MCV with errors generated √ √ from the t0.5 distribution and covariates x from N (0, 1), U(− 3, 3) and t3 / 3. In the column of ”True beta”, × represents that CV wins, and X represents that MCV wins. Error covariates True beta Method Model t0.5 N (0, 1) (2,8,0,0) (1) (2) (3) (2,3) (1,3) (1,2) (1,2,3) cv 0.420 0.244 0.247 0.009 0.031 0.041 0.008 X mcv 0.471 0.000 0.000 0.000 0.212 0.216 0.101 (2,8,5,0) cv 0.328 0.254 0.235 0.012 0.025 0.123 0.023 X mcv 0.009 0.003 0.000 0.000 0.005 0.693 0.290 (2,8,5,6) cv 0.294 0.238 0.253 0.023 0.053 0.044 0.095 X mcv 0.004 0.000 0.001 0.003 0.022 0.014 0.956 U(−√3,√3) (2,8,0,0) cv 0.389 0.239 0.265 0.015 0.040 0.041 0.011 X mcv 0.452 0.000 0.000 0.000 0.229 0.234 0.085 (2,8,5,0) cv 0.304 0.257 0.243 0.019 0.028 0.128 0.021 X mcv 0.004 0.000 0.000 0.001 0.005 0.715 0.275 (2,8,5,6) cv 0.275 0.233 0.257 0.034 0.057 0.050 0.094 X mcv 0.002 0.000 0.000 0.001 0.011 0.006 0.980 √ t3 / 3 (2,8,0,0) cv 0.427 0.261 0.221 0.010 0.035 0.034 0.012 X mcv 0.471 0.004 0.000 0.001 0.217 0.210 0.097 (2,8,5,0) cv 0.346 0.276 0.211 0.020 0.021 0.101 0.025 X mcv 0.017 0.005 0.001 0.001 0.011 0.684 0.281 (2,8,5,6) cv 0.304 0.254 0.228 0.032 0.060 0.047 0.075 X mcv 0.007 0.000 0.004 0.005 0.035 0.024 0.925 Chapter 5. Median-Based Cross Validation for Model Selection 153 Table 5.6: Proportion of times each candidate model is selected by 5-fold CV and MCV with errors generated √ √ √ from the Lévy (α = 0.5) distribution and covariates x from N (0, 1), U(− 3, 3) and t3 / 3. In the column of ”True beta”, × represents that CV wins, and X represents that MCV wins. Error covariates True beta Method Model Lévy (α = 0.5) N (0, 1) (2,8,0,0) (1) (2) (3) (2,3) (1,3) (1,2) (1,2,3) cv 0.351 0.257 0.302 0.010 0.036 0.035 0.000 X mcv 0.425 0.000 0.000 0.000 0.248 0.214 0.113 (2,8,5,0) cv 0.309 0.259 0.293 0.020 0.026 0.076 0.017 X mcv 0.006 0.000 0.000 0.000 0.002 0.674 0.318 (2,8,5,6) cv 0.279 0.261 0.289 0.024 0.058 0.039 0.050 X mcv 0.003 0.001 0.000 0.001 0.011 0.003 0.981 U(−√3,√3) (2,8,0,0) cv 0.378 0.273 0.249 0.020 0.038 0.031 0.011 X mcv 0.443 0.000 0.000 0.001 0.210 0.232 0.114 (2,8,5,0) cv 0.317 0.287 0.242 0.020 0.030 0.085 0.019 X mcv 0.002 0.000 0.001 0.001 0.003 0.694 0.299 (2,8,5,6) cv 0.296 0.274 0.252 0.032 0.046 0.044 0.056 X mcv 0.004 0.000 0.001 0.001 0.005 0.003 0.986 √ t3 / 3 (2,8,0,0) cv 0.376 0.304 0.244 0.010 0.034 0.024 0.008 X mcv 0.475 0.003 0.001 0.000 0.206 0.223 0.092 (2,8,5,0) cv 0.331 0.313 0.227 0.013 0.023 0.077 0.016 X mcv 0.006 0.004 0.000 0.001 0.005 0.690 0.294 (2,8,5,6) cv 0.302 0.288 0.243 0.032 0.046 0.042 0.047 X mcv 0.003 0.001 0.006 0.007 0.018 0.012 0.953 Chapter 5. Median-Based Cross Validation for Model Selection 154 Chapter 5. Median-Based Cross Validation for Model Selection Tables 5.3, 5.4, 5.5 and 5.6. The true model in each case is indicated at the top of the figures. These figures show that for the tv class of error terms, CV outperforms MCV only when v is a little larger than 1, say around 1.3 and that for the Lévy α-stable class CV outperforms MCV only when α is a little larger than 1, say 1.2. It is clear that the location of the threshold value for tv or Lévy α-stable distribution depends on the true model, but in all cases the thresholds fluctuated between 1 and 1.5. Qualitatively, Figures 5.11 and 5.12 show that the MCV performs roughly consistently over a large class of error distributions while CV performs poorly for heavy tails and much better for light tails, for the models chosen. This means that MCV is relatively robust to the error distribution, unlike CV. 5.5 Discussion In this article we have proposed a modification of cross-validation based on replacing the mean with the median and called it MCV. We have argued on abstract grounds that this procedure should have useful robustness properties, should work well for heavy tailed error distributions, and is wellmotivated theoretically. Instead of evaluating its absolute performance, we have evaluated its performance relative to CV. We argue this is reasonable since CV is well-understood and widely used. In addition to revealing that MCV routinely outperforms CV for heavy tailed distributions, our simulations found that the models for which MCV outperformed CV even in the normal error case could be characterized informally. Specifically, when the parameter values are small, MCV in general also outperforms CV. We do not have a good explanation for this, however, it seems that medians are better able to discern the presence of a signal in noisy setting but perhaps less able to quantify it. We remark that although using the median to define a cross-validation criterion for model selection is new, the first use an MCV was in Zheng and Yang (1998). These authors used an MCV for choosing the optimal 155 1 156 3 degree of freedom 1 4 Median CV usual CV 1 2 3 degree of freedom 4 Median CV usual CV t distribution with different df and true model with beta=(2,2,2) 5 % that cv/mcv chose the true model % that cv/mcv chose the true model 0.6 0.5 0.4 0.3 0.2 5 2 3 degree of freedom 4 Median CV usual CV t distribution with different df and true model with beta=(2,2,0) Figure 5.11: Non nested case for a class of tv error distributions 2 % that cv/mcv chose the true model 1.0 0.8 0.6 0.4 0.2 0.0 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 t distribution with different df and true model with beta=(2,0,0) 5 Chapter 5. Median-Based Cross Validation for Model Selection 157 1.0 values of alpha 1.5 % that cv/mcv chose the true model 0.60 0.55 0.50 0.45 0.40 0.35 0.30 Median CV usual CV 0.5 0.5 1.0 values of alpha 1.5 2.0 values of alpha 1.0 1.5 Levy skew stable distribution and true model with beta=(2,2,0) Median CV usual CV Levy skew stable distribution and true model with beta=(2,2,2) 2.0 % that cv/mcv chose the true model Figure 5.12: Non nested case for a class of Lévy α-stable error distributions 0.5 Median CV usual CV % that cv/mcv chose the true model 1.0 0.8 0.6 0.4 0.2 0.0 0.8 0.6 0.4 0.2 0.0 Levy skew stable distribution and true model with beta=(2,0,0) 2.0 Chapter 5. Median-Based Cross Validation for Model Selection Chapter 5. Median-Based Cross Validation for Model Selection number of nearest neighbors in a regression context. Our contribution is the recognition that MCV is more general and we have studied its properties in a linear model selection context. Indeed, Yu (2008) has proposed using an MCV to choose the bandwidth in a kernel regression smoother. It is easy to imagine other contexts in which MCV might be appropriate. There are numerous natural directions for the further development of median methods stemming from the MCV. First, although we have not done so here, one natural strategy for helping to decide whether to use the CV or MCV is to look at the histogram of residuals from fitting a regression model. If the histogram is normal, then CV is suggested. However, substantial departures from normality, in particular in tail behavior, suggest MCV might be reasonable, especially if the coefficients in the regression model are small. Second, our simulations here only involved linear models on the grounds that they form the most tractable class. However, MCV and CV are more naturally suited to more sophisticated regression settings involving neural nets, recursive partitioning, and other more general classes of models. For instance, we hope to investigate the effectiveness of MCV for generating a median based version of stacking in the hopes of getting even better robustness properties for predictive purposes. Also, we hope to investigate MCV for choosing the architecture of a neural net and the splits in forming a tree model. Third, we suggest that minimizing medians may be an appropriate optimality criterion more generally. An instance of this would be adding terms to a generalized additive model successively to reduce the median error or minimizing the median of a penalized risk analogous to shrinkage methods such as LASSO. Finally, throughout this article, we only consider the independent structure of the covariates and the true model is on the model list. Therefore, it is natural to extend our results to the cases that covariates are dependent or that the model list does not contain the true model. 158 Chapter 5. Median-Based Cross Validation for Model Selection 5.6 References Cantoni, E. and Ronchetti, E., 2001. Resistant selection of the smoothing parameter for smoothing splines. Statistics and Computing, 11, 141146. Chakrabarti, A. and Samanta, T, 2008. Asymptotic optimality of a crossvalidatory predictive approach to linear model selection. Institute of Mathematical Statistics Collections, 3, 138-154. Efron, B., 1986. How biased is the apparent error rate of a prediction rule? J. Amer. Statist. Assoc., 81, 461-470. Fan, J. and Li, R., 2001. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc., 96, 13481360. Geisser, S., 1975. The predictive sample reuse method with applications. J. Amer. Statist. Assoc., 70, 320-328. Hastie, T.J. and Tibshirani, R.J., 1990. Generalized additive models. New York, Chapman and Hall. Huber, P.J., 1964. Robust estimation of a location parameter. Ann. Math. Statist., 35, 73-101. Huber, P.J., 1973. Robust regression. Ann Statist., 1, 799-821. Leung, H.Y., 2005. Cross-validation in nonparametric regression with outliers. Ann Statist. , 33(5), 2291-2310. Li, K.C., 1987. Asymptotic optimality for Cp , CL , cross-validation and generalized cross-validation: discrete index set. Ann. Statist., 15, 958-975. McCulloch, J.H., 1996. Financial applications of stable distributions, in G. S. Maddala, C. R. Rao (eds.), Handbook of Statistics, Vol. 14, Elsevier, 393-425. Rachev, S.T. and Mittnik, S., 2000. Stable Paretian Models in Finance. Wiley, New York, NY. Ronchetti, E., Field, C., Blanchard, W., 1997. Robust Linear Model Selection by Cross Validation. J. Amer. Statist. Assoc., 92, 1017-1023. 159 Chapter 5. Median-Based Cross Validation for Model Selection Rostek, M.J., Quantile Maximization in Decision Theory. Mimeo. (Dept. of Economics, Yale University, New Haven, Connecticut, 2006). Rousseeuw, P.J., 1984. Least Median of Squares Regression. J. Amer. Statist. Assoc., 79, 871-880. Rousseeuw, P.J. and Leroy, A.M., 2003. Robust Regression and Outlier Detection. John Wiley and Sons. Shao, J., 1993. Linear Model Selection by Cross-Validation. J. Amer. Statist. Assoc., 88, 486-495. Shao, J., 1997. An asymptotic theory for linear model selection. Statistica Sinica, 7, 221-264. Stone, M., 1974. Cross-Validatory Choice and Assessment of Statistical Predications. J. Roy. Statist. Soc. Ser. B, 36, 111-147. Stone, M., 1977. An asymptotic equivalence of choice of model by cross validation and Akaike’s criterion. J. Roy. Statist. Soc. Ser. B, 39, 44-47. Smyth, P. and Wolpert, D., 1999. Linearly Combining Density Estimators via Stacking. Machine Learning, 36, 59-83. Wikipedia, 2008. Levy skew alpha-stable distribution. http://en.wikipedia. org/wiki/Levy_skew_alpha-stable_distribution Yu, C.W., 2008. Median Kernel smoother. In preparation. Yu, C.W. and Clarke, B., 2008. Median Loss Analysis. Submitted. Zhang, P., 1993. Model Selection Via Multifold Cross Validation. Ann. Statist., 21, 299-313. Zheng, Z.G. and Yang, Y., 1998. Cross-validation and median criterion. Statistica Sinica, 8, 907-921. 160 Chapter 6 Summary and Future Plan 6.1 Summary of the thesis In this thesis, we have constructed a median analog of classical expectationbased decision theory. Our structure rests on Rostek’s Quantile Utility Model in the same way as conventional decision theory rests on von Neuman and Morgenstern’s Expected Utility model. We also defined a median version of minimaxity and admissibility to get results that would parallel those from the usual risk-based minimaxity and admissibility. To establish medianbased results in the estimation context of decision theory, we defined a median loss (hereafter medloss) criterion and then found the optimal estimators by minimizing the medloss in the Bayesian framework. The median Bayes formulation parallels the usual Bayes estimators that minimize a posterior expected loss. Having obtained a large class of new estimators we proceeded to elucidate their properties. First, we established consistency and asymptotic normality for the medloss-minimizing estimators in the Bayesian context. For comparison, we also examined two Frequentist estimators, the least median of squares (LMS) estimator and the least trimmed squares (LTS) estimator, in regression problems. The LMS estimator indeed is the Frequentist version of our median loss based estimator. However, for technical reasons, the LMS has cube-root asymptotics while the Bayesian medloss estimators have square root asymptotics. In fact, it is more natural to compare our Bayesian medloss estimators to LTS estimators since both have the same rate of convergence. We have done this and argued that the Bayesian median-based 161 Chapter 6. Summary and Future Plan method is a good tradeoff between the two Frequentist estimators. To apply our median-based approach in a model selection context rather than a parameter estimation context, we have proposed a median analog to the usual cross validation (CV) technique. We call our proposed method Median CV (MCV). Simulations are conducted to show that in some normal error cases for linear models, a good heuristic is that MCV tends to outperform CV when 0 < β ≤ 0.19σ. Otherwise, typically, CV outperforms MCV. We also provide simulations in which the error term is non-normal. The two classes we considered are the t-distributions with different degrees of freedom v > 0 and the Lévy skewness α-stable distribution with α ∈ [0, 2]. Simulations for a range of v and α indicate the levels of departure from normality for which MCV outperforms CV. Curiously, MCV seems to do better than CV does for extracting small terms in a normal error setting. 6.2 Future plans Having examined in this thesis the application of median based methods to parameter estimation and model selection, we recognize that there are many other areas of statistics in which median methods may find natural application. First, we look at replacing means with medians in a variety of model classes. Then we extend MCV to model averaging techniques for prediction. In addition to a median version of Bayes model averaging, we propose median stacking. We also recognize that it is possible to develop a median based theory for shrinkage estimation paralleling the standard techniques such as LASSO. Finally, our median-based decision theory is also an important direction to pursue. 6.2.1 Using the median in other model classes Median-based criteria can be applied to additive models as well as to nonadditive models. The application of median to the simplest additive model, 162 Chapter 6. Summary and Future Plan linear regression, was developed in Chapter 5. More generally, MCV can be used in any nonparametric regression context. This includes (i) smoothing splines – to choose the smoother, (ii) Neural networks – to choose network architecture, (iii) projection pursuit regression (PPR) – to choose the number of terms, (iv) recursive partitioning – to choose the splits, and (v) (generalized) additive models – for variable selection. Here, for brevity, we review the situation for smoothing splines, PPR, and trees. For smoothing splines, the main task is to choose a smoothing parameter λ by minimizing the average prediction squared error. However, we can replace the sum of squares with a median and minimize Z median|yi − f (xi )| + λ 1≤i≤n 00 {f (x)}2 dx, (6.1) instead, where λ can be chosen by the MCV. If a model with more sparsity is desired, Friedman and Stuetzle (1981) proposed an additive model on projected variables of the form Y = β0 + p X fi (βiT X) + ², i=1 for vector βi = (βi1 , . . . , βip )T , where the dimension p is chosen by the user possibly by comparing different models by CV. This method is called projection pursuit regression (PPR). Recall that in fitting a PPR model, the fi ’s must be estimated by some auxilliary technique such as splines or backfitting more generally. Thus, parallel to the spline case, MCV could be used to choose the smoothing parameters as well. These two uses of MCV – in the term selection and the parameter estimation – may make the resulting PPR model more robust. Another well-known non-parametric model is given by recursive partitioning. The main advantages of this model class are high computational efficiency and a good tradeoff between comprehensibility and predictive accuracy. The conventional recursive partitioning procedure involves two phases: 163 Chapter 6. Summary and Future Plan growing the tree and pruning it. In the growing phase, the input domain of x is recursively partitioned into cells. Each cell corresponds to a leaf of a large initial tree. However, this can lead to overfit and therefore to suboptimal prediction. To avoid overfitting, often the initial tree is pruned by imposing a cost-complexity measure. Let T denote the set consisting of the initial tree and all possible prunings this tree. Cost-complexity selects the tree in T that minimizes n CE (T ) = 1X ˆ (fT (xi ) − yi )2 + α|T |, n i=1 where |T | is the cardinality of the tree, i.e. the number of leaf nodes, terminal nodes or partition cells, and α > 0 is a constant that controls the trade-off between fidelity to the training data and the complexity of the tree. The first term of CE (T ) is just the residual sum of squares. So, it is natural to replace it with a different measure of location, namely the median. The median-based regression tree selects the tree in T that minimizes CM (T ) = median |fˆT (xi ) − yi | + α|T |. 1≤i≤n In addition, the value of α can be chosen by MCV. 6.2.2 Model averaging for prediction Model averaging is an alternative to model selection better suited to prediction problems. There are two kinds of Model Averaging: Bayesian Model Averaging (BMA) and non-Bayesian Model Averaging. We discuss them below. Before proceeding, a few remarks on model selection vs model averaging are worthwhile. First, model selection is where we choose one model from a list to use as if it were true. Thus, we make predictions from one model alone but hopefully include the variability in model selection in the standard error of the 164 Chapter 6. Summary and Future Plan prediction. By contrast, in model averaging we weight each model on the list and use these weights to combine the predictions they give. So, the weighted sum is the overall prediction. Thus the key difference between model selection for prediction and model averaging for prediction is analogous to the P difference between choosing a single Xi and taking a weighted mean ai Xi . However, the Xi ’s are predictions from models and the ai are the degrees to which we think each model i is true. Usually, the ai ’s are positive and sum to 1. Second, there are many model selection procedures: AIC, BIC, CV, and so forth. Of these, only BIC is Bayes. Likewise, there are many model averaging procedures: BMA, stacking, and so forth. Of these, only BMA is Bayes. In fact, every model average weights the models so that choosing the model with the largest weight is a model selection procedure. On the other hand, every model selection procedure assigns a worth to each model. Taking a weighted average of predictions from the models using these worths leads to a model average. In this sense, model averaging and model selection can be made equivalent. (i) Bayes median model averaging In BMA, weights are assigned to models based on the posterior probabilities of the models. The central idea can be expressed as follows. Suppose a finite list of finite dimensional parametric models, such as linear regression models involving different selections of variables, fj (x) = fj (x|θj ) is to be ”averaged”. Equip each θj ∈ Rpj with a prior density w(θ|Mj ) where Mj indicates the jth model from an ensemble E of models and let w(Mj ) indicate the prior on E. Let S ⊂ E be a set of models. Given data D = {(Xi , Yi ) : 165 Chapter 6. Summary and Future Plan i = 1, . . . , n} the posterior probability for S is W (S|D) = X Z w(Mj , θj |D)Ifj , θj ∈S dθj Mj ∈ E = X Z W (Mj |D)w(θj |D, Mj )Ifj , θj ∈S dθj . Mj ∈ E When S is a single point Mj that is permitted to vary over E, these posterior weights lead to the model average ŶB (x) = X W (Mj |D)fj (x|θ̂jE ) Mj ∈ E for predicting the next value of Y at x, where θ̂jE = E(θj |D) and W (Mj |D) can be approximated by exp{− 12 BIC(j)} 1 Mk ∈ E exp{− 2 BIC(k)} P with BIC(k) being the BIC value of Model k, Mk . This procedure is optimal under squared error loss. The more plausible Mj is, the higher its posterior probability will be and thus the more weight it will get. The expression W (S|D) permits evaluation of the posterior probability of different model choices. So, in principle, one can do hypothesis tests on sets of models or individual models. The posterior for θj permits calculation of E(θj |D) but other estimators may be used. Note that in BMA, the posterior is based on the likelihood function. This extra information, not used in stacking or median stacking (see below), often makes BMA more efficient at the cost of being some what non-robust to small departures of the true model from the models on the list. More generally, the likelihood is unknown, so instead of using the fully-specified likelihood, a quasi-likelihood or empirical likelihood could be used giving a different model average. 166 Chapter 6. Summary and Future Plan The median version of BMA is to use the weighted median, instead of the weighted average, i.e. £ ¤ ỸB (x) = median W (Mj |D)fj (x|θ̃jM ) Mj ∈ E for predicting the new value of Y at x with the best Bayesian medloss estimator θ̃jM . (ii) Non-Bayesian median model averaging: Median-based stacking Stacking is a non-Bayesian form of model averaging because the weights are no longer posterior probabilities of models. The main idea is to combine f1 , . . . , fm by a cross-validation based technique; the models are ‘stacked’ in layers fi with weight ai . More formally, suppose we have a list of m distinct possible models f1 , . . . , fm in which each model has one or more real-valued parameters that must be estimated. When we use plug in estimators for the parameters in fi , we write fˆi (x) = fi (x|θ̂i ) for the model we use to get predictions. We want to find good empirical weights âi for the fˆi ’s from the training data. The usual stacking prediction at a point x is fˆstack (x) = m X âi fˆi (x), i=1 (−j) in which the âi ’s are obtained as follows. Let fˆi (xj ) be the prediction at xj using model i, as estimated from training data with the jth observation removed. Then the estimated weighted vector â = (â1 , . . . , âm ) solves m n h i2 X X (−j) yj − ai fˆi (xj ) . â = arg min a j=1 i=1 Parallel to the conventional stacking based on CV, we propose a median 167 Chapter 6. Summary and Future Plan analog, called median stacking. The idea is just to replace the sum or expectation with the median. Now the median stacking prediction at a point x is f˜stack (x) = median ãi f˜i (x), 1≤i≤m in which the ãi ’s are obtained as follows. The estimated weighted vector ã = (ã1 , . . . , ãm ) solves ¯ ¯ ¯ ¯ (−j) ã = arg min median¯ yj − median ai fˆi (xj )¯ , a 1≤i≤m 1≤j≤n (−j) where fˆi (xj ) is defined as before. 6.2.3 Penalized median-loss methods The key principle followed in this section is that it is reasonable to replace occurrences of sums with medians and squared deviations with L1 distances. Doing this in a shrinkage context leads to the following possibilities. (i) Median LASSO Tibshirani (1996) proposed a method called least absolute shrinkage and selection operator, LASSO. It selects variables and estimates parameters simultaneously. LASSO shrinks some regression coefficients and some are shrunk to be zero. The idea is to minimize the sum of squared residual with a penalty term of the norm of the coefficients. That is, minimize Pn i=1 (ei ) 2 +λ Pp j=1 |βj | over β, where λ is a tuning parameter that can be chosen by cross validation methods, ei = yi − f (xi , β) and β T = (β1 , . . . , βp ). So, the median analog of 168 Chapter 6. Summary and Future Plan LASSO can be defined by minimizing med(ei )2 + λ p X |βj | (6.2) j=1 over β, where λ here can be chosen by our median cross validation. Other variants are minimizing (6.2) with the penalty term med |βj |. 1≤j≤q (ii) Median Ridge If we use the penalty term kβk2 = Pp 2 j=1 |βj | instead of Pp j=1 |βj | used in LASSO, then we have the median version of the usual ridge approach. So, the median-based ridge method can be defined to minimize med(ei )2 + λkβk2 (6.3) over β or to minimize med(ei )2 + λ med βj2 over β. Again, λ can be chosen by our median cross validation. 1≤j≤p (iii) Median AIC In addition to CV and shrinkage, information criteria are popular for model selection. They, too, admit median versions. Note that information criteria such as AIC and BIC are penalized log-likelihoods as opposed to shrinkage criteria which are penalized risks. Shrinkage and information criteria tend to be similar for normal errors. For instance, Akaiki (1974) proposed a criterion based on information theory for model selection. It is a measure of the goodness of fit of an estimated statistical model. The Akaike Information Criterion, AIC, is defined by − n X ln f (xi |θ̂) + d, (6.4) i=1 169 Chapter 6. Summary and Future Plan where f (·|θ) is the likelihood function of the observed values {xi : i = 1, . . . , n}, θ̂ is an estimator for the unknown parameters θ, and d is the dimension of θ. We select a model from the model list with the smallest value of AIC. So, the median analog of the AIC is to find a model by minimizing − med [ln f (xi |θ̂)] + d. 1≤i≤n (6.5) (iv) Median BIC Alternatively, Schwarz (1978) proposed a different model selection criterion in the Bayesian context. The Bayesian Information criterion, BIC, is similar to AIC but has a different penalty term. Th BIC is − n X ln f (xi |θ̂) + d/2 ln n, (6.6) i=1 where n is the sample size. So, the median analog of BIC is − med [ln f (xi |θ̂)] + d/2 ln n. 1≤i≤n (6.7) Again, we choose the model with the smallest value. (v) Others Similar to Median LASSO, Median Ridge, Median AIC and Median BIC, we can also develop a median variant of most of the commonly used model selection criteria, such as Median elastic net, Median SCAD criterion, other Median-based Information Criterion methods and so on. 6.2.4 Other possibilities We conclude with a final list of interesting possibilities to be explored in a median based statistical framework. 170 Chapter 6. Summary and Future Plan (i) Median-based decision theoretical extension First, from a theoretical standpoint, one of the most important desiderata is a complete class theorm based on the median of the loss. Essentially, the complete class theorem identifies classes of estimators, such as the usual Bayesian estimators that are complete. In this context, complete means that, for any other estimator that is good, there is an estimator in the class that is close to it. In the median loss context, the estimators may have an analogous property. Moreover, the quest for a median version of Cramer’s inequality remains. We suggest that Sung’s diffusivity (Sung, 1988, 1990) is an appropriate analog of variance for the median case and that the median absolute deviation from the median, M AD, (i.e. the medloss with a L1 loss) of estimators is an appropriate replacement for the MSE. However, more work remains even to conjecture a bound. Of course, it is natural to develop the median version of the Rao-Blackwell theorem as well or to discuss the median-loss based estimator with sufficient statistics. In Chapter 2 in this thesis, we have proposed the median-inadmissibility of estimators and discussed some points about the James-Stein positive part estimator in the linear regression setting. So, it is also sensible to compare the James-Stein estimator with our medloss estimator for multivariate estimation problems. (ii) Median-based diagnostic analysis First, we want to develop data driven criteria for deciding whether CV or MCV is more appropriate. One natural test is to look at the histogram of residuals from a model. If it is unimodal and symmetric with light tails, normality is suggested and CV is appropriate. Otherwise, if it is skewed or has heavy tails then MCV may be appropriate. We hope to develop more convenient diagnostics in the future. Second, in the problems of linear regression models, Rousseeuw and 171 Chapter 6. Summary and Future Plan Leroy (2003, p.238) proposed a diagnostic with LMS estimators to identify all points that are either outliers in the y-direction or leverage points. So, other residual or diagnostic analysis based on LMS estimators for linear or nonlinear models is also a reasonable and interesting extension of the usual LS-type approaches. Third, more complicated diagnostics may be possible. That is, it is not enough to choose CV vs MCV and it is important to choose the classes of regression functions to be compared by CV or MCV. We suggest that functions be chosen to satisfy both fit as well as a various residual behaviors. The importance of good fit is clear. However, ensuring a list that gives a wide range of residual behavior may be important for search for a model list that gives good generalization error as a consequence of the flexibility of the class. This means that selecting the model list before using an MCV model average such as MCV-stacking is a new class of problems our work makes natural for further investigation. (iii) Median cross validation We suggest theoretical development of the MCV naturally follows two directions. First, we want to generalize theorems about the consistency of CV to the MCV case. In this regard, the work of Muller (2008) may be helpful: She establishes consistency for a general form of model selection principle that includes CV and GCV. The techniqeus of proof in her work may extend to MCV and related optimizations. Second, Oracle inequalities for model selection principles and model averages are available, see Yang and Barron (1998). We suggest that MCV may satisfy some property like that, especially if it is converted to a model average. 172 Chapter 6. Summary and Future Plan 6.3 References H. Akaike, 1974. A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19(6), 716V723. J. H. Friedman and W. Stuetzle, 1981. Projection Pursuit Regression. J. Amer. Statist. Assoc., 76, 817 - 823. Marlene Müller, 2008. Consistency properties of model selection criteria in multiple linear regression. Working Paper. Humboldt-Universität zu Berlin. http://econpapers.repec.org/paper/wophumbse/9207.htm P.J. Rousseeuw and A.M. Leroy, 2003. Robust Regression and Outlier Detection. John Wiley and Sons. G. Schwarz, 1978. Estimating the dimension of a model. Annals of Statistics, 6(2), 461 - 464. N.K. Sung, 1988. A Cramér-Rao Analogue for Median-Unbiased Estimators. Prepreint Series 88-29, Department of Statistics, Iowa State University, Ames, Iowa. N.K. Sung, 1990. A Generalized Cramér-Rao Analogue for Median-Unbiased Estimators. J. Multivariate Anal., 32, 204-212. Y. Yang and A. Barron, 1998. An Asymptotic Property of Model Selection Criteria. IEEE Transaction on Information Theory, 44, 95-116. 173 Appendix A Appendix to Chapter 2 In this appendix, we state von Neumann and Morgenstern’s and Savage’s expected utility theorem and Rostek’s quantile utility theorem. Moreover, we also give a detailed explanation of Savage’s and Rostek’s axioms. Allais’ and Ellsberg’s paradoxes for the axioms of the two expected utility models are also discussed here. A.1 Classical axiomatic expected utility von-Neumann-Morgenstern’s expected utility theorem: An agent’s preferences over P satisfy the Independence Axiom and the Continuity Axiom if and only if there exists a utility function u : Z → R such that the agent P ranks lotteries according to their expected utility U (p) = z∈Z p(z)u(z), i.e., p % q ⇐⇒ U (p) ≥ U (q). In addition, the utility function u is unique up to positive affine transformations. Next, we state Savage’s axioms P 1−P 6 that characterize Subjective Utility Maximization. (See also Machina and Schmeidler (1992) and Appendix 1 in Rostek (2007)) Axiom 1 (Ordering). The relation % is complete, transitive and reflexive. 174 Appendix A. Appendix to Chapter 2 Axiom 2 (Sure-thing principle). For all events E and (sub)acts f, f ∗ , g, h, " f ∗ (s) if g(s) ⇒ s∈E if % s∈ /E # s∈E " f ∗ (s) if h(s) # if g(s) if " f (s) if % s∈ /E " f (s) if h(s) if s∈E # s∈ /E # s∈E s∈ /E (A.1) . Axiom 3 (Eventwise Monotonicity). For all outcomes x and y, non-null events E and acts g, " x # if E g if Ec % " y if E g if Ec # ⇔ x % y. (A.2) [Remark: An event E is said to be null if any pair of acts which differ only on E are indifferent.] Axiom 4 (Weak comparative probability). For all events E, F and outcomes x∗ Â x and y ∗ Â y, " ⇒ x∗ if E x if # " y∗ if Ec # E y if Ec " % % x∗ if F x if # " y∗ if Fc # F y if Fc (A.3) . Axiom 5 (Nondegeneracy). There exist outcomes x and y such that x Â y. Axiom 6 (Small-event continuity). For any acts f Â g and outcome x, there exists a finite set of events {E1 , . . . , EN } forming a partition of S such that fÂ " x if En g if Enc # " and x if Em f c Em if # Âg (A.4) for all m, n = 1, . . . , N . 175 Appendix A. Appendix to Chapter 2 Savage’s expected utility theorem: A preference relation % over the set of Savage’s finite-outcome acts F = {f |f : S → Z and |f (S)| < ∞} satisfy the axioms P 1−P 6 if and only if there exists a subjective probability measure µ on the states of the world S and a utility function u on the set of outcomes Z such that the individual evaluates acts according to the expected utility with respect to µ and u, i.e. f % g ⇐⇒ Eµ [u ◦ f ] ≥ Eµ [u ◦ g], for all f, g ∈ F. where Eµ [u ◦ f ] = R S [u ◦ f ](s)dµ(s). In addition, the utility function u is unique up to positive affine transformations. A.2 The Allais paradox and Ellsberg’s paradox Allais Paradox: This paradox was presented by Maurice Allais, Allais (1953), to attack vNM’s Independence Axiom and Savage’s Sure-Thing Principle. The paradox is a choice problem showing an inconsistency between actual observed choices and the predictions of expected utility theory. The problem arises when comparing participants’ choices in two different experiments, each of which consists of a choice between two options, A and B. The payoffs for each gamble in each experiment are in the following table. Experiment a (p=0.1) b (p=0.89) c (p=0.01) 1A $10 $10 $10 1B $50 $10 $0 2A $10 $0 $10 2B $50 $0 $0 Most people choose 1A not 1B but also 2B not 2A. This violates the Independence axiom or the Sure-thing principle, because the same results (option b) should not affect the choice between the alternatives. 176 Appendix A. Appendix to Chapter 2 Ellsberg’s Paradox: Ellsberg’s Paradox presented by Daniel Ellsberg, Ellsberg (1961), can be illustrated as follows. Suppose we have an urn containing 30 red balls and 60 other balls that are either black or yellow in an unknown proportion. Assume that the balls are well mixed so that each individual ball is as likely to be drawn as any other. Now we are given a choice between two gambles: Payoffs for drawing a ball Experiment of each color Red Black Yellow A $100 $0 $0 B $0 $100 $0 C $100 $0 $100 D $0 $100 $100 Most people choose A not B and D not C. According to the Sure-thing principle, people who prefer A to B should also prefer C to D, since the only difference is the result for the yellow ball, which does not differentiate between A and B or between C and D. A.3 Rostek’s axiomatic quantile utility Rostek’s axioms for quantile maximization and her quantile utility theorem: Before we state the axioms, we have to define the sets which are assigned outcomes strictly more or strictly less preferred to x under f . These events are subsets of S, Ef x+ = {s ∈ S|f (s) Â x} and Ef x− = {s ∈ S|f (s) ≺ x}, for a fixed act f ∈ F. Denote the indifference relation as usual: f ∼ g ⇔ f g and g f . Say that event E is null if for any two actions, f and g which differ only on E, we have f ∼ g. 177 Appendix A. Appendix to Chapter 2 Since the actions have bounded ranged, each action induces a natural partition of the state space which is the coarsest partition with respect to which it is measurable. Let the event E be an element of such a partition and let the function gx+ be any mapping from Ef + to Z with gx+ (s) % x, for all s ∈ Ef x+ and similarly, let gx− be any map gx− from Ef x− to Z with gx− (s) - x, for all s ∈ Ef x− . The three new axioms Rostek (2007) introduces are as follows. Axiom P 3Q (Pivotal monotonicity): For any act f ∈ F, there exists a non-null event Ef such that f −1 (x) = Ef for some x ∈ f (S), and for any outcome y, and subacts gx+ , gx− , gy+ and gy− : gx+ if Ef x+ gy+ if Ef x+ x % y ⇔ x % y. if E if E gx− if Ef x− gy− if Ef x− (A.5) Axiom P 4Q (Comparative probability): For all pairs of disjoint events E and F , outcomes x∗ Â x, and subacts g and h, x∗ if s ∈ E x∗ if s ∈ F x if s ∈ F if s ∈ E %x g if s ∈ / E∪F g if s ∈ / E∪F x∗ if s ∈ E x∗ if s ∈ F ⇒ if s ∈ F if s ∈ E x %x h if s ∈ / E∪F h if s ∈ / E∪F (A.6) (A.7) Let Âz denote the preference relation over certain outcomes, Z, obtained as a restriction of Â to constant actions. For an action f ∈ F, event E, such that f −1 (x) = E for some x ∈ f (S), and all subacts gx+ , gx− , gy+ and gy− 178 Appendix A. Appendix to Chapter 2 define gx+ if Ef x+ fE = if E x . gx− if Ef x− (A.8) It follows from P 3Q that for any action f ∈ F, there exists a non-null event Ef such that f −1 (x) = E for some x ∈ f (S) and for all subacts gx+ , gx− , gy+ and gy− , f ∼ fE . The last condition states that for a given action, there exist an event, which is called a pivotal event, such that changing outcomes outside of that event in a rank-preserving way does not affect preferences over actions. Before stating the final axiom, we introduce conditions that identify two important classes of preferences. Intuitively, these preferences will lead to τ = 0 and τ = 1, respectively. (L, ”lowest”): For any action f ∈ F, the pivotal event maps to an outcome from the least preferred equivalence class with respect to Âz in the outcome set {z ∈ Z|z ∈ f (S)}. (H, ”highest”): For any action f ∈ F, the pivotal event maps to an outcome from the most preferred equivalence class with respect to Âz in the outcome set {z ∈ Z|z ∈ f (S)}. Define a preference relation over acts F, Â, satisfying P 3Q , to be extreme if either L or H holds, and non-extreme if neither (L) nor (H) is satisfied. 179 Appendix A. Appendix to Chapter 2 Next we define two continuity properties used in the next axiom. (P 6Q∗ ) Fix a pair of events E, F ∈ E = 2S . If for any pair of outcomes such that x Â y, " # x if s ∈ /E y if s ∈ E " - x if s ∈ /F # , y if s ∈ F (A.9) then there exists a finite partition {G1 , . . . , GN } of S such that " # x if s ∈ /E y if s ∈ E " - x if s ∈ / F ∪ Gn # y if s ∈ F ∪ Gn (A.10) for all n = 1, . . . , N . ∗ (P 6Q ) Fix a pair of events E, F ∈ E = 2S . If for any pair of outcomes such that x Â y, " # x if s ∈ E y if s ∈ /E " - x if s ∈ F y if s ∈ /F # , (A.11) then there exists a finite partition {H1 , . . . , HM } of S such that " # x if s ∈ E y if s ∈ /E - " # x if s ∈ F ∪ Hm y if s ∈ / F ∪ Hm (A.12) for all m = 1, . . . , M . Axiom P 6Q (Event continuity): For non-extreme preferences, the relation Â satisfies P 6Q∗ for all events ∗ in 2S and P 6Q for any event E in 2S and ∅. If H holds, Â satisfies P 6Q∗ , ∗ while if L holds, Â satisfies P 6Q . 180 Appendix A. Appendix to Chapter 2 We want to derive a quantile-utility representation for subjective probability without P 2. Note that P 2 is related to axioms P 3 and P 4 which are used for the existence of the probability measures and the utility function. So, we have to remove all of them in the quantile-utility model. Separately, P 1 can be regarded as a regularity condition because it gives that the preference relation is transitive and all acts are comparable. This is a necessary condition for the preference relation to be represented by a utility function. P 5 rules out the possibility that the decision maker is indifferent among all acts, and it ensures the uniqueness of the probability measures. Moreover, adding P 3Q and P 4Q guarantees (i) the existence and uniqueness of the quantile τ , (ii) the additivity of the derived probability measure, and (iii) the value of the τ -th quantile of the utility function does not affect our preferences over actions. P 6Q is chosen to weaken Savage’s P 6 enough to ensure that the quantile in the representation is left-continuous. Finally, we can state Rostek’s theorem. Rostek’s quantile utility theorem: Consider a preference relation Â over the set of actions F. Then the following are equivalent: 1. Â satisfies P 1, P 3Q , P 4Q , P 5 and P 6Q . 2. There exist: • a unique number τ ∈ [0, 1]; • a probability measure π for τ ∈ (0, 1); • a utility function on the set of outcomes Z, u, which represents the preference relation over certain outcomes ,Âz , where u is unique up to strictly increasing transformations; such that relation Â over actions can be represented by the preference functional V(f ) : F → R given by V(f ) = Qτ (Πf ) if τ ∈ (0, 1), 181 Appendix A. Appendix to Chapter 2 where Πf denotes the induced cumulative probability distribution of utility for an action f , i.e. Πf (t) = π[s ∈ S|u(f (s)) ≤ t, t ∈ R], where S denotes a set of states of the world and Qτ (Πf ) is the corresponding τ -th quantile function defined by Qτ (Πf ) = inf{t ∈ R|π[u(f (s)) ≤ t] ≥ τ }. 182 Appendix A. Appendix to Chapter 2 A.4 References Allais, M. (1953). Le comportement de lhomme rationnel devant le risque: critique des postulats et axiomes de lecole Americaine. Econometrica 21 503–546. Ellsberg, D. (1961). Risk, Ambiguity, and the Savage Axioms. Quarterly Journal of Economics 75 643–669. Machina, M. and D. Schmeidler. (1992). A More Robust Definition of Subjective Probability. Econometrica 60(4) 745–780. Rostek, M.J. (2007). Quantile Maximization in Decision Theory. Unpublished Manuscript. 183 Appendix B Appendix to Chapter 3 In this appendix, we provide the completed proof of the results in our Theorems 6 and 7. B.1 Assumptions Our proof heavily relies on Čı́žek’s assumptions for his asymptotic results of the one-sided LTS estimator (Čı́žek, 2004, 2005). These assumptions can be classified into three groups: Assumptions D, H and I, where assumptions D are for the distributional assumptions for the random variables, assumptions H for the regression functions h and assumptions I for the identification setting. For the two-sided case, we use Čı́žek’s assumptions, except that we replace his assumptions D3 and I2 by ours T D3 and T I2. To indicate the modifications of Čı́žek’s assumptions D, H and I, we denote our assumptions by T D, T H and T I, respectively. Definition 7. A sequence of the explanatory variables {Xt }t∈N is said to be absolutely regular (or α-mixing) if n o αm = supE sup |P (A|σtp ) − P (A)| → 0, t∈N f A∈σt+m as m → ∞, where σtp = σ(Xt , Xt−1 , . . .) and σtf = σ(Xt , Xt+1 , . . .) are σ-algebras. Numbers αm , m ∈ N are called mixing coefficients. 184 Appendix B. Appendix to Chapter 3 B.1.1 Assumptions T D Assumption T D1 Explanatory variables {xi }i∈N are α-mixing with finite second moments and mixing coefficients αm satisfying mrα /(rα −2) (ln m)2(rα −1)/(rα −2) αm → 0, as m → ∞ for some rα > 2. Assumption T D2 Let {ui }i∈N be a sequence of independent symmetrically and identically distributed variables with finite second moments, and additionally, let ui and xi are mutually independent. Moreover, the distribution function F of ui is absolutely continuous and its probability density f is assumed to be positive, bounded from above by Mf > 0, and continuously differentiable in p p a neighborhood of − G−1 (λ) and G−1 (λ), where G is the distribution function of u2i and G−1 is its inverse function. Assumption T D3 Assume that for λ ∈ (0, 1), def mgg = inf inf gβ (G−1 β (λ) + z) > 0, β∈Bz∈(−δg ,δg ) for some δg > 0. Additionally, when 1/2 < λ ≤ 1, suppose that def def −1 ∗∗ m∗G = supG−1 β (1 − λ) > 0 and mG = inf Gβ (λ) > 0, β∈B β∈B and def ∗ = sup Mgg sup def ∗∗ = sup gβ (z) < ∞ and Mgg β∈Bz∈(−∞,m∗G ) sup gβ (z) < ∞, β∈Bz∈(m∗∗ G ,∞) where Gβ and gβ are the distribution function and probability density function of ri2 (β). 185 Appendix B. Appendix to Chapter 3 B.1.2 Assumptions T H Suppose that there are a positive constant δ > 0 and a neighborhood U (β0 , δ) such that the following assumptions hold. Assumption T H1 Let h(xi , β) be a continuous (uniformly over any compact subset of the support of x) in β ∈ B and twice differentiable function in β on U (β0 , δ) almost surely. The first derivative is continuous in β ∈ U (β0 , δ). Assumption T H2 Assume that the second derivatives h00βj βk (x, β) satisfy locally the Lipschitz property, that is, for any compact subset of the support of x, there exists a constant Lp > 0 such that for all β, β 0 ∈ U (β0 , δ), and j, k = 1, . . . , p, ¯ ¯ ¯ ¯ 00 ¯hβj βk (x, β) − h00βj βk (x, β 0 )¯ ≤ Lp kβ − β 0 k Assumption T H3 Let {h(xi , β)|β ∈ B} and {h0β (xi , β)|β ∈ U (β0 , δ)} form VC classes of functions such that their envelopes E1 (x) = sup|h(x, β)| and E2 (x) = β∈B sup |h0β (x, β)| β∈U (β0 ,δ) have finite rα -th moments. Assumption T H4 Let n −1/4 ¯ ¯ ¯ 0 ¯ max max ¯hβj (xi , β)¯ = Op (1) 1≤i≤n1≤j≤p and n −1/2 ¯ ¯ ¯ 00 ¯ max max ¯hβj ,βk (xi , β)¯ = Op (1) 1≤i≤n1≤j,k≤p 186 Appendix B. Appendix to Chapter 3 as n → ∞ uniformly over β ∈ U (β0 , δ). Assumption T H5 Apart from the existence of moments implied by Assumptions T H3, we further assume that 1. E[ri2 (β)]m and E[h(x, β)]m exist and are finite for m = 1, 2 and β ∈ B and 2. E[h00βj βk (xi , β)]m , E[h0βj (xi , β0 )h0βk (xi , β0 )]m and E[h0βl (xi , β0 )h00βj βk (xi , β0 )] exist and are finite for m = 1, 2, all j, k, l = 1, . . . , p and β ∈ U (β0 , δ). We also assume that E[h0β (xi , β0 )h0β (xi , β0 )T ] = Qh , where Qh is a nonsingular positive definite matrix. B.1.3 Assumptions T I Assumption T I1 The parameter space B is compact. Assumption T I2 For any ² > 0 and an open ball U (β0 , ²) such that B ∩ U c (β, ²) is compact, there exist α(²) > 0 such that it holds, for 1/2 < λ ≤ 1, that h i 2 min E ri (β)I{G−1 (1−λ)≤r2 (β)≤G−1 (λ)} i β β β:kβ−β0 k>² h i > E ri2 (β0 )I{G−1 (1−λ)≤r2 (β0 )≤G−1 (λ)} + α(²). β0 i β0 187 Appendix B. Appendix to Chapter 3 B.2 Proofs of the limiting results of two-sided LTS estimators in nonlinear regression models (LT S,h) Our main results on consistency and asymptotic normality of βn stated in Theorems 6 and 7 rely on numerous preliminary results. B.2.1 Preliminary results We start with Lemma 16. Let Sn (β) = Phn 2 n−hn +1 r[i] (β). Lemma 16. Under assumptions T D2 and T H1, Sn (β) is continuous on B, (LT S,hn ) twice differentiable at βn (LT S,hn ) if βn ∈ U (β0 , δ), and almost surely twice differentiable at any fixed point β ∈ U (β0 , δ). Denote I[r2 [n−hn +1] 2 2 (β),r[h (β)] (ri (β)) ] by h00ββ . Then we have Sn (β) = n n X by I2 (ri2 (β)), h0β (xi , β) by h0β and h00ββ (xi , β) ri2 (β)I2 (ri2 (β)), (B.1) i=1 Sn0 (β) = −2 n X ri2 (β)h0β I2 (ri2 (β)), i=1 n X 00 Sn (β) = 2 {h0β (h0β )T − ri (β)h00ββ }I2 (ri2 (β)) i=1 almost surely at any β ∈ B and β ∈ U (β0 , δ), respectively. Proof. The proof of this lemma is substantially the same as Čı́žek’s, so we omit it here. To investigate the behavior of the normal equations Sn0 (β) = 0 around 188 Appendix B. Appendix to Chapter 3 β0 as a function of β − β0 , consider the difference Dn1 (t) = Sn0 (β0 − n−1/2 t) − Sn0 (β0 ) n h X = −2 {yi − h(xi , β0 − n−1/2 t)}h0β (xi , β0 − n−1/2 t)I2 (ri2 (β0 − n−1/2 t)) i=1 i − {yi − h(xi , β0 )}h0β (xi , β0 )I2 (ri2 (β0 )) . Here, t ∈ TM = {t ∈ Rp |ktk ≤ M }, where 0 < M < ∞ is an arbitrary but fixed constant. Proposition 3 (Asymptotic Linearity). Under assumptions T D, T H and T I, and for λ ∈ (1/2, 1] and M > 0, we have 1 n (t) n−1/2 sup k D−2 − n1/2 Qh tCλ k = op (1), as n → ∞, t∈TM where Qh = EX [h0 (X, β0 )h0 (X, β0 )T ], Cλ = (2λ − 1) + ( p H(1 − λ)], H(λ) = f (qλ ) + f (−qλ ) and qλ = G−1 (λ). qλ +q1−λ )[H(λ) 2 − Proof. The proof is put in Section B.2.3. Lemma 17 (Uniform weak law of large numbers). Let assumptions T D, T H and T I1 hold, and assume that t(x, u; β) is a real function continuous in β uniformly in x and u over any compact subset of the support of (x, u). Also, we suppose that Esup|t(x, u; β)|1+δ < ∞, for some δ > 0. Then, letting β∈B I3 (β; K1 , K2 ) = I[G−1 (1−λ)−K1 ,G−1 (λ)+K2 ] (ri2 (β)), we have β β n ¯ ¯1 X ¯ ¯ [t(xi , ui ; β)I3 (β; K1 , K2 )] − E[t(xi , ui ; β)I3 (β; K1 , K2 )]¯ → 0, sup ¯ β∈B,K1 ,K2 ∈R n i=1 as n → ∞ in probability. Proof. The proof is put in Section B.2.2. Note that Čı́žek’s Lemmas A.2 - A.5 are valid for λ ∈ (0, 1) with our assumption T D in place of his assumption D. So we only state these lemmas 189 Appendix B. Appendix to Chapter 3 here without proofs. Denote the ith order statistics of the squared residuals 2 (β) used to define the two-sided LTS estimator ri2 (β) = (yi − h(xi , β))2 by r[i] in (3.8). Thus, we have the following. Lemma 18. For λ ∈ (0, 1) and hn = [λn] for n ∈ N , under assumptions T D, T H1 and T I1, we have ¯ ¯ ¯ ¯ 2 −1 (λ) sup¯r[h (β) − G ¯ → 0, β n] (B.2) β∈B as n → ∞ in probability. Moreover, ¯ ¯ ¯ 2 ¯ −1 (λ) (β) − G EGn = Esup¯r[h ¯ → 0, β n] (B.3) β∈B as n → ∞. Lemma 19. For λ ∈ (0, 1) and hn = [λn] for n ∈ N , under assumptions T D, T H1 and T I1, there exist some ² > 0 such that ¯ ¯ √ ¯ 2 ¯ −1 n sup ¯r[h (β) − G (λ) ¯ = Op (1) β n] β∈U (β0 ,²) and ELn = E n√ n sup ¯ ¯o ¯ 2 ¯ (λ) ¯r[hn ] (β) − G−1 ¯ = Op (1), β β∈U (β0 ,²) as n → ∞. Lemma 20. Let assumptions T D, T H and T I1 hold, and suppose that λ ∈ (0, 1), τ ∈ (1/2, 1), and hn = [λn] for n ∈ N . Then, we have ¯ ¯ ¯ 2 2 (β )¯ = O (n−τ ) ¯r[hn ] (β0 − n−1/2 t) − r[h ¯ 0 p n] uniformly in t ∈ TM = {t ∈ Rk : ktk ≤ M } as n → ∞. Lemma 21. Under assumptions T D, T H1 and T I1, we have that for any i ≤ n and λ ∈ (0, 1), ¯ ³ ¯ PG0 = P sup¯I{r2 (β)≤r2 β∈B i [hn ](β) ¯ ´ ¯ = 6 0 = o(1). − I −1 ¯ 2 } {r (β)≤G (λ)} i β 190 Appendix B. Appendix to Chapter 3 In addition, under assumptions T D, T H and T I1, there exists ² > 0 such that PL0 ³ =P sup ¯ ¯ ¯I{r2 (β)≤r2 i β∈U (β0 ,²) [hn ](β) ¯ ´ ¯ −1/2 ). } − I{r2 (β)≤G−1 (λ)} ¯ 6= 0 = O(n i β as n → ∞. By Lemma 21, we have the following result. Corollary 6. Under assumptions T D, T H1 and T I1. For λ ∈ (1/2, 1) and for any i ≤ n, we have ¯ ³ ¯ PG = P sup¯I{r2 2 } ≤ri2 (β)≤r[h [n−hn +1](β) n ](β) β∈B ¯ ´ ¯ − I{G−1 (1−λ)≤r2 (β)≤G−1 (λ)} ¯ 6= 0 i β β = o(1). In addition, under assumptions T D, T H and T I1, there exists ² > 0 such that ¯ ¯ ¯I{r2 ³ PL = P sup β∈U (β0 ,²) 2 ≤ri2 (β)≤r[h } [n−hn +1](β) n ](β) ¯ ´ ¯ − I{G−1 (1−λ)≤r2 (β)≤G−1 (λ)} ¯ 6= 0 i β β = O(n−1/2 ), as n → ∞. 2 (β)}, B = {r 2 (β) ≥ Proof of Corollary 6. Denote A1 = {ri2 (β) ≤ r[h 1 i n] −1 2 2 r[n−h (β)}, A2 = {ri2 (β) ≤ G−1 β (λ)} and B2 = {ri (β) ≥ Gβ (1 − λ)}. n +1] Let vin (β) = I{r2 [n−hn +1] 2 (β)≤ri2 (β)≤r[h (β)} ] n − I{G−1 (1−λ)≤r2 (β)≤G−1 (λ)} . Thus, β i β vin (β) = IA1 IB1 − IA2 IB2 . 191 Appendix B. Appendix to Chapter 3 So we have ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ 0 ≤ sup¯vin (β)¯ = sup¯IA1 IB1 − IA2 IB1 + IA2 IB1 − IA2 IB2 ¯ β∈B β∈B ¯ ¯¯ ¯ ¯ ¯¯ ¯ ¯ ¯¯ ¯ ¯ ¯¯ ¯ ≤ sup¯IA1 − IA2 ¯¯IB1 ¯ + sup¯IA2 ¯¯IB1 − IB2 ¯ β∈B β∈B ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ≤ sup¯IA1 − IA2 ¯ + sup¯IB1 − IB2 ¯. β∈B β∈B ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ Notice that sup¯vin (β)¯ 6= 0 implies that either sup¯IA1 − IA2 ¯ 6= 0 or β∈B β∈B ¯ ¯ ¯ ¯ sup¯IB1 − IB2 ¯ 6= 0. Thus, we have β∈B ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ 0 ≤ P (sup¯vin (β)¯ 6= 0) ≤ P (sup¯IA1 − IA2 ¯ 6= 0) + P (sup¯IB1 − IB2 ¯ 6= 0) = o(1). β∈B β∈B β∈B The second last equality holds by the first result of Lemma 21. Similarly, using the above arguments with the second result of Lemma 8, we can prove that there exists ² > 0 such that P( sup ¯ ¯ ¯ ¯ ¯vin (β)¯ 6= 0) = O(n−1/2 ). β∈U (β0 ,²) Using the same technique as in the proof of Corollary 6, we have the following which is parallel to Čı́žek’s Corollary A.6. Proposition 4. Let assumptions T D, T H1 and T I1 hold and assume that t(x, u; β) is a real-valued function continuous in β uniformly in x and u over any compact subset of the support of (x, u). Moreover, assume that ¯ ¯ ¯ ¯ E sup ¯t(x, u; β)¯ < ∞. Then we have that for λ ∈ (1/2, 1), β∈B ¯ n h ¯ E sup¯t(xi , ui ; β) I{r2 β∈B 2 ≤ri2 (β)≤r[h } [n−hn +1](β) n ](β) i¯o ¯ − I{G−1 (1−λ)≤r2 (β)≤G−1 (λ)} ¯ β i β = o(1). 192 Appendix B. Appendix to Chapter 3 In addition, under assumptions T D, T H and T I1, there exists ² > 0 such that n E sup ¯ h ¯ ¯t(xi , ui ; β) I{r2 [n−hn +1](β) β∈U (β0 ,²) 2 ≤ri2 (β)≤r[h n ](β) i¯o ¯ } − I{G−1 (1−λ)≤r2 (β)≤G−1 (λ)} ¯ β i β = O(n−1/2 ), as n → ∞. Proposition 4 controls the upper bound arising from applying Chebyshev’s inequality to a weighted sum of differences of indicator functions. This sum of differences expresses the distance between residuals and their limiting quantiles. It is stated in the following. Proposition 5. Let assumptions T D, T H1 and T I1 hold and assume that t(x, u; β) is a real-valued function continuous in β uniformly in x and u over ¯any compact subset of the support of (x, u). Moreover, assume that ¯ ¯ ¯ Esup¯t(x, u; β)¯ < ∞. Then we have that for λ ∈ (1/2, 1), β∈B n n ¯1 X h io¯ ¯ ¯ t(xi , ui ; β) I{r2 − I sup¯ 2 2 −1 −1 ¯ 2 ≤ri (β)≤r[h ](β) } {Gβ (1−λ)≤ri (β)≤Gβ (λ)} [n−hn +1](β) n n β∈B i=1 = op (1). In addition, under assumptions T D, T H and T I1, there exists ² > 0 such that n n ¯ 1 X h io¯ ¯ ¯ t(xi , ui ; β) I{r2 − I{G−1 (1−λ)≤r2 (β)≤G−1 (λ)} ¯ sup ¯ √ 2 (β)≤r 2 ≤r } i [n−hn +1](β) [hn ](β) i β β n β∈U (β0 ,²) i=1 = Op (1), as n → ∞. 2 (β)}, B = {r 2 (β) ≥ Proof of Proposition 5. Recall that A1 = {ri2 (β) ≤ r[h 1 i n] −1 2 2 r[n−h (β)}, A2 = {ri2 (β) ≤ G−1 β (λ)} and B2 = {ri (β) ≥ Gβ (1 − λ)}. By n +1] 193 Appendix B. Appendix to Chapter 3 the first result of Proposition 2, for any ²∗ > 0, we have n n ¯1 X io¯ ´ h ³ ¯ ¯ t(xi , ui ; β) IA1 IB1 − IA2 IB2 ¯ > ²∗ P sup¯ β∈B n i=1 n n ¯1 X h io¯´ 1 ³ ¯ ¯ ≤ ∗ E sup¯ t(xi , ui ; β) IA1 IB1 − IA2 IB2 ¯ ² β∈B n i=1 n h i¯´ 1 X ¯¯ 1 ³ ¯ ≤ ∗ E sup ¯t(xi , ui ; β) IA1 IB1 − IA2 IB2 ¯ ² n β∈B i=1 ¯ h i¯´ 1 ³ ¯ ¯ = ∗ E sup¯t(xi , ui ; β) IA1 IB1 − IA2 IB2 ¯ → 0. ² β∈B Moreover, by the second result of Proposition 4, there exists ² > 0 such that n n ¯ 1 X h io¯´ ¯ ¯ t(xi , ui ; β) IA1 IB1 − IA2 IB2 ¯ ¯√ n β∈U (β0 ,²) i=1 n n ¯ ³ h io¯´ √ ¯1 X ¯ sup ¯ t(xi , ui ; β) IA1 IB1 − IA2 IB2 ¯ ≤ O(1). = nE n β∈U (β0 ,²) ³ E sup i=1 Therefore, using the Chebyshev’s inequality again gives the second result. In what follows, we study in more detail the differences of probabilities that I{r2 2 ≤ri2 (β)≤r[h } at β = β0 and βn for sequences βn converging n ](β) √ to β0 at n-rate. Our next result gives bounds for how closely residuals at [n−hn +1](β) the true parameter β0 approximate residuals at β in a neighborhood of β0 . 2 (β)} and B = {r 2 (β) ≥ Lemma 22. Recall that A1 = {ri2 (β) ≤ r[h 1 i n] 2 2 (β )} and B 0 = {r 2 (β ) ≥ r[n−h (β)}. Denote A01 = {ri2 (β0 ) ≤ r[h 0 0 1 i n +1] n] 2 r[n−h (β0 )}. Let assumptions D∗ and H hold and β ∈ U (β0 , n−1/2 M ) n +1] for some M > 0. Then for λ ∈ (1/2, 1), we have, as n → ∞, 1. For the conditional probability ¯ ¯ ´ ¯ ³ ¯ ¯ ¯ (a) P IA01 IB10 6= IA1 IB1 ¯xi = ¯(h0β (xi , β0 ))T (β − β0 )¯[H(λ) + H(1 − λ)] + Op (n−1/2 ) = Op (n−1/4 ), and 194 Appendix B. Appendix to Chapter 3 n ³ ´¯ o ¯ (b) E sgn ri (β0 ) IA01 IB10 −IA1 IB1 ¯xi = (h0β (xi , β0 ))T (β−β0 )[H(λ)− H(1 − λ)] + Op (n−1/2 ). 2. For the corresponding unconditional probability ³ ´ P IA01 IB10 6= IA1 IB1 ¯ ¯ ¯ ¯ =EX ¯(h0β (xi , β0 ))T (β − β0 )¯[H(λ) + H(1 − λ)] + O(n−1/2 ) = O(n−1/2 ). 3. For the conditional probability taken over all β ∈ U (β0 , n−1/2 M ) ³ −1/2 P ∃β ∈ U (β0 , n =n−1/2 M M ) : IA01 IB10 ¯ ´ ¯ 6 IA1 IB1 ¯xi = p ¯ X ¯ ¯ 0 ¯ ¯hβj (xi , β0 )¯[H(λ) + H(1 − λ)] + Op (n−1/2 ) j=1 =Op (n−1/4 ) 4. For the corresponding unconditional probability taken over all β ∈ U (β0 , n−1/2 M ), ³ ´ P ∃β ∈ U (β0 , n−1/2 M ) : IA01 IB10 6= IA1 IB1 =n −1/2 M p X ¯ ¯ ¯ ¯ EX ¯h0βj (xi , β0 )¯[H(λ) + H(1 − λ)] + Op (n−1/2 ) j=1 =Op (n−1/2 ) where H(λ) = f (qλ ) + f (−qλ ) and qλ = p G−1 (λ). Proof of Lemma 22. Note that Čı́žek’s Lemmas A.8 holds for λ ∈ (0, 1). 195 Appendix B. Appendix to Chapter 3 First, the result of 1(a) holds because P (IA01 IB10 6= IA1 IB1 |xi ) ≤P (IA01 6= IA1 |xi ) + P (IB10 6= IB1 |xi ) ¯ ¯ ¯ ¯ =¯(h0β (xi , β0 ))T (β − β0 )¯[f (qλ ) + f (−qλ )] + Op (n−1/2 ) ¯ ¯ ¯ ¯ + ¯(h0β (xi , β0 ))T (β − β0 )¯[f (q1−λ ) + f (−q1−λ )] + Op (n−1/2 ) ¯ ¯ ¯ ¯ =¯(h0β (xi , β0 ))T (β − β0 )¯[H(λ) + H(1 − λ)] + Op (n−1/2 ) = Op (n−1/4 ). In addition, 1(b) can be obtained by using Čı́žek’s result in his lemma A.8 with our 1(a), so we omit the proof here. Second, for the corresponding unconditional probability, P (IA01 IB10 6= IA1 IB1 ) ¯ ¯ ¯ ¯ =EX P (IA01 IB10 6= IA1 IB1 |xi ) ≤ EX ¯(h0β (xi , β0 ))T (β − β0 )¯[H(λ) + H(1 − λ)] + Op (n−1/2 ). Again the proof of the result is completed by using Čı́žek’s result in his lemma A.8. Third, for the conditional probability taken over all β ∈ U (β0 , n−1/2 M ), ¯ ´ ³ ¯ P ∃β ∈ U (β0 , n−1/2 M ) : IA01 IB10 6= IA1 IB1 ¯xi ¯ ´ ³ ¯ −1/2 ≤P ∃β ∈ U (β0 , n M ) : IA01 6= IA1 ¯xi ¯ ´ ³ ¯ + P ∃β ∈ U (β0 , n−1/2 M ) : IB10 6= IB1 ¯xi ≤n −1/2 p ¯ ¯ X ¯ 0 ¯ M ¯hβj (xi , β0 )¯[H(λ) + H(1 − λ)] + Op (n−1/2 ) j=1 =Op (n−1/4 ). The fourth result can be obtained by using the same techniques as in our second and third results, so we omit the proof. 196 Appendix B. Appendix to Chapter 3 Čı́žek’s corollary A.9 controls the deviation of residuals in one tail from the Taylor approximation to h. Here, both tails must be controlled, as in the following. Lemma 23. Under the assumptions of our Lemma 22, suppose that there exists some β ∈ U (β0 , n−1/2 M ) such that IA01 6= IA1 and IB10 6= IB1 . Then ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ max{¯|ri (β)| − qλ ¯, ¯|ri (β)| − q1−λ ¯} ¯ ¯ ¯ ¯ ≤¯(h0β (xi , ξ))T (β − β0 )¯ + Op (n−1/2 ) = Op (n−1/4 ). and ¯¯ o n¯ ¯¯ oo n n¯ ¯ ¯¯ ¯ ¯¯ max E ¯|ri (β)| − qλ ¯¯xi , E ¯|ri (β)| − q1−λ ¯¯xi ¯ ¯ ¯ ¯ ≤¯(h0β (xi , ξ))T (β − β0 )¯ + Op (n−1/2 ), where ξ ∈ (β0 , β). This lemma is a direct consequence of Čı́žek’s Corollary A.9. B.2.2 Proof of Lemma 17 for the uniform law of large numbers Proof. We prove the uniform weak law of large numbers in Lemma 17 by verifying the four conditions of Andrews’ theorem 4 in Andrews (1992). First, (i) The condition of total boundedness (BD) is ensured by assumption T I1 for the compactness of the parameter space B. (ii) Note that, since Esup|t(x, u; β)|1+δ < ∞, for some δ > 0, β∈B t(xi , ui ; β)I3 (β; K1 , K2 ) − E[t(xi , ui ; β)I3 (β; K1 , K2 )] are identically distributed by assumptions T D1 and T D2; they are also uniformly integrable. Thus, Andrews’ domination condition (DM ) is satisfied. 197 Appendix B. Appendix to Chapter 3 (iii) Additionally, the pointwise convergence of n 1X p [t(xi , ui ; β)I3 (β; K1 , K2 )] − E[t(xi , ui ; β)I3 (β; K1 , K2 )] → 0 n i=1 at any β ∈ B and K1 , K2 ∈ R follows from the weak law of large numbers for mixingales in Andrews (1988). (iv) The last condition of termwise stochastic equicontinuity (T SE) in Andrews’ Theorem 4, Andrews (1992), that ³ lim P ρ→0 sup ¯ ¯ ´ ¯ ¯ 0 0 0 sup ¯tI (xi , ui ; β , K1 , K2 ) − tI (xi , ui ; β, K1 , K2 )¯ > k = 0 β,K1 ,K2 β 0 ,K10 ,K20 (B.4) is satisfied for any k > 0, where tI (xi , ui ; β, K1 , K2 ) = t(xi , ui ; β)I3 (β; K1 , K2 ) and the suprema β, K1 , K2 , β 0 , K10 and K20 are taken over the sets B, R, R, U (β, ρ), U (K1 , ρ) and U (K2 , ρ). To see that (B.4) holds, first notice that for all β ∈ B and K1 , K2 ∈ R, we have sup ¯ ¯ ¯ ¯ sup ¯tI (xi , ui ; β 0 , K10 , K20 ) − tI (xi , ui ; β, K1 , K2 )¯ ≤ sup ¯ ¯ ¯ ¯ sup ¯t(xi , ui ; β 0 )[I3 (β 0 ; K10 , K20 ) − I3 (β; K1 , K2 )]¯ (B.5) + sup ¯ ¯ ¯ ¯ sup ¯[t(xi , ui ; β 0 ) − t(xi , ui ; β)]I3 (β; K1 , K2 )¯. (B.6) β,K1 ,K2 β 0 ,K10 ,K20 β,K1 ,K2 β 0 ,K10 ,K20 β,K1 ,K2 β 0 ,K10 ,K20 Now it is enough to show that given ² > 0, we can find ρ0 > 0 such that the probabilities of the expression (B.5) and (B.6) exceeding given k > 0 are smaller than ² for all ρ < ρ0 . 198 Appendix B. Appendix to Chapter 3 1. Consider the expression (B.5). First note that sup ¯ ¯ ¯ ¯ sup ¯t(xi , ui ; β 0 )[I3 (β 0 ; K10 , K20 ) − I3 (β; K1 , K2 )]¯ β,K1 ,K2 β 0 ,K10 ,K20 ¯ ¯ ¯ ¯ ≤sup¯t(xi , ui ; β)¯ sup ¯ ¯ ¯ ¯ sup ¯I3 (β 0 ; K10 , K20 ) − I3 (β; K1 , K2 )¯, β,K1 ,K2 β 0 ,K10 ,K20 β∈B (B.7) ¯ ¯ ¯ ¯ where sup¯t(xi , ui ; β)¯ is a function independent of β with a finite exβ∈B ¯ ¯ ¯ ¯ pectation. In addition, ¯I3 (β 0 ; K10 , K20 ) − I3 (β; K1 , K2 )¯ is always less than or equal to 1, so (B.5) has an integrable upper bound independent of β. Thus, if we can show that the probability lim P ( sup ρ→0 ¯ ¯ ¯ ¯ sup ¯I3 (β 0 ; K10 , K20 ) − I3 (β; K1 , K2 )¯ = 1) = 0, (B.8) β,K1 ,K2 β 0 ,K10 ,K20 then we get that (B.7) converges in probability to zero for ρ → 0 and n → ∞ as well. So to prove (B.7) it is enough to prove (B.8). Our strategy for proving (B.7) has three steps. (1) We use Čı́žek’s −1 argument that G−1 β 0 (λ) to Gβ (λ) uniformly on B for all λ by the absolute continuity of Gβ . (2) By the result of the uniform convergence of G−1 β , we can find some ρ1 > 0 such that for 1/2 < λ ≤ 1, ¯ ¯ ¯ −1 ¯ ∗ −1 (1 − λ) + K ) ¯(Gβ 0 (1 − λ) + K10 ) − (G−1 1 ¯ < ²(16Mgg ) β and ¯ ¯ ¯ −1 ¯ ∗∗ )−1 , (λ) + K ) ¯(Gβ 0 (λ) + K20 ) − (G−1 ¯ < ²(16Mgg 2 β for any β ∈ B, β 0 ∈ U (β, ρ1 ) and Kj0 ∈ U (Kj , ρ1 ) for j = 1, 2, where ∗ and M ∗∗ , defined in assumption T D3, are the uniform upper Mgg gg bounds in both sides for the probability density functions of ri2 (β). (3) If we denote the product probability space of (xi , ui ) by Ω and 199 Appendix B. Appendix to Chapter 3 consider a compact subset Ω1 ⊂ Ω, such that P (Ω1 ) > 1 − ²/2, and choose ρ2 > 0 such that sup sup β∈B β 0 ∈U (β,ρ2 ) ¯ ¯ ¯ 2 0 ¯ 2 ∗ ∗∗ −1 , Mgg }) , (B.9) ¯ri (β , ω) − ri (β, ω)¯ < ²(16 max{Mgg for all ω ∈ Ω1 and ρ < ρ2 by assumption T H1. Therefore, letting ρ0 = min{ρ1 , ρ2 } and ρ < ρ0 , we can apply steps (1), (2) and (3) to get the following sequence of inequalities. We have that P ( sup ¯ ¯ ¯ ¯ sup ¯I3 (β 0 ; K10 , K20 ) − I3 (β; K1 , K2 )¯ = 1) β,K1 ,K2 β 0 ,K10 ,K20 = P ( sup ¯ ¯ ¯ ¯ sup ¯I3 (β 0 ; K10 , K20 ) − I3 (β; K1 , K2 )¯ = 1, Ω1 ) β,K1 ,K2 β 0 ,K10 ,K20 + P ( sup ¯ ¯ ¯ ¯ sup ¯I3 (β 0 ; K10 , K20 ) − I3 (β; K1 , K2 )¯ = 1, Ω|Ω1 ) β,K1 ,K2 β 0 ,K10 ,K20 ≤ P ( sup ¯ ¯ ¯ ¯ sup ¯I3 (β 0 ; K10 , K20 ) − I3 (β; K1 , K2 )¯ = 1, Ω1 ) + P (Ω|Ω1 ) ≤ P ( sup ¯ ¯ ¯ ¯ sup ¯I3 (β 0 ; K10 , K20 ) − I3 (β; K1 , K2 )¯ = 1, Ω1 ) + ²/2 β,K1 ,K2 β 0 ,K10 ,K20 β,K1 ,K2 β 0 ,K10 ,K20 −1 ∗∗ −1 ∗∗ −1 = P (∃β ∈ B : ri2 (β) ∈ [G−1 β (λ) + K2 − ²(8Mgg ) , Gβ (λ) + K2 + ²(8Mgg ) ] −1 ∗ −1 ∗ −1 ∪ [G−1 β (1 − λ) − K1 − ²(8Mgg ) , Gβ (1 − λ) − K1 + ²(8Mgg ) ]) + ²/2 ² ² ∗ ∗∗ ) + Mgg ( ) ≤ Mgg ( ∗∗ ∗ 4Mgg 4Mgg ² = . 2 Thus, (B.8) is proved, and finally, the expectation of (B.5) converges to zero for ρ → 0 in probability. 2. Now we turn to expression (B.6) and prove that for any given k > 0, lim P ( sup ρ→0 ¯ ¯ ¯ ¯ 0 sup ¯[t(xi , ui ; β ) − t(xi , ui ; β)]I3 (β; K1 , K2 )¯ > k) = 0. β,K1 ,K2 β 0 ,K10 ,K20 (B.10) 200 Appendix B. Appendix to Chapter 3 By Čı́žek’s result that ¯ ¯o n ¯ ¯ E sup¯t(xi , ui ; β 0 ) − t(xi , ui ; β)¯ ≤ k², β,β 0 we have ¯ ¯ ¯ ¯ sup ¯[t(xi , ui ; β 0 ) − t(xi , ui ; β)]I3 (β; K1 , K2 )¯ > k) P ( sup β,K1 ,K2 β 0 ,K10 ,K20 ≤ ¯ ¯i 1 h ¯ ¯ E sup sup ¯[t(xi , ui ; β 0 ) − t(xi , ui ; β)]I3 (β; K1 , K2 )¯ k β,K1 ,K2 β 0 ,K10 ,K20 ≤ k²/k = ², for any ρ < ρ0 . Thus, (B.7) is proved. Consequently, the assumption of TSE in Andrews (1992) is valid and the proof of this lemma is completed by applying the uniform weak law of large numbers. B.2.3 Proof of Proposition 3 on asymptotic linearity Now we can prove Proposition 3. Proof. Recall that Dn1 (t) = Sn0 (β0 − n−1/2 t) − Sn0 (β0 ) n h X = −2 {yi − h(xi , β0 − n−1/2 t)}h0β (xi , β0 − n−1/2 t)I2 (β0 − n−1/2 t) i=1 i − {yi − h(xi , β0 )}h0β (xi , β0 )I2 (β0 ) , Here, I2 (β) = I2 (ri2 (β)) = I[r2 [n−hn +1] 2 2 (β),r[h (β)] (ri (β)) ] n and t ∈ TM = {t ∈ Rp : ktk ≤ M }. For any M > 0, there is an n0 ∈ N such that β0 − n−1/2 t ∈ U (β0 , δ), for all n ≥ n0 and t ∈ TM . Therefore, using Taylor’s expansion for n > n0 and t ∈ TM , we have 201 Appendix B. Appendix to Chapter 3 h(x, β0 − n−1/2 t) = h(x, β0 ) − h0β (x, ξ)T n−1/2 t and h0β (x, β0 − n−1/2 t) = h0β (x, β0 ) − h00ββ (x, ξ 0 )T n−1/2 t, where ξ and ξ 0 are between β0 and β0 −n−1/2 t. Let B1 (x) = h(x, β0 ), B2 (x) = h0β (x, ξ)T n−1/2 t, C1 (x) = h0β (x, β0 ) and C2 (x) = h00ββ (x, ξ 0 )T n−1/2 t. Thus, Dn1 (t) can be rewritten as Dn1 (t) X h = {yi − B1 (xi )}C1 (xi )I2 (β0 − n−1/2 t) − {yi − B1 (xi )}C1 (xi )I2 (β0 ) −2 n i=1 − {yi − B1 (xi )}C2 (xi )I2 (β0 − n−1/2 ) + B2 (xi )C1 (xi )I2 (β0 − n−1/2 t) i − B2 (xi )C2 (xi )I2 (β0 − n−1/2 t) = n h i X {(yi − B1 (xi ))C1 (xi )}[I2 (β0 − n−1/2 t) − I2 (β0 )] (B.11) i=1 − − n h i X (yi − B1 (xi ))C2 (xi )I2 (β0 ) i=1 n h X i (yi − B1 (xi ))C2 (xi )[I2 (β0 − n−1/2 t) − I2 (β0 )] (B.12) (B.13) i=1 n h i X + B2 (xi )C1 (xi )I2 (β0 ) (B.14) i=1 n h i X + B2 (xi )C1 (xi )[I2 (β0 − n−1/2 t) − I2 (β0 )] (B.15) i=1 n h i X − B2 (xi )C2 (xi )I2 (β0 − n−1/2 t) (B.16) i=1 Using techniques substantially like those in Čı́žek’s proofs for his (42)-(47), we can show that the sums in (B.12), (B.13), (B.15) and (B.16) are Op (n1/4 ) or op (n1/2 ), and therefore, are asymptotically negligible in comparison with (B.11) and (B.14), which are Op (n1/2 ). Thus, we omit the proofs here, except for (B.11) and (B.14). 202 Appendix B. Appendix to Chapter 3 To deal with (B.11), let vi (n, t) = I2 (β0 − n−1/2 t) − I2 (β0 ). So (B.11) can be rewritten as n ³ X ´ {(yi − B1 )C1 }[I2 (β0 − n−1/2 t) − I2 (β0 )] i=1 n ³ ´ X = {(yi − h(xi , β0 ))h0β (xi , β0 )}[I2 (β0 − n−1/2 t) − I2 (β0 )] i=1 = n X ri (β0 ) · h0β (xi , β0 ) · vi (n, t) i=1 1 Xh {ri (β0 ) − sgn ri (β0 ) · qλ } + {ri (β0 ) − sgn ri (β0 ) · q1−λ } 2 i=1 i + sgn ri (β0 )[qλ + q1−λ ] · h0β (xi , β0 ) · vi (n, t) n = 1h X {ri (β0 ) − sgn ri (β0 )qλ } · h0β (xi , β0 ) · vi (n, t) 2 n = i=1 n X {ri (β0 ) − sgn ri (β0 )q1−λ } · h0β (xi , β0 ) · vi (n, t) + + i=1 n X i sgn ri (β0 )(qλ + q1−λ ) · h0β (xi , β0 ) · vi (n, t) . (B.17) (B.18) (B.19) i=1 Again, using techniques substantially like those of Čı́žek with our Lemmas 22 and 23, (B.17) and (B.18) multiplied by n−1/4 can be shown to be bounded in probability for λ ∈ (1/2, 1). Moreover (B.19) can be rewritten as n X sgn ri (β0 )(qλ + q1−λ ) · h0β (xi , β0 ) · vi (n, t) i=1 =n1/2 (qλ + q1−λ )[H(λ) − H(1 − λ)]Qh t + O(1) + op (n1/2 ). 203 Appendix B. Appendix to Chapter 3 Therefore, we conclude that sup k t∈TM n X {ri (β0 )}h0β (xi , β0 )vi (n, t) i=1 1 − n1/2 (qλ + q1−λ )[H(λ) − H(1 − λ)]Qh tk = op (1), 2 as n → ∞. Finally we split (B.14) into two parts : B2 C1 I2 (β0 ) = n X h0β (xi , ξ)T n−1/2 t · h0β (xi , β0 )I2 (β0 ) i=1 = n X h0β (xi , β0 )T n−1/2 t · h0β (xi , β0 )I2 (β0 ) (B.20) i=1 + n X n−1/2 tT h00ββ (xi , ξ 00 ) · n−1/2 t · h0β (xi , β0 )I2 (β0 ), i=1 (B.21) where ξ 00 is between β0 and β0 − n−1/2 t. Note that the supremum of (B.21) over t ∈ TM is Op (1). Since n ¯X ¯ ¯ ¯ n−1/2 tT h00ββ (xi , ξ 00 ) · n−1/2 t · h0β (xi , β0 )I2 (β0 )¯ ¯ i=1 ≤ n X kn−1/2 tT · h00ββ (xi , ξ 00 ) · n−1/2 t · h0β (xi , β0 )k, i=1 by the law of large numbers for mixingales in Andrews (1988) and the uniform law of large numbers in Andrews (1992) for the right hand side of the inequality over β 00 ∈ U (β, δ), we have n ¯ ¯ ¯ 1 X ¯¯ T 00 ¯ p ¯ ¯ ¯t hββ (xi , β 00 )t · h0β (xi , β0 )¯ → E ¯tT h00ββ (xi , β 00 )t · h0β (xi , β0 )¯, n i=1 as n → ∞. Moreover, (B.21) is bounded in probability because the expec204 Appendix B. Appendix to Chapter 3 tation is bounded uniformly over t ∈ TM by assumption T H5 and ktk ≤ M . Next we turn to (B.20). Similarly, split it into three parts : n X h0β (xi , β0 )T n−1/2 t · h0β (xi , β0 )I2 (β0 ) i=1 = n X h i h0β (xi , β0 )T n−1/2 t · h0β (xi , β0 ) I2 (β0 ) − I2G (β0 ) (B.22) i=1 n o 1 Xn 0 +√ hβ (xi , β0 )h0β (xi , β0 )T I2G (β0 ) − E[h0β (xi , β0 )h0β (xi , β0 )T I2G (β0 )] t n i=1 (B.23) 1 +√ n n X E[h0β (xi , β0 )h0β (xi , β0 )T I2G (β0 )]t, (B.24) i=1 where I2G (β0 ) = I{G−1 (1−λ)≤r2 (β0 )≤G−1 (λ)} with λ ∈ (1/2, 1). The supremum i of (B.22) taken over t ∈ TM is Op (n1/4 ) as n → ∞, and (B.23) is bounded in probability by applying the central limit theorem to (B.23). Again, the proofs are omitted here because they are so similar to Čı́žek’s. Finally, since E[h0β (xi , β0 )h0β (xi , β0 )T I2G (β0 )] = EXi [h0β (xi , β0 )h0β (xi , β0 )T · {EI{G−1 (1−λ)≤r2 (β0 )≤G−1 (λ)} |Xi }] i = (2λ − 1)EXi [h0β (xi , β0 )h0β (xi , β0 )T ] = (2λ − 1)Qh , (B.25) (B.24) can be rewritten as n1/2 (2λ − 1)Qh t, where λ ∈ (1/2, 1). Thus, we can conclude that n °X ° ° ° sup ° h0β (xi , β0 )n−1/2 t · h0β (xi , β0 ) · I2 (β0 ) − n1/2 (2λ − 1)Qh t° = Op (1), t∈TM i=1 as n → ∞. The proof of Proposition 3 is completed by combining all of the above 205 Appendix B. Appendix to Chapter 3 results. B.2.4 Proofs of our main theorem The proofs of our Theorems 6 and 7 can be obtained using Čı́žek’s proofs of his theorems 4.1, 4.2 and 4.3 but with our lemmas and propositions for the two-sided LTS estimators. For the sake of completeness, we only prove Theorem 7 using Theorem 6 and Proposition 3, because Theorem 7 is the most directly useful in practice. Proof of Theorem 7. From Theorem 6, we have tn = √ (LT S,hn ) n(β0 − βn )= Op (1), as n → ∞. Then using Proposition 3, with probability approaching to 1, we have n−1/2 =n−1/2 ´ ³ D0 (t ) n n − n1/2 Qh tn Cλ −2 ³ D0 (√n(β − β (LT S,hn ) )) n 0 n −2 ´ √ + n1/2 Qh Cλ n(βn(LT S,hn ) − β0 ) =op (1), where Cλ = (2λ − 1) + ( p and qλ = G−1 (λ). qλ +q1−λ )[H(λ) − H(1 − λ)], 2 H(λ) = f (qλ ) + f (−qλ ) (LT S,hn ) Then by simple algebra with the definition of βn , we have √ n(βn(LT S,hn ) − β0 ) −1/2 = n −1 Q−1 h Cλ n X {ri (β0 )}h0β (xi , β0 )I2G (β0 ) + op (1) (B.26) i=1 −1/2 +n −1 Q−1 h Cλ n X {ri (β0 )}h0β (xi , β0 )[I2 (β0 ) − I2G (β0 )]. (B.27) i=1 def First, we show that (B.27) is negligible in probability. Recall that ri (β0 ) = 206 Appendix B. Appendix to Chapter 3 ui . Thus, (B.27) can be rewritten as −1 n−1/2 Q−1 h Cλ n X ui h0β (xi , β0 )[I{u2 [n−hn +1] i=1 ≤u2i ≤u2[h ] } n − I{G−1 (1−λ)≤u2 ≤G−1 (λ)} ]. i Then our Proposition 4 and assumption T D2 imply that, for k = 1 and 2, ¯ ¯ E ¯ui [I{u2 ≤u2i ≤u2[h ] } [n−hn +1] n ¯ ¯ − I{G−1 (1−λ)≤u2 ≤G−1 (λ)} ]¯ k = O(n−1/2 ), (B.28) i as n → ∞. Therefore, the summands in (B.27) multiplied by n1/4 have a finite expectation ¯ ¯ E ¯n1/4 ui h0β (xi , β0 )[I{u2 [n−hn ≤u2i ≤u2[h ] } +1] n ¯ ¯ − I{G−1 (1−λ)≤u2 ≤G−1 (λ)} ]¯ = o(1), i and variance n o var n1/4 ui h0β (xi , β0 )[I{u2 − I{G−1 (1−λ)≤u2 ≤G−1 (λ)} ] 2 ≤u2 ≤u } i i [n−hn +1] [hn ] ¯ n ¯ ≤ n1/2 EXi h0β (xi , β0 ) · var(ui · ¯I{u2 ≤u2i ≤u2[h ] } [n−hn +1] ¯ o n ¯ − I{G−1 (1−λ)≤u2 ≤G−1 (λ)} ¯|Xi ) · h0β (xi , β0 )T i ¯ n ¯ 1/2 0 + n varXi hβ (xi , β0 ) · E(ui · ¯I{u2 ≤u2i ≤u2[h ] } [n−hn +1] n ¯ o ¯ − I{G−1 (1−λ)≤u2 ≤G−1 (λ)} ¯|Xi ) i n o 0 0 T 0 ≤ O(1) EXi {hβ (xi , β0 )hβ (xi , β0 ) } + varXi (hβ (xi , β0 )) = O(1). by assumption TH5 and the independence of xi and ui . Now since all indicators depend only on the squares of the residual u2i and the error terms ui are symmetrically distributed by assumption TD2, we have that, for any i = 1, 2, . . . , n and any n ∈ N , n E n1/4 ui h0β (xi , β0 )[I{u2 [n−hn +1] ≤u2i ≤u2[h n] o − I ] = 0. } {G−1 (1−λ)≤u2 ≤G−1 (λ)} i 207 Appendix B. Appendix to Chapter 3 In the condition case, we get n E n1/4 ui h0β (xi , β0 )[I{u2 [n−hn +1] ≤u2i ≤u2[h ] } n ¯ o ¯ − I{G−1 (1−λ)≤u2 ≤G−1 (λ)} ]¯u1 , . . . , ui−1 , x1 , . . . , xi−1 = 0. i Therefore, similar to Čı́žek’s one-sided case, n1/4 ui h0β (xi , β0 )[I{u2 [n−hn +1] ≤u2i ≤u2[h ] } n − I{G−1 (1−λ)≤u2 ≤G−1 (λ)} ] i forms a sequence of martingale differences with finite variances. Applying the law of large numbers for the sum of martingale differences (B.27), we have −1 n−1/2 Q−1 h Cλ n X h ui h0β (xi , β0 ) I{u2 [n−hn +1] i=1 i ≤u2i ≤u2[h ] } n p − I{G−1 (1−λ)≤u2 ≤G−1 (λ)} → 0, (B.29) i as n → ∞. Thus, (B.27) is negligible in probability op (1). Based on this result, (B.27) gives √ n(βn(LT S,hn ) − β0 ) −1 = n−1/2 Q−1 h Cλ n X ri (β0 )h0β (xi , β0 )I{G−1 (1−λ)≤u2 ≤G−1 (λ)} + op (1). (B.30) i i=1 Additionally, using the same arguments as for (B.27), the summands in (B.30) form a sequence of identically distributed martingale differences with finite second moments by the assumptions T D2 and T H5. Then, by the law of large numbers for L1 -mixingales in Andrews (1988), we have n 1X 2 0 ui hβ (xi , β0 )h0β (xi , β0 )T · I{G−1 (1−λ)≤u2 ≤G−1 (λ)} i n i=1 p → var(ui h0β (xi , β0 )) · I{G−1 (1−λ)≤u2 ≤G−1 (λ)} , i 208 Appendix B. Appendix to Chapter 3 as n → ∞. Therefore, the proof of Theorem 8 for the asymptotic normal(LT S,hn ) ity of the two-sided LTS estimator βn is completed by the central limit theorem for the martingale differences in (B.30) with the asymptotic variance −1 0 V2λ = Cλ−2 · Q−1 h · var[ui hβ (xi , β0 )I{G−1 (1−λ)≤u2i ≤G−1 (λ)} ] · Qh n · E = Cλ−2 · Q−1 [h0β (xi , β0 )ui I{G−1 (1−λ)≤u2 ≤G−1 (λ)} ] h i o × [h0β (xi , β0 )ui I{G−1 (1−λ)≤u2 ≤G−1 (λ)} ]T · Q−1 h i = Cλ−2 = Cλ−2 · Q−1 h · Q−1 h · E[h0β (xi , β0 )h0β (xi , β0 )T ] · Qh · 2 σ2λ · · E[u2i I{G−1 (1−λ)≤u2 ≤G−1 (λ)} ] · Q−1 h i Q−1 h 2 = Cλ−2 · σ2λ · Q−1 h , 2 = E[u2 I where σ2λ i {G−1 (1−λ)≤u2 ≤G−1 (λ)} ] and λ ∈ (1/2, 1). i 209 Appendix B. Appendix to Chapter 3 B.3 References Andrews, D.W.K. (1988). Laws of large numbers for dependent non-identically distributed random variables. Econometric theory, 4, 458-467. Andrews, D.W.K. (1992). Generic uniform convergence. Econometric theory, 8, 241-257. Čı́žek, P. (2004). Asymptotics of least trimmed squares regression, CentER Discussion Paper 2004-72, Tilburg University, The Netherlands. Čı́žek, P. (2005). Least trimmed squares in nonlinear regression under dependence. J. Statist. Plann. Inference, 136, 3967-3988. 210 Appendix C Appendix: Inequalities for Medloss and Risk Here we have more comparisons of the medloss with the risk. C.1 Inequalities for medloss and risk Since we expect the medloss to be better for heavier tailed distributions than the risk is, we note that these are the cases in which the distribution of L(δ, θ) is most skewed and so we expect medloss is less than risk. Otherwise, these are the cases for which median is more representative of typical loss than the risk is and hence these are the cases where medloss should be preferred to risk. Accordingly, the hypotheses of our results identify those cases where our method is appropriate. Recall that the loss function L(θ, δ(X)) is normally defined by non-negative increasing functions of |θ − δ(X)| ≥ 0. First of all, we know that 1 mY ≤ E(Y ) ⇐⇒ P (Y ≤ E(Y )) ≥ , 2 (C.1) where mY is the median of Y . To see this, note that for ” ⇒ ”, ∵ mY ≤ E(Y ) ∴ P (Y ≤ E(Y )) ≥ P (Y ≤ mY ) = 12 , and for ” ⇐ ”, if P (Y ≤ E(Y )) ≥ 1 2 and set Iα = {α : FY (α) ≥ 12 }, then mY = infIα ≤ E(Y ) ∈ Iα . Therefore, α we have the following lemma. Lemma 24 (Transformation invariance). For any non-negative, non-decreasing, convex function g(t) on t ≥ 0, 211 Appendix C. Appendix: Inequalities for Medloss and Risk (A) Frequentist result: Let X be a non-negative random variable. Then if mX ≤ E(X), mg(X) ≤ E(g(X)) , where mX and E(X) represent the median and mean of X respectively. (B) Bayesian result: Suppose that Θ ≥ 0 and that the posterior median is less than or equal to the posterior mean, i.e. med(Θ|x) ≤ E(Θ|x). Then med(g(Θ)|x) ≤ E(g(Θ)|x). Proof. Here we prove the result for the Frequentist approach only, because the proof for the Bayesian approach is similar. By Jensen inequality, since g(X) is convex, g(E(X)) < E(g(X)). Thus, we have P (g(X) > Eg(X)) < P (g(X) > gE(X)) = P (X > E(X)) ≤ P (X > mx ) 1 = 2 ⇒ P (g(X) ≤ Eg(X)) ≥ 12 . Then we are done by (C.1). Example: Suppose that W ∼ N (θ, 1) and X = |W − θ| for g(X) = X 2, r ∵ EX = E|W − θ| = ∴ By Lemma 5A, π ≈ 1.2533 and mX = med|W − θ| = Φ−1 (0.75) ≈ 0.6745 2 med(W − θ)2 ≤ E(W − θ)2 . 212 Appendix C. Appendix: Inequalities for Medloss and Risk C.1.1 Inequality for the median loss and expected loss for unimodal distributions According to Pukelsheim (1994), Gauss, in 1823, proved that for unimodal random variable X with mode 0, P (|X| ≥ k) ≤ 4EX 2 , 9k 2 (C.2) and Winckler, in 1866, extended the inequality (C.2)to higher moments that for r > 0 and X unimodal with mode 0, P (|X| ≥ k) ≤ ³ r ´r E|X|r . r+1 kr (C.3) Here, unimodality about 0 reduces the Chebyshev bound by a factor of at least {r/(r +1)}r . Some of the most recent nice papers on this inequality are Sellke (1996), Sellke and Sellke (1997) for the upper bounds on P (|X| ≤ k) in terms of Eh(X), where h be an even function on R that is nondecreasing on [0, ∞), Pukelsheim (1994) who gave three proofs of the Gauss inequality (C.2), and Csiszar, Mori and Szekely (2005). Based on the above inequalities, we have the following results. First we deal with L1 loss. Theorem 11. (A) Frequentist result: For unimodal random variable X with mode θ, med(|X − θ|) ≤ E(|X − θ|). (C.4) (B) Bayesian result: If the posterior distribution of Θ given X = x is unimodal, then med(|Θ − m0 | |x) ≤ E(|Θ − m0 | |x), (C.5) where m0 means the posterior mode of Θ given X = x. 213 Appendix C. Appendix: Inequalities for Medloss and Risk Proof. Since X −θ is unimodal with model 0, we put r = 1 and k = E|X −θ| in expression (C.3), then we have P (|X − θ| ≥ E|X − θ|) ≤ 12 , so the proof for (C.4) is completed by (C.1). The result of (C.5) can be showed with the same technique, so we skip it here. The L1 loss result extends to a general class of loss functions. Corollary 7. Let g be a non-negative, non-decreasing, convex function on [0, ∞) with g(0) = 0. If X is unimodal with mode θ, then med(g(|X − θ|)) ≤ E(g(|X − θ|)). (C.6) Proof. By Lemma 24(A). Corollary 8. Let g be a non-negative, non-decreasing, convex function on [0, ∞) with g(0) = 0. If the posterior distribution for Θ given X = x is unimodal, then med(g(|Θ − m0 |) |x) ≤ E(g(|Θ − m0 |) |x), (C.7) where m0 means the posterior mode of Θ given X = x. Proof. By Lemma 24(B). C.1.2 Some results for symmetric and unimodal location family First, suppose the distribution of X belongs to a location family with pdf f (x|θ) and it is also symmetric about the unknown location parameter. Let’s denote the location parameter, standard deviation and the 75th quantile of f (x|θ) by θ, σ, and Qθ (3). In other words, EX = θ. Since X ∼ f (x|θ) = f (x − θ|0), if we let Z = X − θ, then we have Z ∼ f (z|0), which is symmetric 214 Appendix C. Appendix: Inequalities for Medloss and Risk about 0. Moreover, EZ = E(X − θ) = 0, V ar(Z) = EZ 2 = E(X − θ)2 = σ 2 and Q0 (3) = Qθ (3) − θ, where Q0 (3) is the 75th quantile of Z. If we let m be the median of Z 2 , or say m = med(Z 2 ), then √ √ 1 = P (Z 2 ≤ m) = P (− m ≤ Z ≤ m) = 2 2 Z √ m f (z|0)dz 0 ⇒ m = Q0 (3)2 Therefore, we have the following lemma. Lemma 25. Suppose that the distribution of X is one member of location family and it is also symmetric about the location parameter θ. Then we have med(X − θ)2 < E(X − θ)2 , Proof. By Gauss inequality (C.2) with k = Q0 (3), we have Q0 (3)2 < σ 2 , where Q0 (3) is the upper quartile of X − θ. Furthermore, we notice that med|X − θ| = Q0 (3). So we have the lemma below. Lemma 26. Suppose that the distribution of X is one member of location family and it is also symmetric about the location parameter θ. Then we have med|X − θ| ≤ E|X − θ|. Proof. By the inequality (C.3), when we put r = 1 and k = Q0 (3), we get Q0 (3) ≤ E|X|. Corollary 9. Let g(y) be a non-negative non-decreasing function on y ≥ 0. If the distribution of X is a member of location family which is symmetric about the unknown location parameter θ, then we have med(g(|X − θ|)) ≤ E(g(|X − θ|)) Proof. By Lemma 24(A) and Lemma 26. 215 Appendix C. Appendix: Inequalities for Medloss and Risk Some extension of symmetric and unimodal location families In this part, we have two extensions to the result in the previous section. The first extension (A) is to consider the n-data point case with an L-estimator (a function of the order statistics), while the second one (B) uses convexity alone to get medloss ≤ risk. (A) Suppose that {Xi , i = 1, . . . , n} is an i.i.d. sample from a symmetric and unimodal location distribution, which is symmetric about the location P parameter θ and with variance σ 2 . Denote L(X n ) = ni=1 ci X(i) , where X(i) P is the i th order statistic of Xi , i = 1, . . . , n and ni=1 ci = 1. P Proposition 6. EL(X n ) = θ and E(L(X n ) − θ)2 = σ 2 ( ni=1 c2i ). Proof. By the rearrangement of the elements of L(X n ), we have L(X n ) = Pn i=1 dj Xj , where {dj , j = 1, . . . , n} is a rearrangement of {cj , j = 1, . . . , n}. P P P P Clearly, nj=1 dj = ni=1 ci = 1 and nj=1 d2j = ni=1 c2i . Thus P P P EL(X n ) = E ni=1 ci X(i) = E nj=1 dj Xj = nj=1 dj EXj = θ. 216 Appendix C. Appendix: Inequalities for Medloss and Risk Similarly, n n X X (L(X n ))2 = ( di Xi )( dj Xj ) = i=1 n X j=1 n X n X d2i Xi2 + i=1 ⇒ E(L(X n ))2 = n X di dj Xi Xj i=1 j=1 i6=j n X n X d2i (EX12 ) + i=1 = n X di dj (EX1 )2 i=1 j=1 i6=j n X n n X X di )( dj ) − d2j ](EX1 )2 d2i (EX12 ) + [( i=1 n X =[ j=1 n X =( i=1 j=1 j=1 d2j ][(EX12 ) − (EX1 )2 ] + (EX1 )2 d2j )σ 2 + θ2 . j=1 Thus, E(L(X n ) − θ)2 = E(L(X n ))2 − 2θE(L(X n )) + θ2 n n X X 2 2 =σ ( dj ) = σ ( c2i ). 2 j=1 (C.8) i=1 Analogous to the case of a single X in the last section, we let Ti = Xi −θ, where {Ti , i = 1, . . . , n} is a random variable from a distribution symmetric P P about 0, and T(i) = X(i) − θ. Thus L(X n ) − θ = nj=1 cj T(j) = ni=1 di Ti . Before we go on, we need the following theorem (Medgyessy, 1977, pp.24). Theorem 12 (Medgyessy’s theorem). Let F1 (x) and F2 (x) be the d.f. of X1 and X2 respectively, and X1 is independent of X2 . If F1 (x) is a unimodal d.f. symmetric at a1 , and F2 (x) is a unimodal d.f. symmetric at a2 , then 217 Appendix C. Appendix: Inequalities for Medloss and Risk their convolution F1 (x) ∗ F2 (x) is a unimodal d.f. symmetric about a1 + a2 . Since {di Ti , i = 1, . . . , n} is unimodal and symmetric at 0, by Medgyessy’s theorem and induction, we have L(X n ) − θ is also unimodal and symmetric at 0. Then using the result of single X case, we have the following lemma. Lemma 27. (i) Suppose that {Xi , i = 1, . . . , n} is an i.i.d. sample from a symmetric and unimodal location distribution, which is symmetric about the P location parameter θ. Denote L(X n ) = ni=1 ci X(i) , where X(i) is the i th P order statistic of Xi , i = 1, . . . , n and ni=1 ci = 1. Thus we have med|L(X n ) − θ| < E|L(X n ) − θ|. (C.9) (ii) Consider g(t) is any non-negative, non-decreasing, convex function on t ≥ 0, we have med[g|L(X n ) − θ|] < E[g|L(X n ) − θ|], (C.10) For extension (B), we want to show that if L(δ(X), θ) = g(δ(X)−θ) ≥ 0, where g is convex and has a minimum at 0, say g(0) = 0, then med[g(δ(X) − θ)] ≤ E[g(δ(X) − θ)]. However, in this part, we only have the result when δ(X) = X. Next we turn to derive sufficient conditions for convexity of g to give medloss preferred to risk. Consider a symmetric and unimodal location distribution with unknown location parameter θ, i.e. X ∼ fX (x|θ) and let T = X − θ ∼ fT (t|0), which is symmetric about 0 We have medT = 0 = ET, medX = θ = EX, ET 2 = σ 2 = E(X − θ)2 , medT 2 = Q2T (3), and med|T | = QT (3), where QT (3) is the 75th quantile of T . In the following, we will prove that if g(t) ≥ 0 is convex in t and 0 is the unique solution for g(t) = 0, then P (g(T ) < Eg(T )) > 1 2 under some conditions. First of all, we need the following result. 218 Appendix C. Appendix: Inequalities for Medloss and Risk Proposition 7. If g(χ) is convex in χ and 0 is the unique solution for g(χ) = 0, then g(χ) < K ⇐⇒ K1 < χ < K2 , where g(K1 ) = K = g(K2 ). In particular, if χ is non-negative, then g(χ) < K ⇐⇒ 0 ≤ χ < K 0 , where g(K 0 ) = K. Proof. (⇐) if K1 < χ < K2 with g(K1 ) = K = g(K2 ), let f (λ) = (1−λ)K1 + λK2 , where λ ∈ [0, 1]. Then by the continuity of f (λ) in λ, ∃λ∗ ∈ [0, 1] such that f (λ∗ ) = χ, i.e. χ = (1−λ∗ )K1 +λ∗ K2 . Therefore, g(χ) = g((1−λ∗ )K1 + λ∗ K2 ) < (1 − λ∗ )g(K1 ) + λ∗ g(K2 ) = K. (⇒) Here we will show that if g(χ) is convex and 0 is the unique solution for g(χ) = 0, then there are at most two solutions for g(4) = K. Assume that there are three solutions 41 , 42 and 43 with 41 < 42 < 43 and g(41 ) = g(42 ) = g(43 ) = K, then ∃λ1 , λ2 and λ3 ∈ (0, 1) such that g((1 − λ1 )41 + λ1 42 ) < K, g((1 − λ2 )42 + λ2 43 ) < K, and g((1 − λ3 )41 + λ3 43 ) < K. Since we can find α1 and α2 that 41 < α1 < 42 < α2 < 43 , i.e. there exist λ∗1 and λ∗2 ∈ (0, 1) such that α1 = (1 − λ∗1 )41 + λ∗1 42 and α2 = (1 − λ∗2 )42 + λ∗2 43 , we have g(α1 ) < K and g(α2 ) < K. Moreover, as α1 < 42 < α2 , 42 can also be rewritten as (1 − λ0 )α1 + λ0 α2 for λ0 ∈ (0, 1); thus, K = g(42 ) = g((1 − λ0 )α1 + λ0 α2 ) < (1 − λ0 )g(α1 ) + λ0 g(α2 ) < K, which is a contradiction that we have three solutions for g(4) = K. Therefore, there are at most two solutions K1 and K2 (K2 > K1 )with g(K1 ) = K = g(K2 ). So if g(χ) < K, then we have three possible situations K1 < χ < K2 , K1 < K2 < χ or χ < K1 < K2 with g(K1 ) = K = g(K2 ). Actually, the last two situations will give us a contradiction. Since if K1 < K2 < χ, then ∃λK ∈ (0, 1) such that K2 = (1 − λK )K1 + λK χ, and g(K2 ) < (1 − λK )g(K1 )+λK g(χ), which implies that K < g(χ). Similarly, if χ < K1 < K2 , 219 Appendix C. Appendix: Inequalities for Medloss and Risk then finally we will also get K < g(χ). Thus, if g(χ) < K, then we only have K1 < χ < K2 , where g(K1 ) = K = g(K2 ). In other words, now it suffices to find what conditions can make 1 P (K1 < T < K2 ) > , 2 (C.11) with g(K1 ) = Eg(T ) = g(K2 ). We notice that since T is symmetric about 0, P (T ≤ 0) = P (T ≥ 0) = 12 , and P (QT (1) < T < QT (3)) = 21 , where QT (1) = −QT (3) < 0 and QT (3) > 0 are the 25th and 75th quantiles of T respectively. Actually, K1 and K2 must have different signs; otherwise, P (K1 < T < K2 ) < 1 2. Thus, we can focus on −∞ < K1 < 0 < K2 < +∞. For some special cases, if K1 = 0, then K2 must be +∞; otherwise, (C.11) cannot hold. Similarly, if K2 = 0, then K1 must be −∞. Therefore, we have the following result. Proposition 8. Let the distribution of T be symmetric about 0, QT (1) = −QT (3) < 0 and QT (3) > 0 be the 25th and 75th quantiles of T respectively, and K1 and K2 (K1 < K2 ) be the solutions for g(4) = Eg(T ) with a convex function g ≥ 0 and 0 is the unique solution for g(4) = 0, then P (K1 < T < K2 ) > a) K1 < QT (1) b) if c) if and 1 2 iff one of the following three situations holds K2 > QT (3), K1 > QT (1) and K1 < QT (1) and (or K2 > QT (3), min(K2 , |K1 |) > QT (3)) Z K1 Z then fT (t|0)dt < K2 < QT (3), then QT (1) Z QT (1) K1 fT (t|0)dt > K2 fT (t|0)dt QT (3) Z QT (3) K2 Suppose the convex function g(·) is symmetric about 0, then K1 = −K2 . In this case, we can reduce the result for Proposition 7 to P (K1 < T < K2 ) > 1 2 iff K2 > QT (3) or max g −1 (Eg(T )) > QT (3). 220 fT (t|0)dt Appendix C. Appendix: Inequalities for Medloss and Risk Finally we have: Corollary 10. Suppose the convex function g(·) is symmetric about 0, and the same notations as Proposition 8. Then med(g(T )) < E(g(T )) iff K2 > QT (3), or max g −1 (Eg(T )) > QT (3). This means that as long as a distribution is spread out enough so that K2 or max g −1 (Eg(T )) is to the right of QT (3), the tails of the distribution are heavy enough that the median will be more representative of the distribution of L(δ, θ) than the risk is. C.2 General results for exponential families Rahman and Gupta (1993) proposed the transformed chi-square class of distributions, which relates the one-parameter exponential family f (x; θ) = ea(x)b(θ)+c(θ)+h(x) (C.12) to the Gamma distribution (See also Jozani, Nematollahi, Shafie, 2002).They have the following lemma. Lemma 28 (RG’s lemma). In a one-parameter exponential family of the form (C.12), the function -2a(X)b(θ) has a Gamma(k/2, 2)-distribution iff 2c0 (θ)b(θ) = k, b0 (θ) (C.13) where k is positive and free from θ. In case k is an integer, -2a(X)b(θ) follows a central χ2 distribution with k degrees of freedom. They also defined a sub-family of the one parameter exponential family, having the pdf of the form (C.12) and satisfying (C.13), to be a family of transformed chi-square distributions, given that k is a positive integer. Thus, connecting RG’s lemma to our theorem in the last section, we have the following theorem. 221 Appendix C. Appendix: Inequalities for Medloss and Risk Theorem 13. In a one-parameter exponential family of the form (C.12), if it satisfies the condition (C.13), then for k > 2, med(| − a(X)b(θ) − (k/2 − 1)|) ≤ E(| − a(X)b(θ) − (k/2 − 1)|). (C.14) Proof. We know that -2a(X)b(θ) has a Gamma(k/2, 2)-distribution by RG’s lemma. So, -a(X)b(θ) follows Gamma(k/2, 1)-distribution. Moreover, Gamma(α, 1) is unimodal with mode α − 1 when α > 1. Thus, the proof is completed by Theorem 11(A). Corollary 11. Under the same conditions as Theorem 13, suppose that g(t) is a non-negative, non-decreasing convex function on t ∈ [0, ∞) and 0 is the unique solution for g(t) = 0, then med(g| − a(X)b(θ) − (k/2 − 1)|) ≤ E(g| − a(X)b(θ) − (k/2 − 1)|). (C.15) For the extension of the above results. Let X be the sample space for X and Θ ⊆ Rd be the parameter space for θ = (θ1 , . . . , θd )T . Suppose that X belongs to a d-parameter exponential family if its density can be written in the form of f (x; θ) = exp [aT (x)b(θ) + c(θ) + h(x)], (C.16) where c(θ) : Rd → R, b(θ) : Rd → Rd , a(X) : X → Rd and h(X) : X → R. For a(X) = (a1 (X), a2 (X), . . . , ad (X))T , b(θ) = (b1 (θ), . . . , bd (θ))T and k = (k1 , . . . , kd )T , we define three conditions (A) −2ai (X)bi (θ) ∼ Ga( k2i , 2), i=1, 2, . . . , d. (B) −2aT (X)b(θ) ∼ Ga( Pd i=1 2 ki , 2). 222 Appendix C. Appendix: Inequalities for Medloss and Risk £ ¤ (C) 2Oc(θ) = O ln bT (θ) k, where O = ( ∂θ∂ 1 , . . . , ∂θ∂d )T . Then we have the following theorem. Theorem 14. (a) If X belongs to the d-parameter exponential family of (C.16), (C) =⇒ (B); (C.17) (b) if the components of a(X) are independent, then (i) (A) =⇒ (C) (C.18) and X belongs to the d-parameter exponential family in the form of (C.16). Moreover, the independence of the components of a(X) implies that (ii) (A) =⇒ (B). (C.19) Proof. First, we will prove the result of (C.17). Denote cj = ∂ ∂θj c(θ). There- fore, from (C), we have d cj = 1X ∂ ki ln bi , ∀j = 1, . . . , d 2 ∂θj (C.20) i=1 which implies d 1X c(θ) = ki ln bi (θ) + γ, 2 (C.21) i=1 where γ is an arbitrary constant. Thus, Z d exp{aT (X)b(θ) + h(x)}dx = exp{−c(θ)} = exp [− X 1X ki ln bi (θ) − γ]. 2 i=1 (C.22) 223 Appendix C. Appendix: Inequalities for Medloss and Risk Let U = −2aT (X)b(θ). The characteristic function of U is given by ϕU (t) = E{exp (itU )} Z = exp{−2itaT (X)b(θ)}exp{aT (X)b(θ) + c(θ) + h(x)}dx X Z = exp{c(θ)} exp{(−2it + 1)aT (X)b(θ) + h(x)}dx X d 1X kj ln[(−2it + 1)bj (θ)] − γ], = exp{c(θ)}exp [− 2 from (C.22) j=1 d 1X 1 = exp{ kj ln } 2 (1 − 2it) j=1 1 = (1 − 2it)− 2 Pd j=1 kj , which is also the characteristic function of a Gamma distribution with paP rameters dj=1 kj and 2. Since the characteristic function uniquely determines the distribution function, it implies (B) and the proof of (C.17) is done. Second, from (A) and the proof of Theorem 2.1 in Rahman and Gupta’s paper (Rahman and Gupta, 1993), we know that for each aj (X), its p.d.f. can be written as exp [aj (x)bj (θ) + (kj /2) ln{−bj (θ)} + (kj /2 − 1) ln{aj (X)} − ln{Γ(kj /2)}], (C.23) ∀j = 1, . . . , d. By the independence of all the components of a(X), the joint 224 Appendix C. Appendix: Inequalities for Medloss and Risk p.d.f. of a(X) is d Y exp [aj (x)bj (θ) + (kj /2) ln{−bj (θ)} + (kj /2 − 1) ln{aj (X)} − ln{Γ(kj /2)}] j=1 d d d d X X X 1X kj ln{−bj (θ)} + (kj /2 − 1) ln aj (X) − ln Γ(kj /2)] =exp [ aj (x)bj (θ) + 2 j=1 j=1 j=1 j=1 =exp [aT (x)b(θ) + c(θ) + h(x)], where d 1X c(θ) = kj ln{−bj (θ)}. 2 (C.24) j=1 Thus, X belongs to the d-parameter exponential family in form of (C.16). After taking a partial derivative of (C.24) with respect to the components of θ, we get (C.20) that implies (C). The proof for (C.19) is obvious from (C.17) and (C.18), and it can also be found in some elementary books of statistical inference. 225 Appendix C. Appendix: Inequalities for Medloss and Risk C.3 References Csiszár, V., Móri, T.F. and Székely, G.J. (2005). Chebyshev-type inequality for scale mixtures, Statistics and Probability Letters, 71, 323-335. Jozani, M.J., Nematollahi, N. and Shafie, K. (2002). An admissible minimax estimator of a bounded scale-parameter in a subclass of the exponential family under scale-invariant squared-error loss. Statistics and Probability Letters, 60, 437-444. Medgyessy, P. (1977). Decomposition of Superpositions of Density Functions and Discrete Distributions. John Wiley and Sons, New York: Halsted Press. Pukelsheim, F. (1994). The Three Sigma Rule, The American Statistician, 48(2), 88-91. Rahman, M.S. and Gupta, R.P. (1993). Family of transformed chi-square distributions, Commun. Statist.: Theory Meth., 22(1), 135-146. Sellke, T.M. (1996). Generalized Gauss-Chebyshev Inequalities for Unimodal Distributions, Metrika, 43, 107-121. Sellke, S.H. and Sellke, T.M. (1997). Chebyshev Inequalities for Unimodal Distributions, The American Statistician, 51(1), 34-40. 226 Appendix D Appendix: Nonparametric Curve Estimation D.1 Nonparametric curve estimation with kernel smoothing approaches An alternative smoothing method to smoothing spline for estimating the true curve m is kernel smoothing. The basic idea for kernel smoother is to assign the weighting for the neighbors of a point, say x0 , by some kernel function K, and then fit a curve around x0 based on the weighted data locally. The widely used K are 1. K(u) = I{|u|≤c} , (uniform or ”box” kernel) 2. K(u) = 1 − u2 if |u| ≤ 1 and 0 otherwise, 3. K(u) = exp(−u2 /(2σ 2 )). (Epanechnikov kernel) (normal kernel) The simplest and most important estimate at x0 by kernel smoother is called Nadaraya-Watson (N-W) kernel estimate, which minimizes the following criterion over m at x0 n X (yi − m)2 K i=1 ³x − x ´ i 0 , h (D.1) where h is called bandwidth, which plays the same role as the smoothing parameter in smoothing spline approaches for controlling the smoothness and the goodness of data-fitting. 227 Appendix D. Appendix: Nonparametric Curve Estimation The minimizer for (D.1) is ³ ´ xi −x0 K yi i=1 h ³ ´ . m̂E h (x0 ) = Pn xi −x0 i=1 K h Pn (D.2) In general, for any center point at xj , we have m̂E h (xj ) = n X ∗ Kji yi , (D.3) i=1 ∗ = where Kji Kji Kj , ³ Kji = K xi −xj h ´ and Kj = Pn i=1 Kji . The generalization to (D.1) is called local polynomial regression approach. For the case of the polynomial of degree p at xj , we minimize n h i2 ³ x − x ´ X i j yi − β0 − β1 (xi − xj ) − · · · − βp (xi − xj )p K . h (D.4) i=1 Consequently, the corresponding local pth degree polynomial estimates are m̂(p) (xj ) = p!β̂p . In particular, when p = 0, it reduces to the N-W kernel estimator (D.2), also called local constant regression estimator. Moreover, the estimators for p=1 and 2 will be called local linear regression estimator and local quadratic estimator, respectively. D.1.1 Selection of bandwidth h by CV Similar to the selection of the smoothing parameter for smoothing spline, we can also choose the bandwidth h by cross validation technique. For simplicity, consider the case for local constant regression. Based on the criterion (D.1), we have the kernel estimator (D.3) at xj . Similar to the 228 Appendix D. Appendix: Nonparametric Curve Estimation usual mean-based CV, we have n 1X (−j) 2 (yj − m̂E ) , h (xj ) n KECVE (h) = (D.5) j=1 where (−j) m̂E = h (xj ) n X i=1, i6=j ∗ Kji ∗ yi . 1 − Kjj So, we can rewrite (D.5) as 1 X ³ yj − m̂hE (xj ) ´2 KECVE (h) = . ∗ n 1 − Kjj n (D.6) j=1 Then we find h∗E to minimize KECVE (h). D.1.2 Median kernel smoother for curve estimation Based on the rule by systematically replacing the expected-based operator with the median-based operator, we propose the median-based local constant regression estimator at xj that minimizes h¯ i ¯ ∗ median ¯yi − m¯♦Kji 1≤i≤n (D.7) ¤ £ over m, where the notation median xi ♦pi means the weighted median of 1≤i≤n ¤ £ xi with pi = p(xi ), the probability of xi . In other words, median xi ♦pi = 1≤i≤n Pj x(k) , where k=inf{j : i=1 p(i) ≥ 0.5, where p(i) = p(x(i) )} and x(i) is the ith smallest observation of x. Thus, we have h¯ i ¯ ∗ ¯yi − m¯♦Kji m̂M (x ) = arg min median . j h m 1≤i≤n (D.8) 229 Appendix D. Appendix: Nonparametric Curve Estimation and h¯ ¯ (−j) ¯yi − m¯♦ m̂M (x ) = arg min median j h m 1≤i≤n, i6=j ∗ i Kji ∗ . 1 − Kjj (D.9) Therefore, the corresponding median-based CV estimate is ¯ ¯ ¯ (−j) ¯ KM CVM (h) = med ¯yj − m̂M (x ) ¯, j h 1≤j≤n (D.10) where the notation med xi just mean the standard sample median of the 1≤i≤n sample x. Then we will find h∗M to minimize this KM CVM (h) and use (D.8) with h∗M for predicting yj . 230
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Median loss analysis and its application to model selection
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Median loss analysis and its application to model selection Yu, Chi Wai 2009
pdf
Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
Page Metadata
Item Metadata
Title | Median loss analysis and its application to model selection |
Creator |
Yu, Chi Wai |
Publisher | University of British Columbia |
Date Issued | 2009 |
Description | In this thesis, we propose a median-loss-based procedure for inference. The optimal estimators under this criterion often have desirable properties. For instance, they have good resistance to outliers and are resistant to the specific loss used to form them. In the Bayesian framework, we establish the asymptotics of median-loss-based Bayes estimators. It turns out that the median-based Bayes estimator has a root-n rate of convergence and is asymptotically normal. We also give a simple way to compute this Bayesian estimator. In regression problems, we compare the median-based Bayes estimator with two other estimators. One is the Frequentist version of our median-loss-minimizing estimator, which is exactly the least median of squares (LMS) estimator, and the other one is two-sided least trimmed squares (LTS) estimator. This comparison is natural because the LMS estimator is median-based but only has cubic-root-n convergence, while the 2-sided LTS is not median-based but it has root-n convergence. We show that our median-based Bayes estimator is a good tradeoff between the LMS and 2-sided LTS estimators. For model selection problems, we propose a median analog of the usual cross validation procedure. In the context of linear models, we present simulation result to compare the performance of cross validation (CV) and median cross validation (MCV). Our results show that when the error terms are from a heavy tailed distribution or are from the normal distribution with small values of the unknown parameters, MCV works better CV does in terms of the probability that chooses the true model. By contrast, when the error terms are from the normal distribution and the values of the unknown parameters are large, CV outperforms MCV. |
Extent | 1016740 bytes |
Genre |
Thesis/Dissertation |
Type |
Text |
FileFormat | application/pdf |
Language | eng |
Date Available | 2009-04-15 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivatives 4.0 International |
DOI | 10.14288/1.0067112 |
URI | http://hdl.handle.net/2429/7110 |
Degree |
Doctor of Philosophy - PhD |
Program |
Statistics |
Affiliation |
Science, Faculty of Statistics, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2009-05 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 24-ubc_2009_spring_yu_chiwai.pdf [ 992.91kB ]
- Metadata
- JSON: 24-1.0067112.json
- JSON-LD: 24-1.0067112-ld.json
- RDF/XML (Pretty): 24-1.0067112-rdf.xml
- RDF/JSON: 24-1.0067112-rdf.json
- Turtle: 24-1.0067112-turtle.txt
- N-Triples: 24-1.0067112-rdf-ntriples.txt
- Original Record: 24-1.0067112-source.json
- Full Text
- 24-1.0067112-fulltext.txt
- Citation
- 24-1.0067112.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
data-media="{[{embed.selectedMedia}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0067112/manifest