Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Globally robust inference for simple linear regression models with repeated median slope estimator Khan, Md Jafar Ahmed 2002

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
831-ubc_2002-0447.pdf [ 2.5MB ]
Metadata
JSON: 831-1.0090413.json
JSON-LD: 831-1.0090413-ld.json
RDF/XML (Pretty): 831-1.0090413-rdf.xml
RDF/JSON: 831-1.0090413-rdf.json
Turtle: 831-1.0090413-turtle.txt
N-Triples: 831-1.0090413-rdf-ntriples.txt
Original Record: 831-1.0090413-source.json
Full Text
831-1.0090413-fulltext.txt
Citation
831-1.0090413.ris

Full Text

Globally Robust Inference for Simple Linear Regression M o d e l s w i t h R e p e a t e d M e d i a n Slope E s t i m a t o r by Md Jafar Ahmed Khan M.Sc, University of Dhaka, 1992  A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF M a s t e r o f Science in THE FACULTY OF GRADUATE STUDIES (Department of Statistics)  We accept this thesis as conforming to the required standard  The University of British  Columbia  September 2002 © Md Jafar Ahmed Khan, 2002  In  presenting  degree freely  at  this  the  available  copying  of  department publication  of  in  partial  fulfilment  of  the  University  of  British  Columbia,  I  agree  for  this or  thesis  reference  thesis by  this  for  his thesis  and  study.  scholarly  or for  her  financial  Department  Date  DE-6 (2/88)  further  purposes  Columbia  gain  shall  that  agree  may  representatives.  permission.  T h e U n i v e r s i t y o f British Vancouver, Canada  I  requirements  It not  be is  that  the  Library  permission  granted  by  understood be  for  allowed  an  advanced  shall for  the that  without  make  it  extensive  head  of  my  copying  or  my  written  Abstract Globally robust inference takes into account the potential bias of the point estimates (Adrover, Salibian-Barrera and Zamar, 2002). To construct robust confidence intervals for the simple linear regression slope, the authors selected the generalized median of slopes (GMS) as their point estimate, considering its good bias behavior and asymptotic normality. However, GMS has a breakdown point of only 0.25, its asymptotic normality is established under very restrictive conditions, and its bias bound is known only for symmetric carrier distributions. In this study, we propose the repeated median slope (RMS) estimate as an alternative choice. RMS has a breakdown point of 0.50, its asymptotic normality holds under mild assumptions, and the bias bound for RMS is known for general carrier distributions. The proposed method achieves, more or less, the same observed coverage levels while it constructs intervals of smaller lengths, as compared to the GMS approach.  ii  Contents  Abstract  ii  Table of contents  iii  Acknowledgements  vi  1  Introduction  1  1.1  Motivation  1  1.2  Terminology  3  1.3  Simple Linear Regression: Robust estimation of parameters  7  1.3.1  Median of Pairwise Slopes (MPS)  8  1.3.2  Generalized Median of Slopes (GMS)  10  1.3.3  Repeated Median of Slopes (RMS)  11  1.3.4  S-estimates  12 iii  V  2  MM-estimates  13  1.3.6  r-estimates  13  1.4  Purpose of this study  14  1.5  Organization of subsequent chapters  14  Globally Robust Inference  15  2.1  Limitations of the classical confidence intervals  16  2.2  Naive intervals: consequences of ignoring bias  17  2.3  Robust inference for the location model  20  2.3.1  Confidence Intervals  20  2.3.2  One sided confidence bound  24  2.3.3  P-values  25  2.3.4  Estimation of bias bound  27  2.4  3  1.3.5  Robust inference for the simple linear regression model  28  2.4.1  29  Estimation of the bias bound  The R M S Approach for Robust Inference on Slope  32  3.1  Reasons for the selection of RMS  33  3.2  Maxbias of the RMS estimate  36 iv  3.3  4  5  6  Asymptotic properties of the RMS estimate  38  Application  41  4.1  Motorola vs Market  41  4.2  Volume vs Height of trees  46  A Monte Carlo Study  55  5.1  Design of the study  56  5.2  Numerical results  57  5.3  Discussion  61  Conclusion  63  6.1  Summary  63  6.2  Further study  64  Bibliography  66  v  Acknowledgements I would like to acknowledge, warmly and thankfully, the guidance and support of my supervisor, Dr. Ruben H . Zamar, throughout this study. I would like to thank Dr. Lang Wu for his valuable suggestions as the second reader of this thesis. Special thanks are to Dr. Matias Salibian-Barrera, who kindly gave me some related codes to start with, and to Dr. Harry Joe, for his timely and valuable suggestions during the computation.  JAFAR A H M E D K H A N  The University  of British  Columbia  September 2002  vi  Chapter 1 Introduction Motivation  1.1 As  a major source of uncertainty, sampling variability of an estimator plays an important  role in statistical inference, particularly when the sample size is small. As sample size increases, sampling variability becomes less and less important. This is because standard errors are usually of order 0(l/y/n),  and tend to zero as sample size n tends to infinity.  This is not the case with the other major source of uncertainty: the bias of an estimator, which is caused by data contamination (e.g., outliers, asymmetric errors, and other departures from the model assumption), gross errors, etc. Since biases are of order 0(1), for  we cannot reduce the bias of an estimator by increasing the sample size. Therefore, large data sets, the uncertainty due to bias clearly dominates the uncertainty due to  sampling variability. To  demonstrate the idea, we perform a simulation. We choose Af(0,1) as our central  model and N(10,1)  as the contaminating distribution. We select three different sample  1  sizes: 25, 100 and 400, and three different levels of contamination: 0%, 10% and 20%. For each combination of sample size and contamination level, we generate 1000 samples, and calculate the median from each sample. The average of the 1000 sample medians (minus the true median which is zero in this case) gives us the bias, and we also calculate the standard error in each case. The following table summarizes the results. Table 1.1: Bias (standard error) for different sample sizes and contamination levels. n = 25  n = 100  n = 400  0%  -0.0028 (0.2415)  -0.0039 (0.1238)  0.0028 (0.0610)  10%  0.1355 (0.2853)  0.1435 (0.1383)  0.1381 (0.0679)  20%  0.3278 (0.3390)  0.3188 (0.1645)  0.3194 (0.0831)  Contamination level  As expected, the standard error of the sample median decreases as n increases while the bias remains almost unchanged. Also, the bias increases with increased contamination. This shows the relative importance of bias as a source of uncertainty of an estimator. With the availability of high-speed computers and automated data collection techniques, there are many applications in which data quantity is not a problem, and we should concentrate on data quality. Accordingly, we should have more focus on the bias behavior of an estimator than on its standard error. Classical inference considers sampling variability to be the only source of uncertainty, and does not address the issue of bias caused by contamination. With one or two exceptions, the few papers that address robust inference also ignore the possible bias of the point estimates. The consequences of this will be explained in Chapter 2 in the context of location and simple linear regression models. In our study, we will consider the bias-uncertainty of an estimator in addition to its standard error.  2  1.2  Terminology  To introduce some terminology, let us consider the parametric family Fg and the econtamination F {F ) e  g  neighborhood (Tukey, 1960),  = {F:F=(1-  e)F + eF*, F* an arbitrary distribution}, 0 < e < 1/2. g  (1.1)  According to this model, 100(1 — e)% of the data comes from a distribution Fg but the remainder 100e% may come from an unknown arbitrary distribution F*. Let us consider estimates T of 9 which, under some mild regularity conditions, converge almost surely to n  the asymptotic functional T(F). This functional is well-defined on a set of distributions which includes the empirical distribution functions and the family ^(Fg).  Due to the  presence of outliers and other departures from the central parametric model, T(F) is not necessarily equal to 9 for all F in F (Fg).  For this reason, we must consider the  €  asymptotic bias, b (F) T  =  d(T(F),9),  where d measures the distance between the true value of the parameter and the asymptotic value of the estimate.  Maxbias function  The robustness of an estimate T can be assessed in terms of the maxbias function, B {e) T  =  sup  (1.2)  b (F) T  which represents the maximum possible perturbation of T(F) when F varies over the entire neighborhood. Huber (1964 & 1981) introduced the concept of maximum asymptotic bias in the location model setup. Martin and Zamar (1989) and Martin, Yohai and Zamar (1989) defined the maxbias curve which is a plot of B {e) against different values T  3  of e, and derived such curves for robust location, scale and regression estimates. It is known that the typical maxbias functions are continuous and increasing from zero to infinity.  Contamination sensitivity  Hampel denoted the supremum of the influence function by 7 * , and called it the grosserror sensitivity. The contamination sensitivity, introduced by He and Simpson (1993), is closely related to the gross-error sensitivity and is defined as 7  (T) = /^(e)| .  (1.3)  e=0  The authors showed that in general 7 * < 7 .  Breakdown point  The breakdown point (BP) of the estimate T is the supremum of the maxbias function domain, i.e., the point at which the maxbias function diverges. Mathematically, e* = sup{e : B (e) < 00}.  (1.4)  T  The breakdown point is the maximum fraction of contamination the estimate can tolerate before its value is completely determined by the contaminating data. The study of the limiting behavior of B (e) near the ends of its domain is very T  informative. Hampel (1974) investigated the behavior of B (e) when e is small, focusing T  on the rate at which Bx(e) tends to zero as e tends to zero. Berrendero and Zamar (2001) studied the explosion rate of the bias function when e approaches the breakdown point. Figure 1.1 shows the maxbias curve, breakdown point and contamination sensitivity of median. The dotted vertical line at BP (0.5 in this case) shows how the maxbias 4  Figure 1.1: Maxbias curve, breakdown point and contamination sensitivity of median.  epsilon  curve diverges when e approaches BP. The slope of the dashed line is the contamination sensitivity, 7 , of median. Notice that the linear approximation is good for small values of e (e < 0.10).  Locally a n d globally robust estimates  An estimate whose maxbias grows slowly near zero is called locally robust and an estimate whose maxbias is relatively small for large fractions of contaminations is called globally robust. The local and global features of the maxbias function Br(e) are summarized by the contamination sensitivity 7(T) and the breakdown point e*. The contamination sensitivity provides an approximation for BT{C) near zero, and thus measures the local robustness of the estimate T. We say that T is lacally robust if j(T) is finite.  5  He and Simpson (1993) showed that e* < 0.5 for all affine-equivariant regression estimates. (Equivariance properties of regression estimates will be discussed in the next section.) If the estimate T attains the maximal breakdown point 0.5, we say that T is globally robust.  Bias  bound  The bias bound for the estimate T, introduced by Berrendero and Zamar (2001), highlights the practical potential of maxbias curves. To fix ideas, let us consider the location model and the median functional M(F). Suppose that we want to determine the upper bound for the absolute difference D (F) M  = \M(F) - 9\. Huber (1964) showed that  where a is the scale of the core distribution. Therefore, D (F) 0  M  is bounded by  a B(e). 0  In practice a is not usually known and has to be estimated by a robust scale functional 0  S(F),  for example, the MAD. However, the quantity S(F)B {() M  may not be an upper  bound for D(F) because S(F) may underestimate cr . For example, let us consider the 0  contaminated distribution F = 0.90 iV(0,1) + 0.10 5 .i5, then 0  whereas MAD(F)B(0A0) such that S(F)K {t) M  - 0  M(F)  5M(0.10)  1  = 0.1397,  = 0.8818 x 0.1397 = 0.1232 < 0.1397. A quantity  K (e) M  is a bound for D {F) is called the bias bound for M(F). We will M  discuss the relation between maxbias functions and bias bounds for the location and the regression estimates in Chapter 2. A confidence interval is called globally robust if it is stable (in the sense of keeping coverage at or above the nominal level) and informative  (in the sense of keeping a reason-  able average length), not only at the central model, but also over the entire contamination 6  neighborhood. We will discuss globally robust inference formally in Chapter 2. To construct a globally robust confidence interval, we have to use the bias bound of the point estimate in addition to its standard error. In this thesis, we will study globally robust inference on the simple linear regression slope. For this, we have to select an appropriate point estimate. We will discuss some robust regression estimates in the following section.  1.3  Simple Linear Regression: Robust estimation of parameters  Let us consider the simple linear regression model, Hi = Po + PiXi +  1  <i<n,  ~ F , Median^(ei) = 0, Xi ~ Go and (xi,yi) are  where re* and Ei are independent,  0  independent. The joint distribution of (xi, yi) under this model will be denoted by H . To 0  allow for outliers and other departures from this central parametric model we will assume that the joint distribution H of (xi,yi) belongs to the e-contamination neighborhood F (H ) e  0  = {H=(1-  e)H + eH* : H* any distribution on K }, 2  0  Let T — (Ti,T ) 2  0 < e < 1/2.  (1.5)  be the intercept and the slope functionals defined on a large  class of distributions H on 5R which includes FfHo) 2  and all the empirical distribution  functions H . A desirable feature of the regression functional T is the property of Fishern  consistency, i.e., T (H ) 1  0  = p,  T {H )  0  2  0  = B. 1  The regression functional T is also expected to have the following equivariance properties. Let Q(x,y) denote the joint distribution of (x,y).  7  Regression Equivariance: T^Gfay  + a + bx))  =  T ( ^ ( x , y ) ) + a,  a,beM.  T {g(x,y  + a + bx))  =  T (G(x,y))  a,be$.  2  1  + a,  2  Affine Equivariance: Ti(g{cx,y))  =  T (Q{x,y)),  T (g(cx,y))  =  T (g(x,y))/c,  2  c > 0.  x  c> 0.  2  Scale Equivariance: Ti{g{x,sy))  =  sT^gfay)),  T {g{x,sy))  =  sT (g(x,y))/c,  2  s > 0.  2  s > 0.  Location Equivariance: Ti{${x + a,y))  =  T^gfay))  r (S(x + a /))  =  T (0(:r,y)),  2  )?  2  - aT {g(x,y)), 2  a G K.  a e £  It is easy to show that the least squares (LS) estimates of the slope and the intercept parameters satisfy the above properties, but they have breakdown point zero. Following are some robust estimates, which also satisfy the above equivariance requirements.  1.3.1  M e d i a n o f P a i r w i s e Slopes ( M P S )  T h i s estimator was studied by T h e i l (1950) and Sen (1968) . For this, we first consider the set I — observations (xi,yi)  : X{ ^ Xj} and calculate the slopes corresponding to a l l pair of and (xj,yj).  T h a t is,  K»,J) = ^ ^ , Xj  Xi  8  (1.6)  with r(i,j)  — 0 ii Xi — Xj. Then, we define ^ (H) PS  = Med r( ,j). I  (1.7)  l  The corresponding functional form of this estimator is ff (H) ps  = Med Q  (^^ 2  H  \xi  where (xi,yi)  -  x, 2  and (x2,y ) are independent with common joint distribution H and Qn 2  stands for the conditional distribution of the ratio given the denominator is different from zero. Like MPS, many other estimators of the slope parameter 0 are based on the pairwise slopes r(i,j).  Most interestingly, the classical least squares estimator B  LS  may be written  as a weighted average of the pairwise slopes,  As = ^  ,  (1.8)  i<j with weights Wij = (xi — Xj) . 2  Boscovich (1757) considered a data set with n = 5  observations, and computed the unweighted average of the 10 pairwise slopes, as well as a 10% trimmed mean given by the average of 8 of these slopes. Stigler (1986) may be referred to for a more complete historical discussion. Frees (1991) gave a survey of these and related estimators. In general, given a slope estimate J3(H), a median-based intercept estimate is defined by a(H) = Med (y  - $(H)x).  H  (1.9)  Thus, the corresponding regression line splits the plane into two halves containing each half of the data. Since the intercept is a location parameter for the difference y — J3(H)x, estimating it by the median seems to be appropriate from a bias point of view. 9  1.3.2  G e n e r a l i z e d M e d i a n o f Slopes ( G M S )  The G M S estimate was first proposed by Brown and Mood (1951) and recently studied by Adrover and Zamar (2000). The G M S is defined as the solutions (a , J3 ) to the n  n  equations 1 - ^ sign (y - a - p (x - /}„)) signfo - fi ) i=i 1 - ] P sign (yi -6e - J3 (xi - A n ) ) n  t  n  n  {  n  =  0,  (1.10)  =  0,  (1.11)  n  n  n  where p, = Med{xj}.  n  The G M S estimates are defined by the following geometrical  n  property: the slope and the intercept estimates are such that the corresponding regression fit and the vertical line x = Medjxj} split the plane into four quarters containing the same number of points. Adrover and Zamar (2000) showed that the slope estimate  P£  MS  satisfies the fixed  point equation  ft"* =  Med{ -" ~5 " } y  Medfa  MS(Xl An)}  •  (1.12)  Notice that this is not really a closed form formula because the right hand side of (1.12) also contains / 3 ^  M 5  . A solution to this equation can be found with the following iterative  algorithm proposed by Adrover and Zamar (2000). Let  be some initial slope estimate.  Then a^  =  +1)  PlT  +1)  Mediyi-pWixi-jin)}  =  Med{( -a^)/( -(i )}. yt  Xl  n  This algorithm usually converges after a few iterations. However, in some cases it runs into a closed loop. Then the algorithm automatically finds the solution by using a bisection procedure which takes place after the closed loop it detected. 10  The GMS estimates are a natural generalization of the minimax-bias estimate of regression through the origin. Martin and Zamar (1989) found that in the case of regression through the origin,  = B%i + Si, the MS estimate P{H) = Med  (1)  H  minimizes the maxbias among all equivariant estimates. It can be verified that /3(H) is a generalized M-estimate (GM) that satisfies the equation E sign(y - bx) sign(x) = 0. H  For the simple linear regression model, the estimates (a(H), (3(H)) can be defined implicitly by the equations EH sign(y — a — bx) sign(a; — m)  = 0  E sign(y - a - bx) = H  0  (1-13) (1.14)  where m = Med#(:r). The finite sample versions are obtained by setting H = H , the n  empirical distribution function of the data. Clearly, (1.13) and (1.14) are particular cases of GM-estimates. It can be easily seen that the GMS estimates satisfy (1.13) and (1-14) and, therefore, they are a very special type of GM-estimates.  1.3.3  Repeated Median of Slopes (RMS)  Siegel (1982) defined the first slope estimate with breakdown point 0.5, by performing a different median-based operation over the ratios (1.6). In this case, we define the slope estimator as 0™  S  =  Med < < Med j r{i,j), 1  i  n  j(k i  where Jj = {j : (i,j) € /}, for 1 < i < n. The corresponding functional form of this estimate is as follows. First, let us define q (a,b,H) M  = Med Q (^L^) H  11  ,  (1.15)  for fixed numbers a and b. Then, the RMS estimator is defined as oRMS  Med Q*  (q {x ,y ,H)),  H  where (xi,yi)  and (x ,y ) 2  2  M  2  2  (1.16)  are independent with common joint distribution H and QH  stands for the conditional distribution of the ratio given the denominator is different from zero. Q* denotes the distribution of qM(x , y , H) under H. H  1.3.4  2  2  S-estimates  The regression estimates defined so far are median-based. Before defining the regression S-, MM- and r-estimates, we need to define the M-estimates of scale. For this we may consider the parametric scale model F {x) a  =  F (x/a) 0  where F is the distribution function of a random variable X with a density function 0  symmetric about zero. The M-estimates of scale were defined by Huber (1964) as (1.17) where b = Ep p{X) and the score function p satisfies the following assumptions: 0  (a) p(x) : M —> E is symmetric and nondecreasing on [0, oo). (b) p(0) = 0, and p(oo) = 1. (c) p(x) has at most a finite number of discontinuities.  The regression S-estimates were defined by Rousseeuw and Yohai (1984) as 0 = argmin S (t, H), S  p  12  where Sp (t, H) is the M-scale of the absolute residuals r(t) = |y — t :x|, in which (y, x) T  have joint distribution H. Mathematically, S {t,  H) =inf \s : E „ p  p  1.3.5  MM-estimates  The regression MM-estimates were defined by Yohai (1987) as OMM = argmin E p H  t  2  where si = Si(H) = min «Si(t, if) is the scale of the absolute residuals corresponding t  to a regression S-estimate with score function pi G  , and p £ Cfc is a score function 2  2  chosen to attain a higher efficiency.  1.3.6  r-estimates  The regression r-estimates were defined by Yohai and Zamar (1988) as 0 = argmin r(t, H), T  t  where  and 5(t, H) is the M-scale defined before with p = p\ € Cb . These estimates can attain 2  a high breakdown point, controlled by pi, and a high efficiency, controlled by p . 2  The bias behavior of different robust estimates will be discussed in Chapter 3, with a view to selecting a 'good' point estimate for robust inference.  13  1.4  Purpose of this study  Adrover, Salibian-Barrera and Zamar (2002) developed the idea of globally robust inference for the location and the simple linear regression models. After showing the consequences of ignoring the asymptotic bias of the point estimate, they incorporated the bias bound in the construction of confidence intervals. For robust inference on simple linear regression slope, the authors selected GMS as their point estimate, considering its good bias behavior and asymptotic normality. However, GMS has a breakdown point of only 0.25, and its asymptotic normality is established under very restrictive conditions. Also, the bias bound for the GMS estimate is known only for symmetric carrier distributions, limiting the applications of the method. In this study, we will consider the RMS as an alternative choice of point estimate for robust inference on slope. RMS has a breakdown point of 0.50, its asymptotic normality holds under very general conditions, and the bias bound for RMS is known for general carrier distributions.  1.5  Organization of subsequent chapters  In Chapter 2, we will explain the idea of globally robust inference for the location and the simple linear regression models. In Chapter 3, we will justify the selection of the RMS estimate as an alternative to the GMS estimate for robust inference on slope. In Chapter 4, we will apply the methods to two different datasets and compare the results. We will present and discuss the results of a Monte Carlo simulation in Chapter 5. In the final chapter, we will conclude by summarizing our study and pointing towards some topics for future research.  14  Chapter 2 Globally Robust Inference The vast majority of robustness literature focuses on point estimation. There are a few papers that address robust inference, but they do not take into account the possible bias of the point estimates. One exception is Adrover, Salibian-Barrera and Zamar (2002). Our discussion on globally robust inference will be based mainly on this paper. In order to highlight the main ideas, let us consider the following  location-scale  model: (2.1)  y = 0 + oe.  Here, 0 is an unknown location parameter, a is a nuisance scale parameter, and e has a specified distribution F . Correspondingly, the distribution of y is Fg(y) = Fo((y — 0)/a). 0  To allow for outliers and other departures from this model, we assume that the actual distribution F belongs to the e-contamination neighborhood , F (F ) e  e  = {F=(1-  e)F + eF* : F* any arbitrary distribution}, e  0 < e < 1/2.  (2.2)  According to this model, the majority of the data comes from a central parametric model but a minority may come from an unknown arbitrary distribution. 15  According to Adrover et al. (2002), a robust confidence interval should be stable and informative.  The robust confidence interval should be stable in the sense of keeping  a high coverage level (at or above the nominal level) not only at the central model but also over the contamination neighborhood. The interval should also be informative  in  the sense of keeping a reasonable average length over the entire neighborhood. These two properties are more precisely stated in the following definition by the authors. Definition 2.1 A confidence interval (L , U ) for 9 is called globally robust of level {I —a) n  n  if it satisfies the following conditions:  1. (Stable interval) The minimum asymptotic coverage over the e-contamination neighborhood is (1 — a): lim  inf  n->cx>  2. (Informative  P (L F  < 9 < U ) > (1 - a);  n  n  FeF {F ) c  e  interval) The maximum asymptotic length of the interval is bounded  over the e-contamination neighborhood: lim  2.1  sup [U — L ] < oo. n  n  Limitations of the classical confidence intervals  It can be easily shown that classical Student-t confidence intervals, X  ± <(„_!)(! - a/2)5 /y n, /  n  n  fail Part 1 and Part 2 of Definition 1. Let us consider first the contaminated distribution F  X0  = (l-e)F  0  + eF*,  where F* is a point mass distribution at XQ > 0. Then, L  n  = X n  t(„_i)(l - a/2)S /^/n n  16  ex , 0  and U  n  = X  n  + t(„_i)(1 - a/2)S /\/n  ^  n  ex  0  as n tends to oo. Therefore, lim  inf  <e <U )<  P {L F  n  n  <6<U )=0.  lim P *o(L F  n  n  Thus, the classical intervals fail Part 1 of Definition 1. Let us now take F* to be a point mass distribution at ± x (equally weighted). Then, 0  x y/e 0  = oo,  failing Part 2 of the definition.  2.2  Naive intervals: consequences of ignoring bias  For robustifying Student's t confidence intervals, it seems natural to replace X  n  robust asymptotically normal point estimate T , and S /^fn n  n  by a  by a robust estimate of the  standard error of T . We will call this the naive procedure. It can be shown that the n  naive confidence intervals satisfy Part 2 but not Part 1 of Definition 2.1. Consequently, the asymptotic coverage proportion of the naive confidence intervals of any nominal level will invariably tend to zero for all e > 0. To illustrate this point Adrover et al. (2002) conducted a Monte Carlo simulation in which they generated 10,000 normal samples of different sizes, containing various fractions of contamination. The contaminating distribution is a point mass distribution at XQ = 4. For each sample, the authors calculated the location M-estimate with Huber V>-function ip(y) = min{-c, max{c, y}},  17  with truncation constant c = 1.345, and the corresponding 95% confidence intervals based on the empirical asymptotic variance. The following table summarizes the observed coverage levels and the average lengths of these intervals. Table 2.1: Percentage of coverage and average length of naive CI for location parameter e  Sample Size  % of coverage  Average length  0.05  20  92  0.91  50  92  0.60  100  88  0.44  200  82  0.31  20  91  1.05  50  84  0.68  100  67  0.49  200  39  0.35  20  88  1.19  50  72  0.76  100  35  0.56  200  5  0.40  20  82  1.41  50  45  0.92  100  8  0.66  200  0  0.47  0.10  0.15  0.20  The poor coverage levels in the above table are due to the asymptotic bias of the point estimate. Let 9 be a robust estimate of the location parameter, and 9(F) be its n  asymptotic value under an asymmetric distribution F belonging to the contamination neighborhood. Usually, 9(F) is different from the actual value of the parameter 9, and the asymptotic bias remains the same even if the sample size increases. However, the 18  standard errors of 9 are very small for large sample sizes, and most of the probability n  mass in the distribution of 9 concentrates on an interval that excludes 9. Ironically, if n  the data are of uneven quality, large sample sizes are bad for naive confidence intervals. The problem is more severe in the case of simple linear regression. For the model Di = A) + 0\Xi + £i, the classical 100(1 — a)% confidence intervals for B are of the form x  Bx ±z /2 A  SE(/3i) where fii is the least squares estimate for B and SE(/3i) is the estimated x  standard error of B . For robustifying these confidence intervals, the naive approach is x  to replace B by a robust point estimate, Bf-, and x  SE(/?i) by a robust  estimate of the  standard error of B^. To show that the naive confidence intervals are not "stable" (do not satisfy Part 1 of Definition 1), Adrover et al. generated 600 samples (XJ, Ui) of sizes n = 20, 40, 60, 80 and 100 from contaminated normal distributions (l-e)TV(O,1)+eN (n, T I) 2  with pj =  (fjbxjfjLy),  r = 0.1, \x = 3 and fx = 1.5 (2.0) for e = 0.05 (0.10). The authors x  y  refer to this case as the "mild contamination case". In the "strong contamination case" they took \x = 5 and \x = 2.5 for e = 0.05 and e = 0.10. They calculated the high x  y  breakdown point regression MM-estimates (Yohai, 1987) and their asymptotic standard errors. The nominal confidence level in each case is 0.95.  The authors reported the  coverage for the slope parameter, which is reproduced in the following table. Clearly, the coverage levels shown in the table are much less than the nominal level, specially for larger sample sizes. As in the case of the location model, the poor coverage levels are due to the asymptotic bias of the point estimate B^ introduced by the contamination in the data. In general, the standard error of Bf- is of order 1 /^fn while its bias does not vanish as n goes to infinity. Therefore, for (globally) robust inference, we have to consider the asymptotic bias of the selected estimator, as well as its standard error. In the following two sections, we will discuss robust inference for the location and the simple linear regression models.  19  Table 2.2: Coverage proportion of naive CI for the slope parameter % of Contamination  5  10  2.3  Sample Size  Mild Contamination  Strong Contamination  20  0.88  0.76  40  0.82  0.57  60  0.81  0.35  80  0.62  0.26  100  0.47  0.23  20  0.71  0.50  40  0.46  0.25  60  0.27  0.11  80  0.15  0.07  100  0.13  0.05  Robust inference for t h e location m o d e l  First, we should know how to incorporate the asymptotic bias (bias bound) of the point estimate in the construction of confidence intervals, one sided confidence bounds, and p-values. Then, we need to know how to estimate the bias bound of the estimate.  2.3.1  Confidence Intervals  Suppose that we have a robust point estimate 9 for the parameter 9 which satisfies n  yfr(§ -6{F))-±rN(0,v (F)), 2  n  (2.3)  where \9(F) — 9\ < 9 and 9 is an upper bound for the bias of 9(F) as defined in Berrendero and Zamar (2001).  20  For any fixed F G ^(Fg),  an asymptotic confidence interval of level (1 — a) is given  by  where l = l (F) and r = r (F) satisfy the equation n  n  n  n  = 1 - a.  < Q - 0 < l)  PF (~r  n  n  n  (2.4)  We can write PF ( - r  < 9 - 9(F) + 9(F) -9<l ) n  n  -r -b  6 -6(F)  n  n  —  where y/nv  n  =  n  < —  —  l -b\ n  < —  =  , (2.5) rt  I-a,  is a consistent estimate for v(F) for all F G T (Fg), and b = b(F) = 9(F) —0. £  From equation (2.3) the normal distribution can be used to approximate the lefthand side of equation (2.5). Hence, we can obtain estimates l and f„ solving the following n  equation:  * [ ^ U * ( ^ ) - l = l-«,  (2.6)  where $ is the standard normal cumulative distribution function. Since F G ^e(Fg) is unspecified, the bias b = b(F) in (2.6) is unknown and cannot be estimated from the data. Therefore, the endpoints l and f n  may be found in such a way that  n  The coverage of the confidence intervals obtained in this way will be 1 — a not only at the central model, but also over the entire neighborhood. The endpoint l may be expressed as a function of f n  n  In  = V$ n  1  2- a - $ 21  by using Equation (2.6): + b,  and f  may be chosen in order to minimize the resulting interval length:  n  -1  'f  2-a-$  + b'  n  +  b+  (2.7)  f. n  Differentiating the right-hand side of Equation (2.7) with respect to f  n  and setting the  derivative equal to zero, we have + 1 = 0, where (p is the standard normal density function. That is f  + b  n  =  tpl$-  1  fn + b  2- a - < S >  From this equation it follows that f  n  + b  =  2-  ±<S>~  1  a  -< E >  fn + b  The minus sign case is discarded because it can only be satisfied with a = 1. From the other case we obtain (2.8) which yields the solution f  n  =  v$  (l  In  =  VZ/ n  - a/2)  1  n  a  2  - b = v za/2 n  ~ b,  + b.  The corresponding confidence interval for 0 of level greater than or equal to (1 — a) for fixed F e F {F ) e  g  is: I (F) n  = (§  n  - vz n  a/2  -b,  9 + v za/2 n  n  - &) .  Since I (F) still depends on the unknown F through b, a robust interval of level (1 — a) n  can be constructed as follows: (2.9)  22  Since \b\ = \b(F)\ < 9 for all F G T (F ) e  we have  0  In Q (On ~ V Z n  ~9,  a/2  9  + VZ  n  n  + (?)  a/2  with equality if 9 is a sharp bias bound, that is if 9 = sup^^j  1^(^)1- The above robust  confidence intervals can be constructed for any estimate 9 that is asymptotically normal, n  has a consistent estimate of its standard error, and a known bias bound 9. In order to obtain the shortest robust confidence interval we will consider estimates where such a sharp bound is available. The robust confidence intervals have positive asymptotic lengths, which is the price we have to pay for robust coverage.  A n alternative approach  An alternative robust interval can be constructed following Fraiman, Yohai and Zamar (2001). The main idea is to find q such that n  P (\9 -9\<q ) F  n  = l-a.  n  Due to the asymptotic normality of 9 , we can estimate q by n  n  * ( ^ ) + * ( * d ± ) - l  = l- . a  (2.10)  Since q is a monotone function of the bias, and its largest value is obtained by replacing n  b by 9, we have  (hzl) *(k±l)-  9  +  \  1  v  n  J  =  \  J  v  n  which gives the robust confidence interval  -a,  1  9 ±q . n  n  The robust interval (2.9) is easier to compute, but slightly longer than that in (2.11).  23  2.3.2  One sided confidence b o u n d  In this section we will discuss robust lower bounds for the parameter 9. Robust upper bounds can be constructed in a similar way. For any fixed F G F (Fg), €  an asymptotic  lower bound of level (1 — a) for 9 is given by (0n-*n,Oo)  (2.12)  where l = l (F) satisfies the equation n  n  PF (e -6<  l) = l - a .  n  (2.13)  n  Similarly to (2.5), and recalling that b = b(F) = 9(F) — 9, we have that l must satisfy n  P  F  ( 6 n - 0 ( F )  <  -  k  1  \  l  _  a  and, as before, for large n, ln(F) = v <S>- (l-a)+b. 1  n  (2.15)  A robust lower bound of level 1 — a can now be defined as (L*oo)=  |J  (c? -/ (F),oo). n  n  (2.16)  F£T (F ) E  G  We have, (0„-Z (F),oo) = ( ^ - ^ ^ ( l - ^ - ^ o o ) . n  (2.17)  Therefore, (Ln,oo)  = (§  n  - vz n  a  - 6, +oo) .  Unlike in the confidence interval case, if we use a construction similar to the one in Fraiman, Yohai and Zamar (2001), we obtain the same lower robust confidence bounds.  24  2.3.3  P-values  Let us start by considering the following artificial hypothesis testing problem. Let us assume that the contaminating distribution H in F = (1 — e)Fg + eH is known, that is, that F belongs to the following "translated" family F {F , e  H) = {F = (1 - e)F + eH, fixed H}  e  (2.18)  e  Suppose that we are interested in testing H  :9<0  0  for F e F (F ,H). e  versus  O  H :9>9 x  (2.19)  0  Using the lower bounds in (2.17), let us define 9 (a, F) = 9  g  n  n  -  u $ ( l — a) — b(F), and a test for (2.19) is given by the rejection rule: _ 1  n  Reject H if  9 (a, F) > 9 .  0  n  0  This is a size a test since sup P „ 0<e {l  [9 (a,  e)Fg+eH  (1 - e)F + eH) > 9 ) '  n  e  v  o  =  P  { 1  -  0  £ ) F e o +  (dn(a, (1 - e)F  eH  9o  + eH) > 6>) = a. 0  In other words, for the fixed family (2.18) the p-value is given by p (F) n  = inf { « : § (a, F) > # j • n  We can calculate p (F) explicitly because 9 (a,F) n  satisfies 9 (p (F),F) n  n  (2.20)  0  is increasing in a and hence p (F)  n  n  = 9 , or equivalently, 0  Pn(F) = 1 - $ ^ " Similarly, for the case H  0  : 9 > 9 versus H 0  Y  ~ ) •  (2.21)  6  : 9 < 9 , we have p {F) = $ 0  n  Let us consider the 2-sided hypothesis H  0  : 9 = 9 versus H 0  x  25  : 9^ 0. O  ^ ~^°~ )n  6  We construct the p-value for this case based on q (a,F)  defined by the equation  n  The rejection rule would be of the form: reject H if \9 — 6 \ > q (a., F). The p-value is 0  n  0  n  p„(F) = inf{a : q {a, F) < \9 - 9 \} . n  n  0  Here, q (a, F) is a decreasing function of a. Therefore p (F) solves the equation n  n  q (p (F),F) n  n  = \9 -9 \. n  (2.23)  0  From (2.22) it is easy to see that the solution g(x) to the equation q (g(x), F) = x is n  given by  We obtain  ^ ) - - * ( ^ ) - * ( ^ ) . The p-values (2.21) and (2.25) were obtained for F G T (F ,H). €  8  We can then define a  robust p-value for all F G J- (Fg) as €  p*=  sup  p (F). n  (2.26)  It is easy to see that p (F) is a monotone function of b = b(F) and hence, if the bias n  bound 9 is sharp, then the robust p-values are as shown in Table 2.3. The rejection rules associated with the p-values in (2.26) have robust Type I Errors at the expense of lower power. In other words, to guarantee a level-a; test with uncertainty in our model we lose power, in particular for values of the parameter near the null hypothesis. Lower power is a reasonable price to pay to achieve a robust rejection rule.  26  Table 2.3: Robust p-values for location-scale problems. Hypothesis  Robust p-values  e>6  #1 :  0  H : 6<e x  Hi 2.3.4  0  :  $ ^L-eo+e)  9^9  0  E s t i m a t i o n o f bias b o u n d  Let T be any location estimate with asymptotic value T(F). The asymptotic bias is n  defined in the following invariant way: b(T,F)  = \T(F)-9\/o,  (2.27)  where o is the true error scale parameter. The maximum asymptotic bias was defined earlier as B(e) = sup b(T,F).  (2.28)  Clearly, \T(F) -6\<  for all F e F .  oB{e)  e  Unfortunately, o is unknown. Let a(F) be the limiting value of the scale estimate a . n  Then, a more useful bias bound would be B(e) = sup b(T,F),  (2.29)  with b(T F) -  l  T  {  F  )  -  6 1  This equation is more useful than (2.28) because we have \T(F) -0\<  a(F)B(e) 27  for all  F € F, t  and, for large n, if we replace d{F) by a(F ) the above relationship still holds approxin  mately. However, B(e) is unknown and its theoretical derivation appears to be difficult. A numerical approximation restricting the supremum to point mass contamination would be feasible. However, for the construction of confidence intervals we can choose the following simpler approach (Berrendero and Zamar, 2001). By replacing a by cr(F) we do not obtain an upper bound due to the possible underestimation of the scale. An estimated bias bound 9 is given by n  6 = ks B{e), n  (2.30)  n  with k = sup a(F)  s-(c)'  FeTe  s = shorth (?/;), n  and .-(<) = inf F<aT a e  The shorth is the standardized length of the shortest half of the sorted data. For the derivation of s~, Lemma 2.1 (at the end of this chapter) can be used.  2.4  Robust inference for the simple linear regression model  We will discuss the confidence intervals for the slope parameter in the simple linear regression model, Vi = A) + Pi(xi  -  where fi is a location parameter, Xj and  fi) +£i,  l < i < n ,  are independent, Zi ~ F , MedianF { I) — £  0  0  0, Xi ~ Go and (xi,yi) are independent. The joint distribution of (xi,yi) under this 28  r  model will be denoted by H . 0  To allow for outliers and other departures from this  central parametric model we will assume that the joint distribution H of (xi,yi) belongs to the e-contaminated neighborhood F (H ) e  0  = {H = (1 - e)H + eH* : H* any distribution on K },  0 < e < 1/2.  2  0  The construction of the robust confidence intervals for the slope follows along the lines of the construction of location confidence intervals discussed earlier. Let us assume that (3i is an asymptotically normal and robust point estimate for B  l}  V^{Pi- Pi{H)) ^  N{^v\H)),  and Pi has a sharp and known bias bound p. We need to find q such that n  P (\Pi-Pi(H)\<q ) H  = l-a.  n  Due to the asymptotic normality of Pi, we can estimate q by n  JhW^) (Mm±A). +i  = 1  1  . .  .  a  (2 31)  Since q is a monotone function of the bias, and its largest value is obtained by replacing n  b by p, we have  which gives the robust confidence interval Pi ± q . n  2.4.1  Let Ti  t7l  E s t i m a t i o n o f t h e bias b o u n d  be estimate for the slope parameter with asymptotic value Ti(H).  The invariant  asymptotic bias is defined in the following way (Adrover and Zamar, 2000): b(Ti(H))  =  a \Ti(H)-Pi\/a , x  29  £  where o and o are the residual and explanatory variable scales under the central model £  x  HQ. These biases are invariant under affine transformations of the data. The maximum asymptotic bias is given by B (e)=  sup  1  b(Ti,H).  Finally, the bias bound Pi is defined by sup |Ti(tf) - A l < sup ^4TS  = Pi-  As before, following Berrendero and Zamar (2001) an estimate of the bias bound is given by  Pm = ^  k(e) Bi(e) = ^ 1 {{e), B  @x,n  where k(e) = s (e)/s~(e),  and we can use  +  a  £tJl  (2.33)  @x,n  = shorth^ - a - J3 Xi) and n  n  o  XyTl  = shorth^)  to estimate a (H) and o (H) (Rousseeuw and Leroy, 1987). The shorth is bias-minimax E  X  in the class of M-estimates of scale with general location (Martin and Zamar, 1993). We need to determine s (e) and s~(e). Martin and Zamar (1993) showed that +  l B t  _^_  =  HeK a (H) X  _L_  =  $  _i ^3-2 ^ £  s+(e)  For s~(e), we can use the following lemma (Adrover et ai, 2002). Lemma 2.1 If (x, y) ~ H € T , then, t  o  inf Her  t  where o (H) £  £  a  E  ( E ) _ ^ [ 0 ) _ $ - i(f)  = shorth (y — a(H) — P(H)x).  case by taking P{H) = 0.  30  A similar  1 s"(e)' result follows for the location  The proof of the above lemma is not included here. To obtain relatively short robust confidence intervals we need to use point estimates with small bias bounds. Adrover, Salibian-Barrera and Zamar (2002) used the GMS estimate for robust inference on the simple linear regression slope. As the simulation study results obtained by the authors suggest, the observed coverage levels of the robust confidence intervals are very satisfactory, and they constitute a major improvement when compared to those of the naive approach. However, the GMS estimate has a breakdown point of only 0.25, its asymptotic normality is established under very restrictive conditions, and its bias bound is known only for symmetric carrier distributions. Therefore, a better point estimate is needed. In the next chapter, we will look for the best possible point estimate for globally robust inference on the simple linear regression slope.  31  Chapter 3 The RMS Approach for Robust Inference on Slope In Chapter 2 we discussed globally robust inference for the location and the simple linear regression models. We explained how the bias bound of an estimate should be used in addition to its standard error to construct a robust confidence interval for the slope. The selection of an appropriate point estimate for the slope parameter is now an issue. In classical inference, sampling variability of an estimate is considered to be the only source of uncertainty and, therefore, an estimator with a smaller standard error is preferred so that we can construct relatively short confidence intervals of practical relevance.  Of course, the normality or asymptotic normality of an estimate is often  considered for theoretical reasons. In robust inference, on the other hand, we consider the bias of an estimator to be at least as important as its standard error (we have shown earlier that the bias of an estimator remains the same while its standard error vanishes as n goes to infinity). Therefore, to construct robust confidence intervals, we should prefer an estimate with a smaller bias 32  bound. We should also consider the asymptotic normality of the estimator because the theory on globally robust inference developed so far is based on this assumption.  3.1  Reasons for the selection of R M S  Most of the available maxbias functions for the slope estimates have been derived assuming that the intercept parameter is known. As an exception, Hennig (1995) derived the maxbiases of MM- and r-estimates of the intercept and the slope parameters. However, these estimators have very large bias bounds even in the case of known intercepts. The least median of squares (LMS) estimate (Rousseeuw, 1984) has the smallest maxbias in the class of residual admissible estimates with known intercepts. However, the class of residual admissible estimates has maxbiases larger than the maxbiases of the three median-based estimates MPS, GMS and RMS. In fact, for all residual admissible estimators, the maxbias function can be expressed as follows: B(e) ^ K ^ e +  O^e),  whereas, the maxbias function for those three median-based estimators can be expressed as B(e) ^K e 2  + 0(e),  where K\ and K are constants (Martin, Yohai & Zamar, 1989, and Berrendero & Zamar, 2  2001). Since 0 < e < 1, we have e < y/e, and the three median-based estimates have maxbiases considerably smaller. Therefore, for globally robust inference, it is reasonable to select one of these three estimates. Table 3.1 below (extracted from Adrover and Zamar, 2000) displays the maxbiases of these median-based regression estimates assuming normally distributed explanatory variable and regression error under the central model. The values in the second column 33  (labeled MS) are the lowest maximum bias attainable in the class of affine and regression equivariant estimates, when the intercept is equal to zero (Martin et a/., 1989). The last column (labeled LMS) gives the lowest maximum bias attainable in the class of residual admissible estimates when the intercept is equal to zero (Yohai and Zamar, 1993). Table 3.1: Maxbiases for several median-based estimates. MS  GMS  MPS  LMS  RMS  e  Slope  Intercept  Slope  Intercept  Slope  Intercept  Slope  Slope  0.010  0.014  0.013  0.016  0.013  0.032  0.013  0.019  0.220  0.025  0.039  0.032  0.041  0.032  0.082  0.032  0.046  0.357  0.050  0.081  0.066  0.088  0.067  0.171  0.066  0.096  0.528  0.100  0.174  0.143  0.201  0.150  0.386  0.142  0.198  0.826  0.150  0.282  0.237  0.361  0.251  0.689  0.235  0.339  1.140  0.200  0.411  0.378  0.639  0.502  1.219  0.357  0.505  1.515  0.240  0.538  0.667  1.299  0.993  2.227  0.489  0.668  1.898  0.250  0.574  oo  oo  1.259  2.747  0.563  0.730  1.999  0.300  0.792  oo  oo  oo  oo  0.817  1.042  2.739  0.350  1.120  oo  oo  oo  oo  1.367  1.564  3.960  The maxbiases of MPS are larger than those of GMS and RMS. While GMS has slightly smaller biases than RMS for e < 0.05, RMS has smaller biases than GMS for e > 0.10.  Adrover, Salibian-Barrera and Zamar (2002) selected GMS as their point  estimate for robust inference on slope for the following reasons:  • GMS shows a good bias performance. • The asymptotic normality of the GMS estimate can be proved under general conditions (allowing for most distributions in the contamination neighborhood)  34  We will now discuss some limitations of the GMS approach, and some advantages of RMS over GMS as a point estimate for robust inference.  Some l i m i t a t i o n s o f t h e G M S approach  The following are some problems that we identified in the GMS approach:  • To prove the consistency and the asymptotic normality of the GMS estimates, one of the regularity conditions used by Adrover, Salibian-Barrera and Zamar (2002) is that the carrier distribution Go is symmetric. In practice, this condition may not be satisfied. • The bias bound of the GMS estimate (Adrover and Zamar, 2000) is valid only when the carrier distribution Go is symmetric. • GMS has a breakdown point (e*) of 0.25, while other median-based estimates have larger breakdown points.  Advantages of t h e R M S approach  The RMS method may be preferred to the GMS approach for the following reasons:  • The regularity conditions for the asymptotic normality of the RMS estimate (to be discussed later) are more general than those of the GMS estimate. • The bias bound of the RMS estimate (Adrover and Zamar, 2000) is valid without the symmetry assumption of the carrier distribution Go• The breakdown point of RMS is 0.50, the maximum for all affine equivariant regression estimates. 35  • Though for small fractions of contamination (e < 0.05) GMS has smaller biases, RMS is a very close competitor. And, for large fractions of contamination RMS has smaller biases. In our opinion, the overall bias performance of RMS is better than that of GMS. • RMS is both locally and globally robust.  Based on the above considerations, we decided to use RMS for the construction of robust confidence intervals, and the calculation of robust p-values. We will now discuss the bias behavior and the asymptotic properties of RMS.  3.2  Maxbias of the R M S estimate  Adrover and Zamar (2000) derived the maxbias of RMS. We will start by showing (Huber, 1981) that the maximum bias of the median functional over the e-contamination neighborhood of a general distribution function G (not necessarily symmetric) is attained by 0  placing a point mass contamination at plus or minus infinity. Therefore, the maxbias of the median is given by m = m(e) = max  To derive the maxbias of the RMS-slope, we need some notation. q {a,b,H) M  and f3 (H) RMS  Let us take  as in the definition of RMS (Chapter 1). In addition, given a  general univariate distribution function F, let us define the quantiles and  (3.2)  and  (3.3)  Finally, let us consider  36  and Q (H)  = q (g (Q (x ,y ,H)))  L  L  Theorem 3.1  H  L  2  and  2  Qu(H) =  q  {G {Qu{x ,y ,H))).  u  H  2  (3.4)  2  (Maxbias of RMS-slope) Suppose that Fo is a symmetric distribution  with  unimodal density function fo. Then, the maxbias of RMS slope estimate is B« (e)  = max {\Q (H )\,  MS  L  \Qu(H )\}  Q  ,  Q  where Qi and Qu are given by (3.4)-  Proof: For all H in T , fixed number t and function g, we have £  (1 - e)P  (g(x  Ho  y ) < t) < P  u  x  H  In particular, by taking g(xi,y ) x  (1 - e)P  Ho  (^—-  <t)  <P„  (g( ,  )  Xl  <t) + e.  Vl  (3.5)  = (y\ — b)/[x\ — a) we get  < () < (1 - ) P  (^—-  £  —  H)  < qM(a, b, H) < Qu(a, b,  )  )  Xl  Ho  \Xi  a  (g( ,  < t) < (1 - e)P  Vl  \xi — a  Ho  (*zl< ) , t  (3.6)  +e  \Xi — a  J  )  and Qi(a,  6,  0  for all  H ), 0  in T and all a, b.  H  t  Since the median is a monotone operator, we also have Med g* Q {x , y , H ) < Med Q* q {x , y , H) < Med Q* Qu{x , y , H  L  2  2  0  H  M  2  2  H  2  2  for all Moreover, using (3.7) and taking g(x ,y ) 2  QL(HO)  < Med g* Q {x ,y ,H ) H  L  2  2  0  2  = Qi{x ,y ,H) 2  2  2  0  in  T. E  (3.7)  in (3.5) we obtain  2  and Med Q* Qu{x ,y ,H ) H  H  H ),  < Qu(H )  0  0  for all H in  and, therefore, QL(HO)  <P  RMS  < Qu{H ) 0  for all  H  in  !F . £  The theorem follows now because the upper and lower bounds above are attained by taking limit over a sequence of contaminated distributions. 37  3.3  Asymptotic properties of the R M S estimate  Siegel (1982) showed that when all X; are distinct (an event with probability 1 if G is continuous), the RMS estimate (3 has afinite-samplebreakdown point e* = [n/2]/n, that n  is, if fewer than [n/2] vectors Zj are changed, the estimate remains bounded. This yields an asymptotic breakdown point of 0.5. Siegel also showed that (3 is a Fisher-consistent n  estimate of 8. Hossjer, Rousseeuw and Croux (1994) established the asymptotic normality of the RMS slope estimate. The authors assumed the following regularity conditions: (F) The error distribution F is absolutely continuous, F ( 0 . 5 ) = 0, and the density _1  / is bounded  (||/||oo  < co) and strictly positive.  (G) The distribution G of the carriers is continuous, G (0.5) = 0, and G has a _1  positive and continuous density g around 0 with g(0) > 0.  Theorem 3.2 Let us consider the simple linear regression model yi = a + f3xi + ei, i = 1,... ,n, with the error and carrier distributions spectively.  satisfying  conditions (F) and (G) re;  Then  Vn~0  n  1 " - B) = -= ^ Y  lF(x  u  ) + O (l)  Vi  N(0, o ) 2  p  asn-^oo,  (3.8)  i=l  where W(x v) = " ' 2f(0)E \X\ S i S n { x y  I  H  X  V  P  (3 9) ^  )  )  G  and  a  = 2  m  E  a  \ X [  38  (  3  1  0  )  The authors proved this theorem through a series of lemmas. The proof is extremely long, non-trivial and involved, and is not included here. Interestingly, the asymptotic variance for RMS (formula 3.10) is the same as the asymptotic variance for GMS derived by Adrover, Salibian-Barrera and Zamar (2002). The difference is that, the regularity conditions for the GMS variance are very restrictive while the RMS variance is derived under more general conditions. One problem with the asymptotic normality of the RMS estimate is that the convergence to the asymptotic behavior is extremely slow. In order to check whether the asymptotic variance of the RMS slope estimate provides a good approximation to the to its variance at finite samples, Hossjer et al. (1994) carried out a Monte Carlo experiment. They considered both G and F to be equal to the standard Gaussian distribution, for which we get the asymptotic variance 7r /4 as 2.47. For each n in Table 3.1 the authors 2  generated m = 10,000 samples of size n, computed the corresponding slope estimates i3 for k = 1, 2, . . . , m, and obtained the n-fold variance, k)  n  nVar (/3«), FC  which should converge to 2.47 as n tends to oo. For n < 40 the n-fold variances are decreasing with n. After that, for n up to about 1000, the n-fold variances stay around 1.65, after which they slowly increase. For n around 40,000, we get a value of 1.86, which is still much less than 2.47. Unfortunately, the approximation (to the finite sample variance) provided by the asymptotic variance of the RMS estimate is not very satisfactory. Therefore, in addition to this asymptotic variance formula, we will use the bootstrap distribution of RMS to estimate its finite sample variability in the Monte Carlo simulation in Chapter 5.  39  Table 3.2: Simulation results of RMS estimate n-fold variance  n  10  2.62  20  1.88  40  1.67  60  1.67  100  1.63  200  1.63  300  1.66  500  1.64  800  1.62  1000  1.67  2000  1.83  3000  1.80  5000  1.82  10000  1.75  20000  1.85  40000  1.86  oo  2.47  40  Chapter 4 Application In Chapter 3, we proposed the RMS method as an alternative to the GMS method for globally robust inference on simple linear regression slope.  We will now apply these  two methods, along with the classical (LS) approach and a naively robustified approach (using MM), to two real datasets.  4.1  Motorola vs Market  These data were published in Berndt (1994). The Motorola data include ten years of monthly returns of Motorola shares over the time period January 1978 to December 1987 (for 120 months).  The Market data include value-weighted composite monthly  market returns based on transactions of the New York Stock Exchange and the American Exchange over the same 10-year time span. The returns on 30-day US Treasury bills are also provided. Adrover, Salibian-Barrera and Zamar (2002) used these data as an example for  41  robust inference on simple linear regression slope with the GMS estimate. The response variable is the difference between the monthly Motorola returns and the returns on 30day US Treasury bills. The explanatory variable is the difference between the monthly Market returns and the returns on 30-day US Treasury bills. The financial economists fit a straight line to this type of data. The slope measures the riskiness of the stock, the larger the slope the riskier the stock. We used RMS to estimate the regression parameters. The following table gives RMS slope estimate along with GMS, MM and LS estimates. Table 4.1: Different slope estimates Method of Estimation  $  RMS  1.11  GMS  1.21  MM  1.34  LS  0.85  The estimates for the intercept parameter are equal to zero (up to the second decimal place). Figure 4.1 contains a scatter plot of the data and the four fitted lines. From afinancialpoint of view, the conclusions differ very widely for the four different regression methods. According to the LS method, Motorola's stocks are safer than the market. According to RMS, GMS and MM methods, Motorola's stocks are riskier than the market. The following hypotheses may be of possible interest in this case H  0  : 8>1  versus H  Y  : 8 < 1.  If the null hypothesis is rejected, an investor would like to invest on this stock. The conclusions from the analyses using LS, MM, GMS and RMS methods are given below. 42  Figure 4.1: A scatter plot and four different regression lines  The LS method  The estimated standard error of the LS slope estimate is 0.105. The LS approach rejects H at level a = 0.1 (the p-value is 0.074). The corresponding residual plot (not shown 0  here) does not indicate the presence of outliers.  The M M method  Adrover et al. used lmRobMM from Splus for this naively robust approach. The estimated standard error of the MM slope estimate is 0.274. Since "The bias is high", this approach is indecisive and the recommendation is not to perform inference based on the final estimate. If we ignore the warning and proceed to test our hypotheses, we get a p-value of 0.890 and the null hypothesis cannot be rejected. 43  The G M S method  For this globally robust approach, we need to determine a plausible value for e. Adrover et al. used the GMS residual plot, which shows one clear outlier out of the 120 observations. They reasonably selected e = 0.01. Since we have ^GMS  = 1.41,  ftGMS  the bias bound is 0.0225. The standard error of the GMS slope estimate is estimated to be 0.169 by using the shorth of the bootstrap distribution of /3 . Therefore, the robust n  p-value is pGMS  21 - 1 + 0.0225)/0.169] = 0.914,  =  and we cannot reject the null hypothesis. The Motorola stocks are not a safe investment.  The R M S method  To estimate the standard error of the RMS slope estimate, we used the shorth of the bootstrap distribution of  J3 , n  and obtained a value of  0.154.  Like the GMS case, the  RMS residual plot (Figure 5.2) shows one clear outlier out of the 120 observations, and we can use e = 0.01. We have frRMS fyRMS x  i-  =  u  and the bias bound is  0.0278. Therefore, the  pRMS  =  4  2  9  '  robust p-value is  _ ! o.0278)/0.154] = 0.815, +  and we cannot reject the null hypothesis.  Again, the Motorola stocks are not a safe  investment.  44  Figure 4.2: RMS residuals against RMS fitted values  -0.2  -0.1  0.0  0.1  RMS Fitted Values  Discussion  According to the LS method, the Motorola stocks are a safe investment. This method considers all the data points, and because of some outlying 'good' returns, the LS method leads us to a wrong conclusion. The other three methods give very large p-values, leaving no room for rejecting the null hypothesis. However, the bias is high for the MM method, and the inference based on this naive method is not recommended. Though the GMS method gives a larger p-value as compared to the RMS method, both of these methods are very conservative, and conclude that Motorola's stocks are NOT safer than the market.  45  4.2  Volume vs Height of trees  This dataset is courtesy of Dr. Harry Joe. The data consist of girth, volume and height of each of 31 trees, and are presented in Table 4.2. It is reasonable to assume that 'Height' is a good predictor for the response variable 'Volume'. Figure 4.3 contains a scatter plot of these two variables. A straight line seems to be a good fit for the data, except for a group of points.  Figure 4.3: A scatter plot of Volume vs Height  o CD  <D  E  i §  i  i  65  70  i  i  75  80  Height  46  r 85  As in the first example, we used four methods for the estimation of the regression parameters in this case: LS, RMS, GMS and MM. Figure 5.3 shows the four fitted lines. The LS slope (1-54) is the largest of all, with the GMS slope (1.43) as the closest competitor. The RMS and the MM slopes are much smaller (1.14 and 1.03, respectively). At afirstglance, this seems to be contrary to our expectation, because there is a seemingly evident linearity along the diagonal of the scatter plot and the LS line is closer to this than the robust lines are. The RMS and the MM methods seem to have missed this particular linearity!  Figure 4.4 The four regression lines •  LS MM GMS RMS  •  • •  i  •  •  -*  •  j^--"  •  ^'  "  1"'  I  m  '  '  • i 65  1 70  1 75  1 80  r 85  Height  To better understand what is going on, let us look at the plot of the residuals against the fitted values for each of the four methods. Figure 5.4 presents the residual plot for the LS method. The zero line and the 2SE line are shown, the —2SE line is beyond the plot. One point (tree number 31) is lying outside the 2SE limit. 47  Table 4.2: Girth, volume and height of trees Tree Number  Girth  Height  Volume  1  8.3  70  10.3  2  8.6  65  3 4  8.8 10.5 10.7  63 72  10.3 10.2  10.8  83 66  5 6 7 8 9  11.0 11.0 11.1  81  75 80 75  16.4 18.8 19.7 15.6 18.2  10 11 12  11.2 11.3 11.4  79  22.6 19.9 24.2  76  21.0  13 14  11.4  76 69 75  21.4 21.3 19.1 22.2  15  11.7 12.0  16 17  12.9 12.9  74  18 19  13.3 13.7  86 71  20 21  13.8  64  24.9  14.0  78  34.5  22  14.2  31.7  23 24  14.5  80 74  16.0  72  38.3  25  16.3  77  26  17.3  27  17.5  81 82  42.6 55.4  28  17.9  80  58.3  29  18.0  80  51.5  30  18.0  80  51.0  31  20.6  87  77.0  48  85  33.8 27.4 25.7  36.3  55.7  Figure 4.5: LS residuals against LS fitted values  LS Fitted Values  Figure 4.6: GMS residuals against GMS fitted values  G M S Fitted Values  49  Figure 4.7: RMS residuals against RMS fitted values  i 15  r~ 20 RMS Fitted Values  Figure 4.8: MM residuals against MM fitted values  MM Fitted Values  50  The residual plot for the GMS method is shown in Figure 4.6. The —2SE line is beyond the plot. This time, four points (tree numbers 28, 29, 30 and 31) are lying outside the 2 SE limit. Figure 4.7 and Figure 4.8 are the residual plots for the RMS and the MM methods, respectively. For the RMS method, the —2SE line is beyond the plot. The same six points (tree numbers 26, 27, 28, 29, 30 and 31) are lying outside the 2SE limit for both the RMS and the MM methods. To learn more about these six trees, we carefully scrutinized their girths, heights and volumes (Table 4.2). Surprisingly, they have the highest girth values, which means that these six trees are older than the others! It is reasonable to assume that the HeightVolume relationship for the old trees is different from that for the young trees, with a larger slope for the old ones. Let us look again at the scatter plot (Figure 4.3), more carefully this time. The linearity exhibited along the diagonal does not represent the majority of the data. The six points corresponding to the six 'old' trees are at the end of the diagonal. If we ignore these points, a straight line with a much smaller slope seems appropriate. There is heterogeneity in the data, and the MM and the RMS methods identify this heterogeneity almost perfectly.  Testing o f hypotheses  Buyers might be interested in assessing the volume of wood in a group of trees, and they would not like to invest unless the amount of wood makes it a safe investment. Moreover, since seasoned wood is more useful, age of the trees should also be considered while making a decision. Considering these, the buyers may want to test the following 51  hypotheses H  0  : B<1 versus H  x  : B > 1.  If the null hypothesis is rejected, a buyer would like to go for investment considering that there is a good amount of wood and the wood is seasoned (age of the trees is also reflected in the slope parameter). The conclusions from the analyses using LS, MM, GMS and RMS methods are given below.  The LS method  The estimated standard error of the LS slope estimate is 0.3839. The LS approach rejects H at level a = 0.1 (the p-value is 0.083). Q  The M M method  We used lmRob from the library 'robust' of Splus6 for this naively robust approach. The estimated standard error of the MM slope estimate is 0.355. Since "The bias is high", this approach is indecisive and the recommendation is not to perform inference based on the final estimate. If we ignore the warning and proceed to test our hypotheses, we get a p-value of 0.466 and we cannot reject the null hypothesis.  The G M S method  We used the GMS estimate both in the naively robust and in the globally robust approaches:  • The standard error of the GMS slope estimate is estimated to be 0.351 by using the shorth of the bootstrap distribution of B . Without any adjustment for the n  52  bias (which is equivalent to using e — 0), the p-value is 0.098, and H is rejected at 0  level a = 0.1. • For the globally robust approach, we need to determine a plausible value for e. According to the GMS residual plot, there is one clear outlier out of the 31 observations. It seems reasonable to use e = 0.03. Since frGMS ^GMS x  =  L 8 2  >  the bias bound is 0.0986. Therefore, the robust p-value is 1 - $[(1.45 - 1 - 0.0986)/0.351] = 0.158, and we cannot reject the null hypothesis at the 10% level.  The R M S method  The RMS estimate is also used both in the naively robust and in the globally robust approaches. The standard error of the RMS slope estimate is estimated to be 0.362 by using the shorth of the bootstrap distribution of  • Without any adjustment for the bias of the point estimate, the p-value is 0.349, and H cannot be rejected at level a = 0.1. 0  • Like in the GMS case, the RMS residual plot shows one clear outlier out of the 31 observations. We can use e = 0.03. In this case, ±.RMS ftRMS  1.59,  X  and the bias bound is 0.0957. Therefore, the robust p-value is -RMS  1 - $[(1.14 - 1 - 0.0957)/0.362] = 0.451,  and we cannot reject the null hypothesis. 53  Discussion According to the LS and the naive GMS methods, investment is safe. The LS method is seriously affected by a few 'old' trees - the slope estimate is larger than the 'true' value, and the investment appears to be safer than it really is. The naive GMS approach is misleading too because it does not take into account the possible bias of the estimate. Since the estimates obtained by the MM and the RMS methods are close to 1, we get large p-values even when we use the methods naively. However, the bias test is significant for the MM method, and the inference based on this method is not recommended by Splus. About the naive RMS approach, though the p-value obtained is very large, we do not recommend this method for the following reasons. In chapter 2, we discussed theoretically the consequences of ignoring the asymptotic bias of the point estimate in robust inference. Moreover, in this particular example, we have seen the consequences of ignoring the bias in the case of naive GMS approach. Comparing the bias-adjusted approaches based on the GMS and the RMS estimates, we strongly recommend globally robust inference with the RMS estimate.  54  Chapter 5 A Monte Carlo Study In this chapter we will present the results of a simulation study that we conducted to investigate the finite sample coverage levels of globally robust confidence intervals for the slope of the simple linear regression model. Adrover, Salibian-Barrera and Zamar (2000) conducted a similar study using the GMS slope estimate. Based on the considerations of Chapter 3 of this thesis, we will use the RMS slope estimate and compare our results with those obtained by Adrover et al. Though Hossjer, Rousseeuw and Croux (1994) established the asymptotic normality of the RMS estimate (and determined its asymptotic variance) without the symmetry assumption of the carrier distribution, we will consider bivariate normal distribution as our central model. We have two reasons for this decision: (1) the maxbiases of the RMS estimate are readily available for normally distributed explanatory variable and regression error under the central model (Table 3.1 extracted from Adrover and Zamar, 2001) and (2) since this is the same model that was considered by Adrover et ai, our study will be more comparable to their study.  55  5.1  Design of the study  We will consider four different choices of v , the estimated standard error of the RMS n  estimate: Method 1 (Empirical Asymptotic Assuming F = 0  N(0,oj)  Variance): The variability is estimated by formula 3.10.  and G0 =  N(0,ol),  ^  we have  2nE \X\  Anal  2  Go  Therefore, Vn  Method 2 (Classical Bootstrap):  =  7 7  °> ^H£  7T1=  tzi\ (5-1)  n  The variability is estimated by using the standard  deviation of the bootstrap distribution of B . n  Method 3 ( "Shorth Bootstrap"):  The variability is estimated by using the shorth of the  bootstrap distribution of J3 . n  Method 4 ("MAD Bootstrap"):  The variability is estimated by using the MAD of the  bootstrap distribution of B . n  In our Monte Carlo simulation we used m = 1000 replicates of each of the following sampling situations: sample sizes n = 20, 40, 60, 80, 100 and 200 from contaminated normal distributions (1 — e)N(0,1) + eN(fi, T I) with fi' = (fi , /j, ), r = 0.1, p, = 3 and 2  x  y  x  fj, = 1.5 (2.0) for e = 0.05 (0.10). We will refer to this case as the "mild contamination y  case". In the "medium contamination case" we took  JJL  X  = 5 and \x = 2.5 for e = 0.05 y  and e = 0.10. We also considered a "strong contamination case" by taking \x = 5 and x  fiy = 15 for e = 0.05 and 0.10. 56  Thus, there are three different contamination types (mild, medium and strong) with two different percentages of contamination for each type. Six different sample sizes are considered for each of these six cases. In total, there are 36 different sampling situations. The nominal confidence level in each situation is 0.95. To estimate the variability of J3 by the three bootstrap approaches (Method 2, n  Method 3 and Method 4), we used B = 1000 bootstrap samples from each of the 1000 replicates of each sampling situation.  5.2  Numerical results  The observed coverage levels and the median lengths of the robust confidence intervals obtained by the four methods are presented in Tables 5.1, 5.2, 5.3 and 5.4, respectively. At first we will compare the different results in these four tables. Then we will compare our results with those obtained by Adrover et al.  Comparison of t h e four methods  Following are the major observations: Method 1 (Table 5.1): Most of the observed coverage levels are above 95%. The median lengths reported in this table are the largest of all four methods. Method 2 (Table 5.2): Most of the coverages are above 95%. The median lengths of this table are larger than the corresponding median lengths for Method 3 and Method 4. Method 3 (Table 5.3): Except for some of the medium contamination cases, most of the coverages are between 90% and 95%. The median lengths in this table are the smallest. 57  Table 5.1: Coverage proportion (median length) of robust CI for the slope by Method 1. % of Contamination  Sample Size  Mild Contamination  Medium Contamination  Strong Contamination  20  0.92 (1.28)  0.96 (1.28)  0.91 (1.37)  40  0.96 (0.99)  0.94 (1.02)  0.95 (1.03)  60  0.97 (0.86)  0.96 (0.86)  0.95 (0.89)  80  0.97 (0.77)  0.96 (0.79)  0.96 (0.80)  100  0.96 (0.71)  0.96 (0.72)  0.96 (0.74)  200  0.96 (0.57)  0.98 (0.58)  0.97 (0.58)  20  0.89 (1.39)  0.89 (1.36)  0.91 (1.68)  40  0.96 (1.16)  0.95 (1.18)  0.97 (1.34)  60  0.96 (1.09)  0.91 (1.07)  0.95 (1.19)  80  0.93 (1.04)  0.94 (1.01)  0.97 (1.11)  100  0.93 (0.97)  0.96 (0.98)  0.97 (1.05)  200  0.97 (0.85)  0.93 (0.84)  0.98 (0.89)  5%  10%  Method 4 (Table 5.4): Most of the coverages are between 90% and 95%. The median lengths for this method are smaller than those of Method 1 and Method 2, but larger than those of Method 3. Table 5.1 shows some overcoverages and the lengths of the confidence intervals are also relatively large. An explanation of this may be found in Table 3.1, which shows that the asymptotic variance (formula 3.10) overestimates the variablity of (3 exhibited in n  finite samples. Similarly, the overcoverages shown in Table 5.2 and the undercoverages in Table 5.3 are due to the overestimation and the underestimation, respectively, of the 'true' standard error of J3 . n  The observed coverage levels for Method 4 are mostly close to the nominal level. This method seems to estimate the standard error of B better than the other methods. n  58  Table 5.2: Coverage proportion (median length) of robust CI for the slope by Method 2. % of Contamination  5%  10%  Sample Size  Mild Contamination  Medium Contamination  Strong Contamination  20  0.97 (1.51)  0.97 (1.49)  0.98 (1.91)  40  0.95 (0.95)  0.94 (0.97)  0.97 (1.10)  60  0.93 (0.80)  0.94 (0.80)  0.97 (0.89)  80  0.96 (0.71)  0.93 (0.72)  0.96 (0.78)  100  0.94 (0.65)  0.95 (0.67)  0.96 (0.71)  200  0.96 (0.53)  0.96 (0.54)  0.97 (0.55)  20  0.94 (1.63)  0.94 (1.55)  0.99 (2.51)  40  0.97 (1.21)  0.95 (1.14)  0.99 (1.57)  60  0.96 (1.09)  0.88 (1.02)  0.98 (1.30)  80  0.94 (1.02)  0.93 (0.96)  0.99 (1.16)  100  0.93 (0.96)  0.94 (0.93)  0.99 (1.08)  200  0.96 (0.84)  0.91 (0.81)  0.99 (0.90)  Comparison of R M S and G M S  For the results of the GMS approach, the three tables provided by Adrover et al. (2002) are referred to. The comparison of Method 1 of RMS and that of GMS is interesting. If we focus on the median lengths, we notice that they are more or less equal. However, if we concentrate on the observed coverage levels, we see that for RMS they are mostly over 95% while for GMS they are mostly below 95%. The explanation is as follows. We mentioned earlier that the asymptotic variance formula for these two estimates are the same (derived under two different sets of regularity conditions, though). For RMS, this formula overestimates the variability (Table 3.1) while 59  Table 5.3: Coverage proportion (median length) of robust CI for the slope by Method 3. % of Contamination  5%  10%  Sample Size  Mild Contamination  Medium Contamination  Strong Contamination  20  0.88 (1.19)  0.89 (1.15)  0.94 (1.41)  40  0.91 (0.86)  0.87 (0.88)  0.94 (0.98)  60  0.90 (0.76)  0.91 (0.76)  0.94 (0.82)  80  0.93 (0.68)  0.92 (0.70)  0.95 (0.73)  100  0.92 (0.62)  0.93 (0.65)  0.94 (0.67)  200  0.94 (0.52)  0.96 (0.53)  0.96 (0.54)  20  0.85 (1.36)  0.85 (1.25)  0.96 (1.84)  40  0.90 (1.13)  0.84 (1.03)  0.98 (1.37)  60  0.90 (1.05)  0.84 (0.97)  0.96 (1.21)  80  0.94 (1.00)  0.90 (0.93)  0.97 (1.12)  100  0.93 (0.94)  0.91 (0.91)  0.98 (1.04)  200  0.94 (0.82)  0.90 (0.80)  0.98 (0.88)  for GMS it underestimates the variability (Adrover et ai, 2002). The maxbiases of these two estimates, on the other hand, are very close (slightly smaller for GMS for e = 0.05, and slightly smaller for RMS for e = 0.10). Therefore, for the same lengths, RMS has overcoverage while GMS has undercoverage. One point is evident from the above comparison. Thefinitesample variability of RMS is less than that of GMS, which is also reflected in Method 2 and Method 3 results of the two estimates. For each of these methods, the median lengths obtained by RMS are smaller than those obtained by GMS, while the observed coverage levels are more or less equal. Method 4 is not comparable since Adrover et al. did not consider this method for the GMS variability estimation. 60  Table 5.4: Coverage proportion (median length) of robust CI for the slope by Method 4. % of Contamination  Sample Size  Mild Contamination  Medium Contamination  Strong Contamination  20  0.91 (1.28)  0.93 (1.26)  0.96 (1.53)  40  0.92 (0.90)  0.91 (0.93)  0.95 (1.02)  60  0.92 (0.79)  0.93 (0.79)  0.95 (0.85)  80  0.94 (0.70)  0.93 (0.72)  0.96 (0.76)  100  0.93 (0.64)  0.95 (0.67)  0.95 (0.69)  200  0.95 (0.52)  0.96 (0.54)  0.97 (0.55)  20  0.92 (1.47)  0.88 (1.35)  0.97 (1.99)  40  0.93 (1.19)  0.90 (1.90)  0.99 (1.43)  60  0.94 (1.08)  0.88 (1.02)  0.97 (1.24)  80  0.94 (1.02)  0.93 (0.96)  0.98 (1.14)  100  0.94 (0.96)  0.95 (0.93)  0.99 (1.06)  200  0.94 (0.83)  0.91 (0.81)  0.98 (0.89)  5%  10%  5.3  Discussion  The numerical results in the four tables show that the coverage of the robust confidence intervals are in general fairly good (close to the nominal level) and constitute a major improvement when compared to those achieved by the naive procedure (Table 2.2). Asymptotic variance of RMS overestimates its finite sample variability, so does the standard deviation of the bootstrap distribution of (3 . On the other hand, shorth of n  the bootstrap distribution underestimates the variability. The performance of MAD bootstrap seems to be better than the others, the coverages for this method are closer to the nominal level than those for the other methods. We would recommend this method for the estimation of the finite sample variability of RMS.  61  The overall performance of RMS is better than that of GMS. Either the observed coverages of RMS intervals are larger with the same lengths, or the lengths of RMS are smaller while maintaining the same coverages. Based on these considerations, we recommend RMS for robust inference on the slope of simple linear regression model.  62  Chapter 6 Conclusion In this concluding chapter we will summarize our findings and identify some topics for further research.  6.1  Summary  The main results of this thesis may be summarized as follows:  1. In the construction of robust confidence intervals, if we ignore the uncertainty due to the bias of the point estimate, we will get asymptotic coverage level zero. 2. To incorporate the bias bound of an estimate in addition to its standard error in robust inference, the method proposed by Adrover, Salibian-Barrera and Zamar (2002) may be used. 3. A point estimate that has an asymptotically normal distribution and a relatively small bias bound should be preferred. 63  4. For robust inference on the simple linear regression slope, the problems of the GMS estimate proposed by Adrover et al. (2002) are that it has a breakdown point of 0.25, and its asymptotic normality is established under very restrictive conditions. 5. In this study, we proposed the RMS estimate, for which the breakdown point is 0.50, and the asymptotic normality holds under very general conditions.  We applied the RMS method to two real datasets. In the first example, the RMS method performed almost as well as the GMS approach, both of them concluded that the investment was risky while the least squares approach indicated otherwise. The second example was a more challenging problem, and the RMS method performed much better than the GMS. The outlying 'old' trees were identified by RMS almost perfectly, and while GMS was 'shaky' in making a decision (as compared to RMS), RMS was clearly conservative and considered the investment to be NOT good. In the Monte Carlo study, the RMS method achieved, more or less, the same observed coverage levels while it constructed intervals of smaller lengths, as compared to GMS. Regarding the four methods of estimation of the standard error of RMS, the asymptotic variance formula and the classical bootstrap were overestimating while the shorth bootstarp was underestimating. The MAD bootstrap may be preferred. Based on our findings, we recommend RMS for globally robust inference on the simple linear regression slope.  6.2  Further study  The following points form some interesting areas for future research:  1. For prediction with simple linear regression models, we may be interested in the 64  linear combination of the slope and the intercept parameters. The asymptotic bivariate normality of these parameters may be established for this purpose. We will also need the appropriate bias bound for the intercept parameter. 2. The bias bounds of the RMS slope and intercept parameters are available only under the assumptions of normally distributed explanatory variable and regression error under the central model. These bounds may be obtained under more general conditions, for example, without the symmetry assumption of the carrier distribution. 3. RMS can also be used for estimating the slope parameters in multiple linear regression, using kernel functions with more than two arguments (Siegel, 1982). However, these estimators are not affine equivariant when the number of slope parameters is two or more. The asymptotic properties of RMS in higher dimensions constitute another interesting area of study.  65  Bibliography Adrover, J. G., Salibian-Barrera, M., and Zamar, R. H. (2002). Globally robust inference for the location and simple linear regression models. J. Statist. Plann.  Inference:  (accepted for publication). Adrover, J. G. and Zamar, R. H. (2000). Bias robustness of three median-based regression estimates. Technical Report No. 194, Department of Statistics, University of British Columbia, Canada. Berrendero, J. R. and Zamar, R. H. (2001). Maximum bias curves for robust regression with non-elliptical regressors. Ann. Statist., 29: 224-251. Boscovich, R. J. (1757). De litteraria expeditione per pontificiam ditionem, et synopsis amplioris operis. Bononiensi  Commentarii,  Scientiarum  et Artum Instituto  atque  Academia  4: 353-396.  Brown, G. W. and Mood, A. M. (1951). On median tests for linear hypotheses, in Proceedings of the Second Berkeley Symposium on Mathematical Statistics and Probability, Univ. of California Press, Berkeley, pages 159-166. Fraiman, R., Yohai, V. J., and Zamar, R. (2001). Optimal robust M-estimates of location. Ann. Statist, 29: 194-223. Frees, E. (1991). Trimmed slope estimates for simple linear regression. J. Statist. Inference, 27: 203-221. 66  Plann.  Hampel, F. R. (1974). The influence curve and its role in robust estimation. J. Amer. Statist. Assoc., 69: 383-393.  Hossjer, O., Rousseeuw, P. J., and Croux, C. (1994). Asymptotics of the repeated median slope estimator. Ann. Statist., 22: 1478-1501. He, X. and Simpson, D. G. (1993). Lower bounds for contamination bias: Globally minimax versus locally linear estimation. Ann. Statist., 21: 314-337. Hennig, C. (1995). Efficient high-breakdown point estimators in robust regression: Which function to choose? Statistics & Decisions, 13: 221-241. Huber, P. J. (1964). Robust estimation of a location parameter. Ann. Math. Statist., 35: 73-101. Huber, P. J. (1981). Robust Statistics. Wiley, New York. Martin, R. D., Yohai, V. J., and Zamar, R. H. (1989). Min-max bias robust regression. Ann. Statist,  17: 1608-1630.  Martin, R. D. and Zamar, R. H. (1989). Asymptotically min-max bias robust M-estimates of scale for positive random variables. J. Amer. Statist. Assoc., 84: 494-501. Martin, R. D. and Zamar, R. H. (1993). Bias-robust estimates of scale. Ann. Statist., 21: 991-1017. Rousseeuw, P. J. and Leroy, A. M. (1987).  Robust regression and outliers  detection.  Wiley, New York. Rousseeuw, P. J. and Yohai, V. J. (1984). Robust regression by means of S-estimators. Robust and Nonlinear  Time Series Analysis (J. Franke, W. Hardle, and R. D. Martin,  eds.), Lecture Notes in Statistics 26, Springer Verlag, New York: 256-272. Sen, P. K. (1968). Estimates of the regression coefficient based on Kendall's tau. J. Amer. Statist. Assoc., 63: 1379-1389.  67  Siegel, A. F. (1982). Robust regression using repeated medians. Biometrika, Stigler, S. (1986).  The History  of Statistics:  6 9 : 242-244.  The Measurement of Uncertainty  before  1900. Harvard Univ. Press. Theil, H. (1950). A rank-invariant method of linear and polynomial regression analysis, I, II and III. Koninklijke  Nederlandse  Akademie  van Wetenschappen,  Proceedings,  53: 386-392; 521-525; 1397-1412. Tukey, J. (1960). A survey of sampling from contaminated distributions, in: Contributions to Probability  and Statistics.  I. Olkin, Ed., Stanford University Press, Stanford.  Yohai, V. J. (1987). High breakdown point and high efficiency robust estimates for regression. Ann. Statist., 15: 642-656. Yohai, V. J. and Zamar, R. H. (1993). A minimax bias property of the least a-quantile estimates. Ann. Statist, 20: 1875-1888.  68  

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0090413/manifest

Comment

Related Items