- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Globally robust inference for simple linear regression...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Globally robust inference for simple linear regression models with repeated median slope estimator 2002
pdf
Page Metadata
Item Metadata
Title | Globally robust inference for simple linear regression models with repeated median slope estimator |
Creator |
Khan, Md Jafar Ahmed |
Date Created | 2009-09-11T23:12:04Z |
Date Issued | 2009-09-11T23:12:04Z |
Date | 2002 |
Description | Globally robust inference takes into account the potential bias of the point estimates (Adrover, Salibian-Barrera and Zamar, 2002). To construct robust confidence intervals for the simple linear regression slope, the authors selected the generalized median of slopes (GMS) as their point estimate, considering its good bias behavior and asymptotic normality. However, GMS has a breakdown point of only 0.25, its asymptotic normality is established under very restrictive conditions, and its bias bound is known only for symmetric carrier distributions. In this study, we propose the repeated median slope (RMS) estimate as an alternative choice. RMS has a breakdown point of 0.50, its asymptotic normality holds under mild assumptions, and the bias bound for RMS is known for general carrier distributions. The proposed method achieves, more or less, the same observed coverage levels while it constructs intervals of smaller lengths, as compared to the GMS approach. |
Extent | 2620742 bytes |
Genre |
Thesis/Dissertation |
Type |
Text |
File Format | application/pdf |
Language | Eng |
Collection |
Retrospective Theses and Dissertations, 1919-2007 |
Series | UBC Retrospective Theses Digitization Project [http://www.library.ubc.ca/archives/retro_theses/] |
Date Available | 2009-09-11T23:12:04Z |
DOI | 10.14288/1.0090413 |
Degree |
Master of Science - MSc |
Program |
Statistics |
Affiliation |
Science, Faculty of |
Degree Grantor | University of British Columbia |
Graduation Date | 2002-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
URI | http://hdl.handle.net/2429/12713 |
Aggregated Source Repository | DSpace |
Digital Resource Original Record | https://open.library.ubc.ca/collections/831/items/1.0090413/source |
Download
- Media
- ubc_2002-0447.pdf [ 2.5MB ]
- Metadata
- JSON: 1.0090413.json
- JSON-LD: 1.0090413+ld.json
- RDF/XML (Pretty): 1.0090413.xml
- RDF/JSON: 1.0090413+rdf.json
- Turtle: 1.0090413+rdf-turtle.txt
- N-Triples: 1.0090413+rdf-ntriples.txt
- Citation
- 1.0090413.ris
Full Text
G l o b a l l y R o b u s t In fe rence fo r S i m p l e L i n e a r Regress ion M o d e l s w i t h R e p e a t e d M e d i a n Slope E s t i m a t o r by Md Jafar Ahmed Khan M.Sc, University of Dhaka, 1992 A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF M a s t e r o f Science in THE FACULTY OF GRADUATE STUDIES (Department of Statistics) We accept this thesis as conforming to the required standard T h e U n i v e r s i t y o f B r i t i s h C o l u m b i a September 2002 © Md Jafar Ahmed Khan, 2002 In p resen t ing this thesis in partial fu l f i lment o f t h e requ i remen ts f o r an advanced d e g r e e at t h e Univers i ty o f Brit ish C o l u m b i a , I agree that t h e Library shall m a k e it f reely available f o r re ference and s tudy. I fu r ther agree that pe rmiss ion f o r ex tens ive c o p y i n g of this thesis f o r scholar ly pu rposes may be g ran ted by t h e head o f m y d e p a r t m e n t o r by his o r her representat ives. It is u n d e r s t o o d that c o p y i n g o r pub l i ca t i on o f this thesis fo r f inancial gain shall n o t b e a l l o w e d w i t h o u t m y w r i t t e n pe rmiss ion . D e p a r t m e n t The Univers i ty o f Brit ish C o l u m b i a Vancouver , Canada Date DE-6 (2/88) A b s t r a c t Globally robust inference takes into account the potential bias of the point estimates (Adrover, Salibian-Barrera and Zamar, 2002). To construct robust confidence intervals for the simple linear regression slope, the authors selected the generalized median of slopes (GMS) as their point estimate, considering its good bias behavior and asymptotic normality. However, GMS has a break- down point of only 0.25, its asymptotic normality is established under very restrictive conditions, and its bias bound is known only for symmetric carrier distributions. In this study, we propose the repeated median slope (RMS) estimate as an alternative choice. RMS has a breakdown point of 0.50, its asymptotic normality holds under mild assumptions, and the bias bound for RMS is known for general carrier distributions. The proposed method achieves, more or less, the same observed coverage levels while it constructs intervals of smaller lengths, as compared to the GMS approach. ii C o n t e n t s Abstract ii Table of contents iii Acknowledgements vi 1 Introduction 1 1.1 Motivation 1 1.2 Terminology 3 1.3 Simple Linear Regression: Robust estimation of parameters 7 1.3.1 Median of Pairwise Slopes (MPS) 8 1.3.2 Generalized Median of Slopes (GMS) 10 1.3.3 Repeated Median of Slopes (RMS) 11 1.3.4 S-estimates 12 iii V 1.3.5 MM-estimates 13 1.3.6 r-estimates 13 1.4 Purpose of this study 14 1.5 Organization of subsequent chapters 14 2 Globally Robust Inference 15 2.1 Limitations of the classical confidence intervals 16 2.2 Naive intervals: consequences of ignoring bias 17 2.3 Robust inference for the location model 20 2.3.1 Confidence Intervals 20 2.3.2 One sided confidence bound 24 2.3.3 P-values 25 2.3.4 Estimation of bias bound 27 2.4 Robust inference for the simple linear regression model 28 2.4.1 Estimation of the bias bound 29 3 The R M S Approach for Robust Inference on Slope 32 3.1 Reasons for the selection of RMS 33 3.2 Maxbias of the RMS estimate 36 iv 3.3 Asymptotic properties of the RMS estimate 38 4 Application 41 4.1 Motorola vs Market 41 4.2 Volume vs Height of trees 46 5 A Monte Carlo Study 55 5.1 Design of the study 56 5.2 Numerical results 57 5.3 Discussion 61 6 Conclusion 63 6.1 Summary 63 6.2 Further study 64 Bibliography 66 v A c k n o w l e d g e m e n t s I would like to acknowledge, warmly and thankfully, the guidance and support of my supervisor, Dr. Ruben H . Zamar, throughout this study. I would like to thank Dr. Lang Wu for his valuable suggestions as the second reader of this thesis. Special thanks are to Dr. Matias Salibian-Barrera, who kindly gave me some related codes to start with, and to Dr. Harry Joe, for his timely and valuable suggestions during the computation. JAFAR A H M E D K H A N The University of British Columbia September 2002 vi C h a p t e r 1 I n t r o d u c t i o n 1.1 Mot iva t ion As a major source of uncertainty, sampling variability of an estimator plays an important role in statistical inference, particularly when the sample size is small. As sample size increases, sampling variability becomes less and less important. This is because standard errors are usually of order 0(l/y/n), and tend to zero as sample size n tends to infinity. This is not the case with the other major source of uncertainty: the bias of an estimator, which is caused by data contamination (e.g., outliers, asymmetric errors, and other departures from the model assumption), gross errors, etc. Since biases are of order 0(1), we cannot reduce the bias of an estimator by increasing the sample size. Therefore, for large data sets, the uncertainty due to bias clearly dominates the uncertainty due to sampling variability. To demonstrate the idea, we perform a simulation. We choose Af(0,1) as our central model and N(10,1) as the contaminating distribution. We select three different sample 1 sizes: 25, 100 and 400, and three different levels of contamination: 0%, 10% and 20%. For each combination of sample size and contamination level, we generate 1000 samples, and calculate the median from each sample. The average of the 1000 sample medians (minus the true median which is zero in this case) gives us the bias, and we also calculate the standard error in each case. The following table summarizes the results. Table 1.1: Bias (standard error) for different sample sizes and contamination levels. Contamination n = 25 n = 100 n = 400 level 0% -0.0028 (0.2415) -0.0039 (0.1238) 0.0028 (0.0610) 10% 0.1355 (0.2853) 0.1435 (0.1383) 0.1381 (0.0679) 20% 0.3278 (0.3390) 0.3188 (0.1645) 0.3194 (0.0831) As expected, the standard error of the sample median decreases as n increases while the bias remains almost unchanged. Also, the bias increases with increased contam- ination. This shows the relative importance of bias as a source of uncertainty of an estimator. With the availability of high-speed computers and automated data collection techniques, there are many applications in which data quantity is not a problem, and we should concentrate on data quality. Accordingly, we should have more focus on the bias behavior of an estimator than on its standard error. Classical inference considers sampling variability to be the only source of uncer- tainty, and does not address the issue of bias caused by contamination. With one or two exceptions, the few papers that address robust inference also ignore the possible bias of the point estimates. The consequences of this will be explained in Chapter 2 in the context of location and simple linear regression models. In our study, we will consider the bias-uncertainty of an estimator in addition to its standard error. 2 1.2 Terminology To introduce some terminology, let us consider the parametric family Fg and the e- contamination neighborhood (Tukey, 1960), Fe{Fg) = {F:F=(1- e)Fg + eF*, F* an arbitrary distribution}, 0 < e < 1/2. (1.1) According to this model, 100(1 — e)% of the data comes from a distribution Fg but the remainder 100e% may come from an unknown arbitrary distribution F*. Let us consider estimates Tn of 9 which, under some mild regularity conditions, converge almost surely to the asymptotic functional T(F). This functional is well-defined on a set of distributions which includes the empirical distribution functions and the family ^(Fg). Due to the presence of outliers and other departures from the central parametric model, T(F) is not necessarily equal to 9 for all F in F€(Fg). For this reason, we must consider the asymptotic bias, bT(F) = d(T(F),9), where d measures the distance between the true value of the parameter and the asymp- totic value of the estimate. M a x b i a s f u n c t i o n The robustness of an estimate T can be assessed in terms of the maxbias function, BT{e) = sup bT(F) (1.2) which represents the maximum possible perturbation of T(F) when F varies over the entire neighborhood. Huber (1964 & 1981) introduced the concept of maximum asymp- totic bias in the location model setup. Martin and Zamar (1989) and Martin, Yohai and Zamar (1989) defined the maxbias curve which is a plot of BT{e) against different values 3 of e, and derived such curves for robust location, scale and regression estimates. It is known that the typical maxbias functions are continuous and increasing from zero to infinity. C o n t a m i n a t i o n s e n s i t i v i t y Hampel denoted the supremum of the influence function by 7* , and called it the gross- error sensitivity. The contamination sensitivity, introduced by He and Simpson (1993), is closely related to the gross-error sensitivity and is defined as 7 (T) = /^(e)| e = 0. (1.3) The authors showed that in general 7* < 7 . B r e a k d o w n p o i n t The breakdown point (BP) of the estimate T is the supremum of the maxbias function domain, i.e., the point at which the maxbias function diverges. Mathematically, e* = sup{e : BT(e) < 00}. (1.4) The breakdown point is the maximum fraction of contamination the estimate can tolerate before its value is completely determined by the contaminating data. The study of the limiting behavior of BT(e) near the ends of its domain is very informative. Hampel (1974) investigated the behavior of BT(e) when e is small, focusing on the rate at which Bx(e) tends to zero as e tends to zero. Berrendero and Zamar (2001) studied the explosion rate of the bias function when e approaches the breakdown point. Figure 1.1 shows the maxbias curve, breakdown point and contamination sensitivity of median. The dotted vertical line at BP (0.5 in this case) shows how the maxbias 4 Figure 1.1: Maxbias curve, breakdown point and contamination sensitivity of median. epsilon curve diverges when e approaches BP. The slope of the dashed line is the contamination sensitivity, 7 , of median. Notice that the linear approximation is good for small values of e (e < 0.10). L o c a l l y a n d g l o b a l l y r o b u s t es t imates An estimate whose maxbias grows slowly near zero is called locally robust and an estimate whose maxbias is relatively small for large fractions of contaminations is called globally robust. The local and global features of the maxbias function Br(e) are summarized by the contamination sensitivity 7(T) and the breakdown point e*. The contamination sensitivity provides an approximation for BT{C) near zero, and thus measures the local robustness of the estimate T. We say that T is lacally robust if j(T) is finite. 5 He and Simpson (1993) showed that e* < 0.5 for all affine-equivariant regression estimates. (Equivariance properties of regression estimates will be discussed in the next section.) If the estimate T attains the maximal breakdown point 0.5, we say that T is globally robust. Bias bound The bias bound for the estimate T, introduced by Berrendero and Zamar (2001), high- lights the practical potential of maxbias curves. To fix ideas, let us consider the location model and the median functional M(F). Suppose that we want to determine the upper bound for the absolute difference DM(F) = \M(F) - 9\. Huber (1964) showed that where a0 is the scale of the core distribution. Therefore, DM(F) is bounded by a0B(e). In practice a0 is not usually known and has to be estimated by a robust scale functional S(F), for example, the MAD. However, the quantity S(F)BM{() may not be an upper bound for D(F) because S(F) may underestimate cr 0. For example, let us consider the contaminated distribution F = 0.90 iV(0,1) + 0.10 50.i5, then whereas MAD(F)B(0A0) = 0.8818 x 0.1397 = 0.1232 < 0.1397. A quantity KM(e) such that S(F)KM{t) is a bound for DM{F) is called the bias bound for M(F). We will discuss the relation between maxbias functions and bias bounds for the location and the regression estimates in Chapter 2. A confidence interval is called globally robust if it is stable (in the sense of keeping coverage at or above the nominal level) and informative (in the sense of keeping a reason- able average length), not only at the central model, but also over the entire contamination 5M(0.10) M(F) - 0 1 = 0.1397, 6 neighborhood. We will discuss globally robust inference formally in Chapter 2. To con- struct a globally robust confidence interval, we have to use the bias bound of the point estimate in addition to its standard error. In this thesis, we will study globally robust inference on the simple linear regression slope. For this, we have to select an appropriate point estimate. We will discuss some robust regression estimates in the following section. 1.3 Simple Linear Regression: Robust estimation of parameters Let us consider the simple linear regression model, Hi = Po + PiXi + 1 <i<n, where re* and Ei are independent, ~ F0, Median^(ei) = 0, Xi ~ Go and (xi,yi) are independent. The joint distribution of (xi, yi) under this model will be denoted by H0. To allow for outliers and other departures from this central parametric model we will assume that the joint distribution H of (xi,yi) belongs to the e-contamination neighborhood Fe(H0) = {H=(1- e)H0 + eH* : H* any distribution on K2}, 0 < e < 1/2. (1.5) Let T — (Ti,T2) be the intercept and the slope functionals defined on a large class of distributions H on 5R2 which includes FfHo) and all the empirical distribution functions Hn. A desirable feature of the regression functional T is the property of Fisher- consistency, i.e., T1(H0) = p0, T2{H0) = B1. The regression functional T is also expected to have the following equivariance properties. Let Q(x,y) denote the joint distribution of (x,y). 7 Regression Equivariance: T^Gfay + a + bx)) T2{g(x,y + a + bx)) = T 1 ( ^ ( x , y ) ) + a, a,beM. = T2(G(x,y)) + a, a,be$. Affine Equivariance: Ti(g{cx,y)) = Tx(Q{x,y)), c > 0. T2(g(cx,y)) = T2(g(x,y))/c, c> 0. Scale Equivariance: Ti{g{x,sy)) = sT^gfay)), s > 0. T2{g{x,sy)) = sT2(g(x,y))/c, s > 0. Location Equivariance: Ti{${x + a,y)) = T^gfay)) - aT2{g(x,y)), a G K. r 2(S(x + a ) ? /)) = T 2(0(:r ,y)), a e £ It is easy to show that the least squares (LS) estimates of the slope and the intercept parameters satisfy the above properties, but they have breakdown point zero. Following are some robust estimates, which also satisfy the above equivariance requirements. 1.3.1 M e d i a n o f P a i r w i s e Slopes ( M P S ) This estimator was studied by Thei l (1950) and Sen (1968) . For this, we first consider the set I — : X{ ^ Xj} and calculate the slopes corresponding to al l pair of observations (xi,yi) and (xj,yj). That is, K»,J) = ̂ ^ , (1.6) Xj X i 8 with r(i,j) — 0 ii Xi — Xj. Then, we define ^PS(H) = MedIr(l,j). (1.7) The corresponding functional form of this estimator is ffps(H) = Med QH (^^2- \xi - x2, where (xi,yi) and (x2,y2) are independent with common joint distribution H and Qn stands for the conditional distribution of the ratio given the denominator is different from zero. Like MPS, many other estimators of the slope parameter 0 are based on the pairwise slopes r(i,j). Most interestingly, the classical least squares estimator BLS may be written as a weighted average of the pairwise slopes, As = ^ , (1.8) i<j with weights Wij = (xi — Xj)2. Boscovich (1757) considered a data set with n = 5 observations, and computed the unweighted average of the 10 pairwise slopes, as well as a 10% trimmed mean given by the average of 8 of these slopes. Stigler (1986) may be referred to for a more complete historical discussion. Frees (1991) gave a survey of these and related estimators. In general, given a slope estimate J3(H), a median-based intercept estimate is defined by a(H) = MedH(y - $(H)x). (1.9) Thus, the corresponding regression line splits the plane into two halves containing each half of the data. Since the intercept is a location parameter for the difference y — J3(H)x, estimating it by the median seems to be appropriate from a bias point of view. 9 1.3.2 G e n e r a l i z e d M e d i a n o f Slopes ( G M S ) The GMS estimate was first proposed by Brown and Mood (1951) and recently studied by Adrover and Zamar (2000). The GMS is defined as the solutions (an, J3n) to the equations 1 n - ^ sign (yt - a n - pn(x{ - /}„)) signfo - fin) = 0, (1.10) n i=i 1 n - ] P sign (yi -6en- J3n(xi - A n ) ) = 0, (1.11) where p,n = Med{xj}. The GMS estimates are defined by the following geometrical property: the slope and the intercept estimates are such that the corresponding regression fit and the vertical line x = Medjxj} split the plane into four quarters containing the same number of points. Adrover and Zamar (2000) showed that the slope estimate P£MS satisfies the fixed point equation ft"* = Med{y-"Medfa~5MS(Xl"An)}} • (1.12) Notice that this is not really a closed form formula because the right hand side of (1.12) also contains / 3 ^ M 5 . A solution to this equation can be found with the following iterative algorithm proposed by Adrover and Zamar (2000). Let be some initial slope estimate. Then a^+1) = Mediyi-pWixi-jin)} PlT+1) = Med{(yt-a^)/(Xl-(in)}. This algorithm usually converges after a few iterations. However, in some cases it runs into a closed loop. Then the algorithm automatically finds the solution by using a bisection procedure which takes place after the closed loop it detected. 10 The GMS estimates are a natural generalization of the minimax-bias estimate of re- gression through the origin. Martin and Zamar (1989) found that in the case of regression through the origin, = B%i + Si, the MS estimate P{H) = MedH (1) minimizes the maxbias among all equivariant estimates. It can be verified that /3(H) is a generalized M-estimate (GM) that satisfies the equation EH sign(y - bx) sign(x) = 0. For the simple linear regression model, the estimates (a(H), (3(H)) can be defined im- plicitly by the equations EH sign(y — a — bx) sign(a; — m) = 0 (1-13) EH sign(y - a - bx) = 0 (1.14) where m = Med#(:r). The finite sample versions are obtained by setting H = Hn, the empirical distribution function of the data. Clearly, (1.13) and (1.14) are particular cases of GM-estimates. It can be easily seen that the GMS estimates satisfy (1.13) and (1-14) and, therefore, they are a very special type of GM-estimates. 1.3.3 Repeated Median of Slopes (RMS) Siegel (1982) defined the first slope estimate with breakdown point 0.5, by performing a different median-based operation over the ratios (1.6). In this case, we define the slope estimator as 0™S = Med1<i<nMedj(kjir{i,j), where Jj = {j : (i,j) € /}, for 1 < i < n. The corresponding functional form of this estimate is as follows. First, let us define qM(a,b,H) = Med QH (^L^) , (1.15) 11 for fixed numbers a and b. Then, the RMS estimator is defined as oRMS Med Q*H (qM{x2,y2,H)), (1.16) where (xi,yi) and (x2,y2) are independent with common joint distribution H and QH stands for the conditional distribution of the ratio given the denominator is different from zero. Q*H denotes the distribution of qM(x2, y2, H) under H. 1.3.4 S-est imates The regression estimates defined so far are median-based. Before defining the regression S-, MM- and r-estimates, we need to define the M-estimates of scale. For this we may consider the parametric scale model where F0 is the distribution function of a random variable X with a density function symmetric about zero. The M-estimates of scale were defined by Huber (1964) as where b = Ep0 p{X) and the score function p satisfies the following assumptions: (a) p(x) : M —> E is symmetric and nondecreasing on [0, oo). (b) p(0) = 0, and p(oo) = 1. (c) p(x) has at most a finite number of discontinuities. The regression S-estimates were defined by Rousseeuw and Yohai (1984) as Fa{x) = F0(x/a) (1.17) 0S = argmin Sp (t, H), 12 where Sp (t, H) is the M-scale of the absolute residuals r(t) = |y — tT: have joint distribution H. Mathematically, x|, in which (y, x) Sp{t, H) =inf \s : E „ p 1.3.5 M M - e s t i m a t e s The regression MM-estimates were defined by Yohai (1987) as OMM = argmin EH p2 t where si = Si(H) = mint «Si(t, if) is the scale of the absolute residuals corresponding to a regression S-estimate with score function pi G , and p2 £ Cfc2 is a score function chosen to attain a higher efficiency. 1.3.6 r - e s t i m a t e s The regression r-estimates were defined by Yohai and Zamar (1988) as and 5(t, H) is the M-scale defined before with p = p\ € Cb2. These estimates can attain a high breakdown point, controlled by pi, and a high efficiency, controlled by p2. The bias behavior of different robust estimates will be discussed in Chapter 3, with a view to selecting a 'good' point estimate for robust inference. 0T = argmin r(t, H), t where 13 1.4 Purpose of this study Adrover, Salibian-Barrera and Zamar (2002) developed the idea of globally robust infer- ence for the location and the simple linear regression models. After showing the conse- quences of ignoring the asymptotic bias of the point estimate, they incorporated the bias bound in the construction of confidence intervals. For robust inference on simple linear regression slope, the authors selected GMS as their point estimate, considering its good bias behavior and asymptotic normality. However, GMS has a breakdown point of only 0.25, and its asymptotic normality is established under very restrictive conditions. Also, the bias bound for the GMS estimate is known only for symmetric carrier distributions, limiting the applications of the method. In this study, we will consider the RMS as an alternative choice of point estimate for robust inference on slope. RMS has a breakdown point of 0.50, its asymptotic normality holds under very general conditions, and the bias bound for RMS is known for general carrier distributions. 1.5 Organization of subsequent chapters In Chapter 2, we will explain the idea of globally robust inference for the location and the simple linear regression models. In Chapter 3, we will justify the selection of the RMS estimate as an alternative to the GMS estimate for robust inference on slope. In Chapter 4, we will apply the methods to two different datasets and compare the results. We will present and discuss the results of a Monte Carlo simulation in Chapter 5. In the final chapter, we will conclude by summarizing our study and pointing towards some topics for future research. 14 Chapter 2 Globally Robust Inference The vast majority of robustness literature focuses on point estimation. There are a few papers that address robust inference, but they do not take into account the possible bias of the point estimates. One exception is Adrover, Salibian-Barrera and Zamar (2002). Our discussion on globally robust inference will be based mainly on this paper. In order to highlight the main ideas, let us consider the following location-scale model: Here, 0 is an unknown location parameter, a is a nuisance scale parameter, and e has a specified distribution F0. Correspondingly, the distribution of y is Fg(y) = Fo((y — 0)/a). To allow for outliers and other departures from this model, we assume that the actual distribution F belongs to the e-contamination neighborhood , Fe(Fe) = {F=(1- e)Fe + eF* : F* any arbitrary distribution}, 0 < e < 1/2. (2.2) According to this model, the majority of the data comes from a central parametric model but a minority may come from an unknown arbitrary distribution. y = 0 + oe. (2.1) 15 According to Adrover et al. (2002), a robust confidence interval should be stable and informative. The robust confidence interval should be stable in the sense of keeping a high coverage level (at or above the nominal level) not only at the central model but also over the contamination neighborhood. The interval should also be informative in the sense of keeping a reasonable average length over the entire neighborhood. These two properties are more precisely stated in the following definition by the authors. Definition 2.1 A confidence interval (Ln, Un) for 9 is called globally robust of level {I —a) if it satisfies the following conditions: 1. (Stable interval) The minimum asymptotic coverage over the e-contamination neigh- borhood is (1 — a): lim inf PF(Ln < 9 < Un) > (1 - a); n->cx> FeFc{Fe) 2. (Informative interval) The maximum asymptotic length of the interval is bounded over the e-contamination neighborhood: lim sup [Un — Ln] < oo. 2.1 Limitations of the classical confidence intervals It can be easily shown that classical Student-t confidence intervals, Xn ± <(„_!)(! - a/2)5n/y /n, fail Part 1 and Part 2 of Definition 1. Let us consider first the contaminated distribution FX0 = (l-e)F0 + eF*, where F* is a point mass distribution at XQ > 0. Then, Ln = X n - t(„_i)(l - a/2)Sn/^/n ex0, 16 and Un = Xn + t(„_i)(1 - a/2)Sn/\/n ^ ex0 as n tends to oo. Therefore, lim inf PF{Ln <e <Un)< lim PF*o(Ln <6<Un)=0. Thus, the classical intervals fail Part 1 of Definition 1. Let us now take F* to be a point mass distribution at ± x 0 (equally weighted). Then, failing Part 2 of the definition. 2.2 Naive intervals: consequences of ignoring bias For robustifying Student's t confidence intervals, it seems natural to replace Xn by a robust asymptotically normal point estimate Tn, and Sn/^fn by a robust estimate of the standard error of Tn. We will call this the naive procedure. It can be shown that the naive confidence intervals satisfy Part 2 but not Part 1 of Definition 2.1. Consequently, the asymptotic coverage proportion of the naive confidence intervals of any nominal level will invariably tend to zero for all e > 0. To illustrate this point Adrover et al. (2002) conducted a Monte Carlo simulation in which they generated 10,000 normal samples of different sizes, containing various fractions of contamination. The contaminating distribution is a point mass distribution at XQ = 4. For each sample, the authors calculated the location M-estimate with Huber V>-function x0y/e = oo, ip(y) = min{-c, max{c, y}}, 17 with truncation constant c = 1.345, and the corresponding 95% confidence intervals based on the empirical asymptotic variance. The following table summarizes the observed coverage levels and the average lengths of these intervals. Table 2.1: Percentage of coverage and average length of naive CI for location parameter e Sample Size % of coverage Average length 0.05 20 92 0.91 50 92 0.60 100 88 0.44 200 82 0.31 0.10 20 91 1.05 50 84 0.68 100 67 0.49 200 39 0.35 0.15 20 88 1.19 50 72 0.76 100 35 0.56 200 5 0.40 0.20 20 82 1.41 50 45 0.92 100 8 0.66 200 0 0.47 The poor coverage levels in the above table are due to the asymptotic bias of the point estimate. Let 9n be a robust estimate of the location parameter, and 9(F) be its asymptotic value under an asymmetric distribution F belonging to the contamination neighborhood. Usually, 9(F) is different from the actual value of the parameter 9, and the asymptotic bias remains the same even if the sample size increases. However, the 18 standard errors of 9n are very small for large sample sizes, and most of the probability mass in the distribution of 9n concentrates on an interval that excludes 9. Ironically, if the data are of uneven quality, large sample sizes are bad for naive confidence intervals. The problem is more severe in the case of simple linear regression. For the model Di = A) + 0\Xi + £i, the classical 100(1 — a)% confidence intervals for Bx are of the form Bx ±zA/2 SE(/3i) where fii is the least squares estimate for Bx and SE(/3i) is the estimated standard error of Bx. For robustifying these confidence intervals, the naive approach is to replace Bx by a robust point estimate, Bf-, and SE(/?i) by a robust estimate of the standard error of B^. To show that the naive confidence intervals are not "stable" (do not satisfy Part 1 of Definition 1), Adrover et al. generated 600 samples (XJ, Ui) of sizes n = 20, 40, 60, 80 and 100 from contaminated normal distributions (l-e)TV(O,1)+eN(n, T2I) with pj = (fjbxjfjLy), r = 0.1, \xx = 3 and fxy = 1.5 (2.0) for e = 0.05 (0.10). The authors refer to this case as the "mild contamination case". In the "strong contamination case" they took \xx = 5 and \xy = 2.5 for e = 0.05 and e = 0.10. They calculated the high breakdown point regression MM-estimates (Yohai, 1987) and their asymptotic standard errors. The nominal confidence level in each case is 0.95. The authors reported the coverage for the slope parameter, which is reproduced in the following table. Clearly, the coverage levels shown in the table are much less than the nominal level, specially for larger sample sizes. As in the case of the location model, the poor coverage levels are due to the asymptotic bias of the point estimate B^ introduced by the contamination in the data. In general, the standard error of Bf- is of order 1 /̂ fn while its bias does not vanish as n goes to infinity. Therefore, for (globally) robust inference, we have to consider the asymptotic bias of the selected estimator, as well as its standard error. In the following two sections, we will discuss robust inference for the location and the simple linear regression models. 19 Table 2.2: Coverage proportion of naive CI for the slope parameter % of Contamination Sample Size Mild Contamination Strong Contamination 5 20 0.88 0.76 40 0.82 0.57 60 0.81 0.35 80 0.62 0.26 100 0.47 0.23 10 20 0.71 0.50 40 0.46 0.25 60 0.27 0.11 80 0.15 0.07 100 0.13 0.05 2.3 Robust inference for the location model First, we should know how to incorporate the asymptotic bias (bias bound) of the point estimate in the construction of confidence intervals, one sided confidence bounds, and p-values. Then, we need to know how to estimate the bias bound of the estimate. 2.3.1 C o n f i d e n c e I n t e r v a l s Suppose that we have a robust point estimate 9n for the parameter 9 which satisfies yfr(§n-6{F))-±rN(0,v2(F)), (2.3) where \9(F) — 9\ < 9 and 9 is an upper bound for the bias of 9(F) as defined in Berrendero and Zamar (2001). 20 For any fixed F G ̂ (Fg), an asymptotic confidence interval of level (1 — a) is given by where ln = ln(F) and rn = rn(F) satisfy the equation PF (~rn < Qn - 0 < ln) = 1 - a. (2.4) We can write PF ( - r n < 9 n - 9(F) + 9(F) -9<ln) = -rn-b 6n-6(F) l n - b \ ,rt — < — — < — = I - a , (2.5) where y/nvn is a consistent estimate for v(F) for all F G T£(Fg), and b = b(F) = 9(F) —0. From equation (2.3) the normal distribution can be used to approximate the left- hand side of equation (2.5). Hence, we can obtain estimates ln and f„ solving the following equation: * [ ^ U * ( ^ ) - l = l-«, (2.6) where $ is the standard normal cumulative distribution function. Since F G ^e(Fg) is unspecified, the bias b = b(F) in (2.6) is unknown and cannot be estimated from the data. Therefore, the endpoints ln and fn may be found in such a way that The coverage of the confidence intervals obtained in this way will be 1 — a not only at the central model, but also over the entire neighborhood. The endpoint ln may be expressed as a function of fn by using Equation (2.6): In = Vn$ 1 2 - a - $ 21 + b, and fn may be chosen in order to minimize the resulting interval length: 'fn + b' -1 2 - a - $ + b + fn. (2.7) Differentiating the right-hand side of Equation (2.7) with respect to fn and setting the derivative equal to zero, we have where (p is the standard normal density function. That is + 1 = 0, fn + b = tpl$-1 2 - a - <S> fn + b From this equation it follows that fn + b = ±<S>~1 2 - a - <E> fn + b The minus sign case is discarded because it can only be satisfied with a = 1. From the other case we obtain (2.8) which yields the solution fn = vn$ 1(l - a/2) - b = vnza/2 ~ b, In = VnZa/2 + b. The corresponding confidence interval for 0 of level greater than or equal to (1 — a) for fixed F e Fe{Fg) is: In(F) = (§n - vnza/2 -b, 9n + vnza/2 - &) . Since In(F) still depends on the unknown F through b, a robust interval of level (1 — a) can be constructed as follows: (2.9) 22 Since \b\ = \b(F)\ < 9 for all F G Te(F0) we have In Q (On ~ VnZa/2 ~9, 9n + VnZa/2 + (?) with equality if 9 is a sharp bias bound, that is if 9 = sup^^j 1̂ (̂ )1- The above robust confidence intervals can be constructed for any estimate 9n that is asymptotically normal, has a consistent estimate of its standard error, and a known bias bound 9. In order to obtain the shortest robust confidence interval we will consider estimates where such a sharp bound is available. The robust confidence intervals have positive asymptotic lengths, which is the price we have to pay for robust coverage. A n alternative approach An alternative robust interval can be constructed following Fraiman, Yohai and Zamar (2001). The main idea is to find qn such that PF(\9n-9\<qn) = l - a . Due to the asymptotic normality of 9n, we can estimate qn by * ( ^ ) + * ( * d ± ) - l = l-a. (2.10) Since qn is a monotone function of the bias, and its largest value is obtained by replacing b by 9, we have 9(hzl)+*(k±l)-1 = 1-a, \ vn J \ vn J which gives the robust confidence interval 9n±qn. The robust interval (2.9) is easier to compute, but slightly longer than that in (2.11). 23 2.3.2 One s ided conf idence b o u n d In this section we will discuss robust lower bounds for the parameter 9. Robust upper bounds can be constructed in a similar way. For any fixed F G F€(Fg), an asymptotic lower bound of level (1 — a) for 9 is given by (0n-*n,Oo) (2.12) where ln = ln(F) satisfies the equation PF (en-6< ln) = l - a . (2.13) Similarly to (2.5), and recalling that b = b(F) = 9(F) — 9, we have that ln must satisfy P F ( 6 n - 0 ( F ) < k - 1 \ l _ a and, as before, for large n, ln(F) = vn<S>-1(l-a)+b. (2.15) A robust lower bound of level 1 — a can now be defined as (L*oo)= | J (c? n-/ n(F),oo). (2.16) F£TE(FG) We have, (0„-Z n (F) ,oo ) = ( ^ - ^ ^ ( l - ^ - ^ o o ) . (2.17) Therefore, (Ln,oo) = (§n - vnza - 6, +oo) . Unlike in the confidence interval case, if we use a construction similar to the one in Fraiman, Yohai and Zamar (2001), we obtain the same lower robust confidence bounds. 24 2.3.3 P-values Let us start by considering the following artificial hypothesis testing problem. Let us assume that the contaminating distribution H in F = (1 — e)Fg + eH is known, that is, that F belongs to the following "translated" family Fe{Fe, H) = {F = (1 - e)Fe + eH, fixed H} (2.18) Suppose that we are interested in testing H0 : 9 < 0O versus Hx : 9 > 90 (2.19) for F e Fe(Fg,H). Using the lower bounds in (2.17), let us define 9n(a, F) = 9n - u n $ _ 1 ( l — a) — b(F), and a test for (2.19) is given by the rejection rule: Reject H0 if 9n(a, F) > 90. This is a size a test since sup P{l„e)Fg+eH [9n(a, (1 - e)Fe + eH) > 90) 0<eo v ' = P { 1 - £ ) F e o + e H (dn(a, (1 - e)F9o + eH) > 6>0) = a. In other words, for the fixed family (2.18) the p-value is given by pn(F) = inf { « : §n(a, F) > #0 j • (2.20) We can calculate pn(F) explicitly because 9n(a,F) is increasing in a and hence pn(F) satisfies 9n(pn(F),F) = 90, or equivalently, Pn(F) = 1 - $ ^ " ~ 6 ) • (2.21) Similarly, for the case H0 : 9 > 90 versus HY : 9 < 90, we have pn{F) = $ ^n~^°~6)- Let us consider the 2-sided hypothesis H0 : 9 = 90 versus Hx : 9 ^ 0O. 25 We construct the p-value for this case based on qn(a,F) defined by the equation The rejection rule would be of the form: reject H0 if \9n — 60\ > qn(a., F). The p-value is p„(F) = inf{a : qn{a, F) < \9n - 90\} . Here, qn(a, F) is a decreasing function of a. Therefore pn(F) solves the equation qn(pn(F),F) = \9n-90\. (2.23) From (2.22) it is easy to see that the solution g(x) to the equation qn(g(x), F) = x is given by We obtain ^ ) - - * ( ^ ) - * ( ^ ) . The p-values (2.21) and (2.25) were obtained for F G T€(F8,H). We can then define a robust p-value for all F G J-€(Fg) as p*= sup pn(F). (2.26) It is easy to see that pn(F) is a monotone function of b = b(F) and hence, if the bias bound 9 is sharp, then the robust p-values are as shown in Table 2.3. The rejection rules associated with the p-values in (2.26) have robust Type I Errors at the expense of lower power. In other words, to guarantee a level-a; test with uncertainty in our model we lose power, in particular for values of the parameter near the null hypothesis. Lower power is a reasonable price to pay to achieve a robust rejection rule. 26 Table 2.3: Robust p-values for location-scale problems. Hypothesis Robust p-values #1 : e > 60 Hx : 6<e0 Hi : 9^90 $ ^L-eo+e) 2.3.4 E s t i m a t i o n o f bias b o u n d Let Tn be any location estimate with asymptotic value T(F). The asymptotic bias is defined in the following invariant way: b(T,F) = \T(F)-9\/o, (2.27) where o is the true error scale parameter. The maximum asymptotic bias was defined earlier as B(e) = sup b(T,F). (2.28) Clearly, \T(F) -6\< oB{e) for all F e Fe. Unfortunately, o is unknown. Let a(F) be the limiting value of the scale estimate an. Then, a more useful bias bound would be B(e) = sup b(T,F), (2.29) with b(T F) - l T { F ) - 6 1 This equation is more useful than (2.28) because we have \T(F) -0\< a(F)B(e) for all F € Ft, 27 and, for large n, if we replace d{F) by a(Fn) the above relationship still holds approxi- mately. However, B(e) is unknown and its theoretical derivation appears to be difficult. A numerical approximation restricting the supremum to point mass contamination would be feasible. However, for the construction of confidence intervals we can choose the fol- lowing simpler approach (Berrendero and Zamar, 2001). By replacing a by cr(F) we do not obtain an upper bound due to the possible underestimation of the scale. An estimated bias bound 9n is given by 6n = ksnB{e), (2.30) with k = sup - FeTea(F) s-(c)' sn = shorth (?/;), and .-(<) = inf F<aTe a The shorth is the standardized length of the shortest half of the sorted data. For the derivation of s~, Lemma 2.1 (at the end of this chapter) can be used. 2.4 Robust inference for the simple linear regression model We will discuss the confidence intervals for the slope parameter in the simple linear regression model, Vi = A) + Pi(xi - fi) +£i, l < i < n , where fi is a location parameter, Xj and are independent, Zi ~ F0, MedianF0{£I) — 0, Xi ~ Go and (xi,yi) are independent. The joint distribution of (xi,yi) under this 28 r model will be denoted by H0. To allow for outliers and other departures from this central parametric model we will assume that the joint distribution H of (xi,yi) belongs to the e-contaminated neighborhood Fe(H0) = {H = (1 - e)H0 + eH* : H* any distribution on K2}, 0 < e < 1/2. The construction of the robust confidence intervals for the slope follows along the lines of the construction of location confidence intervals discussed earlier. Let us assume that (3i is an asymptotically normal and robust point estimate for Bl} V^{Pi- Pi{H)) ^ N{^v\H)), and Pi has a sharp and known bias bound p. We need to find qn such that PH(\Pi-Pi(H)\<qn) = l - a . Due to the asymptotic normality of Pi, we can estimate qn by JhW^)+i(Mm±A).1 = 1 . a . (2.31) Since qn is a monotone function of the bias, and its largest value is obtained by replacing b by p, we have which gives the robust confidence interval Pi ± qn. 2.4.1 E s t i m a t i o n o f t h e bias b o u n d Let Tit7l be estimate for the slope parameter with asymptotic value Ti(H). The invariant asymptotic bias is defined in the following way (Adrover and Zamar, 2000): b(Ti(H)) = ax\Ti(H)-Pi\/a£, 29 where o£ and ox are the residual and explanatory variable scales under the central model HQ. These biases are invariant under affine transformations of the data. The maximum asymptotic bias is given by B1(e)= sup b(Ti,H). Finally, the bias bound Pi is defined by sup |Ti(tf) - A l < sup 4̂TS = Pi- As before, following Berrendero and Zamar (2001) an estimate of the bias bound is given by Pm = ^ k(e) Bi(e) = ^ 1 B{{e), (2.33) @x,n @x,n where k(e) = s+(e)/s~(e), and we can use a£tJl = shorth^ - an - J3nXi) and oXyTl = shorth^) to estimate aE(H) and oX(H) (Rousseeuw and Leroy, 1987). The shorth is bias-minimax in the class of M-estimates of scale with general location (Martin and Zamar, 1993). We need to determine s+(e) and s~(e). Martin and Zamar (1993) showed that l B t _ ^ _ = = _ L _ HeK aX(H) $ _ i ^ 3 - 2 £ ^ s+(e) For s~(e), we can use the following lemma (Adrover et ai, 2002). Lemma 2.1 If (x, y) ~ H € Tt, then, o £ ( E ) _ ^ [ 0 ) _ 1 inf Hert aE $ - i ( f ) s " ( e ) ' where o£(H) = shorth (y — a(H) — P(H)x). A similar result follows for the location case by taking P{H) = 0. 30 The proof of the above lemma is not included here. To obtain relatively short robust confidence intervals we need to use point estimates with small bias bounds. Adrover, Salibian-Barrera and Zamar (2002) used the GMS estimate for robust inference on the simple linear regression slope. As the simulation study results obtained by the authors suggest, the observed coverage levels of the robust confidence intervals are very satisfactory, and they constitute a major improvement when compared to those of the naive approach. However, the GMS estimate has a breakdown point of only 0.25, its asymptotic normality is established under very restrictive conditions, and its bias bound is known only for symmetric carrier distributions. Therefore, a better point estimate is needed. In the next chapter, we will look for the best possible point estimate for globally robust inference on the simple linear regression slope. 31 Chapter 3 The RMS Approach for Robust Inference on Slope In Chapter 2 we discussed globally robust inference for the location and the simple linear regression models. We explained how the bias bound of an estimate should be used in addition to its standard error to construct a robust confidence interval for the slope. The selection of an appropriate point estimate for the slope parameter is now an issue. In classical inference, sampling variability of an estimate is considered to be the only source of uncertainty and, therefore, an estimator with a smaller standard error is preferred so that we can construct relatively short confidence intervals of practical relevance. Of course, the normality or asymptotic normality of an estimate is often considered for theoretical reasons. In robust inference, on the other hand, we consider the bias of an estimator to be at least as important as its standard error (we have shown earlier that the bias of an estima- tor remains the same while its standard error vanishes as n goes to infinity). Therefore, to construct robust confidence intervals, we should prefer an estimate with a smaller bias 32 bound. We should also consider the asymptotic normality of the estimator because the theory on globally robust inference developed so far is based on this assumption. 3.1 Reasons for the selection of RMS Most of the available maxbias functions for the slope estimates have been derived assum- ing that the intercept parameter is known. As an exception, Hennig (1995) derived the maxbiases of MM- and r-estimates of the intercept and the slope parameters. However, these estimators have very large bias bounds even in the case of known intercepts. The least median of squares (LMS) estimate (Rousseeuw, 1984) has the smallest maxbias in the class of residual admissible estimates with known intercepts. However, the class of residual admissible estimates has maxbiases larger than the maxbiases of the three median-based estimates MPS, GMS and RMS. In fact, for all residual admissible estimators, the maxbias function can be expressed as follows: B(e) ^ K ^ e + O^e), whereas, the maxbias function for those three median-based estimators can be expressed as B(e) ^K2e + 0(e), where K\ and K2 are constants (Martin, Yohai & Zamar, 1989, and Berrendero & Zamar, 2001). Since 0 < e < 1, we have e < y/e, and the three median-based estimates have maxbiases considerably smaller. Therefore, for globally robust inference, it is reasonable to select one of these three estimates. Table 3.1 below (extracted from Adrover and Zamar, 2000) displays the maxbiases of these median-based regression estimates assuming normally distributed explanatory variable and regression error under the central model. The values in the second column 33 (labeled MS) are the lowest maximum bias attainable in the class of affine and regression equivariant estimates, when the intercept is equal to zero (Martin et a/., 1989). The last column (labeled LMS) gives the lowest maximum bias attainable in the class of residual admissible estimates when the intercept is equal to zero (Yohai and Zamar, 1993). Table 3.1: Maxbiases for several median-based estimates. MS GMS MPS RMS LMS e Slope Intercept Slope Intercept Slope Intercept Slope Slope 0.010 0.014 0.013 0.016 0.013 0.032 0.013 0.019 0.220 0.025 0.039 0.032 0.041 0.032 0.082 0.032 0.046 0.357 0.050 0.081 0.066 0.088 0.067 0.171 0.066 0.096 0.528 0.100 0.174 0.143 0.201 0.150 0.386 0.142 0.198 0.826 0.150 0.282 0.237 0.361 0.251 0.689 0.235 0.339 1.140 0.200 0.411 0.378 0.639 0.502 1.219 0.357 0.505 1.515 0.240 0.538 0.667 1.299 0.993 2.227 0.489 0.668 1.898 0.250 0.574 oo oo 1.259 2.747 0.563 0.730 1.999 0.300 0.792 oo oo oo oo 0.817 1.042 2.739 0.350 1.120 oo oo oo oo 1.367 1.564 3.960 The maxbiases of MPS are larger than those of GMS and RMS. While GMS has slightly smaller biases than RMS for e < 0.05, RMS has smaller biases than GMS for e > 0.10. Adrover, Salibian-Barrera and Zamar (2002) selected GMS as their point estimate for robust inference on slope for the following reasons: • GMS shows a good bias performance. • The asymptotic normality of the GMS estimate can be proved under general con- ditions (allowing for most distributions in the contamination neighborhood) 34 We will now discuss some limitations of the GMS approach, and some advantages of RMS over GMS as a point estimate for robust inference. Some l i m i t a t i o n s o f t h e G M S a p p r o a c h The following are some problems that we identified in the GMS approach: • To prove the consistency and the asymptotic normality of the GMS estimates, one of the regularity conditions used by Adrover, Salibian-Barrera and Zamar (2002) is that the carrier distribution Go is symmetric. In practice, this condition may not be satisfied. • The bias bound of the GMS estimate (Adrover and Zamar, 2000) is valid only when the carrier distribution Go is symmetric. • GMS has a breakdown point (e*) of 0.25, while other median-based estimates have larger breakdown points. A d v a n t a g e s o f t h e R M S a p p r o a c h The RMS method may be preferred to the GMS approach for the following reasons: • The regularity conditions for the asymptotic normality of the RMS estimate (to be discussed later) are more general than those of the GMS estimate. • The bias bound of the RMS estimate (Adrover and Zamar, 2000) is valid without the symmetry assumption of the carrier distribution Go- • The breakdown point of RMS is 0.50, the maximum for all affine equivariant re- gression estimates. 35 • Though for small fractions of contamination (e < 0.05) GMS has smaller biases, RMS is a very close competitor. And, for large fractions of contamination RMS has smaller biases. In our opinion, the overall bias performance of RMS is better than that of GMS. • RMS is both locally and globally robust. Based on the above considerations, we decided to use RMS for the construction of robust confidence intervals, and the calculation of robust p-values. We will now discuss the bias behavior and the asymptotic properties of RMS. Adrover and Zamar (2000) derived the maxbias of RMS. We will start by showing (Huber, 1981) that the maximum bias of the median functional over the e-contamination neigh- borhood of a general distribution function G0 (not necessarily symmetric) is attained by placing a point mass contamination at plus or minus infinity. Therefore, the maxbias of the median is given by To derive the maxbias of the RMS-slope, we need some notation. Let us take qM{a,b,H) and f3RMS(H) as in the definition of RMS (Chapter 1). In addition, given a general univariate distribution function F, let us define the quantiles 3.2 Maxbias of the RMS estimate m = m(e) = max and (3.2) Finally, let us consider and (3.3) 36 and QL(H) = qL(gH(QL(x2,y2,H))) and Qu(H) = q u {GH {Qu{x2,y2,H))). (3.4) Theorem 3.1 (Maxbias of RMS-slope) Suppose that Fo is a symmetric distribution with unimodal density function fo. Then, the maxbias of RMS slope estimate is B«MS(e) = max {\QL(HQ)\, \Qu(HQ)\} , where Qi and Qu are given by (3.4)- Proof: For all H in T£, fixed number t and function g, we have (1 - e)PHo (g(xu yx) < t) < PH (g(Xl, Vl) < t) < (1 - e)PHo (g(Xl, Vl) <t) + e. (3.5) In particular, by taking g(xi,yx) = (y\ — b)/[x\ — a) we get (1 - e)PHo (^—- <t) <P„ (^—- < () < (1 - £ )P H o (*zl<t)+e, (3.6) \Xi — a ) \xi — a J \Xi — a ) and Qi(a, 6, H0) < qM(a, b, H) < Qu(a, b, H0), for all H in Tt and all a, b. Since the median is a monotone operator, we also have Med g*H QL{x2, y2, H0) < Med Q*H qM{x2, y2, H) < Med Q*H Qu{x2, y2, H0), for all H in TE. (3.7) Moreover, using (3.7) and taking g(x2,y2) = Qi{x2,y2,H) in (3.5) we obtain QL(HO) < Med g*HQL{x2,y2,H0) and Med Q*HQu{x2,y2,H0) < Qu(H0) for all H in and, therefore, QL(HO) < PRMS < Qu{H0) for all H in !F£. The theorem follows now because the upper and lower bounds above are attained by taking limit over a sequence of contaminated distributions. 37 3.3 Asymptotic properties of the RMS estimate Siegel (1982) showed that when all X ; are distinct (an event with probability 1 if G is continuous), the RMS estimate (3n has a finite-sample breakdown point e* = [n/2]/n, that is, if fewer than [n/2] vectors Zj are changed, the estimate remains bounded. This yields an asymptotic breakdown point of 0.5. Siegel also showed that (3n is a Fisher-consistent estimate of 8. Hossjer, Rousseeuw and Croux (1994) established the asymptotic normality of the RMS slope estimate. The authors assumed the following regularity conditions: (F) The error distribution F is absolutely continuous, F _ 1 (0 .5) = 0, and the density / is bounded ( | | / | |oo < co) and strictly positive. (G) The distribution G of the carriers is continuous, G _ 1(0.5) = 0, and G has a positive and continuous density g around 0 with g(0) > 0. Theorem 3.2 Let us consider the simple linear regression model yi = a + f3xi + ei, i = 1,... ,n, with the error and carrier distributions satisfying conditions (F) and (G) ; re- spectively. Then 1 " Vn~0n - B) = -= ^ lF(xu Vi) + Op(l) N(0, o2) asn-^oo, (3.8) Y i=l where W(x v) = S i S n { x y " P ) (3 9) I H X ' V ) 2f(0)EG\X\ ^ and a = 2 m E a \ X [ ( 3 1 0 ) 38 The authors proved this theorem through a series of lemmas. The proof is extremely long, non-trivial and involved, and is not included here. Interestingly, the asymptotic variance for RMS (formula 3.10) is the same as the asymptotic variance for GMS derived by Adrover, Salibian-Barrera and Zamar (2002). The difference is that, the regularity conditions for the GMS variance are very restrictive while the RMS variance is derived under more general conditions. One problem with the asymptotic normality of the RMS estimate is that the con- vergence to the asymptotic behavior is extremely slow. In order to check whether the asymptotic variance of the RMS slope estimate provides a good approximation to the to its variance at finite samples, Hossjer et al. (1994) carried out a Monte Carlo experiment. They considered both G and F to be equal to the standard Gaussian distribution, for which we get the asymptotic variance 7r 2/4 as 2.47. For each n in Table 3.1 the authors generated m = 10,000 samples of size n, computed the corresponding slope estimates i3nk) for k = 1, 2, . . . , m, and obtained the n-fold variance, nVar F C (/3«), which should converge to 2.47 as n tends to oo. For n < 40 the n-fold variances are decreasing with n. After that, for n up to about 1000, the n-fold variances stay around 1.65, after which they slowly increase. For n around 40,000, we get a value of 1.86, which is still much less than 2.47. Unfortunately, the approximation (to the finite sample variance) provided by the asymptotic variance of the RMS estimate is not very satisfactory. Therefore, in addition to this asymptotic variance formula, we will use the bootstrap distribution of RMS to estimate its finite sample variability in the Monte Carlo simulation in Chapter 5. 39 Table 3.2: Simulation results of RMS estimate n n-fold variance 10 2.62 20 1.88 40 1.67 60 1.67 100 1.63 200 1.63 300 1.66 500 1.64 800 1.62 1000 1.67 2000 1.83 3000 1.80 5000 1.82 10000 1.75 20000 1.85 40000 1.86 oo 2.47 40 Chapter 4 Application In Chapter 3, we proposed the RMS method as an alternative to the GMS method for globally robust inference on simple linear regression slope. We will now apply these two methods, along with the classical (LS) approach and a naively robustified approach (using MM), to two real datasets. 4.1 Motorola vs Market These data were published in Berndt (1994). The Motorola data include ten years of monthly returns of Motorola shares over the time period January 1978 to December 1987 (for 120 months). The Market data include value-weighted composite monthly market returns based on transactions of the New York Stock Exchange and the American Exchange over the same 10-year time span. The returns on 30-day US Treasury bills are also provided. Adrover, Salibian-Barrera and Zamar (2002) used these data as an example for 41 robust inference on simple linear regression slope with the GMS estimate. The response variable is the difference between the monthly Motorola returns and the returns on 30- day US Treasury bills. The explanatory variable is the difference between the monthly Market returns and the returns on 30-day US Treasury bills. The financial economists fit a straight line to this type of data. The slope measures the riskiness of the stock, the larger the slope the riskier the stock. We used RMS to estimate the regression parameters. The following table gives RMS slope estimate along with GMS, MM and LS estimates. Table 4.1: Different slope estimates Method of Estimation $ RMS 1.11 GMS 1.21 MM 1.34 LS 0.85 The estimates for the intercept parameter are equal to zero (up to the second decimal place). Figure 4.1 contains a scatter plot of the data and the four fitted lines. From a financial point of view, the conclusions differ very widely for the four different regression methods. According to the LS method, Motorola's stocks are safer than the market. According to RMS, GMS and MM methods, Motorola's stocks are riskier than the market. The following hypotheses may be of possible interest in this case H0 : 8>1 versus HY : 8 < 1. If the null hypothesis is rejected, an investor would like to invest on this stock. The conclusions from the analyses using LS, MM, GMS and RMS methods are given below. 42 Figure 4.1: A scatter plot and four different regression lines The LS method The estimated standard error of the LS slope estimate is 0.105. The LS approach rejects H0 at level a = 0.1 (the p-value is 0.074). The corresponding residual plot (not shown here) does not indicate the presence of outliers. The M M method Adrover et al. used lmRobMM from Splus for this naively robust approach. The esti- mated standard error of the MM slope estimate is 0.274. Since "The bias is high", this approach is indecisive and the recommendation is not to perform inference based on the final estimate. If we ignore the warning and proceed to test our hypotheses, we get a p-value of 0.890 and the null hypothesis cannot be rejected. 43 The G M S method For this globally robust approach, we need to determine a plausible value for e. Adrover et al. used the GMS residual plot, which shows one clear outlier out of the 120 observations. They reasonably selected e = 0.01. Since we have ^GMS = 1.41, ftGMS the bias bound is 0.0225. The standard error of the GMS slope estimate is estimated to be 0.169 by using the shorth of the bootstrap distribution of /3n. Therefore, the robust p-value is pGMS = 21 - 1 + 0.0225)/0.169] = 0.914, and we cannot reject the null hypothesis. The Motorola stocks are not a safe investment. The R M S method To estimate the standard error of the RMS slope estimate, we used the shorth of the bootstrap distribution of J3n, and obtained a value of 0.154. Like the GMS case, the RMS residual plot (Figure 5.2) shows one clear outlier out of the 120 observations, and we can use e = 0.01. We have frRMS fyRMS = i - 4 2 9 ' ux and the bias bound is 0.0278. Therefore, the robust p-value is pRMS = _ ! + o.0278)/0.154] = 0.815, and we cannot reject the null hypothesis. Again, the Motorola stocks are not a safe investment. 44 Figure 4.2: RMS residuals against RMS fitted values -0.2 -0.1 0.0 0.1 RMS Fitted Values Discuss ion According to the LS method, the Motorola stocks are a safe investment. This method considers all the data points, and because of some outlying 'good' returns, the LS method leads us to a wrong conclusion. The other three methods give very large p-values, leaving no room for rejecting the null hypothesis. However, the bias is high for the MM method, and the inference based on this naive method is not recommended. Though the GMS method gives a larger p-value as compared to the RMS method, both of these methods are very conservative, and conclude that Motorola's stocks are NOT safer than the market. 45 4.2 Volume vs Height of trees This dataset is courtesy of Dr. Harry Joe. The data consist of girth, volume and height of each of 31 trees, and are presented in Table 4.2. It is reasonable to assume that 'Height' is a good predictor for the response variable 'Volume'. Figure 4.3 contains a scatter plot of these two variables. A straight line seems to be a good fit for the data, except for a group of points. Figure 4.3: A scatter plot of Volume vs Height o CD <D E i § i i i i r 65 70 75 80 85 Height 46 As in the first example, we used four methods for the estimation of the regression parameters in this case: LS, RMS, GMS and MM. Figure 5.3 shows the four fitted lines. The LS slope (1-54) is the largest of all, with the GMS slope (1.43) as the closest competitor. The RMS and the MM slopes are much smaller (1.14 and 1.03, respectively). At a first glance, this seems to be contrary to our expectation, because there is a seemingly evident linearity along the diagonal of the scatter plot and the LS line is closer to this than the robust lines are. The RMS and the MM methods seem to have missed this particular linearity! Figure 4.4 The four regression lines LS • MM GMS RMS • • • i • - * 1"' • • j^--" • ^' I ' ' " m • i 1 1 1 r 65 70 75 80 85 Height To better understand what is going on, let us look at the plot of the residuals against the fitted values for each of the four methods. Figure 5.4 presents the residual plot for the LS method. The zero line and the 2SE line are shown, the —2SE line is beyond the plot. One point (tree number 31) is lying outside the 2SE limit. 47 Table 4.2: Girth, volume and height of trees Tree Number Girth Height Volume 1 8.3 70 10.3 2 8.6 65 10.3 3 8.8 63 10.2 4 10.5 72 16.4 5 10.7 81 18.8 6 10.8 83 19.7 7 11.0 66 15.6 8 11.0 75 18.2 9 11.1 80 22.6 10 11.2 75 19.9 11 11.3 79 24.2 12 11.4 76 21.0 13 11.4 76 21.4 14 11.7 69 21.3 15 12.0 75 19.1 16 12.9 74 22.2 17 12.9 85 33.8 18 13.3 86 27.4 19 13.7 71 25.7 20 13.8 64 24.9 21 14.0 78 34.5 22 14.2 80 31.7 23 14.5 74 36.3 24 16.0 72 38.3 25 16.3 77 42.6 26 17.3 81 55.4 27 17.5 82 55.7 28 17.9 80 58.3 29 18.0 80 51.5 30 18.0 80 51.0 31 20.6 87 77.0 48 Figure 4.5: LS residuals against LS fitted values LS Fitted Values Figure 4.6: GMS residuals against GMS fitted values GMS Fitted Values 49 Figure 4.7: RMS residuals against RMS fitted values i r~ 15 20 RMS Fitted Values Figure 4.8: MM residuals against MM fitted values MM Fitted Values 50 The residual plot for the GMS method is shown in Figure 4.6. The —2SE line is beyond the plot. This time, four points (tree numbers 28, 29, 30 and 31) are lying outside the 2 SE limit. Figure 4.7 and Figure 4.8 are the residual plots for the RMS and the MM methods, respectively. For the RMS method, the —2SE line is beyond the plot. The same six points (tree numbers 26, 27, 28, 29, 30 and 31) are lying outside the 2SE limit for both the RMS and the MM methods. To learn more about these six trees, we carefully scrutinized their girths, heights and volumes (Table 4.2). Surprisingly, they have the highest girth values, which means that these six trees are older than the others! It is reasonable to assume that the Height- Volume relationship for the old trees is different from that for the young trees, with a larger slope for the old ones. Let us look again at the scatter plot (Figure 4.3), more carefully this time. The linearity exhibited along the diagonal does not represent the majority of the data. The six points corresponding to the six 'old' trees are at the end of the diagonal. If we ignore these points, a straight line with a much smaller slope seems appropriate. There is heterogeneity in the data, and the MM and the RMS methods identify this heterogeneity almost perfectly. T e s t i n g o f hypo theses Buyers might be interested in assessing the volume of wood in a group of trees, and they would not like to invest unless the amount of wood makes it a safe investment. Moreover, since seasoned wood is more useful, age of the trees should also be considered while making a decision. Considering these, the buyers may want to test the following 51 hypotheses H0 : B<1 versus Hx : B > 1. If the null hypothesis is rejected, a buyer would like to go for investment considering that there is a good amount of wood and the wood is seasoned (age of the trees is also reflected in the slope parameter). The conclusions from the analyses using LS, MM, GMS and RMS methods are given below. The LS method The estimated standard error of the LS slope estimate is 0.3839. The LS approach rejects HQ at level a = 0.1 (the p-value is 0.083). The M M method We used lmRob from the library 'robust' of Splus6 for this naively robust approach. The estimated standard error of the MM slope estimate is 0.355. Since "The bias is high", this approach is indecisive and the recommendation is not to perform inference based on the final estimate. If we ignore the warning and proceed to test our hypotheses, we get a p-value of 0.466 and we cannot reject the null hypothesis. The G M S method We used the GMS estimate both in the naively robust and in the globally robust ap- proaches: • The standard error of the GMS slope estimate is estimated to be 0.351 by using the shorth of the bootstrap distribution of Bn. Without any adjustment for the 52 bias (which is equivalent to using e — 0), the p-value is 0.098, and H0 is rejected at level a = 0.1. • For the globally robust approach, we need to determine a plausible value for e. According to the GMS residual plot, there is one clear outlier out of the 31 obser- vations. It seems reasonable to use e = 0.03. Since frGMS ^GMS = L 8 2 > x the bias bound is 0.0986. Therefore, the robust p-value is and we cannot reject the null hypothesis at the 10% level. The R M S method The RMS estimate is also used both in the naively robust and in the globally robust approaches. The standard error of the RMS slope estimate is estimated to be 0.362 by • Without any adjustment for the bias of the point estimate, the p-value is 0.349, and H0 cannot be rejected at level a = 0.1. • Like in the GMS case, the RMS residual plot shows one clear outlier out of the 31 observations. We can use e = 0.03. In this case, ±.RMS 1 - $[(1.45 - 1 - 0.0986)/0.351] = 0.158, using the shorth of the bootstrap distribution of ftRMS 1.59, X and the bias bound is 0.0957. Therefore, the robust p-value is -RMS 1 - $[(1.14 - 1 - 0.0957)/0.362] = 0.451, and we cannot reject the null hypothesis. 53 Discussion According to the LS and the naive GMS methods, investment is safe. The LS method is seriously affected by a few 'old' trees - the slope estimate is larger than the 'true' value, and the investment appears to be safer than it really is. The naive GMS approach is misleading too because it does not take into account the possible bias of the estimate. Since the estimates obtained by the MM and the RMS methods are close to 1, we get large p-values even when we use the methods naively. However, the bias test is significant for the MM method, and the inference based on this method is not recommended by Splus. About the naive RMS approach, though the p-value obtained is very large, we do not recommend this method for the following reasons. In chapter 2, we discussed theoretically the consequences of ignoring the asymptotic bias of the point estimate in robust inference. Moreover, in this particular example, we have seen the consequences of ignoring the bias in the case of naive GMS approach. Comparing the bias-adjusted approaches based on the GMS and the RMS estimates, we strongly recommend globally robust inference with the RMS estimate. 54 Chapter 5 A Monte Carlo Study In this chapter we will present the results of a simulation study that we conducted to investigate the finite sample coverage levels of globally robust confidence intervals for the slope of the simple linear regression model. Adrover, Salibian-Barrera and Zamar (2000) conducted a similar study using the GMS slope estimate. Based on the considerations of Chapter 3 of this thesis, we will use the RMS slope estimate and compare our results with those obtained by Adrover et al. Though Hossjer, Rousseeuw and Croux (1994) established the asymptotic normality of the RMS estimate (and determined its asymptotic variance) without the symmetry assumption of the carrier distribution, we will consider bivariate normal distribution as our central model. We have two reasons for this decision: (1) the maxbiases of the RMS estimate are readily available for normally distributed explanatory variable and regression error under the central model (Table 3.1 extracted from Adrover and Zamar, 2001) and (2) since this is the same model that was considered by Adrover et ai, our study will be more comparable to their study. 55 5.1 Design of the study We will consider four different choices of vn, the estimated standard error of the RMS estimate: Method 1 (Empirical Asymptotic Variance): The variability is estimated by formula 3.10. Assuming F 0 = N(0,oj) and G0 = N(0,ol), we have ^ 2nE2Go\X\ Anal Therefore, 7 7 °£>n tzi\ Vn = 7T1= ^H- (5-1) Method 2 (Classical Bootstrap): The variability is estimated by using the standard deviation of the bootstrap distribution of Bn. Method 3 ( "Shorth Bootstrap"): The variability is estimated by using the shorth of the bootstrap distribution of J3n. Method 4 ("MAD Bootstrap"): The variability is estimated by using the MAD of the bootstrap distribution of Bn. In our Monte Carlo simulation we used m = 1000 replicates of each of the following sampling situations: sample sizes n = 20, 40, 60, 80, 100 and 200 from contaminated normal distributions (1 — e)N(0,1) + eN(fi, T2I) with fi' = (fix, /j,y), r = 0.1, p,x = 3 and fj,y = 1.5 (2.0) for e = 0.05 (0.10). We will refer to this case as the "mild contamination case". In the "medium contamination case" we took JJLX = 5 and \xy = 2.5 for e = 0.05 and e = 0.10. We also considered a "strong contamination case" by taking \xx = 5 and fiy = 15 for e = 0.05 and 0.10. 56 Thus, there are three different contamination types (mild, medium and strong) with two different percentages of contamination for each type. Six different sample sizes are considered for each of these six cases. In total, there are 36 different sampling situations. The nominal confidence level in each situation is 0.95. To estimate the variability of J3n by the three bootstrap approaches (Method 2, Method 3 and Method 4), we used B = 1000 bootstrap samples from each of the 1000 replicates of each sampling situation. 5.2 Numerical results The observed coverage levels and the median lengths of the robust confidence intervals obtained by the four methods are presented in Tables 5.1, 5.2, 5.3 and 5.4, respectively. At first we will compare the different results in these four tables. Then we will compare our results with those obtained by Adrover et al. C o m p a r i s o n o f t h e f o u r m e t h o d s Following are the major observations: Method 1 (Table 5.1): Most of the observed coverage levels are above 95%. The median lengths reported in this table are the largest of all four methods. Method 2 (Table 5.2): Most of the coverages are above 95%. The median lengths of this table are larger than the corresponding median lengths for Method 3 and Method 4. Method 3 (Table 5.3): Except for some of the medium contamination cases, most of the coverages are between 90% and 95%. The median lengths in this table are the smallest. 57 Table 5.1: Coverage proportion (median length) of robust CI for the slope by Method 1. % of Contamination Sample Size Mild Contamination Medium Contamination Strong Contamination 5% 20 0.92 (1.28) 0.96 (1.28) 0.91 (1.37) 40 0.96 (0.99) 0.94 (1.02) 0.95 (1.03) 60 0.97 (0.86) 0.96 (0.86) 0.95 (0.89) 80 0.97 (0.77) 0.96 (0.79) 0.96 (0.80) 100 0.96 (0.71) 0.96 (0.72) 0.96 (0.74) 200 0.96 (0.57) 0.98 (0.58) 0.97 (0.58) 10% 20 0.89 (1.39) 0.89 (1.36) 0.91 (1.68) 40 0.96 (1.16) 0.95 (1.18) 0.97 (1.34) 60 0.96 (1.09) 0.91 (1.07) 0.95 (1.19) 80 0.93 (1.04) 0.94 (1.01) 0.97 (1.11) 100 0.93 (0.97) 0.96 (0.98) 0.97 (1.05) 200 0.97 (0.85) 0.93 (0.84) 0.98 (0.89) Method 4 (Table 5.4): Most of the coverages are between 90% and 95%. The median lengths for this method are smaller than those of Method 1 and Method 2, but larger than those of Method 3. Table 5.1 shows some overcoverages and the lengths of the confidence intervals are also relatively large. An explanation of this may be found in Table 3.1, which shows that the asymptotic variance (formula 3.10) overestimates the variablity of (3n exhibited in finite samples. Similarly, the overcoverages shown in Table 5.2 and the undercoverages in Table 5.3 are due to the overestimation and the underestimation, respectively, of the 'true' standard error of J3n. The observed coverage levels for Method 4 are mostly close to the nominal level. This method seems to estimate the standard error of Bn better than the other methods. 58 Table 5.2: Coverage proportion (median length) of robust CI for the slope by Method 2. % of Contamination Sample Size Mild Contamination Medium Contamination Strong Contamination 5% 20 0.97 (1.51) 0.97 (1.49) 0.98 (1.91) 40 0.95 (0.95) 0.94 (0.97) 0.97 (1.10) 60 0.93 (0.80) 0.94 (0.80) 0.97 (0.89) 80 0.96 (0.71) 0.93 (0.72) 0.96 (0.78) 100 0.94 (0.65) 0.95 (0.67) 0.96 (0.71) 200 0.96 (0.53) 0.96 (0.54) 0.97 (0.55) 10% 20 0.94 (1.63) 0.94 (1.55) 0.99 (2.51) 40 0.97 (1.21) 0.95 (1.14) 0.99 (1.57) 60 0.96 (1.09) 0.88 (1.02) 0.98 (1.30) 80 0.94 (1.02) 0.93 (0.96) 0.99 (1.16) 100 0.93 (0.96) 0.94 (0.93) 0.99 (1.08) 200 0.96 (0.84) 0.91 (0.81) 0.99 (0.90) C o m p a r i s o n o f R M S a n d G M S For the results of the GMS approach, the three tables provided by Adrover et al. (2002) are referred to. The comparison of Method 1 of RMS and that of GMS is interesting. If we focus on the median lengths, we notice that they are more or less equal. However, if we concentrate on the observed coverage levels, we see that for RMS they are mostly over 95% while for GMS they are mostly below 95%. The explanation is as follows. We mentioned earlier that the asymptotic variance formula for these two estimates are the same (derived under two different sets of regularity conditions, though). For RMS, this formula overestimates the variability (Table 3.1) while 59 Table 5.3: Coverage proportion (median length) of robust CI for the slope by Method 3. % of Contamination Sample Size Mild Contamination Medium Contamination Strong Contamination 5% 20 0.88 (1.19) 0.89 (1.15) 0.94 (1.41) 40 0.91 (0.86) 0.87 (0.88) 0.94 (0.98) 60 0.90 (0.76) 0.91 (0.76) 0.94 (0.82) 80 0.93 (0.68) 0.92 (0.70) 0.95 (0.73) 100 0.92 (0.62) 0.93 (0.65) 0.94 (0.67) 200 0.94 (0.52) 0.96 (0.53) 0.96 (0.54) 10% 20 0.85 (1.36) 0.85 (1.25) 0.96 (1.84) 40 0.90 (1.13) 0.84 (1.03) 0.98 (1.37) 60 0.90 (1.05) 0.84 (0.97) 0.96 (1.21) 80 0.94 (1.00) 0.90 (0.93) 0.97 (1.12) 100 0.93 (0.94) 0.91 (0.91) 0.98 (1.04) 200 0.94 (0.82) 0.90 (0.80) 0.98 (0.88) for GMS it underestimates the variability (Adrover et ai, 2002). The maxbiases of these two estimates, on the other hand, are very close (slightly smaller for GMS for e = 0.05, and slightly smaller for RMS for e = 0.10). Therefore, for the same lengths, RMS has overcoverage while GMS has undercoverage. One point is evident from the above comparison. The finite sample variability of RMS is less than that of GMS, which is also reflected in Method 2 and Method 3 results of the two estimates. For each of these methods, the median lengths obtained by RMS are smaller than those obtained by GMS, while the observed coverage levels are more or less equal. Method 4 is not comparable since Adrover et al. did not consider this method for the GMS variability estimation. 60 Table 5.4: Coverage proportion (median length) of robust CI for the slope by Method 4. % of Contamination Sample Size Mild Contamination Medium Contamination Strong Contamination 5% 20 0.91 (1.28) 0.93 (1.26) 0.96 (1.53) 40 0.92 (0.90) 0.91 (0.93) 0.95 (1.02) 60 0.92 (0.79) 0.93 (0.79) 0.95 (0.85) 80 0.94 (0.70) 0.93 (0.72) 0.96 (0.76) 100 0.93 (0.64) 0.95 (0.67) 0.95 (0.69) 200 0.95 (0.52) 0.96 (0.54) 0.97 (0.55) 10% 20 0.92 (1.47) 0.88 (1.35) 0.97 (1.99) 40 0.93 (1.19) 0.90 (1.90) 0.99 (1.43) 60 0.94 (1.08) 0.88 (1.02) 0.97 (1.24) 80 0.94 (1.02) 0.93 (0.96) 0.98 (1.14) 100 0.94 (0.96) 0.95 (0.93) 0.99 (1.06) 200 0.94 (0.83) 0.91 (0.81) 0.98 (0.89) 5.3 Discussion The numerical results in the four tables show that the coverage of the robust confidence intervals are in general fairly good (close to the nominal level) and constitute a major improvement when compared to those achieved by the naive procedure (Table 2.2). Asymptotic variance of RMS overestimates its finite sample variability, so does the standard deviation of the bootstrap distribution of (3n. On the other hand, shorth of the bootstrap distribution underestimates the variability. The performance of MAD bootstrap seems to be better than the others, the coverages for this method are closer to the nominal level than those for the other methods. We would recommend this method for the estimation of the finite sample variability of RMS. 61 The overall performance of RMS is better than that of GMS. Either the observed coverages of RMS intervals are larger with the same lengths, or the lengths of RMS are smaller while maintaining the same coverages. Based on these considerations, we recommend RMS for robust inference on the slope of simple linear regression model. 62 Chapter 6 Conclusion In this concluding chapter we will summarize our findings and identify some topics for further research. 6.1 Summary The main results of this thesis may be summarized as follows: 1. In the construction of robust confidence intervals, if we ignore the uncertainty due to the bias of the point estimate, we will get asymptotic coverage level zero. 2. To incorporate the bias bound of an estimate in addition to its standard error in robust inference, the method proposed by Adrover, Salibian-Barrera and Zamar (2002) may be used. 3. A point estimate that has an asymptotically normal distribution and a relatively small bias bound should be preferred. 63 4. For robust inference on the simple linear regression slope, the problems of the GMS estimate proposed by Adrover et al. (2002) are that it has a breakdown point of 0.25, and its asymptotic normality is established under very restrictive conditions. 5. In this study, we proposed the RMS estimate, for which the breakdown point is 0.50, and the asymptotic normality holds under very general conditions. We applied the RMS method to two real datasets. In the first example, the RMS method performed almost as well as the GMS approach, both of them concluded that the investment was risky while the least squares approach indicated otherwise. The second example was a more challenging problem, and the RMS method performed much better than the GMS. The outlying 'old' trees were identified by RMS almost perfectly, and while GMS was 'shaky' in making a decision (as compared to RMS), RMS was clearly conservative and considered the investment to be NOT good. In the Monte Carlo study, the RMS method achieved, more or less, the same ob- served coverage levels while it constructed intervals of smaller lengths, as compared to GMS. Regarding the four methods of estimation of the standard error of RMS, the asymp- totic variance formula and the classical bootstrap were overestimating while the shorth bootstarp was underestimating. The MAD bootstrap may be preferred. Based on our findings, we recommend RMS for globally robust inference on the simple linear regression slope. 6.2 Further study The following points form some interesting areas for future research: 1. For prediction with simple linear regression models, we may be interested in the 64 linear combination of the slope and the intercept parameters. The asymptotic bivariate normality of these parameters may be established for this purpose. We will also need the appropriate bias bound for the intercept parameter. 2. The bias bounds of the RMS slope and intercept parameters are available only under the assumptions of normally distributed explanatory variable and regres- sion error under the central model. These bounds may be obtained under more general conditions, for example, without the symmetry assumption of the carrier distribution. 3. RMS can also be used for estimating the slope parameters in multiple linear regres- sion, using kernel functions with more than two arguments (Siegel, 1982). However, these estimators are not affine equivariant when the number of slope parameters is two or more. The asymptotic properties of RMS in higher dimensions constitute another interesting area of study. 6 5 B i b l i o g r a p h y Adrover, J. G., Salibian-Barrera, M., and Zamar, R. H. (2002). Globally robust inference for the location and simple linear regression models. J. Statist. Plann. Inference: (accepted for publication). Adrover, J. G. and Zamar, R. H. (2000). Bias robustness of three median-based regression estimates. Technical Report No. 194, Department of Statistics, University of British Columbia, Canada. Berrendero, J. R. and Zamar, R. H. (2001). Maximum bias curves for robust regression with non-elliptical regressors. Ann. Statist., 29: 224-251. Boscovich, R. J. (1757). De litteraria expeditione per pontificiam ditionem, et synop- sis amplioris operis. Bononiensi Scientiarum et Artum Instituto atque Academia Commentarii, 4: 353-396. Brown, G. W. and Mood, A. M. (1951). On median tests for linear hypotheses, in Proceedings of the Second Berkeley Symposium on Mathematical Statistics and Probability, Univ. of California Press, Berkeley, pages 159-166. Fraiman, R., Yohai, V. J., and Zamar, R. (2001). Optimal robust M-estimates of location. Ann. Statist, 29: 194-223. Frees, E. (1991). Trimmed slope estimates for simple linear regression. J. Statist. Plann. Inference, 27: 203-221. 66 Hampel, F. R. (1974). The influence curve and its role in robust estimation. J. Amer. Statist. Assoc., 69: 383-393. Hossjer, O., Rousseeuw, P. J., and Croux, C. (1994). Asymptotics of the repeated median slope estimator. Ann. Statist., 22: 1478-1501. He, X. and Simpson, D. G. (1993). Lower bounds for contamination bias: Globally minimax versus locally linear estimation. Ann. Statist., 21: 314-337. Hennig, C. (1995). Efficient high-breakdown point estimators in robust regression: Which function to choose? Statistics & Decisions, 13: 221-241. Huber, P. J. (1964). Robust estimation of a location parameter. Ann. Math. Statist., 35: 73-101. Huber, P. J. (1981). Robust Statistics. Wiley, New York. Martin, R. D., Yohai, V. J., and Zamar, R. H. (1989). Min-max bias robust regression. Ann. Statist, 17: 1608-1630. Martin, R. D. and Zamar, R. H. (1989). Asymptotically min-max bias robust M-estimates of scale for positive random variables. J. Amer. Statist. Assoc., 84: 494-501. Martin, R. D. and Zamar, R. H. (1993). Bias-robust estimates of scale. Ann. Statist., 21: 991-1017. Rousseeuw, P. J. and Leroy, A. M. (1987). Robust regression and outliers detection. Wiley, New York. Rousseeuw, P. J. and Yohai, V. J. (1984). Robust regression by means of S-estimators. Robust and Nonlinear Time Series Analysis (J. Franke, W. Hardle, and R. D. Martin, eds.), Lecture Notes in Statistics 26, Springer Verlag, New York: 256-272. Sen, P. K. (1968). Estimates of the regression coefficient based on Kendall's tau. J. Amer. Statist. Assoc., 63: 1379-1389. 67 Siegel, A. F. (1982). Robust regression using repeated medians. Biometrika, 6 9 : 242-244. Stigler, S. (1986). The History of Statistics: The Measurement of Uncertainty before 1900. Harvard Univ. Press. Theil, H. (1950). A rank-invariant method of linear and polynomial regression analysis, I, II and III. Koninklijke Nederlandse Akademie van Wetenschappen, Proceedings, 53: 386-392; 521-525; 1397-1412. Tukey, J. (1960). A survey of sampling from contaminated distributions, in: Contributions to Probability and Statistics. I. Olkin, Ed., Stanford University Press, Stanford. Yohai, V. J. (1987). High breakdown point and high efficiency robust estimates for regression. Ann. Statist., 15: 642-656. Yohai, V. J. and Zamar, R. H. (1993). A minimax bias property of the least a-quantile estimates. Ann. Statist, 20: 1875-1888. 68
Cite
Citation Scheme:
Usage Statistics
Country | Views | Downloads |
---|---|---|
China | 9 | 10 |
Japan | 7 | 0 |
United States | 5 | 0 |
City | Views | Downloads |
---|---|---|
Beijing | 9 | 0 |
Tokyo | 7 | 0 |
Washington | 2 | 0 |
Ashburn | 2 | 0 |
Sunnyvale | 1 | 0 |
{[{ mDataHeader[type] }]} | {[{ month[type] }]} | {[{ tData[type] }]} |
Share
Share to: