UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

The detection and testing of multivariate outliers White, Richard Alan 1992

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


831-ubc_1992_fall_white_richard_alan.pdf [ 1.1MB ]
JSON: 831-1.0086547.json
JSON-LD: 831-1.0086547-ld.json
RDF/XML (Pretty): 831-1.0086547-rdf.xml
RDF/JSON: 831-1.0086547-rdf.json
Turtle: 831-1.0086547-turtle.txt
N-Triples: 831-1.0086547-rdf-ntriples.txt
Original Record: 831-1.0086547-source.json
Full Text

Full Text

THE DETECTION AND TESTING OFMULTIVARIATE OUTLIERSbyRICHARD ALAN WHITEB.Sc., The University of British Columbia, 1990A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFMASTER OF SCIENCETHE FACULTY OF GRADUATE STUDIESTHE DEPARTMENT OF STATISTICSWe accept this thesis as conformingto the required standardTHE UNIVERSITY OF BRITISH COLUMBIAJuly 1992©Richard Alan White, 1992In presenting this thesis in partial fulfilment of the requirements for an advanceddegree at the University of British Columbia, I agree that the Library shall make itfreely available for reference and study. I further agree that permission for extensivecopying of this thesis for scholarly purposes may be granted by the head of mydepartment or by his or her representatives. It is understood that copying orpublication of this thesis for financial gain shall not be allowed without my writtenpermission.Department of 5T&r i-The University of British ColumbiaVancouver, CanadaDate ‘7” 17 1 ?DE-6 (2/88)AbstractThe classical estimators of multivariate location and scatter for the normalmodel are the sample mean and sample covariance. However, if outliers are presentin the data, the classical estimates can be very inaccurate and robust estimatesshould be used in their place. Most multivariate robust estimators are very difficultif not impossible to compute, thus limiting their use. I will present some simpleapproximations that make these estimators computable.Robust estimation down weighs or completely ignores the outliers. This is notalways best because the outliers can contain some very important informationabout the population. If they can be identified, the outliers can be further investigated and an appropriate action can be taken based on the results. To detectoutliers, a sequential multivariate scale-ratio test is proposed. It is based on anon-robust estimate and a robust estimate of scatter and is applied in a forwardfashion, removing the most extreme point at each step, until the test fails to indicate the presence of outliers. We will show that this procedure has level c whenapplied to an uncontaminated sample, is uneffected by swamping or masking andis accurate in detecting outliers. Finally, we will apply the scale-ratio test to several data sets and compare it to the sequential Wilk’s outlier test as proposed byC. Caroni and P. Prescott in 1992.11CoritentSAbstract iiTable of Contents iiiList of Tables vList of Figures viAcknowledgements vii1 Introduction 12 Concepts, Definitions and Notation 72.1 Equivariance and Invariance 722 Concepts of Robustness 93 Robust Estimation of Multivariate Location and Scatter 123.1 M-Estimators 133.2 S-Estimators 163.3 r-Estimators 253.4 Stahel-Donoho Estimator 284 Detection and Testing of Outliers 344.1 Tests based on Wilk’s Lambda 364.2 The Scale-Ratio Test 395 Properties of the Scale-Ratio Test 436 Applications of the Scale-Ratio Test 541117 Conclusions and Recommendations 65References 67Appendix: Computer Implementation 70ivList of Tables1 Notation 82 Constant k for Tukey’s biweight function 193 # subsamples such that P(# good samples 1) = 99% 224 Cutoff Points for Vo using 475 Cutoff Points for Vmax using,486 Cutoff Points for Vo using 497 Cutoff Points for Vmax using 508 Coefficient of Variation for 50% biweight S-estimate 519 Coefficient of Variation for Stahel-Doiioho Estimate 5110 Transportation Cost Data: U.S. dollars per mile 5611 Wilks Outlier Test Applied to the Transportation Data 5712 Scale-ratio Test Applied to the Transportation Data 5713 Wilk’s Outlier Test Applied to the Lakes Anion Data 5814 Scale-ratio Test Applied to the Lakes Anion Data 6015 Masked Outlier Data 6016 Wilk’s Outlier Test Applied to the Masked Outlier Data 6117 Scale-ratio Test Applied to the Masked Outlier Data 6218 Wilks Outlier Test Applied to the Hidden Outlier Data 6419 Scale-ratio Test Applied to the Hidden Outlier Data 64VList of Figures1 Outliers in 3-dimensions that are Hidden in 2-dimensions 52 Distortion of the Eigenvectors in 2-dimensions 63 Normal Approximations to the Test Statistics of the Scale-Ratio Test 524 Power Curves for the Scale-Ratio Test 535 Scatterplots and Spin Plots for the Transportation Data 556 Scatterplots and Spin Plots for the Lakes Data 597 Scatterplots for a 4-dimensional Data Set with hidden Outliers 63viAcknowledgementsI would like to thank Dr. Ruben Zamar and Dr. Harry Joe for their help, guidance andsupport during my Masters program at U.B.C.. Both have played important roles in myapproach to statistic. I would also like to thank Dr Hyunshik Lee of Statistic Canadafor supporting me during the start of this research.vii1 IntroductionThe outlier problem has been around for many years. Interest in the subject hasbeen like a roller-coaster ride ranging from periods of intense research to periods of noresearch. Intuitively, an outlier is an observation which deviates from the the rest ofthe data to such an extent that it arouses suspicion about its underlying distribution.The non-outlying data will be referred to as the core data. It is assumed that the coredata contains more than 50% of the observations in the sample. I will further classifyan outlier into at least one of two possible categories, an extreme observation and acontaminant. An extreme observation is a point that lies on the convex hull of thedata set. There are two extreme observations in one dimension, the maximum and theminimum. In higher dimensions, the number of extreme observations is quite arbitrary.A contaminant is an observation that is generated from a different distribution thanthat of the core data. With this definition, it is possible for a contaminant to lie in themiddle of the core data but we will assume that all points that are more extreme thana contaminant, relative to some suitable standardization, are themselves contaminants.Our main goal is to detect only those outliers that are contaminants.In lower dimensions, graphical techniques can be used to detect potential outliers. Inunivariate samples, the boxplot identifies an outlier as an observation that lies beyondsome cutoff point based on the median and inter-quartile range. In two and threedimensions, outliers can be detected by scatterplots and spin plots respectively but thedegree of outlyingness is based on the judgement of the observer. Unfortunately, oncethe dimension is greater than three, no graphical tool exists to identify outliers.Multivariate outliers appear in two forms, the gross outlier and the more subtlestructural outlier. The gross outlier is an observation that appears to be outlying inone or more of the original variables. If boxplots are made of these variables then the1gross outliers will be detected. The structural outlier may not appear to be an outlierin any of the original variables but is outlying relative to the covariance structure of thecore data. It may not appear in any of the pairwise scatterplots or three-way spin plotseither. In fact, it may only be noticeable if all dimensions of the data are consideredsimultaneously. This is illustrated for three dimensions in Figure 1.By suitably rotating the data, a structural outlier can be converted into a grossoutlier. Therefore a structural outlier in higher dimensions may be graphically detectedif the data are rotated by an orthogonal transformation before the plotting is done. Ifthe true underlying distribution is multivariate normal or elliptically contoured, thenthe eigenvectors of the true covariance structure for the core data will be the besttransformation to use, but this assumes that the outliers are already known. Estimatingthe covariance structure may help if the outliers are not too damaging, but in certainsituations the outliers can distort the classical estimates in such a fashion that theeigenvectors can actually disguise the outliers instead of revealing them. See Figure 2for an example of this. Random orthogonal transformations can be used in the hopethat at least one of them will reveal the outliers, but this approach is inefficient becausethe outliers may only appear in a small fraction of the transformations and therefore,a large number would have to be generated to ensure a high probability of selecting anappropriate projection.Given that outliers cannot always be detected through graphical means, one mustresort to other methods. There are two general approaches to the problem, diagnostictests and robust methods of analysis. They attack the problem from opposite points ofview and, oddly enough, the advantages of one method tend to be the disadvantages ofthe other.The main advantage of a diagnostic test is that it detects the outliers and allows the2investigator to decide how the observation should be dealt with. This is a great advantage because not all outliers are spurious observations. Sometimes the most importantinformation about a sample is contained in the outliers and discarding or ignoring themdoes more harm than good. Unfortunately, the accuracy of diagnostic tests is verysuspect and these inaccuracies can seriously affect their performance. The inaccuraciesoccur because most diagnostic tests require an advanced knowledge of the number ofoutliers present in the data. This is mainly due to the masking and swamping effectsthat outliers can have on the testing procedures. Masking is caused when several outliers are close enough together that the removal of some but not all of them resultsin little improvement in the scatter estimate for the sample. Therefore, if the test isconducted for fewer than the true number of outliers present in the data, the test maynot detect any outliers at all. When the test is conducted for more than the true number of outliers, masking cannot occur. However, under these conditions, the test maystill be significant and therefore all the good points that are falsely included with theoutliers will be incorrectly labelled as outliers. This is known as the swamping effect.Applying a diagnostic test in a sequential manner will not help either, because a forwardprocedure cannot avoid the masking problem and a backward procedure cannot avoidthe swamping problem.Robust procedures are designed for the case of several outliers. They can be appliedwhen only an upper bound for the number of outliers is known and they are not affectedby the masking effect. They usually result in good estimates for the core data when thesample is contaminated but usually lack efficiency when the sample is uncontaminated.The main problem with robust procedures is that they down-weight or completely ignorethe outliers. This takes some of the control out of the investigators hands becausethe robust procedures decide how the outliers will be dealt with. Furthermore, any3information contained in the outliers is lost.Since each method is strong where the other is weak, the two approaches to theoutlier problem should be combined to produce a diagnostic test that is immune to theeffects of masking. This test could then be applied sequentially in a forward fashion tonot only detect the outliers but to indicate the number present as well. Furthermore,the test should only have to be applied until it fails to detect the presence of an outlierbecause it should not be fooled by masked outliers.In this thesis, I will propose a robust diagnostic tool for detecting outliers in amultivariate context. This tool, which I call the multivariate scale-ratio test, is basedon the relative size of two multivariate estimates of scatter, one sensitive to outliersthe other highly robust. The size of a matrix can be measured in several ways. Themost common is the determinant but others such as the largest eigenvalue will alsobe considered. Another problem facing this procedure is the computation of a robustmultivariate estimates of scatter. Several proposed estimators exist but all that possessa high breakdown point, are computationally prohibitive. To deal with this problem,I propose approximations to these estimators that are relatively easy to compute andappear to cost very little in terms of performance. The properties of the multivariatescale-ratio test will be investigated through several Monte Carlo simulations. Thesewill provide some evidence about the asymptotic distribution, power and other properties of the test. Finally the multivariate scale-ratio test will be compared to its maincompetitor, the multivariate Wilk’s outlier test. For comparative purposes, the Wilk’soutlier test will be applied in the sequential fashion proposed by Caroni and Prescott[3] in 1992. The comparison will be made by applying the two tests to several real andsynthetic data sets in order to evaluate their relative performances.4Figure 1: Outliers in 3-dimensions that are Hidden in 2-dimensions0tfl0q0U,90Variable 2 vs Variable 1 Variable 3 vs Variable 1c’Vr2. Var3*..1.•. Vail••*• • •• arl•• cJ-2 -1 0 1-2 -1 0 1Variable 2 vs Variable 3 Spin Plot of the DataVar 2• Var3•••Var 1-2 -l 0 1 2-1 0 1 25c’JxcJFigure 2: Distortion of the Eigenvectors in 2-dimensionsOriginal Variables Boxplot of the Original Variables*c’J0c:.ILi-2 -1 0 1 2 3xlClassically Rotated Variables Boxplot of the Rotated Variables>-0VanVar20urn-2 0 2YlCorrectly Rotated Variables Boxplot of the Rotated VariablesVar I* * *Var 2-2 0 2 4Vie62 Concepts, Definitions and NotationIn this section, we will fix the notation used in the thesis and define a few basicconcepts that will be used later. The notation is summarized in Table 1 while theconcepts are presented in formal definitions.2.1 Equivariance and InvarianceWe begin by defining several forms of equivariance and invariance.Definition 1 Suppose we have x1,... ,x,. Let a be a vector in W°, c be a constant, Fbe an othogonal p x p matrix and A be a nonsingular p x p matrix. Then a locationestimator T is said to be1. location equivariant if T(xi + a,.. .,x + a) = T(xi,. . . ,x,) + a;2. scale equivariant if T(cxi,. . . , cx,) = cT(xi,. . . , x,);3. orthogonal equivariant if T(Fxi + a,... ,Px + a) = FT(x1,.. . ,x,) + a;4. affine equivariant if T(Ax1 + a,... ,Ax + a) = AT(x1,. . . ,x,) + a;and a scale estimator S is said to be1. location invariant if S(x1 + a,.. . ,x, + a) = S(xi,.. . ,x,);2. scale equivariant if S(cxi,.. . ,cx,) = IcIS(xi,... ,x);3. orthogonal equivariant if S(Fxi + a,... ,I’x + a) = FS(xi,. . .4. affine equivariant if S(Ax1 + a,... ,Ax + a) = AS(x1,.. . ,x)A’.We want an estimator to possess these properties because the parameters that theyestimate also possess them. However, as we will see later, sometimes it may be necessaryto relax the affine equivariance condition in order to make the estimator computable.7_______Table 1: NotationSymbol Meaningp-dimensional Euclidean space.V or matrices or vector valued random variables.x or 7 p-dimensional vectors.x’ or E’ the transpose of a vector ot matrix.a square root matrix such that v’2= VVI the deteminant of the matrix V.Lvi the greatest integer less that or equal to y.c1 the standard normal distribution function (univariate).the standard normal density function (univariate).x a chi-square distribution with p degrees of freedom.13n,m a beta distribution with parameters n and m.Fey the upper a quantile from a specified distribution.1A(x) an indicator function of the set A.independent and identically distributed frommed(x) median of x1,. . . xmad(x2) median absolute deviation from the median of x2MVE minimum volume ellipsoidMCD minimum covari ance determinant82.2 Concepts of RobustnessNext, we turn our attention to robust statistics. We begin by defining what we mean bya contaminated sample. A contaminated sample is drawn from a mixture distributionFE,H = (1—e)F+eH where F is a p-dimensional distribution with mean ji and covariance, H is any other p-dimensional distribution and e < 1/2 is a positive constant.Definition 2 A samplex1,... ,x, is said to be contaminated ifx, i 1,..., n, F€,Hand both H and e are unknown.If H has moments to the second order or e 0 then F,jj has moments to the secondorder. If the latter is true then FE,H = F and the moments are /1 and , otherwise if Hhas moments then the moments of FE,H are given in Fact 1.Fact 1 Suppose F is a p-dimensional distribution with mean j7 and covariance J andH is any other p-dimensional distribution with mean 0 and covariance I’, e E [0, 1/2) isa constant and X is distributed as FE (1 — e)F + eH, then(1)EX = (1—e)/i+eO(2)Var(X) = (1 — e)D + eF + e(1— c)(JZ — 0)(/i — 0)’.Proof(1) f°° xdFE(x) = f°°, xd[(1 — e)F(x) + eH(x)j= (1- e) f xdF(x) + ef xdH(x) = (1- e)j +(2) f° xx’dF -ExEx’ = f xx’d[(l - e)F(x) +eH(x)] -((1- e)7+J)((l -=(1— e) f°° xx’dF(x) + ef xx’dH(x)—(1 — e)2/ — e2Oö— 2e(1—=(1- e)(f xx’dF(x)-ji’) + e(f0 xx’dH(x)- ) + (1- e)/ji’ + e_(1_2e+e2)j7i7!_e2tX_2e(1—e)jO=(1 — e) + eI’ + e(1— e)(7’ + 00’— 2/to’)=(1-c)J+eP+e(1-e)(/t-J)(/t-)’9If H does not have moments and e> 0 then F,H will not have moments either. Thisimplies that for any level of contamination e> 0, the moments of F,H can be changedto any value, even infinite or undefined, by choosing an appropriate contaminatingdistribution H.Using F, we can define several measures of robustness for a given estimator T.Definition 3 The maximum bias of T for a given level of contamination e is definedbyB(qT) = sup IIT(F,H)— T(F)IIF,ffPlotting B(e; T, F) verses e produces the maximum bias curve [16]. This curve carries a lot of information about the robust properties of the estimator. Two importantrobustness quantities that can be derived from B(e, T) are the breakdown point and thegross error sensitivity of an estimator.Definition 4 The breakdown point of T is defined by= inf {e : B(e, T) = x}In other words, the breakdown point is the smallest level of contamination that cancause T(Fe,H) to be arbitrarily far from T(F).Definition 5 The gross error sensitivity is defined byGES— dB(e; T)— deIt measures the maximum effect that an infinitesimal amount of contamination can haveon T. It can also be used as a linear approximation to maximum bias when e is nearzero.To measure the sensitivity of T to an infinitesimal amount of contamination froman arbitrary distribution H, we use the influence function.10Definition 6 The influence function of T for a contaminating distribution H is givenbyIF(H; T, F) = • T(Ff,H)— T(F)for distributions H for which the limit exists. It can be seen thatsup IF(H;T,F)I GES.HA larger value of IF(H; T, F) means that the estimator T is more greatly influenced bycontamination with H.113 Robust Estimation of Multivariate Location andScatterTo estimate the location and scatter of a multivariate distribution F, one uses thesample mean vector t and sample covariance matrix S defined as=xi/n (1)S =--- 1). (2)Unfortunately, if the sample contains some contaminating elements, then x and S canperform very poorly. For example consider the contaminated distribution F€,H as definedin Section 2. For this case E(X) = (1—)/i + O and Var(X) = (1—+ d’ + (1 ——— 0)’. So, if one wishes to estimate the location and scatter of F and oneobserves x1,. . . , x F, then and S can be extremely inaccurate. Therefore, theseestimators are unreliable and a different method is needed to ensure accurate estimates.Suppose that k of the n points in the sample come from the distribution H andthe exact k points are known. Then one simply removes these points and estimatesthe parameters of F directly, using ic and S on the remaining points. However, theunderlying distribution of each point as well as the exact number of contaminants isusually unknown and therefore diagnostic tools must be used to estimate these values.See Section 4 for details on how this is done in a multivariate context.A second method of estimating the location and scatter of F is to use robust techniques. The main advantage of this method over the previous is that only an upperbound on k need be known for the techniques to work. The disadvantage is that someof the desirable properties of the estimate have to be sacrificed in order that others maybe maintained. In the univariate context the trade-off occurs between robustness and12efficiency. In the multivariate situation, one must also consider computational efficiencyand the equivariance of the estimator. If a desirable property must be sacrificed, whichone should it be? There is no clear solution here and different people will argue fordifferent estimators.3.1 M-EstimatorsThe first robust estimators were called M-estimates because they were based on a generalized maximum likelihood score function. The univariate M-estimate of location wasproposed in 1964 by Peter Huber. M-estimates were later generalized to estimate scaleand location and scale simultaneously. To define the univariate M-estimate, one beginswith maximum likelihood estimation. Suppose we have x1,. . . , N(1t,2), then themaximum likelihood estimate of location when the scale is known is the root t of theequation=0,where /Ls(x) = x, the maximum likelihood estimate of scale when the location is knownis the root v to the equation_PLS() =b,where pLs(x) = x2 and b = 1, and the maximum likelihood estimate of location andscale is the vector (t, v) that satisfies—ZbLs() =0 and—ZPLS() =b (3)simultaneously. M-estimates are a generalization of the maximum likelihood estimatesto a class of functions (x) and p(x) which satisfy the following properties:1. (x) is odd with at most a finite number of discontinuities;132. p(x) is even, differentiable almost everywhere and p(O) = 0;3. p(x) is strictly increasing on [0, c) and constant on [c, oo);4. >0.M-estimates of location are unbiased at the uncontaminated model and have a breakdown point of 1/2 if sup Iib(x)I < , zero otherwise. They can be very efficient butthis usually carries a heavy price in terms of the maximum bias curve. M-estimates oflocation are asymptotically normally distributed with a rate of convergence. TheM-estimate of scale has a breakdown point of b/M against outliers and 1— b/M againstinliers where M = sup p(x). They also converge to a normal distribution at a /T rateand are consistent at the uncontaminated model, but they cannot obtain good efficiencywhile still maintaining a high breakdown point against outliers. M-estimates of locationand scale will have a positive breakdown point against outliers only if both &(x) andp(x) are bounded.As seen above and s2 are special cases of M-estimators. They are efficient buthave a breakdown point of zero. The estimators t = med(x) and v = mad(xi) =medx— tI/’(3/4) are the other extreme for M-estimators. These are obtained by?bmed(X) sgn(x), bmed(0) = 0 and0 ‘(1/4) < x <Pmad(X) = 1/2 x = 1/4 or x =1 otherwise.One can easily verify that b f pmad(x)q(x)dX = 1/2. The median and the mad arehighly robust in terms of the maximum bias curve and have a breakdown point of 1/2but are very inefficient at the normal model. To achieve an M-estimator that is robustand whose location estimator is relatively efficient, a compromise is needed between the14two extremes. This is done by defining & and p so that they behave like bLs and pLsfor small values of x but become constant like med and Pmad when x is large. Huber [9]combined the extremes to obtain the following class of functionsx if 121 c, x2 if 121 < k,?1(x)= and p(x) = (4)c sgn(x) otherwise, k2 otherwise,where 0 <c, k < oo. The balance between efficiency and robustness is controlled by theconstants c and k. Note that (t, v) —* (, s) as c, k —f : and t —* med(x) as c —÷ 0.Some people prefer to have b= p’ because this holds for the maximum likelihoodestimate. A set of functions that satisfy this criterion are Tukey’s biweight functionsdefined byT x(1 — 2(x/c) + (x/c)4) if lxi c,(x) and0 otherwise.T x2(3 — 3(x/k)2 + (x/k)4) if xl k,Pk(x) =otherwise.Like Huber’s function, Tukey’s function has the maximum likelihood estimator as aspecial case, but for any c < , the function i7bi redescends to zero.M-Estimates can be generalized to higher dimensions and, in 1976, Maronna [15]did just that. M-estimators of multivariate location and scatter are defined in a similarfashion as in the univariate context. They are the solution (, ) to the following systemof equationsu1 ({(xi— t)’V’(x — t)}’I2 (x — t) = 0, (5)u2 ((xi — t)’V’(x— t)) (x— t)(x— t)’ — V = 0, (6)where u1 and u2 are functions defined for s 0 satisfying, for V = SS’,— t)i)S1(x — t) = 0,15Eu2(jS’(x— t)I2)[S(x — t)][S(x — t)]’ = Ii,.In addition,1. u1(s) and u2(s) are nonnegative, nonincreasing, and continuous for s 0;2. /(s) = su(s) is bounded for i = 1,2;3. I’(s) is strictly increasing in the interval where 2(S) < K = sup>02(s) andconstant iii the regions where ‘2(S) = K;4. so such that &2(S) > p, and Ui(S) > 0 for s so;5. a> 0 such that for every hyperplane H, PF(H) 1 — p/K — a.Maronna shows, under certain conditions, that the estimates are unique, consistent,asymptotically normal, affine equivariant and relatively easy to compute. However, thebreakdown point is less than (p + i)_i. Therefore these estimates do not perform wellin higher dimensions if the level of contamination is large.3.2 S-EstimatorsS-estimates were originally developed, in the regression context, by Rousseeuw andYohai in 1984 [23}. To describe these estimators, it is best to consider the special caseof estimating univariate location and scale. Suppose we have x1,. . . , N(t, u2) thenthe S-estimates of location and scale is the vector (t, v) such that v is minimized subjectto the constraint,p(xi-t)E[p(x-/L)] (7)where p(x) and b are defined as in the case of M-estimates of scale. In fact, for afixed location, an S-estimate is exactly an M-estimate of scale and, like the M-estimate,has the maximum likelihood estimate as a special case by defining pLs(x) x2. The16other extreme is obtained when Pmve = 1 —125,75j(x) and b = 1 — (Ln/2j + 1)/n areused. If the location is fixed to be the median, then the corresponding M-estimate ofscale will be the mad. However, as an S-estimate, Pmve defines the SHORTH which isthe most robust univariate estimator of location and scale with respect to minimizingthe maximum bias among all M-estimators of scale [17]. The distribution of the scaleestimate converges weakly to a normal distribution at a rate of but the distributionof the location estimate converges at a rate of to a non-normal distribution. Bothestimates are extremely inefficient at the normal model.The properties of S-estimates are similar to the properties of M-estimates of scale.They are consistent at the uncontaminated model, possess a breakdown point of b/Magainst outliers and 1— b/M against inliers, where Al = sup p(x) and they cannotachieve a high breakdown point and good efficiency at the same time. The other mainproblem with S-estimates is their asymptotic distribution. If p is not smooth enough, theestimates can converge at a slower rate to a non-normal distribution as demonstrated bythe SHORTH. An asymptotically normal estimate is preferable but only if the breakdownpoint of Pmve can be maintained. The boundedness of p is necessary and sufficient tomaintain a positive breakdown point and p(x) being twice continuously differentiable isa sufficient condition for asymptotic normality at a rate. Tukey’s biweight function,(x/k)23- 3(x/k)2 + (x/k)4) if xpc(X) = (8)1 otherwise,possesses all these properties.Davies [5] generalized S-estimates to the multivariate context in 1987. They aredefined as the vector and the positive definite matrix ‘(i that minimize IVI subject to-p [{(xi — t)’V1(x_t)}”2] = b, (9)17where p : [0, oo) —* [0, cx) is strictly increasing on [0, c) and constant on [c, oc) andb = Ep {{(x - - [12].Multivariate S-estimates are affine equivariant, consistent at the uncontaminatedmodel, possess a breakdown point of b/M against outliers and 1 — b/M against inliers,where M = sup p(x), but, like their univariate counterparts, they cannot achieve a highbreakdown point and good efficiency at the same time and they onverge at a slowerrate to a non-normal distribution if p is not smooth enough. Again, one seeks a pfunction which is bounded above as well as twice continuously differentiable. Tukey’sbiweight function, defined in (8), does the job. Unfortunately, the distribution of y(x — j7) depends on the dimension of x thus causing the tuning constant c tochange as well. This can cause problems if the distribution of y is difficult to compute.Fortunately, the normal case is one in which the calculation can be done because y‘when x r N(7, ). Substituting y into p and letting k = c2, (8) becomes(y/k)(3 — 3y/k + (y/k)2) if y kpk(y) = . (10)1 otherwiseThe expected value for a given k isk 3z 3z2 z3E(pk(y))=f (T_+)dx+f dxSetting E (pk(y)) = b and solving for k, we get k such thatk/3z 3z2 z3 ‘\i+j T_--+_1)dx=b. (11)Table 2 contains the tuning constant k = c2 for certain values of b and p.In 1989, Lopuhaä [12] showed that an S-estimate can also be defined as the solutionto the equations— t)’V1(x — t)}”2] (x — t) = 0, (12)18Table 2: Constant k for Tukey’s biweight functiondimension pb 1 2 3 4 5 6 7 8 9 10.10 26.86 55.86 84.8 113.8 142.8 171.8 200.8 229.7 258.7 287.7.25 8.63 19.62 30.6 41.5 52.5 63.4 74.3 85.3 96.2 107.2.50 2.40 7.08 11.9 16.8 21.6 26.5 31.4 36.2 41.1 46.9[{(xi— t)’V’(x — t)}h/2](x1—t)(x—t)’—v [{(xi — t)’V1(x — t)}112jV = 0,(13)where u(x) = ‘çb(x)/x, v(x) = x(x)—p(x)+b and ‘i,b(x) is the derivative of p(x). Theseequations are very similar to (5) and (6), which means that there is a solution whichJiasahreakdowirpointliliatis af, most 1 / (p + 1) However, tb reisatJeasi_oneothersolution which has a breakdown point of b/M, where M = sup p. The global minimumof (9) is one of the high breakdown solutions. Therefore, one should minimize (9) ratherthan solve (12) and (13) because the former guarantees the solution to be one of highbreakdown if the global minimum is reached.Unfortunately, S-estimates are computationally expensive because they involve theminimization of an implicitly defined, non-convex function. If the function is convexthen the breakdown point of the S-estimate is zero. The possibility of multiple minimameans that the minimum obtained need riot be the best (although it will have a highbreakdown point). A good starting point can increase the probability of reaching theglobal minimum but this would require a robust estimate of location and scatter whichis exactly what we seek in the first place.In 1983, Rousseeuw proposed the first S-estimate of multivariate location and scatter19with a positive breakdown point called the minimum volume ellipsoid estimator or MVE.The MVE is the multivariate version of the SHORTH and is based on an ellipsoid ofminimal volume which covers at least Ln/2J + 1 points of the data. The centre ofthe ellipsoid is the location estimate, while the rescaled ellipsoid is the correspondingcovariance estimate. The MVE is affine equivariant with a breakdown point of ([n/2J—p + 1)/n. Unfortunately, the MVE is not computable for any reasonable sample sizebecause of the combinatorial explosion. For example, a sample size of 50 in 2 dimensionshas over iO’ ways to choose 27 distinct point from the sample. The MVE is computedby finding the one that yields the ellipsoid of minimal volume. A second way to. computethe MVE is to minimize pmve(Z) = 1— l[o,x2 1(z), where z = (x ——/1) andx = {z : x(x) = 1 — a}. Asymptotically b = 1/2, however for finite samplesb = 1 — [(n + p + 1)/2J /n to guarantee a solution. To find a local minimum, letalone the global minimum, is very difficult because the function to be minimizedisdiscontinuous and nonconvex. However, the MVE can be approximated through thefollowing resampling scheme:1. Select a subsample that contains exactly p + 1 distinct points indexed by J ={ii,...,i÷i};2. Compute= x and C =— )‘(x—iEJ PiEJ3. Set V = mjCj where rrij = med(x — j)C’(x— j)’ for i = 1,. . . ,4. Compute Vjl = m,ICj.The resampling is repeated for many J after which the one with the smallest determinantIVjI is retained; call this subsample I. The final estimates are j and kV1, whereIc = x,.5 makes the scatter estimate consistent at the multivariate normal model.20The total number of subsamples drawn is n-i, which is selected to ensure a highprobability that at least one subsample will not contain any contaminants. Since thesubsamples are drawn independently, the probability that there is at least one goodsubsample in in is given by 1 — (1 — j3)m, where— ((1 — E)n)!(n—p — 1)!is the probability of drawing a good sample from the data and n is the number of contaminants in the data. Observe that 3 = P(X = 0), where X follows a hypergeometricdistribution with parameters n, (1— )n,p+ 1. It is easily shown that 3 —* (1— )+1 asn —* oo. Furthermore, the breakdown point for the MVE implies that < 1/2, otherwisethe estimate will break. Therefore, the probability of getting a good sample when thelevel of contamination is unknown is approximately equal to 1— (1 — l/2P+1)m Fromsetting this probability to 99% and solving for rn, one obtains a lower bound on thenumber of subsamples needed. To verify that this is, in fact a lower bound, one mustconsider the exact breakdown point, = 1 — ([n/2j—p + 1)/n and to calculate thetrue value of m, say rn0, then show that m0 in for all n < cc and lim, m0 = m.Table 3 contains m0 for several sample sizes and dimensions as well as the asymptoticapproximation or lower bound.Like the SHORTH, the MVE does not behave well asymptotically. Its distributionconverges weakly to a non-normal distribution at the slow rate of To improve thisrate, Rousseeuw also proposed the minimum covariance determinant estimator or MCD.The calculation of the MCD is identical to the MVE except that the final estimator is themean vector and rescaled covariance matrix of the [n/2j +1 points inside the MVE. TheMCD has all the robust properties of the MVE and it converges to a normal distributionat the rate ofThe resampling scheme used to calculate the MVE can also be used to calculate21Table 3: subsamples such that P(# good samples 1) = 99%sample dimensionsize 1 2 3 4 5 6 7 8 9 1025 13 21 41 54 100 112 203 192 341 270100 15 32 60 119 211 414 702 1376 2236 43771000 16 35 71 143 283 566 1116 2232 4373 873110000 16 35 72 145 292 585 1171 2343 4679 9356100000 17 35 72 146 293 587 1177 2355 4710 9422cc 17 35 72 146 293 588 1176 2356 4714 9430other S-estimates. To see this, one simply writes the IvIVE in the form of an S-estimate,using the function Pmve as defined in Section 3.2. Furthermore, by letting = .C suchthat = 1, the MVE becomes the vector and the positive definite symmetric matrix.2O that minimize s subject toPmveJI_t))=1- [(n+p+l)/2j/n (14)In this form, the resampling approximation to the MVE becomes the mean and rescaledcovariance of the subsample that minimizes s. Furthermore, since any S-estimate canbe written in this form by replacing Pmte by the appropriate function p and 1— [(n+p+1)/2j/n by the appropriate value b, the resampling scheme can be used to approximateany S-estimate. Computationally, this is not very attractive because a non-linear function must be solved m times in order to compute the estimate. But this can be avoidedby modifying the algorithm in a way similar to that proposed by Yohai and Zamar [29]22in the case of regression estimates. Considerh(slt,C) - Jp (V(x - t)’C’(x - t)) (15)It is easy to verify that h(st, C) is a decreasing function. Therefore, if h(soIt, C) > bthen will have to be increased. This implies that s need not be calculated for eachsubsample because if our current minimum value for s is 80 and h(soIt, C) > b then 8owill not be updated because the new value of s will be greater than so. Using this fact,the following modification is proposed for the resampling algorithm.1. Calculate an initial value and V.2. ForJinl,...,m,(a) select subsample J containing exactly p + 1 distinct points,(b) compute1 1— —tj= 1 x, and C = — (x — t)’(x — ti),iEJ(c) rescale C so that ICj = 1,(d) if h(soIt, C) < b then:i. s, = {s : h(sltj, Cj) =ii. t = tj;iii. ‘‘ =3. return and V.With this modification, the expected number of times that needs to be updated is ofthe order log(m). To show this, let1 if h(s0jt3,C) <6Y3—o otherwise23Observe that the number of times that needs to be updated is , wherer=ZYj.Using the properties of expected values it is clear that E(i) = E(Y). Since eachsubsample is drawn independently and= minh(sIt,C1)= b, for i = 1,.. .,j,= 1) 1/j with equality as n—÷ 00, which implies that E(Y) < 1/j, and,therefore, E(i) <(1/j) log(m).The only possible difficulty with this algorithm is the updating of However, Iclaim that this will be a relatively easy task because it is a function of a single variable,the solution is bounded below by zero and above by and, if one imposes the sufficientcondition for asymptotic normality, the function will be twice continuously differentiable.Therefore, a Newton-Raphson routine can be used to find the solution. Furthermore, itwill converge quite rapidly because the solution is unique (see Section 3.1), and bounded.The uniqueness is guaranteed because t, C is an M-estimate of scale.To use this method, a starting value is needed for The usual approach is to set= {s h(sti,Ci) = b}. I suggest using = {s : h(sI,S/ISI”°) = b} becauseit forces the S-estimate to beat the classical estimate. If it does not then the locationestimate becomes x and the scatter becomes sgS/jS’h’. Furthermore, if the probabilityof beating the classical estimate is fi then the covariance of the location estimatorbecomes (1— ,6)S + 3St. Since x is the most efficient estimator for the uncontaminatedmodel, the efficiency of the location estimator has been improved. The estimated scatteris also slightly improved because the estimated correlation structure receives the samebenefits as the location estimator. Unfortunately, the scale estimate does not because itis an M-estimate and, therefore, not very efficient unless a low breakdown point is used.24This starting point should also work if the sample is contaminated because there is ahigh probability that at least one of the subsamples will contain all good points. Thissubsample alone should cause .so to be updated thus changing the location, correlationand scale estimates.3.3 r.-EstimatorsA drawback of S-estimates is that they cannot achieve a high breakdown pointand high efficiency at the same time. Yohai and Zamar [28] addressed this problem in1988 with the introduction of r-estimates. A r-estimate is an adaptive multiple of anM-estimate of scale. In the simple case of univariate location and scale, letr2(t) = s P2(;)t) (16)where s(t) is an M-estimate of scale defined for a given t, as the solution to(xI_t)=P2 defines another M-estimate which is usually different from s() and b2 is the normalizing constant for P2. Instead of minimizing s(t) over t, r(t) is minimized. The valuethat minimizes r(t) is the location estimate and r() is the scale estimate. r-estimatesof regression can be defined in an analogous way.In 1990, Lopuhaä generalized r-estimates to the multivariate context. He definedthem to be the vector and matrixV= P2 {{(x - )‘O’(x- )}1/2] (17)2where and O are the vector and positive definite symmetric matrix that minimizeci {t2 [{(x - t)’C1(x- t)}h/2]} (18)25subject to‘— t)’C(x — t)}h/2] = b1. (19)r-estimates are a generalization of S-estimates. Indeed, if p P2 and b1 = b2 then(17) reduces to a multivariate S-estimate as defined in Section (3.2). This implies thatthe classical mean and covariance as well as the MVE are special cases of T-estimates.However, in practice p and b1 are taken to be quite different from P2 and b2 so that highbreakdown and good efficiency can be combined. Pi controls the breakdown propertiesof the r-estimate provided that2p(x) — xp(x) > 0, for x> 0 (20)and P2 is bounded. In fact, it can be shown that for this case, the breakdown pointagainst outliers is given by b1/M where M1 = sup p’. (20) is needed to guarantee that(18) is strictly increasing in the magnitude of C. P2 alone controls the efficiency of theestimate. To see this, consider the univariate case with P1 defining any estimate with a50% breakdown point and let P2 = p as defined in (4) thenlim r2 = lims2(t) u (x_— 2 = 1 (x — t)2k—*c k—too n i=1 ‘ s(t) ) n j1which defines the classical estimates.The other properties of r-estimates depend on p1, b1, P2 and b2 simultaneously. Ifb1 and b2 are chosen so that p and 2 each defines a consistent S-estimates then theresulting r-estimate will also be consistent. As for the asymptotic distribution of ther-estimate, one must consider the multivariate S-estimate for and as defined by (12)and (13), where the functionsu(x) = Ai(x) +B2(x) and v(x) = Ai(x) + B2(x) — 2b{pi(x) — b1}depend on the data throughA =1Z2p2(yj)—y1b2(yj) and B =26where yj = — t)’C1(x — t). The asymptotic behaviour of the actual r-estimatorsand can be obtained from that of and [12].The r-estimate is computationally more difficult than the S-estimate because it isthe solution of a constrained minimization problem. However, a resampling schemesimilar to the one derived for S-estimators also works for r-estimators. Considerh(sit,C) =(\/(xi _t)’C’(xi_t)) (21)and_________________h2(sit, C) = P2 - t)’C’(x - t)) (22)where ICI = 1. With this notation the r-estimator becomes the vector and matrixV s2Ch(s,O)/b2where and are the vector and positive definite symmetric matrix that minimize.s2h(sjt, C) subject to hi(sit, C) = b1. Therefore for each subsample of size p + 1,the mean t and rescaled covariance C such that = 1 are used to compute k =.s2h(sIt,C) subject toh1(sIt,C) = b1. The subsample that produces the smallest kis used to approximate the r-estimate.Like the algorithm for the S-estimate, the equations need not be solved for eachsubsample because of the monotonicity of h1 (sit, C) ands2h(sit, C), see (20). Supposeour current minirnium is k0 = sgh2(soto, C0) and we have t and C from the nextsubsample drawn. If hi(soIt, C) > b and sgh2(sot, C) > Ic0 then the subsample cannotproduce the minimum because must be increased to satisfy (19), hi(sit, C) is adecreasing function of s, guaranteeing thats2h(sjt, C) > k0 because it is an increasingfunction of s. Therefore the resampling algorithm for the T-estimate works as follows.1. Calculate an initial value for Ic0 = sh2(sojto, C0).272. ForJinl,...,m,(a) select subsample J containing exactly p + 1 distinct points,(b) computeZiEJXi—tj= 1and Cj = —(x—xj)(x—xj),iEJ(c) rescale C so that ICjj 1,(d) if hi(sotj, Cj) < b or sgh2(sotj, Cj) < ko then:= s : h(sIt2,C) =ii. t,, = t;iii. C =iv. if sh2(st,C) < ko then:A.B. to = t;C. Co=C;D. Ic0 = .sh2(sjt,C);3. V0 = C0k/b2,4. return to and Vo.The expected number of times that (18) needs to be solved, appears to be quite difficultto compute and is left as the subject of further research.3.4 Stahel-Donoho EstimatorAnother multivariate estimator of location and scatter with a high breakdownpoint was obtained in 1981 by Stahel [27] and independently by Donoho [7] in 1982.28To calculate this estimator, called “outlyingness-weighted mean and covariance”, onebegins by defining a measure of “outlyingness” for x E RT’v’x,— med(v’x)Iu= sup. (23)IIvlI=1 mad(v x)Next, one defines a weight function w [0, cx) — [0, oo) such that w(x) is decreasingand xw(x) is bounded for x > 0. Finally one combines w = w(u) and x to calculatethe following weighted mean vector and covariance matrix= (24)—n 2( •( •+!v — Lj1Wj R (25)—Donoho [7] shows that (24) is affine equivariant and has a breakdown point of ([(n +1)/2] — p)/n which tends to 1/2 as n — oo, provided that no more than p points ofx1,. . . , x lie in a (p — 1) dimensional affine subspace. Similarly, one can show that thesame properties hold for (25). The asymptotic behaviour of the estimators such as rateof convergence, limiting distribution and consistency have not been investigated yet.The motivation for the definition of u is the classical Mahalanobis distance definedbyIv’x — V’x) 1 —rn = sup = -/(x— x)’S (x — x) (26)IIvII=1 SD(v x)Unfortunately, u cannot be expressed in a form that is easy to compute because unlikethe SD, there is no known multivariate covariance estimator whose univariate equivalentis the mad. Therefore, all directions in 7?? must be searched in order to compute a singleu. Obviously this cannot be done.In 1990, Patak [20] proposed a modification to Stahel-Donoho estimator that is easyto compute. The method is iterative with the weights for the (k + i)st iteration beingupdated by the principal components of Sk, as calculated in the kt iteration of the29algorithm. Suppose we have Sk and weights W then the (k + l)st step of the algorithmworks as follows:1. Set a,, j 1,... ,p equal to the eigenvectors of Sk;2. Let w2j = w(u3), where w : [0, ) —* [0, ) is such that w(x) is decreasing, xw(x)is bounded for x> 0 and u23 —Q T +TAJk+1_rTP.— 11j14. If Wk+l > wk then Wk+l =5. t,c = z:’=1Wx/ W16. Sk+1 >(Wk+1)2(x— tk+1))(X — tk+1)’/Zi(TV)2.The weights, after convergence, are used to compute the final estimates and ‘( asdefined in (24) and (25). The convergence of the algorithm is guaranteed because theweights are decreasing functions bounded below by zero. There could be a problem ifall the weights converge to zero but Patak showed that at least p + 1 points that donot lie in a lower dimensional hyperplane will have weights greater than zero so theestimators will not collapse or become singular. £ is consistent, orthogonal equivariant,and possesses a breakdown point of ([n/2] — p)/ri. ‘(‘ is also orthogonal equivariant,possesses a breakdown point of ([n/2] — p)/n but is only consistent to within a constantk that depends on the underlying distribution and the particular weight function used.k can be quite difficult to compute in general so I recommend estimating it from thedata. Let p define an M-estimate of scale, then k can be estimated by A which is thesolution to__________________i ((xi - t)’V’(x - t)) =(27)It follows from the properties of M-estimates that V is consistent.30The above algorithm describes how Sk+1 is computed from Sk. However one needsan initial estimate of covariance S0, which can be calculated through the followingalgorithm.1. Set W°= 1, i=1,...,n;2. For k = 1 to m,(a) randomly generate a1 N(O, I) and normalize;(b) calculatea2,..., a€1?Y such that aa=1 <i <j p;(c) compute w w(n);(d) let Wk HP wj(e) if WV < W then W3. to = Wx/ W;4. S0 T’V2(x — to)(x — t0)’/ >n q2The number of iterations (m) is quite arbitrary. It is used to control computationalefficiency. Small m means a very efficient estimate from a computational point of viewbut the starting point can be extremely poor. If we let m get large, the starting pointwill be excellent. In factlim to = t and jim S0 = Vm-*co m-as defined in (24) and (25) respectively, but these estimators are computationally prohibitive. Therefore m must be chosen to optimize the algorithm in terms of speed andquality of the initial estimates. In his thesis, Patak sets n=lOp. However it might bemore realistic to allow the data to determine the number of iterations. For example, if31the weights do not change significantly after a predetermined number of iterations thenstop and use the current value of So.Another modification to this procedure is to centre the data first by some locationestimate i. This location estimator should be independent of the covariance estimatorand possess all the properties of . Thus, it must be consistent, orthogonal equivariantand have a breakdown point of 1/2 asymptotically. The Li location estimator [12],defined byargmin— TIhas all the required properties. When the modified procedure is carried out, the dataare assumed to be centred and therefore, tk = 0 and med(ax) = 0 by assumption.This estimate was originally proposed in the context of principal component analysisand for that purpose, orthogonal equivariance is enough. However, as an estimator ofmuitivariate location and scatter, one would prefer an affine equivariant estimate. Skis not affine equivariant because it depends on the principal components of Sk_1. S0 isnot affine equivariant either because it uses an orthogonal basis at each iteration. Evenif the calculation of S0 is modified to skip this step, it is still not affine equivariantbecause, in general,Ia’x— med(a’x3)I Ia’Axi— med(a’Axj)Imad(a’x) mad(a’Ax)for an arbitrary affine transformation A. In order to obtain affine equivariance, thedirection a must be transform to a direction b such thatIa’x— med(a’x3)I jb’Ax — med(b’Ax)mad(a’xj— mad(b’Ax)This implies that b = (A’)’a. So, the algorithm can be made affine equivariant if thetransformation A is known. This is usually not the case so we attempt to estimate it.32The covariance structure of the data is central to estimating a transformation A. Ifthe structure is known before and after the transformation then one can estimate A byS/2;’/ = AS2S;h/2 = A where y = Ax. When S is unknown, it can assumed tobe I and A becomes S/2.This method works great if the data are uncontaminated, but certain types of contamination can destroy the estimated transformation in such a fashion that the directionof the outliers has a lower probability of being searched. Thus, there appears to be noeasy way to make the algorithm affine equivariant and robust.Example 1 The sample yi,. . . , Y4o, with = (500,0) and S, = I is added to thesample x1,. .. , x60 with = (0, 0) and Sr = I. The final covariance structure is60000 0 0.00408 0S = and S”2 =0 1 0 1-If a = (0.8,0.6) then the outliers appear to be approximately OO standard units awayfrom the centre. However S’2a = (0.00327,0.6) and the outliers now appear to beanound 2.72 standard units from the centre.334 Detection and Testing of OutliersFor the above estimators to work, one only needs a bound on the amount ofcontamination in the sample. If this bound is not exceeded then the estimators willperform reasonably well. However, these estimators yield no information about thesample itself. They do not estimate nor do they indicate which points are the outliersif any are present at all. If the sample is clean and the underlying distribution is normalthen the best estimators of location and scatter are the sample mean and covariance.Therefore it would be wise to check if the sample is contaminated and use the appropriateestimator based on the results. The decision can be made through various techniquesof detecting and testing for the presence of outliers.In order to detect a multivariate oiitlier, we must have some notion of “extreme”.One such notion is the statistical norm X2 defined by X2 = (x —— j7), wherex comes from a distribution with mean and covariance . If the statistical norm is toolarge then the point is considered to be an outlier. Too large depends on the distributionof X2. For example if x is normally distributed then X2 follows a x distribution andtoo large is is defined as X2 > x where = {x : x(x) = 1 — a}. Unfortunately,the statistical norm assumes the true location and scale are known.When the true location and scale are unknown, the statistical norm can be estimatedby replacing the parameters with accurate estimates. Suppose, we have x1,. . . ,x,, yF, x/n = 5 and — — —1) = S. Then the sample Mahalanobisdistance D2= (y — )‘S;’(y—can be used to approximate X2. If F is a normaldistribution then D2 follows a scaled distribution with scaling factor k = (n2 —1)p/(n — p)n. Therefore, to detect all the outliers in a sample, z1,. ..,z4 one canapply the above procedure to the of n + 1 partitions of the data defined by removingthe point from tha sample, considering that observation to be y while the remaining34n points play the role ofx1,. . . , x,. Mathematically, the partition is defined as y =;andzj j=1,...,i—1,(28)zj+1 j=i,...,n.Next, one calculates D for the th partition and declare those points such that D/k>to be outliers. The reason for excluding y from the calculation of x and S is toensure that the F distribution holds. If D = (y — )‘S’(y — k) is used instead of D2,where is the mean of the entire sample and S is the covariance of the entire sample,then the distribution of D is unknown and will have to he derived in order to establishan accurate bound for outlier detection.Now assume that x1,. . . , x, y F6, and we wish to check if y is an outlier. D2does not work anymore because we have inaccurate estimates of /1 and . Using robustestimates, say t and V, to calculate V2 = (y— t)’V;1(y— tx), one can solve thisproblem except that the distribution of V2 must be derived. Robust estimates workbecause they are, to a great degree, independent of the outliers in the sample. If onewishes to detect the outliers for an entire sample, then one can calculate V for eachpartition as defined by (28) and declare the points for which V is beyond a certaincutoff point, to be outliers. However, the time needed to calculate n robust values of tand V can be quite substantial. Therefore, t and V should be calculated only once,using the entire sample, and D! = (y — t)’V1(y — t) used instead of V2. This worksbecause the distributions of the statistic needs to be derived before either can be usedand removing an outlier from the sample before the robust estimates are calculated willnot change them by much.Outlier detection does not differentiate between extreme observations and contaminants. In fact, for any fixed cutoff point, we can guarantee the detection of at least oneoutlier in a clean sample, by increasing the sample size. For example, in a clean sample35of size 200 with c .01, the probability of at least one observation being detected asan outlier is approximately 1 — 0.99200 = 0.866. The advantage of testing a sample foroutliers is that regardless of sample size, a good observation will only be rejected with aspecified level o. However, the disadvantage is that a contaminant is only rejected witha probability 3, which depends on the particular type of contamination and test beingused.4.1 Tests based on Wilk’s LambdaThe classical approach to testing and detecting multivariate outliers is based on theiidWilk s Lambda statistic used in a MANOVA set up. Suppose x1,.. . , x F,, such thatk of the n points in the sample come from H. Suppose the sample is permuted so thatthese points are x1,.. . , One wishes to test the null hypothesis,lid--H0:x-Ffor=1,...,n,against the alternativeHfori=1 ... kxFfori=k+1,...,n.Under H0, the sufficient statistic is A = (n — 1)S where S is the classical samplecovariance matrix defined on page 12, however if H1 is true then the sufficient statisticfor F is A(k) = =k+1(X — c(k))(x — X(k))’ where X(k) Zjk+I x/(n — Ic). Thereare several possible test statistic, all of them based on the eigenvalues of AA wherethe it1 eigenvalue of AA is (1 + j) for i = 1,. ..,p. Wilk’s lambda defined asA = H(1 + ))‘ = IAAj’ is the generalized likelihood ratio test. A can also bedefined as lAki/IAI and in this form it can be seen that A rejects H0 when it is toosmall becausen—k kA = A(k) +——36A is known to have good power when H is a N(/1*, ) distribution and the null distribution is either known or can be closely approximated. The null distribution dependson the sample size and number of suspected outliers. Let denote the value of Afor a p dimensional sample of size n with a k-outlier configuration. The distributionsof and are very straight forward. follows a/3(n—p—1)/2,p/2 distributionwhile is distributed like a 13n—p—2,p random variable. More generally, can beexpressed in terms of the product of (k + 1)/2 beta random variables [8].Other possible test statistics such as Roy’s largest root defined as ). max()and Hotellings T=\, tend to be less powerful than A and have unknown nulldistributions.When the true outlier configuration is unknown but the exact number present isknown, an outlier test can still be performed. This is done by calculating A for allM n!/k!(n—k)! possible configurations and usin&the one that minimizes A to identifythe possible outliers. Unfortunately the distribution of min(Ak) is unknown and the pvalue is therefore based on a conservative bound given by the Bonferroni inequality.Another disadvantage of this method is the substantial time needed to calculate A forall possible outlier configurations. For example, if n = 50 and Ic = 10, the total numberof outlier configurations possible is over ten billion, of which only one will contain all n—kgood points in a single partition. in the univariate case, this problem can be avoidedbecause the sample can be ordered resulting in only k + 1 of the M configurations aspossible candidates for the true outlier configuration. Unfortunately, there is no accurateway to order a multivariate sample unless the true location and scatter are known.The assumption that Ic is known is another serious weakness of this method becausein practice it usually is unknown. An upper bound on k can always be assumed becauseif Ic > Ln/2] then F would be considered to be the contaminating distribution and37H would be considered to be the true distribution. Now assume that one knows thatk <m then an outlier test can be performed by considering all n!/k!(n — k)! possibleoutlier configurations. For each configuration, the p-value is calculated for andthe most statistically significant configuration is used to identify the number of outliers,and the particular points that are considered as possible outliers. Whether or not theseare true outliers is again decided using the Bonferroni inequality.In 1992, Caroni and Prescott [3] suggested that Wilk’s outlier test be applied sequentially assuming that there are at most m outliers in the data. Starting with asingle outlier, is computed by removing the point from the sample. The statistic D1 = min(A,,1)is saved and the observation that corresponds to the minimum,x(1), is removed from the sample. The procedure is then applied to the reduced samplex1),. .. ,x’i1 After rn repetitions of the procedure, one produces a sequence of statistics Dm,Dm_i,. ..,D, where D = minA• 1 -1) and there correspondingn z,p,n—i+1observations X(m),.. . , X(1). Comparing Dm, Dm_i,. . . , D1 against the appropriate critical values ‘km,.. . , ., where ), =_____, the number of outliers declared by2 ‘2’n+1—the sequential procedure is the highest value r such that Dr < .k- or zero if none aresignificant. Caroni and Prescott show that the sequential Wilk’s outlier test has a typeI error which is only slightly conservative for the sought level a if testing for the (k + i)stoutlier when only k exist in the data. Furthermore, the test will be reasonably powerfulwhen only a single outlier remains because at that point, the test becomes equivalentto the generalized likelihood ratio test. Applying the test in a forward fashion makes itimmune to the effects of swamping. However, since it is based on classical methods, itwill be very vulnerable to the effects of masking. This vulnerability exposes the mainweakness with the sequential Wilk’s outlier test. If the outliers are masking one another,the sequential method may detect several good data points and ignore the true outliers38altogether. The error is amplified even further if the test actually declares some of thesepoints to be true outliers. See Section 6 for an example of the effect of masking on thistest.4.2 The Scale-Ratio TestThe scale-ratio test was developed in the univariate set up by Le and Zamar ([13])in 1991. The test statistic is a ratio of two scale estimates, &, which is sensitive tooutliers, and ô, which is highly robust. The test statistic is defined as v = ö/à and thenull hypothesis is rejected when v is too large. Le and Zamar derive the asymptoticdistribution of the scale-ratio test as well as conditions on the estimates, & and ô- sothat a Pitman efficient test is obtained.There are several possible extensions of the scale-ratio test to the multivariate setup.One such generalization is based on the ratio of determinants.- For example, suppose,we have two estimates of scatter for the same sample, , which is highly sensitive tooutliers, and which is highly robust. Summarizing the size of each matrix by itsdeterminant, we define the multivariate scale ratio test to beVo=’/fI (29)It should be noted that V0 reduces to v2 when p 1.Another measure of the relative size of a matrix are its eigenvalues. Supposehas eigenvalues.. > with corresponding eigenvectors é1,. . . , é and haseigenvalues i > ... > with corresponding eigenvectors ê1,. . .,ê, then a test statisticin this case can be defined as17*= rnax/ (30)However Vmax may lack power because ê, and ê, need not point in the same direction.We will illustrate this through an example.39Example 220 80 2001 01 04Suppose we have a two dimensional sample with covariance structure D. A single outlieris added to expand the scale of one of the co-ordinates by a factor offour. If the contamination occurs in the first co-ordinate, the covariance structure becomes and V =However if the contamination occurs in the second co-ordinate, the covariance structurebecomes 2 and V* = 2. Therefore V* can lack power under certain situations.A third possible measure of the relative size of two matrices isö2(a’x)Vmax=ItHaià2(a’x) (31)where & and ô- are univariate estimates of scale. If &2 and ô2 are the univariate versionsof and , respectively and these covariance estimators are affine equivariant thenreduces toT4nax maxaa= maxb’2” (32)IIaII1 a’a IIbII=i1/2 1/2.where b = a/I all. Therefore, “max is equa.1 to ) which is equal to the largesteigenvalue of = If at least one of these estimators is not affine equivariant then Vax Vmax only. It should also be noted that there are some univariateestimators, such as the mad, that do not have a multivariate equivalent. Therefore,Vax defines a larger class of test statistics than 4nax• However, in general Vax cannotbe computed directly if an equivalent Vmax does not exist because all directions of 1V’would have to be searched in order to compute it. Therefore, one should use Vmax toapproximate Vax wherever possible.The best statistic to use for the multivariate scale-ratio test will depend on theasymptotic distribution and local power versus a specific alternative hypothesis. Another40problem is to choose the specific estimators and E. In the univariate case, Le andZamar show that if & and & are M-estimates of scale such that — = 16(x4 — 6x2), forsome 0, then the test is Pitman efficient.Unfortunately, multivariate M-estimates of scale have a breakdown point of at mostl/(p + 1). Therefore a different estimator must be used. Any of the multivariate estimators proposed in section (3) will work, however, the best one is an area of furtherresearch.The scale-ratio test tests the hypothesisjidHo:x-.-N(,u,) forz=1,...,nvs H1 at least one x H (33)However, if the test rejects, there is no indication of how many or which points arecontaminants, This problm is solved by applying the scale-ratio test in a forward se-quential fashion. If the test reject H0 then the point with the largest robust Mahalanobisdistance D = (x1 — t)’’(x — t), where t is the location estimate that correspondsto J, is removed and the test is re-applied to the remaining points. This procedure isrepeated until the test does not reject H0 or 50% of the points have been removed. Ifthe latter happens then one should question the assumption of normality in the nullhypothesis.If each test in the sequence is performed at level c then the overall level of thesequential scale-ratio test is also c. This is obvious since the probability of not rejectingthe first point is 1 — a and no more tests will be conducted if this happens. Furthermore,the probability of rejecting at least one good point when there are k outliers in the datais less than or equal to a, assuming that max D corresponds to an outlier if one is stillpresent. This assumption will be verified in Section 5 through simulation. A formalproof is the subject of further research. However, under this assumption, the above41statement holds because the probability of rejecting a good point after all the outliershave been removed is o and the probability of removing all the outliers is less than orequal to one.Another possible hypothesis that can be tested by the scale ratio test isH0 :x, ildN()vs H1:x (34)where H is any non-normal distribution. The test works in this case because the robustcovariance matrix is tuned and the cutoff points are set for a normal distribution. Ifthe data are non-normal then the robust estimate will tend to under-estimate the truecovariance of a heavy tailed distribution and over-estimate the covariance of a shorttailed distribution. The bias in the robust estimates is large enough to cause the test toreject if H is sufficiently different from a normal distribution. The test is applied onlyonce for this alternative and if the null hypothesis is rejected then the only conclusionis that the xi’s are not normally distributed.The scale-ratio test can also be modified to work for other distributions besides thenormal. To do this, one must calculate proper cutoff points for the test statistic. It mayalso help to re-tune the robust estimator of scale to fit the true distribution of the data.Some information about the computer implementation of the scale-ratio test, sequential Wilk’s outlier test as well as the approximation algorithms of the robust estimatorsis available on page 70 in the appendix.425 Properties of the Scale-Ratio TestBefore the scale-ratio test can be applied, the estimators and E must be chosen.In all cases = S, the classical sample covariance matrix defined by (2). For 1 we willconsider two possible estimators, (a) an S-estimate with p(x) defined by (8) anda breakdown point of 50%, and (b) a Stahel-Donoho estimator, approximated byPatak’s algorithm, withw(x) = 1 if X X,niJn(zi,O.99)0 otherwisesubject to >w(x) < n and ic defined by (8) with b = 1/2. Furthermore, both Vo andVmax as given by (29) and (32) respectively, will be considered as possible test statistics.The first order of business is to calculate the cutoff points for the scale-ratio test.This is done by comp-uting the appropriate test tatistic for 10000 N(O-, I) sarnple ofsize n and estimating the cutoff points by the appropriate order statistic. The mean canbe 0 and the covariance can be I without loss of generality because the estimators areequivariant to rotations and translations. The results are tabulated in Tables 4-7 forvarious sample sizes up to 300, p = 2, 3,4 and c = .05, .025, .01. Other estimated cutoffpoints are presented with the specific application in Section 6. It should be noted thatthe theory of order statistics tells us that the estimated cutoff points {ê} are normallydistributed with parameters— F’ 1 2 — cr(1 — d — a(1 — of,)‘— nf2 [F-’(1—cj)] an— nf [F’(1—aj)] f {F’(1 — a)]where aj > cj, F is the true distribution of the test statistic and f is the density of thetest statistic.Both E8 and >2 are random variables for a given data set because both estimatescontain a random element. The S-estimate uses a random subset of all possible sub43samples of size p + 1 while the Stahel-Donoho estimate is based on random projectionin 7V’. The variability of the estimates is of interest because it can have a significanteffect on their performance. If the cv = s/i, coefficient of variation, is too large thenthe approximation will be too imprecise and possibly lack power because of this addedvariability. Simulations indicate that the cv for Y, is quite a bit smaller that the cvfor . Both are acceptable when the sample is uncontaminated but the cv for E appears to increase dramatically with the level and the magnitude of the contamination.Also, asymmetric contamination appear to be more damaging than symmetric. Thecv for is only mildly affected by the presence of contamination. The cv for eachestimator is summarized in Tables 8 and 9 for p = 2, c = 0, .05, .25, n = 40, 80, 120 andH = N(3, .01), N(6, .01), N(0, 3), N(0, 6).Next, I consider the asymptotic distribution of the test statistics. This is donethrough a Monte Carlo simulation of 5000 replicates under the null hypothesis. Forvarious sample sizes, a probability histogram is plotted with an appropriate normalcurve overlaid. The parameters for the normal curve are estimated from the sampleitself. The plots for a sample size of 1000 in 2 dimensions are shown in Figure 3 forthe four test statistics considered in this section. All of them appear to follow thenormal curve quite closely. Therefore, for larger sample sizes, the cutoff points neednot be approximated. Instead, a p-value can be computed based on the appropriatenormal curve. Computing the asymptotic mea.n and variance of the test statistic, therate of convergence as well as a formal proof of the asymptotic distribution are areas offurther research. However, the parameters of the normal curve can be estimated moreaccurately with fewer replicates than the appropriate order statistics. An estimatedp-value is another advantage of the normal approximation.Finally, we consider the power of the scale-ratio test. The test is applied 1000 times44to random samples of various sizes, dimensions, contamination types and contaminationlevels in order to estimate the power under these conditions. Keeping all but one factorsfixed, the power as function of the changing factor is produced. A plot of this function foreach test statistic will help us decide which is best for a given situation. Plotting poweras a function of dimension is not done because the contamination type also depends onthe dimension. Therefore, the results presented in this section are for the 2-dimensionalcase oniy. However, similar results appear to hold in higher dimensions as well. Thepower curves appear in Figure 4 are for a significance level of 0.01.Six situations are considered for power calculations. First, a single outlier is addedto 99 2-dimensional normal observations to make a sample of size 100. The magnitudeof the outlier is increased and the power for a significance level 0.01, is plotted in the topleft corner of figure 4. The top right plot corresponds to a sample of size n independentand identicallydistributed observations that come from F(x,y)= (1 e1)(1 — eM).The power is plotted as a function of sample size. This situation is considered as anexample of the sensitivity of the scale-ratio test to non-normal distributions. If a shorttailed distribution is used is substituted for the double exponential distribution, thescale-ratio test appears to retain its sensitivity. The middle two plots explore the powerof the scale-ratio test when the the sample contains some symmetric contamination.Power as a function of contamination level and sample size are considered here. Thebottom two plots are the same as the middle two except that the contamination isasymmetric.The plots indicate that the test defined by E. is more sensitive than the test definedby.However, for the two test statistics Vo and Vmax perform similarly. Itis not surprising that V0 is more powerful when facing symmetric contamination andless powerful for asymmetric, but V0 also appears to be more sensitive to the normal45assumption. Another interesting feature is that Vo performs slightly better than Vmwhen there is only one or two outliers and their deviations are not too extreme.46Table 4: Cutoff Points for Vo using ,Dimension and levelSample 2 3 4Size .05 .025 .01 .05 .025 .01 .05 .025 .0125 1.709 1.972 2.316 1.627 1.842 2.104 1.541 1.756 2.07330 1.570 1.757 2.045 1.508 1.673 1.913 1.436 1.621 1.89035 1.484 1.638 1.873 1.453 1.603 1.807 1.382 1.522 1.69840 1.445 1.582 1.778 1.399 1.530 1.686 1.376 1.513 1.68245 1.405 1.526 1.696 1.366 1.468 1.643 1.316 1.419 1.56850 1.374 1.490 1.638 1.324 1.421 1.552 1.287 1.388 1.52555 1.340 1.445 1.574 1.292 1.382 1.509 1.260 1.342 1.439611 1.329 L41ft 1.529 1.287 1.379 1.509 1.26 1.33 1.44580 1.256 1.329 1.411 1.231 1.294 1.368 1.205 1.266 1.341100 1.218 1.277 1.345 1.192 1.247 1.316 1.174 1.231 1.290120 1.199 1.248 1.316 1.174 1.221 1.286 1.150 1.197 1.259140 1.179 1.225 1.283 1.161 1.200 1.260 1.148 1.192 1.239160 1.166 1.206 1.255 1.143 1.184 1.230 1.131 1.167 1.209180 1.158 1.194 1.244 1.142 1.179 1.223 1.122 1.156 1.196200 1.147 1.181 1.229 1.129 1.165 1.208 1.118 1.147 1.186220 1.139 1.174 1.225 1.122 1.156 1.192 1.111 1.140 1.176240 1.134 1.165 1.204 1.117 1.148 1.184 1.109 1.134 1.166260 1.121 1.154 1.189 1.112 1.141 1.174 1.101 1.126 1.158280 1.120 1.151 1.187 1.107 1.133 1.166 1.097 1.121 1.149300 1.114 1.143 1.179 1.106 1.133 1.164 1.094 1.118 1.14747Table 5: Cutoff Points for 4nax usingDimension and levelSample 2 3 4Size .05 .025 .01 .05 .025 .01 .05 .025 .012530354045505560801001201401601802002202402602803002.4102.0711.8871.7341.6451.5661.5121.4621.3361.2661.2331.2011.1871.1721.1601.1491.1411.1311.1281.1212.806 3.4692.354 2.8282.098 2.3991.929 2.1611.810 2.0341.688 1.8671.630 1.8031.569 1.7111.403 1.4981.326 1.3971.277 1.3321.242 1.3021.224 1.2671.203 1.2461.188 1.2271.179 1.2161.167 1.1991.153 1.1821.149 1.1811.141 1.1682.9982.4392.1191.9011.7921.6891.6031.5521.3751.2881.2481.2201.1931.1771.1611.1481.1421.1341.1271.1213.497 4.4002.801 3.3522.382 2.7552.083 2.3621.972 2.1831.825 1.9991.722 1.9041.646 1.7951.454 1.5321.343 1.4051.293 1.3541.261 1.3091.222 1.2661.211 1.2481.187 1.2291.173 1.2061.164 1.1941.156 1.1791.146 1.1731.139 1.1623.445 4.0812.704 3.1292.347 2.6262.029 2.2501.878 2.0721.741 1.8761.646 1.7611.590 1.6911.397 1.4691.299 1.3531.249 1.2911.216 1.2501.197 1.2291.176 1.1991.162 1.1861.150 1.1731.142 1.1611.133 1.1521.122 1.1411.117 1.1355.0853.6893.0192.5332.2872.0681.9071.8611.5631.4131.3401.2891.2701.2321.2151.2041.1891.1781.1621.15748Table 6: Cutoff Points for usingDimension and levelSample 2 3 4Size .05 .025 .01 .05 .025 .01 .05 .025 .0125 1.521 1.726 2.061 1.266 1.403 1.598 1.152 1.237 1.36330 1.442 1.596 1.843 1.254 1.364 1.508 1.163 1.251 1.37735 1.379 1.519 1.699 1.234 1.320 1.449 1.170 1.253 1.35840 1.336 1.456 1.619 1.236 1.318 1.419 1.164 1.245 1.32745 1.311 1.413 1.550 1.224 1.309 1.419 1.175 1.255 1.35550 1.297 1.392 1.530 1.222 1.295 1.401 1.178 1.251 1.34355 1.281 1.362 1.480 1.212 1.289 1.391 1.159 1.220 1.30660 1.258 1.339 1.457 1.206 1.274 L361 1.156 L214 L2380 1.220 1.280 1.361 1.185 1.237 1.309 1.155 1.208 1.272100 1.195 1.245 1.321 1.163 1.213 1.276 1.138 1.187 1.249120 1.178 1.226 1.288 1.150 1.195 1.247 1.130 1.165 1.208140 1.165 1.209 1.261 1.139 1.177 1.219 1.126 1.161 1.208160 1.154 1.192 1.238 1.132 1.168 1.217 1.118 1.151 1.193180 1.144 1.181 1.232 1.125 1.160 1.199 1.112 1.141 1.183200 1.139 1.175 1.21.5 1.118 1.147 1.187 1.106 1.132 1.169220 1.131 1.164 1.199 1.118 1.146 1.188 1.102 1.130 1.158240 1.122 1.152 1.190 1.110 1.141 1.173 1.098 1.124 1.155260 1.121 1.149 1.182 1.106 1.131 1.162 1.097 1.119 1.146280 1.114 1.143 1.181 1.106 1.133 1.160 1.093 1.116 1.146300 1.112 1.139 1.172 1.101 1.128 1.160 1.090 1.113 1.14049Table 7: Cutoff Points for Vmax using EDimension and levelSample 2 3 4Size .05 .025 .01 .05 .025 .01 .05 .025 .0125 3.000 3.802 4.771 1.974 3.247 5.01930 2.470 2.995 3.731 1.100 1.385 3.00535 2.114 2.537 3.175 1.079 1.114 1.63540 1.926 2.287 2.887 1.074 1.102 1.15345 1.752 2.111 2.512 1.073 1.099 1.13850 1.630 1.966 2.399 1.071 1.093 1.13055 1.479 1.745 2.182 1.067 1.090 1.12060 1.423 1.750 2.142 1.061 1.079 1.10580 1.186 1.419 1.706 1.058 1.073 1.093100 1.123 1.225 1.486 1.053 1.067 1.085120 1.100 1.153 1.378 1.048 1.063 1.079140 1.092 1.125 1.265 1.045 1.057 1.073160 1.083 1.109 1.176 1.042 1.052 1.067180 1.076 1.101 1.159 1.041 1.052 1.064200 1.071 1.089 1.121 1.039 1.048 1.062220 1.064 1.082 1.107 1.038 1.046 1.056240 1.061 1.079 1.102 1.036 1.044 1.054260 1.059 1.073 1.094 1.034 1.042 1.053280 1.057 1.072 1.094 1.033 1.041 1.050300 1.055 1.068 1.090 1.032 1.040 1.0501.038 1.0611.039 1.0601.040 1.0581.041 1.0571.041 1.0581.042 1.0571.038 1.0511.038 1.0501.036 1.0471.034 1.0441.032 1.0401.031 1.0401.029 1.0371.026 1.0341.026 1.0321.025 1.0311.024 1.0301.023 1.0291.022 1.0281.022 1.0271.1081.0911.0801.0791.0791.0771.0691.0681.0601.0561.0521.0491.0461.0431.0411.0371.0381.0351.0351.03350Table 8: Coefficient of Variation for 50% biweight S-estimateTable 9: EstimateContamination Sample SizeType Level 40 80 120None 0.007 0.002 0.002N(3,.01) .05 0.019 0.020 0.016N(6,.01) .05 0.054 0.052 0.052N(3,.01) .25 0.042 0.034 0.025N(6,.01) .25 0.108 0.115 0.114N(0,3) .05 0.010 0.005 0.001N(0,6) .05 0.028 0.023 0.018N(0,3) .25 0.027 0.012 0.007N(0,6) .25 0.049 0.025 0.019Coefficient of Van ation for Stahel-DonohoContamination Sample SizeType Level 40 80 120None 0.004 0.002 0.001N(3,.01) .05 0.002 0.002 0.001N(6,.01) .05 0.001 0.001 0.001N(3,.01) .25 0.007 0.009 0.002N(6,.01) .25 0.003 0.002 0.001N(0,3) .05 0.003 0.002 0.001N(0,6) .05 0.005 0.001 0.001N(0,3) .25 0.006 0.003 0.002N(0,6) .25 0.003 0.002 0.00151U,0 I4 0,jaq (Stahel-DonohoEstimatorStahel-DonohoEstimatorCD CD c’J 00.850.900.951. CD C’Jo_______0.850.900.951. 0 0 > 0 0 Cl) C,) C,) 0 Cl) C, -t. 00 C,J00.951.001.05DeterminantMaximumEigenvalueFigure 4: Power Curves for the Scale-Ratio Testn=100, p=2, 1 outlier2 4 6 8 10Magnituden=100, p=2, mu=O,Var=3 in each co-ordinatep=2, Double Exponential///////50 100 150 200 250 300Sample Sizep=2, 5% Symmetric Outliers0a,0CD03000c’J0000a,0CD03000c’J0000a,0030U0000030Ua30Ua,30U-0a,0CD000000Co00000000CD00000//0.02 0.04 0.06 0.08Level of Contamination0.10n=100, p=2, mu=3,Var=.O1 in each co-ordinate50 100 150 200Sample Size250 300//p=2, 5% Asymmetric Outliers/— — — — -,,//——--0.02 0.04 0.06 0.08Level of Contamination0.10 50 100 150 200 250 300Sample Size536 Applications of the Scale-Ratio TestIn this section, the scale-ratio test will be applied to several data sets for the purposeof outlier detection. This includes an estimate of the number of outliers present as wellas the identity of the bad observations. I will also compare the scale-ratio test to thesequential Wilk’s outlier test. For the applications considered, I will use the scale-ratiotest defined by 2rn’ and V0 as the robust estimator and test statistic, respectively. Thisone is chosen because it appears to have reasonable power against most alternatives.The comparison begins by applying the scale-ratio test to the transportation costdata given in Johnson and Wishern [11]. These data are considered because Caroni andPrescott [3] use them in their paper. The purpose of this example is to show that thescale-ratio test produces similar results to the V/ilk’s outlier test in situations where theWilk’s test works well. The data appear in Table 10, the pairwise plots along with afew spin plots are in Figure 5, the results for the sequential Wilk’s outlier test appearin Table 11 and the results for the scale-ratio test appear in Table 12.The results of the two tests are very similar. The extreme points are selected in thesame order. Observation 9 is the most extreme followed by observation 21 and thenobservation 36. Both tests agree that that observation 9 is a definite contaminant andobservation 36 is not. However, they disagree about observation 21. The Wilk’s outliertest rejects this point at .025 level hut not at .01 while the scale-ratio test rejects at.10 but not .05. The plots indicate that observation 21 is an outlier and therefore, thescale-ratio test does not appear to he quite as powerful as the Wilk’s test for this sample.The next application for outlier detection comes from the Eastern Lake Survey [6].The data are extracted for the state of Pennsylvania and the chloride concentration,flouride concentration and sulfate concentration are considered in the log scale. Thenitrate concentration is not used because it contains too many missing values. The54Figure 5: Scatterplots and Spin Plots for the Transportation Data21- 9365 10 15 20 25 30Fuel•.“ Ca ital.... S. 21. Repar369-2 -1 0 1 2 3 4U)C0U,CDCDc’J-0C-)CD(0CDCDc,JaCDCDa-c,J0c-;JC?C’J0(-‘C092136F el•.‘. .taJRepair... 21‘. 3695 10 15 20 25 30Fuel. 21. 3694-2 -1 0 1 2 392136F el• CapitalRepawS 10Capital15-3 -2 -1 0 1 255Table 10: Transportation Cost Data: U.S. dollars per mileObs Fuel Repair Capital Ohs Fuel Repair Capital1 16.44 12.43 11.232 7.19 2.70 3.923 9.92 1.35 9.754 4.24 5.78 7.785 11.20 5.05 10.676 14.25 5.78 9.887 13.50 10.98 10.608 13.32 14.27 9.459 29.11 15.09 3.2810 12.68 7.61 10.2311 7.51 5.80 8.1312 9.90 3.63 9.1313 10.25 5.07 10.1714 11.11 6.15 7.6115 12.17 14.26 14.3916 10.24 2.59 6.0917 10.18 6.05 12.1418 8.88 2.70 12.2319 12.34 7.7320 8.51 14.0221 26.16 17.4422 12.95 8.2423 16.93 13.3724 14.70 10.7825 10.32 5.1626 8.98 4.4927 9.70 11.5928 12.72 8.6329 9.49 2.1630 8.22 7.9531 13.70 11.2232 8.21 9.8533 15.86 11.4234 9.18 9.1835 12.49 4.5736 17.32 6.8611.6812.0116.897.1817.5914.5817.004.266.835.596.236.724.918.1713.069.4911.494.4456Table 11: Wilks Outlir Test Applied to the Transportation DataSample Point Wilk’s Critical Valuessize selected statistic .01 .025 .05 .1036 9 0.481 0.558 0.592 0.619 0.64835 21 0.577 0.548 0.583 0.611 0.64034 36 0.706 0.539 0.574 0.602 0.632Table 12: Scale-ratio Test Applied to the Transportation DataSample Point Scale-ratio Approximate Critical Valuessize selected statistic .01 .025 .05 .1036 9 2.167 1.787 1.578 1.435 1.29635 21 1.321 1.818 1.616 1.449 1.30834 36 0.968 1.917 1.638 1.466 1.31957Table 13: Wilk’s Outlier Test Applied to the Lakes Anion DataSample Point Scale-ratio Approximate Critical Valuessize selected statistic .01 .025 .05 .10106 33 0.786 0.813 0.828 0.840 0.852105 98 0.814 0.811 0.827 0.839 0.851104 69 0.875 0.810 0.826 0.838 0.850sample size for this subset is 106. A spin plot of the data indicates that observation 33and 98 are probably outliers with observation 69, 29, 61 and 77 as possible candidates aswell. The Q — 9 plots of the marginal variables indicate that the most serious departurefrom normality occurs in the chloride ion which appears to be heavy tailed. The othertwo appear to follow the normal distribution quite closely. The scatterplots and a fewspin plots appear in Figure 6, the results for the Wilk’s outlier test are in Table 13 andthe results for the scale-ratio test are in Table 14.Again, the two tests seem to be in close agreement. The only difference is that theWilk’s outlier test only rejects observations 33 and 98 as outliers where as the scale-ratiotest also includes 69 in this category.The next example is used to demonstrate the vulnerability of the Wilk’ outlier testto the effects of masking. It also serves as a demonstration that the scale-ratio test doesnot have the same weakness. The data are fictitious with obvious but severely maskedoutliers. The data appear in Table 15, the results of the Wilk’s outliers test are in Table16 and the results for the scale-ratio test are in Tablesmask.Knowing that there is at most 5 outliers in the data, the Wilk’s Outlier test is applied6 times to the data, the points detected as potential outliers are, in order of detection,x12,x218and x13. By looking at the data it is clear that the true outliers are58Figure 6; Scatterplots and Spin Plots for the Lakes Data69a * •*• 96••• S••”•••*•**•*** • *29360 1 2 3 4Chloride Ion6998*2936C;JC0— 00UCppCOIt)CVC0a, pCV-ca0pa,InCVC0a, PCV0UPCV0C,CJ0cJCV096:** te•t:.•*•_:*:.*.** *20360 1 2 3 4Chloride Ion96‘ *** * 694*t::*..**•j:t> **.29 ** * ••t* —36-2 0 2 436• 29*** LhEkide**: •S4hate**98-2 0 2 4366929-4.0 -33 -3.0Sulphate Ion-2.5-2 -1 0 259Table 14: Scale-ratio Test Applied to the Lakes Anion DataSample Point Scale-ratio A p proxi mate Critical Valuessize selected statistic .01 .025 .05 .10106 33 1.632 1.314 1.236 1.182 1.130105 98 1.387 1.318 1.243 1.188 1.135104 69 1.232 1.217 1.245 1.191 1.133103 29 1.185 1.314 1.247 1.192 1.136Table 15: Masked Outlier Dataobs X 37 ohs x y obs x y1 0.3932 -1.7213 0.7004 0.0505 -0.3836 -1.6057 -1.2928 0.5869 -0.327-2.1931.0860.035-0.028-0.6960.4340.451-0.8790.88610 0.13611 1.16012 2.02313 -0.51614 0.69315 -0.30416 -0.55817 0.43018 -1.01219 0.827 0.51620 -0.953 0.43621 1000.044 1000.07922 999.912 999.97123 1000.121 1000.15824 999.951 1000.07925 1000.158 1000.194-0.287-1.042-2.2960.728-0.638-0.9130.1770.5360.36360Table 16: Wilk’s Outlier Test Applied to the Masked Outlier DataSample Point Wilk’s Critical Valuessize selected statistic .01 .025 .05 .1025 12 0.665 0.491 0.533 0.568 0.60524 1 0.788 0.477 0.520 0.555 0.59323 2 0.773 0.461 0.505 0.542 0.58122 11 0.748 0.445 0.490 0.527 0.56721 8 0.808 0.427 0.473 0.511 0.55220 13 0.771 0.409 0.455 0.494 0.536x21,. . . , x25. None of the detected points are significant but this still demonstrates asevere lack of power when the outliers mask each other. If Wilk’s outlier test is appliedin a non-sequential fashion, the correct 5 points are detected as the potential outliersand A5 = 0.000000736 is highly significant. The problem with applying Wilk’s Outliertest directly comes from the swamping effect. See Caroni and Prescott [3] for an exampleof this.When the scale-ratio test is applied to the same data, the outliers, in order of detection, are x25,x23,x21,x24 and x22. The test is highly significant for these 5 points anddeclares them all to be contaminants but when the test is applied to the remaining 20observation x1 is detected as the extrme observation but it is not rejected as a contaminant. For this extreme example, the scale-ratio test yields a correct result while theWilk’s outlier test is completely fooled.The final example is designed to show the sensitivity of the Wilk’s Outlier test tothe number of suspected outliers. The data consist of 50 observations in 4 dimensions.Observations 46 through 50 are structural outliers that do not appear in any of the 3-61Table 17: Scale-ratio Test Applied to the Masked Outlier DataSample Point Scale-ratio Approximate Critical Valuessize selected statistic .01 .025 .05 .1025 25 342515.530 2.376 1.962 1.711 1.49224 23 382872.985 2.447 2.018 1.754 1.51223 21 399269.468 2.473 2.085 1.796 1.52922 24 319646.961 2.565 2.123 1.848 1.55321 22 254922.959 2.749 2.201 1.889 1.58920 1 1.042 2.724 2.245 1.908 1.600dimensional spin plots or 2-dimensional scatterplots. Figure 7 shows the 6 scatterplots.The results of the Wilk’s outlier test are in Table 18 while the results for the scale-ratioare in Table 19.The Wilk’s Outlier test is affected by masking in this situation but not completelyfooled by it. However, a single application of the Wilk’s outlier test indicates no outliers.Therefore, if the level of contamination is unknown, an investigator may apply the testonce then stop because it yields an insignificant result. The scale-ratio test, on theother hand, is unaffected by the masking and indicates an outlier problem with the firstapplication of the test. This may not seem very serious because the Wilk’s test is verysignificant after 4 applications hut there could be several layers of masked outliers inthe data resulting in a swing from insignificant to significant results as the test worksthrough each layer of contamination.The above examples demonstrate the performance of the scale-ratio test. It appears to work as well as the Wilk’s outlier test when the outliers are not masked anddramatically out performs the Wilks outlier test when the outliers are masked.62Figure 7: Scatterplots for a 4-dimensional Data Set with hidden OutliersVariable 2 vs Variable 1 Variable 3 vs Variable 1C.,>< 0C.,>< oC\J-2 -1 0 1 2xlVariable 4 vs Variable 1c,JC”)< oCJC”>< 0cJC”0cJ-2 -1 0 1 2xlVariable 3 vs Variable 2-2 -1 0 1 2X2Variable 4 vs Variable 3‘-2 -1 0 1 2xlVariable 4 vs Variable 2C”0CJ-2 -1 0 1 2 -2 -1 0 1 2X2 X363Table 18: Wilks Outlier Test Applied to the Hidden Outlier DataSample Point Wilk’s Critical Valuessize selected statistic .01 .025 .05 .1050 47 0.727 0.619 0.647 0.669 0.69249 50 0.666 0.614 0.642 0.664 0.68748 48 0.620 0.607 0.636 0.658 0.68247 49 0.446 0.601 0.630 0.653 0.67746 46 0.008 0.594 0.624 0.647 0.67145 34 0.771 0.588 0.617 0.641 0.665Table 19 Scale-ratio Test Applied to the Hidden Outlier DataSample Point Scale-ratio Approximate Critical Valuessize selected statistic .01 .025 .05 .1050 50 272.123 1.522 1.385 1.292 1.19749 48 246.926 1.510 1.376 1.295 1.19948 47 213.242 1.533 1.389 1.289 1.19347 49 158.282 1.532 1.401 1.298 1.18546 46 88.168 1.550 1.401 1.304 1.19845 34 0.920 1.546 1.414 1.309 1.202647 Conclusions and RecommendationsThe limitation of graphical techniques for detecting outliers illustrates the need forother methods of outlier detection in higher dimensions. This doesn’t mean that theoutliers tests should be blindly applied in higher dimensions. In fact, exploratory dataanalysis should always be carried out before any tests are conducted. This will help inthe detection of obvious outliers as well as revealing some basic symmetries in the datasuch as univariate normality or elliptical contours in the bivariate margins.After the exploratory analysis is done, outlier tests can be conducted for the structural outliers. The assumption of multivariate normality for the Wilk’s outlier test andthe scale-ratio test is not very limiting because the univariate margins can be convertedto near normality before either test is applied. This does not even guarantee bivariatenormality let alone p-variate normality but non-normal univariate margins does guarantee non-normal margins in higher dimensions.Robust estimation plays an important role in the scale-ratio test because the robustestimator is responsible for the resistance of the test against the masking effect. I havenever suggested that robust estimation should be used in place of classical estimation.In fact, I suggest that the robust estimates be used only as part of a diagnostic tool todetect the outliers. Once the outliers have been detected, investigated and dealt withproperly, classical estimates can be used based on the new information. This shouldresult in a highly efficient and robust method of dealing with outliers.The sequential application of the scale-ratio test or the Wilk’s outlier tests is responsible for the resistance to the swamping effect. The resistance comes from the removalof the outliers before the next step is conducted. When no outliers remain in the data,the test should no longer be significant.The specific test to use is not clear. I recommend using the scale-ratio test over65the Wilk’s or sequential Wilk’s outlier test because it is unaffected by swamping ormasking. However, the best scale-ratio test is an area of further research. Of the fourcombinations I tried, I found V0 based on as defined in Section 5 to be the bestbecause it appears to have greater power than the others in most situation an it appearsto be asymptotically normally distributed.The cutoff point need to be investigated further. They seem to decrease as the samplesize and dimension increases. I believe a smooth curve will fit through the cutoff pointfor a given c level allowing them to be estimated by a simple formula. The asymptoticnormality also allows for the modelling of the mean and variance of the cutoff points asanother way of estimating their values for larger sample sizes.If the critical values are ignored, the scale-ratio test can be used as a method ofordering multivariate data in terms of extremeness from the centre. Once the data hasbeen ordered one can investigate a percentage of the most or least extreme poirits.As for the approximation to the multivariate robust estimators, I think these appearto work quite well as they are but further research could be used to improve them.For example, the Stahel-Donoho estimator can be used to increase the probability ofselecting a subsample that contains all good point, for the calculation of an S-estimate ora r-estimate, by sampling each point based on the weight assigned by the Stahel-Donohoestimator.66References[1] Afifi, A. A. and Azen, A. P. (1972). Statistical Analysis A Computer OrientedApproach, New York: Academic Press.[2] Barnett, V. and Lewis, T. (1984). Outliers in Statistical Data (2ed ed.), Toronto:John Wiley & Sons.[3] Caroni, C. and Prescott, P. (1992). “Sequential Applications of Wilk’s MuitivariateOutlier Test”, Applied Statistics, 41 355-364.[4] Das, R. Sinha, B. K. (1986). “Detection of Multivariate Outliers with DispersionSlippage in Elliptically Symmetric Distributions”, The Annals of Statistics, 14,1619-1624.[5] Davies, P. L. (1987). “Asymplotic behaviour of S-estimates of Multivariate Locationand Dispersion Matrices” The Annals of Statistics, 15 1269-1292.[6] Delempady, M. and Douglas, A. (1990). “Eastern Lake Survey- Phase 1. Documentation for the Data and the Derived Data Sets.” Sims Technical Report, 160The University of British Columbia.[7] Donoho, D. L. (1982). Breakdown Properties of Multivariate Location Estimators.Ph.D. Qualifying Paper. Harvard University.[8] Hawkins, D. M. (1980). Identification of Outliers , New York: Chapman and Hall.[9] Huber, P. J. (1964). “Robust estimation of a Location Parameter” The Annals ofMathematical Statistics, 35 73-101.[10] Huber, P. J. (1981). Robust Statistics, New York: John Wiley & Sons.67[11] Johnson, R. A. and Wichern, D. W. (1988). Applied Multivariate Statistical Analysis, Englewood Cliffs, New Jersey: Prentice-Flail.[12] Lopuhaä, R. (1990). Estimation of Location and Covariance with High BreakdownPoint, Ph.D. Thesis, Deift University of Technology, Netherlands.[13] Le, N. D. and Zamar, R. H. (1991). “ A Global Test for Effects in a 2k FactorialDesign Without Replicates” The Journal of Statistical Computing and Simulation,41 41-54.[14] Lind, J. C. (1979). A Comparison of Some Robust Estimates of Correlations in thePresence of Asymmetry, Ph.D. Thesis, University of British Columbia.[15] Maronna, R. A. (1976). Robust M-estiinates of Multivariate Location and Scatter.Annals of Statistics 4 51-67.[16] Martin, R. D., Yohai, V. J. and Zamai, R. H. (1989). “Mm-max Bias RobustRegression.” Annals os Statistics, 17 1608-1630.[17] Martin, R. D. and Zarnar R. H. (1992). Bias Robust Estimation of Scale whenLocation is Unknown. Technical Report.[18] Mood, A. M., F. A. Graybill, and D. C. floes (1974). Introduction to the Theory ofStatistics (3rd ed.), Montreal: McGraw-Hill Book Co.[19] Moussa M. A. A. (1983). “Detection and Accommodation of Multivariate StatisticalOutliers” Computer Programs in Biomedicine, 16, 217-230.[20] Patak, Z. S. (1990) Robust Principal Component Analysis via Projection Pursuit,Master Thesis, University of British Columbia.68[21] Press, W. H. et al, (1988) Numerical Recipes in C: The Art of Scientific Programming, Cambridge, England: Cambridge University Press.[22] Rousseeuw, P. J. and Leroy, A. M. (1987). Robust Regression and Outlier Detection,Toronto: John Wiley & Sons.[23] Rousseeuw, P. J. and Yohai, V. (1984) “Robust Regression by Mears of SEstimators. Robust and Nonlinear Time Series Analysis Lecture Notes in Statistics26 256-272. Springer Verlag, New York.[24] Serfling, R. J. (1980) Approximation Theorems os Mathematical Statistics, TorontoJohn Wiley & Sons.[25] Sinha, B. K. (1984). “Detection of 1\4ultivariate Outliers in Elliptically SymmetricDistributions”, The Annals of Statistics, 12, 1558-1565.[26] Schwager, S. J. and Margolin, B. 1-1. (1982). “Detection of Multivariate NormalOutliers”, The Annals of Statistics, 10, 94 3-954.[27] Stahel, W. A. (1981). Robust Estimation: Infinitesimal Optimality and CovarianceMatrix Estimators. Ph.D. Thesis (in German), ETH, Zurich.[28] Yohai, V. J. and Zamar, R. (1988). “High breakdown-point of Estimates of Regression by Means of the Minimization of an Efficient Scale”, Journal of the AmericanStatistical Association, 83, 406-413.[29] Yohai, V. J. and Zamar, R. (1990). Discussion of Paper No. 90 SM 493-7 PWRS.69Appendix: Computer ImplementationMost of the software used in this thesis was written by the author in the C languageand run on a Sun workstation. The model of workstation used was either a Sparc2,Sparci, or SparcELC. The one exception to this was the Wilk’s sequential outlier testroutine which was written in Splus3.O.The core routines manipulate matrices in one way or another. Since several procedures require the eigenvalues of a matrix, any procedure that could be calculated fromthe spectral decomposition of a matrix was calculated this way. The spectral decomposition routine comes from Numerical Recipes in C [21]. The specific routines are TRED2and TQLI. The uniform deviates were generated by the RANDOM function built in tothe standard C library on a Sun computer, a Nevton-Raphson routine was used to solvefor the constants in Table 2 and for the scale in the S-estimate approximation algorithm,Romberg integration was used for the expected vaLue of the biweight function for a givenk, and the quicksort sorting algorithm was used to sort any data that needed sorting.The Newton-Raphson, Romberg and quicksort routines were written by the author.The code is available upon request. However, it is undocumented at this time andit may not compile properly on any machine other than a Sun workstation.70


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items