R O B U S T P R I N C I P A L C O M P O N E N T A N A L Y S I S V I A P R O J E C T I O N P U R S U I T B y Z D E N E K P A T A K B . S c , The University of Br i t t i sh Columbia, 1987 A T H E S I S S U B M I T T E D I N P A R T I A L F U L L F I L L M E N T O F T H E R E Q U I R E M E N T S F O R T H E D E G R E E M A S T E R O F S C I E N C E in T H E F A C U L T Y O F G R A D U A T E S T U D I E S (Department of Statistics) We accept this thesis as conforming to the required standard T H E U N I V E R S I T Y O F B R I T I S H C O L U M B I A January 1990 © Zdenek Patak, 1990 In presenting this thesis in partial fulfilment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. Department The University of British Columbia Vancouver, Canada DE-6 (2/88) Abstract In principal component analysis (PCA), the principal components (PC) are linear combinations of the variables that minimize some objective function. In the classical setup the objective function is the variance of the PC's. The variance of the PC's can be easily upset by outlying observations; hence, Chen and Li (1985) proposed a robust alternative for the PC's obtained by replacing the variance with an M-estimate of scale. This approach cannot achieve a high breakdown point (BP) and efficiency at the same time. To obtain both high BP and efficiency, we propose to use M M - and r-estimates in place of the M-estimate. Although outliers may cause bias in both the direction and the size of the PC's, Chen and Li looked at the scale bias only, whereas we consider both. Al l proposed robust methods are based on the minimization of a non-convex objective function; hence, a good initial starting point is required. With this in mind, we propose an orthogonal version of the least median of squares (Rousseeuw and Leroy, 1987) and a new method that is orthogonal equivariant, robust and easy to compute. Extensive Monte Carlo study shows promising results for the proposed method. Orthogonal re-gression and detection of multivariate outliers are discussed as possible applications of PCA. n Contents Abstract ii Contents iii List of Tables v List of Figures vi Acknowledgements vii 1. Introduction 1 2. Basic Robustness Concepts 6 2.1 Influence Function 6 2.2 Maximum Bias Curve and Breakdown Point 6 2.3 Equivariance 8 2.3.1 Definitions 8 3. Robust Estimators - A n overview 9 3.1 M-estimates 9 3.2 S-estimates 10 3.3 r-estimates 11 4. Principal Component Analysis 13 4.1 Robust PCA 15 4.1.1 Robust PCA via the Robustification of Scatter Matrices 17 4.1.2 Robust PCA via Projection Pursuit 22 5. Direction Bias Computation 29 5.1 r-estimate 31 5.2 S-estimate 32 5.3 MM-estimate 33 6. Proposed Estimator 39 iii 6.1 Some Properies of the Proposed Estimate 40 7. Tables - Simulation Results 47 8. Applications of P C A 54 8.1 Orthogonal Regression 54 8.2 Outlier Detection Using PCA 55 9. References 57 iv List of Tables Table 1. Fire Claims in Belgium from 1976 to 1980 3 Table 2. Contamination in Y, N(3,0.25), 0=0, n=20, m=200 51 Table 3. Contamination in Y, N(3,0.25), /3=0, n=60, m=200 51 Table 4. Contamination in Y, N(5,0.25), 0=0, n=20, m=200 52 Table 5. Contamination in Y, N(5,0.25), 0=0, n=60, m=200 52 Table 6. Contamination in X, N(20,0.25), 0=5, n=20, m=200 53 Table 7. Contamination in X, N(20,0.25), 0=5, n=60, m=200 53 Table 8. Contamination in Y (dim=5), N(10,0.25), 0 = 0, n=40, m=100 54 Table 9. Contamination in X l (dim=5), N(10,0.25), 0= 5, n=40, m=100 54 Table 10. Empirical BP's of the estimates 55 v Lis t of Figures F igure 1. P lo t of fire Claims in Belgium from 1976 to 1980 5 F igure 2. P lo t of the First P C ' s 16 F igure 3a. P lo t of Direction Bias vs Fract ion of Contamination (^/Ai/Ao=10) . . . 38 F igure 3b. P lo t of Direction Bias vs Fraction of Contamination (^/A 1/A 0=2) . . . . 38 F igure 4. P lo t of Fraction of Contamination versus ^/XQ 39 Figure 5. P lo t of Classical versus Robust P C A 58 v i Acknowledgements Dr. Ruben Zamar first introduced me to the subject of robustness. I thank him for his support, help and guidance during the writing of my thesis. I also like to thank Dr. Mohan Delampady for his helpful comments. vii 1 Introduction This thesis discusses robust alternatives to principal component analysis (PCA) and orthogonal regression (OR). Classical methods and key robustness concepts are briefly discussed, existing robust procedures are described and new robust approaches are intro-duced. Several examples are included to illustrate the properties of the classical versus robust methods. Our goal is to extend the work of Chen and Li (1985) to M M - and r-estimates of PC's and to discuss the breakdown properties not only of the scale of the PC's (Chen and Li only considered the properties of the scale of the PC's) but also of their direction. The word robust is derived from the Latin word "robus" meaning strength. In different disciplines, it takes on different meanings. In statistics, robustness has usully been associated with methods and procedures that do not suffer greatly when a fraction of the data does not follow the model assumptions, i.e. outlying observations may be present. To expand on this, we say that estimator is robust whenever its value does not change appreciably after a number of aberrant observations has been introduced. This notion of robustness is very loose and can be made rigorous, yet it gives us a flavour of what is involved when we say that an estimator is robust. Statistical inference is only in part based upon observations. Equally important are the explicit and implicit assumptions one makes about the underlying situation. In regression, whether classical or orthogonal, one generally assumes that observations are independent and errors are normally distributed with mean zero and some common variance. The presence of outliers in the data certainly violates this assumption and inference based on classical methods that are sensitive to even minor departures from these assumptions would be suspect. Robust alternatives have been proposed to deal with the inherent susceptibility of classical methods to 'disruptions'. 1 In 1964 Huber, in his milestone paper on robust location estimation, laid the foun-dation for modern robust statistical analysis. Since then robust methods have been developed for a variety of statistical procedures including orthogonal regression and the estimation of principal components, both of which constitute the focus of this paper. Scatter matrices and their principal components are at the heart of multivariate data analysis. Unfortunately, the classical estimator - the sample covariance matrix and its eigenvalues and eigenvectors - are highly nonrobust. One outlying observation can distort or completely upset the classical estimator. An observation is considered an outlier if it does not follow the same model as the rest of the data. An observation is considered a leverage point if its relocation causes major changes in the parameters to be estimated. A major problem with the detection of such cases in higher dimensional space is that an observation may not be extreme with respect to any of the original variables, but it can still be an outlier because it does not conform with the correlation structure of the remainder of the data. This type of outlier, called a structural outlier, will most likely distort the direction, an eigenvector, whereas a gross error outlier will most likely distort the scale, an eigenvalue. It may be possible to identify gross error outliers as they will stand out from the rest of the data but it is very difficult to detect structural outliers by looking solely at the original variables one at a time, or even two at a time. Most classical diagnostic procedures can identify a single outlying or high leverage observation. However, they are rendered helpless in the presence of multiple contami-nants especially when these outliers are grouped and have a high leverage. Often the physical process that generates these outliers causes them to cluster in a particular lo-cation. This phenomenon creates a 'masking effect', i.e., the removal of one observation from this cluster will not have a discernible effect on the estimated parameters. The 2 Table 1: Fire Claims in Belgium from 1976 to 1980. Year Number of Fires x{ Vi 76 16694 77 12271 78 12904 79 14036 80 13874 remaining contaminants will 'mask' any change that would be detected if the cluster contained only one outlying observation. To deal with this problem, we employ robust procedures. They provide an objective analytic tool for identifying observations that do not conform with the structure of the rest of the data. To illustrate the problems connected with the use of classical methods when model assumptions have been violated, we consider a simple data set (Rousseeuw, Leroy 1987) comprising the number of reported claims by Belgian fire-insurance companies in the five years from 1976 to 1980. If one disregards the number of fire claims reported in 1976, it is clear that there is an annual upward trend. This is reflected in the estimates from the method of least median of absolute orthogonal deviations (ORLM), ft> = -28872.7 and ft = 534.3. ORLM is one of the many robust alternatives to orthogonal regression which will be discussed later. OR fits the data with a decreasing trend yielding ft = 244547.7 and ft — -2956.3. These results give one the false impression that the number of fire claims is going down 3 E o Figure 1: Plot of Fire Claims in Belgium from 1976 to 1980 while it is in fact going up. Although this example, with only one explanatory variable, is simple and the outlying observation can be easily identified because it is detached from the rest of the data, it illustrates how classical methods can lead to wrong conclusions in the presence of aberrant observations. Locating outliers in higher dimensions is usually more complicated and hence more reliable methods are needed. The rest of the thesis is organized as follows. In section 2 we discuss some basic robustness concepts that are used to classify the performance of estimators. In section 3 we describe some robust estimates in the simple case of location and scale estimation. We focus our attention only on those estimates that are referred to in later sections. In section 4 we discuss classical PCA and the robust alternatives. We mainly focus on extending the work of Chen and Li (1985) beyond S-estimates of the scale of the PC's. We introduce M M - and T-estimates of the direction and scale of the PC's. In section 5 we compute the lower bound for the maximum asymprotic direction bias and show that the BP of the scale is not inherited by the direction and that the BP of the direction 4 depends on the ratio of the adjacent eigenvalues. In section 6 we propose a robust estimate of the direction and the scale of the PC's and discuss its properties. We show that the proposed estimate is orthogonal equivariant, Fisher consistent and robust. In the last two sections we include Monte Carlo results to show the bias characterictics of different estimates and mention possible applications of robust PCA. 5 2 Basic Robustness Concepts Here, we introduce several concepts and definitions aimed at assessing the performance of estimators. Some stem from robustness while others are universally applicable. 2.1 Influence Function The influence function (IF) measures the sensivity of an estimator to infinitesimal per-turbations. The IF of an estimator T at a point x and a distribution F is given by J F(x ; r,f) = l i m r « 1 - e ' f + / ^ - r ( f ) (1) for those points x of the sample space where the limit exists. (Here, 8X is the point-mass distribution at x.). We define the gross error sensitivity (GES) as GES = sup \\IF(T,x)\\ . (2) X Observe that for e near zero, ||r((l - e)F + e8x) - T(F)\\ » e\\IF(x; T, F)\\ . This implies that sup ||T((1 - e)F + e6x) - T(F)\\ » e GES , X where sup r ||T((1 — e)F + eSx) — T(F)\\ is the maximum bias induced in an estimator by a fraction e of contamination. The restriction to point mass contamination implies no loss of generality (see Martin, Yohai and Zamar, 1990). 2.2 Maximum Bias Curve and Breakdown Point An important notion of robustness of an estimator is the breakdown point (BP). It measures the extent to which an estimator is able to cope with contamination. It is 6 often helpful in understanding the robustness properties of the estimator and can also be used to classify its performance. There are several definitions of the breakdown point of an estimator. For simplicity, only the finite sample version due to Donoho and Huber (1983) w i l l be introduced here. To define the breakdown point, let us suppose we have a data set X — { X x , X n } = { ( x n , X\2i •••) ^ l p ) 5 (xni, Xn2, Xnp)} . T ( X ) is the value of an estimator at the sample X . Consider a l l corrupted samples X' obtained by replacing any fraction e £ (0,1) of the original data points by arbitrary values. T(X') is the value of an estimator at the contaminated sample X'. First , we define the maximum bias as B(e;T,X) = sv?\\T(X')-T{X)\\ , (3) where the supremum is taken over a l l e-contaminated samples. P lo t t ing the function B versus the fraction of contamination t produces the maximum bias curve which is a carrier of both the local and global robustness properties of the estimate. The breakdown point is the value of e where the asymptote to the maximum bias curve crosses the x-axis. It is defined as e*(T, X) = inf {e : B(e; T, X) = co} , (4) i.e., e*(T,X) is the smallest fraction of contamination that can cause T(X') to take values arbitrarily far from T ( X ) . Asymptotic counterparts of e* and B have been defined (Hampel, 1986, for example). Under certain regularity conditions, the G E S is the value of the derivative of the maximum asymptotic bias curve at zero, that is, GES = B'(0) and therefore can be used to give a linear approximation for B(t) for t near zero. 7 2.3 Equi variance Equivariance is a concept that reaches beyond robustness as it pertains to a property that in one form or another is desired of all estimators. We shall distinguish between four types of equivariance: location, scale, orthogonal and affine. In the context of orthogonal regression and principal component analysis, the first three are a natural requirement for any estimator. 2.3.1 Definitions Suppose we have a collection of vectors x l 5 x n in Rp. An estimator T is said to be 1. location equivariant if T(xx + v , x n + v) = T ( x i , x n ) + v, where v is any vector in RP. 2. scale equivariant if T ( c x l 5 c x n ) = | c |T(x x , x n ) , where c £ R. 3. affine equivariant if T(Axj + v , A x „ + v) = A T ( x 1 , x n ) + v, where A is any nonsingular matrix and v is any vector in Rp. 4. orthogonal equivariant if T(Txi -f v, ...,I7xn + v) = rT (x 1 , x n ) + v, where r is an orthogonal matrix and v is any vector in Rp. For a scatter matrix C, affine equivariance is defined as C(Axi + v , A x „ + v) = A C ( x i , x n ) A r , where A and v are as above. This means that if a point cloud is rotated or rescaled, then any measurement of its orientation will rotate and any measurement of its size will scale correspondingly. Orthogonal equivariance of a scatter matrix is defined similarly with A replaced by I\ 8 3 Robust Estimators - An overview There exists several different families of robust estimators. However here we w i l l focus our attention only on those estimators that are referred to in later sections. They are the M - , S- and r - estimators. Each is briefly described for the simple case of location and scale estimation. 3.1 M - estimates Suppose we have a set {x,-},-=i,..,7i of independent identically distributed observations from Fgi<r. The maximum likelihood estimators for location and scale at the Gaussian model are defined as the solutions to where ty(x) — x, x(x) = x 2 a n d 6 = 1 . Bo th $ and x favour "large" observations. To reduce the influence of these observations on the estimated location parameter, we can use a function ^ satisfying C l . ^ is odd, bounded, wi th at most a finite number of discontinuities. Examples of such ^ 's are the Huber's function c sgn(x) if |x | > c, c £ (0, oo) if la;I < c or the Tukey's biweight function for Ice 1 < c for \x\ > c, c £ (0, oo) . 9 To robustify the scale estimate, we often choose a x that meets the following conditions: C 2 . x is symmetric, differentiable almost everywhere and x(0) — 0 • C 3 . x l s s tr ict ly increasing on [0,c) and constant on [c,oo) . A n example of a function that satisfies C2 and C3 is the Huber's ^-function defined as xf - T for \x\ < c 1 for \x\ > c, c G (0, co) . Using general \I> and x functions, we define the generalized max imum likelihood esti-mates (M-estimates) t and s of location and scale as the solutions to i t . ( 2 f - > o (., where b is usually taken to be the Ex{Z) and Z is the standard normal random variable. Huber (1964) defined M-estimates of location and described some of their asymptotic properties. These include y/n convergence rate to a normal distribution and a fairly high efficiency. 3.2 S - estimates Let i be a tentative location of the center of a set of numbers, {x^},•=!,..,„. Consider the residuals, r,-(t) = X{ — t. The corresponding M-estimate of scale, s(t), is implici t ly defined by where x a n d b are as above. The S-estimate of location is then defined as T = argmint s(t) . 10 Notice that s(T) = s is a robust estimate of scale. In fact, it can be shown that for x satisfying C2 and C3, the BP of s is min{^^j, 1 —^^y}. (see Huber, 1981, for instance). It is clear from their definition in (5), that M-estimates of location are sensitive to the choice of s. To obtain good robustness properties for the M-estimate, we need to use a measure of dispersion of the residuals r,(i) that has the most bias resistance, i.e. a BP of 1/2. This criterion is met by s. This approach produces a new type of estimator called the M M estimator. It combines the efficiency of M-estimates with the robustness of S-estimates. 3.3 r - estimates Introduced in 1988 by Yohai and Zamar, the r-estimates combine efficiency with good breakdown properties. They are defined by T = argmint r2(t) , where n i = i V 5 W J s(t) is implicitly defined by and x i a n d X2 satisfy conditions C2 and C3. The corresponding "tuning constants" (explained below) for Xi a n -d X2 a r e c i = 1-548 to yield a high BP for the scale s(t) and c2 = 6.08 to yield 95% efficiency for the location t. Suppose that is small and X2 is quadratic near zero. Then 2 Hence for non-contaminated samples, the T estimators of location and scale reduce to the sample mean and variance respectively. This property gives the r estimator its high 11 Xj — t s{t) efficiency. If on the other hand, the absolute residual the tuning constant, C 2 , its influence is diminished because , is large, i.e, greater than X2 Xi — t 1 This property gives the estimator its high B P . 12 4 Principal Component Analysis The objective of Pr inc ipa l Component Analysis ( P C A ) is to reduce the dimensionality of a data set containing a large number of correlated variables while retaining as much as possible of the variabili ty present in the data. This is done by transforming to a new set of variables, the principal components, which are uncorrelated. Suppose x is a vector of p random variables with mean u and covariance S. Unless p is small or the covariance structure is very simple, not much insight can be obtained from looking at the p variances and \p(p — 1) covariances. A n alternative approach is to look for a few "principal components" that retain most of the information contained in the variance-covariance structure. The first step is to look for a linear combination of the components of x , aTx, that has max imum variance; i.e, maxa var(ctTx) — aTHc(. It is clear that without a suitable constraint the max imum wil l not be achieved for finite a. The conventional constraint here is aTa — 1. The problem then becomes maximize aTHa subject to aTa = 1 To obtain a solution, we can use the method of Lagrange multipliers (see for example Jolliffe, 1986) and maximize J\(ct, A) = aTHa — A ( a T a — 1) , where A is a Lagrange multiplier. Differentiating Ji(a, A) wi th respect to cti and setting the derivative to zero yields d — J i ( a , A) = £ a - A a = 0 . da B y premultiplying both sides of the equation by aT and using the constraint ctTa — 1, we get aTEct = A . 13 Therefore, the solution to the constrained maximization problem is the eigenvector a i associated wi th the largest eigenvalue A x of S . The linear function a f x is the first principal component. Next, we look for a linear combination of the elements of x, ct Tx, that has maximum variance and satisfies the constraints (i) aTa = 1 and (ii) aT ctx = 0. The solution can be obtained, again by using the method of Lagrange multipliers, by maximizing J 2(o:, A, <f) = a T S c v — A ( a T a — 1) — cf>aTcti , where A and <f> are Lagrange multipliers. Differentiating J2(a,\,<j>) wi th respect to a and setting the derivative to zero yields d — J2(a, A, <j>) = aTT,a- Xa - <ba = 0 . (8) da Premul t ip ly ing both sides of the equation by a f wi l l result in aTHa — Xafa — cj>aTa.i = 0 . B y noticing that (i) a?cx\ = 1 and (ii) a f S = \a[ we have <f> = 0. Substituting for <f> in (8) and premultiplying by a results in a r S a = A . Therefore, the solution to the doubly constrained maximization problem is the eigen-vector a-2 associated wi th the second largest eigenvalue A 2 of S . The linear function a^x is the second principal component. This process is repeated unti l all principal components are computed. It can be shown that cti,ct2, ...,ap are the eigenvectors of S corresponding to A i , A 2 , A p , re-spectively, where the Aj's are in decreasing order. It can also be shown that var(ajx) = A,- for i = 1 , 2 , p . 14 Note that in the classical PCA, proceeding from either minimizing or maximizing the variance of a linear combination of the elements of x yields the same principal compo-nents. As we will show later, this is not the case in the the robust setting. 4.1 Robust P C A PCA is an important tool used in many fields where there is a need for reducing the dimensionality of a data set. It has been a popular technique with psychologists who routinely collect a multitude of information on their patients and then try to construct a few "indices" to explain their behavior. It is important that such indices should be as reliable as possible to prevent misdiagnoses; however, because the classical approach is unable to cope with aberrant observations, its reliability can be questioned. As we have seen, principal components are obtained via successive constrained maxi-mizations of the variance of a linear function of elements of x. One extreme observation may inflate the variance enough to upset the order of the principal components. To illustrate the weaknesses of the classical PCA, we have generated a hundred point, two variable sample. The first variable y is distributed as N(0,1), the second variable x is 90% distributed as N(0,0.1) and 10% as 1. N(0,0.1) 2. N(3,0.1) 3. N(4,0.1) We have computed the first PC for each of the sampling situations. It is clear from the accompanying plot that contamination can adversely affect the direction of a PC. In this example the size and location of the outliers are not severe yet the direction of the first PC is completely upset. In the two-dimensional case we can surely identify the 15 Figure 2: Plot of the First PC's outliers from looking at an x — y plot. In higher dimensions we may not be able to do so. To make the PCA contamination resistant, statisticians have adopted one of two possible approaches. One is to obtain a robust estimate of the scatter matrix and then proceed with an eigenvalue-eigenvector decomposition (Boente, 1987) to compute the principal components. The second approach is to replace the variance in the maxi-mization by some robust estimate of scale, s, and then proceed as above (Chen and Li , 1985). Note that, classically, we obtain the same principal components whether we begin by maximizing the variance or by minimizing it. This is generally not the case when var(aTx) is replaced by 3(a Tx). In the robust setup, the minimization approach is more attractive in view of the many minimization algorithms developed for classical and orthogonal regression that can be used as building blocks. 16 4.1.1 Robust PCA via the Robustification of Scatter Matrices A natural way to obtain robust estimates of location and scatter is to extend the definition of one-dimensional M-estimators of location and scale to the multivariate setup. Maronna (1976) proposed such M-estimators, defined as solutions of a system of equations of the form -_>i[{(x.- - tyv-^ x,- - 1 ) } 1 ' 2 ] ^ -1) = o n 1=1 -X>2[(xt- - t)'V-i X t. _ t)](x4- - t)(xt- - t)' = V, n i=i where n x and u 2 are functions satisfying a set of general assumptions (see Maronna, 1976). Boente (1987) studied the asymptotic distribution of the eigenvalues and eigen-vectors of the above M-estimator of scatter. She has shown that they are consistent and asymptotically normal at the usual rate, y/n. However, it can be shown that that M-estimates of scatter attain a B P of at most 1/p (Maronna, 1976), where p is the number of variables. This makes them of limited use when p is large. Also, it is not clear whether the B P of the covariance matrix estimate is inherited by the direction of the PC's . Some other techniques for computing robust scatter matrices include convex peeling (Barnett, 1976, Bebbington, 1978), ellipsoidal peeling (Titterington , 1978, Helbling, 1983), iterative trimming (Gnanadesikan and Kettenring, 1972, Devlin et al., 1975) and depth trimming based on the concept of depth (Tukey, 1974). Unfortunately, they all possess a breakdown point of at most l / (p+l) (Donoho, 1982). The first affine equivariant estimator with high B P was constructed independently by Stahel (1981) and Donoho (1982). It measures the "outlyingness" of a point x relative to some center location. For each observation x,-, one looks for the one-dimensional 17 projection leaving it most exposed: C i ~ MEDk\v?xk - MED,lvTXj)\ ' V G R ' where MED is the median. Now consider a weight function Wi — w(d), where w : [0,oo) —> [0,oo) is decreasing with sup j|C (^011 < 0 0 • A robust covariance estimator based on the tw,-'s can be computed then as EF=1 u;?(x,- - t)(x; - t ) r where the multivariate estimate of location, t, is defined to be t _ E?=i w i * i E"=i«;,-The estimators obtained this way combine high B P wi th affine equivariance (Donoho, 1982). However, for each random vector xt-, we have to solve a nontrivial maximization problem. This would be computationally prohibitive even for the computers aboard the Enterprise. Rousseeuw (1983, 1987) proposed another estimator, the min imum volume ellipsoid ( M V E ) , that combines the properties of affine equivariance and high B P . Define an ellipsoid Ectfi by Ec,» = {x: ( x - A i f C - ^ x - ^ l } and the set C by C = {(C,n): #(£C,M n data) > [n/2] + 1} . The M V E is ( C , / / ) = a rgmin |C | , where | C | is the determinant of C . In most cases it is not feasible to consider a l l "halves" of the data and to compute the volume of the smallest ellipsoid that surrounds them. Hence to compute the M V E , we use a method similar to the bootstrap. 18 Given a set of random vectors {x}n in Rp: we draw repeatedly a subsample of p + 1 different observations. The number of subsamples, m, drawn must be large enough so that the probability of a subsample containing only "good" data points is high. For large data sets with many variables, we limit the number of subsamples to whatever is computationally feasible (this is usually in the range of 100 to 3000 depending on p). We find the mean xk and the covariance matrix for the kth subsample. Denote by Ek = {x : (x - x,) TC^ 1(x - x,) < 1} the ellipsoid corresponding to Ck and x^ . It contains the observations x, that are within a Ck unit distance from x^ . The volume of this ellipsoid is related to |Cfc|, that is, vol {x : (x - x,) TC^(x - xfc) < 1} = kp\Ck\^2 , where and T(x) is the gamma function evaluated at x (see for example Johnson and Wichern, 1988). To envelop [n/2] + 1 points, the ellipsoid Ek has to be inflated or deflated by being multiplied by some correction factor, the median Mahalanobis distance (MMD) MMDk = mediani=i]...jn(xI- - xyt)/C^1(x1- - 5ck) . Observe that the resulting ellipsoid E'k = {x : (x - X f c f C ^ x - xfc) < MMDk) contains exactly 50% of the observations. The volume of E'k is vol(E'k) = vol {x : (x - x A) TC; 1(x - 5ck) < MMDk} = vol{x : (x - xk)T(CkMMDk)-l(x - x,) < 1} = kp\MMDkC\1/2 = kpMMDl/2\C\^2 . 19 Hence the volume of the ellipsoid E'k is proportional to MMDpk/2\Ck\^2 . (9) Suppose that MMDj^ 2 | C f c * | a / 2 minimizes (9) over a l l subsamples k = l , . . , m . Then the M V E covariance estimator is expressed as Ck.MMDk.{xlo.sV , where Xp;o.s * s the median of a chi distribution with p degrees of freedom; it is used as a correction factor to obtain consistency at the multivariate normal model. In the univariate case the M V E reduces to the S H O R T H . Given a set of numbers {a:;} t_i,.., n , it is the length of the shortest line segment that contains at least [n/2] + l such numbers. It can be shown (Mar t in and Zamar, 1989) that the S H O R T H is the most resistant measure of dispersion wi th respect to minimizing the maximum bias among al l M-estimates of scale. Davies (1989) showed that the M V E converges weakly at rate of to a l imit ing dis-tr ibut ion that is nonnormal. To improve the rate of convergence, Rousseeuw (1983) also proposed the m i n i m u m covariance determinant ( M C D ) . The M C D covariance matrix estimate is the sample variance of the observations contained in the min imum volume elipsoid, the M C D multivariate location estimate is their sample mean. But ler and Juhn (1988) showed that the M C D is asymptotically normal at the rate yfn. Estimators analogous to S-estimates, MM-estimates and r-estimates of location and scale in the univariate have been extended to the multivariate setup by Lopuhaa, (1990) who discusses their properties at length in his P h . D . thesis. The above estimators have one characteristic in common; they, are affine equivariant. To attain affine equivariance and a high B P , we must sacrifice computational efficiency. In the P C A , the principal components of a covariance matr ix define an orthogonal 20 basis in the factor space. Hence, we need only consider estimators that are orthogonal equivariant, i.e., preserve the basis. Weakening the assumption of affine equivariance allows us to develop a new PCA estimator that is robust, yet easy to compute. A brief description of the estimator follows while its properties are discussed in section 6. Suppose we have a sample {x}J=1 with some initial robust estimate of covariance So(x). We propose to use an iterative procedure for computing weighted estimates of multivariate location and scatter with weights based on the principal components of St*,, the estimate of the covariance matrix at the kth iteration. Note that Maronna's M-estimate is also a reweighted covariance matrix, but the weights are based on the Mahalanobis distance (this is the reason for the low BP of the Maronna's estimate). Let a.j be the eigenvector of S^ associated with the ] t h largest eigenvalue. Then the \ t h principal component of X; is PCij = ajx,-. Next, we consider a weight function, W{j = w(PCij), where w: [0,oo) —• [0,oo) is decreasing with sup ||PC,-j w(PCij)\\ < oo. The weighted estimators of multivariate location and scatter are _ Er=i v r ? * and s _ ££ g l(x ,--t f c + 1)(x ,--t f c + 1 ) r (w;*) 2 where Wjf is the product of w^s computed at step k. The Wjs satisfy the following condition f if Wj1 < W*~l I W*~x otherwise . The weights Wk are forced to decrease at each iteration. The lower bound for the weights will be zero by the assumptions on the weight generating function w(*). This ensures convergence of the method. We will show in section 6 that at least p + 1 weights will be larger than zero for observations that do not lie in a lower dimensional hyperplane. 21 This will prevent the scatter estimate from being singular. Aside from being easy to compute, the estimators are orthogonal equivariant, consistent and have a BP of 1/2 as will be shown in section 6. 4.1.2 Robust P C A via Projection Pursuit Classical P C A is a type of projection pursuit method. Consider a set of random vectors {x},=li...)n. In this method, one searches Rp for a direction in which the variance of a linear function, aTx, of the elements of x attains a critical value. The BP of variance, as classically defined, tends to zero with increasing sample size and even a tiny fraction of contamination may cause it explode (blow up to o o ) . To make the P C A more resistant, a robust scale estimate S of aTx is used in place of the variance. The unmodified problem is as follows: where x 1S a nondecreasing even function that limits the influence of outlying or influ-ential observations, b is usually taken to be E{x(Z)}, where Z is the standard normal random variable. To make the minimization feasible, one can employ a method first introduced by Chen and Li (1985). It consists of three steps: reparametrization, mini-mization and projection. First, we note that a (p+l)-dimensional unit vector a 0 can be reparametrized as min 5(a0) subject to 1 (l,-f3)T , where 8 £ RP . The purpose of the reparametrization is to eliminate the constraint ||a0|| = 1 and make the computations simpler by doing so. The problem thus becomes min S(/3) subject to (10) 22 where /3 E Rp. Denote by /3 € Rp the minimizer of S(f3). Next, we notice that the solution to min S(a) | |a| |=l ,a?ao=0 lies in the nullspace, Af, of a 0, where a 0 = ^ ^ r ^ ^ ' ~@fT• ^ e P r o J e c t our data onto this nullspace and proceed as above, i.e obtain a solution to an unconstrained problem of one less dimension by means of reparametrization. Let a.i, . . ,a p be the orthonormal basis of Af. Then any ax G Af can be expressed by p fc=l where a € Rp and aTa = 1. Let yj = ( a ^ X j , a j X j ) r be the projection of the (p + l)-dimensional vector of observations Xj onto Af. Observe that the y;'s have one less dimension than the x,-'s. Now we minimize the scale Sfa^ over all unit vectors in Af, that is, min 5(a x ) subject to (TTTV) = ~J2x (STY) = H • l|ai||=i n^r{ \S (a x ) / n^r{ \S(ct) j By reparameterizing a as a = , 1 (1, - B ) T , where R e RV~X , the above minimization problem becomes 1=1 By solving the above equation we obtain p » i = _ a.-aj , A = l where a = .., ap)T is given by 23 $ £ i ? p _ 1 is the minirnizer of S(/3). We continue this process until the dimension of the nullspace of the existing vectors is reduced to 0. Vectors obtained in this fashion will be the robust eigenvectors and the square of the scale estimates will be the robust eigenvalues. The robust scatter matrix can then be reconstructed as The beauty of this method lies in the fact that the three steps described above reduce a problem in (p+1) dimensions to the same problem in one less dimension. The robust scale estimate S used in equation (10) is a nonlinear function with several minima. To minimize it, one requires a good initial starting point. In the linear regression setup, one uses a particular S-estimate of regression, the least median of squares (LMS). Essentially, one computes the regression estimates by minimizing the median of the absolute residuals. Analytically, the solution to this problem can be written as min S(/?) subject to i £ X a ^ ~ ^ X ' J = 1/2, (11) where x is defined as 0 if | ar | < a Xa(x) - < 1 if \x\ > a , a £ R . X is referred to as the jump function. Minimizing (11) is computationally very difficult given the discontinuous nature of Xa-In practice, the computation of the LMS estimate involves a technique similar to the bootstrap. Given a set of observations {xt} : i=l,..,n in Rp+l, we begin by drawing subsamples of (p+1) points. Again, note that a large enough number of subsamples must be drawn to increase the probability that a particular subsample contains only "good" data. For very large data sets with multiple variables, the optimal number of 24 subsamples considered is limited to whatever is computationally feasible (usually 100 to 3000 depending on p). We apply the method of least squares (LS) to each subsample k to obtain the re-observations. The LMS estimate is the set of coefficients that minimizes the median of the squared residuals. The method of LMS is CPU intensive and its rate of convergence has been shown to be only -f/n. This is because for each sample k we find the median absolute residual, a quantity that is not uniquely defined for even number of observations. The median absolute residual is an M-estimate of scale based on a jump function x- The nature of the function causes slower than usual convergence rate. To improve the speed of convergence to the usual rate, we can use a smoothed version of the function x (Tukey's x-> f ° r example). The M-scale equation will now have a unique solution; however, the computation of the minimum M-scale remains nontrivial. To compute it, we can use the resampling scheme described above. However, this requires us to solve an equation of the robust residual scale similar to (7) m times. This can be extremely time consuming especially for large p. To improve computational efficiency, we make use of Yohai's suggestion (Yohai and Zamar, 1990). We solve (7) only when it becomes necessary. Observe that if s(f3) is overestimated. To initialize the procedure, we compute an M-estimate of scale for the first set of residuals to obtain Si(/?). Subsequently to minimize s(B), we only solve for s(8) if (12) is satisfied (the scale used in (12) is the minimum M-scale computed thus far). Implementing this suggestion will reduce the number of equations that we gression coefficients ft. Then we compute the LS residuals corresponding to ft for all (12) 25 have to solve from m , the number of subsamples drawn, to ln(m). In the context of projection pursuit, we propose an analog of L M S , the least me-dian of absolute orthogonal errors ( O R L M ) . This method differs from L M S only in the computation of errors. In L M S , an error is defined as the vertical distance between the observed response and the fitted response. In O R L M , it is defined as the Euclidean distance of the observed response from the fitted regression line. B o t h methods, L M S and O R L M , although bias robust, have a rate of convergence of tfn. This results in low efficiency even if the errors are really normally distributed. To improve the speed of convergence, one uses a smoothed-out version of the O R L M obtained by replacing the jump function X with a continuous x, for instance Huber's x2 Xc{x) = m i n { l , — } or Tukey's x) = m m { l , - ( : r - _ + _ ) } . c is called the tuning constant. The S-estimate of scale in a regression context based on either x function is asymptotically normal at the usual rate. Tukey's x is usually preferred because it has continuous first and second derivatives and hence equation (11) can be more easily minimized by Newton-Raphson type of methods. A disadvantage of S-estimates is their inabil i ty to achieve efficiency wi th high B P at the same time. There has always been a trade-off between the two. If an S-estimate is to achieve maximal B P , a low tuning constant c is used. For Tukey's x, c = 1-548. This wi l l cause some non-outlying observations to be viewed as outliers and penalized accordingly; any observation wi th the standardized robust residual larger than 1.548 wi l l be considered an outlier since from that point on all observations wi l l have the same value at the function x)- This wi l l in turn increase the asymptotic variance of the estimated parameters caused by the loss of information. 26 This results in a reduction in efficiency (it can be shown that efficiency is an increasing function of c). It turns out, from extensive Monte Carlo simulation runs, that minimizing the robust scale estimate of the orthogonal errors gives rise to regression estimates that are about six times less efficient than the MLE at the Gaussian model. Estimators, M M and r, that combine maximal BP with efficiency have been proposed -in the linear regression setup. Both can be made 95% efficient at the Gaussian model and their maximum asymptotic bias characteristics are comparable to those of the maximal BP S-estimates. In the orthogonal regression setup we propose estimators analogous to the M M -estimates and r-estimates of classical regression. In the computation" of the MM-estimates, we use a fixed S-estimate of scale, say S n , of the orthogonal residuals that has a BP of 1/2 and solve for 8, the regression parameter. This allows us to increase the tuning constant c to gain efficiency (c = 4.7 for 95% efficiency at the normal model) by drawing closer to the quadratic x °f the Gaussian model. The first estimator we define is the orthogonal regression M M estimator (ORMM). It is the solution of the minimization problem where x is Tukey's x with c = 4.7 and Sn is as above. The second estimator is the orthogonal regression r estimator (OR-r) which is de-fined as the solution to where S(8) is implicitly defined by 27 The tuning constant c\ = 1.548 for Xi 1S chosen so that the maximal BP is achieved, the tuning constant c2 = 6.08 for X2 1S chosen so that 95% efficiency is achieved at the Gaussian model. The r estimator is an adaptive combination of a high efficiency M-estimate with a maximal BP M-estimate. If data are contaminated with a large fraction of outliers, the robust M-estimate dominates. If there are no outliers in the data or only a small fraction, the efficient M-estimate dominates. Hence the r estimator combines both bias robustness and efficiency. We have seen that robust OR can be used as a building block in robust PCA. This approach was pioneered by Chen and Li (1985). In their paper they considered the bias properties of an S-estimate of the scale of the principal components. However, they ignored the breakdown properties of the direction in which this scale is minimized. We have extended Chen and Li's method to efficient robust M M and r estimators that attain maximal BP with respect to the size of the PC's. We have also considered the direction bias and we will show that the breakdown properties of the scale are not inherited by the direction. We further show that the S-estimate of the direction of the PC's cannot be made robust and efficient at the same time. On the other hand, the M M and r estimates can be made 95% efficient while retaining a high level of robustness which, as we will show in the next section, depends on the ratio of adjacent eigenvalues. The larger the ratio the higher the BP. As this ratio approaches oo the BP tends toward 1/2. 28 5 Direction Bias Computation The claims made in this section without formal proof are intuitively clear and can be rigorously established along the lines of Zamar (1989). Classical methods are usually derived under Gaussian assumptions and are optimal when the data follow these assumptions. Gross-error-models of the form F = (1 - e)F0 + eH , where F0 is a multivariate normal with mean a (we take a = 0, for simplicity and without loss of generality) and some covariance matrix £ , are used to describe situations in which a certain fraction e of the data do not follow the central Gaussian model. Let S 0 = ~£>{FQ) be the covariance matrix at FQ. We denote by A 0 < \ \ < .. < A p the eigenvalues of I] 0 and by a0, ax,.., a p the associated eigenvectors. The treatment of the direction bias is asymptotic. We denote by sa(F) the estimating direction functional at F. It can be shown that the direction functionals are Fisher consistent, that is, a,-(i*o) = a;- The bias of the direction a,- at F is then defined (Zamar ,1989) as Bi(F) = 1 - \aj(F)&i(F0)\ = 1 - |a?(F)a,| . Note that Bi(F) lies between zero, no bias, and one, complete breakdown of the direction a,-. The maximum direction biases I?,-(e) are then Bi(e) = sup Bi(F) . F The breakdown point of a-i(F) is then (Zamar, 1989) BPi = sup {e : B,(e) < 1} . £6(0,1) The breakdown point of B.{(F) is achieved when a-i(F) becomes orthogonal to a;. Given a pair of adjacent PC's, this occurs when an outlying observation inflates the scale of the smaller P C enough to make it larger than the scale of the larger P C . Intuitively, this can be done most easily when the variances of the adjacent P C ' s are similar in size. We can find a lower bound 6t-(e) for the B,(e). The lower bound 6,-(e) depends on the fraction of contamination e and on the ratio of two adjacent eigenvalues A t + 1 and A;. We denote the square root of the ratio A 1 + 1 / A , - by r,-. The larger the r,-, the smaller the lower bound 6,(e). It can be shown that the lower bound 60(e) is sharp, that is, 60(e) = B0(e) . We conjecture that 64(e) = -B,(e) for i = 1, as well; however, we are unable to prove it . To find the lower bound 6,(e), we should, in principle, consider a l l directions a.(F) that are generated by introducing outliers at different locations in the (p+l)-dimensional space. However, it can be shown, as in Zamar (1989), that the largest deviation from the true direction a,- = a t-(F0) is obtained by moving toward a I + 1 ( i r 0 ) which is the path of the least resistance. Hence it is enough to consider a biased direction a(7) of the form a( 7) = (1 - 7)at- + v / l - ( l - 7 ) 2 a,-+1 , where 7 denotes the lower bound for the asymptotic direction bias. We also define the direction a( 7) that is orthogonal to a(7) by a( 7) = ^ / l - ( l - 7 ) 2 a,- - (1 - 7 ) a i + 1 . It turns out that the point mass contamination at y = / c a . ( 7 ) , K —* 0 0 results in the most pessimistic scenario. The gross error model associated wi th a point mass at y, 6y, is denoted by F°° = (1 - e)F0 + eSy . 30 It is enough to consider JP°° alone because being the most pessimistic contamination model it induces the largest lower bound. 5.1 r - estimate The asymptotic version a(_F0) of the r-estimate is a(F) = arg m i n M a | | = 1 r 2 (F , a ) , where The estimate of scale s(F, a) is implic i t ly defined by G £ j ) = 1 / 2 ' We assume without loss of generality that x(°°) = 1- We also define the function g(t) as g(t) = Ex(ViZ) , where Z ~ N(0,1) . To compute 6,-(e), we consider two situations (to simplify notation we assume i = 0). Firs t we calculate the value of the r-scale when the contamination is ignored, that is, when a = a.o(Fo) = a0. In this case the distribution of ajx under F°° is (1 - e)N{0, A 0 ) + e^co . Second we compute the value of the r-scale when the contamination is fitted exactly, that is, when a = a0(i?) = a(7). In this case the distribution of ajf^x is (1 - e)JV(0, (1 - 7 ) 2 A 0 + A 2 A 0 + e80 , where A 2 = [1 - (1 - 7 ) 2 ] . We can show that T(F°°, a(7)) > T(F°°, a0) implies k(F°°) = a0. O n the other hand, T(F°°, a(7)) < T(F°°, a0) implies that a.(F°°) = a(7). The lower 31 bound for the max imum bias is thus obtained by equating r 2 ( F ~ a 0 ) = r 2 ( F - a(7)) • We accomodate the contamination as long as the resulting scale is smaller than the scale we obtain by ignoring the contamination. Consider the two modelling situations: 1. Contaminat ion is ignored (F°°,a.0). The defining equation for the r-scale can be written as T2(*~,ao) = * 2 ( F ~ a 0 ) [ ( l " *)EFoX (jT^T^) + e] > or in terms of g(t) r 2 ( ^ , a 0 ) = s2(F°°, a 0 ) [ ( l - e)g2 ( ^ / j ^ ) + 4 • (13) s(F°°,a0) satisfies the following equation ( 1 - « > * ( ? ( j £ ^ ) ) + ' = 1 / 2 ' hence 2 / P O O _ \ _ Substi tuting for S2(F°°,&Q) into (13) yields T 2 ( F ~ a0) = ( l - 6 ) Ao _! f0.5-e\ , A 0 *<* ( — ) ) + ' ^ y -- 1 / 0.5-eA 2. Contaminat ion is fitted exactly (F00,a(i))• The equation for the r-scale expressed in terms of g(t) is AF~,a(7)) = ,'(*"»,a(7))(l - efe , ( 1 4 ) 32 where s(F°°, a(7)) satisfies the following equation ( i - e)gi ( ( 1 - 7 ) 2 A 0 + A 2 A 1 a2 ( F ~ a(7)) = 1 / 2 . Solving for s(F°°, we obtain ( 1 - 7 ) 2 A 0 + A 2 A ! 9i W e then substitute for s(F°°,a(i)) into (14) to obtain r 2(^,a(7)) = ( l - e ) ( 1 - 7 ) 2 A 0 + A 2 A X Equat ing T2(F°°, a(7)) and T2(F°°, a(7)) and solving for 7, we obtain the equation for the lower bound for the maximum direction bias as a function of the fraction of contamination e and the eigenvalue ratio A J / A Q , that is, 5.2 S - estimate The derivation of the lower bound for the maximum asymptotic direction bias for the S-estimate is performed in a way similar to that of the r-estimate. We again consider two cases, first where the outlier is fitted exactly and second where the outlier is ignored. The lower bound is then obtained by equating the scale estimates of the smallest P C computed in the two cases. The defining equation for 5(i ? c o,a 0) is the following A i / A 0 - / ( 6 ) A 1 / A 0 - I (15) where (16) 33 Solving it for s(F°°,a.0) yields Ar ^ (fe) Following the same steps as above we find the expression for s(F°°, a(7)) s2(Fco, a(7)) , 2 ™ (1 - 7) 2A 0 + A 2 A X Equating s2(Foc', a(7)) and s 2( JP c o, a(7)) and solving for 7 that denotes the lower bound for the maximum asymptotic direction bias, we get bs(e) = 1 where V A 0 - /(e) Ai/A„ - 1 ' f ( £ ) ( & ) Notice that if the function <jr2 = <?i, then -Epx (^f j ) = 1/2 and minimizing r 2 (F, a) becomes equivalent to minimizing S2(JF, a) and the r-estimate of direction will be equal to the S-estimate of direction. Equation (16) reduces to equation (17) resulting in bT(e) = bs(e). 5.3 M M - estimate To compute the MM-estimate of direction, we minimize the criterion J(F,a;s(F)) = EFX T a x s(F)J ' where s(F) is an S-estimate of scale with the maximal BP, that is, s(F) = s(F,a.(F)). We assume that the direction bias 70 of this robust S-estimate is smaller than the direction bias 7 of the 95% efficient MM-estimate. Notice that for 7 > 70, 5 (F o o , a (7 ) )> 5 (F o o , a 0 ) 34 and hence the robust S-estimate of direction wi l l be equal to a 0 . Hence, for al l 7 > 70 , A g a i n consider the two modelling situations: 1. Contaminat ion is ignored (F°°,a.0). In terms of the function g, the criterion J(F°°, a 0 ; s) can be written as J ( F ~ ao;3) = ( l - e ) ^ ^ + £ . (18) Substi tuting for s in (18) we get J(F°°, a 0 ; s) = (1 - eMg? ( Y T r ) ) + e • 2. Contaminat ion is fitted exactly ( i* 1 0 0 , a(7)). The criterion J(F°°, a (7 ) ; s) can be writen as J(*-,.(7); *) = (!- «)* ((1-7)2t + A'A') • (19) We then substitute for J into (19) to obtain [(1 - 7 ) 2 A o + A2X1]g? ( ^ = 1 ) 1 • / ( i^ .a f r ) ; s).= (l-e)g2 Ao Again , the achievable bias occurs when fitting the outliers is better (from the criterion point of view) than ignoring them, that is, when J ( F ~ , a 0 ; i ) > J(JP°°,a(7); s) . The lower bound for the maximum asymptotic direction bias, 6 A / M ( C ) J is attained when we equate J(F°°, a 0 ; 5) to J(Fco, a( 7); 5) and solve for 7. Doing so we obtain W ( e ) ~ 1 " \] A . / A o - l ' ( 2 0 ) 35 where 9;1 [ M 1 fe) + ^ (21) In Figure 3, we plot the lower bound for the asymptotic direction bias as a function of the fraction of contamination e for the 95% efficient S-estimate, the maximum BP S-estimate, the MM-estimate and the r-estimate when the ratio r,- is equal to ten. The plot gives a global picture of the performance of the four estimates. First we notice that the S-estimate cannot achieve robustness and efficiency at the same time. The S-estimate of direction based on the 50% BP scale estimate gives the greatest bias protection but it is very inefficient (eff=28%) at the Gaussian model. On the other hand, the 95% efficient S-estimate is not bias robust and breaks down at about 11% of contamination when the ratio r,- = y A i / A 0 = 10 (see Figure 3a) and around 7% of contamination when r,- = 2 (see Figure 3b). Second we observe that the BP of the S-estimate of direction depends on the r,- and falls short of the BP of the corresponding scale estimate which is 50% for the robust and 12% for the efficient scales, respectively. The BP of the robust S-estimate of direction is about 40% when r,- = 10 and about 23% when r» = 2. The corresponding breakdown points of the efficient S-estimates are 11% and 8%, respectively. From Figure 4, we notice that as r,- —•* 0 0 , the BP of the direction approaches that of the corresponding scale estimate. Also notice that the biggest increase in the BP occurs when 1 < r < 5; the BP curves remain fairly flat (although increasing toward the scale BP) for r > 5. Finally, we notice (see Figure 4) that as r approaches one, the BP tends to zero. However, when r ~ 1, switching the order of adjacent principal components will not seriously harm the analysis of the data. Figure 3b is a plot of the lower bound of the direction bias as a function of the 36 fraction of contamination e for an r,- of two. This ratio has been used in the Monte Carlo study that was carried out to assess the finite sample performance of the four estimates. As we expected, the M M - and r-estimates of direction can combine efficiency and robustness. F rom the accompanying pictures we notice that the bias performance of the 95% efficient M M - and r-estimates is much better than that of the efficient S-estimate. Al though the bias behaviour of the M M - and r-estimates is in general worse than that of the robust S-estimate, the performance gap quickly narrows for increasing r-,-. From (15) and (20) we notice that the B P of the M M - and r-estimates occurs when /(e) (defined by (16) and (21)) is equal to A i / A o , that is, S P = r 1(A 1VA 0) . Hence, / - 1 ( A i / A 0 ) —> 1/2 as AX/AQ —• oo. Final ly , notice that the r-estimate has visibly better bias characteristics than the MM-est imate for fractions of contamination larger than 0.27. This is to be expected because the r-estimate, being of an adaptive nature, should accomodate better a large percentage of outliers than the fixed scale MM-estimate. For smaller e, the MM-est imate is marginally better than the r-estimate. 37 epsilon Figure 3a: Plot of Direction Bias vs Fraction of Contamination (y A i / A 0 = 10). o 0.30 epsilon Figure 3b: Plot of Direction Bias vs Fraction of Contamination (\JXI/\Q = 2) 38 d CO 6 c o 'w a . 01 Figure 4: P lo t of Fraction of Contamination versus Xi/XQ. 6 Proposed Estimator Although there are a number of robust scatter matrix estimators in existence, they are computed via the minimization of a nonconvex function with multiple minima. This is both time consuming and difficult to achieve. Also, to increase the probability of convergence to a global minimum, a "good" ini t ia l starting value for the function must be provided. The method proposed here is quick and easy to compute. A Monte Carlo study wi l l show that it compares favourably wi th other robust methods. It can also serve as an in i t ia l estimate for other, more complicated procedures. A m o n g the properties of the proposed estimator are high breakdown point and or-thogonal equivariance. Usually, one requires an estimator to have a somewhat stronger type of equivariance, affine equivariance. However, P C A is based upon successive mini-mizations of the variance of a linear combination of the elements of x. This combination is given by a particular eigenvector of the scatter matrix. The set of such eigenvectors forms an orthogonal basis in the factor space that is to be preserved under transforma-i 1 1 : 1 1 1— 0 20 40 60 80 100 S/N ratio 39 tion. An orthogonal equivariant estimator satisfies such a requirement. Here we describe in detail the procedure used to compute the proposed estimate. The algorithm has three steps: 1. Center the data. First we robustly center the data. In order to achieve orthogonal equivariance and a high BP for the estimate of scatter, we want to center the data by a mul-tivariate location estimate that is orthogonal equivariant, robust, independent of the estimate of the scatter matrix and easy to compute. These requirements are satisfied by the Li estimate of location defined (see for example Lopuhaa, 1990) as the solution to the minimization problem The asymptotic BP of this location estimate is 1 /2. Once T„ is computed, it is sub-tracted from the data so that we obtain a new centered data set {x1?.., x n), X ; = X i — T n . For simplicity, the tilde will be dropped from the notation. 2. Compute the initial estimate of scatter, S 0 . To initialize the procedure, we compute a robust weighted estimate of scatter. The weights used are generated by the function u(r) that is deacreasing on [0,oo) and such that u(r) is equal to zero when its argument exceeds some previously specified constant. Let s be the robust scale imlicitly defined by Xa is defined as before and a = $(3/4). The above maximal BP M-estimate of scale is known as the median absolute deviation from the median, MAD. The n m i n X J x i - T . ± n t=i n I 40 initial estimate of scatter So is SQ(XI, .., x n) V s j 3. Compute Sk+\ given Sfc Let a1;..,ap be the eigenvalues of , the current estimate. Then the [k + l)th estimate of the scatter matrix is of the form 1 2 S * + 1 ( X i , . . , X n ) = J2XiXJWik ^i=l vyik j'=l The WikS are defined as the product of weights that are based on the standardized absolute principal components, that is, The weight generating function w(r) is decreasing on [0,oo) and such that w(r) is zero whenever its argument exceeds a previously specified constant. The scales Sj , j=l,..,p are the MAD, that is, I E * (^0=1/2. Termination criterion: We repeat the third step until the maximum difference between the current weights and the weights obtained at the previous step is less than some S (in the Monte Carlo study 8 was taken to be 0.001) or until the maximum number of iterations (user specified) is reached. 1 Some Properties of Proposed Estimate n this section, we will discuss the properties of the proposed estimates of the direction d size of the principal components (PC). These estimates are based on a reweighted 41 scatter matrix which, at convergence, is of the form S(x l 5 . . ,x n ) = ^ - l ^ f ^ t x i ) (22) where W{ is the weight assigned to observation X j . The weight generating function, w, is even, decreasing on [0,oo) and such that |w(r)r| is bounded. We will show that S(xi , . . ,x n ) is orthogonal equivariant and Fisher consistent. This will in turn imply that the estimates of the directions of the PC's are orthogonal equivariant and Fisher consistent and that the estimates of the size of the PC's are Fisher consistent. We will also show that the BP of S(xi, . . ,x n ) tends to 1/2 as the sample size approaches co. The BP is inherited by the estimate of the size of the PC's; however, the BP of the direction may be smaller as this was indeed the case for the S-, M M - and r-estimates. However, as we will see in the next section, simulation results suggest that the BP of the direction of the PC's based on the proposed method is higher than that of the robust S-scale. Further study is required to investigate finite sample properties of the proposed estimate. Theorem 1 Let {xx,..,x n} be a p- dimensional sample of size n. Suppose the weight functions u(t) and w(t) are even, decreasing on [0,oo) and such that \w(t)t\ and \u(t)t\ are bounded. Then the estimate Sn given by (22) is orthogonal_ equivariant, that is, S n (QX) = Q S n ( X ) Q r , where Q is any orthogonal matrix. Proof: First of all we notice that the data can be robustly centered by using, for example, the L\ estimate of multivariate location (Lopuhaa, 1990) which is clearly orthogonal equivariant and has breakdown point 1/2. Hence we can assume without loss of generality that the location estimate T(x!, . . ,x n ) is equal to zero. Let y, = QXj- (i=l,..,n). Recall that the initial weights u(||x,-||) are based on the Euclidean norms. Since ||x,-|| = ||Qx,|| = ||y,-||, the initial estimate of scatter S 1 (x 1 , . . ,x n ) is orthogonal equivariant, that is, l l y i ' " ' y n j " E?=1^(||yt||) £2= i« 2(IWI) = Q S 1 ( x 1 , . . , x n ) Q r . The proof is now completed by induction. Suppose that S m (x i , . . , x n ) is orthogonal equivariant. Let Xi < A2 < ... < Xp and ai, a 2 , a p be the eigenvalues and eigenvectors of S m (x i , . . , x„). Let hi, b 2, . . ,b p be the eigenvectors of S m ( y i , . . ,y n ) . Then ST O(x 1,.., x n )a = Aa . Premultiplying the above by Q, we get QS m (x i , . . , x n ) a = AQa . By using the fact Q r Q = I, we have [QSm(x 1 , . . ,x n)Q T][Qa] = [Sm( y i,..,y n)][Qa] = A[Qa] . Hence Qa is an eigenvector of Sm(yi,.., y n) corresponding to A. Therefore b, = Qa,. Let sxj = Median,-(|ajx,|) = Median^ |bjy,-|) = syj (23) and ^ = lW^) = lW^ W»< • (24) j=l \ AXJ / i = l \ &yj J 43 Then, by (23) and (24), S m + i ( x i , . . , x n ) is orthogonal equivariant, that is, E i = i y«y« Sm +l(yi, Yn) — E ? = 1 ( Q x O ( Q x O r ^ i t j T Q S m + 1 ( x 1 , . . , x n ) Q - / . This concludes the proof. Theorem 2 LetFsfe) be an elliptical distribution with location parameter u and scatter parameter's. The corresponding density is /s(x) = |X|-1/2<7(||(x — / i)S~ 1 / , 2 | | ) . Suppose the weight functions u(t) and w(t) satisfy the assumptions of Theorem 1. Then the estimating functional S(.FE) is Fisher consistent, that is, S( FE) = S . Proof: Let Ax < A2 < ... < A p be the eigenvalues of £ and ai ,a 2 , . . . ,a p be the corre-sponding eigenvectors. By Theorem 1, we can assume without loss of generality that {a 0, a 1 ; a p } = {e0, e 1 ? e p } , where et = (Oi, 0 2,.., 0 t-_ l 51,0,+ 1,.., 0) T. Therefore, £ = diag(\\,Ap) . and the corresponding density function is Asymptotically, convergence occurs in one step; however, the proof is done in two steps to allow the reader to follow the natural course of our argument. First we show that 44 the initial estimating functional So(i7s) (using the weight function u(||x||)) is diagonal. To see that S0(-Fs) is diagonal, notice that by symmetry we obtain /co /-co ' i •••/ XiXkfs(^)u(Jxl -\ x2p)dxi • • • dxp = 0 for i^k . -oo J—oo Recall that the weight function used after the initial step is based on the principal components. Hence the weights we use in the second step are based on the natural coordinates of x, that is, the projections e^ x (j=l,..p). Again we have to show that the cross-product vanishes to ensure that S(Fj^) remains diagonal. To see that this is the case, notice that by symmetry E[xiXk w (^ij] = f ••• I XiXkfe(x) JJ w ( ^ - j dxx • • • dxp = 0 for i^k . j=i \sj J J-°° J-°° j=i \sj J Next we show that the diagonal elements of Si(i*s), A i , . . , A p , satisfy \i = K(Fi)\i i=l,..,p, where K(F\) is a known constant. Notice that in such a case S(Fj^) = S^JFE), that is, convergence occurs in one step. To show that A,- = XiK(Fj), we have to evaluate „ r co roo P / X I A , - = / • • • / x]Mx)T[w \-)dx1---dxp, (25) By noticing that ij (the median absolute deviation, MAD, was used) is Fisher consistent, i.e. Sj = \jx~j, we can write (25) as A,. = - r i = r.... r xUiC^+• • • $ fi w dXl... dxp. Finally, we make a change of variables x,- = x\Jy/\l to obtain ^ roo roo _P A, = A,- / • • • / x2Ji(x\ + • • • x2) w(xj)dx1 •••dxp = XiK(Fi) . J — C O J—OO • i This results in Fisher consistency for the direction of the principal components. To obtain Fisher consistency for the size of the principal components, the scale estimate A,-must be divided by the known constant K(F\). 45 Theorem 3 At any p-dimensional sample { x i , . . , x n } in general position, that is, there are no more than p points in any (p — l)-dimensional hyperplane, the breakdown point of the proposed estimate equals which converges to 1/2 as n —> oo. Proof: Notice that the weight function w is positive for some constant c1 greater than one and zero for some finite constant c2 that is larger than c\. Consider any sample { y i , . . , y n } obtained by replacing at most [n/2] — p o i n t s in { x i , . . , x „ } by arbitrary values. For any unit vector a in the p-dimensional space we have s ( a T y 1 ; a T y 2 , ...,a r y n ) < s ( | | y i | | , | | y 2 | | , | | y „ | | ) < | | y | | ( [ n / 2 ] + P ) = d < oo . Notice that we can write any observation y,- as a linear combination of orthonormal unit vectors, { a l 5 a 2 , a p } , that span the p-dimensional space, that is, p y ^ ^ y czji&j , where the a.j's are non-negative. Hence | | y , | | 2 = £ j = i ct2-. Suppose that a large outlier, say y t , has a norm | | y A - | | 2 > p(c2d)2. Then at least one ctjh > c2d, say a l f c . The weight wlk for the observation y ^ is We define the weight Wk assigned to the k th observation as the product of the individ-ual weights, Wjk, obtained for the projections a j y ^ for i=l,..p associated with the k th observation. Hence, Wk — I lj = i wjk = 0; an observation that is large with respect to the "good" data is assigned a weight of zero. This will bound the largest scale away from oo. 46 To show that the smallest scale does collapse to zero, consider the most pessimistic direction, say a0, in which all outliers have a null projection and p observations lie in a (p — 1)-dimensional hyperplane. Hence, we will have a sequence of absolute ordered projections (0(1), 0(2), ...,0([n/2]_(p+l)),0([n/2]-p), ...,0([n/2]-l),^([n/2]),-.,^([n/2]+p), •••,&(«)} , where A;/ > 0 for l=([n/2]),..,(n). The corresponding scale of is the ([n/2]+p)t/l absolute projection, that is, s(a0) = fc([n/2]+p) • Hence there will be at least p+1 "good" points of {yi, ..,y n} that will have standardized absolute projections less than or equal to one. From the definition of the weight function, these points will have nonzero weights. Since these points are in a general position, they determine a convex hull with nonzero volume and the corresponding matrix is nonsingular. 47 7 Tables - Simulation Results The core model (contamination-free) used in the Monte Carlo study carried out to assess the finite sample performance of the estimates was the following yi = /3Xi + Vi x,- = Xj + ut- . Vi and u t are normal random errors. The errors are assumed to be uncorrelated with mean zero and variance cr2 and a^I. We considered two situations, p = 1 (x is one-dimensional) and p = 4 (x is four-dimensional). In the one-dimensional case we considered sample sizes of twenty and sixty observa-tions. The parameter 3 was set to 0, 1, 5 and 10. The X^s were distributed as N(0,1), Vi ~N(0,0.25) and ut- ~N(0,0.25) (\f\\/\o — 2). The fraction of contamination e was set to 0 (Gaussian model), 0.05, 0.10, 0.15, 0.20 and 0.25 with contamination-generating distributions N(3,0.25), N(5,0.25), N(10,0.25), N(15,0.25), N(20,0.25) and N(25,0.25). The outliers were generated in either y or x (more complicated contamination models were not considered due to the high cost of simulation and time constraint). Each sam-pling situation was replicated two hundred times. The following was used to assess the performance of the estimates 200 where Bi is the direction bias of the estimate at the iih replicate. The estimate of the direction a is defined as BT was computed for each sampling situation. Notice that Bj ranges from 0 (no direction bias has been induced) to 200 (maximum lower bound for the direction bias was reached at every replicate). The lower the value of Br the better the bias behaviour of the estimate. 48 In the four-dimensional case we considered sample sizes of forty and seventy five. The parameter B is now a vector of parameters and in the Monte Carlo study it was taken to be one of 0,1, 5 and 10. The vector Xj was distributed as N(0,1I). The error structure was as above with U j being N(0, 0.251). The fraction of contamination was as above with contamination-generating distributions N(5,0.25), N(10,0.25), N(15,0.25), N(20,0.25) and N(30,0.25). The outliers were generated in either y or x\. Each sampling situation was replicated a hundred times. Bj was again used to compare the bias behaviour of the estimates. The random number generator used in the Monte carlo study was adapted from an article by Schrage (1979). Eigenvalue-eigenvector decomposition was done using the QR algorithm. The estimates considered in the Monte Carlo were the following: 1. Orthogonal regression. 2. Orthogonal regression analog of LMS based on Tukey's Xc (c=1.548). 3. 95% efficient S-estimate using Tukey's Xc (c=4.70). 4. S-estimate (BP=l/2) using Tukey's Xc (c=1.548). 5. MM-estimate using Tukey's Xc (c=4.70). 6. r-estimate using Tukey's Xc (c=6.08). 7. Proposed estimate with u(r) equal to one for r less than 2.50. The weight function w(r) is defined as follows w(r) 1 if \r\ < 1 pr if 1 < Irl < 2.5 | r | I I — 0 otherwise 49 To improve the performance of the proposed estimate, we considered a slightly different initial estimate S 0 . To compute it, we have adopted an approach that is conceptually closely related to the Donoho-Stahel estimate. However, instead of looking at all one-dimensional projections that leave an observation x,- most exposed, we limit our search to a set of randomly generated orthonormal bases (including the canonical basis). The X; 's are projected onto the basis vectors. We define a projection index 7; as _ |vTx,- — medianj(v TXj)| 7' vi^basis medianfc|vTxfc - median j(v rXj)| The weights ^(7,) are assigned to each observation, X j , according to the weight function w(») defined as above. The resulting multivariate estimates of location and scatter are ; _ E"=i Wi^i E,-=i wi A _ Eti(x,--to)(x,--t. 0) Tu; 2 This estimate is relatively easy to compute and has a breakdown point of 1/2. 8. One-step reweighted estimate with zero-one type of weights based on the proposed estimate. An observation is assigned a weight of zero when the corresponding proposed method weight is zero, otherwise the weight is equal to one. The following tables summarize simulation results for the seven robust estimates con-sidered, ORLM, 95% efficient S-estimate, S-estimate with BP of 1/2, MM-estimate, r-estimate, one-step reweighted OR with weights based on the proposed method (WOR) and the proposed method estimate (MPP) together with classical orthogonal regression for comparison. We only include tables that show clearly the merits of using the robust estimates in place of classical estimates when the data are contaminated by outlying observations. 50 Table 2: Contamination in Y, N(3,0.25), 8=0, n=20, m=200. e OR ORLM S(95%) S(BP=0.5) M M r WOR MPP 0.00 1.729 6.849 2.893 7.065 3.058 3.027 2.160 7.997 0.05 16.128 7.343 3.850 7.368 3.568 5.844 5.125 5.780 0.10 56.041 8.518 29.662 9.530 10.789 12.254 13.938 8.836 0.15 93.977 18.839 86.486 17.807 26.621 33.595 24.401 9.535 0.20 113.490 20.039 119.962 18.549 58.561 55.115 39.843 15.937 0.25 137.259 31.845 146.006 35.817 110.012 99.240 48.995 18.023 Table 3: Contamination in Y, N(3,0.25), 8= 0, n=60, m=200. 6 OR ORLM S(95%) S(BP=0.5) M M r WOR MPP 0.00 0.520 1.985 0.540 1.960 0.546 0.571 0.520 1.418 0.05 5.489 2.142 0.916 1.757 0.863 1.203 1.130 1.486 0.10 43.451 1.916 7.440 2.137 1.191 2.462 1.588 1.315 0.15 105.231 3.451 90.313 2.659 6.754 11.586 4.707 1.791 0.20 138.850 2.585 145.655 2.960 65.327 51.431 10.920 1.567 0.25 157.160 5.416 167.097 5.958 155.097 131.435 23.000 3.232 51 Table 4: Contamination in Y, N(5,0.25), 3=0, n=20, m=200. e OR ORLM S(95%) S(BP=0.5) M M r WOR MPP 0.00 1.729 6.849 2.893 7.065 3.058 3.027 2.160 7.997 0.05 99.114 7.070 6.737 8.468 5.548 7.403 5.488 9.534 0.10 156.263 11.718 63.213 11.867 9.269 10.800 7.134 9.049 0.15 171.012 15.320 178.990 17.221 14.268 19.381 9.104 10.145 0.20 175.600 17.266 182.988 20.762 32.522 35.111 5.428 6.867 0.25 176.412 29.579 182.560 37.020 94.562 92.341 10.439 11.378 Table 5: Contamination in Y, N(5,0.25), 8= 0, n=60, m=200. e OR ORLM S(95%) S(BP=0.5) M M r WOR MPP 0.00 0.520 1.985 0.540 1.960 0.546 0.571 0.520 1.418 0.05 106.260 1.879 0.500 1.838 0.523 0.843 0.483 1.291 0.10 174.419 1.616 49.046 1.672 0.673 1.180 0.641 1.384 0.15 181.852 1.570 187.940 1.817 0.845 2.285 0.729 1.273 0.20 184.547 1.356 189.764 2.831 11.795 16.112 0.778 1.501 0.25 186.338 3.379 192.827 4.844 81.398 79.776 1.554 1.786 52 Table 6: Contamination in X, N(20,0.25), 8=5, n=20, m=200. e OR O R L M S(95%) S(BP=0.5) M M T W O R MPP 0.00 0.0648 0.2405 0.0669 0.2208 0.0724 0.1150 0.0648 0.2764 0.05 35.9934 0.2109 0.0837 0.2317 0.0985 0.1550 0.0718 0.2356 0.10 105.4132 0.2694 0.0753 0.2834 0.0821 0.1848 0.0967 0.2536 0.15 130.9318 0.1580 111.8115 0.2381 0.0934 0.1866 0.0694 0.1480 0.20 143.1987 0.1580 131.8617 0.2966 0.1183 0.2603 0.0840 0.3040 0.25 144.1752 0.1986 141.6511 0.4552 0.1949 0.4162 0.1064 0.2136 Table 7: Contamination in X, N(20,0.25), 8=5, n=60, m=200. 6 OR O R L M S(95%) S(BP=0.5) M M r W O R MPP 0.00 0.0153 0.0653 0.0164 0.0627 0.0168 0.0169 0.0153 0.0595 0.05 18.8693 0.0610 0.0193 0.0565 0.0201 0.0284 0.0185 0.0608 0.10 105.5620 0.0607 0.0224 0.0615 0.0238 0.0415 0.0202 0.0587 0.15 131.1035 0.0523 131.7244 0.0573 0.0247 0.0449 0.0213 0.0411 0.20 144.2054 0.0445 147.7639 0.1046 0.0368 0.0873 0.0223 0.0489 0.25 143.9072 0.0528 150.3505 0.1888 0.0682 0.1701 0.0261 0.0409 53 Table 8: Contamination in Y (dim=5), N(10,0.25), 3=0, n=40, m=100. e OR ORLM S(95%) S(BP=0.5) M M T WOR MPP 0.00 1.794 8.972 1.895 8.183 2.181 2.425 1.800 7.076 0.05 92.399 10.130 4.316 9.940 4.713 7.021 3.280 7.191 0.10 94.399 16.350 16.796 17.030 13.129 15.472 6.825 11.372 0.15 96.311 18.532 97.659 22.511 16.425 25.028 1.918 8.824 0.20 96.393 37.183 97.190 45.574 45.832 51.641 6.040 11.085 0.25 96.985 56.955 97.529 73.062 89.264 91.446 6.806 12.250 Table 9: Contamination in XI (dim=5), N(10,0.25), 3=5, n =40, m= =100. e OR ORLM S(95%) S(BP=0.5) M M r WOR MPP 0.00 1.393 5.328 1.422 4.944 1.636 2.461 1.446 9.401 0.05 12.663 9.130 5.544 8.795 4.724 7.204 3.167 6.726 0.10 15.156 11.331 16.162 11.099 8.948 11.027 3.144 7.648 0.15 15.607 15.550 15.781 15.953 14.629 15.708 4.772 9.468 0.20 15.591 17.365 16.977 19.181 17.749 18.091 5.070 6.022 0.25 15.278 19.201 16.556 20.420 16.711 17.622 6.016 8.205 54 An extensive simulation was done to assess the breakdown properties of different robust orthogonal regression estimates. Given that the upper bound for the maximum direction bias is one, we assume here that an estimate has broken down if the 95 th quantile of the direction bias is larger than 0.90. Using this criterion, the empirical BP's for the direction are Table 10: Empirical BP's of the estimates. OR ORLM S(95%) S(BP=0.5) M M T WOR MPP e* 0.00 > 0.25 « 0.10 > 0.25 « 0.20 « 0.20 > 0.25 > 0.25 The empirical BP's are slightly higher than what is expected theoretically (note that the ratio \Jx~iJ\~o used in the simulations was two). This is probably because the simulations do not reflect the most damaging type of contamination. The Monte Carlo study only confirms what has been theorized. The efficient S-estimate performs well at the Gaussian model and for small fractions of contamination. However, it breaks down early. The maximal BP S-estimate does not break down until about 25% of contamination but its performance at the Gaussian model is unsatisfactory. The same applies to the ORLM. The MM- and r- estimates perform well at the Gaussian model and, for larger fractions of contamination, attain nearly the same level of bias robustness as the maximal BP S-estimate. The proposed estimate appears to have better bias characteristics than the robust S-estimate throughout the e-range; however, it is quite inefficient at the Gaussian model. To improve efficiency while retaining a high BP, we considered a one-step reweighted estimate with weights based on the proposed estimate. It is evident from the tables above that this approach yields superior results. The one-step estimate is more efficient than the M M - and r-estimates at the Gaussian model and its bias performance over the e-range is exemplary. Although the proposed estimate and the one-step estimate give 55 good results, further study of their properties is required. 56 8 Applications of P C A 8.1 Orthogonal Regression Orthogonal regression (OR) is the maximum likelihood procedure at the Gaussian error-in-variables (EV) model. In classical regression, the response variable consists of a deterministic part (assumed known) / ? T X and a random part e where e is assumed to be normally distributed with mean zero and covariance equal to some multiple of the identity matrix. The X,'s can also be random but they are assumed to be observed without error. In OR, the X,-'s are either random independent identically distributed vectors ( structural E V model) or they are non-random but unknown (functional EV model). Let Vi = ct + 3TXi + Vi (26) Xj = Xj' + l l j , where a is the intercept, 8 is the vector of regression parameters and u and v are errors with zero mean and some variance, possibly different, u and v are uncorrelated. Then the OR estimates are the solution of the following minimization problem . » (yj-a-8Txt\ mm > x / In the classical setup x(x) = x2- To make the OR method robust, the function X 1 S chosen to reduce the influence of outlying observations. The above orthogonal regression problem can be restated as follows. Let xz = X,- + u,-be a set of random vectors satisfying the condition a0X,- = 60, where a^ ao = 1 and b0 is some constant. Then the vector a and the number b are found by minimizing i t ( a T X i - 6 ) 2 . 7 1 i=i 57 Figure 5: Plot of Classical versus Robust PCA (aTXj- — b)2 is the square of the orthogonal distance from x; to the hyperplane H(a, b) = {x : a Tx = b} (Zamar, 1989). It can be shown that a is the principal component of the sample covariance matrix corresponding to the smallest eigenvalue. Thus orthogonal regression reduces to finding the the direction of the smallest principal component (PC). To obtain robust orthogonal regression estimates we find the direction of the smallest robust PC. 8.2 Outlier Detection using P C A Robust PCA can also be used for detecting outliers in higher dimensional spaces. This can be done by identifying observations that give rise to unusually large principal components. To emphasize why classical PCA is not well suited to this purpose, let us consider the artificial data set in Zamar (1989). It consists of twenty three-dimensional vectors four of which have been made outlying, observations 1, 2, 19, 20 (marked with a $ sign). We have plotted all pairwise combinations of the PC's for the classical and 58 robust PCA. In the classical PCA, we do not have a clear indication that some of the observations may be outliers. It is possible that an experienced data analyst could identify the outliers from the PC1-PC3 plot (not shown). However, his conclusions would be subjective, based on his or her experience. To make outlier detection objective, we use robust methods. In this example we have used the proposed estimator. From the robust PC2-PC3 plot it appears that observations 1, 2, 19 and 20 are unusually large in the third PC. This indicates that they may be possible candidates for outliers as these observations do not conform with the structure of the rest of the data. These points have been purposely designed to be outliers with the intention of upsetting the classical estimates. The classical PCA failed to identify these observations because they inflated the scale of the third PC so that outliers would not be detectable; in fact, the classical scale was almost ten times larger than its robust counterpart. On the other hand, the proposed estimator clearly distinguished between "good" data and "bad" data. The example illustrates the dangers of relying on classical methods for outlier detec-tion and how robust methods can provide a better picture of the situation by identifying aberrant observations and remaining stable in their presence. 59 9 References B A R N E T T , V . (1976). The ordering of multivariate data. Journal of the Royal Statistical Society A 138 318-344. B E B B I N G T O N , A . C . (1978). A method of bivariate t r imming for estimation of the correlation coefficient. Applied Statistics 38 221-226. B O E N T E , G . (1987). Asymptot ic theory for robust principal components. Journal of Multivariate Analysis 21 67-78. B U T L E R , R . W . and J U H N , M . (1987). Asymptot ic for t r immed multivariate data. Revised Preprint November 1987, University of Michigan and University of Flor ida . C H E N , Z . and L I , G . (1985). Projection-pursuit approach to robust dispersion matrices and principal components: P r imary theory and Monte Carlo. JASA 80 759-766. D A V I E S , P . L . (1989). Improving 5-estimators by means of k-step M-estimators. Tech-nical report, G H S - Essen. D E V L I N , S.J . et. al. (1975). Robust estimation and outlier detection with correlation coefficients. Biometrika 62 531-545 D O N O H O , D . L . (1982). Breakdown properties of multivariate location estimators. P h . D . qualifying paper. Harvard University. D O N O H O , D . L . and H U B E R , P . J . (1983). The notion of breakdown point. In A Festschrift for Erich L. Lehmann (P.J .Bickel , K . A . D o k s u m , J.L.Hodges Jr. , eds.) 157-184. Wadsworth, Belmont, California. G N A N A D E S I K A N , R. and K E T T E N R I N G . J .R . (1972). Robust estimates, residuals and outlier detection with multiresponse data. Biometrics 28 81-124. H A M P E L , F . R . , R O N C H E T T I , E . M . , R O U S S E E U W , P . J . and S T A H E L , W . A . (1986). Robust statistics : The approach based on influence functions. Wiley, New York. 60 H E L B L I N G , J . M . (1983). Ellipsoides minimaux de couverture en statistique mult i -variee, P h . D . thesis, Ecole Polytechnique Federale de Lausanne, Switzerland. H U B E R , P . J . (1964). Robust estimation of a location paramater. Annals of Statistics 35 73-101. H U B E R , P . J . (1981). Robust statistics. Wiley, New York. J O H N S O N , R . A . and W I C H E R N , D . W . (1988). Applied multivariate statistical analysis. Prentice-Hall , 2 n c ^ ed. J O L L I F F E , I.T. (1986). Principal component analysis. Springer -Verlag, New York. L O P U H A A , R . (1990). Estimation of location and covariance with high breakdown point. P h . D . thesis. Delft University of Technology, Netherlands. M A R O N N A , R . A . (1976). Robust M-estimates of multivariate location and scatter. Annals of Statistics 4 51-67. M A R T I N , R . D . , Y O H A I , V . J . and Z A M A R , R . H . (1989). Min-max bias robust regres-sion. Annals of Statistics 17 1608-1630. M A R T I N , R . D . and Z A M A R , R . H . (1989). Bias robust estimation of scale when loca-t ion is unknown. Technical report. R O U S S E E U W , P . J . and L E R O Y , A . M . (1987). Robust regression and outlier detection. Wiley, New York. S C H R A G E , L . (1979). A more portable Fortran random number generator. ACM Trans. Math. Softw. 5 132-138. S T A H E L , W . A . (1981). Robust Estimation: Infinitesimal Opt imal i ty and Covariance M a t r i x Estimators. P h . D . thesis (in German), E T H , Zurich. T I T T E R I N G T O N , D . M . (1978). Estimation of correlation coefficients by ellipsoidal t r imming. Applied Statistics 27 227-234. T U K E Y , J . W . (1974). T6 : Order statistics, in mimeographed notes for Statistics 411, 61 Princeton University. Y O H A I , V . J . and Z A M A R , R . (1988). High breakdown-point of estimates of regression by means of the minimizat ion of an efficient scale. JASA 83 406-413. Y O H A I , V . J . and Z A M A R , R . H . (1990). Discussion of Paper No. 90 S M 493-7 P W R S . Z A M A R , R . H . (1989). Robust estimation in the errors-in-variables model. Biometrika 76 149-160. 62
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Robust principal component analysis via projection...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Robust principal component analysis via projection pursuit Patak, Zdenek 1990
pdf
Page Metadata
Item Metadata
Title | Robust principal component analysis via projection pursuit |
Creator |
Patak, Zdenek |
Publisher | University of British Columbia |
Date Issued | 1990 |
Description | In principal component analysis (PCA), the principal components (PC) are linear combinations of the variables that minimize some objective function. In the classical setup the objective function is the variance of the PC's. The variance of the PC's can be easily upset by outlying observations; hence, Chen and Li (1985) proposed a robust alternative for the PC's obtained by replacing the variance with an M-estimate of scale. This approach cannot achieve a high breakdown point (BP) and efficiency at the same time. To obtain both high BP and efficiency, we propose to use MM- and τ-estimates in place of the M-estimate. Although outliers may cause bias in both the direction and the size of the PC's, Chen and Li looked at the scale bias only, whereas we consider both. All proposed robust methods are based on the minimization of a non-convex objective function; hence, a good initial starting point is required. With this in mind, we propose an orthogonal version of the least median of squares (Rousseeuw and Leroy, 1987) and a new method that is orthogonal equivariant, robust and easy to compute. Extensive Monte Carlo study shows promising results for the proposed method. Orthogonal regression and detection of multivariate outliers are discussed as possible applications of PCA. |
Subject |
Principal components analysis Robust statistics |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2010-11-02 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0098431 |
URI | http://hdl.handle.net/2429/29737 |
Degree |
Master of Science - MSc |
Program |
Statistics |
Affiliation |
Science, Faculty of Statistics, Department of |
Degree Grantor | University of British Columbia |
Campus |
UBCV |
Scholarly Level | Graduate |
AggregatedSourceRepository | DSpace |
Download
- Media
- 831-UBC_1990_A6_7 P38.pdf [ 3.52MB ]
- Metadata
- JSON: 831-1.0098431.json
- JSON-LD: 831-1.0098431-ld.json
- RDF/XML (Pretty): 831-1.0098431-rdf.xml
- RDF/JSON: 831-1.0098431-rdf.json
- Turtle: 831-1.0098431-turtle.txt
- N-Triples: 831-1.0098431-rdf-ntriples.txt
- Original Record: 831-1.0098431-source.json
- Full Text
- 831-1.0098431-fulltext.txt
- Citation
- 831-1.0098431.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.831.1-0098431/manifest