A New Contamination Model for Robust Estimation with Large High-Dimensional Data Sets by Fatemah A l i Alqal laf B . A . (Mathematics) Kuwait University, 1994 M.Sc . (Statistics) University of Br i t i sh Columbia , 1999 A T H E S I S S U B M I T T E D IN P A R T I A L F U L F I L M E N T O F T H E R E Q U I R E M E N T S F O R T H E D E G R E E O F Doctor of Philosophy in T H E F A C U L T Y O F G R A D U A T E S T U D I E S Department of Mathematics (Institute of Applied Mathematics) we accept this thesis as conforming to the required standard The University of British Columbia A p r i l 2003 © Fatemah A l i Alqallaf, 2003 In presenting this thesis in partial fulfilment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. Department of Mathematics The University of British Columbia 121 - 1984 Mathematics Road Vancouver, B C Canada, V6T 1Z2 Abstract Data sets can be very large, highly multidimensional and of mixed quality. This thesis provides feasible and robust methods for estimating multivariate location and scatter matrix for such data. Our estimates scale well to very large sample sizes and dimensions and are resistant to the presence of multivariate outliers. Statisticians use contamination or mixture models to study the performance of robust alternatives to classical statistical procedures. Most multivariate contamination models for numeric data proposed to date (see Hampel et al., 1986) assume that the majority of the observations comes from a nominal distribution such as a multivariate normal distri-bution, while the remainder comes from another multivariate distribution that generates outliers. We stress that such outliers could be "bad" data due to recording errors of all kinds, or they could be a highly informative subset of the data that leads to the discov-ery of unexpected knowledge in areas such as business operations, credit card fraud, and even the analysis of performance statistics of professional athletes. Unfortunately, the previously available models do not adequately represent reality for many multivariate data sets that arise in practice. It may often happen that outliers occur in each of the variables independently of the other variables or in special dependency patterns. We introduce a new contamination model that overcomes the main drawbacks of the current models by taking into account different sources of variability in the data, and allowing greater flexibility. Moreover, our model permits for situations where extreme values of one or more variables (not necessarily outliers) may increase the likelihood of outliers or gross errors in other variables. There is a large statistical literature on robust covariance and correlation matrix i i estimates, with an emphasis on affine equivariant estimates that possess high breakdown points and small worst case biases. A l l such estimates have unacceptable exponential complexity 2P in the number of variables p. And one of the more attractive of these estimates, the Stahel-Donoho estimate, has an unacceptable quadratic complexity n2 in the number of observations n. These estimates may be applied in large data applications with large p and n only by the use of adhoc sampling methods that render the robustness properties of the estimates unclear. In this thesis we focus on pairwise robust scatter matrix estimates and coordinate-wise location estimates. The pairwise scatter estimates are based on coordinate-wise robust transformations (the quadrant correlation estimate, and the coordinate-wise Huberized estimates). We show that such estimates are computationally simple, and have attractive robustness properties under the existing and the newly proposed contamination models. i i i Contents Abstract i i Contents iv List of Tables viii List of Figures xi Acknowledgements xiv Dedication xvi 1 Introduction 1 1.1 Robust Estimates 1 1.1.1 Applications and Uses of Robust Estimates 2 1.1.2 Robust Proposals of Scatter Estimates 6 1.2 Problem, Motivation and Approach 7 1.3 Contributions and Outline of the Thesis 9 2 Background and Related Work 13 2.1 Univariate Robust Statistics 15 2.1.1 Influence Function 16 2.1.2 Robustness Measures 17 2.1.3 M-Estimates for Location 19 2.2 Robust Estimation in the Multivariate Setting 24 iv 2.3 M-Estimates 30 2.4 S-Estimates 34 2.5 M C D Estimate 38 2.6 Conclusions 43 3 Multivariate Contamination Models 45 3.1 Classical Contamination Model 45 3.2 Real Data Examples 47 3.3 New Contamination Model 57 3.4 Dependence Structures in the Contamination Indicator Matrix 62 3.5 Dependence Structures in the Contamination Model 65 3.5.1 Independent-Contamination Model 65 3.5.2 Contamination Vector as a Function of Contamination Indicator Matrix 66 3.5.3 Contamination Vector as a Function of Contamination Indicator matrix and Uncontaminated Vector 67 3.6 Robust Estimation of Multivariate Location and Scatter: Problems and Motivation 70 3.7 Chapter Appendix 72 4 Robust Estimation of Multivariate Scatter 74 4.1 Classical Scatter Estimate 74 4.2 Simple Class of Pairwise Robust Scatter Estimate 76 4.3 Performance of Pairwise Huberized Scatter Estimate 77 4.4 Asymptotic Properties of Huberized Correlation Coefficients 86 4.4.1 Consistency of Huberized Correlation Coefficients 86 4.4.2 Asymptotic Normality of Huberized Correlation Coefficients . . . 87 4.5 Bias in Quadrant and Huberized Correlation Coefficients 94 v 4.6 Positive Definite Pairwise Robust Scatter Estimates 98 4.6.1 Preliminary Estimation of Scatter Matrix 100 4.6.2 Computational Complexity and Computing Times 102 4.7 Maximum Bias of Quadrant and Huberized Correlation Coefficients . . . 110 4.7.1 Maxbias of Quadrant and Huberized Correlation Coefficients with Known Locations and Scales 110 4.7.2 Maxbias of Quadrant Correlation Coefficient with Unknown Loca-tions and Scales I l l 4.7.3 Maxbias of Huberized Correlation Coefficients with Unknown Lo-cations and Scales 115 4.7.4 Maxbias Comparison of the Correlation Coefficients for Stahel-Donoho and F M C D to Huber 124 4.8 Application Examples to Real Data 136 4.8.1 Glass Data 136 4.8.2 KDD-CUP-98 P V A Donations Data 139 4.8.3 Daily Pressure in Northern Hemisphere Data 143 4.9 Chapter Appendix . . . '. 147 4.9.1 Proof of Theorem 4.1 147 4.9.2 Proof of Theorem 4.2 153 4.9.3 Proof of Theorem 4.3 160 4.9.4 Proof of Theorem 4.4 164 5 Robust Estimation of Multivariate Location 168 5.1 Classical Multivariate Location Estimate 168 5.2 Robust Multivariate Location Estimates 170 5.3 Coordinate-wise Location Estimates 171 5.4 Bias-Robustness Properties of Coordinate-wise Median 172 vi 5.5 Chapter Appendix 175 5.5.1 Proof of Theorem 5.1 175 5.5.2 Proof of Theorem 5.2 181 6 Conclusion 184 Bibliography 187 vii List of Tables 3.1 A Numerical Example of Increased Percentage of Contamination in a Lin-ear Combination of Variables 61 3.2 Bivariate Distribution of B\ and B2 62 3.3 Bivariate Distribution of B1 and B2, Independent Case 63 3.4 Bivariate Distribution of B\ and B2, Perfect Correlation 63 3.5 Bivariate Distribution of B\ and B2, Perfect Rejection 63 3.6 Percentage of Rows with at Least One Contaminated Entry when each Variable Independently 5% Contaminated (e = .05) in the Independent-Contamination Model 71 4.1 Performance of Pairwise Huberized (c = 0) and Fast M C D Covariance Estimates for Data Sets with p = 10 84 4.2 Performance of Pairwise Huberized (c = 0) and Fast M C D Covariance Estimates for Data Sets with p = 20 84 4.3 Performance of Pairwise Huberized (c = 0) and Fast M C D Covariance Estimates for Data Sets with p = 30 85 4.4 Evaluation of the Asymptotic Standard Errors of the Huberized Correla-tion Coefficient Estimates with c = 1.00 92 4.5 Evaluation of the Asymptotic Standard Errors of the Huberized Correla-tion Coefficient Estimates with c = 1.25 93 4.6 Evaluation of the Asymptotic Standard Errors of the Huberized Correla-tion Coefficient Estimates with c = 1.50 93 viii 4.7 Maximum and Minimum Values of Corrected and Uncorrected Quadrant and Huberized Correlation Coefficient Estimates, p = 0.1 128 4.8 Maximum Bias of Corrected Quadrant and Huberized Correlation Coeffi-cient Estimates, p = 0.1 129 4.9 Maximum Bias of Uncorrected Quadrant and Huberized Correlation Co-efficient Estimates, p = 0.1 129 4.10 Maximum and Minimum Values of Corrected and Uncorrected Quadrant and Huberized Correlation Coefficient Estimates, p = 0.5 130 4.11 Maximum Bias of Corrected Quadrant and Huberized Correlation Coeffi-cient Estimates, p = 0.5 131 4.12 Maximum Bias of Uncorrected Quadrant and Huberized Correlation Co-efficient Estimates, p = 0.5 131 4.13 Maximum and Minimum Values of Corrected and Uncorrected Quadrant and Huberized Correlation Coefficient Estimates, p = 0.9 132 4.14 Maximum Bias of Corrected Quadrant and Huberized Correlation Coeffi-cient Estimates, p = 0.9 133 4.15 Maximum Bias of Uncorrected Quadrant and Huberized Correlation Co-efficient Estimates, p = 0.9 133 4.16 Maximum and Minimum Values of Fast M C D Correlation Coefficient Es-timates 134 4.17 Maximum Bias of Fast M C D Correlation Coefficient Estimates for Differ-ent Correlation Coefficients, p 134 4.18 Maximum and Minimum Values of Stahel-Donoho Correlation Coefficient Estimates 135 4.19 Maximum Bias of Stahel-Donoho Correlation Coefficient Estimates for Dif-ferent Correlation Coefficients, p 135 ix 4.20 First 20 Proportions of Variation for the Classical and the Pairwise Hu-berized (with c = 1) Covariance Estimates x List of Figures 1.1 Woodmod 5-D Data with Outliers 3 1.2 Classical and Robust Correlations and Mahalanobis Distances for Wood-mod Data 4 3.1 Hertzsprung-Russell Diagram of the Star Cluster C Y G O B I 48 3.2 Body and Brain Weight Data Set 48 3.3 Gesell Adaptive Score versus Age at First Word 49 3.4 Advertising Yield versus Spending 50 3.5 Scatterplot of Cigarette Consumption versus Lung Cancer 51 3.6 Scatterplot of Cigarette Consumption versus Leukemia Death Rates. . . 52 3.7 Air Pollution and Mortality Data Set 53 3.8 Salinity Data Set 54 3.9 Wages and Hours Data Set 55 4.1 Performance of Pairwise Huberized Covariance Estimates using Eigenvalue Metric, for Data Sets with Size of Contamination pQ = 100 and p — 10. . 80 4.2 Performance of Pairwise Huberized Covariance Estimates using Eigenvalue Metric, for Data Sets with Size of Contamination p0 = 100 and p = 30. . 81 4.3 Performance of Pairwise Huberized Covariance Estimates using Eigenvalue Metric, for Data Sets with Size of Contamination p0 = 100 and p = 50. . 82 4.4 Performance of Fast. M C D Covariance Estimates using Eigenvalue Metric, for Data Sets with Sizes of Contamination p0 = 5,10,100 and p — 10, 20,30. 83 4.5 Intrinsic Bias of Huberized Correlation Coefficient Estimates 97 xi 4.6 Intrinsic Bias of Huberized Correlation Coefficient Estimates 98 4.7 Scalability of Dimensions for the Covariance Estimates obtained using QC and G K for n = 5p 103 4.8 Scalability of Sample Sizes for the Covariance Estimates obtained using QC and G K for p = 50 ^ . . 104 4.9 C P U Time of the Covariance Estimates obtained using F M C D , QC and G K for Clean Data 106 4.10 C P U Time of the Covariance Estimates obtained using F M C D , QC and G K for Clean Data with Larger Sample Sizes 107 4.11 C P U Time of the Covariance Estimates obtained using F M C D , QC and G K for 20% Contaminated Data 108 4.12 C P U Time of the Covariance Estimates obtained using F M C D , QC and G K for 20% Contaminated Data with Larger Sample Sizes 109 4.13 Maxbias Comparison of Corrected and Uncorrected Quadrant and Huber-ized Correlation Coefficient Estimates, p = 0.10 120 4.14 Maxbias Comparison of Corrected and Uncorrected Quadrant and Huber-ized Correlation Coefficient Estimates, p — 0.50 121 4.15 Maxbias Comparison of Corrected and Uncorrected Quadrant and Huber-ized Correlation Coefficient Estimates, p = 0.90 122 4.16 Maxbias of Corrected Quadrant and Huberized Correlation Coefficient Es-timates, e = 0.20 123 4.17 Maxbias Comparison of Uncorrected Huberized Correlation Coefficient Es-timates (c = 1) with SD, F M C D Correlation Coefficient Estimates. . . . 127 4.18 Histogram of the Robust Mahalanobis Distances for the Glass Data. . . . 137 4.19 Pairwise Plots for the Data Set Glass 138 4.20 Histogram of the Robust Mahalanobis Distances using a Threshold of 200 for the Glass Data 139 xii 4.21 Histogram of the Robust Mahalanobis Distances for the P V A Data. . . . 140 4.22 Differences between Classical and Robust Correlation Coefficients for the P V A Data 141 4.23 Histogram of the Robust Mahalanobis Distances for the Modified P V A Data with Outliers 142 4.24 Differences between Classical and Robust Correlation Coefficients for the Modified P V A Data 142 4.25 Proportions of Variation for the Classical Covariance Estimates 146 4.26 Proportions of Variation for the Pairwise Huberized (with c = 1) Covari-ance Estimates 146 xiii Acknowledgements Praise God the Almighty for being merciful unto me and blessing me with his grace. Many people helped me during the years I spent at U B C . First of all, my gratitude goes to my supervisor, Dr. Ruben Zamar, for directing the course of my studies. The work on this thesis was carried out under his expert direction and supervision. Without his guidance and support, this thesis could not have been completed. I have also been fortunate to have Dr. Paul Gustafson, Dr. Harry Joe, Dr. Ray-mond Ng and Dr. Bruno Zumbo on my PhD committee. Their input has added strength to this thesis. Thank you also to my university examiners (Dr. Laks Lakshmanan and Dr. Bertrand Clarke), my external examiner (Dr. Christopher Field) and the chair of my oral defence (Dr. Keith Head) for their comments and analysis. Dr. Edwin Perkins de-serves special acknowledgement, since on many occasions he provided me with invaluable suggestions and advice. I would like to acknowledge Kuwait university for providing me with the financial assistance needed to pursue my postgraduate studies in the form of a postgraduate schol-arship. I feel a deep sense of gratitude for my father A l i Alqallaf and my mother Parween Taqi who formed part of my vision and taught me the good things that really matter in life. I am grateful for all my family back home who always believed in us and unconditionally supported us all these years. Special thanks to my brothers and sisters (Mohsen, Mo-hammad, Hani, Yousef, Hussein, Liala, Najlaa, Amna and Zainab). Of course, I would xiv like to thank my two brothers in law AbdulReda and Khaled. Other people who have helped me in different ways are Lindsey Turner for being such a good friend and colleague, Isabella Ghement who was always ready for a chat, Christine Graham and Lee Tran (what would we do without them?) and last but not least the entire Statistics Department for making me feel so at home. I thank the Institute of Applied Mathematics for providing the necessary resources and to those people within the department for their friendship and interesting discussions. Thank you to the director Dr. Bernie Shizgal and to Dr. Roman Baranowski, the former research/IT manager. I owe much more than what I can express here to my friends; Rosalia Aguirre-Hernandez and Alberto Molina-Escobar. Finally, no words can express my gratitude to my husband Mohammad Bahzad whose patient love enabled me to complete this work. FATEMAH ALI ALQALLAF The University of British Columbia April 2003 xv I dedicate this thesis to my parents A l i Alqallaf and Parween Taqi, and to my husband Mohammad Bahzad—an outlier. xvi Chapter 1 Introduction It is desirable to develop methods for extracting reliable and useful information from large high-dimensional data sets of mixed quality. Our thesis is that the existing contamination models used to represent high-dimensional data of mixed quality are not completely satisfactory. Therefore we propose a new contamination model and robust estimation procedures which are feasible/scalable to higher dimensions and appear to work relatively well regardless of the size, dimension and quality of the data set. The focus of our work is on the robust estimation of multivariate location and scat-ter matrices (i.e. covariance and correlation matrices). These quantities are of great importance as they form the underpinnings of linear estimation theory. We begin by discussing the role of robust estimations in statistics, and by providing some motivation for our work. Then, we provide a problem statement, a list of thesis contributions, and an outline of this thesis. 1.1 Robust Estimates Statistics are extracted from data sets to infer properties of their underlying source dis-tribution. Usually, these statistics are estimates of the parameters of the distribution. In the derivation of an estimate, modeling assumptions are made about the source dis-tribution, e.g., i.i.d. (independent and identically distributed) data points or restriction to a particular parametric family of distributions. Those estimates which offer better performance usually make strict assumptions on the data; however, when these assump-1 tions are invalid, the quality of the estimates can be quite poor. One aspect of robust statistics is to address the scenario where most, but not all, of the data points are drawn i.i.d. from a particular distribution. We wish to characterize this distribution. For ex-ample, consider the sample mean and median as estimates of the mean of a Gaussian distribution. The sample mean is the minimum variance estimate in this case, but it is not robust as it can be made arbitrarily bad by corrupting a single data point. The median, on the other hand, is very robust as at least 50% of the data points have to be corrupted to make the estimate arbitrarily bad; however, this robustness comes at the price of a significantly higher variance on the estimate. This example illustrates a fundamental trade-off, resistance versus efficiency. This leads to the primary objective in robust statistics the search for estimates which are not only resistant to model de-viations but also perform well under the correct model. It should be emphasized that one should not infer that the ultimate goal of robust statistics is to ignore outlying data points. Such a naive use of robust statistics could waste possible information contained in the outliers themselves. Robust estimates should merely reflect the bulk of the data points. Nonetheless, possessing a robust estimate often makes it easier to detect outliers which tend to be hidden in non-robust statistics. These outliers can then be separately analyzed for their own structure and information. In the following section, we illustrate the dramatic effects that outliers can have on non-robust estimates. 1.1.1 Applications and Uses of Robust Estimates Covariance and correlation matrices estimated from data sets are used for a variety of purposes. For example, pairwise sample correlation coefficients are often examined in an exploratory data analysis (EDA) stage to determine which variables are highly correlated with one another. Estimated covariance matrices are used as the basis for computing principal components for both general principal components analysis (PCA), 2 Figure 1.1: Woodmod 5-D Data with Outliers. and for manual or automatic dimensionality reduction and variable selection. Estimated covariance matrices are also the basis for detecting multidimensional outliers through computation of the so-called Mahalanobis distances of the cases (rows) of a data set. Unfortunately, the classical sample covariance and correlation matrix estimates, mo-tivated by either Gaussian maximum likelihood or simple method of moments principles, are very sensitive to the presence of multidimensional outliers. Even a small fraction of outliers can distort these classical estimates to the extent that the estimates are very misleading, and virtually useless in any of the above applications. To cope with the prob-lem of outliers, statisticians have invented robust methods that are not much influenced by outliers for a wide range of problems, including estimation of covariance and correla-tion matrices. We illustrate the extent to which outliers can distort classical correlation matrix estimates and the value of having a robust correlation matrix estimate with the small five-dimensional data set example illustrated in Figures 1.1 - 1.2 Figure 1.1 shows all pairwise scatter plots of the 5-dimensional data set called "Wood-mod" . This data set clearly has at least several multidimensional outliers that show up as s s (a) Classical and Robust (b) Classical and Robust Maha-Correlations. lanobis Distances with Square-Root 95% Chi-Squared Thresh-old. Figure 1.2: Classical and Robust Correlations and Mahalanobis Distances for Woodmod Data. a cluster in several of the scatterplots. Note that while these outliers are clearly outliers in two-dimensional space, they are not univariate outliers, i.e., they do not show up as well-detached outliers in any of the variables. Figure 1.2(a) shows the result of computing all pairwise classical correlations by both the classical method (sample correlation coeffi-cients) and a particular robust method known as the Fast M C D (FMCD) . The lower left triangle of values shows both the classical and robust correlation coefficient estimates, while the upper right triangle of ellipses visually represent the contours of a bivariate Gaussian density with zero means, unity variances, and correlation coefficients given by the classical and robust correlation coefficient estimates. A nearly circular ellipse indi-cates an estimated correlation coefficient of nearly zero. A narrow ellipse with its major axis oriented along the +45 degree (-45 degree) direction indicates a large positive (neg-ative) estimated correlation coefficient. From the visual representation, we immediately see differences between the classical and robust correlations, sometimes very substantial differences, including changes of sign. For example, the classical correlation between V4 4 and V5 is -.24 whereas the robust correlation is +.65. The latter is quite consistent with what we might expect if we deleted the small cluster of outliers occurring in the scatterplot of V4 versus V5 in Figure 1.1. A common way of detecting multidimensional outliers is to use the classical Maha-lanobis distance: d(x i) = ( x i - A ) ' C - 1 ( x i - A ) - (1-1) In the above expression X j is the i-th data vector of dimension p (the transpose of the i-th row of the data set), p, is the vector of sample means of the variables (columns) of the data set, and C is the usual sample covariance matrix estimate. Under the assumption that the data is multivariate normal and that we use known values /J, and C in place of the above estimates, the d(xj) would have chi-squared distribution with p degrees of freedom. With reasonably large sample sizes, the sample mean vector and sample covariance matrix will be close to their true values, and it is common practice to use the square-root of chi-squared (with p degrees of freedom) percent point such as .95 or .99 as a threshold to compare the square-root of d(xj) with, and declare X ; to be an outlier if it exceeds this threshold. If we follow this classical approach for the Woodmod data of Figure 1.1, we get the results in the right-hand panel of Figure 1.2(b). The horizontal dashed line is the square-root of the 95% point of a chi-squared distribution with 5 degrees of freedom. Clearly, no points are declared outliers by the classical Mahalanobis distance approach. This is because the outliers have distorted the classical C so much that it does not produce reliable Mahalanobis distances. On the other hand, the left-hand panel of Figure 1.2(b), based on a robust C (and a robust p) result in detection of not only the cluster of four very large outliers evident on the scatterplots of Figure 1.1, but also three additional moderate-sized outliers. The above example serves to vividly illustrate the inadequacy of classical correla-tion and covariance matrices in the presence of outliers and the valuable role of robust alternatives. 5 1 . 1 . 2 Robust Proposals of Scatter Estimates Statistical literature contains a substantial number of papers proposing and studying the properties of robust scatter matrix estimation. A n important early approach was that of M-estimates, first suggested by Hampel (1973), and studied by Maronna (1976) and Huber (1977, 1981). These estimates are positive definite, affine equivariant and relatively easy to compute, but have as a substantial limitation the fact that their breakdown point (BP) - i.e., the maximum proportion of outliers that the estimate can safely tolerate -is at most 1/p where p is the dimension of the data. This is not satisfactory, because it means that the breakdown -point becomes smaller with increasing dimension, where there are more opportunities for outliers to occur. Subsequently, there has been considerable emphasis on obtaining positive definite, affine equivariant estimates with a high breakdown point, namely a breakdown point of one-half. The best known is probably the minimum volume ellipsoid (MVE) estimate introduced by Rousseeuw (1984) and discussed by Rousseeuw and Leroy (1987) and Rousseeuw and Van Zomeren (1990). It consists of taking as location estimate the center of the smallest regular ellipsoid containing half the points of the data set. The scatter estimate is then defined by the shape matrix of that ellipsoid. However, Davies (1992) showed that the M V E estimate is not y/n consistent, making it less attractive for effi-ciency reasons. The M V E estimate has also been generalized to multivariate S-estimates (Davies, 1987; Lopuhaa, 1989; Lopuhaa and Rousseeuw, 1991). Rousseeuw (1985) intro-duced the minimum covariance determinant (MCD) estimate which has the normal rate of convergence. The M C D location and scatter estimates are the average and covariance matrix computed on that half of the data which attain the smallest determinant of their covariance matrix. Croux and Haesbroeck (1999) showed that M C D is more efficient than M V E in high-dimensions, and therefore recommend the use of the M C D . Another important class of affine equivariant high breakdown point estimates are those based on projections: the Stahel-Donoho (SD) estimate proposed by Stahel (1981) 6 and Donoho (1982) and studied by Maronna and Yohai (1995); P-estimates (Maronna, Stahel and Yohai, 1992); and a recent proposal by Pena and Prieto (2001). 1.2 Problem, Motivation and Approach Exact computation of robust estimates is feasible only for small data sets. An alternate remedy for large data sets is the approximate computation which is usually based on subsampling. The algorithm of subsampling is composed of taking a number Ns of subsamples, generally of size p + 1, to obtain an initial set of solutions, which are the starting point for the search for a (hopefully global) extremum. Ruppert (1992) developed a heuristic procedure for S-estimates. Even though the subsampling algorithms tend to lessen the computational burden for robust estimation, the numerical complexity of subsampling algorithms becomes critical for high-dimensional data sets. In order to ensure a given breakdown point, the value of Ns must increase exponentially with p. A high enough value of Ns is also necessary to ensure stability of the result. In general, the subsampling methods are feasible for moderate p, but computing them for large p in a reasonable time requires using values of Ns which imply giving up a high breakdown point. Woodruff and Rocke (1993, 1994) proposed procedures to deal with this problem. Rousseeuw and Van Driesen (1999) proposed the Fast MCD (FMCD), a procedure much more effective than naive subsampling for minimizing the objective function of the MCD, which seems capable of yielding "good" solutions without requiring huge values of Ns. But FMCD still requires substantial running time for large p. Recently Pena and Prieto (2001) proposed a fast algorithm based on the Kurtosis of projections, which does not require subsampling. However, the main drawback remains the lack of feasible methods to compute the estimates for large high-dimensional data sets. Much faster estimates with high breakdown points can be computed if one is willing to drop the requirements of positive definiteness and affine equivariance. Early proposals of robust procedures are of this type, see Bickel (1964) and Sen and Puri (1971). A 7 straightforward approach for multivariate location is to simply calculate a robust location estimate for each individual variable. In the case of multivariate scatter, one can similarly apply a robust covariance or correlation coefficient estimate to each pair of variables. Estimates of this type are called coordinate-wise and pairwise. There are many proposals for robust univariate location estimates (see for example Hampel et al., 1986). Many researchers obtain several multivariate versions of typi-cally univariate notions such as medians, L-estimates and R-estimates. The multivariate medians are known as the spatial median (also called mediancenter or Li-median), the Tukey or halfspace median, the Oja median and the Liu or simplicial median; proposed respectively by Haldane (1948), Tukey (1975), Oja (1983) and Liu (1990). There are also several proposals for the robust estimation of covariance or correlation of a pair of variables. The simplest methods are based on: (i) classical ranks, such as the Spearman's p and Kendall's r (see Abdullah, 1990); (ii) classical correlations applied after coordinate-wise outlier-insensitive transformations, such as the quadrant correlation (QC) and 1-D "Huberized" data (Huber, 1981, page 204); and (iii) bivariate outlier resistant methods such as the method proposed by Gnanadesikan and Kettenring (1972) and studied by Devlin, Gnanadesikan and Kettenring (1981). Unfortunately, the resulting multivariate location and scatter matrix estimates are not affine equivariant and the scatter matrix is not guaranteed to be positive definite. Rousseeuw and Molenberghs (1993) proposed several methods to deal with the problem of negative eigenvalues. Note that, although the scatter matrices obtained by approaches (i) and (ii) are positive definite, they require a correction to make them consistent for normal data, and the correction destroys their positive definiteness. In recognition of this opportunity, Maronna and Zamar (2002) recently proposed a new method based on a modification of approach (iii) that preserves positive definite-ness and has an "almost affine equivariant" property. However, the particular pairwise estimate they used is not nearly as fast as one might like. 8 In this thesis, we follow in a similar spirt of Maronna and Zamar (2002). We consider the use of the quadrant correlation and Huberized estimates of approach (ii) above, which are very transparent in the way they work, and enable fast scalable computation for large data applications, with complexity 0(n) -0(p2) for the resultingpxp covariance or correlation matrix. 1.3 Contributions and Outline of the Thesis Most multivariate contamination models for numeric data proposed to date (see for example, Hampel et al. 1986) assume that the majority of the observations comes from a nominal distribution such as a multivariate normal distribution, while the remainder comes from another multivariate distribution that generates outliers. The inadequacy of the exiting contamination models for large multivariate data sets has been pointed out by many researchers including Rey (2001) and Zamar and Alqallaf (2001). The contributions of this thesis are as follows: • We introduce a flexible multivariate contamination model which allows for different types of contamination in the data. We show several real data examples from the literature suggesting the need for a more general and flexible model. We argue that the proposed model is more appropriate than the existing ones because it is more flexible and better describes different types of possible correlation structures that can occur in practice. We incorporate this feature in our model by allowing its different components to be correlated. Affine equivariant multivariate location and scatter estimates do not scale well for large sample sizes and highly dimensional data sets. In addition, the new contamina-tion model reveals the possible lack of robustness of these estimates, and suggests that the coordinate-wise and pairwise approaches may be useful to overcome some of the robustness problems. 9 We study pairwise robust estimates of scatter matrices based on coordinate-wise robust transformations (the quadrant correlation and the coordinate-wise Huber-ized estimates). We assess the performance of the proposed pairwise estimates and compare them with the Fast MCD using Monte Carlo simulations and the new contamination model. We study the asymptotic properties (consistency and asymptotic normality) of the Huberized correlation coefficient estimates and obtain the mathematical expression for the asymptotic variance of these estimates. Using this expression, we construct an estimate for the variance of the estimated Huberized correlation coefficient. We also verify, using extensive Monte Carlo simulations, that estimated variances approximate well the finite sample variances of the correlation coefficient estimates. We distinguish between two kinds of bias in the quadrant and the Huberized cor-relation coefficient estimates due to the fraction of contamination and because of the structure of the estimates. We then show how to correct for the bias caused by the structure of the estimates. This correction has the drawback that the corrected correlation matrix may no longer be positive definite. We show that an improved scalability of the positive definite scatter matrix es-timate, proposed by Maronna and Zamar (2002), can be obtained by using the quadrant correlation coefficient estimates instead of the bivariate outlier resistant method (proposed by Gnanadesikan and Kettenring, 1972). We extend Huber's (1981) asymptotic maximum bias (maxbias) derivations of the Huberized correlation coefficient estimates to more general cases, where lo-cations and scales are unknown. In particular, we analytically derive the maxbias of the quadrant correlation coefficient and implement numerical computation of the maxbias of the Huberized correlation coefficients. 10 • We provide the minimaxity properties of the coordinate-wise median in the context of the new contamination model. • We give numerical evidence suggesting that affine equivariant estimates break down for high-dimensional data under the new contamination model. The rest of the thesis is organized as follows. Chapter 2 provides the background ma-terial on robust estimation in the univariate setting. We present the robust estimation of multivariate location and covariance, in which we generalize familiar concepts in the uni-variate case and discuss the difficulties that occur in the transition to higher dimensions. We describe three different types of multivariate location and covariance estimates; the M-estimates, the S-estimates and the MCD-estimate. In chapter 3, we introduce the new class of multivariate contamination model and show that it is more appropriate than the existing contamination models. In particular, we give some real data examples from the literature to illustrate that. We study the different correlation structures that the contamination indicators of the variables may have in the contamination model. We also illustrate the dependency situations among the components of the contamination model. In chapter 4, we present the pairwise robust estimates of scatter matrices . We report the results of Monte Carlo studies that assess the performance of the pairwise estimates and compare with the Fast MCD. We study the asymptotic properties (consistency and asymptotic normality) of the Huberized correlation coefficient estimates. We present the asymptotic maximum bias of the Huberized correlation coefficient estimates. Finally, we illustrate the implementation of the quadrant and Huberized correlation coefficient estimates on three real data sets. We show that the proposed methods are capable of computing robust location and covariance estimates and detecting multidimensional outliers on arbitrarily large data sets. Chapter 5 discusses the coordinate-wise robust multivariate location estimates. We 11 study the coordinate-wise median estimate, in which we show its minimaxity properties under the new contamination model. Chapter 6 contains a brief list of the results obtained in this thesis, the challenges that remain to be solved and the directions we foresee for future work. To facilitate the reading of the thesis, some of the proofs for the results presented will be relegated to the chapter appendix. 12 Chapter 2 Background and Related Work In this chapter, we review robust estimation techniques specifically tailored to estimating multivariate location and scatter. Particular attention will be paid to three methods: M-estimates, S-estimates and the minimum covariance determinant (MCD) estimate. The intuition behind these estimates is to find the location and covariance estimate by trying to simultaneously identify and down-weight outliers in the estimates; although, they do so in different ways. These methods and their attributes will be discussed in this chapter. To briefly introduce these estimates, let (x\,..., xn) € W denote a collection of data points. The majority of them are assumed to be i.i.d. from a distribution whose mean fj, and covariance £ we wish to estimate, but some of the data points are drawn from another unknown and arbitrary distribution. The estimates of /x and S will be denoted as t and C , respectively. The particular estimate to which it corresponds will be made clear from the context. An M-estimate (t, C) of (fi, S) is obtained as the solution to the system of equations 1 " -y2vi(di)(xi-t) = 0; i=i 1 n -YJV2{d2)(xi-t){xl-t)T = C, where d2 — d(xj, t; C)2 = (xi — t)'C~1(xi — t) and Vi(-) and v2{-) are weighting functions that control the influence of points that are distant (with respect to C _ 1 ) from t. If vi(-) = v2(-) = 1, then t and C are the sample mean and covariance. By taking v\(-) 13 and i>2(-) to be decreasing functions, we can reduce the effects of outliers based upon how different (in terms of second order statistics) they are from the rest of the data. A n S-estimate (t, C) of (//, S ) is obtained as the solution to the optimization problem over the set G = W x PDS(p), where PDS(p) is the set of all positive definite symmetric p x p matrices, min {|C|}; (t,C)£]RPxPDS(p) such that 1 n ~y2p(di) = bo, i= i where p(-) is a monotonically increasing function and bo is an appropriately defined con-stant. In the case of p(d) = d2, the optimization constraint says that the sum of the squared Mahalanobis distances is constant, i.e. the likelihood, is held constant under the assumption of Gaussian data. Not too surprisingly, this is equivalent to least squares esti-mation and results in the sample mean and covariance as its estimates. Now, in choosing a p(-) which rises slower than quadratically, we can relax the weighting associated with points distant from the center of the ellipsoid, lowering the influence of outliers. The M C D estimate is very intuitive. It does not involve solving a system of nonlinear equations nor a nonlinear optimization problem as above; but instead, it finds the subset of h(n/2 < h < n) points that are most tightly clustered and bases its estimate on that subset. In particular, out of all subsets of size h, it finds the one whose sample covariance determinant is minimal, and takes the M C D estimate (t, C) of (pi, S ) as the sample mean and covariance of this subset. The remainder of this chapter is structured as follows. Section 2.1 provides back-ground material on robust estimation in the univariate setting. Section 2.2 provides the introduction to robust estimation of multivariate location and scatter generalizing famil-iar concepts in the univariate case and discussing difficulties in the transition to higher 14 dimensions. Section 2.3 discusses the M-estimates in the multivariate context. Sec-tion 2.4 describes the S-estimates. Section 2.5 discusses the MCD estimate. The chapter concludes with Section 2.6 which consists of a summary of the material and a closing comment on the direction of robust estimation of multivariate location and covariance. 2.1 Univariate Robust Statistics In this section, we provide a brief introduction to robust statistics in the univariate setting. Naturally, such a vast field cannot be summarized in a section of a chapter, so we only cover the concepts with which we will be directly interested in the multivariate case. For further details see Hampel et al. (1986) and Huber (1981). To start, we consider to what kind of model deviations we wish to be robust. For this, we adopt the popular e-replacement model, which is also commonly called the Gross Error Model (GEM). Under this model, i.i.d. points (xi,x2,...) € K are drawn from the distribution Ge(x) = {l-e)Fe(x)+eH(x), (2.1) where Fg(x) is a strict parametric model parameterized by 9 and H(x) is an arbitrary distribution. The intuition behind this model is that (1 — e) of the sample arise from the parametric model, but e of the sample have been replaced by points from an arbitrary distribution. The derived statistic should estimate the parameter 9. A statistic based on a sam-ple set of size n will be denoted by Tn = Tn(xi,... ,xn), Tn : W1 \-> E. We will only consider statistics which generalize to statistical Junctionals which are mappings T(G) : domain(T) ^ R with the property that T(Gn) = Tn T(G), where Gn is the empirical distribution of the data. A desirable property of statistical functions is Fisher consistency, i.e. T(Fg) = 9 so that under the exact parametric distribution, the estimate returns the correct parameter value. Note that this definition of Fisher consistency of 15 functionals encompasses the more common requirement in estimation theory that Tn —> 6 for X i , . . . , xn ~ i.i.d. FQ. 2 . 1 . 1 Influence Function We follow the statistics literature in using the term efficiency to relate to having a small variance. Note however that it does not have to do with achieving the Cramer-Rao lower bound. A fundamental concept in measuring robustness and efficiency of a statistical functional T is its influence function (IF). The IF of T at a distribution G is given by I F ( l ; T , G ) = l i m I « l ^ ± ^ I M , ( 2. 2 ) for those x where the limit exists, where represents point measure of 1 at x. The IF can be interpreted as a directional derivative of T in the space of distributions. So intuitively, an estimate T with a "nice" smooth W(x\ T, G) should be robust as slight changes to the underlying distribution result in slight changes to the estimate. By applying a von Mises expansion, which resembles a Taylor expansion for functionals, of T at a Gn "near" G, we get Tn = T(Gn) = T(G) + J IF(x; T, G) d{Gn - G)(x) + remainder Tn = T(G) + jIF(x;T, G) dGn(x) + remainder 1 n y/n(Tn-T(G)) = —j= ]jP IF(XJ; T, G) + remainder, v n i=l where the second line follows because j lF(x; T, G) d(G) = 0 as a property of von Mises functionals. Now, if the "remainder" is negligible, which it is for many statistics, then by the Central Limit Theorem, the error asymptotically becomes a zero mean Gaussian with the variance determined by the IF, i.e. v^(T„ - T(G)) A N(0, AV(T, G)), where AV(T,G) = jTF(x;T,G)2 dG(x) (2.3) is the asymptotic variance. Of course, the above derivation is not precise, but it is meant to help provide insight to the IF and a rigorous derivation can be found in Fernholz (1983). 16 2 . 1 . 2 Robustness Measures Equation (2.3) illustrates the importance of the IF in characterizing the efficiency of an estimate, however it has equally important role in characterizing robustness. Recall the IF's interpretation as a directional derivative in the space of distributions. Thus it char-acterizes the sensitivity of the statistic to slight deviations from the model distribution. This motivates the definition the gross error sensitivity (GES) for T at Fg as 7*=sup|IF(x;T,F 0)|. (2.4) X The GES thus measures the worst influence an infinitesimally small fraction of contam-ination can have on the estimate, i.e. it is an upper bound for the standardized bias induced by the contaminated distribution (what we are calling bias here is not a bias in the standard sense of the word, but a bias attributed to a change in the underlying distribution). For this reason, we say that an estimate T at Fg is B-robust if 7* is finite, where the B stands for bias. An estimate is said to be most B-robust if it achieves the minimal GES over all Fisher consistent estimates. As is frequently the case in nature, ro-bustness and efficiency are conflicting goals. Thus, robust statistics will frequently search for optimally B-robust estimates which minimize asymptotic variance given a bound on the GES. From its nature as a derivative, the IF (and thus GES) is only a local characterization of robustness in a small neighborhood of the model distribution Fg. To provide a global measure of robustness, the concept of the breakdown point was developed. The finite-sample replacement breakdown point e* of an estimate Tn at the sample (xi,... ,xn) is defined as m e; = min < — max sup \Tn(zi,... ,zn) - Tn(xi,... ,xn)\ = 00 } , (2.5) 11 '•••'lm y\,—,ym where (zi,..., zn) is obtained by replacing the m data points Xu,..., xim with arbitrary values yi,...,ym. What this definition says is that the breakdown points is the smallest 17 possible fraction of points which must be corrupted to take the estimate across all bounds. Thus, a non-robust estimate, in the sense of breakdown point, has e* = 1/n. Note that there is another prevalent definition of the finite-sample breakdown point in the literature, particularly in the univariate setting for which The difference between the two is minor in that e* is the smallest fraction of replace-ments which can cause the estimate to become unbounded; e* is the largest fraction of replacements that the estimate can tolerate while still guaranteed to be bounded. Thus, their relation is e* = e* + 1/n. For the remainder of this chapter, we will only consider the initial definition given in equation (2.5); however, in reading other papers on this subject, it is important to differentiate between the two. Even though the definition in equation (2.5) uses (x±,..., xn), the breakdown point almost never depends on the points for interesting estimates. There also exists a definition of breakdown point based on dis-tributions instead of points, but it is considerably more involved and conveys the same idea as equation (2.5), see Huber (1981). This distribution breakdown point is denoted as e* and for reasonable estimates e* —> e*. The concept of gross error sensitivity measures the maximum effect that an infinitesi-mal amount of point-mass contamination can have on a functional. A stronger robustness concept is to measure the maximum effect or bias that any type of contamination can have on a functional. Define the contamination neighborhood of F for a given fraction of contamination e (0 < e < 1). The maximum contamination bias function is defined to be e* = max < — max sup |T„(zi Ft = {G : G = (1 - e)F + eH; H any distribution}, > ; T , F ) = sup \T(G)-T(F)\. (2.6) Ge^e 18 The maximum bias function is related to the breakdown point, which is a measure of global robustness, as well as to the contamination sensitivity, which is a measure of local robustness. The breakdown point of T at F over contamination neighborhood is defined to be Under certain regularity conditions, the contamination sensitivity and the gross error sensitivity are equal, see Hampel et al. (1986) for further discussion. In general, though, it readily follows that There are other common measures of robustness such as qualitative robustness, con-tinuity of a statistical functional, local-shift sensitivity, and rejection point. These will not be discussed further, but the interested reader should see Hampel et al. (1986). 2.1.3 M-Estimates for Location Having introduced some basic concepts in robustness theory, we now illustrate them with the well established M-estimate for univariate location. In addition to clarifying the concepts introduced above, it may help establish intuition for its extension to the multivariate setting in Section 2.3. The popular maximum likelihood (ML) estimate chooses the statistic as the parameter value 6 which maximizes the likelihood of the sample data, i.e. e*(T,F) = inf{e > 0\B(e,T,F) — oo}, (2.7) and the contamination sensitivity is defined to be 7(T,F) = limsupB(e;T,F)/ e. (2.8) 7 ( T , F ) > sup{limsup\T(G(e,x)) -T(F)\/e} = j*(T,F). (2.9) Tn = arg max f§(xi) > = arg max e 19 ) where fg represents a member of a family of pdf's parameterized by 0. Unfortunately, M L estimation is frequently non-robust. In order to robustify that approach, Huber (1964) considered generalizing the objective function to a function p(x, 9) with derivative ip(x, 6) = dp{£e^ and proposed to calculate T„ = arg max 1^/9(^,^1 (2.10) as an estimate. Any solution of this will then solve n J2^Tn) = 0. (2.11) i= i Because for reasonable choices of p, a solution of equation (2.1.1) will solve equation (2.10), a solution of either equation (2.10) or equation (2.11) is called an M-estimate, where the M comes from "generalized Maximum likelihood". Because it is frequently simpler to work directly with tp, little use will be made of p and we will associate ip with the M-estimate it defines. Extending equation (2.11) to a statistical functional, we get that an M-estimate is a solution of the equation / ij)(x,T(G)) dG{x) = 0. (2.12) Applying this to the contaminated distribution Gt,x = (1 — t)F + tAx = F + t(Ax — F), differentiating with respect to t, and taking the limit t —> 0, we get 0 - JiP(y,T(F + t[Ax-F})) d(F + t[Ax - F])(y) 0 = J^(y, T(F)) d(Ax - F)(y) + ^ [ T ( G „ x)}t=0 J ^ ( y , B)]T{F) dF(y) 0 = j i>(y,T(F)) dAx(y) + W(x;T,F) J ^[VM)]r(F) dF(y) IF(x; A F) = W(x; T, F) = d f (*' J ( F ) ) (2.13) 20 where the denominator is assumed nonzero. Thus, for an M-estimate, one can straight-forwardly calculate the IF and its derived quantities such as asymptotic variance and gross-error sensitivity from tb. Hampel ( 1 9 6 8 ) derives an optimal M-estimate. He considers that ip which minimizes the asymptotic variance given a constraint on the GES. His result essentially states that under certain regularity conditions on the set of allowable distributions, the minimum variance M-estimate, subject to a bound on the GES, is that for which the ip function is taken to be a vertical shift and "clipping" of the maximum likelihood score function s(x,9*) = •^g[\n(fg(x))]gt,. More precisely, for 0 * £ 0 a convex set T H E O R E M 2 . 1 Assume that • s(x, #*) exists for all x; • J s(x,9*) dFgt = 0 (a regularity condition); • the Fisher information J(Fgt) = J s(x,9*)2 dFgt satisfies 0 < J{Fet) < oo. Then for any b > 0, 3 a G K such that ibr(x) = [S(x,0*)-a}b_b ( 2 . 1 4 ) satisfies f tp dFgt 0 and d = Ji>(y)s(y,6*) dFgt(y) > 0. This vb minimizes the asymptotic variance J>2(y) dFeSy) U^(y)s(y,9*) dFeMf among all ip satisfying 2 1 and ip{x) sup < c = 2' \Sl>(v)s(y,8.)F,.(y)\ where [-]b_b — max{—b, min{-, b}} denotes clipping the function to the range [—b,b]. , Thus, one can control the degrees of B-robustness (7* = b/d) by varying the clipping level b. The shift a is necessary to maintain Fisher consistency, which for M-estimates simplifies to / V M . ) dFe,(y) = 0, V 0 , . Note that in the case of placing no bound on the GES, we get the M L estimate as expected. In the case of location estimation, we model Fe(x) = F(x — 9). Under this model, it is natural to only consider ip of the form ip(x,9) = ip(x — 9) with J^ipix) dF(x) = 0 for Fisher consistency. From equation (2.13), the IF can be written as mx-iPG) = ^ - ^ D 1-r \X, Wi G) _ f n y _ T { G ) ) d G { y ) which at the model distribution F becomes W(x- ib F) = ^ where the denominator is again assumed to be nonzero. Thus, the IF for M-estimates of location are proportional to ip, so up to a scaling factor, we can specify the IF through the definition of ip. For a symmetric model distribution F, it is natural to choose a ip function which is skew-symmetric, i.e. ip(—x) = —ip(x). If ip is also monotonic, then Hampel (1971) obtains the following results: 1. If -0 is bounded then the resulting estimate is B-robust and has a breakdown point of 1/2; 22 2. If ip is not bounded then the resulting estimate is not B-robust and has a breakdown The B-robustness can be seen from equation (2.4) and the fact that ip is proportional to the IF. The breakdown result can be seen by the following reasoning. If ip is bounded and monotonic then it must eventually approach a constant. Thus when a minority of the data samples grow arbitrarily large, the effect of each corrupted point on the sum in equation (2.11) is limited and is essentially the same as a much lower magnitude value, thus limiting the effect it can have. This point can be more visually appreciated with the example of an optimal M-estimate for location presented next. For an example, we apply the optimality result in Theorem 2.1 to location estimation. If F is the standard normal distribution, then the score function becomes s(x — 6) = x — 9 and because F is symmetric a — 0. This results in the optimal ip function known as Huber's ip function. In the case of 6 —^ co, we have ipH —> % and the estimate becomes However, when we clip ip to \—b,b], we limit the effect of a sample distant from Tn to be the same that of a sample at a distance b from Tn, i.e. Huber's ip function effectively draws in points that are further than b from Tn. Thus, arbitrarily large points can have only a limited effect. This "drawing in" of points is why bounded monotonic M-estimates have a breakdown point of 1/2. It should be pointed out that Huber's ip function does not correspond to an a-Winsorized mean or an a-trimmed mean both of which are not in the class of M-estimates (these fall into a class called L-estimates which will not be discussed here). point of 0. iPH(x-0,b) = [x-9}b_b (2.15) n n 0 = 4>H{xi - Tn, oo) = ^(xi - Tn) 23 Huber (1964) derives an optimal location M-estimate according to a different criterion than Hampel, but arises at the same answer. Huber's analysis was strictly for the case of univariate location estimation. Difficulties are encountered in trying to generalize it. Huber's approach does not utilize the IF as he believes that its interpretation of dealing with infinitesimally small contaminations is not consistent with the spirit of robust statistics where model deviations are more than just infinitesimally small. Instead, he considers a minimax condition on the variance over a neighborhood around the model distribution. Using a symmetrized G E M (Gross Error Model) neighborhood T = {G | G = (1 - e)F + eH, H symmetric}, he finds the saddle point (ipo,Go) that satisfies V(Vo, G) < V(ipQ, G0) < Vty, Go), V $ and G <= T. Under some regularity condition on the distribution and set of allowable distributions he shows that ip0 = [s(x — 0)]b_b, where b is determined from e and G0. Thus, the solution to Huber's minimax problem has the same form as Hampel's constrained minimization problem. L i and Zamar (1991) extended Huber's (1964) minimax result to the case when the scale parameter is unknown and must be estimated along with the location parameter. 2.2 Robust Estimation in the Multivariate Setting Having presented an introduction covering location estimation in the univariate case, we now proceed to the primary focus of this thesis which is the investigation of robust estimates of multivariate location and scatter. Most of the discussion in this section follows Lopuhaa, and Rousseeuw (1991), with a few exceptions which will be noted. For a set of points Xn = (xi,..., xn) with i j e ff, we wish to find robust estimates tn(Xn) € W of the mean and Cn(Xn) £ PDS(p) (set of positive definite symmetric pxp 24 matrices) of the covariance which describe the bulk of the data. The majority of the points are modelled as i.i.d. from an elliptical distribution F M > s with density / M > s s (x ) = I S I - ^ V ((* - f i f E ^ f * - fi)) (2.16) where <p : R+ —>• E+ is scaled so that a valid density is produced. We will consider some estimates which satisfy the property of affine equivariance which is defined as follows. DEFINITION 2 . 1 - Affine Equivariance - For any invertible p x p matrix A,v G W, and data set Xn G Wxn • A location estimate tn is affine equivariant iftn(AXn + v) — Atn(Xn) + v; • A covariance estimate Cn is affine equivariant if Cn(AXn + v) = ACn(Xn)AT, where AXn + v = (Ax1 +«,..., Axn + v), and AT is the transpose of A. We will also at times mention translation and coordinate-wise scale equivariant estimates of location, which are estimates that satisfy the weaker conditions, tn(Xn + v) = tn(Xn) +v; tn(DXn) = Dtn{Xn), for all diagonal p x p matrix D. The definition of breakdown point in equation (2.5) is primarily suited for univariate location estimation. Extending it to our broader context, the location breakdown point is defined as e*n{tn,Xn) = min { — 7~b max sup | | t n (^n) - tn(Zn)\\ = oo } (2-17) I1>-'Im yi,-,ym where as before Zn = ( z i , . . . , zn) is obtained by replacing the m data points Xn,..., xim with arbitrary values y1,... ,ym, and the interpretation is the same as for the scalar 25 case. The breakdown point for covariance is quite similar except that we have to account for the undesirable possibility of a singular covariance estimate (it is assumed that the covariance of the uncontaminated source distribution is full rank). Thus, protecting against eigenvalues approaching oo and 0, the breakdown point for covariance estimates is defined as where D(A,B) = max{\Xmax(A) -Xmax(B)\, \Xmin(A) 1 - \ m in(B) l\} with Xmax{A) and ^min(A) denoting the largest and smallest eigenvalues of A. Thus, covariance estimates are considered to be broken if they produce eigenvalues which are arbitrarily large or close to 0. In the univariate setting, the median and bounded M-estimates are examples of max-imally robust with respect to the breakdown point affine equivariant estimates with e* = [(n + l)/2]/n. A natural question is what happens to the breakdown point when the data dimensionality is increased. Lopuhaa and Rousseeuw (1991) addressed this is-sue. Let [•] be the greatest integer function, they show that if tn is translation equivariant, then Because affine equivariant estimates are a subclass of translation equivariant esti-mates, the above inequality holds for them as well. This result fits with ones intuition because any estimate which fits the majority of the data must fail if more than half of the data points can be arbitrarily corrupted. Unfortunately, the maximal breakdown point for an affine equivariant covariance estimate is slightly lower than that for a location estimate. Davies (1987) proves that for any affine equivariant covariance estimate Cn, [(n-p + l ) / 2 ] n 26 Thus, covariance estimate have a slightly smaller maximal breakdown point than location estimates. Lopuhaa and Rousseeuw (1991) suggested that a possible reason for this difference is that location estimate can only breakdown if the estimates can be made arbitrarily large, while covariance estimates can breakdown for eigenvalues tending to both 0 and co. In addition to the breakdown point, the other essential quantity that needs to be extended to the multivariate scenario is the IF. First note that the parameter of the distribution in our new setting is now the vector 9 = (fi, S ) G 0 = W x PDS(p). The extension of the IF is then straightforward and given by Lopuhaa (1989). D E F I N I T I O N 2.2 -Influence Function - consider a statistical functional T(-) mapping a set of distributions into the parameter space 0 and G G dom(T). The IF ofT at G is defined as I F ( X , T , G) = Hm T((l-t)G + tAx)-T(G) ^ if the limit exists for all x G W. For affine equivariant estimates and elliptical distributions, it is only necessary to deter-mine the IF under the spherically symmetric distribution F0j as the IF for any other set of valid parameters can be obtained by IF(a:; t, F M , S ) - A lF(A-\x - y)-t, F0J) (2.19) IF(x; C, F M i E ) = A IF (A~\x - t*);C, F0J) AT (2.20) where AAT = S . The formula for the asymptotic variance under some regularity conditions will gener-alize to AV(T; G) = J IF(x; T, G)lF{x; T, G)TdG{x). 27 We consider the multivariate location model to illustrate the maximum bias of mul-tivariate location. Let X = (X\,... ,XP) be a random vector with distribution Fp(x) = F0(x — fx), where F0 is symmetric around 0. To study the robustness property of the multivariate location we will consider a contamination neighborhood of the target dis-tribution. Given a fraction of contamination e > 0, the corresponding contamination neighborhood of F^ is denned by V e (F M ) = {F = (1- e)Flx + eF* : F* any distribution on W}. It is natural to require that an estimating functional T have the Fisher consistency property T(F,J) = pu. In general, given F E V e ( i ^ ) we will have T(F) ^ fj,. Then, we define the asymptotic bias of T in F by b(T,F,„) = ((T(F) - /x)'S^(T(F) - M ) ) 1 / 2 , (2.21) where YIF0 is a n affine equivariant scatter functional. The maximum asymptotic bias of an estimating functional T for fraction of contamina-tion e is defined by B(T,e,Fll)= sup b(T,F,fi). (2.22) For the univariate case (p = 1), the maximum bias of a location estimate T at an arbitrary distribution Go reduces to B{T, e, Go) = sup where cr(-) is a dispersion functional. If the functional T is equivariant, the maximum bias does not depend on fi, and can therefore be denoted by J5(T,e, F0). He and Simpson (1992) introduced the contamination sensitivity of an estimate T as e=0 28 T(G)-T(G0) a(G0) Observe that 7(T, F M ) = 7 ( T , F0) because of the invariance of the bias. For small e, the maximum bias can be approximated by B ( T , e , F M ) « e 7 ( T , ^ ) . (2.23) The contamination sensitivity 7 ( T , F M ) is closely related to Hampel's (1971) gross error sensitivity 7*(T, F M ) . In fact it is easy to show that always 7 ( T , F M ) > 7 * ( T , F „ ) , where l i m T ( ( l - e ) F M + e ^ c ) - T ( ^ ) and 5C stands for a point mass contamination. Under very general regularity conditions 7 * ( T , ^ ) = 7 ( T , F M ) . Huber (1964) proved that if F0 is a univariate symmetric distribution with unimodal density / 0 and F^ = F0(x — p), then the maximum bias of the median estimating func-tional TM is minimax among the affine equivariant estimates, i.e., if T is another affine equivariant estimating functional, then B(T,e,F,) > B(TM,e,F,) = F0-1 (^—^j = dl(e,F0), (2.24) where d\ stands for the maximum bias of the median i.e., the percentile-0.5 under con-tamination at infinity. He and Simpson (1993) obtained a lower bound for the maximum bias of equivariant estimates. Using this result Adrover and Yohai (2002) prove that d±(e,H0) (provided by Theorem 2.1 of He and Simpson, 1993) is a lower bound for any equivariant multivariate location estimate when the central model is elliptical. Croux et al. (1997) derived a similar result when the covariance is known and the central model is multivariate normal. 7 * ( T , F M ) = sup ceRp 29 2.3 M-Estimates In Section 2.1.3, we discussed the M-estimate for univariate location. Here we describe M-estimates for multivariate location and covariance as presented by Maronna (1976). The definition of the M-estimate 6 = (t, C) of multivariate location and covariance 9 = (//, E) based on a set of points (xi,...,xn) ~ i.i.d. is the solution to the equations 1 " - Y > i {d(xi,t;C))(xi-t) = 0 1 = 1 (2.25) - "2 (d(xi, t- Cf) (Xi - t) (Xi -tf = C i=x where V\ and v2 are weighting functions to be specified later, and, for ease of notation, we define d(x,t; C) — \{x — t)TC~1(x — t)}1/2. It directly follows that this is an affine equivariant estimate. To make the relation to the univariate definition in equation (2.11) clear, we can write , ui (d(x,t\C))(x-t) y2 {d{x, t; C)) [x - t)(x - t)T - C. Then, equation (2.25) become (2.26) ± f > (i, C)) = 0. (2.27) i=i To facilitate understanding and comparisons with the univariate case, we define ipi{s) = svi(s), for i = 1, 2. We consider an example to help establish intuition as to what the M-estimate is doing. First, if we let vi(s) = ^ ( s 2 ) = 1, then no down-weighting is performed and we obtain the sample mean and covariance as our estimates. Now, if we take tpi(s) = ip2{s) = ibii(s, K)> i.e. vi(s) = ipH(s,K)/s 30 and v2(s2) = ^H(s2,K2)/s2, then this behaves like a multivariate extension of the optimal univariate location estimate in the sense that at the solution (tn,Cn), points further than K from tn according to Mahalanobis distance d(x,tn;Cn) are "pulled in" to behave as if they were at distance K. For all the results shown by Maronna (1976), the following four conditions are as-sumed: 1. v\(s) and v2(s) are nonnegative, nonincreasing, and continuous functions for s > 0; 2. tpi(s) and ip2(s) are bounded with Kt = sup s>0{ipi(s)}; 3. ip2{s2) is nondecreasing and is strictly increasing on the interval where ip2 < K2\ 4. There exists s0 such that ^(so) > P thus K2 > p and that Vi(s) > 0 for s < s0. From this point forward, and generally in the literature, use of the term M-estimate implies adherence to these four hypotheses. As an example, Huber's Multivariate Pro-posal (1964) satisfies the above conditions. Huber's proposal is to take and Us2) = Ms2,k2)//3, where f3 = JEFoI[ipH(\\x\\2, k\)\, and thus Kx = kx and K2 - kl/fi. Properties With the above conditions on the M-estimate, Maronna proves several important prop-erties about the M-estimate he defines. 31 • For both a continuous distribution and an empirical distribution based on enough data points, the existence of a solution is guaranteed. • For a unimodal and symmetric distribution around some point there is a unique solution, and under additional restrictions one can include empirical distributions, but only in the univariate case. Maronna conjectures that it holds for p > 1, but is not able to prove it. We are unaware of any result which proves his conjecture. The problem of not having a uniqueness result for empirical distributions is somewhat mitigated by a convergence result for M-estimates. If the distribution producing the points (a?!,..., xn) satisfies a probability measure hypothesis and the M-estimate satis-fies the above four conditions, then that distribution has a unique M-estimate (t-*,C*). Furthermore, an M-estimate ( t n , Cn) based on (xi,..., xn) exists for each n sufficiently large. Maronna shows convergence and asymptotic normality of these M-estimates under general regularity conditions. Thus, even though we may not have unique solutions for finite sample sizes, the estimates do converge to a unique solution. Maronna also calculates the IF for the M-estimate, but only provides it for the location estimate t, omitting the IF for the covariance due to its difficult expression. He shows that the IF for location under a spherically symmetric distribution is IF(aj ; t ,F 0 l i ) = « ; 1 ( | |x | | )x (2.28) for a specified constant c which Maronna (1976) derives. This is reminiscent of the univariate case where the IF is proportional to the ip function. Looking at equation (2.26), one could make the definition vi (d{x,t;C)){x -t) v2 {d{x, t; C)) (x - t)(x - t)T - C Thus defining vector mappings \ & i and \&2 instead of the scalar mappings ipi and ip2 that Maronna uses. From this definition, we see that the IF for location is proportional to * ( * , ( t , C ) ) = * i ( x , ( t , C ) ) *2(x,(t,C))_ 32 ^i(x) = vi (d(x,t; C)) x. Thus, we immediately can infer that the GES of the location estimate is directly controlled by the bounding level applied to Unfortunately, noth-ing is said about the form of the IF of the covariance estimate except that its derivation is similar too, but more laborious, than the one for location. In addition to presenting the IF for location, Maronna also gives an upper bound on the breakdown point of the M-estimate under the following model G^s = (1 — e)-F)*,£ + eAy with ||y|| —>• oo. Note that this is not as general as the GEM in equation (2.1), as Maronna models contaminations as point mass distributions at a point near infinity in comparison to the arbitrary distribution in the GEM. The upper bound Maronna gives for the asymptotic value of the breakdown point (Maronna refers the reader to his thesis (1974) for the derivation) is e* = min{e*(t,FM>s),e*(C,FM,s)} < min {^-, 1 - | -} . (2.29) Thus, we see that not only does the clipping level K2 on ip2 affect the breakdown point (recall that for univariate location estimation we have e* = 1/2 if ip is bounded and e* = 0 otherwise), but having either too large or too small a K2 will lead to low robustness with respect to breakdown point. Note that if K2 > p, the bound on the breakdown point can be written as e* < (2.30) p + l Thus, M-estimates necessarily have a low breakdown point in high-dimensional spaces. Maronna presents several important theorems which supports practical use of his multivariate M-estimate; however, nothing is said regarding Fisher consistency. He stated that we will converge to a unique solution (£*, C*) so long as the conditions have been met, but this solution is not necessarily the parameter (fx, £ ) of the underlying distribution. Naturally, not all choices of vi and v2 produce Fisher consistent estimates, but Maronna does not present conditions that guarantee Fisher consistency. 33 A second criticism is the lack of an optimal M-estimate. He presented the IF for the location estimate and elsewhere derives the expression for the covariance estimate. A natural question is to wonder if Hampel's optimality Theorem 2.1 extends to the multivariate M-estimates that Maronna has presented. This combined with a lack of Fisher consistency prevents us from choosing the weighting functions v\ and v2 in a principled manner. The M-estimate studied in the previous section is the natural generalization of the uni-variate M-estimate to multivariate location and covariance. Several nice properties were shown, but a significant deficiency is its low breakdown point in high dimensions. This issued a search for multivariate estimates which possess a high breakdown that is inde-pendent of the dimension. This and the following section discuss two of the more popular estimates with this property that have emerged. This section focuses on the S-estimate as described by Lopuhaa (1989) and Davies (1987). The S-estimate was originally introduced by Rousseeuw and Yohai (1984) in the con-text of linear regression. They proposed the S-estimate as the solution of the optimization problem where b0 satisfies 0 < b0 < a0 = sup{p}. In setting p(s) — s2, a least squares regression is obtained. By bounding p, we limit the maximal effect that any point can have, thus robustifying the least squares technique. Davies (1987) and Lopuhaa (1989) extend this regression estimate to the S-estimate for multivariate location and covariance, although they do so in slightly different ways. 2.4 S-Estimates nin {o(ot)} such that 34 We use Lopuhaa's definition as it is a little simpler. Lopuhaa defines the S-estimate 0 = (t, C) as the solution to the optimization problem min {\C\} (2.31) (t,C)g(RP, PDS(p)) such that -f2p(d(Xi,t-C))=b0 (2.32) where b0 satisfies 0 < b0 < a0 = sup{p}. The constant b0 will effect the scaling of the covariance estimate and should thus be chosen in accordance with the underlying model distribution. In particular, if (a?i,... ,xn) are from an elliptical distribution i^.s, then it is natural to choose b0 — IE>E[p(d(ajj,//;£))] which is the limit of the average in equations (2.31)-(2.32) if t and C are Fisher consistent. It is straightforward to show that S-estimates are affine equivariant. Thus b0 can be selected without knowing (/x, S) (which of course we do not know) by choosing b0 = JEiFoj[p (\\x\\)] utilizing only the normalized parametric distribution. Just as choosing p(s) = s2 yields the least square solution for the regression problem, it also produces the least square solution for the location-covariance problem. Choosing b0 — p for appropriate scaling of the covariance matrix, the S-estimate produces the sample mean and covariance as the unique solution (Griibel, 1988). Properties Lopuhaa and Davies prove many of the same results such as existence, convergence, con-sistency, and asymptotic normality but under different conditions. Davies applies weaker constraints on the p function, but only considers elliptical distributions. Lopuhaa's con-straints on p are slightly more restrictive, but some of his results apply to more general distributions. Because in most conceivable circumstances, one would apply these tech-niques to a sample set which they believe to have arisen from an elliptical distribution, we will present the properties from Davies. 35 • Davies first demonstrates the existence of a unique solution of the S-functional at the true distribution and that this solution is the correct value, i.e. it is Fisher consistent. For an elliptical distribution with some conditions the S-functional has the unique solution (t*,C*) = (fi,"E). Although this should be expected of a good estimate, recall that M-estimates were not shown to be Fisher consistent, but instead, only that a unique solution existed for the M-functional. • Davies then shows that for large enough sample sizes, equations (2.31)-(2.32) have a solution. • Furthermore, any sequence of these solutions will converge to the true parameter values of the source distribution. • Davies derives the asymptotic for the S-estimate and shows that it also has a limiting normal distribution. Lopuhaa derives the IF for the S-estimate for distributions more general than ellipti-where vi(s) = ip(s)/s, v3(s) = i/>(s)/s - p(s) + bQ, and AF(/Lt, E) = E F[tf (x, (/i, £ ) ) ] , A F has a nonsingular derivative A at (t(F), C(F)). Then the IF exists and is Now if F = FQJ is the spherical distribution and p satisfies some conditions then, the IF for the associated location S-estimate is cal. For *(Mt,c)) = vi {d(x,t;C)) (x - t) jwi (d{x, t; C)) (x -t)(x- tf - v3 (d{x, t; C)) C IF(x;S,F) = -A'1* (x,[t(F),C(F)}). (2.33) IF(x;t,fb,i) = c (^||a:||) x cvi(x\\)x (2.34) 36 for a constant c given by Lopuhaa. As expected, this expression is parallel that of equation (2.28) for M-estimates. Thus far, we have one significant advantage of S-estimates over M-estimates, i.e. con-sistency. However, Davies demonstrates an even more significant attribute of S-estimates, which is the ability to achieve a high breakdown point independent of dimension defined as, e*n = mm{e:(t,F^),e*(C,F^)} = ^ M ± l . (2.35) Two simple, but important corollaries follow here. C O R O L L A R Y 2.1 e* = lim < = b0/a0. (2.36) C O R O L L A R Y 2.2 Setting b0/a0 = 1/2 - (p + l)/2n yields [n-p+l] < = (2-37) Th which is the upper-bound for the breakdown point of affine equivariant estimates, which is proved in Davies (1987). The S-estimate is a significant improvement on the M-estimate in two ways. First, with an appropriate choice of p, it can achieve the maximal breakdown point which is asymptotically 1/2 regardless of dimension. This contrasts with the M-estimates increas-ing susceptibility to outliers as dimensionality increases. Second, Fisher consistency is proven for S-estimates along with a stronger convergence proof. For elliptical distribu-tions, M-estimates are shown to converge in probability to a unique solution though not necessarily the underlying parameters, whereas S-estimates are shown to converge almost surely to the underlying parameters. 37 A significant drawback of the S-estimate is that there is no optimality theorem on how to choose the function p. Lopuhaa uses as an example Tukey's Biweight function but there is no justification for its selection, other than it resembles a smooth clipped parabola, thus approximating robust least-squares estimation. Using Tukey's Biweight function in the S-estimate and Huber's proposed M-estimate, Lopuhaa performs sim-ulations comparison of the two. The general trend is that at the model distribution, the M-estimate achieves a lower error variance, but when the model distribution does not match the actual distribution, the S-estimate performs better. These simulations must be cautiously interpreted however because there may be better choices of weighting functions which could result in a different conclusion. In this section, we will discuss the third of the three classes of estimates surveyed, the M C D estimate. We mostly follow Butler et al. (1993). The previous two estimates are rather abstract in that the M-estimate is the solution of a system of nonlinear equations, and the S-estimate is the solution of a nonlinear optimization problem. The M C D tech-nique presented here is much more intuitive. Given 0.5 < a < 1, the M C D estimate can be described as follows. Consider all subsets of {a j i , . . . , xn} of size h = [an], where [x] denotes the greatest integer smaller than or equal to x. For each of these sets, compute their sample mean and covariance. Then, the M C D estimate (tn, Cn) is the sample mean and covariance from the set whose sample covariance has the minimum determinant. As intuition would lead to believe, the M C D corresponds to finding the ellipsoid which covers the most dense cluster of h points and then taking a weighted sample mean and covariance where the weighting is the indicator function (divided by h for normalization) 2.5 M C D Estimate 38 of the ellipsoid. Inherent in this is an assumption of unimodality due to the clustering. For the rest of the material presented in this section, we assume that the underlying distribution F M > S is unimodal and elliptical. Thus, we can assume it has a density of the form f^{x) = \ n - l l 2 v ( ( x - » ) T V - l { x - » ) ) (2.38) with <f : E+ (->• K+ being a decreasing function appropriate scaled so that the density is valid. Furthermore, because the MCD is affine equivariant (which follows from its use of the sample mean and covariance), we can assume the parameters of the underlying distribution are fi = 0 and £ = I. Properties Although the MCD has an intuitively simple interpretation, there are not many results on its properties. The existence of a solution for sample sets is obvious as it takes a minimal covariance from a finite set of solutions. This solution can be seen to be unique (with probability 1) but only if the underlying distribution is continuous. Butler et al. show the existence and uniqueness of a solution under the model distribution. As one would expect, the solution is the sphere centered at the origin in accordance with the underlying distribution (/J,, £ ) = (0,1). For further details see Butler et al. (1993). Butler et al. also studied the convergence of the MCD estimates. He found that the MCD estimate of location converges to the true value, but the covariance estimate does not. Like the M-estimate and S-estimate, the MCD location estimate is asymptotically normal with asymptotic covariance matrix c £ for some constant c. The constant c is given by Butler et al. (1993). The breakdown point of the MCD estimate has not been mentioned in Butler's article, but due to the straightforwardness of its definition, it is readily apparent that if n > p +1 39 then <(t) = ^ ^ - > l - a (2.39) • t/-\ . (n-h+1 h-p\ . , , en(C) = mm < , > —> min-i 1 — a, OJ}. ( . n n ) (2.40) The breakdown point of the location estimate follows from that if h of the original points remain, then the location estimate will still be finite, but if fewer than h points remain, then the estimate can become unbounded. The same reasoning applies for keeping the maximal covariance eigenvalue bounded. However, we must also consider the other di-rection of breakdown where the minimum eigenvalue approach zero. Here, we only need to replace h — p points to force an eigenvalue to zero. In particular, we find the p densest sample of points whose convex hull contains no other data points. Then we replace any h — p of the other points with points lying at the center of this convex hull, which is contained within a hyperplane by definition. Now, the most dense cluster of h points is the cluster of points in this convex hull and the covariance estimate has a zero eigenvalue. Note that, one can choose a so that they can get an asymptotic breakdown point of e* = 1/2. However, it is clear that by choosing a smaller a yielding a larger e*, one is utilizing fewer points in the sample statistics and thus will have a higher error variance on the estimates. Choosing h as small as possible, i.e. h — [(n + l)/2], yields the maximal location breakdown point, Similarly, choosing h = [(n+p+l)/2] yields the maximal covariance breakdown point, < ( C ) = l ( " - p n + 1 ) / 2 1 . (2.42) Thus, the MCD can achieve the maximal location and covariance breakdown points, but not at the same time. This is not really a hindrance since the difference is minuscule and both breakdown points asymptotically approach 1/2. 40 The computational complexity is a major issue regarding the MCD. To find an exact solution requires searching the entire space of all possible groupings of h out of n data samples. Thus, the computational burden grows combinatorially with the sample size. However, there are fast approximations to the MCD, the most prevalent of which is proposed by Rousseeuw and Van Driessen (1999) and briefly described in the next section. The algorithm is an approximation to the MCD and produces only a locally optimal solution. An ellipse £ is considered locally optimal if the determinant of its sample covariance cannot be decreased by switching one point in £ with a point not in £. The development and availability of fast algorithms Hawkins (1994); Rousseeuw and Van Driessen (1999) for computing the minimum covariance determinant (MCD) has brought renewed interest to this estimate. Asymptotic properties were given in Bulter et al. (1993), but the asymptotic variance of the MCD scatter part remained unknown. In the particular case of one dimension the influence function of the MCD scale was computed by Croux and Rousseeuw (1992a). Croux and Haesbroeck (1999) worked out the influence function of the MCD scatter matrix estimate in arbitrary dimensions, and used it to evaluate the asymptotic efficiency of this estimate. It follows that the MCD scale estimate has a bounded influence function, which is re-descending to zero for the off-diagonal elements, but not for the on-diagonal elements. It is not sufficient to consider only breakdown point and efficiency of robust estimates, but maxbias curves should be computed. This has been done for the MCD estimate in the univariate case by Croux and Haesbroeck (1999), but the multivariate case seems to be rather hard to handle. The MCD is straightforward to compute and can be approximated with a fairly fast algorithm as described in the next section. F A S T - M C D Algorithm We now give a brief overview of the FAST-MCD algorithm proposed by Rousseeuw and Van Driessen (1999). As in the previous section, we will let h denote the size of the subsets to examine. The basis for their algorithm is the following theorem. 41 T H E O R E M 2.2 Let Hi be a subset of {1,... ,n} of size h with associated sample statistics tln = ± $ > (2.43) C n = {/^(xi-tDi^-tlf. (2.44) lf\Cln \ >0 then define the distances di(i) = d{xi, t\; C*). Now, define the set H2 to be those points with the h smallest distances di(i), i.e. {d\(i) \ i € H2} = {(^i)i, (^1)2, • • • , (di)h} where (di)i < (0^1)2 < • • • < {di)n are the ordered distances. Now, define t2n and C\ as in equations (2.43)-(2.44) but with H2 in place of Hi. Then, \C2n\ < \C\\. (2.45) The construction of H2 from Hi is called the C-step where C stands for covariance. Thus, recursively defining sets H by repeatedly taking the points which minimize the Mahalanobis distance based on the previous iteration leads to a local minimum (the determinant of Cln is a local minimum if it cannot be decreased by switching one point in Hi with a point not in Hi) of the covariance determinant. Using this intuitive principle, the FAST-MCD algorithm can be described as follows: 1. Initialize H0 by randomly selecting p+ 1 points. Compute the sample mean tQn and covariance C°n for H0. Construct the set Hi as in Theorem 2.2, i.e. choose the h points which minimize the Mahalanobis distance with respect to and C° . 2. Perform k C-steps (Rousseeuw and Van Driessen recommend k — 2 C-step). 3. Perform steps 1 and 2 many times and take the I best solutions H[ which have the smallest covariance determinant (Rousseeuw and Van Driessen recommend I = 10). For each of these "survivors" repeatedly perform the C-step until convergence (which is guaranteed by Theorem 2.2 and boundedness of the determinant). 4. Take as your solution, the H1^ which has the minimum covariance determinant. 42 For larger sample sets, Rousseeuw and Van Driessen recommend partitioning the data set into several groups and running the FAST-MCD on each of them, then taking the ra best estimates from each group and running the FAST-MCD on the entire data set initialized with each of the m solutions from each group and taking the minimum result as the solution. 2.6 Conclusions The focus of this chapter is primarily on three techniques: the M-estimate, S-estimate and MCD. The first estimate discussed is the M-estimate, which is a generalization of the well known M-estimate for univariate location. The M-estimate is defined by two weighting functions v\ and v2 which control the influence of outliers on the location and covariance estimates. The M-estimate is then the solution of a system of equations of the weighted sample moments. Maronna (1976) shows several important properties of the M-estimate under certain conditions on the weighting functions and distribution. In particular, existence, uniqueness and convergence are shown; although, the convergence is not necessarily to the true underlying parameters as conditions for Fisher consistency are not shown. Another notable characteristic of the M-estimate is its low breakdown point which decreases with increasing dimensionality of the data. Furthermore, the conditions on the weighting functions seem to imply an inherent assumption of unimodality on the underlying distribution. The second estimate surveyed is the S-estimate, which originated in the context of lin-ear regression, but is extended to multivariate location and covariance estimation. This estimate obtains its solution via an optimization problem which minimizes the determi-nant of the covariance estimate subject to a cost constraint which can be interpreted as the sum of weighted Mahalanobis distances of the samples under the covariance and loca-tion estimates. Robustness is endowed to this estimate by limiting the maximal cost that a single data sample can contribute and thus limiting its influence on the estimate. Of 43 the three estimates, this perhaps has the most properties shown about it. These include existence, uniqueness, Fisher consistency and convergence of estimates to the true pa-rameter values. Furthermore, S-estimates are shown to be maximally robust with respect to the breakdown point. S-estimates also satisfy the form of M-estimates, but violate the constraints on the weighting functions and are thus not considered M-estimates. The M C D estimate is based on the intuition that for unimodal elliptical symmetric distributions, the most reliable points on which to base the estimate are those which are closely clustered. The M C D estimate finds the subset of data points which has the small-est sample covariance and takes as its estimates, the sample mean and covariance from this set. Unlike the M-estimate and S-estimate, there are no weighting/cost functions to choose here, only the cluster size parameter a. A l l three of these estimates asymptotically converge to limiting values with a Gaussian distribution. Furthermore, the variance on the "error" goes down as 1/n. Note that there are other well known estimates which actually have a slower decay on the variance, e.g. the asymptotic variance of the minimum volume ellipsoid (MVE) estimate goes down as n~2lz and does not converge in distribution to a Gaussian. 44 Chapter 3 Multivariate Contamination Models 3.1 Classical Contamination Model Statisticians use contamination or mixture models to study the performance of robust al-ternatives to classical statistical procedures when these procedures are applied to messy data sets that contain outliers. Most studies on robustness in statistics are centered around a concept of contamination introduced by Tukey (1960). The best known and most broadly used contamination model is the so called e-contamination neighborhood in-troduced by Tukey (1962) and extended by Huber (1964). These models can be thought of as "testing grounds" where statistical procedures are tested and continuously im-proved. The contamination model was originally introduced to handle one-dimensional data. Assume that, given a sample X\,..., Xn, the majority of the data follows the nom-inal distribution HQ while a small fraction e follows an arbitrary distribution H. This contamination model, which we will call classical contamination model, can be written as H(x) = (1 - e)H0(x) + eH(x), 0 < e < \ . (3.1) Li For example H0 may be a normal distribution with mean \i and standard deviation cr, i.e., H0 — N((j,,cr), and H an arbitrary distribution. Robustness in the sense of the Princeton's Group (Tukey, Huber, Hampel et al., etc.) addresses the estimation of the dominant component H0 and clearly, for this component to be dominant, the amount of contamination e must be less than one half. 45 The classical contamination model (3.1) has been adopted for multivariate data sets as well (see for example Rocke and Woodruff, 1996), although it is not necessarily ap-propriate in that context. Given a sample X i , . . . , Xn, where X ; G W, i = 1,..., n, the majority of the data follows the nominal distribution H0 while a small fraction e follows an arbitrary distribution H. Then, the classical multivariate contamination model can be written as For example H0 may be a multivariate normal distribution with mean fx and scatter matrix S , i.e., HQ = /V(/^,£), and H an arbitrary distribution. Under this model a fraction (1 — e) of the cases on average are distributed according to H0 and are therefore the majority or "core" data, while a fraction e of the cases are from H and generate outliers that deviate from the core behavior of the data. Hence, in this model each data point is either "100 % perfect" coming from H0 or "100 % spoiled" coming from H. An alternative representation of the classical multivariate contamination model (3.2) can be written as (3.2) X = (1 - 6)Y + bZ, (3.3) where Y, Z, b are independent with Y ~ H0 Z H b ~ BINOMIAL(l,e). This can be shown as follows. P(X < x) P((l b)Y + bZ <x) P(Z < x)P(b = 1) + P{Y < x)P(b = 0) H{x)e + HQ{x)(l-e)=H{x). 46 The classical multivariate contamination model (3.2) is a model of "independent mix-ture of two independent populations", for example a "normal population" and an "ab-normal outliers generating population". Unfortunately this model does not adequately represent reality for many multivariate data sets that arise in practice. It may often happen in applications that outliers occur in each of the variables independently of the other variables or in special dependency patterns other than the complete dependency pattern which, we will see, the classical multivariate contamination model enforces. In addition, the classical multivariate contamination model (3.3) does not allow the possi-bility of dependency among the uncontaminated vector, Y, the contamination indicator, b, and the contamination vector, Z. Now we present some real data examples from the literature that illustrate the need for a more general and flexible multivariate contamination model. 3.2 Real Data Examples E X A M P L E 3.1 Hertzsprung-Russell Diagram of the Star Cluster C Y G OBI Consider the Hertzsprung-Russell data set (see Rousseeuw and Leroy, 1987). This two-dimensional data set consists of 47 stars in the direction of Cygnus. The first variable is the effective temperature at the surface of the star and the second variable is its light intensity. The scatterplot of the logarithm of the light intensity versus the logarithm of the temperature is shown in Figure 3.1. We can see from the plot that the data have two groups of points, the majority which seems to follow a steep band and the four stars in the upper left corner. These groups are well known in astronomy. The majority of the points are said to lie on the main sequence and astronomers explain the four points with indices 11, 20, 30 and 34 as giant stars. E X A M P L E 3.2 Body and Brain Weights Consider the brain and body weight of 28 animals as published in Rousseeuw and 47 4.0 4.2 Log temperature Figure 3.1: Hertzsprung-Russell Diagram of the Star Cluster CYG OBI. Body Weight (g) - Log scale Figure 3.2: Body and Brain Weight Data Set. Leroy (1987, page 57). This sample was taken from larger data set in Weisberg (1985). Figure 3.2 contains a scatter plot of the logarithm of the brain weight versus the logarithm 48 10 20 30 40 Age at first word (months) Figure 3.3: Gesell Adaptive Score versus Age at First Word. of the body weight. The scatter plot shows that the majority of the data follow a clear pattern, except the three observations in the lower right region. These three observations correspond to dinosaurs, each of which possessed a small brain with a heavy body. E X A M P L E 3.3 Gesell Adaptive Score This data set was first reported by Mickey et al. (1967) and widely cited in the statistical literature. We obtained the data set from Rousseeuw and Leroy (1987). The study was conducted on 21 children, giving their age (in months) at first spoken word and a score which is a measure of the development of the child. The Gesell adaptive assessment is a standard procedure for direct observation of a child's growth and development. The Gesell assessment is conducted by a trained examiner who makes discriminating obser-vations of a child's behavior and then evaluates these observations by comparison with normal behavior patterns. A normal behavior pattern is a criterion of maturity which has been defined by systematic studies of the average healthy course of child development. The scatterplot of Gesell adaptive score versus age at first word is shown in Figure 3.3. 49 Figure 3.4: Advertising Yield versus Spending. Case 19 does not follow the general pattern of the remaining data points. Mickey et al. (1967) also decided that this observation is an outlier, by mean of a sequential ap-proach to detect outliers via stepwise regression. Since the value of the score is subjective, this outlier could be explained due to a possible error in the observed Gesell adaptive score given to the child. E X A M P L E 3.4 T V A d Yields Consider the TV Ad Yields data of 21 advertisements published in the Wall Street Jour-nal, March 1, 1984 (available at http://lib.stat.cmu.edu/DASL/Datafiles/tvadsdat.html). The advertisements were selected by an annual survey conducted by Video Board Tests, Inc., a New York ad-testing company, based on interviews with 20,000 adults who were asked to name the most outstanding TV commercial they had seen, noticed and liked. The retained impressions were based on a survey of 4,000 adults in which regular prod-uct users were asked to cite a commercial they had seen for that product category in the past week. Figure 3.4 contains a scatterplot of the retention versus the expenditure, 50 25 30 Per capita cigarette sales Figure 3.5: Scatterplot of Cigarette Consumption versus Lung Cancer. which is the TV advertising budget, 1983 ($ Millions). The scatterplot shows that the majority of the data follows an increasing pattern, except the three observations with indices 7, 10 and 13 (McDonald's, Ford and ATT/BELL). These outliers are likely due to an unsuccessful advertising campaign, and therefore the retention figures are lower than expected for the large amount of expenditure. Specifically, the point with index 10 appears to have a low retention for an extremely large value of expenditure. E X A M P L E 3.5 Smoking and Cancer Researchers wanted to examine the effect of smoking on cancer development. Data for 43 states and the District of Columbia were collected on per capita numbers of cigarettes smoked (sold) in 1960 together with death rates per thousand population from various forms of cancer, see Fraumeni (1968). Scatterplots of the cigarette consumption versus the lung cancer and the leukemia death rates are shown in Figures 3.5 - 3.6, respectively. From the scatterplots we can see that Nevada (NE) and the District of Columbia are outliers in the distribution of cigarette consumption (sale) per capita. The ready expla-51 s CO • * • (•» • •• •DC • • t NE * to • • • • LO • 15 20 25 30 35 40 Per capita cigarette sales Figure 3.6: Scatterplot of Cigarette Consumption versus Leukemia Death Rates. nation for the outliers is that cigarette sales are swelled by tourism (Nevada) and tourism and commuting workers (District of Columbia). E X A M P L E 3.6 Air Pollution and Mortality Researchers at General Motors collected data on 60 United States Standard Metropoli-tan Statistical Areas (SMSAs) in a study of whether air pollution contributes to mortal-ity. The variable of main interest is age adjusted mortality and is labelled "Mortality". The data include variables measuring demographic characteristics of the cities, variables measuring climate characteristics and variables recording the pollution potential of three different air pollutants. These properties were collected from a variety of sources and they are available at http://lib.stat.cmu.edu/DASL/Datafiles/SMSA.html. Figure 3.7 shows all pairwise scatterplots of the variables; mortality, median education, population density, percentage of non-whites, annual rainfall (inches) and logarithm of the Nitrous Oxide (NOx). Clearly these data have several multidimensional outliers that show up as a cluster in several of the scatterplots. 52 9 10 11 12 0 10 20 30 40 0 1 2 3 4 5 Mortality • • • \ i J - :': •«.:*'• • .. : -Kr:;- . • . '•:';>: Education • * i **•• • • t / •:. •• . . -. .i . • . Pop density " * * .f* : • *i* *| % • * I • / • • * ***••• %Non white • • •:!:•>»-•••• ' • vo,/.- • . •.<•: ~ * ••*•• »•»! Rain • . i -AS . Log NOx •s 800 900 1000 1100 2000 4000 6000 8000 10000 10 20 30 40 50 60 Figure 3.7: Air Pollution and Mortality Data Set. E X A M P L E 3.7 Salinity Consider the salinity data set (see Ruppert and Carroll, 1980). These data consist of measurements of water salinity (i.e. its salt concentration) and river discharge taken from North Carolina's Pamlico Sound. Figure 3.8 shows all pairwise scatterplots of the variables; water salinity, salinity lagged by two weeks, the trend which is the number of biweekly periods elapsed since the beginning of the spring season and the volume of river discharge into the Sound. Carroll and Ruppert (1985) described the physical background of the data. The scatterplots of the salinity lagged, the trend and water salinity versus 53 0 1 2 3 4 5 4 6 8 10 12 14 Lagged Salinity 1 1 t • • • • •• • • • • • • • • • • • -• • • • <<*-co-CVJ-o -. . . Trend • • • • • • i . • • : ' . : Discharge • • • • ^ . CNJ. o . 0 0 -(O-T f -• . : • • • • • • » : • • • • • • • • • • Salinity i 1 1 1 1 1 n 1 1 1 1 r 1 1 1 I 1 1 r~ 4 6 8 10 12 14 22 24 26 28 30 32 Figure 3.8: Salinity Data Set. the volume of river discharge each show two points that do not follow the pattern of the remaining data. Carroll and Ruppert (1985) indicate that these outliers are cases 5 and 16 which correspond to periods of very heavy discharge. E X A M P L E 3.8 Wages and Hours The data (available at http://lib.stat.cmu.edu/DASL/Datafiles/wagesdat.html) are from a national sample of 6000 households with a male head and earnings of less than $15,000 annually in 1966. Thirty-nine demographic subgroups were formed for analysis of the relation between average hours worked during the year and average hourly wages ($) and 54 2.0 2.5 3.0 3.5 30 40 50 HRS RATE ASSET AGE 2050 2100 2150 2200 2250 "1 r- 2000 4000 6000 8000 ' 12000 Figure 3.9: Wages and Hours Data Set. other variables. The study was undertaken in the context of proposals for a guaranteed annual wage (negative income tax). At issue was the response of labor supply (hours worked) to increasing income and effective hourly wages. Figure 3.9 shows all pairwise scatterplots of the average hours, average wages (rate), average family asset holdings and average age of the respondent. Clearly these data have several multidimensional outliers. In particular, the scatterplots of the average hours, the average wages and the average asset versus the age each show outliers in the left and right of the plots. It appears that the age variable has a certain range and any age outside this range shows as an outlier. 55 Discussion The Tukey-Huber contamination model (3.2) seems appropriate for the data in Exam-ples 3.1 and 3.2. In the case of Example 3.1 we can assume that there are two subpopu-lations of stars (the main sequence and the giant stars) and that the given measurements correspond to either one of these two subpopulations with probabilities (1 — e) and e, where e represents the proportion of giant stars. Hence, we can consider these data as coming from an independent mixture of two independent populations, as required by the classical multivariate contamination model (3.2). In the case of Example 3.2 there are also two subpopulations (regular animals and dinosaurs) and the given measurements correspond to each one of these two subpopulations with probabilities e and (1 — e). Since we do not know the criterion used to include animals in the data set, the interpretation of e is less clear, in this case. It seems difficult, however, to justify the use of the independent mixture model (3.2) for the remaining examples. In Example 3.3, it would.be hard to imagine the existence of a subpopulation of children from which the outlying Case 19 has been drawn. A more likely scenario is that the Gessel adaptive score for this child has been erroneously assigned or recorded. A point to notice here is that one of the two variables (namely, Gessel adaptive score) appears to have a larger probability of gross errors and unusual values than the other variable (namely, age at first spoken word). If the contaminating distribution were to retain the value for the variable age at first spoken word and only contaminate the variable Gessel adaptive score, then the assumption of independence would be violated. A serious limitation of the classical multivariate contamination model (3.2) is the requirement that each data point is either 100% perfect or 100% spoiled. Model (3.2) is also restrictive in that it fails to allow for cases where the probability of occurrence of discordant values and gross errors in some of the variables depend on the values of other variables. For instance, in Example 3.4, the generally increasing pattern 56 observed for the majority of the data ceases to apply for cases 7, 10 and 13 (McDonald's, Ford and ATT/BELL). Case 10 has a suspiciously low retention value and may be a gross error. The main point in this example is that the outliers were likely to be produced by "problems" affecting the variable retention when the variable expenditure assumes extreme values. Example 3.7 and Example 3.8 are other examples that represent this situation (extreme values of one of the variables may cause the occurrence of outliers or gross errors in other variables). In Example 3.7, it is known that the outliers, cases 5 and 16, correspond to periods of very high discharge. These extreme values of the variable discharge may have affected the values of the other variables. For instance, the variable salinity for cases 5 and 16 has high values compared to the general decreasing pattern as shown in the plot in Figure 3.8. In Example 3.8, each plot in Figure 3.9 involving the variable age has two unusually extreme values. These values may have caused outliers in some of the other variables, due to the fact that the general pattern does not apply when age is too high or low. The assumption of independent mixture of two independent populations would also be inappropriate in Example 3.5 where the variables were collected from independent sources (agencies) and the outliers are likely to be due to special circumstances regarding cigarette sales in the District of Columbia and Nevada. A similar situation (measurements from independent sources) arises in Example 3.6. 3.3 New Contamination Model The given examples highlight the need for a more flexible contamination model. Very often, the p-dimensional observation vector collects measurements from different sources, each being inclined to have its own statistical errors. There are also situations where there is a strong dependency between the contaminated and uncontaminated entries and between the uncontaminated entries and their contamination indicator. For example, extreme values of one or more of the variables may increase the likelihood of outliers or 57 gross errors in other variables. Suppose that Y G W has an elliptical distribution H0 with center [i and scatter matrix £ , for instance HQ = AT(/z, £ ) . We consider situations when sometimes Y cannot be perfectly measured and the actual observations can be represented by the contamination model: X = (J - B)Y + BZ, (3.4) where I is a p x p identity matrix, Z is an arbitrary random vector and B = Bx 0 0 B2 \ Di&giBr, B2,..., Bp) (3.5) \ 0 0 ... Bp J is such that each Bj (j = 1,... ,p) is a Bernoulli random variable with P ( £ , = !) = ! - P{Bj = 0) = e„ j = 1,... ,p. (3.6) That is, 6j represents the probability that the j-th entry of X is contaminated. No-tice that in this model the contamination indicator matrix, B, is allowed to depend on the uncontaminated vector, Y, therefore, e,- = JE{P(Bj = Moreover, the con-tamination vector, Z, can depend on the contamination indicator matrix, B, and the uncontaminated vector, Y. Thus, the contamination model (3.4) can be expressed in the following hierarchical way. Y, B\Y, Z\B,Y. Notice that different values of tj and different dependence structures of the diagonal elements of the contamination indicator matrix, B, generate different contamination patterns. This will be further discussed in Section 3.4 of this chapter. To gain some insight into the contamination model (3.4) we will consider some special situations. 58 1. Tukey-Huber contamination model. If Y,B and Z are independent, and the diagonal matrix B of Bernoulli random variables has the special completely dependent structure P(B1 = B2 = ... = BP) = 1, (3.7) then the contamination model generating X reduces to the Tukey-Huber multi-variate mixture distribution (3.2). In this case the contamination model represents independent mixture of two independent populations, the normal population and the abnormal population. The Hertzsprung-Russell data set and the body and brain weights data set are examples of this situation. 2. Independent-contamination model. In this case Y,B and Z are independent. That is the probability of contamination for the different entries of X does not depend on the uncontaminated vector, Y. Also, the contamination vector, Z , does not depend on which entries are being contaminated and their values. This situation can be expressed as, Y, B, Z 3. B and Z are independent of Y. In this case the probability of contamination for the different entries of X and the contamination vector, Z, are independent of the uncontaminated vector, Y. But, the contamination vector, Z , depends on which entries are being contaminated. This situation can be expressed as, Y, B, Z\B. 4. B and Y are independent. In this case the probability of contamination for the different entries of X is independent of the uncontaminated vector, Y. But, the contamination vector, Z, depends on which entries are being contaminated and their values. This situation can be expressed as, Y, B, Z\B,Y. 59 5. Y and Z are independent. In this case the contamination vector, Z, is inde-pendent of the uncontaminated vector, Y. But, the probability of contamination for the different entries of X depends on the uncontaminated vector, Y. Also, the contamination vector, Z, depends on which entries are being contaminated. This situation can be expressed as, Y, B\Y, Z\B. Equivariance Considerations The classical contamination model (3.2) is translation-scale equivariant and affine equiv-ariant. On the other hand, the proposed contamination model (3.4) is translation-scale equivariant but not affine equivariant. To show that the contamination model is not affine equivariant, suppose that the random vector X follows the contamination model X = (I - B)Y + BZ, and A is an invertible (p x p) matrix. Let V = AX = A(I - B)Y + ABZ ^ (I - B)AY + BAZ, unless AB — BA (e.g. A is diagonal). The lack of affine equivariance of the contamination model (3.4) is a bit surprising given that the uncontaminated vector, Y , exhibits this property. However, this lack of affine equivariance is consistent with the fact that some affine transformations of the observable data, X , may considerably worsen the overall quality of the resulting data. For instance, if B\,..., Bp are independent with P(Bj = 1) = 6j, then (1 — ej)100% of the measurements Xij,..., Xnj (j = 1,... ,p) in each coordinate (variable) are expected to be "good", but only [(1 — ei)(l — e 2 ) . . . (1 — ep)]100% of linear combinations a\Xn + 0 2 ^ 2 + . . .+apXip (i = 1,..., n), of all the coordinates, can be expected to be "good". We 60 x2 x3 X\ + X2 + X$ 0.891 -0.482 10.000 10.409 0.902 -0.769 -1.722 -1.589 1.246 -0.110 1.667 2.803 0.025 -1.198 -0.117 -1.290 -0.861 10.000 0.167 9.306 -0.215 -1.464 -1.265 -2.945 -1.157 -0.294 -0.527 -1.979 -0.149 -1.598 0.910 -0.838 10.000 -1.091 -0.740 8.169 -0.671 -0.007 0.884 0.206 Table 3.1: A Numerical Example of Increased Percentage of Contamination in a Linear Combination of Variables. illustrate this phenomenon for a small data set with dimension p = 3 in Table 3.1. The table exhibits the values of each coordinate and the linear combination of the coordinates. The coordinates are separately and independently 10% contaminated. However, we can see that the linear combination of the coordinates are 30% contaminated. In particular notice that in the contamination model (3.4), where ei = e2 = . . . = ep = e, (1 — e) no longer represents the fraction of "good" data vectors (cases). Most of the observations, Xi,..., X n , Xi e f f (i = 1,..., n) can be contaminated and this is often the case when the dimension, p, is large. Assuming that B\,...,BP from the contamination model (3.4) are independent with constant ei = . . . = ep = e, we generate another simple model of some practical interest. This model represents situations when a certain proportion of outliers occurs indepen-dently on each variable. For example, this could be the case if several measurements are performed on the same individuals or items by several laboratories (calibration model). 61 In the next section, we present the different correlation structures that the components of the contamination indicator matrix B (3.5) may have in the contamination model (3.4). Dependence structures that can be covered are general dependence, with both positive and negative dependence. 3.4 Dependence Structures in the Contamination Indicator Matrix To study the different dependency patterns of the components of B, we consider the bivariate Bernoulli distribution of B = Diag(Bi, B2) where P(BX — 1) = P(B2 = 1) = e as shown in Table 3.2. From the table we have that the expected value of Bj (j = 1, 2) Bi 0 1 0 1 - 2e + 8 e-8 1 - € 1 e-8 8 e 1-e e 1 Table 3.2: Bivariate Distribution of Bx and B2. is e with variance e(l — e). The expected value of B\B2 is 8 with covariance 8 — e2 and 5-e2 6(1-6 ) ' correlation , in which -e 8 - e2 < < 1, as 0 < 8 < e. 1 - e - e(l - e) We consider three special cases of the dependence structures of the diagonal elements of B. Firstly, the independent case described in Table 3.3. The joint distribution of B\ and B2 in this case is given by P{BX = 1, B2 = 1) = P(BX = 1)P{B2 = 1) = e2. Secondly, the perfect correlation case described in Table 3.4, which is the classical contamination model where P(BX = B2) — 1. The joint distribution of Bx and B2 in this case is given by P(BX = 1,B2 = 1) = P(BX = 1) = e. Lastly, the perfect rejection case presented in 62 Bi 0 1 0 ( l - 6 ) ( l - e ) e(l - c) 1 - e 1 e(l - e) e2 e 1 - e e 1 Table 3.3: Bivariate Distribution of B\ and JB 2, Independent Case. Bi 0 1 0 1 - e 0 1 - e 1 0 e e 1 - e e 1 Table 3.4: Bivariate Distribution of B\ and B2, Perfect Correlation. Bx 0 1 0 1 - 2e e 1 - e 1—> e 0 e 1 - e € 1 Table 3.5: Bivariate Distribution of Bi and B2, Perfect Rejection. Table 3.5 is an example of negative dependence. The conditional distribution is given by P(B2 = l\Bi = 1) = 0, which implies that the joint distribution of B\ and B2 in this case is given by P(B\ = 1, B2 = 1) = 0 and the conditional distribution P(B2 — \\BX = 0) = j^. The three cases presented differ mainly in 6, the expected value of B\B2 (see Table 3.2). In the case of independence 8 = e2, in the perfect correlation case 8 = e and in the perfect rejection case 8 = 0. 63 Notice that we only consider bivariate cases in Tables 3.3 - 3.5 and situations where P(Bi — 1) = P(B2 = 1) = e. On the other hand, the more general multivariate distri-bution of a Bernoulli vector (Bi,..., Bp) with P(Bi = b\,..., Bp = bp) = P(b\,..., bp), bj = 1 or 0 for j = has too many parameters, namely (2P — 1) parameters. Joe (1997) has an effective proposal for drastically reducing the number of parameters and still attaining a wide range of correlations structures. Joe (1997) proposed the ex-changeable mixture model as a reasonable choice for some applications. The exchangeable mixture model is constructed as follows. Conditional on a random parameter 9 the vari-ables B\,...,BP are i.i.d. Bernoulli(9). Therefore, the unconditional joint density for Bi,...,Bp is given by P(b1,...,bp)=P{B1 = b1,...,Bp = bp) = [ 9k(l-9)p-kdG(9), (3.8) Jo where k = Y^j=ibi-> a n d G{9) is the some specified distribution for 9 with support on [0,1]. Joe (1997) indicates that the exchangeable mixture model only includes non-negative dependence structures. For some cases, the functional form but not the mixture repre-sentation of the exchangeable mixture model can be extended to include negative de-pendence. This feature will help to deal with the negative dependence of the perfect rejection structure in the contamination model (3.4). Note, for the rest of the thesis we mainly consider situations in which Bi,..., Bp are i.i.d. Binomial(l, e). The family of distribution functions H generated by the contamination model (3.4) will be denoted by ri. Notice that ri constitutes a contamination neighborhood for the central elliptical distribution H0. To gain further insight into the family H, in the next section we consider some dependence situations among the contamination vector, Z , the contamination indicator matrix, B, and the uncontaminated vector, Y. 64 3.5 Dependence Structures in the Contamination Model In this section we consider different kinds of dependence structures, other than the de-pendence of the components of the contamination indicator matrix B (3.5). We illustrate the three dependence situations 2, 3 and 4 discussed in Section 3.3. 3.5.1 Independent-Contamination Model We consider the independent-contamination model where the probability of contamina-tion for the different entries of X does not depend on the uncontaminated vector, Y. Also, the contamination vector, Z , does not depend on which entries are being contami-nated and their values. Let Y, Z and B be independent. Then the model can be written as follows. X = (I-B)Y + BZ, (3.9) where for example Z = fx + Af(/i, E) and fx G W. Suppose that F and G are the distribution functions of Y and Z, respectively. Then, for the case p = 2 the distribution function H of X can be written as, H(xl,x2) = £ooF(xi,x2) + ei0Gi{xl)F2{x2) + €QIFI(XI)G2{X2) + enG(xi,x2), where e^j = P(Bn = k, B& = j) for k, j = 0,1. More generally, we write H(xi, ...,xp) = P{Bn = 0,Bi2 = 0,..., 2?i(p_i) = 0, Bip = 0)F(xi,x2, • • •, xp-i,xp) +P(Ba = l,Bi2 = 0,.. •, Bi{p_1)=0, Bip = 0)G(x1)F(x2,.. .,xp-i,xp) +P(Bn = l,Bi2 = l,..., 5j(p_i) = 1, Bip = 0)G(xl,x2,..., xp_i)F(a;p) +P(Bn = l,Bi2 = 1,.. .,Bi(p-i) = l,Bip = l)G(x1,x2,.. .,xp-i,xp). For simplicity we have omitted the subscripts indicating the corresponding marginal distributions of F and G. For instance, G(xi) stands for GJ(XJ), G(xi,Xj) stands for 65 Gij(xi,Xj), etc. Note that this model will be mainly used throughout the rest of the thesis. 3.5.2 Contamination Vector as a Function of Contamination Indicator Matrix We consider the dependence situation where the contamination vector, Z, depends on which entries are being contaminated. The probability of contamination for the different entries of -X" and the contamination vector, Z, are independent of the uncontaminated vector, Y. Let Y and B be independent and given £?, let Y and Z be independent. Then the model can be written as follows. X = (I - B)Y + BZ(B), where for example Y ~ N(0,1) and Z ~ N(fi(B), £ ) . For simplicity, let p — 2 and B be a bivariate Bernoulli random variable with P(BX = 1) = P(B2 = !) = £, as shown in with 1 < r < 1. For different values of B\ and B2 we define y.(B) as follows. where k\,k2,ki, k2 are arbitrary constants. Therefore, when B\ = 1, B2 = 0, X can be expressed as When Bi = 0, B2 = 1, X can be expressed as 66 When B\ = B2 = 1, X can be expressed as ' - ( : ) - " ( ( ! ' • • 3.5.3 Contamination Vector as a Function of Contamination Indicator matrix and Uncontaminated Vector We consider the dependence situation where the contamination vector, Z, depends on which entries are being contaminated and their values. The probability of contamination for the different entries of X is independent of the uncontaminated vector, Y. Let W ~ AT(0,1), Y ~ N(0,I) and B be independent. Then the model can be written as follows. X = (J - B)Y + Bg(Y, B, W). For simplicity, let p — 2 and B b e a bivariate Bernoulli random variable with P(B\ = 1) = P(B2 = 1) = e, as shown in Table 3.2 . For different values of Bi and B2 we define Z = g(Y, B, W) as follows. For B\ = 1 and B2 = 0, Z can be written as This implies that the distribution of Z is bivariate normal with mean K — 1 Yo. I \ h , and k2 J 1 r \ covariance T, = | , where k\ and k2 G R, and — 1 < r < 1. Notice that r is the r 1 / correlation coefficient between the contaminated coordinates and the uncontaminated coordinates in the effective cases. For B\ = 0 and B2 = 1, Z can be written as Z i \ ( Yx Z2 \ rYx + s/T^W 67 This implies that the distribution of Z is bivariate normal with mean K • E covariance ZJ = \ r 1 For £?i = B2 = 1, Z can be written as z=i zA = ( * w k z2 ) \ TYx + VT^w This implies that the distribution of Z is bivariate normal with mean K 1 T covariance £ = | | , where k\ and k2 G R. r 1 When BX = 1, B2 = 0, X can be expressed as which implies that X has a bivariate normal distribution with mean K v ( 1 r covariance 2J = I When 5 i = 0, B2 = 1, X can be expressed as * ) - ( • ? - ' ) * ( • ) • ^2 / \ r Y i + VT^W ) \k2 ) which implies that X has a bivariate normal distribution with mean K covariance 1J = i T 1 68 When Bi = B2 — 1, X can be expressed as r F j + x / T ^ r 2 " W / \ jfc2 ~ / j f c which implies that X has a bivariate normal distribution with mean K=\ 1 | and V h covariance 2J = \ r 1 A more concise way to express the bivariate distribution of X is in terms of densities. Suppose that k = k\ = k2 and k = kx = k2. Let fx(xux2)= ^2 h(xux2\hJ)p(hJ)-t,i=o,i For different values of B\ and B 2 , we have the following density functions: r / | . . f Xi — k — TX2 \ 1 , . . . x 2 | l , 0) = 0 I ——j===— 1 -^===0(a;2) = <pr(x\ — k,x2); h(xux2\0,l) = (j) ^ X 2 ^ A _ ^ ^ -j=L=4>(xi) - 4>T{xi,x2 - k); y / 1 _ r 2 ) ^ _ T 2 ^ x ^ = <f>T(xux2y, ii I-, -i \ . ( x2 — 2k — rx\ \ 1 -. ~ ~ / i (x i , a ;2 | l , l ) = 01 — ^ _ ^ 2 — I -^===<f>{xi - k) = <f>T(xi - k,x2 - k), where 0(-) is the standard normal density function, and 0T(-, •) denotes the joint density function of the bivariate normal with means equal to zero, variances equal to one and correlation equal to r. 69 3.6 Robust Estimation of Multivariate Location and Scatter: Problems and Motivation Regarding the processing of large data sets, "traditional" affine equivariant high break-down point robust multivariate location and scatter estimates have two main shortcom-ings: • Computational complexity. • Possible lack of robustness under the contamination model (3.4). All known affine equivariant high breakdown point estimates are solutions to a highly non-convex optimization problem and as such pose a serious computational challenge. The main challenge is to find good initial estimates from which one searches for a nearest optimum in hopes that it produces a global optimum. The initial estimates are invari-ably obtained by using some form of repeated random sub-sampling of Ns cases of the original data set, with the number of samples Ns determined in order to achieve a high breakdown point with high probability, e.g., with probability .99 or .999 (see for example Rousseeuw and Leroy, 1987). It happens that achieving this latter condition results in computational algorithms that have exponential complexity of order TP in terms of the dimension p of the data set. This rules out the use of such estimates for many data mining applications where one has in excess of 200 - 300 variables. In addition, the robust covariance matrix based on projections has a computational complexity n 2 in the number of observations if implemented in a naive manner. Empirical evidence indicates that a clever implementation can reduce this to approximately n * log(n). Since many data mining applications involve hundreds of thousands if not millions of rows (cases), the current projection estimates are not feasible for large data sets. In order to deal with such severe scalability limitations, Rousseeuw and Van Driesen (1999) proposed a "Fast MCD" (FMCD) method that is much more effective than naive subsampling for minimizing the objective function of the MCD. The FMCD seems capable 70 Number of Variables 2 5 15 20 25 50 100 % Rows Spoiled (all variables) 10 23 54 64 72 92 99 % Rows Spoiled (pairwise variables) 10 13 16 17 18 19 21 Table 3.6: Percentage of Rows with at Least One Contaminated Entry when each Variable Independently 5% Contaminated (e = .05) in the Independent-Contamination Model. of yielding "good" solutions without requiring huge values of Ns. But FMCD still requires substantial running times for large p, and it no longer retains a high breakdown point with high probability when n is large. We consider the possible lack of robustness of the traditional robust estimates under the contamination model (3.4) to be a more serious problem than their computational complexity. Traditional robust estimates have a high breakdown point for all p under the classical contamination model (3.2); in fact, if conveniently tuned, these estimates may attain the maximum breakdown point (BP = 1/2) for affine equivariant estimates (Davies, 1987). However, we will show that the traditional affine equivariant robust estimates may not satisfy the robust properties of high breakdown point under the con-tamination model (3.4). The possible lack of robustness of affine equivariant estimates can be hinted from the simple probability calculations shown in Table 3.6. For small fraction of contamination in each variable e = .05, consider the independent-contamination model (3.9) where Bi,B2,. • • ,BP are i.i.d. Binomial(l,e). Given the dimension of the data set p, the second row of Table 3.6 exhibits the percentage of cases (rows) with at least one contaminated entry considering all variables (columns). The third row of this table exhibits the percentage of cases (rows) with at least one contaminated entry considering all pairs of variables one pair at the time. This probability has been numerically calculated with the details given in the chapter appendix. 71 We can see from Table 3.6 that when we consider all the variables the percentage of contaminated cases increases dramatically for large number of variables. Traditional affine equivariant robust estimates were not designed to cope with these situations, since they work globally with all the variables and require that the majority of the data (more than half) be uncontaminated. On the other hand, if we consider two variables at the time the percentage of contaminated cases is not high even for large number of variables. In the following two chapters we offer a solution to the problems described above. We focus on considering the fewest possible dimensions of the vector we observe. We propose a coordinate-wise location estimate and a pairwise scatter estimate. Thus the proposed schemes do not involve all the coordinates of the vector, but rather use one dimension at the time for the location estimate and two dimensions at the time for the scatter estimate. In this way, we lessen the computational burden and attain the high breakdown point property under the contamination model (3.4). Using the smallest possible dimensions minimizes the fraction of contaminated cases which are used at each step in the computation of the estimates as suggested by the probability calculations in Table 3.6. The proposed estimates are not affine equivariant, but this often is an unnecessary property in large data sets such as data mining applications. Also, this is not a disad-vantage of the estimates since there are many practical situations where there exists a natural representation for the data (e.g. the form in which they have been measured) except perhaps for a shift and/or some unit changes and in which affine equivariance may not be necessarily a desirable property. 3.7 Chapter Appendix We wish to calculate the maximum proportion of contaminated rows that may occur un-der the independent-contamination model (3.9) when we consider all the pairs of 72 columns in a p-dimensional data set. Let Bkj (k = 1,..., n, j = 1,... ,p) be independent Bernoulli random variables with P (Bkj = 1) = e. Let n Sjj = y^jmax{Bki,Bkj}, i, j = 1,... ,p. k=l Then we are interested in studying pn = E { - max Sij \ . [n i<j ) (3.10) To investigate the behavior of pn, we generated 1000 random matrices of the form Ak2p \ k = 1,..., 1000, \ Aknl Akn2 ' ' ' Aknp / where the Akij's are independent Bernoulli random variables with P (A^j = 1) = e, for all k = 1,..., 1000, I = 1,..., n and j = 1,... ,p. The value of pn (3.10) is then estimated by ^ 1000 / i n N job" E r^f - E m a x M « « Aw) fc=i \ (=i / 1000 Our results for the case c = 0.05, n = 1000 and p = 2,5,15,20, 25, 50,100 are pre-sented in Table 3.6. Similar results were obtained for other values of n. 73 Chapter 4 Robust Estimation of Multivariate Scatter 4.1 Classical Scatter Estimate Suppose that Xx,..., Xn, where Xi G W, i — 1,..., n, are independent and identically distributed according to a multivariate distribution with mean /z and covariance matrix S. The classical and best known estimate of the covariance matrix £ is the method of moment estimate (MME) which is defined as follows. where p, = ^ 2~Zi=i ^i- Note that estimation of the correlation matrix R can always be derived from the relation R = DUD, where D = Diag (l/y/d~xl,..., l/y/a ) and • • •) °vv a r e t n e diagonal elements of the covariance matrix £ . The breakdown point is an important feature of the reliability of an estimate, as it indicates, roughly speaking, the smallest proportions of arbitrary values (outliers) that bring the estimate out of the boundaries of the parameter space. The definition of the breakdown point for the covariance matrix estimates is given in Section 2.2 of Chapter 2. Unfortunately, the breakdown point of the method of moment estimate (4.1) is 1/n, which indicates very poor resistance to outliers. In the last three decades, many attempts to overcome the poor resistance properties of the classical sample dispersion matrix (i.e. covariance and correlation matrices) have been (4.1) i=i 74 made. The robust proposals can be classified in two main categories: robust pairwise estimation and robust global estimation of the dispersion matrix. The first one has the advantage of being able to deal with missing values in the data set, but is not affine equivariant and often does not provide a positive definite matrix directly. The second category usually ensures affine equivariance and positive definiteness, but is less appropriate to deal with missing data. In addition, the main drawback remains the computational feasibility of such methods for high dimensional data sets. At this point, it seems appropriate to revisit old proposals for estimating scatter based on using only two variables at a time. The discussion above motivates the construction of robust dispersion matrices by using pairwise robust correlation (or covariance) coefficients as basic building blocks. Several such methods have been around for many years, but they have been mostly ignored because: (a) of the lack of affine equivariance, and (b) that the resulting dis-persion matrix built up from the pairwise estimates lacks positive definiteness. We are motivated to re-examine the pairwise approach because: (1) the lack of affine equivari-ance is not necessarily important for large data sets such as in data mining applications, and (2) there exist good methods for obtaining positive definiteness, such as Maronna and Zamar (2002), and Rousseeuw and Molenberghs (1993) who proposed three meth-ods; respectively, the shrinking method, the eigenvalue method and the scaling method. When the covariance itself is the quantity of interest, one should transform it to a positive definite matrix using one of these methods; while if some particular entries in the matrix are the values of interest, then the estimated values should provide a good estimation of the real values. The simplest pairwise methods are based on pairwise robust correlation or covariance estimates such as: (i) classical rank based methods, such as the Spearman's p and Kendall's r (see for example Abdullah, 1990); (ii) classical correlations applied af-ter coordinate-wise outlier insensitive transformations such as the quadrant correlation and 1-D "Huberized" data (see Huber, 1981, page 204); and (iii) bivariate outlier resis-tant methods such the method proposed by Gnanadesikan and Kettenring (1972) and 75 studied by Devlin, Gnanadesikan and Kettenring (1981). The pairwise approach is ap-pealing in that one can achieve high breakdown point on a pairwise basis that results in a high breakdown point for the overall covariance or correlation matrix, and at the same time reduces the computational complexity in the data dimension p from exponential to quadratic (from 2P to p2). This greatly increases the range of large data sets prob-lems to which robust covariance and correlation estimates can be applied, e.g., 200 - 300 variables becomes quite feasible. In this chapter, we will concentrate on estimates in the class (ii) of pairwise robust estimates originally introduced by Huber (1981). 4.2 Simple Class of Pairwise Robust Scatter Estimate The method we are proposing to estimate the scatter matrix draws on work done by Huber (1981). The work remains largely uninvestigated due to the fact that it is not an affine equivariant estimate. We focus on the estimation of correlation matrices, since estimation of covariance matrices can be derived in the same way. Huber defines robust correlation coefficient estimates as follows. Suppose Xi,..., Xn is a multivariate sample where Xi 6 W, i = l,...,n. Let Sj (j — 1,... ,p) be some robust scale estimates and let tj (j = 1,... ,p) be location M-estimates defined by the following equation: where ip(x) is an appropriate score function. The following two cases are of primary interest: • Huber Function ipc(x) = min {max {—c, x} , c} , where c e R+ is a user-chosen constant. 76 • Sign Function ib(x) = SGN(or), where SGN(x) has the values +1 for x > 0, -1 for x < 0, and 0 for x = 0. The robust correlation coefficient estimate fjk is now denned as the Pearson correla-tion coefficient computed on the transformed data Y^ - = ip ((Xij — tj) JSJ) , j = 1,..., k; j = l , . . . ,p , rjk ll y 2 1 V^ n y ¥ A/ n 2^ i=l -1 ij n 2—ii=\ 1 ik Notice that Y3- = Yk — 0 by definition of and tk. To save computing time and still gain robustness, we can use another robust location estimate tj = median{Xjj} and therefore the robust correlation coefficient estimate has the following form: f.k — n Hi=l (Yjj — Yj) (Yik — Yfc) ^ When )^ is the Huber function, we call this the Huberized correlation coefficient, and when ip is the sign function the estimate is the so-called quadrant correlation (QC) coefficient, which is the Huberized correlation coefficient with tuning constant c = 0, since limipJx) = SGN(rr). c->0 In the case of n observations of a p-dimensional random vector, we use the estimate fjk to estimate every correlation between Xj and Xk (j,k = 1,... ,p) to get the (j, k) entry of the correlation matrix R. The pairwise Huberized correlation matrix estimate can, therefore, be defined as R = (fjk)j,k=i,...,p-4.3 Performance of Pairwise Huberized Scatter Estimate In this section we report the results of a Monte Carlo study on the performance of the pairwise Huberized covariance matrix estimates in the contamination model (3.4). We 77 considered sample sizes n = lOp where p is the number of variables taking values 10,30 and 50. The data followed the independent-contamination model: where Y, B and Z are independent and the diagonal elements of B, Bi,..., Bp are i.i.d. Binomial(l, e). In view of the lack of equivariance of the pairwise Huberized covariance matrix es-timates their behavior may depend on the covariance structure; hence we generated correlated data as follows: Generate Y a s p-variate normals Np(0, S ) for i = 1,..., n, and Zi as Np(fj,Q, SI) for some /z0 = (no,..., po)p', where 5 = 0.1. Generate Bi,..., Bp as i.i.d. Binomial(l, e). The covariance matrix S generated in our simulation study was obtained as follows: 1. Using the condition number, which is defined as the square root of the largest eigen-values divided by the smallest eigenvalues C N = W T 1 , we generated multivariate normal data with mean vector 0 and covariance matrix, S A = Diag(Ai, . . . , Ap) as • Set the largest eigenvalue Ai = 1 and the smallest eigenvalue Xp — with equally spaced eigenvalues in between. • Generate Xi as p-variate normals Np(0, £>) for i = 1,..., TV = 100,000. 2. Using random orthogonal matrix, the correlation structure of the data set X, where X = Xi,..., XN, is decided at random by rotating the data set using a method proposed by Fang and Zhang (1990), which we describe briefly as follows: • Consider the n x p matrix X; • Generate a random matrix Y as Npxp(0, J p x p ) ; • Let U = Y(y'Y)~ll2. Thus U has a uniform distribution over the Stiefel manifold, the set consisting of all orthogonal random matrices; X = (J - B)Y + BZ, follows: 78 • Rotate the matrix X, Xr = XU which has a different correlation matrix. 3. The covariance matrix £ = Cov(X r ) . The pairwise Huberized covariance matrix estimate £ was obtained from the Huber-ized correlation coefficient estimates using the median as the location estimate and the median absolute deviation (MAD) as the scale estimate. We compared the estimated covariance £ with the true covariance £ using the following two metrics: 1. Euclidean distance or the straight line distance between the coordinates of £ and £ , which is given by where = 1,... ,p) are elements of £ and £ , respectively. 2. Another choice for the metric is to use the determinant of the covariance matrix £ , which is the generalized variance. The generalized variance converts the information on all the variances and covariances into a single number. Generalized variance also has interpretations in the p-space scatterplot representation of the data. The most intuitive interpretation concerns the spread of the scatter about the mean vector. The metric of the determinants of £ and £ called eigenvalue distance is defined as follows. where A; and A; (i = 1,... ,p) are eigenvalues of £ and £ , respectively. 79 CN = 1 CN = 10 b CO d d o d 0.02 0.04 0.06 epsilon 0.08 0.10 0) o c ra to b co d o d 0.02 0.04 0.06 0.08 0.10 epsilon CN = 100 c = 0.0 ** c= 1.0 c = 1.5 0.0 0.02 0.04 0.06 0.08 0.10 epsilon Figure 4.1: Performance of Pairwise Huberized Covariance Estimates using Eigenvalue Metric, for Data Sets with Size of Contamination p 0 = 100 and p = 10. We ran 1,000 Monte Carlo simulations from the above distributions with C N = 1,10, 20, 50 and 100 and size of contamination p,0 — 5,10 and 100. The samples were the same for all estimates and for each combinations n, p, e and c, where c is the tuning con-stant of the Huber score function. We considered e = 0, .01, .02, . . . , .10 with c = 0,1,1.25 and 1.5. For all values of c, the results for the pairwise Huberized covariance estimates indicated (tables not shown here) that the eigenvalue distances increased with increasing values of the condition number; however, the Euclidean distances were not affected. The size of the contamination did not affect the results either. Figures 4.1 - 4.3 illustrate the varying degrees of change in the eigenvalue distances for various values of the condition number, C N = 1, 10 and 100 when the size of the contamination is large, p0 = 100. 80 CN = 1 CN = 10 CD O c CO co ci •tf d o d 0.02 0.04 0.06 epsilon 0.08 0.10 0.02 0.04 0.06 0.08 epsilon 0.10 CN = 100 CD O c 00 co d d 0.0 0.02 0.04 0.06 0.08 0.10 epsilon Figure 4.2: Performance of Pairwise Huberized Covariance Estimates using Eigenvalue Metric, for Data Sets with Size of Contamination fi0 = 100 and p = 30. For p = 10,30 and 50 respectively, each figure plots the eigenvalue distance against the fraction of contamination e, for c = 0, l a n d 1.5. From the plots, we can see that for C N = 1,10 and 100, the eigenvalue distance between the estimated covariance and the true covariance decreases as the dimension of the data increases. In addition, for all data dimensions the effect of the different values of the tuning constant c is minimal as the fraction of contamination e increases. Moreover, when C N = 1 the performance of the pairwise Huberized covariance estimates were not affected with the value of the tuning constant c. The performance of the Fast M C D (FMCD) covariance estimates were also monitored. 81 CN = 1 CN = 10 0.02 0.04 0.06 0.08 0.10 0.02 0.04 0.06 0.08 0.10 epsilon epsilon CN = 100 0.0 0.02 0.04 0.06 0.08 0.10 epsilon Figure 4.3: Performance of Pairwise Huberized Covariance Estimates using Eigenvalue Metric, for Data Sets with Size of Contamination JJJQ = 100 and p = 50. We used the same sampling situations with CN = 1 since the Fast MCD covariance estimates are affine equivariant. The performance of the Fast MCD covariance estimates for the size of contamination, II0 — 5,10 and 100 is shown in Figure 4.4. The figure displays plots of the eigenvalue distance versus the fraction of contamination e, for p = 10, 20 and 30. From the plots, we can see that in general, the Fast MCD estimates perform poorly for large contamination sizes and that their performance worsen considerably as the dimension p increases. We also considered comparing the performance of the pairwise Huberized covariance estimates with the performance of the Fast MCD covariance estimates using eigenvalue 82 P = zo 0.06 O.OS epsi lon M u - 1 0 Mu>5 M u - 1 O M u - 5 O.OZ 0.04 0.06 0.08 0.10 0.12 epsilon Figure 4.4: Performance of Fast MCD Covariance Estimates using Eigenvalue Metric, for Data Sets with Sizes of Contamination p,0 = 5,10,100 and p — 10,20,30. and the Euclidean distances. We generated 1,000 data sets from the above distributions with CN = 1, fi0 = 5,10 and 100, and e = .05, .10, .15, .20, .25 and .30. We used c = 0 for the Huber score function, which is the quadrant correlation (QC) coefficient. The performance results of the pairwise Huberized covariance estimates and the Fast MCD covariance estimates are displayed in Tables 4.1, 4.2 and 4.3 for p = 10, 20 and 30, respectively. 83 Small Medium Large Pairwise FMCD Pairwise FMCD Pairwise FMCD Eps dl d2 dl d2 dl d2 dl d2 dl d2 dl d2 0.05 0.073 0.022 0.192 0.023 0.072 0.022 0.207 0.030 0.063 0.021 0.276 2.036 0.10 0.200 0.034 0.268 0.103 0.192 0.031 0.641 0.368 0.194 0.035 2.274 35.684 0.15 0.379 0.056 0.639 0.168 0.375 0.054 1.311 0.633 0.366 0.057 3.787 73.045 0.20 0.588 0.092 0.908 0.254 0.599 0.099 1.775 1.147 0.581 0.099 4.887 116.669 0.25 0.856 0.159 1.128 0.337 0.841 0.151 2.118 1.204 0.848 0.156 5.681 137.926 0.30 1.194 0.295 1.287 0.386 1.162 0.252 2.323 1.523 1.166 0.221 6.075 155.076 dl: Eigenvalue Distance ; d2: Euclidean Distance Small: //Q = 5; Medium: = 10; Large: = 100 Table 4.1: Performance of Pairwise Huberized (c = 0) and Fast MCD Covariance Esti-mates for Data Sets with p = 10. Small Medium Large Pairwise FMCD Pairwise FMCD Pairwise FMCD Eps dl d2 dl d2 dl d2 dl d2 dl d2 dl d2 0.05 0.070 0.008 0.132 0.021 0.065 0.009 0.463 0.081 0.057 0.008 1.812 10.475 0.10 0.215 0.016 0.610 0.067 0.211 0.015 1.359 0.286 0.205 0.015 35.573 oo 0.15 0.392 0.026 0.947 0.108 0.389 0.028 1.914 0.487 0.379 0.027 166.804 oo 0.20 0.598 0.044 1.179 0.150 0.606 0.046 2.264 0.590 0.594 0.044 oo oo 0.25 0.856 0.072 1.355 0.177 0.853 0.073 2.511 0.791 0.857 0.083 oo oo 0.30 1.170 0.123 1.486 0.219 1.177 0.121 2.695 0.838 1.184 0.093 oo oo dl: Eigenvalue Distance ; d2: Euclidean Distance Small: po = 5; Medium: /io = 10; Large: fio = 100 Table 4.2: Performance of Pairwise Huberized (c = 0) and Fast MCD Covariance Esti-mates for Data Sets with p = 20. 84 Small Medium Large Pairwise F M C D Pairwise F M C D Pairwise F M C D Eps d l d2 d l d2 d l d2 d l d2 d l d2 d l d2 0.05 0.063 0.005 0.267 0.018 0.065 0.006 0.834 0.106 0.068 0.005 10.926 00 0.10 0.212 0.010 0.730 0.049 0.211 0.010 1.347 0.186 0.211 0.018 88.570 oo 0.15 0.384 0.018 1.042 0.077 0.388 0.020 1.640 0.273 0.390 0.019 251.030 00 0.20 0.589 0.032 1.263 0.105 0.595 0.031 2.150 0.501 0.600 0.029 00 oo 0.25 0.849 0.047 1.425 0.127 0.855 0.052 2.935 1.259 0.857 0.049 00 oo 0.30 1.166 0.082 1.544 0.148 1.178 0.085 4.067 4.330 1.176 0.082 00 oo d l : Eigenvalue Distance ; d2: Euclidean Distance Small: /in = 5; Medium: fx0 = 10; Large: /in = 100 Table 4.3: Performance of Pairwise Huberized (c = 0) and Fast M C D Covariance Esti-mates for Data Sets with p = 30. We can see that, in general, for both metrics the Fast M C D covariance estimates perform poorly for large contamination sizes and that their performance worsen considerably as the dimension p increases. However, the performance of the pairwise Huberized covari-ance estimates were not affected as the dimension p and the size of contamination p,o increase. The performance of the two estimates becomes dramatically different for large p as the fraction of contamination e increases. 85 4.4 Asymptotic Properties of Huberized Correlation Coefficients The objective of this section is to show that under certain regularity conditions the Huberized correlation coefficient estimates are consistent and asymptotically normal. 4.4.1 Consistency of Huberized Correlation Coefficients The next theorem shows that under certain regularity conditions, if the location and the scale are consistent estimates, then the Huberized correlation coefficient estimates are also consistent. T H E O R E M 4.1 - Consistency of Huberized Correlation Coefficients -Let (Xi, F i ) , . . . , (Xn, Yn) be a random sample from a bivariate distribution. Let fix and fiy be location estimates, and bx and by be scale estimates. Let ip : R — > • R satisfy the following: P.l ip(-u) = -ip(u), u > 0; P.2 ip(u) is non-decreasing and lim ip(u) > 0; P.3 ip is continuously differentiable; P.4 ip, ip' and ip'(u)u are bounded. Then if, px - fix (a.s.) fly -->• fiy (a.s.) ox --> °~X (a.s.) by -->• Oy (a.s.) as n —)• oo ; then f —>• r almost surely as n —>• oo where 86 it ( » - i p (*#*)) (»(*»*) - i (^)) (* (^) - i £ * (^))' i E (* ( ^ ) - i £ * (V))' and g{ ( » ( ^ ) - ^ ( ^ ) ) ( » ( ^ ) - ( ^ ) ) } ( ^ ) - ( ^ ) } * { * ( ^ ) " ^ ( ^ ) I ' The relation between r and p = Corr(X, V) will be studied in Section 4.5. 4.4.2 Asymptotic Normality of Huberized Correlation Coefficients Having shown the consistency of the Huberized correlation coefficient estimates, we turn our attention to their asymptotic distribution. We will focus on the MM-location esti-mates, an important special case of robust location estimates, which is defined below. We need some definitions and assumptions that will be used in the statement of Theorem 4.2 and its proof. Assume that ip : R —¥ R satisfies P . l - P.4 from Theorem 4.1. Moreover, we will assume that the real function x '• K. —>• R+ satisfies the following: A . l x(0) = 0, x{-u) = x(u), u > 0 and supx(«) = 1; ueR A.2 xiu) is non-decreasing in u > 0; A.3 x IS continuously differentiable; A.4 X) x'•> x'(u)u a r e bounded. We now define the S-scale family of estimates (Rousseeuw and Yohai, 1984). 87 DEFINITION 4 . 1 - S-scale estimates - Let X\,... ,Xn be a random sample and 0 < b < 1 / 2 . The S-scale frn is defined as Naturally associated with this family are the S-location estimates. DEFINITION 4 . 2 - S-location estimates - Let X\,...,Xn be a random sample, and for each t € R let sn(t) be as in (4-3). The S-location estimate fin is fin = arginf sn(t). In analogy with Yohai (1987) we will refer to the M-location estimates calculated with an S-scale as MM-estimates. DEFINITION 4 . 3 - MM-location estimates - Let X\,...,Xn be a random sample and on be an S-scale estimate. The solution fin of is called the MM-location estimate of X\,... ,Xn. The following theorem states the asymptotic normality of the Huberized correlation coefficient estimates under certain regularity conditions. T H E O R E M 4 . 2 - Asymptotic Normality of Huberized Correlation Coefficients -Let (Xi, Yi),..., (Xn, Yn) be a random sample of independent and identically distributed random vectors with elliptically symmetric distribution. We consider the Huberized cor-relation coefficient estimate defined as follows. <jn = inf sn(t), where sn(t) is given by ( 4 . 3 ) r ( 4 . 4 ) 88 where fix and fiy are MM-location estimates, dx and by are S-scale estimates. Then, v ^ ( f - r ) N(0,AV), as n —>• oo ; where The variance of the limiting distribution of y/n[f — r) can be expressed as Cn „ , , Co? u2 , , u2 , u . . u . u2 4 V = — + ( 1 / 4 ) - ^ — + ( 1 / 4 ) - | c 1 2 ( — ) - a 1 3 ( - 2 - ) + (l/2)o-23 „ , , n ' \-"7 -v 9 o y ~iov 9 y 1 \^ -j~to 9 9: where and cru = Varl ip I ———) V f ^ ^ Cy / V °x c 1 2 = C ^ I ^ W ^ ^ V ^ " ^ {^( oy ) \ ox J \ Cy cy / V °x J \ crx c 2 2 = Var { ip'z ( ——— cy c 2 3 = Cov {ip' , ip 0-33 - Var oy J \ ox X - n x ox oy J \ ox 2 fy ~ HY' w = E<ip E{ip oy 2 (X ~ Vx ox 89 To simplify the notation, define where px, HY o,re locations and ox, O~Y a r e scales of X and Y, respectively. Then the asymptotic variance can be written as follows. AV = ^ ^ + ( i / 4 ) C Q 4 ; c ° 2 c * n C02C20 c02 C02C20 + ( 1 / 4 ) C 4 ° ; C ' ° C ? 1 - ( c 1 3 - c n c 0 2 ) ^ -c20 C02C20 C02C20 c 2 - ( c 3 i - c n C 2 o ) - 2 ^ - + ( l /2)(c 2 2 - c 02C2o)- 2- IT-- (4-5) C20C02 c02 c20 The proofs of Theorems 4.1 and 4.2 are relatively straightforward and given in Sec-tions 4.9.1 and 4.9.2 of the chapter appendix. Estimating the Variance of the Huberized Correlation Coefficients To estimate the asymptotic variance (4.5) of the Huberized correlation coefficient esti-mate, replace C y by c\j which is defined as follows. where fix and fly are MM-location estimates and &x and by are S-scale estimates of X and Y, respectively. We provide a C source called within Splus. The program computes the Huberized correlation coefficient estimate r (4.4) and its standard error SE(f). The latter is cal-culated from the asymptotic variance of the estimate (4.5). Below we give the skeleton of the program for computing the robust estimate. The input is a data set containing n 2-D points of the form (Xi, Yj) and the tuning constant of the Huber score function c. 90 1. For each variable X and Y, do: (a) Compute the MM-location estimates and S-scale estimates to get fix, P-Y, ^ X and by. (b) For i = 1 to n, do: ibc [*fj*) and ibc ( ^ ) . 2. Compute the Huberized correlation coefficient estimate r = — \ / ( a 2 0 - C f 0 ) ( c 0 2 - C ^ ) 3. Compute the standard error SE(f) = \f^~-4. Return the Huberized correlation coefficient estimate and its standard error: (f, SE(f)). The score function used to define the Huberized correlation coefficient estimate is not continuously differentiable. Since Theorem 4.2 requires this property, it is uncertain that the asymptotic variance formula (4.5) can be used to estimate the standard error of the Huberized correlation coefficient estimate. However we will show here, by means of the following numerical experiment, that the formula (4.5) can still be used. . To evaluate the accuracy of the estimated standard error of the Huberized correlation coefficient estimates, SE(f), we conducted a Monte Carlo experiment which consists of the following: Generate 5000 Monte Carlo samples of size n from a bivariate normal distribution with mean vector 0 and covariance matrix S — j ^ ^ J • Compute fi and \ P 1 / SE(fj) for each sample (i = 1,..., 5000), using the above computer program. Compute the empirical approximation to the standard deviation; SD (r i , . . . ,r5ooo ) - SD(r), and the mean of the standard errors; mean(SE(ri),..., SE(f50oo)) = SE. To give some mea-sure of variability of the standard errors, compute the standard deviation of the standard errors; SD(SE(ri) , . . . , SE(f5 0oo)). The experiment was carried out for different sample sizes, n — 20, 30, 50 and 100 and with different correlation coefficients, p = 0, .1, .25, .50, .75, .90 and .99. The results are 91 n = 20 n = 30 n = 50 n = 100 p SD(f) SE SD(f) SE SD(f) SE SD(f) SE 0.00 0.10 0.25 0.50 0.75 0.90 0.99 0.228 0.211 (0.021) 0.224 0.210 (0.023) 0.220 0.202 (0.028) 0.194 0.174 (0.040) 0.140 0.115 (0.044) 0.076 0.056 (0.031) 0.010 0.007 (0.005) 0.187 0.176 (0.013) 0.185 0.175 (0.014) 0.177 0.168 (0.018) 0.155 0.143 (0.027) 0.106 0.094 (0.029) 0.057 0.046 (0.020) 0.007 0.005 (0.003) 0.141 0.139 (0.007) 0.143 0.138 (0.008) 0.139 0.132 (0.011) 0.121 0.113 (0.016) 0.083 0.073 (0.018) 0.042 0.035 (0.012) 0.005 0.004 (0.002) 0.099 0.099 (0.003) 0.100 0.098 (0.004) 0.097 0.095 (0.005) 0.083 0.080 (0.008) 0.057 0.052 (0.009) 0.029 0.025 (0.006) 0.003 0.003 (0.001) Table 4.4: Evaluation of the Asymptotic Standard Errors of the Huberized Correlation Coefficient Estimates with c = 1.00. displayed in Tables 4.4 - 4.6 for the tuning constant of the Huber score function c = 1,1.25 and 1.50, respectively. For each sample size, n, the first column contains the Monte Carlo standard deviation of the Huberized correlation coefficient estimates. The second column contains the mean of the standard errors of the Huberized correlation coefficient estimates and the corresponding Monte Carlo standard deviation within parentheses. From the tables, we can see that the mean standard errors, SE, approximate the empirical standard error of the estimates, SD(f), closely. In particular, for large sample sizes and high correlations the difference between them is small and they have small standard deviation. Tables for different values of c are fairly similar, indicating that there is not much loss of efficiency by using relatively small value of c such as c = 1. 92 n = 20 n = 30 n = 50 n = 100 p SD(f) SE SD(f) SE SD(f) SE SD(f) SE 0.00 0.10 0.25 0.50 0.75 0.90 0.99 0.228 0.209 (0.025) 0.230 0.208 (0.027) 0.219 0.199 (0.033) 0.191 0.167 (0.042) 0.126 0.106 (0.042) 0.064 0.050 (0.026) 0.009 0.006 (0.004) 0.184 0.175 (0.016) 0.183 0.174 (0.016) 0.178 0.166 (0.021) 0.153 0.139 (0.029) 0.102 0.087 (0.027) 0.049 0.041 (0.017) 0.006 0.005 (0.003) 0.144 0.138 (0.009) 0.143 0.137 (0.010) 0.136 0.131 (0.012) 0.117 0.109 (0.017) 0.074 0.068 (0.017) 0.036 0.031 (0.010) 0.004 0.003 (0.001) 0.101 0.099 (0.004) 0.097 0.098 (0.005) 0.095 0.094 (0.006) 0.080 0.078 (0.009) 0.053 0.048 (0.009) 0.025 0.022 (0.005) 0.003 0.002 (0.001) Table 4.5: Evaluation of the Asymptotic Standard Errors of the Huberized Correlation Coefficient Estimates with c = 1.25. n = 20 n = 30 n = 50 n = 100 p SD(f) SE SD(f) SE SD(f) SE SD(f) SE 0.00 0.10 0.25 0.50 0.75 0.90 0.99 0.229 0.207 (0.029) 0.228 0.206 (0.030) 0.218 0.196 (0.035) 0.185 0.162 (0.043) 0.123 0.101 (0.040) 0.059 0.046 (0.024) 0.007 0.005 (0.003) 0.186 0.174 (0.019) 0.182 0.173 (0.019) 0.178 0.164 (0.023) 0.146 0.135 (0.029) 0.096 0.084 (0.028) 0.045 0.038 (0.016) 0.006 0.004 (0.002) 0.143 0.137 (0.011) 0.141 0.136 (0.011) 0.134 0.130 (0.014) 0.113 0.106 (0.018) 0.071 0.065 (0.016) 0.034 0.029 (0.009) 0.004 0.003 (0.001) 0.102 0.099 (0.005) 0.101 0.098 (0.006) 0.097 0.093 (0.007) 0.079 0.076 (0.009) 0.049 0.046 (0.008) 0.023 0.021 (0.005) 0.003 0.002 (0.001) Table 4.6: Evaluation of the Asymptotic Standard Errors of the Huberized Correlation Coefficient Estimates with c = 1.50. 93 4 . 5 Bias in Quadrant and Huberized Correlation Coefficients In this section, we discuss the bias that the quadrant and the Huberized correlation coefficient estimates may have due to the fraction of contamination in the data and because of the structure of the estimates. Therefore, we need to distinguish between two kinds of bias in the quadrant and the Huberized correlation coefficient estimates. For simplicity and without loss of generality, we will restrict our attention to the case p = 2. Since in this case there are only two variables involved there is not much loss in using the classical contamination neighborhood, H£(p) = {H:H=(l- e) H0 + etf} , (4.6) where p is the true correlation coefficient of the two variables under the nominal dis-tribution HQ, p(HQ). For example H0 = N (0, £ ) , where S = j 1 P J and H is an V p i / arbitrary and unspecified distribution. Let Xi,... ,Xn and Yi,...,Yn be i.i.d. H and H0, respectively, where Xi and Yi G K 2 , i — 1,..., n. Using the consistency result in Theorem 4.1, then f(XuX2,...,Xn) -> r(H) a.s. f(YuY2,...,Yn) -+ r(H0) a.s. as n —> oo. Because of the fraction of contamination included in H, r(H) will typically be asymptot-ically biased. The asymptotic bias of r(H) with H G rle(p), where rle(p) is the family distribution of H generated by (4.6), can be written as b(r,H) = \r(H)-r(H0)\, and the corresponding maximum asymptotic bias over ri€(p) can be expressed as Br(e)= sup \r(H)-r{H0)\. 94 The other bias is the intrinsic bias that occurs at the nominal model H0 because of the structure of the estimate, which requires transforming the data. This transformed data has a slightly different correlation than the correlation of the original data, so r(H0) ^ p{H0). Thus, we define the maximum "overall bias" (OB) of the correlation coefficient r(H) as follows: 0 B = sup \r(H) - p(H0)\. To make this idea clear, we re-write the maximum overall bias as follows: 0 B = sup \r(H) - r(HQ) + r(H0) - p(H0)\. (4.7) Hence, we can see that the overall bias (4.7) is composed of two different biases: • Asymptotic bias, \r(H) — r(H0)\, occurs due to the fraction of contamination in the data set. - .. • Intrinsic bias, \r(H0) — p(H0)\, happens because of the data transformation, "Hu-berizing" the data. It is well known that because of the data transformation, the nature of the data will change. Specifically, when X and Y are jointly normal with correlation p, the limiting value r of f satisfies \r\ < |p|, with strict inequality, except in the trivial cases |p| = 0,1. The next theorem states this result. T H E O R E M 4.3 Let X,Y be jointly normal, with marginals N(0,1), with E{XY} = p. Suppose that f and g are measurable functions such that IE{f(X)2} and IE{g(Y)2} are finite. Then, the correlation of f(X) and g(Y) is less than or equal to p in absolute value. The folklore traces this result back to Kolmogorov, although we could not find published proof. Therefore, we give the proof of this theorem in Section 4.9.3 of the chapter appendix. 95 To verify Theorem 4.3 via empirical evidence, we conducted a Monte Carlo simulations using Huber score function with different values of the tuning constant c. A random sample of size n = 100,000 was generated from a bivariate normal distribution with mean vector 0 and covariance matrix £ = [ ^ ^ j. Using the Huber score function W i / for a specific value of the tuning constant c, the data were transformed (Huberized data). Then, the correlation coefficient of the transformed data is calculated from the following formula: VE{^(x)}E{^(r)}' We compared the correlation coefficient of the transformed data r with the correlation coefficient p of the generated data for different values of the tuning constant c. We show the differences between the two correlations in Figures 4.5 and 4.6, which display plots of the correlation coefficient of the transformed data r versus the correlation coefficient p for different values of the tuning constant c. We see that the intrinsic bias decreases for larger values of the tuning constant c and becomes considerably smaller for c > 1. We also see that for moderate positive correlation coefficients (.5 < p < .7) the magnitude of under estimation is larger. On the other hand, for moderate negative correlation coefficients (—.7 < p < —.5) the magnitude of over estimation is larger. To correct the intrinsic bias under the assumption of a Gaussian model we can use an appropriate non-decreasing transformation function, f = gc(r). (4.9) Specifically, for the quadrant correlation (QC) coefficient Huber (1981) suggested the following transformation: <7QC(r) = sin ((7r/2)r) . (4.10) For general cases such as Huber score function with tuning constant c > 0, we suggest that gc(r) can be obtained by numerical means. Using the Monte Carlo simulations, we 96 c = 0 .00 c = 0 .25 -1.0 -0.5 0.0 0.5 1.0 -1.0 -0.5 0.0 0.5 1.0 rho . rho Figure 4.5: Intrinsic Bias of Huberized Correlation Coefficient Estimates. calculated r using formula (4.8) and denoted by r = r(p,c). Therefore, numerical tables can be obtained for different values of c and correlation coefficients, p. Each table allows us to read the value of p, denoted by p = g c(r), given r and c. However, if the value of r is not in the numerical tables then we can use interpolation to get gc(r). The pairwise Huberized correlation matrix estimate R = (fjk)jk=l p is a positive definite matrix. This is because it is constructed from the Pearson correlation coefficient estimates fjk (4.2) of the outlier-free transformed data. However, when the correlation coefficient estimates are corrected for the intrinsic bias, the corrected correlation matrix estimate R = (fjk)-k=zl p is not positive definite. Fortunately, we have seen that for large values of the tuning constant c, e.g., c = 1.00 or so, the intrinsic bias is very small (less than .05) and, therefore, fjk « fjk. In such cases we recommend using fjk to preserve positive definiteness. On the other hand, when the bias correction is needed, 97 c=1.25 c = 1.50 •1 .0 -0 .5 0 .0 0 .5 1.0 - 1 .0 -0 .5 0 . 0 0 .5 1.0 rho rho Figure 4.6: Intrinsic Bias of Huberized Correlation Coefficient Estimates. there are intuitively appealing methods for adjusting the positive definiteness of the resulting scatter matrix. One such method is introduced by Maronna and Zamar (2002) which we will discuss in the next section. 4.6 Positive Definite Pairwise Robust Scatter Estimates In this section, we describe a general method to obtain positive definite robust scatter estimates. The method was introduced by Maronna and Zamar (2002) for any pairwise robust scatter estimate. They applied their method to the bivariate outlier resistant estimate obtained by Gnanadesikan and Kettenring (1972) and Devlin, Gnanadesikan and Kettenring (1981). However, we will show that when applying this method to the quadrant correlation coefficient estimate we will obtain a pairwise robust scatter estimate 98 which is computationally more feasible. We now briefly describe Maronna and Zamar's (2002) method for correction of positive definiteness. Recall that if £ is the covariance matrix of the p-dimensional random vector X, then a2(a'X) = a'£a, (4.11) for all a e l p and a denotes the standard deviation. Let ^ p 3=1 where Ai < A 2 < . . . < Xp are eigenvalues of £ and a, (j — 1,..., p) are the corresponding eigenvectors. One notices that, when £ is the sample covariance, the Ay's are the variances of the projected data on the direction of the corresponding eigenvectors. To solve the negative eigenvalues problem they proposed to replace the eigenvalues by the square of a robust scale estimate of the corresponding principle components in formula (4.11). Xi = a2(a.'jXl, a;X2,..., a'jXn), j = l,2,...,p. Furthermore, Maronna and Zamar (2002) show that the estimate can still be improved upon by means of a re-weighting step. Hence, the final output is weighted mean and covariance matrix with weights based on the Mahalanobis distances: di = d{Xi) = {Xi - p.)'^~\Xi - A). Let W be a weight function and define p,w, £ T O as the weighted mean and covariance matrix, where each Xj, i = 1,..., n has weight Wi = W(di), that is, n _ EIU WjXj ~ _ E r = i ^ ( * * - A J ( ^ - A J ' ^ EIU Wi ' w LIU Wi They used the simplest W which is the "hard rejection", with W(d) = I(d < d0) and _ Xp ( /3)med(c?i , . . . , r i n ) d°- W> ' where xl(P) i s t r i e /3-quantile of the chi-square distribution with p degrees of freedom. 99 4.6.1 Preliminary Estimation of Scatter Matrix We have already seen that the general method can be applied to any robust estimate of the scatter matrix. Since we are interested in comparing the computational perfor-mance of various estimates of the scatter matrix, we will discuss two such estimates the Gnanadesikan and Kettenring (GK) and the quadrant correlation (QC). The Gnanadesikan and Kettenring estimate is based on the following identity: Cov(X, Y) = \ [a2(X + Y) — a2(X — Y)] , where X and Y are random variables and a is the standard deviation. Gnanadesikan and Kettenring (1972) proposed to define a robust covariance matrix by using a robust scale as cr; they used a trimmed standard deviation. The resulting matrix is symmetric, but not necessarily positive definite and is not affine equivariant. Genton and Ma (1999) calculated its influence function and asymptotic efficiency. Maronna and Zamar (2002) suggested the r-scale introduced by Yohai and Zamar (1988), which is a truncated standard deviation, and a weighted mean to be the robust scale and location respectively. Define the functions: Wc(x) = ( l - (^)2) I(\x\ < c); Pc(x) = mm{x2,c2). Let X = X \ , . . . , Xn be a univariate sample; and put oo = MAD(X) = med (\X - med(X)|); Wi - W C l ^ - ™ e d M j j where /(•) is the indicator function and "med" denotes the median. Now, the weighted mean and the r-scale estimates are defined as follows: K ] " EILiwi ' a { X ) \ ° » / " To combine robustness and efficiency, Maronna and Zamar (2002) set c\ = 4.5 and c 2 = 3. Ma and Genton (2001) advocated the use of the scale estimate Qn proposed 100 by Croux and Rousseeuw (1992b) and Rousseeuw and Croux (1993), but Maronna and Zamar prefer r-scale estimate for reasons of speed. It is well known that robust estimates suffer from lack of computational efficiency. Thus, we propose that better computational speed can be attained by using the quadrant correlation estimate in place of GK in Maronna and Zamar (2002) method for estimating scatter matrix. The quadrant correlation is the sample correlation coefficient of the signs of the differ-ences between the data and their respective medians. Let X i , . . . , Xn be a multivariate sample where Xi 6 W,i = l,...,n with medians Mj (j = l,...,p). The quadrant correlation coefficient estimate is defined as follows: Zti SGN(X a - MQSGN(X i f c - Mk) Pik — > n where I, k — 1,... ,p. The quadrant correlation coefficient estimate can be corrected for the intrinsic bias as follows: pik = sin (rjPik) • Let MAD(Xu) and MAD(Xik) be the scale estimates of Xit and Xik, respectively. Then the covariance estimate can be defined as: alk = MAD(Xu)MAD{Xik)plk. From the &\k the initial robust covariance matrix estimate is formed: So = {&lk\l,k=l,...,p-The final positive definite robust covariance matrix is formed by applying Maronna and Zamar's (2002) method for correction of positive definiteness followed by the final re-weighting step. 101 4.6.2 Computational Complexity and Computing Times The resulting robust dispersion matrix has different computation complexity for different robust methods. The pairwise approach is appealing in that it reduces the computational complexity in the data dimension p from exponential to quadratic (from 2P to p2). The computational complexity of some robust methods are presented below. • Stahel-Donoho (SD) estimate with naive implementation has computation complex-ity 0(2P) -0(n2); whereas, with better implementation the Stahel-Donoho estimate has computation complexity 0(2P) • 0(n- log(n)). • MCD estimate has computation complexity 0(2P) • 0(n); whereas, the FAST MCD estimate has the same computation complexity with a much better constant for 0{n). • The robust pairwise scatter estimate has computation complexity 0(p2) • 0(n). We compared the computing times of the pairwise covariance matrix estimates ob-tained using the quadrant correlation (QC) estimate and the GK estimate (with r-scale robust estimate). The Maronna and Zamar (2002) method is applied to correct for pos-itive definiteness of the resulting covariance matrix estimates with the final re-weighting step. Both programs for the two estimates were written in C and called within Splus for Unix. Moreover, we made the programs available as a built-in Splus command in the recently released Splus, namely Splus 6 for Windows and Splus 6.1 for Unix. The command lines for the estimates (using robust library) are the following: • The Gnanadesikan-Kettenring estimate (with r-scale robust estimate); covRob(stack.dat, estim = "pairwisegk"); • The quadrant correlation (QC) estimate; covRob(stack.dat, estim = "pairwiseqc"). We carried out some timing experiments for a range of sample size n and dimension p. We ran the experiments on Splus (version 6.0, SunOS 5.6). To make the calculation 102 0 100 200 300 400 500 800 Dimensions Figure 4.7: Scalability of Dimensions for the Covariance Estimates obtained using QC and GK for n = hp. of the median run faster, we did not use the built-in Splus command "median", which employs sorting; but a selection algorithm (the procedure "select" in Section 8.5 of Press et al., 1992), which is linear in n. We generated standard normal random samples with different values of n and p. The computing times of the covariance estimates obtained using QC and GK for different data dimensions with sample sizes n = hp is shown in Figure 4.7. The figure displays plots of the computing time (time in seconds) versus the data dimension, p. We can see that QC requires less computing time than GK. Also, the computing times of GK and QC tend to increase quadratically as p increases. However, QC has a smaller constant for the quadratic polynomial in p. The computing times of the covariance estimates obtained using QC and GK for different sample sizes with data dimension p = 50 is shown in Figure 4.8. The figure displays plots of the computing time (time in seconds) versus the sample size, n. We can see that QC is much faster than GK. Also, the computing times of GK and QC tend to increase linearly as n increases. However, QC has a smaller slope. Hence, from Figures 4.7 and 4.8 we can conclude that the computing time growth of the pairwise scatter estimates is linear in n and quadratic in p, which confirms the above complexity claim. 103 0 20000 40000 60000 80000 100000 Sample Sizes Figure 4.8: Scalability of Sample Sizes for the Covariance Estimates obtained using QC and GK for p = 50. We compared the computing times of the pairwise covariance estimates obtained using QC and GK (with r-scale robust estimate) with the Fast MCD (FMCD) covariance estimates. The computing times of the covariance estimates obtained using QC, GK and FMCD for different sample sizes with dimensions p = 10, 30 and 50 are shown in Figure 4.9, which displays plots of the computing time (time in seconds) versus the sample size, n. We can see that for higher dimensions, FMCD requires a much larger amount of computing times than QC and GK. The computing times of the covariance estimates obtained using QC, GK and FMCD for larger sample sizes are shown in Figure 4.10, which displays plots of the computing time (time in seconds) versus the sample size, n. We can see that for larger sample sizes, QC still requires less computing times than GK and FMCD. On the other hand, for higher dimensions, GK requires a much larger amount of computing times than QC and FMCD. We also notice that, for larger sample sizes, FMCD requires less amount of computing times. The reason that FMCD requires less computing times for a larger sample size is when n is larger than a certain n0 (the default is 6000) the FMCD algorithm applies an ingenious splitting procedure to reduce the number of evaluations. 104 We also ran the timing experiments for contaminated data. The contaminated sam-ples were p-variate normal e-contaminated distribution, with p taking the values 10, 30 and 50. Generate Xi asp-variate normals Np(0, /) for i = 1,..., n—m, where m = [ne]; [•] denotes the integer part, and as N p(^ 0, S2I) for i > n—m. We chose n0 = (100,..., 100)p, S = 0.1 and e = 0.20. The computing times of the covariance estimates obtained using QC, GK and FMCD for different sample sizes with dimensions p = 10, 30 and 50 are shown in Figure 4.11. The computing times of the covariance estimates obtained using QC, GK and FMCD for larger sample sizes are shown in Figure 4.12. The figures display plots of the computing time (time in seconds) versus the sample size, n. From the plots, we can see that the three estimates have similar timing behavior compared to the clean data timing situations. 105 p = 10 p = 5 0 p = 3 0 o o 0 1000 2000 3000 4000 5000 N Figure 4.9: CPU Time of the Covariance Estimates obtained using FMCD, QC and GK for Clean Data. 1 0 6 p = 10 p = 5 0 0 10000 30000 N 50000 10000 20000 30000 40000 50000 N CD E o o in o o co o o p = 3 0 FMCD QC GK 0 10000 30000 N 50000 Figure 4.10: CPU Time of the Covariance Estimates obtained using FMCD, QC and GK for Clean Data with Larger Sample Sizes. 107 p = 10 p = 5 0 o o CD _ E O ._ CD O CM O O CD _ E O ._ CD O CM 1000 2000 3000 4000 5000 N 1000 2000 3000 4000 5000 N p = 3 0 o o E O ._ CD O CM — FMCD QC - GK 0 1000 2000 3000 4000 5000 N Figure 4.11: CPU Time of the Covariance Estimates obtained using FMCD, QC and GK for 20% Contaminated Data. 108 p = 10 p = 5 0 10000 30000 N 50000 0 10000 20000 30000 40000 50000 N p = 3 0 CO E o o CO o o FMCD QC GK 10000 30000 N 50000 Figure 4.12: CPU Time of the Covariance Estimates obtained using FMCD, QC and GK for 20% Contaminated Data with Larger Sample Sizes. 109 4.7 Maximum Bias of Quadrant and Huberized Correlation Coefficients In this section, we study the maximum asymptotic bias (maxbias) of the quadrant and Huberized correlation coefficient estimates in the contamination neighborhoods. We are also interested in comparing their maxbias with the maxbias of the affine equivariant estimates such as the Fast MCD and the Stahel-Donoho. For simplicity and without loss of generality, we assume that the number of variables p = 2. Hence, it is appropriate to use the classical contamination neighborhood (4.6) as a good approximation. The rest of this section is organized as follows. Section 4.7.1 studies the maxbias of the quadrant and Huberized correlation coefficient estimates when the location and scale parameters are known. Section 4.7.2 considers the maxbias of the quadrant correlation coefficient estimate with unknown location and scale parameters. Section 4.7.3 deals with numerical computation of the maxbias of the Huberized correlation coefficient estimates when the location and scale parameters are unknown. Finally, Section 4.7.4 compares the maxbias of the Huberized correlation coefficient estimates with the maxbias of the Fast MCD and the Stahel-Donoho correlation coefficient estimates. 4.7.1 Maxbias of Quadrant and Huberized Correlation Coefficients with Known Locations and Scales The following theorem shows the maxbias of the quadrant and Huberized correlation coefficient estimates when the location and scale parameters are known. T H E O R E M 4.4 - Worst Case Bias - The maximum asymptotic bias B (e) under the classical contamination neighborhood (4-6) of size e, sup \r(H) — p\ is given by: Hene(P) max r(H0)-B , 1 + P , r(H0) + B (4.12) 110 where j3 = [e/(l - e)] [ip2 (oo) /E{ip2 (Z)}] with Z ~ N (0,1), and fei &(•) defined as in (4.9). Sketch of the Proof: Assume without loss of generality that ip ( ° ° ) = 1- Let A = E # 0 {ip(X)ip(Y)}, B = l E H o W{X)}, a = JEjj {iP2(X)}, b = TEjj {iP2(Y)}, and ( H ) = VH {iP(X)iP(Y)} ^ H {iP2{X)}^H {iP2{Y)}' By the Cauchy-Schwarz inequality (1 - e)A + eVab r(H)< - e)B + eay/(l - e)B + tb Differentiating the right hand side with respect to a and using the Cauchy-Schwarz inequality again we can verify that this derivative is non-negative for all a < b. Therefore, letting (e/ (1 — e)) /b — and noticing that r(H0) — A/B we can write r [ H ) - (l-e)B + eb~ (l-e)B + e~ 1 + ' ( 4 > L 3 j The second inequality follows because [(1 - e)A + eb] j [(1 - e)B + eb] is increasing in b. A n analogous reasoning gives r(H) > ^ f ^ . (4.14) Huber (1981) states inequalities (4.13) and (4.14) without providing a proof. The result follows now because the function p = gc (r) is non-decreasing. We give a detailed proof of Theorem 4.4 in Section 4.9.4 of the chapter appendix. 4.7.2 Maxbias of Quadrant Correlation Coefficient with Unknown Locations and Scales Before we derive the maxbias of the quadrant correlation coefficient estimates, when the location and scale parameters are unknown, we need to state the following lemma. I l l L E M M A 4 .1 Suppose that the random vector (X,Y) has a bivariate normal distribution 1 P HQ with mean vector 0 and covariance matrix £ = \ P For any a and b in the interval [—XO,XQ] define: P ^ and let x0 = $ 1 ( 2 ( i - e ) ) • g(a, b) = EHo {SGN(X - a)SGN(Y - b)} . Then the following are satisfied: (a) g(a,b) = g(b,a); (b) max g(a, b) = ^(-Zo,-Soc-ial,|6|<xo (c) min g(a,b) = g(-x0,+x0). \a\,\b\<x0 Proof: Assume without loss of generality that p > 0. Part (a) follows because (X, Y) ~ H0 are exchangeable: g(a,b) = E t f 0 { S G N ( X - a ) S G N ( F - & ) } = E F o { S G N ( y - a ) S G N ( X - 6 ) } = g(b,a). Because of (a), to show (b) and (c) we only need to consider the upper triangle {(a, b) : x0 > b > a > —x0}. Let g{a,b) = lEHo{SGN(X-a)SGN(Y-b)} = E H o { l E H o { S G N ( y - 6 ) | X } S G N ( X - a ) } , where the conditional distribution of Y given X = x is normal with mean px and variance 1 - p2. Now Etf 0 {SGN(y - b)\X = x} = P(Y > b\X = x) - P (Y < b\X = x) = 1 - 2$ . , 112 b — px Then l(a, b) = f ' J a 1 - 2 $ b — px 4>(x)dx — j J-ex Fixing b and differentiating with respect to a we get 1 - 2$ b — px 7 ^ 7 (j)(x)dx. da g(a,b) = 2 2$ b — pa 1 The sign of this expression depends on the sign of b — pa. To investigate the sign of b — pa, we break the domain of b into three intervals as follows: Case I: — x0 < b < —pxa. For any a G [—xo,x0] we have b < —pxo < pa < px0 and so b — pa < 0. So g(a, b) decreases with a in the given interval. Therefore, max g(a,b) = g(-x0,b). Case II: — px0 < b < px0. For any a e [—xQ,x0], if — x0 < a < b/p, then — px0 < pa < b and so b — pa > 0. So g(a,b) increases with a G [—x0,b/p\. And if b/p < a < x0, then b < pa < px0 and so b — pa < 0. So g(a, b) decreases with a G (b/p, xo]. Therefore, max g(a,b) = g(b/p,b). —xo<.a<xo Case III: px0 < b < x0. For any a G [—x0,x0], if 0 < a < xQ, then pa < a and so b — pa > b — a > 0. And if — x0 < a < 0, then —pa > 0 and so b — pa > 0. So <?(a, 6) increases with a in the given interval. Therefore, max g(a,b) = g(x0,b). —xo<a<xo For all —x0 <b< —px0 we have g(—x0, b) = g(b, —x0), then by Case I g(b, -x0) < g(-x0, -x0). For all px0 < b < x0 we have g(x0, b) = a;0), then by Case III g(b, x0) < g(x0, x0) = g(-x0, -x0). 113 For all — px0 < b < px0 then —x0<b/p<x0. We consider three cases: Case 1 : b/p < —px0. In this case 9(P/p, b) 9(b,b/p) < g(-x0, b/p) = (b/p, -x0) < g(-x0, -x0). Case 2: — px0 < b/p < px0. In this case g(b/p,b)=g(b,b/p) < g(b/p, b/p) < < 9{b/p\ b/p) 9(b/p,b/p2) g(b/p\b/P2) Let k be such that — px0 <b/pk 1 < px0, but not for b/ph. Eventually, b/pk < —pxo or b/pk > px0. Then g(b, b/p) < g(-x0, -x0) or g(b, b/p) < g(x0, x0). Case 3: b/p > px0. In this case g(b/p, b) = g(b, b/p) < g(x0, b/p) = g(b/p,x0) < g{x0,x0). • Now in the following theorem we derive the maxbias of the quadrant correlation coefficient estimate. T H E O R E M 4 . 5 - Maxbias of the Quadrant Correlation Coefficient - If (X, Y) is distributed according to the classical contamination model H (4-6) where HQ = N(pi, S ) . Then the quadrant correlation coefficient rQc(H) has the following properties: (a) sup rQc(H) = ( 1 - e)g(-x0, -x0) + e; HEHe(p) (b) inf rQC{H) = ( 1 - e)g{-x0, +x0) - e. 1 1 4 Proof: We assume without loss of generality that fix = = 0, ax — &Y = 1 and OXY — P-Therefore the bivariate random variable (X, Y) is distributed according to model (4.6) with Let Mx and My be the medians of X and Y under H, and let rQC{H) = E H {SGN(X - M x ) S G N ( y - My)} be the quadrant correlation coefficient between X and Y under the contaminated distri-bution H. Since \MX\ < x0 and |My| < x0, where x0 is as in Lemma 4.1. Then rQc(H) < (1 - e) max g(Mx, My) + e. -X0<MX,MY<XQ By Lemma 4.1 (b) max g(Mx, My) = g(-x0, -x0), and therefore the right hand -XQ<MXMY<XO side of (a) is upper bound for TQC(H). Now let H = (1 — e)H0 + e5(_OQ . ^ j , and notice that MX{H) = MY(H) — —x0 and rQ C(#) = (1 - ^ W - ^ o , -a:o) + e> proving (a). The proof of part (b) follows along the same lines with H replaced by H_ = (1 — e)H0 + er5( + 0 0 ] + 0 0 ) , and noticing that MY(H_) — and MX{H) = — x 0 ) and rQc(K) = (1 - e)c/tf0(-x0, +^o) - £• • 4.7.3 Maxbias of Huberized Correlation Coefficients with Unknown Locations and Scales The derivation of the maxbias of the Huberized correlation coefficient estimates when the location and scale parameters are unknown is not tractable. Therefore, we decided to derive the maxbias using numerical computations which we describe below. 115 Let X and Y be two random variables jointly distributed under the point mass con-tamination model: H(xo,yo) = (1 - e ) H 0 + e<Wo)> (4-15) where S(XOiyo) is a point mass distribution at (x0,yo) and * • " ( ( : ) • ( : : Let the median, M, be the robust location estimate and the median absolute devia-tion (MAD), S, be the robust scale estimate. To obtain the median and the MAD of X, consider the univariate point mass contamination model which can be expressed as follows: FX0(x) = (1 - e)${x) + e5X0(x), (4.16) where $(•) denotes the standard normal cumulative distribution function and I 0 x < x0 [ l X>XQ. Then the distribution function of X can be written as [(1 - e)$(x) + e x > x0. Let c = ( 2 ( 1 - 6 ) ) ' ^ e n t n e m e ( i ian of X, Mx, can be expressed as M x = L \x0\<c [cSGN(xo) |x 0 |>c. To see this, first notice that to cover the case of discontinuous distributions like (4.16), the definition of median is extended as follows: Mx = sup j i : FX0(t) < l-116 We will restrict attention to the case x0 > 0. The case of negative x0 can be dealt with similarly. Let 0 < x0 < c for all x < x0, FXo(x) = (1 - e)$(x) < (1 - e)$(xQ) < ( l - e )$ (c ) = | . On the other hand, for all x > x0, FXo(x) > FXo(x0) = ( l - e ) $ ( x 0 ) + e > ( l -e )$(0) + e = | + f > \. Therefore, Mx = x0. Finally, for x0 > c since ( l -e )$(c) = \ and i^ 0 (c) = (1 — e)0(c) > 0, the median is equal to c. Now to compute M A D of X, Sx = med\X - M x | / $ _ 1 ( 3 / 4 ) . We have to solve for u the following equation: PX0(\X-Mx\<u) = l/2, which can be expressed as FX0(MX + u)- FX0(MX -u) = 1/2. (4.17) For the case |x 0 | < c, where Mx = x0. Substitute Mx in equation (4.17), we get FX0(x0 + u) - FX0(x0 - u) = 1/2, where FXo (x0 + u) - (1 - e)$(x0 + u) + e, x0 + u>x0 Fxo(x0-u) = (1 - e)$(z 0 - u), x0-u<x0 which implies that $ (x 0 + u)- $(x0 -u) = (4.18) Using Newton-Raphson method to solve for u in equation (4.18), we get Sx = u /$" 1 (3/4) . For the case \x0\ > c, where Mx — c SGN(a:o). Let b = c SGN(:ro) and substitute Mx in equation (4.17), we get FX0(b + u)-FX0(b-u) = 1/2, (4.19) 117 where FX0(b + u) (1 - e)$(6 + u) b + u<x0 (1 - e)$(b + u) + e b + u>x0. F x o { b _ u ) = ^ - ^ - u ) b-u<x0 (I - e)$(b - u) + e b-u>x0. Using Newton-Raphson method to solve for u in equation (4.19), we get Sx = ,u/$~ 1(3/4). Similarly, we obtain median of Y, MY, and M A D of Y, Sy-The Huberized correlation coefficient under the point mass contamination model (4.15) is defined as follows: E {f * N )^}-E {i> { } E (^)} r(H(x0,yo)) where the numerator can be written as >X / i \TT-I / / X-Mx\ fxo-Mx (1 - e)JEHoip ( s ) + eib' >x (1 - e ) E * 0 ^ F - M y yo-MY The first factor of the denominator can be written as (1 - e ) E F o ^ ( — ) + eib S x 118 and the second factor of the denominator can be written as (1 - e)T£„ 0V 2 Y Ho — My SY To obtain the maxbias of the Huberized correlation coefficient estimates, we con-structed a grid of point mass distributions located at (x0, yo) with x0 and y0 between -10 and 10 with increments of 0.01. We used numerical integration, namely the trapezoidal method to, compute the Huberized correlation coefficient r(i7(X o ,,,„)), at each point of the grid. For different correlation coefficients p = 0.1, 0.5 and 0.9, the maximum and the mini-mum values of the Huberized correlation coefficients in the grid for each tuning constant c = 0, .25, .50,1.00,1.25,1.50, 2.0 and fraction of contamination e = 0.01, 0.05, 0.10, 0.15, 0.20 are shown in Tables 4.7, 4.10 and 4.13. Labels C and U in the tables stand for "Corrected" values, which are corrected for the intrinsic bias using formula (4.10) for the quadrant correlation coefficients (c = 0) and the numerical tables for c > 0, and "Uncorrected" values which are not corrected for the intrinsic bias. Then, the maxbias is defined as follows. For each p, Tables 4.8, 4.11 and 4.14 exhibit the maxbiases of the corrected quadrant and Huberized correlation coefficients, given e and c. Tables 4.9, 4.12 and 4.15 display the maxbiases of the uncorrected quadrant and Huberized correlation coefficients for each e and c. Figures 4.13, 4.14 and 4.15 display part of the results in graphical form. These pictures show at a glance that some of the maxbiases of the corrected quadrant and Huberized correlation coefficients are larger than the maxbiases of the uncorrected quadrant and Huberized correlation coefficients. An explanation for this is that the worst 119 Q C Huber, c = 1.00 Huber, c = 1.25 Figure 4.13: Maxbias Comparison of Corrected and Uncorrected Quadrant and Huber-ized Correlation Coefficient Estimates, p — 0.10. contamination bias causes the estimate to become negative and correction for the intrin-sic bias makes the estimate even more negative. Therefore, when p > 0 but the estimate of p is negative the maxbias of the corrected quadrant and Huberized correlation coeffi-cients is larger than the maxbias of the uncorrected quadrant and Huberized correlation coefficients. We notice that this phenomenon is more obvious for p — 0.1 than for p = 0.5 and 0.9, since for low positive correlations it is more likely that the estimated correlation coefficients will be negative. The results show that for a fixed value of e the maxbiases of the corrected Huberized correlation coefficients increase as the values of c increase, and therefore the corrected 120 QC Huber, c = 1.00 CO ra m d o d lO d 0.05 C o r r e c t e d U n c o r r e c t e d 0.10 0.15 epsilon 0.20 CO co m lo d s 2 o d 0.05 0.10 0.15 epsilon 0.20 Huber, c = 1.25 X CO m d o d m d 0.05 0.10 0.15 epsilon 0.20 Figure 4.14: Maxbias Comparison of Corrected and Uncorrected Quadrant and Huber-ized Correlation Coefficient Estimates, p = 0.50. quadrant correlation coefficient has the least maxbias. This implies that the contami-nation bias is an increasing function of c, since the maxbias of the corrected Huberized correlation coefficient will contain only the contamination bias. For different correlation coefficients p = .1, .5 and .9 and e = 0.20, Figure 4.16 displays plots of the maxbiases of the corrected Huberized correlation coefficients versus the tuning constant c. The results also show that when e = 0.01 and 0.05 the maxbiases of the uncorrected quadrant correlation coefficients are larger than the maxbiases of the uncorrected Hu-berized correlation coefficients with c = 1. An explanation for this occurrence is that for small fractions of contamination the overriding bias in the Huberized correlation coeffi-121 QC Huber, c = 1.00 0.05 0.10 0.15 0.20 0.05 0.10 0.15 0.20 epsilon epsilon Huber, c = 1.25 "i < 1 d 0.05 0.10 0.15 0.20 epsilon Figure 4.15: Maxbias Comparison of Corrected and Uncorrected Quadrant and Huber-ized Correlation Coefficient Estimates, p = 0.90. cients is the intrinsic bias which is larger for the quadrant correlation coefficients than for the Huberized correlation coefficients with c = 1. Figures 4.5 and 4.6 show that the intrinsic bias is a decreasing function of c. Now we turn our attention to the choice of the tuning constant c in the Huberized correlation coefficients. One important consideration to guide the choice of the tuning constant c is the maxbias over the contamination neighborhoods, which we would like to make as small as possible. It is important to notice that the intrinsic bias decreases as c increases, whereas the contamination bias increases as c increases. Therefore, in practice it is essential to 122 rho = 0.1 rho = 0.5 2 L • . J 0.0 0.5 1.0 1.5 2.0 C Figure 4.16: Maxbias of Corrected Quadrant and Huberized Correlation Coefficient Es-timates, e = 0.20. " choose the tuning constant c to achieve a trade-off between the intrinsic bias and the contamination bias. The results above suggest that the corrected quadrant correlation coefficient (c = 0) has the least maxbias. This implies that we should choose c = 0 with correction for the intrinsic bias and thus, as mentioned in Section 4.5, that we should correct the resulting pairwise Huberized scatter matrix for positive definiteness. On the other hand, the results show that the uncorrected Huberized correlation coefficient with c = 1 has less maxbias than the uncorrected quadrant correlation coefficient when e < .05. We consider the fraction of contamination e = .05 in each variable. Although this value of e might seem 123 small, since each variable is contaminated at this rate, in fact, a large number of the cases will be contaminated in the same way. Therefore, the value of c = 1 is a good choice since in this case the correction for the intrinsic bias is not needed and thus the positive definiteness of the resulting pairwise Huberized scatter matrix will be preserved. Figures 4.13, 4.14 and 4.15 show that the maxbiases of the corrected and the uncorrected Huberized correlation coefficients with c = 1 are fairly close. When faced with the question of whether c = 0 or c — 1 is to be used, the user may have to balance the following issues. Namely, the corrected quadrant correlation coefficient will correct for the intrinsic bias but will yield a scatter matrix which is not necessarily positive definite, while the uncorrected Huberized correlation coefficient with c = 1 will provide no correction for the intrinsic bias but will lead to a positive definite scatter matrix. From a computational point of view, the uncorrected Huberized correla-tion coefficient with c = 1 is also to be preferred, because it will be less computationally intense. Also, as seen in Section 4.4.2, that for the choice of c = 1 we do not lose much efficiency compared to larger values of c. Since computational feasibility of the estimate is our important concern, we will focus only on the uncorrected Huberized correlation coefficient (with c = 1) in the following section. 4.7.4 M a x b i a s Compar i son of the Cor re l a t ion Coefficients for Stahel-Donoho and F M C D to H u b e r In this section, we compare the maxbias of the uncorrected Huberized correlation co-efficient estimates (with c — I) with the maxbias of the Fast MCD (FMCD) and the Stahel-Donoho (SD) correlation coefficient estimates. We now briefly describe the Stahel-Donoho estimate (see Maronna and Yohai, 1995). Essentially, it is an "outlyingness-weighted" mean and variance, which downweights any point that is many robust standard deviations away from the sample in some univariate 124 projection. The outlyingness measure, r, is based on the idea that if a point is a multi-variate outlier then there must be some one-dimensional projection of the data for which the point is a univariate outlier. Suppose X = Xi,..., Xn is a multivariate sample where Xi £ W, i = 1,..., n. The outlyingness r of each point Xi is computed by finding the direction a; £ A, where A = {a E W\ \\a\\ = 1} such that; r(Xi) = sup \X\a - med{X;.a}y=1 a e A MAD{X;.a}?= 1 ' The Stahel-Donoho estimate of location and scatter is defined as tWiXi V = — . i=l and f:Wi(Xi-fi)(Xi-fj,y ZWi i=l with Wi = W{r(Xi)). We used the Splus built-in command in the robust library covRob(stack.dat, estim = "donostah"), in which the weight Wi is computed using the following function of outlyingness; W{r-c)={ 0 S>1 ai + a2(l)2 + a3(l)A + a4(lY 0.8 < g < 1 1 M < 0.8 where ax = -19.71879, a2 = 82.30453, a 3 = -105.45267 and a 4 = 42.86694. The tuning constant c is set to be the square-root of the 0.95 quantile of a chi-squared distribution with p degrees of freedom. 125 To compute the Fast M C D estimate we used the Splus built-in command in the robust library covRob(stack.dat, estim = "MCD") . This implementation uses the Fast M C D algorithm of Rousseeuw and Van Driessen (1999) to approximate the minimum covariance determinant estimate. This algorithm relies on a method called the "C-step" with which, given any approximation to the M C D , it is possible to compute another approximation with a smaller determinant. The Fast M C D algorithm is discussed in Chapter 2. To obtain the maxbias of the SD and the F M C D correlation coefficient estimates, we constructed a grid of point mass distributions located at (xo,yo) with x0 and y0 between -2 and 2 with increments of 0.01. At each point of the grid, we generated a bivariate data set from the point mass contamination model (4.15) of sample size n = 50, 000 for the F M C D estimates and of sample size n = 5000 for the SD estimates. The sample size is smaller for the SD estimates due to their computational burden. For different correlation coefficients p = .1, .5 and .9, the maximum and the minimum values of the F M C D and the SD correlation coefficient estimates, f, in the grid for e = 0.01,0.05, 0.10,0.15 and 0.20 are shown in Tables 4.16 and 4.18. Then, the maxbias is defined as. Tables 4.17 and 4.19 exhibit the maxbiases of the F M C D and the SD correlation coeffi-cient estimates given e and p. We can now compare the F M C D and the SD maxbiases with the maxbias results of the uncorrected Huberized correlation coefficient estimates (with c = 1) from Section 4.7.3. Figure 4.17 shows the maxbiases for each of the estimates plotted versus the fraction of contamination e for three different correlation coefficients, p = .10, .50 and .90. From the plots in Figure 4.17, it is evident that for p = .10 the Huberized approach has the smallest maxbiases for all values of e. The SD approach is better than the F M C D . For p = .50 the maxbiases of the Huberized approach are almost equal to those of the max 126 rho = 0 . 1 0 rho = 0 . 5 0 tn cd m in d m d 0.05 0.10 0.15 epsilon rho = 0 . 9 0 0.20 0.05 0.10 0.15 epsilon 0.20 CO eg la 3 in in d in d Huber, c = - - SD FMCD 1.00 0.05 0.10 0.15 epsilon 0.20 Figure 4.17: Maxbias Comparison of Uncorrected Huberized Correlation Coefficient Es-timates (c = 1) with SD, FMCD Correlation Coefficient Estimates. SD approach up to e = .05. For larger values of e, the Huberized approach is better. On the other hand, for p = .90 the SD approach performs the best, indicating that for structured data the SD approach identifies the outliers very easily. The Huberized approach, however, performs better than the FMCD. Though the SD approach has the smallest maxbiases for p = .90, the Huberized approach is a very close competitor. Moreover, the Huberized approach can be very easily coded for computation and is much faster to implement. 127 c \ 0.01 U C 0.05 U C 0.10 U C 0.15 U C 0.20 U C o.oo m a x min 0.0734 0.11 0.0529 0.08 0.1127 0.18 0.0067 0.01 0.1662 0.26 -0.0550 -0.09 0.2288 0.35 -0.1260 -0.20 0.2974 0.45 -0.2037 -0.31 0.25 m a X min 0.0821 0.11 0.0588 0.08 0.1282 0.18 0.0067 0.01 0.1914 0.26 -0.0653 -0.09 0.2611 0.36 -0.1463 -0.20 0.3395 0.46 -0.2375 -0.33 0.50 m a X min 0.0915 0.12 0.0642 0.08 0.1453 0.18 0.0032 0.01 0.2182 0.27 -0.0807 -0.10 0.2981 0.37 -0.1740 -0.22 0.3871 0.47 -0.2778 -0.35 1.00 m a x min 0.1090 0.12 0.0700 0.08 0.1834 0.20 -0.0170 -0.02 0.2814 0.31 -0.1325 -0.15 0.3837 0.42 -0.2551 -0.28 0.4904 0.53 -0.3828 -0.42 1.25 m a X min 0.1170 0.13 0.0696 0.08 0.2052 0.22 -0.0352 -0.04 0.3180 0.34 -0.1699 -0.18 0.4313 0.45 -0.3076 -0.33 0.5443 0.57 -0.4444 -0.47 1.50 m a x min 0.1245 0.13 0.0667 0.07 0.2291 0.24 -0.0592 -0.06 0.3578 0.37 -0.2146 -0.22 0.4811 0.49 -0.3656 -0.38 0.5975 0.61 -0.5076 -0.52 2.00 m 3 X min 0.1395 0.14 0.0536 0.05 0.2843 0.29 -0.1231 -0.12 0.4438 0.45 -0.3174 -0.32 0.5793 0.58 -0.4844 -0.49 0.6931 0.70 -0.6239 -0.63 Table 4.7: Maximum and Minimum Values of Corrected and Uncorrected Quadrant and Huberized Correlation Coefficient Estimates, p = 0.1. 128 e 0.01 0.05 0.10 0.15 0.20 c = 0.00 0.02 0.09 0.19 0.30 0.41 c = 0.25 0.02 0.09 0.19 0.30 0.43 c = 0.50 0.02 0.09 0.20 0.32 0.45 c = 1.00 0.02 0.12 0.25 0.38 0.52 c = 1.25 0.03 0.14 0.28 0.43 0.57 c = 1.50 0.03 0.16 0.32 0.48 0.62 c = 2.00 0.04 0.22 0.42 0.59 0.73 Table 4.8: Maximum Bias of Corrected Quadrant and Huberized Correlation Coefficient Estimates, p = 0.1. e 0.01 0.05 0.10 0.15 0.20 c = 0.00 0.05 0.09 0.16 0.23 0.30 c = 0.25 0.04 0.09 0.17 0.25 0.34 c = 0.50 0.04 0.10 0.18 0.27 0.38 c = 1.00 0.03 .0.12 0.23 0.36 0.48 c = 1.25 0.03 0.14 0.27 0.41 0.54 c = 1.50 0.03 0.16 0.31 0.47 0.61 c = 2.00 0.05 0.22 0.42 0.58 0.72 Table 4.9: Maximum Bias of Uncorrected Quadrant and Huberized Correlation Coeffi-cient Estimates, p = 0.1. 129 c \ 0.01 U C 0.05 U C 0.10 U C 0.15 U C 0.20 U C o.oo m a x m i n 0.3413 0.51 0.3213 0.48 0.3697 0.55 0.2633 0.40 0.4070 0.60 0.1827 0.28 0.4505 0.65 0.0898 0.14 0.4974 0.70 -0.0171 -0.03 0.25 m 3 X m i n 0.3813 0.51 0.3580 0.48 0.4138 0.55 0.2917 0.40 0.4570 0.60 0.1977 0.27 0.5067 0.65 0.0917 0.13 0.5622 0.71 -0.0290 -0.04 0.50 m a X m i n 0.4188 0.51 0.3915 0.48 0.4555 0.55 0.3129 0.39 0.5045 0.60 0.2029 0.25 0.5591 0.66 0.0804 0.10 0.6190 0.72 -0.0571 -0.07 1.00 m a x m i n 0.4719 0.51 0.4329 0.47 0.5183 0.56 0.3177 0.35 0.5786 0.62 0.1635 0.18 0.6419 0.68 0.0005 0.01 0.7070 0.74 -0.1704 -0.19 1.25 m a X m i n 0.4886 0.51 0.4412 0.46 0.5415 0.57 0.3010 0.32 0.6084 0.63 0.1195 0.13 0.6759 0.70 -0.0646 -0.07 0.7426 0.76 -0.2485 -0.26 1.50 m a X m i n 0.5007 0.51 0.4428 0.46 0.5616 0.58 0.2731 0.28 0.6360 0.65 0.0628 0.07 0.7075 0.72 -0.1398 -0.14 0.7748 0.79 -0.3317 -0.34 2.00 m a X m i n 0.5169 0.52 0.4310 0.44 0.5988 0.60 0.1913 0.19 0.6884 0.69 -0.0737 -0.07 0.7650 0.77 -0.2989 -0.30 0.8295 0.83 -0.4886 -0.49 Table 4.10: Maximum and Minimum Values of Corrected and Uncorrected Quadrant and Huberized Correlation Coefficient Estimates, p = 0.5. 130 e 0.01 0.05 0.10 0.15 0.20 c = 0.00 0.02 0.10 0.22 0.36 0.53 c = 0.25 0.02 0.10 0.23 0.37 0.54 c = 0.50 0.02 0.11 0.25 0.40 0.57 c = 1.00 0.03 0.15 0.32 0.49 0.69 c = 1.25 0.04 0.18 0.37 0.57 0.76 c = 1.50 0.04 0.22 0.43 0.64 0.84 c = 2.00 0.06 0.31 0.57 0.80 0.99 Table 4.11: Maximum Bias of Corrected Quadrant and Huberized Correlation Coefficient Estimates, p = 0.5. e 0.01 0.05 0.10 0.15 0.20 c = 0.00 0.18 0.24 0.32 0.41 0.52 c = 0.25 0.12 0.21 0.30 0.41 0.53 c = 0.50 0.11 0.19 0.30 0.42 0.56 c = 1.00 0.07 0.18 0.34 0.50 0.67 c = 1.25 0.06 0.20 0.38 0.56 0.75 c = 1.50 0.06 0.23 0.44 0.64 0.83 c = 2.00 0.07 0.31 0.57 0.80 0.99 Table 4.12: Maximum Bias of Uncorrected Quadrant and Huberized Correlation Coeffi-cient Estimates, p = 0.5. 131 c \ 0.01 U C 0.05 U C 0.10 U C 0.15 U C 0.20 U C o.oo m a x min 0.7161 0.90 0.6957 0.89 0.7290 0.91 0.6166 0.82 0.7444 0.92 0.4950 0.70 0.7626 0.93 0.3484 0.52 0.7823 0.94 0.1793 0.28 0.25 m a x min 0.7933 0.90 0.7698 0.89 0.8067 0.91 0.6790 0.82 0.8243 0.92 0.5424 0.69 0.8437 0.93 0.3817 0.51 0.8642 0.94 0.1966 0.27 0.50 m a X min 0.8406 0.90 0.8131 0.88 0.8532 0.91 0.7068 0.80 0.8693 0.92 0.5530 0.65 0.8866 0.93 0.3790 0.46 0.9043 0.94 0.1843 0.23 i.oo m a x min 0.8811 0.90 0.8420 0.87 0.8928 0.91 0.6909 0.73 0.9077 0.92 0.4881 0.53 0.9228 0.94 0.2744 0.30 0.9374 0.95 0.0516 0.06 1.25 m a X min 0.8900 0.90 0.8427 0.86 0.9024 0.91 0.6612 0.68 0.9176 0.93 0.4267 0.45 0.9325 0.94 0.1893 0.20 0.9467 0.95 -0.0471 -.05 1.50 m a X min 0.8958 0.90 0.8380 0.85 0.9092 0.91 0.6206 0.63 0.9252 0.93 0.3516 0.36 0.9404 0.94 0.0922 0.10 0.9541 0.96 -0.1524 -0.16 2.00 m a X min 0.9021 0.90 0.8164 0.82 0.9190 0.92 0.5121 0.52 0.9373 0.94 0.1766 0.18 0.9529 0.95 -0.1103 -0.11 0.9656 0.97 -0.3508 -0.35 Table 4.13: Maximum and Minimum Values of Corrected and Uncorrected Quadrant and Huberized Correlation Coefficient Estimates, p = 0.9. 132 e 0.01 0.05 0.10 0.15 0.20 c = 0.00 0.01 0.08 0.20 0.38 0.62 c = 0.25 0.01 0.08 0.21 0.39 0.63 c = 0.50 0.02 0.10 0.25 0.44 0.67 c = 1.00 0.03 0.17 0.37 0.60 0.84 c = 1.25 0.04 0.22 0.45 0.70 0.95 c = 1.50 0.05 0.27 0.54 0.80 1.06 c = 2.00 0.08 0.38 0.72 1.01 1.25 Table 4.14: Maximum Bias of Corrected Quadrant and Huberized Correlation Coefficient Estimates, p = 0.9. e 0.01 0.05 0.10 0.15 0.20 c = 0.00 0.20 0.28 0.41 0.55 0.72 c = 0.25 0.13 0.22 0.36 0.52 0.70 c = 0.50 0.09 0.19 0.35 0.52 0.72 c = 1.00 0.06 0.21 0.41 0.63 0.85 c = 1.25 0.06 0.24 0.47 0.71 0.95 c = 1.50 0.06 0.28 0.55 0.81 1.05 c = 2.00 0.08 0.39 0.72 1.01 1.25 Table 4.15: Maximum Bias of Uncorrected Quadrant and Huberized Correlation Coeffi-cient Estimates, p = 0.9. 133 0.01 0.05 0.10 0.15 0.20 o.io m a x min 0.2592 -0.0543 0.6343 -0.5019 0.8475 -0.7758 0.9345 -0.9112 0.9624 -0.9538 0.50 m a x min 0.6169 0.3626 0.8345 -0.0980 0.9351 -0.5181 0.9688 -0.7786 0.9792 -0.9104 0.90 m a X min 0.9278 0.8653 0.9733 0.6554 0.9889 0.2923 0.9936 -0.1038 0.9958 -0.4955 Table 4.16: Maximum and Minimum Values of Fast MCD Correlation Coefficient Esti-mates. e 0.01 0.05 0.10 0.15 0.20 ,9 = 0.10 0.16 0.60 0.68 1.01 1.05 p = 0.50 0.24 0.60 1.02 1.28 1.41 p = 0.90 0.03 0.24 0.61 1.00 1.40 Table 4.17: Maximum Bias of Fast MCD Correlation Coefficient Estimates for Different Correlation Coefficients, p. 134 0.01 0.05 0.10 0.15 0.20 o.io m a x m i n 0.1778 0.0182 0.3374 -0.1239 0.5066 -0.3304 0.6663 -0.5417 0.7588 -0.7132 0.50 m a X m i n 0.5551 0.4408 0.6512 0.3018 0.7428 0.0862 0.8185 -0.1633 0.8679 -0.4134 0.90 m a X m i n 0.9176 0.8827 0.9343 0.8403 0.9527 0.7711 0.9642 0.6372 0.9750 0.4509 Table 4.18: Maximum and Minimum Values of Stahel-Donoho Correlation Coefficient Estimates. e 0.01 0.05 0.10 0.15 0.20 p = 0.10 0.08 0.24 0.43 0.64 0.81 p = 0.50 0.06 0.20 0.41 0.66 0.91 p = 0.90 0.02 0.06 0.13 0.27 0.45 Table 4.19: Maximum Bias of Stahel-Donoho Correlation Coefficient Estimates for Dif-ferent Correlation Coefficients, p. 135 4.8 Application Examples to Real Data The goal of this section is to illustrate the implementation of the quadrant and Huberized correlation coefficient estimates on three real data sets. For the first two data sets, we implemented the quadrant correlation (QC) coefficient estimates version of the Maronna and Zamar (2002), discussed in Section 4.6. These estimates are used to obtain the robust covariance matrix estimate and outlier detection via robust Mahalanobis distances. The robust Mahalanobis distance, is computed using the robust covariance matrix estimate, S, along with the coordinate-wise median, p,, as the robust location estimate. In the above expression X j is the i-th data vector of dimension p (the transpose of the i-th row of the data). To decide whether or not a matrix row is an outlier, we used the 99-th percentile of the distribution of the maximum of n independent chi-squared random variables with p degrees of freedom. That is, we compare each dj(xj) with the value d = x29g(max) given by the equation: where Xi,...,Xn are i.i.d. X2{p)-4.8.1 Glass Data The data set glass is a small 214 x 10 matrix consisting of 9 numeric variables and one categorical variable. The 9 numeric variables are the percentages of various chemical constituents of the glass. We obtained these data from the new Insightful Miner (I-Miner) data mining product. We computed the QC based robust covariance matrix and robust Mahalanobis distances for the sub-matrix consisting of the first five columns of the above matrix. Upon running the outlier detection computations with threshold point x29g(max) = 27.43, we found that approximately 34% of the data points are outliers while the remain-d(xi) = (XJ - /*)'£ :(XJ - jj.) P max Xi < d = .99 136 Outlier Distance Figure 4.18: Histogram of the Robust Mahalanobis Distances for the Glass Data. ing 66% of the data represents a central core. The histogram of the robust distances in Figure 4.18 clearly shows a cluster of large distances around 200 to 350 which are much larger than the x2 9 9(max) with 5 degrees of freedom threshold of 27.43 used above. A visualization of the data by means of all pairwise scatter plots in Figure 4.19 reveals an interesting aspect of the multivariate structure that is reasonably consistent with these observations. One sees that the data appears to have a central core that is roughly elliptical in the pairwise views, along with broadly scattered outliers and the distinctive rod-like structure. The latter is due to the fact that 41 of the observations of the Mg variable have value zero. This was evidently because the data values were not recorded or were misplaced, and zero values were substituted for the missing values. Inspection of the scatterplots in Figure 4.19 reveals that there are roughly an ad-ditional 31 diffuse outliers well separated from the elliptical core. So what the outlier detection algorithm with a x299(max) threshold of 27.43 does is identify the diffuse out-liers as well as the extreme outlying rod caused by the zero Mg's. In other words, the 137 Figure 4.19: Pairwise Plots for the Data Set Glass. outlier detection algorithm is behaving quite as anticipated. What the histogram is identifying with its bimodal character is the separation of the pure rod outlier as the most extreme set of distances, distances that are well beyond those of the diffuse outliers closer to central bulk of the data. If we use a threshold of 200 to set aside outliers we will set aside the pure rod as shown in Figure 4.20, and this is not an unreasonable first step. In a second step we will find the remainder of the diffuse outliers. These observations suggest that one might well use robust covariance matrix based 138 Outlier Distance 0 2 7 4 3 50 100 150 200 Bin Range Figure 4.20: Histogram of the Robust Mahalanobis Distances using a Threshold of 200 for the Glass Data. robust distances to iteratively cluster multivariate data by iterative removal of outlier groups, monitored by histograms or density estimates of the robust distances, and sub-sequent iteration on the sub-clusters. This possibility bears further investigation. 4.8.2 K D D - C U P - 9 8 P V A Donations Data This data set was used for the second International Knowledge Discovery and Data Mining tools competition, which was held in conjunction with KDD-98 the fourth Inter-national Conference on Knowledge Discovery and Data Mining. The competition task was a regression problem where the goal is to estimate the return from a direct mailing in order to maximize donation profits. This data set, which we will refer to as the "PVA" data, represents a much more substantial data mining challenge. The original KDD-CUP-98 PVA data set consists of 95,412 records (rows) and 481 variables (columns). For purposes of this example, we 139 Outlier Distance 14 3 8 o 50 100 150 200 250 Bin Range Figure 4.21: Histogram of the Robust Mahalanobis Distances for the P V A Data. have used 16 of the numeric variables. We computed the QC based robust covariance matrix and robust Mahalanobis distances for the sub-matrix. Upon running the outlier detection computations with threshold point x 2 9 g ( m a x ) — 64.1, we found that 17,903 outliers rows, 44,284 non-outlier rows and 33,225 N A rows (missing data). The histogram of the robust distances is shown in Figure 4.21 (in which we have filtered out a few very extreme outlier distances for purposes of a more detailed display). In Figure 4.22 we show the plot of ordered absolute differences between the classical and robust correlation coefficients obtained from the robust correlation matrix. Figure 4.22 shows that while the vast majority of the absolute differences between the classical and robust correlation coefficients are less than .05, a few differences are fairly large (three are larger than .2 and ten are larger than .1). In order to more fully test the capabilities of the robust outlier detection method, we modified a subset of the P V A data as follows. We took a subset of 10,000 records from the P V A data set. Then we added 1,000 rows - each identical to the second row except 140 A B S O L U T E D I F F E R E N C E S B E T W E E N R O B U S T A N D C L A S S I C A L C O R R E L A T I O N S _ l I I 1 I I I L 0.3 H I—, ! , , , , , pi 0 20 40 60 80 100 120 140 Figure 4.22: Differences between Classical and Robust Correlation Coefficients for the PVA Data. that the value for the variable "minramnt" was changed to 1 and the value of the variable "avggift" was changed to 50. While this does not result in very extreme outliers, it does result in outliers that are well detached from the bulk of the data. The results for this modified data set, upon running the outlier detection computations, are 3618 outlier rows and 7382 non-outlier rows. The histogram of the robust distances is shown in Figure 4.23 (in which we have filtered out a few very extreme outlier distances for purposes of a more detailed display). Figure 4.24 shows the plot of ordered absolute differences between the classical and robust correlation coefficients obtained from the robust correlation matrix. In this case the outliers show up as a clear bump in the histogram bar located near 225. This suggests further investigation of the data by deleting all outliers with robust distances greater than 175-200. The overall shape in Figure 4.24 is similar to that of Figure 4.22, except now the largest difference is .5 rather than .35, and several more in the .2-.3 range, reflecting the impact of the added outliers. 141 Outlier Distance -i 1 1 1 1 1— 0 50 100 150 200 250 Bin Range Figure 4.23: Histogram of the Robust Mahalanobis Distances for the Modified PVA Data with Outliers. ABSOLUTE DIFFERENCES BETWEEN ROBUST AND CLASSICAL CORRELATIONS . I I I I I I I L 0.5 " I 0.4 H ~1 I I I I I I r 0 20 40 60 80 100 120 140 Figure 4.24: Differences between Classical and Robust Correlation Coefficients for the Modified PVA Data. 142 4 . 8 . 3 Daily Pressure in Northern Hemisphere Data In this example, two sets of data were considered for the analysis. We obtained these data from Dr. Pandolfo (Department of Oceanography, University of British Columbia). The first data set has the daily data of sea-level pressure (SLP) in boreal winter (December, January and February) for 41 years (from December 1958 to February 1999) in Northern Hemisphere from 20°N to 85°N partitioned with 2.5° x 2.5° grid lines. We have 27 (= 8 5 °~ 2 0 ° + 1) latitudinal grid lines and 145 (= f^- +1) longitudinal grid lines. At each of the 3915 (= 145 x 27) grid points, the SLP values are available for each of the 3690 (= 90 x 41) days. February 29 is not considered for the leap year. Thus, the number of observations is n = 3690 and the number of variables is p = 3915 for the first data set. The second set of the data has the daily pressure values at 500 hPa geopotential heights in boreal winter for 40 years (from December 1958 to February 1998) for the same grid points as above. The number of days in this case is 3600 (= 90 x 40). Thus, the number of observation is n = 3600 and the number of variables is p = 3915 for the second data set. Though 0° longitude and 360° longitude are the same line, they are considered different to present the world on a flat page. This is why the number of longitudinal grid lines is 145 (instead of 144). The results we obtained are almost similar for the two different data sets (because the data sets are similar in nature). Therefore, we discuss only the results for the sea-level pressure data set. To compare the classical covariance estimate with the robust pairwise Huberized covariance estimate, we implemented principal component analysis using the classical covariance estimate and the pairwise Huberized covariance estimate with c = 1. We calculated the proportion of total variance due to each principal component using both estimates. The first 20 of these proportions are presented in the first and second columns of Table 4.20. The values in the two columns are close which indicates that the data do not contain outliers. This also indicates that the pairwise Huberized covariance estimate 143 Clean Data 1% 3SD 1% 6SD 2% 3SD 2% 6SD Classical Robust Classical Robust Classical Robust Classical Robust Classical Robust 10.440 10.045 10.045 10.011 9.613 10.011 9.691 9.978 8.925 9.978 8.227 8.369 7.920 8.340 7.584 8.340 7.641 8.308 7.047 8.308 7.329 7.289 7.047 7.263 6.746 7.263 6.790 7.232 6.253 7.232 6.957 7.060 6.704 7.036 6.422 7.036 6.454 7.010 5.946 7.010 5.654 5.720 5.446 5.697 5.216 5.697 5.256 5.677 4.844 5.677 5.312 5.319 5.109 5.299 4.892 5.299 4.931 5.280 4.547 5.280 4.506 4.358 4.341 4.342 4.160 4.342 4.179 4.324 3.853 4.324 3.591 3.555 3.463 3.544 3.318 3.544 3.339 3.531 3.080 3.531 3.484 3.363 3.353 3.351 3.211 3.351 3.234 3.340 2.982 3.340 2.6487 2.605 2.551 2.596 2.444 2.596 2.462 2.587 2.270 2.587 2.449 2.377 2.357 2.367 2.259 2.367 2.274 2.359 2.098 2.359 2.390 2.329 2.301 2.321 2.204 2.321 2.224 2.313 2.053 2.313 2.018 1.984 1.943 1.978 1.862 1.978 1.874 1.827 1.728 1.827 1.488 1.840 1.782 1.834 1.708 1.834 1.722 1.827 1.590 1.827 1.737 1.663 1.674 1.657 1.604 1.657 1.614 1.651 1.489 1.651 1.637 1.590 1.577 1.585 1.511 1.585 1.524 1.580 1.407 1.580 1.532 1.424 1.478 1.419 1.418 1.419 1.427 1.414 1.318 1.414 1.430 1.394 1.379 1.391 1.321 1.391 1.335 1.386 1.234 1.386 1.403 1.356 1.355 1.352 1.300 1.352 1.305 1.348 1.205 1.348 1.231 1.173 1.186 1.170 1.136 1.170 1.147 1.166 1.060 1.166 Table 4.20: First 20 Proportions of Variation for the Classical and the Pairwise Huberized (with c = 1) Covariance Estimates. works as well as the classical estimate for clean data (data without outliers). To investigate how the classical and the pairwise Huberized covariance estimates be-have in a large data set with outliers, we decided to contaminate 10% of the variables in the data set. We used four different levels of contamination. For each of these 10% variables, first 1% and then 2% of the observations are selected randomly for contami-nation, and they are replaced by randomly generated values from each of the following 144 two distributions: • iV(max + 3SD, 0.5SD) • JV(max + 6SD, 0.5SD) where max is the maximum observation and SD is the standard deviation of the variables being contaminated. Thus, the four levels of contamination can be described as follows: • 1%, 3SD • 1%, 6SD • 2%, 3SD • 2%, 6SD Both the classical and the pairwise Huberized covariance estimates are used for each of these contaminations and the results are presented in Table 4.20. From the table we see that the proportions based on the classical covariance estimate change with increased contamination. However, the proportions based on the pairwise Huberized covariance estimate are hardly affected by contamination. To make the comparison more visible the first six proportions for both clean and contaminated data are plotted in Figure 4.25 for the classical covariance estimates and in Figure 4.26 for the pairwise Huberized covariance estimates. The plots clearly indicate that the classical covariance estimate is very much affected by outliers while the pairwise Huberized covariance estimate is more resistant to the outliers. In the classical approach, if we increase the percentage of contamination we get less favorable results. Also, if we use 6SD contamination instead of 3SD the results further deteriorate. For the pairwise Huberized covariance estimates, if we increase the percentage of contamination, the results get a little bit worse. However, if we use 6SD contamination instead of 3SD we get the same results because of the definition of the Huber function. 145 Classical Covariance Figure 4.25: Proportions of Variation for the Classical Covariance Estimates. Huberized Covariance, c = 1 Clean data 1%, 3sd/6sd 2%, 3sd/6sd Figure 4.26: Proportions of Variation for the Pairwise Huberized (with c = 1) Covariance Estimates. 146 For large data sets it is difficult or sometimes impossible to identify any outliers. In such cases we can apply both the classical and the Huberized approach. If the results are the same, we will understand that there are no outliers, and if the results are different we should rely on the results of the Huberized approach because of the observations presented above. 4 . 9 Chapter Appendix 4.9.1 P r o o f of Theorem 4.1 In this section, we prove Theorem 4.1 that shows under certain regularity conditions the Huberized correlation coefficient estimates are consistent. S T E P I By applying the Taylor expansion for the function f(x,y) = ^ Y^=i V> (^7^) about the point (px,ox) we get o-x n z=i ox (Px - Px) <^2^, {Xi - fix i=l OX (4.20) (dx - <JX) ^ > ^, (Xi — px i-l OX Xj - p,x dx for some (px,ox) between (px,ox) and (px,ox)-S T E P II We show that the second and the third terms on the right hand side of equa-tion (4.20) tend to zero as n —> oo. Since ip' is bounded \ip'\ < M, for some M > 0. So (Px - Px) , f Xi- px < M (Px - Px) ox nox 'r-i \ ox i= i and the right hand side tends to zero, since (px — Px) —> 0 as n —¥ oo. And for the third term, since ip'(X)X is bounded \ip'(X)X\ < M, for some M > 0. So (dx - ax) ^2^, f Xi - p x \ (Xi- px i = i ox ox < M (dx - ox) ox 147 and the right hand side tends to zero, since (bx — o~x) —> 0 as n —>• oo. S T E P III Since the variables ip (x^x), i = 1,... ,n, are i.i.d., then by the strong law of large numbers 'X-iix 1=1 x Xi - fix ox JE{ip ox a.s. as n —> oo. From the preceding steps i=l OX ]E\ip ox a.s. as n —>• oo. S T E P I V Let us assume that Cj = ± 1 . Then by writing the Taylor expansion for f(x,y) = i E"=i (^ V^ ) a b o u t t h e P ° i n t (Px,ox) we get n ^ e \ \ a x ) \ ox (px ~ Px) nox , (Xi- fix i=l Ox (bx - ox) , (Xi - flx\ (Xi - f~ix nox i=l Ox ox for some (fix,bx) between (fj,x,ox) and (fix,bx)- Now apply the same method as in Step II to show that the two terms on the right hand side of the above equality tend to zero. So i=\ N x Xi - fix n *r-f \ \ ox as n -> oo. From here it is clear that ^ r ^ ^ l l — > 0 a.s. ox i " 1 E n ^ i=i iP Xi - fix bx ip Xi - fix ox 0 a.s. as n —> oo. 148 S T E P V Set Then n ' i = i Xj - fix bx i= i v Xj - fix bx - Z i = i v Xj - nx ox i= i ^ Xi - fix ox 1> Xj - nx ox Xi - iix -z) U Xi - fix Xi - iix n ~ [ \ V °x J J \ \ ox J \ ox Now we show that the second term on the right hand side in the above equality tends to zero. In fact ib is bounded, so \tp\ < M then -Y,{^(Xi:^-^(Xi~^x j= i ox ox ~ n z—1 i = i Xi - fix ox Xj - iiX ox and the right hand side tends to zero by Step IV. So £ £ (^T~) ~^ (^ S )^)2 tends to zero. On the other hand, each of the terms tb ( X i ^ x ) ^s bounded by M so their average is bounded by M that is \Z\ < M. So i=i Xi-iiX\ _ z \ (^(Xi-fix ox ox \ ox < —E n i= i ^[X^hc\_^(Xl-iix ox ox 149 and by Step IV the right hand side tends to zero. So 2 ± M x - ^ ) - z ) U ( x - ^ ) - * f x ' - » * n t=T V V °x J / V V °~x J \ o-x tends to zero. But we already know that i=l as n —> oo. Then, a.s. ox J J \ \ o-x C J. as n —>• oo. S T E P V I Similarly, we show that, i=l as n —y oo, where \ Oy / J \ \ Oy a.s. n , \ Oy ) 1=1 v ' S T E P V I I We have n ~l \ \ Ox / V Oy i=l x Xi-fJLX\ , , (Yi- fiy OX \ Oy + ^ p _ £ £ . -(Z + W) 2 2 n " V V ox J \ °y J \ ox i=i N x X l - ^ + i > ( ^ * ) - ( z + w) OX J \ Oy 1 > [ X \ * X ) + * ( ? ^ ) - 1 > ( X i - * X ) - 1 > f Y i - ' M Y ' Ox \ Oy V Ox I \ Oy 150 Now we show that the second term on the right hand side of the above equality tends to zero 1=1 Ox oy ox oy < — F £ .1=1 n V ox 1> Xj - fJLX ox i = i oy Oy and the right hand side tends to zero. Now we show that the third term on the right hand side of the above equality tends to zero. n X i ~ ^ + 1 , ( * ^ J * ) - ( z + w) ox oy ri>[Xi7*x)+*(Yr*Y)-i>(Xi-tXX)-i>fYi-,iY ox Oy Ox oy , i = i n IF* Xi - ftx \ . (Xi- px ox n t=l oy oy and the right hand side tends to zero. Now we already know that km Xi - PX \ , . (Yi-fJLy Ox Oy (Z + W) ox oy a.s. 151 as n —> oo. Then, ox Gy OX oy a.s. as n ^ oo. S T E P V I I I Now we can write i=i X i ~ Vx\ ^(jfVLzfr^w ox oy 1 2n i=i ox oy - E * - ^ i ~ Px dx 2 n E * Yj - py dy W i = l N N * ' ' i=l and from the preceding steps, the right hand side of the above equality tends almost surely to as n —> oo. -Var I ip ox -Var ip Y-jxY oy S T E P I X Now the Huberized correlation coefficient estimate r — i £ ( * ( ^ ) - z ) -w) tends almost surely to V a r ( v ( ^ ) ) V a r ( v . ( ^ ) ) ' as n —y oo. 152 4.9.2 Proof of Theorem 4.2 Here we provide the proof of Theorem 4.2 that gives the asymptotic normality of the Huberized correlation coefficient estimates. It can be shown that Px --> Px a.s. fly — -> HY a.s. bx — ->• Ox a.s. by --> Oy a.s. as n —> oo. For i = 1,..., n, note that the estimates fix, PY, PX, PY, bx and by satisfy: - ± * ( ^ ) = 0; \ ox J i=i N ' l ± J ^ r ) - 0; n f-' V °x ) i=i x ' where fix and fiy are initial S-location estimates. Using Taylor expansion about the points (\ix,o~x) a n d (PY,O~Y), we define the following: \ °x J \ crx J ex \ crx J 1 , (Xi - fix\ (Xi - Px\,~ x - — $ 3 3 (<JX - t T X ) , °x \ °x J \ °x J 153 for some (fix, &x) between (px,ox) and (px,ox), and Oy I \ Gy I Gy \ Oy Oy V °Y ) \ °Y , for some (fry,Oy) between (/j,y,oy) and (fiy,ay). Therefore, £ £ " = 1 ip $ (fi^) c a n b e written as 1 " n / o-x \ ox (by - Gy) , (Ax - Px) 0"X V °x ) \ ox X Gy Gy \ Gy Oy \ Gy ) \ Gy ) Using Serfiing's lemma (Serfling, 1980, page 253), it is easy to show that ' X i - t i X \ . , ( Y i - p Y \ ( Y i - py - ± ± 4 - ox iP' Gy Gy -> A a.s. nox ^ \ ffy / V °x ) V ax N°Y ~{ \ °x ) V Yi - pY OY ->• 0 a.s. as n —y oo, where nt7x ^ \ CTy / \ Ox ,4 = E ^ | ^ £ V y - M A ' " m / cry Oy 154 and oy J \ ox J \ ox Moreover, all the other cross products are oiljy/n). For example oxoy ' ( X i ~ M (Ax - D ( Y ^ ) (Ay - KY) noxoy fr{ \ ox J \ oY ) X [s/n (fly - fly)] (flX ~ Hx) -> 0 oy as n —» oo. Therefore, n X i - f l X \ ( Y i - fly ox oy n n 2=1 Xi - a x \ (Yi- UY ox ib oy i=l ox oy Oy 1 n n ^-^ i=i * i - A^A ,/ (Xi - f i x \ (Xi- f i x Oy 0"x (<7X - Ox) , where "=" means "asymptotically equivalent". Hence, \ bx J \ oY is asymptotically equivalent to i=i x X, - n x \ ^ ( Y i - j i y ] _ A { d Y _ _ b { & x _ a x ) ox oy (4.21) 155 In addition, because Analogously, \ ox ) 1 n X i = i Therefore, Xj - nx ox - b -1 n n i=i , (Xi- n x \ (Xi- nx ox. Ox {bx-ox)+o Gy — Oy = E r = i * ( ^ ) - n 6 spn , ( YJ-HY \ ( Yj-u.Y\ ' L-ii=\ A ^ C y J y C Y J and ox - ox EIU X - nb Z-ii=\ A ^ crx ) \ ax J Replacing (4.22) and (4.23) in (4.21) we get (4.22) (4.23) 156 157 and n i=l where Xj - fix dx n i=i 0* i=l Xi - px ox E { X ' ( ^ ) ( ^ ) } (*g*)} From (4.24), (4.25) and (4.26) it follows that b , (4.26) £ E ^ 2 ( ^ ) V ^E?=iV>2(^) V Nl 0 v o y o n a 12 c r 1 3 0"2i 022 023 \ 031 032 033 / J as n —y oo, where o n = 012 = 013 = 0-22 = Var { ip Y - H Y \ , fX - px Var t ip I ^ ^ 0 y / V o-x C o v ^ l ^ ^ W ^ ^ U 2 C o v ^ l ^ ^ W ^ ^ W ^ - ^ OX Oy 2 I Y ~ PY OX ox oy 023 Oy ox J (4.27) 158 and 0-33 = Var < ip' 2 , X - fj,x ox To simplify the notation set i = i * i - fly \ , f Xi- fix 4> i y - .2 (Yi-P-Y n ^ i = i (Jy , U = IE { V Ox I I \ CTy 2 , y - ^ Xy ox I U = IE cry 2 f X - fix Ox From (4.27) and using the t5-method (see Billingsley, 1986) we obtain / \ / n (f — r) = \fn \^YUV(^)] [ i E ^ 2 E { ^ ( ^ ) } E { ^ ( ^ ) } as n —> 00, where y/VnWn y/vw, (<7 (^n, Vn, Wn) - g (u, v, w)) ^ d N {0, V ; S V 9 } , ^ On O12 (T13 ^ C21 0"22 ^23 V 031 CT32 CT33 / g(u,v,w) = ( , vw ) 159 and Vg = Vg(U,V,W) = (d/du) g(u,v,w) (d/dv) g(u,v,w) \ 1/' y/vW -0.5 (1/V) (u/y/vw) \ \ (d/dw) g (u,v,w) ) \ — 0.5 (l/ui) (u/y/vw) j 4.9.3 Proof of Theorem 4.3 A famous result of Ito (1951), see 0ksendal (1998) page 38, gives the following formula for n times iterated Ito integrals: n: /-</< 0 < « ! < . . . < U n < t dBUl)dBU2)... dBUn = Uhn (4.28) where B is a given Brownian motion and hn is the Hermite polynomial of degree n, defined by hn(x) = (-1)"exp ( %- j ^ e x p n = 0,1,2, Thus the first Hermite polynomials are h0(x) = 1, hi(x) = x, h-zix) = x2 — 1, h3(x) = x3 — 3x, h4(x) = x* - 6x2 + 3, h5(x) =xb - lOx* + 15x,.. Then the normalized Hermite polynomials will be Hn(x) = ^hn(x). 160 Let X(t) and Y(t) be two Brownian motions such that < X, Y >t= pt, where < X, Y >t is the quadratic variation of X and Y. In principle, we can construct X(i) and Y(t) from a couple of i.i.d. standard Brownian motions. Let B\, B2 and B be i.i.d. standard Brownian motions and define X(t) = y/T^BtW + y/pBit), Y(t) = y/T^~pB2{t) + yfpB{t). Then M{X(t)Y(t)} = pt. (4.29) Note that J * 1 d <X,Y >s= pds = p = TE{X(1)Y(1)} by (4.29). Taking successive integrals, we construct the sequence {Xn(t)} by Xx{t) = X(t) X2(t) = fx^dXis) Jo Xn+l(t) = f Xn{s)dX{s). Jo Similarly we construct the sequence {Ym(t)} by Yi(t) = Y(t) Y2(t) = ftYl(s)dY(s) Jo Ym+1(t) = fYm(s)dY{s). Jo 161 Now applying (4.28) with t = 1 for X(t), we obtain Xx(l) = f1 dX(s) = X(l)-X(0) = X(l) = h1(X(l))=H1(X(l)) Jo X2(l) = £ X1(u1)dX(u1) = £ dX(s)^dX(Ul) ^h2(X(l)) = H2(X(1)) X3(l) = f' X2{ux)dX{ux) Jo = J (jH X 1 (u 2 )dX(u 2 )^ dX(Ul) 'jH Q T d X ( U 3 ) ) dX(u2)^j dX(ux) = |/i3(^(l)) = # 3 (* (1 ) ) X n ( l ) = r X ^ x ^ O d X ^ ) = ^hn(X(l)) = Hn(X(l)) = Similarly, F m ( l ) = Hm(Y(l)) = Hm(Y). Hence the sequences of Xn(l) and Ym(l) Hermite polynomials. Now we show that for every n, we have pntn TE{Xn(t)Yn(t)} = ni Using the definition of stochastic integral, we obtain for every n and m, JE{Xn(t)Ym(t)} = IE Xn_i(s)dX(s) Ym-i(s)dY(s) = E < f Xn^dXis), f Ym_l(s)dY{s) Jo Jo = B f X n _ 1 ( s ) F m _ 1 ( S ) 0 ! < X , F > s Jo = [ m{Xn^(s)Ym^(s)}pds Jo = p f JEiX^^Y^s^ds. Jo 162 By induction, we see that if m = n, then we obtain JE{Xn(t)Yn(t)} = ^ , 71/* while if m ^ n, then / E{Y,(S)}G>S = 0, Jo f JEiXjis^ds = 0. Jo Now Xi,X2, •. • and Yx, Y2,... are orthogonal in L2 space. Thus TEiXiWXjit)} = 0, n x m = t. Let X = X(l) and Y = Y(l), then we showed that the families Hn(X) and Hm(Y) are orthonormal with TE{Hn(X)Hm{Y)} where _ I 1 n = m [0 n ^ m. Since the Hermite polynomials are complete, i.e. the functions f(X) and g(Y) with IE{ / 2 (X)} < oo and lE{g2(Y)} < oo, can be approximated by Hermite polynomials, thus oo f(X) = Y,anHn(X), n=0 and 00 f(Y) = Y,*mHm(Y). m=0 . 163 Without loss of generality, we assume that f(X) and g(Y) have mean zero and variance one. Consequently, ao = 0, bo = 0 and then, I, n = l m = l oo oo = ^2^2anbm\E{Hn(X)Hm(Y)} n=l m = l oo = ^anbnTE{Hn(X)Hn(Y)} (by orthogonality) n-l oo = ^2 anKpn 71=1 O O = p^anbnp11'1 71 = 1 O O < H X ) K I I 6 » I ( I P I < I ) oo (by Cauchy-Schwarz) 71 = 1 y n=l \ m = l = \P\E{f\x)}1<2n<r'(Y)} 1' 2 = \p\-Since E { / 2 ( X ) } 1 / 2 = 1 and E {g2{Y)}1/2 = 1. • 4 . 9 . 4 Proof of Theorem 4 . 4 Let HQ and i f be elliptically symmetric distributions in E 2 and assume that (X, Y) is distributed according to the following model: H = (1 - e)H0 + eH. Assume without loss of generality that the location-and scale parameters of X and Y are known, such that p,x = PY = 0 and ox = oy = 1. We will show that the correlation coefficient r(H) of ip(X) and ip(Y) satisfies: (1 - n)r(H0) - 77 < r(#) < (1 - rj)r(H0) + 77, 164 where ^ = ^ • ^ J ; ^ . The Huberized correlation coefficient of X and y is defined as follows: r i H ) = ^{xmY) y/JEH^(X)]EH^(Y) (1 - e)JEHoib(X)iP(Y) + €E6rP(X)^(Y) 7(1 - e ) E f f 0 ^ 2 ( X ) + e E ^ 2 ( X ) 7 ( 1 - e ) I E ^ 2 ( y ) + eIE^2(r)' By the Cauchy-Schwarz inequality < (1 - e)WH^{X)ib{Y) + t^^{X)M^HY) " 7(1 - e)E*0V2(X) + € E ^ 2 ( X ) 7 ( 1 - e ) E ^ 2 ( y ) + e E ^ 2 ( y ) ' By symmetry, the worst H must have the same marginals < (1 - e)IEH oV>(X)^(y) + d E g o > 2 ( X ) - (l-e)TEHo1>*(X) + eTEHooi!>*(X) ' Because EH o^(x)^(y)| < BHJ2{X) Set < ( l -e)E g 0 V>(X)V>(y) + eV>2(oo) ( l - e ) E H 0 ^ 2 W + ^ 2 (oo) (1 - O r W ) + C l E ^ ^ (l-e)+e V>2(oo) V>2(oo) _ | _ 6 ^ 2 ( 0 0 ) ^ 2(oo) 77 l - e E f f 0 ^ 2 ( I ) 77 = (1 — rj)A = A-rjA A 7 1 = T T X A 165 Then r{H0) + A r(H) < 1 + A V r(H0) + Analogously Therefore = (1 - 7])r{H0) + 77. r(Ho) ~ A < r(H0)+A 1+A ~ y } ~ 1+A ' Now the contamination bias: Therefore So and In summary r(H0)-A r(H0) - A - r(H0) - Ar(H0) 1 + A - T { H o ) = iTA r{H0)+A A - T T A — r { H o ) = T + - A { 1 - R { H O ) ) -^ ( 1 + r(H0)) < r(H) - r{H0) < ^ ( 1 - r(H0)). \r(H) - r(H0)\ < ^ ( 1 + r{H0)) V r(H0) > 0, \r(H) - r(H0)\ < ^ ( 1 - r ( f f 0 ) ) V r ( f f 0 ) > 0. I r ^ - r ^ l ^ ^ a + l r^o) ! ) . 166 Note that 1. The worst case bias corresponds to |r(ifo)| — i : \r(H)-r(H0)\<2T^J. 2. The smallest value of A is A = 1 if ip(X) = SGN(X), in other words the quadrant correlation is asymptotically minimax with respect to bias: \r(H) - r(H0)\ < (1 + W M ^ ^ j - = (1 + \r(H0)\)e. 167 Chapter 5 Robust Estimation of Multivariate Location The preceding chapter dealt with the estimation of the scatter matrix. However, the same idea of using fewer coordinates can also be applied to the estimation of the multivariate location. In this chapter, we discuss coordinate-wise estimate of a multivariate location which involves one dimensional data at a time. Particularly, we study the coordinate-wise median. Some notions of robustness are considered, such as, the minimaxity properties of the coordinate-wise median under the new contamination model (3.4). 5.1 Classical Multivariate Location Estimate In this section, we consider the estimation of the "center" or location of a distribution in W. Suppose that we have a multivariate random sample X = Xi,..., Xn = (Xn,Xi2, • • •, Xip),..., (Xni, Xn2,.. • , Xnp), so that the sample consists of n data points (rows) each of p dimensions (columns). A multivariate location estimate can be described as a K.p-valued function, Tn, defined for each sample size n, mapping the set of data points into some point T n ( X i , . . . , Xn) = Tn(X), which is an approximation of the location of the distribution. A location estimate, 1 6 8 Tn, is said to be translation and coordinate-wise scale equivariant if T n ( X + b) = T„ (X) + b, for all constant vectors b G R p , where X + b = { J i + b , . . . , Xn + b}, and Tn(XA) = Tn(X)A, for all diagonal p x p matrices A = Diag(ai,... ,ap), where XA = {XXA,...,XnA}. The most well-known estimate of multivariate location is the arithmetic mean which is basically the least squares estimate because it minimizes X^=i 11-^ *— ^ 1I2> w n e r e || • || is the Euclidean norm. However, it is not necessary that such an estimate is useful in all situations. It is well known that the arithmetic mean is not robust, because a single "bad" outlier in the sample can take X arbitrarily far away. Using the definition of the multivariate location breakdown point (2.17) in Section 2.2 of Chapter 2, we can show that the multivariate arithmetic mean possesses a breakdown point of 1/n. We often consider the limiting breakdown point as n —> oo. Therefore we can say that the multivariate mean has 0 breakdown point, see Rousseeuw and Leroy (1987). It is obvious that no translation location equivariant estimate can have a breakdown point larger than .50, because one could build a configuration of outliers which is just a translation image of the "good" data points, making it impossible for the estimate to choose. In univariate situations, this upper bound of .50 breakdown point can be attained, for example, by the sample median. Therefore, several multivariate generalizations of the median have been constructed, as well as some other proposals to achieve a certain amount of robustness. Tn(X) = X 169 5.2 Robust Multivariate Location Estimates Robust alternatives to the arithmetic mean for estimating locations have a history going back at least to Laplace (see Stigler 1986, page 54). Fisher (1922) drew attention to the inefficiency of the arithmetic mean as an estimate of location for some distributions be-longing to the family of Pearson curves near the normal. Using his normal contamination models, Tukey (1960) dramatically demonstrated how inefficient the mean can become when contamination increases. The same paper also shows how alternative location es-timates such as the median or trimmed means can achieve higher asymptotic efficiency than the mean. We distinguish between two classes of robust location estimates: those that are affine equivariant and those that are not. We say that, Tn is affine equivariant if and only if Tn(XA + b) =Tn(X)A + b, for all nonsingular pxp matrices A and b 6 W, where XA + b = {XXA + b,..., XnA + b} . Sometimes we do not consider equivariance with respect to all affine transformations, but only for those that preserve Euclidean distances. An estimate is said to be orthogonal equivariant if Tn(XA + b) = Tn{X)A + b, for all orthogonal p x p matrices A (i.e. A' = A'1) and b € W, where XA + b = {XiA + b,..., XnA + b} . For instance, the Li-location estimate defined as n T n = argrnin^||Xi-T||, i=\ 170 is orthogonal equivariant because it only depends on Euclidean distances. The L\-estimate, also known as the spatial median or the mediancenter, is a generalization of the univariate median and its breakdown point is .50. Even though it is not affine equiv-ariant, the Li-estimate is clearly translation equivariant. Also, the foregoing minimum is known to be unique, except for degenerate cases. For some background and properties, refer to Small (1990); more recent results and references can also be found in Chaudhuri (1996) or Chakraborty and Chaudhuri (1999). Although the affine equivariance property seems a natural requirement for an esti-mate, there are many practical situations in which it is not necessarily desirable from our point of view. For instance, requiring affine equivariance may be too restrictive when the data set is large and high-dimensional (e.g. data mining applications) or when there is a natural representation of the data up to a shift and/or change of units (arising from the form of measurement). We have seen that traditional affine equivariant estimates are prohibitively expensive for large multivariate data sets. We also have shown numerically that under the assumption of the new contamination model, affine equivariant estimates break down. 5.3 Coordinate-wise Location Estimates The simplest and the straightforward approach is to consider each variable separately and simply calculate the robust location estimate for each of the individual variable. Indeed, for each variable the points X\3,X2j, • • •, Xn]- can be considered as a one dimensional data set with n points (for j — 1,... ,p). Estimates of this type are called coordinate-wise, in which one applies the one-dimensional robust location estimate to each coordinate and combine the results into a p-dimensional estimate. This procedure inherits the breakdown point of the original estimate; however, although it is not affine equivariant it is translation-scale equivariant. Note that, the multivariate arithmetic mean is an affine equivariant and can be com-171 puted coordinate-wise. However, this is an exception. Indeed, Donoho (1982, Proposi-tion 4.6) shows that the only measurable location estimate that is both affine equivariant and computable as a vector of one-dimensional location estimates is the arithmetic mean. A simple way to obtain coordinate-wise location estimates that are translation-scale equivariant with high breakdown point is to use a one-dimensional M-estimate to con-struct its multivariate analogue coordinate-wise. Given the p-dimensional distribution H, let Hi be the corresponding i-th marginal distribution. The coordinate-wise location M-estimate is defined as T(H) ( T(ffi) ^ T{H2) (5.1) V T(HP) J where T(Hi) is the corresponding M-estimate for the i-th marginal distribution Hi. To save computing time and still attain a high breakdown point, we consider the coordinate-wise median which is defined as M E D ( H ) / med(#i) ^ \ med(Hp) J (5.2) with med(Hi) = median(iJj). 5.4 Bias-Robustness Properties of Coordinate-wise Median In this section, we focus on the bias-robustness of the coordinate-wise location estimates in the context of the contamination model. We will consider contamination model of the form: X = (I-B)Y + BZ, (5.3) 172 where Y, B and Z are independent, Y is multivariate normal with mean fj, and co-variance matrix S , Z is an arbitrary random vector and the diagonal elements of B, Bi,..., Bp are i.i.d. Binomial(l,e). We will use the letters H, H0 and H to denote the joint distribution functions of X, Y and Z. Also, we will denote by % t the set of all distribution functions for vectors (5.3) which is called independent-contamination neighborhood of size e. Let Xi,X2, • • • ,Xn (Xi G W) be i.i.d. H. We will consider estimates that satisfy the following property: Tn(X1,X2, • • • ,Xn) -»• T(H) a.s. as n ^ oo. In particular, let YuY2,...,Yn (Yt e W) be i.i.d. H0. Then, Tn(Yi, Y2,..., Yn) -> T(H0) a.s. as n -> oo. Because of the contamination included in H, T(H) will typically be asymptotically biased. The asymptotic bias of T(H) with H G 7fe is defined as follows. b(T,H) = | | D i a g ( a - 1 / ^ a 2 - 2 1 / ^ . . . , c T ; V 2 ) ( T ( i / ) - T ^ where ||-|| is an arbitrary norm in W and if || • || is the Euclidean norm then, v b(T,H) = ^ [ T ^ - T ^ t f o ) ] 2 / ^ , (5-4) \ i=i where a | is the diagonal i-th element of the covariance matrix S . Notice that the asymptotic bias (5.4) is translation and coordinate-wise scale invari-ant. The corresponding maximum asymptotic bias (maxbias) over the contamination neighborhood % e is defined as follows. BT(e) = sup b(T,H). (5.5) Hene 173 Since we will only consider translation-scale equivariant estimates, we can assume without loss of generality that p\ — pi — ... = pp = 0 and an = a 22 = • • • = opp = 1. Therefore, the asymptotic bias of T(if) satisfies: The following results highlight the good bias-robustness properties of the coordinate-wise multivariate median estimate. The next theorem show that when Y in (5.3) has multi-variate normal distribution with mean pi and covariance matrix S = Diag(a21, <j|2,..., cr2p), then the coordinate-wise median is the minimax-bias among translation-scale equivariant multivariate location estimates. T H E O R E M 5.1 - Independent Normal Distributions - The coordinate-wise median M E D (if) (see (5.2)) minimizes the maximum asymptotic bias (5.5) at the independent-contamination neighborhood 7i€ (0 < e < 1/2) centered at the multivariate normal distri-bution with mean pi and covariance matrix £ = Diag(au, of 2,..., a2p) (in (5.3)) among translation-scale equivariant multivariate location estimates. The proof of Theorem 5.1 is given in Section 5.5.1 of the chapter appendix. Notice that Theorem 5.1 holds regardless of the lRp-norm used to define the asymptotic bias. This generality is of practical importance because the choice of an appropriate norm could depend on the given problem. We also notice that the maximum bias depends on the correlation structure of the "core" data. Therefore we should consider different correlation structures, in addition to the independent case addressed by Theorem 5.1. Unfortunately we could not extend Theorem 5.1 for correlated normal distributions in its complete generality. However, we obtained the following result for the class of "marginally-consistent?' estimates. p and the maxbias can be written as BT(e)= sup ||T(if)||. H & U e 174 Suppose that H is a p-dimensional distribution with the following property: for each i = 1, 2,... ,p the i-th one-dimensional marginal distribution of the H is symmetric about Iii. Then T(H) = Pi Pi \ Pp / T H E O R E M 5.2-General Normal Distributions-The coordinate-wise median MED(H) (see (5.2)) minimizes the maximum asymptotic bias (5.5) at the independent-contamination neighborhoodH£ (0 < e < 1/2) centered at the multivariate normal distribution with mean H and covariance matrix £ (in (5.3)) among translation-scale equivariant, marginally-consistent multivariate location estimates. The proof of Theorem 5.2 is given in Section 5.5.2 of the chapter appendix. 5.5 Chapter Appendix 5.5.1 Proof of Theorem 5.1 We will use an approach similar to Huber's (1964) proof of the minimaxity of the univari-ate median. For simplicity of notation, we consider the case p = 2. A similar approach can be followed for the cases p > 2. Because of the translation-scale invariance of the maximum asymptotic bias (5.5) and the location-scale equivariance of the coordinate-wise estimate, we can assume without loss of generality that /ii = ^ 2 = 0 and on = 022 = 1-Consider the independent-contamination neighborhood He (5.3) for H0, with P{Bi = 1) = e, i = 1,2. Let / be the joint density function of a distribution F € Tie. Then / 175 must be of the form: f(xux2) = f(x1,x2\B1=0,B2 = 0)P(Bl = 0,B2 = 0) +f(xl,x2\B1 = 0,B2 = l)P{Bl = 0, B2 = 1) +f(x1,x2\B1 = 1,B2 = 0)P(B1 = 1,B2 = 0) +f(xl,x2\B1 = 1,B2 = \)P(Bl = 1,B2 = 1) = (1 - e)2 <p{xx)ip{x2) + e (1 - e) ^(x^h^Xi) where <p is the standard normal density and hi(xi), h2(x2) and h3(xi,x2) are arbitrary densities. (see Huber, 1964) and will play a central role in our derivations below. Huber considered the case of a single e-contaminated normal density and the corresponding contamination neighborhood is denoted by Te. Huber constructed two e-contaminated normal distributions F+ and F_ , which are symmetric about x0 and —x0, respectively and which are translations of each other; that is F.(x) = F+{x + 2x0). The densities for F+ and F_ are defined respectively as follows. +e (1 - e) hi(xi)(p(x2) + e2h3(x1, x2), (5.6) corresponds to the maxbias of the univariate median f(x) = (l-e)<p(x)+eh(x), (5.7) (5.8) 176 Therefore, for any translation equivariant location estimate, T, we have T(F+)-T(F_) = 2x0, which implies that there is not any translation equivariant functional that can have an absolute bias smaller than x0 at F+ and F_ simultaneously. In fact, for any translation invariant functional \T(F+)-T(F.)\ < \T(F+)\ + \T(F.)\ 2x0 < \T{F+)\ + \T(F.)\, and so either | T ( F + ) | > x0 or |T(F_) | > x0. Then T has larger maxbias than that of the median. Therefore, sup \T(F)\ > x0. Huber's result then follows because /+ and /_ belong to the e-contamination neigh-borhood of the standard normal density. To see that let /+ (x) = (1 - e)(p(x) + eh+(x), /_ (x) = (1 - e)cp(x) + eh-(x). With ^ J . M - l i - ' M ^ ft_M = / - W - ( i - ^ W . The claim follows then provided that: 1. h+(x) > 0, h_(x) > 0; 2- I-oo h+(x)dx = h-{x)dx = 1. To see Part 1, notice that r ( \ (i \ t \ ' 0 x <x0 i e(l - e)[<p(x - 2x0) - ip(x)] x > x0, 177 e(l - e)[<p(x + 2xQ) - <p{x)] x < -x0 0 x > —x0. This follows because of the inequalities: (p(x - 2x0) - (p(x) > 0 (x - 2x0)2 < x2 x0(x0 - x) < 0 XQ < X, and ip(x + 2xQ) — <p(x) > 0 (x + 2x 0 ) 2 < x2 x0(x0 + x) < 0 %o < — To see Part 2 holds, notice that f°° 1 - e f°° (1/e) / [/ +(z) - (1 - e)<p(x)\ dx = / [<p(x - 2x0) - <p{x)] dx J-oo e J XQ ' [ i - $ ( - x 0 ) - i + $ M ] [2$(x0) - 1] 1 - e e 1 - e e 1 - e e 1 - e 2$ 2 (1 -e ) ( 1 - 0 = 1. A similar calculation shows that /oo [/_(ar)-(l-€Mx)]da; = l. •oo 178 Because F+ and F_ are translations of each other and T is translation equivariant, then we have F_(ar) = F+{x + 2x0) T(F.{x)) = T(F+(x + 2x0)) T(F.(x)) = T(F+(x))-2x0 T(F+)-T(F.) = 2x0. Now we turn our attention to the case p = 2. We define g+(xi,x2) = f+(xi)f+(x2), and g-(xux2) = f-(xi)f-(x2). With / + and /_ given by (5.7) and (5.8). We now proceed as follows: 1. Use the fact that f+(x) = (1 - e)<p(x) + eh+(x) and f-(x) = (1 - e)(p(x) + eh-(x) to show that g+(x\,x2) and g_(xi,x2) are of the form (5.6) and therefore the corresponding distribution functions G+ and G- belong to the independent-contamination neighborhood 7ie. 2. Show that g+(xi,x2) and g_(xi:x2) are translations of each other. For Part 1, we write g+(xi,x2) = f+(xi)f+(x2) = [(1 - e)<p(xi) + eh+(Xl)} [(1 - e)<p(x2) + eh+(x2)] = (1 - e)2<p(xi)<p(x2) + (1 - e)etp(xi)h+(x2) +e(l - e)h+(xi)<p(x2) + e2h+(xi)h+(x2), 179 and g-(xux2) = f-(xi)f-{x2) = [(1 - e)<p(x!) + eh.fa)] [(1 - e)<p(x2) + eh_(x2)} = (1 - e)2ip(xl)(p(x2) + (1 - e)e<p(xx)h-(x2) +e(l - e)h-(xi)(p(x2) + t2h_(xx)h-(x2). Hence g+(xi,x2) and g_(xi,x2) belong to the independent-contamination model (5.6) For Part 2, notice that since f-(xi) = f+(xi + 2x0), f.(x2) = U(x2 + 2x0). Then g-(xx,x2) = g+(xi + 2x0,x2 + 2x0). Now, similar to Huber (1964) for all translation equivariant estimate T we have T(G+) - T(G_) = 2 and so ||T(G+) - T(G_)| | = y/(2x0)2 + (2x0)2 = V2(2x0). Moreover, V2(2x0) = | | T ( G + ) - T ( G _ ) | | < | |T(G + ) | | + ||T(G_)||, and so V2(2x0) < 2 sup ||T(#)|| = 2BT(e). Hene This yields, BT(e) > V2x0. (5. 180 On the other hand, sup | | M E D ( i f ) | | = sup Jmedl(H) + me4%(H) Hene HeHe < sup med 2(ifi) + sup med {H2), where M E D (if) = ( m e d ( ^ ) \med{H2), is the coordinate-wise median. Since i i i and H2 are the distribution functions for (1 - Bl)X1 + BiXx and (1 - B2)X2 + B2X2 with Xx ~ JV(0,1) and X2 ~ iV(0, l) , by Huber (1964) sup | | M E D ( # ) | | < Jx2 + x2 = V2x0. (5.10) He-He v The result now follows from (5.9) and (5.10). • 5.5.2 Proof of Theorem 5.2 For simplicity of notation, we consider the case p = 2. A similar approach can be followed for the case p > 2. We can assume without loss of generality that fj. = 0 and (5.11) Let X = I * ' = (l~Bi)Yi + BlZl X2 j \ (1 - B2) Y2 j \ B2Z2 where Yx and Y2 are jointly normal with mean 0 and covariance matrix £ (5.11). In addition, B = Diag(f?i, JB2) , Y — (Yi,Y2)' a n d Z = (Zi,Z2)' are independent and the variables Z\ and Z2 are independent with common density j 0 x < x0 h+{x) = 1 - 6 6 [ip(x - 2x0) - <p(x)] x > xQ, 181 where as before, x0 = $ 1 ( 2 ( i - e ) ) • We notice that for i = 1, 2 X, = {l-B^Yi + BiZi, has density J o x < x0 {(1 - e) tp(x - 2x0) x>x0, which is symmetric at x0. Therefore T(X) = ( X o On the other hand, by similar arguments x = -x=( ~Xl \ = ( (1" B l ) {~Yl) ] + ( B l \ - X 2 J \ (1 - B2) (-Y2) j \ B 2 {-Z2) {l-B1)Y1 \ f BXZX (1-B2)Y2) \B2Z2 where Yx and Y2 are jointly normal with mean 0 and covariance matrix £ (5.11). Since B = Diag(Bi , f i 2 ) , Y = (YX,Y7J and Z = {Z\,Z2^ are independent, the joint distri-bution of X belongs to 7i€. Moreover, the variables Zx and Z2 are independent with common density ( l-[(p(x + 2xQ) - <p(x)] x < -xQ x > —x0, and so, for i = 1,2 X = (l-Bi)Yi + BiZi, 182 has density I (1 - e) tp(x - 2x0) x < -xQ 9-{x) = < ^0 x > —x0, which is symmetric at — x0. Therefore T ( X ) = ( ~ X ° \ -xo. Hence, for all marginally consistent and translation-scale equivariant estimate T we have T{G+) - T ( G L ) = 2 I X ° \ x0 where G+ and G- are the distribution functions for G+ and G-, respectively. So as in the proof of Theorem 5.1, we obtain BT(e) > V2x0. (5.12) The theorem now follows from (5.12) and (5.10). • 183 Chapter 6 Conclusion Our study may be divided into three parts. In Part I we introduced a new contamination model that is more suitable than the existing contamination models for the multivariate setting. Part II dealt with the estimation of multivariate scatter where we revisited Huber's (1981) proposal and extended some of his results. Finally, in Part III, we studied the estimation of the multivariate location, in which we considered the coordinate-wise estimation. The following is an outline of the main results obtained in the thesis, the problems that we encountered, and the directions we foresee for future work. • A New Contamination Model: T We introduced a new multivariate contamination model that adequately repre-sents reality for many multivariate data sets that arise in practice. This model resolves the deficiency of the current contamination models by allowing more flexibility and certain forms of dependency that the existing contamination models do not address. • We gave some arguments and numerical evidence which indicate that the breakdown point of affine equivariant estimates tends to zero when the di-mension p tends to infinity under the new contamination model. • Our study concentrated on the multivariate location and scatter matrix es-timates. It is desirable to investigate the performance of robust regression 184 analysis and some other related models (e.g., error in variables) using the new contamination model. It is also of interest to revisit the problem of outliers detection in time series in the context of the new contamination model. This will be based on an embedding of the time series which allows us to regard the time series as a multivariate sample with identically distributed but non independent observations. Thus, multivariate outlier identifiers can be trans-ferred into the context of time series. This gives interesting insights in some features of outliers in time dependent data, which are not recognizable by other methods, see Gather et al. (2003). Simple Robust Pairwise Scatter Estimates: • The criterion for the selection of a good robust estimate includes small maxbias, high breakdown point and computational feasibility. Based on this crite-rion, we singled out a particular robust pairwise scatter estimate, namely, the Huberized correlation coefficient (with c = 1) based scatter matrix esti-mates. We found that this estimate is computationally simpler than the Fast MCD and other fast scatter estimates recently proposed (see Maronna and Zamar, 2002). We also showed that the Huberized estimate is more stable than the Fast MCD estimate when the data are contaminated according to the independent-contamination model. • We studied the consistency and asymptotic normality of the Huberized cor-relation coefficient estimates. It remains to establish the asymptotic distribu-tion of the Huberized correlation coefficient estimates using other than MM-location and scale estimates. T We added scalability to Maronna and Zamar (2002) pairwise robust scatter estimate by replacing the robustified Gnanadesikan and Kettenring (1972) 185 robust scale-based scatter estimate by the quadrant correlation estimate. We showed its scalability to high dimensions and large sample sizes. • We studied the maxbias and the intrinsic bias of the Huberized correlation coefficient estimates. We then extended Huber's (1981) maxbias formulas and derived the analytical form of the maxbias for the quadrant correlation (QC) coefficient when the locations and scales are unknown. It is desirable to show that the QC is also minimax in this more general context. Coordinate-wise Robust Location Estimates: T In this part we used the same criterion as in the case of the multivariate scatter matrix to select a good robust estimate. The coordinate-wise medians appeared to fulfil our requirements. T We studied the minimaxity properties of the coordinate-wise median for two special cases. The independent situation and the correlated situation; for the latest case we restricted attention to the class of marginally-consistent estimates. We have not been able to show that the coordinate-wise median is minimax in general. However, we conjecture that this is true and, therefore, deserves further study. T Our proposed Huberized covariance matrix estimates and the coordinate-wise medians can be used to define a robustified Mahalanobis distance. It is of interest to study its approximate distribution properties. 186 Bibliography Abdullah, M. B. (1990). On a robust correlation coefficient. The Statistician, 39:455-460. Adrover, J. and Yohai, V. J. (2002). Projection estimates of multivariate location. Tentatively accepted in the Annals of Statistics. Bickel, P. J. (1964). On some alternative estimates for shift in the p-variate one sample problem. Ann. Math. Statist., 35:1079-1090. Billingsley, P. (1986). Probability and Measure. John Wiley and Sons, 2 edition. Butler, R., Davies, P. L., and Jhun, M. (1993). Asymptotics for the minimum covariance determinant estimator. Annals of Statistics, 21:1385-1400. Carroll, R. J. and Ruppert, D. (1985). Transformation in regression: A robust analysis. Technometrics, 27:1-12. Chakraborty, B. and Chaudhuri, P. (1999). A note on the robustness of multivariate medians. Statist. Probab. Letters, 45:269-276. Chaudhuri, P. (1996). On a geometric notion of quantiles for multivariate data. Journal of the American Statistical Association, 91:862-872. Croux, C. and Haesbroeck, G. (1999). Influence function and efficiency of the mini-mum covariance determinant scatter matrix estimator. Journal of Multivariate Analysis, 71:161-190. Croux, C , Haesbroeck, G., and Rousseeuw, P. J. (1997). Location adjustment for the minimum volume ellipsoid estimator. Technical report, University of Brussels (ULB). Available at http://homepages.ulb.ac.be/~ccroux/public.htm. Croux, C. and Rousseeuw, P. J. (1992a). A class of high-breakdown scale estimators based on subranges. Communications in Statistics, Theory Methods, 21:1935-1951. Croux, C. and Rousseeuw, P. J. (1992b). Time-efficient algorithms for two highly robust estimators of scale. Computational Statistics, 2:411-428. Davies, P. L. (1987). Asymptotic behavior of S-estimates of multivariate location pa-rameters and dispersion matrices. Annals of Statistics, 15:1269-1292. 187 Davies, P. L. (1992). The asymptotic of Rousseeuw's minimum volume ellipsoid. Annals of Statistics, 20:1828-1843. Devlin, S. J., Gnanadesikan, R., and Kettenring, J. R. (1981). Robust estimation of dispersion matrices and principal components. Journal of the American Statistical Association, 76:354-362. Donoho, D. L. (1982). Breakdown Properties of Multivariate Location Estimators. PhD thesis, Dept. of Statistics, Harvard University. Fang, K. T. and Zhang, Y. (1990). Generalized Multivariate Analysis. Springer and Science Press, Berlin and Beijing. Fernholz, L. (1983). Von Mises Calculus for Statistical Functionals. Springers, New York. Fisher, R. A. (1922). On the mathematical foundations of theoretical statistics. Philos. Trans. Roy. Astr. Soc London Ser. A, 222:309-368. Fraumeni, J. F. (1968). Cigarette smoking and cancers of the united track: Geographic variations in the united states. Journal of the National Cancer Institute, 41:1205-1211. Gather, U., Bauer, M., and Fried, R. (2003). The identification of multiple outliers in online monitoring data. In process. Genton, M. G. and Ma, Y. (1999). Robustness properties of dispersion estimators. Statistics and Probability Letters, 44:343-350. Gnanadesikan, R. and Kettenring, J. R. (1972). Robust estimates, residuals, and outlier detection with multiresponse data. Biometrics, 28:81-124. Griibel, R. (1988). A minimal characterization of the covariance matrix. Metrika, 35:49-52. Haldane, J. B. S. (1948). Note on the median of multivariate distribution. Biometrika, 35:414-415. Hampel, F. (1968). Contributions to the Theory of Robust Estimation. PhD thesis, Univ. Calif., Berkeley. Hampel, F. (1971). A general qualitative definition of robustness. Annals of Mathemat-ical Statistics, 42(6):1887-1896. Hampel, F. (1973). Robust estimation: A condensed partial survey. Zeitschrift fur Wahrscheinlichkeitstheorie and Verwandte Gebiete, 27:87-104. Hampel, F., Ronchetti, E., Rousseeuw, P. J., and Stahel, W. (1986). Robust Statistics: The Approach Based on Influence Functions. John Wiley and Sons, Inc. New York. 188 Hawkins, D. (1994). The feasible solution algorithm for the minimum covariance deter-minant estimator in multivariate data. Comput. Statist. Data. Anal., 17:197-210. He, X. and Simpson, D. (1992). Robust direction estimation. Annals of Statistics, 20:351-369. He, X. and Simpson, D. (1993). Lower bounds for contamination bias: Globally minimax versus locally linear estimation. Annals of Statistics, 21:314-337. Huber, P. J. (1964). Robust estimation of a location parameter. Ann. Math. Statist., 35:73-101. Huber, P. J. (1977). Robust covariances. In Gupta, S. S. and Moore, D. S., editors, Statistical Decision Theory and Related Topics 2, pages 165-191. Academic Press. Huber, P. J. (1981). Robust Statistics. John Wiley and Sons. Ito, K. (1951). Multiple Wiener'integral. Journal of Mathematical Society, 3:157-169. Japan. Joe, H. (1997). Multivariate Models and Dependence Concepts. Chapman and Hall, 1 edition. Li, B. and Zamar, R. H. (1991). Min-max asymptotic variance of M-estimates of location when scale is unknown. Statistics and Probability Letters, 11:139-145. Liu, R. Y. (1990). On a notion of data depth based on random simplices. Annals of Statistics, 18:405-414. Lopuhaa, H. (1989). On the relation between S-estimators and M-estimators of multi-variate location and covariance. Annals of Statistics, 17:1662-1683. Lopuhaa, H. and Rousseeuw, P. J. (1991). Breakdown point of affine equivariant estima-tors of multivariate location and covariance matrices. Annals of Statistics, 19:229-248. Ma, Y. and Genton, M. G. (2001). Highly robust estimation of dispersion matrices. Journal of Multivariate Analysis, 78(1):11—36. Maronna, R. A. (1974). Estimation Robusta De Locacion Y Dispersion Multivariadas. PhD thesis, Universidad de Buenos Aires. Maronna, R. A. (1976). Robust M-estimators of multivariate location and scatter. Annals of Statistics, 4:51-67. Maronna, R. A., Stahel, W. A., and Yohai, V. J. (1992). Bias-robust estimation of multivariate scatter based on projections. Journal of Multivariate Analysis, 42:141-161. 189 Maronna, R. A. and Yohai, V. J. (1995). The behavior of the Stahel-Donoho robust multivariate estimator. Journal of the American Statistical Association, 90:330-341. Maronna, R. A. and Zamar, R. H. (2002). Robust estimates of location and dispersion for high-dimensional datasets. Technometrics, 44(4):307-317. Mickey, M. R., Dunn, O. J., and Clark, V. (1967). Note on the use of stepwise regression in detecting outliers. Comput. Biomed. Res., 1:105-111. Oja, H. (1983). Descriptive statistics for multivariate distributions. Statist. Probab. Letters, 1:327-332. Oksendal, B. K. (1998). Stochastic Differential Equations: An Introduction With Ap-plications. Springer-Verlag, Berlin Heidelberg, 5 edition. Pefta, D. and Prieto, F. J. (2001). Multivariate outlier detection and robust covariance matrix estimation. Technometrics, 43:286-301. Press, W., Teukolsky, S. A., Vetterlling, W. T., and Flannery, B. P. (1992). Numerical Recipes in Fortran. Cambridge University Press. Rey, W. J. J. (2001). On 100% multivariate breakdown point. In International Confer-ence on Robust Statistics, Vorau, Austria. Rocke, D. M. and Woodruff, D. L. (1996). Identification of outliers in multivariate data. Journal of the American Statistical Association, 91:1047-1061. Rousseeuw, P. J. (1984). Least median of squares regression. Journal of the American Statistical Association, 79:871-880. Rousseeuw, P. J. (1985). Multivariate estimation with high breakdown point. In Gross-mann, W., Pflig, G., Vincze, I., and Wertz, W., editors, Mathematical Statistics and Applications, pages 283-297. Reidel Publishing, Dodrecht. Rousseeuw, P. J. and Croux, C. (1993). Alternatives to the median absolute deviation. Journal of the American Statistical Association, 88:1273-1283. Rousseeuw, P. J. and Leroy, A. M. (1987). Robust Regression and Outlier Detection. John Wiley and Sons. Rousseeuw, P. J. and Molenberghs, G. (1993). Transformation of non positive semidef-inite correlation matrices. Communications in Statistics: Theory and Methods, 22:965-984. Rousseeuw, P. J. and Van Driessen, K. (1999). A fast algorithm for the minimum covariance determinant estimator. Technometrics, 41:212-223. 190 Rousseeuw, R J. and Van Zomeren, B. C. (1990). Unmasking multivariate outliers and leverage points. Journal of the American Statistical Association, 85:633-339. Rousseeuw, P. J. and Yohai, V. J. (1984). Robust regression by means of S-estimates. In Robust and Nonlinear Time Series Analysis, volume 26, pages 256-272. Springer. Lecture notes in statistics. Ruppert, D. (1992). Computing S-estimates for regression and multivariate loca-tion/dispersion. Journal of Computational and Graphical Statistics, 1:253-270. Ruppert, D. and Carroll, R. J. (1980). Trimmed least squares estimation in the linear model. Journal of the American Statistical Association, 75:828-838. Sen, P. K. and Puri, M. L. (1971). Nonparametric Methods in Multivariate Analysis. John Wiley and Sons, New York. Serfling, R. J. (1980). Approximation Theorems of Mathematical Statistics. John Wiley and Sons, New York. Small, C. G. (1990). A survey of multidimensional medians. Int. Statist. Rev., 58:263-277. Stahel, W. A. (1981). Breakdown of covariance estimators. Technical Report 31, Fach-gruppe fur Statistik, ETH, Zurich. Stigler, S. (1986). The History of Statistics. The Belknap Press of Harvard University Press, Cambridge, Ma. Tukey, J. W. (1960). A survey of sampling from contaminated distribution. In Olkin, I., editor, Contributions to Probability and Statistics, pages 448-485. Stanford University Press, Stanford, Calif. Tukey, J. W. (1962). The future of data analysis. Ann. Math. Statist., 33:1-67. Tukey, J. W. (1975). Mathematics and picturing data. In International Congress of Mathematics, volume 2, pages 523-531. Weisberg, S. (1985). Applied Linear Regression. John Wiley and Sons, New York. Woodruff, D. L. and Rocke, D. M. (1993). Heuristic search algorithms for the minimum volume ellipsoid. Journal of Computational and Graphical Statistics, 2:69-95. Woodruff, D. L. and Rocke, D. M. (1994). Computable robust estimation of multivari-ate location and shape in high dimension using compound estimators. Journal of the American Statistical Association, 89:888-896. Yohai, V. J. (1987). High breakdown point and high efficiency robust estimates for regression. Annals of Statistics, 15:642-656. 191 Yohai, V. J. and Zamar, R. H. (1988). High breakdown point estimates of regression by means of the minimization of an efficient scale. Journal of the American Statistical Association, 86:403-413. Zamar, R. H. and Alqallaf, F. (2001). New contamination model for high dimensional data sets. In Joint Statistical Meeting Conference, Atlanta. 192
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- A new contamination model for robust estimation with...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
A new contamination model for robust estimation with large high-dimensional data sets Alqallaf, Fatemah Ali 2003
pdf
Page Metadata
Item Metadata
Title | A new contamination model for robust estimation with large high-dimensional data sets |
Creator |
Alqallaf, Fatemah Ali |
Date Issued | 2003 |
Description | Data sets can be very large, highly multidimensional and of mixed quality. This thesis provides feasible and robust methods for estimating multivariate location and scatter matrix for such data. Our estimates scale well to very large sample sizes and dimensions and are resistant to the presence of multivariate outliers. Statisticians use contamination or mixture models to study the performance of robust alternatives to classical statistical procedures. Most multivariate contamination models for numeric data proposed to date (see Hampel et al., 1986) assume that the majority of the observations comes from a nominal distribution such as a multivariate normal distribution, while the remainder comes from another multivariate distribution that generates outliers. We stress that such outliers could be "bad" data due to recording errors of all kinds, or they could be a highly informative subset of the data that leads to the discovery of unexpected knowledge in areas such as business operations, credit card fraud, and even the analysis of performance statistics of professional athletes. Unfortunately, the previously available models do not adequately represent reality for many multivariate data sets that arise in practice. It may often happen that outliers occur in each of the variables independently of the other variables or in special dependency patterns. We introduce a new contamination model that overcomes the main drawbacks of the current models by taking into account different sources of variability in the data, and allowing greater flexibility. Moreover, our model permits for situations where extreme values of one or more variables (not necessarily outliers) may increase the likelihood of outliers or gross errors in other variables. There is a large statistical literature on robust covariance and correlation matrix estimates, with an emphasis on affine equivariant estimates that possess high breakdown points and small worst case biases. All such estimates have unacceptable exponential complexity 2P in the number of variables p. And one of the more attractive of these estimates, the Stahel-Donoho estimate, has an unacceptable quadratic complexity n2 in the number of observations n. These estimates may be applied in large data applications with large p and n only by the use of adhoc sampling methods that render the robustness properties of the estimates unclear. In this thesis we focus on pairwise robust scatter matrix estimates and coordinate-wise location estimates. The pairwise scatter estimates are based on coordinate-wise robust transformations (the quadrant correlation estimate, and the coordinate-wise Huberized estimates). We show that such estimates are computationally simple, and have attractive robustness properties under the existing and the newly proposed contamination models. |
Extent | 7642409 bytes |
Genre |
Thesis/Dissertation |
Type |
Text |
FileFormat | application/pdf |
Language | eng |
Date Available | 2009-11-11 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
IsShownAt | 10.14288/1.0080075 |
URI | http://hdl.handle.net/2429/14790 |
Degree |
Doctor of Philosophy - PhD |
Program |
Mathematics |
Affiliation |
Science, Faculty of Mathematics, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2003-05 |
Campus |
UBCV |
Scholarly Level | Graduate |
AggregatedSourceRepository | DSpace |
Download
- Media
- 831-ubc_2003-854221.pdf [ 7.29MB ]
- Metadata
- JSON: 831-1.0080075.json
- JSON-LD: 831-1.0080075-ld.json
- RDF/XML (Pretty): 831-1.0080075-rdf.xml
- RDF/JSON: 831-1.0080075-rdf.json
- Turtle: 831-1.0080075-turtle.txt
- N-Triples: 831-1.0080075-rdf-ntriples.txt
- Original Record: 831-1.0080075-source.json
- Full Text
- 831-1.0080075-fulltext.txt
- Citation
- 831-1.0080075.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0080075/manifest