A new contamination model for robust estimation with large high-dimensional data sets

UBC Theses and Dissertations

Featured Collection

UBC Theses and Dissertations

A new contamination model for robust estimation with large high-dimensional data sets Alqallaf, Fatemah Ali

Abstract

Data sets can be very large, highly multidimensional and of mixed quality. This thesis provides feasible and robust methods for estimating multivariate location and scatter matrix for such data. Our estimates scale well to very large sample sizes and dimensions and are resistant to the presence of multivariate outliers. Statisticians use contamination or mixture models to study the performance of robust alternatives to classical statistical procedures. Most multivariate contamination models for numeric data proposed to date (see Hampel et al., 1986) assume that the majority of the observations comes from a nominal distribution such as a multivariate normal distribution, while the remainder comes from another multivariate distribution that generates outliers. We stress that such outliers could be "bad" data due to recording errors of all kinds, or they could be a highly informative subset of the data that leads to the discovery of unexpected knowledge in areas such as business operations, credit card fraud, and even the analysis of performance statistics of professional athletes. Unfortunately, the previously available models do not adequately represent reality for many multivariate data sets that arise in practice. It may often happen that outliers occur in each of the variables independently of the other variables or in special dependency patterns. We introduce a new contamination model that overcomes the main drawbacks of the current models by taking into account different sources of variability in the data, and allowing greater flexibility. Moreover, our model permits for situations where extreme values of one or more variables (not necessarily outliers) may increase the likelihood of outliers or gross errors in other variables. There is a large statistical literature on robust covariance and correlation matrix estimates, with an emphasis on affine equivariant estimates that possess high breakdown points and small worst case biases. All such estimates have unacceptable exponential complexity 2P in the number of variables p. And one of the more attractive of these estimates, the Stahel-Donoho estimate, has an unacceptable quadratic complexity n2 in the number of observations n. These estimates may be applied in large data applications with large p and n only by the use of adhoc sampling methods that render the robustness properties of the estimates unclear. In this thesis we focus on pairwise robust scatter matrix estimates and coordinate-wise location estimates. The pairwise scatter estimates are based on coordinate-wise robust transformations (the quadrant correlation estimate, and the coordinate-wise Huberized estimates). We show that such estimates are computationally simple, and have attractive robustness properties under the existing and the newly proposed contamination models.

Item Metadata

Title	A new contamination model for robust estimation with large high-dimensional data sets
Creator	Alqallaf, Fatemah Ali
Publisher	University of British Columbia
Date Issued	2003
Description	Data sets can be very large, highly multidimensional and of mixed quality. This thesis provides feasible and robust methods for estimating multivariate location and scatter matrix for such data. Our estimates scale well to very large sample sizes and dimensions and are resistant to the presence of multivariate outliers. Statisticians use contamination or mixture models to study the performance of robust alternatives to classical statistical procedures. Most multivariate contamination models for numeric data proposed to date (see Hampel et al., 1986) assume that the majority of the observations comes from a nominal distribution such as a multivariate normal distribution, while the remainder comes from another multivariate distribution that generates outliers. We stress that such outliers could be "bad" data due to recording errors of all kinds, or they could be a highly informative subset of the data that leads to the discovery of unexpected knowledge in areas such as business operations, credit card fraud, and even the analysis of performance statistics of professional athletes. Unfortunately, the previously available models do not adequately represent reality for many multivariate data sets that arise in practice. It may often happen that outliers occur in each of the variables independently of the other variables or in special dependency patterns. We introduce a new contamination model that overcomes the main drawbacks of the current models by taking into account different sources of variability in the data, and allowing greater flexibility. Moreover, our model permits for situations where extreme values of one or more variables (not necessarily outliers) may increase the likelihood of outliers or gross errors in other variables. There is a large statistical literature on robust covariance and correlation matrix estimates, with an emphasis on affine equivariant estimates that possess high breakdown points and small worst case biases. All such estimates have unacceptable exponential complexity 2P in the number of variables p. And one of the more attractive of these estimates, the Stahel-Donoho estimate, has an unacceptable quadratic complexity n2 in the number of observations n. These estimates may be applied in large data applications with large p and n only by the use of adhoc sampling methods that render the robustness properties of the estimates unclear. In this thesis we focus on pairwise robust scatter matrix estimates and coordinate-wise location estimates. The pairwise scatter estimates are based on coordinate-wise robust transformations (the quadrant correlation estimate, and the coordinate-wise Huberized estimates). We show that such estimates are computationally simple, and have attractive robustness properties under the existing and the newly proposed contamination models.
Extent	7642409 bytes
Genre	Thesis/Dissertation
Type	Text
File Format	application/pdf
Language	eng
Date Available	2009-11-11
Provider	Vancouver : University of British Columbia Library
Rights	For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use.
DOI	10.14288/1.0080075
URI	http://hdl.handle.net/2429/14790
Degree	Doctor of Philosophy - PhD
Program	Mathematics
Affiliation	Science, Faculty of; Mathematics, Department of
Degree Grantor	University of British Columbia
Graduation Date	2003-05
Campus	UBCV
Scholarly Level	Graduate
Aggregated Source Repository	DSpace

Item Media

ubc_2003-854221.pdf -- 7.29MB

Item Citations and Data

Rights

For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use.

Open Collections

UBC Theses and Dissertations

A new contamination model for robust estimation with large high-dimensional data sets Alqallaf, Fatemah Ali

Abstract

Item Metadata

Item Media

Item Citations and Data

Rights