Outliers and data mining : finding exceptions in data

UBC Theses and Dissertations

Featured Collection

UBC Theses and Dissertations

Outliers and data mining : finding exceptions in data Knorr, Edwin M.

Abstract

Our thesis is that we can efficiently identify meaningful outliers in large, multidimensional datasets. In particular, we introduce and study the notion and utility of distance-based outliers (DB-outliers). First, we focus on outlier identification, and present nested-loop, index-based, and cell-based algorithms. For datasets that are mainly disk-resident, we present another version of the cell-based algorithm that guarantees at most 3 passes over a dataset. We provide experimental results showing that these cell-based algorithms are by far the best for 4 or fewer dimensions. Second, we focus on the computation of intensional knowledge, that is, we provide a description or an explanation of why an identified outlier is exceptional. We provide an algorithm to compute the "strongest" outliers in a dataset (i.e., outliers that are dominant for certain attributes or dimensions). Our notion of strongest outliers allows significant pruning to take place in the search tree. We also define the notion of "weak" outliers. With respect to the computation of intensional knowledge, we develop naive, semi-naive, and I/O optimized algorithms. The latter class of algorithms intelligently schedules I/O's, and achieves optimization via page sharing. Third, we focus on robust space transformations to achieve meaningful results when mining Ac-D datasets for Z?.B-outliers. Robust estimation is used to: (a) account for differences among attributes in scale, variability, and correlation, (b) account for the effects of outliers in the data, and (c) prevent undesirable masking and flooding during the search for outliers. We propose using a robust space transformation called the Donoho-Stahel estimator (DSE), and we show key properties of the DSE. Of particular importance to data mining applications involving large or dynamic datasets is the stability property, which says that in spite of frequent updates, the estimator does not: (a) change much, (b) lose its usefulness, or (c) require re-computation. We develop randomized algorithms and evaluate how well they perform empirically. The novel algorithm that we develop is called the Hybrid-Random algorithm, which can significantly outperform the other DSE algorithms for moderate-high levels of recall. Experimental results using real-world data are included to show the utility and efficiency of Di>-outiiers.

Item Metadata

Title	Outliers and data mining : finding exceptions in data
Creator	Knorr, Edwin M.
Publisher	University of British Columbia
Date Issued	2002
Description	Our thesis is that we can efficiently identify meaningful outliers in large, multidimensional datasets. In particular, we introduce and study the notion and utility of distance-based outliers (DB-outliers). First, we focus on outlier identification, and present nested-loop, index-based, and cell-based algorithms. For datasets that are mainly disk-resident, we present another version of the cell-based algorithm that guarantees at most 3 passes over a dataset. We provide experimental results showing that these cell-based algorithms are by far the best for 4 or fewer dimensions. Second, we focus on the computation of intensional knowledge, that is, we provide a description or an explanation of why an identified outlier is exceptional. We provide an algorithm to compute the "strongest" outliers in a dataset (i.e., outliers that are dominant for certain attributes or dimensions). Our notion of strongest outliers allows significant pruning to take place in the search tree. We also define the notion of "weak" outliers. With respect to the computation of intensional knowledge, we develop naive, semi-naive, and I/O optimized algorithms. The latter class of algorithms intelligently schedules I/O's, and achieves optimization via page sharing. Third, we focus on robust space transformations to achieve meaningful results when mining Ac-D datasets for Z?.B-outliers. Robust estimation is used to: (a) account for differences among attributes in scale, variability, and correlation, (b) account for the effects of outliers in the data, and (c) prevent undesirable masking and flooding during the search for outliers. We propose using a robust space transformation called the Donoho-Stahel estimator (DSE), and we show key properties of the DSE. Of particular importance to data mining applications involving large or dynamic datasets is the stability property, which says that in spite of frequent updates, the estimator does not: (a) change much, (b) lose its usefulness, or (c) require re-computation. We develop randomized algorithms and evaluate how well they perform empirically. The novel algorithm that we develop is called the Hybrid-Random algorithm, which can significantly outperform the other DSE algorithms for moderate-high levels of recall. Experimental results using real-world data are included to show the utility and efficiency of Di>-outiiers.
Extent	16353044 bytes
Genre	Thesis/Dissertation
Type	Text
File Format	application/pdf
Language	eng
Date Available	2009-09-15
Provider	Vancouver : University of British Columbia Library
Rights	For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use.
DOI	10.14288/1.0051497
URI	http://hdl.handle.net/2429/12783
Degree	Doctor of Philosophy - PhD
Program	Computer Science
Affiliation	Science, Faculty of; Computer Science, Department of
Degree Grantor	University of British Columbia
Graduation Date	2002-05
Campus	UBCV
Scholarly Level	Graduate
Aggregated Source Repository	DSpace

Item Media

ubc_2002-73191X.pdf -- 15.6MB

Item Citations and Data

Rights

For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use.

Open Collections

UBC Theses and Dissertations

Outliers and data mining : finding exceptions in data Knorr, Edwin M.

Abstract

Item Metadata

Item Media

Item Citations and Data

Rights