- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Outliers and data mining : finding exceptions in data
Open Collections
UBC Theses and Dissertations
UBC Theses and Dissertations
Outliers and data mining : finding exceptions in data Knorr, Edwin M.
Abstract
Our thesis is that we can efficiently identify meaningful outliers in large, multidimensional datasets. In particular, we introduce and study the notion and utility of distance-based outliers (DB-outliers). First, we focus on outlier identification, and present nested-loop, index-based, and cell-based algorithms. For datasets that are mainly disk-resident, we present another version of the cell-based algorithm that guarantees at most 3 passes over a dataset. We provide experimental results showing that these cell-based algorithms are by far the best for 4 or fewer dimensions. Second, we focus on the computation of intensional knowledge, that is, we provide a description or an explanation of why an identified outlier is exceptional. We provide an algorithm to compute the "strongest" outliers in a dataset (i.e., outliers that are dominant for certain attributes or dimensions). Our notion of strongest outliers allows significant pruning to take place in the search tree. We also define the notion of "weak" outliers. With respect to the computation of intensional knowledge, we develop naive, semi-naive, and I/O optimized algorithms. The latter class of algorithms intelligently schedules I/O's, and achieves optimization via page sharing. Third, we focus on robust space transformations to achieve meaningful results when mining Ac-D datasets for Z?.B-outliers. Robust estimation is used to: (a) account for differences among attributes in scale, variability, and correlation, (b) account for the effects of outliers in the data, and (c) prevent undesirable masking and flooding during the search for outliers. We propose using a robust space transformation called the Donoho-Stahel estimator (DSE), and we show key properties of the DSE. Of particular importance to data mining applications involving large or dynamic datasets is the stability property, which says that in spite of frequent updates, the estimator does not: (a) change much, (b) lose its usefulness, or (c) require re-computation. We develop randomized algorithms and evaluate how well they perform empirically. The novel algorithm that we develop is called the Hybrid-Random algorithm, which can significantly outperform the other DSE algorithms for moderate-high levels of recall. Experimental results using real-world data are included to show the utility and efficiency of Di>-outiiers.
Item Metadata
Title |
Outliers and data mining : finding exceptions in data
|
Creator | |
Publisher |
University of British Columbia
|
Date Issued |
2002
|
Description |
Our thesis is that we can efficiently identify meaningful outliers in large, multidimensional
datasets. In particular, we introduce and study the notion and utility of distance-based outliers
(DB-outliers).
First, we focus on outlier identification, and present nested-loop, index-based, and cell-based
algorithms. For datasets that are mainly disk-resident, we present another version of the cell-based
algorithm that guarantees at most 3 passes over a dataset. We provide experimental results showing
that these cell-based algorithms are by far the best for 4 or fewer dimensions.
Second, we focus on the computation of intensional knowledge, that is, we provide a description
or an explanation of why an identified outlier is exceptional. We provide an algorithm to
compute the "strongest" outliers in a dataset (i.e., outliers that are dominant for certain attributes
or dimensions). Our notion of strongest outliers allows significant pruning to take place in the
search tree. We also define the notion of "weak" outliers. With respect to the computation of
intensional knowledge, we develop naive, semi-naive, and I/O optimized algorithms. The latter
class of algorithms intelligently schedules I/O's, and achieves optimization via page sharing.
Third, we focus on robust space transformations to achieve meaningful results when mining
Ac-D datasets for Z?.B-outliers. Robust estimation is used to: (a) account for differences among
attributes in scale, variability, and correlation, (b) account for the effects of outliers in the data,
and (c) prevent undesirable masking and flooding during the search for outliers. We propose using a
robust space transformation called the Donoho-Stahel estimator (DSE), and we show key properties
of the DSE. Of particular importance to data mining applications involving large or dynamic
datasets is the stability property, which says that in spite of frequent updates, the estimator does
not: (a) change much, (b) lose its usefulness, or (c) require re-computation. We develop randomized
algorithms and evaluate how well they perform empirically. The novel algorithm that we develop is
called the Hybrid-Random algorithm, which can significantly outperform the other DSE algorithms
for moderate-high levels of recall.
Experimental results using real-world data are included to show the utility and efficiency of
Di>-outiiers.
|
Extent |
16353044 bytes
|
Genre | |
Type | |
File Format |
application/pdf
|
Language |
eng
|
Date Available |
2009-09-15
|
Provider |
Vancouver : University of British Columbia Library
|
Rights |
For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use.
|
DOI |
10.14288/1.0051497
|
URI | |
Degree | |
Program | |
Affiliation | |
Degree Grantor |
University of British Columbia
|
Graduation Date |
2002-05
|
Campus | |
Scholarly Level |
Graduate
|
Aggregated Source Repository |
DSpace
|
Item Media
Item Citations and Data
Rights
For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use.