UBC Theses and Dissertations
Outliers and data mining : finding exceptions in data Knorr, Edwin M.
Our thesis is that we can efficiently identify meaningful outliers in large, multidimensional datasets. In particular, we introduce and study the notion and utility of distance-based outliers (DB-outliers). First, we focus on outlier identification, and present nested-loop, index-based, and cell-based algorithms. For datasets that are mainly disk-resident, we present another version of the cell-based algorithm that guarantees at most 3 passes over a dataset. We provide experimental results showing that these cell-based algorithms are by far the best for 4 or fewer dimensions. Second, we focus on the computation of intensional knowledge, that is, we provide a description or an explanation of why an identified outlier is exceptional. We provide an algorithm to compute the "strongest" outliers in a dataset (i.e., outliers that are dominant for certain attributes or dimensions). Our notion of strongest outliers allows significant pruning to take place in the search tree. We also define the notion of "weak" outliers. With respect to the computation of intensional knowledge, we develop naive, semi-naive, and I/O optimized algorithms. The latter class of algorithms intelligently schedules I/O's, and achieves optimization via page sharing. Third, we focus on robust space transformations to achieve meaningful results when mining Ac-D datasets for Z?.B-outliers. Robust estimation is used to: (a) account for differences among attributes in scale, variability, and correlation, (b) account for the effects of outliers in the data, and (c) prevent undesirable masking and flooding during the search for outliers. We propose using a robust space transformation called the Donoho-Stahel estimator (DSE), and we show key properties of the DSE. Of particular importance to data mining applications involving large or dynamic datasets is the stability property, which says that in spite of frequent updates, the estimator does not: (a) change much, (b) lose its usefulness, or (c) require re-computation. We develop randomized algorithms and evaluate how well they perform empirically. The novel algorithm that we develop is called the Hybrid-Random algorithm, which can significantly outperform the other DSE algorithms for moderate-high levels of recall. Experimental results using real-world data are included to show the utility and efficiency of Di>-outiiers.
Item Citations and Data