UBC Theses and Dissertations
Comparison of data classification procedures in applied geochemistry using Monte Carlo simulation Stanley, Clifford R.
In geochemical applications, data classification commonly involves 'mapping' continuous variables into discrete descriptive categories, and often is achieved using thresholds to define specific ranges of data as separate groups which then can be compared with other categorical variables. This study compares several classification methods used in applied geochemistry to select thresholds and discriminate between populations or to recognize anomalous observations. The comparisons were made using monte carlo simulation to evaluate how well different techniques perform using different data set structures. A comparison of maximum likelihood parameter estimates of a mixture of normal distributions using class interval frequencies versus raw data was undertaken to study the quality of the corresponding results. The more time consuming raw data approach produces optimal parameter estimates while the more rapid class interval approach is the approach in common use. Results show that provided there are greater than 50 observations per distribution and (on average) 10 observations per class interval, the maximum likelihood parameter estimates by the two methods are practically indistinguishable. Univariate classification techniques evaluated in this study include the 'mean plus 2 standard deviations', the '95th percentile', the gap statistic and probability plots. Results show that the 'mean plus 2 standard deviations' and '95th percentile' approaches are inappropriate for most geochemical data sets. The probability plot technique classifies mixtures of normal distributions better than the gap statistic; however, the gap statistic may be used as a discordancy test to reveal the presence of outliers. Multivariate classification using the background characterization approach was simulated using several different functions to describe the variation in the background distribution. Comparisons of principal components, ordinary least squares regression and reduced major axis regression indicate that reduced major axis regression and principal components are not only consistent with assumptions about geochemical data, but are less sensitive to varying degrees of data set truncation than is ordinary least squares regression. Furthermore, correcting the descriptive statistics of a truncated data set and calculating the background functions using these statistics produces residuals and scores which are predictable and thus can be distinguished easily from residuals and scores calculated for data from another distribution.
Item Citations and Data