Comparison of data classification procedures in applied geochemistry using Monte Carlo simulation

UBC Theses and Dissertations

Featured Collection

UBC Theses and Dissertations

Comparison of data classification procedures in applied geochemistry using Monte Carlo simulation Stanley, Clifford R.

Abstract

In geochemical applications, data classification commonly involves 'mapping' continuous variables into discrete descriptive categories, and often is achieved using thresholds to define specific ranges of data as separate groups which then can be compared with other categorical variables. This study compares several classification methods used in applied geochemistry to select thresholds and discriminate between populations or to recognize anomalous observations. The comparisons were made using monte carlo simulation to evaluate how well different techniques perform using different data set structures. A comparison of maximum likelihood parameter estimates of a mixture of normal distributions using class interval frequencies versus raw data was undertaken to study the quality of the corresponding results. The more time consuming raw data approach produces optimal parameter estimates while the more rapid class interval approach is the approach in common use. Results show that provided there are greater than 50 observations per distribution and (on average) 10 observations per class interval, the maximum likelihood parameter estimates by the two methods are practically indistinguishable. Univariate classification techniques evaluated in this study include the 'mean plus 2 standard deviations', the '95th percentile', the gap statistic and probability plots. Results show that the 'mean plus 2 standard deviations' and '95th percentile' approaches are inappropriate for most geochemical data sets. The probability plot technique classifies mixtures of normal distributions better than the gap statistic; however, the gap statistic may be used as a discordancy test to reveal the presence of outliers. Multivariate classification using the background characterization approach was simulated using several different functions to describe the variation in the background distribution. Comparisons of principal components, ordinary least squares regression and reduced major axis regression indicate that reduced major axis regression and principal components are not only consistent with assumptions about geochemical data, but are less sensitive to varying degrees of data set truncation than is ordinary least squares regression. Furthermore, correcting the descriptive statistics of a truncated data set and calculating the background functions using these statistics produces residuals and scores which are predictable and thus can be distinguished easily from residuals and scores calculated for data from another distribution.

Item Metadata

Title	Comparison of data classification procedures in applied geochemistry using Monte Carlo simulation
Creator	Stanley, Clifford R.
Publisher	University of British Columbia
Date Issued	1988
Description	In geochemical applications, data classification commonly involves 'mapping' continuous variables into discrete descriptive categories, and often is achieved using thresholds to define specific ranges of data as separate groups which then can be compared with other categorical variables. This study compares several classification methods used in applied geochemistry to select thresholds and discriminate between populations or to recognize anomalous observations. The comparisons were made using monte carlo simulation to evaluate how well different techniques perform using different data set structures. A comparison of maximum likelihood parameter estimates of a mixture of normal distributions using class interval frequencies versus raw data was undertaken to study the quality of the corresponding results. The more time consuming raw data approach produces optimal parameter estimates while the more rapid class interval approach is the approach in common use. Results show that provided there are greater than 50 observations per distribution and (on average) 10 observations per class interval, the maximum likelihood parameter estimates by the two methods are practically indistinguishable. Univariate classification techniques evaluated in this study include the 'mean plus 2 standard deviations', the '95th percentile', the gap statistic and probability plots. Results show that the 'mean plus 2 standard deviations' and '95th percentile' approaches are inappropriate for most geochemical data sets. The probability plot technique classifies mixtures of normal distributions better than the gap statistic; however, the gap statistic may be used as a discordancy test to reveal the presence of outliers. Multivariate classification using the background characterization approach was simulated using several different functions to describe the variation in the background distribution. Comparisons of principal components, ordinary least squares regression and reduced major axis regression indicate that reduced major axis regression and principal components are not only consistent with assumptions about geochemical data, but are less sensitive to varying degrees of data set truncation than is ordinary least squares regression. Furthermore, correcting the descriptive statistics of a truncated data set and calculating the background functions using these statistics produces residuals and scores which are predictable and thus can be distinguished easily from residuals and scores calculated for data from another distribution.
Genre	Thesis/Dissertation
Type	Text
Language	eng
Date Available	2010-10-21
Provider	Vancouver : University of British Columbia Library
Rights	For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use.
DOI	10.14288/1.0052350
URI	http://hdl.handle.net/2429/29430
Degree	Doctor of Philosophy - PhD
Program	Geological Sciences
Affiliation	Science, Faculty of; Earth, Ocean and Atmospheric Sciences, Department of
Degree Grantor	University of British Columbia
Campus	UBCV
Scholarly Level	Graduate
Aggregated Source Repository	DSpace

Item Media

UBC_1988_A1 S73.pdf -- 15.97MB

Item Citations and Data

Rights

For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use.

Open Collections

UBC Theses and Dissertations

Comparison of data classification procedures in applied geochemistry using Monte Carlo simulation Stanley, Clifford R.

Abstract

Item Metadata

Item Media

Item Citations and Data

Rights