Machine learning for spectroscopic data analysis : challenges of limited labelled data

UBC Theses and Dissertations

Featured Collection

UBC Theses and Dissertations

Machine learning for spectroscopic data analysis : challenges of limited labelled data Dirks, Matthew

Abstract

Extracting meaningful information from spectra, such as sample composition, proves to be challenging. Building prediction models with supervised learning requires labelled data which is often limited. To overcome the challenge of limited data, this thesis explores various strategies spanning the gamut from models reliant on domain knowledge to those primarily data-driven. Leveraging domain knowledge, an approximate, fully-differentiable X-ray fluorescence (XRF) simulator is developed and used in two models. In the first model, the simulator is fit to an observed spectrum. The resulting parameter values are mapped to element concentrations by regression modelling. In the second model, the simulator is embedded in an auto-encoder (AE) neural network. The AE learns the inverse function of the simulator while also adapting to the data when instrument or environment parameters are unavailable. An experiment comparing the AE to standard regression models found improved predictions for 11 elements. Another AE model is developed that uses more general domain knowledge about spectra, which applies to any type of spectrum containing peak-shaped structures. With this model, a statistically significant decrease in prediction error compared to state-of-the-art models was found for predicting tin concentration (with p < 0.00001) in the results of 10×10-fold cross-validation, and it was tied for best on 11 out of 32 elements. A benefit of both AE models is that they can utilize unlabelled data in semi-supervised learning to lower the requirements for ground truth data. Neural networks require extensive hyperparameter optimization (HPO) which depends on validation data to estimate performance accurately. HPO works poorly when the validation set score is noisy; noisy validation scores are typical of small datasets. Ensembling is used to lower the variance, resulting in a neural network configuration that performs as well as an expertly-chosen configuration. A final prediction model combines information from multiple spectrometers, which is particularly challenging for small datasets. Several sensor fusion methods are compared, including a parallel-input convolutional neural network (CNN). Results of 10-fold cross-validation found that high-level PLS-based methods were best, though neural network models were competitive.

Item Metadata

Title	Machine learning for spectroscopic data analysis : challenges of limited labelled data
Creator	Dirks, Matthew
Supervisor	Poole, David L. (David Lynton), 1958-
Publisher	University of British Columbia
Date Issued	2023
Description	Extracting meaningful information from spectra, such as sample composition, proves to be challenging. Building prediction models with supervised learning requires labelled data which is often limited. To overcome the challenge of limited data, this thesis explores various strategies spanning the gamut from models reliant on domain knowledge to those primarily data-driven. Leveraging domain knowledge, an approximate, fully-differentiable X-ray fluorescence (XRF) simulator is developed and used in two models. In the first model, the simulator is fit to an observed spectrum. The resulting parameter values are mapped to element concentrations by regression modelling. In the second model, the simulator is embedded in an auto-encoder (AE) neural network. The AE learns the inverse function of the simulator while also adapting to the data when instrument or environment parameters are unavailable. An experiment comparing the AE to standard regression models found improved predictions for 11 elements. Another AE model is developed that uses more general domain knowledge about spectra, which applies to any type of spectrum containing peak-shaped structures. With this model, a statistically significant decrease in prediction error compared to state-of-the-art models was found for predicting tin concentration (with p < 0.00001) in the results of 10×10-fold cross-validation, and it was tied for best on 11 out of 32 elements. A benefit of both AE models is that they can utilize unlabelled data in semi-supervised learning to lower the requirements for ground truth data. Neural networks require extensive hyperparameter optimization (HPO) which depends on validation data to estimate performance accurately. HPO works poorly when the validation set score is noisy; noisy validation scores are typical of small datasets. Ensembling is used to lower the variance, resulting in a neural network configuration that performs as well as an expertly-chosen configuration. A final prediction model combines information from multiple spectrometers, which is particularly challenging for small datasets. Several sensor fusion methods are compared, including a parallel-input convolutional neural network (CNN). Results of 10-fold cross-validation found that high-level PLS-based methods were best, though neural network models were competitive.
Genre	Thesis/Dissertation
Type	Text
Language	eng
Date Available	2024-01-11
Provider	Vancouver : University of British Columbia Library
Rights	Attribution-NonCommercial-NoDerivatives 4.0 International
DOI	10.14288/1.0438638
URI	http://hdl.handle.net/2429/87209
Degree (Theses)	Doctor of Philosophy - PhD
Program (Theses)	Computer Science
Affiliation	Science, Faculty of; Computer Science, Department of
Degree Grantor	University of British Columbia
Graduation Date	2024-05
Campus	UBCV
Scholarly Level	Graduate
Rights URI	http://creativecommons.org/licenses/by-nc-nd/4.0/
Aggregated Source Repository	DSpace

Open Collections

UBC Theses and Dissertations

UBC Theses and Dissertations

Machine learning for spectroscopic data analysis : challenges of limited labelled data Dirks, Matthew

Abstract

Item Metadata

Item Media

Item Citations and Data

Rights