UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Machine learning for spectroscopic data analysis : challenges of limited labelled data Dirks, Matthew

Abstract

Extracting meaningful information from spectra, such as sample composition, proves to be challenging. Building prediction models with supervised learning requires labelled data which is often limited. To overcome the challenge of limited data, this thesis explores various strategies spanning the gamut from models reliant on domain knowledge to those primarily data-driven. Leveraging domain knowledge, an approximate, fully-differentiable X-ray fluorescence (XRF) simulator is developed and used in two models. In the first model, the simulator is fit to an observed spectrum. The resulting parameter values are mapped to element concentrations by regression modelling. In the second model, the simulator is embedded in an auto-encoder (AE) neural network. The AE learns the inverse function of the simulator while also adapting to the data when instrument or environment parameters are unavailable. An experiment comparing the AE to standard regression models found improved predictions for 11 elements. Another AE model is developed that uses more general domain knowledge about spectra, which applies to any type of spectrum containing peak-shaped structures. With this model, a statistically significant decrease in prediction error compared to state-of-the-art models was found for predicting tin concentration (with p < 0.00001) in the results of 10×10-fold cross-validation, and it was tied for best on 11 out of 32 elements. A benefit of both AE models is that they can utilize unlabelled data in semi-supervised learning to lower the requirements for ground truth data. Neural networks require extensive hyperparameter optimization (HPO) which depends on validation data to estimate performance accurately. HPO works poorly when the validation set score is noisy; noisy validation scores are typical of small datasets. Ensembling is used to lower the variance, resulting in a neural network configuration that performs as well as an expertly-chosen configuration. A final prediction model combines information from multiple spectrometers, which is particularly challenging for small datasets. Several sensor fusion methods are compared, including a parallel-input convolutional neural network (CNN). Results of 10-fold cross-validation found that high-level PLS-based methods were best, though neural network models were competitive.

Item Media

Item Citations and Data

Rights

Attribution-NonCommercial-NoDerivatives 4.0 International