- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Machine learning for spectroscopic data analysis :...
Open Collections
UBC Theses and Dissertations
UBC Theses and Dissertations
Machine learning for spectroscopic data analysis : challenges of limited labelled data Dirks, Matthew
Abstract
Extracting meaningful information from spectra, such as sample composition, proves to be challenging. Building prediction models with supervised learning requires labelled data which is often limited. To overcome the challenge of limited data, this thesis explores various strategies spanning the gamut from models reliant on domain knowledge to those primarily data-driven. Leveraging domain knowledge, an approximate, fully-differentiable X-ray fluorescence (XRF) simulator is developed and used in two models. In the first model, the simulator is fit to an observed spectrum. The resulting parameter values are mapped to element concentrations by regression modelling. In the second model, the simulator is embedded in an auto-encoder (AE) neural network. The AE learns the inverse function of the simulator while also adapting to the data when instrument or environment parameters are unavailable. An experiment comparing the AE to standard regression models found improved predictions for 11 elements. Another AE model is developed that uses more general domain knowledge about spectra, which applies to any type of spectrum containing peak-shaped structures. With this model, a statistically significant decrease in prediction error compared to state-of-the-art models was found for predicting tin concentration (with p < 0.00001) in the results of 10×10-fold cross-validation, and it was tied for best on 11 out of 32 elements. A benefit of both AE models is that they can utilize unlabelled data in semi-supervised learning to lower the requirements for ground truth data. Neural networks require extensive hyperparameter optimization (HPO) which depends on validation data to estimate performance accurately. HPO works poorly when the validation set score is noisy; noisy validation scores are typical of small datasets. Ensembling is used to lower the variance, resulting in a neural network configuration that performs as well as an expertly-chosen configuration. A final prediction model combines information from multiple spectrometers, which is particularly challenging for small datasets. Several sensor fusion methods are compared, including a parallel-input convolutional neural network (CNN). Results of 10-fold cross-validation found that high-level PLS-based methods were best, though neural network models were competitive.
Item Metadata
Title |
Machine learning for spectroscopic data analysis : challenges of limited labelled data
|
Creator | |
Supervisor | |
Publisher |
University of British Columbia
|
Date Issued |
2023
|
Description |
Extracting meaningful information from spectra, such as sample composition, proves to be challenging. Building prediction models with supervised learning requires labelled data which is often limited. To overcome the challenge of limited data, this thesis explores various strategies spanning the gamut from models reliant on domain knowledge to those primarily data-driven.
Leveraging domain knowledge, an approximate, fully-differentiable X-ray fluorescence (XRF) simulator is developed and used in two models. In the first model, the simulator is fit to an observed spectrum. The resulting parameter values are mapped to element concentrations by regression modelling. In the second model, the simulator is embedded in an auto-encoder (AE) neural network. The AE learns the inverse function of the simulator while also adapting to the data when instrument or environment parameters are unavailable. An experiment comparing the AE to standard regression models found improved predictions for 11 elements. Another AE model is developed that uses more general domain knowledge about spectra, which applies to any type of spectrum containing peak-shaped structures. With this model, a statistically significant decrease in prediction error compared to state-of-the-art models was found for predicting tin concentration (with p < 0.00001) in the results of 10×10-fold cross-validation, and it was tied for best on 11 out of 32 elements. A benefit of both AE models is that they can utilize unlabelled data in semi-supervised learning to lower the requirements for ground truth data.
Neural networks require extensive hyperparameter optimization (HPO) which depends on validation data to estimate performance accurately. HPO works poorly when the validation set score is noisy; noisy validation scores are typical of small datasets. Ensembling is used to lower the variance, resulting in a neural network configuration that performs as well as an expertly-chosen configuration.
A final prediction model combines information from multiple spectrometers, which is particularly challenging for small datasets. Several sensor fusion methods are compared, including a parallel-input convolutional neural network (CNN). Results of 10-fold cross-validation found that high-level PLS-based methods were best, though neural network models were competitive.
|
Genre | |
Type | |
Language |
eng
|
Date Available |
2024-01-11
|
Provider |
Vancouver : University of British Columbia Library
|
Rights |
Attribution-NonCommercial-NoDerivatives 4.0 International
|
DOI |
10.14288/1.0438638
|
URI | |
Degree | |
Program | |
Affiliation | |
Degree Grantor |
University of British Columbia
|
Graduation Date |
2024-05
|
Campus | |
Scholarly Level |
Graduate
|
Rights URI | |
Aggregated Source Repository |
DSpace
|
Item Media
Item Citations and Data
Rights
Attribution-NonCommercial-NoDerivatives 4.0 International