UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Improved sampling strategy for representative set construction Sarkar, Debangsha Kusum

Abstract

Active learning solves machine learning problems where acquiring labels for the data is costly. A representative subset helps active learning by selecting the most useful subset of a complete data set and querying the labels of the examples present in it. Various sampling strategies exist for the construction of this type of representative subset. However, the state-of-the-art approaches for subset sampling aim to construct a representative subset where the cost of optimization on the subset is comparable to that of the complete data set. They do not attempt to capture the underlying distribution of the data, which we believe is the most important part of representative sampling. This thesis proposes an adaptation of the sigma point sampling (SPS) technique from unscented transformation (UT) for subset sampling. Unscented transformation (UT) has shown to be very effective in non-linear transformation modeling in object tracking and robotics. We show in this thesis that when combined with the Gaussian mixture model, sigma points can estimate the true statistics of an unknown distribution of data with very few samples. The sigma point sampling approach being parameterized gives better control over the sampling process than the non-parameterized sampling techniques. We use the representative subset for active learning on an existing data set. As a practical application of this novel sampling technique, we use it for Active Transfer Learning (ATL) on autoclave processing data to optimize the manufacturing of high-quality aerospace composite parts. The sampling technique is tested on both classification and regression problems in a pool-based active learning scenario. After comparing our approach to the state-of-the-art sampling techniques, We show that sigma point sampling either outperforms or matches their performance. We also test the sensitivity of the parameters in the sigma point sampling. The success of this sampling technique is very significant for future developments of representative sampling on big data sets. The application of our representative subset is not limited to active learning and transfer learning. There is further potential to extend this approach to parallelizing computation on big data sets among multiple clusters.

Item Citations and Data

Rights

Attribution-NonCommercial-NoDerivatives 4.0 International