Improved sampling strategy for representative set construction

UBC Theses and Dissertations

Featured Collection

UBC Theses and Dissertations

Improved sampling strategy for representative set construction Sarkar, Debangsha Kusum

Abstract

Active learning solves machine learning problems where acquiring labels for the data is costly. A representative subset helps active learning by selecting the most useful subset of a complete data set and querying the labels of the examples present in it. Various sampling strategies exist for the construction of this type of representative subset. However, the state-of-the-art approaches for subset sampling aim to construct a representative subset where the cost of optimization on the subset is comparable to that of the complete data set. They do not attempt to capture the underlying distribution of the data, which we believe is the most important part of representative sampling. This thesis proposes an adaptation of the sigma point sampling (SPS) technique from unscented transformation (UT) for subset sampling. Unscented transformation (UT) has shown to be very effective in non-linear transformation modeling in object tracking and robotics. We show in this thesis that when combined with the Gaussian mixture model, sigma points can estimate the true statistics of an unknown distribution of data with very few samples. The sigma point sampling approach being parameterized gives better control over the sampling process than the non-parameterized sampling techniques. We use the representative subset for active learning on an existing data set. As a practical application of this novel sampling technique, we use it for Active Transfer Learning (ATL) on autoclave processing data to optimize the manufacturing of high-quality aerospace composite parts. The sampling technique is tested on both classification and regression problems in a pool-based active learning scenario. After comparing our approach to the state-of-the-art sampling techniques, We show that sigma point sampling either outperforms or matches their performance. We also test the sensitivity of the parameters in the sigma point sampling. The success of this sampling technique is very significant for future developments of representative sampling on big data sets. The application of our representative subset is not limited to active learning and transfer learning. There is further potential to extend this approach to parallelizing computation on big data sets among multiple clusters.

Item Metadata

Title	Improved sampling strategy for representative set construction
Creator	Sarkar, Debangsha Kusum
Supervisor	Narayan, Apurva
Publisher	University of British Columbia
Date Issued	2022
Description	Active learning solves machine learning problems where acquiring labels for the data is costly. A representative subset helps active learning by selecting the most useful subset of a complete data set and querying the labels of the examples present in it. Various sampling strategies exist for the construction of this type of representative subset. However, the state-of-the-art approaches for subset sampling aim to construct a representative subset where the cost of optimization on the subset is comparable to that of the complete data set. They do not attempt to capture the underlying distribution of the data, which we believe is the most important part of representative sampling. This thesis proposes an adaptation of the sigma point sampling (SPS) technique from unscented transformation (UT) for subset sampling. Unscented transformation (UT) has shown to be very effective in non-linear transformation modeling in object tracking and robotics. We show in this thesis that when combined with the Gaussian mixture model, sigma points can estimate the true statistics of an unknown distribution of data with very few samples. The sigma point sampling approach being parameterized gives better control over the sampling process than the non-parameterized sampling techniques. We use the representative subset for active learning on an existing data set. As a practical application of this novel sampling technique, we use it for Active Transfer Learning (ATL) on autoclave processing data to optimize the manufacturing of high-quality aerospace composite parts. The sampling technique is tested on both classification and regression problems in a pool-based active learning scenario. After comparing our approach to the state-of-the-art sampling techniques, We show that sigma point sampling either outperforms or matches their performance. We also test the sensitivity of the parameters in the sigma point sampling. The success of this sampling technique is very significant for future developments of representative sampling on big data sets. The application of our representative subset is not limited to active learning and transfer learning. There is further potential to extend this approach to parallelizing computation on big data sets among multiple clusters.
Genre	Thesis/Dissertation
Type	Text
Language	eng
Date Available	2022-01-06
Provider	Vancouver : University of British Columbia Library
Rights	Attribution-NonCommercial-NoDerivatives 4.0 International
DOI	10.14288/1.0406204
URI	http://hdl.handle.net/2429/80545
Degree (Theses)	Master of Science - MSc
Program (Theses)	Computer Science
Affiliation	Science, Irving K. Barber Faculty of (Okanagan); Computer Science, Mathematics, Physics and Statistics, Department of (Okanagan)
Degree Grantor	University of British Columbia
Graduation Date	2022-02
Campus	UBCO
Scholarly Level	Graduate
Rights URI	http://creativecommons.org/licenses/by-nc-nd/4.0/
Aggregated Source Repository	DSpace

Open Collections

UBC Theses and Dissertations

UBC Theses and Dissertations

Improved sampling strategy for representative set construction Sarkar, Debangsha Kusum

Abstract

Item Metadata

Item Media

Item Citations and Data

Rights