Similarity maximization and shrinkage approach in kernel metric learning for clustering mixed-type data

UBC Theses and Dissertations

Featured Collection

UBC Theses and Dissertations

Similarity maximization and shrinkage approach in kernel metric learning for clustering mixed-type data Ghashti, Jesse

Abstract

This thesis introduces a new kernel-based shrinkage approach to distance metric learning for mixed-type datasets--a mix of continuous, nominal, and ordinal variables. Mixed-type data is common across many research fields and domains, and is used extensively in machine learning tasks that require a comparison between data points such as distance metrics in clustering and classification. However, traditional methods for handling mixed-type data often rely on distance metrics that inadequately measure and weigh different variables types, and therefore fail to capture the complex nature of the data. This can lead to a loss in accuracy and precision in clustering and classification tasks and inevitably yield suboptimal outcomes. To improve metric calculations, a distance metric learning approach is proposed, utilizing kernel functions as similarity, with a new optimal bandwidth selection methodology called Maximum Similarity Cross-Validation (MSCV). We demonstrate that MSCV smooths out irrelevant variables and emphasizes variables relevant for calculating distance within mixed-type datasets. Additionally, the kernel distance is positioned as a shrinkage methodology that balances the similarity calculation between maximum and uniform similarity. Further, this approach makes no assumptions on the shape of the metric, mitigating user-specification bias. Analysis of simulated data demonstrates that the kernel metric adequately captures intricate and complex data relationships where other methodologies are limited, including scenarios where variables are irrelevant for distinguishing meaningful groupings within the data. It is shown that as the number of observations increases, the precision of the kernel metric stabilizes, as do the kernel bandwidths. Additionally, applications to real data confirm the metric's advantage of adequately improving accuracy across various data structures. The proposed metric consistently outperforms competing distance metrics in various clustering methodologies, showing that this flexible kernel metric provides a new and effective alternative for similarity-based learning algorithms.

Item Metadata

Title	Similarity maximization and shrinkage approach in kernel metric learning for clustering mixed-type data
Creator	Ghashti, Jesse
Supervisor	Thompson, John R. J.
Publisher	University of British Columbia
Date Issued	2024
Description	This thesis introduces a new kernel-based shrinkage approach to distance metric learning for mixed-type datasets--a mix of continuous, nominal, and ordinal variables. Mixed-type data is common across many research fields and domains, and is used extensively in machine learning tasks that require a comparison between data points such as distance metrics in clustering and classification. However, traditional methods for handling mixed-type data often rely on distance metrics that inadequately measure and weigh different variables types, and therefore fail to capture the complex nature of the data. This can lead to a loss in accuracy and precision in clustering and classification tasks and inevitably yield suboptimal outcomes. To improve metric calculations, a distance metric learning approach is proposed, utilizing kernel functions as similarity, with a new optimal bandwidth selection methodology called Maximum Similarity Cross-Validation (MSCV). We demonstrate that MSCV smooths out irrelevant variables and emphasizes variables relevant for calculating distance within mixed-type datasets. Additionally, the kernel distance is positioned as a shrinkage methodology that balances the similarity calculation between maximum and uniform similarity. Further, this approach makes no assumptions on the shape of the metric, mitigating user-specification bias. Analysis of simulated data demonstrates that the kernel metric adequately captures intricate and complex data relationships where other methodologies are limited, including scenarios where variables are irrelevant for distinguishing meaningful groupings within the data. It is shown that as the number of observations increases, the precision of the kernel metric stabilizes, as do the kernel bandwidths. Additionally, applications to real data confirm the metric's advantage of adequately improving accuracy across various data structures. The proposed metric consistently outperforms competing distance metrics in various clustering methodologies, showing that this flexible kernel metric provides a new and effective alternative for similarity-based learning algorithms.
Genre	Thesis/Dissertation
Type	Text
Language	eng
Date Available	2024-06-17
Provider	Vancouver : University of British Columbia Library
Rights	Attribution-NonCommercial-NoDerivatives 4.0 International
DOI	10.14288/1.0443975
URI	http://hdl.handle.net/2429/88471
Degree (Theses)	Master of Science - MSc
Program (Theses)	Mathematics
Affiliation	Science, Irving K. Barber Faculty of (Okanagan); Computer Science, Mathematics, Physics and Statistics, Department of (Okanagan)
Degree Grantor	University of British Columbia
Graduation Date	2024-09
Campus	UBCO
Scholarly Level	Graduate
Rights URI	http://creativecommons.org/licenses/by-nc-nd/4.0/
Aggregated Source Repository	DSpace

Open Collections

UBC Theses and Dissertations

UBC Theses and Dissertations

Similarity maximization and shrinkage approach in kernel metric learning for clustering mixed-type data Ghashti, Jesse

Abstract

Item Metadata

Item Media

Item Citations and Data

Rights