UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Similarity maximization and shrinkage approach in kernel metric learning for clustering mixed-type data Ghashti, Jesse

Abstract

This thesis introduces a new kernel-based shrinkage approach to distance metric learning for mixed-type datasets--a mix of continuous, nominal, and ordinal variables. Mixed-type data is common across many research fields and domains, and is used extensively in machine learning tasks that require a comparison between data points such as distance metrics in clustering and classification. However, traditional methods for handling mixed-type data often rely on distance metrics that inadequately measure and weigh different variables types, and therefore fail to capture the complex nature of the data. This can lead to a loss in accuracy and precision in clustering and classification tasks and inevitably yield suboptimal outcomes. To improve metric calculations, a distance metric learning approach is proposed, utilizing kernel functions as similarity, with a new optimal bandwidth selection methodology called Maximum Similarity Cross-Validation (MSCV). We demonstrate that MSCV smooths out irrelevant variables and emphasizes variables relevant for calculating distance within mixed-type datasets. Additionally, the kernel distance is positioned as a shrinkage methodology that balances the similarity calculation between maximum and uniform similarity. Further, this approach makes no assumptions on the shape of the metric, mitigating user-specification bias. Analysis of simulated data demonstrates that the kernel metric adequately captures intricate and complex data relationships where other methodologies are limited, including scenarios where variables are irrelevant for distinguishing meaningful groupings within the data. It is shown that as the number of observations increases, the precision of the kernel metric stabilizes, as do the kernel bandwidths. Additionally, applications to real data confirm the metric's advantage of adequately improving accuracy across various data structures. The proposed metric consistently outperforms competing distance metrics in various clustering methodologies, showing that this flexible kernel metric provides a new and effective alternative for similarity-based learning algorithms.

Item Media

Item Citations and Data

Rights

Attribution-NonCommercial-NoDerivatives 4.0 International