- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Similarity maximization and shrinkage approach in kernel...
Open Collections
UBC Theses and Dissertations
UBC Theses and Dissertations
Similarity maximization and shrinkage approach in kernel metric learning for clustering mixed-type data Ghashti, Jesse
Abstract
This thesis introduces a new kernel-based shrinkage approach to distance metric learning for mixed-type datasets--a mix of continuous, nominal, and ordinal variables. Mixed-type data is common across many research fields and domains, and is used extensively in machine learning tasks that require a comparison between data points such as distance metrics in clustering and classification. However, traditional methods for handling mixed-type data often rely on distance metrics that inadequately measure and weigh different variables types, and therefore fail to capture the complex nature of the data. This can lead to a loss in accuracy and precision in clustering and classification tasks and inevitably yield suboptimal outcomes. To improve metric calculations, a distance metric learning approach is proposed, utilizing kernel functions as similarity, with a new optimal bandwidth selection methodology called Maximum Similarity Cross-Validation (MSCV). We demonstrate that MSCV smooths out irrelevant variables and emphasizes variables relevant for calculating distance within mixed-type datasets. Additionally, the kernel distance is positioned as a shrinkage methodology that balances the similarity calculation between maximum and uniform similarity. Further, this approach makes no assumptions on the shape of the metric, mitigating user-specification bias. Analysis of simulated data demonstrates that the kernel metric adequately captures intricate and complex data relationships where other methodologies are limited, including scenarios where variables are irrelevant for distinguishing meaningful groupings within the data. It is shown that as the number of observations increases, the precision of the kernel metric stabilizes, as do the kernel bandwidths. Additionally, applications to real data confirm the metric's advantage of adequately improving accuracy across various data structures. The proposed metric consistently outperforms competing distance metrics in various clustering methodologies, showing that this flexible kernel metric provides a new and effective alternative for similarity-based learning algorithms.
Item Metadata
Title |
Similarity maximization and shrinkage approach in kernel metric learning for clustering mixed-type data
|
Creator | |
Supervisor | |
Publisher |
University of British Columbia
|
Date Issued |
2024
|
Description |
This thesis introduces a new kernel-based shrinkage approach to distance metric learning for mixed-type datasets--a mix of continuous, nominal, and ordinal variables. Mixed-type data is common across many research fields and domains, and is used extensively in machine learning tasks that require a comparison between data points such as distance metrics in clustering and classification. However, traditional methods for handling mixed-type data often rely on distance metrics that inadequately measure and weigh different variables types, and therefore fail to capture the complex nature of the data. This can lead to a loss in accuracy and precision in clustering and classification tasks and inevitably yield suboptimal outcomes.
To improve metric calculations, a distance metric learning approach is proposed, utilizing kernel functions as similarity, with a new optimal bandwidth selection methodology called Maximum Similarity Cross-Validation (MSCV). We demonstrate that MSCV smooths out irrelevant variables and emphasizes variables relevant for calculating distance within mixed-type datasets. Additionally, the kernel distance is positioned as a shrinkage methodology that balances the similarity calculation between maximum and uniform similarity. Further, this approach makes no assumptions on the shape of the metric, mitigating user-specification bias.
Analysis of simulated data demonstrates that the kernel metric adequately captures intricate and complex data relationships where other methodologies are limited, including scenarios where variables are irrelevant for distinguishing meaningful groupings within the data. It is shown that as the number of observations increases, the precision of the kernel metric stabilizes, as do the kernel bandwidths. Additionally, applications to real data confirm the metric's advantage of adequately improving accuracy across various data structures. The proposed metric consistently outperforms competing distance metrics in various clustering methodologies, showing that this flexible kernel metric provides a new and effective alternative for similarity-based learning algorithms.
|
Genre | |
Type | |
Language |
eng
|
Date Available |
2024-06-17
|
Provider |
Vancouver : University of British Columbia Library
|
Rights |
Attribution-NonCommercial-NoDerivatives 4.0 International
|
DOI |
10.14288/1.0443975
|
URI | |
Degree | |
Program | |
Affiliation | |
Degree Grantor |
University of British Columbia
|
Graduation Date |
2024-09
|
Campus | |
Scholarly Level |
Graduate
|
Rights URI | |
Aggregated Source Repository |
DSpace
|
Item Media
Item Citations and Data
Rights
Attribution-NonCommercial-NoDerivatives 4.0 International