- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Undergraduate Research /
- Empirical Analysis of PCA-based Distance Comparison...
Open Collections
UBC Undergraduate Research
Empirical Analysis of PCA-based Distance Comparison Operations Geerts, Noah
Abstract
Vector databases are increasingly used in applications such as information retrieval, recommendation systems, and natural language processing due to their ability to perform efficient similarity searches. These searches rely on k-nearest neighbor (KNN) algorithms, but exact KNN methods become computationally expensive as dataset size and dimensionality grow. Approximate nearest neighbor (ANN) algorithms address this issue by trading slight accuracy losses for substantial speed gains. However, even state-of-the-art ANN methods remain bottlenecked by distance comparison operations (DCOs), which dominate computational costs. To accelerate similarity searches further, we investigate replacing exact DCOs with approximate ones. We first examine ADSampling, an existing approximate DCO that reduces computation by randomly subsampling dimensions. We propose a novel DCO, APCA, which performs comparisons in Principal Component Analysis (PCA) space, prioritizing dimensions with high variance to maximize information retention. We benchmark APCA against ADSampling and standard full-distance scanning within the Hierarchical Navigable Small World and Inverted File algorithms, analyzing its impact on search speed and accuracy.
Item Metadata
Title |
Empirical Analysis of PCA-based Distance Comparison Operations
|
Creator | |
Date Issued |
2025-04
|
Description |
Vector databases are increasingly used in applications such as information retrieval,
recommendation systems, and natural language processing due to their ability to perform
efficient similarity searches. These searches rely on k-nearest neighbor (KNN)
algorithms, but exact KNN methods become computationally expensive as dataset size
and dimensionality grow. Approximate nearest neighbor (ANN) algorithms address
this issue by trading slight accuracy losses for substantial speed gains. However, even
state-of-the-art ANN methods remain bottlenecked by distance comparison operations
(DCOs), which dominate computational costs. To accelerate similarity searches further,
we investigate replacing exact DCOs with approximate ones. We first examine ADSampling,
an existing approximate DCO that reduces computation by randomly subsampling
dimensions. We propose a novel DCO, APCA, which performs comparisons in Principal
Component Analysis (PCA) space, prioritizing dimensions with high variance to maximize
information retention. We benchmark APCA against ADSampling and standard
full-distance scanning within the Hierarchical Navigable Small World and Inverted File
algorithms, analyzing its impact on search speed and accuracy.
|
Genre | |
Type | |
Language |
eng
|
Series | |
Date Available |
2025-05-12
|
Provider |
Vancouver : University of British Columbia Library
|
Rights |
Attribution-NonCommercial-NoDerivatives 4.0 International
|
DOI |
10.14288/1.0448871
|
URI | |
Affiliation | |
Peer Review Status |
Unreviewed
|
Scholarly Level |
Undergraduate
|
Rights URI | |
Aggregated Source Repository |
DSpace
|
Item Media
Item Citations and Data
Rights
Attribution-NonCommercial-NoDerivatives 4.0 International