Empirical Analysis of PCA-based Distance Comparison Operations

UBC Undergraduate Research

Empirical Analysis of PCA-based Distance Comparison Operations Geerts, Noah

Abstract

Vector databases are increasingly used in applications such as information retrieval, recommendation systems, and natural language processing due to their ability to perform efficient similarity searches. These searches rely on k-nearest neighbor (KNN) algorithms, but exact KNN methods become computationally expensive as dataset size and dimensionality grow. Approximate nearest neighbor (ANN) algorithms address this issue by trading slight accuracy losses for substantial speed gains. However, even state-of-the-art ANN methods remain bottlenecked by distance comparison operations (DCOs), which dominate computational costs. To accelerate similarity searches further, we investigate replacing exact DCOs with approximate ones. We first examine ADSampling, an existing approximate DCO that reduces computation by randomly subsampling dimensions. We propose a novel DCO, APCA, which performs comparisons in Principal Component Analysis (PCA) space, prioritizing dimensions with high variance to maximize information retention. We benchmark APCA against ADSampling and standard full-distance scanning within the Hierarchical Navigable Small World and Inverted File algorithms, analyzing its impact on search speed and accuracy.

Item Metadata

Title	Empirical Analysis of PCA-based Distance Comparison Operations
Creator	Geerts, Noah
Date Issued	2025-04
Description	Vector databases are increasingly used in applications such as information retrieval, recommendation systems, and natural language processing due to their ability to perform efficient similarity searches. These searches rely on k-nearest neighbor (KNN) algorithms, but exact KNN methods become computationally expensive as dataset size and dimensionality grow. Approximate nearest neighbor (ANN) algorithms address this issue by trading slight accuracy losses for substantial speed gains. However, even state-of-the-art ANN methods remain bottlenecked by distance comparison operations (DCOs), which dominate computational costs. To accelerate similarity searches further, we investigate replacing exact DCOs with approximate ones. We first examine ADSampling, an existing approximate DCO that reduces computation by randomly subsampling dimensions. We propose a novel DCO, APCA, which performs comparisons in Principal Component Analysis (PCA) space, prioritizing dimensions with high variance to maximize information retention. We benchmark APCA against ADSampling and standard full-distance scanning within the Hierarchical Navigable Small World and Inverted File algorithms, analyzing its impact on search speed and accuracy.
Genre	Graduating Project
Type	Text
Language	eng
Series	University of British Columbia. COSC_O 449
Date Available	2025-05-12
Provider	Vancouver : University of British Columbia Library
Rights	Attribution-NonCommercial-NoDerivatives 4.0 International
DOI	10.14288/1.0448871
URI	http://hdl.handle.net/2429/91116
Affiliation	Science, Irving K. Barber Faculty of (Okanagan); Computer Science, Mathematics, Physics and Statistics, Department of (Okanagan)
Peer Review Status	Unreviewed
Scholarly Level	Undergraduate
Rights URI	http://creativecommons.org/licenses/by-nc-nd/4.0/
Aggregated Source Repository	DSpace

Open Collections

UBC Undergraduate Research