UBC Undergraduate Research

Empirical Analysis of PCA-based Distance Comparison Operations Geerts, Noah

Abstract

Vector databases are increasingly used in applications such as information retrieval, recommendation systems, and natural language processing due to their ability to perform efficient similarity searches. These searches rely on k-nearest neighbor (KNN) algorithms, but exact KNN methods become computationally expensive as dataset size and dimensionality grow. Approximate nearest neighbor (ANN) algorithms address this issue by trading slight accuracy losses for substantial speed gains. However, even state-of-the-art ANN methods remain bottlenecked by distance comparison operations (DCOs), which dominate computational costs. To accelerate similarity searches further, we investigate replacing exact DCOs with approximate ones. We first examine ADSampling, an existing approximate DCO that reduces computation by randomly subsampling dimensions. We propose a novel DCO, APCA, which performs comparisons in Principal Component Analysis (PCA) space, prioritizing dimensions with high variance to maximize information retention. We benchmark APCA against ADSampling and standard full-distance scanning within the Hierarchical Navigable Small World and Inverted File algorithms, analyzing its impact on search speed and accuracy.

Item Citations and Data

Rights

Attribution-NonCommercial-NoDerivatives 4.0 International