UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

CubistMerge : spatial-preserving token merging for diverse ViT backbones Gong, Wenyi

Abstract

Many modern Vision Transformer (ViT) backbones adopt spatial architectural designs, such as window attention from Swin and ViTDet, and 2D positional embeddings which includes decomposed relative positional embeddings in SAM and RoPE in DINOv3. Such architectures impose new challenges on token reduction, as the vast majority of existing methods fail to preserve the spatial structure these architectures depend on. In this thesis, we introduce a simple yet effective token merging method that maintains spatial integrity, enabling seamless compatibility with spatial architectures. We reconcile two seemingly conflicting requirements: (1) exploiting the uneven information distribution across the spatial layout, and (2) preserving the spatial structure post-merging. To enable effective token reduction for spatial architectures, we propose three key innovations: - a 2D reduction strategy to enforce structured token layouts, - a spatial-aware merging algorithm that maintains relative token positions, and - a novel max-magnitude-per-dimension token representation that preserves salient features. Our method demonstrates strong performance both off-the-shelf and with fine-tuning, achieving state-of-the-art results on spatial and non-spatial architectures across various vision tasks. Specifically, we achieve 1.25× speedup on SAM-H with only 0.7% mIOU drop evaluated on COCO off-the-shelf, and 1.15× speedup on DeiT-B with no top-1 accuracy drop on ImageNet within just one epoch of fine-tuning.

Item Media

Item Citations and Data

Rights

Attribution-NonCommercial-NoDerivatives 4.0 International