CubistMerge : spatial-preserving token merging for diverse ViT backbones

UBC Theses and Dissertations

Featured Collection

UBC Theses and Dissertations

CubistMerge : spatial-preserving token merging for diverse ViT backbones Gong, Wenyi

Abstract

Many modern Vision Transformer (ViT) backbones adopt spatial architectural designs, such as window attention from Swin and ViTDet, and 2D positional embeddings which includes decomposed relative positional embeddings in SAM and RoPE in DINOv3. Such architectures impose new challenges on token reduction, as the vast majority of existing methods fail to preserve the spatial structure these architectures depend on. In this thesis, we introduce a simple yet effective token merging method that maintains spatial integrity, enabling seamless compatibility with spatial architectures. We reconcile two seemingly conflicting requirements: (1) exploiting the uneven information distribution across the spatial layout, and (2) preserving the spatial structure post-merging. To enable effective token reduction for spatial architectures, we propose three key innovations: - a 2D reduction strategy to enforce structured token layouts, - a spatial-aware merging algorithm that maintains relative token positions, and - a novel max-magnitude-per-dimension token representation that preserves salient features. Our method demonstrates strong performance both off-the-shelf and with fine-tuning, achieving state-of-the-art results on spatial and non-spatial architectures across various vision tasks. Specifically, we achieve 1.25× speedup on SAM-H with only 0.7% mIOU drop evaluated on COCO off-the-shelf, and 1.15× speedup on DeiT-B with no top-1 accuracy drop on ImageNet within just one epoch of fine-tuning.

Item Metadata

Title	CubistMerge : spatial-preserving token merging for diverse ViT backbones
Creator	Gong, Wenyi
Supervisor	Lis, Mieszko
Publisher	University of British Columbia
Date Issued	2025
Description	Many modern Vision Transformer (ViT) backbones adopt spatial architectural designs, such as window attention from Swin and ViTDet, and 2D positional embeddings which includes decomposed relative positional embeddings in SAM and RoPE in DINOv3. Such architectures impose new challenges on token reduction, as the vast majority of existing methods fail to preserve the spatial structure these architectures depend on. In this thesis, we introduce a simple yet effective token merging method that maintains spatial integrity, enabling seamless compatibility with spatial architectures. We reconcile two seemingly conflicting requirements: (1) exploiting the uneven information distribution across the spatial layout, and (2) preserving the spatial structure post-merging. To enable effective token reduction for spatial architectures, we propose three key innovations: - a 2D reduction strategy to enforce structured token layouts, - a spatial-aware merging algorithm that maintains relative token positions, and - a novel max-magnitude-per-dimension token representation that preserves salient features. Our method demonstrates strong performance both off-the-shelf and with fine-tuning, achieving state-of-the-art results on spatial and non-spatial architectures across various vision tasks. Specifically, we achieve 1.25× speedup on SAM-H with only 0.7% mIOU drop evaluated on COCO off-the-shelf, and 1.15× speedup on DeiT-B with no top-1 accuracy drop on ImageNet within just one epoch of fine-tuning.
Genre	Thesis/Dissertation
Type	Text
Language	eng
Date Available	2025-10-16
Provider	Vancouver : University of British Columbia Library
Rights	Attribution-NonCommercial-NoDerivatives 4.0 International
DOI	10.14288/1.0450473
URI	http://hdl.handle.net/2429/92643
Degree (Theses)	Master of Applied Science - MASc
Program (Theses)	Electrical and Computer Engineering
Affiliation	Applied Science, Faculty of; Electrical and Computer Engineering, Department of
Degree Grantor	University of British Columbia
Graduation Date	2025-11
Campus	UBCV
Scholarly Level	Graduate
Rights URI	http://creativecommons.org/licenses/by-nc-nd/4.0/
Aggregated Source Repository	DSpace

Open Collections

UBC Theses and Dissertations

UBC Theses and Dissertations

CubistMerge : spatial-preserving token merging for diverse ViT backbones Gong, Wenyi

Abstract

Item Metadata

Item Media

Item Citations and Data

Rights