- Library Home /
 - Search Collections /
 - Open Collections /
 - Browse Collections /
 - UBC Theses and Dissertations /
 - CubistMerge : spatial-preserving token merging for...
 
Open Collections
UBC Theses and Dissertations
UBC Theses and Dissertations
CubistMerge : spatial-preserving token merging for diverse ViT backbones Gong, Wenyi
Abstract
                                    Many modern Vision Transformer (ViT) backbones adopt spatial architectural designs, such as window attention from Swin and ViTDet, and 2D positional embeddings which includes decomposed relative positional embeddings in SAM and RoPE in DINOv3. Such architectures impose new challenges on token reduction, as the vast majority of existing methods fail to preserve the spatial structure these architectures depend on.
In this thesis, we introduce a simple yet effective token merging method that maintains spatial integrity, enabling seamless compatibility with spatial architectures. We reconcile two seemingly conflicting requirements: (1) exploiting the uneven information distribution across the spatial layout, and (2) preserving the spatial structure post-merging.
To enable effective token reduction for spatial architectures, we propose three key innovations:
- a 2D reduction strategy to enforce structured token layouts,
- a spatial-aware merging algorithm that maintains relative token positions, and
- a novel max-magnitude-per-dimension token representation that preserves salient features.
Our method demonstrates strong performance both off-the-shelf and with fine-tuning, achieving state-of-the-art results on spatial and non-spatial architectures across various vision tasks. Specifically, we achieve 1.25× speedup on SAM-H with only 0.7% mIOU drop evaluated on COCO off-the-shelf, and 1.15× speedup on DeiT-B with no top-1 accuracy drop on ImageNet within just one epoch of fine-tuning.
                                    
                                                                    
Item Metadata
| Title | 
                             
                                CubistMerge : spatial-preserving token merging for diverse ViT backbones                             
                         | 
                    
| Creator | |
| Supervisor | |
| Publisher | 
                             
                                University of British Columbia                             
                         | 
                    
| Date Issued | 
                             
                                2025                             
                         | 
                    
| Description | 
                             
                                Many modern Vision Transformer (ViT) backbones adopt spatial architectural designs, such as window attention from Swin and ViTDet, and 2D positional embeddings which includes decomposed relative positional embeddings in SAM and RoPE in DINOv3. Such architectures impose new challenges on token reduction, as the vast majority of existing methods fail to preserve the spatial structure these architectures depend on.
In this thesis, we introduce a simple yet effective token merging method that maintains spatial integrity, enabling seamless compatibility with spatial architectures. We reconcile two seemingly conflicting requirements: (1) exploiting the uneven information distribution across the spatial layout, and (2) preserving the spatial structure post-merging.
To enable effective token reduction for spatial architectures, we propose three key innovations:
- a 2D reduction strategy to enforce structured token layouts,
- a spatial-aware merging algorithm that maintains relative token positions, and
- a novel max-magnitude-per-dimension token representation that preserves salient features.
Our method demonstrates strong performance both off-the-shelf and with fine-tuning, achieving state-of-the-art results on spatial and non-spatial architectures across various vision tasks. Specifically, we achieve 1.25× speedup on SAM-H with only 0.7% mIOU drop evaluated on COCO off-the-shelf, and 1.15× speedup on DeiT-B with no top-1 accuracy drop on ImageNet within just one epoch of fine-tuning.                             
                         | 
                    
| Genre | |
| Type | |
| Language | 
                             
                                eng                             
                         | 
                    
| Date Available | 
                             
                                2025-10-16                             
                         | 
                    
| Provider | 
                             
                                Vancouver : University of British Columbia Library                             
                         | 
                    
| Rights | 
                             
                                Attribution-NonCommercial-NoDerivatives 4.0 International                             
                         | 
                    
| DOI | 
                             
                                10.14288/1.0450473                             
                         | 
                    
| URI | |
| Degree (Theses) | |
| Program (Theses) | |
| Affiliation | |
| Degree Grantor | 
                             
                                University of British Columbia                             
                         | 
                    
| Graduation Date | 
                             
                                2025-11                             
                         | 
                    
| Campus | |
| Scholarly Level | 
                             
                                Graduate                             
                         | 
                    
| Rights URI | |
| Aggregated Source Repository | 
                             
                                DSpace                             
                         | 
                    
Item Media
Item Citations and Data
Rights
Attribution-NonCommercial-NoDerivatives 4.0 International