- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- BIRS Workshop Lecture Videos /
- Multi-layer cross-attention is optimal for multi-modal...
Open Collections
BIRS Workshop Lecture Videos
BIRS Workshop Lecture Videos
Multi-layer cross-attention is optimal for multi-modal in-context learning Sur, Pragya
Description
Despite rapid recent progress in in-context learning (ICL), our understanding of multi-modal ICL remains nascent. In this talk, we introduce a mathematically tractable framework for multi-modal learning and explore when transformer-like architectures recover Bayes-optimal performance in-context. In the setting of latent variable models, we first discuss a negative result on expressivity: single-layer, linear self-attention fails to recover the Bayes-optimal predictor. We then study a novel linearized cross-attention mechanism and show that it achieves Bayes optimality when trained via gradient flow, in the limit of a large number of cross-attention layers and context length. Our results underscore the benefits of depth for ICL and establish the provable utility of cross-attention for multi-modal data. This is based on joint work with Nicholas Barnfield and Subhabrata Sen.
Item Metadata
| Title |
Multi-layer cross-attention is optimal for multi-modal in-context learning
|
| Creator | |
| Publisher |
Banff International Research Station for Mathematical Innovation and Discovery
|
| Date Issued |
2026-02-05
|
| Description |
Despite rapid recent progress in in-context learning (ICL), our understanding of multi-modal ICL remains nascent. In this talk, we introduce a mathematically tractable framework for multi-modal learning and explore when transformer-like architectures recover Bayes-optimal performance in-context. In the setting of latent variable models, we first discuss a negative result on expressivity: single-layer, linear self-attention fails to recover the Bayes-optimal predictor. We then study a novel linearized cross-attention mechanism and show that it achieves Bayes optimality when trained via gradient flow, in the limit of a large number of cross-attention layers and context length. Our results underscore the benefits of depth for ICL and establish the provable utility of cross-attention for multi-modal data. This is based on joint work with Nicholas Barnfield and Subhabrata Sen.
|
| Extent |
43.0 minutes
|
| Subject | |
| Type | |
| File Format |
video/mp4
|
| Language |
eng
|
| Notes |
Author affiliation: Harvard University
|
| Series | |
| Date Available |
2026-02-09
|
| Provider |
Vancouver : University of British Columbia Library
|
| Rights |
Attribution-NonCommercial-NoDerivatives 4.0 International
|
| DOI |
10.14288/1.0451478
|
| URI | |
| Affiliation | |
| Peer Review Status |
Unreviewed
|
| Scholarly Level |
Researcher
|
| Rights URI | |
| Aggregated Source Repository |
DSpace
|
Item Media
Item Citations and Data
Rights
Attribution-NonCommercial-NoDerivatives 4.0 International