Multi-layer cross-attention is optimal for multi-modal in-context learning

BIRS Workshop Lecture Videos

Featured Collection

BIRS Workshop Lecture Videos

Multi-layer cross-attention is optimal for multi-modal in-context learning Sur, Pragya

Description

Despite rapid recent progress in in-context learning (ICL), our understanding of multi-modal ICL remains nascent. In this talk, we introduce a mathematically tractable framework for multi-modal learning and explore when transformer-like architectures recover Bayes-optimal performance in-context. In the setting of latent variable models, we first discuss a negative result on expressivity: single-layer, linear self-attention fails to recover the Bayes-optimal predictor. We then study a novel linearized cross-attention mechanism and show that it achieves Bayes optimality when trained via gradient flow, in the limit of a large number of cross-attention layers and context length. Our results underscore the benefits of depth for ICL and establish the provable utility of cross-attention for multi-modal data. This is based on joint work with Nicholas Barnfield and Subhabrata Sen.

Item Metadata

Title	Multi-layer cross-attention is optimal for multi-modal in-context learning
Creator	Sur, Pragya
Publisher	Banff International Research Station for Mathematical Innovation and Discovery
Date Issued	2026-02-05
Description	Despite rapid recent progress in in-context learning (ICL), our understanding of multi-modal ICL remains nascent. In this talk, we introduce a mathematically tractable framework for multi-modal learning and explore when transformer-like architectures recover Bayes-optimal performance in-context. In the setting of latent variable models, we first discuss a negative result on expressivity: single-layer, linear self-attention fails to recover the Bayes-optimal predictor. We then study a novel linearized cross-attention mechanism and show that it achieves Bayes optimality when trained via gradient flow, in the limit of a large number of cross-attention layers and context length. Our results underscore the benefits of depth for ICL and establish the provable utility of cross-attention for multi-modal data. This is based on joint work with Nicholas Barnfield and Subhabrata Sen.
Extent	43.0 minutes
Subject	Mathematics; Machine Learning; Probability Theory And Stochastic Processes; Computer Science
Type	Moving Image
File Format	video/mp4
Language	eng
Notes	Author affiliation: Harvard University
Series	BIRS Workshop Lecture Videos (Banff, Alta)
Date Available	2026-02-09
Provider	Vancouver : University of British Columbia Library
Rights	Attribution-NonCommercial-NoDerivatives 4.0 International
DOI	10.14288/1.0451478
URI	http://hdl.handle.net/2429/93608
Affiliation	Non UBC
Peer Review Status	Unreviewed
Scholarly Level	Researcher
Rights URI	http://creativecommons.org/licenses/by-nc-nd/4.0/
Aggregated Source Repository	DSpace

Item Media

202602051905-Sur_hrv-0.mov -- 1.23GB

Item Citations and Data

Rights

Attribution-NonCommercial-NoDerivatives 4.0 International

Open Collections

BIRS Workshop Lecture Videos

Multi-layer cross-attention is optimal for multi-modal in-context learning Sur, Pragya

Description

Item Metadata

Item Media

Item Citations and Data

Rights