BIRS Workshop Lecture Videos

Banff International Research Station Logo

BIRS Workshop Lecture Videos

Multi-layer cross-attention is optimal for multi-modal in-context learning Sur, Pragya

Description

Despite rapid recent progress in in-context learning (ICL), our understanding of multi-modal ICL remains nascent. In this talk, we introduce a mathematically tractable framework for multi-modal learning and explore when transformer-like architectures recover Bayes-optimal performance in-context. In the setting of latent variable models, we first discuss a negative result on expressivity: single-layer, linear self-attention fails to recover the Bayes-optimal predictor. We then study a novel linearized cross-attention mechanism and show that it achieves Bayes optimality when trained via gradient flow, in the limit of a large number of cross-attention layers and context length. Our results underscore the benefits of depth for ICL and establish the provable utility of cross-attention for multi-modal data. This is based on joint work with Nicholas Barnfield and Subhabrata Sen.

Item Media

Item Citations and Data

Rights

Attribution-NonCommercial-NoDerivatives 4.0 International