UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

On the optimization and generalization of self-attention models : a stability and implicit bias perspective Deora, Puneesh

Abstract

Deep learning has increasingly shifted towards using Transformers across various applications, including computer vision and natural language processing, with large language models leading this success. However, the optimization and generalization dynamics of Transformer models remain poorly understood. In this thesis, we take steps towards establishing a foundational theory. Specifically, we focus on self-attention models, the core mechanism of the Transformer. We first study multi-head attention models in a binary classification setting, developing optimization and generalization bounds that elucidate the finite-time behavior of these models when trained with gradient descent. The non-linear and non-convex nature of the softmax function makes this analysis particularly challenging. To establish bounds in this complex regime, our theory leverages the smoothness of the softmax function and pertinent second-order information through the Hessian. The latter gives the loss landscape a weakly-convex nature on sufficient overparameterization. We further leverage the algorithmic stability framework to derive generalization bounds, using on-average model stability. Our bounds have the potential for extension to more complex architectures, and draw connections to the existing theory of overparameterized multilayer perceptrons. While finite-time optimization and generalization guarantees address part of the puzzle, a complete understanding of training dynamics requires studying the time evolution of training parameters. One approach is to study where the training parameters converge as training time blows up—specifically, whether GD training exhibits any implicit preference among the possible minimizers of the training loss. This preference is known as the implicit bias of the optimizer, and we explore this for self-attention models. We demonstrate that in non-trivial data settings, when training the combined key-query matrix, it globally converges (regardless of initial direction) to the solution of a hard-margin SVM problem. Further, our theory characterizes the complete training dynamics in this setup, explaining the time evolution of the loss function, the growth of parameters, and the evolution of the attention map. Finally, we explore benefits of using adaptive step rules for Transformer optimization. Our findings reinforce the implicit bias perspective of self-attention and strengthen its connections to implicit bias in linear logistic regression, despite the intricate non-convex nature of the former.

Item Citations and Data

Rights

Attribution-NonCommercial-NoDerivatives 4.0 International