On the optimization and generalization of self-attention models : a stability and implicit bias perspective

UBC Theses and Dissertations

Featured Collection

UBC Theses and Dissertations

On the optimization and generalization of self-attention models : a stability and implicit bias perspective Deora, Puneesh

Abstract

Deep learning has increasingly shifted towards using Transformers across various applications, including computer vision and natural language processing, with large language models leading this success. However, the optimization and generalization dynamics of Transformer models remain poorly understood. In this thesis, we take steps towards establishing a foundational theory. Specifically, we focus on self-attention models, the core mechanism of the Transformer. We first study multi-head attention models in a binary classification setting, developing optimization and generalization bounds that elucidate the finite-time behavior of these models when trained with gradient descent. The non-linear and non-convex nature of the softmax function makes this analysis particularly challenging. To establish bounds in this complex regime, our theory leverages the smoothness of the softmax function and pertinent second-order information through the Hessian. The latter gives the loss landscape a weakly-convex nature on sufficient overparameterization. We further leverage the algorithmic stability framework to derive generalization bounds, using on-average model stability. Our bounds have the potential for extension to more complex architectures, and draw connections to the existing theory of overparameterized multilayer perceptrons. While finite-time optimization and generalization guarantees address part of the puzzle, a complete understanding of training dynamics requires studying the time evolution of training parameters. One approach is to study where the training parameters converge as training time blows up—specifically, whether GD training exhibits any implicit preference among the possible minimizers of the training loss. This preference is known as the implicit bias of the optimizer, and we explore this for self-attention models. We demonstrate that in non-trivial data settings, when training the combined key-query matrix, it globally converges (regardless of initial direction) to the solution of a hard-margin SVM problem. Further, our theory characterizes the complete training dynamics in this setup, explaining the time evolution of the loss function, the growth of parameters, and the evolution of the attention map. Finally, we explore benefits of using adaptive step rules for Transformer optimization. Our findings reinforce the implicit bias perspective of self-attention and strengthen its connections to implicit bias in linear logistic regression, despite the intricate non-convex nature of the former.

Item Metadata

Title	On the optimization and generalization of self-attention models : a stability and implicit bias perspective
Creator	Deora, Puneesh
Supervisor	Thrampoulidis, Christos
Publisher	University of British Columbia
Date Issued	2024
Description	Deep learning has increasingly shifted towards using Transformers across various applications, including computer vision and natural language processing, with large language models leading this success. However, the optimization and generalization dynamics of Transformer models remain poorly understood. In this thesis, we take steps towards establishing a foundational theory. Specifically, we focus on self-attention models, the core mechanism of the Transformer. We first study multi-head attention models in a binary classification setting, developing optimization and generalization bounds that elucidate the finite-time behavior of these models when trained with gradient descent. The non-linear and non-convex nature of the softmax function makes this analysis particularly challenging. To establish bounds in this complex regime, our theory leverages the smoothness of the softmax function and pertinent second-order information through the Hessian. The latter gives the loss landscape a weakly-convex nature on sufficient overparameterization. We further leverage the algorithmic stability framework to derive generalization bounds, using on-average model stability. Our bounds have the potential for extension to more complex architectures, and draw connections to the existing theory of overparameterized multilayer perceptrons. While finite-time optimization and generalization guarantees address part of the puzzle, a complete understanding of training dynamics requires studying the time evolution of training parameters. One approach is to study where the training parameters converge as training time blows up—specifically, whether GD training exhibits any implicit preference among the possible minimizers of the training loss. This preference is known as the implicit bias of the optimizer, and we explore this for self-attention models. We demonstrate that in non-trivial data settings, when training the combined key-query matrix, it globally converges (regardless of initial direction) to the solution of a hard-margin SVM problem. Further, our theory characterizes the complete training dynamics in this setup, explaining the time evolution of the loss function, the growth of parameters, and the evolution of the attention map. Finally, we explore benefits of using adaptive step rules for Transformer optimization. Our findings reinforce the implicit bias perspective of self-attention and strengthen its connections to implicit bias in linear logistic regression, despite the intricate non-convex nature of the former.
Genre	Thesis/Dissertation
Type	Text
Language	eng
Date Available	2024-09-05
Provider	Vancouver : University of British Columbia Library
Rights	Attribution-NonCommercial-NoDerivatives 4.0 International
DOI	10.14288/1.0445320
URI	http://hdl.handle.net/2429/89179
Degree	Master of Applied Science - MASc
Program	Electrical and Computer Engineering
Affiliation	Applied Science, Faculty of; Electrical and Computer Engineering, Department of
Degree Grantor	University of British Columbia
Graduation Date	2024-11
Campus	UBCV
Scholarly Level	Graduate
Rights URI	http://creativecommons.org/licenses/by-nc-nd/4.0/
Aggregated Source Repository	DSpace

Open Collections

UBC Theses and Dissertations

UBC Theses and Dissertations

On the optimization and generalization of self-attention models : a stability and implicit bias perspective Deora, Puneesh

Abstract

Item Metadata

Item Media

Item Citations and Data

Rights