Open Collections

UBC Theses and Dissertations

Featured Collection

UBC Theses and Dissertations

Towards understanding self-attention and batch normalization : optimization and margin dynamics in neural networks Ghaderi, Rouzbeh

Abstract

Deep neural networks have achieved transformative success in numerous fields such as computer vision, natural language processing, and robotics, largely due to their ability to capture complex patterns from large datasets. Among the innovations driving this success, the transformers mechanism has played a pivotal role by overcoming the limitations of earlier models, like recurrent neural networks. Despite the empirical success, our theoretical understanding of transformers and their core, self-attention mechanism remains incomplete, especially with regard to their optimization and generalization behaviors. Previous research has predominantly focused on single-head attention, thereby limiting insights into the impact of overparameterization in more complex models. Motivated by the demonstrated benefits of overparameterization in training fully-connected networks, we explore the potential optimization and generalization advantages of employing multiple attention heads. Specifically, we derive convergence and generalization guarantees for gradient descent training of a single-layer multi-head self-attention model, under a suitable realizability condition on the data. Additionally, we identify initialization conditions that ensure the realizability assumption is met and verify that these conditions hold in a simple but practical data model. Our findings offer insights that could be extended to various data models and architectural settings. Building upon these insights, we also examine the influence of other architectural components on training dynamics, focusing specifically on normalization. To this end, we study batch normalization, a widely-used technique that has proven to accelerate training and improve generalization in deep neural networks. Despite its popularity, our understanding of its underlying mechanisms remains limited, and we believe that uncovering its impact could further help extend our results to the other normalization variants across different neural network architectures. We investigate the impact of batch normalization on a simple two-layer neural network and show that when training with gradient descent on a simple data model, the network parameters converge to a solution with uniform margin directions. Furthermore, we derive finite-time rates for the evolution of both the loss and the weights during the training process.

Item Metadata

Title	Towards understanding self-attention and batch normalization : optimization and margin dynamics in neural networks
Creator	Ghaderi, Rouzbeh
Supervisor	Thrampoulidis, Christos
Publisher	University of British Columbia
Date Issued	2024
Description	Deep neural networks have achieved transformative success in numerous fields such as computer vision, natural language processing, and robotics, largely due to their ability to capture complex patterns from large datasets. Among the innovations driving this success, the transformers mechanism has played a pivotal role by overcoming the limitations of earlier models, like recurrent neural networks. Despite the empirical success, our theoretical understanding of transformers and their core, self-attention mechanism remains incomplete, especially with regard to their optimization and generalization behaviors. Previous research has predominantly focused on single-head attention, thereby limiting insights into the impact of overparameterization in more complex models. Motivated by the demonstrated benefits of overparameterization in training fully-connected networks, we explore the potential optimization and generalization advantages of employing multiple attention heads. Specifically, we derive convergence and generalization guarantees for gradient descent training of a single-layer multi-head self-attention model, under a suitable realizability condition on the data. Additionally, we identify initialization conditions that ensure the realizability assumption is met and verify that these conditions hold in a simple but practical data model. Our findings offer insights that could be extended to various data models and architectural settings. Building upon these insights, we also examine the influence of other architectural components on training dynamics, focusing specifically on normalization. To this end, we study batch normalization, a widely-used technique that has proven to accelerate training and improve generalization in deep neural networks. Despite its popularity, our understanding of its underlying mechanisms remains limited, and we believe that uncovering its impact could further help extend our results to the other normalization variants across different neural network architectures. We investigate the impact of batch normalization on a simple two-layer neural network and show that when training with gradient descent on a simple data model, the network parameters converge to a solution with uniform margin directions. Furthermore, we derive finite-time rates for the evolution of both the loss and the weights during the training process.
Genre	Thesis/Dissertation
Type	Text
Language	eng
Date Available	2024-10-24
Provider	Vancouver : University of British Columbia Library
Rights	Attribution-NonCommercial-NoDerivatives 4.0 International
DOI	10.14288/1.0447083
URI	http://hdl.handle.net/2429/89486
Degree	Master of Applied Science - MASc
Program	Electrical and Computer Engineering
Affiliation	Applied Science, Faculty of; Electrical and Computer Engineering, Department of
Degree Grantor	University of British Columbia
Graduation Date	2024-11
Campus	UBCV
Scholarly Level	Graduate
Rights URI	http://creativecommons.org/licenses/by-nc-nd/4.0/
Aggregated Source Repository	DSpace

Open Collections

UBC Theses and Dissertations

UBC Theses and Dissertations

Towards understanding self-attention and batch normalization : optimization and margin dynamics in neural networks Ghaderi, Rouzbeh

Abstract

Item Metadata

Item Media

Item Citations and Data

Rights