- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Towards understanding self-attention and batch normalization...
Open Collections
UBC Theses and Dissertations
UBC Theses and Dissertations
Towards understanding self-attention and batch normalization : optimization and margin dynamics in neural networks Ghaderi, Rouzbeh
Abstract
Deep neural networks have achieved transformative success in numerous fields such as computer vision, natural language processing, and robotics, largely due to their ability to capture complex patterns from large datasets. Among the innovations driving this success, the transformers mechanism has played a pivotal role by overcoming the limitations of earlier models, like recurrent neural networks. Despite the empirical success, our theoretical understanding of transformers and their core, self-attention mechanism remains incomplete, especially with regard to their optimization and generalization behaviors. Previous research has predominantly focused on single-head attention, thereby limiting insights into the impact of overparameterization in more complex models. Motivated by the demonstrated benefits of overparameterization in training fully-connected networks, we explore the potential optimization and generalization advantages of employing multiple attention heads. Specifically, we derive convergence and generalization guarantees for gradient descent training of a single-layer multi-head self-attention model, under a suitable realizability condition on the data. Additionally, we identify initialization conditions that ensure the realizability assumption is met and verify that these conditions hold in a simple but practical data model. Our findings offer insights that could be extended to various data models and architectural settings. Building upon these insights, we also examine the influence of other architectural components on training dynamics, focusing specifically on normalization. To this end, we study batch normalization, a widely-used technique that has proven to accelerate training and improve generalization in deep neural networks. Despite its popularity, our understanding of its underlying mechanisms remains limited, and we believe that uncovering its impact could further help extend our results to the other normalization variants across different neural network architectures. We investigate the impact of batch normalization on a simple two-layer neural network and show that when training with gradient descent on a simple data model, the network parameters converge to a solution with uniform margin directions. Furthermore, we derive finite-time rates for the evolution of both the loss and the weights during the training process.
Item Metadata
Title |
Towards understanding self-attention and batch normalization : optimization and margin dynamics in neural networks
|
Creator | |
Supervisor | |
Publisher |
University of British Columbia
|
Date Issued |
2024
|
Description |
Deep neural networks have achieved transformative success in numerous fields such as computer vision, natural language processing, and robotics, largely due to their ability to capture complex patterns from large datasets. Among the innovations driving this success, the transformers mechanism has played a pivotal role by overcoming the limitations of earlier models, like recurrent neural networks. Despite the empirical success, our theoretical understanding of transformers and their core, self-attention mechanism remains incomplete, especially with regard to their optimization and generalization behaviors.
Previous research has predominantly focused on single-head attention, thereby limiting insights into the impact of overparameterization in more complex models. Motivated by the demonstrated benefits of overparameterization in training fully-connected networks, we explore the potential optimization and generalization advantages of employing multiple attention heads. Specifically, we derive convergence and generalization guarantees for gradient descent training of a single-layer multi-head self-attention model, under a suitable realizability condition on the data. Additionally, we identify initialization conditions that ensure the realizability assumption is met and verify that these conditions hold in a simple but practical data model. Our findings offer insights that could be extended to various data models and architectural settings. Building upon these insights, we also examine the influence of other architectural components on training dynamics, focusing specifically on normalization. To this end, we study batch normalization, a widely-used technique that has proven to accelerate training and improve generalization in deep neural networks. Despite its popularity, our understanding of its underlying mechanisms remains limited, and we believe that uncovering its impact could further help extend our results to the other normalization variants across different neural network architectures. We investigate the impact of batch normalization on a simple two-layer neural network and show that when training with gradient descent on a simple data model, the network parameters converge to a solution with uniform margin directions. Furthermore, we derive finite-time rates for the evolution of both the loss and the weights during the training process.
|
Genre | |
Type | |
Language |
eng
|
Date Available |
2024-10-24
|
Provider |
Vancouver : University of British Columbia Library
|
Rights |
Attribution-NonCommercial-NoDerivatives 4.0 International
|
DOI |
10.14288/1.0447083
|
URI | |
Degree | |
Program | |
Affiliation | |
Degree Grantor |
University of British Columbia
|
Graduation Date |
2024-11
|
Campus | |
Scholarly Level |
Graduate
|
Rights URI | |
Aggregated Source Repository |
DSpace
|
Item Media
Item Citations and Data
Rights
Attribution-NonCommercial-NoDerivatives 4.0 International