- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- BIRS Workshop Lecture Videos /
- Understanding the mechanisms of fast hyperparameter...
Open Collections
BIRS Workshop Lecture Videos
BIRS Workshop Lecture Videos
Understanding the mechanisms of fast hyperparameter transfer Wu, Denny
Description
The growing scale of deep learning models has rendered standard hyperparameter (HP) optimization prohibitively expensive. A promising solution is the use of scale-aware hyperparameters, which can enable direct transfer of optimal HPs from small-scale grid searches to large models with minimal performance loss. To understand the principles governing such transfer strategy, we develop a conceptual framework for reasoning about HP transfer across scale. In synthetic settings, we present quantitative examples where transfer either offers a provable computational advantage or fails even under μP. To explain the fast transfer observed in practice, we conjecture that decomposing the optimization trajectory reveals two contributions to loss reduction: (1) a width-stable component that determines the optimal HPs and (2) a width-sensitive component that improves with width but weakly perturbs the HP optimum. We present empirical evidence for this hypothesis in large language model pretraining.
Item Metadata
| Title |
Understanding the mechanisms of fast hyperparameter transfer
|
| Creator | |
| Publisher |
Banff International Research Station for Mathematical Innovation and Discovery
|
| Date Issued |
2026-02-05
|
| Description |
The growing scale of deep learning models has rendered standard hyperparameter (HP) optimization prohibitively expensive. A promising solution is the use of scale-aware hyperparameters, which can enable direct transfer of optimal HPs from small-scale grid searches to large models with minimal performance loss. To understand the principles governing such transfer strategy, we develop a conceptual framework for reasoning about HP transfer across scale. In synthetic settings, we present quantitative examples where transfer either offers a provable computational advantage or fails even under μP. To explain the fast transfer observed in practice, we conjecture that decomposing the optimization trajectory reveals two contributions to loss reduction: (1) a width-stable component that determines the optimal HPs and (2) a width-sensitive component that improves with width but weakly perturbs the HP optimum. We present empirical evidence for this hypothesis in large language model pretraining.
|
| Extent |
55.0 minutes
|
| Subject | |
| Type | |
| File Format |
video/mp4
|
| Language |
eng
|
| Notes |
Author affiliation: New York University & Flatiron Institute
|
| Series | |
| Date Available |
2026-02-09
|
| Provider |
Vancouver : University of British Columbia Library
|
| Rights |
Attribution-NonCommercial-NoDerivatives 4.0 International
|
| DOI |
10.14288/1.0451477
|
| URI | |
| Affiliation | |
| Peer Review Status |
Unreviewed
|
| Scholarly Level |
Postdoctoral
|
| Rights URI | |
| Aggregated Source Repository |
DSpace
|
Item Media
Item Citations and Data
Rights
Attribution-NonCommercial-NoDerivatives 4.0 International