- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Why do machine learning optimizers that work, work?
Open Collections
UBC Theses and Dissertations
UBC Theses and Dissertations
Why do machine learning optimizers that work, work? Kunstner, Frederik
Abstract
The impressive recent applications of machine learning have coincided with an increase in the costs of developing new methods. Beyond the obvious computational cost due to the large dataset, the more insidious cost is complexity. The development of machine learning systems is not yet predictable and instead relies on rules of thumb and expensive trial and error. This thesis focuses on the fundamental methods used to build machine learning models, which “learn” by solving a numerical optimization problem to fit a model to data. Many successful optimization heuristics have emerged in recent years, but we have little understanding of why those heuristics outperform classical optimization schemes. The goal of this thesis is to build a better understanding of why methods that are widely used in machine learning work, presenting theoretical or empirical contributions in 3 areas. Algorithms are often designed to tackle specific problems with a known structure, like the expectation maximization (EM) algorithm for statistical models with missing data. Our current optimization theory ignores this structure and relies on assumptions that are not satisfied, even by textbook applications. We derive the convergence rate of EM for the most common case of exponential families, without additional assumptions. Many heuristics have been proposed to build “adaptive” methods that automatically adjust per-coordinate step-sizes, but these heuristics are often brittle as there is no formal definition of “adaptivity”. We formalize the problem as finding per-coordinate step-sizes that are competitive with the optimal ones for a given problem, and develop an algorithm that provably finds competitive step-sizes. For recently developed language models, the Adam algorithm outperforms gradient descent by such a large margin that it is now the default option. But the reason for this improvement is not clear. We empirically evaluate hypotheses proposed to explain this performance gap, showing that the gap is not due to noise, and isolate a feature of language problems that leads to optimization difficulties. We show that heavy-tailed class imbalance, where many rare classes have a large impact on the objective function, leads to a performance gap and to a correlated gradients and Hessians, hypothesized to benefit Adam.
Item Metadata
Title |
Why do machine learning optimizers that work, work?
|
Creator | |
Supervisor | |
Publisher |
University of British Columbia
|
Date Issued |
2024
|
Description |
The impressive recent applications of machine learning have coincided with an increase in the costs of developing new methods. Beyond the obvious computational cost due to the large dataset, the more insidious cost is complexity. The development of machine learning systems
is not yet predictable and instead relies on rules of thumb and expensive trial and error. This
thesis focuses on the fundamental methods used to build machine learning models, which “learn” by solving a numerical optimization problem to fit a model to data. Many successful optimization heuristics have emerged in recent years, but we have little understanding of why those heuristics outperform classical optimization schemes. The goal of this thesis is to build a better understanding of why methods that are widely used in machine learning work, presenting theoretical or empirical contributions in 3 areas.
Algorithms are often designed to tackle specific problems with a known structure, like the expectation maximization (EM) algorithm for statistical models with missing data. Our current optimization theory ignores this structure and relies on assumptions that are not satisfied, even by textbook applications. We derive the convergence rate of EM for the most common case of exponential families, without additional assumptions.
Many heuristics have been proposed to build “adaptive” methods that automatically adjust
per-coordinate step-sizes, but these heuristics are often brittle as there is no formal definition
of “adaptivity”. We formalize the problem as finding per-coordinate step-sizes that are competitive with the optimal ones for a given problem, and develop an algorithm that provably finds competitive step-sizes.
For recently developed language models, the Adam algorithm outperforms gradient descent
by such a large margin that it is now the default option. But the reason for this improvement
is not clear. We empirically evaluate hypotheses proposed to explain this performance gap, showing that the gap is not due to noise, and isolate a feature of language problems that leads to optimization difficulties. We show that heavy-tailed class imbalance, where many rare classes have a large impact on the objective function, leads to a performance gap and to a correlated gradients and Hessians, hypothesized to benefit Adam.
|
Genre | |
Type | |
Language |
eng
|
Date Available |
2024-09-25
|
Provider |
Vancouver : University of British Columbia Library
|
Rights |
Attribution-NonCommercial-ShareAlike 4.0 International
|
DOI |
10.14288/1.0445444
|
URI | |
Degree | |
Program | |
Affiliation | |
Degree Grantor |
University of British Columbia
|
Graduation Date |
2024-11
|
Campus | |
Scholarly Level |
Graduate
|
Rights URI | |
Aggregated Source Repository |
DSpace
|
Item Media
Item Citations and Data
Rights
Attribution-NonCommercial-ShareAlike 4.0 International