Why do machine learning optimizers that work, work?

UBC Theses and Dissertations

Featured Collection

UBC Theses and Dissertations

Why do machine learning optimizers that work, work? Kunstner, Frederik

Abstract

The impressive recent applications of machine learning have coincided with an increase in the costs of developing new methods. Beyond the obvious computational cost due to the large dataset, the more insidious cost is complexity. The development of machine learning systems is not yet predictable and instead relies on rules of thumb and expensive trial and error. This thesis focuses on the fundamental methods used to build machine learning models, which “learn” by solving a numerical optimization problem to fit a model to data. Many successful optimization heuristics have emerged in recent years, but we have little understanding of why those heuristics outperform classical optimization schemes. The goal of this thesis is to build a better understanding of why methods that are widely used in machine learning work, presenting theoretical or empirical contributions in 3 areas. Algorithms are often designed to tackle specific problems with a known structure, like the expectation maximization (EM) algorithm for statistical models with missing data. Our current optimization theory ignores this structure and relies on assumptions that are not satisfied, even by textbook applications. We derive the convergence rate of EM for the most common case of exponential families, without additional assumptions. Many heuristics have been proposed to build “adaptive” methods that automatically adjust per-coordinate step-sizes, but these heuristics are often brittle as there is no formal definition of “adaptivity”. We formalize the problem as finding per-coordinate step-sizes that are competitive with the optimal ones for a given problem, and develop an algorithm that provably finds competitive step-sizes. For recently developed language models, the Adam algorithm outperforms gradient descent by such a large margin that it is now the default option. But the reason for this improvement is not clear. We empirically evaluate hypotheses proposed to explain this performance gap, showing that the gap is not due to noise, and isolate a feature of language problems that leads to optimization difficulties. We show that heavy-tailed class imbalance, where many rare classes have a large impact on the objective function, leads to a performance gap and to a correlated gradients and Hessians, hypothesized to benefit Adam.

Item Metadata

Title	Why do machine learning optimizers that work, work?
Creator	Kunstner, Frederik
Supervisor	Schmidt, Mark
Publisher	University of British Columbia
Date Issued	2024
Description	The impressive recent applications of machine learning have coincided with an increase in the costs of developing new methods. Beyond the obvious computational cost due to the large dataset, the more insidious cost is complexity. The development of machine learning systems is not yet predictable and instead relies on rules of thumb and expensive trial and error. This thesis focuses on the fundamental methods used to build machine learning models, which “learn” by solving a numerical optimization problem to fit a model to data. Many successful optimization heuristics have emerged in recent years, but we have little understanding of why those heuristics outperform classical optimization schemes. The goal of this thesis is to build a better understanding of why methods that are widely used in machine learning work, presenting theoretical or empirical contributions in 3 areas. Algorithms are often designed to tackle specific problems with a known structure, like the expectation maximization (EM) algorithm for statistical models with missing data. Our current optimization theory ignores this structure and relies on assumptions that are not satisfied, even by textbook applications. We derive the convergence rate of EM for the most common case of exponential families, without additional assumptions. Many heuristics have been proposed to build “adaptive” methods that automatically adjust per-coordinate step-sizes, but these heuristics are often brittle as there is no formal definition of “adaptivity”. We formalize the problem as finding per-coordinate step-sizes that are competitive with the optimal ones for a given problem, and develop an algorithm that provably finds competitive step-sizes. For recently developed language models, the Adam algorithm outperforms gradient descent by such a large margin that it is now the default option. But the reason for this improvement is not clear. We empirically evaluate hypotheses proposed to explain this performance gap, showing that the gap is not due to noise, and isolate a feature of language problems that leads to optimization difficulties. We show that heavy-tailed class imbalance, where many rare classes have a large impact on the objective function, leads to a performance gap and to a correlated gradients and Hessians, hypothesized to benefit Adam.
Genre	Thesis/Dissertation
Type	Text
Language	eng
Date Available	2024-09-25
Provider	Vancouver : University of British Columbia Library
Rights	Attribution-NonCommercial-ShareAlike 4.0 International
DOI	10.14288/1.0445444
URI	http://hdl.handle.net/2429/89294
Degree (Theses)	Doctor of Philosophy - PhD
Program (Theses)	Computer Science
Affiliation	Science, Faculty of; Computer Science, Department of
Degree Grantor	University of British Columbia
Graduation Date	2024-11
Campus	UBCV
Scholarly Level	Graduate
Rights URI	http://creativecommons.org/licenses/by-nc-sa/4.0/
Aggregated Source Repository	DSpace

Open Collections

UBC Theses and Dissertations

UBC Theses and Dissertations

Why do machine learning optimizers that work, work? Kunstner, Frederik

Abstract

Item Metadata

Item Media

Item Citations and Data

Rights