Open Collections

UBC Theses and Dissertations

Featured Collection

UBC Theses and Dissertations

Understanding the role of averaging in non-smooth stochastic gradient descent Randhawa, Sikander

Abstract

Consider the problem of minimizing functions that are Lipschitz and strongly convex, but not necessarily differentiable. We prove that after T steps of stochastic gradient descent (SGD), the error of the final iterate is O(log⁡T⁄T) with high probability. We also construct a function from this class for which the error of the final iterate of deterministic gradient descent is Ω(log⁡T⁄T). This shows that the upper bound is tight and that, in this setting, the last iterate of stochastic gradient descent has the same general error rate (with high probability) as deterministic gradient descent. This resolves both open questions posed by Shamir (2012). We prove analogous results for functions which are Lipschitz and convex, but not necessarily strongly convex or differentiable. After T steps of stochastic gradient descent, the error of the final iterate is O(log⁡T⁄√T) with high probability, and there exists a function for which the error of the final iterate of deterministic gradient descent is Ω(log⁡T⁄√T). In the strongly-convex setting, several forms of SGD, including suffix averaging, are known to achieve the optimal O(1⁄T) convergence rate in expectation. An intermediate step of our high probability analysis for the error of the final iterate proves that the suffix averaging method achieves error O(1⁄T) with high probability, which is optimal (for any first-order optimization method). This improves results of Rakhlin et al. (2012) and Hazan and Kale (2014), both of which achieved error O(1⁄T), but only in expectation, and achieved a high probability error bound of O(log⁡log⁡T ⁄T), which is suboptimal. This is the first known high-probability result which attains the optimal O(1⁄T) rate. We also consider a simple, non-uniform averaging strategy of Lacoste-Julien et al. (2012) and prove that it too achieves the optimal O(1⁄T) convergence rate with high probability. This provides a second algorithm which achieves the optimal O(1⁄T) convergence rate with high-probability. Our high-probability results are proven using a generalization of Freedman's Inequality which we develop.

Item Metadata

Title	Understanding the role of averaging in non-smooth stochastic gradient descent
Creator	Randhawa, Sikander
Publisher	University of British Columbia
Date Issued	2020
Description	Consider the problem of minimizing functions that are Lipschitz and strongly convex, but not necessarily differentiable. We prove that after T steps of stochastic gradient descent (SGD), the error of the final iterate is O(log⁡T⁄T) with high probability. We also construct a function from this class for which the error of the final iterate of deterministic gradient descent is Ω(log⁡T⁄T). This shows that the upper bound is tight and that, in this setting, the last iterate of stochastic gradient descent has the same general error rate (with high probability) as deterministic gradient descent. This resolves both open questions posed by Shamir (2012). We prove analogous results for functions which are Lipschitz and convex, but not necessarily strongly convex or differentiable. After T steps of stochastic gradient descent, the error of the final iterate is O(log⁡T⁄√T) with high probability, and there exists a function for which the error of the final iterate of deterministic gradient descent is Ω(log⁡T⁄√T). In the strongly-convex setting, several forms of SGD, including suffix averaging, are known to achieve the optimal O(1⁄T) convergence rate in expectation. An intermediate step of our high probability analysis for the error of the final iterate proves that the suffix averaging method achieves error O(1⁄T) with high probability, which is optimal (for any first-order optimization method). This improves results of Rakhlin et al. (2012) and Hazan and Kale (2014), both of which achieved error O(1⁄T), but only in expectation, and achieved a high probability error bound of O(log⁡log⁡T ⁄T), which is suboptimal. This is the first known high-probability result which attains the optimal O(1⁄T) rate. We also consider a simple, non-uniform averaging strategy of Lacoste-Julien et al. (2012) and prove that it too achieves the optimal O(1⁄T) convergence rate with high probability. This provides a second algorithm which achieves the optimal O(1⁄T) convergence rate with high-probability. Our high-probability results are proven using a generalization of Freedman's Inequality which we develop.
Genre	Thesis/Dissertation
Type	Text
Language	eng
Date Available	2020-08-24
Provider	Vancouver : University of British Columbia Library
Rights	Attribution-NonCommercial-NoDerivatives 4.0 International
DOI	10.14288/1.0392916
URI	http://hdl.handle.net/2429/75626
Degree (Theses)	Master of Science - MSc
Program (Theses)	Computer Science
Affiliation	Science, Faculty of; Computer Science, Department of
Degree Grantor	University of British Columbia
Graduation Date	2020-11
Campus	UBCV
Scholarly Level	Graduate
Rights URI	http://creativecommons.org/licenses/by-nc-nd/4.0/
Aggregated Source Repository	DSpace

Open Collections

UBC Theses and Dissertations

UBC Theses and Dissertations

Understanding the role of averaging in non-smooth stochastic gradient descent Randhawa, Sikander

Abstract

Item Metadata

Item Media

Item Citations and Data

Rights