Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Interpolation, growth conditions, and stochastic gradient descent Mishkin, Aaron 2020

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.

Item Metadata

Download

Media
24-ubc_2020_november_mishkin_aaron.pdf [ 677.62kB ]
Metadata
JSON: 24-1.0394494.json
JSON-LD: 24-1.0394494-ld.json
RDF/XML (Pretty): 24-1.0394494-rdf.xml
RDF/JSON: 24-1.0394494-rdf.json
Turtle: 24-1.0394494-turtle.txt
N-Triples: 24-1.0394494-rdf-ntriples.txt
Original Record: 24-1.0394494-source.json
Full Text
24-1.0394494-fulltext.txt
Citation
24-1.0394494.ris

Full Text

Interpolation, Growth Conditions, and Stochastic GradientDescentbyAaron MishkinB.Sc., University of British Columbia, 2018a thesis submitted in partial fulfillmentof the requirements for the degree ofMaster of Scienceinthe faculty of graduate and postdoctoral studies(Computer Science)The University of British Columbia(Vancouver)September 2020© Aaron Mishkin, 2020The following individuals certify that they have read, and recommend to the Faculty ofGraduate and Postdoctoral Studies for acceptance, the thesis entitled:Interpolation, Growth Conditions, and Stochastic Gradient Descentsubmitted by Aaron Mishkin in partial fulfillment of the requirements for the degree ofMaster of Science in Computer Science.Examining Committee:Mark Schmidt, Computer ScienceSupervisorNicholas J. A. Harvey, Computer ScienceSupervisory Committee MemberiiAbstractCurrent machine learning practice requires solving huge-scale empirical risk minimizationproblems quickly and robustly. These problems are often highly under-determined and ad-mit multiple solutions which exactly fit, or interpolate, the training data. In such cases,stochastic gradient descent has been shown to converge without decreasing step-sizes oraveraging, and can achieve the fast convergence of deterministic gradient methods. Recentwork has further shown that stochastic acceleration and line-search methods for step-sizeselection are possible in this restricted setting. Although pioneering, existing analyses forfirst-order methods under interpolation have two major weaknesses: they are restricted tothe finite-sum setting, and, secondly, they are not tight with the best deterministic rates. Toaddress these issues, we extend the notion of interpolation to stochastic optimization prob-lems with general, first-order oracles. We use the proposed framework to analyze stochasticgradient descent with a fixed step-size and with an Armijo-type line-search, as well as Nes-terov’s accelerated gradient method with stochastic gradients. In nearly all settings, weobtain faster convergence with a wider range of parameters. The improvement for stochas-tic Nesterov acceleration is comparable to dividing by the square-root of the conditionnumber and addresses criticism that existing rates were not truly “accelerated”. In thecase of convex functions, our convergence rates for stochastic gradient descent — bothwith and without the stochastic Armijo line-search — recover the best-known rates in thedeterministic setting. We also provide a simple extension to `2-regularized minimization,which opens the path to proximal-gradient methods and non-smooth optimization underinterpolation.iiiLay SummaryA major trend in machine learning is the use of flexible models which can exactly fitlarge quantities of data. For example, deep learning approaches can “memorize” datasets,meaning they achieve nearly perfect predictions on the samples used to fit the model. In thiscase, we say that the model interpolates the dataset. Interpolating models are particularlyinteresting from an optimization perspective because they can be fit very quickly usingstochastic gradient methods. This contrasts with general models, where stochastic gradientmethods are notoriously slow. In this thesis, we develop a rigorous definition of interpolationand study the speed of stochastic gradient methods for interpolating models. Our approachis more general than existing analyses and covers standard model-fitting using a datasetas a special case. For many model classes, we show stochastic gradient methods permita wider range of parameters and are faster than previously known when interpolation issatisfied.ivPrefaceThis thesis was conceived and written solely by the author, Aaron Mishkin. The basis forthe theoretical work was developed in collaboration with Sharan Vaswani and FrederikKunstner over the last two years. The work here is original and unpublished with the ex-ception of the stochastic Armijo line-search, which was published in Vaswani et al. (2019b).A. Mishkin is a coauthor of this paper.The breakdown of contributions for results included from Vaswani et al. (2019b) isas follows: the idea to investigate stochastic line-search techniques was proposed by S.Vaswani. S. Vaswani also conceived of the stochastic Armijo line-search and proved theoriginal form of Lemma 8. A. Mishkin and S. Vaswani later revised the lemma statementto the version stated in Chapter 4. The proof technique for Theorem 11 (Chapter 4) wasdeveloped and suggested by S. Vaswani, while the specific result was proved by A. Mishkin.Many of the unpublished results in this thesis have also benefited from collaboration.The extension of interpolation to general stochastic oracles in Chapter 2 was conducted bythe author alone, while the connection between interpolation and weak/strong growth wasdeveloped in collaboration with S. Vaswani, and F. Kunstner. The improved convergencetheorems in Chapters 3 and 5 build on previous work by Vaswani et al. (2019a) and wereproved solely by the author.Theorem 9 in Chapter 4, which improves the dependency on the strong-convexity pa-rameter from µ¯ to µ, was suggested by S. Vaswani and F. Kunstner with reference to aresult for the stochastic Polyak step-size by Loizou et al. (2020). A. Mishkin proved thetheorem alone. The analysis of stochastic gradient descent for `2-regularized functions sat-isfying interpolation in Chapter 6 was inspired by a conversation with Eduard Gorbunov.All figures are the original product of the author, A. Mishkin.vTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xGlossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Preliminaries and Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . 41.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Interpolation and Growth Conditions . . . . . . . . . . . . . . . . . . . . . 102.1 Stochastic Oracles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.2 Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.3 Growth conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . 223.1 Convergence for Strongly-Convex Functions . . . . . . . . . . . . . . . . . . 24vi3.1.1 General Oracles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.1.2 Individually Smooth and Convex Oracles . . . . . . . . . . . . . . . 253.2 Convergence for Convex Functions . . . . . . . . . . . . . . . . . . . . . . . 273.3 Almost Sure Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314 Line Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.2 Convergence for Strongly-Convex Functions . . . . . . . . . . . . . . . . . . 364.3 Convergence for Convex Functions . . . . . . . . . . . . . . . . . . . . . . . 394.4 Convergence for Non-Convex Functions . . . . . . . . . . . . . . . . . . . . 404.4.1 Challenges in the Analysis . . . . . . . . . . . . . . . . . . . . . . . . 414.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425 Acceleration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445.2 Estimating Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455.3 Convergence for Strongly-Convex Functions . . . . . . . . . . . . . . . . . . 495.4 Convergence for Convex Functions . . . . . . . . . . . . . . . . . . . . . . . 505.5 Acceleration under Weak Growth . . . . . . . . . . . . . . . . . . . . . . . . 515.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 526 Beyond Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 536.1 Convergence for L2-Regularized Convex Functions . . . . . . . . . . . . . . 546.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 557 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59A Interpolation and Growth Conditions: Proofs . . . . . . . . . . . . . . . . 68A.1 Stochastic Oracles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68A.2 Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69A.3 Growth Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71viiB Stochastic Gradient Descent: Proofs . . . . . . . . . . . . . . . . . . . . . 75B.1 Convergence for Strongly Convex Functions . . . . . . . . . . . . . . . . . . 75B.2 Convergence for Convex Functions . . . . . . . . . . . . . . . . . . . . . . . 79B.3 Almost Sure Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82B.4 Additional Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83C Line Search: Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85C.1 Convergence for Strongly-Convex Functions . . . . . . . . . . . . . . . . . . 88C.2 Convergence for Convex Functions . . . . . . . . . . . . . . . . . . . . . . . 88C.3 Convergence for Non-Convex Functions . . . . . . . . . . . . . . . . . . . . 89D Acceleration: Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93D.1 Estimating Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93D.2 Convergence for Strongly-Convex Functions . . . . . . . . . . . . . . . . . . 96D.3 Convergence for Convex Functions . . . . . . . . . . . . . . . . . . . . . . . 97E Beyond Interpolation: Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . 99E.1 Convergence for L2-Regularized Convex Functions . . . . . . . . . . . . . . 101F Useful Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .103viiiList of TablesTable 2.1 Relationship between minimizer interpolation and parameters of the weakand strong growth conditions. . . . . . . . . . . . . . . . . . . . . . . . . 21Table 3.1 Comparison of iteration complexities of fixed step-size stochastic gradientdescent for strongly-convex functions under strong growth. . . . . . . . . 26Table 3.2 Comparison of iteration complexities of fixed step-size stochastic gradientdescent for convex functions under weak growth. . . . . . . . . . . . . . 29Table 4.1 Comparison of iteration complexities for stochastic gradient descent withthe stochastic Armijo line-search. . . . . . . . . . . . . . . . . . . . . . . 40Table 5.1 Comparison of iteration complexities for stochastic acceleration schemesunder strong growth. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51ixList of FiguresFigure 2.1 Illustration of oracle samples satisfying different models of interpolation. 15Figure 3.1 Procedural definition of stochastic gradient descent with a fixed step-size. 23Figure 4.1 Progress made by stochastic gradient descent when using the stochasticArmijo line-search with and without minimizer interpolation. . . . . . . 34Figure 4.2 Procedural definition of stochastic gradient descent with the stochasticArmijo line-search. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35Figure 4.3 Step-size intervals accepted and rejected by the Armijo line-search for aLipschitz-smooth function. . . . . . . . . . . . . . . . . . . . . . . . . . 36Figure 5.1 Classic procedural definition of Nesterov’s accelerated gradient descentalgorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45Figure 5.2 Reformulation of Nesterov’s accelerated gradient descent as an alternat-ing descent procedure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48xGlossaryAGD accelerated gradient descentPL Polyak- LojasiewiczR-SAGD reformulated stochastic accelerated gradient descentSAGD stochastic accelerated gradient descentSFO stochastic first-order oracleSGD stochastic gradient descentxiAcknowledgmentsI am extremely grateful to my supervisor, Mark Schmidt, for fostering an intellectuallycurious and collaborative environment. His group at the University of British Columbiawas an -optimal setting for this research. Sharan Vaswani convinced me to join the world ofstochastic optimization in 2018 and none of the work herein would have come to be withouthis support. I am also greatly indebted to my friend and colleague Frederik Kunstner, withwhom I have shared innumerable conversations on interpolation and stochastic gradientdescent. Finally, Si Yi (Cathy) Meng, Betty Shea, and Jonathan Lavington provided manyexcellent comments and suggestions throughout the writing of this thesis.xiiChapter 1IntroductionStochastic first-order methods are the most popular optimization algorithms in modernmachine learning. In particular, stochastic gradient descent (SGD) (Robbins and Monro,1951) and its adaptive variants (Duchi et al., 2011; Kingma and Ba, 2015; Tieleman andHinton, 2012; Zeiler, 2012) are widely used in large-scale supervised learning, where theyare frequently referred to as fundamental “workhorse” algorithms (Assran et al., 2019;Grosse and Salakhutdinov, 2015; Qian et al., 2019). The main advantage of these methodsfor machine learning is that they only use the gradient of a single or small sub-sampleof training examples to update the model parameters at each iteration. The computa-tional cost of SGD (and variants) is thus independent of the training set size and scales tovery large datasets and models. This property is also why stochastic first-order methodsare the dominant approach to training highly expressive models, such as deep neural net-works (Bengio, 2012; Zhang et al., 2017) and non-parametric kernels (Belkin et al., 2019b;Liang and Rakhlin, 2018).Despite their successes, stochastic first-order methods suffer from two well known prob-lems. The step-size, momentum term, and other algorithmic hyper-parameters must becarefully tuned to obtain good performance (Bengio, 2012; Choi et al., 2019; Li andOrabona, 2019; Schaul et al., 2013); and they converge slowly compared to determinis-tic (Nesterov, 2004) or variance-reduced algorithms (Defazio et al., 2014; Johnson andZhang, 2013; Le Roux et al., 2012) even when well-tuned. Hyper-parameter tuning for SGDis the focus of intense research, with approaches ranging from adaptive step-sizes inspiredby online learning (Li and Orabona, 2019; Luo et al., 2019; Orabona and Tommasi, 2017) to1meta-learning (Almeida et al., 1998; Baydin et al., 2018; Plagianakos et al., 2001; Schrau-dolph, 1999; Shao and Yip, 2000; Sutton, 1992) and heuristics for online estimation (Rolinekand Martius, 2018; Schaul et al., 2013; Tan et al., 2016). In contrast, the slow convergenceof first-order methods in the general stochastic setting cannot be improved, with tightlower-bounds showing Θ(−4) iteration complexity for convergence to an -approximatestationary point (Arjevani et al., 2019; Drori and Shamir, 2019). This compares poorly todeterministic methods, which are Θ(−2) for the same problem class (Carmon et al., 2019).Increasing experimental and theoretical evidence shows that the slow optimizationspeed of stochastic first-order methods is mitigated when training over-parameterized mod-els (Arora et al., 2018; Ma et al., 2018; Zou and Gu, 2019). For example, variance-reducedalgorithms typically underperform SGD in this setting despite using increased memoryor computation (Defazio and Bottou, 2019; Ma et al., 2018). A key property of over-parameterized problems is that they satisfy interpolation, meaning the model can exactlyfit all of the available training data (Belkin et al., 2019b). While this may seem strong, in-terpolation has been observed in practice for popular methods such as boosting (Schapireet al., 1997), kernel learning (Belkin et al., 2019b), and over-parameterized neural net-works (Belkin et al., 2019a; Zhang and Yin, 2013). Under additional assumptions, interpo-lation is a sufficient condition for the strong (Schmidt and Le Roux, 2013; Solodov, 1998;Tseng, 1998) or weak (Bassily et al., 2018; Vaswani et al., 2019a) growth conditions, whichimply the second-moment of the stochastic gradients is bounded by a linear function of ei-ther the gradient-norm, or the optimality gap, respectively. A form of automatic reductionin gradient noise occurs (Liu and Belkin, 2020) when strong or weak growth is satisfied,explaining why variance reduction may be unnecessary.The connection between over-parameterization and fast stochastic optimization has ledto a wave of interest in analyzing first-order methods under interpolation and weak/stronggrowth. A number of works have shown that SGD obtains the fast convergence rate of de-terministic gradient methods for interpolating models (Bassily et al., 2018; Cevher and Vu,2019; Jain et al., 2018; Schmidt and Le Roux, 2013; Vaswani et al., 2019a), while related re-search has established accelerated rates under an additional requirement for convexity (Jainet al., 2018; Liu and Belkin, 2020; Vaswani et al., 2019a). Sub-sampled second-order meth-ods have also been explored and proved to have local quadratic convergence for specificfunction classes (Meng et al., 2020). These positive results show the interpolation setting isrestrictive enough to break the Ω(−4) barrier for stochastic first-order methods and attainthe optimal Θ(−1) complexity (up to problem-specific constants) for finding stationary2points of smooth, convex functions (Arjevani and Shamir, 2016; Nemirovsky and Nesterov,1985).Despite these rapid advances, theoretical rates and practical performance for SGD underinterpolation still rely on carefully selected hyper-parameters. A number of approaches fromthe general stochastic setting are rapidly being adopted to tackle this problem. For instance,several works have considered a stochastic version of the Polyak step-size, which does notrequire knowledge of the smoothness or convexity parameters (Berrada et al., 2019; Loizouet al., 2020). Stochastic line-searches using the classic Armijo condition (Armijo, 1966) havealso been proposed and shown to obtain fast convergence under interpolation (Vaswaniet al., 2019b). Very recently, adaptive variants of SGD have been analyzed both with andwithout the Armijo line-search (Vaswani et al., 2020).This thesis analyzes a core group of first-order methods in stochastic optimizationunder interpolation. We consider constant step-size SGD, SGD with a stochastic Armijoline-search, and a version of Nesterov’s accelerated gradient descent with stochastic gradi-ents (Nesterov, 2004). In nearly all cases, we show that existing analyses can be tightened toyield faster convergence rates with a larger range of step-sizes. In the case of acceleration,the improvement is comparable to dividing by the square-root of the condition numberand addresses criticisms that previous analyses yield inferior rates to those of SGD in somecircumstances (Liu and Belkin, 2020).The thesis is organized as follows: In Chapter 2, we formalize interpolation and thestrong/weak growth conditions in the context of general stochastic oracles. Connectionsbetween interpolation, smoothness properties of the stochastic oracle, and growth of thestochastic gradients are derived. Chapter 3 analyzes the complexity of SGD with a fixedstep-size, drawing comparisons with previous work as well as convergence rates in the deter-ministic setting. Asymptotic convergence of SGD with a constant step-size to (i) stationarypoints of general non-convex functions, and (ii) minimizers of convex functions is shown un-der the strong and weak growth conditions. Chapters 4 and 5 then address the convergenceof SGD with a stochastic Armijo line-search and stochastic accelerated gradient descent,respectively. Finally, Chapter 6 leaves the basic interpolation setting behind and considersstructural minimization with interpolating functions as components. Convergence to anexplicit neighborhood is derived for L2-regularized problems as a special case.31.1 Preliminaries and AssumptionsThis work considers the problem of minimizing a continuous function f : Rd → Runder interpolation conditions. We assume that f is bounded below with at least one finiteminimizer. That is, there exists at least one w∗ ∈ Rd such thatf(w) ≥ f(w∗) ∀w ∈ Rd.For functions with multiple minimizers, we write X ∗ = arg minw∈Rd f(w) and denote theprojection of a point onto the optimal set as ΠX ∗(w). Additionally, f is required to bedifferentiable and L-smooth, meaning the mapping w 7→ ∇f(w) is an L-Lipschitz function,‖∇f(w)−∇f(v)‖ ≤ L‖w − v‖ ∀w, v ∈ Rd.This is equivalent to the existence of the following quadratic bound on f :L2‖v − w‖2 ≥ |f(v)− f(w)− 〈∇f(w), v − w〉 | ∀w, v ∈ Rd. (L-Smoothness)Often, only the upper-bound given by L-smoothness is used,f(v) ≤ f(w) + 〈∇f(w), v − w〉+ L2‖v − w‖2 ∀w, v ∈ Rd,which is sometimes called one-sided Lipschitz-smoothness. At times, we will further assumethat f is convex or µ-strongly-convex,f(v) ≥ f(w) + 〈∇f(w), v − w〉 ∀w, v ∈ Rd, (Convexity)f(v) ≥ f(w) + 〈∇f(w), v − w〉+ µ2‖v − w‖2 ∀w, v ∈ Rd. (µ-Strong-Convexity)Convexity holds for many problems in machine learning, including linear and logistic re-gression.Convexity can be relaxed to a more limited property called invexity. Formally, we saya differentiable function f is invex if all stationary points are also global minimizers off (Ben-Israel and Mond, 1986), meaning∇f(w) = 0 =⇒ f(w) ≤ f(v) ∀ v ∈ Rd.4Invexity is formally weaker than convexity and follows immediately from the first-orderconditions for convexity given above. The analogue of strong-convexity for invex functionsis the Polyak- Lojasiewicz (PL) condition (Karimi et al., 2016). A function is said to be µ-PLif12µ‖∇f(w)‖2 ≥ f(w)− f(w∗),holds for all w ∈ Rd. Again, the PL condition is weaker than strong convexity; a µ-strongly-convex function is µ-PL, but the converse does not hold.In several cases, it will be useful to interpret our results in the context of finite-sumfunctions. The function f is said to be a finite-sum if it can be written asf(w) =1nn∑i=1fi(w),where the individual (or sub-) functions fi : Rd → R may be strongly-convex, convex, ornon-convex depending on the problem. Such objective functions arise naturally in empiricalrisk minimization, where fi typically corresponds to a single training example (xi, yi) in atraining set. For example, the classic least-squares regression objective can be written asf(w) =1nn∑i=1(〈w, xi〉 − yi)2 ,which is finite-sum with sub-functions fi(w) = (〈w, xi〉 − yi)2.1.2 Related WorkInterpolation: Existing work focuses on interpolation in the context of finite-sum ob-jectives. In this setting, Bassily et al. (2018) define interpolation in terms of sequencesconverging to global minima of f . They say interpolation holds if, for every sequence (wk)satisfying limk→∞ f(wk) = f(w∗), we also have∀i ∈ [n], limk→∞fi(wk) = f(w∗).Berrada et al. (2019) propose -interpolation, which requires the sub-functions to be lower-5bounded and  sub-optimal for all w∗ ∈ arg min f :∀i ∈ [n], infwfi(w) ≥ B and fi(w∗)−B ≤ .A larger body of work considers interpolation to hold when the optimal points w∗ are alsostationary points or global minimizers of each sub-function fi (Loizou et al., 2020; Menget al., 2020; Vaswani et al., 2019a,b, 2020),w∗ ∈ arg minwf(w) =⇒ w∗ ∈ arg minwfi(w) ∀i ∈ [n],orw∗ ∈ arg minwf(w) =⇒ ∇fi(w∗) = 0 ∀i ∈ [n].We will focus on interpolation conditions of this form, which we extend to general stochas-tic optimization problems.Growth Conditions: The strong growth condition was first proposed in the context ofincremental gradient methods by Solodov (1998) and Tseng (1998) as a bound on thesquared-norm of the stochastic gradients. Suppose f is finite-sum and Pi is an arbitrarydistribution used to sub-sample the finite sum. Then strong growth with parameter ρ > 0may be written asMaximal Strong Growth: ‖∇fi(w)‖2 ≤ ρ‖∇f(w)‖2,where the inequality holds almost-surely with respect to Pi. This definition was later usedby Schmidt and Le Roux (2013) to derive linear convergence rates for SGD with a constantstep-size. Vaswani et al. (2019a) propose a modified version of strong growth which holdsin expectation,Strong Growth: EPi[‖∇fi(w)‖2] ≤ ρ‖∇f(w)‖2.We call this latter condition strong growth and refer to the earlier definition as maximalstrong growth. Vaswani et al. (2019a) also propose the following weak growth condition as6a natural relaxation of strong growth:Weak Growth : EPi[‖∇fi(w)‖2] ≤ 2ρL (f(w)− f(w∗)) .Cevher and Vu (2019) refer to strong growth simply as “the growth condition” and suggeststrong growth with an additive noise parameter as an alternative relaxation,Strong Growth + Noise: EPi[‖∇fi(w)‖2] ≤ ρ‖∇f(w)‖2 + σ2,which they also call weak growth. For clarity, we refer to this assumption as strong growthwith additive noise. For unbiased Pi, strong growth with noise is slightly weaker than as-suming a bound on the variance of the stochastic gradients (Ghadimi and Lan, 2012; Khaledand Richta´rik, 2020). Cevher and Vu (2019) study the convergence of proximal-gradientmethods under strong growth with additive noise and also prove that strong growth is bothsufficient and necessary for SGD to converge linearly.Stochastic Acceleration: Many authors have considered accelerating stochastic gradientmethods. Schmidt et al. (2011) provide sufficient conditions on gradient noise for acceler-ation of proximal-gradient methods. D’Aspremont (2008) and Devolder et al. (2014) in-vestigate accelerated gradient methods under the assumption of approximate oracles withdeterministic, bounded errors and derive rates for convergence and error accumulation inthis setting. Honorio (2012) consider accelerated proximal-gradient methods under biasedstochastic oracles and show that accumulated errors prevent a high-probability convergenceguarantee. In contrast, Cohen et al. (2018) assume access to an unbiased oracle with addi-tive gradient noise and propose a noise-resistant acceleration scheme. Very recently, Chenand Kolar (2020) analyze accelerated methods for strongly-convex functions under stronggrowth with additive noise.An alternative approach to stochastic acceleration splits the convergence rate intostochastic and deterministic parts. Using this framework, multiple authors have shown thatacceleration schemes speed-up convergence for the deterministic component and achievenearly optimal rates in the stochastic approximation setting (Ghadimi and Lan, 2012,2013; Hu et al., 2009). For finite-sum functions, variance reduction techniques can be com-bined with acceleration to improve convergence on the stochastic component (Allen-Zhu,2017, 2018; Defazio, 2016; Kovalev et al., 2020; Shang et al., 2018; Tang et al., 2018). Such7methods have iteration complexity of O((n +√nκ) log(1/)), where n is the number ofsub-functions and κ is the condition number. This is as fast as deterministic accelerationup to the additional factor of√n. Accelerated primal-dual methods have been proposed inthe same setting under an additional requirement for each fi to be Lipschitz-smooth (Lanand Zhou, 2018; Zhang and Lin, 2015). Alternative approaches also leveraging finite-sumstructure are based on the proximal-point algorithm and include accelerated SDCA (Shalev-Shwartz and Zhang, 2014), Catalyst (Lin et al., 2015), and accelerated APPA (Frostig et al.,2015).Several works have recently considered acceleration under interpolation. The most sim-ilar to the investigation here is that of Vaswani et al. (2019a), who analyze a slightlyaltered version of Nesterov’s accelerated gradient descent under the strong growth con-dition. Liu and Belkin (2020) propose a different modification, called MaSS, and analyzeits convergence for convex quadratics as well as strongly-convex functions with additionalassumptions. These assumptions imply strong growth, but yield hard-to-compare rates.Jain et al. (2018) prove accelerated rates for squared-losses under interpolation using atail-averaging scheme. Finally, Assran and Rabbat (2020) study the stochastic approxima-tion setting and prove that accelerated gradient descent may fail to accelerate even wheninterpolation is satisfied.Line Search: Line-search is a classic technique to set the step-size in deterministic opti-mization (see Nocedal and Wright (1999)), but extensions to stochastic optimization arenon-obvious. Mahsereci and Hennig (2017) use a Gaussian process model to define prob-abilistic Wolfe conditions (Wolfe, 1969, 1971); however, the convergence of SGD with thisprocedure is not known. Fridovich-Keil and Recht (2019) propose line-search conditionsbased on golden-section search (Avriel and Wilde, 1968), but again only provide empiricalevidence for SGD. Paquette and Scheinberg (2020), Krejic and Krklec (2013), and Ogaltsovet al. (2019) prove convergence of SGD with the Armijo condition in the general stochasticsetting with several caveats. Paquette and Scheinberg (2020) require adaptive batch-sizes,while Krejic and Krklec (2013) assume the stochastic gradients are bounded and Ogaltsovet al. (2019) use explicit knowledge of an upper-bound on the variance. Other authors con-sider similar approaches based on multiple function and/or gradient evaluations at eachiteration (Byrd et al., 2012; De et al., 2016; Friedlander and Schmidt, 2012). Such ap-proaches also use growing batch-sizes to ensure convergence. The work in the thesis followsVaswani et al. (2019b), who investigated stochastic versions of the Armijo line-search under8interpolation; we provide tighter analyses in a more general setting.Asymptotic Convergence: The original paper by Robbins and Monro (1951) establishesasymptotic, almost-sure convergence for SGD to the zero of a convex function if the stochas-tic gradients are bounded and a decreasing step-size is used. Highly general analyses havesince shown almost-sure convergence for non-convex functions when strong growth withadditive noise is satisfied (Bertsekas and Tsitsiklis, 2000; Bottou, 1991). Alternative workdirectly derives these conditions from properties of strongly-convex or non-convex func-tions, also for the purpose of proving almost-sure convergence (Lei et al., 2019; Nguyenet al., 2018). Recently, asymptotic convergence was shown with Adagrad-style step-sizesinstead of a fixed, decreasing schedule (Li and Orabona, 2019).9Chapter 2Interpolation and GrowthConditionsInterpolation is a technique from numerical analysis where a function is approximatedby exactly fitting a sample of its evaluations using a tractable class of approximators.This class is often piece-wise linear, piece-wise cubic, or polynomial, but can also be non-parametric. Even over-parameterized neural networks can be used, assuming it is possibleto solve the resulting non-convex optimization problem. This is because the defining as-pect of interpolation methods is not the choice of approximating function, but that theapproximation attains the true function value at all points in the sample.This basic notion of interpolation can be directly carried over to supervised machinelearning. The regression case is nearly identical — intuitively, a model interpolates a set oftraining examples if it predicts exactly the true target for each example. The analogy tonumerical analysis is complete if we assume that the data are generated by an underlyingdeterministic mapping. For classification problems, interpolation can be defined in twoobvious ways. We might say a model satisfies interpolation if it predicts the correct classfor each training example, or alternatively, if it attains zero loss for all training examples.The latter interpretation is dependent on the choice of objective function and starts to hintat the role of interpolation in optimization. Note that in the regression example, the twonotions of interpolation coincide for the natural choice of squared loss on all examples.These intuitive definitions for interpolation mean approximately the same thing: “tofit the data exactly”. While this agrees with the spirit of the methodology and broader10scientific usage of the term, a rigorous definition is necessary to analyze the complexityof first-order optimizers under interpolation. This chapter formalizes interpolation in thecontext of stochastic optimization. In particular, we do the following:1. Define the concept of a stochastic first-order oracle (SFO) O and propose a newLipschitz-smoothness property for SFOs, which we call individually-smoothness.2. Give three specific definitions of interpolation in the context of a stochastic first-orderoracle and objective function pair (f,O). These are the first definitions applicable tothe general stochastic setting.3. Characterize the relationships between the three models of interpolation as well asconnections to strong and weak growth. In particular, we show that interpolationimplies weak growth if O is individually-smooth; this relaxes existing sufficient con-ditions, which further assume f is convex and finite-sum (Vaswani et al., 2019a).We begin with the discussion of stochastic oracles.2.1 Stochastic OraclesThe defining feature of stochastic optimization algorithms is that they cannot directlyaccess the value or gradient of the objective function f . Instead, they obtain noisy functionand gradient evaluations through a SFO O. At each iteration k, O outputs a stochasticfunction f(w, zk) and gradient ∇f(w, zk) at query point w, where zk is a random variablewith distribution µk supported on Zk ⊆ Rm, m > 0.1 We assume that f(w, ·) is a deter-ministic Borel function, meaning the stochasticity in f(w, zk) stems only from the randomvariable zk. Similarly, ∇f(w, ·) is taken to be a deterministic Borel function related tof(w, ·) through the standard differentiation operator,∇f(w, ·) =[∂∂w1f(w, ·), . . . , ∂∂wdf(w, ·)]>.Unless stated otherwise, the oracle outputs are always taken to be unbiased, implying thatEzk [f(w, zk)] = f(w) and Ezk [∇f(w, zk)] = ∇f(w),1It will be sufficient to treat the measure µk as a prob. density function pk for nearly all of our purposes.11for all k.The definition of O is non-stationary in the sense that it allows the distributions µkand their support Zk to change across iterations. As such, it will be necessary to refer tothe support of the entire stochastic process (zk),Z =⋃k∈NZk.For simplicity, Z is called the support of O. Note that a statement which holds point-wise over Z holds almost-surely for all zk. These two requirements are equivalent, since(by definition) every z ∈ Z is in the support of some µk. We will make use of this lesscumbersome notation.Some results will further require that the stochastic function f(·, zk) is Lipschitz-smoothfor all outcomes in Z. In this case, we say O satisfies individual smoothness with parameterLmax.Definition 1 (Individual Smoothness). An SFO O is called Lmax individually-smooth ifthe stochastic gradient mapping w 7→ ∇f(w, z) is Lz-Lipschitz with Lz ≤ Lmax for allz ∈ Z.Individual smoothness implies the quadratic upper boundf(v, z) ≤ f(w, z) + 〈∇f(w, z), v − w〉+ Lmax2‖v − w‖2 ∀v, w ∈ Rd,holds point-wise over Z and thus almost-surely for each zk. Note that the existence of anLmax individually-smooth, unbiased SFO for f implies that f is L-smooth with L ≤ Lmaxas an immediately corollary.Corollary 1. Let f : Rd → R be a convex, differentiable function. If there exists anunbiased, Lmax individually-smooth SFO O for f , then f is L-smooth with L ≤ Lmax.Alternatively, if O is individually-convex and biased with finite support Z and finite partialderivatives ∂∂wj f(w, z) for each z ∈ Z, then Ezk [f(·, zk)] is Lmax-smooth for all k.See Section A.1 for proof. In several cases, we will also make use of individual strong-convexity.Definition 2 (Individual Strong-Convexity). A SFO O is called µmax individually-strongly-convex if the function f(·, z) is µz-strongly-convex with µz ≤ µmax for all z ∈ Z. If µmax =0, then we say O is individually-convex.12Individually-smooth oracles are wide-spread throughout machine learning in the formof finite-sum functions. Recall that finite-sum functions are a particular class of structuredoptimization problem where the objective f is the sum of n sub-functions. The followingexample demonstrates individual smoothness in the context of least-squares regression.Example 1 (Least Squares Regression: Individual Smoothness). The classic least-squaresregression problemw∗ = arg min1nn∑i=1(〈w, xi〉 − yi)2 ,has a finite-sum structure with fi(w) = (〈w, xi〉 − yi)2. The mini-batch SFO uniformly sub-samples b ≤ n examples to obtain the following function and gradient estimators:f(w, z) =1b∑i∈z(〈w, xi〉 − yi)2 and ∇f(w, z) = 1b∑i∈z2 (〈w, xi〉 − yi)xi,where z ∈ Z = {A ⊆ {1, . . . , n} : |A| = b} is the set of sampled indices. Notice that f(·, z) isconvex and Lz-smooth with Lz =2b‖∑i∈z xzx>z ‖2op, where ‖ ·‖op is the operator norm. Thisimplies O is individually-convex and Lmax individually-smooth with Lmax ≤ 2 maxi∈[n] ‖xi‖2.Example 1 illustrates the mini-batch oracle commonly used in machine learning. Themini-batch oracle is unbiased, simple to compute, and by far the most common SFO inmachine learning. If the sub-functions fi are each individually Li-smooth, then the mini-batch SFO satisfies individual-smoothness with Lmax = maxi∈[n] Li. We call general SFOsindividually-smooth specifically to invite comparison with this natural property of themini-batch oracle.2.2 InterpolationNow we formalize interpolation in the context of stochastic optimization problems.Unlike previous work, we consider general SFOs and do not require f to be finite-sumor O to be the mini-batch oracle (cf. Vaswani et al. (2019a) or Bassily et al. (2018)).Instead, three different notions of interpolation are developed as a joint property of theobjective and oracle. As with individual smoothness, interpolation is shown to have asimple realization for finite-sum functions that satisfies the intuition previously developedfor machine learning problems.13The basic requirement of interpolation is that the oracle outputs f(·, z) and ∇f(·, z)resemble the true function at key target points. The axes of variation in the definitions tofollow are (a) the specific target points and (b) the type of “matching” required. Unlikeclassic interpolation, matching here refers to mutual optimality or mutual stationarity,rather than a requirement for equal function values; enforcing f(w∗) = f(w∗, z) may notbe useful from an optimization perspective, since ∇f(w∗, z) could be non-zero. The threedefinitions are as follows:Definition 3 (Interpolation: Minimizers). A function-oracle pair (f,O) satisfies minimizerinterpolation if for all z ∈ Z,f(w′) ≤ f(w) ∀w ∈ Rd =⇒ f(w′, z) ≤ f(w, z) ∀w ∈ Rd.Definition 4 (Interpolation: Stationary Points). A function-oracle pair (f,O) satisfiesstationary-point interpolation if for all z ∈ Z,∇f(w′) = 0 =⇒ ∇f(w′, z) = 0.Definition 5 (Interpolation: Mixed). A function-oracle pair (f,O) satisfies mixed inter-polation if for all z ∈ Z,f(w′) ≤ f(w) ∀w ∈ Rd =⇒ ∇f(w′, z) = 0.In words, Definition 3 states that global minimizers of the objective f must be globalminimizers of the stochastic functions given by the SFO at every iteration k. In contrast,Definition 4 puts the same requirement on stationary points of f , while Definition 5 merelydemands that minimizers of f are stationary points of f(·, z) for all outcomes z.For general functions and SFOs, the relationship between these models of interpolationis limited to the following: minimizer interpolation and stationary-point interpolation arestronger than mixed interpolation. However, all three definitions are equivalent when fand f(·, z) are invex for all z ∈ Z. This is stated in the following lemma.14w∗f (w)f (w, z)Minimizerw∗w′w′Stationary-Pointw∗MixedFigure 2.1: Oracle samples satisfying the three different models of interpolation forthe same non-convex function. The dashed red-line is the graph of f(w, z)for a single, fixed z ∈ Z. Stationary-points are denoted as w′, while the globalminimizer is marked as w∗. Note that f(w, z) is permitted to have many “extra”stationary points when O satisfies stationary-point interpolation. Similarly,mixed interpolation does not prevent f(w, z) from being unbounded below.Lemma 1. Let (f,O) be an arbitrary function-SFO pair. Then only the following relation-ships hold between models of interpolation:Minimizer Interpolation (Definition 3) =⇒ Mixed Interpolation (Definition 5)andStationary-Point Interpolation (Definition 4) =⇒ Mixed Interpolation (Definition 5).However, if f is invex and O is such that f(·, z) is invex for all z ∈ Z, then minimizer,stationary-point, and mixed interpolation are equivalent.This result follows immediately from first-order optimality conditions and the equivalenceof stationary points and global minimizers for invex functions. For completeness, a shortproof with counter-examples for the implications which do not hold is given in Appendix A.Let us return to the finite-sum setting and mini-batch oracle to better understand thethree definitions of interpolation. In particular, let f and O be such thatf(w) =1nn∑i=1fi(w), f(w, zk) = fzk(w), ∇f(w, zk) = ∇fzk(w),where µk is a uniform distribution over the support set Z = {1, . . . , n}. The function-oraclepair (f,O) satisfy minimizer interpolation if the individual functions fi are globally min-15imized at every global minimum of f . The differences between minimizer and stationary-point interpolation are readily apparent for finite-sums of non-convex functions, as shownin Figure 2.1. The following example further narrows this to least-squares regression, whereinterpolation implies the entire training set can be fit exactly.Example 2 (Least Squares Regression: Interpolation). Consider the least-squares problemw∗ = arg min1nn∑i=1(〈w, xi〉 − yi)2 ,with individually-smooth and convex mini-batch oraclef(w, z) =1|z|∑i∈z(〈w, xi〉 − yi)2 and ∇f(w, z) = 1|z|∑i∈z2 (〈w, xi〉 − yi)xi.The stochastic functions f(·, z) are invex, as is f , meaning minimizer, stationary-point,and mixed interpolation are equivalent. If (f,O) satisfies minimizer interpolation, thenfirst-order optimality implies2 (〈w∗, xi〉 − yi)xi = 0 ∀i ∈ [n].Further requiring xi 6= 0 for all i ∈ [n] guarantees 〈w∗, xi〉 = yi and our abstract definitionsof interpolation recover the original meaning from numerical analysis.There are several natural ways to establish interpolation. Least-squares regression sat-isfies interpolation when y ∈ (span ({xi}ni=1)). This occurs, for example, when the datamatrix is full-rank and n ≤ d (Hastie et al., 2009). Similarly, interpolation is satisfied forlinear classifiers on separable datasets using the squared-hinge loss (Vaswani et al., 2019a).In the general setting of SFOs, the following lemma establishes simple conditions for (f,O)to satisfy minimizer interpolation.Lemma 2. Let (f,O) be a function-oracle pair. If O is unbiased andf(w, z) ≥ f(w∗) ∀w ∈ Rd, ∀z ∈ Z,holds, then (f,O) satisfies minimizer interpolation.See Section A.2 for proof.16Lemma 2 gives a convenient mechanism for checking if interpolation holds. It is mostuseful for finite-sum functions, where the sufficient conditions enforce fi(w∗) = f(w∗) foreach i. It is also highly related to the -interpolation concept proposed by Berrada et al.(2019), which requires |f(w∗, z) − f(w∗)| ≤  for all z ∈ Z, where  > 0. This alternativemodel of interpolation is not sufficient for exact convergence (Berrada et al., 2019) and wedo not explore -interpolation any further in this work.2.3 Growth conditionsMinimizer, stationary-point, and mixed interpolation all constrain the stochastic oracleto resemble the true objective at a set of target points. Now we show that for individually-smooth oracles, interpolation further implies that the stochastic gradients must be well-behaved globally. This is made concrete by the notion of growth conditions, which con-strain the stochasticity of ∇f(w, zk) in terms of ‖∇f(w)‖. In particular, we prove that theweak (Vaswani et al., 2019a) and strong (Schmidt et al., 2011) growth conditions hold withspecific and simple constants when O is individually-smooth and satisfies minimizer inter-polation. But, first we give some brief background on regularity conditions for first-ordermethods.Regularity conditions on the stochastic gradients have a long history in stochastic op-timization. The first analysis of SGD by Robbins and Monro (1951) required a uniformbound on the norm of the stochastic gradients‖∇f(w, zk)‖ ≤ C,in order to prove convergence. This “bounded gradients” assumption is rarely satisfiedfor objective functions in machine learning. For example, taking ‖w‖ → ∞ in Example 1shows it does not hold even for simple least-squares problems. A more realistic alternativeis bounded variance, which requiresEzk[‖∇f(w, zk)−∇f(w)‖2] ≤ σ2,at each iteration k. Bounded variance, unbiasedness of O, and independence of the zkvariables (ie. zk ⊥⊥ zj for k 6= j) collectively define the stochastic approximation setting,which has been widely studied; see Kushner and Yin (1997) for more details. Yet, it is also17simple to show that bounded variance fails for least-squares with a mini-batch oracle.2A far more realistic model is obtained by relaxing bounded variance toEzk[‖∇f(w, zk)‖2] ≤ ρ‖∇f(w)‖2 + σ2,where ρ, σ ≥ 0. We call this condition “strong growth with additive noise” for reasons whichwill soon be clear. Note that it is strictly weaker than bounded variance, which is recoveredwhen ρ = 1. Strong growth with additive noise was used as early as the classical analysis ofstochastic optimization algorithms by Polyak and Tsypkin (1973), and continues to appearin current work (Bertsekas and Tsitsiklis, 2000; Khaled and Richta´rik, 2020; Nguyen et al.,2018).The first growth condition we discuss was proposed by Tseng (1998) and Solodov (1998),who used a version of strong growth with noise where σ2 = 0 is fixed. In comparison, theircondition is also strengthened to require the norm of the stochastic gradient to be almostsurely bounded by that of the true gradient. Following Khaled and Richta´rik (2020), wecall this condition maximal strong growth.Definition 6 (Maximal Strong Growth). A function-oracle pair (f,O) satisfies maximalstrong growth with parameter ρ if‖∇f(w, z)‖2 ≤ ρ‖∇f(x)‖2,holds for all w ∈ Rd and z ∈ Z.The “maximal” moniker is justified by the fact that Tseng and Solodov’s condition isequivalent tomaxE∫E‖∇f(w, z)‖2dµk(z) ≤ ρ‖∇f(w)‖,where E is any event with non-zero probability under the measure µk. It is important tonote that maximal strong growth is much more restrictive than strong growth with additivenoise. The former condition immediately implies O satisfies stationary-point interpolation,while σ2  0 in the latter allows interpolation to be violated to an arbitrary degree.Maximal strong growth was originally suggested in the context of incremental gradientmethods and is impractical in the general setting due to the almost-sure requirement. The2Take ‖w‖ → ∞ in a problem instance where ‖xi‖ 6= ‖xj‖ for at least two examples xi, xj to see thatthere can be no bound on the variance.18following, far more practical variation, was proposed by Vaswani et al. (2019a) and is simplycalled strong growth.Definition 7 (Strong Growth). A function-oracle pair (f,O) satisfies strong growth withparameter ρ ifEzk[‖∇f(w, zk)‖2] ≤ ρ‖∇f(w)‖2,holds for all k ≥ 0 and w ∈ Rd.It is straightforward to show that strong growth is indeed a weaker condition than maximalstrong growth, which we do in the following lemma.Lemma 3 (Formulations of Strong Growth). Let (f,O) be a function-oracle pair. If (f,O)satisfies maximal strong growth, then it also satisfies strong growth. However, strong growthdoes not imply maximal strong growth for general O.The proof of lemma is given in Section A.3 and illustrates the impracticality of maximalstrong growth, which is not even satisfied for multiplicative Gaussian noise. However, asufficient condition for the equivalence of maximal strong and strong growth is finite Z.Lemma 4. Let (f,O) be a function-SFO pair satisfying the strong growth condition withconstant ρ. Moreover, assume that the support Z of O is finite and each zk admits proba-bility mass function pk. Then (f,O) also satisfies maximal strong growth.See Section A.3 for proof.An important consequence of Lemma 4 is the equivalence of the maximal strong growthand strong growth conditions for finite-sum optimization. This nicely reflects the originaluse of maximal strong growth for incremental gradient methods. Finally, we complete ourstatement of growth conditions with weak growth, which was recently proposed by Vaswaniet al. (2019a) as a further relaxation of strong growth.Definition 8 (Weak Growth Condition). Let f be an L-smooth function an O an SFO.The pair (f,O) satisfies weak growth with parameter α ifEzk[‖∇f(w, zk)‖2] ≤ αL(f(w)− f(w∗)),holds for all k ≥ 0 and w ∈ Rd.19If f is µ-PL, then the weak growth condition implies strong growth holds with parameterρ ≤ αLµ (Vaswani et al., 2019a, Proposition 1). In fact, the converse relation holds if f isLipschitz-smooth, as we now show.Lemma 5. Let (f,O) be a function-oracle pair satisfying the strong growth condition withparameter ρ. If f is L-smooth, then (f,O) satisfies weak growth with parameter α ≤ 2ρL.See Section A.3 for proof.We now give the exact relationships between the strong/weak growth conditions andinterpolation. As mentioned above, maximal strong growth immediately implies stationary-point interpolation. Strong growth with w = w∗ givesEk[‖∇f(w∗, zk)‖2] = 0,for all k, after which non-negativity of the Euclidean norm guarantees ∇f(w∗, zk) = 0almost-surely and stationary-point interpolation holds. Replicating this last argument un-der weak growth shows that the weak growth condition implies mixed interpolation. Thereverse implications are more involved; we establish them formally in the following lemmas.Lemma 6 (Interpolation and Weak Growth). Let f be an L-smooth function and O anLmax individually-smooth SFO. If (f,O) satisfies minimizer interpolation, then the pair alsosatisfies the weak growth condition with parameter α ≤ LmaxL .See Section A.3 for proof.Lemma 6 is similar to the sufficient conditions for weak growth derived by Vaswaniet al. (2019a), who additionally require f to be convex and finite-sum. Thus, our resultapplies to a much larger class of functions and SFOs than previous work. We can derivesimilarly relaxed sufficient conditions for strong growth using the relationship between thestrong and weak growth parameters. The following lemma does exactly this.Lemma 7 (Interpolation and Strong Growth). Let f be a L-smooth µ-PL function and Oan Lmax individually-smooth SFO. If (f,O) satisfies minimizer interpolation, then the pairalso satisfies the strong growth condition with parameter ρ ≤ Lmaxµ .20Assumptions Weak Growth Strong GrowthInd. Smooth α ≤ LmaxL —µ-PL +Ind. Smoothα ≤ LmaxL ρ ≤ LmaxµStrong Growth α ≤ 2ρL ρ = ρTable 2.1: Relationship between minimizer interpolation and parameters of the weakand strong growth conditions. Weak growth is guaranteed to hold when (f,O)satisfies minimizer interpolation and O is Lmax individually-smooth. Stronggrowth holds if f additionally satisfies the PL condition, which is implied bystrong-convexity.See Section A.3 for proof.Lemmas 6 and 7 show that minimizer interpolation implies global regularity of thestochastic gradients when O is individually-smooth. As a by-product, we obtain worst-case bounds on the strong and weak growth parameters and demonstrate the “automaticvariance reduction” property (Liu and Belkin, 2020), which guarantees Ek‖∇f(wk, zk)‖2 →0 if wk → w∗ in the interpolation setting. Table 2.1 summarizes the relationships betweenminimizer interpolation and the weak/strong growth conditions.2.4 ConclusionsWe have now completed our goal of formalizing interpolation for general stochasticoptimization problems. To do this, we leveraged the concept of an SFO, which provides amechanism for optimization algorithms to query (unbiased) stochastic function and gradi-ent values at each iteration. Moreover, we also showed that a class of SFOs satisfying a nat-ural Lipschitz-smoothness property, called individual smoothness, satisfy the weak/stronggrowth conditions when minimizer interpolation holds (Lemmas 6 and 7). These oraclesare well-behaved globally despite the inherently local nature of minimizer interpolation. Aswe shall see in the coming chapters, such global regularity is sufficient for the convergenceof constant step-size SGD, as well as effective line-search techniques and acceleration.21Chapter 3Stochastic Gradient DescentThe discussion in the previous chapter formalized interpolation for general stochasticoptimization problems and derived connections between interpolation and the weak/stronggrowth conditions. Now, we turn to the main interest of this work: the complexity ofiterative algorithms for (f,O) when interpolation is satisfied. This chapter analyzes theconvergence of SGD for strongly-convex and convex functions, while the next two chapterstackle SGD with the Armijo line-search (Chapter 4) and stochastic acceleration (Chapter 5).In particular, this chapter establishes the following non-asymptotic results for SGD with afixed step-size:1. Linear convergence in-expectation for strongly-convex f when (f,O) satisfies stronggrowth; this rate is tight with the best-known deterministic rates when ρ = 1.2. Almost sure linear convergence for strongly-convex f and individually-smooth andstrongly-convex SFO O.3. Sub-linear convergence for convex f when (f,O) satisfies weak growth; our proof issimpler than existing analyses and permits a larger step-size.4. Faster sub-linear convergence for convex f when (f,O) satisfies weak growth and Ois individually-smooth; this rate is tight with the deterministic case when α = 1.Section 3.3 at the end of this chapter leaves the finite-time regime and considers asymptotic,almost-sure convergence of SGD with a fixed step-size under strong and weak growth,respectively. The following is proved:22Stochastic Gradient Descent0. Choose an initial point w0 ∈ Rd.1. For each iteration k ≥ 0:(a) Query O for ∇f(wk, zk).(b) Update input aswk+1 = wk − η∇f(wk, zk).Figure 3.1: Stochastic gradient descent with a fixed step-size η. Note that only onequery to the stochastic oracle is needed per-iteration.1. Almost-sure convergence to a stationary point when f is a general non-convex func-tion and (f,O) satisfies strong growth.2. Almost-sure convergence to a global minimum when f is convex and (f,O) satisfiesweak growth.This last result is particularly interesting because it concerns convergence of the last in-put generated by SGD; we shall see that such results are not straightforward in the non-asymptotic regime, where convergence is instead shown for an averaged input.Now, let us briefly introduce SGD with a fixed step-size before diving into the analysis.The basic procedure is given in Figure 3.1; the key components of the algorithm are (i) theuse of a stochastic gradient ∇f(wk, zk) queried from the oracle at every iteration, (ii) thefixed step-size η > 0, and (iii) the sequence of inputs (wk) generated by the algorithm,which are called the iterates. The non-asymptotic rates in this chapter will be derived byanalyzing the sequence of distances to a minimizer,(‖wk − w∗‖2). The minimizer w∗ isunique for strongly-convex functions and we show wk → w∗ in the iteration limit. In con-trast, we only establish f(w¯K) → f(w∗) for convex functions, where w¯K = 1K∑K−1k=0 wk.Chapter 4 discusses these proof techniques in greater detail and with specific referenceto the case where η is itself a random variable that depends on ∇f(wk, zk). We will seethat this introduces significant challenges compared to SGD with a fixed and deterministic23step-size.3.1 Convergence for Strongly-Convex FunctionsThe analysis of fixed step-size SGD for strongly-convex functions is divided into twosub-sections based on the properties of O. First, we consider general SFOs satisfying stronggrowth. Then, the setting is restricted to individually-smooth and convex SFOs, where weshow that existing convergence results can be slightly improved. Finally, we analyze thecase when O is both individually-smooth and individually-strongly-convex. This settingdegenerates to a simple deterministic optimization problem if ∇f(·, z) is directly accessibleto the optimizer, which is true for mini-batch oracles.3.1.1 General OraclesWe first establish the convergence rate of SGD for strongly-convex f when (f,O) satisfiesthe strong growth condition. Recall from Lemma 7 that this is more general than assumingO is Lmax individually-smooth and (f,O) satisfies minimizer interpolation. Furthermore, inthe case that individual smoothness and minimizer interpolation do hold, we are guaranteedρ ≤ Lmaxµ . This value should be kept in mind, as it informs the worst-case rate that can beobtained in the following theorem.Theorem 1. Let f be a µ-strongly-convex, L-smooth function and O a SFO such that(f,O) satisfies the strong growth condition with parameter ρ. Then stochastic gradientdescent with fixed step-size η ≤ 2ρ(µ+L) converges asE[‖wK − w∗‖2] ≤ (1− 2ηµLµ+ L)K‖w0 − w∗‖2.See Section B.1 for proof.We compare this convergence rate to the original result given by Schmidt and Le Roux(2013, Section 6) in Table 3.1. Our analysis allows for a larger step-size and establishesasymptotically faster convergence. The improvement is most significant for ill-conditioned24problems, where µ L implies 2Lµ+L ≈ 2. Finally, this result is tight in the sense that whenρ = 1, which holds in the deterministic setting, it recovers the best known convergence ratefor gradient descent on strongly-convex functions (Bubeck, 2015, Theorem 3.12).When (f,O) satisfies minimizer interpolation and O is individually-smooth, the com-plexity given by Theorem 1 can be worse than that achieved under the weak-growth condi-tion. To see this, consider when the worst-case values ρ = Lmaxµ and α =LmaxL are attainedin the interpolation setting. If η = 2ρ(µ+L) =2µLmax(µ+L), then Theorem 1 guaranteesE[‖wK − w∗‖2] ≤ (1− 4µ2LLmax (µ+ L)2)K‖w0 − w∗‖2.In contrast, Vaswani et al. (2019a, Theorem 5) showE[‖wK − w∗‖2] ≤ (1− µLmax)K‖w0 − w∗‖2,with a bigger step-size of η = 1Lmax . Noting4µL(µ+L)2≤ 1 — with equality when µ = L —establishes that the rate given in this work is slower.The discrepancy in convergence rates for smooth, interpolating oracles emerges fromthe use of smoothness and strong-convexity in the proof of Theorem 1. The worst-casevalue for ρ is proved using Lmax-smoothness of f(·, zk) and µ-strong-convexity of f (seeLemma 7). Thus, bounding Ezk[‖∇f(wk, zk)‖2] using strong growth and then followingthe typical, deterministic proof strategy equates to using smoothness and strong-convexitytwice and leads to an unnecessarily loose bound.The above conclusion only holds when (f,O) satisfies interpolation andO is individuallysmooth. The two convergence speeds cannot be directly compared when O is more generalbecause of the cyclic relationship between the strong/weak growth parameters. Recall fromLemma 5 that strong growth implies weak growth with constant α ≤ 2ρL, while weakgrowth implies strong growth with constant ρ ≤ αLµ . As a result, no order can be establishedon α and ρ and it is not clear which rate is superior.3.1.2 Individually Smooth and Convex OraclesIt is possible to improve upon the result from Vaswani et al. (2019a) when O furthersatisfies individual convexity. In particular, we are able to obtain the same convergence25AssumptionsMax. Step-Size Best Convergence RateOurs SLR Ours SLR— η ≤ 2ρ(µ+L) η ≤ 2ρL O(ρ(µ+L)24µL log(1))O(ρLµ log(1))Ind. Smooth& Convexη < 2Lmax N/A O(Lmaxµ log(1))N/AInd. Smooth& µz-SCη ≤ 2µmax+Lmax N/A O(µmax+Lmax4δminlog(1))N/ATable 3.1: Comparison of iteration complexities of fixed step-size SGD for µ-SC funder the strong growth condition. For each result, we report the complexityobtained with the optimal step-size according to that analysis. Recall from The-orem 3 that δmin = minz∈Z µzLzµz+Lz . The first row shows our results for generalO are tighter than SLR (Schmidt and Le Roux, 2013) because they allow forlarger step-sizes.speed with a looser bound on the step-size. A case analysis in Section B.4 shows that thestep-size permitted by Vaswani et al. (2019a, Theorem 5) is η ≤ µ+LαL2, which is µ+LLLmax inthe worst-case. If µ < L, then µ+LLLmax <2Lmaxand the following theorem improves upon theoriginal result.Theorem 2. Let f be a µ-strongly-convex, L-smooth function and O an Lmax individually-smooth and convex SFO such that (f,O) satisfies minimizer interpolation. Then stochasticgradient descent with fixed step-size η < 2Lmax converges asE[‖wK − w∗‖2] ≤ (1− µ η (2− ηLmax))K ‖w0 − w∗‖2.See Section B.1 for proof.A key feature of Theorem 2 is that it requires only the full function f to be strongly-convex; the stochastic functions f(·, z) may be merely convex, as is typically the case inthe finite-sum setting. To illustrate this, consider the linear regression problemminw∈Rdn∑i=0(〈w, xi〉 − yi)2 ,which satisfies interpolation if y ∈ span({xi}ni=0). The objective f is µ-strongly-convex if26n ≥ d and the data matrix is full-rank, while the individual functions fi(w) = (〈w, xi〉 − yi)2are not strongly-convex unless d = 1.3 Learning problems of this form occur in non-parametric kernel regression, such as when using radial basis functions with a small band-width (Hastie et al., 2009). However, in the unlikely case that O is also individually-strongly-convex, we can show that SGD will converge almost surely, as established in thefollowing theorem.Theorem 3. Let f be a µ-strongly-convex, L-smooth function and O an Lmax individually-smooth and µmax-strongly-convex SFO such that (f,O) satisfies minimizer interpolation.Then stochastic gradient descent with fixed step-size η ≤ 2µmax+Lmax converges almost surelyat the rate‖wK − w∗‖2 ≤ (1− 2η δmin)K ‖w0 − w∗‖2,where δmin = minz∈Z µzLzµz+Lz .See Section B.1 for proof.Theorem 3 is not surprising; strongly-convex functions have only one minimizer, mean-ing that gradient-descent on a single stochastic function f(·, zk) is sufficient to recover theglobal solution to the optimization problem. Choosing the best-conditioned sub-functionfunction yields convergence to w∗ as O (exp {−2η δmaxK}), where δmax = maxz∈Z µzLzµz+Lz(Bubeck, 2015). Notice that we have exchanged worst-case performance for best-case (δminvs δmax) by optimizing f(·, z) directly. SGD is clearly sub-optimal when O is individually-strongly-convex and the optimization procedure has direct access to f(·, z) for each z ∈ Z,such as in the finite-sum setting. This illustrates the dangers of assuming individual strong-convexity and interpolation hold simultaneously.3.2 Convergence for Convex FunctionsConvex functions are significantly more interesting than strongly-convex functions wheninterpolation is satisfied. Minimizer interpolation for convex functions implies X ∗ ⊆ X ∗z —the optimal set for f is a subset of the optimal set for f(·, z). This condition intuitivelyfeels milder than the requirement for a single shared optimal point w∗. Moreover, assuming3Consider w and w′ = w + v, where v is orthogonal to xi, to see the fi are not strongly-convex whend > 1.27individual convexity does not lead to degenerate optimization problems such as those seenin the previous section. Convex functions also require a major shift in analysis; now weshall show concentration of the optimality gap f(w)−f(w∗), rather than shrinking distanceto a specific minimizer, ‖w − w∗‖2.We first establish the convergence rate of SGD for convex functions under the weakgrowth condition. As before, it is strictly more general to analyze the complexity of SGDunder weak growth than in the setting where O is individually-smooth and minimizerinterpolation holds. Remember that Lemma 6 guarantees α ≤ LmaxL . The following resultimproves on that given by Vaswani et al. (2019a) by constant factors and allows for a largerstep-size (see Table 3.2). Moreover, the proof in Section B.2 is simpler and far shorter.Theorem 4. Let f be a convex, L-smooth function and O a SFO such that (f,O) satisfiesthe weak growth condition with parameter α. Then stochastic gradient descent with fixedstep-size η < 1αL converges asE [f(w¯K)]− f(w∗) ≤ 12η (1− ηαL) K ‖w0 − w∗‖2,where w¯K =1K∑K−1k=0 wk and w∗ = ΠX ∗(w0).In fact, an even larger step-size and faster convergence rate can be obtained when O isindividually-smooth. We show this now.Theorem 5. Let f be a convex, L-smooth function and O a SFO such that (f,O) satisfiesthe weak growth condition with parameter α. Moreover, suppose O is Lmax individually-smooth. Then stochastic gradient descent with fixed step-size η < 1αL +1LmaxconvergesasE [f(w¯K)]− f(w∗) ≤ 12η δ K‖w0 − w∗‖2,where w¯K =1K∑K−1k=0 wk, w∗ = ΠX ∗(w0), and δ = min{1, 1 + αL(1Lmax− η)}.See Section B.2 for proof.Theorem 5 is tight with the deterministic case in the following sense: if f(·, z) = f foreach z ∈ Z, then α = 1, Lmax = L, and the rate given is comparable to the best known re-sults in the deterministic setting (cf. Bubeck (2015, Theorem 3.3)). This result also furtherillustrates the benefits of directly assuming the weak growth condition, since the maximum28AssumptionsMax. Step-Size Best Convergence RateOurs VBS Ours VBS— η < 1αL η <12αL O(αL)O(4(1+α)L)Convex +Ind. Smoothη ≤ 1αL + 1Lmax N/A O(Lmax2)N/ATable 3.2: Comparison of iteration complexities of fixed step-size SGD for convex funder the weak growth condition. For each result, we report the complexityobtained with the optimal step-size according to that analysis. Our results aretighter than VBS (Vaswani et al., 2019a) and allow for larger step-sizes.step-size satisfies 1αL +1Lmax≥ 2Lmax , where 2Lmax is the condition obtained when derivingweak growth directly from individual smoothness and minimizer interpolation.3.3 Almost Sure ConvergenceNow we briefly change paradigms and consider the asymptotic behavior of SGD witha fixed step-size. Our goal in this section is to show the almost-sure (with probability 1)convergence of SGD to a minimizer or stationary point of f when (f,O) satisfy weak orstrong growth, respectively. In the latter scenario, this means we want to show the randomvariable limk→∞ ‖∇f(wk)‖2 exists and is almost-surely zero. We will need some tools frommeasure-theoretic probability to accomplish this.To start, observe that each iterate wk can be written as a deterministic Borel functionof the stochastic gradients {∇f(wk, zk)}Kk=0 by unrolling the SGD update. Formally, wealso assume a probability space (Ω,F , P ) in the background and say the sequence (wk) isadapted to the filtration generated by the stochastic gradients,FK = σ(K−1⋃k=0σ∇f(wk, zk)).The sequences of function and gradient values are functions of (wk) and so are also adaptedto (Fk). See C¸ınlar (2011) for additional details on filtrations.Our main tool to show convergence of sequences of random variables will be super-29martingale theory (C¸ınlar, 2011). Supermartingales are one of two classic tools used to an-alyze the convergence of SGD, the other being Lyapunov functions (Bertsekas and Tsitsiklis,2000). In particular, recent authors have made use of convergence of discrete-time, positivesupermartingales (Bertsekas, 2011; Nguyen et al., 2018). This theorem is due to Neveu(1975) and will be the cornerstone of our analyses; we reproduce it here for convenience.Theorem 6 (Convergence of Positive Supermartingales). Let (Yk), (Xk), and (Ak) bediscrete, non-negative random processes indexed by k ∈ N and adapted to the filtration(Fk). Suppose that∀k ∈ N, E [Yk+1 | Fk] ≤ Yk −Xk +Ak, and∞∑k=0Ak <∞,almost surely. Then the sequence (Yk) converges almost surely to a non-negative randomvariable Y∞ and∑∞k=0Xk <∞ almost surely.With the necessary measure-theoretic background complete, we are now ready to studythe almost-sure convergence of stochastic gradient descent.Theorem 7. Let f be a convex, L-smooth function with at least one finite minimizer andO an Lmax individually-smooth SFO such that (f,O) satisfies the weak growth conditionwith parameter α. Then the sequence (f(wk)) generated by stochastic gradient descent withfixed step-size η < 1αL +1Lmaxconverges to the optimal function value f(w∗) almost surely.The proof is given in Section B.3 and follows a straightforward argument. First, we es-tablish that the sequence of distances to an arbitrary finite minimizer(‖wk − w∗‖2) satisfiesthe conditions of Theorem 6. As a by-product, we establish that∑∞k=0 (f(wk)− f(w∗)) isconvergent (although wk → w∗ does not necessarily hold almost surely) and then deducelimk→∞f(wk)a.s.= f(w∗).It is straightforward to prove an alternative version of Theorem 7 which does not requireindividual smoothness. In this case, convergence is established for η < 1αL using the sameprogress condition as in Theorem 4, namely Equation B.1.Theorem 7 holds for the final iterate generated by the SGD procedure. This should becontrasted with Theorems 4 and 5, which apply only to the averaged iterate w¯k. While theexistence of an asymptotic result suggests that non-asymptotic, final-iterate convergence30for constant step-size SGD under the weak growth condition is possible, we do not establishsuch a result in this work. Convergence (or non-convergence) of the final iterate remainsan interesting and surprisingly simple open problem in optimization under interpolation.The final result of this chapter shows almost-sure convergence to a stationary point forgeneral non-convex functions f such that (f,O) satisfy the strong growth condition. Theproof is presented in Section B.3 and follows a similar structure to the proof of Theorem 7.Theorem 8. Let f be an L-smooth function with at least one finite minimizer w∗ andO a SFO such that (f,O) satisfies the strong growth condition with parameter ρ. Thenthe sequence of gradient-norms(‖∇f(wk)‖2) generated by stochastic gradient descent withfixed step-size η < 2ρL converges to 0 almost surely.3.4 ConclusionsTheorem 8 ends our study of fixed step-size SGD for strongly-convex and convex mini-mization under the strong/weak growth conditions. Building on work by Schmidt and LeRoux (2013), we established a faster linear convergence rate for µ-strongly-convex functionswhen using a larger step-size that requires knowledge of µ; this result attains the deter-ministic rate when ρ = 1 (Bubeck, 2015). Unfortunately, the subsequent discussion showedthat Theorem 1 can still be slower then the corresponding result of Vaswani et al. (2019a)when minimizer interpolation holds and the worst-case values of α and ρ are attained.Inspired by this discrepancy, we then leveraged additional properties of O, such as indi-vidual smoothness and convexity, to derive convergence rates that match those of Vaswaniet al. (2019a), but permit a larger range of step-sizes (Theorem 2). We also proved thatSGD converges almost-surely if O is individually-strongly-convex (Theorem 3). In this case,the optimization problem is degenerate, meaning it can be solved by directly minimizingany f(·, z). Moreover, the conditioning of the optimization problem can be improved fromworst-case to best-case if we have direct access to f(·, z) for all z.The remainder of the chapter established improved convergence rates for convex func-tions (Theorems 4 and 5). These theorems improve over existing rates by constant factors.Individual smoothness of O again allowed us to show SGD converges with a slightly widerrange of larger step-sizes. Finally, we concluded with asymptotic convergence results forSGD applied to convex and non-convex functions (Theorems 7 and 8).31Chapter 4Line SearchA major weakness of constant step-size SGD is that the problem constants must beknown a priori in order to select a step-size for which convergence is guaranteed. Con-sider Theorem 1 as an example. A fast linear rate of convergence is guaranteed only whenη ≤ 2ρ(µ+L) , but optimization speed is sub-optimal for η  2ρ(µ+L) . Meeting this require-ment demands knowledge of the strong-convexity and smoothness constants, µ and L, aswell as the strong growth parameter ρ. In practice, we are unlikely to have knowledge ofthese coefficients or to possess straightforward means for estimating them. This leaves adilemma: over-estimate the optimal step-size and risk divergence, or use a small step-sizethat yields slow convergence.This chapter tackles the problem of step-size selection by augmenting SGD with astochastic Armijo line-search algorithm that automatically finds a suitable step-size ateach iteration. Unlike meta-optimization schemes such as grid search, random search, orBayesian optimization, line-search techniques do not require multiple, costly executions ofthe optimization procedure (Snoek et al., 2012). Accordingly, practical line-search imple-mentations are typically very fast in comparison to meta-optimization approaches (Nocedaland Wright, 1999). Moreover, as we will show in the following sections, the convergencespeed of SGD with a stochastic line-search is tight with that of SGD with the best fixedstep-size when minimizer interpolation is satisfied.The work in this chapter was originally conducted by the author as part of Vaswaniet al. (2019b). The proposal to augment SGD with the stochastic Armijo line-search, thealgorithmic procedure in Figure 4.2, Lemma 8, and Theorem 11 are reproduced from this32work. In contrast, Theorems 9 and 10 are new, unpublished improvements over the origi-nally published theorems.4.1 BackgroundLine-search algorithms are a classic solution to the problem of step-size selection in thedeterministic setting. The basic notion is to search along the gradient direction (or, moregenerally, a descent direction) for a step-size that satisfies a desirable property. A commonrule is the Armijo condition,f(wk + ηkvk) ≤ f(wk) + c · ηk 〈∇f(wk), vk〉 , (4.1)which demands sufficient decrease in the function value when taking a step along searchdirection vk. The vector vk is required to satisfy 〈∇f(wk), vk〉 < 0, in which case it is calleda descent direction. The (locally) steepest descent direction is the negative gradient vk =−∇f(wk) (Nesterov, 2004). In comparison, the negative stochastic gradient −∇f(wk, zk)direction used by SGD is only guaranteed to be a descent direction in expectation.A simple mechanism for enforcing the Armijo condition is iteratively decreasing thestep-size, or backtracking, from a maximal step-size ηmax until the condition is satisfied.Lipschitz-smoothness of f guarantees that for all w there exists sufficiently small ηk suchthat Equation 4.1 holds (Nocedal and Wright, 1999); we will establish a specific case of thisfor SGD in the stochastic setting (Lemma 8). Practical implementations of backtrackingreduce the step-size as ηk ← β · ηk when the Armijo condition does not hold, where β ∈(0, 1) is a tunable parameter. Alternatively, cubic or other interpolation schemes can be usedto model and approximately minimize the one-dimensional function g(ηk) = f(wk − ηkvk);see Nocedal and Wright (1999) for more details. In this work we assume that backtrackingcan be accomplished exactly, meaning that ηk is the largest step-size less than ηmax forwhich Equation 4.1 is satisfied.The Armijo line-search as given above is not suitable for stochastic optimization prob-lems. Remember that the optimizer does not have access to exact function and gradientevaluations — it can only access noisy estimates f(w, zk) and ∇f(w, zk). Even when ex-act evaluations are available, as in the finite-sum setting, the cost of computing functionand gradient values is much larger than the cost of querying O. An alternative to the full33wkwk+1fvk(η)fvk(η, z)No Interpolationw∗wkwk+1`vk(η)InterpolationFigure 4.1: Progress when using the stochastic Armijo line-search with and withoutminimizer interpolation. The function fvk is the restriction of f to the descentdirection vk = −∇f(wk), while `vk is the Armijo condition as a function of thestep-size, η. The dashed red line shows the restriction of the stochastic functionf(w, z) to vk. The next iterate wk+1 is obtained by choosing the largest step-size satisfying the line-search. When interpolation is not satisfied, we “over-shoot” significantly and f(wk+1) > f(wk) as a result. In contrast, minimizerinterpolation ensures that the iteration makes progress towards w∗. Note thatf(wk, z) = f(wk) for convenience only; this need not hold in general.Armijo line-search is the following stochastic version of the Armijo condition:f(wk + ηkvk, zk) ≤ f(wk, zk) + c · ηk 〈∇f(wk, zk), vk〉 . (4.2)This condition is identical to Equation 4.1, but uses stochastic function and gradient evalu-ations queried from the SFO. Intuitively, the stochastic Armijo condition requires sufficientdecrease on f(·, zk), rather than f . Note that practical backtracking on Equation 4.2 willrequire f(·, zk) to be evaluated multiple times in the same iteration. This is straightfor-ward in the finite-sum setting, but may be impossible when zk reflects an inherently noisymeasurement process.The main issue with stochastic line-search conditions is that the oracle queries may notbe representative of the true function. That is, progress as measured by f(·, zk) may beuninformative or even lead to ascent in f . Figure 4.1 illustrates such a situation, as wellas why minimizer interpolation may preclude this problem. Intuitively, progress on thestochastic functions f(·, zk) should be sufficient for progress on f when X ∗ ⊆⋂z∈Z X ∗z —34SGD with Armijo Line-Search0. Choose an initial point w0 ∈ Rd.1. For each iteration k ≥ 0:(a) Query O for f(wk, zk), ∇f(wk, zk).(b) Set ηk = ηmax and backtrack untilf(wk−ηk∇f(wk, zk), zk) ≤ f(wk, zk)−c·ηk‖∇f(wk, zk)‖2.(c) Update input aswk+1 = wk − ηk∇f(wk, zk).Figure 4.2: Stochastic gradient descent with the stochastic Armijo line-search, asproposed by Vaswani et al. (2019b). We assume that the backtracking proce-dure can be evaluated exactly, meaning the maximal step-size ηk satisfying theArmijo is selected. Note that O must be queried for additional function valuesf(wk+1, zk) during backtracking.i.e. when (f,O) satisfies minimizer interpolation The remainder of this chapter formalizesthis intuition and develops non-asymptotic convergence rates for SGD with the stochas-tic Armijo line-search. In particular, we establish the following results under minimizerinterpolation:1. Linear convergence for strongly-convex f and Lmax individually-smooth and convexO; moreover, this rate is tight with fixed step-size SGD.2. Sub-linear convergence for convex f and Lmax individually-smooth and convex O;again, this rate is tight with the fixed step-size case.3. Sub-linear convergence to a stationary point for non-convex f when (f,O) satisfiesstrong growth. Note that this result is reproduced from Vaswani et al. (2019b).Now, let us introduce a formal definition of SGD with the Armijo line-search beforeproceeding with our analysis. Figure 4.2 presents the basic procedure of the algorithm.35wkfvk(η)`vk(η)Figure 4.3: Graphical depiction of Lemma 8. The function fvk is the restriction of fto the descent direction vk = −∇f(wk), while `vk is the Armijo condition asa function of the step-size, η. The dashed red line shows the upper-bound onfvk due to Lipschitz-smoothness. We can see that a large range of step-sizesmake sufficient progress (yellow region) and will be accepted by the Armijocondition; step-sizes for which the graph of `vk lies below fvk (red region) donot make enough progress and will be rejected. Smoothness guarantees that`vk is above fvk for a non-empty interval of step-sizes (green region), which willalways be accepted.There are several key differences from fixed step-size SGD, which are noted as follows:(i) the step-size ηk is defined per-iteration and initialized at ηmax, (ii) ηk is then chosen byexact backtracking on Equation 4.2 evaluated at vk = ∇f(wk, zk), and (iii) the backtrackingprocedure implicitly requires additional queries to O for f(wk − ηk∇f(wk, zk), zk). WhileFigure 4.2 uses an idealized backtracking procedure, in practice the hyper-parameter βmay need to be tuned to avoid excessive oracle queries.4.2 Convergence for Strongly-Convex FunctionsThe first step of our analysis is to control the step-size ηk using smoothness and theline-search condition. Lower-bounding the step-size relies on smoothness of f(·, zk), which iswhy the results in this chapter requireO to be individually-smooth. This fact also motivatesour transition from assuming the strong/weak growth conditions to directly reasoning with36individually-smooth oracles where (f,O) satisfies minimizer interpolation. The followinglemma is reproduced from Vaswani et al. (2019b); the proof in Appendix C is modifiedfrom the original for greater clarity.Lemma 8. Let f be an L-smooth function and O an Lmax individually-smooth SFO suchthat (f,O) satisfies minimizer interpolation. Then the maximum possible step-size returnedby the stochastic Armijo line-search constrained to lie in the (0, ηmax ] range satisfies thefollowing inequalities:min{2(1− c)Lmax, ηmax}≤ ηk ≤ f(wk, zk)− f(w∗, zk)c‖∇f(wk, zk)‖2 .Lemma 8 is the main tool needed to analyze SGD with the Armijo line-search in theconvex setting. Specifically, it allows us to establish the following progress condition, whichwill be at the core of our proofs for convex and strongly convex functions.Lemma 9. Let f be a convex, L-smooth function and O an Lmax individually-smoothand convex SFO such that (f,O) satisfy minimizer interpolation. Then stochastic gradientdescent using the Armijo line-search with c ≥ 12 satisfies the following inequality:f(wk)− f(w∗) ≤ 12max{Lmax2(1− c) ,1ηmax}(‖wk − w∗‖2 − Ezk [‖wk+1 − w∗‖2]) ,where w∗ ∈ X ∗ is arbitrary.See Appendix C for proof.Lemma 9 is the ideal inequality for (a) applying µ-strong-convexity to obtain a recursionfor ‖wk − w∗‖2 and a linear convergence rate, or (b) summing over iterations to obtainsub-linear convergence of the average iterate w¯K . Indeed, these are exactly the strategiesused to prove Theorems 9 and 10. We start with the strongly-convex case, with for thefollowing theorem proof given in Section C.1.Theorem 9. Let f be a µ-strongly-convex, L-smooth function and O an Lmax individually-smooth and convex SFO such that (f,O) satisfies minimizer interpolation. Then stochasticgradient descent using the Armijo line-search with c ≥ 12 converges asE[‖wK − w∗‖2] ≤ (1− µmin{2(1− c)Lmax, ηmax})K‖w0 − w∗‖2.37Setting c = 12 and ηmax =∞ in Theorem 9 givesE[‖wK − w∗‖2] ≤ (1− µLmax)K‖w0 − w∗‖2,which is exactly the rate obtained for fixed step-size SGD in the same setting (cf. Theo-rem 2). The convergence speed is also the same as the worst-case under the weak growthcondition when interpolation and individual smoothness hold (Vaswani et al., 2019a, Theo-rem 5). While this latter result does not use individual convexity, we (intuitively) need thisadditional assumption when using a line-search in order for the step-size to yield reliableprogress towards the global minimizer. Indeed, general non-convex f(·, zk) can satisfy min-imizer interpolation and still have stochastic gradients ∇f(·, zk) which point away fromthe minimizer w∗. This problem is presented formally in Section 4.4.1, which discusseschallenges in establishing convergence for non-convex functions.The c ≥ 12 requirement in Theorem 9 is notable because it is the opposite of theconstraint on c required for Newton or quasi-Newton methods (cf. Nocedal and Wright(1999, Theorem 3.6)). The key issue here is that the Armijo line-search with c > 12 canexclude unit step-lengths (particularly for quadratic functions), which must be acceptedin some neighborhood of w∗ for Newton and quasi-Newton methods to obtain local super-linear convergence. Unit step-sizes are not required for SGD, where the stochastic Armijoline-search is mainly used to obtain sufficient progress at each iteration. Moreover, if O isindividually-smooth and c ∈ (0, 1), then Lemma 8 guarantees SGD will satisfy Equation 4.2for some ηk > 0. As such, larger c values — which demand greater decrease at each iteration— arise naturally in our analysis.Table 4.1 contrasts Theorem 9 with the other known result for SGD with Armijo line-search on strongly-convex functions (Vaswani et al., 2019b, Theorem 1). In particular, notethat the convergence rate presented here depends on the strong-convexity constant of theoverall function, µ, instead of the expected constant µ¯ = Ezk [µzk ]. This is a nice theoreticalimprovement for the following reason: our analysis permits all stochastic functions to beconvex only, while that of Vaswani et al. (2019b) requires f(·, z) to be strongly-convexfor at least one z ∈ Z. Such a condition is unlikely to hold in practice and can lead todegenerate problems (see the discussion in Section 3.1.2). Theorem 9 is motivated by thework of Loizou et al. (2020), who previously obtained a similar improvement in the contextof the stochastic Polyak step-size.384.3 Convergence for Convex FunctionsNow we derive a sub-linear convergence rate for SGD with the stochastic Armijo line-search when f is convex, O is individually-smooth, and minimizer interpolation is satisfied.Convergence is established for the averaged iterate w¯K similarly to the case of constantstep-size SGD; non-asymptotic, final-iterate rates are also an open problem when ηk ischosen by line-search. The proof of the following theorem follows almost immediately fromLemma 9 as briefly described in the previous section; it can be found in Section C.2.Theorem 10. Let f be a convex, L-smooth function and O an Lmax individually-smoothand convex SFO such that (f,O) satisfies minimizer interpolation. Then stochastic gradientdescent using the Armijo line-search with c ≥ 12 converges asE [f(w¯K)]− f(w∗) ≤ 12Kmax{Lmax2(1− c) ,1ηmax}‖w0 − w∗‖2,where w¯K =∑K−1k=0 wk and w∗ = ΠX ∗(w0).This result is tight in the following sense: when c = 12 and ηmax = ∞, Theorem 10yieldsE [f(w¯K)]− f(w∗) ≤ Lmax2K‖w0 − w∗‖2.This rate is identical to that for constant step-size SGD with η = 1Lmax if we assume theworst-case value for the weak-growth parameter α in Theorem 4. If f(·, z) = f for eachz ∈ Z, then Lmax = L and this rate is comparable to the best known convergence resultsfor gradient descent on convex functions (Bubeck, 2015).Table 4.1 compares Theorem 10 with the original result from Vaswani et al. (2019b,Theorem 2). The sub-linear rate established here is faster by constant factors and has asimpler proof. Furthermore, our result holds for all c ≥ 12 , while c > 12 must hold strictlyfor their theorem. The unnecessary strictness of the original result prevents the naturalcase where c = 12 , vk = ∇f(wk, zk) and Equation 4.2 becomes equivalent to the quadraticupper-bound implied by Lipschitz-smoothness.39AssumptionsMinimum c Convergence RateOurs VML+ Ours VML+µ-SC c ≥ 12 c ≥ 12 O(Lmaxµ log(1))O(Lmaxµ¯ log(1))Convex c ≥ 12 c > 12 O(Lmax2 )O(3Lmax)Table 4.1: Comparison of iteration complexities for SGD with the stochastic Armijoline-search. We omit results for non-convex functions, which are identical be-tween the two works. Results are shown for ηmax =∞ and c = 12 , excepting theconvex case of VML+ (Vaswani et al., 2019b) where c = 23 is used as suggestedby the authors. The constant µ¯ = mink {Ezk [µzk ]} requires that at least onestochastic function in the support of zk is strongly convex for each iteration k;in contrast, our strongly-convex proof relies only on µ, the parameter of the truefunction.4.4 Convergence for Non-Convex FunctionsThe complexity of SGD with the stochastic Armijo line-search for general non-convexfunctions is particularly challenging to determine. As discussed below, proofs which analyzethe sequence of distances to a minimizer(‖wk − w∗‖2) fail without convexity. This includesthe technique used to prove Theorems 9 and 10. Instead, convergence to a stationary pointis typically established through the quadratic majorant provided by L-smoothness of f .The following theorem shows that such convergence does indeed hold, but mandates asevere upper-bound on the line-search. A major weakness of our result is that the settingof ηmax requires explicit knowledge of ρ and Lmax.Theorem 11. Let f be an L-smooth function and O an Lmax individually-smooth SFOsuch that (f,O) satisfies the strong growth condition with parameter ρ. Then stochasticgradient descent using the Armijo line-search with c > 1 − LmaxρL and ηmax < 2ρL convergesasmink∈[K]‖∇f(wk)‖2 ≤ 1δ K(f(w0)− f(w∗)) ,where δ =(ηmax +2(1−c)Lmax)− ρ(ηmax − 2(1−c)Lmax + Lη2max).See Section C.3 for proof.40The step-size constraint ηk ≤ ηmax < 2ρL forces the Armijo line-search to behave likeconstant step-size SGD. Indeed, Lemma 14 shows that η < 2ρL is exactly the conditionrequired for constant step-size SGD to make guaranteed progress for an L-smooth functionsuch that (f,O) satisfies strong growth. It is not clear if the upper-bound on ηmax isa fundamental property of SGD for non-convex functions, or an artifact of the proof forTheorem 11. In either case, it is worthwhile to see how the requirement on ηmax emerges.4.4.1 Challenges in the AnalysisA main object in our proofs so far has been the inner-productηk 〈∇f(wk, zk), wk − w∗〉.When ηk is independent of zk, linearity of expectation gives the following:Ezk [ηk 〈∇f(wk, zk), wk − w∗〉] = ηk 〈∇f(wk), wk − w∗〉 .It is now straightforward to use convexity of f to control this term, which is how theproofs for Chapter 3 proceed. However, the stochastic Armijo line-search yields step-sizeswhich are corollated with ∇f(wk, zk) and thus are not independent of zk. A straighforwardsolution is to use individual convexity and minimizer interpolation to obtainηk 〈∇f(wk, zk), wk − w∗〉 ≥ ηk (f(wk, zk)− f(w∗, zk)) ≥ 0.Lemma 8 then provides the necessary tool to lower-bound ηk, which is how the proofsfor this chapter proceed. The major challenge of non-convex functions is now apparent:the inner-product is not guaranteed to be non-negative4 and the proof cannot proceed bybounding ηk. Intuitively, disentangling ηk and ∇f(wk, zk) becomes the key to establishingconvergence in the non-convex setting.The successful convergence proof for non-convex functions starts from L-smoothness off , rather than “going through iterates” as do the proofs for Theorems 9 and 10. Again,a correlated inner-product ηk 〈∇f(wk),∇f(wk, zk)〉 is encountered; the main innovation is4Non-negativity of 〈∇f(w), w − w∗〉 is sometimes called monotonicity of the gradient and is equivalentto convexity if it holds for all w (Bubeck, 2015).41to expand this as−2ηk 〈∇f(wk),∇f(wk, zk)〉 = ηk(‖∇f(wk, zk)−∇f(wk)‖2 − ‖∇f(wk)‖2 − ‖∇f(wk, zk)‖2)≤ ηmax‖∇f(wk, zk)−∇f(wk)‖2 − ηmin(‖∇f(wk)‖2+ ‖∇f(wk, zk)‖2),where the inequality stems from separately bounding ηk on each positive and negativeterms. This bound is worst-case, but separates the step-size and stochastic gradient, allow-ing the proof to proceed. As a side-effect, we require a tight upper-bound on the maximumstep-size: ηmax <2ρL .4.5 ConclusionsThe main focus of this chapter was augmenting SGD with an automatic mechanismfor selecting η, the step-size parameter. In particular, we sought to develop a tuning-free approach that obtains comparable convergence speed to SGD with a fixed step-size(as established in Chapter 3) without restarts or a meta-optimization procedure. This wasachieved by combining SGD with the stochastic Armijo line-search, as proposed by Vaswaniet al. (2019b). We then developed new convergence results for strongly-convex (Theorem 9)and convex (Theorem 10) functions that improve over those given by Vaswani et al. (2019b)by constant factors. These rates show that SGD with the stochastic Armijo line-searchobtains a converge rate for (strongly-) convex functions that is tight with fixed step-sizeSGD despite requiring no knowledge of the problem constants. Finally, we considered generalnon-convex functions and showed SGD with the stochastic Armijo line-search converges toa stationary point if an aggressive upper-bound on the step-size is enforced (Theorem 11).42Chapter 5AccelerationThe focus in Chapters 3 and 4 was on relatively simple first-order methods: SGD with afixed step-size and with a stochastic line-search, respectively. Now, we move on to Nesterov’sfamous accelerated gradient descent (AGD) algorithm (Nesterov, 2004), which is guaranteedfaster convergence in the deterministic setting. First, a small degree of background foraccelerated methods is developed, with a particular focus on why stochastic acceleration isan interesting question for oracles satisfying interpolation. Then, the following convergenceresults are established for stochastic accelerated gradient descent (SAGD):1. O(√ρLµ log(1))iteration complexity for strongly convex f when (f,O) satisfiesstrong growth; this rate is tight with the deterministic analysis when ρ = 1.2. O(√ρL)iteration complexity for convex f when (f,O) satisfies strong growth;again, this rate is tight with AGD when ρ = 1.These convergence rates are tighter than existing work by a factor of√ρ (Vaswani et al.,2019a, Theorems 1-2), which we show is comparable to the speed-up obtained by AGD overgradient descent in the deterministic setting. The chapter ends with a brief discussion ofacceleration under the weak growth condition, which is an open question.435.1 BackgroundRecall that Theorems 1 and 4 establish fast linear and sub-linear convergence ratesfor SGD that are comparable with analyses for deterministic problems (e.g. gradient de-scent). Unfortunately, the classic work by Nemirovsky and Yudin (1983) shows that theconvergence of gradient descent is not tight with lower-bounds for convex, Lipschitz-smoothfunctions with deterministic oracles. Subsequent developments by Nesterov (Nemirovskyand Nesterov, 1985; Nesterov, 1983, 1988) defined a series of first-order methods (termi-nating with AGD) achieving the optimal O( 1K2) convergence in the setting analyzed byNemirovsky and Yudin. Algorithms with this rate are called accelerated and subsequentresearch has generated a large number of accelerated algorithms for deterministic problems.That acceleration literature firmly places gradient descent as a sub-optimal algorithm andalso casts the analyses in Chapter 3 in a new light. In particular, the obvious parallelsbetween SGD under interpolation and deterministic gradient descent hint that SAGD maybe a faster algorithm than SGD in restricted settings.In fact, restricting the SFO is necessary for any hope of proving an accelerated rate forSAGD. Black-box accelerated methods which do not leverage structural information aboutf rapidly accumulate errors when used with general stochastic oracles O (Devolder et al.,2014; Schmidt et al., 2011). Such error accumulation prevents acceleration, as shown bytight Ω(−2) and Ω(−1) lower bounds for the iteration complexity of convex and strongly-convex minimization with general SFOs, respectively (Agarwal et al., 2012; Nemirovsky andYudin, 1983; Raginsky and Rakhlin, 2011). The key question is whether or not the interpo-lation setting is restrictive enough to exclude these lower bounds and permit acceleration;the goal of this chapter is to extend the estimating sequences framework (Nesterov, 2004)to show that SAGD does in fact achieve an accelerated convergence rate when the stronggrowth condition holds.We now introduce AGD in greater detail before discussing its analysis using estimatingsequences. Figure 5.1 presents the classic procedural definition for AGD with stochasticgradients; the major differences to SGD are (i) AGD introduces a secondary sequence ofpoints (yk) that are calculated by extrapolating from the primary sequence (wk), (ii) theprimary sequence is updated using stochastic gradient steps from the extrapolation pointsyk, rather than the proceeding iterate wk, and (iii) the extrapolation step-size is computedfrom a quadratic equation that depends on µ and L, where µ = 0 for convex functions.44Stochastic AGD (SAGD)0. Choose an initial point w0 = y0 ∈ Rd and α0 ∈ (0, 1).1. For each iteration k ≥ 0:(a) Query O for ∇f(wk, zk).(b) Update input aswk+1 = wk − η∇f(yk, zk).(c) Compute α2k+1 = (1− αk)α2k + µLαk+1.(d) Set βk =αk(1−αk)α2k+αk+1and extrapolate asyk+1 = wk+1 + βk(wk+1 − wk).Figure 5.1: Classical form of AGD with stochastic gradients. This algorithm is equiv-alent to that proposed and analyzed by Vaswani et al. (2019a).The procedure for selecting the βk step-sizes is particularly unintuitive and has lead someauthors to describe the fast convergence of AGD as an “algebraic trick” (Allen Zhu andOrecchia, 2017). Partly for this reason, we shall re-express AGD as alternating steps ofgradient descent on f and a new function φ : Rd → R.5.2 Estimating SequencesLet (wk) be the sequence of iterates produced by SGD. Lemma 14 shows that thefollowing descent condition holds if f is L-smooth and (f,O) satisfies strong growth:Ezk [f(wk+1)] ≤ f(wk)− η(1− ρLη2)‖∇f(wk)‖2.Choosing η = 1ρL minimizes the right-hand-side, in which case SGD can be viewed asiterative minimization of the quadratic upper-bound on f given by smoothness and strong45growth. This procedure decreases f at each iteration by assuming worst-case curvature andnoise at every iterate wk. Intuitively, SGD is sub-optimal exactly because it employs globalworst-case bounds regardless of past knowledge — e.g. the (f(wk, zk)) and (∇f(wk, zk))sequences.We can instead consider an algorithm which builds local approximations to f basedon all past information accumulated by the procedure. Sufficiently accurate approxima-tions can replace or augment the worst-case curvature assumption and allow for fasterconvergence. This intuition is formalized by the notion of estimating sequences (Nesterov,2004).Definition 9 (Estimating Sequences). The two sequences (λk)∞k=0 and (φk)∞k=0 are calledestimating sequences for f : Rd → R if (i) λk ≥ 0 for all k ∈ N; (ii) limk→∞ λk = 0; and(iii) for all k ∈ N and w ∈ Rd, the functions φk : Rd → R satisfyφk(w) ≤ (1− λk)f(w) + λkφ0(w).If φ0 is chosen so that φ0(w) ≈ f(w) in a neighborhood of w0, then (φk)∞k=0 matchesour intuition of a sequence of improving, local approximations to f . This is particularlytrue since the conditions on λk guarantee limk→∞ φk ≤ f point-wise — i.e. φ eventuallybecomes a global lower bound on f . Finally, and most importantly, if f(wk) ≤ infx φk(x)for all k, then we obtainf(wk) ≤ φk(w∗) ≤ (1− λk)f(w∗) + φ0(w∗)=⇒ f(wk)− f(w∗) ≤ λk (φ0(w∗)− f(w∗)) ,and the convergence rate of λk controls convergence of wk to w∗. Establishing this lastproperty will be the core of our analysis. Let us now fix a choice of estimating sequencesfollowing Nesterov (2004, Lemma 2.2.2).46Lemma 10. Suppose f is a µ-strongly-convex function (with µ = 0 in the convex case).Let (αk)∞k=0 be a sequence of real numbers such that αk ∈ (0, 1) and∑∞i=1 αk = ∞. Let(yk)∞k=0 be an arbitrary of sequence of points in Rn. Then the following pair of sequencesare estimating sequences for f :λk+1 = (1− αk)λkφk+1(w) = (1− αk)φk(w) + αk(f(yk) + 〈∇f(yk), w − yk〉+ µ2‖w − yk‖2),where λ0 = 1 and φ0(x) = φ∗0 +γ02 ‖x− v0‖2 with arbitrary γ0 ≥ 0, v0 ∈ Rn and φ∗0 ∈ R.Lemma 10 provides a recipe for generating local approximations of f by updating φkwith a global minorant of f “centered” at yk. This minorant is quadratic when f is strongly-convex and linear when f is convex (since µ = 0). An important property of the update isthat it preserves the canonical formφk(w) = φ∗k +γk2‖w − vk‖2, (5.1)where γk+1 = (1− αk) γk + αkµ, vk = 1γk+1 ((1− αk)vk + αkµ yk − αk∇f(yk)) , andφ∗k+1 = (1− αk)φ∗k + αk(f(yk)− αk2γk+1‖∇f(yk)‖2 + (1− αk)γkγk+1(µ2‖yk − vk‖2+ 〈∇f(yk), vk − yk〉)).See Nesterov (2004, Lemma 2.2.3) for proof. While cumbersome, the canonical form for φkallows AGD to be reformulated as alternating descent steps on φk and f . We do this now.Figure 5.2 re-expresses SAGD in the notation of estimating sequences. We call the re-sulting algorithm R-SAGD: reformulated stochastic accelerated gradient descent. Note thatthe order of the (yk) and (wk) sequences has been reversed for convenience; this does notaffect the algorithm. Formulating SAGD as an estimating sequence procedure is particularlynice for two reasons: (1) the extrapolation stepyk+1 = wk+1 + βk(wk+1 − wk),47Reformulated SAGD (R-SAGD)0. Choose an initial point w0 = y0 ∈ Rd and γ0 ≥ 0.1. For each iteration k ≥ 0:(a) Compute α2k =γk+1ρL= (1−αk)γk+αkµρL.(b) Extrapolate asyk = wk − αkγk + αkµ∇φk(wk).(c) Query O for ∇f(yk, zk).(d) Update input aswk+1 = wk − 1ρL∇f(yk, zk).Figure 5.2: Reformulation of AGD as alternating steps of gradient descent on φ andSGD on f .takes on an intuitive interpretation as gradient descent on φk+1, and (2) ifE[inf φk(x)] ≥ E[f(wk)],holds for all k, then convergence of SAGD is determined entirely by convergence of (λk).We will show this second property shortly, but first we formally state the equivalence ofR-SAGD and SAGD, with proof given in Section D.1.Lemma 11. Let (λk) and (φk) be as defined in Lemma 10. Then the R-SAGD and SAGDalgorithms are equivalent.Now we establish the main result of this section; convergence for convex and µ-strongly-convex functions will follow almost immediately in Sections 5.3 and 5.4. The proof relieson the expected decrease condition given by Lemma 14, which holds for L-smooth f when(f,O) satisfies strong growth. Intuitively, this condition shows that SGD obtains similar per-48iteration improvement (in expectation) to deterministic gradient descent if strong growthholds. Finally, note that the expectations in the lemma below are taken with respect tothe entire sequence of random variables (zt)kt=0.Lemma 12. Let f be a µ-strongly-convex, L-smooth function (with µ = 0 in the convexcase) and O a SFO such that (f,O) satisfies strong growth with parameter ρ. If φ∗0 = f(w0)and γ0 ≥ 0 is independent of the random process (zk), then for all k ∈ N R-SAGD satisfiesE[infwφk(w)] = E [φ∗k] ≥ E[f(wk)],See Section D.1 for proof.5.3 Convergence for Strongly-Convex FunctionsIt is now straightforward to prove R-SAGD converges at an accelerated rate. Earlier, weshowed that Lemma 12 and the fact that (λk) and (φk) are estimating sequences for ftogether implyf(wk)− f(w∗) ≤ λk (φ0(w∗)− f(w∗)) = λk(f(w0)− f(w∗) + γ02‖w0 − w∗‖2).This is almost an explicit convergence rate. The last step of the analysis is to select anappropriate value for γ0 and analyze convergence of the λk sequence, which is done in thefollowing theorem.Theorem 12. Let f be a µ-strongly-convex, L-smooth function with µ > 0 and O aSFO such that (f,O) satisfies strong growth with parameter ρ. Moreover, choose γ0 = µ,v0 = w0, and φ∗0 = f(w0). Then R-SAGD converges asE [f(wK)]− f(w∗) ≤(1−√µρL)K (f(w0)− f(w∗) + µ2‖w∗ − w0‖2).See Section D.2 for proof.Accelerated convergence of SAGD is an immediate corollary of the equivalence of R-SAGDand SAGD (Lemma 11 and Theorem 12). The only catch here is that α0 must be selectedto correspond to γ0 = µ. It is easy to see that choosing α0 =√µρL for SAGD is identical toγ0 = µ in R-SAGD since α20 = ((1− α0)γ0 + α0µ) /(ρL).49Corollary 2. Let f be a µ-strongly-convex, L-smooth function with µ > 0 and O a SFOsuch that (f,O) satisfies strong growth with parameter ρ. Moreover, choose α0 =√µρL .Then SAGD converges asE [f(wK)]− f(w∗) ≤(1−√µρL)K (f(w0)− f(w∗) + µ2‖w∗ − w0‖2).Corollary 2 looks like a potential positive answer to the problem of whether or not ac-celeration is possible in the interpolation setting. SAGD’s O(exp{−√µρL K})convergencecertainly improves on that of constant step-size SGD. Moreover, the form of improvement —taking the square-root of the condition number — is identical to that obtained by AGD overdeterministic gradient descent (Nesterov, 2004). In this sense, SAGD is a true acceleratedalgorithm.However, SAGD is, in general, not optimal for L-smooth, strongly-convex functions. Forexample, if O is Lmax individually-smooth, then ρ ≤ Lmaxµ and the worst-case complexityis O(exp{−√µ2/LmaxLK}). This is roughly equivalent to gradient descent, implyingSAGD is not accelerated. Overall, we conclude SAGD is an accelerated method only relativeto the complexity of other stochastic algorithms under interpolation.The analysis of SAGD for strongly-convex functions presented here is strictly tighterthan prior work by Vaswani et al. (2019a, Theorem 1), who also studied convergence understrong growth. Table 5.1 compares the two complexities; the major difference is that ρ issquared in the rate obtain by Vaswani et al. Recalling ρ = Lmax/µ in the worst-case impliesthat our result is faster by a multiple of the condition number for individually-smooth Osatisfying minimizer interpolation, which is comparable to the improvement obtained byAGD over deterministic gradient descent. This tighter rate addresses the criticism of Liuand Belkin (2020), who showed that the convergence speed shown by Vaswani et al. couldbe slower than SGD in some circumstances.5.4 Convergence for Convex FunctionsOur final result for SAGD addresses convergence for convex functions under the stronggrowth condition. The following theorem shows that R-SAGD obtains an accelerated O(ρLK2)complexity. Invoking the correspondence between R-SAGD and SAGD leads an identicalresult for the latter algorithm as a corollary. This improves on prior analyses by a factorof√ρ (Vaswani et al., 2019a, Theorem 2). See Table 5.1 for additional details.50AssumptionsConvergence RateOurs VBSµ-SC O(√ρLµ log(1))O(√ρ2Lµ log(1))Convex O(√ρL)O(√ρ2L)Table 5.1: Comparison of iteration complexities for stochastic acceleration schemesunder the strong growth condition. For each result, we report the complexityobtained with the optimal step-size and momentum parameter according tothe analysis. Our results are tighter than VBS (Vaswani et al., 2019a) by afactor of√ρ. Liu and Belkin (2020) also consider stochastic acceleration underinterpolation, but make substantially different assumptions about f and thusobtain difficult-to-compare rates. Accordingly, we omit their algorithm, MaSS,from this table.Theorem 13. Let f be a convex, L-smooth function and O a SFO such that (f,O) satisfiesstrong growth with parameter ρ. Moreover, choose γ0 = 2ρL, v0 = w0, and φ∗0 = f(w0).Then R-SAGD converges asE [f(wK)]− f(w∗) ≤ 2(K + 1)2(f(w0)− f(w∗) + ρL‖w0 − w∗‖2).The proof is given in Section D.3. Again, we have the following corollary from the equiv-alence of R-SAGD and SAGD. Note that the choice of α0 =√2− 1 for SAGD is identical tothe choice of γ0 = 2ρL in R-SAGD, since α20 = (1− α0)γ0/(ρL).Corollary 3. Let f be a convex, L-smooth function and O a SFO such that (f,O) satisfiesstrong growth with parameter ρ. Moreover, choose α0 =√2 − 1. Then R-SAGD convergesasE [f(wK)]− f(w∗) ≤ 2(K + 1)2(f(w0)− f(w∗) + ρL‖w0 − w∗‖2).5.5 Acceleration under Weak GrowthA disappointing aspect of Theorem 13 is that it uses the strong growth condition,rather than weak growth. Recall that sufficient conditions for strong growth are minimizer51interpolation, individual-smoothness, and the PL condition (Lemma 7). Thus, requiringstrong growth in this case has the flavour of requiring f to be strongly-convex, but onlyleads to sub-linear convergence. This is quite disappointing, but it is not obvious that theassumption can be relaxed.The natural condition for convex functions used throughout this work is the weakgrowth condition. The main issue using weak growth in Theorem 13 is that we lose accessto the progress conditionEzk [f(wk+1)] ≤ f(wk)− ηk(1− ρLη2)‖∇f(wk)‖2,from Lemma 14, which was a critical piece of our analysis for SAGD. The direct proof usingestimating sequences collapses without obvious recourse.5.6 ConclusionsThis chapter has tackled the challenging question of stochastic Nesterov accelerationunder interpolation-type conditions. Chapters 3 and 4 showed that SGD converges nearly asquickly as deterministic gradient descent when minimizer interpolation is satisfied. Thesepositive results suggested that stochastic acceleration might also be possible in a simi-larly restricted setting. Following Vaswani et al. (2019a), we investigated a straightforwardmodification of AGD to use stochastic gradients. However, unlike Vaswani et al. (2019a), wedevelop an argument based on estimating sequences (Nesterov, 2004). This allowed us tocast SAGD as an alternating minimization procedure on f and a sequence of curvature esti-mates, φk. Careful analysis of this procedure showed that SAGD obtains accelerated rates ofconvergence in both the strongly-convex and convex settings (Theorems 12 and 13) whenstrong growth holds. Our theorems improve over Vaswani et al. (2019a) by a factor of√ρ,which is equal to√Lmaxµ in the worst case.52Chapter 6Beyond InterpolationSo far we have investigated the performance of stochastic gradient methods for un-constrained minimization of an L-smooth function f given access to a SFO O such that(f,O) satisfies interpolation or a growth condition. This chapter extends these analyses toL2-regularized minimization of f ,minw∈RdF (w) := f(w) +λ2‖w‖2, (6.1)where λ > 0 is a tuning parameter which controls the degree of regularization. To maintainconsistency with common notation, w∗ will refer to the global minimizer of this regularizedproblem, while minimizers of f will now be denoted as w+.The canonical SFO for L2-regularized minimization evaluates the regularization termexactly, giving the following stochastic function and gradient evaluations:F (w, zk) = f(w, zk) +λ2‖w‖2 ∇F (w, zk) = ∇f(w, zk) + λw.We call this the L2-regularization of O and denote it as O2. Except in degenerate circum-stances,5the pair (F,O2) will not satisfy interpolation or weak/strong growth even when(f,O) does. This means the analyses from previous chapters cannot be applied directly toregularized problems.There is still significant reason to believe the L2-regularization of f can be minimized5For example, when f is minimized at w∗ = 0.53efficiently when (f,O) satisfy the weak growth condition. For example, f(w∗) − f(w+)will be small when λ  µ and so we might expect (F,O2) to approximately satisfy weakgrowth. On the other hand, O2 will become nearly deterministic as λ → ∞ and λ2‖w‖2dominates f . Intuition indicates that it is only when λ is moderate that minimizing Fwill be challenging. The next lemma formalizes these observations into a version of weakgrowth which holds for (F,O2).Lemma 13. Let f be an L-smooth function with at least one finite minimizer w+ andO a SFO such that (f,O) satisfies the weak growth condition with parameter ρ. Then theL2-regularized problem (F,O2) satisfies the following inequality for all w ∈ Rd and k ≥ 0:Ezk[‖∇F (w, zk)‖2] ≤ 4 max {ρL, λ}(F (w)− F (w∗)− λ− L2‖w∗ − w+‖2 + λ2‖w+‖2),where w∗ minimizes the regularized function F . Moreover, this can be improved toEzk[‖∇F (w, zk)‖2] ≤ 4 max {ρL, λ}(F (w)− F (w∗)− λ+ µ2‖w∗ − w+‖2 + λ2‖w+‖2),if f is µ-strongly-convex (with µ = 0 giving the convex case).See Appendix E for proof.6.1 Convergence for L2-Regularized Convex FunctionsLet us use Lemma 13 to show that SGD with a constant step-size converges linearlyto a neighborhood of w∗ when f is convex or strongly-convex. Approximate convergenceto a region around w∗ is the best that can be obtained without an averaging scheme ordecreasing step-size due to the irreducible λ2‖w+‖2 + λ+µ2 ‖w∗ − w+‖2 term in the noise-bound. The following analysis assumes µ-strong-convexity, but permits µ = 0 in orderto capture convex functions. In the case that µ = 0, we are still able to obtain linearconvergence because the regularized function F is µ+λ-strongly-convex and L+λ-smooth.Proof of the following theorem is given in Section E.1.54Theorem 14. Let f be a µ-strongly-convex (with µ = 0 in the convex case), L-smoothfunction with at least one finite minimizer w+ and O a SFO such that the pair (f,O)satisfies the weak growth condition with constant ρ. Then stochastic gradient descent withconstant step-size η ≤ µ+λmax{ρL,λ}((µ+λ)+(L+λ)) obtains the following convergence rate for theL2-regularized problem (F,O2):E[‖wK − w∗‖2] ≤ (1− 2η(µ+ λ)(L+ λ)(µ+ λ) + (L+ λ))K‖w0 − w∗‖2 + γλ‖w+‖2− γ (λ+ µ) ‖w∗ − w+‖2,where γ = ηmax{ρL,λ}[(µ+λ)+(L+λ)](µ+λ)(L+λ) .Theorem 14 shows that the size of the neighborhood to which SGD converges is de-termined by the degree of regularization, which fits with the intuition from the previoussection. We make the following observations about the effect of λ on the region of conver-gence: (i) the minimizer w∗ depends on λ, with w∗ = w+ when λ = 0 — it is easy to seethat the neighborhood terms collapse to 0 and convergence mirrors Theorem 1 when λ = 0holds; (ii) as λ→∞, w∗ → 0 and the neighborhood again collapses to 0; and (iii) when fis convex and η attains is maximal value, the convergence rate of SGD has the special caseE[‖wK − w∗‖2] ≤ (1− 2λ2(L+ λ)max {ρL, λ} (L+ 2λ)2)K‖w0 − w∗‖2 + λ(L+ λ)‖w+‖2− λ(L+ λ)‖w∗ − w+‖2.Crudely assuming λλ+L ≈ 1 shows that the volume of the neighborhood largely depends onthe convergence of w∗ to 0 as λ grows.6.2 ConclusionsThis chapter characterized the complexity of SGD for L2-regularized problems underinterpolation. Theorem 14 is thus a small but promising step towards the analysis of generalstructural minimization problems. For example, consider generalizing Equation 6.1 toF (w) = f(w) + g(w),55where f is an L-smooth function with SFO O such that (f,O) satisfies interpolation andg is a potentially non-smooth regularizer or indicator function. This problem class in-cludes L1-regularized minimization, as well as minimization with (convex) constraints. Aninteresting avenue of research is extending the analysis above to this setting through theproximal-gradient method. Although Cevher and Vu (2019) considered such problems, theyconsidered convergence under strong growth plus noise only and did not derive specific ex-pressions for the neighbourhood size.56Chapter 7ConclusionThis thesis develops the convergence theory of stochastic first-order methods underinterpolation conditions. Unlike existing work which is confined to the finite-sum setting,we propose a highly general model for interpolation that applies to black-box objectivefunctions and stochastic first-order oracles (SFOs). The subsequent discussion relates inter-polation to the strong and weak growth conditions. Specifically, we provide upper-boundson the weak and strong growth parameters for interpolating oracles that satisfy an intuitiveLipschitz-smoothness property. Informally, this means that global regularity of the stochas-tic gradients is guaranteed from a local correspondence between the oracle and objectiveas long as the stochastic gradient mapping is smooth in the model parameters. Such aproperty is satisfied for smooth supervised learning problems.The theoretical analysis of first-order methods focuses on stochastic gradient descent(SGD) and a stochastic version of Nesterov’s accelerated gradient descent. For SGD, weanalyze the iteration complexity with a fixed step-size as well as with a stochastic Armijoline-search. In both cases, our results are tighter than existing convergence rates and allowfor a wider range of parameters. For convex functions, we recover the best-known conver-gence rates in the deterministic setting both with and without the Armijo line-search. Inall cases, specific care is taken to understand the impact of additional assumptions, suchas smoothness or convexity of the oracle.Our study of stochastic accelerated gradient descent (SAGD) uses the estimating se-quences framework developed by Nesterov (2004). Estimating sequences allow us to castSAGD as alternating descent procedure on the objective and a sequence of improving lo-57cal approximations, avoiding the “analytical tricks” used in many accelerated convergenceproofs. Moreover, we are able to derive tighter convergence rates than existing work usingthis framework. In the case of smooth, interpolating oracles, the improvement in iterationcomplexity is comparable to dividing by the square-root of the condition number (Vaswaniet al., 2019a).Although this thesis attempts to be a comprehensive analysis of stochastic gradientmethods under interpolation, the scope is often too narrow and several important methodsare not studied. We briefly discuss the role of growth conditions in structural optimiza-tion in Chapter 6. The emphasis falls on L2-regularized problems where the regularizedfunction satisfies weak growth. In this limited setting, we derive linear convergence to aneighbourhood of the optimal solution and precisely characterize the volume in terms ofthe regularization parameter. The far more general and interesting problem of compos-ite smooth/non-smooth optimization where the smooth component satisfies interpolationremains completely unaddressed. Deriving exact convergence neighborhoods for proximal-gradient methods in this setting is an interesting open problem.Open problems in optimization under interpolation are not limited to the largely unex-plored area of structural optimization. Fundamental questions not answered in this thesisor in the literature include the following:1. Is there a finite-time convergence rate for the final iterate generated by constantstep-size SGD when the objective is convex and weak growth is satisfied?2. If so, can this rate be extended to SGD with the stochastic Armijo line-search?3. Does SAGD obtain an accelerated convergence rate for convex functions under theweak growth condition?4. Can analyses for the stochastic Armijo line-searches be extended to SAGD?5. Can tighter convergence rates be established for SAGD when additional assumptionsare made on the oracle (e.g. smoothness or convexity) as they can for SGD?Many of these questions build on the problems tackled in this thesis, but require new toolsor insights to answer. The analyses contributed here are merely a first step towards a largerunderstanding of optimization under interpolation.58BibliographyA. Agarwal, P. L. Bartlett, P. Ravikumar, and M. J. Wainwright. Information-theoreticlower bounds on the oracle complexity of stochastic convex optimization. IEEE Trans.Inf. Theory, 58(5):3235–3249, 2012.Z. Allen-Zhu. Katyusha: The first direct acceleration of stochastic gradient methods. J.Mach. Learn. Res., 18:221:1–221:51, 2017. URL http://jmlr.org/papers/v18/16-410.html.Z. Allen-Zhu. Katyusha X: Practical momentum method for stochastic sum-of-nonconvexoptimization. In Proceedings of the 35th International Conference on Machine Learning,ICML 2018, volume 80 of Proceedings of Machine Learning Research, pages 179–185.PMLR, 2018.Z. Allen Zhu and L. Orecchia. Linear Coupling: An ultimate unification of gradient andmirror descent. In 8th Innovations in Theoretical Computer Science Conference, ITCS2017, volume 67 of LIPIcs, pages 3:1–3:22. Schloss Dagstuhl - Leibniz-Zentrum fu¨r In-formatik, 2017.L. B. Almeida, T. Langlois, J. D. Amaral, and A. Plakhov. Parameter adaptation instochastic optimization. On-Line Learning in Neural Networks, Publications of the New-ton Institute, pages 111–134, 1998.Y. Arjevani and O. Shamir. On the iteration complexity of oblivious first-order optimizationalgorithms. In Proceedings of the 33nd International Conference on Machine Learning,ICML 2016, volume 48 of JMLR Workshop and Conference Proceedings, pages 908–916,2016.Y. Arjevani, Y. Carmon, J. C. Duchi, D. J. Foster, N. Srebro, and B. Woodworth. Lowerbounds for non-convex stochastic optimization. arXiv preprint arXiv:1912.02365, 2019.L. Armijo. Minimization of functions having Lipschitz continuous first partial derivatives.Pacific Journal of Mathematics, 16(1):1–3, 1966.S. Arora, N. Cohen, and E. Hazan. On the optimization of deep networks: Implicit accel-eration by overparameterization. In Proceedings of the 35th International Conference on59Machine Learning, ICML 2018, volume 80 of Proceedings of Machine Learning Research,pages 244–253. PMLR, 2018.M. Assran and M. Rabbat. On the convergence of Nesterov’s accelerated gradient methodin stochastic settings. arXiv preprint arXiv:2002.12414, 2020.M. Assran, N. Loizou, N. Ballas, and M. Rabbat. Stochastic gradient push for distributeddeep learning. In Proceedings of the 36th International Conference on Machine Learning,ICML 2019, volume 97 of Proceedings of Machine Learning Research, pages 344–353.PMLR, 2019.M. Avriel and D. J. Wilde. Golden block search for the maximum of unimodal functions.Management Science, 14(5):307–319, 1968.R. Bassily, M. Belkin, and S. Ma. On exponential convergence of SGD in non-convexover-parametrized learning. arXiv preprint arXiv:1811.02564, 2018.A. G. Baydin, R. Cornish, D. Mart´ınez-Rubio, M. Schmidt, and F. Wood. Online learningrate adaptation with hypergradient descent. In 6th International Conference on LearningRepresentations, ICLR 2018. OpenReview.net, 2018.M. Belkin, D. Hsu, S. Ma, and S. Mandal. Reconciling modern machine-learning prac-tice and the classical bias–variance trade-off. Proceedings of the National Academy ofSciences, 116(32):15849–15854, 2019a.M. Belkin, A. Rakhlin, and A. B. Tsybakov. Does data interpolation contradict statisticaloptimality? In The 22nd International Conference on Artificial Intelligence and Statis-tics, AISTATS 2019, volume 89 of Proceedings of Machine Learning Research, pages1611–1619. PMLR, 2019b.A. Ben-Israel and B. Mond. What is invexity? The ANZIAM Journal, 28(1):1–9, 1986.Y. Bengio. Practical recommendations for gradient-based training of deep architectures.In Neural Networks: Tricks of the Trade - Second Edition, volume 7700 of Lecture Notesin Computer Science, pages 437–478. Springer, 2012.L. Berrada, A. Zisserman, and M. P. Kumar. Training neural networks for and by inter-polation. arXiv preprint arXiv:1906.05661, 2019.D. P. Bertsekas. Incremental gradient, subgradient, and proximal methods for convexoptimization: A survey. Optimization for Machine Learning, 2010(1-38):3, 2011.D. P. Bertsekas and J. N. Tsitsiklis. Gradient convergence in gradient methods with errors.SIAM Journal on Optimization, 10(3):627–642, 2000.60L. Bottou. Une approche the´orique de l’apprentissage connexionniste et applications a` lareconnaissance de la parole. PhD thesis, 1991.S. Bubeck. Convex optimization: Algorithms and complexity. Found. Trends Mach. Learn.,8(3-4):231–357, 2015.R. H. Byrd, G. M. Chin, J. Nocedal, and Y. Wu. Sample size selection in optimizationmethods for machine learning. Math. Program., 134(1):127–155, 2012.Y. Carmon, J. C. Duchi, O. Hinder, and A. Sidford. Lower bounds for finding stationarypoints I. Mathematical Programming, pages 1–50, 2019.V. Cevher and B. C. Vu. On the linear convergence of the stochastic gradient method withconstant step-size. Optim. Lett., 13(5):1177–1187, 2019.Y.-L. Chen and M. Kolar. Understanding accelerated stochastic gradient descent via thegrowth condition. arXiv preprint arXiv:2006.06782, 2020.D. Choi, C. J. Shallue, Z. Nado, J. Lee, C. J. Maddison, and G. E. Dahl. On empiricalcomparisons of optimizers for deep learning. arXiv preprint arXiv:1910.05446, 2019.E. C¸ınlar. Probability and stochastics, volume 261. Springer Science & Business Media,2011.M. Cohen, J. Diakonikolas, and L. Orecchia. On acceleration with noise-corrupted gradi-ents. In Proceedings of the 35th International Conference on Machine Learning, ICML2018, volume 80 of Proceedings of Machine Learning Research, pages 1018–1027. PMLR,2018.A. d’Aspremont. Smooth optimization with approximate gradient. SIAM J. Optim., 19(3):1171–1183, 2008.S. De, A. Yadav, D. Jacobs, and T. Goldstein. Big batch SGD: Automated inference usingadaptive batch sizes. arXiv preprint arXiv:1610.05792, 2016.A. Defazio. A simple practical accelerated method for finite sums. In Advances in NeuralInformation Processing Systems 29: NeurIPS 2016, pages 676–684, 2016.A. Defazio and L. Bottou. On the ineffectiveness of variance reduced optimization for deeplearning. In Advances in Neural Information Processing Systems 32: NeurIPS 2019,pages 1753–1763, 2019.A. Defazio, F. R. Bach, and S. Lacoste-Julien. SAGA: A fast incremental gradient methodwith support for non-strongly convex composite objectives. In Advances in Neural In-formation Processing Systems 27: NeurIPS 2014, pages 1646–1654, 2014.61O. Devolder, F. Glineur, and Y. Nesterov. First-order methods of smooth convex optimiza-tion with inexact oracle. Mathematical Programming, 146(1-2):37–75, 2014.Y. Drori and O. Shamir. The complexity of finding stationary points with stochasticgradient descent. arXiv preprint arXiv:1910.01845, 2019.J. C. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learningand stochastic optimization. J. Mach. Learn. Res., 12:2121–2159, 2011.S. Fridovich-Keil and B. Recht. Choosing the step size: Intuitive line search algorithmswith efficient convergence. In The 11th Workshop on Optimization for Machine Learning(OPT 2019), 2019.M. P. Friedlander and M. Schmidt. Hybrid deterministic-stochastic methods for datafitting. SIAM J. Scientific Computing, 34(3), 2012.R. Frostig, R. Ge, S. M. Kakade, and A. Sidford. Un-regularizing: Approximate proximalpoint and faster stochastic algorithms for empirical risk minimization. In Proceedingsof the 32nd International Conference on Machine Learning, ICML 2015, volume 37 ofJMLR Workshop and Conference Proceedings, pages 2540–2548, 2015.S. Ghadimi and G. Lan. Optimal stochastic approximation algorithms for strongly convexstochastic composite optimization I: A generic algorithmic framework. SIAM J. Optim.,22(4):1469–1492, 2012.S. Ghadimi and G. Lan. Optimal stochastic approximation algorithms for strongly con-vex stochastic composite optimization, II: Shrinking procedures and optimal algorithms.SIAM J. Optim., 23(4):2061–2089, 2013.R. B. Grosse and R. Salakhutdinov. Scaling up natural gradient by sparsely factorizing theinverse Fisher matrix. In Proceedings of the 32nd International Conference on MachineLearning, ICML 2015, volume 37 of JMLR Workshop and Conference Proceedings, pages2304–2313, 2015.T. Hastie, R. Tibshirani, and J. H. Friedman. The Elements of Statistical Learning: DataMining, Inference, and Prediction, 2nd Edition. Springer Series in Statistics. Springer,2009.J. Honorio. Convergence rates of biased stochastic optimization for learning sparse Isingmodels. In Proceedings of the 29th International Conference on Machine Learning, ICML2012. icml.cc / Omnipress, 2012.C. Hu, J. T. Kwok, and W. Pan. Accelerated gradient methods for stochastic optimizationand online learning. In Advances in Neural Information Processing Systems 22: NeurIPS2009, pages 781–789. Curran Associates, Inc., 2009.62P. Jain, S. M. Kakade, R. Kidambi, P. Netrapalli, and A. Sidford. Accelerating stochasticgradient descent for least squares regression. In Conference On Learning Theory, COLT2018, volume 75 of Proceedings of Machine Learning Research, pages 545–604. PMLR,2018.R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive vari-ance reduction. In Advances in Neural Information Processing Systems 26: NeurIPS2013, pages 315–323, 2013.H. Karimi, J. Nutini, and M. Schmidt. Linear convergence of gradient and proximal-gradient methods under the polyak- lojasiewicz condition. In Machine Learning andKnowledge Discovery in Databases - European Conference, ECML PKDD 2016, volume9851 of Lecture Notes in Computer Science, pages 795–811. Springer, 2016.A. Khaled and P. Richta´rik. Better theory for SGD in the nonconvex world. arXiv preprintarXiv:2002.03329, 2020.D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In 3rd InternationalConference on Learning Representations, ICLR 2015, 2015.D. Kovalev, S. Horva´th, and P. Richta´rik. Don’t jump through hoops and remove thoseloops: SVRG and Katyusha are better without the outer loop. In Algorithmic LearningTheory, ALT 2020, volume 117 of Proceedings of Machine Learning Research, pages451–467. PMLR, 2020.N. Krejic and N. Krklec. Line search methods with variable sample size for unconstrainedoptimization. J. Comput. Appl. Math., 245:213–231, 2013.H. J. Kushner and G. G. Yin. Stochastic Approximation Algorithms and Applications,volume 35 of Applications of Mathematics. Springer, 1997.G. Lan and Y. Zhou. An optimal randomized incremental gradient method. Math. Pro-gram., 171(1-2):167–215, 2018.N. Le Roux, M. Schmidt, and F. R. Bach. A stochastic gradient method with an exponentialconvergence rate for finite training sets. In Advances in Neural Information ProcessingSystems 25: NeurIPS 2012, pages 2672–2680, 2012.Y. Lei, T. Hu, G. Li, and K. Tang. Stochastic gradient descent for nonconvex learningwithout bounded gradient assumptions. IEEE Transactions on Neural Networks andLearning Systems, 2019.X. Li and F. Orabona. On the convergence of stochastic gradient descent with adaptivestepsizes. In The 22nd International Conference on Artificial Intelligence and Statistics,AISTATS 2019, volume 89 of Proceedings of Machine Learning Research, pages 983–992.PMLR, 2019.63T. Liang and A. Rakhlin. Just interpolate: Kernel “ridgeless” regression can generalize.arXiv preprint arXiv:1808.00387, 2018.H. Lin, J. Mairal, and Z. Harchaoui. A universal catalyst for first-order optimization. InAdvances in Neural Information Processing Systems 28: NeurIPS 2015, pages 3384–3392,2015.C. Liu and M. Belkin. Accelerating SGD with momentum for over-parameterized learn-ing. In 8th International Conference on Learning Representations, ICLR 2020. Open-Review.net, 2020.N. Loizou, S. Vaswani, I. Laradji, and S. Lacoste-Julien. Stochastic Polyak step-size forSGD: An adaptive learning rate for fast convergence. arXiv preprint arXiv:2002.10542,2020.L. Luo, Y. Xiong, Y. Liu, and X. Sun. Adaptive gradient methods with dynamic bound oflearning rate. In 7th International Conference on Learning Representations, ICLR 2019.OpenReview.net, 2019.S. Ma, R. Bassily, and M. Belkin. The power of interpolation: Understanding the ef-fectiveness of SGD in modern over-parametrized learning. In Proceedings of the 35thInternational Conference on Machine Learning, ICML 2018, volume 80 of Proceedingsof Machine Learning Research, pages 3331–3340. PMLR, 2018.M. Mahsereci and P. Hennig. Probabilistic line searches for stochastic optimization. J.Mach. Learn. Res., 18:119:1–119:59, 2017.S. Y. Meng, S. Vaswani, I. H. Laradji, M. Schmidt, and S. Lacoste-Julien. Fast andfurious convergence: Stochastic second order methods under interpolation. In The 23rdInternational Conference on Artificial Intelligence and Statistics, AISTATS 2020, volume108 of Proceedings of Machine Learning Research, pages 1375–1386. PMLR, 2020.A. S. Nemirovsky and Y. E. Nesterov. Optimal methods of smooth convex minimization.USSR Computational Mathematics and Mathematical Physics, 25(2):21–30, 1985.A. S. Nemirovsky and D. B. Yudin. Problem complexity and method efficiency in optimiza-tion. Wiley-Interscience Series in Discrete Mathematics, 1983.Y. Nesterov. A method for unconstrained convex minimization problem with the rate ofconvergence O(1/k2). In Doklady an USSR, volume 269, pages 543–547, 1983.Y. Nesterov. On an approach to the construction of optimal methods of minimization ofsmooth convex functions. Ekonomika i Mateaticheskie Metody, 24(3):509–517, 1988.Y. E. Nesterov. Introductory Lectures on Convex Optimization - A Basic Course, volume 87of Applied Optimization. Springer, 2004.64J. Neveu. Discrete-parameter martingales, volume 10. Elsevier, 1975.L. M. Nguyen, P. H. Nguyen, M. van Dijk, P. Richta´rik, K. Scheinberg, and M. Taka´c. SGDand Hogwild! convergence without the bounded gradients assumption. In Proceedingsof the 35th International Conference on Machine Learning, ICML 2018, volume 80 ofProceedings of Machine Learning Research, pages 3747–3755. PMLR, 2018.J. Nocedal and S. J. Wright. Numerical Optimization. Springer, 1999.A. Ogaltsov, D. Dvinskikh, P. Dvurechensky, A. Gasnikov, and V. Spokoiny. Adaptivegradient descent for convex and non-convex stochastic optimization. arXiv preprintarXiv:1911.08380, 2019.F. Orabona and T. Tommasi. Training deep networks without learning rates through coinbetting. In Advances in Neural Information Processing Systems 30: NeurIPS 2017, pages2160–2170, 2017.C. Paquette and K. Scheinberg. A stochastic line search method with expected complexityanalysis. SIAM J. Optim., 30(1):349–376, 2020.V. Plagianakos, G. Magoulas, and M. Vrahatis. Learning rate adaptation in stochasticgradient descent. In Advances in convex analysis and global optimization, pages 433–444. Springer, 2001.B. Polyak and Y. Z. Tsypkin. Pseudogradient adaptation and training algorithms. Au-tomation and Remote Control, 34:45–67, 1973.X. Qian, P. Richta´rik, R. M. Gower, A. Sailanbayev, N. Loizou, and E. Shulgin. SGD witharbitrary sampling: General analysis and improved rates. In Proceedings of the 36thInternational Conference on Machine Learning, ICML 2019, volume 97 of Proceedingsof Machine Learning Research, pages 5200–5209. PMLR, 2019.M. Raginsky and A. Rakhlin. Information-based complexity, feedback and dynamics inconvex programming. IEEE Trans. Inf. Theory, 57(10):7036–7056, 2011.H. Robbins and S. Monro. A stochastic approximation method. Ann. Math. Statist., 22(3):400–407, 09 1951.M. Rolinek and G. Martius. L4: practical loss-based step-size adaptation for deep learning.In Advances in Neural Information Processing Systems 31: NeurIPS 2018, pages 6434–6444, 2018.R. E. Schapire, Y. Freund, P. Barlett, and W. S. Lee. Boosting the margin: A newexplanation for the effectiveness of voting methods. In Proceedings of the FourteenthInternational Conference on Machine Learning (ICML 1997), pages 322–330. MorganKaufmann, 1997.65T. Schaul, S. Zhang, and Y. LeCun. No more pesky learning rates. In Proceedings of the30th International Conference on Machine Learning, ICML 2013, volume 28 of JMLRWorkshop and Conference Proceedings, pages 343–351, 2013.M. Schmidt and N. Le Roux. Fast convergence of stochastic gradient descent under astrong growth condition. arXiv preprint arXiv:1308.6370, 2013.M. Schmidt, N. Le Roux, and F. R. Bach. Convergence rates of inexact proximal-gradientmethods for convex optimization. In Advances in Neural Information Processing Systems24: NeurIPS 2011, pages 1458–1466, 2011.N. N. Schraudolph. Local gain adaptation in stochastic gradient descent. 1999.S. Shalev-Shwartz and T. Zhang. Accelerated proximal stochastic dual coordinate ascentfor regularized loss minimization. In Proceedings of the 31th International Conferenceon Machine Learning, ICML 2014, volume 32 of JMLR Workshop and Conference Pro-ceedings, pages 64–72, 2014.F. Shang, L. Jiao, K. Zhou, J. Cheng, Y. Ren, and Y. Jin. ASVRG: Accelerated proximalSVRG. In Proceedings of The 10th Asian Conference on Machine Learning, ACML 2018,volume 95 of Proceedings of Machine Learning Research, pages 815–830. PMLR, 2018.S. Shao and P. P. Yip. Rates of convergence of adaptive step-size of stochastic approxi-mation algorithms. Journal of mathematical analysis and applications, 244(2):333–347,2000.J. Snoek, H. Larochelle, and R. P. Adams. Practical bayesian optimization of machinelearning algorithms. In P. L. Bartlett, F. C. N. Pereira, C. J. C. Burges, L. Bottou, andK. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25: 26thAnnual Conference on Neural Information Processing Systems 2012. Proceedings of ameeting held December 3-6, 2012, Lake Tahoe, Nevada, United States, pages 2960–2968,2012.M. V. Solodov. Incremental gradient algorithms with stepsizes bounded away from zero.Comp. Opt. and Appl., 11(1):23–35, 1998.R. S. Sutton. Gain adaptation beats least squares. In Proceedings of the 7th Yale workshopon adaptive and learning systems, volume 161168, 1992.C. Tan, S. Ma, Y. Dai, and Y. Qian. Barzilai-Borwein step size for stochastic gradientdescent. In Advances in Neural Information Processing Systems 29: NeurIPS 2016,pages 685–693, 2016.J. Tang, M. Golbabaee, F. Bach, and M. E. Davies. Rest-Katyusha: Exploiting the so-lution’s structure via scheduled restart schemes. In Advances in Neural InformationProcessing Systems 31: NeurIPS 2018, pages 427–438, 2018.66T. Tieleman and G. Hinton. Lecture 6.5-RMSProp: Divide the gradient by a runningaverage of its recent magnitude. Coursera: Neural networks for machine learning, 2012.P. Tseng. An incremental gradient(-projection) method with momentum term and adaptivestepsize rule. SIAM Journal on Optimization, 8(2):506–531, 1998.S. Vaswani, F. Bach, and M. W. Schmidt. Fast and faster convergence of SGD for over-parameterized models and an accelerated perceptron. In The 22nd International Confer-ence on Artificial Intelligence and Statistics, AISTATS 2019, volume 89 of Proceedingsof Machine Learning Research, pages 1195–1204. PMLR, 2019a.S. Vaswani, A. Mishkin, I. H. Laradji, M. Schmidt, G. Gidel, and S. Lacoste-Julien. Painlessstochastic gradient: Interpolation, line-search, and convergence rates. In Advances inNeural Information Processing Systems 32: NeurIPS 2019, pages 3727–3740, 2019b.S. Vaswani, F. Kunstner, I. Laradji, S. Y. Meng, M. Schmidt, and S. Lacoste-Julien. Adap-tive gradient methods converge faster with over-parameterization (and you can do aline-search). arXiv preprint arXiv:2006.06835, 2020.P. Wolfe. Convergence conditions for ascent methods. SIAM review, 11(2):226–235, 1969.P. Wolfe. Convergence conditions for ascent methods. II: Some corrections. SIAM review,13(2):185–188, 1971.M. D. Zeiler. AdaDelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701,2012.C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learningrequires rethinking generalization. In 5th International Conference on Learning Repre-sentations, ICLR 2017. OpenReview.net, 2017.H. Zhang and W. Yin. Gradient methods for convex minimization: Better rates underweaker conditions. arXiv preprint arXiv:1303.4645, 2013.Y. Zhang and X. Lin. Stochastic primal-dual coordinate method for regularized empiricalrisk minimization. In Proceedings of the 32nd International Conference on MachineLearning, ICML 2015, volume 37 of JMLR Workshop and Conference Proceedings, pages353–361, 2015.D. Zou and Q. Gu. An improved analysis of training over-parameterized deep neuralnetworks. In Advances in Neural Information Processing Systems 32: NeurIPS 2019,pages 2053–2062, 2019.67Appendix AInterpolation and Growth Conditions:ProofsA.1 Stochastic OraclesCorollary 1. Let f : Rd → R be a convex, differentiable function. If there exists an unbiased,Lmax individually-smooth SFO O for f , then f is L-smooth with L ≤ Lmax. Alternatively, if O isindividually-convex and biased with finite support Z and finite partial derivatives ∂∂wj f(w, z) foreach z ∈ Z, then Ezk [f(·, zk)] is Lmax-smooth for all k.Proof. We first consider when O is unbiased. In this case, Lmax individual smoothness of O impliesf(v, zk) ≤ f(w, zk) + 〈∇f(w, zk), v − w〉+ Lmax2‖v − w‖2=⇒ Ezk [f(v, zk)] ≤ Ezk [f(w, zk)] + 〈Ezk [∇f(w, zk)] , v − w〉+Lmax2‖v − w‖2=⇒ f(v) ≤ f(w) + 〈∇f(w), v − w〉+ Lmax2‖v − w‖2,Smoothness also gives the lower-bound,f(v, zk) ≥ f(w, zk) + 〈∇f(w, zk), v − w〉 − Lmax2‖v − w‖2=⇒ Ezk [f(v, zk)] ≥ Ezk [f(w, zk)] + 〈Ezk [∇f(w, zk)] , v − w〉+Lmax2‖v − w‖2=⇒ f(v) ≥ f(w) + 〈∇f(w), v − w〉 − Lmax2‖v − w‖2.We conclude thatLmax2‖v − w‖2 ≥ |f(v)− f(w)− 〈∇f(w), v − w〉 |,which, along with convexity, is a necessary and sufficient condition for f to be Lmax-smooth (Nes-terov, 2004, Theorem 2.1.5).68Now, assume O is biased, meaning Ezk [∇f(wk, zk)] 6= ∇f(wk), but Z is finite. In this case,∂∂wjf(w, zk) ≤ maxz∈Z | ∂∂wj (w, z)| almost surely for each dimension j ∈ d. The maximum of eachpartial derivative over Z is finite by assumption and thus ∇f(w, zk) is dominated by a constantvector. Starting again from individual smoothness,f(v, zk) ≤ f(w, zk) + 〈∇f(w, zk), v − w〉+ Lmax2‖v − w‖2=⇒ Ezk [f(v, zk)] ≤ Ezk [f(w, zk)] + 〈Ezk [∇f(w, zk)] , v − w〉+Lmax2‖v − w‖2≤ Ezk [f(w, zk)] + 〈∇Ezk [f(w, zk)] , v − w〉+Lmax2‖v − w‖2,where the order of the gradient and expectation operators was exchanged by appealing to thedominated convergence theorem (C¸ınlar, 2011, Theorem 4.16). An identical derivation using thelower-bound from smoothness yieldsEzk [f(v, zk)] ≥ Ezk [f(w, zk)] + 〈∇Ezk [f(w, zk)] , v − w〉 −Lmax2‖v − w‖2,which impliesLmax2‖v − w‖2 ≥ |Ezk [f(v, zk)]− Ezk [f(w, zk)]− 〈∇Ezk [f(w, zk)] , v − w〉 ,Convexity of f(·, zk) point-wise over Z implies that Ezk [f(·, zk)] is also convex. Thus, Ezk [f(w, zk)]is Lmax-smooth in w by Lemma 2.1.5 of Nesterov (2004).A.2 InterpolationDefinition 3 (Interpolation: Minimizers). A function-oracle pair (f,O) satisfies minimizer inter-polation if for all z ∈ Z,f(w′) ≤ f(w) ∀w ∈ Rd =⇒ f(w′, z) ≤ f(w, z) ∀w ∈ Rd.Definition 4 (Interpolation: Stationary Points). A function-oracle pair (f,O) satisfies stationary-point interpolation if for all z ∈ Z,∇f(w′) = 0 =⇒ ∇f(w′, z) = 0.Definition 5 (Interpolation: Mixed). A function-oracle pair (f,O) satisfies mixed interpolation iffor all z ∈ Z,f(w′) ≤ f(w) ∀w ∈ Rd =⇒ ∇f(w′, z) = 0.69Lemma 1. Let (f,O) be an arbitrary function-SFO pair. Then only the following relationshipshold between models of interpolation:Minimizer Interpolation (Definition 3) =⇒ Mixed Interpolation (Definition 5)andStationary-Point Interpolation (Definition 4) =⇒ Mixed Interpolation (Definition 5).However, if f is invex and O is such that f(·, z) is invex for all z ∈ Z, then minimizer, stationary-point, and mixed interpolation are equivalent.Proof.(i) Minimizer Interpolation =⇒ Mixed Interpolation:Let w′ ∈ arg minw f(w). The stochastic functions f(w, z) are minimized at w′ for all z ∈ Z byminimizer interpolation. First-order optimality conditions thus imply ∇f(w′, z) = 0, which impliesmixed interpolation holds.(ii) Stationary-Point Interpolation =⇒ Mixed Interpolation:Let w′ ∈ arg minw f(w). First-order optimality conditions ensure w′ is a stationary point f and bystationary-point interpolation, w′ is also a stationary point of f(·, z) for all z ∈ Z. Once again,mixed interpolation holds.(iii) Mixed Interpolation, Stationary-Point Interpolation 6=⇒ Minimizer Interpolation:We construct a simple counter example. Define the finite-sum functionf(w) =12(f1(w) + f2(w)) =32w2 − 12w2,and consider the oracle O which returnsf(w, zk) = fzk(w) ∇f(wk, zk) = ∇fzk(w),where zk is supported on {1, 2}. The stochastic functions f1 and f2 are only stationary at theglobal minimum f(0) = 0, but f2 is maximized at this point. So, both stationary-point and mixedinterpolation hold, but minimizer interpolation does not.(iv) Minimizer Interpolation 6=⇒ Stationary-Point Interpolation:This also follows from a simple finite-sum function. Considerf(w) =12(f1(w) + f2(w)) = −12cos(w)− 12cos(w2),and consider the sub-sampling oracle O as above. The global minimizers of f areX0 ={(−1)t (4pi)t : t ∈ {0, 1, . . . }} .70The stochastic functions f1(w) and f2(w) are also globally minimized at every point in X0, so mini-mizer interpolation holds. However, stationary-point interpolation does not hold as f has infinitelymany stationary points which are not stationary points of f1 or f2.Finally, suppose that f is invex and O is such that f(·, z) is invex for z ∈ Z. The equivalenceof all three definitions follows immediately, since all stationary points of invex functions are globalminima.Lemma 2. Let (f,O) be a function-oracle pair. If O is unbiased andf(w, z) ≥ f(w∗) ∀w ∈ Rd, ∀z ∈ Z,holds, then (f,O) satisfies minimizer interpolation.Proof. Since f(w, z) ≥ f(w∗), the optimality gap f(w, z) − f(w∗) must be non-negative for all wand z. Using the fact that O is unbiased,Ezk [f(w∗, zk)− f(w∗)] = f(w∗)− f(w∗) = 0,and thus f(w∗, zk) = f(w∗) almost surely for all k. This is equivalent to f(w∗, ·) = f(w∗) point-wiseover Z. So, f(w, z) ≥ f(w∗) = f(w∗, z) for all w ∈ Rd and w∗ is a global minimizer of f(·, z). Weconclude (f,O) satisfies minimizer interpolation.A.3 Growth ConditionsLemma 3 (Formulations of Strong Growth). Let (f,O) be a function-oracle pair. If (f,O) satisfiesmaximal strong growth, then it also satisfies strong growth. However, strong growth does not implymaximal strong growth for general O.Proof. Suppose that (f,O) satisfy maximal strong growth with constant ρmax. Then,‖∇f(w, zk)‖2 ≤ ρmax‖∇f(w)‖2=⇒ Ezk[‖∇f(w, zk)‖2] ≤ Ezk [ρmax‖∇f(w)‖2]= ρmax‖∇f(w)‖2,which completes the forward direction.Now, let us show there are function-oracle pairs which satisfy strong growth, but not maximalstrong growth. Let f(w) = 12w2 and consider O such that zk ∼ N (1, 1), andf(w, zk) =zk2w2, ∇f(w, zk) = zk · w,71for all k. A simple calculation shows that this oracle is unbiased,Ezk [f(w, zk)] = Ezk[zk2w2]=12w2 = f(w),Ezk [∇f(w, zk)] = Ezk [zk · w] = w = ∇f(w).It is also trivial to verify that strong growth holds with ρ = 2,Ezk[‖∇f(w, zk)‖2] = Ezk [z2kw2] = 2w2.For maximal strong growth to hold, we require c ∈ R such thatz2k · w2 ≤ c · w2 =⇒ z2k ≤ c,almost surely. But, z2k > c with non-zero probability for any c ∈ R, so maximal strong growth doesnot hold for (f,O).Lemma 4. Let (f,O) be a function-SFO pair satisfying the strong growth condition with constantρ. Moreover, assume that the support Z of O is finite and each zk admits probability mass functionpk. Then (f,O) also satisfies maximal strong growth.Proof. We haveρ‖∇f(w)‖2 ≥ Ezk[‖∇f(w, zk)‖2]=∑z∈Z‖∇f(w, z)‖2 pk(z)≥ ‖∇f(w, z)‖2 pk(z),for all z ∈ Z. For z˜ ∈ Z such that pk(z˜) > 0, we haveρpk(z˜)‖∇f(w)‖2 ≥ ‖∇f(w, z˜)‖2=⇒ ρp∗k‖∇f(w)‖2 ≥ max{‖∇f(w, z˜)‖2 : p(z˜) > 0} ,where p∗k = min {pk(z˜) : pk(z) > 0}. We conclude that maximal strong growth holds with ρ′ =maxkρp∗k.72Lemma 5. Let (f,O) be a function-oracle pair satisfying the strong growth condition with parameterρ. If f is L-smooth, then (f,O) satisfies weak growth with parameter α ≤ 2ρL.Proof. Lemma 18 implies‖∇f(w)‖2 ≤ 2L (f(w)− f(w∗)) .Comining this with the definition of strong growth,Ezk[‖∇f(w, zk)‖2] ≤ ρ‖∇f(w)‖2≤ 2Lρ (f(w)− f(w∗)) ,which shows that the weak growth condition holds α = 2Lρ.Lemma 6 (Interpolation and Weak Growth). Let f be an L-smooth function and O an Lmaxindividually-smooth SFO. If (f,O) satisfies minimizer interpolation, then the pair also satisfies theweak growth condition with parameter α ≤ LmaxL .Proof. Starting from Lmax individual-smoothness of O,f(u, zk) ≤ f(w, zk) + 〈∇f(w, zk), u− w〉+ Lmax2‖u− w‖2Choosing u = w − 1Lmax∇f(w, zk),f(u, zk) ≤ f(w, zk)− 1Lmax〈∇f(w, zk),∇f(w, zk)〉+ Lmax2L2max‖∇f(w, zk)‖2= f(w, zk)− 12Lmax‖∇f(w, zk)‖2.Noting that f(u, zk) ≥ f(w∗, zk) by minimizer interpolation and taking expectations with respectto zk gives the following:f(w∗, zk) ≤ f(w, zk)− 12Lmax‖∇f(w, zk)‖2=⇒ Ezk [f(w∗, zk)] ≤ f(w)−12LmaxEzk[‖∇f(w, zk)‖2]=⇒ f(w∗) ≤ f(w)− 12LmaxEzk[‖∇f(w, zk)‖2]=⇒ Ezk[‖∇f(w, zk)‖2] ≤ 2Lmax (f(w∗)− f(w))= 2(LmaxL)L (f(w∗)− f(w)) .We conclude that weak growth holds with α ≤ LmaxL .73Lemma 7 (Interpolation and Strong Growth). Let f be a L-smooth µ-Polyak- Lojasiewicz (PL)function and O an Lmax individually-smooth SFO. If (f,O) satisfies minimizer interpolation, thenthe pair also satisfies the strong growth condition with parameter ρ ≤ Lmaxµ .Proof. Lemma 6 implies that f satisfies the weak growth condition with parameterα ≤ LmaxL.Vaswani et al. (2019a, Proposition 1) now implies that f satisfies strong growth with parameterρ ≤ αLµ≤ Lmaxµ.This concludes the proof.74Appendix BStochastic Gradient Descent: ProofsB.1 Convergence for Strongly Convex FunctionsLemma 14. Let f be an L-smooth function satisfying the strong growth condition with parameterρ. Then stochastic gradient descent with fixed step-size η satisfies the following expected decreasecondition:Ezk [f(wk+1)] ≤ f(wk)− η(1− ρLη2)‖∇f(wk)‖2.Proof. Starting from L-smoothness of f ,f(wk+1) ≤ f(wk) + 〈∇f(wk), wk+1 − wk〉+ L2‖wk+1 − wk‖2= f(wk)− η 〈∇f(wk),∇f(wk, zk)〉+ Lη22‖∇f(wk, zk)‖2Taking expectations with respect to zk:=⇒ Ezk [f(wk+1)] ≤ f(wk)− η 〈Ezk [∇f(wk, zk)] ,∇f(wk)〉+Lη22Ezk[‖∇f(wk, zk)‖2]= f(wk)− η‖∇f(wk)‖2 + Lη22Ezk[‖∇f(wk, zk)‖2] .Using the strong growth condition,=⇒ Ezk [f(wk+1)] ≤ f(wk)− η‖∇f(wk)‖2 +ρLη22‖∇f(wk)‖2≤ f(wk)− η(1− ρLη2)‖∇f(wk)‖2.75Theorem 1. Let f be a µ-strongly-convex, L-smooth function and O a SFO such that (f,O)satisfies the strong growth condition with parameter ρ. Then stochastic gradient descent with fixedstep-size η ≤ 2ρ(µ+L) converges asE[‖wK − w∗‖2] ≤ (1− 2ηµLµ+ L)K‖w0 − w∗‖2.Proof.‖wk+1 − w∗‖2 = ‖wk − η∇f(wk, zk)− w∗‖2= η2‖∇f(wk, zk)‖2 − 2η 〈∇f(wk, zk), wk − w∗〉+ ‖wk − w∗‖2.Taking expectations with respect to zk,=⇒ Ezk[‖wk+1 − w∗‖2] = η2Ezk [‖∇f(wk, zk)‖2]− 2ηEzk [〈∇f(wk, zk), wk − w∗〉] + ‖wk − w∗‖2= η2Ezk[‖∇f(wk, zk)‖2]− 2η 〈∇f(wk), wk − w∗〉+ ‖wk − w∗‖2.Now we use the strong growth condition to control Ezk[‖∇f(wk, zk)‖2], which yieldsEzk[‖wk+1 − w∗‖2] ≤ η2ρ‖∇f(wk)‖2 − 2η 〈∇f(wk), wk − w∗〉+ ‖wk − w∗‖2.Coercivity of the gradient (Lemma 21) impliesEzk[‖wk+1 − w∗‖2] ≤ η2ρ‖∇f(wk)‖2 − 2η( µLµ+ L‖wk − w∗‖2 + 1µ+ L‖∇f(wk)‖2)+ ‖wk − w∗‖2= η(ηρ− 2µ+ L)‖∇f(wk)‖2 +(1− 2ηµLµ+ L)‖wk − w∗‖2.If η ≤ 2ρ(µ+L) then ηρ− 2µ+L ≤ 0 and we obtainEzk[‖wk+1 − w∗‖2] ≤ (1− 2ηµLµ+ L)‖wk − w∗‖2.Taking expectations and recursing on this inequality,=⇒ E [‖wk+1 − w∗‖2] ≤ (1− 2ηµLµ+ L)k+1‖w0 − w∗‖2.76Theorem 2. Let f be a µ-strongly-convex, L-smooth function and O an Lmax individually-smoothand convex SFO such that (f,O) satisfies minimizer interpolation. Then stochastic gradient descentwith fixed step-size η < 2Lmax converges asE[‖wK − w∗‖2] ≤ (1− µ η (2− ηLmax))K ‖w0 − w∗‖2.Proof.‖wk+1 − w∗‖2 = ‖wk − η∇f(wk, zk)− w∗‖2= η2‖∇f(wk, zk)‖2 − 2η 〈∇f(wk, zk), wk − w∗〉+ ‖wk − w∗‖2.Recall f(·, zk) is convex, Lmax-smooth, and ∇f(w∗, zk) = 0 by interpolation. By Lemma 20, wehave‖∇f(wk, zk)‖2 ≤ Lmax 〈∇f(wk, zk), wk − w∗〉 .Applying this to the above yields‖wk+1 − w∗‖2 ≤ η2Lmax 〈∇f(wk, zk), wk − w∗〉 − 2η 〈∇f(wk, zk), wk − w∗〉+ ‖wk − w∗‖2‖wk+1 − w∗‖2 = −η (2− ηLmax) 〈∇f(wk, zk), wk − w∗〉+ ‖wk − w∗‖2.Taking expectations with respect to zk:=⇒ Ezk[‖wk+1 − w∗‖2] ≤ −η (2− ηLmax)Ezk [〈∇f(wk, zk), wk − w∗〉] + ‖wk − w∗‖2= −η (2− ηLmax) 〈∇f(wk), wk − w∗〉+ ‖wk − w∗‖2.If η ≤ 2Lmax , then 2−ηLmax ≥ 0. Combing this with 〈∇f(wk), wk − w∗〉 ≥ µ‖wk−w∗‖2 by Lemma 22allows us to obtain the following:Ezk[‖wk+1 − w∗‖2] ≤ −µ η (2− ηLmax) ‖wk − w∗‖2 + ‖wk − w∗‖2= (1− µ η (2− ηLmax)) ‖wk − w∗‖2.Taking expectations and recursing on this inequality yields the final result,=⇒ E [‖wk+1 − w∗‖]2 ≤ (1− µ η (2− ηLmax))k+1 ‖w0 − w∗‖2.77Theorem 3. Let f be a µ-strongly-convex, L-smooth function and O an Lmax individually-smoothand µmax-strongly-convex SFO such that (f,O) satisfies minimizer interpolation. Then stochasticgradient descent with fixed step-size η ≤ 2µmax+Lmax converges almost surely at the rate‖wK − w∗‖2 ≤ (1− 2η δmin)K ‖w0 − w∗‖2,where δmin = minz∈Z µzLzµz+Lz .Proof.‖wk+1 − w∗‖2 = ‖wk − η∇f(wk, zk)− w∗‖2= η2‖∇f(wk, zk)‖2 − 2η 〈∇f(wk, zk), wk − w∗〉+ ‖wk − w∗‖2.Coercivity of the stochastic gradient (Lemma 21) implies‖wk+1 − w∗‖2 ≤ η2‖∇f(wk, zk)‖2 − 2η(µzLzµz + Lz‖wk − w∗‖2 + 1µz + Lz‖∇f(wk, zk)‖2)+ ‖wk − w∗‖2Let δmin = minz∈Z µzLzµz+Lz . Then we have‖wk+1 − w∗‖2 ≤ η2‖∇f(wk, zk)‖2 − 2η(δmin ‖wk − w∗‖2 + 1µmax + Lmax‖∇f(wk, zk)‖2)+ ‖wk − w∗‖2= η(η − 2µmax + Lmax)‖∇f(wk, zk)‖2 + (1− 2η δmin) ‖wk − w∗‖2.If η ≤ 2(µmax+Lmax) then η − 2µmax+Lmax ≤ 0 and we obtain‖wk+1 − w∗‖2 ≤ (1− 2η δmin) ‖wk − w∗‖2.Recursing on this inequality,=⇒ ‖wk+1 − w∗‖2 ≤ (1− 2η δmin)k+1 ‖w0 − w∗‖2.78B.2 Convergence for Convex FunctionsLemma 15. Let f be a convex, L-smooth function and O an Lmax individually-smooth SFO suchthat (f,O) satisfies minimizer interpolation. Then stochastic gradient descent with step-size η sat-isfies the following inequality:f(wk)− f(w∗) ≤ 12η δ(‖wk − w∗‖2 − Ezk [‖wk+1 − w∗‖2]) ,where δ = min{1, 1 + αL(1Lmax− η)}and w∗ = ΠX ∗(w0). Furthermore, if η ≤ 1Lmax , thenf(wk)− f(w∗) ≤ 12η(‖wk − w∗‖2 − Ezk [‖wk+1 − w∗‖2]) .Proof. Let w∗ = ΠX ∗(w0). We have‖wk+1 − w∗‖2 = ‖wk − ηk∇f(wk, zk) +−w∗‖2= η2‖∇f(wk, zk)‖2 − 2η 〈∇f(wk, zk), wk − w∗〉+ ‖wk − w∗‖2.The weak growth condition implies ∇f(w∗, z) = 0 for all z ∈ Z. Lemma 19 at wk and w∗ statesf(wk, zk)− f(w∗, zk) ≤ 〈∇f(wk, zk), wk − w∗〉 − 12Lmax‖∇f(wk, zk)‖2.Substituting this into the above gives‖wk+1 − w∗‖2 ≤ η2‖∇f(wk, zk)‖2 − 2η(f(wk, zk)− f(w∗, zk) + 12Lmax‖∇f(wk, zk)‖2)+ ‖wk − w∗‖2≤(η2 − ηLmax)‖∇f(wk, zk)‖2 − 2η (f(wk, zk)− f(w∗, zk)) + ‖wk − w∗‖2.Taking expectations with respect to zk:Ezk[‖wk+1 − w∗‖2] ≤ (η2 − ηLmax)Ezk[‖∇f(wk, zk)‖2]− 2ηEzk [f(wk, zk)− f(w∗, zk)]+ ‖wk − w∗‖2,≤(η2 − ηLmax)Ezk[‖∇f(wk, zk)‖2]− 2η (f(wk)− f(w∗)) + ‖wk − w∗‖2.Case 1: If η ≤ 1Lmax then(η2 − ηLmax)(f(wk, zk)− f(w∗, zk)) ≤ 0 by interpolation and we obtainEzk[‖wk+1 − w∗‖2] ≤ −2η (f(wk)− f(w∗)) + ‖wk − w∗‖2=⇒ f(wk)− f(w∗) ≤ 12η[‖wk − w∗‖2 − Ezk‖wk+1 − w∗‖2] .79Case 2: If η > 1Lmax then η2 − ηLmax > 0 and the weak growth condition impliesEzk[‖wk+1 − w∗‖2] ≤ 2ηαL(η − 1Lmax)(f(wk)− f(w∗))− 2η (f(wk)− f(w∗)) + ‖wk − w∗‖2= −2η(1 + αL(1Lmax− η))(f(wk)− f(w∗)) + ‖wk − w∗‖2.If η < 1αL +1Lmax, then 1 + αL(1Lmax− η)> 0,=⇒ f(wk)− f(w∗) ≤ 12η(1 + αL(1Lmax− η)) (‖wk − w∗‖2 − Ezk [‖wk+1 − w∗‖2]) .Let us combine the cases by taking the worst-case bound. Let δ = min{1, 1 + αL(1Lmax− η)}toobtain:f(wk)− f(w∗) ≤ 12η δ(‖wk − w∗‖2 − Ezk [‖wk+1 − w∗‖2]) .Note that this bound is tight with Case 1 since 1 + αL(1Lmax− η)≥ 1 when η ≤ 1Lmax .Theorem 4. Let f be a convex, L-smooth function and O a SFO such that (f,O) satisfies the weakgrowth condition with parameter α. Then stochastic gradient descent with fixed step-size η < 1αLconverges asE [f(w¯K)]− f(w∗) ≤ 12η (1− ηαL) K ‖w0 − w∗‖2,where w¯K =1K∑K−1k=0 wk and w∗ = ΠX ∗(w0).Proof.‖wk+1 − w∗‖2 = ‖wk − η∇f(wk, zk)− w∗‖2= η2‖∇f(wk, zk)‖2 − 2η 〈∇f(wk, zk), wk − w∗〉+ ‖wk − w∗‖2.Taking expectations with respect to zk,Ezk[‖wk+1 − w∗‖2] = η2Ezk [‖∇f(wk, zk)‖2]− 2ηEzk [〈∇f(wk, zk), wk − w∗〉] + ‖wk − w∗‖2= η2Ezk[‖∇f(wk, zk)‖2]− 2η 〈∇f(wk), wk − w∗〉+ ‖wk − w∗‖2.By convexity of f and the weak growth condition,Ezk[‖wk+1 − w∗‖2] ≤ η2Ezk [‖∇f(wk, zk)‖2]− 2η (f(wk)− f(w∗)) + ‖wk − w∗‖2≤ 2η2αL (f(wk)− f(w∗))− 2η (f(wk)− f(w∗)) + ‖wk − w∗‖2= −2η (1− ηαL) (f(wk)− f(w∗)) + ‖wk − w∗‖2. (B.1)80Rearranging the expression to put the optimality gap on the left-hand side,2η (1− ηαL) (f(wk)− f(w∗)) ≤ ‖wk − w∗‖2 − Ezk[‖wk+1 − w∗‖2] .If η < 1αL then 1− ηαL > 0,=⇒ f(wk)− f(w∗) ≤ 12η (1− ηαL)(‖wk − w∗‖2 − Ezk [‖wk+1 − w∗‖2]) .Taking expectations and summing from k = 0 to K − 1 now gives1KK−1∑k=0E [f(wk)]− f(w∗) ≤ 12η (1− ηαL) KK−1∑k=0(E[‖wk − w∗‖2]− E [‖wk+1 − w∗‖2])=12η (1− ηαL) K(‖w0 − w∗‖2 − E [‖wK − w∗‖2])≤ 12η (1− ηαL) K ‖w0 − w∗‖2.Noting 1K∑K−1k=0 f(wk) ≥ f(w¯K) by convexity leads to the final result,E [f(w¯K)]− f(w∗) ≤ 12η (1− ηαL) K ‖w0 − w∗‖2.Theorem 5. Let f be a convex, L-smooth function and O a SFO such that (f,O) satisfies theweak growth condition with parameter α. Moreover, suppose O is Lmax individually-smooth. Thenstochastic gradient descent with fixed step-size η < 1αL +1Lmaxconverges asE [f(w¯K)]− f(w∗) ≤ 12η δ K‖w0 − w∗‖2,where w¯K =1K∑K−1k=0 wk, w∗ = ΠX ∗(w0), and δ = min{1, 1 + αL(1Lmax− η)}.Proof. Starting from Lemma 15,f(wk)− f(w∗) ≤ 12η C(‖wk − w∗‖2 − Ezk [‖wk+1 − w∗‖2])Taking expectations and summing from k = 0 to K − 1 now gives=⇒ 1KK−1∑k=0E [f(wk)− f(w∗)] ≤ 12η δ KK−1∑k=0(E[‖wk − w∗‖2]− E [‖wk+1 − w∗‖2])=12η δ K(‖w0 − w∗‖2 − E [‖wK − w∗‖2])≤ 12η δ K‖w0 − w∗‖281Noting 1K∑K−1k=0 f(wk) ≥ f(w¯K) by convexity leads to the final result,=⇒ E [f(w¯K)]− f(w∗) ≤ 12η δ K‖w0 − w∗‖2.B.3 Almost Sure ConvergenceTheorem 7. Let f be a convex, L-smooth function with at least one finite minimizer and O anLmax individually-smooth SFO such that (f,O) satisfies the weak growth condition with parameter α.Then the sequence (f(wk)) generated by stochastic gradient descent with fixed step-size η <1αL+1Lmaxconverges to the optimal function value f(w∗) almost surely.Proof. Lemma 15 gives the decrease conditionE[‖wk+1 − w∗‖2 | Fk] ≤ ‖wk − w∗‖2 − 2ηδ (f(wk)− f(w∗)) ,where δ ≥ 1 since η < 1αL + 1Lmax The conditions of Theorem 6 are satisfied with Ak = 0 for all kand the sequence(‖wk − w∗‖2) converges to a non-negative random variable limk→∞ ‖wk − w∗‖2almost surely. The theorem also guarantees∞∑k=02ηδ (f(wk)− f(w∗)) <∞=⇒∞∑k=0(f(wk)− f(w∗)) <∞ (B.2)=⇒ limk→∞f(wk)a.s.= f(w∗).That is, stochastic gradient descent converges to the optimal function value.Theorem 8. Let f be an L-smooth function with at least one finite minimizer w∗ and O aSFO such that (f,O) satisfies the strong growth condition with parameter ρ. Then the sequence ofgradient-norms(‖∇f(wk)‖2) generated by stochastic gradient descent with fixed step-size η < 2ρLconverges to 0 almost surely.Proof. Lemma 14 gives the decrease conditionE [f(wk+1)− f(w∗) | Fk] ≤ (f(wk)− f(w∗))− η(1− ρLη2)‖∇f(wk)‖2.Since η < 2ρL ,η(1− ηρL2)‖∇f(wk)‖2 > 0,82and the conditions of Theorem 6 are satisfied with Ak = 0 for all k. The sequence (f(wk)− f(w∗))converges to a non-negative random variable. Of more interest is that∞∑k=0η(1− ηρL2)‖∇f(wk)‖2 <∞=⇒∞∑k=0‖∇f(wk)‖2 <∞, (B.3)almost surely. Accordingly, the sequence of gradient norms satisfieslimk→∞‖∇f(wk)‖2 a.s.= 0,and we conclude that the sequence of gradients converges to a stationary point.B.4 Additional LemmasLemma 16. Consider the setting of Theorem 5 from Vaswani et al. (2019a): f is µ-strongly-convex,L-smooth, and (f,O) satisfies weak growth with parameter α. Then Theorem 5 can be modified toallow for a maximum step-size of η ≤ µ+LαL2and the following convergence rate:E[‖wK+1 − w∗‖2] ≤ (1− µ η + max {ηµ (ηαL− 1) , ηL (ηαL− 1)})K ‖w0 − w∗‖2.Moreover, this is tight with the original result when η = 1αL .Proof. Starting from the seventh line of the proof of Theorem 5 from Vaswani et al. (2019a),Ezk[‖wk+1 − w∗‖2] ≤ 2η (ηαL− 1) (f(wk)− f(w∗)) + (1− µ η) ‖wk − w∗‖2.Case 1: η ≤ 1αL . Then 2η (ηαL− 1) (f(wk)− f(w∗)) ≤ 0 and Lemma 18 impliesEzk[‖wk+1 − w∗‖2] ≤ ηµ (ηαL− 1) ‖wk − w∗‖2 + (1− µ η) ‖wk − w∗‖2.Case 2: µ+LαL2> η > 1αL . Then the other direction of Lemma 18 impliesEzk[‖wk+1 − w∗‖2] ≤ ηL (ηαL− 1) ‖wk − w∗‖2 + (1− µ η) ‖wk − w∗‖2.Note that the upper bound on η is required to make progress in this case. We combine cases bytaking the worst-case bound as follows:=⇒ Ezk[‖wk+1 − w∗‖2] ≤ (1− µ η) ‖wk − w∗‖2 + max {ηµ (ηαL− 1) , ηL (ηαL− 1)} ‖wk − w∗‖2= (1− µ η + max {ηµ (ηαL− 1) , ηL (ηαL− 1)}) ‖wk − w∗‖2,83Taking expectations and recursing on the inequality gives the final results,=⇒ E [‖wk+1 − w∗‖2] ≤ (1− µ η + max {ηµ (ηαL− 1) , ηL (ηαL− 1)})k+1 ‖w0 − w∗‖2.Tightness with the original result from Vaswani et al. (2019a) is immediate when η = 1αL in thelast expression above.84Appendix CLine Search: ProofsLemma 8. Let f be an L-smooth function and O an Lmax individually-smooth SFO such that (f,O)satisfies minimizer interpolation. Then the maximum possible step-size returned by the stochasticArmijo line-search constrained to lie in the (0, ηmax ] range satisfies the following inequalities:min{2(1− c)Lmax, ηmax}≤ ηk ≤ f(wk, zk)− f(w∗, zk)c‖∇f(wk, zk)‖2 .Proof. Let η˜k be the step-size returned by the exact Armijo line-search. The back-tracking con-straint ηk ∈ (0, ηmax ] is equivalent to choosing ηk = min {η˜k, ηmax}. Starting from individualsmoothness of O,f(wk+1, zk) ≤ f(wk, zk)− 〈∇f(wk, zk), wk+1 − wk〉+ Lmax2‖wk+1 − wk‖2= f(wk, zk)− η˜k‖∇f(wk, zk)‖2 + η˜2kLmax2‖∇f(wk, zk)‖2= f(wk, zk)− η˜k(1− Lmaxη˜k2)‖∇f(wk, zk)‖2.This implies that the stochastic Armijo condition,f(wk+1, zk) ≤ f(wk, zk)− c · η˜k‖∇f(wk, zk)‖2,is guaranteed to hold wheneverc η˜k ≤ η˜k(1− Lmaxη˜k2)=⇒ η˜k ≤ 2(1− c)Lmax.Recalling η˜k is the maximal step-size satisfying the Armijo condition leads to the lower-boundηk ≥ min{2(1−c)Lmax, ηmax}.Let us now upper-bound ηk. The argument above established that the Armijo condition is satisfied85for any step-size η ≤ 2(1−c)Lmax . Thus, it must be satisfied for ηk = min {η˜k, ηmax} and we obtainf(wk+1, zk) ≤ f(wk, zk)− c · ηk‖∇f(wk, zk)‖2=⇒ ηk ≤ f(wk, zk)− f(wk+1, zk)c‖∇f(wk, zk)‖2 .Minimizer interpolation implies f(wk+1, zk) ≥ f(w∗, zk) and thusηk ≤ f(wk, zk)− f(w∗, zk)c‖∇f(wk, zk)‖2 .This completes the proof.Lemma 9. Let f be a convex, L-smooth function and O an Lmax individually-smooth and convexSFO such that (f,O) satisfy minimizer interpolation. Then stochastic gradient descent using theArmijo line-search with c ≥ 12 satisfies the following inequality:f(wk)− f(w∗) ≤ 12max{Lmax2(1− c) ,1ηmax}(‖wk − w∗‖2 − Ezk [‖wk+1 − w∗‖2]) ,where w∗ ∈ X ∗ is arbitrary.Proof. Let w∗ ∈ X ∗ be arbitrary. Then,‖wk+1 − w∗‖2 = ‖wk − ηk∇f(wk, zk)− w∗‖2= η2k‖∇f(wk, zk)‖2 − 2ηk 〈∇f(wk, zk), wk − w∗〉+ ‖wk − w∗‖2.Minimizer interpolation implies ∇f(w∗, z) = 0 for all z ∈ Z. We may thus use Lemma 19 at wkand w∗ to obtain‖wk+1 − w∗‖2 ≤ η2k‖∇f(wk, zk)‖2 − 2ηk(f(wk, zk)− f(w∗, zk) + 12Lmax‖∇f(wk, zk)‖2)+ ‖wk − w∗‖2= ηk(ηk − 1Lmax)‖∇f(wk, zk)‖2 − 2ηk (f(wk, zk)− f(w∗, zk))+ ‖wk − w∗‖2.Our analysis now proceeds in cases on ηk.Case 1: ηk ≤ 1Lmax . Then ηk(ηk − 1Lmax)‖∇f(wk, zk)‖2 ≤ 0,=⇒ ‖wk+1 − wk‖2 ≤ −2ηk (f(wk, zk)− f(w∗, zk)) + ‖wk − w∗‖2≤ −2 min{2(1− c)Lmax, ηmax}(f(wk)− f(w∗)) + ‖wk − w∗‖2. (Lemma 8)86Case 2: ηk >1Lmax. Then ηk(ηk − 1Lmax)‖∇f(wk, zk)‖2 ≥ 0 and we may apply Lemma 8 to obtain‖wk+1 − w∗‖2 ≤(ηkc− 1cLmax)(f(wk, zk)− f(w∗, zk))− 2ηk (f(wk, zk)− f(w∗, zk))+ ‖wk − w∗‖2= ηk(1c− 2)(f(wk, zk)− f(w∗, zk))− 1cLmax(f(wk, zk)− f(w∗, zk))+ ‖wk − w∗‖2.If c ≥ 12 then ηk(1c − 2) (f(wk, zk)− f(w∗, zk)) ≤ 0 and‖wk+1 − w∗‖2 ≤ − 1cLmax(f(wk, zk)− f(w∗, zk)) + ‖wk − w∗‖2.We may combine cases by taking the loosest bound as follows:=⇒ ‖wk+1 − w∗‖2 ≤ −min{4(1− c)Lmax, 2ηmax,1cLmax}(f(wk, zk)− f(w∗, zk)) + ‖wk − w∗‖2.Noting 1cLmax ≥4(1−c)Lmaxfor c ∈ [0.5, 1] (with equality holding when c = 12) implies‖wk+1 − w∗‖2 ≤ −2 min{2(1− c)Lmax, ηmax}(f(wk, zk)− f(w∗, zk)) + ‖wk − w∗‖2.Taking expectations with respect to zk:=⇒ Ezk[‖wk+1 − w∗‖2] ≤ −2 min{2(1− c)Lmax, ηmax}Ezk [f(wk, zk)− f(w∗, zk)] + ‖wk − w∗‖2= −2 min{2(1− c)Lmax, ηmax}(f(wk)− f(w∗)) + ‖wk − w∗‖2.Rearranging the expression so the optimality gap is on the left-hand side completes the proof,=⇒ f(wk)− f(w∗) ≤ 12max{Lmax2(1− c) ,1ηmax}(‖wk − w∗‖2 − Ezk [‖wk+1 − w∗‖2]) .87C.1 Convergence for Strongly-Convex FunctionsTheorem 9. Let f be a µ-strongly-convex, L-smooth function and O an Lmax individually-smoothand convex SFO such that (f,O) satisfies minimizer interpolation. Then stochastic gradient descentusing the Armijo line-search with c ≥ 12 converges asE[‖wK − w∗‖2] ≤ (1− µmin{2(1− c)Lmax, ηmax})K‖w0 − w∗‖2.Proof. Starting from Lemma 9,f(wk)− f(w∗) ≤ 12max{Lmax2(1− c) ,1ηmax}(‖wk − w∗‖2 − Ezk [‖wk+1 − w∗‖2])=⇒ Ezk[‖wk+1 − w∗‖2] ≤ ‖wk − w∗‖2 − 2 min{2(1− c)Lmax, ηmax}(f(wk)− f(w∗))Strong-convexity implies f(wk)− f(w∗) ≥ µ2‖wk − w∗‖2 (Lemma 17). Using this inequality yieldsEzk[‖wk+1 − w∗‖2] ≤ ‖wk − w∗‖2 − µmin{2(1− c)Lmax, ηmax}‖wk − w∗‖2=(1− µmin{2(1− c)Lmax, ηmax})‖wk − w∗‖2.Taking expectations and recursing on the inequality,=⇒ E [‖wk+1 − w∗‖2] ≤ (1− µmin{2(1− c)Lmax, ηmax})k+1‖w0 − w∗‖2.C.2 Convergence for Convex FunctionsTheorem 10. Let f be a convex, L-smooth function and O an Lmax individually-smooth andconvex SFO such that (f,O) satisfies minimizer interpolation. Then stochastic gradient descentusing the Armijo line-search with c ≥ 12 converges asE [f(w¯K)]− f(w∗) ≤ 12Kmax{Lmax2(1− c) ,1ηmax}‖w0 − w∗‖2,where w¯K =∑K−1k=0 wk and w∗ = ΠX ∗(w0).Proof. Starting from Lemma 9 with w∗ = ΠX ∗(w0),f(wk)− f(w∗) ≤ 12max{Lmax2(1− c) ,1ηmax}(‖wk − w∗‖2 − Ezk [‖wk+1 − w∗‖2]) .88Taking expectations and summing from k = 0 to K − 1,=⇒ 1KK−1∑k=0E [f(wk)]− f(w∗) ≤ 12Kmax{Lmax2(1− c) ,1ηmax}K−1∑k=0E[‖wk − w∗‖2 − ‖wk+1 − w∗‖2]=12Kmax{Lmax2(1− c) ,1ηmax}(‖w0 − w∗‖2 − E [‖wK − w∗‖2])≤ 12Kmax{Lmax2(1− c) ,1ηmax}‖w0 − w∗‖2.Noting 1K∑K−1k=0 f(wk) ≥ f(w¯K) by convexity leads to the final result,=⇒ E [f(w¯K)]− f(w∗) ≤ 12Kmax{Lmax2(1− c) ,1ηmax}‖w0 − w∗‖2.C.3 Convergence for Non-Convex FunctionsTheorem 11. Let f be an L-smooth function and O an Lmax individually-smooth SFO such that(f,O) satisfies the strong growth condition with parameter ρ. Then stochastic gradient descent usingthe Armijo line-search with c > 1− LmaxρL and ηmax < 2ρL converges asmink∈[K]‖∇f(wk)‖2 ≤ 1δ K(f(w0)− f(w∗)) ,where δ =(ηmax +2(1−c)Lmax)− ρ(ηmax − 2(1−c)Lmax + Lη2max).Proof. Firstly, note that for any vectors a, b ∈ Rd,‖a− b‖2 = ‖a‖2 + ‖b‖2 − 2 〈a, b〉=⇒ −〈a, b〉 = 12(‖a− b‖2 − ‖a‖2 − ‖b‖2) . (C.1)Let ∆k = f(wk+1)− f(wk). Starting from L-smoothness of f :∆k ≤ 〈∇f(wk), wk+1 − wk〉+ L2‖wk+1 − wk‖2= −ηk 〈∇f(wk),∇f(wk, zk)〉+ Lη2k2‖∇f(wk, zk)‖2.89Using Equation C.1 on −〈∇f(wk),∇f(wk, zk)〉,=⇒ ∆k ≤ ηk2(‖∇f(wk)−∇f(wk, zk)‖2 − ‖∇f(wk)‖2 − ‖∇f(wk, zk)‖2)+Lη2k2‖∇f(wk, zk)‖2=⇒ 2∆k ≤ ηk‖∇f(wk)−∇f(wk, zk)‖2 − ηk(‖∇f(wk)‖2 + ‖∇f(wk, zk)‖2)+ Lη2k‖∇f(wk, zk)‖2.Using min{2(1−c)Lmax, ηmax}= ηmin ≤ ηk ≤ ηmax from Lemma 8 and taking expectations with respectto zk:2∆k ≤ ηmax‖∇f(wk)−∇f(wk, zk)‖2 − ηmin(‖∇f(wk)‖2 + ‖∇f(wk, zk)‖2)+ Lη2max‖∇f(wk, zk)‖2=⇒ 2Ezk [∆k] ≤ ηmaxEzk[‖∇f(wk)−∇f(wk, zk)‖2]− ηminEzk [‖∇f(wk)‖2 + ‖∇f(wk, zk)‖2]+ Lη2maxEzk[‖∇f(wk, zk)‖2]= ηmaxE[‖∇f(wk, zk)‖2]− ηmax‖∇f(wk)‖2 − ηminEzk [‖∇f(wk)‖2 + ‖∇f(wk, zk)‖2]+ Lη2maxEzk[‖∇f(wk, zk)‖2]Collecting terms and applying the strong growth condition,2Ezk [∆k] ≤(ηmax − ηmin + Lη2max)Ezk[‖∇f(wk, zk)‖2]− (ηmax + ηmin) ‖∇f(wk)‖2≤ ρ (ηmax − ηmin + Lη2max) ‖∇f(wk)‖2 − (ηmax + ηmin) ‖∇f(wk)‖2= − ((ηmax + ηmin)− ρ (ηmax − ηmin + Lη2max)) ‖∇f(wk)‖2.Assuming δ = (ηmax + ηmin)− ρ(ηmax − ηmin + Lη2max)> 0,=⇒ ‖∇f(wk)‖2 ≤ 2δEzk [−∆k] .Taking expectations and summing from k = 0 to K − 1,=⇒ 1KK−1∑k=0E[‖∇f(wk)‖2] ≤ 2δ KK−1∑k=0E [−∆k]=2δ KK−1∑k=0E [f(wk)− f(wk+1)]=2δ KE [f(w0)− f(wk+1)]≤ 2δ K(f(w0)− f(w∗))=⇒ mink∈[K−1]E[‖∇f(wk)‖2] ≤ 2δ K(f(w0)− f(w∗)) .90It remains to show that δ > 0 holds. Our analysis proceeds in cases.Case 1: ηmax ≤ 2(1−c)Lmax . Then ηmin = ηmax andδ = (ηmax + ηmax)− ρ(ηmax − ηmax + Lη2max)= 2ηmax − ρLη2max > 0=⇒ ηmax < 2ρL.Case 2: ηmax >2(1−c)Lmax. Then ηmin =2(1−c)Lmaxandδ =(ηmax +2(1− c)Lmax)− ρ(ηmax − 2(1− c)Lmax+ Lη2max).This is a concave quadratic in ηmax and is strictly positive whenηmax ∈0, (1− ρ) +√(ρ− 1)2 + [8ρ(1 + ρ)L(1− c)]/Lmax2Lρ .To avoid contradiction with the case assumption 2(1−c)Lmax < ηmax, we require(1− ρ) +√(ρ− 1)2 + [8ρ(1 + ρ)L(1− c)]/Lmax2Lρ>2(1− c)Lmax=⇒ 8ρ(1 + ρ)L(1− c)Lmax>(4LρLmax+ (ρ− 1))2− (ρ− 1)2=16L2ρ2(1− c)2L2max+8Lρ(ρ− 1)(1− c)Lmax=⇒ LmaxρL> (1− c)=⇒ c > 1− LmaxρL.The line-search requires c ∈ (0, 1). Noting that ρ ≥ 1 by definition, we have LmaxρL > 0 as long asL,Lmax > 0. The Lipschitz constants are strictly positive when f is bounded-below and non-zero.We obtain the non-empty constraint setc ∈(1− LmaxρL, 1).Substituting the maximum value for c into the upper-bound on ηmax yields a similar requirement,ηmax ∈(0,2ρL).91This completes the second case.Putting the two cases together gives the final constraints on c and ηmax asc ≥ 1− LmaxρLηmax <2ρL.We note that the upper and lower bounds on ηk are consistent sinceηmin = min{2(1− c)Lmax, ηmax}< min{2LmaxρLLmax, ηmax}= max{2ρL, ηmax}≤ 2ρL,where the last inequality follows from the bound on ηmax. In particular, taking c→ 1 and ηmax → 2ρLyields an adaptive step-size ηk ∈ (0, 2ρL).92Appendix DAcceleration: ProofsD.1 Estimating SequencesLemma 11. Let (λk) and (φk) be as defined in Lemma 10. Then the reformulated stochasticaccelerated gradient descent (R-SAGD) and SAGD algorithms are equivalent.Proof. We expand the updated for yk using the canonical form of φk as follows:yk = wk − αkγk + αkµ∇φk(wk)= wk − αkγk + αkµ∇(φ∗k +γk2‖wk − vk‖2)= wk − αkγkγk + αkµ(wk − vk)=((1− αk)γk + αkµγk + αkµ)wk +αkγkγk + αkµvk.Recalling γk+1 = (1− αk)γk + αkµ givesyk =γk+1γk + αkµwk +αkγkγk + αkµvk=1γk + αkµ(γk+1wk + αkγkvk) .This is identical to update Step (b) of Constant Step-Size Scheme I given by Nesterov (2004, Eq.2.2.19). The equivalence of Constant Step-Size Scheme I and accelerated gradient descent (AGD)(called Constant Step-Size Scheme II) is established by Nesterov (2004, Page 92) — albeit witha different step-size for the gradient step — which completes the proof.93Lemma 12. Let f be a µ-strongly-convex, L-smooth function (with µ = 0 in the convex case)and O a SFO such that (f,O) satisfies strong growth with parameter ρ. If φ∗0 = f(w0) and γ0 ≥ 0is independent of the random process (zk), then for all k ∈ N R-SAGD satisfiesE[infwφk(w)] = E [φ∗k] ≥ E[f(wk)],Proof. First we establish a few preliminaries. The condition γ0 ⊥⊥ (zk) implies the (γk)∞k=0, (αk)∞k=0,and (λk)∞k=0 sequences evolve independently of the stochastic processes (yk)∞k=0, (wk)∞k=0, and(vk)∞k=0. This will be necessary later in the proof. Next, invoking Lemma 14 with η =1ρL yieldsEzk [f(wk+1)] ≤ f(wk)−12ρL‖∇f(wk)‖2.Taking expectations with respect to z0, . . . , zk−1,=⇒ E [f(wk+1)] ≤ E [f(wk)]− 12ρLE[‖∇f(wk)‖2] . (D.1)Now we move on to the main argument, which proceeds by induction.The choice of φ∗0 = f(x0) ensures inf φ0(w) = f(w0) deterministically, which is the base case forinduction. The inductive assumption is E[inf φk(w)] ≥ E[f(wk)]; let us use this to showE[infwφk+1(w)]≥ E [f(wk+1)] .Recall the explicit form of the minimizer inf φk+1(w) = φ∗k+1 isφ∗k+1 = (1− αk)φ∗k + αkf(yk)−α2k2γk+1‖∇f(yk)‖2+αk(1− αk)γkγk+1(µ2‖yk − vk‖2 + 〈∇f(yk), vk − yk〉)Taking expectations with respect to z0, . . . , zk and using γ0 ⊥⊥ (zk) plus linearity of expectation:E[φ∗k+1] = (1− αk)E[φ∗k] + E[αkf(yk)− α2k2γk+1‖∇f(yk)‖2]+ E[αk(1− αk)γkγk+1(µ2‖yk − vk‖2 + 〈∇f(yk), vk − yk〉)]≥ (1− αk)E [f(wk)] + E[αkf(yk)− α2k2γk+1‖∇f(yk)‖2]+ E[αk(1− αk)γkγk+1(µ2‖yk − vk‖2 + 〈∇f(yk), vk − yk〉)],94where the inequality follows from the inductive assumption. Convexity of f implies f(wk) ≥ f(yk)+〈∇f(yk), wk − yk〉. Recalling α2kγk+1= 1ρL (see Figure 5.2) allows us to obtainE[φ∗k+1] ≥ (1− αk)E[(f(yk) + 〈∇f(yk), wk − yk〉)] + E[αkf(yk)− 12ρL‖∇f(yk)‖2]+ E[αk(1− αk)γkγk+1(µ2‖yk − vk‖2 + 〈∇f(yk), vk − yk〉)]= E [(1− αk)f(yk) + αkf(yk)] + (1− αk)E[〈∇f(yk), wk − yk〉]− E[12ρL‖∇f(yk)‖2]+ E[αk(1− αk)γkγk+1(µ2‖yk − vk‖2 + 〈∇f(yk), vk − yk〉)]= E[f(yk)− 12ρL‖∇f(yk)‖2]+ (1− αk)E[〈∇f(yk), wk − yk〉+αkγkγk+1(µ2‖yk − vk‖2 + 〈∇f(yk), vk − yk〉)]Equation D.1 now impliesE[φ∗k+1] ≥ E [f(wk+1)] + (1− αk)E[〈∇f(yk), wk − yk〉+ αkγkγk+1(µ2‖yk − vk‖2 + 〈∇f(yk), vk − yk〉)].The remainder of the proof is largely unchanged from the deterministic case. The definition ofR-SAGD gives wk − yk = αkγk+αkµ∇φ(wk), which we use to obtainE[φ∗k+1] ≥ E[f(wk+1)] + (1− αk)E[ αkγk + αkµ〈∇f(yk),∇φk(wk)〉+ αkγkγk+1(µ2‖yk − vk‖2+ 〈∇f(yk), vk − yk〉)]Noting that ∇φk(wk) = γk (wk − vk) by Equation 5.1 and vk − yk = γk+1γk+αkµ (vk − wk) givesE[φ∗k+1] ≥ E[f(wk+1)] + (1− αk)E[ αkγkγk + αkµ〈∇f(yk), wk − vk〉+ αkγkγk+1(µ2‖yk − vk‖2+ 〈∇f(yk), vk − yk〉)]= E[f(wk+1)] + (1− αk)E[αkγkγk + αkµ〈∇f(yk), wk − vk〉+ αkγkγk+1(µ2‖yk − vk‖2+γk+1γk + αkµ〈∇f(yk), vk − wk〉)]= E[f(wk+1)] +µαk(1− αk)γk2γk+1E[‖yk − vk‖2]≥ E[f(wk+1)].since µαk(1−αk)γk2γk+1 ≥ 0. We conclude that E[inf φk(w)] ≥ E[f(wk)] holds for all k ∈ N by induction.95D.2 Convergence for Strongly-Convex FunctionsTheorem 12. Let f be a µ-strongly-convex, L-smooth function with µ > 0 and O a SFO such that(f,O) satisfies strong growth with parameter ρ. Moreover, choose γ0 = µ, v0 = w0, and φ∗0 = f(w0).Then R-SAGD converges asE [f(wK)]− f(w∗) ≤(1−√µρL)K (f(w0)− f(w∗) + µ2‖w∗ − w0‖2).Proof. We obtain the following inequality immediately:E [f(wk)] ≤ E[infxφk(x)] (Lemma 12)= E[φk(w∗)]≤ (1− λk) f(w∗) + λkφ0(w∗) (Lemma 10)=⇒ E [f(wk)]− f(w∗) ≤ λk(φ0(w∗)− f(w∗))= λk(f(w0)− f(w∗) + µ2‖w0 − w∗‖2). (D.2)The convergence rate of R-SAGD is determined by the speed at which λk converges to 0, which weshall now derive. The canonical form of φk (Equation 5.1) gives the following expression for γk:γk = (1− αk−1)γk−1 + αk−1µ.The choice γ0 = µ now yieldsγk = (1− αk−1)µ+ αk−1µ = µ,by induction on k. The definition of R-SAGD (Figure 5.2) givesα2k =γk+1ρL=µρL,=⇒ αk =√µρL.Finally, the choice of estimating sequences in Lemma 10 gives λ0 = 1 andλk = (1− αk)λk−1=(1−√µρL)λk−1.for every k > 0. Recursing on this inequality givesλk =(1−√µρL)kλ0=(1−√µρL)k.96Substituting this expression into Equation D.2 yields the final result:E [f(wk)]− f(w∗) ≤(1−√µρL)k (f(w0)− f(w∗) + µ2‖w∗ − w0‖2).D.3 Convergence for Convex FunctionsTheorem 13. Let f be a convex, L-smooth function and O a SFO such that (f,O) satisfies stronggrowth with parameter ρ. Moreover, choose γ0 = 2ρL, v0 = w0, and φ∗0 = f(w0). Then R-SAGDconverges asE [f(wK)]− f(w∗) ≤ 2(K + 1)2(f(w0)− f(w∗) + ρL‖w0 − w∗‖2).Proof. We obtain the following inequality immediately:E [f(wk)] ≤ E[infxφk(x)] (by Lemma 12)≤ E[φk(w∗)]≤ (1− λk) f(w∗) + λkφ0(w∗) (by Definition 9)=⇒ E [f(wk)]− f(w∗) ≤ λk(φ0(w∗)− f(w∗))= λk(f(w0)− f(w∗) + ρL‖w0 − w∗‖2). (D.3)We can see that the convergence rate of R-SAGD is controlled by the λk sequence. Let us analyzeits rate of convergence to 0. The canonical form of φk (Equation 5.1) gives the following expressionfor γk:γk = (1− αk−1)γk−1 + αk−1µγk = (1− αk−1)γk−1,since µ = 0. The definition of R-SAGD (Figure 5.2) and our choice of estimating sequences (Lemma 10givesα2k =γk+1ρL,λk = (1− αk)λk−1.Now we must upper-bound λk, given that λ0 = 2L˜. Let L˜ = ρL. Invoking Lemma 2.2.4 of Nesterov97(2004) with L˜ impliesλk ≤ 4L˜γ0 (k + 1)2=2(k + 1)2.Substituting this into Equation D.3 yields the final result:E [f(wk)]− f(w∗) ≤ 2(k + 1)2(f(w0)− f(w∗) + ρL‖w0 − w∗‖2).98Appendix EBeyond Interpolation: ProofsLemma 13. Let f be an L-smooth function with at least one finite minimizer w+ and O a SFO suchthat (f,O) satisfies the weak growth condition with parameter ρ. Then the L2-regularized problem(F,O2) satisfies the following inequality for all w ∈ Rd and k ≥ 0:Ezk[‖∇F (w, zk)‖2] ≤ 4 max {ρL, λ}(F (w)− F (w∗)− λ− L2‖w∗ − w+‖2 + λ2‖w+‖2),where w∗ minimizes the regularized function F . Moreover, this can be improved toEzk[‖∇F (w, zk)‖2] ≤ 4 max {ρL, λ}(F (w)− F (w∗)− λ+ µ2‖w∗ − w+‖2 + λ2‖w+‖2),if f is µ-strongly-convex (with µ = 0 giving the convex case).Proof. By definition of the L2-regularized oracle O2,Ezk[‖∇F (w, zk)‖2] = Ezk [‖∇f(w, zk) + λw‖2] .Young’s inequality for products and the weak growth condition implyEzk[‖∇F (w, zk)‖2] ≤ Ezk [2‖∇f(w, zk)‖2 + 2λ2‖w‖2]≤ 4ρL (f(w)− f(w+))+ 2λ2‖w‖2.We now proceed in cases depending on the degree of regularization. If λ ≤ ρL,=⇒ Ezk[‖∇F (w, zk)‖2] ≤ 4ρL(f(w)− f(w+) + λ22ρL‖w‖2)≤ 4ρL(f(w)− f(w+) + λ2‖w‖2)= 4ρL(F (w)− f(w+)) .99On the other hand, if λ > ρL,=⇒ Ezk[‖∇F (w, zk)‖2] ≤ 4λ(f(w)− f(w+) + λ2‖w‖2)= 4λ(F (w)− f(w+)) .Combining the two cases yieldsEzk[‖∇F (w, zk)‖2] ≤ 4 max {ρL, λ} (F (w)− f(w+)) .Let δ = µ if f is µ-strongly-convex and set δ = −L otherwise. Strong-convexity (or smoothness ifδ = −L) implies f(w+) ≥ f(w∗) + 〈∇f(w∗), w+ − w∗〉 + δ2‖w+ − w∗‖2. Subsituting this into thebound givesEzk[‖∇F (w, zk)‖2] ≤ 4 max {ρL, λ}(F (w)− f(w∗)− 〈∇f(w∗), w+ − w∗〉− δ2‖w+ − w∗‖2)Recall that w∗ minimizes the regularized function F . First-order optimality conditions imply∇f(w∗) = −λw∗ and we obtainEzk[‖∇F (w, zk)‖2] ≤ 4 max {ρL, λ}(F (w)− f(w∗) + 〈λw∗, w+ − w∗〉− δ2‖w+ − w∗‖2)= 4 max {ρL, λ}(F (w)− f(w∗)− λ‖w∗‖2 + λ 〈w∗, w+〉− δ2‖w+ − w∗‖2)= 4 max {ρL, λ}(F (w)− F (w∗)− λ2‖w∗‖2 + λ 〈w∗, w+〉− δ2‖w+ − w∗‖2)= 4 max {ρL, λ}(F (w)− F (w∗)− λ+ δ2‖w∗ − w+‖2 + λ2‖w+‖2).Substituting in the appropriate value for δ completes the proof.100E.1 Convergence for L2-Regularized Convex FunctionsTheorem 14. Let f be a µ-strongly-convex (with µ = 0 in the convex case), L-smooth functionwith at least one finite minimizer w+ and O a SFO such that the pair (f,O) satisfies the weakgrowth condition with constant ρ. Then stochastic gradient descent with constant step-size η ≤µ+λmax{ρL,λ}((µ+λ)+(L+λ)) obtains the following convergence rate for the L2-regularized problem (F,O2):E[‖wK − w∗‖2] ≤ (1− 2η(µ+ λ)(L+ λ)(µ+ λ) + (L+ λ))K‖w0 − w∗‖2 + γλ‖w+‖2− γ (λ+ µ) ‖w∗ − w+‖2,where γ = ηmax{ρL,λ}[(µ+λ)+(L+λ)](µ+λ)(L+λ) .Proof.‖wk+1 − w∗‖2 = ‖wk − η∇F (wk, zk)− w∗‖2= η2‖∇F (wk, zk)‖2 − 2η 〈∇F (wk, zk), wk − w∗〉+ ‖wk − w∗‖2.Taking expectations with respect to Ezk ,=⇒ Ezk[‖wk+1 − w∗‖2] = η2Ezk [‖∇F (wk, zk)‖2]− 2ηEzk [〈∇F (wk, zk), wk − w∗〉] + ‖wk − w∗‖2= η2Ezk[‖∇F (wk, zk)‖2]− 2η 〈∇F (wk), wk − w∗〉+ ‖wk − w∗‖2.Let C = 2η2 max {ρL, λ} (λ‖w+‖2 − (λ+ µ) ‖w∗ − w+‖2). Then Lemma 13 impliesEzk[‖wk+1 − w∗‖2] ≤ 4η2 max {ρL, λ} (F (wk)− F (w∗))− 2η 〈∇F (wk), wk − w∗〉+ ‖wk − w∗‖2 + CUsing (µ+ λ)-strong-convexity of F ,=⇒ Ezk[‖wk+1 − w∗‖2] ≤ 2η2 max {ρL, λ}µ+ λ‖∇F (wk)‖2 − 2η 〈∇F (wk), wk − w∗〉+ ‖wk − w∗‖2 + C.Coercivity of the gradient (Lemma 21) now impliesEzk[‖wk+1 − w∗‖2] ≤ 2η2 max {ρL, λ}µ+ λ‖∇F (wk)‖2 − 2η((µ+ λ)(L+ λ)(µ+ λ) + (L+ λ)‖wk − w∗‖2+1(µ+ λ) + (L+ λ)‖∇F (wk)‖2)+ ‖wk − w∗‖2 + C= 2η(ηmax {ρL, λ}µ+ λ− 1(µ+ λ) + (L+ λ))‖∇F (wk)‖2+(1− 2η(µ+ λ)(L+ λ)(µ+ λ) + (L+ λ))‖wk − w∗‖2 + C.101If η ≤ µ+λmax{ρL,λ}((µ+λ)+(L+λ)) , then we obtainEzk[‖wk+1 − w∗‖2] ≤ (1− 2η(µ+ λ)(L+ λ)(µ+ λ) + (L+ λ))‖wk − w∗‖2 + C.Let δ = 2η(µ+λ)(L+λ)(µ+λ)+(L+λ) . Taking expectations and recursing on this inequality,=⇒ E [‖wk+1 − w∗‖2] ≤ (1− δ)k+1 ‖w0 − w∗‖2 + C k∑l=0(1− δ)l= (1− δ)k+1 ‖w0 − w∗‖2 + C(1− (1− δ)k+1δ)≤ (1− δ)k+1 ‖w0 − w∗‖2 + Cδ.Letting γ = ηmax{ρL,λ}[(µ+λ)+(L+λ)](µ+λ)(L+λ) and subsituting in the value of δ gives the final result:=⇒ E [‖wk+1 − w∗‖2] ≤ (1− 2η(µ+ λ)(L+ λ)(µ+ λ) + (L+ λ))k+1‖w0 − w∗‖2 + γλ‖w+‖2− γ (λ+ µ) ‖w∗ − w+‖2.102Appendix FUseful LemmasLemma 17. Let f be a µ-strongly-convex, L-smooth function. Then, for all w ∈ Rd, the optimalitygap and distance to the minimizer can be related as follows:µ2‖w − w∗‖2 ≤ f(w)− f(w∗) ≤ L2‖w − w∗‖2.The proof of Lemma 17 follows immediately from the definitions of Lipschitz-smoothness andstrong convexity evaluated at w and w∗.Lemma 18. Let f be an L-smooth function. Then f satisfies the following inequality for all w ∈ Rdas follows:12L‖∇f(w)‖2 ≤ f(w)− f(w∗).Similarly, if f is µ-strongly-convex, then f satisfies the Polyak- Lojasiewicz condition,f(w)− f(w∗) ≤ 12µ‖∇f(w)‖2 ∀w ∈ Rd.See Karimi et al. (2016) for proof.Lemma 19. Let f be a convex, L-smooth function. Then f satisfies the following inequality forall w, v ∈ Rd:f(w)− f(v) ≤ 〈∇f(w), w − v〉 − 12L‖∇f(w)−∇f(v)‖2.See Bubeck (2015, Lemma 3.5) for proof.Lemma 20. Let f be a convex, L-smooth function. Then f satisfies the following inequality:〈∇f(w)−∇f(v), w − v〉 ≥ 1Lmax‖∇f(w)−∇f(v)‖2.See Bubeck (2015, Eq. 3.6) for proof.103Lemma 21 (Coercivity of the Gradient). Let f be a µ-strongly-convex, L-smooth function. Thenf satisfies the following inequality:〈∇f(w)−∇f(v), w − v〉 ≥ µLµ+ L‖w − v‖2 + 1µ+ L‖∇f(w)−∇f(v)‖2.See Bubeck (2015, Lemma 3.11) for proof.Lemma 22. Let f be a µ-strongly-convex, L-smooth function. Then f satisfies the followinginequality:〈∇f(w)−∇f(v), w − v〉 ≥ µ‖w − v‖2.See Nesterov (2004, Theorem 2.1.9) for proof.104

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            data-media="{[{embed.selectedMedia}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0394494/manifest

Comment

Related Items