Towards a Tight Asymptotic Bound For One-Dimensional Stochastic GradientDescentWilliam Lu Nick HarveyApril 27th, 2021Contents1 Abstract 12 Background 23 Literature Review 33.1 Shamir and Zhang (2013) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33.2 Emmanuel Sales (2020) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43.3 Koren and Segal (2020) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 Main Results 54.1 Stochastic processes - Martingales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54.2 Biased random walks - Generating functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104.3 Biased random walks - Combinatorics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 Future Work 20References 201 AbstractSupervised machine learning algorithms work by optimizing model parameters to minimize a loss (objective) function. For manyclasses of models, such as robust regression and logistic regression, there is no closed form solution for the minimizers. Therefore,iterative optimization algorithms play a key role in training machine learning models. The stochastic gradient descent (SGD)algorithm requires very few assumptions on the objective function and is therefore one of the most frequently used optimizers inpractice.When training on large datasets, it is useful to know how fast an iterative optimizer converges to the solution. A commonquantification of convergence rate is the expected suboptimality at the final iterate, denoted E [f(~xT )] − f∗. Prior work hasestablished an upper bound of O(lnT√T)and a lower bound of Ω(1√T)on the expected suboptimality of SGD.In this thesis, we conjecture that the upper bound on the expected suboptimality of SGD can be improved to O(1√T)whenthe objective function is one-dimensional. We prove this conjecture under a restricted setting. Our analysis frames this problemthrough the lens of stochastic processes and random walks. To investigate the limiting behaviour of biased random walks, we1utilize techniques from generating functions and combinatorics. We make standard assumptions of convexity and Lipschitzcontinuity on the objective function. For simplicity, we assume a constant step size ηt = Θ(1√T).2 BackgroundIn this thesis, we define Ja, bK = {x ∈ Z : a ≤ x ≤ b} as a shorthand for integer intervals.Supervised machine learning algorithms take a labelled dataset as input and return a model that fits the input dataset. Thismodel can then make predictions on new data that comes from the same distribution as the labelled data. Formally, given atraining set consisting of features A and ground-truth labels ~b, the algorithm optimizes the model parameters ~x to minimize aloss function f(~x) quantifying how poor the model’s performance is.In the paradigm of maximum likelihood estimation, we aim to maximize the probability of seeing the training set giventhe model parameters. We assume the training examples are independently generated. Thus, the probability of seeing theentire training set is the product of the probabilities of seeing the individual training examples. Since the natural logarithm ismonotonically increasing, we can equivalently minimize the negative log-probability, which converts the product into a more-easily-differentiated sum:f(~x) = − ln Pr[~b∣∣∣A, ~x] (definition of maximum likelihood estimation)= − ln∏iPr [bi | ~ai, ~x] (independence of training examples)= −∑iln Pr [bi | ~ai, ~x] (logarithm turns product into sum)In the paradigm of maximum a-priori estimation, we aim to maximize the probability of the model parameters given thetraining set. This is a Bayesian approach which incorporates the prior probability of the model parameters. The practice ofpenalizing more complex models by assigning lower prior probabilities to them is called regularization, and is integral to preventoverfitting to noise in the training set. This paradigm gives rise to the loss-regularizer framework, where the loss function is thesum of the negative log-likelihood and the negative log-prior:Pr[~x∣∣∣A,~b] ∝ Pr [~b ∣∣∣A, ~x]Pr [~x] (Bayes’ Theorem)f(~x) = − ln{Pr[~b∣∣∣A, ~x]Pr [~x]} (definition of maximum a-priori estimation)= − ln{∏iPr [bi | ~ai, ~x] · Pr [~x]}(independence of training examples)= −∑iln Pr [bi | ~ai, ~x]︸ ︷︷ ︸negative log-likelihood− ln Pr [~x]︸ ︷︷ ︸negativelog-prior(logarithm turns product into sum)For most classical machine learning models, the corresponding loss function is convex. Intuitively, a convex function is shapedlike a bowl, so every local minimum is a global minimum. An alternative characterization of a convex function is that the areaabove the function is a convex set. A set is convex iff the line segment between any two points in the set stays in the set. Theformal definitions of convexity are:Definition 2.1 (Convex set). A set S is convex if ∀~x, ~y ∈ S, ∀t ∈ [0, 1], t~x+ (1− t)~y ∈ S.Definition 2.2 (Convex function). A function f : S → R is convex if ∀~x, ~y ∈ S,∀t ∈ [0, 1], f(t~x+ (1− t)~y) ≤ tf(~x) + (1− t)f(~y).Because computer memory is finite, the domain of any optimization problem run on a computer must be bounded. A set isbounded iff its diameter is finite:Definition 2.3 (Bounded set). A set S is bounded if ∃D ≥ 0,∀~x, ~y ∈ S, ‖~x− ~y‖ ≤ D.Definition 2.4 (Diameter). The diameter of a set S is diam(S) = sup~x,~y∈S ‖~x− ~y‖.2For some machine learning models, the corresponding loss function is not everywhere differentiable. Subgradients are thegeneralization of gradients to convex non-differentiable functions. A vector ~g is a subgradient of a function at a point ~a if thefunction always stays above the tangent plane associated with ~g and ~a. If the function is differentiable at ~a, the gradient is theonly subgradient at ~a. The set of all subgradients at ~a is called the subdifferential at ~a:Definition 2.5 (Subgradient). Let f : S → R be a convex function. A vector ~g is a subgradient of f at a point ~a ∈ S if∀~x ∈ S, f(~x) ≥ f(~a) + ~g · (~x− ~a). We write ~g ∈ ∂f(~a).Definition 2.6 (Subdifferential). Let f : S → R be a convex function. The subdifferential of f at a point ~a ∈ S is the set of allsubgradients of f at ~a. We write ∂f(~a).Corollary 2.7. Let f : S → R be a convex function. Then for any point ~a where f is differentiable, ∂f(~a) ={~∇f(~a)}.Lemma 2.8 (Monotonicity of subgradient). Let f : S → R be a one-dimensional convex function. Then ∀x, y ∈ S, x < y ⇒max ∂f(x) ≤ min ∂f(y).Some analyses of optimization algorithms assume the objective function is Lipschitz continuous. The most useful property ofLipschitz continuity is that all subgradients are bounded in size, and therefore all steps taken by the optimizer are bounded insize.Definition 2.9 (Lipschitz continuity). A function f : S → R is L-Lipschitz for some L > 0 if ∀~x, ~y ∈ S, |f(~x)− f(~y)| ≤ L ‖~x− ~y‖.Corollary 2.10. Let f : S → R be an L-Lipschitz function. Then ∀~x ∈ S,∀~g ∈ ∂f(~x), ‖~g‖ ≤ L.In the context of discrete-time stochastic processes and random walks, we use Xt to denote the position (iterate) at time t,with X0 denoting the initialization. We use Yt to denote the update (increment) at time t, such that Xt+1 = Xt + Yt. In thecontext of iterative optimizers, we sometimes use ~xt and ~yt (for multi-dimensional objectives) and xt and yt (for one-dimensionalobjectives) instead of Xt and Yt, respectively. The current state of a random walk comprises two variables, the current time tand the current position n, and is written as Xt = n.A filtration is the set of all variables whose value has been learned “so far” in a stochastic process. When describing therandomness generated at the current time step in a stochastic process, we use expectations conditioned on the current filtrationto remove the effect of the randomness generated at the previous time steps. At any given time in a run of SGD, we know thevalues of all iterates up to and including the current iterate, and all increments up to but not including the step that will takenat the current time:Definition 2.11 (Filtration). The filtration at time t, denoted Ft, is the set of all variables whose value is known at time t. ForSGD, we have Ft = {Xk}tk=0 ∪ {Yk}t−1k=0.A first order optimizer only makes use of the values and subgradients of the objective function. The simplest algorithm,gradient descent, takes steps (scaled by a step size ηt) in the direction of a negative subgradient. At points where the objectivefunction is differentiable, the negative subgradient is the steepest descent direction.Definition 2.12 (Gradient descent). Gradient descent is the first-order optimizer with update rule ~xt+1 = ~xt − ηt~∇f(~xt).SGD is a variation of gradient descent that is used in practice to minimize loss functions that average (or sum) over all trainingexamples, such as in the maximum likelihood estimation paradigm. On each step, the subgradient of a random training exampleis used instead of the full subgradient, thus allowing the time complexity of a step to be constant with respect to the number ofexamples in the training set. This stochastic noise also lets SGD escape local minima when optimizing non-convex loss functionssuch as those of neural networks, although we do not dwell on this ability of SGD in our thesis. In fact, most theoretical worksdo not interpret SGD as choosing random training examples, but rather as adding unbiased noise to the true subgradient.Definition 2.13 (Stochastic gradient descent). Stochastic gradient descent is the first-order optimizer with update rule ~xt+1 =~xt − ηt∇ˆf(~xt), where E[∇ˆf(~xt)∣∣∣ Ft] = ~∇f(~xt).3 Literature Review3.1 Shamir and Zhang (2013)In their 2013 paper Stochastic Gradient Descent For Non-Smooth Optimization: Convergence Results and Optimal AveragingSchemes [1], Ohad Shamir and Tong Zhang proved the following upper bound on the expected suboptimality of SGD:3E [f(~xT )]− f∗ <(D22α+G2α)2 + lnT√TThe authors make the following assumptions:• The domain S is convex, arbitrary-dimensional, and bounded with diameter D• The objective f : S → R is convex, not necessarily differentiable, and not necessarily Lipschitz continuous• The random subgradient is bounded in expectation: E[∥∥∥∇ˆf∥∥∥2] ≤ G2• SGD is run for T iterations, using a decreasing step size ηt = α√t+1 (with α > 0)Shamir and Zhang presented a direct proof that does not use any sophisticated mathematical machinery. The proof naturallyholds for arbitrarily high-dimensional objective functions. The proof uses the insight of suffix averaging: Let Sk be the averagesuboptimality of the last k + 1 iterates. Formally, Sk =1k+1∑Tt=T−k f(~xt) − f∗. Shamir and Zhang prove that E [Sk] satisfiesthe following recurrence and base case:E [Sk−1] < E [Sk] +1k(D22α+G2α)1√T(recurrence)E [ST−1] <(D22α+G2α)1√T(base case)This is helpful because our ultimate quantity of interest is E [f(~xT )]− f∗ = E [S0]. Unrolling the recurrence until we reach thebase case gives us:E [S0] < E [ST−1] +(D22α+G2α)1√TT−1∑k=11k(unrolling the recurrence)<(D22α+G2α)1√T+(D22α+G2α)1√TT−1∑k=11k(substituting in the base case)The harmonic sum can be upper-bounded as∑T−1k=11k ≤ 1 + ln(T ), which concludes the proof:E [S0] <(D22α+G2α)1√T+(D22α+G2α)1 + lnT√T(upper-bounding the harmonic sum)=(D22α+G2α)2 + lnT√T(this is Shamir and Zhang’s main result)To summarize, the suffix averaging technique incurs a harmonic sum and thus a factor of ln(T ), making the bound too loosefor our purposes. Unfortunately, there is no obvious way of adapting Shamir and Zhang’s argument to the special case ofone-dimensional objectives that removes the harmonic sum.3.2 Emmanuel Sales (2020)In his 2020 thesis Restricted-Dimension Subgradient Descent: Asymptotic Bounds on Error [2], Emmanuel Sales proved thatthe suboptimality at the final iterate of deterministic gradient descent is O(1√T)for one-dimensional objectives. He makes thefollowing assumptions:• The domain is convex, one-dimensional, and bounded with diameter 1• The objective function is convex, not necessarily differentiable, and 1-Lipschitz4• Gradient descent is run for T iterations, using a decreasing step size ηt = 1√t+1Emmanuel’s proof uses induction on T and case analysis. There are three non-trivial cases depending on the sign of the gradientat the current iterate, formally sgn max ∂f(xt), and the sign of the gradient at the previous iterate, formally sgn max ∂f(xt−1).If the signs are different, then the minimizer is trapped between the previous and current iterates. If both signs are negative,the minimizer is further right. If both signs are positive, the minimizer is further left.This approach is not well suited to our problem due to its use of induction on T . With Emmanuel’s decreasing step sizeηt =1√t+1, running SGD for T + 1 iterations is equivalent to running SGD for T iterations, followed by one step of size 1√T+1.Therefore, the induction hypothesis can be invoked on the first T iterations, leaving only the final step for the inductive argumentto consider. However, with our constant step size ηt = Θ(1√T), running SGD for T + 1 iterations is fundamentally differentfrom running SGD for T iterations since every step will have a different size. Thus, there is no obvious substructure that theinduction hypothesis can be invoked on.Furthermore, the case analysis in Emmanuel’s proof conditions on the signs of gradients at exact iterates. There is no obviousway to incorporate expectations into this case analysis to generalize it to SGD.3.3 Koren and Segal (2020)In their COLT 2020 paper Open Problem: Tight Convergence of SGD in Constant Dimension [3], Tomer Koren and Shahar Segalconjecture that the expected suboptimality of SGD is Θ(1√T)for one-dimensional objectives. Our thesis is a direct continuationof Koren and Segal’s line of research, and we prove this conjecture under a restricted setting.In their preliminary analysis, Koren and Segal investigate the behaviour of SGD on the scaled absolute value function f(x) = |x|(with 0 ≤ ≤ 1.) Intuitively, this is the “greatest” -Lipschitz function possible: the size of the gradient is maximal everywhere,so the function cannot “curve up” more to become greater. Since the size of the gradient is uniform, each increment has the samesize, allowing SGD runs to be modelled as random walks on Z. The authors use the technique of generating functions to derivelimiting properties of these random walks. In Section 4.2, we adapt the authors’ insights to a more general parameterization.4 Main Results4.1 Stochastic processes - MartingalesIn this section, we present a promising argument for improving the asymptotic upper bound on the expected suboptimality ofSGD. Our analysis makes the following assumptions:• The domain S is convex, one-dimensional, and bounded with diameter D• The objective f : S → R is convex, not necessarily differentiable, and L-Lipschitz• The stochastic gradients are bounded:∣∣∣∇ˆf(x)∣∣∣ ≤ L• SGD is run for T iterations, using a constant step size η = 1√TLet Φ denote the set of objective functions satisfying these assumptions. We aim to prove the following conjecture:Conjecture 4.1. For every function f ∈ Φ, E [f(xT )]− f∗ = O(1√T). By definition of big-O, in predicate logic quantifiers thisis:∃c > 0,∃T ∗ ∈ Z+,∀f ∈ Φ,∀T ≥ T ∗,E [f(xT )]− f∗ ≤ c(D + L2 + L)√TCrucially, the quantifiers for c and T ∗ appear before the quantifier for f . In their 2009 paper Information-Theoretic LowerBounds on the Oracle Complexity of Convex Optimization [4], Alekh Agarwal et al. prove that E [f(xT )]− f∗ = Ω(1√T). This5implies our conjectured upper bound is tight, i.e. E [f(xT )]− f∗ = Θ(1√T). We make use of Azuma’s inequality, stated below,as a concentration bound in our proof:Definition 4.2 (Martingale). A stochastic process {Xt}Tt=0 is a martingale if ∀t ∈ J1, T K,E [Xt | Ft−1] = Xt−1.Theorem 4.3 (Azuma’s inequality). Let {Xt}Tt=0 be a martingale stochastic process. Let δ > 0 be a real number such that|Yt| ≤ δ for all t ∈ J0, T − 1K. Let > 0 be a real number. Then Pr [XT − E [XT ] > |X0] ≤ exp{− 22Tδ2}.Because f may not have a unique minimizer x∗, the quantity ‖x− x∗‖ (the distance from the minimizer) is poorly defined. Formathematical rigour, we work with the concept of near minimizers, defined below:Definition 4.4. The set of near minimizers, denoted N , contains all points x ∈ S where the function value is close to theminimum or the size of the gradient is small:N ={x ∈ S : f(x)− f∗ ≤ D√T∨ |∇f(x)| ≤ 1√T}Corollary 4.5. If f is convex and one-dimensional, N is a non-empty interval. We write N = [nL, nR] with nL < nR. Thereexists some (potentially not unique) point x∗ ∈ N such that f(x∗) = f∗.Lemma 4.6. For any x ∈ N , f(x)− f∗ ≤ D√T.Proof. By definition of N , it suffices to prove that ∀x ∈ S, |∇f(x)| ≤ 1√T⇒ f(x) − f∗ ≤ D√T. Let x∗ ∈ S be a point at whichf(x∗) = f∗. Consider an unspecified point x ∈ S at which |∇f(x)| ≤ 1√T. If x < x∗, then:f(x)− f∗ =∫ xx∗∇f(x)dx = −∫ x∗x∇f(x)dx (fundamental theorem of calculus)≤∫ x∗x1√Tdx =x∗ − x√T≤ D√T(by Lemma 2.8, − 1√T≤ ∇f(x) ≤ 0 in the interval of integration)If x > x∗, then:f(x)− f∗ =∫ xx∗∇f(x)dx (fundamental theorem of calculus)≤∫ xx∗1√Tdx =x− x∗√T≤ D√T(by Lemma 2.8, 0 ≤ ∇f(x) ≤ 1√Tin the interval of integration)Definition 4.7. The distance from a point x ∈ S to the set of near minimizers is:‖x−N‖ =nL − x if x < nL0 if nL ≤ x ≤ nRx− nR if x > nRWe have not yet discovered a complete proof of Conjecture 4.1 that works for all f ∈ Φ. Below, we provide a proof that worksif we make three temporary assumptions:• SGD starts from inside the set of near minimizers, i.e. X0 ∈ N .• There exists some g > 0 such that |∇f(x)| ≥ g everywhere.• There exists some b ∈ (0, 1) such that for all t ∈ J2, T K, 2 exp{− g28L2 t} ≤ ( 1−b1+Lt) bt−1. Note that b may depend on T , butnot on t.6Proof. Choose c = 1 and T ∗ = 1. It suffices to prove that:∃b ∈ (0, 1),∀t ∈ J1, T K, maxX0∈NE [‖Xt −N‖ |X0] ≤ L+ 1− bt√TThis statement is sufficient to imply Conjecture 4.1 because:E [f(XT )]− f∗ ≤ maxx∈N{f(x)− f∗}+ L · E [‖XT −N‖] (by Lipschitz continuity)≤ D√T+ L · E [‖XT −N‖] (by Lemma 4.6)≤ D√T+ L · maxX0∈NE [‖XT −N‖ |X0]≤ D√T+ L(L+ 1− bt√T)(by the sufficient statement)<D√T+ L(L+ 1√T)=D + L2 + L√TTo prove this statement, we will use induction on t. Base case: t = 1.maxX0∈NE [‖X1 −N‖ |X0] ≤ maxX0∈NE [‖X0 −N‖+ |X1 −X0| |X0] (triangle inequality)≤ maxX0∈NE[1√T∣∣∣∇ˆf(X0)∣∣∣ ∣∣∣∣X0] ≤ L√T (SGD update rule)<L+ 1− b√T(since b < 1)Inductive step: t ∈ J2, T K. Inductive hypothesis: Assume this statement holds for all t′ ∈ J1, t − 1K. We want to prove thestatement holds for t. Observe that E [‖Xt −N‖ |X0] satisfies the following recurrence, with τ denoting the first positive timeat which SGD goes back into the near minimizers. Formally, τ = min {k ∈ J1, tK : Xk ∈ N} and τ =∞ if ∀k ∈ J1, tK, Xk /∈ N :maxX0∈NE [‖Xt −N‖ |X0] = maxX0∈Nt∑k=1Pr [τ = k |X0]E [‖Xt −N‖ | τ = k ∧X0]︸ ︷︷ ︸Term One+ Pr [τ =∞ |X0]E [‖Xt −N‖ | τ =∞∧X0]︸ ︷︷ ︸Term TwoWe upper bound Term One by re-indexing:E [‖Xt −N‖ | τ = k ∧X0] = E [‖Xt −N‖ |Xτ ] (X0 is irrelevant knowing Xτ )≤ maxXτ∈NE [‖Xt −N‖ |Xτ ]= maxX0∈NE [‖Xt−τ −N‖ |X0] (re-indexing since the walk parameters don’t depend on t)≤ L+ 1− bt−τ√T≤ L+ 1− bt−1√T(inductive hypothesis)We upper bound Term Two by using the diameter bound on the domain:E [‖Xt −N‖ | τ =∞∧X0] ≤ max {‖Xt −N‖ | τ =∞∧X0} (upper bounding expectation by maximum)≤ max ‖Xt −N‖≤ max{‖X0 −N‖+t−1∑k=0|Yt|}(triangle inequality)≤ Lt√T(SGD update rule and boundedness of stochastic gradients)7Substituting the bounds for the two terms into the recurrence:maxX0∈NE [‖Xt −N‖ |X0] ≤ maxX0∈N{t∑k=1Pr [τ = k |X0](L+ 1− bt−1√T)+ Pr [τ =∞ |X0](Lt√T)}= maxX0∈N{(1− Pr [τ =∞ |X0])(L+ 1− bt−1√T)+ Pr [τ =∞ |X0](Lt√T)}Our quantity of interest is thus Pr [τ =∞ |X0] for an unspecified X0 ∈ N :Pr [τ =∞ |X0] = Pr[t∧k=1Xt > nR∣∣∣∣∣X0]+ Pr[t∧k=1Xt < nL∣∣∣∣∣X0]≤ exp{− g28L2t}+ Pr[t∧k=1Xt < nL∣∣∣∣∣X0](Lemma 4.8 holds for walks that go above N )≤ exp{− g28L2t}+ exp{− g28L2t}(by symmetry, Lemma 4.8 also holds for walks that go below N )= 2 exp{− g28L2t}To simplify the notation below, let p = Pr [τ =∞ |X0]. Then for some b ∈ (0, 1):p ≤(1− b1 + Lt)bt−1 (by the third temporary assumption)⇔ p(1 + Lt√T)≤ bt−1 − bt√T⇒ p(bt−1 + Lt√T)<bt−1 − bt√T(since bt−1 < 1)⇔ − bt−1√T+ p(bt−1 + Lt√T)< − bt√T⇔ − (1− p)(bt−1√T)+ p(Lt√T)< − bt√T⇒ (1− p)(L+ 1− bt−1√T)+ p(Lt√T)<L+ 1− bt√T(since (1− p)(L+ 1) < L+ 1)Since X0 ∈ N was unspecified, we have maxX0∈N E [‖Xt −N‖ |X0] < L+1−bt√T. This is what we wanted to prove.Below, we state and prove some lemmas that are used in the proof of Conjecture 4.1.Lemma 4.8. For any X0 ∈ N , Pr[∧tk=1Xt > nR∣∣∣X0] ≤ exp{− g28L2 t}.Proof. Given an SGD run with iterates {Xk}tk=0 and increments {Yk}t−1k=0, define another stochastic process with iterates{X˜k}tk=0and increments{Y˜k}t−1k=0as follows. Note that{X˜k}tk=0and{Y˜k}t−1k=0are deterministic given {Xk}tk=0 and {Yk}t−1k=0.The stochasticity of{X˜k}tk=0and{Y˜k}t−1k=0comes only from the stochasticity of {Xk}tk=0 and {Yk}t−1k=0.X˜k = X0 +k−1∑i=0Y˜iY˜k ={Yk if ∀i ∈ J1, kK, Xi > nR−ηg if ∃i ∈ J1, kK, Xi ≤ nR8The quantity of interest can be written in terms of X˜:Pr[t∧k=1Xt > nR∣∣∣∣∣X0]= Pr[X˜t > nR∣∣∣X0] (by Lemma 4.9)= Pr[X˜t − E[X˜t]> nR − E[X˜t] ∣∣∣X0]Since{X˜k}tk=0is not a martingale, we cannot directly apply Azuma’s inequality. Define a de-biased version of this stochasticprocess with iterates{X¯k}tk=0and increments{X¯k}t−1k=0as follows:X¯k = X0 +k−1∑i=0Y¯iY¯k = Y˜k − E[Y˜k∣∣∣ Fk]X˜k and Y˜k can be written in terms of X¯k and Y¯k:X˜k = X¯k +k−1∑i=0E[Y˜k∣∣∣ Fk]Y˜k = Y¯k + E[Y˜k∣∣∣ Fk]We verify that{X¯k}tk=0is a martingale:E[X¯k∣∣ Fk−1] = X¯k−1 + E [Y¯k−1 ∣∣ Fk−1] (by definition of X¯)= X¯k−1 + E[Y˜k−1∣∣∣ Fk−1]− E [Y˜k−1 ∣∣∣ Fk−1] (by definition of Y¯ )= X¯k−1We verify the upper bound∣∣Y¯k∣∣ ≤ 2L√T :∣∣Y¯k∣∣ = ∣∣∣Y˜k − E [Y˜k ∣∣∣ Fk]∣∣∣ (by definition of Y¯ )≤∣∣∣Y˜k∣∣∣+ ∣∣∣E [Y˜k ∣∣∣ Fk]∣∣∣ (by the triangle inequality)≤ Lη + Lη = 2L√T(by Lipschitz continuity)Thus, we can apply Azuma’s inequality to{X¯k}tk=0:Pr[X˜t − E[X˜t]> ∣∣∣X0] = Pr[X¯t + t−1∑k=0E[Y˜k∣∣∣ Fk]− E [X¯t]− t−1∑k=0E[Y˜k∣∣∣ Fk] > ∣∣∣∣∣X0](writing X˜ in terms of X¯)= Pr[X¯t − E[X¯t]> ∣∣X0]≤ exp− 22t(2L√T)2 = exp{− 2T8L2t}(applying Azuma’s inequality)Substituting in = nR − E[X˜t]completes the proof:9Pr[t∧k=1Xt > nR∣∣∣∣∣X0]≤ exp−(nR − E[X˜t])2T8L2t≤ exp−(t2g2T)T8L2t = exp{− g28L2t}(by Lemma 4.10)Lemma 4.9. X˜t > nR if and only if ∀k ∈ J1, tK, Xk > nR. Therefore, Pr [X˜t > nR] = Pr [∧tk=1Xk > nR].Proof. First, assume ∀k ∈ J1, tK, Xk > nR. By definition of Y˜ , Y˜k = Yk for all k ∈ J0, t− 1K. Therefore, X˜t = Xt > nR.Next, assume ∃k∗ ∈ J1, tK, X∗k ≤ nR. By definition of Y˜ , Y˜k = Yk for all k ∈ J0, k∗ − 1K and Y˜k = −ηg for all k ∈ Jk∗, t − 1K.Therefore, X˜t = Xk∗ −∑t−1k=k∗ ηg ≤ Xk∗ ≤ nR.Lemma 4.10.(nR − E[X˜t])2≥ t2g2T .Proof. To simplify the notation below, let pk = Pr[∧ki=1Xi > nR]. Observe that:E[Y˜k]= pk · E [Yk] + (1− pk)(−ηg) (by definition of Y˜ )≤ pk(−ηg) + (1− pk)(−ηg) (by the second temporary assumption)= −ηgTherefore:nR − E[X˜t]= nR −(X0 +t−1∑k=0E[Y˜k])(by definition of X˜)= (nR −X0)−t−1∑k=0E[Y˜k]≥ −t−1∑k=0E[Y˜k](X0 ∈ N by assumption, so X0 ≤ nR by Corollary 4.5)≥t−1∑k=0ηg =tg√TSince tg√T> 0, we can square both sides to get(nR − E[X˜t])2≥ t2g2T .4.2 Biased random walks - Generating functionsSGD can be conceptualized as a random walk on S where the increments Yt are biased in expectation towards N . If S isone-dimensional, then E [Yt] > 0 when Xt < nL and E [Yt] < 0 when Xt > nR. A classical result [5] states that the simple(unbiased) random walk has expected absolute deviation E [|XT |] = Θ(1√T). In this section, we derive an improved bound onthe expected absolute deviation of a biased random walk as an approximate model of SGD’s convergence dynamics.Consider a random walk on Z with initial state X0 = 0 and transition probabilities listed below and depicted in Figure 1. Thisrandom walk is well-defined (all transition probabilities are valid probability values) if 0 < γ < 1. We assume γ < 12 so therandom walk is biased towards the origin:10Figure 1: State transition diagram of the random walk we analyze in Section 4.2.· · · −2 −1 0 1 2 · · ·1− γ 1− γ 1− γ γ2 γ γγ γ γ21− γ 1− γ 1− γ1− γFigure 2: State transition diagram of the simplified random walk we analyze in Section 4.2.0 1 2 · · ·γ γ γ1− γ 1− γ 1− γ1− γ• If Xt > 0, then Yt ={1 with probability γ−1 with probability 1− γ• If Xt = 0, then Yt =1 with probability γ20 with probability 1− γ−1 with probability γ2• If Xt < 0, then Yt ={1 with probability 1− γ−1 with probability γOur ultimate quantity of interest is E [|XT |]. To simplify the analysis, observe that this random walk is symmetric about theorigin. Formally, for any X ∈ Z, the left and right transition probabilities at X are respectively equal to the right and lefttransition probabilities at −X. Thus, it suffices to consider the simpler random walk on Z0 with initial state X˜0 and transitionprobabilities listed below and depicted in Figure 2. Specifically, E [|XT |] = E[X˜T]:• If X˜t > 0, then Y˜t ={1 with probability γ−1 with probability 1− γ• If X˜t = 0, then Y˜t ={1 with probability γ0 with probability 1− γDefinition 4.11. In this section, Pr[X˜s = n X˜t = n](with s < t) denotes the probability that, given the random walk isat position n at time s, it returns to position n at time t without having walked “left of” position n. Formally:• If n = 0, this means the walk did not take the self-loop at position 0, i.e. Yk 6= 0 for all k ∈ Js, t− 1K.• If n > 0, this means Xk ≥ n for all k ∈ Js+ 1, t− 1K.Corollary 4.12. For all non-negative integers s, t, n, Pr[X˜s = n X˜t = n]= Pr[X˜0 = 0 X˜t−s = 0].Now, we state and prove the main result of this section:Theorem 4.13. For all positive integers T , E [|XT |] = E[X˜T]= Θ (1).Proof. The position at time T is the sum of the steps up to but not including T :E[X˜T]= E[T−1∑t=0Y˜t]11Note that the steps are not independent because the transition probabilities of a step Y˜t depend on the position X˜t, whichdepends on the previous steps{Y˜k}t−1k=0. Nonetheless, linearity of expectation still holds for dependent random variables:E[X˜T]=T−1∑t=0E[Y˜t]By the transition probabilities defined in Figure 2:E[Y˜t∣∣∣ X˜t > 0] = γ(1) + (1− γ)(−1) = −(1− 2γ)E[Y˜t∣∣∣ X˜t = 0] = γ(1) + (1− γ)(0) = γCombining these two conditional expectations:E[Y˜t]= E[Y˜t∣∣∣ X˜t = 0]Pr [X˜t = 0]+ E [Y˜t ∣∣∣ X˜t > 0]Pr [X˜t > 0]= γ Pr[X˜t = 0]− (1− 2γ)(1− Pr[X˜t = 0])= (1− γ)(Pr[X˜t = 0]− 1− 2γ1− γ)By Lemma 4.16, we have Pr[X˜t = 0]− 1−2γ1−γ ≥ 0 for all t ≥ 0. Combining the results above:E[X˜T]= (1− γ)T−1∑t=0{Pr[X˜t = 0]− 1− 2γ1− γ}≤ (1− γ)∞∑t=0{Pr[X˜t = 0]− 1− 2γ1− γ}(by Lemma 4.16)To evaluate this infinite sum, we will use generating functions. First, note that Pr[X˜0 = 0 X˜t = 0]satisfies the followingrecurrence, with k denoting the first positive time at which the position of the random walk is 0:Pr[X˜0 = 0 X˜t = 0]= Pr[X˜1 = 1∣∣∣ X˜0 = 0] t∑k=2Pr[X˜1 = 1 X˜k−1 = 1]Pr[X˜k = 0∣∣∣ X˜k−1 = 1]Pr [X˜k = 0 X˜t = 0]= γt∑k=2Pr[X˜1 = 1 X˜k−1 = 1](1− γ) Pr[X˜k = 0 X˜t = 0](by the transition probabilities)= γt∑k=2Pr[X˜0 = 0 X˜k−2 = 0](1− γ) Pr[X˜0 = 0 X˜t−k = 0](by Corollary 4.12)= γ(1− γ)t−2∑k=0Pr[X˜0 = 0 X˜k = 0]Pr[X˜0 = 0 X˜t−2−k = 0]The base cases of this scalar recurrence are Pr[X˜0 = 0 X˜0 = 0]= 1 and Pr[X˜0 = 0 X˜1 = 0]= 0. Define a generatingfunction χ(z) =∑∞t=0 Pr[X˜0 = 0 X˜t = 0]zt. By the scalar recurrence above, χ(z) satisfies the following function recurrence:χ(z) = 1 + γ(1− γ)z2 · χ(z) · χ(z)⇔ γ(1− γ)z2 · χ(z)2 − χ(z) + 1 = 0 (rewriting as a quadratic equation)⇒ χ(z) = 1±√1− 4γ(1− γ)z22γ(1− γ)z2 (by the quadratic formula)12Now, note that Pr[X˜t = 0]satisfies the following recurrence, with k denoting the first positive time at which the position of therandom walk is 0:Pr[X˜t = 0]= Pr[X˜1 = 0]Pr[X˜t = 0∣∣∣ X˜1 = 0]+ Pr[X˜1 = 1] t∑k=2Pr[X˜1 = 1 X˜k−1 = 1]Pr[X˜k = 0∣∣∣ X˜k−1 = 1]Pr [X˜t = 0 ∣∣∣ X˜k = 0]= (1− γ) Pr[X˜t = 0∣∣∣ X˜1 = 0]+ γt∑k=2Pr[X˜1 = 1 X˜k−1 = 1](1− γ) Pr[X˜t = 0∣∣∣ X˜k = 0] (by the transition probabilities)= (1− γ) Pr[X˜t = 0∣∣∣ X˜1 = 0]+ γt∑k=2Pr[X˜0 = 0 X˜k−2 = 0](1− γ) Pr[X˜t = 0∣∣∣ X˜k = 0] (by Corollary 4.12)= (1− γ) Pr[X˜t−1 = 0]+ γ(1− γ)t−2∑k=0Pr[X˜0 = 0 X˜k = 0]Pr[X˜t−2−k = 0]The base case of this scalar recurrence is Pr[X˜0 = 0]= 1. Define a generating function φ(z) =∑∞t=0 Pr[X˜t = 0]zt. By thescalar recurrence above, φ(z) satisfies the following function recurrence:φ(z) = 1 + (1− γ)z · φ(z) + γ(1− γ)z2 · χ(z) · φ(z)⇔ φ(z) = 11− (1− γ)z − γ(1− γ)z2 · χ(z) (solving for φ(z))=11− (1− γ)z − γ(1− γ)z2 · 1±√1−4γ(1−γ)z22γ(1−γ)z2(substituting in χ(z))=11− (1− γ)z − 12(1±√1− 4γ(1− γ)z2)=21− 2(1− γ)z ±√1− 4γ(1− γ)z2Since Pr [Xt = 0] ≥ 0 for all t, we have φ(z) =∑∞t=0 Pr[X˜t = 0]zt ≥ 0 for all z ≥ 0. Thus, we replace ± with + in the aboveequation to guarantee the denominator, and thus φ(z), is positive:φ(z) =21− 2(1− γ)z +√1− 4γ(1− γ)z2=2[1− 2(1− γ)z −√1− 4γ(1− γ)z2][1− 2(1− γ)z]2 − [1− 4γ(1− γ)z2] (rationalizing the denominator)=1− 2(1− γ)z −√1− 4γ(1− γ)z22(1− γ)z(z − 1)Recall that our quantity of interest was∑∞t=0{Pr[X˜t = 0]− 1−2γ1−γ}. We can write this infinite sum in terms of φ(z) by usingthe fact that limz→1− φ(z) =∑∞t=0 Pr[X˜t = 0]:∞∑t=0{Pr[X˜t = 0]− 1− 2γ1− γ}= limz→1−{ ∞∑t=0Pr[X˜t = 0]zt −∞∑t=01− 2γ1− γ zt}13= limz→1−{φ(z)− 1− 2γ1− γ ·11− z}(by the geometric series formula)= limz→1−{1− 2(1− γ)z −√1− 4γ(1− γ)z22(1− γ)z(z − 1) −1− 2γ(1− γ)(1− z)}(substituting in φ(z))= limz→1−{1− 2γz −√1− 4γ(1− γ)z22(1− γ)z(z − 1)}At z = 1, this limit is a 00 indeterminate form:Numerator = 1− 2γ −√1− 4γ(1− γ) = (1− 2γ)− |1− 2γ| = 0 (by assumption that γ < 12)Denominator = 2(1− γ)(1)(1− 1) = 0Therefore, we evaluate the limit by applying L’Hoˆpital’s Rule:∞∑t=0{Pr[X˜t = 0]− 1− 2γ1− γ}= limz→1−−2γ − (−4γ)(1−γ)(2z)2√1−4γ(1−γ)z24(1− γ)z − 2(1− γ)=−2γ − (−4γ)(1−γ)(2)2√1−4γ(1−γ)4(1− γ)− 2(1− γ) = −γ1− γ +2γ1− 2γ (substituting in z = 1)Altogether, we have E[X˜T]≤ −γ + 2γ(1−γ)(1−2γ) . This quantity is well-defined since γ 6= 1 and γ 6= 12 . Because the quantitydoes not depend on T , we conclude that E[X˜T]= Θ (1).Below, we state and prove some lemmas that are used in the proof of Theorem 4.13.Lemma 4.14. For all integers t ≥ 0 and n ≥ 1, Pr[X˜t = n]≤(γ1−γ)Pr[X˜t = n− 1].Proof. We use induction on t. Base case: t = 0. The initial position of the walk is X˜0 = 0, so it follows that for all n ≥ 1,Pr[X˜0 = n]= 0 ≤(γ1−γ)Pr[X˜0 = n− 1].Inductive step: t ≥ 1. By the inductive hypothesis, assume this lemma holds for all t′ ∈ J0, t− 1K. We want to prove this lemmaholds for t. If n = 1:Pr[X˜t = 1]= γ Pr[X˜t−1 = 0]+ (1− γ) Pr[X˜t−1 = 2](contributions from left and right)≤ γ Pr[X˜t−1 = 0]+ (1− γ)(γ1− γ)Pr[X˜t−1 = 1](by the inductive hypothesis)=(γ1− γ)((1− γ) Pr[X˜t−1 = 0]+ (1− γ) Pr[X˜t−1 = 1])(contributions from left and right)=(γ1− γ)Pr[X˜t = 0]If n > 1:Pr[X˜t = n]= γ Pr[X˜t−1 = n− 1]+ (1− γ) Pr[X˜t−1 = n+ 1](contributions from left and right)≤ γ(γ1− γ)Pr[X˜t−1 = n− 2]+ (1− γ)(γ1− γ)Pr[X˜t−1 = n](by the inductive hypothesis)=(γ1− γ)(γ Pr[X˜t−1 = n− 2]+ (1− γ) Pr[X˜t−1 = n])(contributions from left and right)14=(γ1− γ)Pr[X˜t = n− 1]Corollary 4.15. For all integers t ≥ 0 and n ≥ 0, Pr[X˜t = n]≤(γ1−γ)nPr[X˜t = 0].Proof. Repeatedly applying Lemma 4.14 proves this corollary:Pr[X˜t = n]≤(γ1− γ)Pr[X˜t = n− 1](applying Lemma 4.14)≤(γ1− γ)2Pr[X˜t = n− 2](applying Lemma 4.14)≤ · · ·≤(γ1− γ)nPr[X˜t = 0]Lemma 4.16. For all integers t ≥ 0, Pr[X˜t = 0]≥ 1−2γ1−γ .Proof.Pr[X˜t = 0]= 1−∞∑n=1Pr[X˜t = n](by the complement rule of probability)≥ 1−∞∑n=1(γ1− γ)nPr[X˜t = 0](applying Corollary 4.15)= 1−(γ1− 2γ)Pr[X˜t = 0](by the geometric series formula;∣∣∣∣ γ1− γ∣∣∣∣ < 1 since 0 < γ < 12)Solving for Pr[X˜t = 0]completes the proof.4.3 Biased random walks - CombinatoricsIn this section, we analyze another biased random walk on Z that has slightly different behaviour at position 0. The initial stateis X0 = 0 and the transition probabilities are listed below and depicted in Figure 3. As is the case in the previous section, thisrandom walk is well-defined if 0 < γ < 1, and we assume γ < 12 . Figure 4 traces the evolution of possible states in this randomwalk, with edge weights denoting state transition probabilities:• If Xt > 0, then Yt ={1 with probability γ−1 with probability 1− γ• If Xt = 0, then Yt ={1 with probability 12−1 with probability 12• If Xt < 0, then Yt ={1 with probability 1− γ−1 with probability γBefore we state the main conjecture of this section, we present some combinatorics background that is relevant to our proof.Definition 4.17 (Young diagram). A Young diagram is a collection of cells arranged in left-justified rows, with the row lengthsin non-increasing order from top to bottom. The shape of a Young diagram is denoted λ and is the list of row lengths from topto bottom.15Figure 3: State transition diagram of the random walk we analyze in Section 4.3.· · · −2 −1 0 1 2 · · ·1− γ 1− γ 1− γ 12 γ γγ γ 121− γ 1− γ 1− γFigure 4: The states that are reachable in the random walk depicted in Figure 3. Edge weights denote probabilities of transitioningbetween adjacent states. The colour coded material shows how this diagram can be used to calculate Pr [X0 = 0 X2j = 0]. Inthis example, j = 3. WLOG, assume X1 through X2j−1 are positive. Then Y0 = 1 and Y2j−1 = −1, which are drawn in red. Thewalk from X1 to X2j−1 must stay in the region drawn in blue; the number of such walks can be counted with a Young diagram.X0 = 0X1 = −1 X1 = 1X2 = −2 X2 = 0 X2 = 2X3 = −3 X3 = −1 X3 = 1 X3 = 3X4 = −4 X4 = −2 X4 = 0 X4 = 2 X4 = 4X5 = −5 X5 = −3 X5 = −1 X5 = 1 X5 = 3 X5 = 5X6 = −6 X6 = −4 X6 = −2 X6 = 0 X6 = 2 X6 = 4 X6 = 6· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·1212γ1− γ1− γγγ1− γ12121− γγγ1− γγ1− γ1− γγ1− γγγ1− γγ1− γ12121− γγ1− γγγ1− γγ1− γγ1− γ1− γγ1− γγ1− γγ16Figure 5: A Young diagram with shape λ = (10, 7, 4, 3, 1).Cell c is labelled. The hook Hλ(c) is shaded red and its hooklength is hλ(c) = 8.cFigure 6: One possible Young tableau corresponding to theYoung diagram in Figure 5.1 5 7 8 11 16 19 21 22 232 6 13 15 17 18 203 10 14 254 12 249Definition 4.18 (Hook). The hook of a cell c in a Young diagram with shape λ is denoted Hλ(c). The hook is the set containingc, all cells to the right of c, and all cells below c. The hook length is hλ(c) = |Hλ(c)|.Definition 4.19 (Young tableau). A Young tableau corresponding to a Young diagram containing n cells is obtained by fillingin the cells with the integers in J1, nK such that each integer is used exactly once and the integers along each row (read from leftto right) and each column (read from top to bottom) are strictly increasing.Figure 5 and Figure 6 give examples of a Young diagram and corresponding Young tableau, respectively.Corollary 4.20. Let s be a string of length n containing k open brackets and n− k close brackets (with k ≥ n− k) such thatany prefix of s contains more open brackets than close brackets, or equal numbers of open and close brackets. The number ofsuch strings s is equal to the number of Young tableaux corresponding to the Young diagram with shape λ = (k, n− k).Theorem 4.21 (Hook-length formula). Consider a Young diagram with n cells and shape λ. Let C be the set of cells in thisYoung diagram. Then the number of Young tableaux corresponding to this Young diagram is n!∏c∈C hλ(c).We conjecture that the expected absolute deviation from the origin is constant with respect to the number of steps T in therandom walk:Conjecture 4.22. For all positive integers T , E [|XT |] = Θ (1).In this section, we discuss the intuition behind a promising proof strategy for this conjecture, although we have not yet completeda formal proof. First, observe that this random walk, like the one in Section 4.2, is symmetric about the origin. Therefore,Pr [Xt = n] = Pr [Xt = −n] for all integers t ≥ 0 and n. We have:E [|XT |] =T∑n=−T|n|Pr [XT = n] (definition of expectation)=T∑n=12nPr [XT = n] (by symmetry of the random walk)Definition 4.23. In this section, Pr [Xs = m Xt = n] (with s < t) denotes the probability that, given the random walk isat position m at time s, it reaches position n at time t without visiting position 0 in between. Formally:Pr [Xs = m Xt = n] = Pr[(t−1∧k=s+1Xk 6= 0)∧Xt = n∣∣∣∣∣Xs = m]Usually, we will apply this definition with s = 0, m = 0, and n ≥ 0.Corollary 4.24. For any integers T > 0 and n ∈ J1, T K with the same parity, Pr [XT = n] satisfies the following recurrences. Inboth recurrences, 2i is the first positive time at which the position of the random walk is 0.• If T and n are even, i.e. T = 2J and n = 2k for some integers J > 0 and k ∈ J1, JK, then:Pr [X2J = 2k] =J−k∑i=1Pr [X0 = 0 X2i = 0] Pr[X2(J−i) = 2k]+ Pr [X0 = 0 X2J = 2k]17• If T and n are odd, i.e. T = 2J − 1 and n = 2k − 1 for some integers J > 0 and k ∈ J1, JK, then:Pr [X2J−1 = 2k − 1] =J−k∑i=1Pr [X0 = 0 X2i = 0] Pr[X2(J−i)−1 = 2k − 1]+ Pr [X0 = 0 X2J−1 = 2k − 1]This corollary uses the fact that the random walk transition probabilities are time-invariant, so therefore Pr [Xt = n |Xs = 0] =Pr [Xt−s = n |X0 = 0]. Below, we derive closed-form formulas for Pr [X0 = 0 Xt = n]. There are three cases to consider:• n = 0 and t is even (Lemma 4.25)• n > 0 and t and n are even (Lemma 4.26)• n > 0 and t and n are odd (Lemma 4.27)Lemma 4.25. Let j be a positive integer. Then Pr [X0 = 0 X2j = 0] = 12γj−1(1− γ)j−1 (2j−2)!j!(j−1)! .Proof. In the context of a walk from X0 = 0 to X2j = 0, call {Xk}2j−1k=1 the set of internal positions. Any walk from X0 = 0to X2j = 0 with all nonzero internal positions, has either all positive internal positions or all negative internal positions. Sincethe transition probabilities depicted in Figure 3 are symmetric about the origin, every walk with all positive internal positionshas a corresponding “mirror” walk, with the same probability, with all negative internal positions. Therefore, when computingPr [X0 = 0 X2j = 0], we can consider only the walks with all positive internal positions and then multiply by 2. Every suchwalk can be traced out as follows:• The first step is Y0 = 1 with probability 12 .• Next, we walk from X1 = 1 to X2j−1 = 1 without visiting position 0. In doing so, we take j− 1 steps right and j− 1 stepsleft. After taking these steps, the time is 1 + (j − 1) + (j − 1) = 2j − 1 and the position is 1 + (j − 1) − (j − 1) = 1, asdesired. Each right step has probability γ and each left step has probability 1−γ, so these steps have combined probabilityγj−1(1−γ)j−1. By Corollary 4.20, the number of such walks is the number of Young tableaux corresponding to the Youngdiagram below. The hook lengths are written in the cells:j j − 1 j − 2 · · · 4 3 2j − 1 j − 2 j − 3 · · · 3 2 1By Theorem 4.21, there are (2j−2)!j!(j−1)! such walks.• The last step is Y2j−1 = −1 with probability 12 .Multiplying everything together completes the proof:Pr [X0 = 0 X2j = 0] =(12)(γj−1(1− γ)j−1 (2j − 2)!j!(j − 1)!)(12)(2) (bullet one × bullet two × bullet three × symmetry)=12γj−1(1− γ)j−1 (2j − 2)!j!(j − 1)!Lemma 4.26. Let j and k be integers such that j > 0 and k ∈ J1, jK. Then:Pr [X0 = 0 X2j = 2k] =12γj+k−1(1− γ)j−k (2j − 1)!(2k)(j + k)!(j − k)!Proof. Since 2k > 0, any walk from X0 = 0 to X2j = 2k with all nonzero internal positions has all positive internal positions.Every such walk can be traced out as follows:18• The first step is Y0 = 1 with probability 12 .• Next, we walk from X1 = 1 to X2j = 2k without visiting position 0. In doing so, we take j+k−1 steps right and j−k stepsleft. After taking these steps, the time is 1 + (j+ k− 1) + (j− k) = 2j and the position is 1 + (j+ k− 1)− (j− k) = 2k, asdesired. Each right step has probability γ and each left step has probability 1−γ, so these steps have combined probabilityγj+k−1(1 − γ)j−k. By Corollary 4.20, the number of such walks is the number of Young tableaux corresponding to theYoung diagram below. The hook lengths are written in the cells:j+k j+k−1 · · · 2k+2 2k+1 2k−1 2k−2 · · · 2 1j−k j−k−1 · · · 2 1By Theorem 4.21, there are (2j−1)!(2k)(j+k)!(j−k)! such walks.Multiplying everything together completes the proof:Pr [X0 = 0 X2j = 2k] =(12)(γj+k−1(1− γ)j−k (2j − 1)!(2k)(j + k)!(j − k)!)(bullet one × bullet two)Lemma 4.27. Let j and k be integers such that j > 0 and k ∈ J1, jK. Then:Pr [X0 = 0 X2j−1 = 2k − 1] = 12γj+k−2(1− γ)j−k (2j − 2)!(2k − 1)(j + k − 1)!(j − k)!Proof. Since 2k − 1 > 0, any walk from X0 = 0 to X2j−1 = 2k − 1 with all nonzero internal positions has all positive internalpositions. Every such walk can be traced out as follows:• The first step is Y0 = 1 with probability 12 .• Next, we walk from X1 = 1 to X2j−1 = 2k − 1 without visiting position 0. In doing so, we take j + k − 2 stepsright and j − k steps left. After taking these steps, the time is 1 + (j + k − 2) + (j − k) = 2j − 1 and the position is1 + (j + k − 2)− (j − k) = 2k − 1, as desired. Each right step has probability γ and each left step has probability 1− γ,so these steps have combined probability γj+k−2(1− γ)j−k. By Corollary 4.20, the number of such walks is the number ofYoung tableaux corresponding to the Young diagram below. The hook lengths are written in the cells:j+k−1 j+k−2 · · · 2k+1 2k 2k−2 2k−3 · · · 2 1j−k j−k−1 · · · 2 1By Theorem 4.21, there are (2j−2)!(2k−1)(j+k−1)!(j−k)! such walks.Multiplying everything together completes the proof:Pr [X0 = 0 X2j−1 = 2k − 1] =(12)(γj+k−2(1− γ)j−k (2j − 2)!(2k − 1)(j + k − 1)!(j − k)!)(bullet one × bullet two)195 Future WorkTo complete a formal proof of Conjecture 4.1, we seek to remove the three temporary assumptions in Section 4.1. Here weexamine how removing the second assumption breaks the current proof. Without the second assumption, by Definition 4.4 wehave |∇f(x)| ≥ 1√T= g for all x /∈ N . For L = 1 and t = 2, we have:Pr [τ =∞ |X0] ≤ 2 exp{− g28L2t}(from earlier work in Section 4.1)= 2 exp{−g24}(substituting in L = 1 and t = 2)= 2 exp{− 14T}(substituting in g =1√T)Then, for T > 14 ln(24) there does not exist a b ∈ (0, 1) such that 2 exp{− g28L2 t}≤(1−b1+Lt)bt−1:2 exp{− g28L2t}= 2 exp{− 14T}>112= maxb∈(0,1){(1− b)b3}= maxb∈(0,1){(1− b1 + Lt)bt−1}Now, we investigate what happens if the third assumption is removed. Here we provide an example where the second assumptiondoes not imply the third assumption. Let L = 1, t = 2, and g = 2√ln(10.49). Then:2 exp{− g28L2t}≤(1− b1 + Lt)bt−1⇔ 2 exp{−g24}≤ (1− b)b3(substituting in L = 1 and t = 2)⇔ 0.98 ≤ (1− b)b3(substituting in g = 2√ln(10.49))The final statement above is impossible since maxb∈(0,1){(1−b)b3}= 112 < 0.98.In Section 4.3, we derived a recurrence for Pr [XT = n] and thus an algorithm for computing E [|XT |]. However, there is noobvious closed-form solution for the recurrence, so it is difficult to analyze how quickly Pr [XT = n] grows as a function of T . Anatural extension for future work is to use Stirling’s approximation to convert the factorial terms in Section 4.3 into exponentialterms, which are easier to bound.References[1] Shamir, O. & Zhang, T. (2013). Stochastic Gradient Descent For Non-Smooth Optimization: Convergence Results andOptimal Averaging Schemes. Proceedings of the 30th International Conference on Machine Learning, PMLR 28 (1):71-79.[2] Sales, E. (2020). Restricted-Dimension Subgradient Descent: Asymptotic Bounds on Error. https://emsal.me/thesis.pdf[3] Koren, T. & Segal, S. (2020). Open Problem: Tight Convergence of SGD in Constant Dimension. Proceedings of 33rdConference on Learning Theory, PMLR 125 :3847-3851.[4] Agarwal, A., Wainwright, M. J., Bartlett, P., & Ravikumar, P. (2009). Information-Theoretic Lower Bounds on the OracleComplexity of Convex Optimization. Advances in Neural Information Processing Systems, 22 :1-9.[5] Weisstein, E. (2021, April 13). Random Walk–1-Dimensional. MathWorld - A Wolfram Web Resource.https://mathworld.wolfram.com/RandomWalk1-Dimensional.html20
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Undergraduate Research /
- Towards a Tight Asymptotic Bound For One-Dimensional...
Open Collections
UBC Undergraduate Research
Towards a Tight Asymptotic Bound For One-Dimensional Stochastic Gradient Descent Lu, William; Harvey, Nick 2021-04-27
pdf
Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
Page Metadata
Item Metadata
Title | Towards a Tight Asymptotic Bound For One-Dimensional Stochastic Gradient Descent |
Creator |
Lu, William Harvey, Nick |
Date Issued | 2021-04-27 |
Description | Supervised machine learning algorithms work by optimizing model parameters to minimize a loss (objective) function. For many classes of models, such as robust regression and logistic regression, there is no closed form solution for the minimizers. Therefore, iterative optimization algorithms play a key role in training machine learning models. The stochastic gradient descent (SGD) algorithm requires very few assumptions on the objective function and is therefore one of the most frequently used optimizers in practice. When training on large datasets, it is useful to know how fast an iterative optimizer converges to the solution. A common quantification of convergence rate is the expected suboptimality at the final iterate, denoted E [f(~xT )] − f ∗ . Prior work has established an upper bound of O ln √ T T and a lower bound of Ω √ 1 T on the expected suboptimality of SGD. In this thesis, we conjecture that the upper bound on the expected suboptimality of SGD can be improved to O √ 1 T when the objective function is one-dimensional. We prove this conjecture under a restricted setting. Our analysis frames this problem through the lens of stochastic processes and random walks. To investigate the limiting behaviour of biased random walks, we utilize techniques from generating functions and combinatorics. We make standard assumptions of convexity and Lipschitz continuity on the objective function. For simplicity, we assume a constant step size ηt = Θ √ 1 T . |
Genre |
Graduating Project |
Type |
Text |
Language | eng |
Series |
University of British Columbia. CPSC 449 |
Date Available | 2021-05-17 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivatives 4.0 International |
DOI | 10.14288/1.0397497 |
URI | http://hdl.handle.net/2429/78356 |
Affiliation |
Science, Faculty of Computer Science, Department of |
Peer Review Status | Unreviewed |
Scholarly Level | Faculty Undergraduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 52966-Lu_William_CPSC_449_Tight_asymptotic_2021.pdf [ 327.32kB ]
- Metadata
- JSON: 52966-1.0397497.json
- JSON-LD: 52966-1.0397497-ld.json
- RDF/XML (Pretty): 52966-1.0397497-rdf.xml
- RDF/JSON: 52966-1.0397497-rdf.json
- Turtle: 52966-1.0397497-turtle.txt
- N-Triples: 52966-1.0397497-rdf-ntriples.txt
- Original Record: 52966-1.0397497-source.json
- Full Text
- 52966-1.0397497-fulltext.txt
- Citation
- 52966-1.0397497.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
data-media="{[{embed.selectedMedia}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.52966.1-0397497/manifest