Open Collections

UBC Undergraduate Research

Restricted-dimension subgradient descent : asymptotic bounds on error Sales, Emmanuel 2020-04

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.

Item Metadata

Download

Media
52966-Sales_Emmanuel_CPSC_449_Restricted_dimension_2020.pdf [ 291.91kB ]
Metadata
JSON: 52966-1.0398234.json
JSON-LD: 52966-1.0398234-ld.json
RDF/XML (Pretty): 52966-1.0398234-rdf.xml
RDF/JSON: 52966-1.0398234-rdf.json
Turtle: 52966-1.0398234-turtle.txt
N-Triples: 52966-1.0398234-rdf-ntriples.txt
Original Record: 52966-1.0398234-source.json
Full Text
52966-1.0398234-fulltext.txt
Citation
52966-1.0398234.ris

Full Text

Restricted-dimension subgradient descent: asymptoticbounds on errorEmmanuel SalesUniversity of British ColumbiaApril 2020AbstractConvex optimization, the study of minimizing convex functions over convex sets, is hostto a multitude of methods with far-reaching applications in machine learning. For methodsin convex optimization, it is often of interest to analyze the asymptotic bounds of the errorof the method, or in other words, how close the result of the method gets to the minimumvalue after some set period of time.Gradient descent, an iterative procedure that involves taking the gradients of convexfunctions at a sequence of points, is one of the most important algorithms in the field ofconvex optimization. Subgradient descent refers to gradient descent that can be applied tofunctions that need not be smooth. The primary focus of this text is this particular typeof gradient descent.This text will explore error estimates of the asymptotic bounds of the final iterate errorof subgradient descent; that is, the error between the result of the subgradient descentprocedure after a set number of steps T . Prior work has established tight asymptotic errorbounds that are independent of the dimension of the underlying set; this work exploresthe possibility of there existing tighter bounds when the dimension of the underlying set isrestricted to a finite constant d that is independent of T . In this work we have proven thatin the case of d = 1, the final iterate error has an upper bound of O(1√T).1 Introduction: Convex optimization1.1 Convex sets and functionsThe domain of convex optimization is concerned with problems of the following form: given aconvex function f on a convex set S, devise an algorithm for finding an approximate minimumof f(x) on S. Definitions of the relevant concepts are as below:Definition 1.1 (Convex set). A set S ⊆ R is convex if ∀x, y ∈ S,∀θ ∈ [0, 1], θx+ (1− θ)y ∈ S.Definition 1.2 (Convex function). A function f : S → R is convex if ∀x, y ∈ S,∀θ ∈[0, 1], f(θx+ (1− θ)y) ≤ θf(x) + (1− θ)f(y).1.2 Convex optimization problemsIn general, convex optimization methods are concerned with functions wherein we have limitedor no prior information about its global landscape and optima. Instead, different methods areconcerned with scenarios in which, given a point x in the set S of interest, we can query the value1of the function f(x) and possibly its first- or second-order derivatives. For functions that arenot smooth, first-order methods can make use of elements of the subdifferential of the functionat a point.Definition 1.3 (subdifferential). Let S ⊂ Rd be convex. The subdifferential of a functionf : Rd → R at the point x, denoted ∂f(x) is the set of directions g ∈ Rd such that, for all y ∈ S,f(y) ≥ f(x) + gT (y − x).1.2.1 Convex optimization in machine learningIn machine learning, convex optimization problems often take the form of findingminw∈Rnm∑i=1fi(w) + λR(w)where the functions f1, . . . , fm,R are convex and λ ≥ 0 is a known parameter (Bubeck, [1]).Convex optimization is particularly useful in these cases because local minima of convex func-tions are guaranteed to be global minima, thus simplifying the criteria needed to find globalminima.In machine learning these functions fi are determined by a data set (xi, yi) ∈ Rn × Y fori = 1, . . . ,m. The fi generally represent a loss function that quantifies how close a functiongw(xi), parameterized by w is able to return a value close to yi for each xi. The lower fi is foreach i, in general, the closer gw(xi) is able to attain the results yi for every i. The functionR(w) acts as a regularizer, which imposes additional constraints imposed on x in the form of apenalty.In the problem of linear regression for example the gw(x) we are interested in is wTxi. Settingfi(w, xi) = (wTxi−yi)2 andR(w) = 0, we obtain the least squares minimization problem, whichcan be reframed in matrix notation as min‖Xw − y‖2, where X ∈ Rm×n is the matrix withthe ith row equivalent to xTi , and y ∈ mathbbRm is the vector of yi values. The least-squaresproblem has been thoroughly theoretically explored and it has an analytical closed-form solutionof w = (XTX)−1XT y when the matrix X has full rank.Many other problems also fall into this framework but do not reap the benefits of a closed-form solution. For example, setting the fi as in the case for linear regression but addingR(w) = ‖w‖1 produces the lasso problem ([2]), which requires iterative methods in order tosolve.1.3 Convex optimization methods and gradient descentFirst-order methods are convex optimization methods that involve the first derivative or gradientof the function. These are useful when the global landscape of the function (and thus, itsoptimum points) cannot be determined analytically, but it is computationally easy to computethe values of the functions at any given point x.Central to the analysis of first-order methods is the notion of oracle complexity, which Bubeck([1]) defines as how many queries to an ”oracle” that, given an input x, either produces eitherthe value of f(x) (zeroth order oracle) or the value of a subgradient (more in Section 2) of f atx.Gradient descent is an iteration on the equationxt+1 = xt − ηt∇f(xt) (1)Results have been shown for lower bounds that would apply to any gradient descent opti-mization method, mostly by proving the existence of functions for which any black box procedure2would only get within a certain distance of the minimum value of the function within t iterations(or oracle queries). In this case, a gradient descent method is defined as an iterative procedure{xi} and {gi} where gi = ∇f(xi), the initial x1 = 0, and xt+1 ∈ Span(g1, . . . , gt).These theorems make use of the important additional concept of strongly convex functions,noted as being significant because functions of this class significantly speed up the performanceof first-order methods.Definition 1.4 (Strongly convex function). A function f : S → R is α-strongly convex if∀x, y ∈ S, f satisfies the inequality f(x)−f(y) ≤ ∇f(x)T (x−y)− α2 ‖x−y‖2. If f is not smooththen ∇f(x) can be replaced by gx ∈ ∂f(x).These are Theorems 3.13, 3.14, and 3.15 in Bubeck.Theorem 1.1. Let t ≤ n and L,R > 0. There exists a convex and L-Lipschitz function f suchthat for any gradient descent black-box procedure,min1≤s≤tf(xs)− min‖x‖2≤R f(x) ≥RL2(1 +√t)(2)Theorem 1.2. Let t ≤ (n − 1)/2, β > 0. There exists a continuously differentiable functionf for which ‖∇f(x) − ∇f(y)‖ ≤ β‖x − y‖ for all x and y, such that for any gradient descentprocedure,min1≤s≤tf(xs)− f(x∗) ≥ 3β32‖x1 − x∗‖2(t+ 1)2(3)Theorem 1.3. Let κ > 1. There exists an α-strongly convex continuously differentiable functionf for which ‖∇f(x)−∇f(y)‖ ≤ β‖x−y‖ for all x and y, with κ = β/α such that for any gradientdescent procedure, we havef(xt)− f(x∗) ≥ α2(√κ− 1√κ+ 1)2(t−1)‖x1 − x∗‖2 (4)2 Subgradient descentWe examine subgradient descent on convex and bounded sets.2.1 Definition of subgradient descentSubgradient descent is a type of gradient descent that works on functions that are not necessarilysmooth. One important usage of subgradient descent is in the case of L1 minimization andregularization; loss functions of the form∑ni=1 |g(X)− y| are nonsmooth and require the use ofsubgradients, as a gradient cannot be selected at every point. We havext+1 = xt − ηtgt (5)where gt ∈ ∂f(xt), not requiring the gradient ∇f(xt) to exist as in (1).The algorithm can be written as follows:Algorithm 1: SubgradientDescentx1 ← some initial guess in the set Sfor t in 2 . . . T doηt ← 1√tgt ← an element in ∂f(x)xt+1 ← ΠS(xt − ηtgt)end3where Π, the projection operator, is defined by ΠS(z) = arg minx∈S‖x− z‖2.2.1.1 Choices of step sizeThere are many possible choices of ηt. Boyd, Xia, and Mutapcic ([3]) list four:• Constant step size: ηt = h, independent of t• Constant step length: ηt = h/gt, such that ‖xt+1 − xt‖ is the same for all t• Square summable but not summable: choices of ηt such that∑∞t=1 η2t <∞ and∑∞t=1 ηt =∞.• Nonsummable diminishing: choices of ηt such that limt→∞ ηt = 0 and∑∞i=1 ηt =∞.In Section 3, where we provide a proof related to final iterate error for the case wheredim(S) = 1, we examine the algorithm with ηt = 1√t , which is an example of a nonsummablediminishing step size rule.2.2 Convergence resultsDefinition 2.1. The final iterate error of a run of subgradient descent is the value of f(xT )−f(x∗).Boyd ([3]) states that for the diminishing step size rule, the limit limt→∞ f(xt) = f(x∗), i.e.the algorithm is guaranteed to converge to the optimal value.Prior work has established that for a run of SGD with T iterations on a T -dimensional space,the final iterate error is Ω(log(T )/√T ) [4].An upper bound for the expected final iterate error for stochastic gradient descent withoutsmoothness assumptions was established by Shamir and Zhang [5] as O(log T/√T ) over non-smooth convex objective functions.2.3 Goals of this workThis work’s primary aim is to examine the properties of subgradient descent on convex functionsin arbitrary finite-dimensional cases. We aim to examine the possibility that tighter bounds canbe achieved than existing results if we restrict the dimension of the space we are analyzing to afinite constant d.In Section 3, we establish a tight upper bound on the final iterate error of the algorithm asO(1/√T ) in the case where d = 1.3 Upper bound for final iterate error where d = 1The primary result is the following theorem on the final iterate error of subgradient descent inthe restricted 1-dimensional case. This is a tighter result than the general upper bound achievedby Shamir and Zhang ([5]) of O(log(T )/√T ).Theorem 3.1. Let S ⊂ R be convex and bounded (without loss of generality, let diam(S) = 1).Let f : S → R be a convex and 1-Lipschitz.Then the final iterate error |f(xT )− f(x∗)| of a run of subgradient descent is in O(1)√T .4−3.5 −3 −2.5 −222.533.5x(a)−1 −0.5 0.5 10.20.40.60.81x(b)2 2.5 3 3.522.533.5x(c)Figure 1: Example of three cases where Theorem 3.1 applies. Figures 1a and 1c are examples ofconvex functions that are monotonically decreasing on their underlying set, and hence has theminimizer x∗ lying at one of the endpoints. Figure 1b denotes the case where the minimizer x∗resides in the middle.Proof. We prove this by induction on the number of iterations.The base case is on the first iteration, T = 1, or the initial value. Set c = max(√2, f(x1)−f(x∗)). Any initial starting point will have error that is less than or equal to c√1.The inductive step: Suppose the iteration error f(xT−1)−f(x∗) ≤ c√T−1 . We will show thatthis impliesf(xT )− f(x∗) ≤ c√T(6)We havef(xT )− f(x∗) = f(xT )− f(xT−1) + f(xT−1)− f(x∗) (7)which is known to be less than (f(xT )− f(xT−1)) + c√T−1 by the inductive hypothesis.We then examine the quantity f(xT ) − f(xT−1). By the subgradient descent algorithm wehavexT = ΠS(xT−1 − ηT−1gT−1) (8)with ηT−1 = 1√T−1 and gT−1 in the subgradient of f(xT−1).Let x˙T = xT−1 − ηT−1gT−1, with xT = ΠS(x˙T ).Because f is 1-Lipschitz, we know that |gi| ≤ 1 for all i.An important property used in showing the inequality is the monotonicity of the subgradienton convex functions:Claim 3.1.1 (Monotonicity of the subgradient). For convex f , for all x, y ∈ S, gx ∈ ∂f(x), gy ∈∂f(y), (gx − gy)(x− y) ≥ 0.Proof. By the subgradient inequality we have both f(y) ≥ f(x) + gx(y − x) and f(x) ≥ f(y) +gy(x− y). Combining these two inequalities we have 0 ≥ (gx − gy)(y − x), or, by chaging signs,our intended inequality of (gx − gy)(x− y) ≥ 0.To prove the induction we condition on three cases depending on the presence of 0 in ∂f(xT )and ∂f(xT−1).Case 1: If 0 ∈ ∂f(xT ) then we have f(xT ) = f(x∗), which makes f(xT )− f(x∗) = 0, which isless than c√Tfor all T .5Case 2: If 0 /∈ ∂f(xT ) but 0 ∈ ∂f(xT−1), we have xT−1 = x∗ (the optimum) and thus |xT −xT−1| = ηT−1|gT−1|. Because f is 1-Lipschitz we have |xT − xT−1| ≤ ηT−1 = 1√T−1 ; again byLipschitz we thus conclude f(xT )− f(x∗) = f(xT )− f(xT−1) ≤ c√T .Case 3: 0 /∈ ∂f(xT−1) and 0 /∈ ∂f(xT ). For this case we show that (6) holds by showing that itholds on all possible cases: sign(max ∂f(xT )) = sign(max ∂f(xT−1)), and sign(max ∂f(xT )) 6=sign(max ∂f(xT−1)).Case 3.1: sign(max ∂f(xT )) = sign(max ∂f(xT−1)).There is a special case for when xT is on the boundary of the set S. As such, we conditionon whether or not xT is on the boundary of S.Case 3.1a: xT is on the boundary of S. In that case, we are able to show the following:Claim 3.1.2. If xT is on the boundary of S and sign(max(∂f(xT ))) = sign(max(∂f(xT−1))),then xT must be a global minimizer for f on S.Proof. Because xT is on the boundary of a one-dimensional set, either xT ≥ x∀x ∈ S orxT ≤ x∀x ∈ S. Because xT = xT−1 − ηT−1gT−1, and ηT−1 > 0, we are able to conclude byalgebra that sign(gT ) = sign(gT−1) = −sign(xT − xT−1) = sign(xT−1 − xT ).By the subgradient inequality we have f(x) ≥ f(xT ) + gT (x − xT ),∀x ∈ S. Becausesign(gT ) = sign(xT−1 − xT ) = sign(x − xT )∀x ∈ S, we have f(xT ) + gT (x − xT ) ≥ f(xT )and therefore f(x) ≥ f(xT )∀x ∈ S.Thus f(xT ) = f(x∗) and thus we have f(xT )− f(x∗) = 0 ≤ c√T .Case 3.1b: xT is in the interior of S. For this case we use the following claim:Claim 3.1.3. If sign(max ∂f(xT )) = sign(max ∂f(xT−1)), then f(xT ) ≤ f(xT−1) and |gT | ≤|gT−1|.Proof. Applying Claim 3.1.1 to xT and xT−1, we have(gT − gT−1)(xT − xT−1) = (gT − gT−1)(−ηT−1gT−1) ≥ 0 (9)We thus have −gT gT−1 + g2T−1 ≥ 0. Because we know that gT and gT−1 are the same sign,we know that gT gT−1 ≥ 0. Thus we know that |g2T−1| ≥ |gT gT−1| and thus |gT | ≤ |gT−1|.Using this fact, and the fact that gT and gT−1 have the same sign, we know that eithergT−1 ≤ gT < 0 or 0 < gT ≤ gT−1. We remark that 0 is in the subgradient of a point z if andonly if z is a global minimizer (a point x∗) for f , as f is convex. Thus by monotonicity of thesubgradient we have either xT−1 < xT ≤ x∗ or x∗ ≤ xT < xT−1, associated with each caserespectively. We are thus able to concludesign(gT ) = sign(xT−1 − xT )We know that by the subgradient inequality that f(xT−1) ≥ f(xT )+gT (xT−1−xT ). BecausegT and (xT−1 − xT ) have the same sign, the quantity is nonnegative; thus we havef(xT−1) ≥ f(xT )Corollary 3.1.1. sign(gT ) = sign(gT−1), because 0 /∈ ∂f(xT ) and 0 /∈ ∂f(xT−1).To show that f(xT ) ≤ c√T , it is sufficient to show thatf(xT−1)− f(xT ) ≥ c√T − 1 −c√T(10)We examine the quantity ηT gT , examining the properties of the cases where |gT | ≤ α√T andwhere gT >α√T, where α is a number chosen in the interval [√c, c], which is a nonempty intervalbecause c ≥ 1.6If |gT | ≤ α√T : we havef(xT )− f(x∗) ≤ |gT ||xT − x∗| by convexity/monotonicity of the gradient≤ α√T(1) (|xT − x∗| ≤ 1 by the diameter of the domain)≤ c√TIf |gT | > α√T :By convexity we know that|gT | ≤ f(xT−1)− f(xT )|xT−1 − xT | (By Corollary 3.1.1)=f(xT−1)− f(xT )|ηT−1gT−1| ≤f(xT−1)− f(xT )|ηT−1gT | (By Claim 3.1.3)ηT−1|gT |2 ≤ f(xT−1)− f(xT ) = [f(xT−1)− f(x∗)]− [f(xT )− f(x∗)]≤ c√T − 1 − [f(xT )− f(x∗)]By rearranging we deducef(xT )− f(x∗) ≤ c√T − 1 − ηT−1|gT |2 (11)By assumption we have |gT | ≥ α√T . We thus havef(xT )− f(x∗) ≤ c√T − 1 −1√T − 1α2T≤ c√T − 1 −α2/2(T − 1)1.5(12)This is true because for T ≥ 2, 1T ≥ 12(T−1) .By applying Claim A.1 to (12) we thus havef(xT )− f(x∗) ≤ c√T(13)as intended.Case 3.2: sign(max ∂f(xT )) 6= sign(max ∂f(xT−1))We know that if 0 is in the subgradient of a point x ∈ S, then x is a global minimizer for f .Combining this optimality condition with fact with the monotonicity of the subgradient(Claim 3.1.1) we are able to conclude that a global minimizer for f , x∗, is in the open intervalwith endpoints xT and xT−1; this can either be (xT , xT−1) or (xT−1, xT ). Thus we know that|xT − x∗| ≤ |xT − xT−1| = |ηT−1gT−1| ≤ ηT(|gT−1| ≤ 1 because f is 1-Lipschitz)=⇒ f(xT )− f(x∗) ≤ |xT − x∗| ≤ ηT−1=1√T − 1 ≤c√T(as c ≥√2) (14)7Thus in all cases we have f(xT )− f(x∗) ≤ c√T . Thus, we complete the induction. We thusconclude that f(xT )− f(x∗) ≤ O(1)√T .4 Future directionsThe natural extension of the established work is to attempt further theoretical ground workfor higher-dimensional cases and possibly a generalization to any arbitrary finite dimension d.Lower bounds have been established for arbitrary-dimensional cases where d > T , namely thatthe error f(xT ) − f(x∗) = Ω( log(d)√T ), shown in ([4]). Following that, it is natural to speculatethat the bound is tight, such that the error is O(logmin(d,T )√T).It is also of interest to study lower bounds. As shown in the theorems from [1] listed in Section1.3, lower bounds can be established by proving the existence of a function f : S ⊂ Rd → R,such that running subgradient descent on f produces a final iterate error after T iterations thatis bounded below by some function of d and T .8References[1] S. Bubeck et al., “Convex optimization: Algorithms and complexity,” Foundations andTrends R© in Machine Learning, vol. 8, no. 3-4, pp. 231–357, 2015.[2] R. Tibshirani, “Regression shrinkage and selection via the lasso,” Journal of the Royal Sta-tistical Society: Series B (Methodological), vol. 58, pp. 267–288, 1996.[3] S. Boyd, L. Xiao, and A. Mutapcic, “Subgradient methods,” 2003.[4] N. J. A. Harvey, C. Liaw, Y. Plan, and S. Randhawa, “Tight analyses for non-smoothstochastic gradient descent,” CoRR, vol. abs/1812.05217, 2018.[5] O. Shamir and T. Zhang, “Stochastic gradient descent for non-smooth optimization: Con-vergence results and optimal averaging schemes,” in Proceedings of the 30th InternationalConference on Machine Learning (S. Dasgupta and D. McAllester, eds.), vol. 28 of Proceed-ings of Machine Learning Research, (Atlanta, Georgia, USA), pp. 71–79, PMLR, 17–19 Jun2013.9A Scalar inequalitiesClaim A.1. If T ≥ 2, then α2/2(T−1)1.5 ≥ c√T−1 − c√T .Proof. We show this by the following:c(1√T − 1 −1√T)= c(√T −√T − 1√T (T − 1))=c√T (T − 1)(√T +√T − 1) (Using a− b =a2 − b2a+ b)≤ c(T − 1)(2√T − 1) (replace every T with T − 1 in the denominator)=c2(T − 1)1.5 ≤α2/2(T − 1)1.5(as α2 ≥ c).10

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            data-media="{[{embed.selectedMedia}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.52966.1-0398234/manifest

Comment

Related Items