Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

A derivative-free approximate gradient sampling algorithm for finite minimax problems Nutini, Julie Ann 2012

You don't seem to have a PDF reader installed, try download the pdf

Item Metadata

Download

Media
ubc_2012_spring_nutini_julie.pdf [ 722.2kB ]
Metadata
JSON: 1.0072771.json
JSON-LD: 1.0072771+ld.json
RDF/XML (Pretty): 1.0072771.xml
RDF/JSON: 1.0072771+rdf.json
Turtle: 1.0072771+rdf-turtle.txt
N-Triples: 1.0072771+rdf-ntriples.txt
Original Record: 1.0072771 +original-record.json
Full Text
1.0072771.txt
Citation
1.0072771.ris

Full Text

A Derivative-Free Approximate Gradient Sampling Algorithm for Finite Minimax Problems by Julie Ann Nutini  B.Sc. Hons., The University of British Columbia, 2010  A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE in The College of Graduate Studies (Mathematics)  THE UNIVERSITY OF BRITISH COLUMBIA (Okanagan) April 2012 c Julie Ann Nutini 2012  Abstract Mathematical optimization is the process of minimizing (or maximizing) a function. An algorithm is used to optimize a function when the minimum cannot be found by hand, or finding the minimum by hand is inefficient. The minimum of a function is a critical point and corresponds to a gradient (derivative) of 0. Thus, optimization algorithms commonly require gradient calculations. When gradient information of the objective function is unavailable, unreliable or ‘expensive’ in terms of computation time, a derivative-free optimization algorithm is ideal. As the name suggests, derivative-free optimization algorithms do not require gradient calculations. In this thesis, we present a derivative-free optimization algorithm for finite minimax problems. Structurally, a finite minimax problem minimizes the maximum taken over a finite set of functions. We focus on the finite minimax problem due to its frequent appearance in real-world applications. We present convergence results for a regular and a robust version of our algorithm, showing in both cases that either the function is unbounded below (the minimum is −∞) or we have found a critical point. Theoretical results are explored for stopping conditions. Additionally, theoretical and numerical results are presented for three examples of approximate gradients that can be used in our algorithm: the simplex gradient, the centered simplex gradient and the Gupal estimate of the gradient of the Steklov averaged function. A performance comparison is made between the regular and robust algorithm, the three approximate gradients, and the regular and robust stopping conditions. Finally, an application in seismic retrofitting is discussed.  ii  Table of Contents Abstract  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  ii  Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . .  iii  List of Tables  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  v  List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  vi  Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . vii Dedication  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii  1 Introduction . . . . . . . . . . . . 1.1 Notations . . . . . . . . . . . . 1.2 The Problem . . . . . . . . . . 1.3 Derivative-Free Optimization . 1.4 DFO and Finite Max Functions 1.5 Method of Steepest Descent . 1.6 Basic Definitions and Results .  . . . .  . . . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  1 2 2 3 3 6 8  2 Approximate Gradient Sampling Algorithm . . . . . . . . . 2.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . .  10 10 14  3 Robust Approximate Gradient 3.1 Algorithm . . . . . . . . . . 3.2 Convergence . . . . . . . . . 3.2.1 Descent Direction: dY 3.3 Robust Stopping Conditions  . . . . .  23 25 26 29 30  4 Approximate Gradients . . . . . . . . . . . . . . . . . . . . . . 4.1 Simplex Gradient . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . .  34 34 34  Sampling Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . .  . . . . .  . . . . . . .  . . . . .  . . . . . . .  . . . . .  iii  Table of Contents . . . . . .  . . . . . .  . . . . . .  35 36 36 37 37 38  . . . .  . . . .  . . . .  38 38 39 44  . . . . . . .  . . . . . . .  . . . . . . .  45 45 46 47 47 49 53  6 Conclusion and Future Work . . . . . . . . . . . . . . . . . . 6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . .  55 55 56  Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  57  Appendix A Tables  61  4.2  4.3  4.1.2 Convergence . . . . . . 4.1.3 Algorithm . . . . . . . Centered Simplex Gradient . . 4.2.1 Definitions . . . . . . . 4.2.2 Convergence . . . . . . 4.2.3 Algorithm . . . . . . . Gupal Estimate of the Gradient Function . . . . . . . . . . . . 4.3.1 Definitions . . . . . . . 4.3.2 Convergence . . . . . . 4.3.3 Algorithm . . . . . . .  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . of the Steklov Averaged . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  5 Numerical Results . . . . . . . . . . . . 5.1 Versions of the AGS Algorithm . . . . 5.2 Test Sets and Software . . . . . . . . 5.3 Initialization and Stopping Conditions 5.4 Computing a Descent Direction . . . 5.5 Results . . . . . . . . . . . . . . . . . 5.6 An Application in Seismic Retrofitting  . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . . . . . . . . . . . . . . . . . . . . .  iv  List of Tables 2.1  Glossary of Algorithm Notation . . . . . . . . . . . . . . . . .  10  5.1  Damper Coefficient Selection Results . . . . . . . . . . . . . .  54  6.1 6.2 6.3 6.4  Summary of Test Set Problems . . . . . . . . . Numerical Results: Simplex Gradient . . . . . . Numerical Results: Centered Simplex Gradient Numerical Results: Gupal Estimate . . . . . . .  61 62 63 64  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  v  List of Figures 1.1 1.2  A nonsmooth function and its gradients . . . . . . . . . . . . Method of Steepest Descent . . . . . . . . . . . . . . . . . . .  1 6  2.1  An Armijo-like line search . . . . . . . . . . . . . . . . . . . .  13  3.1 3.2  Surface plot of Problem 2.1 CB2 . . . . . . . . . . . . . . . . Iterations for Problem 2.1 CB2 . . . . . . . . . . . . . . . . .  24 24  5.1  Performance profiles: 12 versions of AGS/RAGS algorithm . .  51  vi  Acknowledgements First, and foremost, I would like thank my supervisor, Dr. Warren Hare, whose guidance, patience and encouragement have been unfailing. I am incredibly grateful for the numerous occasions when his confidence in my ability far exceeded my own, and feel truly fortunate to have worked under his supervision. Without him, this thesis would not have come to its fruition, my passport would still be without stamps and the idea of ‘proof by fairy dust’ would still be a logical option. I would like to thank my committee members, Dr. Heinz Bauschke, Dr. Shawn Wang and Dr. Erik Rosolowsky, for their time and helpful suggestions throughout this challenging, yet rewarding journey. I would like to express my sincere gratitude to Dr. Claudia Sagastiz´abal for our conversations in Chile, especially with respect to the Goldstein approximate subdifferential. I would like to thank Kasra Bigdeli and Dr. Solomon Tesfamariam for taking an interest in my algorithm and for furthering my passion for algorithm design. I would like to thank Dr. Rebecca Tyson and Dr. Blair Spearman for encouraging me to pursue studies in optimization, and Dr. Yves Lucet for showing an invested interest in my research through his helpful, thought provoking questions. I would like to thank UBC Okanagan and the many donors who have provided me with incredible financial support over the years. Last, but not least, I would like to thank my family and friends for their understanding, faith and endless love throughout my studies.  Many of the results in this thesis are currently in submission to Computational Optimization and Applications (see [HN12]).  vii  To my dad, who shares my love of mathematics, and to my mom, who is my constant reminder...va piano.  viii  Chapter 1  Introduction Mathematical optimization is the process of minimizing (or maximizing) a function. We use an optimization algorithm when either the minimum of a function cannot be found by hand, or finding the minimum by hand is inefficient. In the first year calculus classroom, optimization algorithms are simple: take the derivative of a function, set the derivative equal to 0 and solve for x. For the functions explored in first year calculus, this process generally works fine. However, many real-world problems result in nonsmooth functions. For example, we consider the following nonsmooth finite max function: f (x, y) = 10|x| + y 2 = max{10x + y 2 , −10x + y 2 }.  (a) Graph of f (x, y).  (b) Selected gradients of f (x, y).  Figure 1.1: Graph of f (x, y) and its gradients. We can see in Figure 1.1(a) that f (x, y) has a ridge along the line x = 0 where the function is not differentiable. Figure 1.1(b) shows the pattern of gradients for f (x, y) as if looking down onto the bottom ridge. Notice 1  1.1. Notations that for any point far from the ridge, the negative of the gradient points roughly towards the minimum. However, for any point near the ridge, the negative of the gradient points roughly perpendicular to the ridge, which is not necessarily the direction of the minimum. In this thesis we develop a novel optimization algorithm specifically designed for functions with this finite max structure.  1.1  Notations  In this thesis, we use the following notations: 1. | · | denotes the Euclidean norm and matrix norm.  ·  denotes the corresponding  2. As defined in [Cla90], the function f : Rn → R is locally Lipschitz at a point x ∈ Rn if there exists a scalar L and δ > 0 such that |f (y1 ) − f (y2 )| ≤ L|y1 − y2 | for all y1 and y2 in the open ball of radius δ about x. 3. If f1 and f2 are continuous, then max{f1 , f2 } is continuous. If f1 and f2 are Lipschitz, then max{f1 , f2 } is Lipschitz [RW98, Prop 9.10]. However, we note that even if f1 and f2 are differentiable, this does not imply that max{f1 , f2 } is differentiable.  1.2  The Problem  We consider the finite minimax problem min f (x) where f (x) = max{fi (x) : i = 1, . . . , N }, x  where each individual fi is continuously differentiable. Finite minimax problems occur in numerous applications, such as portfolio optimization [CTYZ00], control system design [IOKK04], engineering design [Pol87], and determining the cosine measure of a positive spanning set [CSV09, Def 2.7]. In a finite max function, although each individual fi may be smooth, taking the maximum forms a nonsmooth function with ‘nondifferentiable ridges’. For this reason, most algorithms designed to solve finite minimax problems employ some form of smoothing technique;  2  1.3. Derivative-Free Optimization [PGL93], [PRW03], [Pol88], and [Xu01] (among many others). In general, these smoothing techniques require gradient calculations. However, in many applications gradient information is not available or can be difficult to compute accurately (see [BJF+ 98], [DV04], [Har10], [MFT08] and [CSV09, Chpt 1] for some examples of such situations). Thus, for the purpose of this thesis, we further restrict ourselves to the field of derivative-free optimization, where we are only permitted to compute function values, i.e., we cannot compute gradient values ∇fi directly.  1.3  Derivative-Free Optimization  The research area of derivative-free optimization (DFO) has blossomed in recent years. As previously stated, DFO algorithms are useful in situations when gradient information is not available, difficult to compute accurately or ‘expensive’ to compute in relation to computation time. For example, if a function is given by a simulation, then gradient information may not be available. For a thorough introduction to several basic DFO frameworks and convergence results for each, see [CSV09]. As there are no gradient calculations required in DFO algorithms, it is assumed that function evaluations are the most ‘expensive’ computations in terms of CPU time. Thus, the performance of a DFO algorithm is based on the number of function evaluations required to solve the problem.  1.4  DFO and Finite Max Functions  In relation to our problem, research on optimizing finite max functions without calculating derivatives can be seen as early as 1975 [Mad75], while more recently we have seen a resurface in this area, [LLS06] and [HM11]. In 2006, Liuzzi, Lucidi and Sciandrone used a smoothing technique based on an exponential penalty function in a directional direct-search framework to form a derivative-free optimization method for finite minimax problems [LLS06]. This method is shown to globally converge towards a standard stationary point of the original finite minimax problem. We want to take a step away from the prevalent smoothing techniques used to solve finite minimax problems. Instead of altering the function with a smoothing technique, we look to solve the finite minimax problem with a DFO method that exploits its smooth substructure. To understand how we do this, we first need to define the following terms.  3  1.4. DFO and Finite Max Functions Definition 1.1. Let f : Rn → R be locally Lipschitz at a point x ¯. Then by Rademacher’s Theorem ([RW98, Thm 9.60]), f is differentiable almost everywhere on Rn . Let Ωf be the set of points at which f fails to be differentiable. Then the Clarke subdifferential, as defined in [Cla90], is given by the set ∂f (¯ x) = conv(lim ∇f (y j ) : y j → x ¯, y j ∈ / Ωf ). j  (1.1)  Basically, the subdifferential of a function at x ¯ is the set of all possible gradients near x ¯. As an example, we consider the absolute value function, f (x) = |x| in R. Clearly, f (x) is not differentiable at 0. The subdifferential of f at the point 0 is the set ∂f (0) = conv(1, −1), as f (x) =  1 −1  if x > 0 . if x < 0  Definition 1.2. A descent direction of a continuously differentiable function f at a given point x ¯ ∈ dom(f ) as defined in [CSV09] is any vector d such that d, v < 0 for all v ∈ ∂f (¯ x). To explain how subdifferentials can be used to find a descent direction, we first define the projection. Definition 1.3 (Definition 3.7, [BC11]). Let C ⊆ Rn be a nonempty closed convex set and let x ∈ Rn . Let p ∈ C. Then p is the unique projection of x onto C, denoted by Proj(x|C), if p ∈ arg min{|y − x| : y ∈ C}. y  A point x ¯ is a (Clarke) stationary point of a function f when 0 ∈ ∂f (¯ x). Thus, we define the direction of steepest descent as follows. Definition 1.4. The direction of steepest descent as defined in [CSV09] is given by d = − Proj(0|∂f (¯ x)).  4  1.4. DFO and Finite Max Functions In 2011, Hare and Macklem presented a derivative-free method that exploits the smooth substructure of the finite minimax problem. It combines the frameworks of a directional direct search method [CSV09, Chpt 7] and the gradient sampling algorithm (GS algorithm) presented in [BLO02] and [BLO05]. Loosely speaking, the GS algorithm uses a collection of local gradients to build a ‘robust subdifferential’ of the objective function and uses this to determine a ‘robust descent direction’. In [HM11], these ideas are used to develop several methods to find an approximate descent direction that moves close to parallel to an ‘active manifold’. During each iteration, points are sampled from around the current iterate and the simplex gradient is calculated for each of the active functions of the objective function. The calculated simplex gradients are then used to form an approximate subdifferential, which is then used to determine a likely descent direction. Ideas from the GS algorithm have appeared in two other recent DFO methodologies: [BKS08] and [Kiw10]. In 2008, Bagirov, Karas¨ozen and Sezer presented a discrete gradient derivative-free method for unconstrained nonsmooth optimization problems [BKS08]. Described as a derivative-free version of the bundle method presented in [Wol75], the method uses discrete gradients to approximate subgradients of the function and build an approximate subdifferential. The analysis of this method provides proof of convergence to a Clarke stationary point for an extensive class of nonsmooth problems. In this thesis, we focus on the finite minimax problem. This allows us to require few (other) assumptions on our function while maintaining strong convergence analysis. It is worth noting that we use the same set of test problems as in [BKS08]. Specifically, we use the [LV00] test set and exclude one problem as its subfunctions are complex-valued. (The numerics in [BKS08] exclude the same problem, and several others, without explanation.) Using approximate gradient calculations instead of gradient calculations, the GS algorithm is made derivative free by Kiwiel in [Kiw10]. Specifically, Kiwiel employs the Gupal estimate of the gradient of the Steklov averaged function (see [Gup77] or Section 4.3 herein) as an approximate gradient. It is shown that, with probability 1, this derivative-free algorithm satisfies the same convergence results as the GS algorithm – it either drives the f values to −∞ or each cluster point is found to be Clarke stationary [Kiw10, Theorem 3.8]. No numerical results are presented for Kiwiel’s derivative-free algorithm.  5  1.5. Method of Steepest Descent  1.5  Method of Steepest Descent  To obtain a general understanding of the framework of the algorithm presented in this thesis, we recall the classical method of steepest descent. 1. Search Direction: Set dk = − Proj(0|∂f (xk )). 2. Step Length: Find tk by solving min{f (xk + tk dk )}. tk >0  3. Update: Set xk+1 = xk + tk dk . Increase k = k + 1. Loop.  Figure 1.2: An example of the method of steepest descent. In each iteration, the method of steepest descent calculates the direction of steepest descent, i.e., − Proj(0|∂f (xk )), does a line search in this direction and then updates the iterate. This allows the algorithm to move perpendicular to the contours of the function, towards the minimum. The method of steepest descent works well for smooth functions. However, for nonsmooth functions, the optimal solution often occurs at a point of nondifferentiability, which results in a very slow convergence rate for the method of steepest descent. In general, our algorithm takes the method of steepest descent and makes it derivative free; instead of calculating the subdifferential, we calculate an approximate subdifferential; instead of finding the direction of steepest descent, we project 0 onto our approximate subdifferential and find an approximate direction of steepest descent. With these alterations, our algorithm is able to navigate along nondifferentiable ridges, more rapidly minimizing the function. Specifically, we use the GS algorithm framework to form a derivativefree approximate gradient sampling algorithm. As we are dealing with finite max functions, instead of calculating an approximate gradient of f at each 6  1.5. Method of Steepest Descent of the sampled points, we calculate an approximate gradient of each of the active functions at the current iterate. We say a function fi is active if its index is an element of the active set. Definition 1.5. We define the active set of f at x ¯ to be the set of indices A(¯ x) = {i : f (¯ x) = fi (¯ x)}.  (1.2)  We denote the set of active gradients of f at x ¯ by {∇fi (¯ x)}i∈A(¯x) . Expanding the active set to include ‘almost’ active functions, we also present a robust version of our algorithm, which is more akin to the GS algorithm. In this robust version, when our iterate is close to a point of nondifferentiability, the size and shape of our approximate subdifferential will reflect the presence of ‘almost active’ functions. Hence, when we project 0 onto our approximate subdifferential, the descent direction will direct minimization parallel to a ‘nondifferentiable ridge’, rather than straight at this ridge. It can be seen in our numerical results that these robust changes greatly influence the performance of our algorithm. The algorithm presented in this thesis differs from those mentioned above in a few key manners. Unlike in [LLS06] we do not employ a smoothing technique. Unlike in [HM11], which uses the directional direct-search framework to imply convergence, we employ an approximate steepest descent framework. Using this framework, we are able to analyze convergence directly and develop stopping analysis based on our stopping conditions for the algorithm. Unlike in [BKS08] and [Kiw10], where convergence is proven for a specific approximate gradient, we prove convergence for any approximate gradient that satisfies a simple error bound dependent on the sampling radius. As examples, we present the simplex gradient, the centered simplex gradient and the Gupal estimate of the gradient of the Steklov averaged function. (As a side-note, Section 4.3 also provides, to the best our knowledge, a novel error analysis of the Gupal estimate of the gradient of the Steklov averaged function.) Furthermore, unlike in [Kiw10], where convergence to a stationary point is proved with probability 1, we prove convergence to a critical point for every cluster point of a sequence generated by our algorithm. Focusing on the finite minimax problem provides us with an advantage over the methods of [BKS08] and [Kiw10]. In particular, we only require order n function calls per iteration (where n is the dimension of the problem), 7  1.6. Basic Definitions and Results while both [BKS08] and [Kiw10] require order mn function calls per iteration (where m is the number of gradients they approximate to build their approximate subdifferential). (The original GS algorithm suggests that m ≈ 2n provides a good value for m.)  1.6  Basic Definitions and Results  We denote by C 1 the class of differentiable functions whose gradient mapping ∇ is continuous. As stated previously, we assume that each of the individual fi in the finite max function is continuously differentiable, i.e., fi ∈ C 1 . We denote by C 1+ the class of continuously differentiable functions whose gradient mapping ∇ is locally Lipschitz. We say f ∈ C 1+ with constant L if its gradient mapping ∇ has a Lipschitz constant L. We denote by C 2+ the class of twice continuously differentiable functions whose gradient mapping ∇2 is locally Lipschitz. We say f ∈ C 2+ with constant L if its gradient mapping ∇2 has a Lipschitz constant L. The following result reveals that the subdifferential for a finite max function at a point x ¯ is equal to the convex hull of the active gradients of f at x ¯. Proposition 1.6. Let f = max{fi : i = 1, . . . , N }. If fi ∈ C 1 for each i ∈ A(¯ x), then ∂f (¯ x) = conv(∇fi (¯ x))i∈A(¯x) . Proof. See Proposition 2.3.12, [Cla90]. It is important to note that the subdifferential as defined in Proposition 1.6 is a compact set. We formally state and prove this result in the following corollary. Corollary 1.7. Let f = max{fi : i = 1, . . . , N } where fi ∈ C 1 for each i ∈ A(¯ x). The subdifferential of f at x ¯ is a compact set. Proof. By Proposition 1.6, ∂f (¯ x) = conv(∇fi (¯ x))i∈A(¯x) . As there are a finite number of functions, A(¯ x) is a finite set, which implies {∇fi (¯ x)} is a finite set of points. The convex hull of a finite set of points is closed and bounded. Hence, ∂f (¯ x) is a compact set. For our proof of convergence in Section 2.2, we require that ∂f is outer semicontinuous. First, we define the limit superior for a mapping. 8  1.6. Basic Definitions and Results Definition 1.8. For a mapping S : Rn → Rn , we define the limit superior as lim sup S(x) = {y : there exists xk → x ¯ and y k → y with y k ∈ S(xk )}. x→¯ x  Definition 1.9. As defined in [RW98, Def 5.4], S is outer semicontinuous (osc) at x ¯ if lim sup S(x) = S(¯ x). x→¯ x  To see that the subdifferential is outer semicontinuous, we have the following proposition from [RW98]. Proposition 1.10. For a continuous function f : Rn → R, the mapping ∂f is outer semicontinuous everywhere. Proof. See Proposition 8.7, [RW98].  9  Chapter 2  Approximate Gradient Sampling Algorithm In Chapter 2, we state the approximate gradient sampling algorithm (AGS algorithm), present the details as to when and why the AGS algorithm necessarily finds function value decrease using an arbitrary approximate gradient of each of the active fi , denoted by ∇A fi , and then show that the AGS algorithm converges to a local minimum. We note that no specific approximate gradient is discussed in this section. Furthermore, we emphasize that the error bound requirement used in the convergence results (Section 2.2) for the arbitrary approximate gradient is achievable. Examples of approximate gradients, their structures and their error bounds are discussed in Section 4.  2.1  Algorithm  We first provide a partial glossary of the notation used in the statement of the AGS algorithm. Table 2.1: Glossary of notation used in AGS algorithm. Glossary of Notation k: Iteration counter µk : Accuracy measure m: Sample size y j : Sampling points η: Armijo-like parameter tk : Step length ∇A fi : Approximate gradient of fi Gk : Approximate subdifferential  xk : Current iterate ∆k : Sampling radius θ: Sampling radius reduction factor Y : Sampled set of points dk : Search direction tmin : Minimum step length A(xk ): Active set at xk εtol : Stopping tolerance  10  2.1. Algorithm We state the theoretical AGS algorithm and then provide a detailed description of the algorithm. Conceptual Algorithm: [AGS Algorithm] 0. Initialize: Set k = 0 and input x0 - starting point µ0 > 0 - accuracy measure ∆0 > 0 - initial search radius θ ∈ (0, 1] - search radius reduction factor 0 < η < 1 - Armijo-like parameter tmin - minimum step length εtol > 0 - stopping tolerance 1. Generate Approximate Subdifferential Gk : Generate a set Y = [xk , y 1 , . . . y m ] around the current iterate xk such that max |y j − xk | ≤ ∆k . j=1,...,m  Use Y to calculate the approximate gradient of fi , denoted ∇A fi , at xk for each i ∈ A(xk ). Set Gk = conv(∇A fi (xk ))i∈A(xk ) . 2. Generate Search Direction: Let dk = − Proj(0|Gk ). Check if ∆k ≤ µk |dk |.  (2.1)  If (2.1) does not hold, then set xk+1 = xk , set ∆k+1 =  θµk |dk | θ∆k  if |dk | = 0 , if |dk | = 0  (2.2)  and go to Step 4. If (2.1) holds and |dk | < εtol , then STOP. Else, continue to the line search.  11  2.1. Algorithm 3. Line Search: Attempt to find tk > 0 such that f (xk + tk dk ) < f (xk ) − ηtk |dk |2 . Line Search Failure: µk k+1 Set µk = ,x = xk and go to Step 4. 2 Line Search Success: Let xk+1 be any point such that f (xk+1 ) ≤ f (xk + tk dk ). 4. Update and Loop: Set ∆k+1 = max |y j − xk |, k = k + 1 and return to Step 1. j=1,...,m  In Step 0 of the AGS algorithm, we set the iterate counter to 0, provide an initial starting point x0 , and initialize the parameter values. In Step 1, we create the approximate subdifferential. First, we select a set of points around xk within a search radius of ∆k . In implementation, the points are randomly and uniformly sampled from the interior of an n-dimensional hypersphere of radius ∆k centered at the origin (using the MATLAB randsphere.m function [Sta05]). Using this set Y , we then calculate an approximate gradient of each of the active functions at xk and set the approximate subdifferential Gk equal to the convex hull of these active approximate gradients. In Step 2, we generate a search direction by solving the projection of 0 onto the approximate subdifferential: Proj(0|Gk ) ∈ arg ming∈Gk |g|2 . The search direction dk is set equal to the negative of the solution, i.e., dk = − Proj(0|Gk ). After finding a search direction, we check the condition ∆k ≤ µk |dk | (equation (2.1)). This condition determines if the current search radius is sufficiently small relative to the distance from 0 to the approximate subdifferential. If equation (2.1) holds and |dk | < εtol , then we terminate the algorithm, as 0 is within εtol of the approximate subdifferential and the search radius is small enough to reason that the approximate subdifferential is accurate. If equation (2.1) does not hold, then the approximate subdifferential is not sufficiently accurate to warrant a line search, so we decrease 12  2.1. Algorithm the search radius according to equation (2.2) and loop (Step 4). If equation (2.1) holds, but |dk | ≥ εtol , then we proceed to a line search. In Step 3, we carry out a line search. We attempt to find a step length tk > 0 such that the Armijo-like condition holds f (xk + tk dk ) < f (xk ) − ηtk |dk |2 .  (2.3)  The following figures illustrate the Armijo(-like) line search for the curve f (x) = 41 x2 . Figure 2.1(a) is representative of a gradient based method (Armijo line search), while Figure 2.1(b) is representative of a DFO method (Armijo-like line search).  (a) A gradient based line search.  (b) A derivative-free line search.  Figure 2.1: A graphical depiction of the Armijo and Armijo-like condition. To satisfy the condition in equation (2.3), the algorithm must find a point on the curve that lies on or below the Armijo descent line for both methods. Any point found below the Armijo descent line will result in a successful line search. Thus, the Armijo(-like) condition prevents the algorithm from taking multiple steps with minimal function value decrease. In other words, it ensures sufficient function value decrease is found when a line search success is declared. Comparing Figures 2.1(a) and 2.1(b), we can see that the DFO Armijo descent line does not allow the algorithm to accept a point x ≤ 0.5. This prevents the algorithm from jumping back and forth across the y-axis, thus, resulting in a faster convergence rate. In implementation, we use a back-tracking line search (described in [NW99]) with an initial step-length of tini = 1. Basically, we test to see if equation (2.3) holds for tk = tini . If it holds, then we declare a line search 13  2.2. Convergence success. If it does not hold, we reduce tk by a factor of 0.5 and we test equation (2.3) again. If we have tk < tmin without declaring a line search success, then we declare a line search failure. If we find a tk such that equation (2.3) holds, then we declare a line search success. If a line search success occurs, then we let xk+1 be any point such that f (xk+1 ) ≤ f (xk + tk dk ),  (2.4)  thus, forming a non-increasing sequence of function values {f (xk )}k=0 . In implementation, we do this by searching through the function values used in the calculation of our approximate gradients ({f (y i )}yi ∈Y ). As this set of function values corresponds to points distributed around our current iterate, there is a good possibility of finding further function value decrease without having to carry out additional function evaluations. We find the minimum function value in our set of evaluations and if equation (2.4) holds for this minimum value, then we set xk+1 equal to the corresponding input point. Otherwise, we set xk+1 = xk + tk dk . If a line search failure occurs, then we reduce the accuracy measure µk by a factor of 0.5 and set xk+1 = xk . (Notice that as µk is decreased each time a line search failure occurs, {µk }k=0 is a non-increasing sequence.) Finally, in Step 4, we set ∆k+1 = maxj=1,...,m |y j −xk |, update the iterate counter and loop to Step 1 to resample.  2.2  Convergence  In order to prove convergence for the AGS algorithm, we must show that the direction dk is a descent direction, that the algorithm is well-defined (eventually finds function value decrease) and that the stopping conditions are sufficient, i.e., when the algorithm satisfies the stopping conditions, the distance to a critical point is controlled by εtol . Remark 2.1. For the following results, we denote the approximate subdifferential of f at x ¯ as G(¯ x) = conv(∇A fi (¯ x))i∈A(¯x) , where ∇A fi (¯ x) is the approximate gradient of fi at x ¯. Our first result shows that the approximate subdifferential generated by the AGS algorithm is a good approximate of the exact subdifferential. We use part 1 of the following lemma to establish that our stopping conditions are sufficient. We use part 2 of the following lemma to establish that dk is a descent direction. 14  2.2. Convergence Lemma 2.2. Let f = max{fi : i = 1, . . . , N } where each fi ∈ C 1 . Suppose there exists an ε > 0 such that |∇A fi (¯ x) − ∇fi (¯ x)| ≤ ε for all i. Then 1. for all w ∈ G(¯ x), there exists a v ∈ ∂f (¯ x) such that |w − v| ≤ ε, and 2. for all v ∈ ∂f (¯ x), there exists a w ∈ G(¯ x) such that |w − v| ≤ ε. Proof. 1. By definition, for all w ∈ G(¯ x) there exists a set of αi such that αi ∇A fi (¯ x),  w=  where αi ≥ 0,  i∈A(¯ x)  αi = 1. i∈A(¯ x)  By Proposition 1.6, as each fi ∈ C 1 we have ∂f (¯ x) = conv(∇fi (¯ x))i∈A(¯x) . Using the same αi as above, we see that αi ∇fi (¯ x) ∈ ∂f (¯ x).  v= i∈A(¯ x)  Then |w − v|  =  αi ∇fi (¯ x)|  αi ∇A fi (¯ x) −  |  i∈A(¯ x)  i∈A(¯ x)  αi |∇A fi (¯ x) − ∇fi (¯ x)|  ≤ i∈A(¯ x)  ≤  αi ε  (by assumption)  i∈A(¯ x)  =  ε  αi = 1).  (as i∈A(¯ x)  Hence, for all w ∈ G(¯ x), there exists a v ∈ ∂f (¯ x) such that |w − v| ≤ ε.  (2.5)  2. We have ∂f (¯ x) = conv(∇fi (¯ x))i∈A(¯x) . So for all v ∈ ∂f (¯ x), there exist αi such that αi ∇fi (¯ x)  v=  where αi ≥ 0,  i∈A(¯ x)  αi = 1. i∈A(¯ x)  Using the same αi as above, we see that αi ∇A fi (¯ x) ∈ G(¯ x).  w= i∈A(¯ x)  Using the same argument as in part 1, we can conclude that for all v ∈ ∂f (¯ x), there exists a w ∈ G(¯ x) such that |w − v| ≤ ε. 15  2.2. Convergence Our next goal (in Theorem 2.6) is to show that eventually a line search success will occur in the AGS algorithm. To achieve this, we first state the Projection Theorem. Theorem 2.3. (The Projection Theorem) Let C ⊆ Rn be a nonempty closed convex set. Then for every x and p in Rn p = Proj(x|C) ⇐⇒ p ∈ C and x − p, y − p ≤ 0 for all y ∈ C. Proof. See Theorem 3.14, [BC11]. We use the Projection Theorem to prove d = − Proj(0|G(¯ x)) is a descent direction. Lemma 2.4. Let f = max{fi : i = 1, . . . , N } where each fi ∈ C 1 . Suppose there exists an ε > 0 such that |∇A fi (¯ x) − ∇fi (¯ x)| ≤ ε for all i. Define d = − Proj(0|G(¯ x)) and suppose |d| = 0. Let β ∈ (0, 1). If ε < (1 − β)|d|, then for all v ∈ ∂f (¯ x) we have d, v < −β|d|2 . Proof. Notice that, by Theorem 2.3, d = − Proj(0|G(¯ x)) implies that 0 − (−d), w − (−d) ≤ 0 for all w ∈ G(¯ x). Hence, d, w + d ≤ 0 for all w ∈ G(¯ x).  (2.6)  So we have for all v ∈ ∂f (¯ x) d, v  =  d, v − w + w − d + d  for all w ∈ G(¯ x)  =  d, v − w + d, w + d + d, −d  for all w ∈ G(¯ x)  ≤  d, v − w − |d|2  for all w ∈ G(¯ x)  ≤ |d||v − w| − |d|2  for all w ∈ G(¯ x),  (by (2.6))  which follows by using Cauchy Schwarz. For any v ∈ ∂f (¯ x), using w as constructed in Lemma 2.2(2), we see that d, v  ≤  |d|ε − |d|2  <  |d|2 (1 − β) − |d|2  =  −β|d|2 .  (as ε < (1 − β)|d|)  16  2.2. Convergence  Using Lemma 2.4, we can easily show that d is a descent direction for f at x ¯ by setting β = 0 and bounding ε above by |d|. We notice that this condition on ε requires the algorithm to reach a point where the error between the exact gradient and the approximate gradient is smaller than the distance from 0 to our approximate subdifferential. Corollary 2.5. Let f = max{fi : i = 1, . . . , N } where each fi ∈ C 1 . Suppose there exists an ε > 0 such that |∇A fi (¯ x) − ∇fi (¯ x)| ≤ ε for all i. Define d = − Proj(0|G(¯ x)) and suppose |d| = 0. If ε < |d|, then d, v < 0 for all v ∈ ∂f (¯ x). To guarantee convergence, we must show that, except in the case of 0 ∈ ∂f (xk ), the algorithm will always be able to find a search radius that satisfies the requirements in Step 2. In Section 4, we show that (for three different approximate gradients) the value ε (in Lemma 2.4) is linked to the search radius ∆. As unsuccessful line searches will drive ∆ to 0, this implies that eventually the requirements of Lemma 2.4 will be satisfied. We formalize this in the next two theorems. Theorem 2.6. Let f = max{fi : i = 1, . . . , N } where each fi ∈ C 1 . Sup¯ > 0 such pose 0 ∈ ∂f (xk ) for each iteration k. Suppose there exists K that given any set of points generated in Step 1 of the AGS algorithm, the ¯ k for all i. Let approximate gradient satisfies |∇A fi (xk ) − ∇fi (xk )| ≤ K∆ k k ¯ > 0 such that, d = − Proj(0|G(x )). Then for any µ > 0, there exists ∆ ¯ ∆ ≤ µ|dk | + Kµ(∆ k − ∆)  ¯ for all 0 < ∆ < ∆,  ¯ then the following inequality holds Moreover, if ∆k < ∆, ∆k ≤ µ|dk |. Proof. Let v¯ = Proj(0|∂f (xk )) (by assumption, v¯ = 0). Given µ > 0, let 1 ¯ = ∆ |¯ v |, ¯ K + µ1  (2.7)  ¯ Now create G(xk ) and dk = − Proj(0|G(xk )). As and consider 0 < ∆ < ∆. −dk ∈ G(xk ), by Lemma 2.2(1), there exists a v k ∈ ∂f (xk ) such that ¯ k. | − dk − v k | ≤ K∆ 17  2.2. Convergence Then ¯ k ≥ | − dk − v k | K∆ ⇒  ¯ k ≥ |v k | − |dk | K∆  ⇒  ¯ k ≥ |¯ K∆ v | − |dk |  (as |v| ≥ |¯ v | for all v ∈ ∂f (xk ) ).  ¯ we apply equation (2.7) to |¯ Thus, for 0 < ∆ < ∆, v | in the above inequality to get ¯ k ≥ (K ¯ + 1 )∆ − |dk |, K∆ µ which rearranges to ¯ ∆ ≤ µ(|dk | + Kµ(∆ k − ∆). ¯ ¯ Hence, ∆ ≤ µ|dk | + Kµ(∆ k − ∆) for all 0 < ∆ < ∆. ¯ Finally, if ∆k < ∆, then ∆k ≤ µ|dk |.  Remark 2.7. In Theorem 2.6, it is important to note that eventually the con¯ will hold. Examine ∆ ¯ as constructed above: K ¯ is a constant dition ∆k < ∆ and v¯ is associated with the current iterate. However, the current iterate is only updated when a line search success occurs, which will not occur unless ¯ the AGS the condition ∆k ≤ µk |dk | is satisfied. As a result, if ∆k ≥ ∆, ¯ ¯ algorithm will reduce ∆k , with ∆ remaining constant, until ∆k < ∆. Recall in Step 3 of the AGS algorithm, for a given η ∈ (0, 1), we attempt to find a step length tk > 0 such that f (xk + tk dk ) < f (xk ) − ηtk |dk |2 The following result shows that eventually the above inequality will hold in the AGS algorithm. Recall Corollary 1.7, which states that the exact subdifferential for a finite max function is a compact set. Thus, we know that in the following theorem v˜ is well-defined. Theorem 2.8. Fix 0 < η < 1. Let f = max{fi : i = 1, . . . , N } where each fi ∈ C 1 . Let v˜ ∈ arg max{ d, v : v ∈ ∂f (¯ x)}. Suppose there exists an ε > 0 such that |∇A fi (¯ x) − ∇fi (¯ x)| ≤ ε for all i. Define d = − Proj(0|G(¯ x)) and 2η ¯ suppose |d| = 0. Let β = 1+η . If ε < (1 − β)|d|, then there exists t > 0 such that f (¯ x + td) − f (¯ x) < −ηt|d|2 for all 0 < t < t¯. 18  2.2. Convergence Proof. Note that β ∈ (0, 1). Recall, from Lemma 2.4, we have for all v ∈ ∂f (¯ x) d, v < −β|d|2 . (2.8) Using β =  2η 1+η ,  equation (2.8) becomes 2η d, v < − 1+η |d|2  for all v ∈ ∂f (¯ x).  (2.9)  From equation (2.8) we can conclude that for all v ∈ ∂f (¯ x) d, v < 0. Using Theorem 8.30 in [RW98], we have f (¯ x + τ d) − f (¯ x) = max{ d, v : v ∈ ∂f (¯ x)} = d, v˜ < 0. 0 τ  lim  τ  Therefore, as  η+1 2  < 1 (since η < 1) there exists t¯ > 0 such that  f (¯ x + td) − f (¯ x) η+1 < d, v˜ t 2 For such a t, we have f (¯ x + td) − f (¯ x)  <  η+1 t d, v˜ 2  <  −  <  −ηt|d|2 .  for all 0 < t < t¯.  η + 1 2η t|d|2 2 η+1  (by (2.9))  Hence, f (¯ x + td) − f (¯ x) < −ηt|d|2  for all 0 < t < t¯.  Combining the previous results, we show that the AGS algorithm is guaranteed to find function value decrease (provided 0 ∈ / ∂f (xk )). We summarize with the following corollary. Corollary 2.9. Let f = max{fi : i = 1, . . . , N } where each fi ∈ C 1 . Sup¯ > 0 such pose 0 ∈ / ∂f (xk ) for each iteration k. Suppose there exists a K that given any set of points generated in Step 1 of the AGS algorithm, the ¯ k for all i. Then approximate gradient satisfies |∇A fi (xk ) − ∇fi (xk )| ≤ K∆ after a finite number of iterations, the algorithm will find a new iterate with a lower function value. 19  2.2. Convergence Proof. Consider xk , where 0 ∈ / ∂f (xk ). To find function value decrease with the AGS algorithm, we must declare a line search success in Step 3. The AGS algorithm will only carry out a line search if the stopping condition below is satisfied, i.e., ∆k ≤ µk |dk |.  (2.10)  ¯ such that In Theorem 2.6, we showed that for any µk > 0, there exists a ∆ ¯ ¯ if ∆k < ∆, then equation (2.10) is satisfied. If ∆k > ∆ in Step 2, i.e., equation (2.10) is not satisfied, then ∆k is updated according to equation (2.2). Whether |dk | = 0 or |dk | = 0, we can see that ∆k+1 ≤ θ∆k , so ¯ i.e., equation (2.10) will be satisfied and the AGS eventually ∆k < ∆. algorithm will carry out a line search. Now, in order to have a line search success, we must be able to find a step length tk such that the Armijo-like condition holds, f (xk + tk dk ) < f (xk ) − ηtk |dk |2 . In Theorem 2.8, we showed that there exists t¯ > 0 such that f (xk + tk dk ) − f (xk ) < −ηtk |dk |2  for all 0 < tk < t¯,  provided that for β ∈ (0, 1), ε < (1 − β)|dk |.  (2.11)  ¯ k . If equation (2.11) does not hold, then a line search failure will Set ε = K∆ occur, resulting in µk+1 = 0.5µk . Thus, eventually we will have µk < (1−β) ¯ K and (1 − β) k ∆k ≤ µk |dk | < ¯ |d |, K which means equation (2.11) will hold. Thus, after a finite number of iterations, the AGS algorithm will declare a line search success and find a new iterate with a lower function value. Now we are ready to prove convergence. In the following result, assuming that the step length tk is bounded away from 0 means that there exists a t¯ > 0 such that tk > t¯. Remark 2.10. The following theorem does not ensure the existence of a cluster point. In order to ensure the existence of a cluster point, an additional assumption would have to be made, such as compact level sets.  20  2.2. Convergence Theorem 2.11. Let f = max{fi : i = 1, . . . , N } where each fi ∈ C 1 . Let {xk }∞ k=0 be an infinite sequence generated by the AGS algorithm. Suppose ¯ > 0 such that given any set of points generated in Step 1 there exists a K of the AGS algorithm, the approximate gradient satisfies the error bound ¯ k for all i. Suppose tk is bounded away from 0. |∇A fi (xk ) − ∇fi (xk )| ≤ K∆ Then either 1. f (xk ) ↓ −∞, or 2. |dk | → 0, ∆k ↓ 0 and every cluster point x ¯ of the sequence {xk }∞ k=0 satisfies 0 ∈ ∂f (¯ x). Proof. If f (xk ) ↓ −∞, then we are done. Conversely, if f (xk ) is bounded below, then f (xk ) is non-increasing and bounded below, therefore f (xk ) converges. We consider two cases. Case 1: An infinite number of line search successes occur. k Let x ¯ be a cluster point of {xk }∞ k=0 . Notice that x only changes for line search successes, so there exists a subsequence {xkj }∞ j=0 of line search suck j ¯. Then for each corresponding step length tkj and cesses such that x → x direction dkj , the following condition holds f (xkj +1 ) ≤ f (xkj + tkj dkj ) < f (xkj ) − ηtkj |dkj |2 . Note that 0 ≤ ηtkj |dkj |2 < f (xkj ) − f (xkj +1 ). Since f (xk ) converges we know that f (xkj ) − f (xkj +1 ) → 0. Since tkj is bounded away from 0, we see that lim |dkj | = 0.  j→∞  Recall from the AGS algorithm, we check the condition ∆kj ≤ µkj |dkj |. As ∆kj > 0, µkj ≤ µ0 , and |dkj | → 0, we can conclude that ∆kj ↓ 0. Finally, from Lemma 2.2(1), as −dkj ∈ G(xkj ), there exists a v kj ∈ ∂f (xkj ) such that ¯ k | − v kj − dkj | ≤ K∆ j ¯ k ⇒ | − v kj | − |dkj | ≤ K∆ j  ⇒  ¯ k + |dkj |, |v kj | ≤ K∆ j 21  2.2. Convergence which implies that ¯ k + |dkj | → 0. 0 ≤ |v kj | ≤ K∆ j So, lim |v kj | = 0,  j→∞  where |v kj | ≥ dist(0|∂f (xkj )) ≥ 0, which implies dist(0|∂f (xkj )) → 0. We have xkj → x ¯. As f is a finite max function, it is continuous and therefore, by Proposition 1.10, ∂f is outer semicontinuous. Hence, every cluster point x ¯ of a convergent subsequence of {xk }∞ x). k=0 satisfies 0 ∈ ∂f (¯ Case 2: A finite number of line search successes occur. ¯ ¯ However, This means there exists a k¯ such that xk = xk = x ¯ for all k ≥ k. by Corollary 2.9, if 0 ∈ / ∂f (¯ x), then after a finite number of iterations, the algorithm will find function value decrease (line search success). Hence, we have 0 ∈ ∂f (¯ x). Our last result shows that if the algorithm terminates in Step 2, then the distance from 0 to the exact subdifferential is controlled by εtol . Theorem 2.12. Let f = max{fi : i = 1, . . . , N } where each fi ∈ C 1 . ¯ > 0 such that for each iteration k, the approximate Suppose there exists a K ¯ k for all i. Suppose the AGS gradient satisfies |∇A fi (xk ) − ∇fi (xk )| ≤ K∆ ¯ algorithm terminates at some iteration k in Step 2 for εtol > 0. Then ¯ ¯ 0 )εtol . dist(0|∂f (xk )) < (1 + Kµ ¯  ¯  Proof. Let w ¯ = Proj(0|G(xk )). We use v¯ ∈ ∂f (xk ) as constructed in Lemma 2.2(1) to see that ¯  dist(0|∂f (xk ))  ≤  dist(0|¯ v)  ≤  dist(0|w) ¯ + dist(w|¯ ¯ v)  =  |dk | + |w ¯ − v¯|  ≤  |dk | + ε  (by Lemma 2.2)  <  εtol + ε  (as |dk | < εtol ).  ¯  ¯  ¯  (as w ¯ ∈ G(xk ), v¯ ∈ ∂f (xk )) ¯  ¯  (as |dk | = | Proj(0|G(xk ))|)  ¯  ¯  ¯  The final statement now follows by the test ∆k¯ ≤ µk¯ |dk | in Step 2 and the fact that µk¯ ≤ µ0 as {µk }k=0 is a non-increasing sequence.  22  Chapter 3  Robust Approximate Gradient Sampling Algorithm The AGS algorithm depends on the active set of functions at each iterate, A(xk ). Of course, it is possible at various times in the algorithm for there to be functions that are inactive at the current iterate, but active within a small radius of the current iterate. Typically, such behaviour means that the current iterate is close to a ‘nondifferentiable ridge’ formed by the function. In [BLO02] and [BLO05], it is suggested that allowing an algorithm to take into account these ‘almost active’ functions will provide a better idea of what is happening at and around the current iterate, thus, making the algorithm more robust, i.e., performs for unusual or strict types of functions. In this section, we present the robust approximate gradient sampling algorithm (RAGS algorithm). Specifically, we adapt the AGS algorithm by expanding our active set to include all functions that are active at xk or active at one (or more) of the points in the set Y = [xk y 1 , . . . , y m ]. Recall from the AGS algorithm that the set Y is sampled from within a ball of radius ∆k . Thus, the points in Y are not far from the current iterate. We define the robust active set next. Definition 3.1. Let f = max{fi : i = 1, . . . , N } where each fi ∈ C 1 . Let y 0 = xk be the current iterate and Y = [y 0 , y 1 , y 2 , . . . , y m ] be a set of randomly sampled points from a ball centered at xk with radius ∆k . The robust active set of f on Y is A(y j ),  A(Y ) =  (3.1)  y j ∈Y  where each set A(y j ) is defined as in equation (1.2). To shed some light on the motivation for our robust algorithm, we consider the following function (originally from [CC78], Problem 2.1 CB2 in the test set [LV00]): 23  Chapter 3. Robust Approximate Gradient Sampling Algorithm f = max{f1 , f2 , f3 }, f1 (x, y) = x2 + y 4 , f2 (x, y) = (2 − x)2 + (2 − y)2 , f3 (x, y) = 2 exp(y − x).  where  Figure 3.1: Surface plot for Problem 2.1 CB2.  2  2  1.8  1.8  1.6  1.6  1.4  1.4  1.2  1.2  1  1  0.8  0.8  0.6  0.6  0.4  0.4  0.2  0.2  0  0 0  0.2  0.4  0.6  0.8  1  1.2  1.4  1.6  (a) AGS algorithm  1.8  2  0  0.2  0.4  0.6  0.8  1  1.2  1.4  1.6  1.8  2  (b) RAGS algorithm  Figure 3.2: Iterations taken using the simplex gradient for Problem 2.1 CB2. These figures illustrate example iterations taken by the AGS algorithm and by the RAGS algorithm (presented in this section) for Problem 2.1 CB2 using the simplex gradient as an approximate gradient (see Section 4.1 for details used in numerical results). In Figure 3.2(a), we can see that the iterations move perpendicular to the contours of the functions towards the nondifferentiable ridge. However, eventually the iterations stall at the nondifferentiable ridge. These results 24  3.1. Algorithm show that the AGS algorithm is performing similar to how we would expect the method of steepest descent to perform on this problem. In Figure 3.2(b), we can see that the iterations proceed similar to the AGS algorithm until they reach the nondifferentiable ridge. Here, instead of stalling at the nondifferentiable ridge, the algorithm turns and minimizes along the nondifferentiable ridge, moving towards the optimal solution. Thus, by incorporating the ‘almost active’ functions into our active set, our algorithm is able to further minimize the function f (x, y).  3.1  Algorithm  For the RAGS algorithm, we incorporate the robust active set by replacing Steps 1 and 2 of the AGS algorithm from Section 2.1 with the following. 1. Generate Approximate Subdifferential GkY (Robust): Generate a set Y = [xk , y 1 , . . . , y m ] around the current iterate xk such that max |y j − xk | ≤ ∆k . j=1,...,m  Use Y to calculate the approximate gradient of fi , denoted ∇A fi , at xk for each i ∈ A(Y ). Then set Gk = conv(∇A fi (xk ))i∈A(xk ) and GkY = conv(∇A fi (xk ))i∈A(Y ) . 2. Generate Search Direction: Let dk = − Proj(0|Gk ). Let dkY = − Proj(0|GkY ). Check if ∆k ≤ µk |dk |.  (3.2)  If (3.2) does not hold, then set xk+1 = xk , set ∆k+1 =  θµk |dk | θ∆k  if |dk | = 0 , if |dk | = 0  and go to Step 4. If (3.2) holds and |dk | < εtol , then STOP. Else, continue to the line search, using dkY as a search direction.  25  3.2. Convergence Notice that in Step 2, we still use the stopping conditions from Section 2. Although this modification requires the calculation of two projections, it should be noted that neither of these projections are particularly difficult to calculate and that no additional function evaluations are required for this modification. In Section 3.3, we use the Goldstein approximate subdifferential to adapt Theorem 2.12 to work for stopping conditions based on dkY , but we still do not have theoretical results for the exact subdifferential. In the numerics section, we test each version of our algorithm using the robust descent direction to check the stopping conditions. For the RAGS algorithm, this alteration shows convincing results that the robust stopping conditions not only guarantee convergence, but significantly decrease the number of function evaluations required for the RAGS algorithm to converge.  3.2  Convergence  To show that the RAGS algorithm is well-defined we require that when ∆ is small enough, the robust active set is in fact equal to the original active set. Lemma 3.2. Let f = max{fi : i = 1, . . . , N } where each fi ∈ C 1 . Let Y = [¯ x, y 1 , . . . , y m ] be a randomly sampled set from a ball centered at x ¯ with radius ∆. Then there exists an ε˜ > 0 such that if Y ⊆ Bε˜(¯ x), then A(¯ x) = A(Y ). Proof. Clearly, if i ∈ A(¯ x), then i ∈ A(Y ), as x ¯∈Y. Consider i ∈ A(¯ x). Then by the definition of f , we have that fi (¯ x) < f (¯ x). We know that f is continuous (see Section 1.1). Therefore, there exists an ε˜i > 0 such that for all z ∈ Bε˜i (¯ x) fi (z) < f (z). If ∆ < ε˜i , then we have |y j − x ¯| < ε˜i for all j = 1, . . . , m. Therefore, fi (y j ) < f (y j )  for all j = 1, . . . , m,  (3.3)  so i ∈ A(Y ). Setting ε˜ = min ε˜i completes the proof. i∈A(¯ x)  26  3.2. Convergence Using Lemma 3.2, we can easily conclude that the AGS algorithm is still well-defined when using the robust active set. Corollary 3.3. Let f = max{fi : i = 1, . . . , N } where each fi ∈ C 1 . Suppose ¯ > 0 such that given 0 ∈ ∂f (xk ) for each iteration k. Suppose there exists a K any set of points generated in Step 1 of the RAGS algorithm, the approximate ¯ k for all i. Then after a finite gradient satisfies |∇A fi (xk ) − ∇f (xk )| ≤ K∆ number of iterations, the RAGS algorithm will find function value decrease. Proof. Consider xk , where 0 ∈ / ∂f (xk ). For eventual contradiction, suppose we do not find function value decrease. In the RAGS algorithm, this corresponds to an infinite number of line search failures. If we have an infinite number of line search failures, then ∆k → 0 ¯ and xk = xk for all k¯ ≥ k. In Lemma 3.2, ε˜ depends only on xk . Hence, we can conclude that eventually ∆k < ε˜ and therefore Y k ⊆ Bε˜(xk ). Thus, eventually A(xk ) = A(Y k ). Once the two active sets are equal, the results of Section 2.2 will hold. Thus, by Corollary 2.9, the algorithm will find a new iterate with a lower function value, a contradiction. To examine convergence of the RAGS algorithm we use the result that eventually the robust active set at the current iterate will be a subset of the regular active set at any cluster point of the algorithm. Lemma 3.4. Let f = max{fi : i = 1, . . . , N } where each fi ∈ C 1 . Let Y k = [xk , y 1 , . . . , y m ] be a randomly sampled set from a ball centered at xk with radius ∆k . Let xk → x ¯. Then there exists an ε˜ > 0 such that if k k Y ⊆ Bε˜(¯ x), then A(Y ) ⊆ A(¯ x). Proof. Let i ∈ / A(¯ x). By definition of f , we have that fi (¯ x) < f (¯ x). Since f is continuous, there exists an ε˜i > 0 such that for all z ∈ Bε˜i (¯ x) fi (z) < f (z). If Y k ⊆ Bε˜i (¯ x), then we have |xk − x ¯| < ε˜i and |y j − x ¯| < ε˜i for all j = 1, . . . , m. Therefore fi (xk ) < f (xk ) and fi (y j ) < f (y j )  for all j = 1, . . . , m.  Therefore, i ∈ A(Y k ). Letting ε˜ = min ε˜i completes the proof. i∈A(¯ x)  27  3.2. Convergence Now we examine the convergence of the RAGS algorithm. Theorem 3.5. Let f = max{fi : i = 1, . . . , N } where each fi ∈ C 1 . Let {xk }∞ k=0 be an infinite sequence generated by the RAGS algorithm. Suppose ¯ > 0 such that given any set of points generated in Step 1 there exists a K of the RAGS algorithm, the approximate gradient satisfies the error bound ¯ k for all i. Suppose tk is bounded away from 0. |∇A fi (xk ) − ∇fi (xk )| ≤ K∆ Then either 1. f (xk ) ↓ −∞, or 2. |dk | → 0, ∆k ↓ 0 and every cluster point x ¯ of the sequence {xk }∞ k=0 satisfies 0 ∈ ∂f (¯ x). Proof. If f (xk ) ↓ −∞, then we are done. Conversely, if f (xk ) is bounded below, then f (xk ) is non-increasing and bounded below, therefore f (xk ) converges. We consider two cases. Case 1: An infinite number of line search successes occur. k Let x ¯ be a cluster point of {xk }∞ k=0 . Notice that x only changes for line k search successes, so there exists a subsequence {x j }∞ k=0 of line search suck j cesses such that x → x ¯. Following the arguments of Theorem 2.11, we have x), |dkj | → 0 and ∆kj ↓ 0. Notice that if ∆kj ↓ 0, then eventually Y kj ⊆ Bε˜(¯ k where x → x ¯ and ε˜ is defined as in Lemma 3.4. Thus, by Lemma 3.4, we have that A(Y kj ) ⊆ A(¯ x). This means GY kj (xkj ) is formed from a subset of the active gradients that make up ∂f (¯ x). Thus, by Lemma 2.2(1), as x) from the same set of −dkj ∈ GY kj (xkj ), we can construct a v kj ∈ ∂f (¯ k j active gradients that make up GY kj (x ) such that | − dkj − v kj | αi ∇A fi (xkj ) −  = i∈A(Y  kj  )  αi ∇fi (¯ x) i∈A(Y  kj  )  kj  ≤  αi |∇A fi (x ) − ∇fi (¯ x)| i∈A(Y  kj )  =  |∇A fi (xkj ) − ∇fi (¯ x)|  ≤  |∇A fi (xkj ) − ∇fi (xkj )| + |∇fi (xkj ) − ∇fi (¯ x)|  ≤  ¯ k + |∇fi (xkj ) − ∇fi (¯ K∆ x)| j  (as  i∈A(Y kj )  αi = 1)  (by error bound).  28  3.2. Convergence Using the Triangle Inequality, we have | − dkj − v kj | = | − v kj − dkj | ≥ |v kj | − |dkj |, which implies that ¯ k + |∇fi (xkj ) − ∇fi (¯ |v kj | − |dkj | ≤ K∆ x)|. j We already showed that |dkj | → 0 and ∆kj ↓ 0. Furthermore, since ∇fi ∈ C and xkj → x ¯, we have |∇fi (xkj ) − ∇fi (¯ x)| → 0. So, lim |v kj | = 0.  j→∞  Using the same arguments as in Theorem 2.11, the result follows. Case 2: A finite number of line search successes occur. ¯ ¯ However, This means there exists a k¯ such that xk = xk = x ¯ for all k ≥ k. by Corollary 3.3, if 0 ∈ / ∂f (¯ x), then after a finite number of iterations, the algorithm will find function value decrease (line search success). Hence, we have 0 ∈ ∂f (¯ x). Remark 3.6. Using dk to check our stopping conditions allows the result of Theorem 2.12 to still hold.  3.2.1  Descent Direction: dY  It is worth noting that we can prove Lemma 2.2(2) for G(Y ) and thus, prove dY is a descent direction without the requirement that A(Y ) ⊆ A(¯ x). We show this in Lemmas 3.7 and 3.8. Lemma 3.7. Let f = max{fi : i = 1, . . . , N } where each fi ∈ C 1 . Let Y = [y 0 , y 1 , y 2 , . . . , y m ] be a randomly sampled set from a ball centered at y 0 = x ¯ ¯ > 0 such that the approximate with radius ∆. Suppose there exists a K ¯ for all i. Then for all v ∈ ∂f (¯ gradient satisfies |∇A fi (¯ x)−∇fi (¯ x)| ≤ K∆ x), there exists a w ∈ G(Y ) such that ¯ |w − v| ≤ K∆. Proof. By definition, for all v ∈ ∂f (¯ x) there exists a set of αi such that αi ∇fi (¯ x)  v= i∈A(¯ x)  where αi ≥ 0,  αi = 1. i∈A(¯ x)  29  3.3. Robust Stopping Conditions We know that A(¯ x) ⊆ A(Y ), as x ¯ ∈ Y . Thus, using the same αi as above, we see that αi ∇A fi (¯ x) ∈ G(Y ). w= i∈A(¯ x)  Using the same argument as shown in Lemma 2.2(2), we can conclude that for all v ∈ ∂f (¯ x), there exists a w ∈ G(Y ) such that ¯ |w − v| ≤ K∆.  Using this result, we can show that dY is a descent direction. Lemma 3.8. Let f = max{fi : i = 1, . . . , N } where each fi ∈ C 1 . Let Y = [y 0 , y 1 , y 2 , . . . , y m ] be a randomly sampled set from a ball centered at ¯ > 0 such that the apy0 = x ¯ with radius ∆. Suppose there exists a K ¯ for all i. Define proximate gradient satisfies |∇A fi (¯ x) − ∇fi (¯ x)| ≤ K∆ ¯ dY = − Proj(0|G(Y )) and suppose |dY | = 0. If K∆ < |dY |, then dY , v < 0 for all v ∈ ∂f (¯ x). Proof. Using Lemma 3.7 instead of Lemma 2.2(2) in the proof of Corollary 2.5, we can prove the result.  3.3  Robust Stopping Conditions  We want to provide some insight as to how Theorem 2.12 can work for stopping conditions based on dkY , that is, replacing the stopping conditions ∆k ≤ µk |dk | and |dk | < εtol in Step 2 with the robust stopping conditions ∆k ≤ µk |dkY | and |dkY | < εtol .  (3.4)  The following proposition does not theoretically justify why the robust stopping conditions in (3.4) are sufficient, but does help explain their logic. Theoretically, since we do not know what x ¯ is, we cannot tell when A(Y k ) ⊆ A(¯ x). Proposition 3.9. Let f = max{fi : i = 1, . . . , N } where each fi ∈ C 1 . ¯ > 0 such that |∇A fi (x) − ∇fi (x)| ≤ K∆ ¯ k for all i Suppose there exists a K k and for all x ∈ B∆k (x ). Suppose the RAGS algorithm terminates at some iteration k¯ in Step 2 using the robust stopping conditions given in (3.4). ¯ ¯ Furthermore, suppose there exists x ¯ ∈ B∆k¯ (xk ) such that A(Y k ) ⊆ A(¯ x). Then ¯ 0 )εtol . Proj(0|∂f (¯ x)) < (1 + Kµ 30  3.3. Robust Stopping Conditions ¯  x), then the proofs of Lemma 2.2(1) and Theorem 2.12 Proof. If A(Y k ) ⊆ A(¯ still hold. Additionally, in the following results, we approach the theory for robust stopping conditions using the Goldstein approximate subdifferential. If the RAGS algorithm terminates in Step 2, then it is shown that the distance between 0 and the Goldstein approximate subdifferential is controlled by εtol . Again, this does not prove the robust stopping conditions are sufficient for the exact subdifferential. Definition 3.10. The Goldstein approximate subdifferential, as defined in [Gol77], is given by the set G ∂∆ f (¯ x) = conv(∂f (z) : z ∈ B∆ (¯ x)),  (3.5)  where B∆ (¯ x) is the closed ball centered at x ¯ with radius ∆. We now show that the Goldstein approximate subdifferential contains all of the gradients of the active functions in the robust active set. Lemma 3.11. Let f = max{fi : i = 1, . . . , N }. Let Y = [y 0 , y 1 , y 2 , . . . , y m ] be a randomly sampled set from a ball centered at y 0 = x ¯ with radius ∆. If fi ∈ C 1 for each i, then G ∂∆ f (¯ x) ⊇ conv(∇fi (y j ) : y j ∈ Y, i ∈ A(y j )).  Proof. If fi ∈ C 1 for each i ∈ A(Y ), then by Proposition 1.6, for each y j ∈ Y we have ∂f (y j ) = conv(∇fi (y j ))i∈A(yj ) = conv(∇fi (y j ) : i ∈ A(y j )). Using this in our definition of the Goldstein approximate subdifferential in (3.5) and knowing B∆ (¯ x) ⊇ Y , we have G ∂∆ f (¯ x) ⊇ conv(conv(∇fi (y j ) : i ∈ A(y j )) : y j ∈ Y ),  which simplifies to G ∂∆ f (¯ x) ⊇ conv(∇fi (y j ) : y j ∈ Y, i ∈ A(y j )).  (3.6)  Now we have a result similar to Lemma 2.2(1) for dkY with respect to the Goldstein approximate subdifferential. 31  3.3. Robust Stopping Conditions Remark 3.12. For the following two results, we assume each of the fi ∈ C 1+ with Lipschitz constant L. Note that this implies the Lipschitz constant L is independent of i. If each fi ∈ C 1+ has Lipschitz constant Li for ∇fi , then L is easily obtained by L = max{Li : i = 1, . . . , N }. Lemma 3.13. Let f = max{fi : i = 1, . . . , N } where each fi ∈ C 1+ with Lipschitz constant L. Let Y = [y 0 , y 1 , y 2 , . . . , y m ] be a randomly sampled set ¯ >0 from a ball centered at y 0 = x ¯ with radius ∆. Suppose there exists a K ¯ such that the approximate gradient satisfies |∇A fi (¯ x) − ∇fi (¯ x)| ≤ K∆ for G all i. Then for all w ∈ G(Y ), there exists a g ∈ ∂∆ f (¯ x) such that ¯ + L)∆. |w − g| ≤ (K Proof. By definition, for all w ∈ G(Y ) there exists a set of αi such that αi ∇A fi (¯ x),  w=  where αi ≥ 0,  i∈A(Y )  αi = 1. i∈A(Y )  By our assumption that each fi ∈ C 1+ , Lemma 3.11 holds. It is clear that for each i ∈ A(Y ), i ∈ A(y j ) for some y j ∈ Y . Let ji be an index corresponding to this active index; i.e., i ∈ A(y ji ). Thus, for each i ∈ A(Y ), there is a corresponding active gradient G f (¯ x). ∇fi (y ji ) ∈ conv(∇fi (y ji ) : y ji ∈ Y, i ∈ A(y ji )) ⊆ ∂∆  Using the same αi as above, we can construct G f (¯ x). αi ∇fi (y ji ) ∈ conv(∇fi (y ji ) : y ji ∈ Y, i ∈ A(y ji )) ⊆ ∂∆  g= i∈A(Y )  Then |w − g| =  |  αi ∇fi (y ji )|  αi ∇A fi (¯ x) − i∈A(Y )  i∈A(Y )  αi |∇A fi (¯ x) − ∇fi (y ji )|  ≤ i∈A(Y )  αi |∇A fi (¯ x) − ∇fi (¯ x) + ∇fi (¯ x) − ∇fi (y ji )|  = i∈A(Y )  αi |∇A fi (¯ x) − ∇fi (¯ x)| + |∇fi (¯ x) − ∇fi (y ji )| .  ≤ i∈A(Y )  32  3.3. Robust Stopping Conditions Applying our error bound assumption and the Lipschitz condition, we have |w − g|  ¯ + L max |¯ αi K∆ x − y ji |  ≤  ji  i∈A(Y )  ≤  ¯ + L∆ K∆  (as  αi = 1) i∈A(Y )  =  ¯ + L)∆. (K  G Hence, for all w ∈ G(¯ x), there exists a g ∈ ∂∆ f (¯ x) such that  |w − g| ≤ ε + L∆.  (3.7)  Thus, using Lemma 3.13, we can show that if the algorithm stops due to the robust stopping conditions, then the distance from 0 to the Goldstein approximate subdifferential is controlled by εtol . Proposition 3.14. Let f = max{fi : i = 1, . . . , N } where each fi ∈ C 1+ ¯ > 0 such that for each with Lipschitz constant L. Suppose there exists a K ¯ k iteration k, the approximate gradient satisfies |∇A fi (xk ) − ∇fi (xk )| ≤ K∆ for all i. Suppose the RAGS algorithm terminates at some iteration k¯ in Step 2 using the robust stopping conditions given in (3.4). Then ¯ G ¯ + L)]εtol . Proj(0|∂∆ f (xk )) < [1 + µ0 (K ¯ k ¯  ¯  G Proof. Let w ¯ = Proj(0|G(Y k )). We use g¯ ∈ ∂∆ f (xk ) as constructed in ¯ k Lemma 3.13 to see that G dist(0|∂∆  ¯ ¯ )f (xk ) k  ≤ dist(0|¯ g) ¯  ¯  G ≤ dist(0|w) ¯ + dist(w|¯ ¯ g ) (as w ¯ ∈ G(Y k ), g¯ ∈ ∂∆ f (xk )) ¯ k  ¯  = |dkY | + |w ¯ − g¯| ¯ k ¯ + L)∆¯ ≤ |d | + (K Y  ¯  ¯  (as |dkY | = | Proj(0|G(Y k ))|) (by Lemma 3.13)  k  ¯  ¯ + L)∆¯ < εtol + (K k  (as |dkY | < εtol ). ¯  The statement now follows by the test ∆k¯ ≤ µk¯ |dkY | in Step 2 and the fact that µk¯ ≤ µ0 as {µk }k=0 is a non-increasing sequence.  33  Chapter 4  Approximate Gradients As seen in the previous two sections, in order for convergence to be guaranteed in the AGS and RAGS algorithms, the approximate gradient used must satisfy an error bound for each of the active fi . Specifically, for a randomly sampled set of points Y = [xk , y 1 , . . . , y m ] centered around xk , ¯ > 0 such that there must exist a K ¯ k, |∇A fi (xk ) − ∇fi (xk )| ≤ K∆ where ∆k = maxj |y j − xk |. In this section, we present three specific approximate gradients that satisfy the above requirement: the simplex gradient, the centered simplex gradient and the Gupal estimate of the gradient of the Steklov averaged function.  4.1  Simplex Gradient  The simplex gradient is a commonly used approximate gradient. In recent years, several derivative-free algorithms have been proposed that use the simplex gradient, ([BK98], [Kel99a], [CV07], [CJV08], and [HM11]) among others. It is geometrically defined as the gradient of the linear interpolation of f over a set of n + 1 points in Rn . In the following section, we mathematically define the simplex gradient.  4.1.1  Definitions  First we present several basic definitions of terms used in the definition of the simplex gradient. Definition 4.1. A set S is an affine set if given any two points x1 ∈ S and x2 ∈ S with x1 = x2 , the line formed by x1 and x2 is a subset of S, i.e., x1 x2 ⊆ S. Definition 4.2. The affine hull of a set S ⊆ Rn is the smallest affine set containing S. 34  4.1. Simplex Gradient Definition 4.3. A set of m + 1 points Y = [y 0 , y 1 , . . . , y m ] is said to be affinely independent if its affine hull aff(y 0 , y 1 , . . . , y m ) has dimension m. Equivalently, Y is affinely independent if the set [y 1 − y 0 , . . . , y m − y 0 ] is linearly independent, [CSV09]. Definition 4.4. Let Y = [y 0 , y 1 , . . . , y n ] be a set of affinely independent points in Rn . We say that Y forms the simplex, S, where S = conv(Y ). Thus, S is a simplex if it can be written as the convex hull of an affinely independent set of n + 1 points in Rn . With the previous terms defined, we are now ready to mathematically define the simplex gradient. Definition 4.5. Let Y = [y 0 , y 1 , . . . , y n ] be a list of affinely independent points in Rn . The simplex gradient of a function f over the set Y is given by ∇s f (Y ) = L−1 δf (Y ), where L = y1 − y0 · · · yn − y0 and   f (y 1 ) − f (y 0 )   .. δf (Y ) =  . . n 0 f (y ) − f (y )   ˆ −1 , where The condition number of a simplex is given by L ˆ = 1 [y 1 − y 0 y 2 − y 0 . . . y n − y 0 ] and ∆ = max |y j − y 0 |. L j=1,...n ∆  4.1.2  Convergence  The following result (from Kelley) shows that there exists an appropriate error bound between the simplex gradient and the exact gradient of our objective function. We note that the Lipschitz constant used in the following theorem corresponds to ∇fi . Theorem 4.6. Consider fi ∈ C 1+ with Lipschitz constant Ki for ∇fi . Let Y = [y 0 , y 1 , . . . , y n ] form a simplex. Let ˆ = 1 [y 1 − y 0 y 2 − y 0 . . . y n − y 0 ] , L ∆  where ∆ = max |y j − y 0 |. j=1,...n  35  4.2. Centered Simplex Gradient Then the simplex gradient satisfies the error bound ¯ |∇s fi (Y ) − ∇fi (y 0 )| ≤ K∆, √ ¯ = 1 Ki n L ˆ −1 . where K 2 Proof. See [Kel99b, Lemma 6.2.1]. With the above error bound result, we conclude that convergence holds when using the simplex gradient as an approximate gradient in the AGS and RAGS algorithms. Corollary 4.7. Let f = max{fi : i = 1, . . . , N } where each fi ∈ C 1+ with Lipschitz constant Ki for ∇fi . If the approximate gradient used in the AGS ˆ −1 is bounded above, or RAGS algorithm is the simplex gradient and L then the results of Theorems 2.6, 2.8, 2.11 and 2.12 hold.  4.1.3  Algorithm  Algorithm - Simplex Gradient In order to calculate a simplex gradient in Step 1, we generate a set Y = [xk , y 1 , . . . , y n ] of points in Rn and then check to see if Y forms a ˆ −1 . A bounded well-poised simplex by calculating its condition number, L −1 ˆ condition number ( L < n) ensures a ‘good’ error bound between the approximate gradient and the exact gradient. ˆ −1 < n), then we calculate the If Y forms a well-poised simplex ( L simplex gradient of fi over Y for each i ∈ A(xk ) (or each i ∈ A(Y k )) and then set the approximate subdifferential equal to the convex hull of the active simplex gradients. If the set Y does not form a well-poised simplex ˆ −1 ≥ n), then we resample. We note that the probability of generating ( L a random matrix with a condition number greater than n is asymptotically constant, [Wsc04]. Thus, randomly generating simplices is a quick and practical option. Furthermore, notice that calculating the condition number does not require function evaluations; thus, resampling does not affect the number of function evaluations required by the algorithm.  4.2  Centered Simplex Gradient  The centered simplex gradient is the average of two simplex gradients. Although it requires more function evaluations, it contains an advantage that the error bound satisfied by the centered simplex gradient is in terms of ∆2 , rather than ∆. 36  4.2. Centered Simplex Gradient  4.2.1  Definitions  Definition 4.8. Let Y = [y 0 , y 1 , . . . , y n ] form a simplex. We define the sets Y + = [x, x + y˜1 , . . . , x + y˜n ] and Y − = [x, x − y˜1 , . . . , x − y˜n ], where x = y 0 and y˜i = y i − y 0 for i = 1, . . . n. The centered simplex gradient is the average of the two simplex gradients over the sets Y + and Y − , i.e., 1 ∇CS f (Y + ) = (∇S f (Y + ) + ∇S f (Y − )). 2  4.2.2  Convergence  To show that the AGS and RAGS algorithms are well-defined when using the centered simplex gradient as an approximate gradient, we provide an error bound between the centered simplex gradient and the exact gradient (again from Kelley). Theorem 4.9. Consider fi ∈ C 2+ with Lipschitz constant Ki for ∇2 fi . Let Y = [y 0 , y 1 , . . . , y n ] form a simplex. Let ˆ = 1 [y 1 − y 0 , . . . , y n − y 0 ] L ∆  where ∆ = max |y j − y 0 |. j=1,...n  Then the centered simplex gradient satisfies the error bound ¯ 2, |∇CS fi (Y ) − ∇fi (y 0 )| ≤ K∆ ¯ = Ki √n L ˆ −1 . where K Proof. See [Kel99b, Lemma 6.2.5]. Notice that Theorem 4.9 requires fi ∈ C 2+ . If fi ∈ C 1+ , then the error bound is in terms of ∆, not ∆2 . With the above error bound result, we conclude that convergence holds when using the centered simplex gradient as an approximate gradient in the AGS and RAGS algorithms. Corollary 4.10. Let f = max{fi : i = 1, . . . , N } where each fi ∈ C 2+ with Lipschitz constant Ki for ∇2 fi . If the approximate gradient used in the AGS or RAGS algorithm is the centered simplex gradient and ∆0 ≤ 1, then the results of Theorems 2.6, 2.8, 2.11 and 2.12 hold. Proof. Since ∆0 ≤ 1 and non-increasing, (∆k )2 ≤ ∆k and ergo, Theorems 2.6 and 2.11 hold. 37  4.3. Gupal Estimate of the Gradient of the Steklov AveragedFunction  4.2.3  Algorithm  To adapt the AGS and RAGS algorithms to use the centered simplex gradient, in Step 1 we sample our set Y in the same manner as for the simplex gradient (resampling until a well-poised set is achieved). We then form the sets Y + and Y − and proceed as expected.  4.3  Gupal Estimate of the Gradient of the Steklov Averaged Function  The nonderivative version of the gradient sampling algorithm presented by Kiwiel in [Kiw10] uses the Gupal estimate of the gradient of the Steklov averaged function as an approximate gradient. We see in Theorem 4.16 that an appropriate error bound exists for this approximate gradient. Surprisingly, unlike the error bounds for the simplex and centered simplex gradients, the error bound in Theorem 4.16 does not include a condition number term. Following the notation used by Kiwiel in [Kiw10], we define the Gupal estimate of the gradient of the Steklov averaged function as follows.  4.3.1  Definitions  Definition 4.11. For α > 0, the Steklov averaged function fα is defined by f (x − y)ψα (y)dy,  fα (x) = Rn  where ψα : Rn → R+ is the Steklov mollifier defined by ψα (y) =  1/αn if y ∈ [−α/2, α/2]n , 0 otherwise.  We can equivalently define the Steklov averaged function by fα (x) =  1 αn  x1 +α/2  xn +α/2  ··· x1 −α/2  f (y)dy1 . . . dyn .  (4.1)  xn −α/2  The partial derivatives of fα are given by ∂fα (x) = ∂xi  γi (x, α, ζ)dζ  (4.2)  B∞  38  4.3. Gupal Estimate of the Gradient of the Steklov AveragedFunction for i = 1, . . . , n, where B∞ = [−1/2, 1/2]n is the unit cube centred at 0 and γi (x, α, ζ) = 1 1 f (x1 + αζ1 , . . . , xi−1 + αζi−1 , xi + α, xi+1 + αζi+1 , . . . , xn + αζn ) α 2 1 −f (x1 + αζ1 , . . . , xi−1 + αζi−1 , xi − α, xi+1 + αζi+1 , . . . , xn + αζn ) . 2 (4.3) Definition 4.12. Given α > 0 and z = (ζ 1 , . . . , ζ n ) ∈ estimate of ∇fα (x) over z is given by  n i=1 B∞ ,  γ(x, α, z) = (γ1 (x, α, ζ 1 ), . . . , γn (x, α, ζ n )).  the Gupal  (4.4)  Remark 4.13. Although we define γ(x, α, z) as the Gupal estimate of ∇fα (x), in Section 4.3.2, we will show that γ(x, α, z) provides a good approximate of the exact gradient, ∇fi (x). Remark 4.14. For the following results, we note that the α used in the above definitions is equivalent to our search radius ∆. Thus, we will be replacing α with ∆ in the convergence results in Section 4.3.2.  4.3.2  Convergence  Convergence As before, in order to show that the AGS and RAGS algorithms are well-defined when using the Gupal estimate as an approximate gradient, we must establish that it provides a good approximate of our exact gradient. To do this, we need the following result. Lemma 4.15. Let f ∈ C 1+ with Lipschitz constant K for ∇f . Let y 0 ∈ Rn . Then for any y ∈ Rn 1 |f (y) − f (y 0 ) − ∇f (y 0 ) (y − y 0 )| ≤ K|y − y 0 |2 . 2 Proof. For ease of notation, let δ = y − y 0 . By the Fundamental Theorem of Calculus, for any y ∈ Rn we have 1  f (y) − f (y 0 ) =  ∇f (y 0 + τ (δ)) δ dτ.  (4.5)  0  39  4.3. Gupal Estimate of the Gradient of the Steklov AveragedFunction Considering ∇f (y 0 ) δ, notice that 1  ∇f (y 0 ) δ =  ∇f (y 0 ) δ dτ.  (4.6)  0  Subtracting (4.6) from (4.5), we have 1  (∇f (y 0 + τ δ) − ∇f (y 0 )) δ dτ.  f (y) − f (y 0 ) − ∇f (y 0 ) δ = 0  Taking the absolute value, we have f (y) − f (y 0 ) − ∇f (y 0 ) δ 1  (∇f (y 0 + τ δ) − ∇f (y 0 )) δ dτ  = 0 1  ≤  (∇f (y 0 + τ δ) − ∇f (y 0 )) δ dτ  (as ∇f is cont.)  (∇f (y 0 + τ δ) − ∇f (y 0 )) |δ| dτ  (by Cauchy Schwarz)  0 1  ≤ 0 1  K|y 0 + τ δ − y 0 ||δ|dτ  ≤  (as ∇f is Lip.)  0 1  Kτ |δ|2 dτ  =  (as τ ∈ (0, 1))  0  =  1 K|δ|2 . 2  Therefore, with δ = y − y 0 , we have 1 |f (y) − f (y 0 ) − ∇f (y 0 ) (y − y 0 )| ≤ K|y − y 0 |2 . 2  Using Lemma 4.15, we establish an error bound between the Gupal estimate and the exact gradient of fi .  40  4.3. Gupal Estimate of the Gradient of the Steklov AveragedFunction Theorem 4.16. Consider fi ∈ C 1+ with Lipschitz constant Ki for ∇fi . Let ε > 0. Then for ∆ > 0, z = (ζ 1 , . . . , ζ n ) ∈ Z = ni=1 B∞ and any point x ∈ Rn , the Gupal estimate of ∇fi,α (x) satisfies the error bound |γ(x, ∆, z) − ∇fi (x)| ≤  √ 1 √ n Ki ∆( n + 3). 2  Proof. For ∆ > 0, let 1 y j− = [x1 + ∆ζ1 , . . . , xj−1 + ∆ζj−1 , xj − ∆, xj+1 + ∆ζj+1 , . . . , xn + ∆ζn ] . 2 By Lemma 4.15, for 1 y j+ = [x1 + ∆ζ1 , . . . , xj−1 + ∆ζj−1 , xj + ∆, xj+1 + ∆ζj+1 , . . . , xn + ∆ζn ] , 2 we have 1 |fi (y j+ ) − fi (y j− ) − ∇fi (y j− ) (y j+ − y j− )| ≤ Ki |y j+ − y j− |2 . 2  (4.7)  From equation (4.3) (with α = ∆), we can see that fi (y j+ ) − fi (y j− ) = ∆ γj (x, ∆, ζ j ). Hence, equation (4.7) becomes 1 |∆ γj (x, ∆, ζ j ) − ∇fi (y j− ) (y j+ − y j− )| ≤ Ki |y j+ − y j− |2 . 2  (4.8)  From our definitions of y j− and y j+ , we can see that y j+ − y j− = [0, . . . , 0, ∆, 0, . . . , 0] . The inner product in equation (4.8) simplifies to ∇fi (y j− ) (y j+ − y j− ) = ∆  ∂fi j− (y ). ∂xj  Thus, we have ∆ γj (x, ∆, ζ j ) − ∆  ∂fi j− 1 (y ) ≤ Ki ∆2 . ∂xj 2  Dividing out ∆ gives γj (x, ∆, ζ j ) −  ∂fi j− 1 (y ) ≤ Ki ∆. ∂xj 2  (4.9) 41  4.3. Gupal Estimate of the Gradient of the Steklov AveragedFunction We notice that |y j− − x|  1 j j , . . . , ∆ζnj ) (∆ζ1j , . . . , ∆ζj−1 , − ∆, ∆ζj+1 2  =  1 j j ∆ (ζ1j , . . . , ζj−1 , . . . , ζnj ) . , − , ζj+1 2  =  Using the standard basis vector ej , we have |y j− − x|  =  1 ∆ ζ j − ζj e j − e j 2  ≤  ∆ |ζ j | + |ζj | +  ≤  ∆  =  1 √ ∆( n + 2). 2  1 2  1√ 1 1 n+ + 2 2 2  (as ζ j ∈ [−1/2, 1/2]n )  Thus, since fi ∈ C 1+ , we have 1 √ |∇fi (y j− ) − ∇fi (x)| ≤ Ki ∆( n + 2). 2  (4.10)  Now, we notice that ∂fi j− ∂fi (y ) − (x) ∂xj ∂xj =  ∇fi (y j− ) ej − ∇fi (x) ej  ≤  |∇fi (y j− ) − ∇fi (x)||ej |  =  |∇fi (y j− ) − ∇fi (x)|.  Therefore, 1 √ ∂fi j− ∂fi (y ) − (x) ≤ Ki ∆( n + 2). ∂xj ∂xj 2  (4.11)  42  4.3. Gupal Estimate of the Gradient of the Steklov AveragedFunction Using equations (4.9) and (4.11), we have γj (x, ∆, ζ j ) −  ∂fi (x) ∂xj  =  γj (x, ∆, ζ j ) −  ∂fi j− ∂fi ∂fi j− (y ) + (y ) − (x) ∂xj ∂xj ∂xj  ≤  γj (x, ∆, ζ j ) −  ∂fi j− ∂fi ∂fi j− (y ) + (y ) − (x) ∂xj ∂xj ∂xj  ≤  1 √ 1 Ki ∆ + Ki ∆( n + 2) 2 2  =  √ 1 Ki ∆( n + 3). 2  Hence, |γ(x, ∆, z) − ∇fi (x)| (x, ∆, ζ 1 )  ∂fi − (x) ∂x1  =  γ1  ≤  √ 1 Ki ∆( n + 3) 2  =  =  n  √ 1 Ki ∆( n + 3) 2  2  + · · · + γn  2  + ··· +  (x, ∆, ζ n )  √ 1 Ki ∆( n + 3) 2  ∂fi − (x) ∂xn  2  2  2  √ 1 √ n Ki ∆( n + 3). 2  We conclude that convergence holds when using the Gupal estimate of the gradient of the Steklov averaged function of f as an approximate gradient in the AGS and RAGS algorithms. Corollary 4.17. Let f = max{fi : i = 1, . . . , N } where each fi ∈ C 1+ with Lipschitz constant Ki for ∇fi . If the approximate gradient used in the AGS or RAGS algorithm is the Gupal estimate of the gradient of the Steklov averaged function, then the results of Theorems 2.6, 2.8, 2.11 and 2.12 hold. 43  4.3. Gupal Estimate of the Gradient of the Steklov AveragedFunction  4.3.3  Algorithm  Algorithm To use the Gupal estimate of the gradient of the Steklov averaged function in the AGS and RAGS algorithms, in Step 1, we sample independently n×n , where m is the number and uniformly {z kl }m l=1 from the unit cube in R of active functions. Then proceed as expected.  44  Chapter 5  Numerical Results 5.1  Versions of the AGS Algorithm  We implemented the AGS and RAGS algorithms using the simplex gradient, the centered simplex gradient and the Gupal estimate of the gradient of the Steklov averaged function as approximate gradients. Additionally, we used the robust descent direction to create robust stopping conditions. That is, the AGS and RAGS algorithms terminate when ∆k ≤ µk |dkY | and |dkY | < εtol ,  (5.1)  where dkY is the projection of 0 onto the approximate subdifferential generated using the robust active set. (See Lemmas 3.11 and 3.13, and Proposition 3.14 for results linking the robust stopping conditions with the Goldstein approximate subdifferential.) The implementation was done in MATLAB (v. 7.11.0.584, R2010b). Software is available upon request. Let dk denote the regular descent direction and let dkY denote the robust descent direction. There are three scenarios that could occur when using the robust stopping conditions: 1. |dk | = |dkY |; 2. |dk | > |dkY |, but checking the stopping conditions leads to the same result (line search, radius decrease or termination); or 3. |dk | > |dkY |, but checking the stopping conditions leads to a different result. In Scenarios 1 and 2, the robust stopping conditions have no influence on the algorithm. In Scenario 3, we have two cases: 1. ∆k ≤ µk |dkY | ≤ µk |dk |, but |dk | ≥ εtol and |dkY | < εtol or 2. ∆k ≤ µk |dk | holds, but ∆k > µk |dkY |.  45  5.2. Test Sets and Software Thus, we hypothesize that the robust stopping conditions will cause the AGS and RAGS algorithms to do one of two things: to terminate early, providing a solution that has a smaller quality measure, but requires less function evaluations to find, or to reduce its search radius instead of carrying out a line search, reducing the number of function evaluations carried out during that iteration and calculating a more accurate approximate subdifferential at the next iteration. Our goal in this testing is to determine if there are any notable numerical differences in the quality of the three approximate gradients (simplex, centered simplex, and Gupal estimate), the two search directions (non-robust and robust), and the two stopping conditions (non-robust and robust). This results in the following 12 versions: AGS Simplex (1. non-robust /2. robust stopping) RAGS Simplex (3. non-robust /4. robust stopping) AGS Centered Simplex (5. non-robust /6. robust stopping) RAGS Centered Simplex (7. non-robust /8. robust stopping) AGS Gupal (9. non-robust /10. robust stopping) RAGS Gupal (11. non-robust/12. robust stopping)  5.2  Test Sets and Software  Testing was performed on a 2.0 GHz Intel Core i7 Macbook Pro and a 2.2 GHz Intel Core 2 Duo Macbook Pro. We used the test set from LukˇsanVlˇcek, [LV00]. The first 25 problems presented are of the desired form min F (x) where F (x) = max{fi (x) : i = 1, 2, . . . , N }. x  Of these 25 problems, we omit problem 2.17 because the sub-functions are complex-valued. Thus, our test set is a total of 24 finite minimax problems with dimensions from 2 to 20. There are several problems featuring functions fi that have the form fi = |fi |, where fi is a smooth function. We rewrote these functions as fi = max{fi , −fi }. The resulting test problems have from 2 to 130 sub-functions. A summary of the test problems appears in Table 1 in Appendix A.  46  5.3. Initialization and Stopping Conditions  5.3  Initialization and Stopping Conditions  We first describe our choices for the initialization parameters used in the AGS and RAGS algorithms. The initial starting points are given for each problem in [LV00]. We set the initial accuracy measure to 0.5 with a reduction factor of 0.5. We set the initial search radius to 0.1 with a reduction factor of 0.5. The Armijolike parameter η was chosen to be 0.1 to ensure that a line search success resulted in a significant function value decrease. We set the minimum step length to 10−10 . Next, we discuss the stopping tolerances used to ensure finite termination of the AGS and RAGS algorithms. We encoded four possible reasons for termination in our algorithm. The first reason corresponds to our theoretical stopping conditions, while the remaining three reasons are to ensure numerical stability of the algorithm. 1. Stopping conditions met: As stated in the theoretical algorithm, the algorithm terminates for this reason when ∆k ≤ µk |dk | and |dk | < εtol , where dk is either the regular or the robust descent direction. 2. Hessian matrix has NaN / Inf entries: For the solution of the quadratic program in Step 2, we use the quadprog command in MATLAB, which has certain numerical limitations. When these limitations result in NaN or Inf entries in the Hessian, the algorithm terminates. 3. ∆k , µk , and |dk | are small: This stopping criterion bi-passes the test ∆k ≤ µk |dk | (in Step 2) and stops if ∆k < ∆tol , µk < µtol and |dk | < εtol . Examining Theorem 2.12 along with Theorems 4.6, 4.9 and 4.16, it is clear that this is also a valid stopping criterion. We used a bound of 10−6 in our implementation for both ∆k and µk . 4. Max number of function evaluations reached - As a final failsafe, we added an upper bound of 106 on the number of function evaluations allowed. (This stopping condition only occurs once in our results.)  5.4  Computing a Descent Direction  In each iteration of the AGS algorithm, we must compute the projection of 0 onto the approximate subdifferential G(xk ). Additionally, for the RAGS algorithm, we must compute the projection of 0 onto the robust approximate subdifferential G(Y k ). 47  5.4. Computing a Descent Direction To explain how we solve this problem in MATLAB, we first define the projection of 0 onto the convex hull of a set of points X = [x1 , x2 , . . . , xn ]. By definition, n  conv(X) = {z : z =  n  αi xi , αi ≥ 0, i=1  αi = 1, xi ∈ X}, i=1  where x ∈ Rn and α = [α1 , . . . , αn ] ∈ Rn . By Definition 1.3, we know that the projection p of a point x ∈ Rn onto a closed convex set C is in arg min{|y − x|2 : y ∈ C}. Hence, we have n  Proj(0| conv(X)) ∈ arg min{|z|2 : z = (z,α)  n  αi xi , αi ≥ 0, i=1  αi = 1, xi ∈ X}. i=1  Next we show that this minimization problem is a quadratic program that can be solved using a quadratic program solver. For the sake of notation, let x = xk . We are given G(x) = conv(∇A fi (x))i∈A(x) , i.e., G(x) = {g : g =  λi ∇A fi (x), i∈A(x)  λi = 1, λi ≥ 0}. i∈A(x)  Without loss of generality, assume the first m functions in our finite set of fi functions are active (m ≤ n). Let λ = [λ1 λ2 . . . λm ] and Λ = [∇A f1 (x) ∇A f2 (x) . . . ∇A fm (x)]. Then for all g ∈ G(x), g = Λλ  for some λ ≥ 0,  λi = 1. i∈A(¯ x)  Define A = [1 1 . . . 1]. Then Aλ = 1. Now g  2  = (Λλ) (Λλ) = λ Λ Λλ.  48  5.5. Results Let H =Λ Λ       =     .   ∇A f1 (x)2 ∇A f2 (x) ∇A f1 (x) .. .  ∇A f1 (x) ∇A f2 (x) · · · ∇A f1 (x) ∇A fm (x) ∇A f2 (x)2 · · · ∇A f2 (x) ∇A fm (x) .. .. .. . . . ∇A fm (x) ∇A f1 (x) ∇A fm (x) ∇A f2 (x) · · · ∇A fm (x)2  Then we have the problem min{λ Hλ : λ ≥ 0, Aλ = 1}. λ  (5.2)  After solving this problem, we have d = − Proj(0|G(x)) = −Λλ. Lemma 5.1. The matrix H defined above is positive semi-definite and therefore, the minimization problem in (5.2) is a convex quadratic program. Proof. Clearly, H ∈ R(m+1)×(m+1) is symmetric as ∇A fi (x) ∇A fj (x) = ∇A fj (x) ∇A fi (x) for all i = j, i, j = 1, . . . , m. Furthermore, we have for all y ∈ Rn , y Hy = y Λ Λy = (Λy) (Λy) = |Λy|2 ≥ 0. Thus, H is positive semi-definite. In implementation, we used the quadprog.m function in MATLAB to solve for λ. The default algorithm for the quadprog.m function is the interior point algorithm, which finds the projection in a limiting sense. As the quadratic programs solved in our algorithm are of small size, we used the active-set algorithm instead, thus, finding exact solutions.  5.5  Results  Due to the randomness in the AGS and RAGS algorithms, we carry out 25 trials for each version. For each of the 25 trials, we record the number of function evaluations, the number of iterations, the solution, the quality of the solution and the reason for termination. The quality was measured  49  5.5. Results by the improvement in the number of digits of accuracy, which is calculated using the formula |Fmin − F ∗ | , − log |F0 − F ∗ | where Fmin is the function value at the final (best) iterate, F ∗ is the true minimum value (optimal value) of the problem (as given in [LV00]) and F0 is the function value at the initial iterate. Results on function evaluations and solution qualities appear in Tables 6.2, 6.3 and 6.4 of the appendix. To visually compare algorithmic versions, we use performance profiles. A performance profile is the (cumulative) distribution function for a performance metric [DM02]. For the AGS and RAGS algorithm, the performance metric is the ratio of the number of function evaluations taken by the current version to successfully solve each test problem versus the least number of function evaluations taken by any of the versions to successfully solve each test problem. Performance profiles eliminate the need to discard failures in numerical results and provide a visual representation of the performance difference between several solvers. For each performance profile, we have a set S of ns solvers and a set P of np problems. Our performance profiles define tp,s to be the number of function evaluations required to solve problem p by solver s. The baseline for comparison used is the following performance ratio rp,s =  tp,s . min{tp,s : s ∈ S}  As an overall assessment of the performance of the solver, we define ρs (τ ) =  1 size{p ∈ P : rp,s ≤ τ }. np  So, “ρs (τ ) is the probability for solver s ∈ S that a performance ratio rp,s is within a factor τ ∈ R of the best possible ratio”, [DM02]. In Figures 5.1(a) and 5.1(b) we include a performance profile showing all 12 versions our algorithm tested, declaring a success for 1 digit and 3 digits of improvement, respectively. We used PProfile.m in MATLAB, a function courtesy of Dr. Warren Hare and Valentine Koch, to generate the performance profiles.  50  5.5. Results Performance Profile on AGS and RAGS Algorithms (successful improvement of 1 digit) 1  0.9  0.8  0.7  ρ(τ)  0.6  0.5  0.4  0.3  0.2  0.1  0 0 10  1  2  10  3  10  10  τ  (a) Accuracy improvement of 1 digit.                    Performance Profile for AGS and RAGS Algorithms (successful improvement of 3 digits) 0.6  0.5  ρ(τ)  0.4  0.3  0.2  0.1  0 0 10  1  2  10  10  3  10  τ  (b) Accuracy improvement of 3 digits.  Figure 5.1: Performance profiles for 12 versions of AGS/RAGS algorithm.  51  5.5. Results In general, we can see that using the Gupal estimate of the gradient of the Steklov averaged function as an approximate gradient does not produce the best results. It only produces 1 or more digits of accuracy for problems 2.1, 2.2, 2.4, 2.10, 2.18 and 2.23 (robust version). There is no significant difference between the performance of the AGS and RAGS algorithms using the simplex and centered simplex gradients as approximate gradients. Looking at the results in Tables 6.2, 6.3 and 6.4, and our performance profiles, we can make the following two observations: i) the versions of the RAGS algorithm generally outperform (converge faster than) the versions of the AGS algorithm, and ii) the RAGS algorithm using the robust stopping conditions terminates faster and with lower (but still significant) accuracy. Robust active set: From our results, it is clear that expanding the active set to include ‘almost active’ functions in the RAGS algorithm greatly improves performance for the simplex and centered simplex algorithm. This robust active set brings more local information into the approximate subdifferential and thereby allows for descent directions that are more parallel to any nondifferentiable ridges formed by the function. Robust stopping conditions: We notice from the performance profiles that in terms of function evaluations, the robust stopping conditions improve the overall performance of the RAGS algorithm, although they decrease the average accuracy on some problems. These results correspond with our previously discussed hypothesis. Furthermore, upon studying the reasons for termination, it appears that the non-robust stopping conditions cause the AGS and RAGS algorithms to terminate mainly due to ∆k and µk becoming too small. For the robust stopping conditions, the RAGS algorithm terminated often because the stopping conditions were satisfied. As our theory in Section 3.3 is not complete, we cannot make any theoretical statements about how the robust stopping conditions would perform in general (like those in Theorem 2.11). However, from our results, we conjecture that the alteration is beneficial for decreasing function evaluations. In 23 of the 24 problems tested, for both robust and non-robust stopping conditions, the RAGS algorithm either matches or outperforms the AGS algorithm in average accuracy obtained over 25 trials using the simplex and centered simplex gradients. Knowing this, we conclude that the improvement of the accuracy is due to the choice of a robust search direction.  52  5.6. An Application in Seismic Retrofitting  5.6  An Application in Seismic Retrofitting  This application deals with the potentially destructive inter-story drift that occurs during an earthquake. In high density metropolitan areas, the close proximity of buildings leaves little room for side-to-side movement. As a means of lessening the destructive effects of an earthquake on high rise buildings, dampers are placed between closely adjacent buildings. When the buildings sway, the dampers absorb the horizontal movement. For this application, there are two optimization problems that need to be solved. The first is determining the optimal positions of the dampers. The second is determining the optimal damping coefficients of damper connectors. Details on the discrete optimization problem of optimal damper configurations can be found in [BHT12]. The problem of determining optimal damping coefficients is a continuous optimization problem. You can think of a damping coefficient like a friction coefficient; it represents the effect the damper has on the velocity of the ith floor of the building. In 2011, Bigdeli, Hare and Tesfamariam presented the first research on the application of mathematical optimization to the problem of optimal damping coefficients [BHT11]. As presented, a set of damping coefficients is run through a simulation that generates the ‘inter-story drift’ of each floor. The objective is to minimize the maximum inter-story drift. As interstory drift is computed via simulation, no derivative information is available and DFO methods are indispensable. Applying the Nelder-Mead method, a well-known DFO algorithm, to this problem, it is shown in [BHT11] that non-uniform damping coefficients can greatly improve system performance. This problem is a finite minimax problem. The details of the problem can be found in [BHT11]; simply put, we are trying to minimize the maximum inter-story drift between buildings. In this section, we explore the performance of two commercial solvers and the RAGS algorithm (using the simplex gradient and the robust stopping conditions) on this problem. The two commercial solvers we look at are the DFO pattern search and genetic algorithms defined in MATLAB. The pattern search algorithm is categorized as a directional direct-search method. As in any optimization algorithm, we desire to find a new point xk+1 such that f (xk+1 ) < f (xk ). There are two phases to this algorithm as described in [CSV09]: the search step and the poll step. In the search step, the algorithm searches a finite set of points for function value decrease. If found, then the iterate is updated and the algorithm loops. Else, the algorithm carries out a poll step. A poll step is a localized search around 53  5.6. An Application in Seismic Retrofitting the current iterate using a given directional matrix. The genetic algorithm follows the framework of initialization, selection and reproduction. In the initialization stage, the algorithm randomly generates an initial population, i.e., generates a random set of points. In the selection stage, the algorithm chooses the ‘best’ individuals in the population to reproduce, i.e., the set of points that result in the lowest function values. In the reproduction stage, the algorithm creates a new population from the ‘best’ individuals. The genetic algorithm requires a large number of function evaluations per iteration, so is best suited for problems with fast function call speed or when the dimension of a problem is high. In Table 5.6, we present the summary results of our comparison of the genetic, pattern search and RAGS algorithms for 135 damper coefficient selection test problems. Table 5.1: Summary results for 135 damping coefficient selection test problems.  GA PS RAGS  Best Time  Total Computation Time (seconds)  Best Fopt  46 22 67  972357.5 4689590.1 864632.6  5 71 59  We can see that the RAGS algorithm has the shortest computation time for 67 of the 135 test problems and the shortest total computation time for all of the 135 test problems. If for optimal function values, we declare a tie between algorithms when |Fopt,GA/P S − Fopt,RAGS | < 10−2 , Fopt,GA/P S then the genetic algorithm tied all 5 of its best solutions with the RAGS algorithm, and the pattern search algorithm tied 40 of its 71 best solutions with the RAGS algorithm. In summary, for 104 of the 135 test problems, the RAGS algorithm had the best or tied for the best optimal solution. Thus, the RAGS algorithm is certainly comparable, if not superior, to the genetic and pattern search algorithms in terms of finding the optimal function value for these problems.  54  Chapter 6  Conclusion and Future Work 6.1  Conclusion  We have presented a new derivative-free algorithm for finite minimax problems that exploits the smooth substructure of the problem. Convergence results are given for any arbitrary approximate gradient that satisfies an error bound dependent on the sampling radius. Three examples of such approximate gradients are given. Additionally, a robust version of the algorithm is presented and shown to have the same convergence results as the regular version. Of the theory presented, the most influential result is Lemma 2.2, which says that our approximate subdifferential is a good approximate of our exact subdifferential. Lemma 2.2 (1) is essential in proving that our stopping conditions are sufficient and Lemma 2.2 (2) is essential in proving that our search direction (approximate direction of steepest descent) is a descent direction. Part 1 can be adapted to the Goldstein approximate subdifferential (see Lemma 3.7) and Part 2 can be adapted to the robust approximate subdifferential. The proof is elegant, relying only on the definitions of subdifferential and approximate subdifferential sets. In Chapter 4, we see that the AGS and RAGS algorithms are flexible as to the approximate gradient used. The condition that an error bound in terms of ∆ is satisfied is a reasonable assumption, as the error bounds from Kelley for the simplex and centered simplex gradients and the error bound we provided for the Gupal estimate of the gradient of the Steklov average function depend on ∆. Through numerical testing, we found that the RAGS algorithm outperforms the AGS algorithm with respect to the accuracy of the solution obtained. The general framework of the RAGS algorithm is not so different from the method of steepest descent. However, by including the ‘almost active’ functions in the robust active set, the RAGS algorithm is shown to avoid the downfall of the method of steepest descent caused by nondifferentiable ridges when applied to nonsmooth functions. Additionally, we tested robust stopping conditions and found that they generally required less func55  6.2. Future Work tion evaluations before termination for the RAGS algorithm. Although the results presented in Section 3.3 do not provide a complete theory for the robust stopping conditions, they give insight into the superior performance of the robust stopping conditions in our numerical results. Of the visuals presented, Figures 3.2(a) and 3.2(b) capture the true essence of the RAGS algorithm, clearly showing that the robust stopping conditions paired with the robust version of the algorithm performed best.  6.2  Future Work  Considerable future work is available in this research direction. Most obvious is a further exploration of the theory behind the performance of the robust stopping conditions. Another direction lies in the theoretical requirement bounding the step length away from 0 (see Theorems 2.11 and 3.5). In gradient based methods, one common way to avoid this requirement is with the use of Wolfe-like conditions. We are unaware of any derivativefree variant on the Wolfe conditions. The AGS and RAGS algorithms are hybrid algorithms; they combine the elements of several previously proposed algorithm frameworks to form new optimization algorithms. Combining the strengths of algorithms to create novel hybrid algorithms is an unbounded field of future work with great potential. Finally, there is considerable future work in the seismic retrofitting problem, as well as numerous other real-world applications.  56  Bibliography [BC11] H. H. Bauschke and P. L. Combettes. Convex Analysis and Monotone Operator Theory in Hilbert Spaces. CMS Books in Mathematics. Springer, New York, NY, 2011. → pages 4, 16 [BHT11] K. Bigdeli, W. Hare, and S. Tesfamariam. Determining optimal non-uniform damping coefficients for adjacent buildings via a nelder-mead method. Proceeding of the 23rd Canadian Congress of Applied Mechanics, 2011. 4 pages. → pages 53 [BHT12] K. Bigdeli, W. Hare, and S. Tesfamariam. Configuration optimization of dampers for adjacent buildings under seismic excitations. To appear in Eng. Opt., 2012. 19 pages. → pages 53 [BJF+ 98] A. J. Booker, J. E. Dennis Jr., P. D. Frank, D. B. Serafini, and V. Torczon. Optimization using surrogate objectives on a helicopter test example. In Computational methods for optimal design and control (Arlington, VA, 1997), volume 24 of Progr. Systems Control Theory, pages 49–58. Birkh¨auser Boston, Boston, MA, 1998. → pages 3 [BK98] D. M. Bortz and C. T. Kelley. The simplex gradient and noisy optimization problems, volume 24 of Computational methods for optimal design and control. Birkh¨auser Boston, Boston, MA, 1998. → pages 34 [BKS08] A. M. Bagirov, B. Karas¨ozen, and M. Sezer. Discrete gradient method: derivative-free method for nonsmooth optimization. J. Optim. Theory Appl., 137(2):317–334, 2008. → pages 5, 7, 8 [BLO02] J. V. Burke, A. S. Lewis, and M. L. Overton. Approximating subdifferentials by random sampling of gradients. Math. Oper. Res., 27(3):567–584, 2002. → pages 5, 23  57  Bibliography  [BLO05] J. V. Burke, A. S. Lewis, and M. L. Overton. A robust gradient sampling algorithm for nonsmooth, nonconvex optimization. SIAM J. Optim., 15(3):751–779, 2005. → pages 5, 23 [CC78] C. Charalambous and A. R. Conn. An efficient method to solve the minimax problem directly. SIAM J. Numer. Anal., 15(1):162–187, 1978. → pages 23 [CJV08] A. L. Cust´ odio, J. E. Dennis Jr., and L. N. Vicente. Using simplex gradients of nonsmooth functions in direct search methods. IMA J. Numer. Anal., 28(4):770–784, 2008. → pages 34 [Cla90] F. H. Clarke. Optimization and Nonsmooth Analysis. Classics Appl. Math. 5. SIAM, Philadelphia, second edition, 1990. → pages 2, 4, 8 [CSV09] A. Conn, K. Scheinberg, and L. Vicente. Introduction to derivative-free optimization, volume 8 of MPS/SIAM Series on Optimization. SIAM, Philadelphia, 2009. → pages 2, 3, 4, 5, 35, 53 [CTYZ00] X. Cai, K. Teo, X. Yang, and X. Zhou. Portfolio optimization under a minimax rule. Manage. Sci., 46:957–972, July 2000. → pages 2 [CV07] A. L. Cust´ odio and L. N. Vicente. Using sampling and simplex derivatives in pattern search methods. SIAM J. Optim., 18(2):537–555, 2007. → pages 34 [DM02] E. D. Dolan and J. J. Mor´e. Benchmarking optimization software with performance profiles. Math. Program., 91(2, Ser. A):201– 213, 2002. → pages 50 [DV04] R. Duvigneau and M. Visonneau. Hydrodynamic design using a derivative-free method. Struct. and Multidiscip. Optim., 28:195– 205, 2004. → pages 3 [Gol77] A. A. Goldstein. Optimization of Lipschitz continuous functions. Math. Program., 13(1):14–22, 1977. → pages 31 [Gup77] A. M. Gupal. A method for the minimization of almost differentiable functions. Kibernetika, pages 114–116, 1977. (in Russian). → pages 5 58  Bibliography  [Har10] W. Hare. Using derivative free optimization for constrained parameter selection in a home and community care forecasting model. In Int. Perspectives on Oper. Res. and Health Care, Proc. of the 34th Meeting of the EURO Working Group on Oper. Res. Applied to Health Sciences, 2010. → pages 3 [HM11] W. Hare and M. Macklem. Derivative-free optimization methods for finite minimax problems. To appear in Optim. Methods and Softw., 2011. 16 pages. → pages 3, 5, 7, 34 [HN12] W. Hare and J. Nutini. A derivative-free approximate gradient sampling algorithm for finite minimax problems. Submitted to Comput. Optim. and Appl., February 2012. 33 pages. → pages vii [IOKK04] J. Imae, N. Ohtsuki, Y. Kikuchi, and T. Kobayashi. Minimax control design for nonlinear systems based on genetic programming: Jung’s collective unconscious approach. Intern. J. Syst. Sci., 35:775–785, October 2004. → pages 2 [Kel99a] C. T. Kelley. Detection and remediation of stagnation in the Nelder-Mead algorithm using a sufficient decrease condition. SIAM J. Optim., 10(1):43–55, 1999. → pages 34 [Kel99b] C. T. Kelley. Iterative Methods for Optimization, volume 18 of Frontiers in Applied Mathematics. SIAM, Philadelphia, 1999. → pages 36, 37 [Kiw10] K. C. Kiwiel. A nonderivative version of the gradient sampling algorithm for nonsmooth nonconvex optimization. SIAM J. Optim., 20(4):1983–1994, 2010. → pages 5, 7, 8, 38 [LLS06] G. Liuzzi, S. Lucidi, and M. Sciandrone. A derivative-free algorithm for linearly constrained finite minimax problems. SIAM J. Optim., 16(4):1054–1075, 2006. → pages 3, 7 [LV00] L. Lukˇsan and J. Vlˇcek. Test problems for nonsmooth unconstrained and linearly constrained optimization. Technical Report V-798, ICS AS CR, February 2000. → pages 5, 23, 46, 47, 50 [Mad75] K. Madsen. Minimax solution of non-linear equations without calculating derivatives. Math. Programming Stud., pages 110– 126, 1975. → pages 3 59  Bibliography  [MFT08] A. L. Marsden, J. A. Feinstein, and C. A. Taylor. A computational framework for derivative-free optimization of cardiovascular geometries. Comput. Methods Appl. Mech. Engrg., 197(2124):1890–1905, 2008. → pages 3 [NW99] J. Nocedal and S. J. Wright. Numerical Optimization. Springer Series in Operations Research. Springer-Verlag, New York, 1999. → pages 13 [PGL93] G. Di Pillo, L. Grippo, and S. Lucidi. A smooth method for the finite minimax problem. Math. Program., 60(2, Ser. A):187–214, 1993. → pages 3 [Pol87] E. Polak. On the mathematical foundations of nondifferentiable optimization in engineering design. SIAM Rev., 29(1):21–89, 1987. → pages 2 [Pol88] R. A. Polyak. Smooth optimization methods for minimax problems. SIAM J. Control Optim., 26(6):1274–1286, 1988. → pages 3 [PRW03] E. Polak, J. O. Royset, and R. S. Womersley. Algorithms with adaptive smoothing for finite minimax problems. J. Optim. Theory Appl., 119(3):459–484, 2003. → pages 3 [RW98] R. T. Rockafellar and R. J.-B. Wets. Variational Analysis, volume 317 of Grundlehren der Mathematischen Wissenschaften [Fundamental Principles of Mathematical Sciences]. SpringerVerlag, Berlin, 1998. → pages 2, 4, 9, 19 [Sta05] R. Stafford. Random points in an n-dimensional hypersphere, 2005. → pages 12 [Wol75] P. Wolfe. A method of conjugate subgradients for minimizing nondifferentiable functions. Math. Programming Stud., pages 145–173, 1975. → pages 5 [Wsc04] M. Wschebor. Smoothed analysis of κ(A). 20(1):97–107, 2004. → pages 36  J. Complexity,  [Xu01] S. Xu. Smoothing method for minimax problems. Comput. Optim. Appl., 20(3):267–279, 2001. → pages 3  60  Appendix A  Tables Table 6.1: Test Set Summary: problem name and number, problem dimension (N ), and number of sub-functions (M ); * denotes an absolute value operation (doubled number of sub-functions). Prob. # 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11 2.12 2.13 2.14 2.15 2.16 2.18 2.19 2.20 2.21 2.22 2.23 2.24 2.25  Name  N  M  CB2 WF SPIRAL EVD52 Rosen-Suzuki Polak 6 PCB3 Bard Kow.-Osborne Davidon 2 OET 5 OET 6 GAMMA EXP PBC1 EVD61 Filter Wong 1 Wong 2 Wong 3 Polak 2 Polak 3 Watson Osborne 2  2 2 2 3 4 4 3 3 4 4 4 4 4 5 5 6 9 7 10 20 10 11 20 11  3 3 2 6 4 4 42* 30* 22* 40* 42* 42* 122* 21 60* 102* 82* 5 9 18 2 10 62* 130*  61  Appendix A. Tables  Table 6.2: Average accuracy for 25 trials obtained by the AGS and RAGS algorithms for the simplex gradient. AGS  RAGS  Regular Stop  Robust Stop  Regular Stop  Robust Stop  Prob.  f-evals  Acc.  f-evals  Acc.  f-evals  Acc.  f-evals  Acc.  2.1  3018  2.082  2855  2.120  2580  9.470  202  6.759  2.2  3136  4.565  3112  4.987  4179  13.211  418  6.343  2.3  3085  0.002  3087  0.002  3090  0.002  3096  0.002  2.4  3254  2.189  3265  2.238  2986  11.559  367  7.570  2.5  3391  1.379  3138  1.351  3576  1.471  539  1.471  2.6  3260  1.236  3341  1.228  4258  1.338  859  1.338  2.7  2949  1.408  2757  1.367  4155  9.939  4190  7.230  2.8  4959  0.879  4492  0.913  3634  9.941  3435  7.655  2.9  2806  0.732  3303  0.581  16000  8.049  13681  3.975  2.10  2978  3.343  2993  3.342  3567  3.459  1924  3.459  2.11  3303  2.554  3453  2.559  35367  6.099  11725  5.063  2.12  2721  1.866  3117  1.871  15052  2.882  8818  2.660  2.13  2580  1.073  2706  0.874  43618  1.952  141  1.679  2.14  3254  1.585  3289  1.086  7713  2.696  4221  1.476  2.15  3917  0.262  5554  0.259  31030  0.286  12796  0.277  2.16  3711  2.182  4500  2.077  20331  3.242  11254  2.178  2.18  10468  0.000  10338  0.000  76355  17.717  30972  17.138  2.19  3397  0.376  3327  0.351  5403  7.105  1767  7.169  2.20  4535  1.624  4271  1.624  8757  8.435  7160  6.073  2.21  8624  2.031  8380  2.157  15225  1.334  11752  1.393  2.22  1563  0.958  1408  1.042  64116  3.049  1256  2.978  2.23  7054  2.557  10392  2.744  6092  6.117  970  6.178  2.24  4570  0.301  7857  0.298  93032  0.447  21204  0.328  2.25  3427  0.339  4197  0.340  98505  0.342  343  0.342  62  Appendix A. Tables  Table 6.3: Average accuracy for 25 trials obtained by the AGS and RAGS algorithms for the centered simplex gradient. AGS Regular Stop  RAGS Robust Stop  Regular Stop  Robust Stop  Prob.  f-evals  Acc.  f-evals  Acc.  f-evals  Acc.  f-evals  Acc.  2.1  3769  2.054  3573  2.051  2351  9.469  221  7.125  2.2  3705  6.888  1284  5.154  4151  9.589  330  5.594  2.3  5410  0.003  5352  0.003  5332  0.003  5353  0.003  2.4  4059  2.520  4154  2.456  4347  11.578  296  6.834  2.5  3949  1.422  3813  1.437  4112  1.471  452  1.471  2.6  3756  1.302  3880  1.309  4815  1.338  879  1.338  2.7  4227  1.435  4187  1.373  5285  9.950  7164  6.372  2.8  6928  0.988  6933  1.003  4116  9.939  3754  7.775  2.9  3301  0.933  3743  0.949  17944  8.072  13014  2.436  2.10  3447  3.343  3424  3.342  4744  3.459  427  3.459  2.11  3593  2.768  4082  2.785  47362  6.344  11886  5.115  2.12  3321  1.892  3406  1.876  15550  2.843  10726  2.651  2.13  3067  1.355  3508  1.216  36969  1.873  519  1.643  2.14  3967  1.771  6110  1.152  9757  2.692  7284  1.510  2.15  4646  0.272  6014  0.273  23947  0.280  15692  0.277  2.16  4518  2.223  6911  2.074  22225  2.628  17001  2.215  2.18  30492 16.931  14671  16.634  125859 17.804 20815  17.293  2.19  4473  0.551  4484  0.591  8561  7.113  1697  5.851  2.20  5462  1.615  5503  1.599  8908  9.011  7846  6.042  2.21  11629  1.887  11724  1.661  18957  1.304  17067  1.339  2.22  1877  1.166  1604  1.160  1453  3.139  2066  3.644  2.23  3807  2.150  7850  3.586  15625  6.117  1020  6.230  2.24!  7198  0.302  12745  0.301  115787  0.436  61652  0.329  2.25  4749  0.339  4896  0.341  256508  0.342  568  0.342  63  Appendix A. Tables  Table 6.4: Average accuracy for 25 trials obtained by the AGS and RAGS algorithm for the Gupal estimate of the gradient of the Steklov averaged function. AGS  RAGS  Regular Stop  Robust Stop  Regular Stop  Robust Stop  Prob.  f-evals  Acc.  f-evals  Acc.  f-evals  Acc.  f-evals  Acc.  2.1  2775  2.448  2542  2.124  13126  3.896  89  2.708  2.2  3729  3.267  2221  2.813  5029  15.904  1776  7.228  2.3  2243  0.000  2262  0.000  2276  0.000  2255  0.000  2.4  2985  2.771  2841  2.892  3475  3.449  2362  3.738  2.5  3493  1.213  3529  1.196  3447  1.211  338  1.200  2.6  3144  0.187  3245  0.188  3018  0.162  3059  0.162  2.7  2631  1.368  3129  1.248  2476  1.048  2208  1.047  2.8  2711  1.125  3898  0.893  2231  0.514  5846  0.515  2.9  3102  0.727  3011  0.600  2955  0.937  3248  0.863  2.10  3075  3.241  2927  3.272  3100  0.000  3050  0.000  2.11  2947  1.527  3307  1.528  3003  1.560  2905  1.560  2.12  3095  1.099  7179  0.000  2670  0.788  7803  0.000  2.13  2755  0.710  1485  0.715  2517  0.231  6871  0.227  2.14  2965  0.574  3070  0.427  2860  0.708  4571  0.668  2.15  2658  0.010  2386  0.017  3355  0.050  3210  0.031  2.16  3431  0.457  3256  0.459  2861  0.199  2620  0.119  2.18  3936  16.345  5814  0.000  3950  16.451  6598  4.542  2.19  3337  0.014  3270  0.011  3488  0.970  3376  0.957  2.20  4604  0.835  4434  0.808  9459  1.360  10560  1.359  2.21  5468  0.000  5418  0.000  6632  0.641  6159  0.635  2.22  21  0.000  21  0.000  21  0.000  21  0.000  2.23  5436  1.814  5176  1.877  1.00E+06 2.354  954  2.415  2.24  7426  0.280  171  0.017  7927  0.043  6389  0.283  2.25  4519  0.286  4814  0.300  3760  0.017  3209  0.023  64  

Cite

Citation Scheme:

    

Usage Statistics

Country Views Downloads
United States 23 12
China 8 0
France 1 0
City Views Downloads
Ashburn 17 0
Shanghai 4 0
Beijing 3 0
Mountain View 2 0
Hanford 2 0
Unknown 1 3
University Park 1 0
Seattle 1 0
Hangzhou 1 0

{[{ mDataHeader[type] }]} {[{ month[type] }]} {[{ tData[type] }]}
Download Stats

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0072771/manifest

Comment

Related Items