- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- A derivative-free approximate gradient sampling algorithm...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
A derivative-free approximate gradient sampling algorithm for finite minimax problems 2012
pdf
Page Metadata
Item Metadata
Title | A derivative-free approximate gradient sampling algorithm for finite minimax problems |
Creator |
Nutini, Julie Ann |
Publisher | University of British Columbia |
Date Created | 2012-04-23 |
Date Issued | 2012-04-23 |
Date | 2012 |
Description | Mathematical optimization is the process of minimizing (or maximizing) a function. An algorithm is used to optimize a function when the minimum cannot be found by hand, or finding the minimum by hand is inefficient. The minimum of a function is a critical point and corresponds to a gradient (derivative) of 0. Thus, optimization algorithms commonly require gradient calculations. When gradient information of the objective function is unavailable, unreliable or ‘expensive’ in terms of computation time, a derivative-free optimization algorithm is ideal. As the name suggests, derivative-free optimization algorithms do not require gradient calculations. In this thesis, we present a derivative-free optimization algorithm for finite minimax problems. Structurally, a finite minimax problem minimizes the maximum taken over a finite set of functions. We focus on the finite minimax problem due to its frequent appearance in real-world applications. We present convergence results for a regular and a robust version of our algorithm, showing in both cases that either the function is unbounded below (the minimum is −∞) or we have found a critical point. Theoretical results are explored for stopping conditions. Additionally, theoretical and numerical results are presented for three examples of approximate gradients that can be used in our algorithm: the simplex gradient, the centered simplex gradient and the Gupal estimate of the gradient of the Steklov averaged function. A performance comparison is made between the regular and robust algorithm, the three approximate gradients, and the regular and robust stopping conditions. Finally, an application in seismic retrofitting is discussed. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | Eng |
Collection |
Electronic Theses and Dissertations (ETDs) 2008+ |
Date Available | 2012-04-23 |
DOI | 10.14288/1.0072771 |
Degree |
Master of Science - MSc |
Program |
Mathematics |
Affiliation |
Science, Faculty of |
Degree Grantor | University of British Columbia |
Graduation Date | 2012-05 |
Campus |
UBCO |
Scholarly Level | Graduate |
URI | http://hdl.handle.net/2429/42200 |
Aggregated Source Repository | DSpace |
Digital Resource Original Record | https://open.library.ubc.ca/collections/24/items/1.0072771/source |
Download
- Media
- ubc_2012_spring_nutini_julie.pdf
- ubc_2012_spring_nutini_julie.pdf [ 722.2kB ]
- Metadata
- JSON: 1.0072771.json
- JSON-LD: 1.0072771+ld.json
- RDF/XML (Pretty): 1.0072771.xml
- RDF/JSON: 1.0072771+rdf.json
- Turtle: 1.0072771+rdf-turtle.txt
- N-Triples: 1.0072771+rdf-ntriples.txt
- Citation
- 1.0072771.ris
Full Text
A Derivative-Free Approximate Gradient Sampling Algorithm for Finite Minimax Problems by Julie Ann Nutini B.Sc. Hons., The University of British Columbia, 2010 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE in The College of Graduate Studies (Mathematics) THE UNIVERSITY OF BRITISH COLUMBIA (Okanagan) April 2012 c© Julie Ann Nutini 2012 Abstract Mathematical optimization is the process of minimizing (or maximizing) a function. An algorithm is used to optimize a function when the minimum cannot be found by hand, or finding the minimum by hand is inefficient. The minimum of a function is a critical point and corresponds to a gradient (derivative) of 0. Thus, optimization algorithms commonly require gradient calculations. When gradient information of the objective function is unavail- able, unreliable or ‘expensive’ in terms of computation time, a derivative-free optimization algorithm is ideal. As the name suggests, derivative-free op- timization algorithms do not require gradient calculations. In this thesis, we present a derivative-free optimization algorithm for finite minimax prob- lems. Structurally, a finite minimax problem minimizes the maximum taken over a finite set of functions. We focus on the finite minimax problem due to its frequent appearance in real-world applications. We present convergence results for a regular and a robust version of our algorithm, showing in both cases that either the function is unbounded below (the minimum is −∞) or we have found a critical point. Theoretical results are explored for stopping conditions. Additionally, theoretical and numerical results are presented for three examples of approximate gradients that can be used in our algorithm: the simplex gradient, the centered simplex gradient and the Gupal estimate of the gradient of the Steklov averaged function. A performance comparison is made between the regular and robust algorithm, the three approximate gradients, and the regular and robust stopping conditions. Finally, an ap- plication in seismic retrofitting is discussed. ii Table of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . vii Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 The Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Derivative-Free Optimization . . . . . . . . . . . . . . . . . . 3 1.4 DFO and Finite Max Functions . . . . . . . . . . . . . . . . 3 1.5 Method of Steepest Descent . . . . . . . . . . . . . . . . . . 6 1.6 Basic Definitions and Results . . . . . . . . . . . . . . . . . . 8 2 Approximate Gradient Sampling Algorithm . . . . . . . . . 10 2.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3 Robust Approximate Gradient Sampling Algorithm . . . . 23 3.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.2 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.2.1 Descent Direction: dY . . . . . . . . . . . . . . . . . . 29 3.3 Robust Stopping Conditions . . . . . . . . . . . . . . . . . . 30 4 Approximate Gradients . . . . . . . . . . . . . . . . . . . . . . 34 4.1 Simplex Gradient . . . . . . . . . . . . . . . . . . . . . . . . 34 4.1.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . 34 iii Table of Contents 4.1.2 Convergence . . . . . . . . . . . . . . . . . . . . . . . 35 4.1.3 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 36 4.2 Centered Simplex Gradient . . . . . . . . . . . . . . . . . . . 36 4.2.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . 37 4.2.2 Convergence . . . . . . . . . . . . . . . . . . . . . . . 37 4.2.3 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 38 4.3 Gupal Estimate of the Gradient of the Steklov Averaged Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.3.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . 38 4.3.2 Convergence . . . . . . . . . . . . . . . . . . . . . . . 39 4.3.3 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 44 5 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.1 Versions of the AGS Algorithm . . . . . . . . . . . . . . . . . 45 5.2 Test Sets and Software . . . . . . . . . . . . . . . . . . . . . 46 5.3 Initialization and Stopping Conditions . . . . . . . . . . . . . 47 5.4 Computing a Descent Direction . . . . . . . . . . . . . . . . 47 5.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.6 An Application in Seismic Retrofitting . . . . . . . . . . . . 53 6 Conclusion and Future Work . . . . . . . . . . . . . . . . . . 55 6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Appendix A Tables . . . . . . . . . . . . . . . . . . . . . . . . . . 61 iv List of Tables 2.1 Glossary of Algorithm Notation . . . . . . . . . . . . . . . . . 10 5.1 Damper Coefficient Selection Results . . . . . . . . . . . . . . 54 6.1 Summary of Test Set Problems . . . . . . . . . . . . . . . . . 61 6.2 Numerical Results: Simplex Gradient . . . . . . . . . . . . . . 62 6.3 Numerical Results: Centered Simplex Gradient . . . . . . . . 63 6.4 Numerical Results: Gupal Estimate . . . . . . . . . . . . . . . 64 v List of Figures 1.1 A nonsmooth function and its gradients . . . . . . . . . . . . 1 1.2 Method of Steepest Descent . . . . . . . . . . . . . . . . . . . 6 2.1 An Armijo-like line search . . . . . . . . . . . . . . . . . . . . 13 3.1 Surface plot of Problem 2.1 CB2 . . . . . . . . . . . . . . . . 24 3.2 Iterations for Problem 2.1 CB2 . . . . . . . . . . . . . . . . . 24 5.1 Performance profiles: 12 versions of AGS/RAGS algorithm . . 51 vi Acknowledgements First, and foremost, I would like thank my supervisor, Dr. Warren Hare, whose guidance, patience and encouragement have been unfailing. I am incredibly grateful for the numerous occasions when his confidence in my ability far exceeded my own, and feel truly fortunate to have worked under his supervision. Without him, this thesis would not have come to its fruition, my passport would still be without stamps and the idea of ‘proof by fairy dust’ would still be a logical option. I would like to thank my committee members, Dr. Heinz Bauschke, Dr. Shawn Wang and Dr. Erik Rosolowsky, for their time and helpful suggestions throughout this challenging, yet rewarding journey. I would like to express my sincere gratitude to Dr. Claudia Sagastizábal for our conversations in Chile, especially with respect to the Goldstein ap- proximate subdifferential. I would like to thank Kasra Bigdeli and Dr. Solomon Tesfamariam for taking an interest in my algorithm and for furthering my passion for algo- rithm design. I would like to thank Dr. Rebecca Tyson and Dr. Blair Spearman for encouraging me to pursue studies in optimization, and Dr. Yves Lucet for showing an invested interest in my research through his helpful, thought provoking questions. I would like to thank UBC Okanagan and the many donors who have provided me with incredible financial support over the years. Last, but not least, I would like to thank my family and friends for their understanding, faith and endless love throughout my studies. Many of the results in this thesis are currently in submission to Compu- tational Optimization and Applications (see [HN12]). vii To my dad, who shares my love of mathematics, and to my mom, who is my constant reminder...va piano. viii Chapter 1 Introduction Mathematical optimization is the process of minimizing (or maximizing) a function. We use an optimization algorithm when either the minimum of a function cannot be found by hand, or finding the minimum by hand is inefficient. In the first year calculus classroom, optimization algorithms are simple: take the derivative of a function, set the derivative equal to 0 and solve for x. For the functions explored in first year calculus, this process generally works fine. However, many real-world problems result in nonsmooth functions. For example, we consider the following nonsmooth finite max function: f(x, y) = 10|x|+ y2 = max{10x+ y2,−10x+ y2}. (a) Graph of f(x, y). (b) Selected gradients of f(x, y). Figure 1.1: Graph of f(x, y) and its gradients. We can see in Figure 1.1(a) that f(x, y) has a ridge along the line x = 0 where the function is not differentiable. Figure 1.1(b) shows the pattern of gradients for f(x, y) as if looking down onto the bottom ridge. Notice 1 1.1. Notations that for any point far from the ridge, the negative of the gradient points roughly towards the minimum. However, for any point near the ridge, the negative of the gradient points roughly perpendicular to the ridge, which is not necessarily the direction of the minimum. In this thesis we develop a novel optimization algorithm specifically de- signed for functions with this finite max structure. 1.1 Notations In this thesis, we use the following notations: 1. | · | denotes the Euclidean norm and ‖ · ‖ denotes the corresponding matrix norm. 2. As defined in [Cla90], the function f : Rn → R is locally Lipschitz at a point x ∈ Rn if there exists a scalar L and δ > 0 such that |f(y1)− f(y2)| ≤ L|y1 − y2| for all y1 and y2 in the open ball of radius δ about x. 3. If f1 and f2 are continuous, then max{f1, f2} is continuous. If f1 and f2 are Lipschitz, then max{f1, f2} is Lipschitz [RW98, Prop 9.10]. However, we note that even if f1 and f2 are differentiable, this does not imply that max{f1, f2} is differentiable. 1.2 The Problem We consider the finite minimax problem min x f(x) where f(x) = max{fi(x) : i = 1, . . . , N}, where each individual fi is continuously differentiable. Finite minimax problems occur in numerous applications, such as port- folio optimization [CTYZ00], control system design [IOKK04], engineering design [Pol87], and determining the cosine measure of a positive spanning set [CSV09, Def 2.7]. In a finite max function, although each individ- ual fi may be smooth, taking the maximum forms a nonsmooth function with ‘nondifferentiable ridges’. For this reason, most algorithms designed to solve finite minimax problems employ some form of smoothing technique; 2 1.3. Derivative-Free Optimization [PGL93], [PRW03], [Pol88], and [Xu01] (among many others). In general, these smoothing techniques require gradient calculations. However, in many applications gradient information is not available or can be difficult to compute accurately (see [BJF+98], [DV04], [Har10], [MFT08] and [CSV09, Chpt 1] for some examples of such situations). Thus, for the purpose of this thesis, we further restrict ourselves to the field of derivative-free optimization, where we are only permitted to compute func- tion values, i.e., we cannot compute gradient values ∇fi directly. 1.3 Derivative-Free Optimization The research area of derivative-free optimization (DFO) has blossomed in recent years. As previously stated, DFO algorithms are useful in situations when gradient information is not available, difficult to compute accurately or ‘expensive’ to compute in relation to computation time. For example, if a function is given by a simulation, then gradient information may not be available. For a thorough introduction to several basic DFO frameworks and convergence results for each, see [CSV09]. As there are no gradient calculations required in DFO algorithms, it is assumed that function evaluations are the most ‘expensive’ computations in terms of CPU time. Thus, the performance of a DFO algorithm is based on the number of function evaluations required to solve the problem. 1.4 DFO and Finite Max Functions In relation to our problem, research on optimizing finite max functions without calculating derivatives can be seen as early as 1975 [Mad75], while more recently we have seen a resurface in this area, [LLS06] and [HM11]. In 2006, Liuzzi, Lucidi and Sciandrone used a smoothing technique based on an exponential penalty function in a directional direct-search framework to form a derivative-free optimization method for finite minimax problems [LLS06]. This method is shown to globally converge towards a standard stationary point of the original finite minimax problem. We want to take a step away from the prevalent smoothing techniques used to solve finite minimax problems. Instead of altering the function with a smoothing technique, we look to solve the finite minimax problem with a DFO method that exploits its smooth substructure. To understand how we do this, we first need to define the following terms. 3 1.4. DFO and Finite Max Functions Definition 1.1. Let f : Rn → R be locally Lipschitz at a point x̄. Then by Rademacher’s Theorem ([RW98, Thm 9.60]), f is differentiable almost everywhere on Rn. Let Ωf be the set of points at which f fails to be differentiable. Then the Clarke subdifferential, as defined in [Cla90], is given by the set ∂f(x̄) = conv(lim j ∇f(yj) : yj → x̄, yj /∈ Ωf ). (1.1) Basically, the subdifferential of a function at x̄ is the set of all possible gradients near x̄. As an example, we consider the absolute value function, f(x) = |x| in R. Clearly, f(x) is not differentiable at 0. The subdifferential of f at the point 0 is the set ∂f(0) = conv(1,−1), as f ′(x) = { 1 if x > 0 −1 if x < 0 . Definition 1.2. A descent direction of a continuously differentiable func- tion f at a given point x̄ ∈ dom(f) as defined in [CSV09] is any vector d such that 〈d, v〉 < 0 for all v ∈ ∂f(x̄). To explain how subdifferentials can be used to find a descent direction, we first define the projection. Definition 1.3 (Definition 3.7, [BC11]). Let C ⊆ Rn be a nonempty closed convex set and let x ∈ Rn. Let p ∈ C. Then p is the unique projection of x onto C, denoted by Proj(x|C), if p ∈ arg min y {|y − x| : y ∈ C}. A point x̄ is a (Clarke) stationary point of a function f when 0 ∈ ∂f(x̄). Thus, we define the direction of steepest descent as follows. Definition 1.4. The direction of steepest descent as defined in [CSV09] is given by d = −Proj(0|∂f(x̄)). 4 1.4. DFO and Finite Max Functions In 2011, Hare and Macklem presented a derivative-free method that ex- ploits the smooth substructure of the finite minimax problem. It combines the frameworks of a directional direct search method [CSV09, Chpt 7] and the gradient sampling algorithm (GS algorithm) presented in [BLO02] and [BLO05]. Loosely speaking, the GS algorithm uses a collection of local gra- dients to build a ‘robust subdifferential’ of the objective function and uses this to determine a ‘robust descent direction’. In [HM11], these ideas are used to develop several methods to find an approximate descent direction that moves close to parallel to an ‘active manifold’. During each iteration, points are sampled from around the current iterate and the simplex gradient is calculated for each of the active functions of the objective function. The calculated simplex gradients are then used to form an approximate subdif- ferential, which is then used to determine a likely descent direction. Ideas from the GS algorithm have appeared in two other recent DFO methodologies: [BKS08] and [Kiw10]. In 2008, Bagirov, Karasözen and Sezer presented a discrete gradient derivative-free method for unconstrained nonsmooth optimization problems [BKS08]. Described as a derivative-free version of the bundle method pre- sented in [Wol75], the method uses discrete gradients to approximate sub- gradients of the function and build an approximate subdifferential. The analysis of this method provides proof of convergence to a Clarke station- ary point for an extensive class of nonsmooth problems. In this thesis, we focus on the finite minimax problem. This allows us to require few (other) assumptions on our function while maintaining strong convergence analysis. It is worth noting that we use the same set of test problems as in [BKS08]. Specifically, we use the [LV00] test set and exclude one problem as its sub- functions are complex-valued. (The numerics in [BKS08] exclude the same problem, and several others, without explanation.) Using approximate gradient calculations instead of gradient calculations, the GS algorithm is made derivative free by Kiwiel in [Kiw10]. Specifically, Kiwiel employs the Gupal estimate of the gradient of the Steklov averaged function (see [Gup77] or Section 4.3 herein) as an approximate gradient. It is shown that, with probability 1, this derivative-free algorithm satisfies the same convergence results as the GS algorithm – it either drives the f - values to −∞ or each cluster point is found to be Clarke stationary [Kiw10, Theorem 3.8]. No numerical results are presented for Kiwiel’s derivative-free algorithm. 5 1.5. Method of Steepest Descent 1.5 Method of Steepest Descent To obtain a general understanding of the framework of the algorithm presented in this thesis, we recall the classical method of steepest descent. 1. Search Direction: Set dk = −Proj(0|∂f(xk)). 2. Step Length: Find tk by solving min tk>0 {f(xk + tkdk)}. 3. Update: Set xk+1 = xk + tkd k. Increase k = k + 1. Loop. Figure 1.2: An example of the method of steepest descent. In each iteration, the method of steepest descent calculates the direction of steepest descent, i.e., −Proj(0|∂f(xk)), does a line search in this direction and then updates the iterate. This allows the algorithm to move perpendic- ular to the contours of the function, towards the minimum. The method of steepest descent works well for smooth functions. However, for nonsmooth functions, the optimal solution often occurs at a point of nondifferentiabil- ity, which results in a very slow convergence rate for the method of steepest descent. In general, our algorithm takes the method of steepest descent and makes it derivative free; instead of calculating the subdifferential, we calculate an approximate subdifferential; instead of finding the direction of steepest de- scent, we project 0 onto our approximate subdifferential and find an approx- imate direction of steepest descent. With these alterations, our algorithm is able to navigate along nondifferentiable ridges, more rapidly minimizing the function. Specifically, we use the GS algorithm framework to form a derivative- free approximate gradient sampling algorithm. As we are dealing with finite max functions, instead of calculating an approximate gradient of f at each 6 1.5. Method of Steepest Descent of the sampled points, we calculate an approximate gradient of each of the active functions at the current iterate. We say a function fi is active if its index is an element of the active set. Definition 1.5. We define the active set of f at x̄ to be the set of indices A(x̄) = {i : f(x̄) = fi(x̄)}. (1.2) We denote the set of active gradients of f at x̄ by {∇fi(x̄)}i∈A(x̄). Expanding the active set to include ‘almost’ active functions, we also present a robust version of our algorithm, which is more akin to the GS algorithm. In this robust version, when our iterate is close to a point of non- differentiability, the size and shape of our approximate subdifferential will reflect the presence of ‘almost active’ functions. Hence, when we project 0 onto our approximate subdifferential, the descent direction will direct min- imization parallel to a ‘nondifferentiable ridge’, rather than straight at this ridge. It can be seen in our numerical results that these robust changes greatly influence the performance of our algorithm. The algorithm presented in this thesis differs from those mentioned above in a few key manners. Unlike in [LLS06] we do not employ a smoothing tech- nique. Unlike in [HM11], which uses the directional direct-search framework to imply convergence, we employ an approximate steepest descent frame- work. Using this framework, we are able to analyze convergence directly and develop stopping analysis based on our stopping conditions for the al- gorithm. Unlike in [BKS08] and [Kiw10], where convergence is proven for a specific approximate gradient, we prove convergence for any approximate gradient that satisfies a simple error bound dependent on the sampling ra- dius. As examples, we present the simplex gradient, the centered simplex gradient and the Gupal estimate of the gradient of the Steklov averaged func- tion. (As a side-note, Section 4.3 also provides, to the best our knowledge, a novel error analysis of the Gupal estimate of the gradient of the Steklov averaged function.) Furthermore, unlike in [Kiw10], where convergence to a stationary point is proved with probability 1, we prove convergence to a critical point for every cluster point of a sequence generated by our algo- rithm. Focusing on the finite minimax problem provides us with an advantage over the methods of [BKS08] and [Kiw10]. In particular, we only require or- der n function calls per iteration (where n is the dimension of the problem), 7 1.6. Basic Definitions and Results while both [BKS08] and [Kiw10] require order mn function calls per iteration (where m is the number of gradients they approximate to build their approx- imate subdifferential). (The original GS algorithm suggests that m ≈ 2n provides a good value for m.) 1.6 Basic Definitions and Results We denote by C1 the class of differentiable functions whose gradient mapping ∇ is continuous. As stated previously, we assume that each of the individual fi in the finite max function is continuously differentiable, i.e., fi ∈ C1. We denote by C1+ the class of continuously differentiable functions whose gradient mapping ∇ is locally Lipschitz. We say f ∈ C1+ with constant L if its gradient mapping ∇ has a Lipschitz constant L. We denote by C2+ the class of twice continuously differentiable functions whose gradient mapping ∇2 is locally Lipschitz. We say f ∈ C2+ with constant L if its gradient mapping ∇2 has a Lipschitz constant L. The following result reveals that the subdifferential for a finite max func- tion at a point x̄ is equal to the convex hull of the active gradients of f at x̄. Proposition 1.6. Let f = max{fi : i = 1, . . . , N}. If fi ∈ C1 for each i ∈ A(x̄), then ∂f(x̄) = conv(∇fi(x̄))i∈A(x̄). Proof. See Proposition 2.3.12, [Cla90]. It is important to note that the subdifferential as defined in Proposition 1.6 is a compact set. We formally state and prove this result in the following corollary. Corollary 1.7. Let f = max{fi : i = 1, . . . , N} where fi ∈ C1 for each i ∈ A(x̄). The subdifferential of f at x̄ is a compact set. Proof. By Proposition 1.6, ∂f(x̄) = conv(∇fi(x̄))i∈A(x̄). As there are a finite number of functions, A(x̄) is a finite set, which implies {∇fi(x̄)} is a finite set of points. The convex hull of a finite set of points is closed and bounded. Hence, ∂f(x̄) is a compact set. For our proof of convergence in Section 2.2, we require that ∂f is outer semicontinuous. First, we define the limit superior for a mapping. 8 1.6. Basic Definitions and Results Definition 1.8. For a mapping S : Rn → Rn, we define the limit superior as lim sup x→x̄ S(x) = {y : there exists xk → x̄ and yk → y with yk ∈ S(xk)}. Definition 1.9. As defined in [RW98, Def 5.4], S is outer semicontinuous (osc) at x̄ if lim sup x→x̄ S(x) = S(x̄). To see that the subdifferential is outer semicontinuous, we have the fol- lowing proposition from [RW98]. Proposition 1.10. For a continuous function f : Rn → R, the mapping ∂f is outer semicontinuous everywhere. Proof. See Proposition 8.7, [RW98]. 9 Chapter 2 Approximate Gradient Sampling Algorithm In Chapter 2, we state the approximate gradient sampling algorithm (AGS algorithm), present the details as to when and why the AGS algo- rithm necessarily finds function value decrease using an arbitrary approx- imate gradient of each of the active fi, denoted by ∇Afi, and then show that the AGS algorithm converges to a local minimum. We note that no specific approximate gradient is discussed in this section. Furthermore, we emphasize that the error bound requirement used in the convergence results (Section 2.2) for the arbitrary approximate gradient is achievable. Exam- ples of approximate gradients, their structures and their error bounds are discussed in Section 4. 2.1 Algorithm We first provide a partial glossary of the notation used in the statement of the AGS algorithm. Table 2.1: Glossary of notation used in AGS algorithm. Glossary of Notation k: Iteration counter xk : Current iterate µk: Accuracy measure ∆k: Sampling radius m: Sample size θ: Sampling radius reduction factor yj : Sampling points Y : Sampled set of points η: Armijo-like parameter dk: Search direction tk: Step length tmin: Minimum step length ∇Afi: Approximate gradient of fi A(xk): Active set at xk Gk: Approximate subdifferential εtol: Stopping tolerance 10 2.1. Algorithm We state the theoretical AGS algorithm and then provide a detailed description of the algorithm. Conceptual Algorithm: [AGS Algorithm] 0. Initialize: Set k = 0 and input x0 - starting point µ0 > 0 - accuracy measure ∆0 > 0 - initial search radius θ ∈ (0, 1] - search radius reduction factor 0 < η < 1 - Armijo-like parameter tmin - minimum step length εtol > 0 - stopping tolerance 1. Generate Approximate Subdifferential Gk: Generate a set Y = [xk, y1, . . . ym] around the current iterate xk such that max j=1,...,m |yj − xk| ≤ ∆k. Use Y to calculate the approximate gradient of fi, denoted ∇Afi, at xk for each i ∈ A(xk). Set Gk = conv(∇Afi(xk))i∈A(xk). 2. Generate Search Direction: Let dk = −Proj(0|Gk). Check if ∆k ≤ µk|dk|. (2.1) If (2.1) does not hold, then set xk+1 = xk, set ∆k+1 = { θµk|dk| if |dk| 6= 0 θ∆k if |dk| = 0 , (2.2) and go to Step 4. If (2.1) holds and |dk| < εtol, then STOP. Else, continue to the line search. 11 2.1. Algorithm 3. Line Search: Attempt to find tk > 0 such that f(xk + tkd k) < f(xk)− ηtk|dk|2. Line Search Failure: Set µk = µk 2 , xk+1 = xk and go to Step 4. Line Search Success: Let xk+1 be any point such that f(xk+1) ≤ f(xk + tkdk). 4. Update and Loop: Set ∆k+1 = max j=1,...,m |yj − xk|, k = k + 1 and return to Step 1. In Step 0 of the AGS algorithm, we set the iterate counter to 0, provide an initial starting point x0, and initialize the parameter values. In Step 1, we create the approximate subdifferential. First, we select a set of points around xk within a search radius of ∆k. In implementa- tion, the points are randomly and uniformly sampled from the interior of an n-dimensional hypersphere of radius ∆k centered at the origin (using the MATLAB randsphere.m function [Sta05]). Using this set Y , we then cal- culate an approximate gradient of each of the active functions at xk and set the approximate subdifferential Gk equal to the convex hull of these active approximate gradients. In Step 2, we generate a search direction by solving the projection of 0 onto the approximate subdifferential: Proj(0|Gk) ∈ arg ming∈Gk |g|2. The search direction dk is set equal to the negative of the solution, i.e., dk = −Proj(0|Gk). After finding a search direction, we check the condition ∆k ≤ µk|dk| (equation (2.1)). This condition determines if the current search radius is sufficiently small relative to the distance from 0 to the approximate subd- ifferential. If equation (2.1) holds and |dk| < εtol, then we terminate the algorithm, as 0 is within εtol of the approximate subdifferential and the search radius is small enough to reason that the approximate subdifferential is accurate. If equation (2.1) does not hold, then the approximate subdif- ferential is not sufficiently accurate to warrant a line search, so we decrease 12 2.1. Algorithm the search radius according to equation (2.2) and loop (Step 4). If equation (2.1) holds, but |dk| ≥ εtol, then we proceed to a line search. In Step 3, we carry out a line search. We attempt to find a step length tk > 0 such that the Armijo-like condition holds f(xk + tkd k) < f(xk)− ηtk|dk|2. (2.3) The following figures illustrate the Armijo(-like) line search for the curve f(x) = 14x 2. Figure 2.1(a) is representative of a gradient based method (Armijo line search), while Figure 2.1(b) is representative of a DFO method (Armijo-like line search). (a) A gradient based line search. (b) A derivative-free line search. Figure 2.1: A graphical depiction of the Armijo and Armijo-like condition. To satisfy the condition in equation (2.3), the algorithm must find a point on the curve that lies on or below the Armijo descent line for both methods. Any point found below the Armijo descent line will result in a successful line search. Thus, the Armijo(-like) condition prevents the algorithm from taking multiple steps with minimal function value decrease. In other words, it ensures sufficient function value decrease is found when a line search success is declared. Comparing Figures 2.1(a) and 2.1(b), we can see that the DFO Armijo descent line does not allow the algorithm to accept a point x ≤ 0.5. This prevents the algorithm from jumping back and forth across the y-axis, thus, resulting in a faster convergence rate. In implementation, we use a back-tracking line search (described in [NW99]) with an initial step-length of tini = 1. Basically, we test to see if equation (2.3) holds for tk = tini. If it holds, then we declare a line search 13 2.2. Convergence success. If it does not hold, we reduce tk by a factor of 0.5 and we test equa- tion (2.3) again. If we have tk < tmin without declaring a line search success, then we declare a line search failure. If we find a tk such that equation (2.3) holds, then we declare a line search success. If a line search success occurs, then we let xk+1 be any point such that f(xk+1) ≤ f(xk + tkdk), (2.4) thus, forming a non-increasing sequence of function values {f(xk)}k=0. In implementation, we do this by searching through the function values used in the calculation of our approximate gradients ({f(yi)}yi∈Y ). As this set of function values corresponds to points distributed around our current iterate, there is a good possibility of finding further function value decrease without having to carry out additional function evaluations. We find the minimum function value in our set of evaluations and if equation (2.4) holds for this minimum value, then we set xk+1 equal to the corresponding input point. Otherwise, we set xk+1 = xk + tkd k. If a line search failure occurs, then we reduce the accuracy measure µk by a factor of 0.5 and set xk+1 = xk. (Notice that as µk is decreased each time a line search failure occurs, {µk}k=0 is a non-increasing sequence.) Finally, in Step 4, we set ∆k+1 = maxj=1,...,m |yj−xk|, update the iterate counter and loop to Step 1 to resample. 2.2 Convergence In order to prove convergence for the AGS algorithm, we must show that the direction dk is a descent direction, that the algorithm is well-defined (eventually finds function value decrease) and that the stopping conditions are sufficient, i.e., when the algorithm satisfies the stopping conditions, the distance to a critical point is controlled by εtol. Remark 2.1. For the following results, we denote the approximate subdif- ferential of f at x̄ as G(x̄) = conv(∇Afi(x̄))i∈A(x̄), where ∇Afi(x̄) is the approximate gradient of fi at x̄. Our first result shows that the approximate subdifferential generated by the AGS algorithm is a good approximate of the exact subdifferential. We use part 1 of the following lemma to establish that our stopping conditions are sufficient. We use part 2 of the following lemma to establish that dk is a descent direction. 14 2.2. Convergence Lemma 2.2. Let f = max{fi : i = 1, . . . , N} where each fi ∈ C1. Suppose there exists an ε > 0 such that |∇Afi(x̄)−∇fi(x̄)| ≤ ε for all i. Then 1. for all w ∈ G(x̄), there exists a v ∈ ∂f(x̄) such that |w − v| ≤ ε, and 2. for all v ∈ ∂f(x̄), there exists a w ∈ G(x̄) such that |w − v| ≤ ε. Proof. 1. By definition, for all w ∈ G(x̄) there exists a set of αi such that w = ∑ i∈A(x̄) αi∇Afi(x̄), where αi ≥ 0, ∑ i∈A(x̄) αi = 1. By Proposition 1.6, as each fi ∈ C1 we have ∂f(x̄) = conv(∇fi(x̄))i∈A(x̄). Using the same αi as above, we see that v = ∑ i∈A(x̄) αi∇fi(x̄) ∈ ∂f(x̄). Then |w − v| = | ∑ i∈A(x̄) αi∇Afi(x̄)− ∑ i∈A(x̄) αi∇fi(x̄)| ≤ ∑ i∈A(x̄) αi|∇Afi(x̄)−∇fi(x̄)| ≤ ∑ i∈A(x̄) αiε (by assumption) = ε (as ∑ i∈A(x̄) αi = 1). Hence, for all w ∈ G(x̄), there exists a v ∈ ∂f(x̄) such that |w − v| ≤ ε. (2.5) 2. We have ∂f(x̄) = conv(∇fi(x̄))i∈A(x̄). So for all v ∈ ∂f(x̄), there exist αi such that v = ∑ i∈A(x̄) αi∇fi(x̄) where αi ≥ 0, ∑ i∈A(x̄) αi = 1. Using the same αi as above, we see that w = ∑ i∈A(x̄) αi∇Afi(x̄) ∈ G(x̄). Using the same argument as in part 1, we can conclude that for all v ∈ ∂f(x̄), there exists a w ∈ G(x̄) such that |w − v| ≤ ε. 15 2.2. Convergence Our next goal (in Theorem 2.6) is to show that eventually a line search success will occur in the AGS algorithm. To achieve this, we first state the Projection Theorem. Theorem 2.3. (The Projection Theorem) Let C ⊆ Rn be a nonempty closed convex set. Then for every x and p in Rn p = Proj(x|C) ⇐⇒ p ∈ C and 〈x− p, y − p〉 ≤ 0 for all y ∈ C. Proof. See Theorem 3.14, [BC11]. We use the Projection Theorem to prove d = −Proj(0|G(x̄)) is a descent direction. Lemma 2.4. Let f = max{fi : i = 1, . . . , N} where each fi ∈ C1. Suppose there exists an ε > 0 such that |∇Afi(x̄) − ∇fi(x̄)| ≤ ε for all i. Define d = −Proj(0|G(x̄)) and suppose |d| 6= 0. Let β ∈ (0, 1). If ε < (1 − β)|d|, then for all v ∈ ∂f(x̄) we have 〈d, v〉 < −β|d|2. Proof. Notice that, by Theorem 2.3, d = −Proj(0|G(x̄)) implies that 〈0− (−d), w − (−d)〉 ≤ 0 for all w ∈ G(x̄). Hence, 〈d,w + d〉 ≤ 0 for all w ∈ G(x̄). (2.6) So we have for all v ∈ ∂f(x̄) 〈d, v〉 = 〈d, v − w + w − d+ d〉 for all w ∈ G(x̄) = 〈d, v − w〉+ 〈d,w + d〉+ 〈d,−d〉 for all w ∈ G(x̄) ≤ 〈d, v − w〉 − |d|2 for all w ∈ G(x̄) (by (2.6)) ≤ |d||v − w| − |d|2 for all w ∈ G(x̄), which follows by using Cauchy Schwarz. For any v ∈ ∂f(x̄), using w as constructed in Lemma 2.2(2), we see that 〈d, v〉 ≤ |d|ε− |d|2 < |d|2(1− β)− |d|2 (as ε < (1− β)|d|) = −β|d|2. 16 2.2. Convergence Using Lemma 2.4, we can easily show that d is a descent direction for f at x̄ by setting β = 0 and bounding ε above by |d|. We notice that this condition on ε requires the algorithm to reach a point where the error between the exact gradient and the approximate gradient is smaller than the distance from 0 to our approximate subdifferential. Corollary 2.5. Let f = max{fi : i = 1, . . . , N} where each fi ∈ C1. Suppose there exists an ε > 0 such that |∇Afi(x̄) − ∇fi(x̄)| ≤ ε for all i. Define d = −Proj(0|G(x̄)) and suppose |d| 6= 0. If ε < |d|, then 〈d, v〉 < 0 for all v ∈ ∂f(x̄). To guarantee convergence, we must show that, except in the case of 0 ∈ ∂f(xk), the algorithm will always be able to find a search radius that satisfies the requirements in Step 2. In Section 4, we show that (for three different approximate gradients) the value ε (in Lemma 2.4) is linked to the search radius ∆. As unsuccessful line searches will drive ∆ to 0, this implies that eventually the requirements of Lemma 2.4 will be satisfied. We formalize this in the next two theorems. Theorem 2.6. Let f = max{fi : i = 1, . . . , N} where each fi ∈ C1. Sup- pose 0 6∈ ∂f(xk) for each iteration k. Suppose there exists K̄ > 0 such that given any set of points generated in Step 1 of the AGS algorithm, the approximate gradient satisfies |∇Afi(xk) − ∇fi(xk)| ≤ K̄∆k for all i. Let dk = −Proj(0|G(xk)). Then for any µ > 0, there exists ∆̄ > 0 such that, ∆ ≤ µ|dk|+ K̄µ(∆k −∆) for all 0 < ∆ < ∆̄, Moreover, if ∆k < ∆̄, then the following inequality holds ∆k ≤ µ|dk|. Proof. Let v̄ = Proj(0|∂f(xk)) (by assumption, v̄ 6= 0). Given µ > 0, let ∆̄ = 1 K̄ + 1µ |v̄|, (2.7) and consider 0 < ∆ < ∆̄. Now create G(xk) and dk = −Proj(0|G(xk)). As −dk ∈ G(xk), by Lemma 2.2(1), there exists a vk ∈ ∂f(xk) such that | − dk − vk| ≤ K̄∆k. 17 2.2. Convergence Then K̄∆k ≥ | − dk − vk| ⇒ K̄∆k ≥ |vk| − |dk| ⇒ K̄∆k ≥ |v̄| − |dk| (as |v| ≥ |v̄| for all v ∈ ∂f(xk) ). Thus, for 0 < ∆ < ∆̄, we apply equation (2.7) to |v̄| in the above inequality to get K̄∆k ≥ (K̄ + 1 µ )∆− |dk|, which rearranges to ∆ ≤ µ(|dk|+ K̄µ(∆k −∆). Hence, ∆ ≤ µ|dk|+ K̄µ(∆k −∆) for all 0 < ∆ < ∆̄. Finally, if ∆k < ∆̄, then ∆k ≤ µ|dk|. Remark 2.7. In Theorem 2.6, it is important to note that eventually the con- dition ∆k < ∆̄ will hold. Examine ∆̄ as constructed above: K̄ is a constant and v̄ is associated with the current iterate. However, the current iterate is only updated when a line search success occurs, which will not occur unless the condition ∆k ≤ µk|dk| is satisfied. As a result, if ∆k ≥ ∆̄, the AGS algorithm will reduce ∆k, with ∆̄ remaining constant, until ∆k < ∆̄. Recall in Step 3 of the AGS algorithm, for a given η ∈ (0, 1), we attempt to find a step length tk > 0 such that f(xk + tkd k) < f(xk)− ηtk|dk|2 The following result shows that eventually the above inequality will hold in the AGS algorithm. Recall Corollary 1.7, which states that the exact subdifferential for a finite max function is a compact set. Thus, we know that in the following theorem ṽ is well-defined. Theorem 2.8. Fix 0 < η < 1. Let f = max{fi : i = 1, . . . , N} where each fi ∈ C1. Let ṽ ∈ arg max{〈d, v〉 : v ∈ ∂f(x̄)}. Suppose there exists an ε > 0 such that |∇Afi(x̄) −∇fi(x̄)| ≤ ε for all i. Define d = −Proj(0|G(x̄)) and suppose |d| 6= 0. Let β = 2η1+η . If ε < (1− β)|d|, then there exists t̄ > 0 such that f(x̄+ td)− f(x̄) < −ηt|d|2 for all 0 < t < t̄. 18 2.2. Convergence Proof. Note that β ∈ (0, 1). Recall, from Lemma 2.4, we have for all v ∈ ∂f(x̄) 〈d, v〉 < −β|d|2. (2.8) Using β = 2η1+η , equation (2.8) becomes 〈d, v〉 < − 2η1+η |d|2 for all v ∈ ∂f(x̄). (2.9) From equation (2.8) we can conclude that for all v ∈ ∂f(x̄) 〈d, v〉 < 0. Using Theorem 8.30 in [RW98], we have lim τ↘0 f(x̄+ τd)− f(x̄) τ = max{〈d, v〉 : v ∈ ∂f(x̄)} = 〈d, ṽ〉 < 0. Therefore, as η+12 < 1 (since η < 1) there exists t̄ > 0 such that f(x̄+ td)− f(x̄) t < η + 1 2 〈d, ṽ〉 for all 0 < t < t̄. For such a t, we have f(x̄+ td)− f(x̄) < η + 1 2 t〈d, ṽ〉 < −η + 1 2 2η η + 1 t|d|2 (by (2.9)) < −ηt|d|2. Hence, f(x̄+ td)− f(x̄) < −ηt|d|2 for all 0 < t < t̄. Combining the previous results, we show that the AGS algorithm is guar- anteed to find function value decrease (provided 0 /∈ ∂f(xk)). We summarize with the following corollary. Corollary 2.9. Let f = max{fi : i = 1, . . . , N} where each fi ∈ C1. Sup- pose 0 /∈ ∂f(xk) for each iteration k. Suppose there exists a K̄ > 0 such that given any set of points generated in Step 1 of the AGS algorithm, the approximate gradient satisfies |∇Afi(xk)−∇fi(xk)| ≤ K̄∆k for all i. Then after a finite number of iterations, the algorithm will find a new iterate with a lower function value. 19 2.2. Convergence Proof. Consider xk, where 0 /∈ ∂f(xk). To find function value decrease with the AGS algorithm, we must declare a line search success in Step 3. The AGS algorithm will only carry out a line search if the stopping condition below is satisfied, i.e., ∆k ≤ µk|dk|. (2.10) In Theorem 2.6, we showed that for any µk > 0, there exists a ∆̄ such that if ∆k < ∆̄, then equation (2.10) is satisfied. If ∆k > ∆̄ in Step 2, i.e., equation (2.10) is not satisfied, then ∆k is updated according to equation (2.2). Whether |dk| 6= 0 or |dk| = 0, we can see that ∆k+1 ≤ θ∆k, so eventually ∆k < ∆̄. i.e., equation (2.10) will be satisfied and the AGS algorithm will carry out a line search. Now, in order to have a line search success, we must be able to find a step length tk such that the Armijo-like condition holds, f(xk + tkd k) < f(xk)− ηtk|dk|2. In Theorem 2.8, we showed that there exists t̄ > 0 such that f(xk + tkd k)− f(xk) < −ηtk|dk|2 for all 0 < tk < t̄, provided that for β ∈ (0, 1), ε < (1− β)|dk|. (2.11) Set ε = K̄∆k. If equation (2.11) does not hold, then a line search failure will occur, resulting in µk+1 = 0.5µk. Thus, eventually we will have µk < (1−β) K̄ and ∆k ≤ µk|dk| < (1− β) K̄ |dk|, which means equation (2.11) will hold. Thus, after a finite number of iter- ations, the AGS algorithm will declare a line search success and find a new iterate with a lower function value. Now we are ready to prove convergence. In the following result, assuming that the step length tk is bounded away from 0 means that there exists a t̄ > 0 such that tk > t̄. Remark 2.10. The following theorem does not ensure the existence of a clus- ter point. In order to ensure the existence of a cluster point, an additional assumption would have to be made, such as compact level sets. 20 2.2. Convergence Theorem 2.11. Let f = max{fi : i = 1, . . . , N} where each fi ∈ C1. Let {xk}∞k=0 be an infinite sequence generated by the AGS algorithm. Suppose there exists a K̄ > 0 such that given any set of points generated in Step 1 of the AGS algorithm, the approximate gradient satisfies the error bound |∇Afi(xk)−∇fi(xk)| ≤ K̄∆k for all i. Suppose tk is bounded away from 0. Then either 1. f(xk) ↓ −∞, or 2. |dk| → 0, ∆k ↓ 0 and every cluster point x̄ of the sequence {xk}∞k=0 satisfies 0 ∈ ∂f(x̄). Proof. If f(xk) ↓ −∞, then we are done. Conversely, if f(xk) is bounded below, then f(xk) is non-increasing and bounded below, therefore f(xk) converges. We consider two cases. Case 1: An infinite number of line search successes occur. Let x̄ be a cluster point of {xk}∞k=0. Notice that xk only changes for line search successes, so there exists a subsequence {xkj}∞j=0 of line search suc- cesses such that xkj → x̄. Then for each corresponding step length tkj and direction dkj , the following condition holds f(xkj+1) ≤ f(xkj + tkjdkj ) < f(xkj )− ηtkj |dkj |2. Note that 0 ≤ ηtkj |dkj |2 < f(xkj )− f(xkj+1). Since f(xk) converges we know that f(xkj ) − f(xkj+1) → 0. Since tkj is bounded away from 0, we see that lim j→∞ |dkj | = 0. Recall from the AGS algorithm, we check the condition ∆kj ≤ µkj |dkj |. As ∆kj > 0, µkj ≤ µ0, and |dkj | → 0, we can conclude that ∆kj ↓ 0. Finally, from Lemma 2.2(1), as −dkj ∈ G(xkj ), there exists a vkj ∈ ∂f(xkj ) such that | − vkj − dkj | ≤ K̄∆kj ⇒ | − vkj | − |dkj | ≤ K̄∆kj ⇒ |vkj | ≤ K̄∆kj + |dkj |, 21 2.2. Convergence which implies that 0 ≤ |vkj | ≤ K̄∆kj + |dkj | → 0. So, lim j→∞ |vkj | = 0, where |vkj | ≥ dist(0|∂f(xkj )) ≥ 0, which implies dist(0|∂f(xkj )) → 0. We have xkj → x̄. As f is a finite max function, it is continuous and therefore, by Proposition 1.10, ∂f is outer semicontinuous. Hence, every cluster point x̄ of a convergent subsequence of {xk}∞k=0 satisfies 0 ∈ ∂f(x̄). Case 2: A finite number of line search successes occur. This means there exists a k̄ such that xk = xk̄ = x̄ for all k ≥ k̄. However, by Corollary 2.9, if 0 /∈ ∂f(x̄), then after a finite number of iterations, the algorithm will find function value decrease (line search success). Hence, we have 0 ∈ ∂f(x̄). Our last result shows that if the algorithm terminates in Step 2, then the distance from 0 to the exact subdifferential is controlled by εtol. Theorem 2.12. Let f = max{fi : i = 1, . . . , N} where each fi ∈ C1. Suppose there exists a K̄ > 0 such that for each iteration k, the approximate gradient satisfies |∇Afi(xk) −∇fi(xk)| ≤ K̄∆k for all i. Suppose the AGS algorithm terminates at some iteration k̄ in Step 2 for εtol > 0. Then dist(0|∂f(xk̄)) < (1 + K̄µ0)εtol. Proof. Let w̄ = Proj(0|G(xk̄)). We use v̄ ∈ ∂f(xk̄) as constructed in Lemma 2.2(1) to see that dist(0|∂f(xk̄)) ≤ dist(0|v̄) ≤ dist(0|w̄) + dist(w̄|v̄) (as w̄ ∈ G(xk̄), v̄ ∈ ∂f(xk̄)) = |dk̄|+ |w̄ − v̄| (as |dk̄| = |Proj(0|G(xk̄))|) ≤ |dk̄|+ ε (by Lemma 2.2) < εtol + ε (as |dk̄| < εtol). The final statement now follows by the test ∆k̄ ≤ µk̄|dk̄| in Step 2 and the fact that µk̄ ≤ µ0 as {µk}k=0 is a non-increasing sequence. 22 Chapter 3 Robust Approximate Gradient Sampling Algorithm The AGS algorithm depends on the active set of functions at each iterate, A(xk). Of course, it is possible at various times in the algorithm for there to be functions that are inactive at the current iterate, but active within a small radius of the current iterate. Typically, such behaviour means that the current iterate is close to a ‘nondifferentiable ridge’ formed by the function. In [BLO02] and [BLO05], it is suggested that allowing an algorithm to take into account these ‘almost active’ functions will provide a better idea of what is happening at and around the current iterate, thus, making the algorithm more robust, i.e., performs for unusual or strict types of functions. In this section, we present the robust approximate gradient sampling algorithm (RAGS algorithm). Specifically, we adapt the AGS algorithm by expanding our active set to include all functions that are active at xk or active at one (or more) of the points in the set Y = [xky1, . . . , ym]. Recall from the AGS algorithm that the set Y is sampled from within a ball of radius ∆k. Thus, the points in Y are not far from the current iterate. We define the robust active set next. Definition 3.1. Let f = max{fi : i = 1, . . . , N} where each fi ∈ C1. Let y0 = xk be the current iterate and Y = [y0, y1, y2, . . . , ym] be a set of randomly sampled points from a ball centered at xk with radius ∆k. The robust active set of f on Y is A(Y ) = ⋃ yj∈Y A(yj), (3.1) where each set A(yj) is defined as in equation (1.2). To shed some light on the motivation for our robust algorithm, we con- sider the following function (originally from [CC78], Problem 2.1 CB2 in the test set [LV00]): 23 Chapter 3. Robust Approximate Gradient Sampling Algorithm f = max{f1, f2, f3}, where f1(x, y) = x 2 + y4, f2(x, y) = (2− x)2 + (2− y)2, f3(x, y) = 2 exp(y − x). Figure 3.1: Surface plot for Problem 2.1 CB2. 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 (a) AGS algorithm 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 (b) RAGS algorithm Figure 3.2: Iterations taken using the simplex gradient for Problem 2.1 CB2. These figures illustrate example iterations taken by the AGS algorithm and by the RAGS algorithm (presented in this section) for Problem 2.1 CB2 using the simplex gradient as an approximate gradient (see Section 4.1 for details used in numerical results). In Figure 3.2(a), we can see that the iterations move perpendicular to the contours of the functions towards the nondifferentiable ridge. However, eventually the iterations stall at the nondifferentiable ridge. These results 24 3.1. Algorithm show that the AGS algorithm is performing similar to how we would expect the method of steepest descent to perform on this problem. In Figure 3.2(b), we can see that the iterations proceed similar to the AGS algorithm until they reach the nondifferentiable ridge. Here, instead of stalling at the nondifferentiable ridge, the algorithm turns and mini- mizes along the nondifferentiable ridge, moving towards the optimal solu- tion. Thus, by incorporating the ‘almost active’ functions into our active set, our algorithm is able to further minimize the function f(x, y). 3.1 Algorithm For the RAGS algorithm, we incorporate the robust active set by replac- ing Steps 1 and 2 of the AGS algorithm from Section 2.1 with the following. 1. Generate Approximate Subdifferential GkY (Robust): Generate a set Y = [xk, y1, . . . , ym] around the current iterate xk such that max j=1,...,m |yj − xk| ≤ ∆k. Use Y to calculate the approximate gradient of fi, denoted ∇Afi, at xk for each i ∈ A(Y ). Then set Gk = conv(∇Afi(xk))i∈A(xk) and GkY = conv(∇Afi(xk))i∈A(Y ). 2. Generate Search Direction: Let dk = −Proj(0|Gk). Let dkY = −Proj(0|GkY ). Check if ∆k ≤ µk|dk|. (3.2) If (3.2) does not hold, then set xk+1 = xk, set ∆k+1 = { θµk|dk| if |dk| 6= 0 θ∆k if |dk| = 0 , and go to Step 4. If (3.2) holds and |dk| < εtol, then STOP. Else, continue to the line search, using dkY as a search direction. 25 3.2. Convergence Notice that in Step 2, we still use the stopping conditions from Section 2. Although this modification requires the calculation of two projections, it should be noted that neither of these projections are particularly difficult to calculate and that no additional function evaluations are required for this modification. In Section 3.3, we use the Goldstein approximate subdiffer- ential to adapt Theorem 2.12 to work for stopping conditions based on dkY , but we still do not have theoretical results for the exact subdifferential. In the numerics section, we test each version of our algorithm using the robust descent direction to check the stopping conditions. For the RAGS algorithm, this alteration shows convincing results that the robust stopping conditions not only guarantee convergence, but significantly decrease the number of function evaluations required for the RAGS algorithm to con- verge. 3.2 Convergence To show that the RAGS algorithm is well-defined we require that when ∆ is small enough, the robust active set is in fact equal to the original active set. Lemma 3.2. Let f = max{fi : i = 1, . . . , N} where each fi ∈ C1. Let Y = [x̄, y1, . . . , ym] be a randomly sampled set from a ball centered at x̄ with radius ∆. Then there exists an ε̃ > 0 such that if Y ⊆ Bε̃(x̄), then A(x̄) = A(Y ). Proof. Clearly, if i ∈ A(x̄), then i ∈ A(Y ), as x̄ ∈ Y . Consider i 6∈ A(x̄). Then by the definition of f , we have that fi(x̄) < f(x̄). We know that f is continuous (see Section 1.1). Therefore, there exists an ε̃i > 0 such that for all z ∈ Bε̃i(x̄) fi(z) < f(z). If ∆ < ε̃i, then we have |yj − x̄| < ε̃i for all j = 1, . . . ,m. Therefore, fi(y j) < f(yj) for all j = 1, . . . ,m, (3.3) so i 6∈ A(Y ). Setting ε̃ = min i 6∈A(x̄) ε̃i completes the proof. 26 3.2. Convergence Using Lemma 3.2, we can easily conclude that the AGS algorithm is still well-defined when using the robust active set. Corollary 3.3. Let f = max{fi : i = 1, . . . , N} where each fi ∈ C1. Suppose 0 6∈ ∂f(xk) for each iteration k. Suppose there exists a K̄ > 0 such that given any set of points generated in Step 1 of the RAGS algorithm, the approximate gradient satisfies |∇Afi(xk)−∇f(xk)| ≤ K̄∆k for all i. Then after a finite number of iterations, the RAGS algorithm will find function value decrease. Proof. Consider xk, where 0 /∈ ∂f(xk). For eventual contradiction, suppose we do not find function value decrease. In the RAGS algorithm, this corresponds to an infinite number of line search failures. If we have an infinite number of line search failures, then ∆k → 0 and xk̄ = xk for all k̄ ≥ k. In Lemma 3.2, ε̃ depends only on xk. Hence, we can conclude that eventually ∆k < ε̃ and therefore Y k ⊆ Bε̃(xk). Thus, eventually A(xk) = A(Y k). Once the two active sets are equal, the results of Section 2.2 will hold. Thus, by Corollary 2.9, the algorithm will find a new iterate with a lower function value, a contradiction. To examine convergence of the RAGS algorithm we use the result that eventually the robust active set at the current iterate will be a subset of the regular active set at any cluster point of the algorithm. Lemma 3.4. Let f = max{fi : i = 1, . . . , N} where each fi ∈ C1. Let Y k = [xk, y1, . . . , ym] be a randomly sampled set from a ball centered at xk with radius ∆k. Let x k → x̄. Then there exists an ε̃ > 0 such that if Y k ⊆ Bε̃(x̄), then A(Y k) ⊆ A(x̄). Proof. Let i /∈ A(x̄). By definition of f , we have that fi(x̄) < f(x̄). Since f is continuous, there exists an ε̃i > 0 such that for all z ∈ Bε̃i(x̄) fi(z) < f(z). If Y k ⊆ Bε̃i(x̄), then we have |xk − x̄| < ε̃i and |yj − x̄| < ε̃i for all j = 1, . . . ,m. Therefore fi(x k) < f(xk) and fi(y j) < f(yj) for all j = 1, . . . ,m. Therefore, i 6∈ A(Y k). Letting ε̃ = min i 6∈A(x̄) ε̃i completes the proof. 27 3.2. Convergence Now we examine the convergence of the RAGS algorithm. Theorem 3.5. Let f = max{fi : i = 1, . . . , N} where each fi ∈ C1. Let {xk}∞k=0 be an infinite sequence generated by the RAGS algorithm. Suppose there exists a K̄ > 0 such that given any set of points generated in Step 1 of the RAGS algorithm, the approximate gradient satisfies the error bound |∇Afi(xk)−∇fi(xk)| ≤ K̄∆k for all i. Suppose tk is bounded away from 0. Then either 1. f(xk) ↓ −∞, or 2. |dk| → 0, ∆k ↓ 0 and every cluster point x̄ of the sequence {xk}∞k=0 satisfies 0 ∈ ∂f(x̄). Proof. If f(xk) ↓ −∞, then we are done. Conversely, if f(xk) is bounded below, then f(xk) is non-increasing and bounded below, therefore f(xk) converges. We consider two cases. Case 1: An infinite number of line search successes occur. Let x̄ be a cluster point of {xk}∞k=0. Notice that xk only changes for line search successes, so there exists a subsequence {xkj}∞k=0 of line search suc- cesses such that xkj → x̄. Following the arguments of Theorem 2.11, we have |dkj | → 0 and ∆kj ↓ 0. Notice that if ∆kj ↓ 0, then eventually Y kj ⊆ Bε̃(x̄), where xk → x̄ and ε̃ is defined as in Lemma 3.4. Thus, by Lemma 3.4, we have that A(Y kj ) ⊆ A(x̄). This means G Y kj (xkj ) is formed from a subset of the active gradients that make up ∂f(x̄). Thus, by Lemma 2.2(1), as −dkj ∈ G Y kj (xkj ), we can construct a vkj ∈ ∂f(x̄) from the same set of active gradients that make up G Y kj (xkj ) such that | − dkj − vkj | = ∣∣∣∣ ∑ i∈A(Y kj ) αi∇Afi(xkj )− ∑ i∈A(Y kj ) αi∇fi(x̄) ∣∣∣∣ ≤ ∑ i∈A(Y kj ) αi|∇Afi(xkj )−∇fi(x̄)| = |∇Afi(xkj )−∇fi(x̄)| (as ∑ i∈A(Y kj ) αi = 1) ≤ |∇Afi(xkj )−∇fi(xkj )|+ |∇fi(xkj )−∇fi(x̄)| ≤ K̄∆kj + |∇fi(xkj )−∇fi(x̄)| (by error bound). 28 3.2. Convergence Using the Triangle Inequality, we have | − dkj − vkj | = | − vkj − dkj | ≥ |vkj | − |dkj |, which implies that |vkj | − |dkj | ≤ K̄∆kj + |∇fi(xkj )−∇fi(x̄)|. We already showed that |dkj | → 0 and ∆kj ↓ 0. Furthermore, since ∇fi ∈ C and xkj → x̄, we have |∇fi(xkj )−∇fi(x̄)| → 0. So, lim j→∞ |vkj | = 0. Using the same arguments as in Theorem 2.11, the result follows. Case 2: A finite number of line search successes occur. This means there exists a k̄ such that xk = xk̄ = x̄ for all k ≥ k̄. However, by Corollary 3.3, if 0 /∈ ∂f(x̄), then after a finite number of iterations, the algorithm will find function value decrease (line search success). Hence, we have 0 ∈ ∂f(x̄). Remark 3.6. Using dk to check our stopping conditions allows the result of Theorem 2.12 to still hold. 3.2.1 Descent Direction: dY It is worth noting that we can prove Lemma 2.2(2) for G(Y ) and thus, prove dY is a descent direction without the requirement that A(Y ) ⊆ A(x̄). We show this in Lemmas 3.7 and 3.8. Lemma 3.7. Let f = max{fi : i = 1, . . . , N} where each fi ∈ C1. Let Y = [y0, y1, y2, . . . , ym] be a randomly sampled set from a ball centered at y0 = x̄ with radius ∆. Suppose there exists a K̄ > 0 such that the approximate gradient satisfies |∇Afi(x̄)−∇fi(x̄)| ≤ K̄∆ for all i. Then for all v ∈ ∂f(x̄), there exists a w ∈ G(Y ) such that |w − v| ≤ K̄∆. Proof. By definition, for all v ∈ ∂f(x̄) there exists a set of αi such that v = ∑ i∈A(x̄) αi∇fi(x̄) where αi ≥ 0, ∑ i∈A(x̄) αi = 1. 29 3.3. Robust Stopping Conditions We know that A(x̄) ⊆ A(Y ), as x̄ ∈ Y . Thus, using the same αi as above, we see that w = ∑ i∈A(x̄) αi∇Afi(x̄) ∈ G(Y ). Using the same argument as shown in Lemma 2.2(2), we can conclude that for all v ∈ ∂f(x̄), there exists a w ∈ G(Y ) such that |w − v| ≤ K̄∆. Using this result, we can show that dY is a descent direction. Lemma 3.8. Let f = max{fi : i = 1, . . . , N} where each fi ∈ C1. Let Y = [y0, y1, y2, . . . , ym] be a randomly sampled set from a ball centered at y0 = x̄ with radius ∆. Suppose there exists a K̄ > 0 such that the ap- proximate gradient satisfies |∇Afi(x̄) − ∇fi(x̄)| ≤ K̄∆ for all i. Define dY = −Proj(0|G(Y )) and suppose |dY | 6= 0. If K̄∆ < |dY |, then 〈dY , v〉 < 0 for all v ∈ ∂f(x̄). Proof. Using Lemma 3.7 instead of Lemma 2.2(2) in the proof of Corol- lary 2.5, we can prove the result. 3.3 Robust Stopping Conditions We want to provide some insight as to how Theorem 2.12 can work for stopping conditions based on dkY , that is, replacing the stopping conditions ∆k ≤ µk|dk| and |dk| < εtol in Step 2 with the robust stopping conditions ∆k ≤ µk|dkY | and |dkY | < εtol. (3.4) The following proposition does not theoretically justify why the robust stop- ping conditions in (3.4) are sufficient, but does help explain their logic. Theo- retically, since we do not know what x̄ is, we cannot tell when A(Y k) ⊆ A(x̄). Proposition 3.9. Let f = max{fi : i = 1, . . . , N} where each fi ∈ C1. Suppose there exists a K̄ > 0 such that |∇Afi(x)−∇fi(x)| ≤ K̄∆k for all i and for all x ∈ B∆k(xk). Suppose the RAGS algorithm terminates at some iteration k̄ in Step 2 using the robust stopping conditions given in (3.4). Furthermore, suppose there exists x̄ ∈ B∆k̄(xk̄) such that A(Y k̄) ⊆ A(x̄). Then Proj(0|∂f(x̄)) < (1 + K̄µ0)εtol. 30 3.3. Robust Stopping Conditions Proof. If A(Y k̄) ⊆ A(x̄), then the proofs of Lemma 2.2(1) and Theorem 2.12 still hold. Additionally, in the following results, we approach the theory for robust stopping conditions using the Goldstein approximate subdifferential. If the RAGS algorithm terminates in Step 2, then it is shown that the distance between 0 and the Goldstein approximate subdifferential is controlled by εtol. Again, this does not prove the robust stopping conditions are sufficient for the exact subdifferential. Definition 3.10. The Goldstein approximate subdifferential, as defined in [Gol77], is given by the set ∂G∆f(x̄) = conv(∂f(z) : z ∈ B∆(x̄)), (3.5) where B∆(x̄) is the closed ball centered at x̄ with radius ∆. We now show that the Goldstein approximate subdifferential contains all of the gradients of the active functions in the robust active set. Lemma 3.11. Let f = max{fi : i = 1, . . . , N}. Let Y = [y0, y1, y2, . . . , ym] be a randomly sampled set from a ball centered at y0 = x̄ with radius ∆. If fi ∈ C1 for each i, then ∂G∆f(x̄) ⊇ conv(∇fi(yj) : yj ∈ Y, i ∈ A(yj)). Proof. If fi ∈ C1 for each i ∈ A(Y ), then by Proposition 1.6, for each yj ∈ Y we have ∂f(yj) = conv(∇fi(yj))i∈A(yj) = conv(∇fi(yj) : i ∈ A(yj)). Using this in our definition of the Goldstein approximate subdifferential in (3.5) and knowing B∆(x̄) ⊇ Y , we have ∂G∆f(x̄) ⊇ conv(conv(∇fi(yj) : i ∈ A(yj)) : yj ∈ Y ), which simplifies to ∂G∆f(x̄) ⊇ conv(∇fi(yj) : yj ∈ Y, i ∈ A(yj)). (3.6) Now we have a result similar to Lemma 2.2(1) for dkY with respect to the Goldstein approximate subdifferential. 31 3.3. Robust Stopping Conditions Remark 3.12. For the following two results, we assume each of the fi ∈ C1+ with Lipschitz constant L. Note that this implies the Lipschitz constant L is independent of i. If each fi ∈ C1+ has Lipschitz constant Li for ∇fi, then L is easily obtained by L = max{Li : i = 1, . . . , N}. Lemma 3.13. Let f = max{fi : i = 1, . . . , N} where each fi ∈ C1+ with Lipschitz constant L. Let Y = [y0, y1, y2, . . . , ym] be a randomly sampled set from a ball centered at y0 = x̄ with radius ∆. Suppose there exists a K̄ > 0 such that the approximate gradient satisfies |∇Afi(x̄) − ∇fi(x̄)| ≤ K̄∆ for all i. Then for all w ∈ G(Y ), there exists a g ∈ ∂G∆f(x̄) such that |w − g| ≤ (K̄ + L)∆. Proof. By definition, for all w ∈ G(Y ) there exists a set of αi such that w = ∑ i∈A(Y ) αi∇Afi(x̄), where αi ≥ 0, ∑ i∈A(Y ) αi = 1. By our assumption that each fi ∈ C1+, Lemma 3.11 holds. It is clear that for each i ∈ A(Y ), i ∈ A(yj) for some yj ∈ Y . Let ji be an index corresponding to this active index; i.e., i ∈ A(yji). Thus, for each i ∈ A(Y ), there is a corresponding active gradient ∇fi(yji) ∈ conv(∇fi(yji) : yji ∈ Y, i ∈ A(yji)) ⊆ ∂G∆f(x̄). Using the same αi as above, we can construct g = ∑ i∈A(Y ) αi∇fi(yji) ∈ conv(∇fi(yji) : yji ∈ Y, i ∈ A(yji)) ⊆ ∂G∆f(x̄). Then |w − g| = | ∑ i∈A(Y ) αi∇Afi(x̄)− ∑ i∈A(Y ) αi∇fi(yji)| ≤ ∑ i∈A(Y ) αi|∇Afi(x̄)−∇fi(yji)| = ∑ i∈A(Y ) αi|∇Afi(x̄)−∇fi(x̄) +∇fi(x̄)−∇fi(yji)| ≤ ∑ i∈A(Y ) αi ( |∇Afi(x̄)−∇fi(x̄)|+ |∇fi(x̄)−∇fi(yji)| ) . 32 3.3. Robust Stopping Conditions Applying our error bound assumption and the Lipschitz condition, we have |w − g| ≤ ∑ i∈A(Y ) αi ( K̄∆ + Lmax ji |x̄− yji | ) ≤ K̄∆ + L∆ (as ∑ i∈A(Y ) αi = 1) = (K̄ + L)∆. Hence, for all w ∈ G(x̄), there exists a g ∈ ∂G∆f(x̄) such that |w − g| ≤ ε+ L∆. (3.7) Thus, using Lemma 3.13, we can show that if the algorithm stops due to the robust stopping conditions, then the distance from 0 to the Goldstein approximate subdifferential is controlled by εtol. Proposition 3.14. Let f = max{fi : i = 1, . . . , N} where each fi ∈ C1+ with Lipschitz constant L. Suppose there exists a K̄ > 0 such that for each iteration k, the approximate gradient satisfies |∇Afi(xk)−∇fi(xk)| ≤ K̄∆k for all i. Suppose the RAGS algorithm terminates at some iteration k̄ in Step 2 using the robust stopping conditions given in (3.4). Then Proj(0|∂G∆k̄f(x k̄)) < [1 + µ0(K̄ + L)]εtol. Proof. Let w̄ = Proj(0|G(Y k̄)). We use ḡ ∈ ∂G∆k̄f(x k̄) as constructed in Lemma 3.13 to see that dist(0|∂G ∆k̄)f(x k̄) ≤ dist(0|ḡ) ≤ dist(0|w̄) + dist(w̄|ḡ) (as w̄ ∈ G(Y k̄), ḡ ∈ ∂G∆k̄f(x k̄)) = |dk̄Y |+ |w̄ − ḡ| (as |dk̄Y | = |Proj(0|G(Y k̄))|) ≤ |dk̄Y |+ (K̄ + L)∆k̄ (by Lemma 3.13) < εtol + (K̄ + L)∆k̄ (as |dk̄Y | < εtol). The statement now follows by the test ∆k̄ ≤ µk̄|dk̄Y | in Step 2 and the fact that µk̄ ≤ µ0 as {µk}k=0 is a non-increasing sequence. 33 Chapter 4 Approximate Gradients As seen in the previous two sections, in order for convergence to be guaranteed in the AGS and RAGS algorithms, the approximate gradient used must satisfy an error bound for each of the active fi. Specifically, for a randomly sampled set of points Y = [xk, y1, . . . , ym] centered around xk, there must exist a K̄ > 0 such that |∇Afi(xk)−∇fi(xk)| ≤ K̄∆k, where ∆k = maxj |yj−xk|. In this section, we present three specific approx- imate gradients that satisfy the above requirement: the simplex gradient, the centered simplex gradient and the Gupal estimate of the gradient of the Steklov averaged function. 4.1 Simplex Gradient The simplex gradient is a commonly used approximate gradient. In re- cent years, several derivative-free algorithms have been proposed that use the simplex gradient, ([BK98], [Kel99a], [CV07], [CJV08], and [HM11]) among others. It is geometrically defined as the gradient of the linear interpola- tion of f over a set of n + 1 points in Rn. In the following section, we mathematically define the simplex gradient. 4.1.1 Definitions First we present several basic definitions of terms used in the definition of the simplex gradient. Definition 4.1. A set S is an affine set if given any two points x1 ∈ S and x2 ∈ S with x1 6= x2, the line formed by x1 and x2 is a subset of S, i.e., x1x2 ⊆ S. Definition 4.2. The affine hull of a set S ⊆ Rn is the smallest affine set containing S. 34 4.1. Simplex Gradient Definition 4.3. A set of m + 1 points Y = [y0, y1, . . . , ym] is said to be affinely independent if its affine hull aff(y0, y1, . . . , ym) has dimension m. Equivalently, Y is affinely independent if the set [y1− y0, . . . , ym− y0] is linearly independent, [CSV09]. Definition 4.4. Let Y = [y0, y1, . . . , yn] be a set of affinely independent points in Rn. We say that Y forms the simplex, S, where S = conv(Y ). Thus, S is a simplex if it can be written as the convex hull of an affinely independent set of n+ 1 points in Rn. With the previous terms defined, we are now ready to mathematically define the simplex gradient. Definition 4.5. Let Y = [y0, y1, . . . , yn] be a list of affinely independent points in Rn. The simplex gradient of a function f over the set Y is given by ∇sf(Y ) = L−1δf(Y ), where L = [ y1 − y0 · · · yn − y0 ]> and δf(Y ) = f(y 1)− f(y0) ... f(yn)− f(y0) . The condition number of a simplex is given by ‖L̂−1‖, where L̂ = 1 ∆ [y1 − y0 y2 − y0 . . . yn − y0]> and ∆ = max j=1,...n |yj − y0|. 4.1.2 Convergence The following result (from Kelley) shows that there exists an appropriate error bound between the simplex gradient and the exact gradient of our objective function. We note that the Lipschitz constant used in the following theorem corresponds to ∇fi. Theorem 4.6. Consider fi ∈ C1+ with Lipschitz constant Ki for ∇fi. Let Y = [y0, y1, . . . , yn] form a simplex. Let L̂ = 1 ∆ [y1 − y0 y2 − y0 . . . yn − y0]>, where ∆ = max j=1,...n |yj − y0|. 35 4.2. Centered Simplex Gradient Then the simplex gradient satisfies the error bound |∇sfi(Y )−∇fi(y0)| ≤ K̄∆, where K̄ = 1 2 Ki √ n‖L̂−1‖. Proof. See [Kel99b, Lemma 6.2.1]. With the above error bound result, we conclude that convergence holds when using the simplex gradient as an approximate gradient in the AGS and RAGS algorithms. Corollary 4.7. Let f = max{fi : i = 1, . . . , N} where each fi ∈ C1+ with Lipschitz constant Ki for ∇fi. If the approximate gradient used in the AGS or RAGS algorithm is the simplex gradient and ‖L̂−1‖ is bounded above, then the results of Theorems 2.6, 2.8, 2.11 and 2.12 hold. 4.1.3 Algorithm Algorithm - Simplex Gradient In order to calculate a simplex gradient in Step 1, we generate a set Y = [xk, y1, . . . , yn] of points in Rn and then check to see if Y forms a well-poised simplex by calculating its condition number, ‖L̂−1‖. A bounded condition number (‖L̂−1‖ < n) ensures a ‘good’ error bound between the approximate gradient and the exact gradient. If Y forms a well-poised simplex (‖L̂−1‖ < n), then we calculate the simplex gradient of fi over Y for each i ∈ A(xk) (or each i ∈ A(Y k)) and then set the approximate subdifferential equal to the convex hull of the active simplex gradients. If the set Y does not form a well-poised simplex (‖L̂−1‖ ≥ n), then we resample. We note that the probability of generating a random matrix with a condition number greater than n is asymptotically constant, [Wsc04]. Thus, randomly generating simplices is a quick and practical option. Furthermore, notice that calculating the condition number does not require function evaluations; thus, resampling does not affect the number of function evaluations required by the algorithm. 4.2 Centered Simplex Gradient The centered simplex gradient is the average of two simplex gradients. Although it requires more function evaluations, it contains an advantage that the error bound satisfied by the centered simplex gradient is in terms of ∆2, rather than ∆. 36 4.2. Centered Simplex Gradient 4.2.1 Definitions Definition 4.8. Let Y = [y0, y1, . . . , yn] form a simplex. We define the sets Y + = [x, x+ ỹ1, . . . , x+ ỹn] and Y − = [x, x− ỹ1, . . . , x− ỹn], where x = y0 and ỹi = yi − y0 for i = 1, . . . n. The centered simplex gradient is the average of the two simplex gradients over the sets Y + and Y −, i.e., ∇CSf(Y +) = 1 2 (∇Sf(Y +) +∇Sf(Y −)). 4.2.2 Convergence To show that the AGS and RAGS algorithms are well-defined when using the centered simplex gradient as an approximate gradient, we provide an error bound between the centered simplex gradient and the exact gradient (again from Kelley). Theorem 4.9. Consider fi ∈ C2+ with Lipschitz constant Ki for ∇2fi. Let Y = [y0, y1, . . . , yn] form a simplex. Let L̂ = 1 ∆ [y1 − y0, . . . , yn − y0] where ∆ = max j=1,...n |yj − y0|. Then the centered simplex gradient satisfies the error bound |∇CSfi(Y )−∇fi(y0)| ≤ K̄∆2, where K̄ = Ki √ n‖L̂−1‖. Proof. See [Kel99b, Lemma 6.2.5]. Notice that Theorem 4.9 requires fi ∈ C2+. If fi ∈ C1+, then the error bound is in terms of ∆, not ∆2. With the above error bound result, we conclude that convergence holds when using the centered simplex gradient as an approximate gradient in the AGS and RAGS algorithms. Corollary 4.10. Let f = max{fi : i = 1, . . . , N} where each fi ∈ C2+ with Lipschitz constant Ki for ∇2fi. If the approximate gradient used in the AGS or RAGS algorithm is the centered simplex gradient and ∆0 ≤ 1, then the results of Theorems 2.6, 2.8, 2.11 and 2.12 hold. Proof. Since ∆0 ≤ 1 and non-increasing, (∆k)2 ≤ ∆k and ergo, Theorems 2.6 and 2.11 hold. 37 4.3. Gupal Estimate of the Gradient of the Steklov AveragedFunction 4.2.3 Algorithm To adapt the AGS and RAGS algorithms to use the centered simplex gradient, in Step 1 we sample our set Y in the same manner as for the simplex gradient (resampling until a well-poised set is achieved). We then form the sets Y + and Y − and proceed as expected. 4.3 Gupal Estimate of the Gradient of the Steklov Averaged Function The nonderivative version of the gradient sampling algorithm presented by Kiwiel in [Kiw10] uses the Gupal estimate of the gradient of the Steklov averaged function as an approximate gradient. We see in Theorem 4.16 that an appropriate error bound exists for this approximate gradient. Surpris- ingly, unlike the error bounds for the simplex and centered simplex gradi- ents, the error bound in Theorem 4.16 does not include a condition number term. Following the notation used by Kiwiel in [Kiw10], we define the Gupal estimate of the gradient of the Steklov averaged function as follows. 4.3.1 Definitions Definition 4.11. For α > 0, the Steklov averaged function fα is defined by fα(x) = ∫ Rn f(x− y)ψα(y)dy, where ψα : Rn → R+ is the Steklov mollifier defined by ψα(y) = { 1/αn if y ∈ [−α/2, α/2]n, 0 otherwise. We can equivalently define the Steklov averaged function by fα(x) = 1 αn ∫ x1+α/2 x1−α/2 · · · ∫ xn+α/2 xn−α/2 f(y)dy1 . . . dyn. (4.1) The partial derivatives of fα are given by ∂fα ∂xi (x) = ∫ B∞ γi(x, α, ζ)dζ (4.2) 38 4.3. Gupal Estimate of the Gradient of the Steklov AveragedFunction for i = 1, . . . , n, where B∞ = [−1/2, 1/2]n is the unit cube centred at 0 and γi(x, α, ζ) = 1 α [ f(x1 + αζ1, . . . , xi−1 + αζi−1, xi + 1 2 α, xi+1 + αζi+1, . . . , xn + αζn) −f(x1 + αζ1, . . . , xi−1 + αζi−1, xi − 1 2 α, xi+1 + αζi+1, . . . , xn + αζn) ] . (4.3) Definition 4.12. Given α > 0 and z = (ζ1, . . . , ζn) ∈∏ni=1 B∞, the Gupal estimate of ∇fα(x) over z is given by γ(x, α, z) = (γ1(x, α, ζ 1), . . . , γn(x, α, ζ n)). (4.4) Remark 4.13. Although we define γ(x, α, z) as the Gupal estimate of∇fα(x), in Section 4.3.2, we will show that γ(x, α, z) provides a good approximate of the exact gradient, ∇fi(x). Remark 4.14. For the following results, we note that the α used in the above definitions is equivalent to our search radius ∆. Thus, we will be replacing α with ∆ in the convergence results in Section 4.3.2. 4.3.2 Convergence Convergence As before, in order to show that the AGS and RAGS algorithms are well-defined when using the Gupal estimate as an approximate gradient, we must establish that it provides a good approximate of our exact gradient. To do this, we need the following result. Lemma 4.15. Let f ∈ C1+ with Lipschitz constant K for ∇f . Let y0 ∈ Rn. Then for any y ∈ Rn |f(y)− f(y0)−∇f(y0)>(y − y0)| ≤ 1 2 K|y − y0|2. Proof. For ease of notation, let δ = y − y0. By the Fundamental Theorem of Calculus, for any y ∈ Rn we have f(y)− f(y0) = ∫ 1 0 ∇f(y0 + τ(δ))>δ dτ. (4.5) 39 4.3. Gupal Estimate of the Gradient of the Steklov AveragedFunction Considering ∇f(y0)>δ, notice that ∇f(y0)>δ = ∫ 1 0 ∇f(y0)>δ dτ. (4.6) Subtracting (4.6) from (4.5), we have f(y)− f(y0)−∇f(y0)>δ = ∫ 1 0 (∇f(y0 + τδ)−∇f(y0))>δ dτ. Taking the absolute value, we have∣∣∣∣f(y)− f(y0)−∇f(y0)>δ∣∣∣∣ = ∣∣∣∣ ∫ 1 0 (∇f(y0 + τδ)−∇f(y0))>δ dτ ∣∣∣∣ ≤ ∫ 1 0 ∣∣∣∣(∇f(y0 + τδ)−∇f(y0))>δ∣∣∣∣dτ (as ∇f is cont.) ≤ ∫ 1 0 ∣∣∣∣(∇f(y0 + τδ)−∇f(y0))∣∣∣∣|δ| dτ (by Cauchy Schwarz) ≤ ∫ 1 0 K|y0 + τδ − y0||δ|dτ (as ∇f is Lip.) = ∫ 1 0 Kτ |δ|2dτ (as τ ∈ (0, 1)) = 1 2 K|δ|2. Therefore, with δ = y − y0, we have |f(y)− f(y0)−∇f(y0)>(y − y0)| ≤ 1 2 K|y − y0|2. Using Lemma 4.15, we establish an error bound between the Gupal es- timate and the exact gradient of fi. 40 4.3. Gupal Estimate of the Gradient of the Steklov AveragedFunction Theorem 4.16. Consider fi ∈ C1+ with Lipschitz constant Ki for ∇fi. Let ε > 0. Then for ∆ > 0, z = (ζ1, . . . , ζn) ∈ Z = ∏ni=1 B∞ and any point x ∈ Rn, the Gupal estimate of ∇fi,α(x) satisfies the error bound |γ(x,∆, z)−∇fi(x)| ≤ √ n 1 2 Ki∆( √ n+ 3). Proof. For ∆ > 0, let yj− = [x1 + ∆ζ1, . . . , xj−1 + ∆ζj−1, xj − 1 2 ∆, xj+1 + ∆ζj+1, . . . , xn + ∆ζn] >. By Lemma 4.15, for yj+ = [x1 + ∆ζ1, . . . , xj−1 + ∆ζj−1, xj + 1 2 ∆, xj+1 + ∆ζj+1, . . . , xn + ∆ζn] >, we have |fi(yj+)− fi(yj−)−∇fi(yj−)>(yj+ − yj−)| ≤ 1 2 Ki|yj+ − yj−|2. (4.7) From equation (4.3) (with α = ∆), we can see that fi(y j+)− fi(yj−) = ∆ γj(x,∆, ζj). Hence, equation (4.7) becomes |∆ γj(x,∆, ζj)−∇fi(yj−)>(yj+ − yj−)| ≤ 1 2 Ki|yj+ − yj−|2. (4.8) From our definitions of yj− and yj+, we can see that yj+ − yj− = [0, . . . , 0,∆, 0, . . . , 0]>. The inner product in equation (4.8) simplifies to ∇fi(yj−)>(yj+ − yj−) = ∆ ∂fi ∂xj (yj−). Thus, we have ∣∣∣∣∆ γj(x,∆, ζj)−∆ ∂fi∂xj (yj−) ∣∣∣∣ ≤ 12Ki∆2. Dividing out ∆ gives∣∣∣∣γj(x,∆, ζj)− ∂fi∂xj (yj−) ∣∣∣∣ ≤ 12Ki∆. (4.9) 41 4.3. Gupal Estimate of the Gradient of the Steklov AveragedFunction We notice that |yj− − x| = ∣∣∣∣(∆ζj1 , . . . ,∆ζjj−1,−12∆,∆ζjj+1, . . . ,∆ζjn) ∣∣∣∣ = ∆ ∣∣∣∣(ζj1 , . . . , ζjj−1,−12 , ζjj+1, . . . , ζjn) ∣∣∣∣. Using the standard basis vector ej , we have |yj− − x| = ∆ ∣∣∣∣ζj − ζjej − 12ej ∣∣∣∣ ≤ ∆ ( |ζj |+ |ζj |+ 1 2 ) ≤ ∆ ( 1 2 √ n+ 1 2 + 1 2 ) (as ζj ∈ [−1/2, 1/2]n) = 1 2 ∆( √ n+ 2). Thus, since fi ∈ C1+, we have |∇fi(yj−)−∇fi(x)| ≤ Ki 1 2 ∆( √ n+ 2). (4.10) Now, we notice that ∣∣∣∣ ∂fi∂xj (yj−)− ∂fi∂xj (x) ∣∣∣∣ = ∣∣∣∣∇fi(yj−)>ej −∇fi(x)>ej∣∣∣∣ ≤ |∇fi(yj−)−∇fi(x)||ej | = |∇fi(yj−)−∇fi(x)|. Therefore, ∣∣∣∣ ∂fi∂xj (yj−)− ∂fi∂xj (x) ∣∣∣∣ ≤ Ki 12∆(√n+ 2). (4.11) 42 4.3. Gupal Estimate of the Gradient of the Steklov AveragedFunction Using equations (4.9) and (4.11), we have∣∣∣∣γj(x,∆, ζj)− ∂fi∂xj (x) ∣∣∣∣ = ∣∣∣∣γj(x,∆, ζj)− ∂fi∂xj (yj−) + ∂fi∂xj (yj−)− ∂fi∂xj (x) ∣∣∣∣ ≤ ∣∣∣∣γj(x,∆, ζj)− ∂fi∂xj (yj−) ∣∣∣∣+ ∣∣∣∣ ∂fi∂xj (yj−)− ∂fi∂xj (x) ∣∣∣∣ ≤ 1 2 Ki∆ +Ki 1 2 ∆( √ n+ 2) = 1 2 Ki∆( √ n+ 3). Hence, |γ(x,∆, z)−∇fi(x)| = √( γ1(x,∆, ζ1)− ∂fi ∂x1 (x) )2 + · · ·+ ( γn(x,∆, ζn)− ∂fi ∂xn (x) )2 ≤ √( 1 2 Ki∆( √ n+ 3) )2 + · · ·+ ( 1 2 Ki∆( √ n+ 3) )2 = √ n ( 1 2 Ki∆( √ n+ 3) )2 = √ n 1 2 Ki∆( √ n+ 3). We conclude that convergence holds when using the Gupal estimate of the gradient of the Steklov averaged function of f as an approximate gradient in the AGS and RAGS algorithms. Corollary 4.17. Let f = max{fi : i = 1, . . . , N} where each fi ∈ C1+ with Lipschitz constant Ki for ∇fi. If the approximate gradient used in the AGS or RAGS algorithm is the Gupal estimate of the gradient of the Steklov averaged function, then the results of Theorems 2.6, 2.8, 2.11 and 2.12 hold. 43 4.3. Gupal Estimate of the Gradient of the Steklov AveragedFunction 4.3.3 Algorithm Algorithm To use the Gupal estimate of the gradient of the Steklov averaged func- tion in the AGS and RAGS algorithms, in Step 1, we sample independently and uniformly {zkl}ml=1 from the unit cube in Rn×n, where m is the number of active functions. Then proceed as expected. 44 Chapter 5 Numerical Results 5.1 Versions of the AGS Algorithm We implemented the AGS and RAGS algorithms using the simplex gra- dient, the centered simplex gradient and the Gupal estimate of the gradient of the Steklov averaged function as approximate gradients. Additionally, we used the robust descent direction to create robust stop- ping conditions. That is, the AGS and RAGS algorithms terminate when ∆k ≤ µk|dkY | and |dkY | < εtol, (5.1) where dkY is the projection of 0 onto the approximate subdifferential gener- ated using the robust active set. (See Lemmas 3.11 and 3.13, and Proposition 3.14 for results linking the robust stopping conditions with the Goldstein ap- proximate subdifferential.) The implementation was done in MATLAB (v. 7.11.0.584, R2010b). Software is available upon request. Let dk denote the regular descent direction and let dkY denote the robust descent direction. There are three scenarios that could occur when using the robust stopping conditions: 1. |dk| = |dkY |; 2. |dk| > |dkY |, but checking the stopping conditions leads to the same result (line search, radius decrease or termination); or 3. |dk| > |dkY |, but checking the stopping conditions leads to a different result. In Scenarios 1 and 2, the robust stopping conditions have no influence on the algorithm. In Scenario 3, we have two cases: 1. ∆k ≤ µk|dkY | ≤ µk|dk|, but |dk| ≥ εtol and |dkY | < εtol or 2. ∆k ≤ µk|dk| holds, but ∆k > µk|dkY |. 45 5.2. Test Sets and Software Thus, we hypothesize that the robust stopping conditions will cause the AGS and RAGS algorithms to do one of two things: to terminate early, providing a solution that has a smaller quality measure, but requires less function evaluations to find, or to reduce its search radius instead of carrying out a line search, reducing the number of function evaluations carried out during that iteration and calculating a more accurate approximate subdif- ferential at the next iteration. Our goal in this testing is to determine if there are any notable numerical differences in the quality of the three approximate gradients (simplex, cen- tered simplex, and Gupal estimate), the two search directions (non-robust and robust), and the two stopping conditions (non-robust and robust). This results in the following 12 versions: AGS Simplex (1. non-robust /2. robust stopping) RAGS Simplex (3. non-robust /4. robust stopping) AGS Centered Simplex (5. non-robust /6. robust stopping) RAGS Centered Simplex (7. non-robust /8. robust stopping) AGS Gupal (9. non-robust /10. robust stopping) RAGS Gupal (11. non-robust/12. robust stopping) 5.2 Test Sets and Software Testing was performed on a 2.0 GHz Intel Core i7 Macbook Pro and a 2.2 GHz Intel Core 2 Duo Macbook Pro. We used the test set from Lukšan- Vlček, [LV00]. The first 25 problems presented are of the desired form min x F (x) where F (x) = max{fi(x) : i = 1, 2, . . . , N}. Of these 25 problems, we omit problem 2.17 because the sub-functions are complex-valued. Thus, our test set is a total of 24 finite minimax problems with dimensions from 2 to 20. There are several problems featuring functions fi that have the form fi = |fi|, where fi is a smooth function. We rewrote these functions as fi = max{fi,−fi}. The resulting test problems have from 2 to 130 sub-functions. A summary of the test problems appears in Table 1 in Appendix A. 46 5.3. Initialization and Stopping Conditions 5.3 Initialization and Stopping Conditions We first describe our choices for the initialization parameters used in the AGS and RAGS algorithms. The initial starting points are given for each problem in [LV00]. We set the initial accuracy measure to 0.5 with a reduction factor of 0.5. We set the initial search radius to 0.1 with a reduction factor of 0.5. The Armijo- like parameter η was chosen to be 0.1 to ensure that a line search success resulted in a significant function value decrease. We set the minimum step length to 10−10. Next, we discuss the stopping tolerances used to ensure finite termina- tion of the AGS and RAGS algorithms. We encoded four possible reasons for termination in our algorithm. The first reason corresponds to our theo- retical stopping conditions, while the remaining three reasons are to ensure numerical stability of the algorithm. 1. Stopping conditions met: As stated in the theoretical algorithm, the algorithm terminates for this reason when ∆k ≤ µk|dk| and |dk| < εtol, where dk is either the regular or the robust descent direction. 2. Hessian matrix has NaN / Inf entries: For the solution of the quadratic program in Step 2, we use the quadprog command in MATLAB, which has certain numerical limitations. When these limitations result in NaN or Inf entries in the Hessian, the algorithm terminates. 3. ∆k, µk, and |dk| are small: This stopping criterion bi-passes the test ∆k ≤ µk|dk| (in Step 2) and stops if ∆k < ∆tol, µk < µtol and |dk| < εtol. Examining Theorem 2.12 along with Theorems 4.6, 4.9 and 4.16, it is clear that this is also a valid stopping criterion. We used a bound of 10−6 in our implementation for both ∆k and µk. 4. Max number of function evaluations reached - As a final failsafe, we added an upper bound of 106 on the number of function evaluations allowed. (This stopping condition only occurs once in our results.) 5.4 Computing a Descent Direction In each iteration of the AGS algorithm, we must compute the projection of 0 onto the approximate subdifferential G(xk). Additionally, for the RAGS algorithm, we must compute the projection of 0 onto the robust approximate subdifferential G(Y k). 47 5.4. Computing a Descent Direction To explain how we solve this problem in MATLAB, we first define the projection of 0 onto the convex hull of a set of points X = [x1, x2, . . . , xn]. By definition, conv(X) = {z : z = n∑ i=1 αixi, αi ≥ 0, n∑ i=1 αi = 1, xi ∈ X}, where x ∈ Rn and α = [α1, . . . , αn] ∈ Rn. By Definition 1.3, we know that the projection p of a point x ∈ Rn onto a closed convex set C is in arg min{|y − x|2 : y ∈ C}. Hence, we have Proj(0| conv(X))∈ arg min (z,α) {|z|2 : z = n∑ i=1 αixi, αi ≥ 0, n∑ i=1 αi = 1, xi ∈ X}. Next we show that this minimization problem is a quadratic program that can be solved using a quadratic program solver. For the sake of notation, let x = xk. We are given G(x) = conv(∇Afi(x))i∈A(x), i.e., G(x) = {g : g = ∑ i∈A(x) λi∇Afi(x), ∑ i∈A(x) λi = 1, λi ≥ 0}. Without loss of generality, assume the first m functions in our finite set of fi functions are active (m ≤ n). Let λ = [λ1 λ2 . . . λm] > and Λ = [∇Af1(x) ∇Af2(x) . . . ∇Afm(x)]. Then for all g ∈ G(x), g = Λλ for some λ ≥ 0, ∑ i∈A(x̄) λi = 1. Define A = [1 1 . . . 1]. Then Aλ = 1. Now ‖g‖2 = (Λλ)>(Λλ) = λ>Λ>Λλ. 48 5.5. Results Let H = Λ>Λ = ∇Af1(x)2 ∇Af1(x)>∇Af2(x) · · · ∇Af1(x)>∇Afm(x) ∇Af2(x)>∇Af1(x) ∇Af2(x)2 · · · ∇Af2(x)>∇Afm(x) ... ... . . . ... ∇Afm(x)>∇Af1(x) ∇Afm(x)>∇Af2(x) · · · ∇Afm(x)2 . Then we have the problem min λ {λ>Hλ : λ ≥ 0, Aλ = 1}. (5.2) After solving this problem, we have d = −Proj(0|G(x)) = −Λλ. Lemma 5.1. The matrix H defined above is positive semi-definite and there- fore, the minimization problem in (5.2) is a convex quadratic program. Proof. Clearly, H ∈ R(m+1)×(m+1) is symmetric as ∇Afi(x)>∇Afj(x) = ∇Afj(x)>∇Afi(x) for all i 6= j, i, j = 1, . . . ,m. Furthermore, we have for all y ∈ Rn, y>Hy = y>Λ>Λy = (Λy)>(Λy) = |Λy|2 ≥ 0. Thus, H is positive semi-definite. In implementation, we used the quadprog.m function in MATLAB to solve for λ. The default algorithm for the quadprog.m function is the interior point algorithm, which finds the projection in a limiting sense. As the quadratic programs solved in our algorithm are of small size, we used the active-set algorithm instead, thus, finding exact solutions. 5.5 Results Due to the randomness in the AGS and RAGS algorithms, we carry out 25 trials for each version. For each of the 25 trials, we record the number of function evaluations, the number of iterations, the solution, the quality of the solution and the reason for termination. The quality was measured 49 5.5. Results by the improvement in the number of digits of accuracy, which is calculated using the formula − log ( |Fmin − F ∗| |F0 − F ∗| ) , where Fmin is the function value at the final (best) iterate, F ∗ is the true minimum value (optimal value) of the problem (as given in [LV00]) and F0 is the function value at the initial iterate. Results on function evaluations and solution qualities appear in Tables 6.2, 6.3 and 6.4 of the appendix. To visually compare algorithmic versions, we use performance profiles. A performance profile is the (cumulative) distribution function for a perfor- mance metric [DM02]. For the AGS and RAGS algorithm, the performance metric is the ratio of the number of function evaluations taken by the current version to successfully solve each test problem versus the least number of function evaluations taken by any of the versions to successfully solve each test problem. Performance profiles eliminate the need to discard failures in numerical results and provide a visual representation of the performance difference between several solvers. For each performance profile, we have a set S of ns solvers and a set P of np problems. Our performance profiles define tp,s to be the number of function evaluations required to solve problem p by solver s. The baseline for comparison used is the following performance ratio rp,s = tp,s min{tp,s : s ∈ S} . As an overall assessment of the performance of the solver, we define ρs(τ) = 1 np size{p ∈ P : rp,s ≤ τ}. So, “ρs(τ) is the probability for solver s ∈ S that a performance ratio rp,s is within a factor τ ∈ R of the best possible ratio”, [DM02]. In Figures 5.1(a) and 5.1(b) we include a performance profile showing all 12 versions our algorithm tested, declaring a success for 1 digit and 3 digits of improvement, respectively. We used PProfile.m in MATLAB, a function courtesy of Dr. Warren Hare and Valentine Koch, to generate the performance profiles. 50 5.5. Results 100 101 102 103 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 τ ρ(τ ) Performance Profile on AGS and RAGS Algorithms (successful improvement of 1 digit) (a) Accuracy improvement of 1 digit. 100 101 102 103 0 0.1 0.2 0.3 0.4 0.5 0.6 τ ρ(τ ) Performance Profile for AGS and RAGS Algorithms (successful improvement of 3 digits) (b) Accuracy improvement of 3 digits. Figure 5.1: Performance profiles for 12 versions of AGS/RAGS algorithm. 51 5.5. Results In general, we can see that using the Gupal estimate of the gradient of the Steklov averaged function as an approximate gradient does not produce the best results. It only produces 1 or more digits of accuracy for problems 2.1, 2.2, 2.4, 2.10, 2.18 and 2.23 (robust version). There is no significant difference between the performance of the AGS and RAGS algorithms using the simplex and centered simplex gradients as approximate gradients. Looking at the results in Tables 6.2, 6.3 and 6.4, and our performance profiles, we can make the following two observations: i) the versions of the RAGS algorithm generally outperform (converge faster than) the versions of the AGS algorithm, and ii) the RAGS algorithm using the robust stopping conditions termi- nates faster and with lower (but still significant) accuracy. Robust active set: From our results, it is clear that expanding the active set to include ‘almost active’ functions in the RAGS algorithm greatly im- proves performance for the simplex and centered simplex algorithm. This robust active set brings more local information into the approximate sub- differential and thereby allows for descent directions that are more parallel to any nondifferentiable ridges formed by the function. Robust stopping conditions: We notice from the performance profiles that in terms of function evaluations, the robust stopping conditions improve the overall performance of the RAGS algorithm, although they decrease the average accuracy on some problems. These results correspond with our pre- viously discussed hypothesis. Furthermore, upon studying the reasons for termination, it appears that the non-robust stopping conditions cause the AGS and RAGS algorithms to terminate mainly due to ∆k and µk becoming too small. For the robust stopping conditions, the RAGS algorithm termi- nated often because the stopping conditions were satisfied. As our theory in Section 3.3 is not complete, we cannot make any theoretical statements about how the robust stopping conditions would perform in general (like those in Theorem 2.11). However, from our results, we conjecture that the alteration is beneficial for decreasing function evaluations. In 23 of the 24 problems tested, for both robust and non-robust stopping conditions, the RAGS algorithm either matches or outperforms the AGS al- gorithm in average accuracy obtained over 25 trials using the simplex and centered simplex gradients. Knowing this, we conclude that the improve- ment of the accuracy is due to the choice of a robust search direction. 52 5.6. An Application in Seismic Retrofitting 5.6 An Application in Seismic Retrofitting This application deals with the potentially destructive inter-story drift that occurs during an earthquake. In high density metropolitan areas, the close proximity of buildings leaves little room for side-to-side movement. As a means of lessening the destructive effects of an earthquake on high rise buildings, dampers are placed between closely adjacent buildings. When the buildings sway, the dampers absorb the horizontal movement. For this application, there are two optimization problems that need to be solved. The first is determining the optimal positions of the dampers. The second is determining the optimal damping coefficients of damper con- nectors. Details on the discrete optimization problem of optimal damper configurations can be found in [BHT12]. The problem of determining optimal damping coefficients is a continuous optimization problem. You can think of a damping coefficient like a friction coefficient; it represents the effect the damper has on the velocity of the ith floor of the building. In 2011, Bigdeli, Hare and Tesfamariam presented the first research on the application of mathematical optimization to the problem of optimal damping coefficients [BHT11]. As presented, a set of damping coefficients is run through a simulation that generates the ‘inter-story drift’ of each floor. The objective is to minimize the maximum inter-story drift. As inter- story drift is computed via simulation, no derivative information is available and DFO methods are indispensable. Applying the Nelder-Mead method, a well-known DFO algorithm, to this problem, it is shown in [BHT11] that non-uniform damping coefficients can greatly improve system performance. This problem is a finite minimax problem. The details of the problem can be found in [BHT11]; simply put, we are trying to minimize the max- imum inter-story drift between buildings. In this section, we explore the performance of two commercial solvers and the RAGS algorithm (using the simplex gradient and the robust stopping conditions) on this problem. The two commercial solvers we look at are the DFO pattern search and genetic algorithms defined in MATLAB. The pattern search algorithm is categorized as a directional direct-search method. As in any optimization algorithm, we desire to find a new point xk+1 such that f(xk+1) < f(xk). There are two phases to this algorithm as described in [CSV09]: the search step and the poll step. In the search step, the algorithm searches a finite set of points for function value decrease. If found, then the iterate is updated and the algorithm loops. Else, the algorithm carries out a poll step. A poll step is a localized search around 53 5.6. An Application in Seismic Retrofitting the current iterate using a given directional matrix. The genetic algorithm follows the framework of initialization, selection and reproduction. In the initialization stage, the algorithm randomly gen- erates an initial population, i.e., generates a random set of points. In the selection stage, the algorithm chooses the ‘best’ individuals in the popula- tion to reproduce, i.e., the set of points that result in the lowest function values. In the reproduction stage, the algorithm creates a new population from the ‘best’ individuals. The genetic algorithm requires a large number of function evaluations per iteration, so is best suited for problems with fast function call speed or when the dimension of a problem is high. In Table 5.6, we present the summary results of our comparison of the genetic, pattern search and RAGS algorithms for 135 damper coefficient selection test problems. Table 5.1: Summary results for 135 damping coefficient selection test prob- lems. Best Time Total Computation Best Fopt Time (seconds) GA 46 972357.5 5 PS 22 4689590.1 71 RAGS 67 864632.6 59 We can see that the RAGS algorithm has the shortest computation time for 67 of the 135 test problems and the shortest total computation time for all of the 135 test problems. If for optimal function values, we declare a tie between algorithms when |Fopt,GA/PS − Fopt,RAGS | Fopt,GA/PS < 10−2, then the genetic algorithm tied all 5 of its best solutions with the RAGS algorithm, and the pattern search algorithm tied 40 of its 71 best solutions with the RAGS algorithm. In summary, for 104 of the 135 test problems, the RAGS algorithm had the best or tied for the best optimal solution. Thus, the RAGS algorithm is certainly comparable, if not superior, to the genetic and pattern search algorithms in terms of finding the optimal function value for these problems. 54 Chapter 6 Conclusion and Future Work 6.1 Conclusion We have presented a new derivative-free algorithm for finite minimax problems that exploits the smooth substructure of the problem. Conver- gence results are given for any arbitrary approximate gradient that satisfies an error bound dependent on the sampling radius. Three examples of such approximate gradients are given. Additionally, a robust version of the algo- rithm is presented and shown to have the same convergence results as the regular version. Of the theory presented, the most influential result is Lemma 2.2, which says that our approximate subdifferential is a good approximate of our ex- act subdifferential. Lemma 2.2 (1) is essential in proving that our stopping conditions are sufficient and Lemma 2.2 (2) is essential in proving that our search direction (approximate direction of steepest descent) is a descent di- rection. Part 1 can be adapted to the Goldstein approximate subdifferential (see Lemma 3.7) and Part 2 can be adapted to the robust approximate subdifferential. The proof is elegant, relying only on the definitions of sub- differential and approximate subdifferential sets. In Chapter 4, we see that the AGS and RAGS algorithms are flexible as to the approximate gradient used. The condition that an error bound in terms of ∆ is satisfied is a reasonable assumption, as the error bounds from Kelley for the simplex and centered simplex gradients and the error bound we provided for the Gupal estimate of the gradient of the Steklov average function depend on ∆. Through numerical testing, we found that the RAGS algorithm out- performs the AGS algorithm with respect to the accuracy of the solution obtained. The general framework of the RAGS algorithm is not so different from the method of steepest descent. However, by including the ‘almost active’ functions in the robust active set, the RAGS algorithm is shown to avoid the downfall of the method of steepest descent caused by nondifferen- tiable ridges when applied to nonsmooth functions. Additionally, we tested robust stopping conditions and found that they generally required less func- 55 6.2. Future Work tion evaluations before termination for the RAGS algorithm. Although the results presented in Section 3.3 do not provide a complete theory for the robust stopping conditions, they give insight into the superior performance of the robust stopping conditions in our numerical results. Of the visuals presented, Figures 3.2(a) and 3.2(b) capture the true essence of the RAGS algorithm, clearly showing that the robust stopping conditions paired with the robust version of the algorithm performed best. 6.2 Future Work Considerable future work is available in this research direction. Most obvious is a further exploration of the theory behind the performance of the robust stopping conditions. Another direction lies in the theoretical requirement bounding the step length away from 0 (see Theorems 2.11 and 3.5). In gradient based methods, one common way to avoid this requirement is with the use of Wolfe-like conditions. We are unaware of any derivative- free variant on the Wolfe conditions. The AGS and RAGS algorithms are hybrid algorithms; they combine the elements of several previously proposed algorithm frameworks to form new optimization algorithms. Combining the strengths of algorithms to create novel hybrid algorithms is an unbounded field of future work with great potential. Finally, there is considerable future work in the seismic retrofitting prob- lem, as well as numerous other real-world applications. 56 Bibliography [BC11] H. H. Bauschke and P. L. Combettes. Convex Analysis and Monotone Operator Theory in Hilbert Spaces. CMS Books in Mathematics. Springer, New York, NY, 2011. → pages 4, 16 [BHT11] K. Bigdeli, W. Hare, and S. Tesfamariam. Determining opti- mal non-uniform damping coefficients for adjacent buildings via a nelder-mead method. Proceeding of the 23rd Canadian Congress of Applied Mechanics, 2011. 4 pages. → pages 53 [BHT12] K. Bigdeli, W. Hare, and S. Tesfamariam. Configuration opti- mization of dampers for adjacent buildings under seismic excita- tions. To appear in Eng. Opt., 2012. 19 pages. → pages 53 [BJF+98] A. J. Booker, J. E. Dennis Jr., P. D. Frank, D. B. Serafini, and V. Torczon. Optimization using surrogate objectives on a heli- copter test example. In Computational methods for optimal de- sign and control (Arlington, VA, 1997), volume 24 of Progr. Sys- tems Control Theory, pages 49–58. Birkhäuser Boston, Boston, MA, 1998. → pages 3 [BK98] D. M. Bortz and C. T. Kelley. The simplex gradient and noisy optimization problems, volume 24 of Computational methods for optimal design and control. Birkhäuser Boston, Boston, MA, 1998. → pages 34 [BKS08] A. M. Bagirov, B. Karasözen, and M. Sezer. Discrete gradient method: derivative-free method for nonsmooth optimization. J. Optim. Theory Appl., 137(2):317–334, 2008. → pages 5, 7, 8 [BLO02] J. V. Burke, A. S. Lewis, and M. L. Overton. Approximating subdifferentials by random sampling of gradients. Math. Oper. Res., 27(3):567–584, 2002. → pages 5, 23 57 Bibliography [BLO05] J. V. Burke, A. S. Lewis, and M. L. Overton. A robust gradi- ent sampling algorithm for nonsmooth, nonconvex optimization. SIAM J. Optim., 15(3):751–779, 2005. → pages 5, 23 [CC78] C. Charalambous and A. R. Conn. An efficient method to solve the minimax problem directly. SIAM J. Numer. Anal., 15(1):162–187, 1978. → pages 23 [CJV08] A. L. Custódio, J. E. Dennis Jr., and L. N. Vicente. Using simplex gradients of nonsmooth functions in direct search methods. IMA J. Numer. Anal., 28(4):770–784, 2008. → pages 34 [Cla90] F. H. Clarke. Optimization and Nonsmooth Analysis. Classics Appl. Math. 5. SIAM, Philadelphia, second edition, 1990. → pages 2, 4, 8 [CSV09] A. Conn, K. Scheinberg, and L. Vicente. Introduction to derivative-free optimization, volume 8 of MPS/SIAM Series on Optimization. SIAM, Philadelphia, 2009. → pages 2, 3, 4, 5, 35, 53 [CTYZ00] X. Cai, K. Teo, X. Yang, and X. Zhou. Portfolio optimization under a minimax rule. Manage. Sci., 46:957–972, July 2000. → pages 2 [CV07] A. L. Custódio and L. N. Vicente. Using sampling and sim- plex derivatives in pattern search methods. SIAM J. Optim., 18(2):537–555, 2007. → pages 34 [DM02] E. D. Dolan and J. J. Moré. Benchmarking optimization software with performance profiles. Math. Program., 91(2, Ser. A):201– 213, 2002. → pages 50 [DV04] R. Duvigneau and M. Visonneau. Hydrodynamic design using a derivative-free method. Struct. and Multidiscip. Optim., 28:195– 205, 2004. → pages 3 [Gol77] A. A. Goldstein. Optimization of Lipschitz continuous functions. Math. Program., 13(1):14–22, 1977. → pages 31 [Gup77] A. M. Gupal. A method for the minimization of almost differen- tiable functions. Kibernetika, pages 114–116, 1977. (in Russian). → pages 5 58 Bibliography [Har10] W. Hare. Using derivative free optimization for constrained parameter selection in a home and community care forecasting model. In Int. Perspectives on Oper. Res. and Health Care, Proc. of the 34th Meeting of the EURO Working Group on Oper. Res. Applied to Health Sciences, 2010. → pages 3 [HM11] W. Hare and M. Macklem. Derivative-free optimization methods for finite minimax problems. To appear in Optim. Methods and Softw., 2011. 16 pages. → pages 3, 5, 7, 34 [HN12] W. Hare and J. Nutini. A derivative-free approximate gradient sampling algorithm for finite minimax problems. Submitted to Comput. Optim. and Appl., February 2012. 33 pages. → pages vii [IOKK04] J. Imae, N. Ohtsuki, Y. Kikuchi, and T. Kobayashi. Minimax control design for nonlinear systems based on genetic program- ming: Jung’s collective unconscious approach. Intern. J. Syst. Sci., 35:775–785, October 2004. → pages 2 [Kel99a] C. T. Kelley. Detection and remediation of stagnation in the Nelder-Mead algorithm using a sufficient decrease condition. SIAM J. Optim., 10(1):43–55, 1999. → pages 34 [Kel99b] C. T. Kelley. Iterative Methods for Optimization, volume 18 of Frontiers in Applied Mathematics. SIAM, Philadelphia, 1999. → pages 36, 37 [Kiw10] K. C. Kiwiel. A nonderivative version of the gradient sampling algorithm for nonsmooth nonconvex optimization. SIAM J. Op- tim., 20(4):1983–1994, 2010. → pages 5, 7, 8, 38 [LLS06] G. Liuzzi, S. Lucidi, and M. Sciandrone. A derivative-free algo- rithm for linearly constrained finite minimax problems. SIAM J. Optim., 16(4):1054–1075, 2006. → pages 3, 7 [LV00] L. Lukšan and J. Vlček. Test problems for nonsmooth uncon- strained and linearly constrained optimization. Technical Report V-798, ICS AS CR, February 2000. → pages 5, 23, 46, 47, 50 [Mad75] K. Madsen. Minimax solution of non-linear equations without calculating derivatives. Math. Programming Stud., pages 110– 126, 1975. → pages 3 59 Bibliography [MFT08] A. L. Marsden, J. A. Feinstein, and C. A. Taylor. A computa- tional framework for derivative-free optimization of cardiovascu- lar geometries. Comput. Methods Appl. Mech. Engrg., 197(21- 24):1890–1905, 2008. → pages 3 [NW99] J. Nocedal and S. J. Wright. Numerical Optimization. Springer Series in Operations Research. Springer-Verlag, New York, 1999. → pages 13 [PGL93] G. Di Pillo, L. Grippo, and S. Lucidi. A smooth method for the finite minimax problem. Math. Program., 60(2, Ser. A):187–214, 1993. → pages 3 [Pol87] E. Polak. On the mathematical foundations of nondifferentiable optimization in engineering design. SIAM Rev., 29(1):21–89, 1987. → pages 2 [Pol88] R. A. Polyak. Smooth optimization methods for minimax prob- lems. SIAM J. Control Optim., 26(6):1274–1286, 1988. → pages 3 [PRW03] E. Polak, J. O. Royset, and R. S. Womersley. Algorithms with adaptive smoothing for finite minimax problems. J. Optim. The- ory Appl., 119(3):459–484, 2003. → pages 3 [RW98] R. T. Rockafellar and R. J.-B. Wets. Variational Analysis, vol- ume 317 of Grundlehren der Mathematischen Wissenschaften [Fundamental Principles of Mathematical Sciences]. Springer- Verlag, Berlin, 1998. → pages 2, 4, 9, 19 [Sta05] R. Stafford. Random points in an n-dimensional hypersphere, 2005. → pages 12 [Wol75] P. Wolfe. A method of conjugate subgradients for minimizing nondifferentiable functions. Math. Programming Stud., pages 145–173, 1975. → pages 5 [Wsc04] M. Wschebor. Smoothed analysis of κ(A). J. Complexity, 20(1):97–107, 2004. → pages 36 [Xu01] S. Xu. Smoothing method for minimax problems. Comput. Op- tim. Appl., 20(3):267–279, 2001. → pages 3 60 Appendix A Tables Table 6.1: Test Set Summary: problem name and number, problem dimen- sion (N), and number of sub-functions (M); * denotes an absolute value operation (doubled number of sub-functions). Prob. # Name N M 2.1 CB2 2 3 2.2 WF 2 3 2.3 SPIRAL 2 2 2.4 EVD52 3 6 2.5 Rosen-Suzuki 4 4 2.6 Polak 6 4 4 2.7 PCB3 3 42* 2.8 Bard 3 30* 2.9 Kow.-Osborne 4 22* 2.10 Davidon 2 4 40* 2.11 OET 5 4 42* 2.12 OET 6 4 42* 2.13 GAMMA 4 122* 2.14 EXP 5 21 2.15 PBC1 5 60* 2.16 EVD61 6 102* 2.18 Filter 9 82* 2.19 Wong 1 7 5 2.20 Wong 2 10 9 2.21 Wong 3 20 18 2.22 Polak 2 10 2 2.23 Polak 3 11 10 2.24 Watson 20 62* 2.25 Osborne 2 11 130* 61 Appendix A. Tables Table 6.2: Average accuracy for 25 trials obtained by the AGS and RAGS algorithms for the simplex gradient. AGS RAGS Regular Stop Robust Stop Regular Stop Robust Stop Prob. f-evals Acc. f-evals Acc. f-evals Acc. f-evals Acc. 2.1 3018 2.082 2855 2.120 2580 9.470 202 6.759 2.2 3136 4.565 3112 4.987 4179 13.211 418 6.343 2.3 3085 0.002 3087 0.002 3090 0.002 3096 0.002 2.4 3254 2.189 3265 2.238 2986 11.559 367 7.570 2.5 3391 1.379 3138 1.351 3576 1.471 539 1.471 2.6 3260 1.236 3341 1.228 4258 1.338 859 1.338 2.7 2949 1.408 2757 1.367 4155 9.939 4190 7.230 2.8 4959 0.879 4492 0.913 3634 9.941 3435 7.655 2.9 2806 0.732 3303 0.581 16000 8.049 13681 3.975 2.10 2978 3.343 2993 3.342 3567 3.459 1924 3.459 2.11 3303 2.554 3453 2.559 35367 6.099 11725 5.063 2.12 2721 1.866 3117 1.871 15052 2.882 8818 2.660 2.13 2580 1.073 2706 0.874 43618 1.952 141 1.679 2.14 3254 1.585 3289 1.086 7713 2.696 4221 1.476 2.15 3917 0.262 5554 0.259 31030 0.286 12796 0.277 2.16 3711 2.182 4500 2.077 20331 3.242 11254 2.178 2.18 10468 0.000 10338 0.000 76355 17.717 30972 17.138 2.19 3397 0.376 3327 0.351 5403 7.105 1767 7.169 2.20 4535 1.624 4271 1.624 8757 8.435 7160 6.073 2.21 8624 2.031 8380 2.157 15225 1.334 11752 1.393 2.22 1563 0.958 1408 1.042 64116 3.049 1256 2.978 2.23 7054 2.557 10392 2.744 6092 6.117 970 6.178 2.24 4570 0.301 7857 0.298 93032 0.447 21204 0.328 2.25 3427 0.339 4197 0.340 98505 0.342 343 0.342 62 Appendix A. Tables Table 6.3: Average accuracy for 25 trials obtained by the AGS and RAGS algorithms for the centered simplex gradient. AGS RAGS Regular Stop Robust Stop Regular Stop Robust Stop Prob. f-evals Acc. f-evals Acc. f-evals Acc. f-evals Acc. 2.1 3769 2.054 3573 2.051 2351 9.469 221 7.125 2.2 3705 6.888 1284 5.154 4151 9.589 330 5.594 2.3 5410 0.003 5352 0.003 5332 0.003 5353 0.003 2.4 4059 2.520 4154 2.456 4347 11.578 296 6.834 2.5 3949 1.422 3813 1.437 4112 1.471 452 1.471 2.6 3756 1.302 3880 1.309 4815 1.338 879 1.338 2.7 4227 1.435 4187 1.373 5285 9.950 7164 6.372 2.8 6928 0.988 6933 1.003 4116 9.939 3754 7.775 2.9 3301 0.933 3743 0.949 17944 8.072 13014 2.436 2.10 3447 3.343 3424 3.342 4744 3.459 427 3.459 2.11 3593 2.768 4082 2.785 47362 6.344 11886 5.115 2.12 3321 1.892 3406 1.876 15550 2.843 10726 2.651 2.13 3067 1.355 3508 1.216 36969 1.873 519 1.643 2.14 3967 1.771 6110 1.152 9757 2.692 7284 1.510 2.15 4646 0.272 6014 0.273 23947 0.280 15692 0.277 2.16 4518 2.223 6911 2.074 22225 2.628 17001 2.215 2.18 30492 16.931 14671 16.634 125859 17.804 20815 17.293 2.19 4473 0.551 4484 0.591 8561 7.113 1697 5.851 2.20 5462 1.615 5503 1.599 8908 9.011 7846 6.042 2.21 11629 1.887 11724 1.661 18957 1.304 17067 1.339 2.22 1877 1.166 1604 1.160 1453 3.139 2066 3.644 2.23 3807 2.150 7850 3.586 15625 6.117 1020 6.230 2.24! 7198 0.302 12745 0.301 115787 0.436 61652 0.329 2.25 4749 0.339 4896 0.341 256508 0.342 568 0.342 63 Appendix A. Tables Table 6.4: Average accuracy for 25 trials obtained by the AGS and RAGS algorithm for the Gupal estimate of the gradient of the Steklov averaged function. AGS RAGS Regular Stop Robust Stop Regular Stop Robust Stop Prob. f-evals Acc. f-evals Acc. f-evals Acc. f-evals Acc. 2.1 2775 2.448 2542 2.124 13126 3.896 89 2.708 2.2 3729 3.267 2221 2.813 5029 15.904 1776 7.228 2.3 2243 0.000 2262 0.000 2276 0.000 2255 0.000 2.4 2985 2.771 2841 2.892 3475 3.449 2362 3.738 2.5 3493 1.213 3529 1.196 3447 1.211 338 1.200 2.6 3144 0.187 3245 0.188 3018 0.162 3059 0.162 2.7 2631 1.368 3129 1.248 2476 1.048 2208 1.047 2.8 2711 1.125 3898 0.893 2231 0.514 5846 0.515 2.9 3102 0.727 3011 0.600 2955 0.937 3248 0.863 2.10 3075 3.241 2927 3.272 3100 0.000 3050 0.000 2.11 2947 1.527 3307 1.528 3003 1.560 2905 1.560 2.12 3095 1.099 7179 0.000 2670 0.788 7803 0.000 2.13 2755 0.710 1485 0.715 2517 0.231 6871 0.227 2.14 2965 0.574 3070 0.427 2860 0.708 4571 0.668 2.15 2658 0.010 2386 0.017 3355 0.050 3210 0.031 2.16 3431 0.457 3256 0.459 2861 0.199 2620 0.119 2.18 3936 16.345 5814 0.000 3950 16.451 6598 4.542 2.19 3337 0.014 3270 0.011 3488 0.970 3376 0.957 2.20 4604 0.835 4434 0.808 9459 1.360 10560 1.359 2.21 5468 0.000 5418 0.000 6632 0.641 6159 0.635 2.22 21 0.000 21 0.000 21 0.000 21 0.000 2.23 5436 1.814 5176 1.877 1.00E+06 2.354 954 2.415 2.24 7426 0.280 171 0.017 7927 0.043 6389 0.283 2.25 4519 0.286 4814 0.300 3760 0.017 3209 0.023 64
Cite
Citation Scheme:
Usage Statistics
Country | Views | Downloads |
---|---|---|
United States | 16 | 8 |
China | 3 | 0 |
France | 1 | 0 |
City | Views | Downloads |
---|---|---|
Ashburn | 16 | 0 |
Beijing | 3 | 0 |
Unknown | 1 | 3 |
{[{ mDataHeader[type] }]} | {[{ month[type] }]} | {[{ tData[type] }]} |
Share
Share to: