A Derivative-Free Approximate Gradient Sampling Algorithm for Finite Minimax Problems by Julie Ann Nutini B.Sc. Hons., The University of British Columbia, 2010 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE in The College of Graduate Studies (Mathematics) THE UNIVERSITY OF BRITISH COLUMBIA (Okanagan) April 2012 c Julie Ann Nutini 2012Abstract Mathematical optimization is the process of minimizing (or maximizing) a function. An algorithm is used to optimize a function when the minimum cannot be found by hand, or nding the minimum by hand is ine cient. The minimum of a function is a critical point and corresponds to a gradient (derivative) of 0. Thus, optimization algorithms commonly require gradient calculations. When gradient information of the objective function is unavail- able, unreliable or ‘expensive’ in terms of computation time, a derivative-free optimization algorithm is ideal. As the name suggests, derivative-free op- timization algorithms do not require gradient calculations. In this thesis, we present a derivative-free optimization algorithm for nite minimax prob- lems. Structurally, a nite minimax problem minimizes the maximum taken over a nite set of functions. We focus on the nite minimax problem due to its frequent appearance in real-world applications. We present convergence results for a regular and a robust version of our algorithm, showing in both cases that either the function is unbounded below (the minimum is 1) or we have found a critical point. Theoretical results are explored for stopping conditions. Additionally, theoretical and numerical results are presented for three examples of approximate gradients that can be used in our algorithm: the simplex gradient, the centered simplex gradient and the Gupal estimate of the gradient of the Steklov averaged function. A performance comparison is made between the regular and robust algorithm, the three approximate gradients, and the regular and robust stopping conditions. Finally, an ap- plication in seismic retro tting is discussed. iiTable of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . vii Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 The Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Derivative-Free Optimization . . . . . . . . . . . . . . . . . . 3 1.4 DFO and Finite Max Functions . . . . . . . . . . . . . . . . 3 1.5 Method of Steepest Descent . . . . . . . . . . . . . . . . . . 6 1.6 Basic De nitions and Results . . . . . . . . . . . . . . . . . . 8 2 Approximate Gradient Sampling Algorithm . . . . . . . . . 10 2.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3 Robust Approximate Gradient Sampling Algorithm . . . . 23 3.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.2 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.2.1 Descent Direction: dY . . . . . . . . . . . . . . . . . . 29 3.3 Robust Stopping Conditions . . . . . . . . . . . . . . . . . . 30 4 Approximate Gradients . . . . . . . . . . . . . . . . . . . . . . 34 4.1 Simplex Gradient . . . . . . . . . . . . . . . . . . . . . . . . 34 4.1.1 De nitions . . . . . . . . . . . . . . . . . . . . . . . . 34 iiiTable of Contents 4.1.2 Convergence . . . . . . . . . . . . . . . . . . . . . . . 35 4.1.3 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 36 4.2 Centered Simplex Gradient . . . . . . . . . . . . . . . . . . . 36 4.2.1 De nitions . . . . . . . . . . . . . . . . . . . . . . . . 37 4.2.2 Convergence . . . . . . . . . . . . . . . . . . . . . . . 37 4.2.3 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 38 4.3 Gupal Estimate of the Gradient of the Steklov Averaged Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.3.1 De nitions . . . . . . . . . . . . . . . . . . . . . . . . 38 4.3.2 Convergence . . . . . . . . . . . . . . . . . . . . . . . 39 4.3.3 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 44 5 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.1 Versions of the AGS Algorithm . . . . . . . . . . . . . . . . . 45 5.2 Test Sets and Software . . . . . . . . . . . . . . . . . . . . . 46 5.3 Initialization and Stopping Conditions . . . . . . . . . . . . . 47 5.4 Computing a Descent Direction . . . . . . . . . . . . . . . . 47 5.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.6 An Application in Seismic Retro tting . . . . . . . . . . . . 53 6 Conclusion and Future Work . . . . . . . . . . . . . . . . . . 55 6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Appendix A Tables . . . . . . . . . . . . . . . . . . . . . . . . . . 61 ivList of Tables 2.1 Glossary of Algorithm Notation . . . . . . . . . . . . . . . . . 10 5.1 Damper Coe cient Selection Results . . . . . . . . . . . . . . 54 6.1 Summary of Test Set Problems . . . . . . . . . . . . . . . . . 61 6.2 Numerical Results: Simplex Gradient . . . . . . . . . . . . . . 62 6.3 Numerical Results: Centered Simplex Gradient . . . . . . . . 63 6.4 Numerical Results: Gupal Estimate . . . . . . . . . . . . . . . 64 vList of Figures 1.1 A nonsmooth function and its gradients . . . . . . . . . . . . 1 1.2 Method of Steepest Descent . . . . . . . . . . . . . . . . . . . 6 2.1 An Armijo-like line search . . . . . . . . . . . . . . . . . . . . 13 3.1 Surface plot of Problem 2.1 CB2 . . . . . . . . . . . . . . . . 24 3.2 Iterations for Problem 2.1 CB2 . . . . . . . . . . . . . . . . . 24 5.1 Performance pro les: 12 versions of AGS/RAGS algorithm . . 51 viAcknowledgements First, and foremost, I would like thank my supervisor, Dr. Warren Hare, whose guidance, patience and encouragement have been unfailing. I am incredibly grateful for the numerous occasions when his con dence in my ability far exceeded my own, and feel truly fortunate to have worked under his supervision. Without him, this thesis would not have come to its fruition, my passport would still be without stamps and the idea of ‘proof by fairy dust’ would still be a logical option. I would like to thank my committee members, Dr. Heinz Bauschke, Dr. Shawn Wang and Dr. Erik Rosolowsky, for their time and helpful suggestions throughout this challenging, yet rewarding journey. I would like to express my sincere gratitude to Dr. Claudia Sagastiz abal for our conversations in Chile, especially with respect to the Goldstein ap- proximate subdi erential. I would like to thank Kasra Bigdeli and Dr. Solomon Tesfamariam for taking an interest in my algorithm and for furthering my passion for algo- rithm design. I would like to thank Dr. Rebecca Tyson and Dr. Blair Spearman for encouraging me to pursue studies in optimization, and Dr. Yves Lucet for showing an invested interest in my research through his helpful, thought provoking questions. I would like to thank UBC Okanagan and the many donors who have provided me with incredible nancial support over the years. Last, but not least, I would like to thank my family and friends for their understanding, faith and endless love throughout my studies. Many of the results in this thesis are currently in submission to Compu- tational Optimization and Applications (see [HN12]). viiTo my dad, who shares my love of mathematics, and to my mom, who is my constant reminder...va piano. viiiChapter 1 Introduction Mathematical optimization is the process of minimizing (or maximizing) a function. We use an optimization algorithm when either the minimum of a function cannot be found by hand, or nding the minimum by hand is ine cient. In the rst year calculus classroom, optimization algorithms are simple: take the derivative of a function, set the derivative equal to 0 and solve for x. For the functions explored in rst year calculus, this process generally works ne. However, many real-world problems result in nonsmooth functions. For example, we consider the following nonsmooth nite max function: f(x; y) = 10jxj+ y2 = maxf10x+ y2; 10x+ y2g: (a) Graph of f(x; y). (b) Selected gradients of f(x; y). Figure 1.1: Graph of f(x; y) and its gradients. We can see in Figure 1.1(a) that f(x; y) has a ridge along the line x = 0 where the function is not di erentiable. Figure 1.1(b) shows the pattern of gradients for f(x; y) as if looking down onto the bottom ridge. Notice 11.1. Notations that for any point far from the ridge, the negative of the gradient points roughly towards the minimum. However, for any point near the ridge, the negative of the gradient points roughly perpendicular to the ridge, which is not necessarily the direction of the minimum. In this thesis we develop a novel optimization algorithm speci cally de- signed for functions with this nite max structure. 1.1 Notations In this thesis, we use the following notations: 1. j j denotes the Euclidean norm and k k denotes the corresponding matrix norm. 2. As de ned in [Cla90], the function f : Rn ! R is locally Lipschitz at a point x 2 Rn if there exists a scalar L and > 0 such that jf(y1) f(y2)j Ljy1 y2j for all y1 and y2 in the open ball of radius about x. 3. If f1 and f2 are continuous, then maxff1; f2g is continuous. If f1 and f2 are Lipschitz, then maxff1; f2g is Lipschitz [RW98, Prop 9.10]. However, we note that even if f1 and f2 are di erentiable, this does not imply that maxff1; f2g is di erentiable. 1.2 The Problem We consider the nite minimax problem min x f(x) where f(x) = maxffi(x) : i = 1; : : : ; Ng; where each individual fi is continuously di erentiable. Finite minimax problems occur in numerous applications, such as port- folio optimization [CTYZ00], control system design [IOKK04], engineering design [Pol87], and determining the cosine measure of a positive spanning set [CSV09, Def 2.7]. In a nite max function, although each individ- ual fi may be smooth, taking the maximum forms a nonsmooth function with ‘nondi erentiable ridges’. For this reason, most algorithms designed to solve nite minimax problems employ some form of smoothing technique; 21.3. Derivative-Free Optimization [PGL93], [PRW03], [Pol88], and [Xu01] (among many others). In general, these smoothing techniques require gradient calculations. However, in many applications gradient information is not available or can be di cult to compute accurately (see [BJF+98], [DV04], [Har10], [MFT08] and [CSV09, Chpt 1] for some examples of such situations). Thus, for the purpose of this thesis, we further restrict ourselves to the eld of derivative-free optimization, where we are only permitted to compute func- tion values, i.e., we cannot compute gradient values rfi directly. 1.3 Derivative-Free Optimization The research area of derivative-free optimization (DFO) has blossomed in recent years. As previously stated, DFO algorithms are useful in situations when gradient information is not available, di cult to compute accurately or ‘expensive’ to compute in relation to computation time. For example, if a function is given by a simulation, then gradient information may not be available. For a thorough introduction to several basic DFO frameworks and convergence results for each, see [CSV09]. As there are no gradient calculations required in DFO algorithms, it is assumed that function evaluations are the most ‘expensive’ computations in terms of CPU time. Thus, the performance of a DFO algorithm is based on the number of function evaluations required to solve the problem. 1.4 DFO and Finite Max Functions In relation to our problem, research on optimizing nite max functions without calculating derivatives can be seen as early as 1975 [Mad75], while more recently we have seen a resurface in this area, [LLS06] and [HM11]. In 2006, Liuzzi, Lucidi and Sciandrone used a smoothing technique based on an exponential penalty function in a directional direct-search framework to form a derivative-free optimization method for nite minimax problems [LLS06]. This method is shown to globally converge towards a standard stationary point of the original nite minimax problem. We want to take a step away from the prevalent smoothing techniques used to solve nite minimax problems. Instead of altering the function with a smoothing technique, we look to solve the nite minimax problem with a DFO method that exploits its smooth substructure. To understand how we do this, we rst need to de ne the following terms. 31.4. DFO and Finite Max Functions De nition 1.1. Let f : Rn ! R be locally Lipschitz at a point x. Then by Rademacher’s Theorem ([RW98, Thm 9.60]), f is di erentiable almost everywhere on Rn. Let f be the set of points at which f fails to be di erentiable. Then the Clarke subdi erential, as de ned in [Cla90], is given by the set @f( x) = conv(lim j rf(yj) : yj ! x; yj =2 f ): (1.1) Basically, the subdi erential of a function at x is the set of all possible gradients near x. As an example, we consider the absolute value function, f(x) = jxj in R. Clearly, f(x) is not di erentiable at 0. The subdi erential of f at the point 0 is the set @f(0) = conv(1; 1); as f 0(x) = 1 if x > 0 1 if x < 0 : De nition 1.2. A descent direction of a continuously di erentiable func- tion f at a given point x 2 dom(f) as de ned in [CSV09] is any vector d such that hd; vi < 0 for all v 2 @f( x). To explain how subdi erentials can be used to nd a descent direction, we rst de ne the projection. De nition 1.3 (De nition 3.7, [BC11]). Let C Rn be a nonempty closed convex set and let x 2 Rn. Let p 2 C. Then p is the unique projection of x onto C, denoted by Proj(xjC), if p 2 arg min y fjy xj : y 2 Cg: A point x is a (Clarke) stationary point of a function f when 0 2 @f( x). Thus, we de ne the direction of steepest descent as follows. De nition 1.4. The direction of steepest descent as de ned in [CSV09] is given by d = Proj(0j@f( x)): 41.4. DFO and Finite Max Functions In 2011, Hare and Macklem presented a derivative-free method that ex- ploits the smooth substructure of the nite minimax problem. It combines the frameworks of a directional direct search method [CSV09, Chpt 7] and the gradient sampling algorithm (GS algorithm) presented in [BLO02] and [BLO05]. Loosely speaking, the GS algorithm uses a collection of local gra- dients to build a ‘robust subdi erential’ of the objective function and uses this to determine a ‘robust descent direction’. In [HM11], these ideas are used to develop several methods to nd an approximate descent direction that moves close to parallel to an ‘active manifold’. During each iteration, points are sampled from around the current iterate and the simplex gradient is calculated for each of the active functions of the objective function. The calculated simplex gradients are then used to form an approximate subdif- ferential, which is then used to determine a likely descent direction. Ideas from the GS algorithm have appeared in two other recent DFO methodologies: [BKS08] and [Kiw10]. In 2008, Bagirov, Karas ozen and Sezer presented a discrete gradient derivative-free method for unconstrained nonsmooth optimization problems [BKS08]. Described as a derivative-free version of the bundle method pre- sented in [Wol75], the method uses discrete gradients to approximate sub- gradients of the function and build an approximate subdi erential. The analysis of this method provides proof of convergence to a Clarke station- ary point for an extensive class of nonsmooth problems. In this thesis, we focus on the nite minimax problem. This allows us to require few (other) assumptions on our function while maintaining strong convergence analysis. It is worth noting that we use the same set of test problems as in [BKS08]. Speci cally, we use the [LV00] test set and exclude one problem as its sub- functions are complex-valued. (The numerics in [BKS08] exclude the same problem, and several others, without explanation.) Using approximate gradient calculations instead of gradient calculations, the GS algorithm is made derivative free by Kiwiel in [Kiw10]. Speci cally, Kiwiel employs the Gupal estimate of the gradient of the Steklov averaged function (see [Gup77] or Section 4.3 herein) as an approximate gradient. It is shown that, with probability 1, this derivative-free algorithm satis es the same convergence results as the GS algorithm { it either drives the f - values to 1 or each cluster point is found to be Clarke stationary [Kiw10, Theorem 3.8]. No numerical results are presented for Kiwiel’s derivative-free algorithm. 51.5. Method of Steepest Descent 1.5 Method of Steepest Descent To obtain a general understanding of the framework of the algorithm presented in this thesis, we recall the classical method of steepest descent. 1. Search Direction: Set dk = Proj(0j@f(xk)). 2. Step Length: Find tk by solving min tk>0 ff(xk + tkd k)g. 3. Update: Set xk+1 = xk + tkdk. Increase k = k + 1. Loop. Figure 1.2: An example of the method of steepest descent. In each iteration, the method of steepest descent calculates the direction of steepest descent, i.e., Proj(0j@f(xk)), does a line search in this direction and then updates the iterate. This allows the algorithm to move perpendic- ular to the contours of the function, towards the minimum. The method of steepest descent works well for smooth functions. However, for nonsmooth functions, the optimal solution often occurs at a point of nondi erentiabil- ity, which results in a very slow convergence rate for the method of steepest descent. In general, our algorithm takes the method of steepest descent and makes it derivative free; instead of calculating the subdi erential, we calculate an approximate subdi erential; instead of nding the direction of steepest de- scent, we project 0 onto our approximate subdi erential and nd an approx- imate direction of steepest descent. With these alterations, our algorithm is able to navigate along nondi erentiable ridges, more rapidly minimizing the function. Speci cally, we use the GS algorithm framework to form a derivative- free approximate gradient sampling algorithm. As we are dealing with nite max functions, instead of calculating an approximate gradient of f at each 61.5. Method of Steepest Descent of the sampled points, we calculate an approximate gradient of each of the active functions at the current iterate. We say a function fi is active if its index is an element of the active set. De nition 1.5. We de ne the active set of f at x to be the set of indices A( x) = fi : f( x) = fi( x)g: (1.2) We denote the set of active gradients of f at x by frfi( x)gi2A( x): Expanding the active set to include ‘almost’ active functions, we also present a robust version of our algorithm, which is more akin to the GS algorithm. In this robust version, when our iterate is close to a point of non- di erentiability, the size and shape of our approximate subdi erential will re ect the presence of ‘almost active’ functions. Hence, when we project 0 onto our approximate subdi erential, the descent direction will direct min- imization parallel to a ‘nondi erentiable ridge’, rather than straight at this ridge. It can be seen in our numerical results that these robust changes greatly in uence the performance of our algorithm. The algorithm presented in this thesis di ers from those mentioned above in a few key manners. Unlike in [LLS06] we do not employ a smoothing tech- nique. Unlike in [HM11], which uses the directional direct-search framework to imply convergence, we employ an approximate steepest descent frame- work. Using this framework, we are able to analyze convergence directly and develop stopping analysis based on our stopping conditions for the al- gorithm. Unlike in [BKS08] and [Kiw10], where convergence is proven for a speci c approximate gradient, we prove convergence for any approximate gradient that satis es a simple error bound dependent on the sampling ra- dius. As examples, we present the simplex gradient, the centered simplex gradient and the Gupal estimate of the gradient of the Steklov averaged func- tion. (As a side-note, Section 4.3 also provides, to the best our knowledge, a novel error analysis of the Gupal estimate of the gradient of the Steklov averaged function.) Furthermore, unlike in [Kiw10], where convergence to a stationary point is proved with probability 1, we prove convergence to a critical point for every cluster point of a sequence generated by our algo- rithm. Focusing on the nite minimax problem provides us with an advantage over the methods of [BKS08] and [Kiw10]. In particular, we only require or- der n function calls per iteration (where n is the dimension of the problem), 71.6. Basic De nitions and Results while both [BKS08] and [Kiw10] require order mn function calls per iteration (where m is the number of gradients they approximate to build their approx- imate subdi erential). (The original GS algorithm suggests that m 2n provides a good value for m.) 1.6 Basic De nitions and Results We denote by C1 the class of di erentiable functions whose gradient mapping r is continuous. As stated previously, we assume that each of the individual fi in the nite max function is continuously di erentiable, i.e., fi 2 C1. We denote by C1+ the class of continuously di erentiable functions whose gradient mapping r is locally Lipschitz. We say f 2 C1+ with constant L if its gradient mapping r has a Lipschitz constant L. We denote by C2+ the class of twice continuously di erentiable functions whose gradient mapping r2 is locally Lipschitz. We say f 2 C2+ with constant L if its gradient mapping r2 has a Lipschitz constant L. The following result reveals that the subdi erential for a nite max func- tion at a point x is equal to the convex hull of the active gradients of f at x. Proposition 1.6. Let f = maxffi : i = 1; : : : ; Ng. If fi 2 C1 for each i 2 A( x), then @f( x) = conv(rfi( x))i2A( x). Proof. See Proposition 2.3.12, [Cla90]. It is important to note that the subdi erential as de ned in Proposition 1.6 is a compact set. We formally state and prove this result in the following corollary. Corollary 1.7. Let f = maxffi : i = 1; : : : ; Ng where fi 2 C1 for each i 2 A( x). The subdi erential of f at x is a compact set. Proof. By Proposition 1.6, @f( x) = conv(rfi( x))i2A( x): As there are a nite number of functions, A( x) is a nite set, which implies frfi( x)g is a nite set of points. The convex hull of a nite set of points is closed and bounded. Hence, @f( x) is a compact set. For our proof of convergence in Section 2.2, we require that @f is outer semicontinuous. First, we de ne the limit superior for a mapping. 81.6. Basic De nitions and Results De nition 1.8. For a mapping S : Rn ! Rn, we de ne the limit superior as lim sup x! x S(x) = fy : there exists xk ! x and yk ! y with yk 2 S(xk)g: De nition 1.9. As de ned in [RW98, Def 5.4], S is outer semicontinuous (osc) at x if lim sup x! x S(x) = S( x): To see that the subdi erential is outer semicontinuous, we have the fol- lowing proposition from [RW98]. Proposition 1.10. For a continuous function f : Rn ! R, the mapping @f is outer semicontinuous everywhere. Proof. See Proposition 8.7, [RW98]. 9Chapter 2 Approximate Gradient Sampling Algorithm In Chapter 2, we state the approximate gradient sampling algorithm (AGS algorithm), present the details as to when and why the AGS algo- rithm necessarily nds function value decrease using an arbitrary approx- imate gradient of each of the active fi, denoted by rAfi, and then show that the AGS algorithm converges to a local minimum. We note that no speci c approximate gradient is discussed in this section. Furthermore, we emphasize that the error bound requirement used in the convergence results (Section 2.2) for the arbitrary approximate gradient is achievable. Exam- ples of approximate gradients, their structures and their error bounds are discussed in Section 4. 2.1 Algorithm We rst provide a partial glossary of the notation used in the statement of the AGS algorithm. Table 2.1: Glossary of notation used in AGS algorithm. Glossary of Notation k: Iteration counter xk : Current iterate k: Accuracy measure k: Sampling radius m: Sample size : Sampling radius reduction factor yj : Sampling points Y : Sampled set of points : Armijo-like parameter dk: Search direction tk: Step length tmin: Minimum step length rAfi: Approximate gradient of fi A(xk): Active set at xk Gk: Approximate subdi erential "tol: Stopping tolerance 102.1. Algorithm We state the theoretical AGS algorithm and then provide a detailed description of the algorithm. Conceptual Algorithm: [AGS Algorithm] 0. Initialize: Set k = 0 and input x0 - starting point 0 > 0 - accuracy measure 0 > 0 - initial search radius 2 (0; 1] - search radius reduction factor 0 < < 1 - Armijo-like parameter tmin - minimum step length "tol > 0 - stopping tolerance 1. Generate Approximate Subdifferential Gk: Generate a set Y = [xk; y1; : : : ym] around the current iterate xk such that max j=1;:::;m jyj xkj k: Use Y to calculate the approximate gradient of fi, denoted rAfi, at xk for each i 2 A(xk). Set Gk = conv(rAfi(x k))i2A(xk): 2. Generate Search Direction: Let dk = Proj(0jGk): Check if k kjd kj: (2.1) If (2.1) does not hold, then set xk+1 = xk, set k+1 = ( kjdkj if jdkj 6= 0 k if jdkj = 0 ; (2.2) and go to Step 4. If (2.1) holds and jdkj < "tol, then STOP. Else, continue to the line search. 112.1. Algorithm 3. Line Search: Attempt to nd tk > 0 such that f(xk + tkd k) < f(xk) tkjd kj2: Line Search Failure: Set k = k 2 , xk+1 = xk and go to Step 4. Line Search Success: Let xk+1 be any point such that f(xk+1) f(xk + tkd k): 4. Update and Loop: Set k+1 = max j=1;:::;m jyj xkj, k = k + 1 and return to Step 1. In Step 0 of the AGS algorithm, we set the iterate counter to 0, provide an initial starting point x0, and initialize the parameter values. In Step 1, we create the approximate subdi erential. First, we select a set of points around xk within a search radius of k. In implementa- tion, the points are randomly and uniformly sampled from the interior of an n-dimensional hypersphere of radius k centered at the origin (using the MATLAB randsphere.m function [Sta05]). Using this set Y , we then cal- culate an approximate gradient of each of the active functions at xk and set the approximate subdi erential Gk equal to the convex hull of these active approximate gradients. In Step 2, we generate a search direction by solving the projection of 0 onto the approximate subdi erential: Proj(0jGk) 2 arg ming2Gk jgj 2. The search direction dk is set equal to the negative of the solution, i.e., dk = Proj(0jGk). After nding a search direction, we check the condition k kjdkj (equation (2.1)). This condition determines if the current search radius is su ciently small relative to the distance from 0 to the approximate subd- i erential. If equation (2.1) holds and jdkj < "tol, then we terminate the algorithm, as 0 is within "tol of the approximate subdi erential and the search radius is small enough to reason that the approximate subdi erential is accurate. If equation (2.1) does not hold, then the approximate subdif- ferential is not su ciently accurate to warrant a line search, so we decrease 122.1. Algorithm the search radius according to equation (2.2) and loop (Step 4). If equation (2.1) holds, but jdkj "tol, then we proceed to a line search. In Step 3, we carry out a line search. We attempt to nd a step length tk > 0 such that the Armijo-like condition holds f(xk + tkd k) < f(xk) tkjd kj2: (2.3) The following gures illustrate the Armijo(-like) line search for the curve f(x) = 14x 2. Figure 2.1(a) is representative of a gradient based method (Armijo line search), while Figure 2.1(b) is representative of a DFO method (Armijo-like line search). (a) A gradient based line search. (b) A derivative-free line search. Figure 2.1: A graphical depiction of the Armijo and Armijo-like condition. To satisfy the condition in equation (2.3), the algorithm must nd a point on the curve that lies on or below the Armijo descent line for both methods. Any point found below the Armijo descent line will result in a successful line search. Thus, the Armijo(-like) condition prevents the algorithm from taking multiple steps with minimal function value decrease. In other words, it ensures su cient function value decrease is found when a line search success is declared. Comparing Figures 2.1(a) and 2.1(b), we can see that the DFO Armijo descent line does not allow the algorithm to accept a point x 0:5. This prevents the algorithm from jumping back and forth across the y-axis, thus, resulting in a faster convergence rate. In implementation, we use a back-tracking line search (described in [NW99]) with an initial step-length of tini = 1. Basically, we test to see if equation (2.3) holds for tk = tini. If it holds, then we declare a line search 132.2. Convergence success. If it does not hold, we reduce tk by a factor of 0:5 and we test equa- tion (2.3) again. If we have tk < tmin without declaring a line search success, then we declare a line search failure. If we nd a tk such that equation (2.3) holds, then we declare a line search success. If a line search success occurs, then we let xk+1 be any point such that f(xk+1) f(xk + tkd k); (2.4) thus, forming a non-increasing sequence of function values ff(xk)gk=0. In implementation, we do this by searching through the function values used in the calculation of our approximate gradients (ff(yi)gyi2Y ). As this set of function values corresponds to points distributed around our current iterate, there is a good possibility of nding further function value decrease without having to carry out additional function evaluations. We nd the minimum function value in our set of evaluations and if equation (2.4) holds for this minimum value, then we set xk+1 equal to the corresponding input point. Otherwise, we set xk+1 = xk + tkdk. If a line search failure occurs, then we reduce the accuracy measure k by a factor of 0:5 and set xk+1 = xk. (Notice that as k is decreased each time a line search failure occurs, f kgk=0 is a non-increasing sequence.) Finally, in Step 4, we set k+1 = maxj=1;:::;m jyj xkj, update the iterate counter and loop to Step 1 to resample. 2.2 Convergence In order to prove convergence for the AGS algorithm, we must show that the direction dk is a descent direction, that the algorithm is well-de ned (eventually nds function value decrease) and that the stopping conditions are su cient, i.e., when the algorithm satis es the stopping conditions, the distance to a critical point is controlled by "tol. Remark 2.1. For the following results, we denote the approximate subdif- ferential of f at x as G( x) = conv(rAfi( x))i2A( x); where rAfi( x) is the approximate gradient of fi at x. Our rst result shows that the approximate subdi erential generated by the AGS algorithm is a good approximate of the exact subdi erential. We use part 1 of the following lemma to establish that our stopping conditions are su cient. We use part 2 of the following lemma to establish that dk is a descent direction. 142.2. Convergence Lemma 2.2. Let f = maxffi : i = 1; : : : ; Ng where each fi 2 C1. Suppose there exists an " > 0 such that jrAfi( x) rfi( x)j " for all i. Then 1. for all w 2 G( x), there exists a v 2 @f( x) such that jw vj ", and 2. for all v 2 @f( x), there exists a w 2 G( x) such that jw vj ". Proof. 1. By de nition, for all w 2 G( x) there exists a set of i such that w = X i2A( x) irAfi( x); where i 0; X i2A( x) i = 1: By Proposition 1.6, as each fi 2 C1 we have @f( x) = conv(rfi( x))i2A( x). Using the same i as above, we see that v = X i2A( x) irfi( x) 2 @f( x): Then jw vj = j P i2A( x) irAfi( x) P i2A( x) irfi( x)j P i2A( x) ijrAfi( x) rfi( x)j P i2A( x) i" (by assumption) = " (as P i2A( x) i = 1). Hence, for all w 2 G( x), there exists a v 2 @f( x) such that jw vj ": (2.5) 2. We have @f( x) = conv(rfi( x))i2A( x). So for all v 2 @f( x), there exist i such that v = X i2A( x) irfi( x) where i 0; X i2A( x) i = 1: Using the same i as above, we see that w = X i2A( x) irAfi( x) 2 G( x): Using the same argument as in part 1, we can conclude that for all v 2 @f( x), there exists a w 2 G( x) such that jw vj ". 152.2. Convergence Our next goal (in Theorem 2.6) is to show that eventually a line search success will occur in the AGS algorithm. To achieve this, we rst state the Projection Theorem. Theorem 2.3. (The Projection Theorem) Let C Rn be a nonempty closed convex set. Then for every x and p in Rn p = Proj(xjC) () p 2 C and hx p; y pi 0 for all y 2 C: Proof. See Theorem 3.14, [BC11]. We use the Projection Theorem to prove d = Proj(0jG( x)) is a descent direction. Lemma 2.4. Let f = maxffi : i = 1; : : : ; Ng where each fi 2 C1. Suppose there exists an " > 0 such that jrAfi( x) rfi( x)j " for all i. De ne d = Proj(0jG( x)) and suppose jdj 6= 0. Let 2 (0; 1). If " < (1 )jdj, then for all v 2 @f( x) we have hd; vi < jdj2: Proof. Notice that, by Theorem 2.3, d = Proj(0jG( x)) implies that h0 ( d); w ( d)i 0 for all w 2 G( x): Hence, hd;w + di 0 for all w 2 G( x): (2.6) So we have for all v 2 @f( x) hd; vi = hd; v w + w d+ di for all w 2 G( x) = hd; v wi+ hd;w + di+ hd; di for all w 2 G( x) hd; v wi jdj2 for all w 2 G( x) (by (2.6)) jdjjv wj jdj2 for all w 2 G( x), which follows by using Cauchy Schwarz. For any v 2 @f( x), using w as constructed in Lemma 2.2(2), we see that hd; vi jdj" jdj2 < jdj2(1 ) jdj2 (as " < (1 )jdj) = jdj2: 162.2. Convergence Using Lemma 2.4, we can easily show that d is a descent direction for f at x by setting = 0 and bounding " above by jdj. We notice that this condition on " requires the algorithm to reach a point where the error between the exact gradient and the approximate gradient is smaller than the distance from 0 to our approximate subdi erential. Corollary 2.5. Let f = maxffi : i = 1; : : : ; Ng where each fi 2 C1. Suppose there exists an " > 0 such that jrAfi( x) rfi( x)j " for all i. De ne d = Proj(0jG( x)) and suppose jdj 6= 0. If " < jdj, then hd; vi < 0 for all v 2 @f( x). To guarantee convergence, we must show that, except in the case of 0 2 @f(xk), the algorithm will always be able to nd a search radius that satis es the requirements in Step 2. In Section 4, we show that (for three di erent approximate gradients) the value " (in Lemma 2.4) is linked to the search radius . As unsuccessful line searches will drive to 0, this implies that eventually the requirements of Lemma 2.4 will be satis ed. We formalize this in the next two theorems. Theorem 2.6. Let f = maxffi : i = 1; : : : ; Ng where each fi 2 C1. Sup- pose 0 62 @f(xk) for each iteration k. Suppose there exists K > 0 such that given any set of points generated in Step 1 of the AGS algorithm, the approximate gradient satis es jrAfi(xk) rfi(xk)j K k for all i. Let dk = Proj(0jG(xk)). Then for any > 0, there exists > 0 such that, jdkj+ K ( k ) for all 0 < < ; Moreover, if k < , then the following inequality holds k jd kj: Proof. Let v = Proj(0j@f(xk)) (by assumption, v 6= 0). Given > 0, let = 1 K + 1 j vj; (2.7) and consider 0 < < . Now create G(xk) and dk = Proj(0jG(xk)). As dk 2 G(xk), by Lemma 2.2(1), there exists a vk 2 @f(xk) such that j dk vkj K k: 172.2. Convergence Then K k j dk vkj ) K k jvkj jdkj ) K k j vj jdkj (as jvj j vj for all v 2 @f(xk) ). Thus, for 0 < < , we apply equation (2.7) to j vj in the above inequality to get K k ( K + 1 ) jdkj; which rearranges to (jdkj+ K ( k ): Hence, jdkj+ K ( k ) for all 0 < < . Finally, if k < , then k jd kj: Remark 2.7. In Theorem 2.6, it is important to note that eventually the con- dition k < will hold. Examine as constructed above: K is a constant and v is associated with the current iterate. However, the current iterate is only updated when a line search success occurs, which will not occur unless the condition k kjdkj is satis ed. As a result, if k , the AGS algorithm will reduce k, with remaining constant, until k < . Recall in Step 3 of the AGS algorithm, for a given 2 (0; 1), we attempt to nd a step length tk > 0 such that f(xk + tkd k) < f(xk) tkjd kj2 The following result shows that eventually the above inequality will hold in the AGS algorithm. Recall Corollary 1.7, which states that the exact subdi erential for a nite max function is a compact set. Thus, we know that in the following theorem ~v is well-de ned. Theorem 2.8. Fix 0 < < 1. Let f = maxffi : i = 1; : : : ; Ng where each fi 2 C1. Let ~v 2 arg maxfhd; vi : v 2 @f( x)g. Suppose there exists an " > 0 such that jrAfi( x) rfi( x)j " for all i. De ne d = Proj(0jG( x)) and suppose jdj 6= 0. Let = 2 1+ . If " < (1 )jdj, then there exists t > 0 such that f( x+ td) f( x) < tjdj2 for all 0 < t < t: 182.2. Convergence Proof. Note that 2 (0; 1). Recall, from Lemma 2.4, we have for all v 2 @f( x) hd; vi < jdj2: (2.8) Using = 2 1+ , equation (2.8) becomes hd; vi < 2 1+ jdj 2 for all v 2 @f( x). (2.9) From equation (2.8) we can conclude that for all v 2 @f( x) hd; vi < 0: Using Theorem 8.30 in [RW98], we have lim &0 f( x+ d) f( x) = maxfhd; vi : v 2 @f( x)g = hd; ~vi < 0: Therefore, as +12 < 1 (since < 1) there exists t > 0 such that f( x+ td) f( x) t < + 1 2 hd; ~vi for all 0 < t < t: For such a t, we have f( x+ td) f( x) < + 1 2 thd; ~vi < + 1 2 2 + 1 tjdj2 (by (2.9)) < tjdj2: Hence, f( x+ td) f( x) < tjdj2 for all 0 < t < t: Combining the previous results, we show that the AGS algorithm is guar- anteed to nd function value decrease (provided 0 =2 @f(xk)). We summarize with the following corollary. Corollary 2.9. Let f = maxffi : i = 1; : : : ; Ng where each fi 2 C1. Sup- pose 0 =2 @f(xk) for each iteration k. Suppose there exists a K > 0 such that given any set of points generated in Step 1 of the AGS algorithm, the approximate gradient satis es jrAfi(xk) rfi(xk)j K k for all i. Then after a nite number of iterations, the algorithm will nd a new iterate with a lower function value. 192.2. Convergence Proof. Consider xk, where 0 =2 @f(xk). To nd function value decrease with the AGS algorithm, we must declare a line search success in Step 3. The AGS algorithm will only carry out a line search if the stopping condition below is satis ed, i.e., k kjd kj: (2.10) In Theorem 2.6, we showed that for any k > 0, there exists a such that if k < , then equation (2.10) is satis ed. If k > in Step 2, i.e., equation (2.10) is not satis ed, then k is updated according to equation (2.2). Whether jdkj 6= 0 or jdkj = 0, we can see that k+1 k, so eventually k < . i.e., equation (2.10) will be satis ed and the AGS algorithm will carry out a line search. Now, in order to have a line search success, we must be able to nd a step length tk such that the Armijo-like condition holds, f(xk + tkd k) < f(xk) tkjd kj2: In Theorem 2.8, we showed that there exists t > 0 such that f(xk + tkd k) f(xk) < tkjd kj2 for all 0 < tk < t; provided that for 2 (0; 1), " < (1 )jdkj: (2.11) Set " = K k. If equation (2.11) does not hold, then a line search failure will occur, resulting in k+1 = 0:5 k. Thus, eventually we will have k < (1 ) K and k kjd kj < (1 ) K jdkj; which means equation (2.11) will hold. Thus, after a nite number of iter- ations, the AGS algorithm will declare a line search success and nd a new iterate with a lower function value. Now we are ready to prove convergence. In the following result, assuming that the step length tk is bounded away from 0 means that there exists a t > 0 such that tk > t. Remark 2.10. The following theorem does not ensure the existence of a clus- ter point. In order to ensure the existence of a cluster point, an additional assumption would have to be made, such as compact level sets. 202.2. Convergence Theorem 2.11. Let f = maxffi : i = 1; : : : ; Ng where each fi 2 C1. Let fxkg1k=0 be an in nite sequence generated by the AGS algorithm. Suppose there exists a K > 0 such that given any set of points generated in Step 1 of the AGS algorithm, the approximate gradient satis es the error bound jrAfi(xk) rfi(xk)j K k for all i. Suppose tk is bounded away from 0. Then either 1. f(xk) # 1, or 2. jdkj ! 0, k # 0 and every cluster point x of the sequence fxkg1k=0 satis es 0 2 @f( x). Proof. If f(xk) # 1, then we are done. Conversely, if f(xk) is bounded below, then f(xk) is non-increasing and bounded below, therefore f(xk) converges. We consider two cases. Case 1: An in nite number of line search successes occur. Let x be a cluster point of fxkg1k=0. Notice that x k only changes for line search successes, so there exists a subsequence fxkjg1j=0 of line search suc- cesses such that xkj ! x. Then for each corresponding step length tkj and direction dkj , the following condition holds f(xkj+1) f(xkj + tkjd kj ) < f(xkj ) tkj jd kj j2: Note that 0 tkj jd kj j2 < f(xkj ) f(xkj+1): Since f(xk) converges we know that f(xkj ) f(xkj+1) ! 0. Since tkj is bounded away from 0, we see that lim j!1 jdkj j = 0: Recall from the AGS algorithm, we check the condition kj kj jd kj j: As kj > 0, kj 0, and jd kj j ! 0, we can conclude that kj # 0. Finally, from Lemma 2.2(1), as dkj 2 G(xkj ), there exists a vkj 2 @f(xkj ) such that j vkj dkj j K kj ) j vkj j jdkj j K kj ) jvkj j K kj + jd kj j, 212.2. Convergence which implies that 0 jvkj j K kj + jd kj j ! 0: So, lim j!1 jvkj j = 0; where jvkj j dist(0j@f(xkj )) 0, which implies dist(0j@f(xkj )) ! 0. We have xkj ! x. As f is a nite max function, it is continuous and therefore, by Proposition 1.10, @f is outer semicontinuous. Hence, every cluster point x of a convergent subsequence of fxkg1k=0 satis es 0 2 @f( x). Case 2: A nite number of line search successes occur. This means there exists a k such that xk = x k = x for all k k. However, by Corollary 2.9, if 0 =2 @f( x), then after a nite number of iterations, the algorithm will nd function value decrease (line search success). Hence, we have 0 2 @f( x). Our last result shows that if the algorithm terminates in Step 2, then the distance from 0 to the exact subdi erential is controlled by "tol. Theorem 2.12. Let f = maxffi : i = 1; : : : ; Ng where each fi 2 C1. Suppose there exists a K > 0 such that for each iteration k, the approximate gradient satis es jrAfi(xk) rfi(xk)j K k for all i. Suppose the AGS algorithm terminates at some iteration k in Step 2 for "tol > 0. Then dist(0j@f(x k)) < (1 + K 0)"tol: Proof. Let w = Proj(0jG(x k)). We use v 2 @f(x k) as constructed in Lemma 2.2(1) to see that dist(0j@f(x k)) dist(0j v) dist(0j w) + dist( wj v) (as w 2 G(x k), v 2 @f(x k)) = jd kj+ j w vj (as jd kj = j Proj(0jG(x k))j) jd kj+ " (by Lemma 2.2) < "tol + " (as jd kj < "tol). The nal statement now follows by the test k kjd kj in Step 2 and the fact that k 0 as f kgk=0 is a non-increasing sequence. 22Chapter 3 Robust Approximate Gradient Sampling Algorithm The AGS algorithm depends on the active set of functions at each iterate, A(xk). Of course, it is possible at various times in the algorithm for there to be functions that are inactive at the current iterate, but active within a small radius of the current iterate. Typically, such behaviour means that the current iterate is close to a ‘nondi erentiable ridge’ formed by the function. In [BLO02] and [BLO05], it is suggested that allowing an algorithm to take into account these ‘almost active’ functions will provide a better idea of what is happening at and around the current iterate, thus, making the algorithm more robust, i.e., performs for unusual or strict types of functions. In this section, we present the robust approximate gradient sampling algorithm (RAGS algorithm). Speci cally, we adapt the AGS algorithm by expanding our active set to include all functions that are active at xk or active at one (or more) of the points in the set Y = [xky1; : : : ; ym]. Recall from the AGS algorithm that the set Y is sampled from within a ball of radius k. Thus, the points in Y are not far from the current iterate. We de ne the robust active set next. De nition 3.1. Let f = maxffi : i = 1; : : : ; Ng where each fi 2 C1. Let y0 = xk be the current iterate and Y = [y0; y1; y2; : : : ; ym] be a set of randomly sampled points from a ball centered at xk with radius k. The robust active set of f on Y is A(Y ) = [ yj2Y A(yj); (3.1) where each set A(yj) is de ned as in equation (1.2). To shed some light on the motivation for our robust algorithm, we con- sider the following function (originally from [CC78], Problem 2.1 CB2 in the test set [LV00]): 23Chapter 3. Robust Approximate Gradient Sampling Algorithm f = maxff1; f2; f3g, where f1(x; y) = x2 + y4, f2(x; y) = (2 x)2 + (2 y)2, f3(x; y) = 2 exp(y x). Figure 3.1: Surface plot for Problem 2.1 CB2. 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 (a) AGS algorithm 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 (b) RAGS algorithm Figure 3.2: Iterations taken using the simplex gradient for Problem 2.1 CB2. These gures illustrate example iterations taken by the AGS algorithm and by the RAGS algorithm (presented in this section) for Problem 2.1 CB2 using the simplex gradient as an approximate gradient (see Section 4.1 for details used in numerical results). In Figure 3.2(a), we can see that the iterations move perpendicular to the contours of the functions towards the nondi erentiable ridge. However, eventually the iterations stall at the nondi erentiable ridge. These results 243.1. Algorithm show that the AGS algorithm is performing similar to how we would expect the method of steepest descent to perform on this problem. In Figure 3.2(b), we can see that the iterations proceed similar to the AGS algorithm until they reach the nondi erentiable ridge. Here, instead of stalling at the nondi erentiable ridge, the algorithm turns and mini- mizes along the nondi erentiable ridge, moving towards the optimal solu- tion. Thus, by incorporating the ‘almost active’ functions into our active set, our algorithm is able to further minimize the function f(x; y). 3.1 Algorithm For the RAGS algorithm, we incorporate the robust active set by replac- ing Steps 1 and 2 of the AGS algorithm from Section 2.1 with the following. 1. Generate Approximate Subdifferential GkY (Robust): Generate a set Y = [xk; y1; : : : ; ym] around the current iterate xk such that max j=1;:::;m jyj xkj k: Use Y to calculate the approximate gradient of fi, denoted rAfi, at xk for each i 2 A(Y ). Then set Gk = conv(rAfi(xk))i2A(xk) and GkY = conv(rAfi(x k))i2A(Y ). 2. Generate Search Direction: Let dk = Proj(0jGk): Let dkY = Proj(0jG k Y ): Check if k kjd kj: (3.2) If (3.2) does not hold, then set xk+1 = xk, set k+1 = ( kjdkj if jdkj 6= 0 k if jdkj = 0 ; and go to Step 4. If (3.2) holds and jdkj < "tol, then STOP. Else, continue to the line search, using dkY as a search direction. 253.2. Convergence Notice that in Step 2, we still use the stopping conditions from Section 2. Although this modi cation requires the calculation of two projections, it should be noted that neither of these projections are particularly di cult to calculate and that no additional function evaluations are required for this modi cation. In Section 3.3, we use the Goldstein approximate subdi er- ential to adapt Theorem 2.12 to work for stopping conditions based on dkY , but we still do not have theoretical results for the exact subdi erential. In the numerics section, we test each version of our algorithm using the robust descent direction to check the stopping conditions. For the RAGS algorithm, this alteration shows convincing results that the robust stopping conditions not only guarantee convergence, but signi cantly decrease the number of function evaluations required for the RAGS algorithm to con- verge. 3.2 Convergence To show that the RAGS algorithm is well-de ned we require that when is small enough, the robust active set is in fact equal to the original active set. Lemma 3.2. Let f = maxffi : i = 1; : : : ; Ng where each fi 2 C1. Let Y = [ x; y1; : : : ; ym] be a randomly sampled set from a ball centered at x with radius . Then there exists an ~" > 0 such that if Y B~"( x), then A( x) = A(Y ). Proof. Clearly, if i 2 A( x), then i 2 A(Y ), as x 2 Y . Consider i 62 A( x). Then by the de nition of f , we have that fi( x) < f( x): We know that f is continuous (see Section 1.1). Therefore, there exists an ~"i > 0 such that for all z 2 B~"i( x) fi(z) < f(z): If < ~"i, then we have jyj xj < ~"i for all j = 1; : : : ;m. Therefore, fi(y j) < f(yj) for all j = 1; : : : ;m; (3.3) so i 62 A(Y ). Setting ~" = min i 62A( x) ~"i completes the proof. 263.2. Convergence Using Lemma 3.2, we can easily conclude that the AGS algorithm is still well-de ned when using the robust active set. Corollary 3.3. Let f = maxffi : i = 1; : : : ; Ng where each fi 2 C1. Suppose 0 62 @f(xk) for each iteration k. Suppose there exists a K > 0 such that given any set of points generated in Step 1 of the RAGS algorithm, the approximate gradient satis es jrAfi(xk) rf(xk)j K k for all i. Then after a nite number of iterations, the RAGS algorithm will nd function value decrease. Proof. Consider xk, where 0 =2 @f(xk). For eventual contradiction, suppose we do not nd function value decrease. In the RAGS algorithm, this corresponds to an in nite number of line search failures. If we have an in nite number of line search failures, then k ! 0 and x k = xk for all k k. In Lemma 3.2, ~" depends only on xk. Hence, we can conclude that eventually k < ~" and therefore Y k B~"(xk). Thus, eventually A(xk) = A(Y k). Once the two active sets are equal, the results of Section 2.2 will hold. Thus, by Corollary 2.9, the algorithm will nd a new iterate with a lower function value, a contradiction. To examine convergence of the RAGS algorithm we use the result that eventually the robust active set at the current iterate will be a subset of the regular active set at any cluster point of the algorithm. Lemma 3.4. Let f = maxffi : i = 1; : : : ; Ng where each fi 2 C1. Let Y k = [xk; y1; : : : ; ym] be a randomly sampled set from a ball centered at xk with radius k. Let xk ! x. Then there exists an ~" > 0 such that if Y k B~"( x), then A(Y k) A( x). Proof. Let i =2 A( x). By de nition of f , we have that fi( x) < f( x): Since f is continuous, there exists an ~"i > 0 such that for all z 2 B~"i( x) fi(z) < f(z): If Y k B~"i( x), then we have jx k xj < ~"i and jyj xj < ~"i for all j = 1; : : : ;m. Therefore fi(x k) < f(xk) and fi(y j) < f(yj) for all j = 1; : : : ;m: Therefore, i 62 A(Y k). Letting ~" = min i 62A( x) ~"i completes the proof. 273.2. Convergence Now we examine the convergence of the RAGS algorithm. Theorem 3.5. Let f = maxffi : i = 1; : : : ; Ng where each fi 2 C1. Let fxkg1k=0 be an in nite sequence generated by the RAGS algorithm. Suppose there exists a K > 0 such that given any set of points generated in Step 1 of the RAGS algorithm, the approximate gradient satis es the error bound jrAfi(xk) rfi(xk)j K k for all i. Suppose tk is bounded away from 0. Then either 1. f(xk) # 1, or 2. jdkj ! 0, k # 0 and every cluster point x of the sequence fxkg1k=0 satis es 0 2 @f( x). Proof. If f(xk) # 1, then we are done. Conversely, if f(xk) is bounded below, then f(xk) is non-increasing and bounded below, therefore f(xk) converges. We consider two cases. Case 1: An in nite number of line search successes occur. Let x be a cluster point of fxkg1k=0. Notice that x k only changes for line search successes, so there exists a subsequence fxkjg1k=0 of line search suc- cesses such that xkj ! x. Following the arguments of Theorem 2.11, we have jdkj j ! 0 and kj # 0. Notice that if kj # 0, then eventually Y kj B~"( x), where xk ! x and ~" is de ned as in Lemma 3.4. Thus, by Lemma 3.4, we have that A(Y kj ) A( x). This means GY kj (x kj ) is formed from a subset of the active gradients that make up @f( x). Thus, by Lemma 2.2(1), as dkj 2 GY kj (x kj ), we can construct a vkj 2 @f( x) from the same set of active gradients that make up GY kj (x kj ) such that j dkj vkj j = X i2A(Y kj ) irAfi(x kj ) X i2A(Y kj ) irfi( x) X i2A(Y kj ) ijrAfi(x kj ) rfi( x)j = jrAfi(xkj ) rfi( x)j (as P i2A(Y kj ) i = 1) jrAfi(xkj ) rfi(xkj )j+ jrfi(xkj ) rfi( x)j K kj + jrfi(x kj ) rfi( x)j (by error bound). 283.2. Convergence Using the Triangle Inequality, we have j dkj vkj j = j vkj dkj j jvkj j jdkj j; which implies that jvkj j jdkj j K kj + jrfi(x kj ) rfi( x)j: We already showed that jdkj j ! 0 and kj # 0. Furthermore, since rfi 2 C and xkj ! x, we have jrfi(xkj ) rfi( x)j ! 0. So, lim j!1 jvkj j = 0: Using the same arguments as in Theorem 2.11, the result follows. Case 2: A nite number of line search successes occur. This means there exists a k such that xk = x k = x for all k k. However, by Corollary 3.3, if 0 =2 @f( x), then after a nite number of iterations, the algorithm will nd function value decrease (line search success). Hence, we have 0 2 @f( x). Remark 3.6. Using dk to check our stopping conditions allows the result of Theorem 2.12 to still hold. 3.2.1 Descent Direction: dY It is worth noting that we can prove Lemma 2.2(2) for G(Y ) and thus, prove dY is a descent direction without the requirement that A(Y ) A( x). We show this in Lemmas 3.7 and 3.8. Lemma 3.7. Let f = maxffi : i = 1; : : : ; Ng where each fi 2 C1. Let Y = [y0; y1; y2; : : : ; ym] be a randomly sampled set from a ball centered at y0 = x with radius . Suppose there exists a K > 0 such that the approximate gradient satis es jrAfi( x) rfi( x)j K for all i. Then for all v 2 @f( x), there exists a w 2 G(Y ) such that jw vj K : Proof. By de nition, for all v 2 @f( x) there exists a set of i such that v = X i2A( x) irfi( x) where i 0; X i2A( x) i = 1: 293.3. Robust Stopping Conditions We know that A( x) A(Y ), as x 2 Y . Thus, using the same i as above, we see that w = X i2A( x) irAfi( x) 2 G(Y ): Using the same argument as shown in Lemma 2.2(2), we can conclude that for all v 2 @f( x), there exists a w 2 G(Y ) such that jw vj K : Using this result, we can show that dY is a descent direction. Lemma 3.8. Let f = maxffi : i = 1; : : : ; Ng where each fi 2 C1. Let Y = [y0; y1; y2; : : : ; ym] be a randomly sampled set from a ball centered at y0 = x with radius . Suppose there exists a K > 0 such that the ap- proximate gradient satis es jrAfi( x) rfi( x)j K for all i. De ne dY = Proj(0jG(Y )) and suppose jdY j 6= 0. If K < jdY j, then hdY ; vi < 0 for all v 2 @f( x). Proof. Using Lemma 3.7 instead of Lemma 2.2(2) in the proof of Corol- lary 2.5, we can prove the result. 3.3 Robust Stopping Conditions We want to provide some insight as to how Theorem 2.12 can work for stopping conditions based on dkY , that is, replacing the stopping conditions k kjdkj and jdkj < "tol in Step 2 with the robust stopping conditions k kjd k Y j and jd k Y j < "tol: (3.4) The following proposition does not theoretically justify why the robust stop- ping conditions in (3.4) are su cient, but does help explain their logic. Theo- retically, since we do not know what x is, we cannot tell when A(Y k) A( x). Proposition 3.9. Let f = maxffi : i = 1; : : : ; Ng where each fi 2 C1. Suppose there exists a K > 0 such that jrAfi(x) rfi(x)j K k for all i and for all x 2 B k(x k). Suppose the RAGS algorithm terminates at some iteration k in Step 2 using the robust stopping conditions given in (3.4). Furthermore, suppose there exists x 2 B k(x k) such that A(Y k) A( x). Then Proj(0j@f( x)) < (1 + K 0)"tol: 303.3. Robust Stopping Conditions Proof. If A(Y k) A( x), then the proofs of Lemma 2.2(1) and Theorem 2.12 still hold. Additionally, in the following results, we approach the theory for robust stopping conditions using the Goldstein approximate subdi erential. If the RAGS algorithm terminates in Step 2, then it is shown that the distance between 0 and the Goldstein approximate subdi erential is controlled by "tol. Again, this does not prove the robust stopping conditions are su cient for the exact subdi erential. De nition 3.10. The Goldstein approximate subdi erential, as de ned in [Gol77], is given by the set @G f( x) = conv(@f(z) : z 2 B ( x)); (3.5) where B ( x) is the closed ball centered at x with radius . We now show that the Goldstein approximate subdi erential contains all of the gradients of the active functions in the robust active set. Lemma 3.11. Let f = maxffi : i = 1; : : : ; Ng. Let Y = [y0; y1; y2; : : : ; ym] be a randomly sampled set from a ball centered at y0 = x with radius . If fi 2 C1 for each i, then @G f( x) conv(rfi(y j) : yj 2 Y; i 2 A(yj)): Proof. If fi 2 C1 for each i 2 A(Y ), then by Proposition 1.6, for each yj 2 Y we have @f(yj) = conv(rfi(y j))i2A(yj) = conv(rfi(y j) : i 2 A(yj)): Using this in our de nition of the Goldstein approximate subdi erential in (3.5) and knowing B ( x) Y , we have @G f( x) conv(conv(rfi(y j) : i 2 A(yj)) : yj 2 Y ); which simpli es to @G f( x) conv(rfi(y j) : yj 2 Y; i 2 A(yj)): (3.6) Now we have a result similar to Lemma 2.2(1) for dkY with respect to the Goldstein approximate subdi erential. 313.3. Robust Stopping Conditions Remark 3.12. For the following two results, we assume each of the fi 2 C1+ with Lipschitz constant L. Note that this implies the Lipschitz constant L is independent of i. If each fi 2 C1+ has Lipschitz constant Li for rfi, then L is easily obtained by L = maxfLi : i = 1; : : : ; Ng. Lemma 3.13. Let f = maxffi : i = 1; : : : ; Ng where each fi 2 C1+ with Lipschitz constant L. Let Y = [y0; y1; y2; : : : ; ym] be a randomly sampled set from a ball centered at y0 = x with radius . Suppose there exists a K > 0 such that the approximate gradient satis es jrAfi( x) rfi( x)j K for all i. Then for all w 2 G(Y ), there exists a g 2 @G f( x) such that jw gj ( K + L) : Proof. By de nition, for all w 2 G(Y ) there exists a set of i such that w = X i2A(Y ) irAfi( x); where i 0; X i2A(Y ) i = 1: By our assumption that each fi 2 C1+, Lemma 3.11 holds. It is clear that for each i 2 A(Y ), i 2 A(yj) for some yj 2 Y . Let ji be an index corresponding to this active index; i.e., i 2 A(yji). Thus, for each i 2 A(Y ), there is a corresponding active gradient rfi(y ji) 2 conv(rfi(y ji) : yji 2 Y; i 2 A(yji)) @G f( x): Using the same i as above, we can construct g = X i2A(Y ) irfi(y ji) 2 conv(rfi(y ji) : yji 2 Y; i 2 A(yji)) @G f( x): Then jw gj = j P i2A(Y ) irAfi( x) P i2A(Y ) irfi(yji)j P i2A(Y ) ijrAfi( x) rfi(yji)j = P i2A(Y ) ijrAfi( x) rfi( x) +rfi( x) rfi(yji)j P i2A(Y ) i jrAfi( x) rfi( x)j+ jrfi( x) rfi(yji)j . 323.3. Robust Stopping Conditions Applying our error bound assumption and the Lipschitz condition, we have jw gj P i2A(Y ) i K + Lmax ji j x yji j K + L (as P i2A(Y ) i = 1) = ( K + L) . Hence, for all w 2 G( x), there exists a g 2 @G f( x) such that jw gj "+ L : (3.7) Thus, using Lemma 3.13, we can show that if the algorithm stops due to the robust stopping conditions, then the distance from 0 to the Goldstein approximate subdi erential is controlled by "tol. Proposition 3.14. Let f = maxffi : i = 1; : : : ; Ng where each fi 2 C1+ with Lipschitz constant L. Suppose there exists a K > 0 such that for each iteration k, the approximate gradient satis es jrAfi(xk) rfi(xk)j K k for all i. Suppose the RAGS algorithm terminates at some iteration k in Step 2 using the robust stopping conditions given in (3.4). Then Proj(0j@G kf(x k)) < [1 + 0( K + L)]"tol: Proof. Let w = Proj(0jG(Y k)). We use g 2 @G kf(x k) as constructed in Lemma 3.13 to see that dist(0j@G k)f(x k) dist(0j g) dist(0j w) + dist( wj g) (as w 2 G(Y k), g 2 @G kf(x k)) = jd kY j+ j w gj (as jd k Y j = j Proj(0jG(Y k))j) jd kY j+ ( K + L) k (by Lemma 3.13) < "tol + ( K + L) k (as jd k Y j < "tol). The statement now follows by the test k kjd k Y j in Step 2 and the fact that k 0 as f kgk=0 is a non-increasing sequence. 33Chapter 4 Approximate Gradients As seen in the previous two sections, in order for convergence to be guaranteed in the AGS and RAGS algorithms, the approximate gradient used must satisfy an error bound for each of the active fi. Speci cally, for a randomly sampled set of points Y = [xk; y1; : : : ; ym] centered around xk, there must exist a K > 0 such that jrAfi(x k) rfi(x k)j K k; where k = maxj jyj xkj. In this section, we present three speci c approx- imate gradients that satisfy the above requirement: the simplex gradient, the centered simplex gradient and the Gupal estimate of the gradient of the Steklov averaged function. 4.1 Simplex Gradient The simplex gradient is a commonly used approximate gradient. In re- cent years, several derivative-free algorithms have been proposed that use the simplex gradient, ([BK98], [Kel99a], [CV07], [CJV08], and [HM11]) among others. It is geometrically de ned as the gradient of the linear interpola- tion of f over a set of n + 1 points in Rn. In the following section, we mathematically de ne the simplex gradient. 4.1.1 De nitions First we present several basic de nitions of terms used in the de nition of the simplex gradient. De nition 4.1. A set S is an a ne set if given any two points x1 2 S and x2 2 S with x1 6= x2, the line formed by x1 and x2 is a subset of S, i.e., x1x2 S. De nition 4.2. The a ne hull of a set S Rn is the smallest a ne set containing S. 344.1. Simplex Gradient De nition 4.3. A set of m + 1 points Y = [y0; y1; : : : ; ym] is said to be a nely independent if its a ne hull a (y0; y1; : : : ; ym) has dimension m. Equivalently, Y is a nely independent if the set [y1 y0; : : : ; ym y0] is linearly independent, [CSV09]. De nition 4.4. Let Y = [y0; y1; : : : ; yn] be a set of a nely independent points in Rn. We say that Y forms the simplex, S, where S = conv(Y ). Thus, S is a simplex if it can be written as the convex hull of an a nely independent set of n+ 1 points in Rn. With the previous terms de ned, we are now ready to mathematically de ne the simplex gradient. De nition 4.5. Let Y = [y0; y1; : : : ; yn] be a list of a nely independent points in Rn. The simplex gradient of a function f over the set Y is given by rsf(Y ) = L 1 f(Y ); where L = y1 y0 yn y0 > and f(Y ) = 2 6 4 f(y1) f(y0) ... f(yn) f(y0) 3 7 5 : The condition number of a simplex is given by kL^ 1k, where L^ = 1 [y1 y0 y2 y0 : : : yn y0]> and = max j=1;:::n jyj y0j: 4.1.2 Convergence The following result (from Kelley) shows that there exists an appropriate error bound between the simplex gradient and the exact gradient of our objective function. We note that the Lipschitz constant used in the following theorem corresponds to rfi. Theorem 4.6. Consider fi 2 C1+ with Lipschitz constant Ki for rfi. Let Y = [y0; y1; : : : ; yn] form a simplex. Let L^ = 1 [y1 y0 y2 y0 : : : yn y0]>; where = max j=1;:::n jyj y0j: 354.2. Centered Simplex Gradient Then the simplex gradient satis es the error bound jrsfi(Y ) rfi(y 0)j K ; where K = 1 2 Ki p nkL^ 1k. Proof. See [Kel99b, Lemma 6.2.1]. With the above error bound result, we conclude that convergence holds when using the simplex gradient as an approximate gradient in the AGS and RAGS algorithms. Corollary 4.7. Let f = maxffi : i = 1; : : : ; Ng where each fi 2 C1+ with Lipschitz constant Ki for rfi. If the approximate gradient used in the AGS or RAGS algorithm is the simplex gradient and kL^ 1k is bounded above, then the results of Theorems 2.6, 2.8, 2.11 and 2.12 hold. 4.1.3 Algorithm Algorithm - Simplex Gradient In order to calculate a simplex gradient in Step 1, we generate a set Y = [xk; y1; : : : ; yn] of points in Rn and then check to see if Y forms a well-poised simplex by calculating its condition number, kL^ 1k. A bounded condition number (kL^ 1k < n) ensures a ‘good’ error bound between the approximate gradient and the exact gradient. If Y forms a well-poised simplex (kL^ 1k < n), then we calculate the simplex gradient of fi over Y for each i 2 A(xk) (or each i 2 A(Y k)) and then set the approximate subdi erential equal to the convex hull of the active simplex gradients. If the set Y does not form a well-poised simplex (kL^ 1k n), then we resample. We note that the probability of generating a random matrix with a condition number greater than n is asymptotically constant, [Wsc04]. Thus, randomly generating simplices is a quick and practical option. Furthermore, notice that calculating the condition number does not require function evaluations; thus, resampling does not a ect the number of function evaluations required by the algorithm. 4.2 Centered Simplex Gradient The centered simplex gradient is the average of two simplex gradients. Although it requires more function evaluations, it contains an advantage that the error bound satis ed by the centered simplex gradient is in terms of 2, rather than . 364.2. Centered Simplex Gradient 4.2.1 De nitions De nition 4.8. Let Y = [y0; y1; : : : ; yn] form a simplex. We de ne the sets Y + = [x; x+ ~y1; : : : ; x+ ~yn] and Y = [x; x ~y1; : : : ; x ~yn]; where x = y0 and ~yi = yi y0 for i = 1; : : : n. The centered simplex gradient is the average of the two simplex gradients over the sets Y + and Y , i.e., rCSf(Y +) = 1 2 (rSf(Y +) +rSf(Y )): 4.2.2 Convergence To show that the AGS and RAGS algorithms are well-de ned when using the centered simplex gradient as an approximate gradient, we provide an error bound between the centered simplex gradient and the exact gradient (again from Kelley). Theorem 4.9. Consider fi 2 C2+ with Lipschitz constant Ki for r2fi. Let Y = [y0; y1; : : : ; yn] form a simplex. Let L^ = 1 [y1 y0; : : : ; yn y0] where = max j=1;:::n jyj y0j: Then the centered simplex gradient satis es the error bound jrCSfi(Y ) rfi(y 0)j K 2; where K = Ki p nkL^ 1k. Proof. See [Kel99b, Lemma 6.2.5]. Notice that Theorem 4.9 requires fi 2 C2+. If fi 2 C1+, then the error bound is in terms of , not 2. With the above error bound result, we conclude that convergence holds when using the centered simplex gradient as an approximate gradient in the AGS and RAGS algorithms. Corollary 4.10. Let f = maxffi : i = 1; : : : ; Ng where each fi 2 C2+ with Lipschitz constant Ki for r2fi. If the approximate gradient used in the AGS or RAGS algorithm is the centered simplex gradient and 0 1, then the results of Theorems 2.6, 2.8, 2.11 and 2.12 hold. Proof. Since 0 1 and non-increasing, ( k)2 k and ergo, Theorems 2.6 and 2.11 hold. 374.3. Gupal Estimate of the Gradient of the Steklov AveragedFunction 4.2.3 Algorithm To adapt the AGS and RAGS algorithms to use the centered simplex gradient, in Step 1 we sample our set Y in the same manner as for the simplex gradient (resampling until a well-poised set is achieved). We then form the sets Y + and Y and proceed as expected. 4.3 Gupal Estimate of the Gradient of the Steklov Averaged Function The nonderivative version of the gradient sampling algorithm presented by Kiwiel in [Kiw10] uses the Gupal estimate of the gradient of the Steklov averaged function as an approximate gradient. We see in Theorem 4.16 that an appropriate error bound exists for this approximate gradient. Surpris- ingly, unlike the error bounds for the simplex and centered simplex gradi- ents, the error bound in Theorem 4.16 does not include a condition number term. Following the notation used by Kiwiel in [Kiw10], we de ne the Gupal estimate of the gradient of the Steklov averaged function as follows. 4.3.1 De nitions De nition 4.11. For > 0, the Steklov averaged function f is de ned by f (x) = Z Rn f(x y) (y)dy; where : Rn ! R+ is the Steklov molli er de ned by (y) = 1= n if y 2 [ =2; =2]n; 0 otherwise: We can equivalently de ne the Steklov averaged function by f (x) = 1 n Z x1+ =2 x1 =2 Z xn+ =2 xn =2 f(y)dy1 : : : dyn: (4.1) The partial derivatives of f are given by @f @xi (x) = Z B1 i(x; ; )d (4.2) 384.3. Gupal Estimate of the Gradient of the Steklov AveragedFunction for i = 1; : : : ; n, where B1 = [ 1=2; 1=2]n is the unit cube centred at 0 and i(x; ; ) = 1 f(x1 + 1; : : : ; xi 1 + i 1; xi + 1 2 ; xi+1 + i+1; : : : ; xn + n) f(x1 + 1; : : : ; xi 1 + i 1; xi 1 2 ; xi+1 + i+1; : : : ; xn + n) : (4.3) De nition 4.12. Given > 0 and z = ( 1; : : : ; n) 2 Qn i=1 B1, the Gupal estimate of rf (x) over z is given by (x; ; z) = ( 1(x; ; 1); : : : ; n(x; ; n)): (4.4) Remark 4.13. Although we de ne (x; ; z) as the Gupal estimate ofrf (x), in Section 4.3.2, we will show that (x; ; z) provides a good approximate of the exact gradient, rfi(x). Remark 4.14. For the following results, we note that the used in the above de nitions is equivalent to our search radius . Thus, we will be replacing with in the convergence results in Section 4.3.2. 4.3.2 Convergence Convergence As before, in order to show that the AGS and RAGS algorithms are well-de ned when using the Gupal estimate as an approximate gradient, we must establish that it provides a good approximate of our exact gradient. To do this, we need the following result. Lemma 4.15. Let f 2 C1+ with Lipschitz constant K for rf . Let y0 2 Rn. Then for any y 2 Rn jf(y) f(y0) rf(y0)>(y y0)j 1 2 Kjy y0j2: Proof. For ease of notation, let = y y0. By the Fundamental Theorem of Calculus, for any y 2 Rn we have f(y) f(y0) = Z 1 0 rf(y0 + ( ))> d : (4.5) 394.3. Gupal Estimate of the Gradient of the Steklov AveragedFunction Considering rf(y0)> , notice that rf(y0)> = Z 1 0 rf(y0)> d : (4.6) Subtracting (4.6) from (4.5), we have f(y) f(y0) rf(y0)> = Z 1 0 (rf(y0 + ) rf(y0))> d : Taking the absolute value, we have f(y) f(y 0) rf(y0)> = Z 1 0 (rf(y0 + ) rf(y0))> d Z 1 0 (rf(y 0 + ) rf(y0))> d (as rf is cont.) Z 1 0 (rf(y 0 + ) rf(y0)) j j d (by Cauchy Schwarz) Z 1 0 Kjy0 + y0jj jd (as rf is Lip.) = Z 1 0 K j j2d (as 2 (0; 1)) = 1 2 Kj j2. Therefore, with = y y0, we have jf(y) f(y0) rf(y0)>(y y0)j 1 2 Kjy y0j2: Using Lemma 4.15, we establish an error bound between the Gupal es- timate and the exact gradient of fi. 404.3. Gupal Estimate of the Gradient of the Steklov AveragedFunction Theorem 4.16. Consider fi 2 C1+ with Lipschitz constant Ki for rfi. Let " > 0. Then for > 0, z = ( 1; : : : ; n) 2 Z = Qn i=1 B1 and any point x 2 Rn, the Gupal estimate of rfi; (x) satis es the error bound j (x; ; z) rfi(x)j p n 1 2 Ki ( p n+ 3): Proof. For > 0, let yj = [x1 + 1; : : : ; xj 1 + j 1; xj 1 2 ; xj+1 + j+1; : : : ; xn + n] >: By Lemma 4.15, for yj+ = [x1 + 1; : : : ; xj 1 + j 1; xj + 1 2 ; xj+1 + j+1; : : : ; xn + n] >; we have jfi(y j+) fi(y j ) rfi(y j )>(yj+ yj )j 1 2 Kijy j+ yj j2: (4.7) From equation (4.3) (with = ), we can see that fi(y j+) fi(y j ) = j(x; ; j): Hence, equation (4.7) becomes j j(x; ; j) rfi(y j )>(yj+ yj )j 1 2 Kijy j+ yj j2: (4.8) From our de nitions of yj and yj+, we can see that yj+ yj = [0; : : : ; 0; ; 0; : : : ; 0]>: The inner product in equation (4.8) simpli es to rfi(y j )>(yj+ yj ) = @fi @xj (yj ): Thus, we have j(x; ; j) @fi @xj (yj ) 1 2 Ki 2: Dividing out gives j(x; ; j) @fi @xj (yj ) 1 2 Ki : (4.9) 414.3. Gupal Estimate of the Gradient of the Steklov AveragedFunction We notice that jyj xj = ( j 1 ; : : : ; j j 1; 1 2 ; jj+1; : : : ; j n) = ( j 1 ; : : : ; j j 1; 1 2 ; jj+1; : : : ; j n) . Using the standard basis vector ej , we have jyj xj = j jej 1 2 ej j j j+ j j j+ 1 2 1 2 p n+ 1 2 + 1 2 (as j 2 [ 1=2; 1=2]n) = 1 2 ( p n+ 2). Thus, since fi 2 C1+, we have jrfi(y j ) rfi(x)j Ki 1 2 ( p n+ 2): (4.10) Now, we notice that @fi @xj (yj ) @fi @xj (x) = rfi(y j ) > ej rfi(x) >ej jrfi(y j ) rfi(x)jje j j = jrfi(y j ) rfi(x)j: Therefore, @fi @xj (yj ) @fi @xj (x) Ki 1 2 ( p n+ 2): (4.11) 424.3. Gupal Estimate of the Gradient of the Steklov AveragedFunction Using equations (4.9) and (4.11), we have j(x; ; j) @fi @xj (x) = j(x; ; j) @fi @xj (yj ) + @fi @xj (yj ) @fi @xj (x) j(x; ; j) @fi @xj (yj ) + @fi @xj (yj ) @fi @xj (x) 1 2 Ki +Ki 1 2 ( p n+ 2) = 1 2 Ki ( p n+ 3). Hence, j (x; ; z) rfi(x)j = s 1(x; ; 1) @fi @x1 (x) 2 + + n(x; ; n) @fi @xn (x) 2 s 1 2 Ki ( p n+ 3) 2 + + 1 2 Ki ( p n+ 3) 2 = s n 1 2 Ki ( p n+ 3) 2 = p n 1 2 Ki ( p n+ 3). We conclude that convergence holds when using the Gupal estimate of the gradient of the Steklov averaged function of f as an approximate gradient in the AGS and RAGS algorithms. Corollary 4.17. Let f = maxffi : i = 1; : : : ; Ng where each fi 2 C1+ with Lipschitz constant Ki for rfi. If the approximate gradient used in the AGS or RAGS algorithm is the Gupal estimate of the gradient of the Steklov averaged function, then the results of Theorems 2.6, 2.8, 2.11 and 2.12 hold. 434.3. Gupal Estimate of the Gradient of the Steklov AveragedFunction 4.3.3 Algorithm Algorithm To use the Gupal estimate of the gradient of the Steklov averaged func- tion in the AGS and RAGS algorithms, in Step 1, we sample independently and uniformly fzklgml=1 from the unit cube in R n n, where m is the number of active functions. Then proceed as expected. 44Chapter 5 Numerical Results 5.1 Versions of the AGS Algorithm We implemented the AGS and RAGS algorithms using the simplex gra- dient, the centered simplex gradient and the Gupal estimate of the gradient of the Steklov averaged function as approximate gradients. Additionally, we used the robust descent direction to create robust stop- ping conditions. That is, the AGS and RAGS algorithms terminate when k kjd k Y j and jd k Y j < "tol; (5.1) where dkY is the projection of 0 onto the approximate subdi erential gener- ated using the robust active set. (See Lemmas 3.11 and 3.13, and Proposition 3.14 for results linking the robust stopping conditions with the Goldstein ap- proximate subdi erential.) The implementation was done in MATLAB (v. 7.11.0.584, R2010b). Software is available upon request. Let dk denote the regular descent direction and let dkY denote the robust descent direction. There are three scenarios that could occur when using the robust stopping conditions: 1. jdkj = jdkY j; 2. jdkj > jdkY j, but checking the stopping conditions leads to the same result (line search, radius decrease or termination); or 3. jdkj > jdkY j, but checking the stopping conditions leads to a di erent result. In Scenarios 1 and 2, the robust stopping conditions have no in uence on the algorithm. In Scenario 3, we have two cases: 1. k kjdkY j kjd kj, but jdkj "tol and jdkY j < "tol or 2. k kjdkj holds, but k > kjdkY j. 455.2. Test Sets and Software Thus, we hypothesize that the robust stopping conditions will cause the AGS and RAGS algorithms to do one of two things: to terminate early, providing a solution that has a smaller quality measure, but requires less function evaluations to nd, or to reduce its search radius instead of carrying out a line search, reducing the number of function evaluations carried out during that iteration and calculating a more accurate approximate subdif- ferential at the next iteration. Our goal in this testing is to determine if there are any notable numerical di erences in the quality of the three approximate gradients (simplex, cen- tered simplex, and Gupal estimate), the two search directions (non-robust and robust), and the two stopping conditions (non-robust and robust). This results in the following 12 versions: AGS Simplex (1. non-robust /2. robust stopping) RAGS Simplex (3. non-robust /4. robust stopping) AGS Centered Simplex (5. non-robust /6. robust stopping) RAGS Centered Simplex (7. non-robust /8. robust stopping) AGS Gupal (9. non-robust /10. robust stopping) RAGS Gupal (11. non-robust/12. robust stopping) 5.2 Test Sets and Software Testing was performed on a 2.0 GHz Intel Core i7 Macbook Pro and a 2.2 GHz Intel Core 2 Duo Macbook Pro. We used the test set from Luk san- Vl cek, [LV00]. The rst 25 problems presented are of the desired form min x F (x) where F (x) = maxffi(x) : i = 1; 2; : : : ; Ng: Of these 25 problems, we omit problem 2.17 because the sub-functions are complex-valued. Thus, our test set is a total of 24 nite minimax problems with dimensions from 2 to 20. There are several problems featuring functions fi that have the form fi = jfij, where fi is a smooth function. We rewrote these functions as fi = maxffi; fig. The resulting test problems have from 2 to 130 sub-functions. A summary of the test problems appears in Table 1 in Appendix A. 465.3. Initialization and Stopping Conditions 5.3 Initialization and Stopping Conditions We rst describe our choices for the initialization parameters used in the AGS and RAGS algorithms. The initial starting points are given for each problem in [LV00]. We set the initial accuracy measure to 0:5 with a reduction factor of 0:5. We set the initial search radius to 0:1 with a reduction factor of 0:5. The Armijo- like parameter was chosen to be 0:1 to ensure that a line search success resulted in a signi cant function value decrease. We set the minimum step length to 10 10. Next, we discuss the stopping tolerances used to ensure nite termina- tion of the AGS and RAGS algorithms. We encoded four possible reasons for termination in our algorithm. The rst reason corresponds to our theo- retical stopping conditions, while the remaining three reasons are to ensure numerical stability of the algorithm. 1. Stopping conditions met: As stated in the theoretical algorithm, the algorithm terminates for this reason when k kjdkj and jdkj < "tol, where dk is either the regular or the robust descent direction. 2. Hessian matrix has NaN / Inf entries: For the solution of the quadratic program in Step 2, we use the quadprog command in MATLAB, which has certain numerical limitations. When these limitations result in NaN or Inf entries in the Hessian, the algorithm terminates. 3. k, k, and jdkj are small: This stopping criterion bi-passes the test k kjdkj (in Step 2) and stops if k < tol, k < tol and jdkj < "tol. Examining Theorem 2.12 along with Theorems 4.6, 4.9 and 4.16, it is clear that this is also a valid stopping criterion. We used a bound of 10 6 in our implementation for both k and k. 4. Max number of function evaluations reached - As a nal failsafe, we added an upper bound of 106 on the number of function evaluations allowed. (This stopping condition only occurs once in our results.) 5.4 Computing a Descent Direction In each iteration of the AGS algorithm, we must compute the projection of 0 onto the approximate subdi erential G(xk). Additionally, for the RAGS algorithm, we must compute the projection of 0 onto the robust approximate subdi erential G(Y k). 475.4. Computing a Descent Direction To explain how we solve this problem in MATLAB, we rst de ne the projection of 0 onto the convex hull of a set of points X = [x1; x2; : : : ; xn]: By de nition, conv(X) = fz : z = nX i=1 ixi; i 0; nX i=1 i = 1; xi 2 Xg; where x 2 Rn and = [ 1; : : : ; n] 2 Rn. By De nition 1.3, we know that the projection p of a point x 2 Rn onto a closed convex set C is in arg minfjy xj2 : y 2 Cg. Hence, we have Proj(0j conv(X))2 arg min (z; ) fjzj2 : z = nX i=1 ixi; i 0; nX i=1 i = 1; xi 2 Xg: Next we show that this minimization problem is a quadratic program that can be solved using a quadratic program solver. For the sake of notation, let x = xk. We are given G(x) = conv(rAfi(x))i2A(x), i.e., G(x) = fg : g = X i2A(x) irAfi(x); X i2A(x) i = 1; i 0g: Without loss of generality, assume the rst m functions in our nite set of fi functions are active (m n). Let = [ 1 2 : : : m] > and = [rAf1(x) rAf2(x) : : : rAfm(x)]: Then for all g 2 G(x), g = for some 0; X i2A( x) i = 1: De ne A = [1 1 : : : 1]. Then A = 1. Now kgk2 = ( )>( ) = > > : 485.5. Results Let H = > = 2 6 6 6 4 rAf1(x)2 rAf1(x)>rAf2(x) rAf1(x)>rAfm(x) rAf2(x)>rAf1(x) rAf2(x)2 rAf2(x)>rAfm(x) ... ... . . . ... rAfm(x)>rAf1(x) rAfm(x)>rAf2(x) rAfm(x)2 3 7 7 7 5 . Then we have the problem min f >H : 0; A = 1g: (5.2) After solving this problem, we have d = Proj(0jG(x)) = : Lemma 5.1. The matrix H de ned above is positive semi-de nite and there- fore, the minimization problem in (5.2) is a convex quadratic program. Proof. Clearly, H 2 R(m+1) (m+1) is symmetric as rAfi(x)>rAfj(x) = rAfj(x)>rAfi(x) for all i 6= j, i; j = 1; : : : ;m. Furthermore, we have for all y 2 Rn, y>Hy = y> > y = ( y)>( y) = j yj2 0: Thus, H is positive semi-de nite. In implementation, we used the quadprog.m function in MATLAB to solve for . The default algorithm for the quadprog.m function is the interior point algorithm, which nds the projection in a limiting sense. As the quadratic programs solved in our algorithm are of small size, we used the active-set algorithm instead, thus, nding exact solutions. 5.5 Results Due to the randomness in the AGS and RAGS algorithms, we carry out 25 trials for each version. For each of the 25 trials, we record the number of function evaluations, the number of iterations, the solution, the quality of the solution and the reason for termination. The quality was measured 495.5. Results by the improvement in the number of digits of accuracy, which is calculated using the formula log jFmin F j jF0 F j ; where Fmin is the function value at the nal (best) iterate, F is the true minimum value (optimal value) of the problem (as given in [LV00]) and F0 is the function value at the initial iterate. Results on function evaluations and solution qualities appear in Tables 6.2, 6.3 and 6.4 of the appendix. To visually compare algorithmic versions, we use performance pro les. A performance pro le is the (cumulative) distribution function for a perfor- mance metric [DM02]. For the AGS and RAGS algorithm, the performance metric is the ratio of the number of function evaluations taken by the current version to successfully solve each test problem versus the least number of function evaluations taken by any of the versions to successfully solve each test problem. Performance pro les eliminate the need to discard failures in numerical results and provide a visual representation of the performance di erence between several solvers. For each performance pro le, we have a set S of ns solvers and a set P of np problems. Our performance pro les de ne tp;s to be the number of function evaluations required to solve problem p by solver s. The baseline for comparison used is the following performance ratio rp;s = tp;s minftp;s : s 2 Sg : As an overall assessment of the performance of the solver, we de ne s( ) = 1 np sizefp 2 P : rp;s g: So, \ s( ) is the probability for solver s 2 S that a performance ratio rp;s is within a factor 2 R of the best possible ratio", [DM02]. In Figures 5.1(a) and 5.1(b) we include a performance pro le showing all 12 versions our algorithm tested, declaring a success for 1 digit and 3 digits of improvement, respectively. We used PProfile.m in MATLAB, a function courtesy of Dr. Warren Hare and Valentine Koch, to generate the performance pro les. 505.5. Results 100 101 102 103 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 τ ρ(τ ) Performance Profile on AGS and RAGS Algorithms (successful improvement of 1 digit) (a) Accuracy improvement of 1 digit. 100 101 102 103 0 0.1 0.2 0.3 0.4 0.5 0.6 τ ρ(τ ) Performance Profile for AGS and RAGS Algorithms (successful improvement of 3 digits) (b) Accuracy improvement of 3 digits. Figure 5.1: Performance pro les for 12 versions of AGS/RAGS algorithm. 515.5. Results In general, we can see that using the Gupal estimate of the gradient of the Steklov averaged function as an approximate gradient does not produce the best results. It only produces 1 or more digits of accuracy for problems 2.1, 2.2, 2.4, 2.10, 2.18 and 2.23 (robust version). There is no signi cant di erence between the performance of the AGS and RAGS algorithms using the simplex and centered simplex gradients as approximate gradients. Looking at the results in Tables 6.2, 6.3 and 6.4, and our performance pro les, we can make the following two observations: i) the versions of the RAGS algorithm generally outperform (converge faster than) the versions of the AGS algorithm, and ii) the RAGS algorithm using the robust stopping conditions termi- nates faster and with lower (but still signi cant) accuracy. Robust active set: From our results, it is clear that expanding the active set to include ‘almost active’ functions in the RAGS algorithm greatly im- proves performance for the simplex and centered simplex algorithm. This robust active set brings more local information into the approximate sub- di erential and thereby allows for descent directions that are more parallel to any nondi erentiable ridges formed by the function. Robust stopping conditions: We notice from the performance pro les that in terms of function evaluations, the robust stopping conditions improve the overall performance of the RAGS algorithm, although they decrease the average accuracy on some problems. These results correspond with our pre- viously discussed hypothesis. Furthermore, upon studying the reasons for termination, it appears that the non-robust stopping conditions cause the AGS and RAGS algorithms to terminate mainly due to k and k becoming too small. For the robust stopping conditions, the RAGS algorithm termi- nated often because the stopping conditions were satis ed. As our theory in Section 3.3 is not complete, we cannot make any theoretical statements about how the robust stopping conditions would perform in general (like those in Theorem 2.11). However, from our results, we conjecture that the alteration is bene cial for decreasing function evaluations. In 23 of the 24 problems tested, for both robust and non-robust stopping conditions, the RAGS algorithm either matches or outperforms the AGS al- gorithm in average accuracy obtained over 25 trials using the simplex and centered simplex gradients. Knowing this, we conclude that the improve- ment of the accuracy is due to the choice of a robust search direction. 525.6. An Application in Seismic Retro tting 5.6 An Application in Seismic Retro tting This application deals with the potentially destructive inter-story drift that occurs during an earthquake. In high density metropolitan areas, the close proximity of buildings leaves little room for side-to-side movement. As a means of lessening the destructive e ects of an earthquake on high rise buildings, dampers are placed between closely adjacent buildings. When the buildings sway, the dampers absorb the horizontal movement. For this application, there are two optimization problems that need to be solved. The rst is determining the optimal positions of the dampers. The second is determining the optimal damping coe cients of damper con- nectors. Details on the discrete optimization problem of optimal damper con gurations can be found in [BHT12]. The problem of determining optimal damping coe cients is a continuous optimization problem. You can think of a damping coe cient like a friction coe cient; it represents the e ect the damper has on the velocity of the ith oor of the building. In 2011, Bigdeli, Hare and Tesfamariam presented the rst research on the application of mathematical optimization to the problem of optimal damping coe cients [BHT11]. As presented, a set of damping coe cients is run through a simulation that generates the ‘inter-story drift’ of each oor. The objective is to minimize the maximum inter-story drift. As inter- story drift is computed via simulation, no derivative information is available and DFO methods are indispensable. Applying the Nelder-Mead method, a well-known DFO algorithm, to this problem, it is shown in [BHT11] that non-uniform damping coe cients can greatly improve system performance. This problem is a nite minimax problem. The details of the problem can be found in [BHT11]; simply put, we are trying to minimize the max- imum inter-story drift between buildings. In this section, we explore the performance of two commercial solvers and the RAGS algorithm (using the simplex gradient and the robust stopping conditions) on this problem. The two commercial solvers we look at are the DFO pattern search and genetic algorithms de ned in MATLAB. The pattern search algorithm is categorized as a directional direct-search method. As in any optimization algorithm, we desire to nd a new point xk+1 such that f(xk+1) < f(xk). There are two phases to this algorithm as described in [CSV09]: the search step and the poll step. In the search step, the algorithm searches a nite set of points for function value decrease. If found, then the iterate is updated and the algorithm loops. Else, the algorithm carries out a poll step. A poll step is a localized search around 535.6. An Application in Seismic Retro tting the current iterate using a given directional matrix. The genetic algorithm follows the framework of initialization, selection and reproduction. In the initialization stage, the algorithm randomly gen- erates an initial population, i.e., generates a random set of points. In the selection stage, the algorithm chooses the ‘best’ individuals in the popula- tion to reproduce, i.e., the set of points that result in the lowest function values. In the reproduction stage, the algorithm creates a new population from the ‘best’ individuals. The genetic algorithm requires a large number of function evaluations per iteration, so is best suited for problems with fast function call speed or when the dimension of a problem is high. In Table 5.6, we present the summary results of our comparison of the genetic, pattern search and RAGS algorithms for 135 damper coe cient selection test problems. Table 5.1: Summary results for 135 damping coe cient selection test prob- lems. Best Time Total Computation Best Fopt Time (seconds) GA 46 972357.5 5 PS 22 4689590.1 71 RAGS 67 864632.6 59 We can see that the RAGS algorithm has the shortest computation time for 67 of the 135 test problems and the shortest total computation time for all of the 135 test problems. If for optimal function values, we declare a tie between algorithms when jFopt;GA=PS Fopt;RAGS j Fopt;GA=PS < 10 2; then the genetic algorithm tied all 5 of its best solutions with the RAGS algorithm, and the pattern search algorithm tied 40 of its 71 best solutions with the RAGS algorithm. In summary, for 104 of the 135 test problems, the RAGS algorithm had the best or tied for the best optimal solution. Thus, the RAGS algorithm is certainly comparable, if not superior, to the genetic and pattern search algorithms in terms of nding the optimal function value for these problems. 54Chapter 6 Conclusion and Future Work 6.1 Conclusion We have presented a new derivative-free algorithm for nite minimax problems that exploits the smooth substructure of the problem. Conver- gence results are given for any arbitrary approximate gradient that satis es an error bound dependent on the sampling radius. Three examples of such approximate gradients are given. Additionally, a robust version of the algo- rithm is presented and shown to have the same convergence results as the regular version. Of the theory presented, the most in uential result is Lemma 2.2, which says that our approximate subdi erential is a good approximate of our ex- act subdi erential. Lemma 2.2 (1) is essential in proving that our stopping conditions are su cient and Lemma 2.2 (2) is essential in proving that our search direction (approximate direction of steepest descent) is a descent di- rection. Part 1 can be adapted to the Goldstein approximate subdi erential (see Lemma 3.7) and Part 2 can be adapted to the robust approximate subdi erential. The proof is elegant, relying only on the de nitions of sub- di erential and approximate subdi erential sets. In Chapter 4, we see that the AGS and RAGS algorithms are exible as to the approximate gradient used. The condition that an error bound in terms of is satis ed is a reasonable assumption, as the error bounds from Kelley for the simplex and centered simplex gradients and the error bound we provided for the Gupal estimate of the gradient of the Steklov average function depend on . Through numerical testing, we found that the RAGS algorithm out- performs the AGS algorithm with respect to the accuracy of the solution obtained. The general framework of the RAGS algorithm is not so di erent from the method of steepest descent. However, by including the ‘almost active’ functions in the robust active set, the RAGS algorithm is shown to avoid the downfall of the method of steepest descent caused by nondi eren- tiable ridges when applied to nonsmooth functions. Additionally, we tested robust stopping conditions and found that they generally required less func- 556.2. Future Work tion evaluations before termination for the RAGS algorithm. Although the results presented in Section 3.3 do not provide a complete theory for the robust stopping conditions, they give insight into the superior performance of the robust stopping conditions in our numerical results. Of the visuals presented, Figures 3.2(a) and 3.2(b) capture the true essence of the RAGS algorithm, clearly showing that the robust stopping conditions paired with the robust version of the algorithm performed best. 6.2 Future Work Considerable future work is available in this research direction. Most obvious is a further exploration of the theory behind the performance of the robust stopping conditions. Another direction lies in the theoretical requirement bounding the step length away from 0 (see Theorems 2.11 and 3.5). In gradient based methods, one common way to avoid this requirement is with the use of Wolfe-like conditions. We are unaware of any derivative- free variant on the Wolfe conditions. The AGS and RAGS algorithms are hybrid algorithms; they combine the elements of several previously proposed algorithm frameworks to form new optimization algorithms. Combining the strengths of algorithms to create novel hybrid algorithms is an unbounded eld of future work with great potential. Finally, there is considerable future work in the seismic retro tting prob- lem, as well as numerous other real-world applications. 56Bibliography [BC11] H. H. Bauschke and P. L. Combettes. Convex Analysis and Monotone Operator Theory in Hilbert Spaces. CMS Books in Mathematics. Springer, New York, NY, 2011. ! pages 4, 16 [BHT11] K. Bigdeli, W. Hare, and S. Tesfamariam. Determining opti- mal non-uniform damping coe cients for adjacent buildings via a nelder-mead method. Proceeding of the 23rd Canadian Congress of Applied Mechanics, 2011. 4 pages. ! pages 53 [BHT12] K. Bigdeli, W. Hare, and S. Tesfamariam. Con guration opti- mization of dampers for adjacent buildings under seismic excita- tions. To appear in Eng. Opt., 2012. 19 pages. ! pages 53 [BJF+98] A. J. Booker, J. E. Dennis Jr., P. D. Frank, D. B. Sera ni, and V. Torczon. Optimization using surrogate objectives on a heli- copter test example. In Computational methods for optimal de- sign and control (Arlington, VA, 1997), volume 24 of Progr. Sys- tems Control Theory, pages 49{58. Birkh auser Boston, Boston, MA, 1998. ! pages 3 [BK98] D. M. Bortz and C. T. Kelley. The simplex gradient and noisy optimization problems, volume 24 of Computational methods for optimal design and control. Birkh auser Boston, Boston, MA, 1998. ! pages 34 [BKS08] A. M. Bagirov, B. Karas ozen, and M. Sezer. Discrete gradient method: derivative-free method for nonsmooth optimization. J. Optim. Theory Appl., 137(2):317{334, 2008. ! pages 5, 7, 8 [BLO02] J. V. Burke, A. S. Lewis, and M. L. Overton. Approximating subdi erentials by random sampling of gradients. Math. Oper. Res., 27(3):567{584, 2002. ! pages 5, 23 57Bibliography [BLO05] J. V. Burke, A. S. Lewis, and M. L. Overton. A robust gradi- ent sampling algorithm for nonsmooth, nonconvex optimization. SIAM J. Optim., 15(3):751{779, 2005. ! pages 5, 23 [CC78] C. Charalambous and A. R. Conn. An e cient method to solve the minimax problem directly. SIAM J. Numer. Anal., 15(1):162{187, 1978. ! pages 23 [CJV08] A. L. Cust odio, J. E. Dennis Jr., and L. N. Vicente. Using simplex gradients of nonsmooth functions in direct search methods. IMA J. Numer. Anal., 28(4):770{784, 2008. ! pages 34 [Cla90] F. H. Clarke. Optimization and Nonsmooth Analysis. Classics Appl. Math. 5. SIAM, Philadelphia, second edition, 1990. ! pages 2, 4, 8 [CSV09] A. Conn, K. Scheinberg, and L. Vicente. Introduction to derivative-free optimization, volume 8 of MPS/SIAM Series on Optimization. SIAM, Philadelphia, 2009. ! pages 2, 3, 4, 5, 35, 53 [CTYZ00] X. Cai, K. Teo, X. Yang, and X. Zhou. Portfolio optimization under a minimax rule. Manage. Sci., 46:957{972, July 2000. ! pages 2 [CV07] A. L. Cust odio and L. N. Vicente. Using sampling and sim- plex derivatives in pattern search methods. SIAM J. Optim., 18(2):537{555, 2007. ! pages 34 [DM02] E. D. Dolan and J. J. Mor e. Benchmarking optimization software with performance pro les. Math. Program., 91(2, Ser. A):201{ 213, 2002. ! pages 50 [DV04] R. Duvigneau and M. Visonneau. Hydrodynamic design using a derivative-free method. Struct. and Multidiscip. Optim., 28:195{ 205, 2004. ! pages 3 [Gol77] A. A. Goldstein. Optimization of Lipschitz continuous functions. Math. Program., 13(1):14{22, 1977. ! pages 31 [Gup77] A. M. Gupal. A method for the minimization of almost di eren- tiable functions. Kibernetika, pages 114{116, 1977. (in Russian). ! pages 5 58Bibliography [Har10] W. Hare. Using derivative free optimization for constrained parameter selection in a home and community care forecasting model. In Int. Perspectives on Oper. Res. and Health Care, Proc. of the 34th Meeting of the EURO Working Group on Oper. Res. Applied to Health Sciences, 2010. ! pages 3 [HM11] W. Hare and M. Macklem. Derivative-free optimization methods for nite minimax problems. To appear in Optim. Methods and Softw., 2011. 16 pages. ! pages 3, 5, 7, 34 [HN12] W. Hare and J. Nutini. A derivative-free approximate gradient sampling algorithm for nite minimax problems. Submitted to Comput. Optim. and Appl., February 2012. 33 pages. ! pages vii [IOKK04] J. Imae, N. Ohtsuki, Y. Kikuchi, and T. Kobayashi. Minimax control design for nonlinear systems based on genetic program- ming: Jung’s collective unconscious approach. Intern. J. Syst. Sci., 35:775{785, October 2004. ! pages 2 [Kel99a] C. T. Kelley. Detection and remediation of stagnation in the Nelder-Mead algorithm using a su cient decrease condition. SIAM J. Optim., 10(1):43{55, 1999. ! pages 34 [Kel99b] C. T. Kelley. Iterative Methods for Optimization, volume 18 of Frontiers in Applied Mathematics. SIAM, Philadelphia, 1999. ! pages 36, 37 [Kiw10] K. C. Kiwiel. A nonderivative version of the gradient sampling algorithm for nonsmooth nonconvex optimization. SIAM J. Op- tim., 20(4):1983{1994, 2010. ! pages 5, 7, 8, 38 [LLS06] G. Liuzzi, S. Lucidi, and M. Sciandrone. A derivative-free algo- rithm for linearly constrained nite minimax problems. SIAM J. Optim., 16(4):1054{1075, 2006. ! pages 3, 7 [LV00] L. Luk san and J. Vl cek. Test problems for nonsmooth uncon- strained and linearly constrained optimization. Technical Report V-798, ICS AS CR, February 2000. ! pages 5, 23, 46, 47, 50 [Mad75] K. Madsen. Minimax solution of non-linear equations without calculating derivatives. Math. Programming Stud., pages 110{ 126, 1975. ! pages 3 59Bibliography [MFT08] A. L. Marsden, J. A. Feinstein, and C. A. Taylor. A computa- tional framework for derivative-free optimization of cardiovascu- lar geometries. Comput. Methods Appl. Mech. Engrg., 197(21- 24):1890{1905, 2008. ! pages 3 [NW99] J. Nocedal and S. J. Wright. Numerical Optimization. Springer Series in Operations Research. Springer-Verlag, New York, 1999. ! pages 13 [PGL93] G. Di Pillo, L. Grippo, and S. Lucidi. A smooth method for the nite minimax problem. Math. Program., 60(2, Ser. A):187{214, 1993. ! pages 3 [Pol87] E. Polak. On the mathematical foundations of nondi erentiable optimization in engineering design. SIAM Rev., 29(1):21{89, 1987. ! pages 2 [Pol88] R. A. Polyak. Smooth optimization methods for minimax prob- lems. SIAM J. Control Optim., 26(6):1274{1286, 1988. ! pages 3 [PRW03] E. Polak, J. O. Royset, and R. S. Womersley. Algorithms with adaptive smoothing for nite minimax problems. J. Optim. The- ory Appl., 119(3):459{484, 2003. ! pages 3 [RW98] R. T. Rockafellar and R. J.-B. Wets. Variational Analysis, vol- ume 317 of Grundlehren der Mathematischen Wissenschaften [Fundamental Principles of Mathematical Sciences]. Springer- Verlag, Berlin, 1998. ! pages 2, 4, 9, 19 [Sta05] R. Sta ord. Random points in an n-dimensional hypersphere, 2005. ! pages 12 [Wol75] P. Wolfe. A method of conjugate subgradients for minimizing nondi erentiable functions. Math. Programming Stud., pages 145{173, 1975. ! pages 5 [Wsc04] M. Wschebor. Smoothed analysis of (A). J. Complexity, 20(1):97{107, 2004. ! pages 36 [Xu01] S. Xu. Smoothing method for minimax problems. Comput. Op- tim. Appl., 20(3):267{279, 2001. ! pages 3 60Appendix A Tables Table 6.1: Test Set Summary: problem name and number, problem dimen- sion (N), and number of sub-functions (M); * denotes an absolute value operation (doubled number of sub-functions). Prob. # Name N M 2.1 CB2 2 3 2.2 WF 2 3 2.3 SPIRAL 2 2 2.4 EVD52 3 6 2.5 Rosen-Suzuki 4 4 2.6 Polak 6 4 4 2.7 PCB3 3 42* 2.8 Bard 3 30* 2.9 Kow.-Osborne 4 22* 2.10 Davidon 2 4 40* 2.11 OET 5 4 42* 2.12 OET 6 4 42* 2.13 GAMMA 4 122* 2.14 EXP 5 21 2.15 PBC1 5 60* 2.16 EVD61 6 102* 2.18 Filter 9 82* 2.19 Wong 1 7 5 2.20 Wong 2 10 9 2.21 Wong 3 20 18 2.22 Polak 2 10 2 2.23 Polak 3 11 10 2.24 Watson 20 62* 2.25 Osborne 2 11 130* 61Appendix A. Tables Table 6.2: Average accuracy for 25 trials obtained by the AGS and RAGS algorithms for the simplex gradient. AGS RAGS Regular Stop Robust Stop Regular Stop Robust Stop Prob. f-evals Acc. f-evals Acc. f-evals Acc. f-evals Acc. 2.1 3018 2.082 2855 2.120 2580 9.470 202 6.759 2.2 3136 4.565 3112 4.987 4179 13.211 418 6.343 2.3 3085 0.002 3087 0.002 3090 0.002 3096 0.002 2.4 3254 2.189 3265 2.238 2986 11.559 367 7.570 2.5 3391 1.379 3138 1.351 3576 1.471 539 1.471 2.6 3260 1.236 3341 1.228 4258 1.338 859 1.338 2.7 2949 1.408 2757 1.367 4155 9.939 4190 7.230 2.8 4959 0.879 4492 0.913 3634 9.941 3435 7.655 2.9 2806 0.732 3303 0.581 16000 8.049 13681 3.975 2.10 2978 3.343 2993 3.342 3567 3.459 1924 3.459 2.11 3303 2.554 3453 2.559 35367 6.099 11725 5.063 2.12 2721 1.866 3117 1.871 15052 2.882 8818 2.660 2.13 2580 1.073 2706 0.874 43618 1.952 141 1.679 2.14 3254 1.585 3289 1.086 7713 2.696 4221 1.476 2.15 3917 0.262 5554 0.259 31030 0.286 12796 0.277 2.16 3711 2.182 4500 2.077 20331 3.242 11254 2.178 2.18 10468 0.000 10338 0.000 76355 17.717 30972 17.138 2.19 3397 0.376 3327 0.351 5403 7.105 1767 7.169 2.20 4535 1.624 4271 1.624 8757 8.435 7160 6.073 2.21 8624 2.031 8380 2.157 15225 1.334 11752 1.393 2.22 1563 0.958 1408 1.042 64116 3.049 1256 2.978 2.23 7054 2.557 10392 2.744 6092 6.117 970 6.178 2.24 4570 0.301 7857 0.298 93032 0.447 21204 0.328 2.25 3427 0.339 4197 0.340 98505 0.342 343 0.342 62Appendix A. Tables Table 6.3: Average accuracy for 25 trials obtained by the AGS and RAGS algorithms for the centered simplex gradient. AGS RAGS Regular Stop Robust Stop Regular Stop Robust Stop Prob. f-evals Acc. f-evals Acc. f-evals Acc. f-evals Acc. 2.1 3769 2.054 3573 2.051 2351 9.469 221 7.125 2.2 3705 6.888 1284 5.154 4151 9.589 330 5.594 2.3 5410 0.003 5352 0.003 5332 0.003 5353 0.003 2.4 4059 2.520 4154 2.456 4347 11.578 296 6.834 2.5 3949 1.422 3813 1.437 4112 1.471 452 1.471 2.6 3756 1.302 3880 1.309 4815 1.338 879 1.338 2.7 4227 1.435 4187 1.373 5285 9.950 7164 6.372 2.8 6928 0.988 6933 1.003 4116 9.939 3754 7.775 2.9 3301 0.933 3743 0.949 17944 8.072 13014 2.436 2.10 3447 3.343 3424 3.342 4744 3.459 427 3.459 2.11 3593 2.768 4082 2.785 47362 6.344 11886 5.115 2.12 3321 1.892 3406 1.876 15550 2.843 10726 2.651 2.13 3067 1.355 3508 1.216 36969 1.873 519 1.643 2.14 3967 1.771 6110 1.152 9757 2.692 7284 1.510 2.15 4646 0.272 6014 0.273 23947 0.280 15692 0.277 2.16 4518 2.223 6911 2.074 22225 2.628 17001 2.215 2.18 30492 16.931 14671 16.634 125859 17.804 20815 17.293 2.19 4473 0.551 4484 0.591 8561 7.113 1697 5.851 2.20 5462 1.615 5503 1.599 8908 9.011 7846 6.042 2.21 11629 1.887 11724 1.661 18957 1.304 17067 1.339 2.22 1877 1.166 1604 1.160 1453 3.139 2066 3.644 2.23 3807 2.150 7850 3.586 15625 6.117 1020 6.230 2.24! 7198 0.302 12745 0.301 115787 0.436 61652 0.329 2.25 4749 0.339 4896 0.341 256508 0.342 568 0.342 63Appendix A. Tables Table 6.4: Average accuracy for 25 trials obtained by the AGS and RAGS algorithm for the Gupal estimate of the gradient of the Steklov averaged function. AGS RAGS Regular Stop Robust Stop Regular Stop Robust Stop Prob. f-evals Acc. f-evals Acc. f-evals Acc. f-evals Acc. 2.1 2775 2.448 2542 2.124 13126 3.896 89 2.708 2.2 3729 3.267 2221 2.813 5029 15.904 1776 7.228 2.3 2243 0.000 2262 0.000 2276 0.000 2255 0.000 2.4 2985 2.771 2841 2.892 3475 3.449 2362 3.738 2.5 3493 1.213 3529 1.196 3447 1.211 338 1.200 2.6 3144 0.187 3245 0.188 3018 0.162 3059 0.162 2.7 2631 1.368 3129 1.248 2476 1.048 2208 1.047 2.8 2711 1.125 3898 0.893 2231 0.514 5846 0.515 2.9 3102 0.727 3011 0.600 2955 0.937 3248 0.863 2.10 3075 3.241 2927 3.272 3100 0.000 3050 0.000 2.11 2947 1.527 3307 1.528 3003 1.560 2905 1.560 2.12 3095 1.099 7179 0.000 2670 0.788 7803 0.000 2.13 2755 0.710 1485 0.715 2517 0.231 6871 0.227 2.14 2965 0.574 3070 0.427 2860 0.708 4571 0.668 2.15 2658 0.010 2386 0.017 3355 0.050 3210 0.031 2.16 3431 0.457 3256 0.459 2861 0.199 2620 0.119 2.18 3936 16.345 5814 0.000 3950 16.451 6598 4.542 2.19 3337 0.014 3270 0.011 3488 0.970 3376 0.957 2.20 4604 0.835 4434 0.808 9459 1.360 10560 1.359 2.21 5468 0.000 5418 0.000 6632 0.641 6159 0.635 2.22 21 0.000 21 0.000 21 0.000 21 0.000 2.23 5436 1.814 5176 1.877 1.00E+06 2.354 954 2.415 2.24 7426 0.280 171 0.017 7927 0.043 6389 0.283 2.25 4519 0.286 4814 0.300 3760 0.017 3209 0.023 64
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- A derivative-free approximate gradient sampling algorithm...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
A derivative-free approximate gradient sampling algorithm for finite minimax problems Nutini, Julie Ann 2012-04-23
pdf
Page Metadata
Item Metadata
Title | A derivative-free approximate gradient sampling algorithm for finite minimax problems |
Creator |
Nutini, Julie Ann |
Publisher | University of British Columbia |
Date Issued | 2012 |
Description | Mathematical optimization is the process of minimizing (or maximizing) a function. An algorithm is used to optimize a function when the minimum cannot be found by hand, or finding the minimum by hand is inefficient. The minimum of a function is a critical point and corresponds to a gradient (derivative) of 0. Thus, optimization algorithms commonly require gradient calculations. When gradient information of the objective function is unavailable, unreliable or ‘expensive’ in terms of computation time, a derivative-free optimization algorithm is ideal. As the name suggests, derivative-free optimization algorithms do not require gradient calculations. In this thesis, we present a derivative-free optimization algorithm for finite minimax problems. Structurally, a finite minimax problem minimizes the maximum taken over a finite set of functions. We focus on the finite minimax problem due to its frequent appearance in real-world applications. We present convergence results for a regular and a robust version of our algorithm, showing in both cases that either the function is unbounded below (the minimum is −∞) or we have found a critical point. Theoretical results are explored for stopping conditions. Additionally, theoretical and numerical results are presented for three examples of approximate gradients that can be used in our algorithm: the simplex gradient, the centered simplex gradient and the Gupal estimate of the gradient of the Steklov averaged function. A performance comparison is made between the regular and robust algorithm, the three approximate gradients, and the regular and robust stopping conditions. Finally, an application in seismic retrofitting is discussed. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2012-04-23 |
Provider | Vancouver : University of British Columbia Library |
DOI | 10.14288/1.0072771 |
URI | http://hdl.handle.net/2429/42200 |
Degree |
Master of Science - MSc |
Program |
Mathematics |
Affiliation |
Irving K. Barber School of Arts and Sciences (Okanagan) |
Degree Grantor | University of British Columbia |
Graduation Date | 2012-05 |
Campus |
UBCO |
Scholarly Level | Graduate |
Aggregated Source Repository | DSpace |
Download
- Media
- 24-ubc_2012_spring_nutini_julie.pdf [ 722.2kB ]
- Metadata
- JSON: 24-1.0072771.json
- JSON-LD: 24-1.0072771-ld.json
- RDF/XML (Pretty): 24-1.0072771-rdf.xml
- RDF/JSON: 24-1.0072771-rdf.json
- Turtle: 24-1.0072771-turtle.txt
- N-Triples: 24-1.0072771-rdf-ntriples.txt
- Original Record: 24-1.0072771-source.json
- Full Text
- 24-1.0072771-fulltext.txt
- Citation
- 24-1.0072771.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0072771/manifest