Theory and Algorithmic Applications of theProximal Mapping and Moreau EnvelopebyChayne Daniel PlanidenB.Sc. Hons., Universidad Autónoma de Baja California, 2009M.Sc., University of British Columbia, 2013A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHYinThe College of Graduate Studies(Mathematics)THE UNIVERSITY OF BRITISH COLUMBIA(Okanagan)February 2018c© Chayne Daniel Planiden, 2018The following individuals certify that they have read, and recommend to theCollege of Graduate Studies for acceptance, a thesis/dissertation entitled:THEORY AND ALGORITHMIC APPLICATIONS OF THE PROXIMAL MAPPINGAND MOREAU ENVELOPEsubmitted by CHAYNE DANIEL PLANIDEN in partial fulfilment of the require-ments of the degree of Doctor of PhilosophyWarren Hare, Irving K. Barber School of Arts and SciencesSupervisorShawn Wang, Irving K. Barber School of Arts and SciencesSupervisory Committee MemberHeinz Bauschke, Irving K. Barber School of Arts and SciencesSupervisory Committee MemberChen Feng, School of EngineeringUniversity ExaminerPaulo da Silva, University of CampinasExternal ExamineriiAbstractThe Moreau envelope and the proximal mapping have been objects of greatinterest for optimizers since their conception more than half a century ago [162,163]. They provide us with many desirable properties; for instance, the Moreauenvelope of a convex function is smooth (differentiable) while the function maynot be, and the envelope maintains the same minimum value and the same set ofminimizers as the function [196, 215]. This is a great advantage to have when theobjective is to minimize the function, because standard Calculus methods can thenbe applied to minimize the smooth envelope. From a computational standpoint, theproximal mapping has given rise to many efficient minimization algorithms, suchas the proximal point method [151] and proximal bundle methods [99].Derivative-free optimization methods [57] continue to grow in importance andpopularity. The term ‘derivative-free’ refers to the fact that for the function to beminimized, (sub)gradient information is either unavailable or inconvenient to use,thus necessitating an algorithm that does not require subgradients. Such algorithmsrely on constructs such as the simplex gradient to obtain good-quality approxima-tions of subgradients and use the approximations in derivative-free versions of theproximal routines.The present work is divided into three major parts. Part I provides a historyof the Moreau envelope, with the goal of illustrating its usefulness and some of itssuccesses over the past few decades. Part II contains new theoretical results thatinvolve the Moreau envelope and the proximal mapping on many topics, includingprox-boundedness, convex functions with unique and/or strong minimizers, Bairecategory and epiconvergence. Part III is the algorithmic section, where a proximalbundle method is converted to derivative-free format. Using this result, a particularminimization algorithm for convex finite-max functions called the VU-algorithm[159] is presented and also converted to derivative-free. The new method is provedconvergent and numerical results are included.iiiLay SummaryThis thesis advances theory and application of Optimization of functions, i.e.finding the minimum or maximum value. The Moreau envelope is the main toolused for this objective; it is a manner of approximating the function that needsoptimizing by another function that is more well-behaved. Then the optimal valueof the Moreau envelope function is found, which corresponds to that of the originalfunction.The theoretical work develops many useful equivalencies to the Moreau enve-lope that were previously unknown, as well as counts and classifies many particulartypes of functions for which it is used. Several examples with graphical illustra-tions are provided.The algorithmic work uses Moreau envelope theory to implement a particularminimization algorithm. The major contribution is the derivative-free format of thealgorithm: until now the functioning of the algorithm relied on the availability ofderivatives of the function in question; now it does not.ivPrefaceThis thesis has been adapted from the following manuscripts: [94–96, 178–180]. All manuscripts were coauthored by Dr. Warren Hare or Dr. Xianfu Wangof University of British Columbia, Okanagan campus in Kelowna, Canada. Themanuscript [96] was also coauthored by Dr. Claudia Sagastizábal of Instituto deMatemática Pura e Aplicada in Rio de Janeiro, Brazil. All manuscripts are eitherpublished in, accepted to, or submitted to Optimization journals.vTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiiI Introduction and History 1Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 21.1 The goals of this thesis . . . . . . . . . . . . . . . . . . . . . . . 31.2 The presentation of this thesis . . . . . . . . . . . . . . . . . . . 3Chapter 2: Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . 52.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.3 Facts, lemmas and propositions . . . . . . . . . . . . . . . . . . . 10Chapter 3: History . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.2 The origin of the Moreau envelope . . . . . . . . . . . . . . . . . 143.3 An example by Jean-Jacques Moreau . . . . . . . . . . . . . . . . 15viTABLE OF CONTENTS3.4 Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.4.1 The universal density functional . . . . . . . . . . . . . . 163.4.2 High-dimensional posterior distributions . . . . . . . . . 183.4.3 The Stefan problem . . . . . . . . . . . . . . . . . . . . . 203.4.4 The Cahn-Hilliard/Navier-Stokes system . . . . . . . . . 213.4.5 The time crisis problem . . . . . . . . . . . . . . . . . . 233.5 The proximal point algorithm and its variants . . . . . . . . . . . 243.5.1 The proximal point and proximal bundle algorithms . . . 253.5.2 An inexact proximal bundle algorithm . . . . . . . . . . . 253.5.3 The accelerated gradient method . . . . . . . . . . . . . . 253.5.4 The Douglas–Rachford algorithm . . . . . . . . . . . . . 263.6 Other regularizations . . . . . . . . . . . . . . . . . . . . . . . . 273.6.1 Yosida regularization . . . . . . . . . . . . . . . . . . . . 273.6.2 Tikhonov regularization . . . . . . . . . . . . . . . . . . 283.6.3 Bregman regularization . . . . . . . . . . . . . . . . . . . 29II Theory 31Chapter 4: Thresholds of Prox-boundedness of PLQ Functions . . . . 324.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.2 The threshold of prox-boundedness . . . . . . . . . . . . . . . . . 334.3 Full-domain quadratic functions . . . . . . . . . . . . . . . . . . 334.4 PLQ functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394.4.1 Quadratic functions with conic domain . . . . . . . . . . 404.4.2 Quadratic functions with polyhedral domain . . . . . . . . 474.4.3 PLQ functions . . . . . . . . . . . . . . . . . . . . . . . 504.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51Chapter 5: Most Convex Functions have Unique Minimizers . . . . . 585.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 585.2 Functions with unique minimizers . . . . . . . . . . . . . . . . . 585.3 Chapter-specific definitions and facts . . . . . . . . . . . . . . . . 595.4 The complete metric space of subdifferentials . . . . . . . . . . . 615.5 Genericity of the set of convex functions with unique minimizers . 665.5.1 Super-regularity . . . . . . . . . . . . . . . . . . . . . . 675.5.2 Denseness and Baire category . . . . . . . . . . . . . . . 68Chapter 6: Strongly Convex Functions and Functions with Strong Min-imizers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72viiTABLE OF CONTENTS6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 726.2 Strong convexity and strong minimizers . . . . . . . . . . . . . . 726.3 Chapter-specific definitions and facts . . . . . . . . . . . . . . . . 736.3.1 Strong convexity and coercivity . . . . . . . . . . . . . . 746.3.2 Baire category . . . . . . . . . . . . . . . . . . . . . . . 756.3.3 Monotonicity . . . . . . . . . . . . . . . . . . . . . . . . 756.3.4 Differentiability, conjugation and epiconvergence . . . . . 756.4 The Moreau envelope of coercive functions and strongly convexfunctions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 766.5 A complete metric space using Moreau envelopes . . . . . . . . . 786.6 Baire category results . . . . . . . . . . . . . . . . . . . . . . . . 816.6.1 Characterizations of the strong minimizer . . . . . . . . . 816.6.2 The set of strongly convex functions is dense, but first cat-egory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 856.6.3 The set of convex functions with strong minimizers is sec-ond category . . . . . . . . . . . . . . . . . . . . . . . . 91Chapter 7: Generalized Linear-quadratic Functions . . . . . . . . . . 977.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 977.2 Generalized linear-quadratic functions . . . . . . . . . . . . . . . 977.3 Chapter-specific definitions and facts . . . . . . . . . . . . . . . . 987.4 Epigraphical limits of quadratic functions on R . . . . . . . . . . 1007.5 Generalized linear-quadratic functions on Rn . . . . . . . . . . . 1047.5.1 Linear relations and generalized linear-quadratic functions 1047.5.2 Properties and calculus of qA . . . . . . . . . . . . . . . . 1067.5.3 The Fenchel conjugate of qA . . . . . . . . . . . . . . . . 1107.5.4 Relating the set-valued inverse and the Moore–Penrose in-verse . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1137.5.5 Characterizations of Moreau envelopes . . . . . . . . . . 1167.5.6 A characterization of generalized linear-quadratic functions 1207.6 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1247.6.1 A seminorm with infinite values . . . . . . . . . . . . . . 1247.6.2 The least squares problem . . . . . . . . . . . . . . . . . 1267.6.3 Epiconvergence and algebra rules . . . . . . . . . . . . . 126III Applications: DFO Proximal Bundle Algorithms 130Chapter 8: Computing Proximal Points of Convex Functions with In-exact Subgradients . . . . . . . . . . . . . . . . . . . . . . 131viiiTABLE OF CONTENTS8.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1318.2 The DFO proximal bundle method . . . . . . . . . . . . . . . . . 1318.3 The proximal point . . . . . . . . . . . . . . . . . . . . . . . . . 1338.4 Replacing exactness with approximation . . . . . . . . . . . . . . 1348.4.1 The approximate model function and approximate subgra-dient . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1348.4.2 The algorithm . . . . . . . . . . . . . . . . . . . . . . . . 1368.4.3 Relation to other inexact gradient proximal-style subroutines1378.5 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1398.6 Numerical tests . . . . . . . . . . . . . . . . . . . . . . . . . . . 1468.6.1 Bundle variants . . . . . . . . . . . . . . . . . . . . . . . 1478.6.2 Max-of-quadratics tests . . . . . . . . . . . . . . . . . . . 1478.6.3 DFO tests . . . . . . . . . . . . . . . . . . . . . . . . . . 1528.6.4 Simplex gradient vs. randsphere tests . . . . . . . . . . . 1538.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155Chapter 9: A Derivative-free VU-algorithm . . . . . . . . . . . . . . . 1579.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1579.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1579.2.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . 1589.3 VU-theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1599.3.1 Primal-dual tracks . . . . . . . . . . . . . . . . . . . . . 1619.3.2 The VU-algorithm . . . . . . . . . . . . . . . . . . . . . 1629.4 Defining inexact subgradients and related approximations . . . . . 1639.5 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1699.6 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . 1749.6.1 Test functions and benchmark rules . . . . . . . . . . . . 1749.6.2 Comparing the solvers’ accuracy on f and dimV . . . . . 1769.6.3 Performance Profiles . . . . . . . . . . . . . . . . . . . . 1789.6.4 CPU time, function evaluations and failures . . . . . . . . 1809.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181Chapter 10: Conclusion and Future Work . . . . . . . . . . . . . . . . 18210.1 Main results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18210.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185ixList of TablesTable 4.1 Results of Example 4.5.4 . . . . . . . . . . . . . . . . . . 57Table 8.1 Average values among the four bundle variants. . . . . . . 152Table 8.2 Set of test problems using simplex gradients. . . . . . . . 152Table 8.3 Average values among the four bundle variants. . . . . . . 153Table 8.4 Average values for two methods of approximating subgra-dients. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155Table 9.1 Results for MAXQUAD test function, dimV(x¯) = 3. . . . . 177Table 9.2 Average accuracy RA for 602 runs. . . . . . . . . . . . . . 177Table 9.3 The V-dimension prediction comparison between the inex-act solvers. . . . . . . . . . . . . . . . . . . . . . . . . . 178xList of FiguresFigure 3.1 An illustration of the Bregman distance. . . . . . . . . . 29Figure 4.1 The Moreau envelopes of PLQ functions with the samethreshold may have different domains. . . . . . . . . . . 53Figure 4.2 The partitioning of R2 for f(x, y). . . . . . . . . . . . . 55Figure 8.1 Low-dimension performance profile. . . . . . . . . . . . 150Figure 8.2 High-dimension performance profile. . . . . . . . . . . . 151Figure 8.3 Low-dimension performance profile – randsphere vs.simplex gradient. . . . . . . . . . . . . . . . . . . . . . . 154Figure 9.1 Performance Profile: (reciprocal of) accuracy, all solvers. 179Figure 9.2 Performance Profile: (reciprocal of) accuracy, solvers IN-EXBUN and DFO-VU. . . . . . . . . . . . . . . . . . . 180xiAcknowledgementsI wish to thank my supervisors Drs. Warren Hare and Shawn Wang for theirinvaluable insights and tireless assistance on all of our published manuscripts. Youwere always at hand to help me out when I needed it, and otherwise you were con-tent to leave me to my work. This resulted in a very productive relationship, onewhich I will be sorry to have to leave behind when I move on. Because of you, Inow feel well-prepared to venture out on my own. Thank you very much.This work was financially supported by UBC, and by NSERC grant numberCGSD2-489327-2016.xiiDedicationTo my wife.You sacrificed a lot to enable me to get through this. You left your country,culture, family and friends, you put your career on hold, you put up with snow.Four years is a long time to bear such difficulties, but you did it without complaintand I appreciate it. Now it’s done and the rest of our lives will be better for it.Thank you for your love and support throughout this process and everything elsethat we have faced together. I love you.xiiiPart IIntroduction and History1Chapter 1IntroductionThe principal task of an optimizer is to find the maximum or minimum value,and the set of points (maximizers/minimizers) at which the sought-after value oc-curs, of a given function of interest called the objective function. There are count-less applications of Optimization in industry; the owner of any business big orsmall needs to be concerned with important issues such as maximizing produc-tion, efficiency, revenue, profit, etc. and minimizing risk, waste, overhead costs,delivery time, etc. Since any problem of maximization can be transformed easilyinto a problem of minimization of equal difficulty and cost, we typically restrictourselves to the discussion of minimization problems.A useful tool available to us in the pursuit of this goal is the Moreau enve-lope, also known as the Moreau-Yosida regularization, which made its debut inthe 1960s. In the setting of convex objective functions, the Moreau envelope is aregularizing (smoothing) function [162, 163, 215], meaning that it converts a non-differentiable function into a differentiable one, and it has the same minimum andminimizers as the objective function [196, 215]. There is an abundant library on thetheoretical development and the practical application of the Moreau envelope; see[24, 30, 47, 51, 52, 55, 86, 92, 93, 97, 98, 113, 115, 117, 144, 145, 182, 183, 196,212]. Recent mathematical history is full of instances of minimization problemsthat were unsolvable until the objective function was regularized by the Moreauenvelope (see [6, 29, 37, 107, 131]); several of these are discussed in Chapter 3.Closely associated with the Moreau envelope is the proximal mapping. Thisoperator is the foundation of many minimization algorithms, such as proximalpoint algorithms [45, 81, 83, 135, 151, 174, 191] and proximal bundle methods[99, 106, 125]. Several variants of these algorithms have arisen, such as the ac-celerated gradient method [3, 72, 73, 79, 122], the Douglas–Rachford method[22, 25, 26, 39, 61] and others. The Moreau envelope and the proximal mappingform the basis of this thesis.21.1. The goals of this thesis1.1 The goals of this thesisThe first goal of this work is to convince the reader of the importance of theMoreau envelope and the proximal mapping. To that end, Chapter 3 showcases sev-eral examples of practical application, where these tools have been used to solvereal-world problems. The next goal is to further the theoretical results that per-tain to the Moreau envelope. There are many topics covered in this area, includ-ing prox-boundedness, functions with strong and/or unique minimizers, epiconver-gence, and generalized linear-quadratic functions. We provide several new resultsunder each of these headings. The third goal is to present new algorithmic results:a derivative-free minimization method for convex nonsmooth objective functionsthat makes use of the proximal mapping. The algorithm of interest in this thesis iscalled the VU-algorithm. It is a two-step iterative algorithm that employs a prox-imal bundle method in one of the steps. Our contributions were the rewrite of thebundle method to include a novel tilt-correct step in the model construction, in or-der to convert it to derivative-free, and the change of format of the VU-algorithmto a derivative-free framework.1.2 The presentation of this thesisThis section outlines the organization of the remaining chapters found in thiswork. Chapter 2 states the notation used throughout and contains a number ofdefinitions and useful facts. Chapter 3 gives an account of some of the major con-tributions of the Moreau envelope and the proximal mapping from its inception tothe present day. We begin with an application by Jean-Jacques Moreau, the enve-lope’s namesake. This is followed by five instances of the Moreau envelope appliedto a physical problem in order to regularize the problem and make it solvable. Thenseveral proximal point algorithms are highlighted and discussed, as well as othercomparable types of regularization.Chapter 4 commences the theoretical portion of the thesis and presents oneof the earliest new contributions to this work: how to identify and categorize thethreshold of prox-boundedness of a piecewise linear-quadratic function. Chapters5 and 6 explore functions that have unique minimizers and strong minimizers. Weuse Baire category and epiconvergence tools to prove that the set of functions withunique minimizers is large (Baire category two), as is the set of functions withstrong minimizers, but that the set of strongly convex functions is relatively small(Baire category one). In Chapter 7 we continue with the theme of epiconvergence,this time from the point of view of linear relations and generalized quadratic func-tions. We explore the question of when a quadratic function is a Moreau envelope31.2. The presentation of this thesisof a generalized linear-quadratic function; characterizations involving nonexpan-siveness and Lipschitz continuity are established.In Chapter 8, we switch our focus to the algorithmic segment of our goals. Wepresent a new bundle method for finding the proximal point of a convex function,by using inexact subgradients rather than exact ones. This transforms the algo-rithm into derivative-free format, in preparation for use in the following chapter.Chapter 9 contains the culmination of the computative work of this thesis: thederivative-free VU-algorithm. Using the derivative-free proximal bundle methodas a subroutine, we alter the remainder of the algorithm to run without need ofexact subgradients as well. Chapter 10 concludes this document, summarizing allthat we have recorded and discussing several avenues of future research that shouldbe done based on the results presented here.4Chapter 2PreliminariesIn this chapter, we present notation used throughout the thesis and list a num-ber of definitions, facts, lemmas and propositions that are needed in subsequentchapters. Wherever not specified, notation follows the format of [196].2.1 NotationWe work in Euclidean space Rn, with inner product 〈x, y〉 = ∑ni=1 xiyi andinduced norm ‖x‖ = √〈x, x〉. The extended real line is denoted by R∪{∞}. Thesymbl ∃ means ‘there exists’, and the symbol ∀ means ‘for all’. We use Γ0(Rn)to represent the set of proper, convex, lower semicontinuous (lsc) functions onRn (the terms proper, convex and lower semicontinuous are defined below). Theidentity operator is denoted by Id .We useBδ(x) to represent the open ball centredat x of radius δ. The relative interior of a set S is denoted by riS and is its interiorwithin its affine hull. We use → for single-valued maps (functions) and ⇒ formultivalued maps. We usep→ for pointwise convergence, e→ for epiconvergence,g→ for graphical convergence and u→ for uniform convergence. The gradient of adifferentiable function f is denoted by∇f and the Hessian is denoted by∇2f.2.2 DefinitionsWhere convenient, we useq(x) =12‖x‖2.The distance from a point x to a set C is defined bydC(x) = infy∈C‖y − x‖,and the projection of a point x onto a closed set C is defined byPC(x) = {y ∈ C : ‖y − x‖ = dC(x)}.Definition 2.2.1. A function f : Rn → R∪{∞} is proper if52.2. Definitions(i) there does not exist x ∈ Rn such that f(x) = −∞ and(ii) there exists x ∈ Rn such that f(x) <∞.Definition 2.2.2. A function f : Rn → R∪{∞} is called K-Lipschitz continuousif there exists K > 0 such that|f(y)− f(x)| ≤ K‖y − x‖ for all x, y ∈ Rn .Definition 2.2.3. For k ∈ N∪{0}, a function f : Rn → R is of differentiabilityclass Ck (Ck+) if all partial derivatives of f up to and including order k exist andare (Lipschitz) continuous.Definition 2.2.4. The limit inferior of a function f : Rn → R∪{∞} at a point ais defined bylim infx→a f(x) = supa∈Uopen{inf x ∈ U \ {a}f(x)}.Similarly, the limit superior of f is defined bylim infx→a f(x) = infa∈Uopen{supx ∈ U \ {a}f(x)}.Definition 2.2.5. Let f : Rn → R∪{∞}.(i) The function f is lower semicontinuous (lsc) at x¯ if the following hold.(a) If f(x¯) <∞, then for every ε > 0 there exists δ > 0 such thatf(x) ≥ f(x¯)− ε whenever ‖x− x¯‖ < δ.(b) If f(x¯) =∞, then f(x)→∞ as x→ x¯.Equivalently, f is lower semicontinuous at x¯ if lim infx→x¯ f(x) ≥ f(x¯).(ii) The function f is upper semicontinuous (usc) at x¯ if the following hold.(a) If f(x¯) <∞, then for every ε > 0 there exists δ > 0 such thatf(x) ≤ f(x¯)− ε whenever ‖x− x¯‖ < δ.(b) If f(x¯) = −∞, then f(x)→ −∞ as x→ x¯.Equivalently, f is upper semicontinuous at x¯ if lim supx→x¯ f(x) ≤ f(x¯).(iii) The function f is continuous at x¯ if f is both lsc and usc at x¯.62.2. DefinitionsDefinition 2.2.6. A function f : Rn → R∪{∞} is convex if for all x, y ∈ dom fand for all λ ∈ [0, 1],(1− λ)f(x) + λf(y) ≥ f((1− λ)x+ λy).Definition 2.2.7. A function f : Rn → R∪{+∞} is called piecewise linear-quadratic (PLQ) if dom f can be represented as the union of finitely many poly-hedral sets, relative to each of which f(x) is given by an expression of the form12〈x,Ax〉+ 〈b, x〉+ c for some scalar c ∈ R, vector b ∈ Rn and symmetric matrixA ∈ Sn.Definition 2.2.8. The graph of an operator A : Rn ⇒ Rn is defined bygraA = {(x, x∗) : x∗ ∈ Ax}.The inverse A−1 is defined by the graphgraA−1 = {(x∗, x) : x∗ ∈ Ax}.Definition 2.2.9. For a sequence of mappings Sk : Rn ⇒ Rm, the pointwise outerlimit and the pointwise inner limit are the mappings plimsupk Sk and pliminfk Sk,respectively, defined at each point x by(plimsupkSk)(x) = lim supkSk(x),(pliminfkSk)(x) = lim infkSk(x).When the pointwise outer and inner limits agree, we say that the pointwise limitplimk Sk exists. Hence,S = plimkSk ⇔ plimsupkSk ⊂ S and S ⊂ pliminfkSk.In this case, the notation Skp→ S is used and the mappings Sk are said to convergepointwise to S. Thus,Skp→ S ⇔ Sk(x)→ S(x) for all x.Definition 2.2.10. For a sequence of mappings Sk : Rn ⇒ Rm, the graphicalouter limit glimsupk Sk is the mapping having as its graph the set lim supk(graSk):gra(glimsupkSk)= lim supk(graSk).72.2. DefinitionsThe graphical inner limit gliminfk Sk is the mapping having as its graph the setlim infk(graSk) :gra(gliminfkSk)= lim infk(graSk).If these outer and inner limits agree, the graphical limit glimk Sk exists; thus,S = glimk Sk if and only if glimsupk Sk ⊂ S and S ⊂ gliminfk Sk. In this case,the notation Skg→ S is used and the mappings Sk are said to converge graphicallyto S. Thus,Skg→ S ⇔ graSk → graS.Definition 2.2.11. For any sequence {fk} of functions on Rn, the lower epilimiteliminfk fk is the function having as its epigraph the outer limit of the sequence ofsets epi fk :epi(eliminfkfk) = lim supk(epi fk).The upper epilimit elimsupk fk is the function having as its epigraph the inner limitof the sets epi fk :epi(elimsupkfk) = lim infk(epi fk).When these two functions coincide, the epilimit elimk fk is said to exist:elimkfk = eliminfkfk = elimsupkfk.In this event, the functions are said to epiconverge to f, symbolized by fke→ f.Thus,fke→ f ⇔ epi fk g→ epi f.Definition 2.2.12. A sequence of functions {fk}, k ∈ N is said to be uniformlyconvergent to f for a set S of vectors x if, for each ε > 0, an integer K exists suchthat|fk(x)− f(x)| < ε for all k ≥ K, for all x ∈ S.Definition 2.2.13. A function f : Rn → R is little-o of function g : Rn → R,written f ∈ o(g), iflim‖x‖→∞f(x)g(x)= 0.Definition 2.2.14. Consider a proper function f : Rn → R∪{∞} and a point x¯such that f(x¯) <∞. A vector v ∈ Rn is a82.2. Definitions(i) regular subgradient of f at x¯, written v ∈ ∂ˆf(x¯), if for all x ∈ Rn,f(x) ≥ f(x¯) + 〈v, x− x¯〉+ o(‖x− x¯‖);(ii) subgradient of f at x¯, written v ∈ ∂f(x¯), if there exist sequences xk →fx¯and vk ∈ ∂ˆf(xk) with vk → v.The sets ∂ˆf(x¯) and ∂f(x¯) are called, respectively, the regular subdifferential andthe subdifferential of f at x¯.Definition 2.2.15. The Moreau envelope with proximal parameter r > 0 of aproper, lsc function f : Rn → R∪{∞} is defined byerf(x) = infy∈Rn{f(y) +r2‖y − x‖2}.The proximal mapping is the (possibly empty) set of points at which this infimumis achieved and is denoted by Prf :Prf(x) = argminy∈Rn{f(y) +r2‖y − x‖2}.Note 2.2.16. Many articles in the literature use the notation λf for the Moreauenvelope of f with parameter λ and have λ in the denominator of the infimand:λf(x) = infy∈Rn{f(y) +12λ‖y − x‖2}.Both definitions have their conveniences and their drawbacks; the notation that ismost useful in this thesis is the erf notation, but we use the λf notation when theoccasion calls for it.Definition 2.2.17. A proper, lsc function f : Rn → R∪{∞} is prox-bounded ifthere exists r > 0 such that erf(x¯) > −∞ for some x¯ ∈ Rn. The infimum of allsuch r is called the threshold of prox-boundedness.Definition 2.2.18. The infimal convolution of functions f, g : Rn → R∪{∞} isdefined by(f @ g)(x) = infy∈Rn{f(y) + g(x− y)}.Infimal convolution gives us an alternate expression for the Moreau envelope:erf = f @ r2‖ · ‖2. (2.2.1)92.3. Facts, lemmas and propositionsDefinition 2.2.19. The Fenchel conjugate of a function f : Rn → R∪{∞} isdefined byf∗(v) = supx∈Rn{〈v, x〉 − f(x)}.Definition 2.2.20. An operator A : Rn ⇒ Rn is monotone if〈x− y, x∗ − y∗〉 ≥ 0for all (x, x∗), (y, y∗) ∈ graA. The monotone operator A is maximally monotoneif there does not exist a proper extension of A that is monotone.Definition 2.2.21. The resolvent of an operatorA : Rn ⇒ Rn is JA = (Id +A)−1.Definition 2.2.22. A set S ⊆ Rn is dense in Rn if every element of Rn is either inS, or a limit point of S. A set is nowhere dense in Rn if the interior of its closurein Rn is empty.Definition 2.2.23. A set S ⊆ Rn is of first category (meagre) if S is a union ofcountably many nowhere dense sets. A set S ⊆ Rn is of second category (generic)if Rn \S is of first category.Definition 2.2.24. A topological space S ⊆ Rn is called a Baire space if everyintersection of countably many dense, open sets in S is dense in S.Definition 2.2.25. [36, Chapter 1] The Moore-Penrose inverse A† of a matrix A isdefined as the unique matrix that satisfies(i) AA†A = A, (ii) A†AA† = A†,(iii) (AA†)∗ = AA†, (iv) (A†A)∗ = A†A.2.3 Facts, lemmas and propositionsLemma 2.3.1. For any proper function f : Rn → R∪{∞},erf(x) =r2‖x‖2 − g∗(rx), (2.3.1)where g(x) = f(x) + r2‖x‖2.Proof. We haveerf(x) = infy{f(y) +r2‖y − x‖2}= − supy{−f(y)− r2(‖y‖2 − 2〈x, y〉+ ‖x‖2)}=r2‖x‖2 − supy{〈rx, y〉 −(f(y) +r2‖y‖2)}=r2‖x‖2 − g∗(rx).102.3. Facts, lemmas and propositionsFact 2.3.2. [21, Example 23.3] For a convex function f, an alternate representationof the proximal mapping makes use of the resolvent of ∂f, which also provides aconversion to the proximal mapping with proximal parameter 1:Prf =(Id +1r∂f)−1= P1(1rf).Fact 2.3.3. [21, Theorem 16.2] For any proper, lsc function f : Rn → R∪{∞}and any r > 0, we havep ∈ Prf(x)⇒ 0 ∈ ∂f(p) + r(p− x)⇔ 0 ∈ 1r∂f(p) + p− x.If, in addition, f is convex, then the first implication above becomes bidirectional:p ∈ Prf(x)⇔ 0 ∈ ∂f(p) + r(p− x).Proposition 2.3.4 (Properties of the Moreau envelope). For any f : Rn → R∪{∞},r > 0, v ∈ Rn and c ∈ R, the following hold:(i) er(f + c) = erf + c;(ii) erf = re1(f/r);(iii) er(f(· − c)) = (erf)(· − c);(iv) e1f = q − (f + q)∗;(v) e1(f + 〈·, v〉) = e1f(· − v) + 〈·, v〉 − q(v);(vi) (erf)∗ = f∗ + q/r.Proof. (i) This is seen directly as a property of the infimum: for any function g andany c ∈ R, inf{g(x) + c} = inf g(x) + c.(ii) See [21, Proposition 12.22].(iii) Let z = y − c. Thener(f(· − c))(x) = infy∈Rn{f(y − c) + r2‖y − x‖2}= infz∈Rn{f(z) +r2‖z − (x− c)‖2}= (erf)(x− c).112.3. Facts, lemmas and propositions(iv) This is Lemma 2.3.1 with r = 1.(v) Consider the left-hand side of statement (v) first. Applying statement (iv) tof + 〈·, v〉, we havee1(f + 〈·, v〉) = q − (f + 〈·, v〉+ q)∗.Applying [21, Proposition 13.20(iii)] to the function f + q with y = 0 and α = 0,we havee1(f + 〈·, v〉) = q − [f(· − v) + q(· − v)]∗. (2.3.2)Now consider the right-hand side of statement (v). Applying statement (iv) tof(· − v), we havee1(f(· − v)) = q(· − v)− [f(· − v) + q(· − v)]∗,= q − [f(· − v) + q(· − v)]∗ − 〈·, v〉+ q(v),e1(f(· − v)) + 〈·, v〉 − q(v) = q − [f(· − v) + q(· − v)]∗,which is the same as (2.3.2).(vi) By [21, Proposition 13.21(iii)] with g = rq, we have (f @ rq)∗ = f∗ + (rq)∗.By (2.2.1), we have (erf)∗ = (f @ rq)∗, and the fact that (rq)∗ = q/r follows by[21, Example 13.4].Proposition 2.3.5. Let f ∈ Γ0(Rn). Then f is prox-bounded with threshold 0, Prfis single-valued and continuous, and erf is convex and continuously differentiable.Moreover, the following properties hold.(i) erf(x) + e 1rf∗(rx) =r2‖x‖2;(ii) ∇erf(x) = r[x− Prf(x)];(iii) ∇erf∗(x) = P 1rf(rx);(iv) Prf(x) = ∇g(x), where g(x) = 1r[e 1rf∗(rx)];(v) Prf∗(x) = x− 1rP 1rf(rx).Proof. The proof that f has threshold 0, Prf is single-valued and continuous, anderf is convex and continuously differentiable is found in [196, Theorem 2.26].(i) See [196, Example 11.26].(ii) See [196, Theorem 2.26].122.3. Facts, lemmas and propositions(iii) Replacing f with f∗ in part (i) and using the fact that f∗∗ = f, we haveerf∗(x) + e 1rf(rx) =r2‖x‖2.Differentiating both sides and rearranging yields∇erf∗(x) = rx−∇e 1rf(rx).We substitute z = rx, then use part (ii) and the chain rule to get∇erf∗(x) = rx−∇xe 1rf(z)= rx−∇ze 1rf(z)∇xz= rx− 1r[z − P 1rf(z)]r= P 1rf(rx).(iv) See [196, Exercise 11.27].(v) Replacing f with f∗ in part (iv), we havePrf∗(x) = ∇g(x), where g(x) = 1r[e 1rf(rx)].Substituting z = rx, then applying part (ii) and the chain rule yields∇g(x) = 1r∇z(e 1rf(z))∇xz=1r[1r(z − P 1rf(z))]r=1r(rx− P 1rf(rx))= x− 1rP 1rf(rx).This concludes the preliminary portion of the thesis. We now move on to thehistorical section.13Chapter 3History3.1 OverviewThis chapter provides a sampling of the successful uses of the Moreau envelopeand the proximal mapping since the 1960s. There are no new results in this chapter;it is meant to give the reader background information on why these objects areso important in optimization and how they have been applied. We discuss thefollowing three topics.• Smoothing: the Moreau envelope of a convex function is not only convex,but also C1, and an explicit expression for its gradient is known. Five in-stances of real-world problems solved by smoothing are presented.• The proximal point algorithm: iteratively calculating the proximal point ofa convex function f converges to a minimizer of f. Many variants of thismethod exist; four of them are highlighted here.• Other regularizations: Moreau-Yosida regularization is not the only type ofregularization in use. Three other regularization methods are outlined in thischapter and their relationships to the Moreau envelope are shown.3.2 The origin of the Moreau envelopeThe Moreau envelope first came to light in the 1960s thanks to Jean-JacquesMoreau [162, 163]. The infimand of the Moreau envelope was presented as thefunction Φ :Φ(y) = f(y) +12‖y − x‖2, e1f(x) = infy∈RnΦ(y),without the proximal parameter r. Under convexity, the proximal mapping is a sin-gleton, the point at which Φ attains its minimum. This is one of many useful resultsfound in [163]; others involve strict convexity, dual function theory and Lipschitzcontinuity. Over the years, extensions such as the insertion of the proximal param-eter [191] and applications to nonconvex functions [118] have been made. This143.3. An example by Jean-Jacques Moreauallowed for further growth of the theoretical results, such as the fact that erf → fas r → ∞ [196, Theorem 1.25] and many others. We cannot hope to present acomplete history here, but a selection of problems and their solutions is given inthis section in order to showcase the positive impact that the Moreau envelope andthe proximal mapping have had on the world.3.3 An example by Jean-Jacques MoreauJean-Jacques Moreau was a mechanician as well as a mathematician, with aninterest in nonsmooth mechanics as a professor in the Laboratoire de Mécaniqueet Génie Civil [194]. One of the first examples of the appearance of the proxi-mal mapping in a physical problem is found in Moreau’s 1966 paper Quadraticprogramming in mathematics: dynamics of one-sided constraints [164]. A fric-tionless mechanical system at a point q = (q1, . . . , qn) is considered over time t,with one-sided constraints fα(q, t) ≥ 0, α ∈ I, where I is a finite set of indices.The Lagrange multipliers of the system are denoted by λα, so that λαfα(q, t) = 0for all α ∈ I. The kinetic energy of the system isT (q, q˙, t) =12∑i,kaik(q, t)q˙iq˙k +∑ibi(q, t)q˙i + c(q, t),where q˙ represents the derivative of q. Using the notation uiα = ∂fα/∂qi, thefunctions qi(t) and λα(t) are determined byddt(∂T∂qi)− ∂T∂qi= Qi +∑α∈Iλαuiα,where Qi is the covariant component of the set of active forces. The problem is todetermine the state of acceleration after t0 : the right-limits q¨i(t0 +ε). The solutionis proved via Kuhn-Tucker theory to be the point where the functionG =12∑i,kaikq¨iq¨k −∑iziq¨iattains its minimum, where zi are known quantities. With z =∑iziei ∈ E, where{ei} is a base of the n-dimensional linear space E, the Lagrange equations arewrittenx−∑αλαuα = z, (3.3.1)153.4. Smoothingwith inequality constraints〈uα, x〉 − sα ≥ 0,where sα is known. The set of inequality constraints defines a closed, convex,polyhedral region C. Finally, the duality-decomposition theorem on quadratic pro-gramming [161, 163] is invoked. Defining f as the indicator function ofC, the dualfunction g(y) = supx{〈x, y〉 − f(x)} is the support function of C. The duality-decomposition theorem states that if f and g are dual functions, then every z equalsthe sum of x = P1f(z) and y = P1g(z), with x and y unique conjugate points.This allows the rewriting of (3.3.1) as a characterization of the reaction exerted bythe system against its one-sided constraints, i.e.−∑αλαuα ∈ E.This term equals the proximal point P1g(z).3.4 SmoothingWith the objective of convex minimization in mind, one of the immediate ad-vantages of the Moreau envelope is that it transforms any function f ∈ Γ0(Rn)into a differentiable function [196, Theorem 2.26] that preserves the minimum andminimizers of f [196, Proposition 13.37]. In addition, for all x ∈ dom f we havethat erf(x) ≤ f(x), and that erf(x) ↗ f(x) as r → ∞ [196, Theorem 1.25].Thus, any f ∈ Γ0(Rn) can be approximated from below by a smooth function asaccurately as one wishes by increasing the proximal parameter r. Since we havebasic calculus techniques for minimization of a differentiable function, it is oftenpreferable to work with the Moreau envelope of f, rather than with f itself. Wepresent a number of examples where this smoothing technique has been imple-mented to solve a problem that is otherwise unmanageable.3.4.1 The universal density functionalIn [111, 131, 140, 175, 177], the universal density functional F is discussed.Density-functional theory, also known as Kohn-Sham theory [129], is currently ofresearch interest in quantum chemistry and quantum physics, since F is known toform a conjugate pair with the ground-state energy E. However, it is problematicbecause while F is convex, it is an erratic function that is everywhere discontinuousand nowhere differentiable [131]. The functional F (ρ) represents the electronic en-ergy of a quantum system consistent with a given density ρ. Until 2014 one had to163.4. Smoothingrely upon a very complicated model of F, together with assumptions that are prob-ably untrue [131], in order to approximate F accurately enough to be useful. Thecontribution of [131, 176] was to regularize F and E using the Moreau envelope(therein the notation F and E is used to denote the envelopes), thus enabling dif-ferentiation and facilitating the solving of the system. Moreover, they showed thatno information is lost in the regularization process; both F and E are completelyrecoverable from F and E, respectively, over their entire domains [131, (81)].The equations that follow are extracted from [131]. The ground-state energycorresponding to the potential v can be written in the form of a Hohenberg-Kohnvariation principle:E(v) = infρ∈X{F (ρ) + 〈v, ρ〉}, v ∈ X∗,whereX is a Hilbert space. This establishes E and F as conjugate functions, sincereciprocallyF (ρ) = supv∈X∗{E(v)− 〈v, ρ〉}.Since F is a discontinuous function, the Moreau envelope is applied:F (ρ) = infρ′∈X{F (ρ′) +12‖ρ− ρ′‖2}.This is the conjugate function ofE(v) = E(v)− 2‖v‖2.Though F is finite only on the set IN of N -representable densities (ρ ∈ IN ifand only if there exists a normalized N -electron wave function with finite kineticenergy and density ρ [131, (5)]), F is finite on the whole space. Even for non-physical densities, one can consider F (ρ + c) where ρ is a physical density, andfind thatF (ρ+ c) = F (ρ) +12c2.As a function of c, the minimizer is c = 0, which recovers a density functional thatmakes physical sense. Since the proximal point ρ is unique andρ = Pf(x)⇔ 1(x− ρ) ∈ ∂f(ρ),the Moreau envelope can be written asF (ρ) = F (ρ) +12‖ρ− ρ‖2.173.4. SmoothingThis is a differentiable function and ρ is a physical ground-state density. Then, forevery given density ρ and corresponding ground-state density ρ, the associatedpotential with density ρ isv =1(ρ − ρ),and all of the pertinent information of the original problem is recovered.3.4.2 High-dimensional posterior distributionsIn statistical modelling, the Bayesian analysis [6, 77, 80, 112, 176, 199] ofdata results in posterior distributions that are very difficult to manage with tradi-tional (Markov Chain Monte Carlo) methods, especially in high dimensions. Mod-ern theory relies on an approximation of the distribution, typically a Laplace ora variational Bayes approximation, but these methods are focused on the low-dimensional case, and even then they can fail to accurately represent all aspectsof the distribution [6]. The proposals in [6, 176, 199] are instead to use the Moreauenvelope to obtain a smooth approximation of the distribution function, specifi-cally the negative log-density function. The posterior distribution is challengingto simulate because the prior distribution from which it originates has a mixeddiscrete-continuous form. This means that transdimensional methods are neededto sample from the distribution, methods that jump from one subspace to another.As the dimension grows, these jump methods become increasingly inefficient andlose their appeal as solving tools. Applying the Moreau envelope has the effect ofdecoupling the variables [6], so that each variable in the distribution has its owncontinuous density. This makes the approximate distribution solvable by standard(not transdimensional) sampling methods. Sampling still becomes more difficultas the proximal parameter is decreased to give a more accurate approximation ofthe distribution function, but a careful choice of parameter strikes a good balancebetween sampling efficiency and approximation error.The equations that follow are extracted from [6]. For variables (δ, φ, θ) in∆× Φ×Θ, a prior distribution Π is defined byΠ(δ, dφ, dθ) = piδG(dφ)Π(dθ|δ, φ),where {piδ} is a discrete distribution on ∆, G is a prior probability measure on Φ,and Π(·|δ, φ) is a sparsity-inducing prior on Θ. The resulting posterior distributionis denoted by Πˆ,Πˆ(dθ, dφ|xn) = qθ,φ(xn)Π(dθ, dφ)∫qθ,φ(xn)Π(dθ, dφ)183.4. Smoothingfor some function q such that 0 <∫Θ×Φqθ,φ(x)Π(dθ, dφ) < ∞ for all x. This canbe expressed using the log function for each xn :Πˆn(δ, dφ, dθ|xn) = Kpiδqθ,φ(xn) exp d∑j=1δj log p(θj |φ)G(dφ)µd,δ(dθ),where K is constant, p(·, φ) is a positive density, and µd,δ is a product measureof Rd . Due to the discrete-continuous mixture form of the prior distribution Π,any two posterior distributions Πˆn(·|xn, δ) and Πˆn(·|xn, δ′) are mutually singular.This is why the transdimensional sampling methods are necessary for any attemptat simulation. To find the Moreau envelope of Πˆ, the following functions are used:ln(θ|φ) = − log qθ,φ(xn),Pn(θ|δ, φ) = −d∑j=1δj log p(θj |φ) + ιΘδ(θ),hn(θ|δ, φ) = ln(θ|φ) + Pn(θ|δ, φ),where ι is the indicator function. The function hn is the objective function, but themutual singularity sends hn to infinity for any θ 6∈ Θδ, and Θδ ∩ Θδ′ = {0} forδ 6= δ′. Therefore, ln in hn is replaced by its linear approximation, and the Moreauenvelope is calculated:hn,γ(θ|δ, φ) = minu∈Θ{ln(θ|φ) + 〈∇ln(θ|φ), u− θ〉+ Pn(u|δ, φ) + 12γ‖u− θ‖2}.Remark 3.4.1. This is a case of the proximal parameter being used in the denom-inator (see Note 2.2.16). Since this is how the author of [6] presents the Moreauenvelope, we restate it in the same notation here.So hn,γ <∞ for all θ ∈ Θ, and can be used to approximate Πˆn :Πˆn,γ(δ, dφ, dθ|xn) = Kpiδ(2piγ)‖δ‖12 e−hn,γ(θ|δ,φ)G(dφ)µd(dθ).As γ → 0, we have that Πˆn,γ → Πˆn. Furthermore, we have the advantage thatΠˆn,γ is smooth and can thus be simulated by standard means rather than trans-dimensional means. The proximal parameter γ must be chosen with care, sincedecreasing γ increases both the accuracy of the approximation and the complexityof the simulation. Guidelines for choosing γ are outlined in [6].193.4. Smoothing3.4.3 The Stefan problemThe Stefan problem [37, 48, 69, 85] models the change in phase of materialsfrom liquid to solid and is used in applications such as crystal growth and metalcasting. The heat equation governs the distribution of temperature along the so-lidification front (the border between solid and liquid states), and since a materialhas different heat properties dependent on its state, the heat equation coefficientsare discontinuous. This jump in temperature gradient on the solidification frontis known as the Stefan condition. Optimal control of the Stefan model is advan-tageous, because specific convex shapes of the solidification front result in betterquality solid formation. This is a state-constrained optimal control problem, andthe state constraint can be removed via the Moreau envelope [37]. The quadraticpenalty term is added, thus making it differentiable and converting the problem intoone of standard optimal control. Then the penalty parameter (proximal parameter)is driven to zero, and the original Stefan problem with solution is recovered.The equations that follow are extracted from [37]. The state-constrained opti-mal control problem isminy∈Yu∈UJ(y, u), subject to e(y, u) = 0, g(y) ≤ 0 in Ω0, (3.4.1)where e(y, u) is the state equation that relates the state y to the control u, andg(y) ≤ 0 is the state constraint. The Lagrange multipliers that pertain to the stateconstraints, however, have only low regularity, which creates difficulties. Theseproblems can be avoided by doing away with the state constraint inequality. Apenalty term is added to the cost functional J, with the now penalized cost func-tional defined byJγ(y, u) = J(y, u) +γ2∫Ω0max[0, g(y)]2dx.This is now a standard (not state-constrained) optimal control problem. The func-tion Jγ is differentiable, and the parameter γ is sent to infinity in order to enforcethe state constraint. The resulting optimal control problem analogous to (3.4.1) isthe Moreau-Yosida regularizedminy∈Yu∈UJγ(y, u) subject to e(y, u) = 0, g(y) ≤ 0 in Ω0.203.4. Smoothing3.4.4 The Cahn-Hilliard/Navier-Stokes systemThe Cahn-Hilliard/Navier-Stokes modelling system [1, 17, 105, 107, 110] is amodel that describes the hydrodynamics of a two-phase flow, that is, the flow ofa liquid and a gas through a pipe at the same time. The Cahn-Hilliard equationgoverns the phase separation, and the Navier-Stokes system describes the motionof the fluid. The type of potential function involved can be chosen from a knownlist; the double-obstacle type is a popular choice [107]. It is interesting to notethat with the double-obstacle potential, the solution to this problem converges tothe solution of the classical Stefan problem by sending one of the parameters tozero. In [105, 107], the coupled Cahn-Hilliard/Navier-Stokes system is decoupledand the minimization of the Cahn-Hilliard variational inequality is replaced by itsMoreau envelope, which can be solved by Newton iteration. The existence of aunique minimizer in this setting is proven, as is local superlinear convergence inthe original function space, provided that the initial iteration point is chosen suffi-ciently close to the solution point. Numerically, however, convergence is obtainedno matter the choice of starting point.The equations that follow are extracted from [107]. We present some notationfirst: Ω ⊂ Rn is the bounded convex polygonal flow domain with boundary ∂Ω,and K,Pe,Re are constants. The space L2(Ω) is that of measurable functionswhose squares are Lebesgue integrable, L2(0)(Ω) ⊂ L2(Ω) is the subspace of func-tions with vanishing mean value, and Hm(Ω) is the Hilbert space of functions inL2(Ω) with distributional derivatives of order less than or equal to m contained inL2(Ω). The following sets are defined:H10 (Ω) = {v ∈ H1(Ω) : (v, 1) = 0 on ∂Ω}.K = {v ∈ H1(Ω) : ‖v‖ ≤ 1 a.e. in Ω}.The problem is to find (c(t, x), w(t, x), u(t, x), p(t, x)) such that∂tu− 1Re∆u + u · ∇u +∇p+Kc∇w = 0 in ΩT = Ω× (0, T ),div u = 0 in ΩT ,∂tc− 1Pe∇ · (b(c)∇w) + u∇c = 0 in ΩT ,w = Φ′(c)− γ2∆c in ΩT ,c(x, 0) = c0(x), u(x, 0) = u0(x) ∀x ∈ Ω,∂vc = ∂vw = 0, u = g on ∂Ω× (0, T ).213.4. SmoothingThe system is solved by discretizing with respect to time, so with time step sizeτ, Hamiltonian H, and space of Lebesgue integrable functions L, the time-discreteproblem becomes finding (c, w, u, p) such thatτ(p, div u)− τK(c∇w, v) = (u− u0, v) + τRe(∇u : ∇v) + τB(u0,u0, v) ∀v,(3.4.2)(−div u, v) = 0 ∀v ∈ L2(0)(Ω), (3.4.3)(c, v) +τPe(∇w,∇v) = τ(c0u0,∇v) + (c0, v) ∀v ∈ H1(Ω), (3.4.4)(c0, v − c) ≤ γ2(∇c,∇(v − c))− (w, v − c) ∀v ∈ K. (3.4.5)The Navier-Stokes problem is described by (3.4.2) and (3.4.3) , the Cahn-Hilliardproblem by (3.4.4) and (3.4.5). One of the key points here is that (3.4.4) and (3.4.5)are related to the first-order optimality condition of the problemminimizeγ22‖∇c‖2 + τ2‖∇w‖2 − (c0, c) over (c, w) ∈ K × V, (3.4.6)subject to (c− c0, v) + τPe(∇w,∇v) = τ(c0u0,∇v) ∀v ∈ H1(Ω). (3.4.7)There is a unique solution (c, w) to (3.4.6) and a Lagrange multiplier q associatedwith (3.4.7) that satisfies w = q − (q, 1). Then the quantityλs(c) = s(max(0, c− 1) + min(0, c+ 1)), s > 0is defined and proved to be Lipschitz continuous, monotone and Newton differ-entiable. Now λs is used to replace (3.4.6) and (3.4.7) by their Moreau-Yosidaregularized version. That is, the constraint c is replaced by the Moreau envelope ofthe indicator function ιK(c) :minimizeγ22‖∇c‖2 + τ2‖∇w‖2 − (c0, c) + s2M(c) over (c, w) ∈ H1(Ω)× V,subject to (c− c0, v) + τPe(∇w,∇v) = τ(c0u0,∇v) ∀v ∈ H1(Ω)where M(c) = ‖max(0, c − 1)‖2 + ‖min(0, c + 1)‖2. This problem is provedin [107] to be differentiable and to have a unique solution that coincides with thatof the original Cahn-Hilliard problem. The unique solution is characterized by thefirst-order optimality condition:〈F (1)(cs, ws), v〉 = (cs − c0, v) + τPe(∇ws,∇v)− τ(c0u0,∇v) = 0,〈F (2)(cs, ws), v〉 = γ2(∇cs,∇v)− (ws + c0, v) + (λs(cs), v) = 0223.4. Smoothingfor all v ∈ H1(Ω). The new, Moreau-Yosida regularized system that compares to(3.4.2) through (3.4.5) is(u− u0, v) + τRe(∇u : ∇v) + τB(u0,u0, v) + τK(cs∇ws,∇v)− τ(p,div v)= 0 ∀v,(−div u, v) = 0 ∀v ∈ L2(0)(Ω),〈F (1)(cs, ws), v〉 = 0 ∀v ∈ H1(Ω),〈F (2)(cs, ws), v〉 = 0 ∀v ∈ H1(Ω).The authors then prove that there is a unique solution to this system, and that su-perlinear convergence is achieved when (c0, w0) is sufficiently close to (cs, ws).3.4.5 The time crisis problemThe time crisis problem [29, 68] is also a state-constrained optimal controlproblem:dxdt= f(x, u),x(0) = x0,x(t) ∈ K,where x is the state, u is the control, f : Rn×Rm → Rn is the dynamics, andK ⊆ Rn is non-empty. Such problems are abundant in economics, finance andsocial sciences [68]. Viability theory [11–13] is typically used to solve the system,butK is not always viable. The time of crisis refers to the time spent by a trajectorysolution of the control system outside the set K; the objective is to find a controlu for which the time of crisis is minimized. The indicator function of K is usedto express the time crisis functional, which is discontinuous along the boundary ofK. The standard Maximum Principle is normally employed in solving these typesof partial differential equation systems, but the inherent discontinuity here preventsthis method from functioning. By using the Moreau envelope of the characteristicfunction of K, the regularized function is smooth and the Maximum Principle canbe applied. This method is implemented in [29], where it is proven that optimalsolutions converge to an optimal solution of the original problem.The equations that follow are extracted from [29]. The indication function of aset K is defined by1K(x) ={0, x ∈ K,1, x 6∈ K.233.5. The proximal point algorithm and its variantsThe minimal time crisis problem over [0, T ] isinfu∈UJT (u), with JT (u) =T∫01Kc(xu(t))dt,where xu solves dxdt = f(x, u). But 1Kc is not a continuous function, so the tra-ditional Pontryagin Maximum Principle method of solving the problem fails here.Therefore, regularization is necessary. The characteristic function of set K is de-fined byχK(x) ={0, x ∈ K,∞, x 6∈ K.ForK closed and convex, χK is convex and lsc. Then the Moreau envelope of χK ,e(x) =12dist(x,K)2,is a convex C1+ function with∇e(x) = 1 (x− PK(x)), andlim→0e(x) = χK(x).Now define the concave function γ(v) = 1− e−v, so that for any x ∈ Rn,1Kc(x) = γ(χK(x)).The regularized optimal control problem isinfu∈UJT (u), JT (u) =T∫0γ(e(xu(t)))dt.It is proved in [29] that JT → JT as → 0, and that the solution trajectory of thisnew problem is a minimizing solution to the original optimal control problem.3.5 The proximal point algorithm and its variantsMoreau envelope theory has given rise to algorithms that are used in com-puter code implementations of minimization routines. These are known as prox-imal point algorithms; there are several variants in use today and we showcase afew of them here. We discuss the proximal point, proximal bundle, acceleratedgradient and Douglas–Rachford algorithms.243.5. The proximal point algorithm and its variants3.5.1 The proximal point and proximal bundle algorithmsThe most basic method of this family is the proximal point method. It is aniterative algorithm that can be described in one line; for f ∈ Γ0(Rn) and a startingpoint x0, we calculatexk+1 = Prf(xk).The convergence of this method to a minimizer of f was proved by Martinet [151]and variations of it are used as subroutines in many minimization algorithms. Thismethod has a long and rich history of expansion, adaptation and analysis see [45,81, 83, 135, 174, 191], and continues in strong use to this day. Proximal points arenot always easy to identify, so routines for finding proximal points have emerged aswell. One such routine is the proximal bundle method [99, 106, 125], where for anonsmooth function f ∈ Γ0(Rn) and proximal centre z, a piecewise-linear modelfunction ϕk is built at each iteration and the routine calculates the proximal pointof ϕk, which has been proved to converge to the proximal point of f. Finding theproximal point is not a challenge for a piecewise-linear convex function, so if themodel is accurate enough, the proximal point of f(z) will be found, within somefixed accuracy tolerance.3.5.2 An inexact proximal bundle algorithmThe bundle method mentioned in the previous section functions when exactfunction values and subgradients are available. In the case that one or both arenot available or inconvenient to use, an inexact bundle method is used. In Chapter8, we present one recent inexact method, called the tilt-correct method [95]. Thepremise is that exact function values are available, but exact subgradients are not.Since proximal bundle methods are commonly used as subroutines in minimizationalgorithms, the tilt-correct routine can be inserted into any such algorithm, thusconverting it into a derivative-free method. This method is presented in detail inChapter 8.3.5.3 The accelerated gradient methodThe accelerated gradient method is another important application of the proxi-mal mapping [3, 72, 73, 79, 122, 139, 170, 207]. To solve the problemminx∈Rng(x) + h(x),253.5. The proximal point algorithm and its variantswhere g is convex and differentiable and h is convex, a generalized gradient descentmethod can be used. Given an initial point x0, repeat the following:xk = Prkh(xk−1 − 1rk∇g(xk−1)).This method has a convergence rate of O(1/k) if∇g is Lipschitz continuous [167,§1.2.3]. However, a rate of O(1/k2) can be achieved by applying the followingaccelerated method [207]. Given an initial point x0 = x−1, repeat:y = xk−1 +k − 2k + 1(xk−1 − xk−2),xk = Prkh(y − 1rk∇g(y)).The above is called the accelerated generalized gradient method, and by settingh = 0 it reverts to the accelerated gradient method. If the proximal parameter r isfixed, this method can be shown to satisfyf(xk)− f(x∗) ≤ 2r‖x0 − x∗‖2(k + 1)2,where x∗ is the minimizer of f, and to achieve theO(1/k2) convergence rate [166].3.5.4 The Douglas–Rachford algorithmThe Douglas–Rachford algorithm [22, 25, 26, 39, 61, 70, 104] is another usefulmethod that considers the problemminx∈Rnf(x) + g(x).Unlike the gradient descent methods in the last section, where one function is con-vex and the other is convex differentiable, here we require that both f and g beclosed, convex, lsc functions. The algorithm is the following: setting k = 1 andgiven any y0, iteratively calculatexk = Prf(yk−1),yk = yk−1 + Prg(2xk − yk−1)− xk.If a solution to the problem exists, then the sequence {xk} converges to a solutionof 0 ∈ ∂f(x)+∂g(x) [70]. The Douglas–Rachford algorithm is commonly used inapplications such as sparse inverse covariance selection [54], Springarn’s methodof partial inverses [205], decomposition of separable problems [169] and manyothers.263.6. Other regularizations3.6 Other regularizationsThe Moreau envelope is not the only type of regularization in use. There aremany others, three of which we discuss in this section. Remarkably, all three ofthem are related in some significant way to the Moreau envelope, even though allwere developed by different people in different countries, at a time (50 years ago)when communication and information sharing were not nearly as readily availableas they are today. We highlight the Yosida regularization (Japan, 1965), Tikhonovregularization (Russia, 1963) and Bregman regularization (Russia, 1966).3.6.1 Yosida regularizationThe Yosida regularization [215] was developed by Kôsaku Yosida in the 1960s,about the same time as the Moreau envelope. For any mapping T : Rn ⇒ Rn andany r > 0, the Yosida regularization of T is defined byyrT =(1rId +T−1)−1.As r → ∞, yrT converges graphically (see Definition 2.2.10) to clT, and in thecase of T maximally monotone, yrTg→ T [196, Example 12.13]. The Yosidaregularization can be found in monotone operator theory, as it has a close relation-ship with the resolvent of T, known as the inverse-resolvent identity [196, Lemma12.14]: (1rId +T−1)−1= r[Id−(Id +1rT)−1].The Yosida regularization and the Moreau envelope may seem to be quite differentconcepts, but in fact there is a close relationship between them.Theorem 3.6.1. [196, Exercise 12.23] For any f ∈ Γ0(Rn), the Yosida regu-larization of the subgradient mapping ∂f is the gradient mapping of the Moreauenvelope. That is,∇erf =(1rId +(∂f)−1)−1.Proof. To see that this is true, we make use of the following well-known factsabout Moreau envelopes and resolvents.Fact 3.6.2. [196, Theorem 2.26][21, Example 23.3] For any f ∈ Γ0(Rn),(i) ∇erf = r (Id−Prf) ,(ii) Prf =(Id +1r∂f)−1 and273.6. Other regularizations(iii) Prf is a singleton.Using the above facts and Theorem 3.6.1, the proof of Theorem 3.6.1 is as follows:∇erf(x) = r (x− Prf(x))= r[Id−(Id +1r∂f)−1](x)=[1rId +(∂f)−1]−1(x)= yr∂f(x).3.6.2 Tikhonov regularizationIn the 1960s, Andrey Tikhonov developed what is known today as Tikhonovregularization [210]. For a linear system Ax = b, the least-squares solution ismin ‖Ax− b‖2. If the system is not well-posed (i.e., such an x does not exist or isnot unique), the Tikhonov regularization is defined using a Tikhonov matrix Γ :t = minx{‖Ax− b‖2 + ‖Γx‖2} .The Tikhonov matrix is most often a multiple of the identity matrix, so thattλ = minx{‖Ax− b‖2 + ‖λx‖2} .Tikhonov regularization is also known as ridge regression and has many applica-tions in statistics. In particular, the regular least-squares method has the undesir-able effect of amplifying noise, since the solution is x =(A>A)−1A>b and smalleigenvalues of A give rise to large eigenvalues of(A>A)−1. Thus, ridge regres-sion is preferable, because it mitigates this effect of noise amplification.Just as in the case of Yosida regularization, Tikhonov regularization has a veryclose relationship to the Moreau envelope. In fact, it is a particular Moreau enve-lope at x¯ = 0 :tλ = minx{‖Ax− b‖2 + λ2‖x‖2}= minx{‖Ax− b‖2 + 2λ22‖x− 0‖2}= e2λ2f(0),where f(x) = ‖Ax− b‖2.283.6. Other regularizations3.6.3 Bregman regularizationBregman regularization [42] is a development from the 1960s by Lev Bregman,one that makes use of the Bregman distance.Definition 3.6.3. For f strictly convex and C2 on a closed, convex set S, the Breg-man distance between any two points in p, q ∈ S is defined byDf (p, q) = f(p)− f(q)− 〈∇f(q), p− q〉.Figure 3.1: An illustration of the Bregman distance.The Bregman distance is called a semi-distance, since it is not commutative nordoes it respect the triangle inequality. Bregman regularization arose as a method ofminimizing the constrained L1 norm; rather than solvingminx{‖x‖1 : Ax = b}or the equivalent unconstrained problemminx{‖x‖1 + 12λ‖Ax− b‖2},293.6. Other regularizationswe consider the iterative techniquexk+1 = argminx{Df (x, xk) +12‖Ax− b‖2},where f is a strictly convex function. This is Bregman regularization for the L1norm. Since then, this method has been generalized, with Df replacing the con-straints, or with Df replacing the strictly convex f and with the constraint in theform of a convex, differentiable function H . In other words, instead of consideringminx{f(x) +H(x)},we iteratively solvexk+1 = argminx{Df (x, xk) +12‖x− (xk −∇H(xk))‖2}. (3.6.1)Bregman regularization has applications in denoising [78], image deblurring [46,142, 186], compressed sensing [173, 181, 214] and medical imaging [101, 148,149], among others. This minimization of the Bregman distance is a minimizationof the unwanted information in a noisy function, so these are natural areas of appli-cation in which to use it. Collecting the constants in (3.6.1), we see that Bregmanregularization is also closely linked to the Moreau envelope:xk+1 = argminx{Df (x, xk) +12‖x− x¯‖2}= P1Df (x¯, xk),where x¯ = xk −∇H(xk).This concludes the historical section of the thesis. Now we move on to new mate-rial, beginning with theoretical results.30Part IITheory31Chapter 4Thresholds of Prox-boundednessof PLQ Functions4.1 OverviewIn this chapter, we discuss prox-bounded functions. Given a function f, weexplore the threshold of prox-boundedness and the interesting properties that erfhas when the threshold is used as the proximal parameter. The class of piecewiselinear-quadratic functions is investigated, with the goal of identifying the thresholdof prox-boundedness and the domain of the Moreau envelope at the threshold. Themain results of this chapter are the following. For a PLQ function f : Rn → Rdefined byf(x) =f1(x), if x ∈ S1,f2(x), if x ∈ S2,...fm(x), if x ∈ Smwith fi linear-quadratic and Si polyhedral,• it is shown that the threshold of prox-boundedness r¯ of f is maxi{r¯i},wherer¯i is the threshold of fi;• it is shown that the domain of er¯f is⋂i∈A dom er¯fi, where A is the activeset of fi : A = {i : r¯i = r¯};• a computationally feasible method for computing the proximal threshold isdeveloped.This chapter is based on results found in [94], which is published in Journal ofConvex Analysis.324.2. The threshold of prox-boundedness4.2 The threshold of prox-boundednessAn important aspect in applying the Moreau envelope to a function f is deter-mining if f is prox-bounded, that is, if there exists a point x and a parameter r suchthat erf(x) is finite. If so, then the infimum of all such r is called the thresholdof prox-boundedness of f. In this chapter, we seek to understand the thresholds ofPLQ functions. The threshold r¯ of f is of interest, as any r > r¯ yields erf(x) ∈ Rfor all x ∈ Rn [196, Theorem 1.25] and (if r¯ > 0) any r such that 0 ≤ r < r¯yields erf(x) = −∞ for all x ∈ Rn . At the threshold itself, the domain of theMoreau envelope may be Rn, ∅ or some proper subset of Rn, depending on thecharacteristics of f.Thresholds are also of interest due to their importance when dealing with cer-tain programmable tasks in optimization. A prime example is the proximal pointmethod that was discussed in Chapter 3. The algorithm starts at an arbitrary pointx0 ∈ dom f and iteratively calculates the proximal mappingxi+1 = argminy{f(y) +ri2‖y − xi‖2}.This method is known to converge to a minimizer for convex functions [82], andfor certain nonconvex functions as well [98, 118, 209], provided that ri is greaterthan the proximal threshold. There is a question of how to choose the sequenceri, and it is likely that an ideal starting choice is to use the threshold r¯ [190]. Sofor this algorithm and others that use variants of the proximal point method, it isdesirable to be able to calculate the threshold of the function in question.A PLQ function is a function whose domain is a finite union of polyhedralsets, and that is linear or quadratic on each of those sets [196, Definition 10.20].This is a logical family of functions on which to focus, as they are commonly usedin applications and computational optimization [64, 76, 146, 187, 192]. They areeasily programmable, but complex enough to allow us to illustrate the many casesthat emerge at the threshold.4.3 Full-domain quadratic functionsWe begin by considering quadratic functions on R . Depending on the structureof the function f, a variety of situations for the domain of erf arise at the threshold.As we see in the examples below, we can have dom er¯f = Rn, dom er¯f = ∅, or∅ ( dom er¯f ( Rn.Lemma 4.3.1. Let f : Rn → R∪{∞} be proper and lsc. Then f is boundedbelow if and only if r¯ = 0 and dom er¯f = Rn .334.3. Full-domain quadratic functionsProof. We havef is bounded below ⇔ infy∈Rnf(y) > −∞,⇔ infy∈Rn{f(y) +02‖y − x¯‖2}> −∞ ∀x¯ ∈ Rn,⇔ r¯ = 0 and dom er¯f = Rn .The examples that follow are presented without proof, as the proofs are coveredby the upcoming Proposition 4.3.5. Example 4.3.3 also demonstrates the impor-tance of the “dom erf = Rn” component of Lemma 4.3.1.Example 4.3.2. Let f(x) = x2, x ∈ R . Then r¯ = 0 and dom er¯f = R .Example 4.3.3. Let f(x) = x, x ∈ R . Then r¯ = 0 and dom er¯f = ∅.Example 4.3.4. Let f(x) = −x2, x ∈ R . Then r¯ = 2 and dom er¯f = {0}.Now let us consider a full-domain quadratic function on R . The followingresult is extended to Rn later in this section.Proposition 4.3.5. Let f : R → R, f(x) = 12ax2 + bx + c be full-domain. Thenthe threshold of f isr¯ = max{0,−a},and dom er¯f depends on a and b in the following manner.(i) If a > 0, then dom er¯f = R .(ii) If a < 0, then dom er¯f ={− ba} .(iii) If a = 0 and b 6= 0, then dom er¯f = ∅.(iv) If a = b = 0, then dom er¯f = R .Proof. (i) If a > 0, then f is bounded below. Hence, r¯ = 0 and dom er¯f = R byLemma 4.3.1.(ii) If a < 0, then for r 6= −a we find the vertex of 12ay2 + by + c + r2(y − x)2by setting the derivative with respect to y equal to 0. This gives the critical pointy = rx−ba+r . The second derivative is a + r, so the critical point gives a minimumfor all r > −a, and a maximum for all r < −a. Indeed, r < −a results in12ay2 + by + c + r2(y − x)2 being unbounded below. Hence, r¯ = −a. Then we344.3. Full-domain quadratic functionsevaluate the Moreau envelope at the threshold:e−af(x¯) = infy∈R{12ay2 + by + c+−a2(y − x¯)2}= infy∈R{(ax¯+ b)y + c− 12ax¯2}={c− b22a , x¯ = − ba ,−∞, x¯ 6= − ba .Hence, dom er¯f ={− ba} .(iii) If a = 0 and b 6= 0, then for any r > 0 we have erf(x¯) > −∞ for all x¯ ∈ R .This tells us that r¯ = 0. Thener¯f(x¯) = infy∈R{by + c}= −∞ ∀x¯ ∈ R.Therefore, dom er¯f = ∅.(iv) If a = 0 and b = 0, then f is constant and bounded below. Lemma 4.3.1applies, and we have dom er¯f = R .Proposition 4.3.5 can be extended to the finite-dimensional case, as we see inProposition 4.3.6 and Theorem 4.3.8. First, we consider a quadratic function onRn with full domain, whose quadratic coefficient is a diagonal matrix. We use Dn,D+n , and D++n to denote the sets of n-dimensional diagonal, diagonal positive-semidefinite, and diagonal positive definite matrices, respectively.Proposition 4.3.6. Let f : Rn → R, f(x) = 12〈x,Ax〉 + 〈b, x〉 + c be full-domain, A ∈ Dn, b> = [b1, b2, . . . , bn] ∈ Rn, c ∈ R. Suppose that (without lossof generality) for i ∈ {1, 2, . . . , n} the eigenvalues λi of A are in non-increasingorder. Then the threshold of f isr¯ = max{0,−λn},and dom er¯f depends on A and b in the following manner.(i) If A ∈ Dn++, then dom er¯f = Rn.(ii) If A ∈ Dn \Dn+, then dom er¯f ={x¯ : x¯i = − biλi ∀i such that λi = λn}.(iii) If A ∈ Dn+ \ Dn++ and there exists i such that λi = 0 and bi 6= 0, thendom er¯f = ∅.354.3. Full-domain quadratic functions(iv) IfA ∈ Dn+ \Dn++ and bi = 0 for all i such that λi = 0, then dom er¯f = Rn.Proof. We havef(x) =12[x1, . . . , xn]λ1 0 · · · 00 λ2 · · · 0....... . ....0 0 · · · λn x1...xn+ [b1, . . . , bn] x1...xn+ c=12(λ1x21 + · · ·+ λnx2n) + (b1x1 + · · ·+ bnxn) + c=(λ12x21 + b1x1)+(λ22x22 + b2x2)+ · · ·+(λn2x2n + bnxn)+ c. (4.3.1)(i) If A ∈ Dn++, then λi > 0 for all i, hence, f is bounded below. Therefore, r¯ = 0and dom er¯f = Rn by Lemma 4.3.1.(ii) If A ∈ Dn \Dn+, then λn is the negative eigenvalue of largest magnitude, sincethe eigenvaules are ordered. Fix x¯ ∈ Rn, r < −λn, and denote [0, . . . , 0, xn]> byy¯. Consider the following limit:limxn→∞[f(y¯) +r2‖y¯ − x¯‖2]= limxn→∞[λn2x2n + bnxn + c+r2‖y¯ − x¯‖2]= limxn→∞[λn + r2x2n + (bn − rx¯n)xn]+ c+r2n∑i=1x¯2i=−∞.Thus, the threshold of f is at least −λn. Now fix r > −λn. Thenf(x) +r2‖x− x¯‖2 = 12〈x,Ax〉+ 〈b, x〉+ c+ 12〈x− x¯, r Id(x− x¯)〉=12〈x, (A+ r Id)x〉+ 〈b, x〉+ r2〈x¯, x〉+ c+ r〈x¯, x¯〉.Since r > −λn, then (A + rI) ∈ Dn++. So f(x) + r2‖x − x¯‖2 is strictly convexquadratic, and is therefore bounded below. Hence, r¯ = −λn. Now we consider the364.3. Full-domain quadratic functionsMoreau envelope at the threshold:er¯f(x¯) = infy∈Rn{f(y)− λn2‖y − x¯‖2}= infy∈Rn{12〈y,Ay〉+ 〈b, y〉+ c− λn2‖y − x¯‖2}= infy∈Rn{n∑i=1[λi − λn2y2i + (bi + λnx¯i)yi −λn2x¯2i]+ c}. (4.3.2)Notice that λi − λn ≥ 0 for all i, so that the infimand above consists of a sum ofn single-variable functions, one function of each yi, that are either strictly convexquadratic (when λi > λn) or linear (when λi = λn). In particular, the nth suchfunction is linear. Suppose the first k functions are quadratic, and the last n − kfunctions are linear. Then to find the infimum, we must choose y1 through yk to bethose numbers that give us the vertices of the parabolasλi − λn2y2i + (bi + λnx¯i)yi −λn2x¯2i , i ∈ {1, 2, . . . , k}.That gives us the minimum values for the first k components of the sum in (4.3.2).For the remaining n − k components, we must choose the yi that give the infimaof (bi +λix¯i)yi. This means that we will have a finite infimum when x¯i = − biλi foreach i = k + 1, k + 2, . . . , n, but an infimum of −∞ otherwise. Therefore,dom er¯f ={x¯ : x¯i = − biλi, λi = λn}. (4.3.3)(iii) Suppose A ∈ Dn+ \ Dn++, and let k be such that λk = 0 and bk 6= 0. Fixx¯ ∈ Rn and consider the Moreau envelope:infy∈Rn{f(y) +r2‖y − x¯‖2}.For any r > 0 the infimand is strictly convex quadratic, so the infimum is a realnumber. Hence, r¯ = 0. Now we considerer¯f(x¯) = infy∈Rnf(y)= −∞ ∀x¯ ∈ Rn,since f is linear and non-constant in direction x¯k. Therefore, dom er¯f = ∅.(iv) Suppose A ∈ Dn+ \Dn++, and bi = 0 for all i such that λi = 0. Again we have374.3. Full-domain quadratic functionsa finite sum of strictly convex quadratic functions and linear functions, but sincebi = 0 for every corresponding λi = 0, the linear functions are in fact constant.Hence, the function is bounded below, and we apply Lemma 4.3.1 to conclude thatr¯ = 0 and dom er¯f = Rn.In order to generalize Proposition 4.3.6 to include all real symmetric matrices, weuse spectral decomposition. Recall that a square matrix A is orthogonally diago-nalizable if and only if there exist an orthogonal matrix Q and a diagonal matrix Dsuch that A = Q>DQ.Fact 4.3.7 (Fact 8.1.1 [43]). A square matrix A is orthogonally diagonalizable ifand only if A is symmetric. Moreover, D is the matrix generated by diagonalizingthe vector of eigenvalues of A. This is referred to as the spectral decomposition ofA.So if we have a quadratic function f(x) = 12〈x,Ax〉 + 〈b, x〉 + c (where A issymmetric by definition), we are always able to diagonalize A and the eigenvaluesof the resulting diagonal matrix are the same as those of A. The consequence ofthis is that with a change of variable we will be able to apply Proposition 4.3.6 toany quadratic, full-domain function. With this tool at our disposal, we present thegeneral form of Proposition 4.3.6 in Theorem 4.3.8.Theorem 4.3.8. Let f : Rn → R, f(x) = 12〈x,Ax〉 + 〈b, x〉 + c be full-domain,A ∈ Sn, b ∈ Rn, c ∈ R. Let Q>DQ be the spectral decomposition of A, andsuppose (without loss of generality) that for i ∈ {1, 2, . . . , n} the eigenvalues λiof D are in non-increasing order. Then the threshold of f isr¯ = max{0,−λn},and dom er¯f depends on D, Q and b in the following manner.(i) If D ∈ Dn++, then dom er¯f = Rn.(ii) If D ∈ Dn \Dn+, thendom er¯f =x¯ :n∑j=1qij x¯j = − 1λin∑j=1qijbj ∀i with λi = λn . (4.3.4)(iii) If D ∈ Dn+ \Dn++ and there exists i such that λi = 0 and∑nj=1 qijbj 6= 0,then dom er¯f = ∅.(iv) If D ∈ Dn+ \ Dn++ and∑nj=1 qijbj = 0 for every i such that λi = 0, thendom er¯f = Rn.384.4. PLQ functionsProof. We implement the variable changes y = Qx and y¯ = Qx¯. These changesdo not affect the threshold, as Q is invertible and, by orthogonality, Q−1 = Q>.Thus,infx∈Rn{f(x) +r2‖x− x¯‖2}= infx∈Rn{f(Q>y) +r2∥∥∥Q>y −Q>y¯∥∥∥2 : y = Qx}= infy∈Rn{f(Q>y) +r2∥∥∥Q>y −Q>y¯∥∥∥2} .Further,f(Q>y) =12〈Q>y,AQ>y〉+ 〈b,Q>y〉+ c=12〈y,QAQ>y〉+ 〈Qb, y〉+ c=12〈y,Dy〉+ 〈Qb, y〉+ c.Now we consider the Moreau envelope,erf(Q>y¯) = infy∈Rn{12〈y,Dy〉+ 〈Qb, y〉+ c+ r2∥∥∥Q>(y − y¯)∥∥∥2}= infy∈Rn{12〈y,Dy〉+ 〈Qb, y〉+ c+ r2[〈y − y¯, QQ>(y − y¯)〉]}= infy∈Rn{12〈y,Dy〉+ 〈Qb, y〉+ c+ r2‖y − y¯‖2}.Since D is diagonal, we have the form of Proposition 4.3.6, with b replaced by Qb.The rest of the proof is analogous to that of Proposition 4.3.6.Remark 4.3.9. Example 4.5.1 is an application of Theorem 4.3.8.4.4 PLQ functionsThe goal of this section is to generalize the results we have so far to PLQ func-tions. We begin by stating some results about the domain of the Moreau envelope.In our first result, we see that the more we restrict the domain of a function, thelarger the domain of the Moreau envelope can be.Lemma 4.4.1. Let f : dom f → R. Suppose f˜ : dom f˜ → R is such that for allx ∈ dom f˜ , dom f˜ ⊆ dom f and f(x) = f˜(x). Then dom erf ⊆ dom erf˜ .394.4. PLQ functionsProof. We haveinfy∈dom f{f(y) +r2‖y − x¯‖2}> −∞ ∀x¯ ∈ dom erf,⇒ infy∈dom f˜{f˜(y) +r2‖y − x¯‖2}> −∞ ∀x¯ ∈ dom erf,since dom f˜ ⊆ dom f. Therefore, dom erf ⊆ dom erf˜ .Combining Theorem 4.3.8 with Lemma 4.4.1, we have the following corollary.Corollary 4.4.2. Let f : dom f →⊆ Rn → R, f(x) = 12〈x,Ax〉 + 〈b, x〉 + c(A ∈ Sn, b ∈ Rn, c ∈ R) have threshold r¯ > 0. For S ⊆ dom f, letf˜(x) ={f(x), x ∈ S,∞, x 6∈ S.Then1r¯b ∈ dom er¯f ⊆ dom er¯f˜ .Proof. Using (4.3.4), we see that substituting x¯i = bi satisfies the condition, whichgives us that 1r¯ b ∈ dom er¯f. Lemma 4.4.1 completes the proof.So for any quadratic function f with dom er¯f 6= ∅, Corollary 4.4.2 gives us apoint in dom er¯f.4.4.1 Quadratic functions with conic domainNow we move on to the case of a quadratic function f where dom f is a closed,unbounded conic region. We change to generalized spherical coordinates, alsoknown as n-spherical coordinates, since angles and vector lengths are more conve-nient for defining vectors than is the Cartesian system when the domain is a cone.404.4. PLQ functionsThe variable change is as follows:x ∈ Rn ↔x1 = ρ cosφ1x2 = ρ sinφ1 cosφ2...xn−1 = ρ sinφ1 · · · sinφn−2 cosφn−1xn = ρ sinφ1 · · · sinφn−2 sinφn−1↔(ρ, φ) ∈ R× Rn−1ρ ≥ 0φ1 ∈ [0, 2pi),φ2, φ3, . . . , φn−1 ∈ [0, pi].For ease of notation, we introduce the capital sine-k function Sink φ.Definition 4.4.3. Let φ = (φ1, φ2, . . . , φn−1). The Sink function is defined bySink φ =k∏i=1sinφi.We adopt the conventions Sin0 φ = 1 and φn = 0, so that we may writexi = ρSini−1 φ cosφi for all i ∈ {1, 2, . . . , n}.For a quadratic function f(x) = 12〈x,Ax〉 + 〈b, x〉 + c, the change to n-sphericalcoordinates of the infimand of the Moreau envelope results in12〈x,Ax〉+ 〈b, x〉+ c+ r2‖x− x¯‖2=12n∑j=1n∑i=1aijxixj +n∑i=1bixi + c+r2n∑i=1(xi − x¯i)2=12n∑j=1n∑i=1aijρ Sini−1 φ cosφiρSinj−1 φ cosφj +n∑i=1biρ Sini−1 φ cosφi + c+r2n∑i=1(ρ Sini−1 φ cosφi − ρ¯Sini−1 φ¯ cos φ¯i)2=ρ2r2+12n∑j=1n∑i=1aij Sini−1 φ cosφi Sinj−1 φ cosφj414.4. PLQ functions+ ρ(n∑i=1(bi − ρ¯r Sini−1 φ¯ cos φ¯i) Sini−1 φ cosφi)+ c+ρ¯2r2n∑i=1Sin2i−1 φ¯ cos2 φ¯.DefineG(φ) =n∑j=1n∑i=1aij Sini−1 φ cosφi Sinj−1 φ cosφj , (4.4.1)Hr(ρ¯, φ¯;φ) =n∑i=1(bi − ρ¯r Sini−1 φ¯ cos φ¯i) Sini−1 φ cosφi, (4.4.2)Kr(ρ¯, φ¯) = c+ρ¯2r2n∑i=1Sin2i−1 φ¯ cos2 φ¯i. (4.4.3)Then we haveerf(ρ¯, φ¯) = inf(ρ,φ)∈W (S){ρ2(G(φ) + r2)+ ρHr(ρ¯, φ¯;φ) +Kr(ρ¯, φ¯)}, (4.4.4)where W (x) = (ρ, φ) by the change of variables. Now suppose that S is anunbounded, closed, convex cone. The case S = Rn has been covered, so wesuppose that S 6= Rn . Note that {φ : (1, φ) ∈ W (S)} is a compact set and thatHr is bounded. Since the infimand of (4.4.4) is quadratic in ρ, it is bounded belowif G(φ) + r > 0 for all (ρ, φ) ∈ W (S), and unbounded below if there exists(ρ, φ) ∈W (S) such that G(φ) + r < 0. Since G(φ) is a sum and product of sinesand cosines, it is bounded on the compact set {φ : (1, φ) ∈ W (S)}, and as such ithas a minimum. Thus, definingG = min(1,φ)∈S{G(φ)}, (4.4.5)we haveinf{r : G(φ) + r > 0 ∀(1, φ) ∈W (S)}= inf{r : r > −G(φ) ∀(1, φ) ∈W (S)}= inf{r : r > −G}=−G.If G > 0, then the threshold is 0, since it cannot be negative. Hence,r¯ = max{0,−G}.424.4. PLQ functionsNow setting r¯ = max{0,−G}, we define the following:Φ = {φ : (1, φ) ∈W (S) and G(φ) = G}, (4.4.6)H+r¯ (ρ¯, φ¯) = {φ : φ ∈ Φ and Hr¯(ρ¯, φ¯;φ) ≥ 0}, (4.4.7)H++r¯ (ρ¯, φ¯) = {φ : φ ∈ Φ and Hr¯(ρ¯, φ¯;φ) > 0}. (4.4.8)For the following, recall that a set is said to be polyhedral if it can be expressed asthe intersection of a finite number of closed half-spaces [196, Ex 2.10].Theorem 4.4.4. On Rn, let f be a quadratic function with S = dom f a closed,unbounded polyhedral cone. Define G(φ), Hr(ρ¯, φ¯;φ), G, Φ, H+r¯ (ρ¯, φ¯), andH++r¯ (ρ¯, φ¯) as in (4.4.1), (4.4.2), (4.4.5), (4.4.6), (4.4.7), and (4.4.8). Then, us-ing W (x¯) = (ρ¯, φ¯), the threshold of f isr¯ = max{0,−G},and dom er¯f depends on G and Hr¯(ρ¯, φ¯;φ) in the following manner.(i) If G > 0, then dom er¯f = Rn.(ii) If G ≤ 0, and Φ = H++r¯ (ρ¯, φ¯), then x¯ ∈ dom er¯f.(iii) If G ≤ 0, and Φ 6= H+r¯ (ρ¯, φ¯), then x¯ /∈ dom er¯f.Proof. (i) If G > 0, then r¯ = 0 and we haveer¯f(ρ¯, φ¯) = inf(ρ,φ)∈W (S){ρ2G(φ) + ρHr¯(ρ¯, φ¯;φ) +Kr¯(ρ¯, φ¯)}> −∞,sinceG(φ) ≥ G > 0 for all (ρ, φ) ∈W (S).Hence, the infimand above is a strictlyconvex (bounded below) function. Therefore, dom er¯f = Rn by Lemma 4.3.1.(ii) If G ≤ 0, then r¯ = −G, which gives us that ρ2(G(φ) + r¯) ≥ 0 for all φ with(1, φ) ∈ W (S). In fact, G(φ) + r¯ = 0 for all φ ∈ Φ, and G(φ) + r¯ > 0 for allφ 6∈ Φ. Suppose (ρ¯, φ¯) is such that Φ = H++r¯ (ρ¯, φ¯). Considerer¯f(ρ¯, φ¯)= inf(ρ,φ)∈W (S){ρ2G(φ) + r¯2+ ρHr¯(ρ¯, φ¯;φ) +Kr¯(ρ¯, φ¯)}= min infρ≥0,φ∈Φ{ρ2G(φ)+r¯2 + ρHr¯(ρ¯, φ¯;φ) +Kr¯(ρ¯, φ¯)},infρ≥0,φ/∈Φ{ρ2G(φ)+r¯2 + ρHr¯(ρ¯, φ¯;φ) +Kr¯(ρ¯, φ¯)} .434.4. PLQ functionsConsidering the first infimum above, we note that φ ∈ Φ⇒ G(φ) + r¯ = 0, soinfρ≥0,φ∈Φ{ρ2G(φ) + r¯2+ ρHr¯(ρ¯, φ¯;φ) +Kr¯(ρ¯, φ¯)}= infρ≥0,φ∈Φ{ρHr¯(ρ¯, φ¯;φ) +Kr¯(ρ¯, φ¯)}.As Φ = H++r¯ (ρ¯, φ¯), we have Hr¯(ρ¯, φ¯;φ) > 0, so the minimum occurs at ρ = 0.That is,infρ≥0,φ∈Φ{ρ2G(φ) + r¯2+ ρHr¯(ρ¯, φ¯;φ) +Kr¯(ρ¯, φ¯)}= Kr¯(ρ¯, φ¯).Turning our attention to the second infimum, given any φ /∈ Φ, the infimand isstrictly convex quadratic. Thus, using basic calculus we haveinfρ≥0{ρ2G(φ) + r¯2+ ρHr¯(ρ¯, φ¯;φ) +Kr¯(ρ¯, φ¯)}={− (Hr¯(ρ¯,φ¯;φ))22(G(φ)+r¯) +Kr¯(ρ¯, φ¯), if Hr¯(ρ¯, φ¯;φ) ≤ 0,Kr¯(ρ¯, φ¯), if Hr¯(ρ¯, φ¯;φ) > 0.Returning to the Moreau envelope calculation, we have thater¯f(ρ¯, φ¯) = min[Kr¯(ρ¯, φ¯), infφ/∈ΦHr¯(ρ¯,φ¯;φ)≤0{− (Hr¯(ρ¯,φ¯;φ))22(G(φ)+r¯) +Kr¯(ρ¯, φ¯)} ]. (4.4.9)Since Φ = H++r¯ (ρ¯, φ¯), this simplifies toer¯f(ρ¯, φ¯) = min[Kr¯(ρ¯, φ¯), inf(1,φ)∈W (S)Hr¯(ρ¯,φ¯;φ)≤0{− (Hr¯(ρ¯,φ¯;φ))22(G(φ)+r¯) +Kr¯(ρ¯, φ¯)} ].Finally, noting that Hr¯(ρ¯, φ¯;φ) and G(φ) are continuous functions in φ and that φis bounded, we note that the infimum is over a compact set. Hence, it is obtained:er¯f(ρ¯, φ¯) = min[Kr¯(ρ¯, φ¯), min(1,φ)∈W (S)Hr¯(ρ¯,φ¯;φ)≤0{− (Hr¯(ρ¯,φ¯;φ))22(G(φ)+r¯) +Kr¯(ρ¯, φ¯)} ]>−∞.Therefore, x¯ ∈ dom er¯f.(iii) If (ρ¯, φ¯) is such that Φ 6= H+r¯ (ρ¯, φ¯), then there exists φˆ such thatG(φˆ)+ r¯ = 0444.4. PLQ functionsand Hr¯(ρ¯, φ¯; φˆ) < 0. Using this, we see thater¯f(ρ¯, φ¯) = inf(ρ,φ)∈W (S){ρ2G(φ) + r¯2+ ρHr¯(ρ¯, φ¯;φ) +Kr¯(ρ¯, φ¯)}≤ infρ≥0{ρ2G(φˆ) + r¯2+ ρHr¯(ρ¯, φ¯; φˆ) +Kr¯(ρ¯, φ¯)}= infρ≥0{ρHr¯(ρ¯, φ¯; φˆ) +Kr¯(ρ¯, φ¯)} = −∞.Remark 4.4.5. The domain of er¯f cannot be identified in every circumstance. Inparticular, the boundary caseG ≤ 0 and Φ = H+r¯ (ρ¯, φ¯)\H++r¯ (ρ¯, φ¯) is not coveredby Theorem 4.4.4. See Example 4.5.2 for an illustration of the problems that arisein this situation.Before moving to general polyhedral domains, we make one final remark on thedomain of the Moreau envelope.Corollary 4.4.6. On Rn, let f be a quadratic function with S = dom f a closed,unbounded polyhedral cone. Define G(φ), Hr(ρ¯, φ¯;φ), G, Φ, H+r¯ (ρ¯, φ¯), andH++r¯ (ρ¯, φ¯) as in (4.4.1), (4.4.2), (4.4.5), (4.4.6), (4.4.7), and (4.4.8). If G < 0,thendom er¯f 6= ∅ and dom er¯f 6= Rn.Proof. ConsiderHr¯(ρ¯, φ¯;φ) =n∑i=1(bi − ρ¯r¯ Sini−1 φ¯ cos φ¯i) Sini−1 φ cosφi.We first note that ifbi − ρ¯r¯ Sini−1 φ¯ cos φ¯i = 0 ∀i, (4.4.10)then Hr¯(ρ¯, φ¯;φ) = 0 for all φ. In this caseer¯f(ρ¯, φ¯) = inf(ρ,φ)∈W (S){ρ2G(φ)+r¯2 + ρHr¯(ρ¯, φ¯;φ) +Kr¯(ρ¯, φ¯)}= inf(ρ,φ)∈W (S){ρ2G(φ)+r¯2 +Kr¯(ρ¯, φ¯)}≥ inf(ρ,φ)∈W (S){Kr¯(ρ¯, φ¯)} = Kr¯(ρ¯, φ¯),as ρ2(G(φ) + r¯) ≥ 0 for all (ρ, φ) ∈W (S). Thus, any point (ρ¯, φ¯) such thatbi − ρ¯r¯ Sini−1 φ¯ cos φ¯i = 0 ∀i454.4. PLQ functionsis in dom er¯f. Returning (4.4.10) to Cartesian coordinates yields bi − r¯x¯i = 0, orx¯ = b/r¯. Next, we show that there exists (ρ¯, φ¯) such thatHr¯(ρ¯, φ¯;φ) < 0 for someφ ∈ Φ. This means that (ρ¯, φ¯) meets the conditions of Theorem 4.4.4(iii), hencedom er¯f 6= Rn. To see this, select any φ ∈ Φ. Consider the summationn∑i=1(bi − ρ¯r¯ Sini−1 φ¯ cos φ¯i) Sini−1 φ cosφi.Notice that not all of the factors Sini−1 φ cosφi can be zero. We see this by writingout these terms,Sin0 φ cosφ1 = cosφ1,Sin1 φ cosφ2 = sinφ1 cosφ2,Sin2 φ cosφ3 = sinφ1 sinφ2 cosφ3,...Sinn−2 φ cosφn−1 = sinφ1 · · · sinφn−2 cosφn−1,Sinn−1 φ cosφn = sinφ1 · · · sinφn−1,and observing that for the first term to be zero, φ1 must be either pi/2 or 3pi/2.Then, since sinφ1 = ±1, we must have φ2 = pi/2 in order for the second term tobe zero. Continuing in this manner, we find that φi = pi/2 for all i 6= 1, whichleaves the last term equal to ±1. Hence, the summation is never equivalently zerodue to φ. Suppose, then, that the kth term Sink φ cosφk 6= 0. Setting φ¯i = pi/2 fori = 1, 2, . . . , k − 1, and φ¯i = 0 for i = k, k + 1, . . . , n− 1, yieldsSini−1 φ¯ cos φ¯i ={0, if i 6= k1, if i = k.Hence, bk− ρ¯r¯ Sink−1 φ¯ cos φ¯k can be driven to−∞, while the other terms remainconstant, by making ρ¯ sufficiently large. Conversely, setting φ¯1 = 3pi/2, φ¯i = pi/2for i = 2, 3, . . . , k − 1, and φ¯i = 0 for i = k, k + 1, . . . , n− 1, yieldsSini−1 φ¯ cos φ¯i ={0, if i 6= k−1, if i = k.Hence, bk − ρ¯r¯ Sink−1 φ¯ cos φ¯k can be driven to∞, while the other terms remainconstant, by making ρ¯ sufficiently large. Therefore, it is always possible to select(ρ¯, φ¯) with Hr¯(ρ¯, φ¯;φ) < 0.464.4. PLQ functions4.4.2 Quadratic functions with polyhedral domainTheorem 4.4.4 covers the case where dom f is an unbounded polyhedral cone.We now generalize to include all unbounded polyhedral domains. For this, we willneed the recession cone, defined as follows.Definition 4.4.7. [196, Definition 6.33] For any point x¯ ∈ S ⊂ Rn, S 6= ∅, therecession cone R(x¯) is the cone defined asR(x¯) = {x : x¯+ τx ∈ S ∀τ ≥ 0}.If S is polyhedral, then R(x¯) is the same independent of the choice of x¯ [196,Exercise 6.34], and we denote the recession cone by R. If S is bounded, thenR = {0}. If S is unbounded, then R represents all unbounded directions of S. Wewill see that in order to understand the threshold, it suffices to focus solely on whathappens on R. We first prove that the thresholds themselves are the same on R ason S, in Theorem 4.4.8 below.Theorem 4.4.8. Let f : S → R be a quadratic function with S polyhedral. Forany xˆ ∈ S, let R = R(xˆ) + xˆ. Definef˜(x) ={f(x), if x ∈ R,∞, if x 6∈ R.Let r¯f and r¯f˜ be the thresholds of f and f˜ , respectively. Then r¯f˜ = r¯f .Proof. Let r > r¯f . Then dom erf = Rn, so dom erf˜ = Rn by Lemma 4.4.1.This gives us an upper bound on the threshold of f˜ : r¯f˜ ≤ r¯f . Now let r > r¯f˜ .It suffices to show that dom erf = Rn, since this implies that r ≥ r¯f . Let G˜(φ),H˜r(ρ¯, φ¯;φ), K˜r(ρ¯, φ¯), and G˜ be defined as in (4.4.1), (4.4.2), (4.4.3), and (4.4.5).To see that dom erf = Rn, suppose that x¯ 6∈ dom erf. Since r > r¯f˜ , we knowthat dom erf˜ = Rn, so there exists {xn} ⊆ S \ R (where (ρn, φn) = W (xn))such thatlimn→∞{ρ2nG˜(φn) + r2+ ρnH˜r(ρ¯, φ¯;φn) + K˜r(ρ¯, φ¯)}= −∞. (4.4.11)Since r > r¯f˜ , we have G˜(φ) + r/2 > 0 for all φ with (1, φ) ∈ W (R). SinceG˜(φ), H˜r(ρ¯, φ¯;φ) and K˜(ρ¯, φ¯) are bounded, necessarily ρn → ∞. By definitionof the recession cone, dropping to a subsequence if necesary, we may assume that474.4. PLQ functionsφn → φˆ such that (1, φˆ) ∈W (R). Since G˜(φˆ) + r/2 > 0 and G˜(φ) is continuous,there exists N ∈ N such that G˜(φn) + r > G˜(φˆ)+r2 for all n ≥ N. This means thatlimn→∞{ρ2nG˜(φn) + r2+ ρnH˜r(ρ¯, φ¯;φn) + K˜r(ρ¯, φ¯)}≥ limn→∞{ρ2nG˜(φˆ) + r4+ ρnH˜r(ρ¯, φ¯;φn) + K˜r(ρ¯, φ¯)}.Since H˜r(ρ¯, φ¯;φn) is bounded, say |H˜r(ρ¯, φ¯;φn)| ≤ L, we have thatlimn→∞{ρ2nG˜(φˆ) + r4+ ρnH˜r(ρ¯, φ¯;φn) + K˜r(ρ¯, φ¯)}≥ limn→∞{ρ2nG˜(φˆ)4− ρnL+ K˜r(ρ¯, φ¯)}=∞.This is a contradiction to (4.4.11). Therefore, dom er¯f = Rn.We henceforth drop the subscripts on the threshold and set r¯f = r¯f˜ = r¯. We turnour attention to the domain of the Moreau envelope of a constrained function withpolyhedral domain.Theorem 4.4.9. Let f(x) = 12〈x,Ax〉 + 〈b, x〉 + c, A ∈ Sn, b ∈ Rn, c ∈ Rbe a quadratic function on S ⊆ Rn with S polyhedral. For any xˆ ∈ S, defineR = R(xˆ) + xˆ. Definef˜(x) ={f(x), x ∈ R,∞, else.Let r¯ be the threshold of prox-boundedness of f˜ . For f˜ , define G˜(φ), H˜r(ρ¯, φ¯;φ),G˜, Φ˜, H˜+r¯ (ρ¯, φ¯), and H˜++r¯ (ρ¯, φ¯) as in (4.4.1), (4.4.2), (4.4.5), (4.4.6), (4.4.7), and(4.4.8). Then the following hold.(i) If G˜ > 0, then dom er¯f˜ = dom er¯f = Rn.(ii) If G˜ ≤ 0, and φ ∈ Φ˜⇒ (1, φ) ∈ intR, then dom er¯f˜ = dom er¯f.(iii) If G˜ ≤ 0 and Φ˜ 6= H˜+r¯ (ρ¯, φ¯), then x¯ 6∈ dom er¯f˜ and x¯ 6∈ dom er¯f.Proof. Notice that the functions G˜(φ) and H˜r¯(ρ¯, φ¯;φ) are the same for f as for f˜ ,with possibly different domains.(i) If G˜ > 0, then r¯ = 0 and by the same argument as in the proof of Theorem484.4. PLQ functions4.4.4(i) we have dom er¯f˜ = Rn. Suppose dom er¯f 6= Rn. Then there exists (ρ¯, φ¯)such that er¯f(ρ¯, φ¯) = −∞. That is,inf(ρ,φ)∈W (S\R){ρ2G˜(φ)2+ ρH˜r¯(ρ¯, φ¯;φ) + K˜r¯(ρ¯, φ¯)}= −∞. (4.4.12)In order for (4.4.12) to be true, we must have a sequence{(ρn, φn)}∞n=1 ⊆W (S \R)such thatlimn→∞{ρ2nG˜(φn)2+ ρnH˜r¯(ρ¯, φ¯;φn) + K˜r¯(ρ¯, φ¯)}= −∞. (4.4.13)As G˜(φ) > 0 for all φ with (1, φ) ∈ W (R), by the same argument as in the proofof Theorem 4.4.8 we get a contradiction to (4.4.12). We conclude that the domainof er¯f is Rn.(ii) By Lemma 4.4.1, we have dom er¯f ⊆ dom er¯f˜ . Suppose there exists a pointx¯ ∈ dom er¯f˜ \ dom er¯f. This implies that we have a sequence {(ρn, φn)}∞n=1 inW (S \R) such thatlimn→∞{ρ2nG˜(φn) + r¯2+ ρnH˜r¯(ρ¯, φ¯;φn) + K˜r¯(ρ¯, φ¯)}= −∞. (4.4.14)Dropping to a subsequence if necessary, we assume ρn → ∞ and φn → φˆ suchthat (1, φˆ) ∈ W (R). Note that (1, φˆ) is on the boundary of W (R). Since the pair(1, φˆ) ∈W (R), we have G˜(φˆ) ≥ G˜. In fact, G˜(φˆ) > G˜, sinceφ ∈ Φ˜⇒ (1, φ) ∈ intR.Hence, G˜(φˆ) + r¯ > 0. The proof now follows from the same arguments as inTheorem 4.4.8.(iii) If G˜ ≤ 0 and Φ˜ 6= H˜+r¯ (ρ¯, φ¯), then x¯ 6∈ dom er¯f˜ by Theorem 4.4.4(iii). ByLemma 4.4.1, we have dom er¯f ⊆ dom er¯f˜ . Thus, x¯ 6∈ dom er¯f˜ .Remark 4.4.10. As we saw in Theorem 4.4.4, the domain of the Moreau enve-lope is not identifiable in all situations. For a quadratic function f with generalpolyhedral domain, we are certain of the domain of er¯f only in the three situationsdescribed in the statement of Theorem 4.4.9. See Example 4.5.2 for an illustra-tion of how polyhedral domains that are not conic can make it difficult to identifydom er¯f.494.4. PLQ functions4.4.3 PLQ functionsFor a quadatic function f whose domain is a single, closed, unbounded poly-heral region, we use Theorems 4.4.8 and 4.4.9 to identify the threshold r¯ anddom er¯f. We will now use this as a basis for doing the same with a PLQ func-tion. Since a PLQ function is continuous [196, Proposition 10.21], every piece isbounded below except possibly those whose domains are unbounded sets. The-orem 4.4.11 explicitly identifies the thresholds, and the domains of the Moreauenvelopes at the thresholds where possible, of PLQ functions.Theorem 4.4.11. For i ∈ {1, 2, . . . ,m}, let fi : Rn → R be quadratic functionson closed, polyhedral domains Si = dom fi, such that Si ∩ intSj = ∅ for everyi 6= j and fi(x) = fj(x) for all x ∈ Si∩Sj . Let r¯i be the threshold of fi for each i(find r¯i and dom er¯ifi by applying Theorem 4.4.9 to each fi). Define the functionf(x) =f1(x), x ∈ S1,f2(x), x ∈ S2,...fm(x), x ∈ Sm.Then the threshold of f isr¯ = maxi{r¯i}.Moreover, if we define the active set A = {i : r¯i = r¯}, thendom er¯f =⋂i∈Adom er¯fi.Proof. We will make use of the following equation in the proof:erf(x¯) = infy∈dom f{f(y) +r2‖y − x¯‖2}= min[infy∈S1{f1(y) +r2‖y − x¯‖2}, . . . , infy∈Sm{fm(y) +r2‖y − x¯‖2}]. (4.4.15)Let r > maxi{r¯i}. Then by [196, Theorem 1.25], we have erfi(x¯) > −∞ for allx¯ ∈ Rn, for all i. Then (4.4.15) gives us that erf(x¯) > −∞ for all x¯ ∈ Rn, hence,r¯ ≤ maxi{r¯i}. Now let r < maxi{r¯i}. Then for any k such that r¯k = maxi{r¯i},we have erfk(x¯) = −∞ for all x¯ ∈ Rn. Then (4.4.15) gives us that erf(x¯) = −∞for all x¯ ∈ Rn, hence, r¯ ≥ maxi{r¯i}. Therefore, r¯ = maxi{r¯i}.If r¯ = 0 and dom er¯ifi = Rn for all i ∈ A, then by Lemma 4.3.1 fi is boundedbelow for each i ∈ A. Since maxi{r¯i} = 0, we know that A = {1, 2, . . . ,m},so in fact fi is bounded below for all i. Hence, f is bounded below as well. By504.5. ExamplesLemma 4.3.1, dom er¯f = Rn =⋂i∈A dom er¯fi.If we do not have r = 0 with dom er¯ifi = Rn for all i, then consider any x¯. Noticethat if i /∈ A, then r¯ > r¯i, so dom er¯fi = Rn. If i ∈ A, then er¯fi(x¯) > −∞ if andonly if x¯ ∈ dom er¯fi. Hence, we havedom er¯f =⋂i∈Adom er¯fi.Remark 4.4.12. Two applications of Theorem 4.4.11 are given in Examples 4.5.3and 4.5.4.4.5 ExamplesIn this section, we provide a few examples that illustrate some of the nuancesof the results and highlight the procedures. The first example illustrates the basictechniques for a full-domain quadratic function.Example 4.5.1. Define f : R2 → R,f(x) =12x>[1 22 −2]x+[11]x+ 1.Then the threshold is r¯ = 3, and 1r¯[11]∈ dom er¯f.Proof. Let A =[1 22 −2]and b =[11]. Spectral decomposition of A yieldsA = Q>DQ withQ =√55[2 11 −2]andD =[2 00 −3]. FromD we see thatλ1 = 2 and λ2 = −3, hence r¯ = 3. As per Theorem 4.3.8, we use the substitutionsx = Q>y and x¯ = Q>y¯, and calculate the Moreau envelope of f at the threshold:er¯f(Q>y¯)= infy12y>[2 00 −3]y +(√55[2 11 −2] [11])>y + 1 +32‖y − y¯‖2= infy{52y21 +(3√55− 3y¯1)y1 +(−√55− 3y¯2)y2 + 1 +32(y¯21 + y¯22)}={35 y¯21 − 9√525 y¯1 +2125 , if y¯2 = −√515 ,−∞, if y¯2 6= −√515 .514.5. ExamplesNow we use x¯ = Q>y¯ to find thater¯f(x¯) ={34 x¯21 − 45 x¯1 + 4760 , if x¯2 = 12 x¯1 + 16 ,−∞, if x¯2 6= 12 x¯1 + 16 .Hence, we havedom er¯f ={x¯ : x¯2 =12x¯1 +16}.Finally, in accordance with Corollary 4.4.2, we observe that 1r¯ b ∈ dom er¯f.Our next example shows the difficulty in computing dom er¯f when nonconic setsare involved.Example 4.5.2. Define f : R2 → R, f(x, y) = xy. LetS1 = {(x, 0)},S2 = {(x, y) : −1 ≤ y ≤ 1},and definef1(x, y) ={f(x, y), if (x, y) ∈ S1∞, if (x, y) 6∈ S1,f2(x, y) ={f(x, y), if (x, y) ∈ S2,∞, if (x, y) 6∈ S2.Then f1 and f2 have G = 0 and Φ = H+r¯ (ρ¯, φ¯) \H++r¯ (ρ¯, φ¯), but dom er¯f1 = R2,whereas dom er¯f2 = ∅.Proof. On S1, the function f1 is equivalently zero. This makes it trivial to findthat G = 0 and Hr¯(ρ¯, φ¯;φ) = 0 for all x ∈ dom f1, for all x¯ ∈ R2. Hence,Φ = H+r¯ (ρ¯, φ¯) \H++r¯ (ρ¯, φ¯). Since f1 is bounded below, by Lemma 4.3.1 we havethat dom er¯f1 = R2. The recession cone of S2 is S1. It is left to the reader toverify that G(φ) = sin 2φ, G = r¯ = 0, Φ = {0, pi}, and Hr¯(ρ¯, φ¯;φ) = 0, so thatΦ = H+r¯ (ρ¯, φ¯) \H++r¯ (ρ¯, φ¯). Thener¯f2(x¯, y¯) = inf−1≤y≤1xy= −∞ ∀(x¯, y¯) ∈ R2.Therefore, dom er¯f2 = ∅.Next, we have a simple example that shows it possible to construct PLQ functionswith equal, positive thresholds, whose Moreau envelope domains are different.524.5. ExamplesExample 4.5.3. Define two regions on R : S1 = {x : x ≤ 0}, S2 = {x : x ≥ 0}.Definef1(x) = −x2, f2(x) = −x2,g1(x) = −(x+ 1)2, g2(x) = −(x− 1)2.Then the PLQ functionsF (x) ={f1(x), if x ∈ S1,f2(x), if x ∈ S2,G(x) ={g1(x), if x ∈ S1,g2(x), if x ∈ S2,both have threshold r¯f = r¯g = 2, but dom e2F = {0}, whereas dom e2G = ∅.x-3 -2 -1 0 1 2 3F(x)-4-3.5-3-2.5-2-1.5-1-0.500.51x-3 -2 -1 0 1 2 3G(x)-4-3.5-3-2.5-2-1.5-1-0.500.51Figure 4.1: The Moreau envelopes of PLQ functions with the same threshold mayhave different domains.Proof. Figure 4.1 makes it plain to see that for F (x), the common real valueof the Moreau envelopes is e2f1(0) = e2f2(0) = 0. Hence, e2F (0) = 0 ande2F (x) = −∞ for all x 6= 0, which gives dom e2F = {0}. For G(x), we see thate2g1(−1) = e2g2(1) = 0 and the real values of the Moreau envelopes are not at thesame point, which gives e2G(x) = −∞ everywhere. Hence, dom e2G = ∅.Finally, we have an example of a six-piece PLQ function on R2. We identify thethreshold of each piece and that of the PLQ function. We also make some conclu-sions with respect to the domain of the Moreau envelope for each piece and for thatof the PLQ function.534.5. ExamplesExample 4.5.4. Define six regions on R2 that overlap on their boundaries:S1 ={(x, y) : y ≥ 0, x ≤ −2},S2 ={(x, y) : x ≥ −2, y ≥ x+ 2, x ≤ 0},S3 ={(x, y) : y ≥ 0, y ≤ x+ 2, x ≤ 0},S4 ={(x, y) : x ≥ 0, y ≥ x},S5 ={(x, y) : y ≥ 0, y ≤ x}, andS6 ={(x, y) : y ≤ 0}.Define the quadratic functionsf1(x, y) =12[ x y ][0 −4−4 0] [xy]+[1 −3 ] [ xy],f2(x, y) =12[ x y ][6 −3−3 0] [xy]+[7 −1 ] [ xy],f3(x, y) =[1 −1 ] [ xy],f4(x, y) =12[ x y ][12 −7−7 0] [xy]+[6 −1 ] [ xy],f5(x, y) =12[ x y ][0 55 −12] [xy]+[1 4] [ xy], andf6(x, y) =12[ x y ][0 22 −2] [xy]+ [ 1 1 ][xy],and the PLQ functionf(x, y) =f1(x, y), if (x, y) ∈ S1,f2(x, y), if (x, y) ∈ S2,f3(x, y), if (x, y) ∈ S3,f4(x, y), if (x, y) ∈ S4,f5(x, y), if (x, y) ∈ S5,f6(x, y), if (x, y) ∈ S6.Then f has threshold r¯ = 12 +12√5 ≈ 1.618, withΦ = {φˆ} ={pi − arctan(12+12√5)}.Moreover, dom er¯f 6= Rn, dom er¯f 6= ∅.544.5. ExamplesFigure 4.2: The partitioning of R2 for f(x, y).Proof. Figure 4.2 shows the six regions of the domain of f. It is left to the readerto verify that f is indeed a PLQ function, that is, it is continuous at all boundarypoints.S1 : This region is not a cone, so we identify the recession cone R1 and useW (R1) ={(ρ, φ) : ρ ≥ 0, φ ∈[pi2, pi]}.We consider the restricted function f˜1 = f1 with dom f˜1 = R1 + (−2, 0). In polarcoordinate form, the function becomesf˜1(ρ, φ) = −4ρ2 cosφ sinφ+ ρ(cosφ− 3 sinφ).Then the Moreau envelope at W ((x¯, y¯)) = (ρ¯, φ¯) isinf(ρ,φ)∈W (R1){ρ2(−2 sin 2φ+ r2)+ ρ[cosφ− 3 sinφ− rρ¯ cos(φ− φ¯)]+ r2ρ¯2}.Using (4.4.1) and (4.4.2), we haveG(φ) = −2 sin 2φ and Hr(ρ¯, φ¯;φ) = cosφ− 3 sinφ− rρ¯ cos(φ− φ¯).Notice thatG = minφ∈[pi2 ,pi]G(φ) = 0 and Φ = argminφ∈[pi2 ,pi]G(φ) ={pi2, pi}.554.5. ExamplesThis gives r¯1 = 0, Hr¯1(ρ¯, φ¯; pi2)= −3, and Hr¯1(ρ¯, φ¯, pi) = −1, independentof the choice of (ρ¯, φ¯). So we have G ≤ 0 and Φ 6= H+r¯1(ρ¯, φ¯) for all x¯ ∈ R2.Therefore, by Theorem 4.4.9, dom er¯1f1 = ∅.S2 : This region is not a cone, so we identify the recession cone R2 and useW (R2) ={(ρ, φ) : ρ ≥ 0, φ = pi2}.We consider the restricted function f˜2 = f2 with dom f˜2 = R2 + (−2, 0). Thefunction in polar coordinates isf˜2(ρ, φ) = 3ρ2 cos2 φ− 3ρ2 cosφ sinφ+ 7ρ cosφ− ρ sinφ,and the Moreau envelope at (ρ¯, φ¯) is the infimum over (ρ, φ) ∈W (R2) of the set{ρ2(3 cos2 φ− 32sin 2φ+r2)+ ρ[7 cosφ− sinφ− rρ¯ cos(φ− φ¯)]+ r2ρ¯2}.Since we have only one angle in W (R2), φ = pi2 , we get G = 0 and r¯2 = 0.Then Hr¯2(ρ¯, φ¯; pi2)= −1. So we have G ≤ 0 and Φ 6= H+r¯2(ρ¯, φ¯) for all x¯ ∈ R2.Therefore, by Theorem 4.4.9, dom er¯2f2 = ∅.S3 : This region is bounded, so f3 has threshold r¯3 = 0, and dom er¯3f3 = R2.S4 : This region is a closed, unbounded polyhedral cone. The function in polarcoordinates isf4(ρ, φ) = 6ρ2 cos2 φ− 72ρ2 sin 2φ+ 6ρ cosφ− ρ sinφ,with domainW (S4) = {(ρ, φ) : ρ ≥ 0, φ ∈ [pi4 , pi2 ]}. Its Moreau envelope at (ρ¯, φ¯)is the infimum over (ρ, φ) ∈W (S4) of the set{ρ2(6 cos2 φ− 72sin 2φ+r2)+ ρ[6 cosφ− sinφ− rρ¯ cos(φ− φ¯)]+ r2ρ¯2}.This yields G(φ) = 6 cos2 φ − 72 sin 2φ and G = 6 cos2 φˆ − 72 sin 2φˆ, whereφˆ = arctan 6+√857 is the unique minimizer, hence G ≈ −1.610 and r¯4 ≈ 1.61.SinceG < 0, by Corollary 4.4.6 (noting that S4 is conic) we have dom er¯4f4 6= R2and dom er¯4f4 6= ∅.S5 : This region is also a closed, unbounded polyhedral cone. The function in polarcoordinates isf5(ρ, φ) = ρ2(5 cosφ sinφ− 6 sin2 φ) + ρ(cosφ+ 4 sinφ),564.5. Exampleswith domain W (S5) = {(ρ, φ) : ρ ≥ 0, φ ∈ [0, pi4 ]}. Its Moreau envelope at (ρ¯, φ¯)is the infimum over (ρ, φ) ∈W (S5) of the set{ρ2(5 cosφ sinφ− 6 sin2 φ+ r2)+ ρ[cosφ+ 4 sinφ− rρ¯ cos(φ− φ¯)]+ r2ρ¯2}.We find that G(φ) is minimized uniquely at pi4 , G = −12 and r¯5 = 12 . Since G < 0,by Corollary 4.4.6 we have dom er¯5f5 6= R2 and dom er¯5f5 6= ∅.S6 : This region is also a closed, unbounded polyhedral cone. The function in polarcoordinates isf6(ρ, φ) = ρ2(2 cosφ sinφ− sin2 φ) + ρ(cosφ+ sinφ),with domain W (S6) = {(ρ, φ) : ρ ≥ 0, φ ∈ [pi, 2pi]}. Its Moreau envelope at(ρ¯, φ¯) is the infimum over (ρ, φ) ∈W (S6) of the set{ρ2(2 cosφ sinφ− sin2 φ+ r2)+ ρ[cosφ+ sinφ− rρ¯ cos(φ− φ¯)]+ r2ρ¯2}.We find that G(φ) is minimized uniquely atφˆ = pi − arctan10[− 1200(50− 10√5)32 + 310√50− 10√5]√50− 10√5= pi − arctan(12+12√5).This givesG = −12− 12√5 and r¯6 = 12 +12√5 ≈ 1.618. SinceG < 0, by Corollary4.4.6 we have dom er¯6f6 6= R2 and dom er¯6f6 6= ∅.We summarize these results below. Forφˆ = arctan6 +√857,we have the following table.i ri ri rounded to 10−3 dom er¯ifi1 0 0.000 ∅2 0 0.000 ∅3 0 0.000 R24 6 cos2 φˆ− 72 sin 2φˆ 1.610 6= R2, 6= ∅5 12 0.500 6= R2, 6= ∅6 12 +12√5 1.618 6= R2, 6= ∅Table 4.1: Results of Example 4.5.4By Table 4.1 and Theorem 4.4.11, r¯ = r¯6 and dom er¯f = dom er¯f6.57Chapter 5Most Convex Functions haveUnique Minimizers5.1 OverviewIn this chapter, we focus on convex functions that have unique minimizers.The question we aim to answer here is, “How many convex functions have uniqueminimizers?” Baire category theory is used to find the answer: “Most of them.”The word ‘most’ refers to a generic (Baire category two) set, which is defined inSection 5.3. The main results of this chapter are the following.• The set of proximal mappings{T : FixT is a singleton}is generic, where FixT is the set of fixed points of T.• The set of subdifferentials of convex functions{∂f : ∂f has a unique 0}is generic.• The set of equivalence classes of convex functions{F : f ∈ F has a unique minimizer}is generic (F is defined in Section 5.4).This chapter is based on results found in [178], which is published in Journal ofConvex Analysis.5.2 Functions with unique minimizersFinding the minimum and the minimizers of convex functions has been of pri-mary concern in convex analysis since its conception. It is well-known that if a585.3. Chapter-specific definitions and factsconvex function has a minimum, then that minimum is global. The minimizers,however, may not be unique. There are certain subclasses, such as strictly convexfunctions, that do have unique minimizers when the minimum exists, but other sub-classes, such as constant functions, that do not. This chapter addresses the questionof how many convex functions have unique minimizers. We show, using Baire cat-egory theory, that the set of proximal mappings of convex functions that have aunique fixed point is generic. Consequently, the set of classes of convex functionsthat have unique minimizers is generic.This chapter builds on the work done in [211], where it was shown that mostmonotone operators have unique zeroes. We are concerned with a similar questionhere: do most convex functions have unique minimizers, or equivalently do mostsubdifferential mappings of proper, convex, lsc functions have unique zeros? Infact they do; this is the main result presented in this work. By ‘most’, we arereferring to the idea of a generic set in the sense of Baire category. In terms ofproximal mappings, our result means that the set of proximal mappings that do nothave unique fixed points is a small, negligible set. To prove this, we construct acomplete metric space using the proximal mapping as a component of the distancefunction, and we use an argument based on super-regular mappings and the densityof contraction mappings in the established metric space. Super-regular is a termcoined by Reich and Zaslavski [188, page 2]. For a comprehensive study of thegeneric properties of nonexpansive mappings, see [188]. From the point of view ofconvex functions, we work with the set of equivalence classes of functions.5.3 Chapter-specific definitions and factsThe set of fixed points of the operator T : Rn → Rn is denoted by FixT. Fora sequencexk+1 = T ◦ T ◦ · · · ◦ T(k+1 times)x0,where x0 ∈ Rn, we write xk+1 = T k+1(x0), and gk+1 = T k+1.Definition 5.3.1. A function f ∈ Γ0(Rn) is strongly convex if there exists σ > 0such that f − σ2 ‖ · ‖2 is convex.Definition 5.3.2. An operator A : Rn ⇒ Rn is strongly monotone if there existsσ > 0 such that〈y − x,Ay −Ax〉 ≥ σ‖Ay −Ax‖2, ∀x, y ∈ domA.595.3. Chapter-specific definitions and factsDefinition 5.3.3. An operator A : Rn ⇒ Rn is k-cyclically monotone if(ai, a∗i ) ∈ graA, i ∈ {1, 2, . . . , k}ak+1 = a1}⇒k∑i=1〈ai+1 − ai, a∗i 〉 ≤ 0.When this holds for all k ∈ N, we say that A is cyclically monotone. We call Amaximally cyclically monotone if there does not exist a proper extension of A thatis cyclically monotone. It is clear that monotone and 2-cyclically monotone areequivalent.Definition 5.3.4. Let F ⊆ C(Rn, Y ) be nonempty. For a given x ∈ Rn, the set Fis equicontinuous at x if to each ε > 0 there corresponds a neighborhood U of xsuch thatρ(f(t), f(x)) < εwhenever t ∈ U and f ∈ F, where ρ is the metric on Y. We say that F is equicon-tinuous on Rn if F is equicontinuous at each x ∈ Rn .Definition 5.3.5. Let T : Rn ⇒ Rn be a mapping, and gk(x) = T kx. ThenT is called super-regular if there exists a unique xT ∈ Rn such that for eachs > 0, when k →∞, the sequence {gk}∞k=1 converges to the constant function xTuniformly on Bs(0).Fact 5.3.6. [196, Corollary 3.37] If f1, f2 ∈ Γ0(Rn) with Prf1 = Prf2 for somer > 0, then f1 = f2 + c, where c ∈ R is a constant.Fact 5.3.7. [196, Proposition 12.19] If f ∈ Γ0(Rn), then Prf is maximally mono-tone and nonexpansive for all r > 0. Hence, Prf is single-valued.Fact 5.3.8. [196, Theorem 12.17] A function f ∈ Γ0(Rn) if and only if ∂f ismonotone, in which case ∂f is maximally monotone.Fact 5.3.9. [196, Theorem 12.25] An operator T : Rn ⇒ Rn is the subdifferentialof some f ∈ Γ0(Rn), if and only if T is maximally cyclically monotone. Then fis uniquely determined by T, up to a constant.Fact 5.3.10. [18, Theorem 6.6] Let T : Rn ⇒ Rn . Then T is the resolvent ofthe maximally cyclically monotone operator A : Rn ⇒ Rn if and only if T hasfull domain, T is firmly nonexpansive, and for every set of points {x1, x2, . . . , xn}where n > 1 and xn+1 = x1, one hasn∑i=1〈xi − Txi, Txi − Txi+1〉 ≥ 0.Fact 5.3.11. [21, Proposition 4.2,Corollary 23.8] Let C ⊆ Rn and T : C ⇒ Rn .Then the following are equivalent:605.4. The complete metric space of subdifferentials(i) T is firmly nonexpansive;(ii) ‖Tx− Ty‖2 ≤ 〈x− y, Tx− Ty〉 for all x, y ∈ C;(iii) T is the resolvent of some maximally monotone operator A : Rn ⇒ Rn .Fact 5.3.12. [211, Proposition 1.5] Let M(Rn) be the set of maximally monotoneoperators on Rn . The following are equivalent:(i) a sequence of maximally monotone operators {Ak}∞k=1 ∈M(Rn) convergesgraphically to A;(ii) the sequence {JAk}∞k=1 converges pointwise to JA on Rn .Fact 5.3.13. [188, Theorem 3.1] Let K be a bounded, closed, convex subset of Rnand A be the set of all operators A : K → K such that‖Ax−Ay‖ ≤ ‖x− y‖ for all x, y ∈ K.Assume that B ∈ A is a contractive mapping, i.e., there exists σ ∈ [0, 1) such thatfor all x, y ∈ domB, ‖By − Bx‖ ≤ σ‖y − x‖. Then there exists xB ∈ K suchthat Bnx→ xB as n→∞, uniformly on K.Fact 5.3.14. [208, Theorem 3.143] Suppose that the metric space Y is completeand that {fk}∞k=1 is an equicontinuous sequence in C(Rn, Y ) that converges ateach point of a dense subsetD of Rn . Then there is a function f ∈ C(Rn, Y ) suchthat fk → f uniformly on each compact subset of Rn .Fact 5.3.15. [200, Theorem 10.11.4] Every complete metric space is a Baire space.5.4 The complete metric space of subdifferentialsIn this section, we establish a complete metric space whose distance functionmakes use of the proximal mapping. In order to prove completeness, we first statethe following lemma.Lemma 5.4.1. Define a : [0,∞)→ R, a(t) = t1+t . Then(i) a is an increasing function and(ii) t1, t2 ≥ 0⇒ a(t1 + t2) ≤ a(t1) + a(t2).Proof. (i) We have a′(t) = (1+ t)−2 > 0 for all t ∈ [0,∞). Hence, a is increasingeverywhere.615.4. The complete metric space of subdifferentials(ii) We havea(t1 + t2) =t1 + t21 + (t1 + t2)=t11 + (t1 + t2)+t21 + (t1 + t2)≤ t11 + t1+t21 + t2= a(t1) + a(t2).Let (J, d) be a space defined byJ = {∂f : f ∈ Γ0(Rn)}, andd(∂f, ∂g) =∞∑i=112isup‖x‖≤i‖P1f(x)− P1g(x)‖1 + sup‖x‖≤i‖P1f(x)− P1g(x)‖ .Note 5.4.2. There is no ambiguity in using the normally set-valued P1f and P1gdue to Fact 5.3.7.Note 5.4.3. Without loss of generality, henceforth in this chapter we set the proxi-mal parameter r = 1. The results are expandable to any r > 0 and are affected byat most a factor of r or 1/r.Proposition 5.4.4. The space (J, d) is a complete metric space, hence, a Bairespace.Proof. Items M1-M4 show that (J, d) is a metric space and item C shows that it iscomplete.M1:∞∑i=112i= 1, and 0 ≤sup‖x‖≤i‖P1f(x)− P1g(x)‖1 + sup‖x‖≤i‖P1f(x)− P1g(x)‖ < 1,⇒ 12i≥ 12isup‖x‖≤i‖P1f(x)− P1g(x)‖1 + sup‖x‖≤i‖P1f(x)− P1g(x)‖ ∀i⇒ 0 ≤ d(∂f, ∂g) < 1 ∀∂f, ∂g ∈ J.Hence, d is real-valued, finite, and non-negative.625.4. The complete metric space of subdifferentialsM2:d(∂f, ∂g) = 0⇔∞∑i=112isup‖x‖≤i‖P1f(x)− P1g(x)‖1 + sup‖x‖≤i‖P1f(x)− P1g(x)‖ = 0⇔ sup‖x‖≤i‖P1f(x)− P1g(x)‖ = 0 ∀i⇔ P1f(x)− P1g(x) = 0 ∀x⇔ f = g + c (Fact 5.3.6).⇔ ∂f = ∂g.Hence d(∂f, ∂g) = 0 if and only if ∂f = ∂g.M3: d(∂f, ∂g) = d(∂g, ∂f) is trivial.M4: Let ∂f, ∂g, ∂h ∈ J. By the triangle inequality,‖P1f(x)−P1g(x)‖ ≤ ‖P1f(x)−P1h(x)‖+‖P1h(x)−P1g(x)‖ ∀∂f, ∂g, ∂h ∈ J.This givessup‖x‖≤i‖P1f(x)− P1g(x)‖≤ sup‖x‖≤i{‖P1f(x)− P1h(x)‖+ ‖P1h(x)− P1g(x)‖}≤ sup‖x‖≤i‖P1f(x)− P1h(x)‖+ sup‖x‖≤i‖P1h(x)− P1g(x)‖.By Lemma 5.4.1(i), we havesup‖x‖≤i‖P1f(x)− P1g(x)‖1 + sup‖x‖≤i‖P1f(x)− P1g(x)‖≤sup‖x‖≤i‖P1f(x)− P1h(x)‖+ sup‖x‖≤i‖P1h(x)− P1g(x)‖1 + sup‖x‖≤i‖P1f(x)− P1h(x)‖+ sup‖x‖≤i‖P1h(x)− P1g(x)‖ .Then by Lemma 5.4.1(ii) witht1 = sup‖x‖≤i‖P1f(x)− P1h(x)‖ and t2 = sup‖x‖≤i‖P1h(x)− P1g(x)‖,635.4. The complete metric space of subdifferentialswe havesup‖x‖≤i‖P1f(x)− P1g(x)‖1 + sup‖x‖≤i‖P1f(x)− P1g(x)‖≤sup‖x‖≤i‖P1f(x)− P1h(x)‖1 + sup‖x‖≤i‖P1f(x)− P1h(x)‖ +sup‖x‖≤i‖P1h(x)− P1g(x)‖1 + sup‖x‖≤i‖P1h(x)− P1g(x)‖ .Multiplying by 1/2i and taking the infinite summation over i of both sides, we ob-tain d(∂f, ∂g) ≤ d(∂f, ∂h) + d(∂h, ∂g for all ∂f, ∂g, ∂h ∈ J.Combining M1-M4, we have that (J, d) is a metric space.C: Let {∂fk} be a Cauchy sequence in (J, d). Then for all ε > 0, there existsNε ∈ N such that d(∂fj , ∂fk) < ε for all j, k ≥ Nε. Fix ε > 0. Then there existsN ∈ N such that∞∑i=112isup‖x‖≤i‖P1fj(x)− P1fk(x)‖1 + sup‖x‖≤i‖P1fj(x)− P1fk(x)‖ < ε ∀j, k ≥ N.Then for i fixed, we have12isup‖x‖≤i‖P1fj(x)− P1fk(x)‖1 + sup‖x‖≤i‖P1fj(x)− P1fk(x)‖ < ε,so thatsup‖x‖≤i‖P1fj(x)− P1fk(x)‖ < 2iε1− 2iε.Since ε > 0 is arbitrary, we have that {P1fk(x)}∞k=1 is a Cauchy sequence on‖x‖ ≤ i, so that P1fk(x) p→ h for some h. By Fact 5.3.8, ∂fk is maximally mono-tone for all k. Since P1fk is the resolvent of the maximally cyclically monotoneoperator ∂fk, the domain of P1fk is Rn. By Fact 5.3.11 we have‖P1fk(x)−P1fk(y)‖2 ≤ 〈x− y, P1fk(x)−P1fk(y)〉 for all x, y ∈ Rn. (5.4.1)Then letting k →∞, we have‖h(x)− h(y)‖2 ≤ 〈x− y, h(x)− h(y)〉 for all x, y ∈ Rn,645.4. The complete metric space of subdifferentialsso by Fact 5.3.11, h = JA for a maximally monotone operator A : Rn ⇒ Rn. Itremains to be shown that h = J∂f for some convex function f ∈ Γ0(Rn). By Fact5.3.10 we havej∑i=1〈xi − P1fk(xi), P1fk(xi)− P1fk(xi−1)〉 ≥ 0 (5.4.2)for all {x1, x2, . . . , xj} with xj+1 = x1, for all j > 1. Then, letting k → ∞ in(5.4.1) and (5.4.2), we have‖h(x)− h(y)‖2 ≤ 〈x− y, h(x)− h(y)〉 (5.4.3)for all x, y ∈ Rn, andj∑i=1〈xi − h(xi), h(xi)− h(xi−1)〉 ≥ 0 (5.4.4)for all {x1, x2, . . . , xj} with xj+1 = x1, for all j > 1. Hence, h is a cyclicalresolvent. So by (5.4.4) and and Fact 5.3.10, we see that h = JA is the resolventof a maximally cyclically monotone operator. Thus, A is the subdifferential ofa proper, lsc, convex function by Fact 5.3.9. Therefore, (J, d) is closed and is acomplete metric space. By Fact 5.3.15, (J, d) is a Baire space.Remark 5.4.5. By Fact 5.3.12, in (J, d) the convergence is graphical convergence.Thus, the topological space (J,Γ), where Γ denotes graphical convergence, ismetrizable.We proceed to introduce two closely related metric spaces.For all f ∈ Γ0(Rn), define the equivalence classes Ff :Ff = {g ∈ Γ0(Rn) : f − g = c,where c ∈ R is a constant}.We denote by F the set of all such equivalence classes:F = {Ff : f ∈ Γ0(Rn)}.This forms a partition of Γ0(Rn). That is, the intersection of any two distinct ele-ments of F is empty, and Γ0(Rn) =⋃Ff∈F Ff . Now considering the equivalenceclasses Ff and Fg, define the metric d˜ :d˜(Ff ,Fg) =∞∑i=112isup‖x‖≤i‖P1f(x)− P1g(x)‖1 + sup‖x‖≤i‖P1f(x)− P1g(x)‖ ,655.5. Genericity of the set of convex functions with unique minimizerswhere f and g are arbitrary elements of Ff ,Fg ∈ F respectively. Then we have d˜is a metric on F by Fact 5.3.6, andd˜(Ff ,Fg) = d(∂f, ∂g).Thus, the following corollary holds.Corollary 5.4.6. The space (F , d˜) is a complete metric space.Define P = {P1f : f ∈ Γ0(Rn)}, andρ(T1, T2) =∞∑i=112isup‖x‖≤i‖T1x− T2x‖1 + sup‖x‖≤i‖T1x− T2x‖where T1, T2 ∈ P .Corollary 5.4.7. The space (P, ρ) is a complete metric space.Although (J, d), (F , d˜), (P, ρ) look different, they are in fact isometric.Proposition 5.4.8. The three complete metric spaces (J, d), (F , d˜) and (P, ρ) areisometric.Proof. Define φ : (J, d) → (P, ρ) by φ(∂f) = P1f = (∂f + Id)−1. Thenρ(φ(∂f), φ(∂g)) = d(∂f, ∂g) for all ∂f, ∂g ∈ J , and φ is bijective. Therefore,(J, d) and (P, ρ) are isometric.Define ψ : (F , d˜) → (J, d) by ψ(Ff ) = ∂f. Then d(ψ(Ff ), ψ(Fg)) = d˜(Ff ,Fg)for all Ff ,Fg ∈ F , and ψ is bijective by Fact 5.3.9. Therefore, (F , d˜) and (J, d)are isometric.5.5 Genericity of the set of convex functions with uniqueminimizersIn this section, we establish the main result of the chapter: the set of proximalmappings that have a unique fixed point is a generic set. Equivalently, the set ofequivalence classes of convex functions that have unique minimizers is a genericset. For all that follows, we use f to represent any function in Ff , as the results arethe same for any function in the equivalence class of f.665.5. Genericity of the set of convex functions with unique minimizers5.5.1 Super-regularityTo start, we need the following results, which give the conditions needed forsuper-regularity of the proximal mapping and show that the set of contractive prox-imal mappings is dense in P.Proposition 5.5.1. Let f ∈ Γ0(Rn) be strictly convex and argmin f 6= ∅. ThenP1f is super-regular.Proof. Since f is strictly convex and argmin f 6= ∅, then the minimizer is asingleton, argmin f = {xT }. By the proximal point algorithm [191, Theorem4], we know that by starting at any arbitrary point x and iteratively calculatingxk+1 = (P1f)k(x) = gk(x), this generates a sequence of functions {gk(x)} thatconverges to xT , where gk(x) = (P1f)k(x). Since {gk(x)} is a collection of non-expansive mappings, they are 1-Lipschitz and hence equicontinuous on Rn [196,Theorem 12.32]. By Fact 5.3.14, we have that (P1f)k(x)u→ xT on each compactsubset of Rn. In particular, (P1f)ku→ xT on Bs(0) for all s > 0. Hence, P1f issuper-regular.Proposition 5.5.1 is a particular case (strictly convex functions), but it offers usa hint at the direction in which to go to obtain a necessary and sufficient conditionfor super-regular proximal mappings. That condition is found in Theorem 5.5.2below.Theorem 5.5.2. Let f ∈ Γ0(Rn). ThenP1f is super-regular if and only if argmin fis a singleton.Proof. (⇒) Suppose P1f is super-regular. Define {gk(x)} as the iterative se-quence gk(x) = (P1f)k(x) as in Proposition 5.5.1. Then there exists a uniquexT ∈ Rn such that gk(x) → xT uniformly on Bs(0) for any s > 0. By theproximal point algorithm, we know that {gk(x)} converges to the minimizer of f.Therefore, argmin f = {xT }, a singleton.(⇐) Suppose argmin f = {xT } is a singleton. By the proximal point algorithm,gk(x)→ xT . In the proof of Proposition 5.5.1, we saw via equicontinuity that thisconvergence is uniform on each compact subset of Rn, and in particular on Bs(0)for all s > 0. Therefore, P1f is super-regular.Remark 5.5.3. The set of super-regular proximal mappings is strictly larger thanthe set of proximal mappings of strictly convex functions with unique minimizers.For example, the function f(x) = ‖x‖ has a super-regular proximal mapping, yetf(x) is not strictly convex.675.5. Genericity of the set of convex functions with unique minimizers5.5.2 Denseness and Baire categoryWith super-regularity established, we turn our focus to proving denseness ofstrongly convex functions and genericity of functions with unique minimizers.Lemma 5.5.4. In (J, d), the set of strongly monotone mappings is dense. Equiva-lently, in (F , d˜), the set of strongly convex equivalence classes is dense; in (P, ρ),the set of contraction mappings is dense.Proof. We need to show that for every ε > 0 and T ∈ P, there exists a contractionT such that d(T , T ) < ε. As T ∈ P, T = P1f for some f ∈ Γ0(Rn). DefineT = (1 − σ)P1f for some σ ∈ (0, 1). Then T is a contraction, since P1f isnonexpansive. Our first goal is to find a function g ∈ Γ0(Rn) such that T = P1g.We do this by equating T to the resolvent of g, and solving for g. This followsfrom:P1g = (1− σ)P1f(Id +∂g)−1 = (1− σ)(Id +∂f)−1(Id +∂g)−1 =[(Id +∂f) ◦(11− σ Id)]−1Id +∂g = (Id +∂f) ◦(11− σ Id)∂g =σ1− σ Id +∂f ◦(11− σ Id).From here we see that ∂g is strongly monotone, so that g ∈ Γ0(Rn) is strongly con-vex. Thus, T is the proximal mapping of the proper, lsc, strongly convex functiong, whereg(x) =σ1− σ‖x‖22+ (1− σ)f(x1− σ).For ε > 0, choose N such that∑∞i≥N12i< ε/2. Considerd(∂f, ∂g) = ρ(P1f, P1g) ≤N∑i=112isup‖x‖≤i‖P1f(x)− P1g(x)‖1 + sup‖x‖≤i‖P1f(x)− P1g(x)‖ +ε2.Notice thatsup‖x‖≤i‖P1f(x)− P1g(x)‖ = sup‖x‖≤i‖(1− σ)P1f(x)− P1f(x)‖= σ sup‖x‖≤i‖P1f(x)‖.685.5. Genericity of the set of convex functions with unique minimizersAs P1f : Rn → Rn is nonexpansive, there exists M > 0 such thatsup‖x‖≤i‖P1f(x)‖ < M.Hence,sup‖x‖≤i‖P1f(x)− P1g(x)‖ < σM.Then σ < ε/2M yieldsρ(P1f, P1g) ≤N∑i=112iσM1 + σM+ε2=σM1 + σMN∑i=112i+ε2≤ σM1 + σM+ε2≤ σM + ε2< ε.Thus, for any ε > 0 one can always choose σ small enough so that d(∂f, ∂g) < ε.Therefore, the set of strongly monotone subdifferential mappings is dense in (J, d).Since (J, d), (F , d˜), and (P, ρ) are isometric, we have the equivalent conclusions:(i) in (F , d˜) the set of strongly convex function classes is dense, and(ii) in (P, ρ) the set of contraction mappings is dense.Lemma 5.5.5. In (P, ρ), let T ∈ P be super-regular, Fix(T ) = {xT }. Then forevery ε > 0 and s > 0, there exist δ > 0 and k0 > 1 such that when ρ(T , T ) < δand k ≥ k0, we have‖T kx− xT ‖ < for every x ∈ Bs(0).Proof. Apply [211, Proposition 2.12] with F = P and d = ρ.Next, we have a lemma that proves that the set of super-regular proximity op-erators is a generic set in P.Lemma 5.5.6. In (P, ρ), there exists a set G ⊂ P that is a countable intersectionof open, everywhere dense sets in P such that each T ∈ G is super-regular. Hence,G is a generic subset of P.Proof. The proof is similar to that of [211, Proposition 2.14]. Let C be the set ofcontractive proximal mappings; recall that C is dense in P by Lemma 5.5.4. ByFact 5.3.13, each T ∈ C is super-regular. By Lemma 5.5.5, for each T ∈ C there695.5. Genericity of the set of convex functions with unique minimizersexists an open neighborhood U(T, i) of T in (P, d), and a natural number k(T, i)such that‖T kx− xT ‖ < 1i(5.5.1)whenever T ∈ U(T, i), k ≥ k(T, i), and x ∈ Bi(0). DefineOq =⋃{U(T, i) : T ∈ C, i ≥ q}.Then C ⊂ Oq. Now we define G =⋂∞q=1Oq. Since each U(T, i) is open andC ⊂ Oq, we have that G is dense in P. It remains to show that every T ∈ G issuper-regular. Let T ∈ G be arbitrary. Then there exist sequences {Tq}∞q=1 and{iq}∞q=1 such that T ∈ U(Tq, iq) for each q ∈ N. By using (5.5.1), we have‖T kx− xTq‖ <1iq(5.5.2)whenever k ≥ k(Tq, iq) and x ∈ Biq(0). Hence, lettingK = max{k(Tq, iq), k(Tp, ip)} and M = min{iq, ip},we know that‖xTq − xTp‖ ≤ ‖xTq − T kx‖+ ‖T kx− xTp‖ <1iq+1ipwhenever k ≥ K and ‖x‖ ≤ M. In other words, we have a Cauchy sequence{xTq}∞q=1 such that xTq → xT .With the correct choice of q and iq, we are sure thatBs(0) ⊂ Biq(0), and that1iq+ ‖xTq − xT ‖ < ε.Now using that fact together with (5.5.2), we have‖T kx− xT ‖ ≤ ‖T kx− xTq‖+ ‖xTq − xT ‖ <1iq+ ‖xTq − xT ‖ < εwhen k ≥ K and ‖x‖ ≤ s. Hence, T is super-regular, and since T is an arbitraryelement of G, every element of G is super-regular. Therefore, the set of super-regular proximity mappings in P is a generic set.With that, we are ready to present the main results.705.5. Genericity of the set of convex functions with unique minimizersTheorem 5.5.7. In (P, ρ), define the set of proximal mappingsS = {T ∈ P : FixT is a singleton}.Then S is generic in P .Proof. Every super-regular mapping T has Fix(T ) a singleton. Since the set Ggiven in Lemma 5.5.6 satisfies G ⊆ S, we have that S is generic in P.Theorem 5.5.8. On (J, d), define the set of subdifferentialsS = {∂f ∈ J : f ∈ Γ0(Rn), ∂f has a unique zero}.Then S is generic in J.Proof. Since every element of S has a unique zero, by Theorem 5.5.2 we have thatP1f is super-regular for every corresponding ∂f ∈ S. Then by Theorem 5.5.7, theset {P1f : ∂f ∈ S} is generic in P. Since (J, d) and (P, ρ) are isometric, S isgeneric in J.Also because (J, d) and (F , d˜) are isometric, we obtain the following theorem.Theorem 5.5.9. In (F , d˜), define the set of equivalence classes of convex functionsS = {Ff : f ∈ Γ0(Rn), f has a unique minimizer}.Then S is generic in F .71Chapter 6Strongly Convex Functions andFunctions with StrongMinimizers6.1 OverviewIn this chapter, we study the sets of strongly convex functions and convex func-tions with strong minima. As in the previous chapter, we wish to answer the ques-tion of ‘how many’ such functions there are, using Baire category methods. Somenice preservation properties and characterizations of the Moreau envelopes of thesetypes of functions are established. The results of this chapter are the following.• Strong convexity, coercivity and strong minimizers of f are preserved in erf.• The set of strongly convex functions, while dense, is meagre.• The set of convex functions with strong minima is generic.This chapter is based on results found in [179], which is published in SIAM Journalon Optimization.6.2 Strong convexity and strong minimizersFor most applications of convex minimization, the assertions that can be madeabout a class of convex functions are of greater value than those concerning aparticular problem. This theoretical analysis is valuable for the insights gainedon the behaviour of the entire class of functions. Our main results in this chapterstate that the set of all proper lsc convex functions that have strong minimizers isof second category, and that although every strongly convex function has a strongminimizer, the set of strongly convex functions is of first category.Studying strong minima is important, because numerical methods usually pro-duce asymptotically minimizing sequences and we can assert the convergence ofasymptotically minimizing sequences when the function has a strong minimizer.726.3. Chapter-specific definitions and factsThe proximal mapping of a strongly convex function is a contraction, and the prox-imal point method converges at a linear rate [191]. The strongly convex functioncan also significantly increase the rate of convergence of first-order methods suchas projected subgradient descent [132], or more generally the forward-backwardalgorithm [21, Example 27.12].As a proper lsc convex function allows the value infinity, we propose to re-late the function to its Moreau envelope. Using Moreau envelopes, we define acomplete metric for the set of proper lower semicontinuous convex functions ina finite-dimensional space, as we did in the previous chapter using proximal map-pings. In this setting, there are many nice properties of the set of Moreau envelopesof proper, lsc, convex functions. This set is proved to be closed and convex. More-over, as a mapping from the set of proper lsc convex functions to the set of Moreauenvelopes of convex functions, the Moreau envelope mapping is bijective. We pro-vide a detailed analysis of functions with strong minima, strongly convex functionsand their Moreau envelopes.A comparison to literature is in order. In [211], Baire category theory wasused to show that most (i.e. a generic set) maximally monotone operators havea unique zero. In [178] (Chapter 5 of the present) a similar track was taken, butit uses the perspective of proximal mappings in particular, ultimately proving thatmost classes of convex functions have a unique minimizer. The technique of thischapter differs in that it is based on functions. We use Moreau envelopes of con-vex functions, strong minimizers and strongly convex functions rather than subd-ifferentials. While Beer and Lucchetti obtained a similar result on generic well-posedness of convex optimization, their approach relies on epigraphs of convexfunctions [31, 33, 34]. We hope that our Moreau envelope approach is more acces-sible and natural to practical optimizers. In [67], the definition of generic Tikhonovwell-posedness of convex problems is given. There, it is assumed that either theconvex functions are finite-valued and the set of convex functions is equipped withuniform convergence on bounded sets, or the convex functions are all continuouson the whole space. See also [206] for generic nature of constrained optimizationproblems, and [31, 67, 143, 189] for well-posedness in optimization. For com-prehensive generic results on fixed points of nonexpansive mappings and firmlynonexpansive mappings, we refer the reader to [188].6.3 Chapter-specific definitions and factsIn this chapter as in the previous one, without loss of generality we use prox-imal parameter r = 1. The theory developed here is equally applicable with anyother choice of r > 0. The symbol Gδ is used to indicate a countable intersection736.3. Chapter-specific definitions and factsof open sets.6.3.1 Strong convexity and coercivityDefinition 6.3.1. A function f : Rn → R∪{∞} is said to attain a strong minimumat x¯ ∈ Rn if(i) f(x¯) ≤ f(x) for all x ∈ dom f and(ii) f(xk)→ f(x¯) implies xk → x¯.In existing literature, a function f having a strong minimizer is also known as(Rn, f) being well-posed in the sense of Tikhonov, as defined by Dontchev andZolezzi in [67, Chapter I] and used in [31, 49, 66, 123]. [67, Theorem 12] showsthat f has a strong minimizer x¯ if and only if there exists a forcing function c suchthatf(x) ≥ f(x¯) + c(dist(x, x¯))where c : 0 ∈ D ⊆ [0,∞) → [0,∞), c(0) = 0, and c(an) → 0 ⇒ an → 0.When f is convex, the forcing function c can be chosen convex, [143, Proposition10.1.9]. Chapter III of [67] also contains several results on generic well-posedness.For further information on strong minimizers, we refer readers to [41, 65, 143].Definition 6.3.2. Following Rockafellar and Wets (see [196, p. 90]), we will calla function f ∈ Γ0(Rn) level-coercive iflim inf‖x‖→∞f(x)‖x‖ > 0,and coercive iflim inf‖x‖→∞f(x)‖x‖ =∞.Definition 6.3.3. A function f ∈ Γ0(Rn) is σ-strongly convex if there exists aconstant σ > 0 such that f − σ2 ‖ · ‖2 is convex. Equivalently, f is σ-stronglyconvex if there exists σ > 0 such that for all λ ∈ (0, 1) and for all x, y ∈ Rn,f(λx+ (1− λ)y) ≤ λf(x) + (1− λ)f(y)− σ2λ(1− λ)‖x− y‖2.Fact 6.3.4. [196, Theorem 11.8] Let f ∈ Γ0(Rn). Then(i) f is level-coercive if and only if 0 ∈ int dom f∗, and(ii) f is coercive if and only if dom f∗ = Rn .746.3. Chapter-specific definitions and facts6.3.2 Baire categoryFact 6.3.5. ([217, Theorem 1.47] or [21, Corollary 1.44]) Let (Rn, d) be a com-plete metric space. Then any countable intersection of dense open subsets of Rn isdense.Fact 6.3.6. [130, Example 1.3-7] The finite-dimensional space Rn is separable.That is, Rn has a countable subset that is dense in Rn .6.3.3 MonotonicityFact 6.3.7. [9, Theorem 2.51] For any maximally monotone T : Rn ⇒ Rn, theset domT is almost convex. That is, there exists a convex set C ⊆ Rn such thatC ⊆ domT ⊆ C. The same applies to the set ranT.6.3.4 Differentiability, conjugation and epiconvergenceFact 6.3.8. [193, Corollary 23.5.1] If f ∈ Γ0(Rn), then ∂f∗ is the inverse of ∂fin the sense of multivalued mappings, i.e. x ∈ ∂f∗(x∗) if and only if x∗ ∈ ∂f(x).Fact 6.3.9. [196, Theorem 2.26] Let f ∈ Γ0(Rn). Then e1f : Rn → R is contin-uously differentiable on Rn, and its gradient∇e1f = Id−P1fis 1-Lipschitz continuous, i.e., nonexpansive.Fact 6.3.10. [196, Theorem 7.37] Let {fγ}∞γ=1 ⊆ Γ0(Rn), f ∈ Γ0(Rn). Thenfγe→ f if and only if e1fγ p→ e1f.Moreover, the pointwise convergence of e1fγ to e1f is uniform on all boundedsubsets of Rn, hence yields epiconvergence to e1f as well.Remark 6.3.11. (1) Fact 6.3.10 would not be valid as it stands in infinite dimen-sions. A uniform limit on bounded sets of Moreau envelopes will remain a Moreauenvelope, but this can fail to be true when considering pointwise convergence ofMoreau envelopes, see, e.g., [7, Remark 2.71].(2) When X is a Banach space, we refer to Beer [31, p. 263]: one has that for{fγ}∞γ=1 ⊆ Γ0(X), fγ aw→ f0 if and only if e1fγ → e1f0 uniformly on boundedsubsets of X. When X is infinite dimensional, Attouch-Wets convergence is astronger concept than epiconvergence, see, e.g., [31, p. 235] or [40, Theorem6.2.14].Lemma 6.3.12. [193, Theorem 31.5] Let f ∈ Γ0(Rn). Then for all x ∈ Rn,e1f(x) + e1f∗(x) =12‖x‖2.756.4. The Moreau envelope of coercive functions and strongly convex functions6.4 The Moreau envelope of coercive functions andstrongly convex functionsIn this section,we present characterizations of the Moreau envelope involvingstrong convexity and coercivity. Counterexamples of functions that do not complywith the necessary conditions are provided for illustration.Lemma 6.4.1. The function f ∈ Γ0(Rn) is σ-strongly convex if and only if e1f isσσ+1 -strongly convex.Proof. By [196, Proposition 12.60], f is σ-strongly convex if and only if ∇f∗ is1σ -Lipschitz for some σ > 0. Now(e1f)∗ = f∗ +12‖ · ‖2, and∇(e1f)∗ = ∇f∗ + Id .Suppose that f is σ-strongly convex. Since ∇f∗ is 1σ -Lipschitz, we have that∇f∗ + Id is (1 + 1σ)-Lipschitz. Hence, ∇(e1f)∗ is σ+1σ -Lipschitz. Then e1f isσσ+1 -strongly convex, and we have proved one direction of the lemma. Workingbackwards with the same argument, the other direction is proved as well.Lemma 6.4.2. Let f ∈ Γ0(Rn). Then(i) f is level-coercive if and only if e1f is level-coercive, and(ii) f is coercive if and only if e1f is coercive.Proof. Since (e1f)∗ = f∗ + 12‖ · ‖2, we have dom(e1f)∗ = dom f∗. The resultfollows from Fact 6.3.4.Lemma 6.4.3. Let f ∈ Γ0(Rn) be strongly convex. Then f is coercive.Proof. Since f is strongly convex, f = g + σ2 ‖ · ‖2 for some g ∈ Γ0(Rn) andσ > 0. Since g is convex, g is bounded below by a hyperplane. That is, there existx˜ ∈ Rn and r ∈ R such thatg(x) ≥ 〈x˜, x〉+ r for all x ∈ Rn .Hence,f(x) ≥ 〈x˜, x〉+ r + σ2‖x‖2 for all x ∈ Rn .This gives us thatlim inf‖x‖→∞f(x)‖x‖ =∞.766.4. The Moreau envelope of coercive functions and strongly convex functionsLemma 6.4.4. Let f : Γ0(Rn)→ R∪{∞} be strongly convex. Then the (unique)minimizer of f is a strong minimizer.Proof. Let f(xk) → infxf(x). Since f is coercive by Lemma 6.4.3, {xk}∞k=1 isbounded. By the Bolzano-Weierstrass Theorem, {xk}∞k=1 has a convergent subse-quence xkj → x¯. Since f is lsc, we have that lim infk→∞f(xk) ≥ f(x¯). Hence,infxf(x) ≤ f(x¯) ≤ infxf(x).Therefore, f(x¯) = infxf(x). Since strong convexity implies strict convexity, wehave that argmin f(x) = {x¯} is unique. As every subsequence of {xk}∞k=1 con-verges to the same limit x¯, we conclude that xk → x¯.Remark 6.4.5. See [217, Proposition 3.5.8] for a stronger and more general resultregarding Lemmas 6.4.3 and 6.4.4.In view of [217, Corollary 3.5.11(iii)], when f is strongly convex with strongminimizer x¯, taking (x¯, 0) ∈ gra ∂f one has the existence of c > 0 such thatf(x) ≥ f(x¯) + c2‖x− x¯‖2.From this, we have‖xk − x¯‖ ≤√2c[f(xk)− f(x¯)] for all k ∈ N .This ensures a bound on the rate of convergence of the minimizing sequence to theminimizer in terms of function values.Note that a convex function can be coercive, but fail to be strongly convex.Consider the following examples.Example 6.4.6. Let f : R→ R, f(x) = x4. The function f is coercive and attainsa strong minimum at x¯ = 0, but is not strongly convex.Proof. It is clear that f is coercive. By definition, f is strongly convex if andonly if there exists σ > 0 such that g(x) = x4 − σ2x2 is convex. Since g is adifferentiable, univariate function, we know it is convex if and only if its secondderivative is nonnegative for all x ∈ R . Since g′′(x) = 12x2 − σ is clearly notnonnegative for any fixed σ > 0 and all x ∈ R, we have that g is not convex.Therefore, f is not strongly convex. Clearly, zero is the minimum and minimizerof f. Let {xk}∞k=1 ⊆ R be such that f(xk) → f(0) = 0. Then limn→∞ x4k = 0implies limn→∞ xk = 0. Therefore, f attains a strong minimum.776.5. A complete metric space using Moreau envelopesExample 6.4.7. Let f : R → R, f(x) = ‖x‖p, where p > 1. Then f is coercivefor all such p, but f is strongly convex if and only if p = 2.Proof. It is elementary to show that f is coercive. Define g(x) = f(x)− σ2x2 forsome σ > 0. Then g′′(x) = p(p − 1)xp−2 − σ. In order to conclude that f isstrongly convex, we must haveg′′(x) ≥ 0 for all x ∈ R . (6.4.1)However, if p > 2 then (6.4.1) fails for ‖x‖ small, specifically for‖x‖ <(σp(p− 1)) 1p−2,and if 1 < p < 2 then (6.4.1) fails for ‖x‖ large. Only for p = 2 can one choose σsuch that (6.4.1) is true for all x.6.5 A complete metric space using Moreau envelopesThe principal tool we use is Baire category theorey. To this end, we need aBaire space. In this section, we establish a complete metric space whose distancefunction makes use of the Moreau envelope. This metric has been used by Attouchand Wets in [9, page 38]. The distances used in the next section refer to the metricestablished here. The proof that it is a complete metric space (Proposition 6.5.4) isomitted, as it is essentially the proof of Proposition 5.4.4 in the previous chapter.We begin with some properties on the Moreau envelope. Sete1(Γ0(Rn)) = {e1f : f ∈ Γ0(Rn)}.Theorem 6.5.1. The set e1(Γ0(Rn)) is a convex set in Γ0(Rn).Proof. Let f1, f2 ∈ Γ0(Rn), λ ∈ [0, 1]. Then e1f1, e1f2 ∈ e1(Γ0(Rn)). We needto show that λe1f1 + (1 − λ)e1f2 ∈ e1(Γ0(Rn)). By [23, Theorem 6.2] withµ = 1 and n = 2, we have that λe1f1 + (1 − λ)e1f2 is the Moreau envelope ofthe proximal average function P1(f1, f2, λ). By [23, Corollary 5.2], we have thatP1(f1, f2, λ) ∈ Γ0(Rn). Hence, e1P1(f1, f2, λ) ∈ e1(Γ0(Rn)), and we concludethat e1(Γ0(Rn)) is a convex set.On e1(Γ0(Rn)), define a metric byd˜(f˜ , g˜) =∞∑i=112i‖f˜ − g˜‖i1 + ‖f˜ − g˜‖i, (6.5.1)786.5. A complete metric space using Moreau envelopeswhere ‖f˜ − g˜‖i = sup‖x‖≤i|f˜(x)− g˜(x)| and f˜ , g˜ ∈ e1(Γ0(Rn)).Note that a sequence of functions in (e1(Γ0(Rn)), d˜) converges if and only if thesequence converges uniformly on bounded sets, if and only if the sequence con-verges pointwise on Rn. The latter fails in infinite-dimensional space.Theorem 6.5.2. The metric space (e1(Γ0(Rn)), d˜) is complete.Proof. For fk ∈ Γ0(Rn), k ∈ N, consider a Cauchy sequence{e1fk}k∈N ⊆ (e1(Γ0(Rn)), d˜).Since fk ∈ Γ0(Rn), by Fact 6.3.9, e1fk is continuous and differentiable on Rn.Then e1fkp→ g where g : Rn → R is continuous and convex. Our objective is toprove that g is in fact the Moreau envelope of a proper, lsc, convex function. Since(e1fk)k∈N and g are convex and full-domain, by [196, Theorem 7.17] we have thate1fke→ g, and e1fk u→ g on bounded sets of Rn. By [196, Theorem 11.34], wehave that (e1fk)∗e→ g∗, that is, f∗k + 12‖ · ‖2e→ g∗ and g∗ is proper, lsc, andconvex. Hence, f∗ke→ g∗ − 12‖ · ‖2 by [196, Exercise 7.8(a)]. As {f∗k}k∈N is asequence of convex functions, by [196, Theorem 7.17] we have that g∗− 12‖ · ‖2 isconvex. Since g∗− 12‖ · ‖2 is proper, lsc, and convex, there exists h ∈ Γ0(Rn) suchthat g∗ − 12‖ · ‖2 = h∗. Applying [196, Theorem 11.34] again, we obtain fke→ h.Finally, using Fact 6.3.10 or [196, Theorem 7.37] we see that e1fkp→ e1h, ande1fku→ e1h on bounded subsets of Rn as well. Therefore, (e1(Γ0(Rn)), d˜) iscomplete. 1In view of Fact 6.3.10, we give the definition of the Attouch-Wets metric onΓ0(Rn) as follows.Definition 6.5.3 (Attouch-Wets metric). For f, g ∈ Γ0(Rn), define the distancefunction d :d(f, g) =∞∑i=112i‖e1f − e1g‖i1 + ‖e1f − e1g‖i .In Γ0(Rn), there are other metrics that induce the same topology as the Attouch-Wets metric;Proposition 6.5.4. The space (Γ0(Rn), d), where d is the metric defined in Defi-nition 6.5.3, is a complete metric space.1A referee points out that this can also be obtained by proving that the σ-strong convexity propertyis preserved in the limit, using the characterization of the strong convexity in terms of the subdiffer-entials, and the corresponding graphical convergence of the subdifferentials.796.5. A complete metric space using Moreau envelopesProof. See the proof of Proposition 5.4.4.Remark 6.5.5. (1). This result can also be reached via [8, Theorem 2.1], using theρ-Hausdorff distance for epigraphs; and this result is also mentioned as an exercisein [31, p. 241]. We refer the reader to [31, Chapter 7] for more details on theAttouch-Wets topology for convex functions.(2). When X is infinite dimensional, on the set of proper lower semicontinuousconvex function Γ0(X), a variety different topologies emerge, such as Kuratowski-Painlevé convergence, Attouch-Wets convergence, Mosco convergence, Choquet-Wijsman convergence, etc. See Beer [31], Lucchetti [143], Borwein and Vander-werff [40], Attouch [7]. However, when X is finite dimensional, all these conver-gences coincide [40, Theorem 6.2.13].On the set of Fenchel conjugates(Γ0(Rn))∗ = {f∗ : f ∈ Γ0(Rn)}define a metric by dˆ(f∗, g∗) = d(f∗, g∗) for f∗, g∗ ∈ (Γ0(Rn))∗. Observe thatΓ0(Rn) = (Γ0(Rn))∗.Corollary 6.5.6. Consider the two metric spaces (Γ0(Rn), d) and ((Γ0(Rn))∗, dˆ).DefineT : (Γ0(Rn), d)→ ((Γ0(Rn))∗, dˆ) : f 7→ f∗.Then T is a bijective isometry. Consequently, (Γ0(Rn), d) and ((Γ0(Rn))∗, dˆ) areisometric.Proof. Clearly T is onto. Also, T is injective because of the Fenchel-MoreauTheorem [21, Theorem 13.32] or [193, Corollary 12.2.1]. To see this, let Tf = Tg.Then f∗ = g∗, so f = (f∗)∗ = (g∗)∗ = g. It remains to show that T is an isometry,that is, for all f, g ∈ Γ0(Rn), d(f, g) = d(f∗, g∗) = dˆ(Tf, Tg). To see this, usingLemma 6.3.12 we haved(f∗, g∗) =∞∑i=112isup‖x‖≤i|e1f∗(x)− e1g∗(x)|1 + sup‖x‖≤i|e1f∗(x)− e1g∗(x)|=∞∑i=112isup‖x‖≤i∣∣12‖x‖2 − e1f(x)− 12‖x‖2 + e1g(x)∣∣1 + sup‖x‖≤i∣∣12‖x‖2 − e1f(x)− 12‖x‖2 + e1g(x)∣∣=∞∑i=112isup‖x‖≤i|e1g(x)− e1f(x)|1 + sup‖x‖≤i|e1g(x)− e1f(x)| = d(f, g).806.6. Baire category resultsBy Theorem 6.5.2, (e1(Γ0(Rn)), d˜) is a complete metric space.Corollary 6.5.7. Consider the two metric spaces (Γ0(Rn), d) and (e1(Γ0(Rn)), d˜).DefineT : Γ0(Rn)→ e1(Γ0(Rn)) : f 7→ e1f.Then T is a bijective isometry, so (Γ0(Rn), d) and (e1(Γ0(Rn)), d˜) are isometric.6.6 Baire category resultsThis section is devoted to the main work of the chapter. Ultimately, we showthat the set of strongly convex functions is a meagre (Baire category one) set, whilethe set of convex functions that attain a strong minimum is a generic (Baire cate-gory two) set.6.6.1 Characterizations of the strong minimizerThe first proposition describes the relation between a function and its Moreauenvelope, pertaining to the strong minimum. Several more results regarding strongminima follow.Proposition 6.6.1. Let f : Rn → R∪{∞} . Then f attains a strong minimum atx¯ if and only if e1f attains a strong minimum at x¯.Proof. (⇒) Assume that f attains a strong minimum at x¯. Thenminxf(x) = minxe1f(x) = f(x¯) = e1f(x¯).Let {xk} be such that e1f(xk)→ e1f(x¯). We need to show that xk → x¯. Sincee1f(xk) = f(vk) +12‖vk − xk‖2for some vk, and f(vk) ≥ f(x¯), we have0 ≤ 12‖xk − vk‖2 + f(vk)− f(x¯) = e1f(xk)− e1f(x¯)→ 0. (6.6.1)Since both 12‖xk−vk‖2 ≥ 0 and f(vk)−f(x¯) ≥ 0, (6.6.1) tells us that xk−vk → 0and f(vk)→ f(x¯). Since x¯ is the strong minimizer of f, we have vk → x¯. There-fore, xk → x¯, and e1f attains a strong minimum at x¯.816.6. Baire category results(⇐) Assume that e1f attains a strong minimum at x¯, e1f(x¯) = min e1f. Thene1f(xk)→ e1f(x¯) implies that xk → x¯. Let f(xk)→ f(x¯). We havef(x¯) ≤ e1f(x¯) ≤ e1f(xk) ≤ f(xk).Since f(xk)→ f(x¯), we obtaine1f(xk)→ f(x¯) = e1f(x¯).Therefore, xk → x¯, and f attains a strong minimum at x¯.Proposition 6.6.2. Let f : Rn → R∪{∞} have a strong minimizer x¯. Then forall m ∈ N,inf‖x−x¯‖≥ 1mf(x) > f(x¯).Proof. This is clear by the definition of strong minimizer. Indeed, suppose thatthere exists m ∈ N such that inf‖x−x¯‖≥ 1mf(x) = f(x¯). Then there exists a sequence{xk}∞k=1 with ‖xk − x¯‖ ≥ 1m and limk→∞ f(xk) = f(x¯). Since x¯ is the strongminimizer of f, we have xk → x¯, a contradiction.Corollary 6.6.3. Let f : Rn → R∪{∞} have a strong minimizer x¯. Then for allm ∈ N,inf‖x−x¯‖≥ 1me1f(x) > e1f(x¯).Proof. This follows directly from Propositions 6.6.1 and 6.6.2.Theorem 6.6.4. Let f ∈ Γ0(Rn). Then f has a strong minimizer if and only if fhas a unique minimizer.Proof. (⇒) By definition, if f has a strong minimizer, then that minimizer isunique.(⇐) Suppose f has a unique minimizer x¯. Because f ∈ Γ0(Rn), by [193, Theorem8.7], all sublevel-sets {x : f(x) ≤ α}, for any α ≥ f(x¯), have the same recessioncone. Since the recession cone of {x : f(x) ≤ f(x¯)} = {x¯} is 0, each sublevelset of f is bounded. Since f has a unique minimizer x¯ and all sublevel sets of fare bounded, we have that x¯ is in fact a strong minimizer. Indeed, this follows byapplying [40, Fact 4.4.8], [40, Theorem 4.4.10], and [40, Theorem 5.23(e)(c)] inRn because ∂f∗(0) = argmin f = {x¯}.Remark 6.6.5. See also [40, Exercise 5.2.1 p. 234].826.6. Baire category resultsExample 6.6.6. Theorem 6.6.4 can fail when the function is nonconvex. Considerthe continuous but nonconvex function f : R→ R, f(x) = x2x4+1.The function has a unique minimizer x¯ = 0, but the minimizer is not strong, asany sequence {xk} that tends to∞ or−∞ gives a sequence of function values thattends to f(x¯).Example 6.6.7. Theorem 6.6.4 can also fail in infinite-dimensional space. Con-sider the continuous convex functionf : l2 → R∪{∞} : x 7→ f(x) =∞∑k=11kx2k.This function has a unique minimizer x¯ = 0, but x¯ is not a strong minimizerbecause f(ek) = 1k → 0 and ek 6→ 0. Here, ek = (0, . . . , 0, 1, 0, . . .), where the 1is in the kth position. See also [31, p. 268, Exercise 2].Using Proposition 6.6.2 and Corollary 6.6.3, we can now single out two sets inΓ0(Rn) that are very important for our later proofs.Definition 6.6.8. For any m ∈ N, define the sets Um and Em as follows:Um ={f ∈ Γ0(Rn) : ∃z ∈ Rn such that inf‖x−z‖≥ 1mf(x)− f(z) > 0},Em ={f ∈ Γ0(Rn) : ∃z ∈ Rn such that inf‖x−z‖≥ 1me1f(x)− e1f(z) > 0}.Proposition 6.6.9. Let f ∈ ⋂m∈NUm. Then f attains a strong minimum on Rn .836.6. Baire category resultsProof. The proof follows the method of [65, Theorem II.1]. Since f ∈ ⋂m∈N Um,we have that for each m ∈ N there exists zm ∈ Rn such thatf(zm) < inf‖x−zm‖≥ 1mf(x).Suppose that ‖zp − zm‖ ≥ 1m for some p > m. By the definition of zm, we havef(zp) > f(zm). (6.6.2)Since ‖zm − zp‖ ≥ 1m > 1p , we havef(zm) > f(zp)by the definition of zp. This contradicts (6.6.2). Thus, ‖zp − zm‖ < 1m for eachp > m. This gives us that {zm}∞m=1 is a Cauchy sequence that converges to somex¯ ∈ Rn . It remains to be shown that x¯ is the strong minimizer of f. Since f is lsc,we havef(x¯) ≤ lim infm→∞ f(zm)≤ lim infm→∞(inf‖x−zm‖≥ 1mf(x))≤ infx∈Rn \{x¯}f(x).Let {yk}∞k=1 ⊆ Rn be such that f(yk) → f(x¯), and suppose that yk 6→ x¯. Drop-ping to a subsequence if necessary, there exists ε > 0 such that ‖yk − x¯‖ ≥ ε forall k. Thus, there exists p ∈ N such that ‖yk − zp‖ ≥ 1p for all k ∈ N . Hence,f(x¯) ≤ f(zp) < inf‖x−zp‖≥ 1pf(x) ≤ f(yk)for all k ∈ N, a contradiction to the fact that f(yk) → f(x¯). Therefore, x¯ is thestrong minimizer of f.Theorem 6.6.10. Let f ∈ ⋂m∈NEm. Then e1f attains a strong minimum on Rn,so f attains a strong minimum on Rn .Proof. Applying Proposition 6.6.9, for each f ∈ ⋂m∈NEm, e1f has a strongminimizer on Rn . By Proposition 6.6.1, f has the same corresponding strong min-imizer.846.6. Baire category results6.6.2 The set of strongly convex functions is dense, but first categoryNext, we turn our attention to the set of strongly convex functions. The ob-jectives here are to show that the set is contained in both Um and Em, dense in(Γ0(Rn), d) and meagre in (Γ0(Rn), d).Theorem 6.6.11. Let f : Rn → R∪{∞} be strongly convex. Then f ∈ Um andf ∈ Em for all m ∈ N .Proof. Since f is strongly convex, f has a unique minimizer z. By Lemma 6.4.4,z is a strong minimizer, so that for any sequence {xk} such that f(xk) → f(x¯),we must have xk → x¯. We want to show thatinf‖x−z‖≥ 1mf(x)− f(z) > 0. (6.6.3)For any m ∈ N, (6.6.3) is true by Proposition 6.6.2. Thus, f ∈ Um for all m ∈ N .By Lemma 6.4.1, e1f is strongly convex. Therefore, f ∈ Em for all m ∈ N .We will need the following characterizations of strongly convex functions in laterproofs. Note that the proof of (i)⇒(iii) has been done by Rockafellar [191].Lemma 6.6.12. Let f ∈ Γ0(Rn). The following are equivalent:(i) f is strongly convex.(ii) P1f = kP1g for some 0 ≤ k < 1 and g ∈ Γ0(Rn).(iii) P1f = kA for some 0 ≤ k < 1 and A : Rn → Rn nonexpansive.Proof. (i)⇒(ii): Assume that f is strongly convex. Then f = g + σq whereg ∈ Γ0(Rn), q = 12‖ · ‖2, and σ > 0. We haveP1f = ((1 + σ) Id +∂g)−1 =((1 + σ)(Id +∂g1 + σ))−1=(Id +∂g1 + σ)−1( Id1 + σ).Define g˜(x) = (1 + σ)g(x/(1 + σ)). Then g˜ ∈ Γ0(Rn) and ∂g˜ = ∂g ◦(Id1+σ), soP1g˜ =(Id +∂g ◦(Id1 + σ))−1=((1 + σ)(Id +∂g1 + σ)◦(Id1 + σ))−1= (1 + σ)(1 +∂g1 + σ)−1◦(Id1 + σ)= (1 + σ)P1f.856.6. Baire category results(ii)⇒(i): Assume P1f = kP1g for some 0 ≤ k < 1 and g ∈ Γ0(Rn). If k = 0,then f = ι{0} + c for some arbitrary c ∈ R, and f is obviously strongly convex.Let us assume 0 < k < 1. The assumption (Id +∂f)−1 = k(Id +∂g)−1 givesId +∂f = (Id +∂g) ◦ (Id /k) = Id /k + ∂g ◦ (Id /k), so∂f = (1/k − 1) Id +∂g(Id /k).Since 1/k > 1 and ∂g◦(Id /k) is monotone, we have that ∂f is strongly monotone,which implies that f is strongly convex.(ii)⇒(iii): This is clear because P1g is nonexpansive, see, e.g., [21, Proposition12.27].(iii)⇒(ii): Assume P1f = kA where 0 ≤ k < 1 and A is nonexpansive. If k = 0,then P1f = 0 = 0 · 0, so (ii) holds because P1ι{0} = 0. If 0 < k < 1, thenA = 1/kP1f. AsP1f = (Id +∂f)−1 = ∇(q + f)∗ = ∇e1(f∗),we have A = ∇(e1(f∗)/k). This means that A is nonexpansive and is the gradientof a differentiable convex function. By the Baillon-Haddad theorem [16, Corollary10], A = P1g for some g ∈ Γ0(Rn). Therefore, P1f = kP1g, i.e., (ii) holdstrue.Theorem 6.6.13. The set of strongly convex functions is dense in (Γ0(Rn), d).Equivalently, the set of strongly convex functions is dense in (e1(Γ0(Rn)), d˜).Proof. Let 0 < ε < 1 and f ∈ Γ0(Rn). It will suffice to find h ∈ Γ0(Rn) suchthat h is strongly convex and d(h, f) < ε. For 0 < σ < 1, define g ∈ Γ0(Rn) byway of the proximal mapping:P1g = (1− σ)P1f = (1− σ)P1f + σP1ι{0}.Such a g ∈ Γ0(Rn) does exists because g is the proximal average of f and ι{0} by[23], and g is strongly convex because of Lemma 6.6.12. Define h ∈ Γ0(Rn) byh = g − e1g(0) + e1f(0).Indeed, some calculations giveh = (1− σ)f( ·1− σ)+σ1− σq + σe1f(0).Then e1h = e1g − e1g(0) + e1f(0), so thate1h(0) = e1f(0), (6.6.4)866.6. Baire category resultsand P1h = P1g. Fix N large enough that∑∞i=N12i< ε2 . Then∞∑i=N12i‖e1f − e1g‖i1 + ‖e1f − e1g‖i ≤∞∑i=N12i<ε2. (6.6.5)Choose σ such that0 < σ <ε2− ε1N(N + ‖P1f(0)‖)) . (6.6.6)This gives us thatσN(N + ‖P1f(0)‖)1 + σN(N + ‖P1f(0)‖) <ε2. (6.6.7)By (6.6.4) and the Mean Value Theorem, for some c ∈ [x, 0] we havee1h(x)− e1f(x) = e1h(x)− e1f(x)− (e1h(0)− e1f(0))= 〈∇e1h(c)−∇e1f(c), x− 0〉= 〈(Id−P1h)(c)− (Id−P1f)(c), x− 0〉= 〈−P1h(c) + P1f(c), x− 0〉= 〈−(1− σ)P1f(c) + P1f(c), x〉= 〈σP1f(c), x〉.Using the triangle inequality, the Cauchy-Schwarz inequality, and the fact that P1fis nonexpansive, we obtain|e1h(x)− e1f(x)| ≤ σ‖P1f(c)‖‖x‖= σ‖P1f(c)− P1f(0) + P1f(0)‖‖x‖≤ σ(‖P1f(c)− P1f(0)‖+ ‖P1f(0)‖)‖x‖≤ σ(‖c‖+ ‖P1f(0)‖)‖x‖≤ σ(‖x‖+ ‖P1f(0)‖)‖x‖≤ σN(N + ‖P1f(0)‖),when ‖x‖ ≤ N. Therefore, ‖e1h − e1f‖N ≤ σN(N + ‖P1f(0)‖). Applying(6.6.7), this implies that‖e1f − e1g‖N1 + ‖e1f − e1g‖N ≤σN(N + ‖P1f(0)‖)1 + σN(N + ‖P1f(0)‖) <ε2. (6.6.8)876.6. Baire category resultsNow considering the first N − 1 terms of our d function, we haveN−1∑i=112i‖e1f − e1g‖i1 + ‖e1f − e1g‖i ≤N−1∑i=112i‖e1f − e1g‖N1 + ‖e1f − e1g‖N=‖e1f − e1g‖N1 + ‖e1f − e1g‖NN−1∑i=112i<‖e1f − e1g‖N1 + ‖e1f − e1g‖N . (6.6.9)When (6.6.6) holds, combining (6.6.5), (6.6.8), and (6.6.9) yields d(h, f) < ε.Hence, for any arbitrary f ∈ Γ0(Rn) and 0 < ε < 1, there exists a strongly convexfunction h ∈ Γ0(Rn) such that d(h, f) < ε. That is, the set of strongly convexfunctions is dense in (Γ0(Rn), d). Because (Γ0(Rn), d) and (e1(Γ0(Rn)), d˜) areisometric by Corollary 6.5.7, it suffices to apply Lemma 6.4.1.Remark 6.6.14. A shorter proof can be provided by approximating f ∈ Γ0(Rn)with f + ε‖ · ‖2. One can use either [31, Theorem 7.4.5], or the fact that f + ε‖ · ‖2converges uniformly to f on bounded subsets of dom f. When ε ↓ 0, clearlyf+ε‖·‖2 converges epigraphically to f , which gives the Attouch-Wets convergencebecause we are in a finite dimension setting. To quantify the epigraphical conver-gence in terms of the distance d based on Moreau envelopes (Definition 6.5.3), onehase1(f + ε‖ · ‖2) = (2ε+ 1)e1(f2ε+ 1)( ·2ε+ 1)+ε2ε+ 1‖ · ‖2,which indeed converges to e1f uniformly on bounded sets of Rn.Theorem 6.6.15. The set of strongly convex Moreau envelopes is a meagre set in(e1(Γ0(Rn)), d˜) where d˜ is given by (6.5.1). Equivalently, in (Γ0(Rn), d) the setof strongly convex function is meagre.Proof. Denote the set of strongly convex functions in e1(Γ0(Rn)) by S. DefineFm ={g ∈ e1(Γ0(Rn)) : g − 12m‖ · ‖2 is convex on Rn}.We show that(i) S =⋃m∈NFm,(ii) for each m ∈ N, the set Fm is closed in e1(Γ0(Rn)), and886.6. Baire category results(iii) for each m ∈ N, the set Fm has empty interior.Then S will have been shown to be a countable union of closed, nowhere densesets, hence first category.(i) (⇒) Let f ∈ S. Then there exists σ > 0 such that f − σ2 ‖ · ‖2 is convex. Notethat this means f − σ˜2 ‖ · ‖2 is convex for all σ˜ ∈ (0, σ). Since σ > 0, there existsm ∈ N such that 0 < 1/m < σ. Hence, f − 12m‖ · ‖2 is convex, and f ∈ Fm.Therefore, S ⊆ ⋃m∈N Fm.(⇐) Let f ∈ Fm for some m ∈ N . Then f − 12m‖ · ‖2 is convex. Thus, withσ = 1m , we have that there exists σ > 0 such that f − σ2 ‖ · ‖2 is convex, which isthe definition of strong convexity of f. Therefore, Fm ⊆ S, and since this is truefor every m ∈ N, we have⋃m∈N Fm ⊆ S.(ii) Let g 6∈ Fm. Then g− 12m‖·‖2 is not convex. Equivalently, there exist λ ∈ (0, 1)and x, y ∈ Rn such thatg(λx+ (1− λ)y)− λg(x)− (1− λ)g(y)λ(1− λ) > −‖x− y‖22m. (6.6.10)Let N > max{‖x‖, ‖y‖}. Choose ε > 0 such that when f ∈ e1(Γ0(Rn)) andd˜(f, g) < ε, we have ‖f − g‖N < ε˜ for some ε˜ > 0. In particular,f(λx+ (1− λ)y)− λf(x)− (1− λ)f(y)λ(1− λ)=g(λx+ (1− λ)y)− λg(x)− (1− λ)g(y)λ(1− λ)+(f − g)(λx+ (1− λ)y)− λ(f − g)(x)− (1− λ)(f − g)(y)λ(1− λ)>g(λx+ (1− λ)y)− λg(x)− (1− λ)g(y)λ(1− λ) −4ε˜λ(1− λ) .Hence, when ε˜ is sufficiently small, which can be achieved by making ε sufficientlysmall, we havef(λx+ (1− λ)y)− λf(x)− (1− λ)f(y)λ(1− λ) > −‖x− y‖22m.This gives us, by (6.6.10), that f − 12m‖ · ‖2 is not convex. Thus, f 6∈ Fm, soe1(Γ0(Rn)) \ Fm is open, and therefore Fm is closed.(iii) That intFm = ∅ is equivalent to saying that e1(Γ0(Rn)) \Fm is dense. Thus,it suffices to show that for every ε > 0 and every g ∈ e1(Γ0(Rn)), the open ball896.6. Baire category resultsBε(g) contains an element of e1(Γ0(Rn)) \ Fm.If g ∈ e1(Γ0(Rn)) \ Fm, then there is nothing to prove. Assume that g ∈ Fm.Then g is 12m -strongly convex, and has a strong minimizer x¯ by Lemma 6.4.4. Asg ∈ e1(Γ0(Rn)), g = e1f for some f ∈ Γ0(Rn). We consider two cases.Case 1: Suppose that for every 1k > 0, f(xk) < f(x¯)+1k for some xk 6= x¯. Definehk = max{f, f(x¯) + 1k}. Thenminhk = f(x¯) +1k, f ≤ hk < f + 1k,so that e1f ≤ e1hk ≤ e1f + 1k . We have that gk = e1hk ∈ e1(Γ0(Rn)), andthat ‖gk − g‖i < 1k for all i ∈ N . Choosing k sufficiently large guarantees thatd˜(gk, g) < ε. We see that gk does not have a strong minimizer by noting that forevery k,f(x¯) < f(x¯) +1k, f(xk) < f(x¯) +1kand hk(x¯) = hk(xk) = f(x¯) +1k.Thus, hk does not have a strong minimizer, which implies that gk = e1hk does noteither, by Proposition 6.6.1. Therefore, gk 6∈ Fm.Case 2: If Case 1 is not true, then there exists k such that f(x) ≥ f(x¯) + 1k forevery x 6= x¯. Then we claim that f(x) =∞ for all x 6= x¯. Suppose for the purposeof contradiction that there exists x 6= x¯ such that f(x) < ∞. As f ∈ Γ0(Rn), thefunction φ : [0, 1]→ R defined by φ(t) = f(tx+ (1− t)x¯) is continuous by [217,Proposition 2.1.6]. This contradicts the assumption, therefore,f(x) = ι{x¯}(x) + f(x¯).Consequently,g(x) = e1f(x) = f(x¯) +12‖x− x¯‖2.Now for every j ∈ N, define fj : Rn → R∪{∞},fj(x) ={f(x¯), ‖x− x¯‖ ≤ 1j ,∞, otherwise.We have fj ∈ Γ0(Rn), andgj(x) = e1fj(x) =f(x¯), ‖x− x¯‖ ≤1j ,f(x¯) + 12(‖x− x¯‖ − 1j)2, ‖x− x¯‖ > 1j .906.6. Baire category resultsThen {gj(x)}j∈N converges pointwise to e1f = g, by [196, Theorem 7.37]. Thus,for sufficiently large j, d˜(gj , g) < ε. Since gj is constant on B 1j(x¯), gj is notstrongly convex, so gj 6∈ Fm.Properties (i), (ii) and (iii) all together show that the set of strongly convex functionis meagre in (e1(Γ0(Rn), d˜). Note that (e1(Γ0(Rn), d˜) and (Γ0(Rn), d) are isomet-ric by Corollary 6.5.7. The proof is completed by applying Lemma 6.4.1.Remark 6.6.16. Theorem 6.6.15 shows that the set S of strongly convex functionsis of first category, but what about the larger set U of uniformly convex functionsto which S belongs? (See [21, 216, 217] for information on uniform convexity.)It is clear that U is bigger than S, for example [21, Definition 10.5] shows thatf ∈ S ⇒ f ∈ U, and [21, Exercise 10.7] states that f : R → R, f(x) = ‖x‖4 isuniformly convex but not strongly convex. So is U generic or meagre? This is anopen question. We refer the reader in particular to [217, Proposition 3.5.8, Theorem3.5.10] for properties and characterizations of uniformly convex functions.Corollary 6.6.17. In (Γ0(Rn), d)) the setD = {f : f is differentiable and∇f is c-Lipschitz for some c > 0}is of first category.Proof. We know that (Γ0(Rn), d) → ((Γ0(Rn))∗, dˆ) is an isometry, and that fis strongly convex for some c > 0 if and only if f∗ is differentiable on Rn with∇f∗ being 1c -Lipschitz [217, Corollary 3.5.11(i)⇔(vi)]. Since the set of stronglyconvex functions is first category in ((Γ0(Rn))∗, dˆ), the set of differentiable convexfunctions with ∇f being c-Lipschitz for some c > 0 is a first category set in(Γ0(Rn), d).6.6.3 The set of convex functions with strong minimizers is secondcategoryIn this section, we present properties of the sets Um and Em, and show that theset of convex functions that attain a strong minimum is a generic set in (Γ0(Rn), d).Lemma 6.6.18. The sets Um and Em are dense in (Γ0(Rn), d).Proof. This is immediate by combining Theorems 6.6.11 and 6.6.13.To continue, we need the following result, which holds in Γ0(X) where X isany Banach space.916.6. Baire category resultsLemma 6.6.19. Let f ∈ Γ0(Rn), m ∈ N, and fix z ∈ dom f. Theninf‖x−z‖≥ 1mf(x)− f(z) > 0 if and only if infm≥‖x−z‖≥ 1mf(x)− f(z) > 0.Proof. (⇒) Suppose that for z fixed, inf‖x−z‖≥ 1mf(x)− f(z) > 0. Sinceinfm≥‖x−z‖≥ 1mf(x)− f(z) ≥ inf‖x−z‖≥ 1mf(x)− f(z),we have infm≥‖x−z‖≥ 1mf(x)− f(z) > 0.(⇐) Let infm≥‖x−z‖≥ 1mf(x)− f(z) > 0, and suppose thatinf‖x−z‖≥ 1mf(x)− f(z) ≤ 0.Then for each k ∈ N, there exists yk such that ‖yk−z‖ ≥ 1m and f(yk) ≤ f(z)+ 1k .Take zk ∈ [yk, z] ∩{x ∈ Rn : m ≥ ‖x− z‖ ≥ 1m} 6= ∅. Thenzk = λkyk + (1− λk)zfor some λk ∈ [0, 1]. By convexity of f , we havef(zk) = f(λkyk + (1− λk)z) ≤ λkf(yk) + (1− λk)f(z)≤ λkf(z) + (1− λk)f(z) + λkk= f(z) +λkk≤ f(z) + 1k.Now infm≥‖x−z‖≥ 1mf(x) ≤ f(zk) ≤ f(z) + 1k , so when k →∞ we obtaininfm≥‖x−z‖≥ 1mf(x)− f(z) ≤ 0.This contradicts the fact that infm≥‖x−z‖≥ 1mf(x)− f(z) > 0. Therefore,inf‖x−z‖≥ 1mf(x)− f(z) > 0.Lemma 6.6.20. The set Em is an open set in (Γ0(Rn), d).926.6. Baire category resultsProof. Fix m ∈ N, and let f ∈ Em. Then there exists z ∈ Rn such thatinf‖x−z‖≥ 1me1f(x)− e1f(z) > 0.Hence, by Lemma 6.6.19,infm≥‖x−z‖≥ 1me1f(x)− e1f(z) > 0.Choose j large enough that Bm[z] ⊆ Bj(0). For g ∈ Γ0(Rn), let d(f, g) < ε,where0 < ε <infm≥‖x−z‖≥ 1me1f(x)− e1f(z)2j(2 + infm≥‖x−z‖≥ 1me1f(x)− e1f(z)) < 12j. (6.6.11)The reason for this bound on ε will become apparent at the end of the proof. Then∞∑i=112i‖e1f − e1g‖i1 + ‖e1f − e1g‖i < ε.In particular for our choice of j, we have that 2jε < 1 by (6.6.11), and that12j‖e1f − e1g‖j1 + ‖e1f − e1g‖j < ε,‖e1f − e1g‖j < 2jε(1 + ‖e1f − e1g‖j),sup‖x‖≤j|e1f(x)− e1g(x)|(1− 2jε) < 2jε,sup‖x‖≤j|e1f(x)− e1g(x)| < 2jε1− 2jε.Define α = 2jε1−2jε . Thensup‖x‖≤j|e1f(x)− e1g(x)| < α.Hence,|e1f(x)− e1g(x)| < α for all x with ‖x‖ ≤ j.In other words,e1f(x)− α < e1g(x) < e1f(x) + α for all x with ‖x‖ ≤ j.936.6. Baire category resultsSince Bm[z] ⊆ Bj(0), we can take the infimum over m ≥ ‖x− z‖ ≥ 1m to obtaininfm≥‖x−z‖≤ 1me1f(x)− α ≤ infm≥‖x−z‖≥ 1me1g(x) ≤ infm≥‖x−z‖≥ 1me1f(x) + α.Using the above together with the fact that |e1g(z)− e1f(z)| < α yieldsinfm≥‖x−z‖≥ 1me1g(x)− e1g(z) ≥(infm≥‖x−z‖≥ 1me1f(x)− α)− (e1f(z) + α)= infm≥‖x−z‖≥ 1me1f(x)− e1f(z)− 2α.Hence, ifα <infm≥‖x−z‖≥ 1me1f(x)− e1f(z)2, (6.6.12)we haveinfm≥‖x−z‖≥ 1me1g(x)− e1g(z) > 0. (6.6.13)Recalling that α = 2jε1−2jε , we solve (6.6.12) for ε to obtainε <infm≥‖x−z‖≥ 1me1f(x)− e1f(z)2j(2 + infm≥‖x−z‖≥ 1me1f(x)− e1f(z)) .Thus, (6.6.13) is true whenever d(f, g) < ε for any ε that respects (6.6.11). Ap-plying Lemma 6.6.19 to (6.6.13), we conclude thatinf‖x−z‖≥ 1me1g(x)− e1g(z) > 0.Hence, if g ∈ Γ0(Rn) is such that d(f, g) < ε, then g ∈ Em. Therefore, Em isopen.Theorem 6.6.21. In X = (Γ0(Rn), d), the setS = {f ∈ Γ0(Rn) : f attains a strong minimum}is generic.946.6. Baire category resultsProof. By Lemmas 6.6.18 and 6.6.20, we have that Em is open and dense in X.Hence, G =⋂m∈NEm is a countable intersection of open, dense sets in X , andas such G is generic in X. Let f ∈ G. By Corollary 6.6.10, f attains a strongminimum on Rn . Thus, every element of G attains a strong minimum on Rn .Since G is generic in X and G ⊆ S, we conclude that S is generic in X.Remark 6.6.22. This result is stated as Exercise 7.5.10 in [31, p. 269]. However,the approach taken there uses the Attouch-Wets topology defined by uniform con-vergence on bounded subsets of the distance function, associated with epigraphs ofconvex functions.Theorem 6.6.23. In X = (Γ0(Rn), d), the setS = {f ∈ Γ0(Rn) : f is coercive}is generic.Proof. Define the set Γ1(Rn) = Γ0(Rn)+x∗, in the sense that for any f ∈ Γ0(Rn),the function f + 〈x∗, ·〉 ∈ Γ1(Rn). Since any such f + 〈x∗, ·〉 is proper, lsc, andconvex, we have Γ1(Rn) ⊆ Γ0(Rn). Since for any f ∈ Γ0(Rn) we have thatf − x∗ ∈ Γ0(Rn), this gives us that f ∈ Γ0(Rn) + x∗ = Γ1(Rn). Therefore,Γ1(Rn) = Γ0(Rn). By Theorem 6.6.21, there exists a generic set G ⊆ Γ0(Rn)such that for every f ∈ G, f attains a strong minimum at some point x, and hence0 ∈ ∂f(x). Then, given any x∗ fixed, there exists a generic set Gx∗ that containsa dense Gδ set, such that 0 ∈ ∂(f + x∗)(x). Thus, for each f ∈ Gx∗ there existsx ∈ Rn such that −x∗ ∈ ∂f(x). By Fact 6.3.6, it is possible to construct the setD = {−x∗i }∞i=1 such that D = Rn . Then each set Gx∗i , i ∈ N, contains a denseGδ set. Therefore, the set G =⋂∞i=1Gx∗i contains a dense Gδ set. Let f ∈ G.Then for each i ∈ N, −x∗i ∈ ∂f(x) for some x ∈ Rn . That is, −x∗i ∈ ran ∂f.So D =⋃∞i=1{−x∗i } ⊆ ran ∂f, and D ⊆ ran ∂f. Since D = Rn, we haveRn = ran ∂f. By Facts 5.3.8 and 6.3.7, ran ∂f is almost convex; there exists aconvex set C such that C ⊆ ran f ⊆ C. Then C = Rn . As C is convex, by[193, Theorem 6.3] we have the relative interior riC = riC, so riC = Rn . Thus,Rn = riC ⊆ C, which gives us that C = Rn . Therefore, ran ∂f = Rn . By Fact6.3.8, ran ∂f ⊆ dom(f∗). Hence, dom f∗ = Rn . By Fact 6.3.4, we have thatlim‖x‖→∞f(x)‖x‖ =∞.Therefore, f is coercive for all f ∈ G. Since G is generic in X and G ⊆ S, weconclude that S is generic in X.956.6. Baire category resultsTheorem 6.6.24. In (Γ0(Rn), d), the set S = {f ∈ Γ0(Rn) : dom f = Rn} isgeneric.Proof. Note that (Γ0(Rn))∗ = Γ0(Rn). In ((Γ0(Rn))∗, d), by Theorem 6.6.23, theset{f∗ ∈ (Γ0(Rn))∗ : f∗ is coercive}is generic. Since f∗ is coercive if and only if f has dom f = Rn by Fact 6.3.4, theproof is done.Combining Theorems 6.6.21, 6.6.23 and 6.6.24, we obtain the following.Corollary 6.6.25. In (Γ0(Rn), d), the setS = {f ∈ Γ0(Rn) : dom f = Rn, dom f∗ = Rn, f has a strong minimizer}is generic.96Chapter 7Generalized Linear-quadraticFunctions7.1 OverviewThis chapter develops the theory of the class of generalized linear-quadraticfunctions (see Definition 7.3.3), constructed using maximally monotone symmetriclinear relations. Calculus rules and properties of the Moreau envelope for this classof functions are presented. On a metric space defined by Moreau envelopes, weconsider the epigraphical limit of a sequence of quadratic functions and categorizethe results. We explore the question of when a quadratic function is a Moreau en-velope of a generalized linear-quadratic function; characterizations involving non-expansiveness and Lipschitz continuity are established. Generalizations of resultsby Hiriart-Urruty and by Rockafellar and Wets are included. The main results ofthis chapter are the following.• A convex quadratic function f = erg for some g generalized linear-quadraticif and only if the quadratic coefficient matrix of f is nonexpansive.• A function is convex generalized linear-quadratic if and only if its Moreauenvelope is convex linear-quadratic.• The epigraphical limit of a sequence of generalized linear-quadratic func-tions is a generalized linear-quadratic function.This chapter is based on results in [180], which is submitted to Journal of Opti-mization Theory and Applications.7.2 Generalized linear-quadratic functionsIn this chapter, we continue the investigation into Moreau envelopes in fi-nite dimensions, from the perspective of the generalized linear-quadratic objectivefunction. We focus on generalized linear-quadratic functions because it is a classof functions that has enough structure to secure solid results that do not require977.3. Chapter-specific definitions and factsoverly restrictive conditions, but allows us to obtain results that are useful for awide range of functions. We use the metric space defined in the previous chapter,with the intention of exploring the epiconvergence of a sequence of generalizedlinear-quadratic functions. Epiconvergence plays a fundamental role in optimiza-tion and variational analysis, see [7, 10, 31, 40, 44, 195, 196]. Several classes offunctions can arise at the limit; these results are classified and illustrated. Thenwe approach the relationship between Moreau envelopes and quadratics from theopposite direction, determining the conditions under which a given quadratic func-tion g is a Moreau envelope of another function f and whether f can be determinedexplicitly.The linear relation is also a useful tool in functional analysis, notably docu-mented and developed in [59], with more recent expansion in [19, 27, 28, 213].This chapter continues to develop the theory of monotone linear relations, in par-ticular for the class of generalized linear-quadratic functions. Such functions arise,for example, in the determination of the existence of a Hessian for the Moreau en-velope [137, 182]. In [182, Theorem 3.9], Rockafellar and Poliquin showed that afunction does not have to be finite in order for its Moreau envelope to have a Hes-sian; it suffices that the second-order epiderivative of the function be a generalizedlinear-quadratic function. The existence of a Hessian is of interest since it is neededin order to do a second-order expansion of the Moreau envelope function, whichleads to a second-order approximation of its objective function. Several proper-ties and characterizations for the class of generalized linear-quadratic functions areprovided in this work and we demonstrate that it is useful and convenient to workin the setting of generalized linear-quadratic functions when considering mattersof epiconvergence.7.3 Chapter-specific definitions and factsDefinition 7.3.1. A sequence of functions {fk} on Rn is eventually prox-boundedif there exists r ≥ 0 such that lim infk→∞ erfk(x) > −∞ for some x. The infi-mum of all such r is the threshold of eventual prox-boundedness of the sequence.Definition 7.3.2. An operator A : Rn ⇒ Rn is a linear relation if the graph of Ais a linear subspace of Rn×n .Definition 7.3.3. A generalized linear-quadratic function p : Rn → R∪{∞} isdefined byp(x) =12〈x− a,A(x− a)〉+ 〈b, x〉+ c ∀x ∈ Rn,where A is a linear relation, a, b ∈ Rn, c ∈ R .987.3. Chapter-specific definitions and factsThe function p may not be well-defined, but in section 7.5 we will present thecondition on A that makes p well-defined. Note that the term 〈x − a,A(x − a)〉cannot in general be expanded, as the following example demonstrates.Example 7.3.4. Consider the example on R of A = N{0} :N{0}(1− 1) = R 6= N{0}(1) +N{0}(−1) = ∅+∅.Definition 7.3.5. The adjoint A∗ of a linear relation A is defined in terms of itsgraph:graA∗ = {(x∗∗, x∗) ∈ Rn×n : (x∗,−x∗∗) ∈ (graA)⊥}= {(x∗∗, x∗) ∈ Rn×n : 〈a, x∗〉 = 〈a∗, x∗∗〉 ∀(a, a∗) ∈ graA}.Definition 7.3.6. An operator A is symmetric if graA ⊆ graA∗. Equivalently, Ais symmetric if 〈x, y∗〉 = 〈y, x∗〉 for all (x, x∗), (y, y∗) ∈ graA.Definition 7.3.7. For a linear relation A : Rn ⇒ Rn, we define(i) qA(x) ={12〈x,Ax〉, if x ∈ domA,∞, if x 6∈ domA,(ii) A+ = 12(A+A∗).Fact 7.3.8. [196, Theorem 7.37] For proper, lsc functions fk and f, the followingare equivalent:(i) the sequence {fk} is eventually prox-bounded and fk e→ f ;(ii) There exists ε > 0 such that f is prox-bounded and erfkp→ erf for allr ∈ (ε,∞).Then the pointwise convergence of erfk to erf for r > 0 sufficiently large isuniform on all bounded subsets of Rn, hence yields continuous convergence andepiconvergence as well. If fk and f are convex, then r¯ = 0 and condition (ii) canbe replaced by(iib) erfkp→ erf for all r > 0.Fact 7.3.9. [193, Theorem 25.7] Let C be an open convex set and f be a convexfunction that is finite and differentiable onC. Let {fk}k∈N be a sequence of convexfunctions finite and differentiable on C such that limk→∞ fk(x) = f(x) for everyx ∈ C. Thenlimk→∞∇fk(x) = ∇f(x) ∀x ∈ C.In fact, the mappings ∇fk converge to ∇f uniformly on every closed boundedsubset of C.997.4. Epigraphical limits of quadratic functions on RFact 7.3.10. [196, Lemma 12.14] Every mapping A : Rn ⇒ Rn obeys the identityId−(Id +A)−1 = (Id +A−1)−1.Fact 7.3.11. [196, Lemma 12.12] LetA : Rn ⇒ Rn be monotone and λ > 0. Then(Id +λA)−1 is monotone and nonexpansive. Moreover, A is maximally monotoneif and only if dom(Id +λA)−1 = Rn . In that case, (Id +λA)−1 is maximallymonotone as well, and it is a single-valued mapping from all of Rn into itself.Fact 7.3.12. [21, Proposition 23.7] Let D ⊆ Rn be nonempty, T : D → Rn,A = T−1 − Id . Then T is firmly nonexpansive if and only if A is monotone.Fact 7.3.13. [18, Theorem 6.6] Let T : Rn → Rn . Then T is the resolvent of amaximally cyclically monotone operator A : Rn ⇒ Rn if and only if T has fulldomain, T is firmly nonexpansive, and for every set of points {x1, . . . , xm} wherethe integer m ≥ 2 and xm+1 = x1, one hasm∑i=1〈xi − Txi, Txi − Txi+1〉 ≥ 0.Fact 7.3.14. [21, Theorem 22.14] Let A : Rn ⇒ Rn . Then A is maximally cycli-cally monotone if and only if there exists f ∈ Γ0(Rn) such that A = ∂f.Fact 7.3.15. (Baillon–Haddad Theorem) [16, Corollary 10] Let ϕ be a convex C1function on Rn . Let A = ∇ϕ. If A is L-Lipschitz, then〈Au−Av, u− v〉 ≥ 1L‖Au−Av‖2 ∀u, v ∈ Rn .Hence, AL = ∇(ϕL)is firmly nonexpansive and 1-Lipschitz. Consequently, AL is aproximal mapping:AL= P1g for some g ∈ Γ0(Rn).Fact 7.3.16. (Toland–Singer Duality)[203, Corollary 1] Let f : Rn → R∪{∞}and h : Γ0(Rn)→ R∪{∞} . Theninfx∈Rn{f(x)− h(x)} = infx∈Rn{h∗(x)− f∗(x)} andsupx∈Rn{h(x)− f(x)} = supx∈Rn{f∗(x)− h∗(x)}.7.4 Epigraphical limits of quadratic functions on ROne of the main objectives of this chapter is to present epiconvergence prop-erties of generalized linear-quadratic functions and their Moreau envelopes. For1007.4. Epigraphical limits of quadratic functions on Rthe first set of results, we focus on quadratic functions on R . This serves to showthe variety of situations that can arise at the epigraphical limit of a sequence ofquadratic functions.Theorem 7.4.1. For all k ∈ N, let ak, bk, ck ∈ R with ak ≥ 0, so thatF = {fk(x) = akx2 + bkx+ ck}∞k=1 ⊆ Γ0(R).Then for r > 0, we haveerfk(x) =akr2ak + rx2 +bkr2ak + rx+ ck − b2k2(2ak + r). (7.4.1)Moreover, if k →∞, fk e→ f and f is proper, then erf is of the form arx2+bx+cwherea = limk→∞ak2ak + r, b = limk→∞bkr2ak + rand c = limk→∞[ck − b2k2(2ak + r)].This is true even in the case where ak →∞ and f(x) = ι{b}(x) + c.Proof. We consider the Moreau envelope at the limit of the sequenceerfk(x) = infy∈R{fk(y) +r2(y − x)2}= infy∈R{(ak +r2)y2 + (bk − rx)y + ck + r2x2}.The infimand above is a strictly convex quadratic function, so its minimum can befound by setting the derivative equal to zero and finding critical points. This yieldsthe minimizer y = rx−bk2ak+r , which giveserfk(x) =(ak +r2) (rx− bk)2(2ak + r)2+ (bk − rx)rx− bk2ak + r+ ck +r2x2=akr2ak + rx2 +bkr2ak + rx+ ck − b2k2(2ak + r).We have erfk(x) ∈ Γ0(R) for all k, since the quadratic coefficient is nonnegative.Now consider the sequence fke→ f. By Fact 7.3.8, we need only consider thepointwise convergence of the sequence {erfk}k∈N. Since erf(x) is finite for allx, evaluating (7.4.1) at x = 0 and taking the limit as k → ∞ gives us that theconstant coefficient ck − b2k/[2(2ak + r)] converges to some c ∈ R . We knowthat erfk is differentiable for all k by Proposition 2.3.5, so ∇erfk → ∇erf byFact 7.3.9. Thus, differentiating (7.4.1) and evaluating at x = 0, we take the limit1017.4. Epigraphical limits of quadratic functions on Rto find that the linear coefficient bkr/(2ak + r) also converges, to some b ∈ R .Finally, evaluating the same derivative at x = 1 and taking the limit, we havelimk→∞(akr2ak + r+bkr2ak + r)= r limk→∞ak2ak + r+ b,so that the coefficient akr/(2ak + r) (which is nonnegative for all k) converges toar for some a ≥ 0.There are three possible epigraphical limits for the sequence defined in The-orem 7.4.1. The first is epi(bx + c), the case where ak → 0. The second isepi(ax2 + bx + c), the case where ak → a > 0. The third is epi(ι{b}(x) + c),the case where ak → ∞. We present three examples here, to illustrate the threepossibilities. In all three examples, we set r = 1.Example 7.4.2. Define fk(x) =(1 + 1k)x2 +(2 + 1k)x+(1 + 1k). Thene1fk(x) =k + 13k + 2x2 +2k + 13k + 2x+2k2 + 6k + 3k(6k + 4).Letting k →∞, we have fk e→ f withf(x) = (x+ 1)2, ande1f(x) =13(x+ 1)2.Example 7.4.3. Define gk(x) = 1kx2 +(1 + 1k)x+ 1k . Thene1gk(x) =1k + 2x2 +k + 1k + 2x+−k2 + 32k(k + 2).Letting k →∞, we have gk e→ g withg(x) = x, ande1g(x) = x− 12.Example 7.4.4. Define hk(x) = kx2 + 1kx+1k . Thene1hk(x) =k2k + 1x2 +1k(2k + 1)x+4k2 + 2k − 12k2(2k + 1).Letting k →∞, we have hk e→ h withh(x) = ι{0}(x), ande1h(x) =12x2.1027.4. Epigraphical limits of quadratic functions on RTheorem 7.4.1 leads one to ask which convex functions have quadratic func-tions as their Moreau envelopes. This question is answered by Proposition 7.4.5below.Proposition 7.4.5. On R, every convex quadratic function f(x) = αx2 + βx+ γ,α ≥ 0 is the Moreau envelope of a function g that is either a convex quadraticfunction or an indicator function. Specifically, given any proximal parameter r > 0the following hold.(i) If 0 ≤ α < r/2, then g(x) = ax2 + bx+ c, wherea =αrr − 2α, b =βrr − 2α, c = γ +β22(r − 2α) .(ii) If α = r/2, then g(x) = ι{b}(x) + c, whereb = −βr, c = γ − β22r.(iii) If α > r/2, then there does not exist g ∈ Γ0(R) such that f = erg.Proof. We need to show the form of g such that f(x) = erg(x) for all x ∈ R forany choice of α ≥ 0, β, γ ∈ R . By Theorem 7.4.1, we have thaterg(x) =ar2a+ rx2 +br2a+ rx+ c− b22(2a+ r).We equate the coefficients of f accordingly:α =ar2a+ r, β =br2a+ r, γ = c− b22(2a+ r). (7.4.2)Solving α = ar/(2a+ r) for a yields a = αr/(r − 2α), α 6= r/2.(i) If α ∈ [0, r/2), then there is a one-to-one correspondence with a ∈ [0,∞).Solving the equations in (7.4.2) gives the expressions for b and c.(ii) If α = r/2, then g(x) = ι{b}(x) + c :g(x) ={c, x = b,∞, x 6= b,erg(x) = infy{g(y) +r2(y − x)2},= g(b) +r2(b− x)2,=r2x2 − brx+ r2b2 + c.1037.5. Generalized linear-quadratic functions on RnEquating β = −br and γ = rb2/2+c,we find that b = −β/r and c = γ−β2/(2r).Then f(x) = erg(x) where g(x) = ι{b}(x) + c.(iii) Suppose for eventual contradiction that α > r/2. Suppose that there existsg ∈ Γ0(R) such that f = erg. By Proposition 2.3.5 and Fact 2.3.2, we have∇erg(x) = r(Id−J∂g/r).Since (Id−J∂g/r) = J(∂g/r)−1 is nonexpansive,∇erg is r-Lipschitz. On the otherhand, we have∇erg(x) = ∇f(x) = 2αx+ β,which is L-Lipschitz only if L ≥ 2α. Hence, r ≥ 2α, which contradicts thecondition that α > r/2. Therefore, there does not exist g ∈ Γ0(R) such thatf = erg.7.5 Generalized linear-quadratic functions on RnNow we move on to finite-dimensional space. One natural goal that arises isthat of unifying f(x) = 12〈x,Ax〉+ 〈b, x〉+ c and f(x) = ι{b}(x) + c in the moregeneral setting of Rn . To do so, we first need to establish several properties ofmonotone linear relations and generalized linear-quadratic functions.7.5.1 Linear relations and generalized linear-quadratic functionsOur first question to consider in this section is: when is a generalized linear-quadratic function well-defined? The following illustrates a case to the contrary,showing that when a linear relation A is not monotone, then 〈x,Ax〉 may not besingle-valued.Example 7.5.1. Define the linear relationA(x1, x2) = {t(1, 1) : t ∈ R} ⊆ R2, ∀(x1, x2) ∈ R2 .Then A is not monotone, and 〈x,Ax〉 is not single-valued.Proof. Let x1 + x2 6= 0. Then〈(x1, x2), A(x1, x2)〉 = {〈(x1, x2), t(1, 1)〉 : t ∈ R}= {t(x1 + x2) : t ∈ R} = R .1047.5. Generalized linear-quadratic functions on RnTherefore, 〈x,Ax〉 is not single-valued. Observe that A is not monotone. Indeed,set t > 0, and choose x1, x2 such that x1 + x2 < 0 and t(1, 1) ∈ A(x1, x2). Notethat (0, 0) ∈ A(0, 0). Then〈(x1, x2)− (0, 0), A[(x1, x2)− (0, 0)]〉 = 〈(x1, x2), t(1, 1)〉= t(x1 + x2) < 0.The following fact says that when A is a monotone linear relation, the generalizedlinear-quadratic function that it forms is well-defined.Fact 7.5.2. [213, Proposition 3.2.1] Let A : Rn ⇒ Rn be a linear relation. ThenA is monotone if and only if 〈x,Ax〉 ≥ 0 and 〈x,Ax〉 is single-valued for allx ∈ domA.Fact 7.5.3. [213, Proposition 3.1.3] The operator A is a linear relation if for allα, β ∈ R and for all x, y ∈ Rn we haveA(αx+ βy) = αAx+ βAy +A0.Proposition 7.5.4. Let A be a monotone linear relation. If x, a ∈ domA, then〈x− a,A(x− a)〉 = 〈x,Ax〉 − 〈x,Aa〉 − 〈a,Ax〉+ 〈a,Aa〉.Proof. If A is a monotone linear relation, then A0 ⊂ (domA)⊥ [213, Proposition3.2.1]. When x, a ∈ domA, we have that 〈x,Ax〉, 〈a,Aa〉, 〈x,Aa〉, 〈a,Ax〉 aresingle-valued and that 〈x,A0〉 = 0, 〈a,A0〉 = 0. Then, using Fact 7.5.3, we have〈x− a,A(x− a)〉 = 〈x− a,Ax−Aa+A0〉= 〈x,Ax〉 − 〈a,Ax〉 − 〈x,Aa〉+ 〈a,Aa〉+ 〈x,A0〉 − 〈a,A0〉= 〈x,Ax〉 − 〈a,Ax〉 − 〈x,Aa〉+ 〈a,Aa〉.Example 7.5.5. The following are maximally monotone symmetric linear rela-tions.(i) A symmetric positive definite matrix A : Rn → Rn, and its inverse A−1.This follows from(A−1)∗ = (A∗)−1 = A−1.(ii) The normal cone operator NL : Rn ⇒ Rn, where L ⊂ Rn is a subspace.This is becausegraNL = L× L⊥, gra(NL)∗ = L× L⊥.1057.5. Generalized linear-quadratic functions on Rn7.5.2 Properties and calculus of qAThe generalized linear-quadratic function qA is instrumental in establishing ourmain results. In this section, we collect a number of properties of qA under condi-tions such as maximal monotonicity and symmetry.Lemma 7.5.6. Let A : Rn ⇒ Rn be symmetric. Then A−1 is symmetric.Proof. By definition, A is symmetric if and only if〈x,Ay〉 = 〈Ax, y〉 ∀x, y ∈ domA. (7.5.1)Let u ∈ Ay, v ∈ Ax. Then u, v ∈ ranA = domA−1, and x ∈ A−1v, y ∈ A−1u.Substituting into (7.5.1), we have〈A−1v, u〉 = 〈v,A−1u〉 ∀u, v ∈ domA−1,which is the definition of symmetry of A−1.Lemma 7.5.7. Let A1, A2 : Rn ⇒ Rn be maximally monotone linear relations.Then A1 +A2 is a maximally monotone linear relation. If, in addition, A1 and A2are symmetric, then A1 +A2 is symmetric.Proof. Since domA1 and domA2 are linear subspaces of Rn, domA1 − domA2is a closed subspace. By [213, Theorem 7.2.2], A1 + A2 is maximally monotone.Since graA1 and graA2 are linear subspaces, gra(A1 + A2) is a linear subspace.Hence, A1 +A2 is a linear relation. It remains to prove that A1 +A2 is symmetric.Let (x, x∗), (y, y∗) ∈ gra(A1 +A2). Since dom(A1 +A2) = domA1 ∩ domA2,we have x, y ∈ domA1 and x, y ∈ domA2. Then there exist x∗1, y∗1 ∈ ranA1 andx∗2, y∗2 ∈ ranA2 such that(i) (x, x∗1), (y, y∗1) ∈ graA1 and (x, x∗2), (y, y∗2) ∈ graA2, and(ii) x∗1 + x∗2 = x∗ and y∗1 + y∗2 = y∗.This gives us that(x, x∗1+x∗2) = (x, x∗) ∈ gra(A1+A2) and (y, y∗1 +y∗2) = (y, y∗) ∈ gra(A1+A2).Now consider 〈x, y∗〉 − 〈y, x∗〉 :〈x, y∗〉 − 〈y, x∗〉 = 〈x, y∗1〉+ 〈x, y∗2〉 − 〈y, x∗1〉 − 〈y, x∗2〉= (〈x, y∗1〉 − 〈y, x∗1〉) + (〈x, y∗2〉 − 〈y, x∗2〉)= (〈x, y∗1〉 − 〈x, y∗1〉) + (〈x, y∗2〉 − 〈x, y∗2〉)(A1 is symmetric) (A2 is symmetric)= 0.1067.5. Generalized linear-quadratic functions on RnThus, for any (x, x∗), (y, y∗) ∈ gra(A1 + A2) we have that 〈x, y∗〉 = 〈y, x∗〉.Therefore, A1 +A2 is symmetric.Proposition 7.5.8. Let A1, A2 be maximally monotone symmetric linear relationson Rn . Then A∗1 +A∗2 = (A1 +A2)∗.Proof. This proposition can be deduced from [213, Fact 7.1.6], but we provide aproof for the sake of completeness.(⇒) By definition of symmetry, we havegraA1 ⊆ graA∗1. (7.5.2)Since A1 is maximally monotone, A∗1 is also maximally monotone by [20, Corol-lary 5.11]. Then (7.5.2) is actually an equality and we haveA1 = A∗1, and similarly A2 = A∗2.Then, by the above and the definition of adjoint, we havegra(A1 +A2)∗ = {(x, x∗) ∈ Rn×n : (x∗,−x) ∈ (gra(A1 +A2))⊥}= {(x, x∗) ∈ Rn×n : (x∗,−x) ∈ (gra(A∗1 +A∗2))⊥}= gra(A∗1 +A∗2)∗.Once more by definition of symmetry, we have gra(A∗1 + A∗2) ⊆ gra(A∗1 + A∗2)∗.Therefore, gra(A∗1 +A∗2) ⊆ gra(A1 +A2)∗.(⇐) We have gra(A1 + A2)∗ ⊆ gra(A∗1 + A∗2)∗ from above, and by symmetrygra(A∗1 +A∗2)∗ ⊆ gra(A∗1 +A∗2)∗∗. Since we are in Rn, gra(A∗1 +A∗2)∗∗ is closed.Thus gra(A∗1 +A∗2)∗∗ = gra(A∗1 +A∗2). The conclusion follows.2Proposition 7.5.9. Let A be a maximally monotone linear relation. Then(i) qA is well-defined, i.e., qA : Rn → R∪{∞};(ii) qA is convex;(iii) qA = qA+ ;(iv) ∂qA = A+.2Thank you to Dr. Walaa Moursi for contributing to this proof.1077.5. Generalized linear-quadratic functions on RnProof. (i) This is direct from Fact 7.5.2.(ii) See [27, Proposition 2.3].(iii) Since A is maximally monotone, A∗ is also maximally monotone [20, Corol-lary 5.11]. By definition of A∗, we have 〈x,Ax〉 = 〈A∗x, x〉 = 〈x,A∗x〉. ThenqA(x) =12〈x,Ax〉 = 12(〈x,Ax〉+ 〈x,A∗x〉2)=12〈x,Ax+A∗x2〉=12〈x,A+x〉 = qA+(x).(iv) This can be deduced as a finite-dimensional case of [213, Corollary 9.1.11],but we provide a short proof for completeness. Since A is maximally monotone,A∗ is as well, hence A+ is as well by Lemma 7.5.7. Then∂qA = ∂qA+ = A+ =12(A+A∗).Lemma 7.5.10. Let A be a maximally monotone symmetric linear relation. Then∂qA = A.Proof. This is a specific case of [213, Theorem 9.2.6] in Rn, with f = qA. Theself-contained proof is the following. Since A is symmetric, we have that A = A∗.The result follows from Proposition 7.5.9(iv).Corollary 7.5.11. Let A1, A2 be maximally monotone symmetric linear relationssuch that qA1 = qA2 . Then A1 = A2.Proof. This follows from ∂qA1 = A1, ∂qA2 = A2.Remark 7.5.12. The maximal monotonicity condition of Corollary 7.5.11 is nec-essary. As a counterexample, consider a monotone selection S of A and setA1 = S, A2 = S +A0.Then qA1 = qA2 , but A1 6= A2 unless A0 = {0}.Proposition 7.5.13. Let A1, A2 be monotone linear relations. ThenqA1 + qA2 = qA1+A2 .In addition, if domA1 ⊆ domA2 and A1 −A2 is monotone, thenqA1 − qA2 = qA1−A2 .1087.5. Generalized linear-quadratic functions on RnProof. By definition, we haveqA1(x) ={12〈x,A1x〉 if x ∈ domA1,∞, otherwise.Similarly,qA2(x) ={12〈x,A2x〉, if x ∈ domA2,∞, otherwise.Thus,(qA1 + qA2)(x) ={12〈x, (A1 +A2)x〉, if x ∈ domA1 ∩ domA2,∞, else.= qA1+A2(x).Now suppose that domA1 ⊆ domA2 and A1−A2 is monotone. For x ∈ domA2with x ∈ domA1, we have that qA1 − qA2 is single-valued, so thatqA1(x)− qA2(x) = qA1−A2(x).When x 6∈ domA1, we haveqA1(x)− qA2(x) =∞− qA2(x) =∞.Nowdom qA1−A2 = dom(A1 −A2) = domA1 ∩ domA2 = domA1,so thatqA1−A2(x) =∞ when x 6∈ domA1.Therefore,qA1 − qA2 = qA1−A2 .The condition domA1 ⊆ domA2 is necessary for qA1 − qA2 = qA1−A2 . Thefollowing example shows that Proposition 7.5.13 can fail if domA1 6⊆ domA2.Example 7.5.14. Let A1, A2 : R2 ⇒ R2 be maximally monotone linear relationsgiven byA1 = Id, A2 = NR×{0}whereNR×{0}(x, y) ={{0} × R if y = 0,∅ if y 6= 0.1097.5. Generalized linear-quadratic functions on RnThen(A1 −A2)(x, y) ={(x, 0)− {0} × R if y = 0,∞ if y 6= 0is a maximally monotone linear relation. We haveqA1(0, 1)− qA2(0, 1) = 1/2−∞ = −∞,but qA1−A2(0, 1) =∞ because (0, 1) 6∈ dom(A1 −A2).Proposition 7.5.15. Let A be a maximally monotone symmetric linear relation.Then the following are equivalent:(i) qA(x) = 0;(ii) x ∈ argmin qA;(iii) 0 ∈ ∂qA(x);(iv) 0 ∈ Ax;(v) x ∈ A−10.Proof. (i)⇒(ii) Let qA(x) = 0. Since A is monotone, by Fact 7.5.2 we haveqA(y) ≥ 0 for all y ∈ Rn . Hence,miny∈RnqA(y) = 0 = qA(x)⇒ x ∈ argmin qA.(ii)⇒(iii) This is direct from Fermat’s Theorem.(iii)⇒(iv) Let 0 ∈ ∂qA(x). Since A is symmetric and maximally monotone, byLemma 7.5.10 we have ∂qA(x) = A(x). Therefore, 0 ∈ Ax.(iv)⇒(i) Let 0 ∈ Ax. Then, since qA(x) is single-valued by Fact 7.5.2, we haveqA(x) =12〈x,Ax〉 = 12〈x, 0〉 = 0.(iv)⇔(v) We have 0 ∈ Ax⇔ x ∈ A−10.7.5.3 The Fenchel conjugate of qAConjugacy plays a vital role in convex analysis [109, Chapter X]. One oftenfinds it beneficial to work temporarily in a dual space in order to solve a problem,then return the answer to the primal space. In this section, we explore the Fenchelconjugate of qA. We show that the inverse A−1 is convenient for computing q∗A.1107.5. Generalized linear-quadratic functions on RnProposition 7.5.16. LetA : Rn ⇒ Rn be a maximally monotone symmetric linearrelation. Then q∗A = qA−1 , that is,q∗A(y) ={12〈y,A−1y〉, if y ∈ ranA,∞, if y 6∈ ranA.Consequently,q∗∗A (x) ={12〈x,Ax〉, if x ∈ ranA−1,∞, if x 6∈ ranA−1.Thus, q∗∗A = qA.Proof. By the definition of q∗A, we haveq∗A(y) = supx{〈y, x〉 − qA(x)} = supx∈domA{〈y, x〉 − 12〈x,Ax〉}. (7.5.3)Case 1: Let y ∈ ranA. Note that qA is convex by Proposition 7.5.9. Then thesolution to the supremum in (7.5.3) is x¯ such that 0 ∈ ∂ [〈y, x¯〉 − 12〈x¯, Ax¯〉] . Thisgives y ∈ Ax¯, hence, x¯ ∈ A−1y. Thenq∗A(y) = 〈y,A−1y〉 −12〈A−1y, y〉 = 12〈y,A−1y〉.Case 2: Let y 6∈ ranA. Note that since ranA is closed and convex, by [193,Corollary 11.4.2] there exist z ∈ Rn and r ∈ R such that〈z, y〉 > r ≥ supx∈ranA〈x, z〉.Since ranA is a subspace, we have 0 ∈ ranA and r ≥ 0. Also since ranAis a subspace, we have kx ∈ ranA for all x ∈ ranA, for all k ∈ R . Hence,r ≥ k〈x, z〉 for all x ∈ ranA and for all k ∈ R . Thus, 〈x, z〉 = 0 for all x ∈ ranA.Then supx∈ranA〈x, z〉 = 0, hence 〈z, y〉 > 0. Now noting thatsupk>0{〈y, kz〉 − 12〈kz,A(kz)〉}= supk>0〈y, kz〉 =∞, (7.5.4)and that the supremum of (7.5.3) is greater than or equal to that of (7.5.4), we haveq∗A(y) =∞.Corollary 7.5.17. Let A be symmetric, positive definite and nonsingular. Thenq∗A = qA−1 , where A−1 is the classical inverse.1117.5. Generalized linear-quadratic functions on RnCorollary 7.5.18. Let A be symmetric and positive semidefinite. Then q∗A = qA−1whereqA−1(x) ={12〈x,A−1x〉, if x ∈ ranA,∞, if x 6∈ ranA,and A−1 is the set-valued inverse.Proposition 7.5.19. Let A be a maximally monotone symmetric linear relation.Then q∗A = qA if and only if A = A−1, if and only if A = Id .Proof. The proof that A = A−1 if and only if A = Id is found in [19, Proposition2.8]. In the sequel, we prove that q∗A = qA if and only if A = A−1.(⇐) Suppose that A = A−1. Then we see immediately by Corollary 7.5.18 thatq∗A = qA.(⇒) Suppose that q∗A = qA. Then by Lemma 7.5.10, we have ∂qA = A = ∂q∗A.By Corollary 7.5.18, we find ∂q∗A = A−1. Therefore, A = A−1.Proposition 7.5.20. Let Ai : Rn ⇒ Rn be a maximally monotone symmetriclinear relation for each i ∈ {1, . . . ,m}. Then f = qA1 @ · · ·@ qAm is a generalizedlinear-quadratic function and∂f =(n∑i=1A−1i)−1,which is the parallel sum of Ai.Proof. By Lemmas 7.5.6 and 7.5.7, we have that A−11 + · · ·+A−1m is a maximallymonotone symmetric linear relation. By Corollary 7.5.18, q∗Ai = qA−1i . Since0 ∈ ⋂mi=1 ri ranAi, [193, Theorem 16.4] givesq(A−11 +···+A−1m )−1 = q∗A−11 +···+A−1m=(qA−11+ · · ·+ qA−1m)∗=(q∗A1 + · · ·+ q∗Am)∗= qA1 @ · · · @ qAm = f.Therefore, by Lemma 7.5.10, we have the statement of the proposition.Proposition 7.5.21. Let A1 : Rn ⇒ Rn be a maximally monotone symmetriclinear relation and A2 : Rn → Rn be symmetric positive definite. Define thefunction h : Rn → R∪{∞} byqA2 @ h = qA1 . (7.5.5)1127.5. Generalized linear-quadratic functions on RnThen for every x ∈ Rn,h(x) =(q∗A1 − q∗A2)∗(x) = supy{qA1(x+ y)− qA2(y)} . (7.5.6)Consequently, when A−11 −A−12 is monotone, one has∂h =(A−11 −A−12)−1, (7.5.7)which is the star-difference of A1 and A2.Proof. Taking the Fenchel conjugate of (7.5.5) yields h∗ = q∗A1 − q∗A2 . Then byFact 7.3.16, we have (7.5.6). Observe that A−11 − A−12 is maximally monotonebecause of the following. We havedom(A−11 −A−12 ) = ranA1 ∩ ranA2 = ranA1 = domA−11 , and(A−11 −A−12 )(0) = A−11 (0).Because A−11 is maximally monotone, (domA−11 )⊥ = A−11 (0). Then by [28, Fact2.4(v)], A−11 − A−12 is maximally monotone. Since A−11 − A−12 is a maximallymonotone symmetric linear relation and q∗Ai = qA−1i , qA−11 − qA−12 = qA−11 −A−12 ,we have (7.5.7).Remark 7.5.22. This result generalizes that of Hiriart-Urruty [108, Example 2.7],because A−11 −A−12 need not be positive definite.7.5.4 Relating the set-valued inverse and the Moore–Penrose inverseThe set-valued inverseA−1 of a linear mapping and the Moore–Penrose inverseA† both have their uses. For properties of A†, see [153, p. 423–428]. In thissection, we show how the two inverses are closely related. We also include adescription of the Moore–Penrose inverse for a particular mapping, the orthogonalprojector.Proposition 7.5.23. The following hold.(i) Let A : Rn ⇒ Rn be a linear mapping. ThenA−1x ={A†x+A−10, if x ∈ ranA,∅, if x 6∈ ranA.(ii) Let A : Rn ⇒ Rn be maximally monotone. ThenA−1x = A†x+NdomA−1 ={A†x+ (ranA)⊥, if x ∈ ranA,∅, if x 6∈ ranA.1137.5. Generalized linear-quadratic functions on Rn(iii) Let A : Rn ⇒ Rn be monotone, symmetric and linear. ThenA−1 = PranAA†PranA +NdomA−1 .Proof. (i) Since AA† is the orthogonal projector onto ranA, it holds that for allx ∈ ranA,AA†x = PranAx = x⇒ A†x ∈ A−1x.Since A−1x = x∗ +A−10 for every x∗ ∈ Ax, we haveA−1 = A† +A−10 on ranA.(ii) Since A is maximally monotone, A−1 is as well, and(domA−1)⊥ = A−10.Applying part (i) completes the proof.(iii) If A is maximally monotone and linear, thenranA† = ranA> = ranA∗ = ranA, andNdomA−1(x) = (domA−1)⊥ = A−10.This implies that on ranA = domA−1,PranAA†PranA = A†.Now we apply part (ii). Let x ∈ Rn, u = PranAx. Denote PranAA†PranA by L.We have that L is symmetric, because(A†)∗ = (A∗)† = A†.Using AA†A = A, L is monotone because〈x, Lx〉 = 〈PranAx,A†PranAx〉 = 〈Au,A†Au〉= 〈u,AA†Au〉 = 〈u,Au〉 ≥ 0.Therefore,〈Ly − Lx, y − x〉 = 〈y, Ly〉+ 〈x, Lx〉 − 〈x, Ly〉 − 〈y, Lx〉= 〈y, Ly〉+ 〈x, Lx〉 − 〈x, Ly〉+ 〈x, Ly〉≥ 0.1147.5. Generalized linear-quadratic functions on RnIn [21, Exercise 3.13], for a linear mapping A, one hasA† = PranA∗A−1PranA. (7.5.8)For a set S ⊂ Rn, define the indicator mapping ιS : Rn → Rn of S relative to RnbyιS(x) ={0 ∈ Rn if x ∈ S,∅ if x 6∈ S.(See, e.g., [160].) Combing (7.5.8) and Proposition 7.5.23, we obtain a completerelationship between A−1 and A†.Corollary 7.5.24. (i) When A is a linear mapping on Rn,A−1 = A† + ιdomA−1 +A−10, andA† = PranA∗A−1PranA.(ii) If, in addition, A is maximally monotone, thenA−1 = A† +NdomA−1 ,andA† = PranAA−1PranA.Corollary 7.5.24(i) is a corollary of Proposition 7.5.23.Corollary 7.5.25. LetA be a maximally monotone symmetric linear relation. Then(qA)∗ ={qA† , if x ∈ ranA,∞, if x 6∈ ranA.In the sequel, we present the Moore–Penrose inverse of the projector mapping.Proposition 7.5.26. Let PL be the orthogonal projector onto a subspace L ⊆ Rn .Then the Moore–Penrose generalized inverse P †L = PL.Proof. By [197, Theorem 10.5], PL is idempotent (P 2L = PL) and Hermitian(P ∗L = PL). Since P†L is the unique operator that satisfies the four Moore–Penroseequations, we verify that each of them is satisfied by P †L = PL :(i) AA†A = A : PLPLPL = PL,(ii) A†AA† = A† : PLPLPL = PL,(iii) (AA†)∗ = AA† : (PLPL)∗ = P ∗L = PL = PLPL,(iv) (A†A)∗ = A†A : (PLPL)∗ = P ∗L = PL = PLPL.Therefore, P †L = PL.1157.5. Generalized linear-quadratic functions on RnCorollary 7.5.27. Let f : Rn → R∪{∞} : x 7→ 12〈x, PLx〉, where L is a sub-space. Thenf∗(x∗) ={12〈x∗, PLx∗〉, if x∗ ∈ L,∞, if x∗ 6∈ L ={12〈x∗, x∗〉, if x∗ ∈ L,∞, if x∗ 6∈ L.Proof. The proof combines Proposition 7.5.26 and Corollary 7.5.25.We end this section with an extension of Rockafellar’s and Wets’ result [196, Ex-ample 11.10].Example 7.5.28. Let A : Rn ⇒ Rn be a symmetric monotone linear relation,b ∈ Rn, c ∈ R. Suppose f(x) = qA(x) + 〈b, x〉+ c. Then for every y ∈ Rn,f∗(y) = qA−1(y − b)− c ={12〈y − b, A†(y − b)〉 − c, if y − b ∈ ranA,∞, if y − b 6∈ ranA.Proof. Applying Theorem 7.5.16, we have for all y ∈ Rn,f∗(y) = (qA)∗(y − b)− c = qA−1(y − b)− c.By Proposition 7.5.23(ii), A−1 = A† +NranA. This givesf∗(y) ={12〈y − b, A†(y − b)〉 − c, if y − b ∈ ranA,∞, if y − b 6∈ ranA.Remark 7.5.29. In [196, Example 11.10], the authors assume that A ∈ Rn×n, i.e.,A is a linear operator. In Example 7.5.28, A is a linear relation.7.5.5 Characterizations of Moreau envelopesIn this section, we present several useful properties of Moreau envelopes ofconvex functions. We identify the form of the Moreau envelope for quadratic func-tions, and provide a characterization of Moreau envelopes that involves Lipschitzcontinuity. This leads to a sum rule for Moreau envelopes of convex functions.Then we follow up with one of the main results of this chapter; Theorem 7.5.39 isa characterization relating generalized linear-quadratic functions to nonexpansivemappings.Proposition 7.5.30. Let f ∈ Γ0(Rn). Then f = erg for some g ∈ Γ0(Rn) if andonly if∇f is r-Lipschitz.1167.5. Generalized linear-quadratic functions on RnProof. (⇒) Let f = erg for some g ∈ Γ0(Rn). Then by Proposition 2.3.5(ii) wehave ∇f = r(Id−Prg). Let x, y ∈ Rn . Then‖∇f(x)−∇f(y)‖ = r‖x− Prg(x)− y + Prg(y)‖= r∥∥∥∥1rP 1r g∗(rx)− 1rP 1r g∗(ry)∥∥∥∥ (Proposition 2.3.5(v))=∥∥∥P 1rg∗(rx)− P 1rg∗(ry)∥∥∥= ‖Jr∂g∗(rx)− Jr∂g∗(ry)‖ (Fact 2.3.2)≤ ‖rx− ry‖ = r‖x− y‖.Therefore,∇f is r-Lipschitz.(⇐) Let ∇f be r-Lipschitz. Then by Fact 7.3.15, 1r∇f is firmly nonexpansive.By Fact 7.3.12, 1r∇f = (Id +A)−1 for some A monotone. Since f ∈ Γ0(Rn), Ais in fact maximally cyclically monotone by Fact 7.3.13. Thus, A = ∂g˜ for someg˜ ∈ Γ0(Rn) by Fact 7.3.14. Hence,∇f = r(Id +∂g˜)−1 =[(Id +∂g˜) ◦(Idr)]−1.Then we have∂f∗ = (∇f)−1 = (Id +∂g˜)(Idr)=Idr+ ∂g˜(Idr),so thatf∗ =qr+ rg˜( ·r)+ c, c ∈ R .Taking the conjugate of both sides yieldsf =[qr+ rg˜( ·r)+ c]∗=(qr)∗ @ [rg˜ ( ·r)+ c]∗= (rq) @ (rg˜∗ − c) = er(rg˜∗ − c),where g˜∗ ∈ Γ0(Rn). Therefore, with g = rg∗−c ∈ Γ0(Rn),we have f = erg.Corollary 7.5.31. Let r1, r2 > 0, g, h ∈ Γ0(Rn). Thener1g + er2h = er1+r2f (7.5.9)for some f ∈ Γ0(Rn). Specifically,f(x) = supv∈Rn{[er1g(x+ v)− r1q(v)] + [er2h(x+ v)− r2q(v)]} .1177.5. Generalized linear-quadratic functions on RnProof. Denote er1g, er2h by g¯, h¯, respectively. Then by Proposition 7.5.30, ∇g¯ isr1-Lipschitz and∇h¯ is r2-Lipschitz. Hence,∇f¯ is (r1 + r2)-Lipschitz, wheref¯ = g¯ + h¯ = er1g + er2h.Applying Proposition 7.5.30 again, for some f ∈ Γ0(Rn) we have f¯ = er1+r2f .Now to find f, we apply the Fenchel conjugate to (7.5.9):f∗ +qr1 + r2= (er1g + er2h)∗f =[(er1g + er2h)∗ − qr1 + r2]∗.By Toland-Singer duality for the conjugate of a difference [21, Corollary 14.19],we obtain that for every x ∈ Rn,f(x) = supv∈Rn{(er1g + er2h)∗∗(x+ v)−(qr1 + r2)∗(v)}= supv∈Rn{(er1g + er2h)(x+ v)− (r1 + r2)q(v)}= supv∈Rn{[er1g(x+ v)− r1q(v)] + [er2h(x+ v)− r2q(v)]} .Remark 7.5.32. Corollary 7.5.31 gives us that for r > 0 and f ∈ Γ0(Rn),f =[(erf)∗ − qr]∗.Therefore, by Fact 7.3.16, for every x ∈ Rn we havef(x) = supv∈Rn{erf(x+ v)− rq(v)}.This is the Hiriart-Urruty deconvolution [108].Proposition 7.5.33. Let A ∈ Sn+. Then the following are equivalent:(i) A is nonexpansive, i.e. ‖Ax−Ay‖ ≤ ‖x− y‖ for all x, y ∈ Rn;(ii) A is firmly nonexpansive, i.e. ‖Ax − Ay‖2 ≤ 〈x − y,Ax − Ay〉 for allx, y ∈ Rn;(iii) A = (P + Id)−1 for some maximally monotone linear relation P.1187.5. Generalized linear-quadratic functions on RnProof. Denote the eigenvalues of A by λ1, λ2, . . . , λn. Since A ∈ Sn+, all eigen-values are real and nonnegative (see [153] Section 7.6).(i)⇔(ii) Suppose that statement (i) is true. Then, letting z = x − y and squaringboth sides, we have‖Az‖2 ≤ ‖z‖2⇔ 〈z,A>Az〉 ≤ 〈z, z〉⇔ 〈z,A2z〉 ≤ 〈z, z〉⇔ 〈z, (Id−A2)z〉 ≥ 0 for all z ∈ Rn .The inequality above is equivalent to the statement Id−A2 ∈ Sn+, so 1 − λ2i ≥ 0for all i ∈ {1, 2, . . . , n}. SinceA ∈ Sn+, we have λi ≥ 0 for all i. Hence, statement(i) is equivalent to the following:0 ≤ λi ≤ 1 for all i ∈ {1, 2, . . . , n}. (7.5.10)Now suppose that statement (ii) is true. This gives〈z,A>Az〉 ≤ 〈z,Az〉⇔ 〈z,A2z〉 ≤ 〈z,Az〉⇔ 〈z, (A−A2)z〉 ≥ 0 for all z ∈ Rn .Then (λi−λ2i ) ≥ 0⇒ λi(1−λi) ≥ 0 for all i ∈ {1, 2, . . . , n}, so that 0 ≤ λi ≤ 1.Hence, statement (ii) is equivalent to (7.5.10).for some f ∈ Γ0(Rn).(ii)⇔(iii) Suppose that statement (ii) is true. Then Fact 7.3.12 gives us that A =(Id +P )−1 for some maximally monotone relation P. SinceA is a matrix, we havethat A is linear, so that A−1 is a linear relation. Note that the matrix inverse of Amay not exist; here we are referring to the general set-valued inverse of A. Thenwe have Id +P = A−1 ⇒ P = A−1 − Id, so that P is a linear relation. Thusstatement (ii) implies statement (iii). Conversely, supposing that statement (iii) istrue and applying Fact 7.3.12, statement (ii) is immediately implied.Existence of a Moreau envelope is closely tied to nonexpansiveness, as thefollowing proposition and example demonstrate, and Theorem 7.5.39 ultimatelyconcludes.Proposition 7.5.34. If f(x) = r2〈x,Ax〉 + 〈b, x〉 + c ∈ Γ0(Rn) is the Moreauenvelope with proximal parameter r of a proper lsc convex function, then A ∈ Sn+is nonexpansive.1197.5. Generalized linear-quadratic functions on RnProof. Suppose that f is the Moreau envelope with proximal parameter r of someg ∈ Γ0(Rn). Then by Theorem 7.5.30,∇f is r-Lipschitz. So for all x, y ∈ Rn, wehave‖∇f(x)−∇f(y)‖ ≤ r‖x− y‖‖(rAx+ b)− (rAy + b)‖ ≤ r‖x− y‖‖Ax−Ay‖ ≤ ‖x− y‖.Therefore, A is nonexpansive.Example 7.5.35. Let A =[3 00 3], f(x) = 12〈x,Ax〉. Then there does notexist g ∈ Γ0(R2) such that f(x) = e1g(x). However, f(x) = e3g(x), whereg(x) = 12〈x, x〉.Proof. Using proximal parameter r = 1, we know that there cannot exist g withf = e1g as a direct consequence of Proposition 7.5.34, since A is not nonexpan-sive. However, rearranging the expression as f(x) = 32〈x, Idx〉 gives a largerproximal parameter r˜ = 3 and a nonexpansive matrix Id, so there does existg ∈ Γ0(R2) such that f(x) = e3g(x).7.5.6 A characterization of generalized linear-quadratic functionsIn this section, we present a characterization of convex generalized linear-quadratic functions. We begin with a theorem that explicitly determines the Moreauenvelope of a generalized linear-quadratic function.Theorem 7.5.36. Let A : Rn ⇒ Rn be a maximally monotone symmetric linearrelation. Let a, b ∈ Rn, c ∈ R, r > 0. Define the generalized linear-quadraticfunctionf(x) =r2〈x− a,A(x− a)〉+ 〈b, x〉+ c.Then for every x ∈ Rn,erf(x) = rq(Id +A−1)−1(x− a− br)+ 〈b, x〉 − 1rq(b) + c.Proof. By Proposition 2.3.4, we haveerf = re1(f/r) = re1(qA(· − a) + 〈·, b/r〉+ c/r)= r[e1(qA(· − a))(· − b/r) + 〈·, b/r〉 − q(b/r) + c/r]= r[q(Id +A−1)−1(· − b/r − a) + 〈·, b/r〉 − q(b/r) + c/r]= rq(Id +A−1)−1(· − a− b/r) + 〈·, b〉 −1rq(b) + c.1207.5. Generalized linear-quadratic functions on RnCorollary 7.5.37. Let A : Rn ⇒ Rn be a maximally monotone symmetric linearrelation. Then(i) e1(qA) = q(Id +A−1)−1 ;(ii) qA = e1g for some g ∈ Γ0(Rn) if and only if A is nonexpansive.Proof. (i) This follows from Theorem 7.5.36 with a = b = c = 0 and r = 1.(ii) This follows from part (i) above and Proposition 7.5.30 with r = 1.Proposition 7.5.38. Let f(x) ∈ Γ0(Rn) be quadratic, i.e. f(x) = r2〈x,Ax〉 +〈b, x〉 + c, A ∈ Sn+, b ∈ Rn, c ∈ R, r > 0. Then erf ∈ Γ0(Rn) and erf isquadratic. Specifically,erf(x) =r2〈x, [Id−(Id +A)−1]x〉+ 〈b, (Id +A)−1x〉 − 12r〈b, (Id +A)−1b〉+ c,where Id−(Id +A)−1 ∈ Sn+.Proof. Applying Theorem 7.5.36 with a = 0 and denoting (Id +A)−1 as B, wehaveerf(x) = rq(Id +A−1)−1(x− br)+ 〈b, x〉 − 1rq(b) + c=r2〈x− br, (Id +A−1)−1(x− br)〉+ 〈b, x〉 − 12r〈b, b〉+ c=r2〈x− br, [Id−(Id +A)−1](x− br)〉+ 〈b, x〉 − 12r〈b, b〉+ c=r2〈x, x〉 − 〈x, b〉+ 12r〈b, b〉 − r2〈x,B x〉+ 〈x,B b〉 − 12r〈b,B b〉+ 〈b, x〉 − 12r〈b, b〉+ c=r2〈x, (Id−B)x〉+ 〈x,B b〉 − 12r〈b,B b〉+ c=r2〈x, [Id−(Id +A)−1]x〉+ 〈b, (Id +A)−1x〉 − 12r〈b, (Id +A)−1b〉+ c.Since Id−(Id +A)−1 = (Id +A−1)−1 is monotone and symmetric, it follows thatId−(Id +A)−1 ∈ Sn+ and the proof is complete.Theorem 7.5.39. Let f be a convex quadratic function:f(x) =r2〈x,Ax〉+ 〈b, x〉+ c, A ∈ Sn+, b ∈ Rn, c ∈ R, r > 0.1217.5. Generalized linear-quadratic functions on RnThen A is nonexpansive if and only if f = erg where g is a generalized linear-quadratic function:g(x) ={r2〈x, P−1x〉+ 〈t, x〉+ s, if x ∈ domP−1,∞, if x 6∈ domP−1,with P−1 a monotone linear relation. This includes the case g(x) = ι{t}(x) + s.Specifically, g is as follows.(i) If A = Id, then g(x) = ι{t}(x) + s = qN{0}(x− t) + s, wheret = − brand s = c− r2〈b, b〉. (7.5.11)(ii) The matrix A ∈ Sn+ \ Id is nonexpansive if and only ifg(x) =r2〈x, P−1x〉+ 〈t, x〉+ s,whereP−1 = (Id−A)−1 − Id, t ∈ (Id−A)−1b, s = c+ 12r〈b, (Id−A)−1b〉. (7.5.12)Proof. (i) Let g(x) = ι{t}(x) + s. Thenerg(x) = infy{g(y) +r2‖y − x‖2}= g(t) +r2‖t− x‖2= s+r2〈t− x, t− x〉= s+r2(〈t, t〉 − 2〈t, x〉+ 〈x, x〉)=r2〈x, Idx〉 − r〈t, x〉+ r2〈t, t〉+ s.EquatingA = Id, b = −rt, and c = r2〈t, q〉+ s, (7.5.13)we have that for any choice of b ∈ Rn and c ∈ R, there exists g = ι{t}(·) + s suchthatf(x) =r2〈x, x〉+ 〈b, x〉+ c = erg(x) for all x ∈ Rn .The equations in (7.5.11) are obtained by solving the equations in (7.5.13) for tand s.1227.5. Generalized linear-quadratic functions on Rn(ii) By Proposition 7.5.34, if there exists g ∈ Γ0(Rn) such that f(x) = erg(x),then A is nonexpansive. Then by Fact 7.3.15, A = (Id +P )−1 for some maxi-mally monotone operator P. Since A ∈ Sn+, P is a symmetric linear relation byProposition 7.5.33. Now using the general set-valued inverse P−1, we setg(x) ={r2〈x, P−1x〉+ 〈t, x〉+ s, if x ∈ domP−1,∞, if x 6∈ domP−1.This function g is well-defined due to Fact 7.5.2. Since P is a monotone linearrelation, the functionh(x) ={12〈x, P−1x〉, if x ∈ domP,∞, if x 6∈ domPis single-valued. Then by Proposition 7.5.38 and Fact 7.3.10, we haveerg(x) =r2〈x, [Id−(Id +P−1)−1]x〉+ 〈t, (Id +P−1)−1x〉 − 12r〈t, (Id +P−1)−1t〉+ s=r2〈x, (Id +P )−1x〉+ 〈t, (Id +P−1)−1x〉 − 12r〈t, (Id +P−1)−1t〉+ s.EquatingA = (Id +P )−1, b = (Id +P−1)−1t, c = s− 12r〈t, (Id +P−1)−1t〉, (7.5.14)we have that function f = r2〈·, A·〉+ 〈b, ·〉+ c is the Moreau envelope of functiong = r2〈·, P−1·〉 + 〈t, ·〉 + s. The equations in (7.5.12) are obtained by solving theequations in (7.5.14) for P−1, t, and s.Theorem 7.5.40. The function f ∈ Γ0(Rn) is a generalized linear-quadratic func-tion if and only if erf ∈ Γ0(Rn) is a quadratic function. Specifically,erf(x) =12〈x,Ax〉+ 〈b, x〉+ c ∀x ∈ Rnwith A ∈ Sn+, if and only iff(x) = q(A−1−Id /r)−1(x+br)+ 〈b, x〉+ c+ 12r‖b‖2 ∀x ∈ Rn .Proof. (⇒) This is the statement of Theorem 7.5.36.(⇐) Let erf(x) = 12〈x,Ax〉+ 〈b, x〉+ c, with A symmetric, linear and monotone,b ∈ Rn, c ∈ R . Then(erf)∗ = f∗ +1rq and (qA + 〈·, b〉+ c)∗ = qA−1(· − b)− c.1237.6. ApplicationsThis gives us thatf∗ = qA−1−Id /r(· − b)− 〈·, b/r〉 − c+ q(b)/r.It follows thatf = (qA−1−Id /r(· − b))∗(·+ b/r) + c− q(b)/r= q(A−1−Id /r)−1(·+ b/r) + 〈·+ b/r, b〉+ c− q(b)/r= q(A−1−Id /r)−1(·+ b/r) + 〈·+ b/r, b〉+ c− q(b)/r= q(A−1−Id /r)−1(·+ b/r) + 〈·, b〉+ c+ q(b)/r.Thus, f ∈ Γ0(Rn) is a generalized linear-quadratic function.7.6 ApplicationsThis final section presents a few applications of the theory seen thus far. Webuild on the idea of extended norms, give an application to the least squares prob-lem and explore the limit of a sequence of generalized linear-quadratic functions.7.6.1 A seminorm with infinite valuesIn [32, 35], Beer and Vanderwerff present the idea of norms that are permittedto take on infinite values. These so-called extended norms are functions on linearspaces that satisfy the properties of a norm when they are finite-valued, but can beinfinite-valued as well. The authors extend many properties of norms to the settingof an extended norm space (X, ‖ · ‖), where X is a vector space and ‖ · ‖ is anextended norm. In that spirit, we present here an extended seminorm.Definition 7.6.1. A function k : Rn → R∪{∞} is a gauge if k is a nonnegative,positively homogeneous, convex function such that k(0) = 0. Thus, a gauge is afunction k such thatk(x) = inf{µ ≥ 0 : x ∈ µC}for some nonempty convex set C.Definition 7.6.2. The polar of a gauge k is the function ko defined byko(x∗) = inf{µ∗ ≥ 0 : 〈x, x∗〉 ≤ µ∗k(x) ∀x ∈ Rn}.If k is finite everywhere and positive except at the origin, the polar of k can bewritten asko(x∗) = sup{〈x, x∗〉k(x): x 6= 0}.1247.6. ApplicationsDefinition 7.6.3. A function k : Rn → R∪{∞} is an extended seminorm if(i) k(x) ≥ 0 for all x ∈ Rn;(ii) k(αx) = |α|k(x) for all x ∈ Rn, for all α ∈ R;(iii) k(x+ y) ≤ k(x) + k(y) for all x, y ∈ Rn;(iv) k(x) =∞ if x 6∈ dom k.Theorem 7.6.4. Let A be a maximally monotone symmetric linear relation. Thenthe following hold.(i) The functionk = (2qA)1/2is an extended seminorm. Moreover,k−1(0) = A−10. (7.6.1)(ii) For all x ∈ domA and for all x∗ ∈ ranA, we have〈x, x∗〉 ≤√〈x,Ax〉√〈x∗, A−1x∗〉.(iii) The closed convex setsC = {x : qA(x) ≤ 1}, C∗ = {x∗ : qA−1(x∗) ≤ 1}are polar to each other.Proof. (i) Applying [193, Corollary 15.3.1] with f = qA and p = 2, we have thatk is a gauge function. Thus, k is an extended seminorm. To see (7.6.1), we havethat k(x) = 0⇔ qA(x) = 0, so it suffices to apply Proposition 7.5.15.(ii) By Proposition 7.5.16, q∗A = qA−1 . By [193, Corollary 15.3.1], we have thatko(x∗) = (2q∗A(x∗))1/2, and that for all x ∈ domA, for all x∗ ∈ ranA,〈x, x〉∗ ≤ k(x)ko(x∗)= 2(qA(x))1/2(qA−1(x∗))1/2=√〈x,Ax〉√〈x∗, A−1x∗〉.(iii) By [193, Corollary 15.3.2], we have that the closed, convex setsC = {x : 〈x,Ax〉 ≤ 1}, C∗ = {x∗ : 〈x∗, A−1x∗〉 ≤ 1}are polar to each other.Remark 7.6.5. The above result generalizes Rockafellar’s result on [193, p. 136]with Q = A from a positive definite matrix to a maximally monotone symmetriclinear relation.1257.6. Applications7.6.2 The least squares problemIn this section, we show that generalized quadratic functions can be used tostudy the least squares problem. Let A ∈ Rm×n and b ∈ Rm. The general leastsquares problem is to find a vector x ∈ Rn that minimizes` : Rn → R : x 7→ 12‖Ax− b‖2 = qA>A(x)− 〈x,A>b〉+ qId(b). (7.6.2)Theorem 7.6.6. For the function ` given by (7.6.2), we have(i) `∗(y) = q(A>A)−1(y +A>b)− qId(b) ∀y ∈ Rn;(ii)∂`∗(y) = (A>A)−1(y +A>b) ∀y ∈ Rn, (7.6.3)anddom `∗ = ranA>. (7.6.4)Proof. (i) Apply Example 7.5.28.(ii) Apply Proposition 7.5.9 to obtain (7.6.3). To see (7.6.4), using the facts thatranA>A = ranA> (cf. [153, page 212]) and that ranA> is a subspace, we havedom `∗ = dom[(A>A)−1 −A>b] = ran(A>A−A>b)= ran(A> −A>b) = ranA>.7.6.3 Epiconvergence and algebra rulesWe end this chapter with an application for sequences of qAk functions withAk linear relations and the development of a set of algebra rules for the generalizedlinear-quadratic functions.Proposition 7.6.7 (epiconvergence). q(i) For all k ∈ N, letfk = qAk(· − ak) + 〈bk, ·〉+ ck, (7.6.5)where Ak is a maximally monotone symmetric linear relation, ak, bk ∈ Rn,ck ∈ R . Suppose that fk e→ f and that f is proper. Then f is a generalizedlinear-quadratic function:f = qA(· − a) + 〈b, ·〉+ c, (7.6.6)where A is a maximally monotone symmetric linear relation, a, b ∈ Rn,c ∈ R .1267.6. Applications(ii) For all k ∈ N, letfk = qAk + ck, (7.6.7)where Ak is a maximally monotone symmetric linear relation, ck ∈ R . Sup-pose that fke→ f and that f is proper. Then f is a generalized linear-quadratic function:f = qA + c, (7.6.8)where A is a maximally monotone symmetric linear relation, c ∈ R .Proof. (i) As fke→ f, we have ∂fk g→ ∂f. Differentiating (7.6.5), we find that∂fk = Ak(·−ak)+bk, so that gra ∂fk = graAk+(ak, bk) is maximally monotoneand affine. Thus, gra ∂f is maximally monotone and affine. By [28, Theorem 4.3],gra ∂f = graA + (a, b) for some maximally monotone symmetric linear relationA. Then by Proposition 7.5.9, we have A = ∂qA, so that (7.6.6) is true.(ii) Differentiating (7.6.7) we find that ∂fk = Ak, so that gra ∂fk = graAk isa linear subspace. Thus, gra ∂f is a linear subspace, gra ∂f = graA for somemaximally monotone symmetric linear relation A, and by Proposition 7.5.9 wehave that A = ∂qA, so that (7.6.8) is true.As a result of Proposition 7.6.7, we are now able to define algebra rules forgeneralized linear-quadratic functions. We do so in the form of Theorems 7.6.9and 7.6.10, for which we define the following sets.Definition 7.6.8. Denote by A the set of maximally monotone symmetric linearrelations on Rn . We define S as the set of convex generalized linear-quadraticfunctions:S = {f = qA(· − a) + 〈b, ·〉+ c : A ∈ A, a, b ∈ Rn, c ∈ R, f ∈ Γ0(Rn)} .We define T as the subset of S obtained by setting a = 0 :T = {f = qA + 〈b, ·〉+ c : A ∈ A, b ∈ Rn, c ∈ R, f ∈ Γ0(Rn)} .We define U as the subset of S obtained by setting a = b = 0 :U = {f = qA + c : A ∈ A, c ∈ R, f ∈ Γ0(Rn)} .We begin with algebra rules for the simpler case, the set U.Theorem 7.6.9. Let d be the Attouch-Wets metric (see (6.5.1)). The followinghold.(i) The metric space (U, d) is complete.1277.6. Applications(ii) If f ∈ U, then f∗ ∈ U.(iii) If f ∈ U and λ > 0, then λf ∈ U.(iv) If f1, f2 ∈ U, then f1 + f2 ∈ U.(v) If f1, f2 ∈ U, then f1 @ f2 ∈ U.Proof. (i) This follows from Proposition 7.6.7.(ii) Let f ∈ U, f = qA + c. By Proposition 7.5.16, we havef∗ = (qA + c)∗ = q∗A − c = qA−1 − c,which is a convex generalized linear-quadratic function of the required form. There-fore, f∗ ∈ U.(iii) It is clear that f = qA + c ∈ U and λ > 0 yields λf = qλA + λc ∈ U.(iv) Let f1, f2 ∈ U, f1 = qA1 + c1, f2 = qA2 + c2. By Proposition 7.5.13, we havethat(f1 + f2) = qA1+A2 + c1 + c2is a convex generalized linear-quadratic function of the form found inU. Therefore,f1 + f2 ∈ U.(v) Let f1, f2 ∈ U, f1 = qA1 + c1, f2 = qA2 + c2. By Propositions 7.5.20 and7.5.16, we havef1 @ f2 = q∗A−11 +A−12 + c1 + c2 = q(A−11 +A−12 )−1 + c1 + c2 ∈ U.For the more general setting of the sets S and T, the algebra rules are not sostraightforward. More stringent conditions are necessary; the following theoremprovides the obtainable results.Theorem 7.6.10. Let d be the Attouch-Wets metric (see (6.5.1)). The followinghold.(i) The metric space (S, d) is complete.(ii) If f ∈ S, then f∗ ∈ S.(iii) If f ∈ S with b = 0, then f∗ ∈ T.(iv) If f ∈ S (f ∈ T ) and λ > 0, then λf ∈ S (λf ∈ T ).(v) If f1, f2 ∈ T, then f1 + f2 ∈ T.1287.6. ApplicationsProof. (i) This follows from Proposition 7.6.7.(ii) Let f ∈ S, f = qA(· − a) + 〈b, ·〉+ c. Combining Proposition 7.5.16 and [21,Proposition 13.20], we havef∗ = (qA(· − a) + 〈b, ·〉+ c)∗= (qA(· − a))∗(· − b)− c= (qA−1 + 〈a, ·〉)(· − b)− c= qA−1(· − b) + 〈a, · − b〉 − c,which is a convex generalized linear-quadratic function. Therefore, f∗ ∈ S.(iii) Let f ∈ S, f = qA(· − a) + c. By the same procedure as in the proof of part(ii), we havef∗ = qA−1 + 〈a, x〉 − c ∈ T.(iv) It is clear that with f = qA(· − a) + 〈b, ·〉 + c ∈ S and λ > 0, we haveλf = qλA(· − a) + λ〈b, ·〉+ λc ∈ S, and that a = 0 yields λf ∈ T.(v) Let f1, f2 ∈ T, f1 = qA1 + 〈b1, ·〉+ c1, f2 = qA2 + 〈b2, ·〉+ c2. By Proposition7.5.13, we have that(f1 + f2) = qA1+A2 + 〈b1 + b2, ·〉+ c1 + c2is a convex generalized linear-quadratic function. Therefore, setting f = f1 + f2,A = A1+A2, b = b1+b2 and c = c1+c2,we have that f = qA+〈b, ·〉+c ∈ T.This is the final chapter of theoretical results. Now we begin the algorithmic por-tion of the thesis, the conversion of the VU-algorithm to derivative-free format.129Part IIIApplications: DFO ProximalBundle Algorithms130Chapter 8Computing Proximal Points ofConvex Functions with InexactSubgradients8.1 OverviewIn this chapter, we present an algorithm for finding the proximal point of aconvex function at a given point. The method is derivative free, that is, it doesnot require first-order information to implement. It is an iterative proximal bun-dle method that uses a bundle of information collected at every iteration to build apiecewise-linear model function and find the proximal point of the model, whichconverges to the proximal point of the objective function. The model constructionphase includes a novel tilt-correct step that helps to build more accurate models.The proof of convergence and the results of numerical testing are included.This chapter is based on results found in [95], which is accepted to Set-valued andVariational Analysis.8.2 The DFO proximal bundle methodDue to its broad applicability, particularly for simulation, DFO has seen manysuccessful applications in the past decade [14, 91] (and references therein). DFOalgorithms can be broadly split into two categories, direct search methods andmodel-based methods. Model-based methods approximate the objective functionwith a model function, and then use the model to guide the optimization algorithm[57].While model-based methods were designed for optimization of smooth black-box functions (see, for example, [56, 60, 88, 184]), recent research has moved awayfrom assuming smoothness of the objective function [89, 90, 133]. In [89, 90], theobjective function takes the form F = max{fi : i = 1, 2, . . . , nf}, where nf isthe number of functions and each fi is given by a black box. In [133], it is assumed1318.2. The DFO proximal bundle methodthat the objective function takes the form F =∑mi=1 |fi|, where each fi is given bya black box. In each case, it is shown that the black-box information allows for thecreation of a convergent model-based DFO algorithm for nonsmooth optimization.Among the model-based methods are the proximal bundle methods. Bundlemethods proceed by collecting information (function values and subgradient vec-tors), then using that information to build a model of the objective function andseek a new incumbent solution [38, 172]. Bundle methods have been widely es-tablished as the most robust and effective technique for nonsmooth optimization[4, 38, 63, 100, 120, 127, 168]. They are also well known for their ability to workwith the structure of a given problem. Specialized bundle methods have been de-veloped considering eigenvalue optimization [102, 103], sum functions [53, 74],chance-constrained problems [2], composite functions [141, 198], and difference-convex (DC) functions [75, 114, 171].For a function f ∈ Γ0(Rn) and initial point x0, the basic proximal point algo-rithm setsxk+1 = Prf(xk)and was shown to converge to a minimizer of f by Martinet [150]. Proximal bundlemethods advance this idea by using a bundle of information to create a convexpiecewise-linear approximation of the objective function, then apply the proximalpoint operator on the model function to determine the next iterate [38, 99, 124,138]. Among the most complex methods, the VU-algorithm alternates between aproximal-point step and a "U-Newton" step to achieve superlinear convergence inthe minimization of nonsmooth, convex functions [159]. This method is the subjectof the next chapter and we use the results of this chapter in its implementation. Thesubject of this chapter is a DFO proximal bundle method.Works in this vein by other authors present results that have demonstrated theability to approximate a subgradient using only function evaluations [15, 84, 90,128, 134]. This provides opportunity for the development of proximal-based meth-ods for nonsmooth convex functions. Such methods require the use of a proxi-mal subroutine that relies on inexact subgradient information. We develop sucha method, prove its convergence, provide stopping criterion analysis and includenumerical testing. This method can be used as a foundation in proximal-basedmethods for nonsmooth convex functions where the oracle returns an exact func-tion value and an inexact subgradient vector. We present the method in termsof an arbitrary approximate subgradient, noting that any of the methods from[15, 84, 90, 128] or any future method of similar style can provide the approxi-mate subgradient.Remark 8.2.1. It should be noted that several methods that use proximal-stylesubroutines and inexact subgradient vectors have already been developed [62, 100,1328.3. The proximal point106, 126, 201, 202, 204, 209]. However, in each case the subroutine is embeddedwithin the developed method and only analyzed in light of the developed method.In this work, we develop a subroutine that uses exact function values and inexactsubgradient vectors to determine the proximal point for a nonsmooth convex func-tion. As a stand-alone method, the algorithm can be used as a subroutine in anyproximal-style algorithm. Some more technical differences between the algorithmin this work and the inexact gradient proximal-style subroutines in other worksappear in Section 8.4.3.The algorithm we present is for finite-valued convex objective functions and isbased on standard cutting-plane methods (see [38, 98]). We assume that for a givenpoint x and positive scalar ε, the exact function value f(x) and an approximate sub-gradient gε such that dist(gε, ∂f(x)) < ε are available, where ∂f(x) is the convexsubdifferential as defined in [196, §8C]. Using this information, a piecewise-linearapproximation of f is constructed, and a quadratic problem is used to determinethe proximal point of the model function – an approximal point. Unlike methodsthat use exact subgradients, the algorithm includes a subgradient correction termthat is required to ensure convergence.The proximal parameter r is fixed in this algorithm. Given a stopping tolerancestol, we prove that if a stopping condition is met, then a solution has been foundthat is within ε + stol /r of the true proximal point. We also prove that any accu-mulation point of the algorithm lies within ε+stol /r of the true proximal point. Inthe numerics section, we discuss practical implementation of the developed algo-rithm. The algorithm is implemented in MATLAB version 8.4.0.150421 (R2014b)and several variants are numerically tested on a collection of randomly generatedfunctions. Tests show that the algorithm is effective, even when ε is quite large.8.3 The proximal pointThroughout this chapter, we assume that the objective function f ∈ Γ0(Rn).As f is convex, for any z ∈ dom f the proximal point Prf(z) exists and is unique[196, Thm 2.26]. If f is locally K-Lipschitz continuous at z with radius σ > 0,then [98, Lemma 2] implies‖Prf(z)− z‖ < 2Krwhenever2Kr< σ. (8.3.1)The proximal point can be numerically computed via an iterative method. Givenan exact oracle, one method for numerically computing a proximal point of f isas follows. Let z ∈ Rn be the proximal centre. We create an information bundle,Dk = {(xi, fi, gi) : i ∈ Bk}, where xi is a point at which the oracle has been1338.4. Replacing exactness with approximationcalled, fi = f(xi) and gi = g(xi) ∈ ∂f(xi) are the function value and subgradientreturned by the oracle, respectively, and Bk is the bundle index set. At iteration k,the piecewise-linear function ϕk is defined byϕk(x) = maxi∈Bk{fi + 〈gi, x− xi〉} .The proximal point of ϕk (the approximal point) is calculated, xk+1 = Prϕk(z),and the oracle is called at xk+1 to obtain fk+1 and gk+1. If[fk+1 − ϕk(xk+1)]/r < stol2,where stol is the stopping tolerance, then the algorithm stops and returns xk+1.Otherwise, the element (xk+1, fk+1, gk+1) is inserted into the bundle and the pro-cess repeats. Further information on this approach can be found in [109, ChapterXI] and in [124, 126].Computing the approximal point is a convex quadratic program, and can there-fore be solved efficiently as long as the dimension and the bundle size remainreasonable [50]. In order to keep the bundle size reasonable, various techniquessuch as bundle cleaning [119] and aggregate gradient cutting planes [152] havebeen advanced. As a result, we have a computationally tractable algorithm that,under mild assumptions, can be proved to converge to the true proximal point.In this work, we are interested in how this method must be adapted if, insteadof returning gk ∈ ∂f(xk), the oracle returnsg˜εk ∈ ∂f(xk) +Bε(0). (8.3.2)We address this issue in the next section.8.4 Replacing exactness with approximation8.4.1 The approximate model function and approximate subgradientWe denote the maximum subgradient error by ε, and use g˜εi to represent theinexact subgradient returned by the oracle at point xi. We use this informationto define a new bundle element to update the model function, but first we wantto ensure that our model function will not lie above the objective function at theproximal centre. This is a necessary component of our convergence proof. So ifthe linear function defined by the new bundle element lies above f at z, we makea correction to g˜εk. We set gεk = g˜εk − ck, where the correction term ck is nonzero ifcorrection is needed, zero otherwise. Then, denoting the bundle index set by Bk,the piecewise-linear model function ϕεk is defined byϕεk(x) = maxi∈Bk{fi + 〈gεi , x− xi〉} . (8.4.1)1348.4. Replacing exactness with approximationWe use Dk to denote the set of bundle elements. At initialization (k = 0), we haveB0 = {0} and D0 = {(z, f0, gε0)}. For each k ≥ 1, we will have at least threebundle elements: Bk ⊇ {−1, 0, k} andDk ⊇ {(xk, ϕεk−1(xk), r(z − xk)), (z, f0, gε0), (xk, fk, gεk)}.In bundle and cutting-plane methods, the bundle component r(z − xk) is knownas the aggregate subgradient [62, 126, 201] and is an element of ∂ϕεk−1(xk).We adopt the convention of using the index −1 as the label for the aggregatebundle element (xk, ϕεk−1(xk), r(z − xk)).We may have up to k + 2 elements:Bk = {−1, 0, 1, 2, . . . , k}, however, elements −1, 0 and k are sufficient to guaran-tee convergence.Now we consider the correction term ck. Suppose thatEk = fk + 〈g˜εk, z − xk〉 − f(z) > 0,thus necessitating a correction. We seek the minimal correction term, hence, weneed to findck ∈ argmin{‖c‖ : 〈c, z − xk〉 − Ek = 0}.This givesck = PG(0), where G ={c :〈c, z − xk〉‖z − xk‖ =Ek‖z − xk‖}. (8.4.2)That is, ck is the projection of 0 onto the hyperplane generated by the normal vectorz − xk and shift constant Ek. This yieldsck =Ek‖z − xk‖z − xk‖z − xk‖ = Ekz − xk‖z − xk‖2 . (8.4.3)Now we define the approximate subgradient that we use in the algorithm:gεk ={g˜εk − ck, if fk + 〈g˜εk, z − xk〉 > f(z),g˜εk, if fk + 〈g˜εk, z − xk〉 ≤ f(z).Since g˜εk is the approximate subgradient returned by the oracle but gεk is the one wewant to use in construction of the model function, we must first prove that gεk alsorespects (8.3.2).Lemma 8.4.1. Let f be convex. Then at any iteration k, dist(gεk, ∂f(xk)) < ε.1358.4. Replacing exactness with approximationProof. If gεk = g˜εk, then the result holds by Assumption 8.3.2. Now suppose thatfk + 〈g˜εk, z − xk〉 > f(z), and defineH = {g : f(z) = f(xk) + 〈g, z − xk〉} ,J = {g : f(z) ≥ f(xk) + 〈g, z − xk〉} .By (8.4.2), we have that gεk = PH(g˜εk). Since f(z) < fk + 〈g˜εk, z − xk〉, we alsoknow that PJ(g˜εk) = PH(g˜εk) = gεk. By (8.3.2), there exists g¯ ∈ ∂f(xk) such that‖g˜εk − g¯‖ < ε. Since f is convex, we havef(z) ≥ f(xk) + 〈g¯, z − xk〉,hence g¯ ∈ J and PJ(g¯) = g¯. Using the fact that the projection is firmly nonexpan-sive [21, Proposition 4.8], we have‖gεk − g¯‖ = ‖PJ(g˜εk)− PJ(g¯)‖ ≤ ‖g˜εk − g¯‖ < ε,which is the desired result.In the case of exact subgradients, the resulting linear functions form cuttingplanes of f, so that the model function is an underestimator of the objective func-tion. That is, for exact subgradient gi,f(x) ≥ fi + 〈gi, x− xi〉 for all i ∈ Bk . (8.4.4)In the approximate subgradient case we do not have this luxury, but all is not lost.Using (8.4.4) and the fact that ‖gεi − gi‖ < ε, we have that for all i ∈ Bk and forall x ∈ Rn,f(x) ≥ fi + 〈gi − gεi + gεi , x− xi〉= fi + 〈gεi , x− xi〉+ 〈gi − gεi , x− xi〉≥ fi + 〈gεi , x− xi〉 − ‖gi − gεi ‖‖x− xi‖≥ fi + 〈gεi , x− xi〉 − ε‖x− xi‖.Hence,f(x) + ε‖x−xi‖ ≥ ϕεk(x) for all x ∈ Rn, for all i ∈ Bk, for all k ∈ N . (8.4.5)8.4.2 The algorithmNow we present the algorithm that uses approximate subgradients. In the nu-merics section, we implement four variants of this algorithm, comparing four dif-ferent ways of updating the bundle in Step 5.1368.4. Replacing exactness with approximationAlgorithm:Step 0: Initialization. Given a proximal centre z ∈ Rn, choose a stopping tolerancestol ≥ 0 and a proximal parameter r > 0. Set k = 0 and x0 = z. Set B0 = {0}.Use the oracle to find (f0, g˜ε0).Step 1: Linearization. Compute Ek = fk + 〈g˜εk.z − xk〉 − f(z), and definegεk = g˜εk + max{0, Ek}z − xk‖z − xk‖2 .Step 2: Model. Defineϕεk(x) = maxi∈Bk{fi + 〈gεi , x− xi〉} .Step 3: Proximal Point. Calculate the point xk+1 = Prϕεk(z), and use the oracle tofind (fk+1, g˜εk+1).Step 4: Stopping Test. If (fk+1 − ϕεk(xk+1))/r ≤ stol2, output the approximalpoint xk+1 and stop.Step 5: Update and Loop. Create the bundle element (xk, ϕεk−1, r(z−xk)). CreateBk+1 such that {−1, 0, k} ⊆ Bk+1 ⊆ {−1, 0, 1, 2, · · · , k}. Increment k and go toStep 1.End algorithm.8.4.3 Relation to other inexact gradient proximal-style subroutinesNow that the algorithm is presented, we provide some insight on how it relatesto previously developed inexact subgradient proximal point computations. Firstand foremost, the presented algorithm is a stand-alone method that is not presentedas a subroutine of another algorithm.In 1995, [126] presented a method of computing a proximal point using inexactsubgradients as a subroutine in a minimization algorithm for a nonsmooth, convexobjective function. However, the algorithm assumes that the inexactness in thesubgradient takes the form of an ε-subgradient. We remind the reader that an ε-subgradient vε of f at xk is an approximate subgradient that satisfiesf(x) ≥ f(xk) + 〈vε, x− xk〉 − ε for all x.1378.4. Replacing exactness with approximationThus, the method in [126] relies on the approximate subgradient forming an ε-cutting plane at each iteration.While ε-subgradients do appear in some real-world applications [71, 116], inother situations the ability to determine ε-subgradients is an unreasonable expecta-tion. For example, if the objective function is a black-box that only returns functionvalues, then subgradients could be approximated numerically using the techniquesdeveloped in [15, 84, 90, 128]. These technique will return approximate subgra-dients that satisfy assumption (8.3.2), but are not necessarily ε-subgradients. Themethod of the present work changes the need for ε-subgradients, to approximatesubgradients that satisfy assumption (8.3.2).Shortly before [126] was published, a similar technique was presented in [58].However, it too requires the approximate subgradients to be ε-subgradients. A fewyears later [202] and [209] presented similar solvers, also as subroutines withinminimization algorithms. Again, the convergence results rest upon ε-subgradientsand model functions that are constructed using supporting hyperplanes of the ob-jective function.The algorithmic pattern in [62] is much more general in nature; it is appli-cable to many types of bundle methods and oracles. The authors go into detailabout the variety of oracles in use (upper, dumb lower, controllable lower, asymp-totically exact and others), and the resulting particular bundle methods that theyinspire. The oracles themselves are more generalized as well, in that they deliverapproximate function values instead of exact ones. The approximate subgradient isthen determined based on the approximate function value, and thus is dependent ontwo parameters of inexactness rather than one. The algorithm iteratively calculatesproximal points as does ours, but does not include the subgradient correction step.Both [201] and [100] address the issue of non-convexity. The algorithm in[201] splits the proximal parameter in two: a local convexification parameter and anew model proximal parameter. It calls an oracle that delivers an approximatefunction value and an approximate subgradient, which are used to construct apiecewise-linear model function. That function is then shifted down to ensure thatit is a cutting-planes model. In [100] the same parameter-splitting technique isemployed to deal with nonconvex functions, and the oracle returns both inexactfunction values and inexact subgradients. The notable difference here is that theapproximate subgradient is not an ε-subgradient; it is a vector that is within ε of thesubdifferential of the model function at the current point. This is the same assump-tion that we employ in our version. However, non-convexity forces the algorithmsto involve proximal parameter corrections that obscure any proximal point subrou-tine. (Indeed, it is unclear if a proximal point is actually computed as a subroutinein these methods, or if the methods only use proximal directions to seek descent.)In all of the above methods except for the last, the model function is built using1388.5. Convergencelower-estimating cutting planes. In this work, the goal is to avoid this requirementand extend the class of inexact subgradients that can be used in these types ofalgorithms. The tilt-correct step in our method ensures that the model function andthe objective function coincide at the proximal centre, which we show is sufficientto prove convergence in the next section. Although the last method mentionedabove is for the nonconvex case and uses approximate function values, it is themost similar to the one in the present work, as it does not rely on ε-subgradients.The differentiating aspect in that method, as in all of the aforementioned methods,is that it does not make any slope-improving correction to the subgradient.8.5 ConvergenceTo prove convergence of this routine, we need several lemmas that are provedin the sequel. Ultimately, we prove that the algorithm converges to the true proxi-mal point of f with a maximum error of ε/r. Throughout this section, we denotePrf(z) by x∗. To begin, we establish some properties of ϕεk.Lemma 8.5.1. Let ϕεk(x) = maxi∈Bk {fi + 〈gεi , x− xi〉} . Then for all k,(i) ϕεk is a convex function,(ii) ϕεk(z) = f(z),(iii) ϕεk+1(x) ≥ ϕεk(xk+1) + r〈z − xk+1, x− xk+1〉 ∀x ∈ Rn,(iv) ϕεk(x) ≥ f(xk) + 〈gεk, x− xk〉 ∀x ∈ Rn, and(v) if f is K-Lipschitz, then ϕεk is (K + ε)-Lipschitz.Proof. (i) Since ϕεk is the maximum of a finite number of convex functions, ϕεk isconvex by [21, Proposition 8.14].(ii) We have that ϕε0(z) = f(z) by definition. Then for any k > 0, Step 1 of thealgorithm guarantees thatf(xk) + 〈gεk, z − xk〉 ≤ f(z),so by (8.4.1) we have ϕεk(z) ≤ f(z) for all k. Thus, we need only concern our-selves with the new linear function at iteration k. If fk + 〈g˜εk, z − xk〉 > f(z), wemake the correction to g˜εk so that fk + 〈gεk, z − xk〉 = f(z). As for the aggregatesubgradient bundle element, we have that r(z − xk) ∈ ∂ϕεk(xk+1) and ϕεk is con-vex, so that ϕεk−1(xk) + r〈z − xk, x− xk〉 ≤ ϕεk(x) for all x ∈ Rn . In particular,ϕεk−1(xk) + r〈z − xk, x− xk〉 ≤ ϕεk(z) = f(z). Therefore,f(z) = ϕε0(z) ≤ ϕεk(z) = maxi∈Bk{fi + 〈gεi , z − xi〉} ≤ f(z).1398.5. Convergence(iii) Since (xk+1, ϕεk(xk+1), r(z − xk+1)) ∈ Bk+1, we haveϕεk+1(x) = maxi∈Bk+1{fi + 〈gεi , x− xi〉}≥ ϕεk(xk+1) + r〈z − xk+1, x− xk+1〉.(iv) This is true by definition of ϕεk.(v) We know that f is K-Lipschitz, and by Lemma 8.4.1, for each k we have thatgεk is within distance ε of ∂f(xk). Therefore, ϕεk is (K + ε)-Lipschitz.Remark 8.5.2. Lemma 8.5.1 makes strong use of the aggregate bundle elementto prove part (iii). It is possible to avoid the aggregate bundle element by settingBk+1 = Bk ∪ {k} at every iteration. To see this, note that the tilt-correct step willonly ever alter gεk at iteration k. As ϕεk(x) ≥ ϕεk(xk+1) + r〈z − xk+1, x− xk+1〉,if Bk+1 = Bk ∪ {k}, then ϕεk+1(x) ≥ ϕεk(x) provides the necessary informationto ensure that Lemma 8.5.1(iii) still holds.Next, we show that at every iteration the distance between the approximalpoint and the true proximal point is bounded by a function of the distance be-tween f(xk+1) and ϕεk(xk+1). This immediately leads to an understanding of thestopping criterion.Lemma 8.5.3. At every iteration k of the algorithm, the distance between the prox-imal point of the piecewise-linear function xk+1 = Prϕεk(z) and the proximal pointof the objective function x∗ = Prf(z) satisfies‖xk+1 − x∗‖ ≤√f(xk+1)− ϕεk(xk+1) + ε24rr+ε2r. (8.5.1)Proof. Since xk+1 = Prϕεk(z), we have thatϕεk(xk+1) ≤ ϕεk(x) +r2‖z − x‖2 ∀x ∈ Rn .By (8.4.5) with i = k + 1 ∈ Bk+1, we havef(x) + ε‖x− xk+1‖ ≥ ϕεk+1(x) ∀x ∈ Rn,which by Lemma 8.5.1(iii) results inf(x) + ε‖x− xk+1‖ ≥ ϕεk(xk+1) + r〈z − xk+1, x− xk+1〉 ∀x ∈ Rn . (8.5.2)In particular, we havef(x∗) + ε‖x∗ − xk+1‖ ≥ ϕεk(xk+1) + r〈z − xk+1, x∗ − xk+1〉. (8.5.3)1408.5. ConvergenceSince x∗ = Prf(z), we have r(z − x∗) ∈ ∂f(x∗). Thenf(x) ≥ f(x∗) + r〈z − x∗, x− x∗〉 ∀x ∈ Rn, (8.5.4)thus, we havef(xk+1) ≥ f(x∗) + r〈x∗ − z, x∗ − xk+1〉. (8.5.5)Adding (8.5.3) and (8.5.5) yieldsf(xk+1) + ε‖x∗ − xk+1‖ ≥ ϕεk(xk+1) + r〈z − xk+1 − z + x∗, x∗ − xk+1〉,f(xk+1)− ϕεk(xk+1) ≥ r‖x∗ − xk+1‖2 − ε‖x∗ − xk+1‖,= r(‖x∗ − xk+1‖2 − εr‖x∗ − xk+1‖+ ε24r2− ε24r2),= r(‖x∗ − xk+1‖ − ε2r)2 − ε24r.Isolating the squared binomial term above and taking the square root of both sides,we have√f(xk+1)− ϕεk(xk+1) + ε24rr≥∣∣∣‖x∗ − xk+1‖ − ε2r∣∣∣ ≥ ‖x∗ − xk+1‖ − ε2r,‖xk+1 − x∗‖ ≤√f(xk+1)− ϕεk(xk+1) + ε24rr+ε2r.Remark 8.5.4. Lemma 8.5.3 not only sets up our analysis of the stopping crite-rion, but also provides the necessary insight to understand the algorithm’s output ifan early termination is invoked. In particular, if the algorithm is used as a subrou-tine inside of larger method and the larger method stops the subroutine (perhapsbecause desirable decrease is detected), then (8.5.1) still applies. As such, the op-timizer can still compute an error bound on the distance of the output to the trueproximal point.Corollary 8.5.5. If the stopping criterion is satisfied, then ‖xk+1−x∗‖ ≤ stol + εr .Proof. Substituting the stopping criterion into (8.5.1) yields‖xk+1 − x∗‖ ≤√stol2 +ε24r2+ε2r≤ stol +εr.Corollary 8.5.5 is our first convergence result, showing that if ϕεk(xk+1) comesclose enough to f(xk+1) to trigger the stopping condition of the algorithm, thenxk+1 is within a fixed distance of the true proximal point. Now we aim to prove1418.5. Convergencethat the stopping condition will always be activated at some point and the algo-rithm will not run indefinitely. We begin with Lemma 8.5.6, which shows that if atany iteration the new point is equal to the previous one, the stopping condition istriggered and the approximal point is within ε/r of x∗.Lemma 8.5.6. If xk+2 = xk+1, then the algorithm stops and ‖xk+2 − x∗‖ ≤ εr .Proof. We have ϕεk+1(x) = maxi∈Bk+1{fi + 〈gεi , x− xi〉}, so in particularϕεk+1(xk+1) = maxi∈Bk+1{fi + 〈gεi , xk+1 − xi〉}.As k + 1 ∈ Bk+1 and f(xk+1) + 〈gεk+1, xk+1 − xk+1〉 = f(xk+1), it follows thatϕεk+1(xk+1) ≥ f(xk+1),which is equivalent to ϕεk+1(xk+2) ≥ f(xk+2) if xk+1 = xk+2. So the stoppingcriterion is satisfied and by Lemma 8.5.3 we have‖xk+2 − x∗‖ ≤√f(xk+2)− ϕεk+1(xk+2) + ε24rr+ε2r≤√ε24r2+ε2r=εr.Next, we prove convergence within ε/r in the case that the stopping conditionis not present and the algorithm does not stop. We show that this is true by es-tablishing Lemmas 8.5.7 through 8.5.9, which lead to Theorem 8.5.10, the mainconvergence result.Lemma 8.5.7. Suppose the algorithm is run without the stopping criterion. Thenthe functionΦ(k) = ϕεk(xk+1) +r2‖z − xk+1‖2is strictly increasing and bounded above, withΦ(k + 1) ≥ Φ(k) + r2‖xk+2 − xk+1‖2 > Φ(k). (8.5.6)Proof. Recall that ϕεk(z) = f(z) by Lemma 8.5.1(ii). Since xk+1 is the proximalpoint of ϕεk at z, we haveϕεk(xk+1) +r2‖z − xk+1‖2 ≤ ϕεk(z) +r2‖z − z‖2 = f(z).Therefore, Φ(k) is bounded above by f(z) for all k. DefineLk(x) = ϕεk(xk+1) +r2‖z − xk+1‖2 + r2‖x− xk+1‖2.1428.5. ConvergenceSince xk+1 = Prϕεk(z), we haveLk(xk+1) = ϕεk(xk+1) +r2‖z − xk+1‖2 ≤ ϕεk(z) = f(z).By Lemma 8.5.1(iii) with x = xk+2, we haveϕεk+1(xk+2) ≥ ϕεk(xk+1) + r〈z − xk+1, xk+2 − xk+1〉. (8.5.7)Using (8.5.7), we haveLk+1(xk+2) = ϕεk+1(xk+2) +r2‖z − xk+2‖2≥ϕεk(xk+1) + r〈z − xk+1, xk+2 − xk+1〉+r2‖z − xk+2‖2=Lk(xk+1) + r〈z − xk+1, xk+2 − xk+1〉+ r2‖z − xk+2‖2− r2‖z − xk+1‖2.Expanding the norms above, we haver2‖z − xk+2‖2 − r2‖z − xk+1‖2=r2(‖xk+2‖2 − 2〈xk+2, xk+1〉+ ‖xk+1‖2)+ r(−〈z, xk+2 − xk+1〉+ 〈xk+2, xk+1〉 − ‖xk+1‖2)=r2‖xk+2 − xk+1‖2 − r〈z − xk+1, xk+2 − xk+1〉.This gives us thatLk+1(xk+2) ≥Lk(xk+1) + r〈z − xk+1, xk+2 − xk+1〉+ r2‖xk+2 − xk+1‖2− r〈z − xk+1, xk+2 − xk+1〉=Lk(xk+1) +r2‖xk+2 − xk+1‖2.Since xk+2 6= xk+1 for all k by Lemma 8.5.6, the inequality above becomesϕεk+1(xk+2) +r2‖z− xk+2‖2 ≥ ϕεk(xk+1) +r2‖z− xk+1‖2 + r2‖xk+2− xk+1‖2,which by the definition of Φ yieldsΦ(k + 1) ≥ Φ(k) + r2‖xk+2 − xk+1‖2 > Φ(k).Therefore, Φ(k) is a strictly increasing function.1438.5. ConvergenceCorollary 8.5.8. Suppose the algorithm is run without the stopping criterion. Thenlimk→∞‖xk+1 − xk‖ = 0.Proof. By (8.5.6), we have0 ≤ r2‖xk+2 − xk+1‖2≤ Φ(k + 1)− Φ(k).By Lemma 8.5.7, both terms on the right-hand side above converge, and they con-verge to the same place. Then0 ≤ limk→∞r2‖xk+2 − xk+1‖2 ≤ 0,and since r 6= 0, we have that ‖xk+2 − xk+1‖ → 0.We point out here that if f is locally K-Lipschitz, then the sequence {xk}has a convergent subsequence. This is because, by (8.3.1) and Lemma 8.5.1(v),the iterates are contained in a compact set and the Bolzano-Weierstrass Theoremapplies. We use this fact to prove the results that follow.Lemma 8.5.9. Let f be locally K-Lipschitz. Suppose the algorithm is run with-out the stopping criterion. Then for any accumulation point p of {xk}, we havelimj→∞ϕεkj (xkj+1) = f(p), where {xkj} is any subsequence converging to p.Proof. We have fk+1 ≥ ϕεk(xk+1) (otherwise, the stopping criterion is satisfiedand the algorithm stops). By Lemma 8.5.1(v), ϕεk is (K + ε)-Lipschitz:‖ϕεk(xk+1)− ϕεk(xk)‖ ≤ (K + ε)‖xk+1 − xk‖,and by Corollary 8.5.8 we have that limk→∞ ‖ϕεk(xk+1)− ϕεk(xk)‖ = 0. We alsohavef(xk+1) ≥ ϕεk(xk+1)− ϕεk(xk) + ϕεk(xk)≥ ϕεk(xk)− (K + ε)‖xk+1 − xk‖.By Lemma 8.5.1(iv) with x = xk,f(xk+1) ≥ ϕεk(xk)−(K+ε)‖xk+1−xk‖ ≥ f(xk)−(K+ε)‖xk+1−xk‖. (8.5.8)Select any subsequence {xkj} such that limj→∞ xkj = p. By Corollary 8.5.8,limj→∞ ‖xkj+1 − xkj‖ = 0. So we have that limj→∞ xkj+1 = p as well. Hence,taking the limit of (8.5.8) as j →∞, and employing Corollary 8.5.8, we havef(p) ≥ limj→∞ϕεkj (xkj ) ≥ f(p).1448.5. ConvergenceThus, limj→∞ ϕεkj (xkj ) = f(p). Since limj→∞ ‖ϕεkj (xkj+1) − ϕεkj (xkj )‖ = 0,we have thatlimj→∞ϕεkj (xkj+1) = f(p).Now the stage is set for the following theorem, which proves that the algorithmconverges to a vector that is within a fixed distance of Prf(z).Theorem 8.5.10. Let f be locallyK-Lipschitz. Suppose the algorithm is run with-out the stopping criterion. Then for any accumulation point p of {xk}, we have‖x∗ − p‖ ≤ ε/r.Proof. By (8.5.2),f(x) + ε‖x− xk+1‖ ≥ ϕεk(xk+1) + r〈z − xk+1, x− xk+1〉 ∀x ∈ Rn .Select any subsequence {xkj} such that limj→∞ xkj = p. Taking the limit asj →∞ and using Corollary 8.5.8 and Lemma 8.5.9, we havef(x) + ε‖x− p‖ = limj→∞f(x) + ε‖x− xkj+1‖≥ limj→∞[ϕεkj (xkj+1) + r〈z − xkj+1, x− xkj+1〉]= f(p) + r〈z − p, x− p〉 ∀x ∈ Rn . (8.5.9)By (8.5.4), we havef(p) ≥ f(x∗) + r〈x∗ − z, x∗ − p〉. (8.5.10)By (8.5.9), in particular we havef(x∗) + ε‖x∗ − p‖ ≥ f(p) + r〈z − p, x∗ − p〉. (8.5.11)Adding (8.5.10) and (8.5.11) yieldsε‖x∗ − p‖ ≥ r〈x∗ − z + z − p, x∗ − p〉= r‖x∗ − p‖2.Therefore, ‖x∗ − p‖ ≤ ε/r.We have established convergence of the algorithm when the stopping criterionis not present. Now, we wish to show that when the stopping criterion is inserted,the algorithm will always terminate in finite time and output a good approximationto the proximal point. This is an important detail, as the intention is to use the algo-rithm as a subroutine of minimization algorithms such as the DFO VU-algorithm1458.6. Numerical testsof Chapter 9, so in order for the minimization algorithms to be well-posed, the sub-routine will have to terminate. With the proof of Theorem 8.5.11 below, we willhave proved that the algorithm does not run indefinitely, and that when it stops theoutput is within tolerance of the point we seek.Theorem 8.5.11. Let f be locally K-Lipschitz. Then the algorithm stops in finitetime, at iteration k + 1 where‖xk+1 − xk‖ < r stol2K + ε, (8.5.12)and outputs xk+1 such that ‖xk+1 − x∗‖ ≤ stol + ε/r.Proof. By Lemma 8.5.1(iv) with x = xk+1, we haveϕεk(xk+1) ≥ f(xk) + 〈gεk, xk+1 − xk〉f(xk)− ϕεk(xk+1) ≤ −〈gεk, xk+1 − xk〉≤ ‖gεk‖‖xk+1 − xk‖.By Lemma 8.5.1(v), ϕεk is (K + ε)-Lipschitz. Hence, ‖gεk‖ ≤ K + ε, andf(xk+1)− ϕεk(xk+1) ≤ (K + ε)‖xk+1 − xk‖.We have that limk→∞ ‖xk+1 − xk‖ = 0 by Corollary 8.5.8, so (8.5.12) will even-tually be true. In that case,f(xk+1)− ϕεk(xk+1) < (K + ε)r stol2K + εf(xk+1)− ϕεk(xk+1)r< stol2 .This is the stopping condition of the algorithm. So the algorithm stops in finitetime, and by Corollary 8.5.5, xk+1 is such that ‖xk+1 − x∗‖ ≤ stol + ε/r.8.6 Numerical testsIn this section, we present the results of a number of numerical tests performedusing this algorithm. The tests were run using MATLAB version 8.4.0.150421(R2014b), on a 2.8 GHz Intel Core 2 Duo processor with a 64-bit operating system.1468.6. Numerical tests8.6.1 Bundle variantsWe set r = 1 and compare four bundle variants: 3-bundle, (k + 2)-bundle,active bundle, and almost active bundle. In the 3-bundle variant, each iterationuses the three bundle elements indexed by Bk = {−1, 0, k}. In the (k + 2)-bundlevariant, we keep all the bundle elements from each previous iteration (replacing theold aggregate with the new one), and add the kth element. So the bundle index setis Bk = {−1, 0, 1, 2, 3, . . . , k}, for a total of k + 2 elements.3 In the active bundlevariant, we keep the indices −1, 0, k, and add in any indices i that satisfyϕεk(xk+1) = ϕεk(xi) + 〈gεi , xk+1 − xi〉.These are the linear functions that are active at iteration k. Finally, the almost activebundle keeps the indices −1, 0, k, and adds in any indices that satisfyϕεk(xk+1) < ϕεk(xi) + 〈gεi , xk+1 − xi〉+ 10−6.These are the linear functions that are ‘almost’ active at iteration k, which allowsfor software rounding errors to be discounted.8.6.2 Max-of-quadratics testsFor our first tests, we use a max-of-quadratics generator to create problems.Each problem to be solved is a randomly generated functionf(x) = max{q1(x), q2(x), ..., qm(x)},where qi is convex quadratic for all i. There are four inputs to the generator func-tion: n, nf, nfx∗ , and nfz. The number n is the dimension of x, nf is the numberof quadratic functions used, nfx∗ is the number of quadratic functions that are ac-tive at the true proximal point, and nfz is the number of quadratic functions thatare active at the proximal centre z. These features can all be controlled, as seenin [98]. The approximate subgradient is constructed by finding the gradient of thefirst active quadratic function and giving it some random error of magnitude lessthan ε by using the randsphere routine.1 That is,g˜εk = ∇fi(x) + ε randsphere(1, n, 1),3The index −1 is not necessary for convergence when indices 0 through k are in the bundle.However, since our convergence analysis focused on the aggregate subgradient, we keep index −1in every bundle variant.1Randsphere is a MATLAB function that outputs a uniformly distributed set of ran-dom points in the interior of an n-dimensional hypersphere of radius r with centre atthe origin. The code is found at http://www.mathworks.com/matlabcentral/fileexchange/9443-random-points-in-an-n-dimensional-hypersphere/content/randsphere.m.1478.6. Numerical testswhere i is the first active index. Though we use random error, the random functionis seeded, so that the results are reproducible. The primal quadratic program issolved using MATLAB’s quadprog solver.Two sets of problems were generated: low-dimension trials and high-dimensiontrials. For the low-dimension trials the Hessians of the qi functions were dense andwe attempted to solve ten problems at each possible state ofn ∈ {4, 10, 25}, nf ∈{1,⌈n3⌉,⌈2n3⌉, n},nfx∗ ∈{1,⌈n3⌉,⌈2n3⌉, n}, nfz ∈{1,⌈n3⌉,⌈2n3⌉, n},with three subgradient error levels: ε ∈ {0, stol, 10 stol} and four variants of thealgorithm. This amounts to a total of 2700 problems attempted by each of thefour variants, 10,800 runs altogether.2 For the high-dimension trials the Hessiansof the qi functions were sparse, with 95% density of zeros. In high dimensions,we attempted two problems at each possible state of n ∈ {100, 200} and the sameconditions as above on the rest of the variables, for another 360 problems attemptedby each variant, 1440 runs altogether.The performance of the variants is presented in two performance profile graphsand a table of averages. The table provides average CPU times, average numberof iterations and average number of tilt-corrections for each of the four bundlevariants. One performance profile graph is for low dimension n = 4, 10, 25, andthe other is for high dimension n = 100, 200. A performance profile is a standardmethod of comparing algorithms, using the best-performing algorithm for eachcategory of test as the basis for comparison. Here, we compare the four variantsbased on CPU time and number of iterations used to solve each problem. We setstol = 10−3 for all tests and declare a problem solved if the stopping criterionis triggered within 100n iterations for the low-dimension set, 20n for the high-dimension set. The x-axis, on a natural logarithmic scale, is the factor τ of the bestpossible ratio. On the y-axis is the percentage of problems solved within a factorof τ of the solve time of the best solve time for a given function.In the low-dimension case, we see from Figure 1 that the (k + 2)-bundle isthe most efficient and the almost active bundle follows close behind. The 3-bundleand active bundle coincide almost exactly; their curves overlap. Comparing theresults for each error level in terms of CPU time vs. number of iterations (eachpair of side-by-side graphs), there is no notable difference. With the 3-bundle andactive bundle, about 22% of the problems timed out, meaning that the upper bound2These totals take into account that nfz and nfx∗ cannot exceed nf at any stage. When n = 4,we have fewer possibilities due to the low values of dn3e and d 2n3e.1488.6. Numerical testsof 100n iterations was reached before the stopping condition was triggered. Withthe almost active bundle, that figure drops to about 10%, and the (k + 2)-bundlesolved all of the problems. However, it is interesting to note that about 95% ofthose timed-out problems still ended with dist(xk, x∗) ≤ stol +ε/r. While thisdoes not contradict the theory within this paper, it suggests that the stopping testmay be more difficult to satisfy than is desired. Future research should explorealternative stopping tests.In the high-dimension case, Figure 2 tells us that the 3-bundle and active bundleperform well in terms of CPU time for the problems they solved, however theywere only able to solve about two thirds of the problems within the allotted timelimit. The (k+2)-bundle took more time, and the almost active bundle much morestill, but the former solved all the problems and the latter solved 97% of them. Interms of number of iterations, the same general pattern as was found in the low-dimension case appears. The only difference is that the almost active bundle usesnoticeably more iterations to complete the job, but the curve still lies below that ofthe (k + 2)-bundle and above the other two.The average CPU time, number of iterations and number of tilt-corrections forboth sets of problems are displayed in Table 8.1. It is interesting to note that forthis type of problem the tilt-correction was used sparingly in low dimension, andin high dimension it was not needed at all. Our feeling is that this is due to theway we chose to approximate subgradients; this will be made clearer in the nextsets of numerical tests below. Also, it needs to be mentioned that our software ap-plied quadprog to solve the quadratic subproblems. While quadprog is readilyavailable, it is not recognized as among the top-quality quadratic program solvers.Testing the algorithm using different solvers might produce different results. How-ever, it is also possible that the inexactness of the subgradients might override anyimprovement in solver quality.1498.6. Numerical testsPerformance Profile n = 4, 10, 25logτ0 0.5 1 1.5 2 2.5 300.10.20.30.40.50.60.70.80.91Error=0: CPU Time3-bundle(k+2)-bundleactive bundlealmost active bundlelogτ0 0.5 1 1.5 200.10.20.30.40.50.60.70.80.91Error=0: Iterationslogτ0 0.5 1 1.5 200.10.20.30.40.50.60.70.80.91Error=stol: CPU Timelogτ0 0.5 1 1.5 200.10.20.30.40.50.60.70.80.91Error=stol: Iterationslogτ0 0.5 1 1.5 2 2.5 300.10.20.30.40.50.60.70.80.91Error=10stol: CPU Timelogτ0 0.5 1 1.5 2 2.5 300.10.20.30.40.50.60.70.80.91Error=10stol: IterationsFigure 8.1: Low-dimension performance profile.1508.6. Numerical testsPerformance Profile n=100, 200logτ0 0.5 1 1.5 2 2.5 300.10.20.30.40.50.60.70.80.91Error=0: CPU Time3-bundle(k+2)-bundleactive bundlealmost active bundlelogτ0 0.5 1 1.5 2 2.5 3 3.500.10.20.30.40.50.60.70.80.91Error=0: Iterationslogτ0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 500.10.20.30.40.50.60.70.80.91Error=stol: CPU Timelogτ0 0.5 1 1.5 2 2.5 3 3.500.10.20.30.40.50.60.70.80.91Error=stol: Iterationslogτ0 0.5 1 1.5 2 2.5 300.10.20.30.40.50.60.70.80.91Error=10stol: CPU Timelogτ0 0.5 1 1.5 2 2.5 3 3.500.10.20.30.40.50.60.70.80.91Error=10stol: IterationsFigure 8.2: High-dimension performance profile.1518.6. Numerical testsBundle Average CPU Time Average Iterations Average Tilt-corrections3-bundle n low: 1.36s n low: 191 n low: 0.0122n high: 19.90s n high: 2034 n high: 0(k + 2)-bundle n low: 0.61s n low: 45 n low: 0.0015n high: 66.31s n high: 250 n high: 0Active bundle n low: 1.39s n low: 191 n low: 0.0111n high: 19.90s n high: 2035 n high: 0Almost active n low: 1.96s n low: 109 n low: 0.003n high: 531.06s n high: 1438 n high: 0Table 8.1: Average values among the four bundle variants.8.6.3 DFO testsTo see how the algorithm performs in the setting of derivative-free optimiza-tion, we selected a test set of ten functions and ran the algorithm using the simplexgradient method developed in the robust approximate gradient sampling algorithm[90]. The algorithm approximates a subgradient of the objective function by usingconvex combinations of linear interpolation approximate subgradients (details canbe found in [90]). Most of the problems are taken from [147], some of which wereadjusted slightly to make them convex. The adjustments made and other details onthese test functions appear in [95]. A brief description of the functions is found inTable 8.2; more details are available from [147].Function Dimension Description ReferenceP alpha 2 max of 2 quadraticsDEM 2 max of 3 quadratics [147, Problem 3.5]Wong 3 (adjusted) 20 max of 18 quadratics [147, Problem 2.21]CB 2 2 max of exponential, quartic [147, Problem 2.1]Mifflin 2 2 absolute value + quadratic [147, Problem 3.9]EVD 52 (adjusted) 3 max of quartic, quadratics, linears [147, Problem 2.4]OET 6 (adjusted) 2 max of exponentials [147, Problem 2.12]MaxExp 12 max of exponentialsMaxLog 30 max of negative logarithmsMax10 10 max of 10th-degree polynomialsTable 8.2: Set of test problems using simplex gradients.Each bundle variant was run 100 times on each test problem. In all cases,the algorithm located the correct proximal point. The average CPU time, averagenumber of iterations and average number of tilt-correct steps used appear in Table8.3.1528.6. Numerical testsBundle Average CPU Time Average Iterations Average Tilt-corrections3-bundle 0.85s 147 1.01(k + 2)-bundle 1.17s 45 1.18Active bundle 0.84s 146 0.93Almost active 1.18s 59 0.79Table 8.3: Average values among the four bundle variants.As in the previous test set, in terms of iterations, the (k + 2)-bundle and thealmost active variants greatly outperform the 3-bundle and active bundle variants.In terms of CPU time, all methods used approximately one second per problem.However, unlike the first test set, this test set required an average of one tilt-correctstep per problem. Since these latest averages were taken over varying types offunctions instead of just max-of-quadratics, it could be that the tilt-correct step ismore often necessary for an objective function that is not a max-of-quadratic. Orit could be that using simplex gradients, instead of finding a true subgradient andgiving it some random error via the randsphere function, requires heavier useof the tilt-correct step. This issue inspired the next set of tests below.8.6.4 Simplex gradient vs. randsphere testsThe next set of data takes the first 2400 trials of the low-dimension case andsolves the same problems using the aforementioned simplex gradient method of[90]. We compare these results with the previously obtained randspheremethodby way of the performance profile in Figure 8.6.4 and the average values of Table8.4.1538.6. Numerical testsPerformance Profile n = 4, 10, 25logτ0 0.5 1 1.5 2 2.5 3 3.500.10.20.30.40.50.60.70.80.91Error=stol: CPU Timerandspheresimplexlogτ0 0.5 1 1.5 2 2.5 3 3.500.10.20.30.40.50.60.70.80.91Error=stol: Iterationslogτ0 0.5 1 1.5 2 2.5 3 3.500.10.20.30.40.50.60.70.80.91Error=10stol: CPU Timelogτ0 0.5 1 1.5 2 2.5 300.10.20.30.40.50.60.70.80.91Error=10stol: IterationsFigure 8.3: Low-dimension performance profile – randsphere vs. simplex gra-dient.1548.7. SummaryMethod Average CPU Time Average Iterations Average Tilt-correctionsrandsphere 0.6606 49.9050 0.0040Simplex 0.8034 44.6333 12.2750Table 8.4: Average values for two methods of approximating subgradients.There is not a very noticeable difference in the performance profile graph,except that when using the randsphere method all the problems are solved,whereas about 10% of problems timed out when the simplex gradient method wasused. As in Section 8.6.2, a problem times out if more than 100n iterations are re-quired to trigger the stopping condition. The two curves start out in the same placeand increase at about the same rate. The table reflects that; both the average CPUtime and the average number of iterations have negligible differences between thetwo methods. However, we do notice a large difference in the average number oftilt-corrections used. As mentioned in the discussion of the previous data set, thetilt-correct step is almost never implemented when using randsphere to give atrue subgradient some error. However, when the simplex gradient method is usedthere is an average of 12 tilt-corrections performed per problem solved. This sug-gests that in a true DFO setting, the tilt-correct step will be utilized much moreoften.8.7 SummaryWe have presented an approximate subgradient method for numerically findingthe proximal point of a convex function at a given point. The method assumes theavailability of an oracle that delivers an exact function value and a subgradientthat is within distance ε of the true subdifferential, but does not depend on theapproximate subgradient being an ε-subgradient, nor on the model function beinga cutting-planes function (one that bounds the objective function from below). Themethod is proved to converge to the true proximal point within stol +ε/r, wherestol is the stopping tolerance of the algorithm, ε is the bound on the approximatesubgradient error, and r is the proximal parameter.From a theoretical standpoint, two questions immediately present themselves.First, could the method be extended to work for nonconvex functions? Second,could the method be extended to work in situations where the function value isinexact? Some of the techniques in this paper were inspired by [98, 100], whichsuggests that the answer to both questions could be positive. However, the ex-tensions are not as straightforward as they may first appear. The key difficulty inextending this algorithm in either of these directions is that, when multiple po-1558.7. Summarytential sources of error are present, it is difficult to determine the best course ofaction. For example, suppose fk + 〈g˜εk, z − xk〉 − f(z) > 0. Then, by convexity,we know the inexactness of the subgradient is at fault and we perform a tilt-correctstep. However, if the function is nonconvex, then the above inequality could occurdue to nonconvexity or due to the inexactness of the subgradient. If the error isdue to inexactness, then a tilt-correct step is still the right thing to do. If, on theother hand, the error is due to nonconvexity, then it might be better to redistributethe proximal parameter as suggested in [98]. These issues are equally complexif inexact function values are allowed, and even more complex if both nonconvexfunctions and inexact function values are permitted.Another obvious theoretical question is, what happens if εk asymptoticallytends to zero? Would the algorithm converge to the exact proximal point? It islikely that, because past information is preserved in the aggregate subgradient, al-lowing εk → 0 inside this routine will not result in an asymptotically exact algo-rithm. However, this is not a concern, as the purpose of this algorithm is to be usedas a subroutine inside a minimization algorithm, where εk can be made to tend tozero outside the proximal routine. This has the effect of resetting the routine withever smaller values of ε, which yields asymptotic exactness.One may also wonder what the effect of changing the proximal parameter asthe algorithm runs would have on the results and the speed of convergence. In[190], the authors outline a method for dynamically choosing the optimal prox-imal parameter at each iteration by solving an additional minimization problemand incurring negligible extra cost, with encouraging numerical results. In futurework, that method could be incorporated into this algorithm to see if the runtimeimproves.From a numerical standpoint, we found that when subgradient errors are present,it is best to keep all the bundle information and use it at every iteration. The otherbundle variants also solve the problems, but clearly not as robustly as the biggestbundle does.There are many more numerical questions that could be investigated. For ex-ample, error bounds/accuracy on the results could be analyzed, and as mentionedabove, the effect of a dynamic proximal parameter could be investigated. Themore immediate goal is to examine this new method in light of the VU-algorithm[159], which alternates between a proximal step and a quasi-Newton step to mini-mize nonsmooth convex functions. The results of this paper theoretically confirmthat the proximal step will work in the setting of inexact subgradient evaluations.Combined with [87], which theoretically examines the quasi-Newton step, the de-velopment of a VU-algorithm for inexact subgradients appears achievable.156Chapter 9A Derivative-free VU-algorithm9.1 OverviewThe VU-algorithm is a superlinearly convergent method for minimizing nons-mooth convex functions [159]. At each iteration, the algorithm separates Rn intothe V-space and the U-space, such that the nonsmoothness of the objective functionis concentrated on its projection onto the V-space, and on the U-space the projec-tion is smooth. This structure allows for an alternation between a Newton-like stepwhere the function is smooth, and a proximal-point step that is used to seek iterateswith promising VU-decompositions. In this chapter, we establish a derivative-freevariant of the VU-algorithm for convex finite-max objective functions. We showconvergence and provide numerical results from a proof-of-concept implementa-tion that demonstrate the feasibility and practical value of the approach.This chapter is based on results found in [96], which is submitted to MathematicalProgramming.9.2 BackgroundWe consider the finite-dimensional, unconstrained minimization problemminx∈Rnf(x) (9.2.1)with f a nonsmooth, proper, lsc, convex finite-max function,f(x) = maxi=1,...,mfi(x),where for i ∈ {1, 2, . . . ,m} the functions fi : Rn → R are of class C2+. We workunder the assumption that there is an oracle delivering only function values. Werefer to the oracle as a "grey box" because, even though the subfunctions cannotbe examined analytically, for a given x ∈ Rn the oracle informs not only f(x) butalso which of the fi subfunctions yield the maximum. Our goal is to exploit thegrey-box information in a derivative-free optimization setting.1579.2. BackgroundThe VU-algorithm alternates between a proximal point step and a "U-Newton"step to achieve superlinear convergence in the minimization of nonsmooth convexfunctions [159]. The VU-algorithm has proven effective in dealing with the chal-lenges that arise in that setting [87, 154, 156, 158, 159]. It continues to be a methodof interest in the Optimization community, having been expanded to use on convexfunctions with primal-dual gradient structure [155–157] and even some noncon-vex functions [158]. The basic tenet is to separate the space into two orthogonalsubspaces, called the V-space and the U-space, such that near the current iterationpoint the nonsmoothness of f is captured in the V-space, and the smoothness of fis captured in the U-space. This procedure is known as a VU-decomposition. Oncethis is accomplished, one can take a proximal point step (V-step) parallel to theV-space, in order to find incumbent solutions with favourable VU-decompositions,then a Newton-like step (U-step) parallel to the U-space.In order to apply the VU-algorithm, at each iteration it is necessary to do theVU-decomposition, compute the proximal point to apply the V-step, then com-pute the U-gradient and U-Hessian to apply the U-step (each of these computa-tions is formally defined in Section 9.3). In our grey-box optimization setting,none of these objects is directly available. However, in [87] it was shown that theVU-decomposition, U-gradient, and U-Hessian can be approximated numericallywith controlled precision for finite-max functions. Moreover, in Chapter 8 of thepresent we have a derivative-free algorithm for computing proximal points of con-vex functions that only requires approximate subgradients. Finally, in [90] it wasshown how to approximate subgradients for finite-max functions using only func-tion values. Hence, we have a sufficient foundation to develop a derivative-freeVU-algorithm suitable for our grey-box optimization setting. We show that at eachiteration, one can approximate subgradients of the objective function as closely asone wishes and use the inexact first-order information to obtain approximations ofall the necessary components of the algorithm. We prove that the results of globalconvergence in [159] can be extended to the framework of inexact gradients andHessians.9.2.1 AssumptionsThroughout this chapter, we assume the following for Problem (9.2.1).Assumption 9.2.1. The objective function f : Rn → R is convex and definedthrough the maximum of a finite number of functions,f(x) = maxi=1,...,mfi(x),where each fi ∈ C2+. Furthermore, at each given point x¯ the grey box returns the1589.3. VU-theoryindices of active subfunctions, that is, the indices in the setA(x¯) = {i : fi(x¯) = f(x¯)}.Assumption 9.2.2. The objective function f has compact lower level sets, that is,the setSβ = {x ∈ Rn : f(x) ≤ β}is compact for any choice of β ∈ R .Assumption 9.2.3. For any fixed x¯ ∈ Rn, the set of active gradients,{∇fi(x¯) : i ∈ A(x¯)},is affinely independent. That is, the only scalars λi that satisfy∑i∈A(x¯)λi∇fi(x¯) = 0,∑i∈A(x¯)λi = 0are λi = 0 for all i ∈ A(x¯).9.3 VU-theoryAs the objective function f is convex and finite-valued, the subdifferential off at a point x¯ is well-defined and never empty. The ε-subdifferential of f at x isdenoted ∂εf(x) (with g ∈ ∂εf(x) called an ε-subgradient) and is defined by∂εf(x¯) = {g ∈ Rn : f(x) ≥ f(x¯) + g>(x− x¯)− ε for all x ∈ Rn}.Given a finite-max function, the active indices provide an alternate manner of con-structing the subdifferential:∂f(x¯) = conv{∇fi(x¯) : i ∈ A(x¯)}. (9.3.1)At any point x ∈ Rn, the space can be split into two orthogonal subspacescalled the U-space and the V-space, such that the nonsmoothness of f is capturedentirely in the V-space, while on the U-space f behaves smoothly. This is accom-plished through the VU-decomposition.Definition 9.3.1. Fix x¯ ∈ Rn and let g ∈ ri ∂f(x¯). The VU-decomposition of Rnfor f at x¯ is the separation of Rn into the following two orthogonal subspaces:V(x¯) = span{∂f(x¯)− g} and U(x¯) = [V(x¯)]⊥.1599.3. VU-theoryThis decomposition is independent of the choice of g ∈ ri ∂f(x¯) [136, Proposition2.2]. With V a basis matrix for the V-space and U a semiorthonormal basis matrixfor the U-space, every x ∈ Rn can be decomposed into components xU ∈ Rrank Uand xV ∈ Rrank V [136, Section 2]. DefiningxU = U>x and xV =(V>V)−1V>x,we writex = xU ⊕ xV = UxU + V xV ,where the symbol ⊕ represents the VU-decomposition of x (rather than the tradi-tional Minkowski sum). Henceforth, we use the notation RU and RV to representRrank U and Rrank V, respectively.Given a subgradient g ∈ ∂f(x¯) with V-component gV , the U-Lagrangian of f,LU (u; gV) : RU → R is defined [159, Section 2.1] as follows:LU (u; gV) = minv∈RV{f(x¯+ Uu+ V v)− g>V v} .The associated set of V-space minimizers isW (u; gV) = {V v : LU (u; gV) = f(x¯+ Uu+ V v)− g>V v}.The U-gradient ∇LU (0; gV) and the U-Hessian ∇2LU (0; gV) are then defined asthe gradient and Hessian, respectively, of the U-Lagrangian. For f convex, eachU-Lagrangian is a convex function that is differentiable at u = 0, with∇LU (0; gV) = gU = U> g = U> g˜ for all g˜ ∈ ∂f(x) [159, Section 2.1].If LU (u; gV) has a Hessian at u = 0, then the second-order expansion of LU alsoprovides a second-order expansion of f in the U-space. General conditions forexistence of the U-Hessian are found in [159]. However, for the purpose of thispaper, we note that Assumptions 9.2.1 and 9.2.3, combined with g ∈ ri ∂f(x),imply the existence of a U-Hessian at the origin [87, Lemma 2.6, Lemma 3.5].When x¯ minimizes f, we have 0 ∈ ∂f(x¯) [159, Section 2.1]. This gives∇LU (0; gV) = 0 for all g ∈ ∂f(x¯),and LU is minimized at u = 0, which subsequently yields LU (0; 0) = f(x¯).1609.3. VU-theory9.3.1 Primal-dual tracksThe second-order expansion of f in the U-space allows the VU-algorithm totake U-Newton steps, which in turn allows for rapid convergence. However, inorder to be effective, the algorithm must seek out iterates where the U-space at theiterate lines up with the U-space at the minimizer. This is accomplished through theproximal point operation. When f is convex, the proximal mapping is a singleton,called the proximal point. When computed close to a minimizer x¯, the proximalpoint has a very important relationship with a smooth manifold called the primaltrack [158, §1], defined as follows.Definition 9.3.2. The ordered pair (χ(u), γ(u)) is called a primal-dual track lead-ing to (x¯, 0) (a minimizer of f and zero-subgradient ordered pair), if for all u ∈ RUsmall enoughthe primal track χ(u) = x¯+ (u⊕ v(u)) andthe dual track γ(u) = argmin {‖g‖ : g ∈ ∂f(χ(u))}satisfy the following:(i) v : RU → RV is C2 such that V v(u) ∈WU (u; gV) for all g ∈ ri ∂f(x¯),(ii) the Jacobian Jχ(u) is a basis matrix for V(χ(u))⊥, and(iii) the particular U-Lagrangian LU (u; 0) is a C2 function.There are two special-definition cases: rank U = 0 and rank U = n. In theformer, the primal-dual track is defined as the point (x¯, 0). In the latter, we define(χ(u), γ(u)) = (x¯+u,∇f(x¯+u)) for all u ∈ Bδ ⊆ Rn .When Definition 9.3.2(i)holds, we have v(0) = 0, Jv(0) = 0 and v(u) = O(‖u‖2) [136, Corollary 3.5], sothat χ(u) is a trajectory that is tangent to U at x¯. Definition 9.3.2(ii) gives us thatχ(u) is tangent to U along the entire primal track.When the Hessian of LU (u; 0) exists at u = 0 (see [136, Definition 3.8] andthe preamble), the second-order expansion of LU is possible [159, Section 2.2].Lemma 3 of [159] shows that in that case, the dual track is tangent to the primaltrajectory and gives a C1 U-gradient. The primal-dual track is unknown in generalbut has a connection with the proximal point that can be exploited. For r > 0, in[191] we find the following properties:(i) gr(x) = r[x− Prf(x)] ∈ ∂f(Prf(x));(ii) if x¯ is a minimizer of f, then x¯ is a fixed point of Prf and‖Prf(x)− x¯‖2 ≤ ‖x− x¯‖2 − ‖x− Prf(x)‖2.1619.3. VU-theoryThis leads to the following very useful equivalence between the proximal point andthe primal track.Theorem 9.3.3. [159, Theorem 4] Let χ(u) be a primal track leading to a mini-mizer x¯ ∈ Rn, so that 0 ∈ ∂f(x¯). Suppose that for all x close enough to x¯ we haver = r(x) > 0 with r(x)‖x− x¯‖ → 0 as x→ x¯. Define ur(x) = (Prf(x)− x¯)U .Then for all x close enough to x¯,Prf(x) = χ(ur(x)) = x¯+ ur(x)⊕ v(ur(x))and ur(x)→ 0 as x→ x¯.The advantage that the conclusion of Theorem 9.3.3 gives us is, we can concentrateon finding the proximal point instead of being concerned about how to find theprimal track, since close to x¯ they are one and the same. Furthermore, a routinefor finding the proximal point of a convex function at a given point already exists[124]. We give a brief summary of the method here. Given a convex function f andan initial point y0, at iteration j of the bundle routine we choose any subgradientgj ∈ ∂f(yj) and define the linearization error:Ej = E(x, yj) = f(x)− f(yj)− g>j (x− yj).Since f is convex, Ej ≥ 0 for all j. Using the fact that f(z) ≥ f(yj) + g>j (z− yj)for all z ∈ Rn, we have thatf(z) ≥ f(x) + g>j (z − x)− Ej for all z ∈ Rn .In other words, gj ∈ ∂Ejf(x). The bundle {(Ej , gj)}j∈B, where B is a set that in-dexes information from previous iterations, is used to construct a convex piecewise-linear function ϕj that approximates and minorizes f. Then the new iteration pointyj+1 = Prϕj(yj) is found, and the process repeats. This method is proved in [58]to converge to Prf(y0). The DFO version of this routine that was presented inChapter 8 is proved in [95] to converge to Prf(y0) as well, within a user-definedtolerance level.9.3.2 The VU-algorithmWhen a primal track exists, the VU-algorithm takes a step approximately fol-lowing the primal track by way of a V-step (proximal step) followed by a U-step(quasi-Newton step). The V-step outputs a potential primal track point, whichis then checked and either accepted or rejected, depending on whether sufficientdescent is achieved. We now state an abbreviated version of the conceptual VU-algorithm presented in [159].1629.4. Defining inexact subgradients and related approximationsConceptual VU-algorithmStep 0: Initialize the starting point x0, proximal parameter r > 0, iteration counterk = 0 and other parameters.Step 1: Given g ∈ ∂f(xk), compute the VU-decomposition with subspace basesV and U .Step 2: Compute an approximate proximal point xk+1 ≈ Prf(xk). Incrementk 7→ k + 1.Step 3: If xk does not show sufficient descent, then declare a null step and repeatStep 2 to higher precision. If xk does show sufficient descent, then check stoppingconditions and either stop or continue to Step 4.Step 4: Compute the U-gradient∇LU (0; gV) and a U-Hessian∇2LU (0; gV). Takea U-Newton step by solving∇2LU (0; gV)∆u = −∇LU (0; gV)for ∆u and settingxk+1 = xk + U ∆u.Increment k 7→ k + 1 and go to Step 1.End algorithm.9.4 Defining inexact subgradients and relatedapproximationsWe now consider how to adapt the conceptual VU-algorithm to a derivative-free setting as provided by Assumptions 9.2.1 through 9.2.3. In order to proveconvergence, we make use of the results of [87] and [95]. We use the techniques in[87] to approximate a subgradient, the VU-decomposition, the U-gradient and theU-Hessian for the function f at a point x¯.To define an inexact subgradient for f, we make use of the simplex gradientsof each fi. The simplex gradient is defined as the gradient of the approximationresulting from a linear interpolation of f over a set of n+ 1 points in Rn [57, 121].Definition 9.4.1. Let Y = [y0, y1, . . . , yn] be a set of affinely independent pointsinRn. Then it is said that Y forms a simplex, and the simplex gradient of a functionfi over Y is given by∇εfi(Y ) = M−1δfi(Y ), where1639.4. Defining inexact subgradients and related approximationsM = [y1 − y0 · · · yn − y0]> and δfi(Y ) = fi(y1)− fi(y0)...fi(yn)− fi(y0) .The condition number of Y is given by ‖Mˆ−1‖, whereMˆ =1ε[y1 − y0 y2 − y0 . . . yn − y0]> with ε = maxj=1,...n‖yj − y0‖.An important aspect of the condition number is that it is always possible to keepit bounded away from zero while simultaneously making ε arbitrarily close to zero(see Remark 9.4.4). The following result provides an error bound for the distancebetween the simplex gradient and the exact gradient for a smooth function.Theorem 9.4.2. [121, Lemma 6.2.1] Consider fi ∈ C2. Let Y = [y0, y1, . . . , yn]form a simplex. Then there exists µ = µ(y0) > 0 such that‖∇εfi(Y )−∇fi(y0)‖ ≤ εµ‖Mˆ−1‖.We set y0 = x¯, and y1 through yn are set to x¯ + εei, where ei is the ith canonicalvector. If desired, a rotation matrix can be used to prevent the yi vectors from beingoriented in the coordinate directions every time. Now we define Subroutine 9.4.3,which we use to find an approximate subgradient gε, approximations of the sub-space bases V and U and the approximate U-gradient ∇εLU (0; gεV). Henceforth,we use |A(x¯)| to denote the cardinality ofA(x¯).Subroutine 9.4.3 (First-order approximations). hiStep 0: Input x¯ and ε.Step 1: Set Y = [x¯ x¯+ εe1 x¯+ εe2 · · · x¯+ εen].Step 2: Find A(x¯) and calculate∇εfi(Y ) for each i ∈ A(x¯).Step 3: Set(i) gε = 1|A(x¯)|∑i∈A(x¯)∇εfi(Y );(ii) V as the matrix of column vectors ∇εfi(Y ) − ∇εfI(Y ), i ∈ A(x¯) \ {I},where I is the first element of A(x¯);(iii) U = null V /‖ null V ‖;(iv) ∇εLU (0; gεV) = U>gε.End subroutine.1649.4. Defining inexact subgradients and related approximationsRemark 9.4.4. Using Y from Step 1, we haveMˆ =1ε[x¯+ εe1 − x¯ x¯+ εe2 − x¯ · · · x¯+ εen − x¯] = Id,so that ‖Mˆ−1‖ = 1 while ε can be arbitrarily small.The following theorem shows that the outputs gε and ∇εLU (0; gεV) from Subrou-tine 9.4.3 are good approximations.Theorem 9.4.5. Let f : Rn → R satisfy Assumptions 9.2.1 and 9.2.3. Fix a pointx¯ ∈ dom f. Then there exist µ = µ(x¯) > 0 and g ∈ ri ∂f(x¯) such that for ε > 0sufficiently small, one can obtain(i) an approximate subgradient gε such that‖gε − g‖ ≤ εµ, ‖gεU − gU‖ ≤ εµ and ‖gεV − gV‖ ≤ εµ;(ii) the approximate U-gradient∇εLU (0; gεV) such that‖∇εLU (0; gεV)−∇LU (0; gV)‖ ≤ εµ.Proof. By Theorem 9.4.2 with ‖Mˆ−1‖ = 1 as per Remark 9.4.4, there exists anumber µ1 = µ1(x¯) > 0 such that ‖∇εfi(Y ) − ∇fi(x¯)‖ ≤ εµ1. Then by [87,Lemma 2.6], there exist λi ≥ 0 with∑i∈A(x¯) λi = 1 such thatg =∑i∈A(x¯)λi∇fi(x¯) ∈ ri ∂f(x¯) and gε =∑i∈A(x¯)λi∇εfi(x¯) ∈ ri ∂εf(x¯)are such that ‖gε − g‖ ≤ εµ1. By [87, Lemma 4.3, Theorem 5.3], we have theexistence of µ2, µ3 > 0 such that ‖gεU − gU‖ ≤ εµ2, ‖gεV − gV‖ ≤ εµ2 and‖∇εLU (o; gεV) − ∇LU (0; gV)‖‖ ≤ εµ3. Setting µ = max{µ1, µ2, µ3} completesthe proof.Next, we find the approximate U-Hessian ∇2εLU (0; gεV), as outlined in [87]. To doso, we need the Frobenius norm.Definition 9.4.6. The Frobenius norm ‖M‖F of a matrixM ∈ Rp×q with elementsaij is defined by‖M‖F =√√√√ p∑i=1q∑j=1a2ij .1659.4. Defining inexact subgradients and related approximationsWe define the matrix Z ∈ Rn×(2n+1) :Z = [x¯ x¯+ εe1 x¯− εe1 · · · x¯+ εen x¯− εen].To build an approximate Hessian ∇2F of fi(x¯) for each i ∈ A(x¯), we solve theminimum Frobenius norm problem:∇2F fi(Z) = argmin ‖Hi‖F , such that12Z>j HiZj +B>i Zj + Ci = fi(Zj), j = 1, 2, . . . , 2n+ 1,where Zj is the jth column of Z. The variables of the above problem areHi, Bi andCi, and the solution is obtained by solving a quadratic program. Then, denotingH =1|A(x¯)|∑i∈A(x¯)∇2F fi(Z),we define the approximate U-Hessian of f(x¯) : ∇2εLU (0; gεV) = U>H U . Thefollowing result provides the error bound for the approximate U-Hessian.Theorem 9.4.7. [87, Theorem 6.1] Let x¯ be fixed. Suppose that Assumption9.2.3 holds and that for any ε > 0 there exists µ = µ(x¯) such that the approx-imate U-gradient and U-Hessian satisfy the bounds ‖∇εfi(x¯)−∇f(x¯)‖ < εµ and‖∇2εfi(x¯)−∇2f(x¯)‖ < εµ. Then‖∇2LU (0; gV)−∇2εLU (0; gεV)‖ ≤ ε[2√2√|A(x¯)| − 1‖V† ‖‖H‖(2µ+ µ2ε) + µ].Thus, limε↘0∇2εLU (0; gεV) = ∇2LU (0; gV).Now we state Subroutine 9.4.8, which is used to find the approximate U-Hessian.Subroutine 9.4.8 (Second-order approximation). hiStep 0: Input x¯, ε, A(x¯) and U .Step 1: Set Z = [x¯ x¯+ εe1 x¯− εe1 · · · x¯+ εen x¯− εen].Step 2: Calculate∇2F fi(Z) for each i ∈ A(x¯).Step 3: Set∇2εLU (0; gεV) = U>(1|A(x¯)|∑i∈A(x¯)∇2F fi(Z))U .End subroutine.Theorems 9.4.5 and 9.4.7 provide us with the tools needed to perform the ap-proximate U-step in the derivative-free VU-algorithm. In order to perform the1669.4. Defining inexact subgradients and related approximationsapproximate V-step, we need to be able to approximate a proximal point in aderivative-free setting. This is where the tilt-correct DFO proximal bundle routineof Chapter 8 makes its appearance; details are reproduced in Step 2 of the DFOVU-algorithm below. At iteration j of said routine, a subgradient is approximatedby modelling f with a piecewise-linear function ϕj and then finding the proximalpoint of ϕj . This method is proved in [95] to converge to the desired proximal pointwithin a preset tolerance. Theorem 9.4.5(i) provides the approximate subgradientsrequired for this step.The tilt-correct DFO proximal bundle method involves a possible correctionto the approximate subgradient found at each iteration (Step 1.1 of the upcomingDFO VU-algorithm), which ensures that the model function does not lie above theobjective function at the current iteration point xk. This is not a concern when exactsubgradients are available, because then the model function naturally bounds the(convex) objective function from below, but when using approximate subgradientsit is possible for the model function to lie partially above the objective function.In that case, tilting the linear piece down until ϕεj(xk) = f(xk) makes the modelno worse [95, Lemma 3.1]. Once gε is found, it can be replaced by the approxi-mate subgradient defined by (9.4.1), which complies with all of our requirements,including the condition ϕεj(xk) = f(xk).In the following algorithm, we use k for the outer counter and j for the inner(V-step subroutine) counter.DFO VU-Algorithm:Step 0: Initialization: Choose a stopping tolerance δ > 0 for the aggregate sub-gradients, an accuracy tolerance εmin for the subgradient errors, a descent-checkparameter m ∈ (0, 1) and a proximal parameter r > 0. Choose an initial pointx0 ∈ dom f and an initial subgradient accuracy ε0 > 0. Set k = 0.Step 1: V-step:Step 1.0: Initialization. Set j = 0, z0 = xk and B0 = {0}.Step 1.1: Linearization. Call Subroutine 9.4.3 with input (zj , εk) to find g˜εkj .Compute Ej = f(zj) + g˜εk>j (z0 − zj)− f(z0) and setgεkj = g˜εkj + max(0, Ej)z0 − zj‖z0 − zj‖2 . (9.4.1)Step 1.2: Model. Defineϕεkj (z) = maxi∈Bj{f(zi) + gεk>i (z − zi)}.1679.4. Defining inexact subgradients and related approximationsStep 1.3: Proximal Point. Calculate zj+1 = Prϕεkj (z0).Step 1.4: Stopping Test. If f(zj+1) − ϕεkj (zj+1) ≤ ε2k/r, set xk+1 = zj+1,calculate the aggregate subgradient sk+1 = r(z0 − zj+1), and go to Step 2.Step 1.5: Update and Loop. Create the aggregate bundle element(zj+1, ϕεkj , r(z0 − zj+1)). Create Bj+1 such that{−1, 0, j + 1} ⊆ Bj+1 ⊆ {−1, 0, 1, 2, · · · , j + 1}.Increment j 7→ j + 1 and go to Step 1.1.Step 2: Stopping Test: If ‖sk+1‖2 ≤ δ and εk ≤ εmin, output xk+1 and stop.Step 3: Update and Loop:(i) If f(xk) − f(xk+1) ≥ m2r‖sk+1‖2 and ‖sk+1‖2 ≤ δ and εk > εmin, thendeclare a SERIOUS STEP and set εk+1 = εk/2.(ii) If f(xk)−f(xk+1) ≥ m2r‖sk+1‖2 and ‖sk+1‖2 > δ, then declare a SERIOUSSTEP and set εk+1 = εk.(iii) If f(xk) − f(xk+1) < m2r‖sk+1‖2, then declare a NULL STEP and setεk+1 = εk/2.Increment k 7→ k + 1. If a SERIOUS STEP has been declared, go to Step 4. If aNULL STEP has been declared, go to Step 1.Step 4: U-step: Call Subroutine 9.4.3 with input (xk, εk) to find A(xk), gεkk , Ukand∇εLU (0; (gεkk )V). Call Subroutine 9.4.8 with input (xk, εk, A(xk),Uk) to find∇2εLU (0; (gεkk )V). Compute an approximate U-Newton step by solving the linearsystem∇2εLU (0; (gεkk )V)∆uk = −∇εLU (0; (gεkk )V)for ∆uk. Set xk+1 = xk + Uk ∆uk and εk+1 = εk. Increment k 7→ k + 1 and goto Step 1.End algorithm.Note 9.4.9. In Step 1.1, the call to Subroutine 9.4.3 yields the active set, approx-imate U-basis and approximate U-gradient in addition to g˜εkj . However, g˜εkj is theonly information we use from Subroutine 9.4.3 in the V-step, so we do not mentionthe other outputs in the statement of the algorithm.1689.5. Convergence9.5 ConvergenceIn this section, we examine the convergence of the DFO VU-algorithm, startingwith the V-step. By [95, Corollary 4.6], if the V-step is run without the stoppingcriterion, thenlimj→∞‖zj+1 − zj‖ = 0.Then [95, Theorem 4.9] states that if f is locally K-Lipschitz (which a finite-maxfunction is), then‖zj+1 − zj‖ ≤ ε2kr(K + 2εk)⇒ f(zj+1)− ϕεkj (zj+1) ≤ε2kr(9.5.1)and the routine terminates. The properties of ϕεkj established in [95, Lemma 4.1]show that if the V-step with input z0 stops at iteration j and outputs zj+1, then‖Prf(z0)− zj+1‖ ≤ (µ+ 1)εkr,where µ is the constant of Theorem 9.4.5. Now in order to prove the convergenceof the main algorithm, we need to show that either the algorithm terminates in afinite number of steps, or εk → 0 and lim inf ‖sk‖ → 0, and that in either case wearrive at a good approximation of the minimizer of f. To accomplish that goal, weneed the following definitions.Definition 9.5.1. Let ε ≥ 0. The ε-directional derivative of f at x in direction d isdefined byf ′ε(x; d) = inft>0f(x+ td)− f(x) + εt= maxs∈∂εf(x){s>d}.Definition 9.5.2. Let ε, η ≥ 0. A vector v is an (ε, η)-subgradient of f at x¯,denoted v ∈ ∂ηε f(x¯), if for all x,f(x) ≥ f(x¯) + v>(x− x¯)− η‖x− x¯‖ − ε.Notice that by setting η = 0 we recover the definition of the ε-subgradient, andby setting ε = η = 0 we have the convex analysis subgradient. The next lemmaprovides enlightenment on the (ε, η)-subgradient in the general case.Lemma 9.5.3. Let ε, η ≥ 0. Theng ∈ ∂ηε f(x¯)⇔ g ∈ ∂εf(x¯) +Bη. (9.5.2)1699.5. ConvergenceProof. (⇒) Suppose g ∈ ∂ηε f(x¯). Since ∂εf is closed and convex [109, Theorem1.1.4], we defineg¯ = P∂εf(x¯)(g)and we have g¯ ∈ ∂εf(x¯). Set v = g − g¯, so that g = g¯ + v, and for t > 0 we usex = x¯+ tv in the definition of the (ε, η)-subgradient:f(x¯+ tv) ≥ f(x¯) + (g¯ + v)>tv − η‖tv‖ − ε,f(x¯+ tv)− f(x¯) + εt− v>g¯ ≥ ‖v‖2 − η‖v‖,inft>0f(x¯+ tv)− f(x¯) + εt− v>g¯ ≥ ‖v‖2 − η‖v‖,f ′ε(x¯; v)− v>g¯ ≥ ‖v‖2 − η‖v‖. (9.5.3)By the Projection Theorem, we havep = P∂εf(x¯)(y)⇔ (y − p)>(z − p) ≤ 0 for all z ∈ ∂εf(x¯).So for all g˜ ∈ ∂εf(x¯) we have(g − g¯)>(g˜ − g¯) ≤ 0,v>g˜ ≤ v>g¯.Hence,v>g¯ = maxg˜∈∂εf(x¯){v>g˜}.Using this together with Definition 9.5.1, (9.5.3) becomes‖v‖2 − η‖v‖ ≤ 0,‖v‖ ≤ η.Therefore, v ∈ Bη, and we have g = g¯ + v ∈ ∂εf(x¯) +Bη.(⇐) Suppose that g ∈ ∂εf(x¯) +Bη. Set g = g¯+v where g¯ ∈ ∂εf(x¯) and v ∈ Bη.Then by the definition of ε-subgradient and the Cauchy-Schwarz inequality, wehavef(x)− f(x¯)− g>(x− x¯) = f(x)− f(x¯)− g¯>(x− x¯)− v>(x− x¯),≥ −ε− v>(x− x¯),≥ −ε− ‖v‖‖x− x¯‖,≥ −ε− η‖x− x¯‖.Therefore, g ∈ ∂ηε f(x¯).1709.5. ConvergenceNow we are ready to show that the inexact aggregate subgradient at any step is agood approximation of a true subgradient.Lemma 9.5.4. There exists µ ≥ 0 such that, given any iteration k with output(xk+1, sk+1),sk+1 ∈ ∂εkµε2krf(xk+1).Proof. By Theorem 9.4.2, we have that µ = K‖Mˆ−1‖, which is constant becauseof Assumptions 9.2.1 and 9.2.2. Thus,f(x) + εkµ‖x− xk+1‖ ≥ ϕεkj (xk+1) + s>k+1(x− xk+1).From (9.5.1), we haveϕεkj (xk+1)− f(xk+1) > −ε2kr.Thenf(x) ≥ ϕεkj (xk+1)− f(xk+1) + f(xk+1) + s>k+1(x− xk+1)− εkµ‖x− xk+1‖≥ −ε2kr+ f(xk+1) + s>k+1(x− xk+1)− εkµ‖x− xk+1‖.Thus, sk+1 ∈ ∂εkµε2krf(xk+1) by Definition 9.5.2.There are two special cases of Lemma 9.5.4 that are of interest; we consider whathappens when the aggregate subgradient is zero and when the maximum subgradi-ent error is also zero.Corollary 9.5.5. At any iteration k with output (xk+1, sk+1), the following hold.(i) If sk+1 = 0, then 0 ∈ ∂ ε2krf(xk+1) + Bµεk , and by Lemma 9.5.4 we havethat for all x ∈ Rn,f(x) ≥ f(xk+1)− εkµ‖x− xk+1‖ − ε2kr.(ii) If ε0 = sk+1 = 0, then 0 ∈ ∂f(xk+1) and xk+1 is a minimizer of f.Now we need to consider the possibility that the algorithm does not terminate andwhat the effect would be. We begin with the scenario where an infinite number ofserious steps is taken.1719.5. ConvergenceTheorem 9.5.6. Suppose there is an infinite number of serious steps taken in Step3. Then εk → 0 and lim infk→∞ ‖sk‖ = 0.Proof. Note that f is bounded below, due to Assumption 9.2.2. Suppose that outof the infinite number of successes, ‖sk+1‖2 > δ an infinite number of times. Thenwe havef(xk)− f(xk+1) ≥ m2r‖sk+1‖2 > mδ2ran infinite number of times. Since mδ2r is constant, limk→∞[f(x0)− f(xk)] =∞,which contradicts the fact that f is bounded below. Hence, ‖sk+1‖2 ≤ δ eventually.Since we are supposing that the algorithm does not stop, we must have εk > εminand we set εk+1 = εk/2. This happens an infinite number of times, which givesεk → 0. Since ‖sk+1‖2 ≤ δ for any δ > 0, we have lim infk→∞ ‖sk‖ = 0.Next comes the scenario where a finite number of serious steps is taken, yet thealgorithm does not terminate.Lemma 9.5.7. Suppose the algorithm does not terminate and there is a finite num-ber of serious steps taken. Then for all k sufficiently large,εk >(1− m2)1/2 ‖sk+1‖. (9.5.4)Proof. Let k¯ be the final iteration where a serious step occurs, so that a null stepoccurs at every k > k¯. Since sk+1 = r(xk − xk+1) is the aggregate subgradient ofthe model function ϕεkj at zj+1 = xk+1, we have sk+1 ∈ ∂ϕεkj (xk+1) [95, Lemma4.1(c)]. Thus,ϕεkj (x) ≥ ϕεkj (xk+1) + s>k+1(x− xk+1) for all x.By the tilt-correction (Ej in Step 1.1), we have that at xk,f(xk) ≥ ϕεkj (xk+1) + s>k+1(xk − xk+1) [95, Lemma 4.1(b)],= ϕεkj (xk+1) +1rs>k+1[r(xk − xk+1)],= ϕεkj (xk+1) +1r‖sk+1‖2,= ϕεkj (xk+1)− f(xk+1) + f(xk+1) +1r‖sk+1‖2,f(xk)− f(xk+1) ≥ ϕεkj (xk+1)− f(xk+1) +1r‖sk+1‖2. (9.5.5)1729.5. ConvergenceBy the stopping test in Step 1.4, we haveϕεkj (xk+1)− f(xk+1) ≥ −ε2kr. (9.5.6)Combining (9.5.5) and (9.5.6) yieldsf(xk)− f(xk+1) ≥ 1r‖sk+1‖2 − ε2kr. (9.5.7)Then for all k > k¯, by Step 3(iii) and (9.5.7) we havem2r‖sk+1‖2 > f(xk)− f(xk+1),m2r‖sk+1‖2 > 1r‖sk+1‖2 − ε2kr,ε2k >(1− m2)‖sk+1‖2,and (9.5.4) is proved.Corollary 9.5.8. Suppose the algorithm does not terminate and there is a finitenumber of serious steps. Then εk → 0 and ‖sk‖ → 0.Proof. Since there is a finite number of successes and the algorithm does not ter-minate, there is an infinite number of failures. By Step 3(iii), εk → 0. By (9.5.4),‖sk‖ → 0.It remains to discuss what happens when the stopping tolerances δ and εmin are setto zero. If so, by Step 2 of the algorithm we see that the only way it will terminateis if sk+1 = 0 and εk = 0, in which case xk+1 is a minimizer of f by Corollary9.5.5. Theorem 9.5.9 below unites the convergence results of this section.Theorem 9.5.9. For the objective function f in Problem (9.2.1), suppose that As-sumptions 9.2.1 through 9.2.3 are met. Suppose the DFO VU-algorithm is run on fand generates the sequence {xk}. Then either the algorithm terminates at some it-eration k¯ with ‖sk¯+1‖ ≤√δ and εk¯ ≤ εmin, or lim infk→∞ ‖sk‖ = 0 and εk → 0.In the latter case, any cluster point x¯ of a subsequence {xkj} such that skj → 0satisfies 0 ∈ ∂f(x¯).Proof. If the algorithm terminates at iteration k¯, by Step 2 we have ‖sk¯+1‖ ≤√δand εk¯ ≤ εmin. Suppose the algorithm does not terminate. If there is an infinitenumber of serious steps taken, then lim infk→∞ ‖sk‖ = 0 and εk → 0 by Theorem9.5.6. If there is a finite number of serious steps taken, then sk → 0 and εk → 0by Corollary 9.5.8. In either case, consider any subsequence {xkj} with cluster1739.6. Numerical Resultspoint x¯ such that skj → 0. Since εk → 0, we have εkj → 0. By [5, Proposition5], we have that the (ε2kj/r)-subdifferential of f is continuous jointly as a functionof (x, εkj ) on (ri dom f) × (0,∞). Since Bµεkj is continuous as well, by [165,Proposition 3.2.7-3] and (9.5.2) we have that ∂εkjµε2kj/ris continuous. Therefore, sinceskj+1 ∈ ∂εkjµε2kj/rf(xkj+1) by Lemma 9.5.4, skj+1 → 0, εkj → 0 and xkj+1 → x¯,we conclude that 0 ∈ ∂f(x¯).9.6 Numerical ResultsIn this section, we present some numerical tests with the DFO VU-algorithm.The tests were run on a 2.8 GHz Intel Core 2 Duo processor with a 64-bit operatingsystem, using MATLAB version 8.5.0.197613 (R2015a). To determine the impactof the combination of a DFO setting with a VU-decomposition, we compare theperformance of the following four bundle algorithms:1. DFO-VU, the DFO VU-algorithm using the default of 2n+ 1 function callsper Hessian approximation;2. INEXBUN, an inexact bundle method along the lines of [172], with accessto the grey box available to DFO-VU: the value function is exact, but thesubgradient is approximated by means of the DFO approach in Subroutine9.4.3;3. EXBUN, a classical bundle method in proximal form [38, Part II];4. COMPBUN, the Composite Bundle method from [198].The last two variants in the list use exact subgradient information. As such, weexpect those solvers to outperform both DFO-VU and INEXBUN. These inexactvariants, by contrast, are on equal ground and we expect to see a positive impact ofthe VU-decomposition in terms of accuracy.9.6.1 Test functions and benchmark rulesWe considered 301 max-of-quadratics functions. The first one is the classicalMAXQUAD function in nonsmooth optimization [38, Part II], for which the dimen-sion is n = 10, the optimal value is f¯ = −0.84140833459641814, and v¯ = dimVat a solution is equal to 3. For n ∈ {10, 20, 30, 40, 50}, the remaining 300 functions1749.6. Numerical Resultswere generated randomly with minimizer x¯ = 0 ∈ Rn, f¯ = 0 and various finalV-dimensions v¯ ∈ {0.25n, 0.5n, 0.75n}. Accordingly, givenm ≥ |A(x¯)| = v¯+1,f(x) = maxj∈{1,2,...,m}{12x>Ajx+ b>j x},for random Aj ∈ Sn+ and bj ∈ Rn. The symmetric positive semidefinite matricesAj have condition number equal to rankA2 = v¯2, and the set of difference vectors{b2− b1, . . . , bv¯+1− b1} is linearly independent. The above setting guarantees thatall the assumptions in Section 9.2.1 hold for the considered instances.We must acknowledge and accept that some of the inner workings of eachsolver make it difficult to compare the results fairly. First, the COMPBUN andEXBUN algorithms make bb-calls that yield exact values for the function and asubgradient, while INEXBUN and DFO-VU call a grey box that yields exact func-tion values and approximate subgradients. Second, to avoid machine error due to anear-singular matrix in the second-order approximation, DFO-VU stops when inStep 4 the parameter εk becomes smaller than 10−5. Third, INEXBUN stops whenthere are more than 18 consecutive noise-attenuation steps; we refer the reader to[172] for details. Barring the above, the parameters for COMPBUN, EXBUN, andINEXBUN are those chosen for the Composite Bundle solver in [198]. In an effortto make the comparisons as fair as possible, we adopted the following rules.1. All solvers use the same quadratic programming built-in MATLAB solver,quadprog.2. For all solvers, the stopping tolerance was set to 10−2, which for DFO-VU means that in Step 2.4, δ = εmin = 10−2.3. The maximum number of bb-calls, denoted by maxS, was set to maxS=800 min(n, 20). This corresponds to function and subgradient evaluationsfor the exact variants and to function evaluations for the inexact variants.4. For all solvers, a run was declared a failure when maxS was reached or whenthere was an error in the QP solver.5. The methods use the same starting points, with components randomly drawnin [−1, 1]. We ran all the instances with two starting points, for a total of 602runs.For those readers interested in implementing DFO-VU, we mention the followingadditional numerical tweaks that had a positive impact in the algorithm’s perfor-mance.1759.6. Numerical Results1. In the U-step, finding the active index setA(xk) in Subroutine 9.4.3 is tricky.We note that using an absolute criterion was worse than the following soft-thresholding test:i ∈ A(xk) when f(xk)− fi(xk) ≤ 0.001|f(xk)| .2. In Step 1.3, it is often preferable to calculate the proximal point zj+1 bysolving the dual of the quadratic programming problem defining Prϕεkj (z0).3. The tilting of gradients in (9.4.1) is done only when Ej is larger than 10−8.Otherwise, we set gεkj = g˜εkj .4. As long as the proximal parameter remains uniformly bounded, it can varyalong iterations without impairing the convergence results. We definetk ={0.5|g(xk)|21+|f(xk)| , if |f(xk)| > 10−10,2, otherwise,and letrk = max{1,min{1tk, 100rk−1, 106}}.5. In Step 2.5, the new bundle Bj+1 keeps almost active indices. We accept asactive the subfunctions that are close to active at each iteration point, so asnot to dismiss those that are active but do not quite appear to be so becauseof numerical error.9.6.2 Comparing the solvers’ accuracy on f and dimVThis section describes the indicators defined to compare the solvers. The num-ber of iterations is not a meaningful measure for comparison, because each solverinvolves a very different computational effort per iteration. This depends not onlyon the solver, but also on how many evaluations are done per iteration. Moreover,since the exact variants do not spend calls to make the DFO subgradient approx-imation, neither the total solving time nor the number of bb-calls are meaningfulmeasures. As the optimal values are known for the considered instances, we com-pare the accuracy reached by each solver. Denoting the best function value of theanalyzed case by ffound,RA = max(0,− log10 max(10−16,ffound − f¯1 + |f¯ |))1769.6. Numerical Resultsmeasures the number of digits of accuracy achieved by the solver. We also analyzethe ability of each solver in capturing the (known) exact V-dimension, by lookingat the cardinality of A(xfound), for xfound the final point found by each solver, andcomputing vfound = |A(xfound)| − 1.Since MAXQUAD is a well-known test function for bundle methods, in Ta-ble 9.1 we report separately the measures obtained for this function, running thefour solvers with two starting points.COMPBUN EXBUN INEXBUN DFO-VUFirst x0RAvfound53211133Second x0RAvfound53201132Table 9.1: Results for MAXQUAD test function, dimV(x¯) = 3.We observe a very good performance of DFO-VU, both in terms of accuracyand V-dimension, which is underestimated in the second run. Such underestima-tion means that DFO-VU is taking U-steps in a larger subspace. Of course, theprice to be paid (especially with our rudimentary implementation) is computingtime, which passes from a few seconds with COMPBUN-INEXBUN, to 2 minuteswith DFO-VU.The solver performance for the remaining 600 runs was similar. For each prob-lem and the two random starting points, we organized the output into five groups,corresponding to increasing percentages of the V-dimension at x¯ with respect ton. Each row in Table 9.2 reports for each solver the mean value of the relativeaccuracy, averaged for each group. The bottom line in Table 9.2 contains the totalnumber of instances considered for the test and the total average values for RA.# of runs dimV(x¯) COMPBUN EXBUN INEXBUN DFO-VU96 ∈ (0, 15)%n 3.99 0.78 0.58 1.44182 ∈ [15, 30)%n 4.79 1.12 0.89 1.63134 ∈ [30, 45)%n 3.93 0.91 0.61 1.05106 ∈ [45, 60)%n 4.21 0.96 0.62 1.1684 ∈ [60, 100)%n 5.75 1.36 1.07 2.15602 MEAN 4.50 1.02 0.76 1.46Table 9.2: Average accuracy RA for 602 runs.As conjectured, in terms of accuracy on the optimal value, COMPBUN is far su-perior to all the other variants. The inexact bundle method INEXBUN performs rea-1779.6. Numerical Resultssonably well, but is systematically outperformed by DFO-VU. An interesting fea-ture is that, in spite of using only approximate subgradient information, DFO-VUachieves better function accuracy than the exact classical bundle method, EXBUN.This fact confirms the interest of exploiting available structure in the bundle method,even if the information is inexact.Table 9.6.2 below gives another indication of the performance of DFO-VUand INEXBUN in predicting the dimension of the V-space. Out of the 602 runs, welist the number of times that each algorithm returned the exact V-dimension, thenumber of times vfound was within 1, 2 or 5, and the number of times it was morethan 5 away from the correct V-dimension.# Exact # ±1 # ±2 # ±5 # > 5INEXBUN 161 (27%) 351 (58%) 441 (73%) 528 (88%) 74 (12%)DFO-VU 408 (68%) 486 (81%) 513 (85%) 551 (92%) 51 (8%)Table 9.3: The V-dimension prediction comparison between the inexact solvers.In almost 70% of the runs, DFO-VU correctly predicted the V-dimension, morethan double what INEXBUN was able to do. This is a strong indicator that DFO-VU is able to do what it is meant to do in that respect; INEXBUN is not meant tomake this prediction, so we expect to see the results that we have. The calculationof the V-dimension was done by counting the number of almost active subfunctionsat the best point found by each algorithm.9.6.3 Performance ProfilesFigure 9.1 below contains the performance profile over all 602 instances. Inthe graph, each curve represents the cumulative probability distribution φs(θ) ofthe resource "f -accuracy", measured in terms of the reciprocal of RA. The use of1/RA as an indicator stems from the fact that usually smaller values of the abscissaθ mean better performance of the resource. As in our case higher accuracy ispreferred, we invert the relation to plot the profile. In this manner, in all the profilesthat follow, the solvers with the highest curves are the best ones for the givenindicator of performance.1789.6. Numerical Results1 2 3 4 5θ00.20.40.60.81φ(θ)CompBunExBunInexBunDFOVUFigure 9.1: Performance Profile: (reciprocal of) accuracy, all solvers.For an abscissa θ = θmin in the graph, the probability φs(θmin) of a particularsolver is the probability that the solver will win over all the others. Looking at thehighest value for the leftmost abscissa in Figure 9.1, we conclude that the mostprecise solver is COMPBUN in all of the runs, as expected. The DFO-VU solveris the second-best, followed by EXBUN.In general, for a particular solver s, the ordinate φs(θ) gives information on thepercentage of problems that the considered method will solve if given θ times theresource employed by the best one. Looking at the value of θ = 3, we see thatDFO-VU solves more than half of the 602 problems (φ(3) > 0.5) with a third(=1/θ) of the accuracy obtained by COMPBUN, while INEXBUN solves less than30% (φ(3) < 0.3).Since comparing with exact variants is not fair, we repeat the profile, this timecomparing only INEXBUN and DFO-VU. The values of θ = 0 in Figure 9.2 showthat INEXBUN was more accurate than DFO-VU in 30% of the runs (φ(1) = 0.3).1799.6. Numerical Results1 1.5 2 2.5 3θ00.20.40.60.81φ(θ)InexBunDFOVUFigure 9.2: Performance Profile: (reciprocal of) accuracy, solvers INEXBUN andDFO-VU.9.6.4 CPU time, function evaluations and failuresNaturally, the gain in accuracy of DFO-VU comes at the price of CPU time.As expected, the fastest solver in all of the runs is COMPBUN, followed by EXBUN,INEXBUN, and DFO-VU. The average CPU time in seconds was 0.47 for COMP-BUN, 0.28 for EXBUN, 0.40 for INEXBUN, and 61 for DFO-VU. The time in-crease for DFO-VU is better understood when examining the respective averagenumber of calls to the oracle, equal to 8 for COMPBUN, 26 for EXBUN, 504 forINEXBUN, and 52330 for DFO-VU. There is a factor of close to 20 when passingfrom EXBUN to INEXBUN, whose only difference is in the use of the inexact (sim-plex) gradients. The factor of 100 between the oracle calls required by INEXBUNand those required by DFO-VU are explained by the fact that DFO-VU approxi-mates not only the gradient, but also the U-Hessian. Such an increase is not a sur-prise, as our implementation of DFO-VU is not optimized, and the computationalburden required by DFO-VU is much higher than that required by INEXBUN. Wecomment on possible numerical enhancements in this respect in Section 9.7.Regarding failures, there was none for COMPBUN, EXBUN and INEXBUN,whose respective stopping tests were triggered in all 602 runs. DFO-VU failed 104times having reached the maximum number of allowed evaluations (maxS), andtwice when the parameter εk became unduly small. This figure represents 17.5%1809.7. Summaryof all the runs. Most of the failures of DFO-VU by maxS remained even afterincreasing maxS by a factor of 10. It is our understanding that the method reachedits limit of accuracy in those instances, which likely had a worse conditioningand were too difficult to solve with our inexact method. By constrast, a closeobservation of failures of DFO-VU in previous runs that were due to a small εkgave us some hints for improvement of the algorithm’s performance. We noticedthat when εk becomes too small, the stopping test in Step 2.4 becomes hard toattain, and the V-step gets dismayingly slow. It is important to tune the manner inwhich εk decreases, so that the reduction is not done too fast. For our experiments,taking in Steps 4(i) and (iii) of DFO-VUεk+1 = 0.9εkappeared a reasonable setting for the considered 602 instances.We finish by noting that EXBUN, the classical bundle method, is extremelyreliable, but neither as accurate nor as fast as COMPBUN, which fully exploitsstructure of composite functions and uses exact gradient information. Out of thefour solvers, if the gradient evaluations can be done exactly, COMPBUN is to bepreferred. Otherwise, DFO-VU seems a good option for cases when accuracy ofthe solution is a more important concern than solving time.9.7 SummaryWe have presented a complete and fully-functional DFO VU-algorithm for con-vex finite-max objective functions on Rn under reasonable assumptions. This ex-tends the original algorithm of [159] into the derivative-free setting, where exactfunction values are available but approximations of subgradients are sufficient forconvergence. Numerical testing suggests that, at the expense of increased CPUtime and number of function calls, the DFO VU-algorithm provides an improve-ment on final function value accuracy when compared to other inexact methods,and even compared to the EXBUN method that uses exact first-order information.There is much more work that can be done to improve this new algorithm; we referthe reader to Chapter 10 for recommendations on future work.181Chapter 10Conclusion and Future WorkThis thesis is based on the Moreau envelope and the proximal mapping. PartI shows the rich history of previous research done in this area, starting in themid-1960s, that inspired the present work. Part II develops many new theoreticalresults on convex Moreau envelopes, under the headings of thresholds of prox-boundedness, functions with unique minimizers, strongly convex functions andgeneralized linear-quadratic functions. Part III presents algorithmic results thatpertain to the proximal bundle method and the VU-algorithm. All results stemfrom the published or accepted manuscripts [94, 95, 178, 179] and the submittedmanuscripts [96, 180].10.1 Main resultsWe study PLQ functions in Chapter 4, presenting a method of identifying thethreshold of prox-boundedness and the domain of the Moreau envelope when thethreshold is used as the proximal parameter. We show that the threshold is the max-imum of the thresholds of the subfunctions that define the PLQ function, and thatthe domain of the Moreau envelope is the intersection of domains of the Moreauenvelopes of the active subfunctions. The results of this chapter do not depend onconvexity.In Chapter 5, we determine that the set of convex functions that have a uniqueminimizer is a generic set. The equivalences to the genericity of the set of proximalmappings whose fixed-point set is a singleton, and the set of convex subdifferentialsthat contain a unique zero, are demonstrated. We introduce a new distance functionbased on the proximal mapping, that forms a complete metric space and is used inthe proofs of the results of this chapter.Chapter 6 continues the theme of Baire category. We show that the set ofstrongly convex functions is a meagre set, while the set of convex functions withstrong minima, the set of convex coercive functions, the set of convex full-domainfunctions and the set of convex functions with strong minimizers are all genericsets. A similar distance function to that of Chapter 5, this time based on the Moreauenvelope, is presented and used in the proofs. Characterizations involving func-18210.2. Future worktions with strong minima/minimizers, functions with unique minima/minimizersand Moreau envelopes with strong minima are included.Chapter 7 is dedicated to the study of generalized linear-quadratic functionsand their Moreau envelopes, from the point of view of epiconvergence. We estab-lish calculus rules and equivalency statements, study the behaviour of the Fenchelconjugates, and develop characterizations of the Moreau envelopes of convex gen-eralized linear-quadratic functions. Applications to extended seminorms and leastsquares problems are given.In Chapter 8, we present the tilt-correct DFO proximal bundle algorithm, an in-exact model-based method of finding the proximal point of a convex function. Thetilt correction prevents the model function from lying above the objective functionat the prox-centre, which is a necessary component of the convergence proof. Weprove that the resulting inexact piecewise-linear model function is sufficient for thealgorithm to converge to within a factor of the prox-parameter and the maximumsubgradient error. This work was done with the goal of producing a stand-aloneproximal bundle routine that can be inserted into any optimization algorithm thatuses such a routine in its inner loop, of which there are many. We include numer-ical testing on max-of-quadratics functions, in both high and low dimensions, thatcompares four variants of constructing the bundle.Chapter 9 contains our construction of a DFO VU-algorithm for convex finite-max functions. The U-step is adapted to DFO by using the work done in [87],and the inexact V-step was accomplished by incorporating the tilt-correct DFOproximal bundle routine of Chapter 8. Convergence of the aggregate subgradientnorm to within a preset stopping tolerance is proved, and numerical tests that com-pare the new method to three other minimization algorithms are presented. Solveraccuracy on final function value and V-dimension are the comparison bases, andwe determine that our algorithm performs well, even when compared to an exactsubgradient method.10.2 Future workChapter 4. The work on proximal thresholds found here is limited to PLQfunctions. A potential area of future research would be to extend the result to in-clude other classes of functions. A calculus of prox-thresholds would be useful aswell; can one develop rules that give thresholds of the sum, product and/or compo-sition of functions? Our work is also restricted to finite-dimensional space and tosubfunctions that form a polyhedral partition of the space. Perhaps similar resultscould be obtained for some nonpolyhedral partitions, and for infinite-dimensionalspace.18310.2. Future workChapters 5, 6, and 7. These chapters are interrelated and set in Rn . It shouldbe possible to expand many of the results contained here to infinite dimensions.Further development of the same ideas to other function classes should be possibleas well. For example, the question of genericity of the set of uniformly convexfunctions, of which the meagre set of strongly convex functions is a subset, couldbe explored.Chapter 8. The algorithm in this chapter relies on convexity of the objectivefunction, and on the black box delivering exact function values. It would be in-teresting to find out if adjustments can be made to handle some nicely-structurednonconvex functions, or to work with a more relaxed black box that yields inexactfunction values as well as inexact subgradients. However, both of these proposalslikely involve a greater level of difficulty than one might suspect. They would eachintroduce an additional source of inexactness, which makes it challenging to besure where inexactness is coming from at any given moment.Chapter 9. This is a proof-of-concept implementation of the derivative-freeVU-algorithm, and there is much room for improvement of its performance. Wedid some hand-tuning of the parameters to get better performance, but other tweaksin the code that were not done would help as well. For instance, a hard reset hap-pens at every iteration, which means that nearby function values already calculatedare not reused in the construction of the next model function. Retaining a cacheof function calls and referencing it before making new evaluations would reducethe total number of grey box calls. In addition, in the construction of the simplexgradient we used the coordinate directions. A method such as Householder trans-formation [185] could be used to rotate the coordinates so that the first canonicalvector points in the previous descent direction. We expect these adjustments toreduce the number of function calls by a factor between n and n2, so it is encour-aging to know that future work on this project should result in quite a significantenhancement of the algorithm.184Bibliography[1] H. ABELS, Diffuse interface models for two-phase flows of viscous incom-pressible fluids, Lecture Notes, Max Planck Institute for Mathematics in theSciences, (2007). → pages 21[2] W. V. ACKOOIJ AND W. D. OLIVEIRA, Level bundle methods for con-strained convex optimization with various oracles, Comput. Optim. Appl.,57 (2014), pp. 555–597. → pages 132[3] Z. ALLEN-ZHU AND L. ORECCHIA, Linear coupling: An ultimate unifica-tion of gradient and mirror descent, preprint arXiv:1407.1537, (2014). →pages 2, 25[4] P. APKARIAN, D. NOLL, AND L. RAVANBOD, Nonsmooth bundle trust-region algorithm with applications to robust stability, Set-Valued Var. Anal.,24 (2016), pp. 115–148. → pages 132[5] E. ASPLUND AND R. ROCKAFELLAR, Gradients of convex functions,Trans. Amer. Math. Soc., 139 (1969), pp. 443–467. → pages 174[6] Y. ATCHADÉ, A Moreau-Yosida approximation scheme for high-dimensional posterior and quasi-posterior distributions, preprintarXiv:1505.07072, (2015). → pages 2, 18, 19[7] H. ATTOUCH, Variational Convergence for Functions and Operators, Ap-plicable Mathematics Series, Pitman, Boston, MA, 1984. → pages 75, 80,98[8] H. ATTOUCH, R. LUCCHETTI, AND R. WETS, The topology of the ρ-Hausdorff distance, Ann. Mat. Pura Appl. (4), 160 (1991), pp. 303–320.→ pages 80[9] H. ATTOUCH AND R. WETS, Isometries for the Legendre-Fenchel trans-form, Trans. Amer. Math. Soc., 296 (1986), pp. 33–60. → pages 75, 78[10] , Quantitative stability of variational systems: I. The epigraphical dis-tance, Trans. Amer. Math. Soc., 328 (1991), pp. 695–729. → pages 98185Bibliography[11] J.-P. AUBIN, Viability Theory, Springer Science & Business Media, 2009.→ pages 23[12] J.-P. AUBIN, A. BAYEN, AND P. SAINT-PIERRE, Viability Theory: NewDirections, Springer Science & Business Media, 2011. → pages[13] J.-P. AUBIN AND H. FRANKOWSKA, Set-valued Analysis, Springer Science& Business Media, 2009. → pages 23[14] C. AUDET, A Survey on Direct Search Methods for Blackbox Optimizationand their Applications, Springer, New York, 2014, pp. 31–56. → pages 131[15] A. BAGIROV, B. KARASÖZEN, AND M. SEZER, Discrete gradient method:derivative-free method for nonsmooth optimization, J. Optim. Theory Appl.,137 (2008), pp. 317–334. → pages 132, 138[16] J.-B. BAILLON AND G. HADDAD, Quelques propriétés des opérateursangle-bornés et n-cycliquement monotones, Israel J. Math., 26 (1977). →pages 86, 100[17] J. BARRETT, J. BLOWEY, AND H. GARCKE, Finite element approxima-tion of the Cahn–Hilliard equation with degenerate mobility, SIAM J. Num.Anal., 37 (1999), pp. 286–318. → pages 21[18] S. BARTZ, H. BAUSCHKE, J. BORWEIN, S. REICH, AND X. WANG,Fitzpatrick functions, cyclic monotonicity and Rockafellar’s antiderivative,Nonlinear Anal., 66 (2007), pp. 1198–1223. → pages 60, 100[19] S. BARTZ, H. BAUSCHKE, S. MOFFAT, AND X. WANG, The resolventaverage of monotone operators: dominant and recessive properties, SIAMJ. Optim., 26 (2016), pp. 602–634. → pages 98, 112[20] H. BAUSCHKE, J. BORWEIN, X. WANG, AND L. YAO, The Brezis-Browdertheorem in a general Banach space, J. Funct. Anal., 262 (2012), pp. 4948–4971. → pages 107, 108[21] H. BAUSCHKE AND P. COMBETTES, Convex Analysis and Monotone Op-erator Theory in Hilbert Spaces, CMS Books, Springer, New York, 2011.→ pages 11, 12, 27, 60, 73, 75, 80, 86, 91, 100, 115, 118, 129, 136, 139[22] H. BAUSCHKE, M. DAO, AND W. MOURSI, The Douglas–Rachford algo-rithm in the affine-convex case, preprint arXiv:1505.06408, (2015).→ pages2, 26186Bibliography[23] H. BAUSCHKE, R. GOEBEL, Y. LUCET, AND X. WANG, The proximalaverage: basic theory, SIAM J. Optim., 19 (2008), pp. 766–785. → pages78, 86[24] H. BAUSCHKE, Y. LUCET, AND M. TRIENIS, How to transform one convexfunction continuously into another, SIAM Rev., 50 (2008), pp. 115–132. →pages 2[25] H. BAUSCHKE AND W. MOURSI, On the order of the operators in theDouglas–Rachford algorithm, Optim. Lett., (2015), pp. 1–9. → pages 2,26[26] H. BAUSCHKE AND D. NOLL, On the local convergence of the Douglas–Rachford algorithm, Archiv der Math., 102 (2014), pp. 589–600. → pages2, 26[27] H. BAUSCHKE, X. WANG, AND L. YAO, Monotone linear relations: max-imality and Fitzpatrick functions, J. Convex Anal., 16 (2009), pp. 673–686.→ pages 98, 108[28] , On Borwein–Wiersma decompositions of monotone linear relations,SIAM J. Optim., 20 (2010), pp. 2636–2652. → pages 98, 113, 127[29] T. BAYEN AND A. RAPAPORT, About Moreau–Yosida regularization of theminimal time crisis problem, J. Convex Anal., 23 (2016), pp. 263–290. →pages 2, 23, 24[30] A. BECK AND M. TEBOULLE, Smoothing and first order methods: a unifiedframework, SIAM J. Optim., 22 (2012), pp. 557–580. → pages 2[31] G. BEER, Topologies on Closed and Closed Convex Sets, Mathematics andits Applications, Kluwer Academic Publishers Group, Dordrecht, 1993. →pages 73, 74, 75, 80, 83, 88, 95, 98[32] , Norms with infinite values, J. Convex Anal, 22 (2015), pp. 35–58. →pages 124[33] G. BEER AND R. LUCCHETTI, Convex optimization and the epi-distancetopology, Trans. Amer. Math. Soc., 327 (1991), pp. 795–813. → pages 73[34] , The epi-distance topology: Continuity and stability results with ap-plications to convex optimization problems, Math. Oper. Res., 17 (1992),pp. 715–726. → pages 73187Bibliography[35] G. BEER AND J. VANDERWERFF, Structural properties of extended normedspaces, Set-Valued Var. Anal., 23 (2015), pp. 613–630. → pages 124[36] A. BEN-ISRAEL AND T. GREVILLE, Generalized Inverses: Theory andApplications, vol. 15, Springer Science & Business Media, 2003. → pages10[37] M. BERNAUER, Motion planning for the two-phase Stefan problem in levelset formulation, Ph. D. thesis, (2010). → pages 2, 20[38] J. BONNANS, J. GILBERT, C. LEMARÉCHAL, AND C. SAGASTIZÁBAL,Numerical optimization: Theoretical and Practical Aspects, Universitext,Springer-Verlag, Berlin, second ed., 2006. → pages 132, 133, 174[39] J. BORWEIN AND B. SIMS, The Douglas–Rachford algorithm in the ab-sence of convexity, in Fixed-Point Algorithms for Inverse Problems in Sci-ence and Engineering, Springer, 2011, pp. 93–109. → pages 2, 26[40] J. BORWEIN AND J. VANDERWERFF, Convex Functions: Constructions,Characterizations and Counterexamples, Encyclopedia of Mathematics andits Applications, Cambridge University Press, Cambridge, 2010. → pages75, 80, 82, 98[41] J. BORWEIN AND Q. ZHU, Techniques of Variational Analysis, CMS Booksin Mathematics, Springer-Verlag, New York, 2005. → pages 74[42] L. BREGMAN, A relaxation method of finding a common point of convexsets and its application to the solution of problems in convex programming,Z˘. Vycˇisl. Mat. i Mat. Fiz., 7 (1967), pp. 620–631. → pages 29[43] O. BRETSCHER, Linear Algebra with Application, Prentice-Hall, UpperSaddle River, NJ, 1995. → pages 38[44] J. BURKE AND T. HOHEISEL, Epi-convergent smoothing with applicationsto convex composite functions, SIAM J. Optim., 23 (2013), pp. 1457–1479.→ pages 98[45] J. BURKE AND M. QIAN, A variable metric proximal point algorithm formonotone operators, SIAM J. Optim., 37 (1999), pp. 353–375. → pages 2,25[46] J.-F. CAI, S. OSHER, AND Z. SHEN, Linearized Bregman iterations forframe-based image deblurring, SIAM J. Imaging Sci., 2 (2009), pp. 226–252. → pages 30188Bibliography[47] J. CHEN AND S. PAN, A proximal-like algorithm for a class of nonconvexprogramming, Pac. J. Optim., 4 (2008), pp. 319–333. → pages 2[48] J. CHESSA, P. SMOLINSKI, AND T. BELYTSCHKO, The extended finite el-ement method (xfem) for solidification problems, Int. J. Numer. Meth. Eng.,53 (2002), pp. 1959–1977. → pages 20[49] M. CˇOBAN, P. KENDEROV, AND J. REVALSKI, Generic well-posednessof optimization problems in topological spaces, Mathematika, 36 (1989),pp. 301–324 (1990). → pages 74[50] T. COLEMAN AND L. HULBERT, A globally and superlinearly convergentalgorithm for convex quadratic programs with simple bounds, tech. rep.,Cornell University, 1990. → pages 134[51] P. COMBETTES, D. DU˜NG, AND B. VU˜, Proximity for sums of compositefunctions, J. Math. Anal. Appl., 380 (2011), pp. 680–688. → pages 2[52] P. COMBETTES AND J. PESQUET, Proximal thresholding algorithmfor minimization over orthonormal bases, SIAM J. Optim., 18 (2007),pp. 1351–1376. → pages 2[53] , Proximal splitting methods in signal processing, in Fixed-point algo-rithms for inverse problems in science and engineering, vol. 49 of SpringerOptim. Appl., Springer, New York, 2011, pp. 185–212. → pages 132[54] P. COMBETTES AND J.-C. PESQUET, A Douglas–Rachford splitting ap-proach to nonsmooth convex variational signal recovery, IEEE J. Sel. Top.Signa., 1 (2007), pp. 564–574. → pages 26[55] P. COMBETTES AND V. WAJS, Signal recovery by proximal forward-backward splitting, Multiscale Model. Simul., 4 (2005), pp. 1168–1200. →pages 2[56] A. CONN, K. SCHEINBERG, AND P. TOINT, On the convergence ofderivative-free methods for unconstrained optimization, in Approximationtheory and optimization, Cambridge Univ. Press, Cambridge, 1997, pp. 83–108. → pages 131[57] A. CONN, K. SCHEINBERG, AND L. VICENTE, Introduction to Derivative-free Optimization, vol. 8, Siam, 2009. → pages iii, 131, 163[58] R. CORREA AND C. LEMARÉCHAL, Convergence of some algorithms forconvex minimization, Math. Program., 62 (1993), pp. 261–275. → pages138, 162189Bibliography[59] R. CROSS, Multivalued Linear Operators, vol. 213 of Monographs andTextbooks in Pure and Applied Mathematics, Marcel Dekker, Inc., NewYork, 1998. → pages 98[60] A. CUSTÓDIO AND L. VICENTE, Using sampling and simplex derivativesin pattern search methods, SIAM J. Optim., 18 (2007), pp. 537–555. →pages 131[61] D. DAVIS, Convergence rate analysis of the forward-Douglas–Rachfordsplitting scheme, SIAM J. Optim., 25 (2015), pp. 1760–1786. → pages2, 26[62] W. DE OLIVEIRA, C. SAGASTIZÁBAL, AND C. LEMARÉCHAL, Convexproximal bundle methods in depth: a unified analysis for inexact oracles,Math. Program., 148 (2014), pp. 241–277. → pages 132, 135, 138[63] W. DE OLIVEIRA AND M. SOLODOV, A doubly stabilized bundle methodfor nonsmooth convex optimization, Math. Program., 156 (2016), pp. 125–159. → pages 132[64] R. DEMBO AND R. ANDERSON, An efficient linesearch for convexpiecewise-linear/quadratic functions, in Advances in numerical partial dif-ferential equations and optimization (Mérida, 1989), SIAM, Philadelphia,PA, 1991, pp. 1–8. → pages 33[65] R. DEVILLE, G. GODEFROY, AND V. ZIZLER, A smooth variational prin-ciple with applications to Hamilton-Jacobi equations in infinite dimensions,J. Funct. Anal., 111 (1993), pp. 197–212. → pages 74, 84[66] R. DEVILLE AND J. REVALSKI, Porosity of ill-posed problems, Proc. Amer.Math. Soc., 128 (2000), pp. 1117–1124. → pages 74[67] A. DONTCHEV AND T. ZOLEZZI, Well-posed Optimization Problems, Lect.Notes Math., Springer-Verlag, Berlin, 1993. → pages 73, 74[68] L. DOYEN AND P. SAINT-PIERRE, Scale of viability and minimal time ofcrisis, Set-Valued Anal., 5 (1997), pp. 227–246. → pages 23[69] C. ECK, H. GARCKE, AND P. KNABNER, Mathematische Modellierung,Springer-Verlag, 2008. → pages 20[70] J. ECKSTEIN AND D. BERTSEKAS, On the Douglas–Rachford splittingmethod and the proximal point algorithm for maximal monotone operators,Math. Program., 55 (1992), pp. 293–318. → pages 26190Bibliography[71] M. FISHER, An applications oriented guide to Lagrangian relaxation, In-terfaces, 15 (1985), pp. 10–21. → pages 138[72] N. FLAMMARION AND F. BACH, From averaging to acceleration, there isonly a step-size, preprint arXiv:1504.01577, (2015). → pages 2, 25[73] A. FORSYTHE AND G. FORSYTHE, Punched-card experiments with accel-erated gradient methods for linear equations, National Bureau of Standards,Appl. Math. Ser, 39 (1954), pp. 55–69. → pages 2, 25[74] A. FRANGIONI AND E. GORGONE, Bundle methods for sum-functions with“easy” components: applications to multicommodity network design, Math.Program., 145 (2014), pp. 133–161. → pages 132[75] A. FUDULI, M. GAUDIOSO, AND G. GIALLOMBARDO, A DC piecewiseaffine model and a bundling technique in nonconvex nonsmooth minimiza-tion, Optim. Methods Softw., 19 (2004), pp. 89–102. → pages 132[76] B. GARDINER AND Y. LUCET, Convex hull algorithms for piecewiselinear-quadratic functions in computational convex analysis, Set-ValuedVar. Anal., 18 (2010), pp. 467–482. → pages 33[77] A. GELMAN, J. CARLIN, H. STERN, AND D. RUBIN, Bayesian Data Anal-ysis, vol. 2, Taylor & Francis, 2014. → pages 18[78] P. GETREUER, Rudin-Osher-Fatemi total variation denoising using splitBregman, Image Processing On Line, 2 (2012), pp. 74–95. → pages 30[79] S. GHADIMI AND G. LAN, Accelerated gradient methods for nonconvexnonlinear and stochastic programming, Math. Program., (2015), pp. 1–41.→ pages 2, 25[80] J. GILL, Bayesian Methods: a Social and Behavioral Sciences Approach,vol. 20, CRC press, 2014. → pages 18[81] A. GOLDSTEIN AND I. RUSSAK, How good are the proximal point algo-rithms?, Numer. Func. Anal. Opt., 9 (1987), pp. 709–724. → pages 2, 25[82] O. GÜLER, On the convergence of the proximal point algorithm for convexminimization, SIAM J. Control Optim., 29 (1991), pp. 403–419. → pages33[83] , New proximal point algorithms for convex minimization, SIAM J.Optim., 2 (1992), pp. 649–664. → pages 2, 25191Bibliography[84] A. GUPAL, A method for the minimization of almost-differentiable func-tions, Cybernet. Syst. Anal., 13 (1977), pp. 115–117. → pages 132, 138[85] S. GUPTA, The Classical Stefan Problem: Basic Concepts, Modelling andAnalysis, Elsevier, 2003. → pages 20[86] W. HARE, A proximal average for nonconvex functions: a proximal stabilityperspective, SIAM J. Optim., 20 (2009), pp. 650–666. → pages 2[87] , Numerical analysis of VU-decomposition, U-gradient, and U-Hessian approximations, SIAM J. Optim., 24 (2014), pp. 1890–1913. →pages 156, 158, 160, 163, 165, 166, 183[88] W. HARE AND Y. LUCET, Derivative-free optimization via proximal pointmethods, J. Optim. Theory Appl., 160 (2014), pp. 204–220. → pages 131[89] W. HARE AND M. MACKLEM, Derivative-free optimization methods forfinite minimax problems, Optim. Methods Softw., 28 (2013), pp. 300–312.→ pages 131[90] W. HARE AND J. NUTINI, A derivative-free approximate gradient samplingalgorithm for finite minimax problems, Comput. Optim. Appl., 56 (2013),pp. 1–38. → pages 131, 132, 138, 152, 153, 158[91] W. HARE, J. NUTINI, AND S. TESFAMARIAM, A survey of non-gradientoptimization methods in structural engineering, Adv. Eng. Softw., 59(2013), pp. 19–28. → pages 131[92] W. HARE AND C. PLANIDEN, The NC-proximal average for multiple func-tions, Optim. Lett., 8 (2014), pp. 849–860. → pages 2[93] , Parametrically prox-regular functions, J. Convex Anal., 21 (2014),pp. 901–923. → pages 2[94] , Thresholds of prox-boundedness of PLQ functions, J. Convex Anal.,23 (2016), pp. 691–718. → pages v, 32, 182[95] , Computing proximal points of convex functions with inexact subgra-dients, (accepted) Set-valued Var. Anal., (2017), pp. 1–24.→ pages 25, 131,152, 162, 163, 167, 169, 172, 182[96] W. HARE, C. PLANIDEN, AND C. SAGASTIZÁBAL, A derivative-free VU-algorithm, (submitted) Math. Program., (2018). → pages v, 157, 182192Bibliography[97] W. HARE AND R. POLIQUIN, Prox-regularity and stability of the proximalmapping, J. Convex Anal., 14 (2007), pp. 589–606. → pages 2[98] W. HARE AND C. SAGASTIZÁBAL, Computing proximal points of noncon-vex functions, Math. Program., 116 (2009), pp. 221–258. → pages 2, 33,133, 147, 155, 156[99] , A redistributed proximal bundle method for nonconvex optimization,SIAM J. Optim., 20 (2010), pp. 2442–2473. → pages iii, 2, 25, 132[100] W. HARE, C. SAGASTIZÁBAL, AND M. SOLODOV, A proximal bundlemethod for nonsmooth nonconvex functions with inexact information, Com-put. Optim. Appl., 63 (2016), pp. 1–28. → pages 132, 138, 155[101] L. HE, T.-C. CHANG, AND S. OSHER, MR image reconstruction fromsparse radial samples by using iterative refinement procedures, in Proceed-ings of the 13th Annual Meeting of ISMRM, vol. 696, 2006. → pages 30[102] C. HELMBERG, M. OVERTON, AND F. RENDL, The spectral bundlemethod with second-order information, Optim. Methods Softw., 29 (2014),pp. 855–876. → pages 132[103] C. HELMBERG AND F. RENDL, A spectral bundle method for semidefiniteprogramming, SIAM J. Optim., 10 (2000), pp. 673–696. → pages 132[104] R. HESSE, D. LUKE, AND P. NEUMANN, Alternating projections andDouglas–Rachford for sparse affine feasibility, IEEE T. Signal Proces., 62(2014), pp. 4868–4881. → pages 26[105] M. HINTERMÜELLER, M. HINZE, AND C. KAHLE, An adaptive finite el-ement Moreau–Yosida-based solver for a coupled Cahn–Hilliard/Navier–Stokes system, J. Comp. Phys., 235 (2013), pp. 810–827. → pages 21[106] M. HINTERMÜLLER, A proximal bundle method based on approximate sub-gradients, Comput. Optim. Appl., 20 (2001), pp. 245–266. → pages 2, 25,133[107] M. HINTERMÜLLER, M. HINZE, AND M. TBER, An adaptive finite-element Moreau–Yosida-based solver for a non-smooth Cahn–Hilliardproblem, Optim. Methods Softw., 26 (2011), pp. 777–811. → pages 2, 21,22[108] J.-B. HIRIART-URRUTY, The deconvolution operation in convex analysis:An introduction, Cybern. Syst. Anal., 30 (1994), pp. 555–560.→ pages 113,118193Bibliography[109] J.-B. HIRIART-URRUTY AND C. LEMARÉCHAL, Convex Analysis andMinimization Algorithms. II, vol. 306 of Grundlehren der MathematischenWissenschaften, Springer-Verlag, Berlin, 1993. Advanced theory and bun-dle methods. → pages 110, 134, 170[110] P. HOHENBERG AND B. HALPERIN, Theory of dynamic critical phenom-ena, Rev. Mod. Phys., 49 (1977), p. 435. → pages 21[111] P. HOHENBERG AND W. KOHN, Inhomogeneous electron gas, Phys. Rev.,136 (1964), p. B864. → pages 16[112] S. JACKMAN, Bayesian Analysis for the Social Sciences, vol. 846, JohnWiley & Sons, 2009. → pages 18[113] J. JOHNSTONE, V. KOCH, AND Y. LUCET, Convexity of the proximal aver-age, J. Optim. Theory Appl., 148 (2011), pp. 107–124. → pages 2[114] K. JOKI, A. BAGIROV, N. KARMITSA, AND M. MÄKELÄ, A proximalbundle method for nonsmooth DC optimization utilizing nonconvex cuttingplanes, J. Global Optim., 68 (2017), pp. 501–535. → pages 132[115] A. JOURANI, L. THIBAULT, AND D. ZAGRODNY, Differential propertiesof the Moreau envelope, J. Funct. Anal., 266 (2014), pp. 1185–1237. →pages 2[116] E. KALVELAGEN, Benders decomposition with GAMS,web.stanford.edu/class/msande348/papers/bendersingams. pdf, 7 (2002).→ pages 138[117] C. KAN AND W. SONG, The Moreau envelope function and proximal map-ping in the sense of the Bregman distance, Nonlinear Anal., 75 (2012),pp. 1385–1399. → pages 2[118] A. KAPLAN AND R. TICHATSCHKE, Proximal point methods and noncon-vex optimization, J. Global Optim., 13 (1998), pp. 389–406. Workshop onGlobal Optimization (Trier, 1997). → pages 14, 33[119] E. KARAS, A. RIBEIRO, C. SAGASTIZÁBAL, AND M. SOLODOV, Abundle-filter method for nonsmooth convex constrained optimization, Math.Program., 116 (2009), pp. 297–320. → pages 134[120] N. KARMITSA, A. BAGIROV, AND S. TAHERI, New diagonal bundlemethod for clustering problems in large data sets, European J. Oper. Res.,263 (2017), pp. 367–379. → pages 132194Bibliography[121] C. KELLEY, Iterative Methods for Optimization, vol. 18, Siam, 1999. →pages 163, 164[122] H. KELLEY, W. DENHAM, I. JOHNSON, AND P. WHEATLEY, An accel-erated gradient method for parameter optimization with non-linear con-straints, J. Astronaut. Sci., 13 (1966), p. 166. → pages 2, 25[123] P. KENDEROV AND J. REVALSKI, The Banach-Mazur game and genericexistence of solutions to optimization problems, Proc. Amer. Math. Soc.,118 (1993), pp. 911–917. → pages 74[124] K. KIWIEL, Proximity control in bundle methods for convex nondifferen-tiable minimization, Math. Program., 46 (1990), pp. 105–122. → pages132, 134, 162[125] , A tilted cutting plane proximal bundle method for convex nondiffer-entiable optimization, Oper. Res. Lett., 10 (1991), pp. 75–81. → pages 2,25[126] , Approximations in proximal bundle methods and decomposition ofconvex programs, J. Optim. Theory Appl., 84 (1995), pp. 529–548.→ pages133, 134, 135, 137, 138[127] , A proximal bundle method with approximate subgradient lineariza-tions, SIAM J. Optim., 16 (2006), pp. 1007–1023. → pages 132[128] , A nonderivative version of the gradient sampling algorithm for nons-mooth nonconvex optimization, SIAM J. Optim., 20 (2010), pp. 1983–1994.→ pages 132, 138[129] W. KOHN AND L. SHAM, Self-consistent equations including exchange andcorrelation effects, Phys. Rev., 140 (1965), pp. A1133–A1138. → pages 16[130] E. KREYSZIG, Introductory Functional Analysis with Applications, WileyClassics Library, John Wiley & Sons, Inc., New York, 1989. → pages 75[131] S. KVAAL, U. EKSTRÖM, A. TEALE, AND T. HELGAKER, Differentiablebut exact formulation of density-functional theory, J. Chem. Phys., 140(2014), p. 18A518. → pages 2, 16, 17[132] S. LACOSTE-JULIEN, M. SCHMIDT, AND F. BACH, A simpler approach toobtaining an O(1/t) convergence rate for the projected stochastic subgradi-ent method, arXiv:1212.2002, (2012), pp. 1–8. → pages 73195Bibliography[133] J. LARSON, M. MENICKELLY, AND S. WILD, Manifold sampling for `1nonconvex optimization, SIAM J. Optim., 26 (2016), pp. 2540–2563. →pages 131[134] J. LARSON AND S. WILD, A batch, derivative-free algorithm for findingmultiple local minima, Optim. Eng., 17 (2016), pp. 205–228. → pages 132[135] O. LEFEBVRE AND C. MICHELOT, About the finite convergence ofthe proximal point algorithm, in Trends in Mathematical Optimization,Springer, 1988, pp. 153–161. → pages 2, 25[136] C. LEMARÉCHAL, F. OUSTRY, AND C. SAGASTIZÁBAL, The U -Lagrangian of a convex function, Trans. Amer. Math. Soc., 352 (2000),pp. 711–729. → pages 160, 161[137] C. LEMARÉCHAL AND C. SAGASTIZÁBAL, Practical aspects of theMoreau–Yosida regularization: Theoretical preliminaries, SIAM J. Optim.,7 (1997), pp. 367–385. → pages 98[138] , Variable metric bundle methods: from conceptual to implementableforms, Math. Program., 76 (1997), pp. 393–410. → pages 132[139] L. LESSARD, B. RECHT, AND A. PACKARD, Analysis and design of opti-mization algorithms via integral quadratic constraints, SIAM J. Optim., 26(2016), pp. 57–95. → pages 25[140] M. LEVY AND J. PERDEW, Hellmann-Feynman, virial, and scaling req-uisites for the exact universal density functionals. shape of the correlationpotential and diamagnetic susceptibility for atoms, Phys. Rev., 32 (1985),p. 2010. → pages 16[141] A. LEWIS AND S. WRIGHT, A proximal method for composite minimiza-tion, Math. Program., 158 (2016), pp. 501–546. → pages 132[142] X. LIU AND L. HUANG, Split Bregman iteration algorithm for totalbounded variation regularization based image deblurring, J. Math. Anal.Appl., 372 (2010), pp. 486–495. → pages 30[143] R. LUCCHETTI, Convexity and Well-posed Problems, CMS Books in Math-ematics/Ouvrages de Mathématiques de la SMC, 22, Springer, New York,2006. → pages 73, 74, 80[144] Y. LUCET, Fast Moreau envelope computation. I. Numerical algorithms,Numer. Algorithms, 43 (2006), pp. 235–249 (2007). → pages 2196Bibliography[145] , What shape is your conjugate? A survey of computational convexanalysis and its applications, SIAM Rev., 52 (2010), pp. 505–542. → pages2[146] Y. LUCET, H. BAUSCHKE, AND M. TRIENIS, The piecewise linear-quadratic model for computational convex analysis, Comput. Optim. Appl.,43 (2009), pp. 95–118. → pages 33[147] L. LUKŠAN AND J. VLCEK, Test problems for nonsmooth unconstrainedand linearly constrained optimization, Technická zpráva, 798 (2000). →pages 152[148] M. LUSTIG, D. DONOHO, AND J. PAULY, The application of compressedsensing for rapid MR imaging, Mag. Res. Med., 58 (2007), pp. 1182–95. →pages 30[149] M. LUSTIG, J. LEE, D. DONOHO, AND J. PAULY, Faster imaging withrandomly perturbed, under-sampled spirals and L1 reconstruction, in Pro-ceedings of the 13th annual meeting of ISMRM, Miami Beach, 2005, p. 685.→ pages 30[150] B. MARTINET, Régularisation d’inéquations variationnelles par approxi-mations successives, Rev. Française Informat. Recherche Opérationnelle, 4(1970), pp. 154–158. → pages 132[151] , Détermination approchée d’un point fixe d’une application pseudo-contractante. Cas de l’application prox, C. R. Acad. Sci. Paris Sér. A-B, 274(1972), pp. A163–A165. → pages iii, 2, 25[152] J. MAYER, Stochastic Linear Programming Algorithms: A ComparisonBased on a Model Management System, vol. 1, CRC Press, 1998. → pages134[153] C. MEYER, Matrix Analysis and Applied Linear Algebra, Society for Indus-trial and Applied Mathematics (SIAM), Philadelphia, PA, 2000. → pages113, 119, 126[154] R. MIFFLIN AND C. SAGASTIZÁBAL, V U -decomposition derivatives forconvex max-functions, in Ill-posed variational problems and regularizationtechniques (Trier, 1998), vol. 477 of Lecture Notes in Econom. and Math.Systems, Springer, Berlin, 1999, pp. 167–186. → pages 158197Bibliography[155] , Functions with primal-dual gradient structure and U -Hessians, inNonlinear optimization and related topics (Erice, 1998), vol. 36 of Appl.Optim., Kluwer Acad. Publ., Dordrecht, 2000, pp. 219–233. → pages 158[156] , On V U -theory for functions with primal-dual gradient structure,SIAM J. Optim., 11 (2000), pp. 547–571 (electronic). → pages 158[157] , Primal-dual gradient structured functions: second-order results;links to epi-derivatives and partly smooth functions, SIAM J. Optim., 13(2003), pp. 1174–1194. → pages 158[158] , V U -smoothness and proximal point results for some nonconvex func-tions, Optim. Methods Softw., 19 (2004), pp. 463–478. → pages 158, 161[159] , A V U -algorithm for convex minimization, Math. Program., 104(2005), pp. 583–608. → pages iii, 132, 156, 157, 158, 160, 161, 162, 181[160] B. MORDUKHOVICH, Variational Analysis and Generalized DifferentiationI: Basic Theory, vol. 330, Springer, 2006. → pages 115[161] J.-J. MOREAU, Fonctions convexes duales et points proximaux dans un es-pace Hilbertien, C. R. Acad. Sci. Paris, 255 (1962), pp. 2897–2899. →pages 16[162] , Propriétés des applications “prox”, C. R. Acad. Sci. Paris, 256(1963), pp. 1069–1071. → pages iii, 2, 14[163] , Proximité et dualité dans un espace Hilbertien, Bull. Soc. Math.France, 93 (1965), pp. 273–299. → pages iii, 2, 14, 16[164] , Quadratic programming in mechanics: dynamics of one-sided con-straints, SIAM J. Control, 4 (1966), pp. 153–158. → pages 15[165] L. NEL, Continuity Theory, Springer, 2016. → pages 174[166] Y. NESTEROV, A method for solving the convex programming problem withconvergence rate O(1/k2), Dokl. Akad. Nauk SSSR, 269 (1983), pp. 543–547. → pages 26[167] , Introductory Lectures on Convex Programming, Volume I: BasicCourse, Kluwer, 2001. → pages 26[168] D. NOLL, O. PROT, AND A. RONDEPIERRE, A proximity control algorithmto minimize nonsmooth and nonconvex functions, Pac. J. Optim., 4 (2008),pp. 571–604. → pages 132198Bibliography[169] D. O’CONNOR AND L. VANDENBERGHE, Primal-dual decomposition byoperator splitting and applications to image deblurring, SIAM J. ImagingSci., 7 (2014), pp. 1724–1754. → pages 26[170] B. O’DONOGHUE AND E. CANDES, Adaptive restart for accelerated gra-dient schemes, Found. Comput. Math., 15 (2015), pp. 715–732. → pages25[171] W. D. OLIVEIRA, Proximal bundle methods for nonsmooth DC pro-gramming, preprint, (2017). http://www.oliveira.mat.br/publications. → pages 132[172] W. D. OLIVEIRA, C. SAGASTIZÁBAL, AND C. LEMARÉCHAL, Convexproximal bundle methods in depth: a unified analysis for inexact oracles,Math. Program., 148 (2014), pp. 241–277. → pages 132, 174, 175[173] S. OSHER, Y. MAO, B. DONG, AND W. YIN, Fast linearized Bregmaniteration for compressive sensing and sparse denoising, Commun. Math.Sci., 8 (2010), pp. 93–111. → pages 30[174] L. PARENTE, P. LOTITO, AND M. SOLODOV, A class of inexact variablemetric proximal point algorithms, SIAM J. Optim., 19 (2008), pp. 240–260.→ pages 2, 25[175] R. PARR, Density Functional Theory of Atoms and Molecules, Springer,1980. → pages 16[176] M. PEREYRA, Proximal Markov chain Monte Carlo algorithms, Stat. Com-put., (2015), pp. 1–16. → pages 17, 18[177] R. PEVERATI AND D. TRUHLAR, Quest for a universal density functional:the accuracy of density functionals across a broad spectrum of databases inchemistry and physics, Philos. Trans. A Math. Phys. Eng. Sci., 372 (2014),p. 20120476. → pages 16[178] C. PLANIDEN AND X. WANG, Most convex functions have unique mini-mizers, J. Convex Anal., 23(3) (2016), pp. 877–892. → pages v, 58, 73,182[179] , Strongly convex functions, Moreau envelopes, and the generic natureof convex functions with strong minimizers, SIAM J. on Opt., 26 (2016),pp. 1341–1364. → pages 72, 182199Bibliography[180] , Epiconvergence, the Moreau envelope and generalized linear-quadratic functions, (submitted) J. Optim. Theory Appl., (2018). → pagesv, 97, 182[181] G. PLONKA AND J. MA, Curvelet-wavelet regularized split Bregman itera-tion for compressed sensing, Int. J. Wavelets Multi., 9 (2011), pp. 79–110.→ pages 30[182] R. POLIQUIN AND R. ROCKAFELLAR, Generalized Hessian properties ofregularized nonsmooth functions, SIAM J. Optim., 6 (1996), pp. 1121–1137.→ pages 2, 98[183] , Prox-regular functions in variational analysis, Trans. Amer. Math.Soc., 348 (1996), pp. 1805–1838. → pages 2[184] M. POWELL, Developments of NEWUOA for minimization without deriva-tives, IMA J. Numer. Anal., 28 (2008), pp. 649–664. → pages 131[185] W. PRESS, S. TEUKOLSKY, W. VETTERLING, AND B. FLANNERY, Nu-merical Recipes: The Art of Scientific Computing, Cambridge UniversityPress, Cambridge, third ed., 2007. → pages 184[186] P. PURKAIT AND B. CHANDA, Super resolution image reconstructionthrough Bregman iteration using morphologic regularization, IEEE T. Im-age Proces., 21 (2012), pp. 4029–4039. → pages 30[187] A. RANTZER AND M. JOHANSSON, Piecewise linear quadratic optimalcontrol, IEEE T. Automat. Control, 45 (2000), pp. 629–637. → pages 33[188] S. REICH AND A. ZASLAVSKI, Genericity in Nonlinear Analysis, vol. 34of Developments in Mathematics, Springer, New York, 2014. → pages 59,61, 73[189] J. REVALSKI AND N. ZHIVKOV, Well-posed constrained optimization prob-lems in metric spaces, J. Optim. Theory Appl., 76 (1993), pp. 145–163. →pages 73[190] P. REY AND C. SAGASTIZÁBAL, Dynamical adjustment of the prox-parameter in bundle methods, Optimization, 51 (2002), pp. 423–447. →pages 33, 156[191] R. ROCKAFELLAR, Monotone operators and the proximal point algorithm,SIAM J. Control Optim., 14 (1976), pp. 877–898. → pages 2, 14, 25, 67,73, 85, 161200Bibliography[192] , On the essential boundedness of solutions to problems in piecewiselinear-quadratic optimal control, in Analyse mathématique et applications,Gauthier-Villars, Montrouge, 1988, pp. 437–443. → pages 33[193] , Convex Analysis, Princeton Landmarks in Mathematics, PrincetonUniversity Press, Princeton, NJ, 1997. → pages 75, 80, 82, 95, 99, 111,112, 125[194] , A Short Boigraphy of Jean-Jacques Moreau.https://bipop.inrialpes.fr/people/acary/Research/Moreau-bio.pdf, 2016.→ pages 15[195] R. ROCKAFELLAR AND J. ROYSET, Random variables, monotone rela-tions, and convex analysis, Math. Program., Ser. B, 148 (2014), pp. 297–331. → pages 98[196] R. ROCKAFELLAR AND R. WETS, Variational Analysis, Grundlehren derMathematischen Wissenschaften, Springer-Verlag, Berlin, 1998. → pagesiii, 2, 5, 12, 13, 15, 16, 27, 33, 43, 47, 50, 60, 67, 74, 75, 76, 79, 91, 98, 99,100, 116, 133[197] S. ROMAN, Advanced Linear Algebra, vol. 3, Springer, 2005. → pages 115[198] C. SAGASTIZÁBAL, Composite proximal bundle method, Math. Program.,140 (2013), pp. 189–233. → pages 132, 174, 175[199] A. SCHRECK, G. FORT, S. LE CORFF, AND E. MOULINES, A shrinkage-thresholding Metropolis adjusted Langevin algorithm for Bayesian variableselection, IEEE J. Select. Topics in Signal Process., 10 (2016), pp. 366–375.→ pages 18[200] M. SEARCÓID, Metric Spaces, Springer Undergraduate Mathematics Se-ries, Springer-Verlag London, Ltd., London, 2007. → pages 61[201] J. SHEN, X.-Q. LIU, F.-F. GUO, AND S.-X. WANG, An approximate re-distributed proximal bundle method with inexact data for minimizing non-smooth nonconvex functions, Mathematical Problems in Engineering, 2015(2015). → pages 133, 135, 138[202] J. SHEN, Z.-Q. XIA, AND L.-P. PANG, A proximal bundle method withinexact data for convex nondifferentiable minimization, Nonlinear Anal., 66(2007), pp. 2016–2027. → pages 133, 138201Bibliography[203] I. SINGER, A Fenchel-Rockafellar type duality theorem for maximization,Bull. Australian Math. Soc., 20 (1979), pp. 193–198. → pages 100[204] M. SOLODOV, A bundle method for a class of bilevel nonsmooth convexminimization problems, SIAM J. Optim., 18 (2007), pp. 242–259. → pages133[205] J. SPINGARN, Applications of the method of partial inverses to convex pro-gramming: decomposition, Math. Program., 32 (1985), pp. 199–223. →pages 26[206] J. SPINGARN AND R. ROCKAFELLAR, The generic nature of optimalityconditions in nonlinear programming, Math. Oper. Res., 4 (1979), pp. 425–430. → pages 73[207] P. STANIMIROVIC´ AND M. MILADINOVIC´, Accelerated gradient descentmethods with line search, Numer. Algorithms, 54 (2010), pp. 503–520. →pages 25, 26[208] K. STROMBERG, An Introduction to Classical Real Analysis, AMS ChelseaPublishing, Providence, RI, 1981. → pages 61[209] W.-Y. SUN, R. SAMPAIO, AND M. CANDIDO, Proximal point algorithmfor minimization of DC function, J. Comput. Math., 21 (2003), pp. 451–462.→ pages 33, 133, 138[210] A. TIKHONOV, On the solution of incorrectly put problems and the reg-ularisation method, Outlines Joint Sympos. Partial Differential Equations(Novosibirsk, 1963), (1963), pp. 261–265. → pages 28[211] X. WANG, Most maximally monotone operators have a unique zero and asuper-regular resolvent, Nonlinear Anal., 87 (2013), pp. 69–82. → pages59, 61, 69, 73[212] I. YAMADA, M. YUKAWA, AND M. YAMAGISHI, Minimizing the Moreauenvelope of nonsmooth convex functions over the fixed point set of certainquasi-nonexpansive mappings, in Fixed-point algorithms for inverse prob-lems in science and engineering, vol. 49 of Springer Optim. Appl., Springer,New York, 2011, pp. 345–390. → pages 2[213] L. YAO, On monotone linear relations and the sum problem in Banachspaces, UBC Ph. D. thesis, (2011). → pages 98, 105, 106, 107, 108202Bibliography[214] W. YIN, S. OSHER, J. DARBON, AND D. GOLDFARB, Bregman iterativealgorithms for compressed sensing and related problems, SIAM J. ImagingSciences, 1 (2008), pp. 143–168. → pages 30[215] K. YOSIDA, Functional Analysis, Die Grundlehren der MathematischenWissenschaften, Band 123, Academic Press, Inc., New York; Springer-Verlag, Berlin, 1965. → pages iii, 2, 27[216] C. ZA˘LINESCU, On uniformly convex functions, J. Math. Anal. Appl., 95(1983), pp. 344 – 374. → pages 91[217] , Convex Analysis in General Vector Spaces, World Scientific Publish-ing Co., Inc., River Edge, NJ, 2002. → pages 75, 77, 90, 91203
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Theory and algorithmic applications of the proximal...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Theory and algorithmic applications of the proximal mapping and Moreau envelope Planiden, Chayne Daniel 2018
pdf
Page Metadata
Item Metadata
Title | Theory and algorithmic applications of the proximal mapping and Moreau envelope |
Creator |
Planiden, Chayne Daniel |
Publisher | University of British Columbia |
Date Issued | 2018 |
Description | The Moreau envelope and the proximal mapping have been objects of great interest for optimizers since their conception more than half a century ago. They provide us with many desirable properties; for instance, the Moreau envelope of a convex function is smooth (differentiable) while the function may not be, and the envelope maintains the same minimum value and the same set of minimizers as the function. This is a great advantage to have when the objective is to minimize the function, because standard Calculus methods can then be applied to minimize the smooth envelope. From a computational standpoint, the proximal mapping has given rise to many efficient minimization algorithms, such as the proximal point method and proximal bundle methods. Derivative-free optimization methods continue to grow in importance and popularity. The term ‘derivative-free’ refers to the fact that for the function to be minimized, (sub)gradient information is either unavailable or inconvenient to use, thus necessitating an algorithm that does not require subgradients. Such algorithms rely on constructs such as the simplex gradient to obtain good-quality approximations of subgradients and use the approximations in derivative-free versions of the proximal routines. The present work is divided into three major parts. Part I provides a history of the Moreau envelope, with the goal of illustrating its usefulness and some of its successes over the past few decades. Part II contains new theoretical results that involve the Moreau envelope and the proximal mapping on many topics, including prox-boundedness, convex functions with unique and/or strong minimizers, Baire category and epiconvergence. Part III is the algorithmic section, where a proximal bundle method is converted to derivative-free format. Using this result, a particular minimization algorithm for convex finite-max functions called the VU-algorithm is presented and also converted to derivative-free. The new method is proved convergent and numerical results are included. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2018-02-21 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivatives 4.0 International |
DOI | 10.14288/1.0363957 |
URI | http://hdl.handle.net/2429/64685 |
Degree |
Doctor of Philosophy - PhD |
Program |
Mathematics |
Affiliation |
Irving K. Barber School of Arts and Sciences (Okanagan) Computer Science, Mathematics, Physics and Statistics, Department of (Okanagan) |
Degree Grantor | University of British Columbia |
GraduationDate | 2018-05 |
Campus |
UBCO |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 24-ubc_2018_may_planiden_chayne.pdf [ 1.07MB ]
- Metadata
- JSON: 24-1.0363957.json
- JSON-LD: 24-1.0363957-ld.json
- RDF/XML (Pretty): 24-1.0363957-rdf.xml
- RDF/JSON: 24-1.0363957-rdf.json
- Turtle: 24-1.0363957-turtle.txt
- N-Triples: 24-1.0363957-rdf-ntriples.txt
- Original Record: 24-1.0363957-source.json
- Full Text
- 24-1.0363957-fulltext.txt
- Citation
- 24-1.0363957.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0363957/manifest