Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Intersections and sums of sets for the regularization of inverse problems Peters, Bas 2019

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2019_september_peters_bas.pdf [ 6.25MB ]
Metadata
JSON: 24-1.0378709.json
JSON-LD: 24-1.0378709-ld.json
RDF/XML (Pretty): 24-1.0378709-rdf.xml
RDF/JSON: 24-1.0378709-rdf.json
Turtle: 24-1.0378709-turtle.txt
N-Triples: 24-1.0378709-rdf-ntriples.txt
Original Record: 24-1.0378709-source.json
Full Text
24-1.0378709-fulltext.txt
Citation
24-1.0378709.ris

Full Text

Intersections and sums of sets for theregularization of inverse problemsbyBas Petersa thesis submitted in partial fulfillmentof the requirements for the degree ofDoctor of Philosophyinthe faculty of graduate and postdoctoralstudies(Geophysics)The University of British Columbia(Vancouver)May 2019© Bas Peters, 2019The following individuals certify that they have read, and recommend tothe Faculty of Graduate and Postdoctoral Studies for acceptance, the thesisentitled:Intersections and sums of sets for the regularization of inverseproblemssubmitted by Bas Peters in partial fulfillment of the requirements for thedegree of Doctor of Philosophy in Geophysics.Examining Committee:Felix J. Herrmann, Earth, Ocean and Atmospheric SciencesSupervisorMichael Bostock, Earth, Ocean and Atmospheric SciencesSupervisorChen Greif, Computer ScienceSupervisory Committee MemberRobert Rohling, Electrical and Computer EngineeringUniversity ExaminerPurang Abolmaesumi, Electrical and Computer EngineeringUniversity ExaminerLaurent Demanet, Department of Mathematics, Massachusetts Institute ofTechnologyExternal ExamineriiAbstractInverse problems in the imaging sciences encompass a variety of applica-tions. The primary problem of interest is the identification of physicalparameters from observed data that come from experiments governed bypartial-differential-equations. The secondary type of imaging problems at-tempts to reconstruct images and video that are corrupted by, for example,noise, subsampling, blur, or saturation.The quality of the solution of an inverse problem is sensitive to issuessuch as noise and missing entries in the data. The non-convex seismic full-waveform inversion problem suffers from parasitic local minima that leadto wrong solutions that may look realistic even for noiseless data. To meetsome of these challenges, I propose solution strategies that constrain themodel parameters at every iteration to help guide the inversion.To arrive at this goal, I present new practical workflows, algorithms, andsoftware, that avoid manual tuning-parameters and that allow us to incor-porate multiple pieces of prior knowledge. Opposed to penalty methods, Iavoid balancing the influence of multiple pieces of prior knowledge by work-ing with intersections of constraint sets. I explore and present advantagesof constraints for imaging. Because the resulting problems are often non-trivial to solve, especially on large 3D grids, I introduce faster algorithms,dedicated to computing projections onto intersections of multiple sets.To connect prior knowledge more directly to problem formulations, I alsocombine ideas from additive models, such as cartoon-texture decompositionand robust principal component analysis, with intersections of multiple con-straint sets for the regularization of inverse problems. The result is aniiiextension of the concept of a Minkowski set.Examples from non-unique physical parameter estimation problems showthat constraints in combination with projection methods provide controlover the model properties at every iteration. This can lead to improvedresults when the constraints are carefully relaxed.ivLay SummaryWhen we send electromagnetic or seismic signals through an unknown medium(the Earth, humans) and measure the output using multiple sensors, weknow input and output but not what the medium looks like inside. Thegoal of inverse problems is to use the input and output to compute thematerials the signals passed through. While we can numerically solve suchproblems, there are often many answers that satisfy the measurements—i.e.,non-uniqueness. The quality of the computed solutions also decreases whenthere is noise in the observations, or when data is missing. I propose meth-ods to mitigate these issues by merging measured data with prior knowledgeabout the material. I construct new formulations that better translate ex-pert intuition, as well as inferences from other types of observations, into amathematical problem. The new and faster computational methods that Iderive can include more pieces of prior information than existing techniques.vPrefaceAll presented content in this thesis is the result of research in the SeismicLaboratory for Imaging and Modeling at the University of British Columbia(Vancouver), supervised by Professor Felix J. Herrmann. All main chaptersare currently published, in review, or submitted for publication. I am theprimary researcher and author of all chapters. My supervisor reviewed andsuggested improvements to all documents. I formulated and developed theresearch questions, algorithms, software, and numerical experiments.Chapter 2 was published as Peters, Bas, and Felix J. Herrmann. “Con-straints versus penalties for edge-preserving full-waveform inversion.” TheLeading Edge 36, no. 1 (2017): 94-100. Chapter 3 was published as Peters,Bas, Brendan R. Smithyman, and Felix J. Herrmann. “Projection methodsand applications for seismic nonlinear inverse problems with multiple con-straints.” Geophysics (2019). Chapter 4 and 5 will be submitted for review.The software packages corresponding to chapter 4 and 5 were written by meand are available at https://github.com/slimgroup/SetIntersectionProjection.jl.viTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . viiList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . xx1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 From prior knowledge to problem formulation . . . . . . . . . 51.2 Visual introduction . . . . . . . . . . . . . . . . . . . . . . . . 81.3 Motives and objectives . . . . . . . . . . . . . . . . . . . . . . 91.4 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . 121.5 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 Constraints versus penalties for edge-preserving full-waveforminversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.2 Velocity blocking with total-variation norm constraints . . . . 192.3 FWI with total-variation like penalties . . . . . . . . . . . . . 20vii2.4 FWI with total-variation norm constraints . . . . . . . . . . . 222.5 Why constraints? . . . . . . . . . . . . . . . . . . . . . . . . . 252.6 Objective and gradients for two waveform inversion methods 272.7 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272.8 Discussion and summary . . . . . . . . . . . . . . . . . . . . . 313 Projection methods and applications for seismic nonlinearinverse problems with multiple constraints . . . . . . . . . 343.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343.1.1 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . 363.1.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . 363.1.3 Related work . . . . . . . . . . . . . . . . . . . . . . . 373.2 Limitations of unconstrained regularization methods . . . . . 393.2.1 Tikhonov and quadratic regularization . . . . . . . . . 393.2.2 Gradient filtering . . . . . . . . . . . . . . . . . . . . . 413.2.3 Change of variables / subspaces . . . . . . . . . . . . . 423.2.4 Modified Gauss-Newton . . . . . . . . . . . . . . . . . 433.3 Including prior information via constraints . . . . . . . . . . . 433.3.1 Constrained formulation . . . . . . . . . . . . . . . . . 443.3.2 Convex sets . . . . . . . . . . . . . . . . . . . . . . . . 453.4 Computing projections onto intersections of convex sets . . . 473.5 Nonlinear optimization with projections . . . . . . . . . . . . 533.5.1 Projected gradient descent . . . . . . . . . . . . . . . . 533.5.2 Spectral projected gradient . . . . . . . . . . . . . . . 543.5.3 Spectral projected gradient with multiple constraints . 583.6 Numerical example . . . . . . . . . . . . . . . . . . . . . . . . 613.6.1 Comparison with a quadratic penalty method . . . . . 683.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 713.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 734 Algorithms and software for projections onto intersectionsof convex and non-convex sets with applications to inverseproblems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75viii4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 754.1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . 824.2 Notation, assumptions, and definitions . . . . . . . . . . . . . 834.3 PARSDMM: Exploiting similarity between constraint sets . . 854.3.1 Multilevel PARSDMM . . . . . . . . . . . . . . . . . . 954.4 Software and numerical examples . . . . . . . . . . . . . . . . 994.4.1 Parallel Dykstra versus PARSDMM . . . . . . . . . . 1034.4.2 Timings for 2D and 3D projections . . . . . . . . . . . 1064.4.3 Geophysical parameter estimation with constraints . . 1084.4.4 Learning a parametrized intersection from a few train-ing examples . . . . . . . . . . . . . . . . . . . . . . . 1124.5 Discussion and future research directions . . . . . . . . . . . . 1204.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1235 Minkowski sets for the regularization of inverse problems. 1255.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1255.1.1 Related work . . . . . . . . . . . . . . . . . . . . . . . 1275.1.2 Contributions and outline . . . . . . . . . . . . . . . . 1285.2 Generalized Minkowski set . . . . . . . . . . . . . . . . . . . . 1295.3 Projection onto the generalized Minkowski set . . . . . . . . . 1305.4 Formulation of inverse problems with generalized Minkowskiconstraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1365.4.1 Inverse problems with computationally expensive data-misfit evaluations . . . . . . . . . . . . . . . . . . . . . 1365.4.2 Linear inverse problems with computationally cheapforward operators . . . . . . . . . . . . . . . . . . . . . 1375.5 Numerical examples . . . . . . . . . . . . . . . . . . . . . . . 1395.5.1 Seismic full-waveform inversion 1 . . . . . . . . . . . . 1395.5.2 Seismic full-waveform inversion 2 . . . . . . . . . . . . 1415.5.3 Video processing . . . . . . . . . . . . . . . . . . . . . 1445.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1485.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149ix6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1517 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164A Alternating Direction Method of Multipliers (ADMM) forthe projection problem. . . . . . . . . . . . . . . . . . . . . . 191B Transform-domain bounds / slope constraints . . . . . . . . 196C Black-box alternating projection methods . . . . . . . . . . 199xList of TablesTable 2.1 Objectives and gradients for full-waveform inversion (FWI)and wavefield reconstruction inversion (WRI). Source term:q, discrete Helmholtz system: A(m), complex-conjugatetranspose (∗), matrix P selects the values of the predictedwavefields u and u¯ at the receiver locations. The scalar λbalances the data-misfit versus the wavefield residual. . . . 28Table 3.1 Notation used in this chapter. . . . . . . . . . . . . . . . . 37Table 4.1 Overview of constraint sets that the software currently sup-ports. A new constraint requires the projector onto the set(without linear operator) and a linear operator or equiva-lent matrix-vector product together with its adjoint. Vec-tor entries are indexed as m[i]. . . . . . . . . . . . . . . . . 101xiList of FiguresFigure 1.1 Two parameter representation of bound constraints, smooth-ness constraints via bounds on the gradient, monotonicityvia positivity/negativity of the gradient, and an annulusconstraint on the data fit. . . . . . . . . . . . . . . . . . . 10Figure 1.2 (left) The yellow highlighted patch is the intersection ofthe other sets that describe prior knowledge. (right) Theyellow highlighted patch shows the intersection of the dataconstraint and all other sets. Red dots are projections ofrandom points onto the intersection. . . . . . . . . . . . . 11Figure 2.1 Result of projecting the true Marmousi model onto the setof bounds and limited TV-norms. Shown as a function ofa fraction of the TV-norm of the true model, TV(m∗). . . 20Figure 2.2 FWI results using the smoothed total-variation (TV) asa penalty. Shows the results for various combinations of and α. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23xiiFigure 2.3 Constrained optimization workflow. At every FWI it-eration, the user code provides data-misfit and gradientw.r.t. data-misfit only. The projected gradient algorithmuses this to propose an updated model (mk −∇mf(mk))and sends this to Dykstra’s algorithm. This algorithmprojects it onto the intersection of all constraints. Todo this, it needs to project vectors onto each set sepa-rately once per Dykstra iteration. These individual pro-jections are either closed-form solutions or computed bythe ADMM algorithm. . . . . . . . . . . . . . . . . . . . . 25Figure 2.4 Results for constrained FWI for various total-variationbudgets (τ). . . . . . . . . . . . . . . . . . . . . . . . . . . 26Figure 2.5 True and initial models for FWI and WRI, based on theBP 2004 model. . . . . . . . . . . . . . . . . . . . . . . . . 30Figure 2.6 Estimated models for FWI and WRI for 25% data noise,based on the BP 2004 model. Estimated models areshown for box constraints only (a and c) and for box con-straints combined with total-variation constraints (b andd). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31Figure 2.7 Estimated models for FWI and WRI for 50% data noise,based on the BP 2004 model. Estimated models areshown for box constraints only and for box constraintscombined with total-variation constraints. . . . . . . . . . 32Figure 2.8 Frequency panels of the lowest frequency data for the ex-ample based on the BP 2004 model with 25% noise. Allexamples use noisy data, but the figure also displays datawithout noise for reference. . . . . . . . . . . . . . . . . . 33xiiiFigure 3.1 The trajectory of Dykstra’s algorithm for a toy exam-ple with two constraints: a maximum 2-norm constraint(disk) and bound constraints. The feasible set is the in-tersection of a halfspace and a disk. The circle and hori-zontal lines are the boundaries of the sets. The differencebetween the two figures is the ordering of the two sets.The algorithms in (a) start with the projection onto thedisk, in (b) they start with the projection onto the halfs-pace. The projection onto convex sets (POCS) algorithmconverges to different points, depending onto which set weproject first. In both cases, the points found by POCSare not the projection onto the intersection. Dykstra’salgorithm converges to the projection of the initial pointonto the intersection in both cases, as expected. . . . . . 50Figure 3.2 The Marmousi model (a), the projection onto an intersec-tion of bound constraints and total-variation constraintsfound with Dykstra’s algorithm (b) and two feasible mod-els found by the POCS algorithm (c) and (d). We observethat one of the POCS results (c) is very similar to theprojection (b), but the other result (d) is very different.The different model (d) has a total-variation much smallerthan requested. This situation is analogous to Figure 3.1. 51Figure 3.3 FWI with an incorrect source function with projections(with Dykstra’s algorithm) and FWI with two feasiblepoints (with POCS) for various TV-balls (as a percent-age of the TV of the true model) and bound constraints.Also shows differences (rightmost two columns) betweenresults. The results show that using POCS inside a pro-jected gradient algorithm instead of the projection leadsto different results that also depend on the order in whichwe provide the sets to POCS. This example illustrates thedifferences between the methods and it is not the inten-tion to obtain excellent FWI results. . . . . . . . . . . . . 52xivFigure 3.4 Example of the iteration trajectory when (a) using gra-dient descent to minimize a non-convex function and (b)projected gradient descent to minimize a non-convex func-tion subject to a constraint. The constraint requires themodel estimate to inside the elliptical area in (b). Thesemi-transparent area outside the ellipse is not accessibleby projected gradient descent. There are two importantobservations: 1) The constrained minimization convergesto a different (local) minimizer. 2) The intermediate pro-jected gradient parameter estimates can be in the interiorof the set or on the boundary. Black represents low valuesof the function. . . . . . . . . . . . . . . . . . . . . . . . . 55Figure 3.5 The 3-level nested constrained optimization workflow. . . 60Figure 3.6 True (a) and initial (b) velocity models for the example. . 63Figure 3.7 Model estimate obtained by FWI with bound constraintsonly. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64Figure 3.8 (a) Model estimate obtained by FWI from 3− 4 Hz datawith bound constraints, a vertical slope constraint anda constraint on the velocity variation per meter in thehorizontal direction. (b) Model estimate by FWI from3−15 Hz data with bound constraints and using the resultfrom (a) as the starting model. . . . . . . . . . . . . . . . 66Figure 3.9 (a) Model estimate obtained by FWI from 3− 4 Hz datawith bound constraints and total-variation constraints.(b) Model estimate by FWI from 3 − 15 Hz data withbound constraints and using the result from (a) as thestarting model. . . . . . . . . . . . . . . . . . . . . . . . . 67xvFigure 3.10 Comparison of reverse time migration (RTM) results basedon the FWI velocity models (right halves) and the truereflectivity (left halves). Figures (a) and (d) show RTMbased on the velocity model from FWI with bounds only(Figure 3.7). Figures (b) and (e) show RTM results basedon the velocity model from FWI with bounds, horizontaland vertical slope constraints (Figure 3.8b). Figures (c)and (f) show RTM results based on the velocity modelfrom FWI with bounds and total-variation constraints(Figure 3.9b). RTM results based on FWI with boundconstraints, (a) and (d), miss a number of reflectors thatare clearly present in the other RTM results. . . . . . . . 69Figure 3.11 Results from FWI with regularization by a quadratic penaltymethod to promote horizontal and vertical smoothness.As for the constrained FWI example, the first FWI cycleuses 3−4 Hz data and is with regularization (left column),the second cycle uses 3−15 Hz data and does not use reg-ularization (right column). Figure (a) uses regularizationparameter α1 = α2 = 1e5, (c) uses α1 = α2 = 1e6, and(e) uses α1 = α2 = 1e7. . . . . . . . . . . . . . . . . . . . 71Figure 3.12 Results from FWI with regularization by a quadratic penaltymethod to promote horizontal and vertical smoothness.As for the constrained FWI example, the first FWI cycleuses 3 − 4 Hz data and is with regularization (left col-umn), the second cycle uses 3− 15 Hz data and does notuse regularization (right column). Figure (a) uses regular-ization parameter α1 = 1e6, α2 = 1e5, (c) uses α1 = 1e5,α2 = 1e6, (e) uses α1 = 1e7, α2 = 1e6, and (g) usesα1 = 1e6, α2 = 1e7. . . . . . . . . . . . . . . . . . . . . . 72xviFigure 4.1 Relative transform-domain set feasibility (equation 4.24)as a function of the number of conjugate-gradient itera-tions and projections onto the `1 ball. This figure alsoshows relative change per iteration in the solution x. . . . 105Figure 4.2 Relative transform-domain set feasibility (equation 4.24)as a function of the number of conjugate-gradient itera-tions and projections onto the set of matrices with limitedrank via the SVD. This figure also shows relative changeper iteration in the solution x. . . . . . . . . . . . . . . . 106Figure 4.3 Timings for a 2D and 3D example where we project ageological model onto the intersection of bounds, lateralsmoothness, and vertical monotonicity constraints. . . . . 107Figure 4.4 Timings for a 3D example where we project a geologi-cal model onto the intersection of bound constraints andan `1-norm constraint on the vertical derivative of theimage. Parallel computation of all yi and vi does nothelp in this case, because the `1-norm projection is muchmore time consuming than the projection onto the boundconstraints. The time savings for other computations inparallel are then canceled out by the additional commu-nication time. . . . . . . . . . . . . . . . . . . . . . . . . . 108Figure 4.5 True, initial, and estimated models with various constraintcombinations for the full-waveform inversion example. Crossesand circles represent sources and receivers, respectively.All projections inside the spectral projected gradient al-gorithm are computed using single-level PARSDMM. . . . 113Figure 4.6 Estimated models with various constraint combinationsfor the full-waveform inversion example. Crosses and cir-cles represent sources and receivers, respectively. All pro-jections inside the spectral projected gradient algorithmare computed using coarse-to-fine multilevel PARSDMMwith three levels and a coarsening of a factor two per level.114Figure 4.7 A sample of 8 out of 35 training images. . . . . . . . . . . 117xviiFigure 4.8 Reconstruction results from 80% missing pixels of an im-age with motion blur (25 pixels) and zero-mean randomnoise in the interval [−10, 10]. Results that are the pro-jection onto an intersection of 12 learned constraints setswith PARSDMM are visually better than BPDN-waveletresults. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118Figure 4.9 A sample of 8 out of 16 training images. . . . . . . . . . . 120Figure 4.10 Reconstruction results from recovery from saturated im-ages as the projection onto the intersection of 12 con-straint sets. . . . . . . . . . . . . . . . . . . . . . . . . . . 121Figure 5.1 The true model for the data generation for the full-waveforminversion 1 example, the initial guess for parameter esti-mation, and the model estimates with various constraints.Crosses and circles indicate receivers and sources, respec-tively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141Figure 5.2 The true and initial models corresponding to the full-waveform inversion 2 example. Figure shows parame-ter estimation results with various intersections of sets,as well as the result using a generalized Minkowski con-straint set. Only the result obtained with the generalizedMinkowski set does not show an incorrect low-velocityanomaly. . . . . . . . . . . . . . . . . . . . . . . . . . . . 143Figure 5.3 Results of the generalized Minkowski decomposition ap-plied to the escalator video. The figure shows four frames.The most pronounced artifacts are in the time stamp.This example illustrates that the constrained approach issuitable to observe and apply constraint properties ob-tained from a few frames of background only video. . . . 147Figure 6.1 The true image (left), and the observed data (right) thatconsists of vertical bands of the true image, increasinglysparsely sampled from left to right. . . . . . . . . . . . . . 153xviiiFigure 6.2 Three samples from the prior information set, which is theintersection of bounds, lateral smoothness, and parametervalues that are limited to decrease slowly in the downwarddirection. Samples are the result of projecting randomimages onto the intersection. . . . . . . . . . . . . . . . . 154Figure 6.3 Samples from the intersection of sets that describe priorknowledge and data observations. The bottom row showsthe difference between the sample from the top row andthe true model from Figure (6.1). . . . . . . . . . . . . . . 155Figure 6.4 Pointwise maximum and minimum values, as well as thedifference of the three samples from Figure (6.3). . . . . . 155Figure B.1 The figure shows the effect of different slope constraintswhen we project a velocity model (a). Figure (b) showsthe effect of allowing arbitrary velocity increase with depth,but only slow velocity decrease with depth. Lateral smooth-ness (c) is obtained by bounding the upper and lower limiton the velocity change per distance interval in the lateraldirection. . . . . . . . . . . . . . . . . . . . . . . . . . . . 198xixAcknowledgmentsI would like to use this opportunity to thank my supervisor Professor Felix J.Herrmann for giving me the opportunity to study and take classes in varioustopics, always encouraging me to be critical of my own papers, presentations,and software. I am also grateful for the support for me to work on multipletopics and the freedom to explore and develop a research path. All thisenabled me to acquire a broad set of skills.I was also lucky to have many nice and helpful colleagues: students, post-docs, and other support at the Seismic Laboratory of Imaging and Modelingat the University of British Columbia. Special thanks to Henryk for alwayshaving ready some good advice for any software, hardware, or programmingrelated questions.Also very grateful for much love and support at home from my wifeTug˘c¸e, and I am looking forward to post-graduation activities now that weare both done with school. Furthermore, I am also thankful for the supportfrom far away, in the form of my parents and sister visiting me regularly.xxChapter 1IntroductionEvery day we witness all sorts of physical phenomena around us: heat dis-sipation, fluid flow and sound wave propagation to name a few. We knowhow to simulate physics numerically, given the source activation function,initial states, boundary conditions, and physical model parameters such asdensity, acoustic velocity, and electrical conductivity. This process is calledforward modeling in the context of an inverse problem. The inverse prob-lem for physical parameter estimation is using the data acquired in the realworld, called the observed data, and use computational methods to infer themodel parameters that resulted in the observed data. A prominent exam-ple in this thesis is the estimation of acoustic velocity in the subsurface ofthe Earth from seismic pressure signals measured near the surface. Thisproblem is known as full-waveform inversion (FWI [Tarantola, 1986, Prattet al., 1998, Virieux and Operto, 2009]) in the geophysical literature. Morechallenging parameter estimation inverse problems also estimate the sourceterms [Pratt, 1999, Aravkin and van Leeuwen, 2012].Other inverse problems that feature prominently in this thesis are im-age and video processing tasks such as deblurring, inpainting missing pixels,noise removal, and segmentation/classification. These problems appear dif-ferent from physical parameter estimation, but there are many similaritiesin terms of mathematical structure, algorithms, and software.If we define the forward modeling operator on a grid with N grid points,1acting on the vectorized grid, as G(·) : RN → RM , the observed data asd ∈ RM , and the model parameters as m ∈ RN , the most basic inverseproblem is straightforward data-fitting, where the misfit between simulatedand observed data is minimized. Mathematically, this corresponds to thegoal of finding model parameters that result in the observed data when usedfor forward modeling, i.e.,minmf(G(m)− d). (1.1)Unlike forward modeling, most inverse problems do not have a solution, orhave a solution that is not unique and does not depend on the data continu-ously. These are the Hadamard conditions that define an ill-posed problem,stated informally. The data-misfit function f(·) : RN → R quantifies thedifference between the predicted data, G(m), and the observed data d. Acanonical example is the (non-) linear least-squares misfit 1/2‖ · ‖22. Thechoice of f depends on the statistical distribution of the data-fit errors thatwe expect, but computational arguments such as differentiability or separa-bility are also important. Separability means that data misfit objective canbe written as a sum.When we discuss solving an inverse problem, we define this as the resultof the procedure that we use to obtain model parameters that minimize (1.1)with respect to the model parameters. Results can be a local or global mini-mum of (1.1), or any other point that prevents the optimization algorithm tofurther decrease f significantly. In another scenario, f decreases, but thereis no significant change in m. For problems where computing G(m) requirestime-consuming numerical simulations, a limited number of evaluations ofG(m) is typically the stopping criterion and the ‘solution’ of the inverseproblem is the m we obtained when there is no more computational timeleft. The model parameters that provide us with the solution, as defined inthis paragraph, are also named the model estimate.So far, we discussed inverse problems in the context of data fitting. How-ever, even if an inverse problem is easy to solve numerically, it is often chal-lenging to obtain a model estimate that is close to the true parameters. Data2fitting alone is usually not enough because it leaves the solution sensitive to• forward operators that do not contain information that helps the re-construction. For problems like image inpainting, audio declipping andimage desaturation, the observed and corrupted data satisfiesG(d) = dfor operators that map from image to image.• data problems such as noise, a lack of well-sampled data, aliasing, anddata gaps.• widely varying model estimates based on small changes in the initialguess. Inverse problems such as inversion of seismic data to estimaterock properties have many possible solutions that may look geologi-cally realistic, but most of them are far from the truth.• subsampling artifacts in the model parameters that are the result ofusing a (randomly changing) subset of the data in iterative reconstruc-tion algorithms to reduce the computational demand [Krebs et al.,2009, Dai et al., 2011, Herrmann and Li, van Leeuwen et al., 2011, Liet al., 2012a, Peters et al., 2016, Xue et al., 2016].When the data and forward modeling operator do not contain sufficientinformation, or are corrupted, we need more input to obtain good modelparameters. Additional information may come in the form of prior knowl-edge: things we know about the model parameters before we even lookat the data or start solving the inverse problem. Prior information comesfrom many different sources. For geophysical imaging these include expert(geologist) knowledge, physical measurements in wells [Asnaashari et al.,2013], models obtained using other types of geophysical data [Lines et al.,1988, Gallardo and Meju, 2007, Hu et al., 2009, Haber and Holtzman Gazit,2013], and models derived using the same type of data at an earlier time(time-lapse) [Asnaashari et al., Karaoulis et al., 2011, Oghenekohwo et al.,2015]. Examples of prior knowledge include minimum and maximum val-ues of the parameters, or if the model is simple in some sense (smooth,3blocky, sparse in a transform-domain, composed of a few linearly indepen-dent rows/columns). Merging data and prior knowledge ‘fixes’ the problemslisted so far, and improves the model estimates, provided a) sufficient priorinformation is available and b) formulating the inverse problem such thatall prior knowledge is actually included in the solution. Developing methodsand algorithms to include as much prior information as possible is the maintopic of this thesis.Adding prior information to the inverse problem formulation in the formof penalty functions or constraints regularizes the problem. The regulariza-tion described so far applies to the model parameters. Some of the issueslisted above, data noise and missing data, can also be tackled using dataprocessing. For instance, noise filtering, data completion, and bandwidthextension [Li and Demanet, 2016] techniques all act on the data. However,issues related to the non-uniqueness of inverse problem solutions also occurin case of ‘perfect’ data. In this thesis, I focus on model-based regulariza-tion exclusively. Before motivating this choice, it is important to note thatwe can apply both model and data-based regularizations to solve an inverseproblem, which may be necessary in order to obtain the best results possible.My choice is motivated by• the intuition that inverse problem practitioners have about the modelparameters. A geologist knows what the earth looks like in the sub-surface, but not what characteristics gravitational, electromagnetic, orseismic data supposed to possess.• the invariance of a model to the data. A physical model or image isindependent of the type of data, sensors, and source/receiver acquisi-tion arrays. These factors typically change for every experiment, whichmakes it challenging to develop general regularization techniques thatwork for multiple types of data and varying experimental settings.• the invariance of various physical properties of the same model. Con-sider different geophysical models of the same target. For example pa-rameters computed from gravitational data in terms of density, and a4model based on seismic data in terms of acoustic velocity. While thesemodels describe different physical parameters with varying scales, theirstructure is often similar. This means that these models share someproperties, such as matrix rank or cardinality (number of non-zerosin a vector) of the number of discontinuities, because these propertiesare scale-invariant.• the possibly much higher dimensionality of the data compared to themodel. An image or video is always 2D or 3D. The observed data to ob-tain such a model may be higher dimensional and contain many moredata points than grid points. In exploration seismology, we can workwith source x-y-z locations, receiver x-y-z locations and a time/fre-quency coordinate. The seismic data is therefore a 7D tensor (or 5Dwith fixed z-coordinates) [Trad, 2008, Kreimer et al., 2013, Silva andHerrmann, 2015, Kumar et al., 2015], which makes it more difficult towork with in a computational sense than with a 3D model.Sometimes, model and data based regularization go hand in hand. Wecan apply the techniques I develop in this thesis to data as well. This isnot the primary application, but if we have data organized in a matrix or3D tensor, we may use all developed algorithms directly. If data is higherdimensional, we can flatten the tensor [Kreimer et al., 2013, Kumar et al.,2015], i.e., reshaping to lower dimensional tensors (3D array or matrix).1.1 From prior knowledge to problemformulationSo far, we discussed what prior knowledge is and why it is important forimaging. One of the most challenging parts of solving an inverse problem,is translating prior information into a mathematical formulation. There areseveral ways to do this. What methods are preferable depends on the priorknowledge, applications, and available algorithms. In each of the chapters, Imotivate in detail why I prefer a specific formulation over the others. I willlimit the following informal discussion to the basic concepts and philosophy5behind the different regularization techniques.Perhaps the most classical concept is to penalize properties that we donot want to see in the model estimate. This is known as Tikhonov/quadraticregularization. In a more general form we add p regularization terms Ri(m) :RN → R that each assign large values to models that have unwanted prop-erties. The corresponding minimization problem isminmf(m) +p∑i=1αiR(m). (1.2)The regularization functionsRi(m) may be non-convex and non-differentiable.The scalars αi balance the influence of each regularizer with respect to eachother and the data misfit. Penalty methods are the most widely used reg-ularization technique, see, e.g. [Farquharson and Oldenburg, 1998, Beckeret al., 2015, Lin and Huang, 2015, Xue and Zhu, 2015, Qiu et al., 2016] forexamples in geophysics.Another formulation casts the penalty term into the objective that isminimized given a constraint on the data misfit—i.e.,minmp∑i=1αiR(m) s.t. f(m) ≤ σ. (1.3)This formulation has the advantage that if we know something about thedata noise level we can determine a good choice of σ > 0. If we use onlya single Ri(m), there are no other scalar tuning parameters, which makesit a more practical formulation. However, when multiple pieces of priorknowledge are available, this advantage no longer holds. The multiple reg-ularization terms require multiple αi for balancing the influence of each ob-jective, so there is still one trade-off parameter per model property. Thereare examples of this approach in the geophysical literature [Constable et al.,1987, Greenhalgh et al., 2006], but it is rare for authors to work with morethan one regularization function because choosing the trade-off parametersis challenging [Ellis and Oldenburg, 1994]To avoid choosing these trade-off parameters, this thesis revolves around6the following constrained formulation:minmf(m) s.t. m ∈p⋂i=1Vi, (1.4)where the data misfit appears as the objective and the prior informationas constraints. The above formulation requires the model parameters m tobe an element of p sets Vi. The model estimate m is an element of theintersection⋂pi=1 Vi. In each chapter, I explain which properties make prob-lem (1.4) the cornerstone of this thesis. The absence of penalty parametersis an advantage of the constrained formulation for situations where we havemultiple pieces of prior knowledge. The definition of each constraint is in-dependent of all other constraint sets and requires no balancing. Moreover,any solution of problem (1.4) will satisfy all constraints.For geophysical problems we often want to, or need to, work with manyconstraints or penalties. Consider seismic imaging in sedimentary geologicalsettings. In this situation, we quickly reach the number of four pieces ofprior information. Usually there is knowledge on upper and lower limitson parameter values (bound constraints), some information about variationwith depth (often in the form of promoting blockiness across the sedimen-tary layers), as well as two different smoothness related regularization termsfor the two lateral directions (along the sedimentary layers). Yet, manyinversion results are ‘obviously’ not good in the eyes of the geologist/geo-physicist. This implies there is more prior knowledge available that it notyet used. Several geophysical works successfully use formulation (1.4), [Zeevet al., 2006, Bello and Raydan, 2007, Lelivre and Oldenburg, 2009, Baum-stein, 2013, Smithyman et al., 2015, Esser et al., 2015a, 2016b, Esser et al.,2016]. These reference are limited to a single or two constraint sets, andsome of them present algorithms for specific constraints. In this thesis, Iextend workflows and algorithms to more than two constraint sets, presentalgorithms to compute projections onto intersections that are not tied tospecific sets, introduce practical software implementations with a reducednumber of tuning parameters, and I work with constraints not previously7used in the geophysical literature. In chapter 5, I also introduce a newproblem formulation that is more general than an intersection of sets.1.2 Visual introductionWhile all inverse problems that appear in this thesis are defined on small2D or large 3D grids, looking at some sets and intersections in R2 providessome visual intuition about the techniques that underpin this work. Tomake visualization simple by avoiding a mixture of function level-sets andconstraint sets, let me first make a small modification to problem (1.4) bychanging the minimization of a data-misfit to a constraint on the data-fit.The new problem formulation readsfind m ∈p⋂i=1Vi⋂Vdata. (1.5)This means we find a vector m ∈ RN that is in the intersection of p con-straints on the model properties,⋂pi=1 Vi, and also in a data-constraint setVdata. An example of a constraint on the data fit is Vdata = {m | σ1 ≤‖G(m)−d‖ ≤ σ2} with σ1 ≤ σ2. Although many authors state optimizationproblems of the form minm ‖G(m) − d‖, they sometimes intend to use theconstraint {m | σ1 ≤ ‖G(m) − d‖ ≤ σ2}. This happens when researchersstop their iterative algorithm when the data-misfit drops below the noiselevel: ‖G(m)−d‖ < σ1. The upper bound is also effectively present becausethere is often a rough idea about how close we should be able to match theobserved data.We consider sets that are inspired by a geophysical inverse problem ina sedimentary geological setting. Prior knowledge, in this case, is oftenavailable about the upper and lower bounds on parameter values, somesmoothness in the lateral direction, and the acoustic velocity or densityare generally increasing monotonically with depth in the Earth. For eachelement in the model vector, prior information is given by the intersectionof81. {m | l ≤ m ≤ u} : bounds on parameter values2. {m | − ε ≤ (Iz ⊗ Dx)m ≤ +ε} : parameter values change slowly inlateral direction3. {m | 0 ≤ (Dz ⊗ Ix)m ≤ +∞}, : parameter values are increasing withdepthwhere ⊗ is the Kronecker product and Dx, Dz are finite-difference matrices,and Ix and Iz are identity matrices of size that corresponds to the grid extentin x or z-direction. In Figure 1.1, we show two-parameter representations ofthese sets, as well as an annulus constraint on the data fit: Vdata = {m |σ1 ≤‖G(m)− d‖ ≤ σ2}.Figure 1.2 displays the intersection of all sets that describe prior knowl-edge. That figure also shows the intersection of the data-fit constraint setwith the sets of prior knowledge. The projection of a few random pointsthat are outside the intersection of all sets, are examples of feasible pointsthat satisfy all constraints. Any feasible point satisfies all pieces of priorinformation, and also has the desired level of data fit; these points are ex-amples of solutions of the inverse problem 1.5. The ‘full’ solution of problem1.5 is set-valued, i.e., any point in the set. This type of projection appearsextensively in the following chapters.1.3 Motives and objectivesMotivated by physical parameter estimation problems using seismic data(seismic full-waveform inversion), I highlight fundamental challenges.• The estimation of the model parameters of a wave-equation fromrecorded wavefield data at a small part of the boundary of the compu-tational domain is a notoriously non-convex problem. PDE-constrainedoptimization attempts to match observed oscillatory data to simulateddata that is also oscillatory. Small changes in the initial guess typicallylead to large changes in the final model estimate obtained using an it-erative optimization algorithm. While many of the recovered modelsare ‘obviously’ incorrect, we also fine many models that are realistic,9Figure 1.1: Two parameter representation of bound constraints,smoothness constraints via bounds on the gradient, monotonic-ity via positivity/negativity of the gradient, and an annulusconstraint on the data fit.but far from the true model parameters. These problems typicallyoccur when there are no low-frequencies recorded (about ≤ 3 Hertzin ocean-based data acquisition), and the initial guess is far from thetrue model.• Generating and recording low-frequency data is challenging for physi-cal reasons, so assuming those low-frequency data are/will be availableis not an option.• Creating an accurate initial model from seismic data is extremely time-consuming, difficult, and requires much manual work by, e.g., first-10Figure 1.2: (left) The yellow highlighted patch is the intersection ofthe other sets that describe prior knowledge. (right) The yellowhighlighted patch shows the intersection of the data constraintand all other sets. Red dots are projections of random pointsonto the intersection.arrival analysis. Methods and algorithms with relatively low sensitivityto the initial model are preferable to invert seismic data.By merging data-fitting with prior knowledge on the model parameters,we can partially mitigate the above list of challenges. Well chosen and ac-curate prior information has a similar effect as augmenting the missing dataand can also help ‘guide’ iterative inversion algorithms from an inaccurateinitial model to a good estimation.The two primary objectives of this thesis are tightly linked. I wantto include more prior knowledge than most research on inverse problemsthat only use one or two pieces of prior information, usually in the form ofpenalties. At the same time, I also want to make several aspects of solvinginverse problems easier. More specifically, the objectives of this thesis are• developing problem formulations, workflows, and algorithms that caninclude multiple pieces of prior information about model parametersand solve resulting problems on large 3D grids.11• reducing the number of parameters that need hand-tuning or algorith-mic tuning at a high computational cost. These include step-lengthlimits that need function/linear operator properties which are notreadily available, stopping criteria, augmented-Lagrangian penalty,and over/under-relaxation parameters.• applying the developed algorithms for the constrained problem for-mulation to non-convex seismic full-waveform inversion in various ge-ological settings where standard formulations with quadratic penaltymethods do not succeed.1.4 Thesis outlineThere are four main chapters in this thesis that follow a natural progressionfrom relatively simple to more advanced and faster algorithms.The intended audience for chapter 2 is a broad range of explorationgeoscientists and it uses a minimal amount of mathematics to explain theconcepts. I discuss some advantages of constrained formulations of inverseproblems compared to penalty forms, for seismic full-waveform inversion(FWI). I show that FWI, a nonlinear and non-convex problem, with multi-ple penalty parameters behaves unpredictably as a function of the penaltyparameter scaling. As a solution to be able to work with multiple regu-larizers, I present a workflow that combines three simple algorithms. Thisworkflow is a first step that includes an arbitrary number of constraints,including ones for which we do not know the projection in closed form.The constraints in this chapter apply to geological settings that contain saltstructures, i.e., large contrasts in parameter values. To verify that the reg-ularization strategy was not just one ‘lucky’ success for a specific problem,I also apply the same constraints to a different non-convex formulation ofFWI.In chapter 3, I present an extended version of the basic framework pre-sented in chapter 2, aimed at a general geophysical audience. Contraryto chapter 2, there are more mathematical details and faster algorithmsthat are only slightly more involved than the ones in chapter 2. This time,12I consider a sedimentary geological setting, which means the models aremostly layered, but include challenging high-low-high acoustic velocity vari-ation with depth, which refracts the waves such that there is little energyrecorded that corresponds to waves that probed the deeper parts. To dealwith this challenge, I introduce slope-constraints to geophysical problems.These constraints occur in applications like computational design and roadplanning in mountainous terrain. The examples show that slope-constraintscan enforce smoothness or monotonicity of the parameter values. The con-straints lead to better model estimates compared to penalty methods whilethey allow for straightforward inclusion of physical units.While the algorithms in the framework presented in chapter 3 are fasterthan the ones in chapter 2, there are still opportunities to reduce computa-tion times, which is important for 3D problems. Chapters 2 and 3 outlinenested algorithms, i.e., one algorithm solves sub-problems of another algo-rithm, specifically the alternating direction method of multipliers (ADMM)solves sub-problems of (parallel) Dyksta’s algorithm. While it may be pos-sible to obtain limited speedups by enhancing both methods, there are tworeasons why I dedicate chapter 4 to developing a single and new algorithmto compute projections onto the intersection of multiple sets. The first ar-gument is the nuisance of having to deal with stopping criteria for bothADMM and Dykstra’s algorithm. Besides additional parameters, nestingis usually inefficient. Not solving the sub-problems with sufficient accuracywill cause the framework to fail to converge, while solving sub-problemsmore accurate than required amounts to wasted computational time. Thesecond reason to develop a new algorithm is the specific target problem ofmultiple sets. More sets mean that there is likely some similarity betweenthe constraint. This is an opportunity that I exploit using a few simple, yeteffective problem reformulation steps. The algorithmic development focussesspeed and practicality. To reduce the computation times I include multilevelcontinuation from coarse to fine grids, hybrid coarse and fine-grained par-allelism, multi-threaded matrix-vector products for banded matrices, andrecently introduced automatic selection of acceleration parameters. Practi-cal relevance of this chapter is ensured by making all algorithms available13as open-source and written in Julia, stopping conditions that are more intu-itive and tailored to projections, formulating the problems such that thereare no manual tuning-parameters required to ensure convergence, and us-ing various heuristics to enhance performance in case of non-convex sets.I demonstrate the capabilities on seismic full-waveform inversion and twoimage processing tasks where I use a simple learning method to obtain 12pieces of prior knowledge from a few training examples.Chapters 2, 3, and 4 all use the same problem formulation: estimatedmodel parameters need to be an element of the intersection of multiple con-straint sets. This approach captures a wide range of models and images,but there are still situations where it is difficult to describe prior knowledgeusing an intersection of multiple sets. A simple example is an image that ispartially smooth and partially blocky, or, a smooth image with a small scaleblocky pattern superimposed. In the field of image processing, such modelsare more conveniently described by an additive structure. Methods that adddifferent type of image components include cartoon-texture decomposition,morphological component analysis, multi-scale analysis, and robust/sparseprincipal component analysis. All of these concepts use, almost exclusively,penalty methods to regularize each component. In chapter 5 I present aproblem formulation, as well as algorithms, to use additive model descrip-tions in a constrained framework. The constrained additive formulationleads to a Minkowski set. I show that these sets are not suitable for phys-ical parameter estimation and therefore I introduce a generalization of theMinkowski set that allows each component to be an intersection of sets,while the full model can still be an element of another intersection of sets.This concept merges and extends the problem formulation of chapters 2, 3,and 4. Using examples of seismic waveform inversion and video background-foreground segmentation, I show why a constrained version of sums of modelcomponents enables the inclusion of more pieces of prior information.141.5 ContributionsMy primary contributions to the topics introduced so far are summarizedas follows:• I provide a comprehensive investigation of how, why, and when con-strained formulations for non-convex seismic parameter estimationproblems are easier to use and lead to better results than penalty for-mulations. The presented projection-based workflow to include mul-tiple constraints guarantees that all constraints are satisfied at eachiteration, which prevents the model estimates from becoming physi-cally unrealistic. I designed the combination of constrained problemformulation and optimization framework to avoid manual tuning pa-rameters as much as possible, and include heuristics for defining someof the constraint sets.• To be able to compute projections of large 3D models onto intersec-tions of multiple convex and non-convex sets, I developed specializedalgorithms and software. Different from excisting algorithms, I exploitcomputational similarity between the sets, specialize stopping condi-tions and sub-problem computations, include multilevel acceleration,while keeping the number of tuning parameters to a minimum. Allpresented material is available as a software package written in Julia,and this is the first package that combines all the ingredients listedabove.• I formulated a generalization of the Minkowski set. Minkowski setscombines the strenghts of constraint sets and additive model descrip-tions (e.g., cartoon-texture decomposition, morphological componentanalysis, multiscal analysis, variants of robust principal componentanalysis). The proposed generalization can describe more pieces ofdetailed prior knowledge, because each of the Minkowski set compo-nents is an intersection of sets, while the sum is also required to be anelement of another intersection of sets. I also develop computational15methods for computing projections onto the generalized Minkowskisets and show applications to regularizing inverse problems.16Chapter 2Constraints versus penaltiesfor edge-preservingfull-waveform inversion2.1 IntroductionWhile full-waveform inversion (FWI) is becoming increasingly part of theseismic toolchain, prior information on the subsurface model is rarely in-cluded. In that sense, FWI differs significantly from other inversion modu-larities such as electromagnetic and gravity inversion, which without priorinformation generate untenable results. Especially in situations where theinverse problem is severely ill posed, including certain regularization terms—which for instance limit the values of the inverted medium parameter to pre-defined ranges or that impose a certain degree of smoothness or blockiness—are known to improve inversion results significantly.With relatively few exceptions, people have shied away from includingregularization in FWI especially when this concerns edge-preserving regu-larization. Because of its size and sensitivity to the medium parameters,FWI differs in many respects from the above mentioned inversions, whichA version of this chapter has been published in The Leading Edge (Society of Explo-ration Geophysicists), 2017.17partly explains the somewhat limited success of incorporating prior infor-mation via quadratic penalty terms (Tikhonov regularization) or gradientfiltering. This lack of success is further exemplified by challenges tuningthese regularizations and by the fact that they do not lend themselves natu-rally to handle more than one type of prior information. Also, adding priorinformation in the form of penalties may add undesired contributions to thegradients (and Hessians).To prevail over these challenges, we replace regularized (via additivepenalty terms) inversions by inversions with ‘hard’ constraints. In words,instead of using regularization with penalty terms tofind amongst all possible velocity models models that jointly fitobserved data and minimize model dependent penalty terms,we employ constrained inversions, which aim tofind amongst all possible velocity models models that fit observeddata subject to models that meet one or more constraints on themodel.While superficially these two “inversion mission statements” look rathersimilar, they are syntactically very different and lead to fundamentally dif-ferent (mathematical) formulations, which in turn can yield significantly dif-ferent inversion results and tuning-parameter sensitivities. Without goinginto mathematical technicalities, we define penalty approaches as methods,which add terms to a data-misfit function. Contrary to penalty formulations,constraints do not rely on local derivative information of the modified objec-tive. Instead, constraints ‘carve out’ an accessible area from the data-misfitfunction and rely on gradient information of the data-misfit function only incombination with projections of updated models to make sure these satisfythe constraints. As a result, constrained inversions do not require differen-tiability of the constraints; are practically parameter free; allow for mixingand matching of multiple constraints; and most importantly, by virtue ofthe projections, the intermediate inversion results are guaranteed to remain18within the constraint set, an extremely important feature that is more dif-ficult if not impossible to achieve with regularizations via penalty terms.To illustrate the difference between penalties and constraints, we con-sider FWI where the values and spatial variations of the inverted velocitiesare jointly controlled via bounds and the total-variation (TV) norm. Thelatter TV-norm is widely used in edge-preserved image processing [Rudinet al., 1992] and corresponds to the sum of the lengths of the gradient vectorsat each spatial coordinate position.After briefly demonstrating the effect of combining bound and TV-normconstraints on the Marmousi model, we explain in some detail the challengesof incorporating this type of prior information into FWI. We demonstratethat it is nearly impossible to properly tune the total-variation norm whenincluded as a modified penalty term, an observation that is may very wellbe responsible for the unpopularity of TV-norm minimization in FWI. Byimposing the TV-norm as a constraint instead, we demonstrate that thesedifficulties can mostly be overcome, which allows FWI to significantly im-prove the delineation of high-velocity high-contrast salt bodies.2.2 Velocity blocking with total-variation normconstraintsEdge-preserving prior information derives from the premise that the Earthcontains sharp edge-like unconformable strata, faults, salt or basalt inclu-sions. Several researchers have worked on ways to promote these edge-likefeatures by including prior information in the form of TV-norms. If perform-ing according to their specification, minimizing the TV-norm of the velocitymodel m on a regular grid with gridpoint spacing h,TV(m) =1h∑ij√(mi+1,j −mi,j)2 + (mi,j+1 −mi,j)2, (2.1)acts as applying a multidimensional “velocity blocker”. To make sure thatthe resulting models remain physically feasible, TV-norm minimization iscombined with so-called Box constraints that make sure that each gridpoint190.15 TV(m*) 0.25 TV(m*) 0.5 TV(m*) 0.75 TV(m*) true model TV(m*)Figure 2.1: Result of projecting the true Marmousi model onto theset of bounds and limited TV-norms. Shown as a function of afraction of the TV-norm of the true model, TV(m∗).of the resulting velocity model remains within a user-specified interval—i.e.,m ∈ Box means that l ≤ mi,j ≤ u with l and u the lower and upper boundrespectively. It can be shown, that for a given τminm‖m−m∗‖2 subject to m ∈ Box and TV(m) ≤ τ, (2.2)finds a unique blocked velocity model that is close to the original veloc-ity model (m∗) and whose blockiness depends on the size of the TV-normball τ . As the size of this ball increases, the resulting blocked velocitymodel is less constrained, less blocky, and closer to the original model—juxtapose the TV-norm constrained velocity models in Figure 2.1 for τ =(0.15, 0.25, 0.5, 0.75, 1)× τtrue with τtrue = TV(m∗). The solution of Equa-tion 2.2 is the projection of the original model onto the intersection of thebox- and TV-norm constraint sets. In other words, the solution is the clos-est model to the input model, but which is within the bounds and hassufficiently small total-variation.2.3 FWI with total-variation like penaltiesEdge-preserved regularizations have been attempted by several researchersin crustal-scale FWI. Typically, these attempts derive from minimizing theleast-squares misfit between observed (dobs) and simulated data (dsim(m)),computed from the current model iterate. Without regularization, the least-20squares objective for this problem readsf(m) = ‖dobs − dsim(m)‖2. (2.3)Now, if we follow the textbooks on geophysical inversion the most straightforward way to regularize the above nonlinear least-squares problem wouldbe to add the following penalty term: α‖Lm‖22, where L represents theidentity or a sharpening operator. The parameter α controls the trade-offbetween data fit and prior information, residing in the additional penaltyterm.Unfortunately, this type of regularization does not fit our purpose be-cause it smoothes the model and does not preserve edges. TV-norms (asdefined in Equation 2.1), on the other hand, do preserve edges but are non-differentiable and lack curvature. Both wreak havoc because FWI relieson first- (gradient descent) and second-order (either implicitly or explicitly)derivative information.As other researchers, including Vogel [2002b], Epanomeritakis et al.[2008], Anagaw and Sacchi [2011] and Xiang and Zhang [2016] have donebefore us, we can seemingly circumvent the issue of non-smoothness alto-gether by adding a small parameter 2 to the definition of the TV-norm inEquation 2.1. The expression for this TV-like norm now becomesTV(m) =1h∑ij√(mi+1,j −mi,j)2 + (mi,j+1 −mi,j)2 + 2 (2.4)and corresponds to “sand papering” the original functional form of the TV-norm at the origin so it becomes differentiable. By virtue of this mathemat-ical property, this modified term can be added to the objective defined inEquation 2.3 — i.e., we haveminmf(m) + αTV(m). (2.5)For relatively simple linear inverse problems this approach has been appliedwith success (see e.g. Vogel [2002b]). However, as we demonstrate in the21example below, this behavior unfortunately does not carry over to FWIwhere the inclusion of this extra tuning parameter  becomes problematic.To illustrate this problem, we revisit the subset of the Marmousi modelin Figure 2.1 and invert noisy data generated from this model with a 10 HzRicker wavelet and with zero mean Gaussian noise added such that‖noise‖2/‖signal‖2 = 0.25. Sources and receivers are equally spaced at 50 mand we start from an accurate smoothed model. Results of multiple warm-started inversions from 3 to 10 Hz are shown in Figure 2.2. Warm-startedmeans we invert the data in 1 Hz batches and use the final result of a fre-quency batch as the initial model for the next batch. In an attempt tomimic relaxation of the constraint as in Figure 2.1, we decrease the trade-off parameter α ∈ (107, 106, 105) (rows of Figure 2.2) and and increase ∈ (10−4, 10−3, 10−2) (plotted in the columns of Figure 2.2). The latterexperiments are designed to illustrate the effects of approximating the idealTV-norm (TV(m) for → 0).Even though the inversion results reflect to some degree the expectedbehavior, namely more blocky for larger α and smaller , the reader wouldagree that there is no distinctive progression from “blocked” to less blockyas was clearly observed in Figure 2.1. For instance, the regularized inver-sion results are no longer edge preserving when the “sandpaper” parameter becomes too large. Unfortunately, this type of unpredictable behavior of reg-ularization is common and exacerbate by more complex nonlinear inversionproblems. It is difficult, if not impossible, to predict the inversion’s behavioras a function of the multiple tuning parameters. While underreported, thislack of predictability of penalty-based regularization has frustrated prac-titioners of this type of total-variation like regularization and explains itslimited use so far.2.4 FWI with total-variation norm constraintsFollowing developments in modern-day optimization [Esser et al., 2016a,Esser et al., 2016], we replace the smoothed penalty term in Equation 2.5 bythe intersection of box and TV-norm constraints (cf. Equation 2.1), yielding22α=10 7 , ǫ=10 -4 α=10 7 , ǫ=10 -3 α=10 7 , ǫ=10 -2α=10 6 , ǫ=10 -4 α=10 6 , ǫ=10 -3 α=10 6 , ǫ=10 -2α=10 5 , ǫ=10 -4 α=10 5 , ǫ=10 -3 α=10 5 , ǫ=10 -2Figure 2.2: FWI results using the smoothed total-variation (TV) asa penalty. Shows the results for various combinations of  andα.minmf(m) subject to m ∈ Box and TV(m) ≤ τ, (2.6)which corresponds to a generalized version of Equation 2.2. Contrary toregularization with smooth penalty terms, minimization problems of theabove type do not require smoothness on the constraints. Depending onthe objective (data misfit function in our case), these formulations permitdifferent solution strategies. Since the objective of FWI is highly nonlinearand computationally expensive to evaluate, we call for an algorithm designthat meets the following design criteria:23• each model update depends only the current model and gradient anddoes not require additional expensive gradient and objective calcula-tions;• the updated models satisfy all constraints after each iteration;• arbitrary number of constraints can be handled as long as their inter-section is non-empty;• manual tuning of parameters is limited to a bare minimum.While there are several candidate algorithms that meet these criteria, weconsider a projected-gradient method where at the kth iteration the modelis first updated by the gradient, to bring the data residual down, followedby a projection onto the constraint set C. The projection onto the set C isdenoted by PC . The main iteration of the projected gradient algorithm istherefore given bymk+1 = PC(mk −∇mf(mk)). (2.7)After this projection, each model is guaranteed to lie within the intersectionof the Box and TV-norm constraints—i.e., C = {m : m ∈ Box and TV(m) ≤τ}. During the projections defined in Equation 2.2, the resulting model(mk+1) is unique while it also stays as close as possible to the model afterit has been updated by the gradient.While conceptually easily stated, uniquely projecting models onto multi-ple constraints can be challenging especially if the individual projections donot permit closed-form solutions as is the case with the TV-norm. For ourspecific problem, we use Dykstra’s algorithm [Boyle and Dykstra, 1986] byalternating between projections onto the Box constraint and onto the TV-norm constraint. Projecting onto an intersection of constraint sets is equiva-lent to running Dykstra’s algorithm: PC(mk−∇mf(mk))⇔ DYKSTRA(mk−∇mf(mk)).The projection onto the box constraint is provided in closed-form, bytaking the elementwise median. The projection onto the set of models withsufficiently small TV is computed via the Alternating Direction Method of24Figure 2.3: Constrained optimization workflow. At every FWI iter-ation, the user code provides data-misfit and gradient w.r.t.data-misfit only. The projected gradient algorithm uses this topropose an updated model (mk −∇mf(mk)) and sends this toDykstra’s algorithm. This algorithm projects it onto the inter-section of all constraints. To do this, it needs to project vectorsonto each set separately once per Dykstra iteration. These indi-vidual projections are either closed-form solutions or computedby the ADMM algorithm.Multipliers (ADMM, Boyd et al. [2011]). Dykstra’s algorithm and ADMMare both free of tuning parameters in practice. The three steps above canbe put in one nested-optimization workflow, displayed in Figure 2.3.Dykstra’s algorithm for the projection onto the intersection of constraintswas first proposed by Smithyman et al. [2015] in the context of FWI andcan be seen as an alternative approach to the method proposed by the lateErnie Esser and that has resulted in major breakthroughs in automatic saltflooding with TV-norm and hinge-loss constraints [Esser et al., 2016a, Esseret al., 2016].2.5 Why constraints?Before presenting a more elaborate example of constrained FWI on saltplays, let us first discuss why constrained optimization approaches withprojections onto intersections of constraint sets are arguably simpler to use250.15 TV(m*)0 1000 2000 300001000200030000.25 TV(m*)0 1000 2000 300001000200030000.5 TV(m*)0 1000 2000 300001000200030000.75 TV(m*)0 1000 2000 30000100020003000Figure 2.4: Results for constrained FWI for various total-variationbudgets (τ).than some other well known regularization techniques.• Constraints translate prior information and assumptions aboutthe geology more directly than penalties. Although the con-strained formulation does not require the selection of a penalty pa-rameter, the user still needs to specify parameters for each constraint.For Equation 2.6, this is the size of the TV ball τ . However, comparedto trade-off parameter α, the τ is directly measurable from a startingor any other model that serves as a proxy.• Absence of user-specified weights. Where regularization via penaltyterms relies on the user to provide weights for each penalty term,unique projections onto multiple constraints can be computed withDykstra’s algorithm as long as these intersections are not empty. More-over, the inclusion of the constraints does not alter the objective (datamisfit) but rather it controls the region of f(m) that our non-lineardata fitting procedure is allowed to explore. This is especially impor-tant when there are many (≥ 2) constraints. For standard regulariza-tion, it would be difficult to select the weights because the differentadded penalties are all competing to bring down the total objective.• Constraints are only activated when necessary. Before startingthe inversion, it is typically unknown how ‘much’ regularization isrequired, as this depends on the noise level, type of noise, numberof sources and receivers as well as the medium itself. The advantageof projection methods for constrained optimization is that they only26activate the constraints when required. If a proposed model, mk −∇mf(mk), satisfies all constraints, the projection step does not doanything. The data-fitting and constraint handling are uncoupled inthat sense. Penalty methods, on the other hand, modify the objectivefunction and will for this reason always have an effect on the inversion.• Constraints are satisfied at each iteration. We obtained this im-portant property by construction of our projected-gradient algorithm.Penalty methods, on the other hand, do not necessarily satisfy theconstraints at each iteration and this can make them prone to localminima.2.6 Objective and gradients for two waveforminversion methodsTo illustrate the fact that the constrained approach to waveform inver-sion does not depend on the specifics of a particular waveform inversionmethod (we only need a differentiable f(m) and the corresponding gradient∇mf(m)), we briefly describe the objective and gradient for full-waveforminversion (FWI) and Wavefield Reconstruction Inversion (WRI, van Leeuwenand Herrmann [2013]). These two methods will be used in the results sec-tion. We would like to emphasize that we do not need gradients of theconstraints or anything related to the constraints. Only the projection ontothe constraint set is necessary. For derivations of these gradients, see e.g.,Plessix [2006] for FWI and van Leeuwen and Herrmann [2013] for WRI.2.7 ResultsTo evaluate the performance of our constrained waveform-inversion method-ology, we present the West part of the BP 2004 velocity model [Billette andBrandsberg-Dahl, 2005], Figure 2.5. The inversion strategy uses simultane-ous sources and noisy data. We present results for two different waveforminversion methods and two different noise levels. As we can clearly seefrom Figures 2.6 and 2.7, FWI with bound constraints (l = 1475 m/s and27ExpressionObjective FWI: f(m) = 12‖Pu− dobs‖22Objective WRI: f(m) = 12‖Pu¯− dobs‖22 + λ22 ‖A(m)u¯− q‖22Field FWI: u = A−1qField WRI: u¯ = (λ2A(m)∗A(m) + P ∗P )−1(λ2A(m)∗q + P ∗dobs)Adjoint FWI: v = −A−∗P ∗(Pu− dobs)Adjoint WRI: noneGradient FWI: ∇mf(m) = G(m,u)∗vGradient FWI: ∇mf(m) = λ2G(m, u¯)∗(A(m)u¯− q)Partial derivative FWI: G(m,u) = ∂A(m)u/∂mPartial derivative WRI: G(m,u) = ∂A(m)u¯/∂mTable 2.1: Objectives and gradients for full-waveform inversion (FWI)and wavefield reconstruction inversion (WRI). Source term: q,discrete Helmholtz system: A(m), complex-conjugate transpose(∗), matrix P selects the values of the predicted wavefields u andu¯ at the receiver locations. The scalar λ balances the data-misfitversus the wavefield residual.u = 5000 m/s) only is insufficient to steer FWI in the correct direction de-spite the fact we used a reasonably accurate starting model (Figure 2.5b)by smoothing the true velocity model (Figure 2.5a). WRI with bound con-straint only does better, but the results are still unsatisfactory. The resultsobtained by including TV-norm constraints, on the other hand, lead to asignificant improvement and sharpening of the salt.We arrived at this result via a practical workflow where we select the τ =TV(m0), such that the initial model (m0) satisfies the constraints. We runour inversions with the well-established multiscale frequency continuationstrategy keeping the value of τ fixed. Next, we rerun the inversion withthe same multiscale technique, but this time with a slightly larger τ , suchthat more details can enter into the solution. We select τ = 1.25×TV(m1),where m1 is the inversion result from the first inversion. This is repeated onemore time, so we run the inversions three times, each run uses a differentconstraint. For comparison (juxtapose Figures 2.6a and 2.6b), we do thesame for the inversions with the box constraints except in that case we do28not impose the TV-norm constraint and keep the box constraints fixed.As before, our inversions are carried out over multiple frequency batcheswith a time-harmonic solver for the Helmholtz equation and for data gener-ated with a 15Hz Ricker wavelet. The inversions start at 3 Hz and run upto 9 Hz. The data contains noise, so that measured over all frequencies, thenoise to signal ratio is ‖noise‖2/‖signal‖2 = 0.25 for the first example and‖noise‖2/‖signal‖2 = 0.50 for the second example. This means that the 3Hz data is noisier than frequencies closer to the peak frequency. Frequencydomain amplitude data is shown in Figure 2.8 for the starting frequency.The starting model is kinematically correct because it is a smoothed versionof the true model (cf. Figure 2.5a and 2.5b). The model size is about 3km by 12 km, discretized on a regular grid with a gridpoint spacing of 20meters.The main goal of this experiment is to delineate the top and bottomof the salt body, while working with noisy data and only 8 (out of 132sequential sources) simultaneous sources redrawn independently after eachgradient calculation. The simultaneous sources activate every source at once,with a Gaussian weight. The distance between sources is 80 meters whilethe receivers are spaced 40 meters apart.As we can see, limiting the total-variation norm serves two purposes. (i)We keep the model physically realistic by projecting out highly oscillatoryand patchy components appearing in the inversion result where the TV-normis not constrained. These artifacts are caused by noise, source crosstalk andby missing low frequencies and long offsets that lead to a non-trivial nullspace easily inhabited by incoherent velocity structures that hit the bounds.(ii) We prevent otherwise ringing artifacts just below and just above thetransition into the salt. These are typical artifacts caused by the inabilityof regular FWI to handle large velocity contrasts. Because the artifactsincrease the total-variation by a large amount, limiting the total-variationnorm mitigates this well-known problem to a reasonable degree.The noisy data, together with the use of 8 simultaneous sources effec-tively creates “noisy” gradients because of the source crosstalk. Therefore,our projected gradient algorithm can be interpreted as “denoising” where29True velocity model0 2000 4000 6000 8000 10000 12000x [m]05001000150020002500z [m]15002000250030003500400045005000(a)Initial velocity model0 2000 4000 6000 8000 10000 12000x [m]050010001500200025003000z [m]15002000250030003500400045005000(b)Figure 2.5: True and initial models for FWI and WRI, based on theBP 2004 model.we map at each iteration incoherent energy onto coherent velocity struc-ture. For this reason, the results with TV-norm constraints are drasticallyimproved compared to the inversions carried out with bound constraintsonly. The inability of bounds constrained FWI to produce reasonable re-sults for FWI with source encoding was also observed by Esser et al. [2015b](see his Figure 19). While removing the bounds could possibly avoid someof these artifacts from building up, it would lead to physically unfeasible lowand high velocities, which is something we would need to avoid at all times.Again when the TV-norm and box constraints are applied in tandem, theresults are very different. Artifacts related to velocity clipping no longer oc-cur because they are removed by the TV-norm constraint while the inclusionof this constraint also allows us to improve the delineation of top/bottomsalt and the salt flanks. The results also show that WRI, by virtue of includ-ing the wavefields as unknowns, is more resilient to noise and local minimacompared to FWI and that WRI obtains a better delineation of the top andbottom of the salt structure.3025% noise, FWI, bounds only, 3rd cycle0 2000 4000 6000 8000 10000 12000x [m]050010001500200025003000z [m]15002000250030003500400045005000(a)25% noise, FWI, TV, 3rd cycle0 2000 4000 6000 8000 10000 12000x [m]050010001500200025003000z [m]15002000250030003500400045005000(b)25% noise, WRI, bounds only, 3rd cycle0 2000 4000 6000 8000 10000 12000x [m]050010001500200025003000z [m]15002000250030003500400045005000(c)25% noise, WRI, TV, 3rd cycle0 2000 4000 6000 8000 10000 12000x [m]050010001500200025003000z [m]15002000250030003500400045005000(d)Figure 2.6: Estimated models for FWI and WRI for 25% data noise,based on the BP 2004 model. Estimated models are shown forbox constraints only (a and c) and for box constraints combinedwith total-variation constraints (b and d).2.8 Discussion and summaryOur purpose was to demonstrate the advantages of including (non-smooth)constraints over adding penalties in full-waveform inversion (FWI). Whilethis text is certainly not intended to extensively discuss subtle technicaldetails on how to incorporate non-smooth edge-preserving constraints in full-waveform inversion, we explained the somewhat limited success of includingtotal-variation (TV) norms into FWI. By means of stylized examples, we3150% noise, FWI, bounds only, 3rd cycle0 2000 4000 6000 8000 10000 12000x [m]050010001500200025003000z [m]15002000250030003500400045005000(a)50% noise, FWI, TV, 3rd cycle0 2000 4000 6000 8000 10000 12000x [m]050010001500200025003000z [m]15002000250030003500400045005000(b)50% noise, WRI, bounds only, 3rd cycle0 2000 4000 6000 8000 10000 12000x [m]050010001500200025003000z [m]15002000250030003500400045005000(c)50% noise, WRI, TV, 3rd cycle0 2000 4000 6000 8000 10000 12000x [m]050010001500200025003000z [m]15002000250030003500400045005000(d)Figure 2.7: Estimated models for FWI and WRI for 50% data noise,based on the BP 2004 model. Estimated models are shownfor box constraints only and for box constraints combined withtotal-variation constraints.revealed an undesired lack of predictability of the inversion results as afunction of the trade-off and smoothing parameters when we include TV-norm regularization as an added penalty term. We also made the point thatmany of the issues of including multiple pieces of prior information can beovercome when included as intersections of constraints rather than as thesum of several weighted penalties. In this way, we were able to incorporatethe edge-preserving TV-norm and box constraints controlling the spatial32Amplitude 3 Hz20 40 60 80 100 120source index50100150200250300receiver index-400-2000200400Amplitude 3Hz with noise20 40 60 80 100 120source index50100150200250300receiver index-400-2000200400Figure 2.8: Frequency panels of the lowest frequency data for the ex-ample based on the BP 2004 model with 25% noise. All exam-ples use noisy data, but the figure also displays data withoutnoise for reference.variations as well as the permissible range of inverted velocities with oneparameter aside from the lower and upper bounds for the seismic velocity.As the stylized examples illustrate, this TV-norm parameter predictablycontrols the degree blockiness of the inverted velocity models making itsuitable for FWI on complex models with sharp boundaries.Even though the salt body example we presented is synthetic and in-verted acoustically with the “inversion crime”, it clearly illustrates the im-portant role properly chosen constraints can play when combined with searchextensions such as Wavefield Reconstruction Inversion (WRI). Without TV-norm constraints, artifacts stemming from source crosstalk, noise and fromundesired fluctuations when moving in and out of the salt overcome FWIbecause the inverted velocities hit the upper and lower bounds too often.If we include the TV-norm, this effect is removed and we end up with asignificantly improved inversion result with clearly delineated salt. This ex-ample also illustrates that the constrained optimization approach applies toany waveform inversion method. Results are presented for FWI and WRI,where WRI results delineate the salt structure better and exhibit more ro-bustness to noise.The proposed workflow and algorithms are explained in more details,and replaced with faster variants, in the following chapter.33Chapter 3Projection methods andapplications for seismicnonlinear inverse problemswith multiple constraints3.1 IntroductionWe propose an optimization framework to include prior knowledge in theform of constraints into nonlinear inverse problems that are typically ham-pered by the presence of parasitic local minima. We favor this approach overmore commonly known regularization via (quadratic) penalties because in-cluding constraints does not alter the objective, and therefore first- andsecond-order derivative information. Moreover, constraints do not need tobe differentiable, and most importantly, they offer guarantees that the up-dated models meet the constraints at each iteration of the inversion. Whilewe focus on seismic full-waveform inversion (FWI), our approach is moregeneral and applies in principle to any linear or nonlinear geophysical in-A version of this chapter has been published in Geophysics, Society of ExplorationGeophysicists, 2018.34verse problem as long as its objective is differentiable so it can be minimizedwith local derivative information to calculate descent directions that reducethe objective.In addition to the above important features, working with constraintsoffers several additional advantages. For instance, because models alwaysremain within the constraint set, inversion with constraints mitigates theadverse effects of local minima which we encounter in situations where thestarting model is not accurate enough or where low-frequency and long-offset data are missing or too noisy. In these situations, derivative-basedmethods are likely to end up in a local minimum mainly because of theoscillatory nature of the data and the non-convexity of the objective. More-over, the costs of data acquisition and limitations on available computationalresources also often force us to work with only small subsets of data. As aresult, the inversions may suffer from artifacts. Finally, noise in the dataand modeling errors can also give rise to artifacts. We will demonstrate thatby adding constraints, which prevent these artifacts from occurring in theestimated models, our inversion results can be greatly improved and makemore geophysical and geological sense.To deal with each of the challenging situations described above, geo-physicists traditionally often rely on Tikhonov regularization, which corre-sponds to adding differentiable quadratic penalties that are connected toGaussian Bayesian statistics on the prior. While these penalty methods areresponsible for substantial progress in working with geophysical ill-posedand ill-conditioned problems, quadratic penalties face some significant short-comings. Chiefly amongst these is the need to select a penalty parameter,which weights the trade-off between data misfit and prior information on themodel. While there exists an extensive body of literature on how to choosethis parameter in the case of a single penalty term [e.g., Vogel, 2002a, Zh-danov, 2002, Sen and Roy, 2003, Farquharson and Oldenburg, 2004, Muellerand Siltanen, 2012], these approaches do not easily translate to situationswhere we want to add more than one type of prior information. There isalso no simple prior distribution to bound pointwise values on the modelwithout making assumptions on the underlying and often unknown statis-35tical distribution [see Backus, 1988, Scales and Snieder, 1997, Stark, 2015].By working with constraints, we avoid making these types of assumptions.3.1.1 OutlineOur primary goal is to develop a comprehensive optimization frameworkthat allows us to directly incorporate multiple pieces of prior information inthe form of multiple constraints. The main task of the optimization is toensure that the inverted models meet all constraints during each iteration.To avoid certain ambiguities, we will do this with projections so that theupdated models are unique, lie in the intersection of all constraints andremain as close as possible to the model updates provided by FWI withoutconstraints.There is an emerging literature on working with constrained optimiza-tion, see Lelivre and Oldenburg [2009]; Zeev et al. [2006]; Bello and Raydan[2007]; Baumstein [2013]; Smithyman et al. [2015]; Esser et al. [2015a]; Esseret al. [2016b]; Esser et al. [2018], and Chapter 2 of this thesis. Because this isrelatively new to the geophysical community, we first start with a discussionon related work and what the limitations are of unconstrained regularizationmethods. Next, we discuss how to include (multiple pieces of) prior informa-tion with constraints. This discussion includes projections onto convex setsand how to project onto intersections of convex sets. After describing theseimportant concepts, we combine them with nonlinear optimization and de-scribe concrete algorithmic instances based on spectral projected gradientsand Dykstra’s algorithm. We conclude by demonstrating our approach onan FWI problem.3.1.2 NotationBefore we discuss the advantages of constrained optimization for FWI, letus first establish some mathematical notation. Our discretized unknownmodels live on regular grids with N grid points represented by the modelvector m ∈ RN , which is the result of vectorizing the 2D or 3D models. InTable 3.1 we list a few other definitions we will use.36description symboldata-misfit f(m)gradient w.r.t. medium parameters ∇mf(m)set (convex or non-convex) Cintersection of sets⋂pi=1 Ciany transform-domain operator A ∈ CM×Ndiscrete derivative matrix in 1D Dz or Dxcardinality (# of nonzeros)or `0 ‘norm’card(·)⇔ ‖ · ‖0`1 norm (one-norm) ‖ · ‖1Table 3.1: Notation used in this chapter.3.1.3 Related workA number of authors use constraints to include prior knowledge in nonlin-ear geophysical inverse problems. Most of these works focus on only oneor maximally two constraints. For instance; Zeev et al. [2006]; Bello andRaydan [2007] and Me´tivier and Brossier [2016] consider nonlinear geophys-ical problems with only bound constraints, which they solve with projectionmethods. Because projections implement these bounds exactly, these meth-ods avoid complications that may arise if we attempt to approximate boundconstraints by differentiable penalty functions. While standard differentiableoptimization can minimize the resulting objective with quadratic penalties,there is no guarantee the inverted parameters remain within the specifiedrange at every grid point during each iteration of the inversion. Moreover,there is also no consistent and easy way to add multiple constraints reflectingcomplementary aspects (e.g., bounds and smoothness) of the underlying ge-ology. Bound constraints in a transformed domain are discussed by Lelivreand Oldenburg [2009].Close in spirit to the approach we propose is recent work by Beckeret al. [2015], who introduces a quasi-Newton method with projections andproximal operators [see, e.g., Parikh and Boyd, 2014] to add a single `1 normconstraint or penalty on the model in FWI. These authors include this non-differentiable norm to induce sparsity on the model by constraining the `137norm in some transformed domain or on the gradient as in total-variationminimization. While their method uses the fact that it is relatively easyto project on the `1-ball, they have to work on the coefficients rather thanon the physical model parameters themselves, and this makes it difficultto combine this transform-domain sparsity with say bound constraints thatlive in another transform-domain. As we will demonstrate, we overcome thisproblem by allowing for multiple constraints in multiple transform-domainssimultaneously.Several authors present algorithms that can incorporate multiple con-straints simultaneously. The implementation of multiple constraints for in-verse problems entails some subtle, but important algorithmic details. Wewill discuss these in this chapter. For instance, the work by Baumstein[2013] employs the well-known projection-onto-convex-sets (POCS) algo-rithm, which can be shown to converge to the projection of a point only inspecial cases, see, e.g., work by Escalante and Raydan [2011] and Bauschkeand Combettes [2011]. Projecting the updated model parameters onto theintersection of multiple constraints solves this problem and offers guaranteesthat each model iterate (model after each iteration) remains after projec-tion the closest in Euclidean distance to the unconstrained model and at thesame time satisfies all the constraints. Different methods exist to ensure thatthe model estimate at every iteration remains within the non-empty inter-section of multiple constraint sets. Most notably, we would like to mentionthe work by the late Ernie Esser [Esser et al., 2018], who developed a scaledgradient projection method for this purpose involving box constraints, total-variation, and hinge-loss constraints. Esser et al. [2018] arrived at this resultby using a primal-dual hybrid gradient (PDHG) method, which derives fromLagrangians associated with total-variation and hinge-loss minimization. Toallow for more flexibility in the number and type of constraints, we proposethe use of Dykstra’s algorithm [Dykstra, 1983, Boyle and Dykstra, 1986]instead. We refer to Smithyman et al. [2015] and Chapter 2 for examplesof successful geophysical applications of multiple constraints to FWI and itsdistinct advantages over adding constraints as weighted penalties.383.2 Limitations of unconstrained regularizationmethodsIn the introduction, we stated our requirements on a regularization frame-work for nonlinear inverse problems. While there is a large number of suc-cessful regularization approaches such as Tikhonov regularization, change ofvariables, gradient filtering and modified Gauss-Newton, these methods missone or more of our desired properties listed in the introduction. Below wewill show why the above methods do not generalize to multiple constraintsor do so at the cost of introducing additional manual tuning parameters.3.2.1 Tikhonov and quadratic regularizationPerhaps the most well known and widely used regularization technique ingeophysics is the addition of quadratic penalties to a data-misfit function.Let us denote the model vector with medium parameters by m ∈ RN (forexample velocity) where the number of grid points is N . The total objectivewith quadratic regularization φ(m) : RN → R is given byφ(m) = f(m) +α12‖R1m‖22 + · · ·+αp2‖Rpm‖22. (3.1)In this expression, the data misfit function f(m) : RN → R measures thedifference between predicted and observed data. A common choice for thedata-misfit isf(m) =12‖dpred(m)− dobs‖22, (3.2)where dobs and dpred(m) are observed and predicted (from the current modelm) data, respectively. The predicted data may depend on the model param-eters in a nonlinear way.There are p regularization terms in equation 3.1, all of which describedifferent pieces of prior information in the form of differentiable quadraticpenalties weighted by scalar penalty parameters α1, α2, . . . , αp. The oper-ators Ri ∈ CMi×N are selected to penalize unwanted properties in m—i.e.,we select each Ri such that the penalty terms become large if the modelestimate does not lie in the desired class of models. For example, we will39promote smoothness of the model estimate m if we add horizontal or verticaldiscrete derivatives as R1 and R2.Aside from promoting certain properties on the model, adding penaltyterms also changes the gradient and Hessian—i.e., we have∇mφ(m) = ∇mf(m) + α1R∗1R1m+ α2R∗2R2m (3.3)and∇2mφ(m) = ∇2mf(m) + α1R∗1R1 + α2R∗2R2. (3.4)Both expressions, where the symbol ∗ refers to the complex conjugate trans-pose, contain contributions from the penalty terms designed to add certainfeatures to the gradient and to improve the spectral properties of the Hessianby applying a shift to the eigenvalues of ∇2mφ(m).While regularization of the above type has been applied successfully, ithas two important disadvantages. First, it is not straightforward to en-code one’s confidence in a starting model other than including a referencemodel (mref) in the quadratic penalty term—i.e., α/2‖mref−m‖22 (see, e.g.,Farquharson and Oldenburg [2004] and Asnaashari et al. [2013]). Unfor-tunately, this type of penalty tends to spread deviations with respect tothis reference model evenly so we do not have easy control over its localvalues (cf. box constraints) unless we provide detailed prior information onthe covariance. Secondly, quadratic penalties are antagonistic to modelsthat exhibit sparse structure—i.e., models that can be well approximatedby models with a small total-variation or by transform-domain coefficient(e.g., Fourier, wavelet, or curvelet) vectors with a small `1-norm or car-dinality (‖ · ‖0 “norm”). Regrettably, these sparsifying norms are non-differentiable, which often leads to problems when they are added to theobjective by smoothing or reweighting the norms. In either case, this canlead to slower convergence, to unpredictable behavior in nonlinear inverseproblems [Anagaw, 2014, page 110; Lin and Huang, 2015, and Chapter 2]or to a worsening of the conditioning of the Hessian [Akcelik et al., 2002].Even without smoothing non-differential penalties, there are still penaltyparameters to select [Farquharson and Oldenburg, 1998, Becker et al., 2015,40Lin and Huang, 2015, Xue and Zhu, 2015, Qiu et al., 2016]. Finally, these is-sues with quadratic penalties are not purely theoretical. For instance, whenworking with a land dataset, Smithyman et al. [2015] found that the abovelimitations of penalty terms hold in practice and found that constraint opti-mization overcomes these limitations, an observation motivating this work.3.2.2 Gradient filteringAside from adding penalties to the data-misfit, we can also remove unde-sired model artifacts by filtering the gradients of f(m). When we minimizethe data objective (cf. equation 3.1) with standard gradient descent, thisamounts to applying a filter to the gradient when we update the model—i.e., we havemk+1 = mk − γs(∇mf(m)), (3.5)where γ is the step-length and s(·) a nonlinear or linear filter. For in-stance; Brenders and Pratt [2007] apply a 2D spatial low-pass filter to pre-vent unwanted high-wavenumber updates to the model when inverting low-frequency seismic data. The idea behind this approach is that noise-freelow-frequency data should give rise to smooth model updates. While thesefilters can remove unwanted high-frequency components of the gradient, thismethod has some serious drawbacks.First, the gradient is no longer necessarily a gradient of the objectivefunction (equation 3.1) after applying the filter. Although the filtered gra-dient may under certain technical conditions remain a descent direction,optimization algorithms, such as spectral projected gradient (SPG) [Birginet al., 1999] or quasi-Newton methods [Nocedal and Wright, 2000], expecttrue gradients when minimizing (constrained) objectives. Therefore gradientfiltering can generally not be used in combination with these optimizationalgorithms, without giving up their expected behavior. Second, it is notstraightforward to enforce more than one property on the model in this way.Consider, for instance, a two-filter case where s1(·) is a smoother and s2(·)enforces upper and lower bounds on the model. In this case, we face theunfortunate ambiguity s2(s1(∇mf(m))) 6= s1(s2(∇mf(m))). Moreover, this41gradient will have non-smooth clipping artifacts if we smooth first and thenapply the bounds. Anagaw and Sacchi [2017] present a method that filtersthe updated model instead of a gradient, but it is also not clear how toextend this filtering technique to more than one model property.3.2.3 Change of variables / subspacesAnother commonly employed method to regularize nonlinear inverse prob-lems involves certain (possibly orthogonal) transformations of the originalmodel vector. While somewhat reminiscent of gradient filtering, this ap-proach entails a change of variables, see, e.g., Jervis et al. [1996]; Shenet al. [2005]; Shen and Symes [2008] for examples in migration velocityanalysis and Kleinman and den Berg [1992]; Guitton et al. [2012]; Guittonand Daz [2012]; Li et al. [2013] for examples in the context of waveforminversion. This approach is also known as a subspace method [Kennettand Williamson, 1988, Oldenburg et al., 1993]. We can invoke this changeof variables by transforming the model into p = Tm, where T is a (notnecessarily invertible) linear operator. This changes the unconstrained opti-mization problem minm f(m) into another unconstrained problem minp f(p).To see why this might be helpful, we observe that the gradient becomes∇pf(p) = T ∗∇mf(m), which shows how T can be designed to ‘filter’ thegradient. The matrix T can also represent a subspace (limited number ofbasis vectors such as splines, wavelets). Just as with gradient filtering, achange of variables does not easily lend itself to multiple transforms aimedat incorporating complementary pieces of prior information. However, sub-space information fits directly into the constrained optimization approach ifwe constrain our models to be elements of the subspace. The constrainedapproach has the advantage that we can straightforwardly combine it withother constraints in multiple transform-domains; all constraints in the pro-posed framework act on the variables m in the physical space since we donot minimize subspace/transform-domain coefficients p.423.2.4 Modified Gauss-NewtonA more recent successful attempt to improve model estimation for cer-tain nonlinear inverse problems concerns imposing curvelet domain `1-normbased sparsity constraints on the model updates [Herrmann et al., 2011, Liet al., 2012b, 2016]. This approach converges to local minimizers of f(m)(and hopefully a global one) because sparsity constrained updates provablyremain descent directions (Burke [1990], chapter 2; Herrmann et al. [2011]).However, there are no guarantees that the curvelet coefficients of the modelitself will remain sparse unless the support (= locations of the non-zero co-efficients) is more or less the same for each Gauss-Newton update [Li et al.,2016]. Zhu et al. [2017] use a similar approach, but they update the trans-form (also known as a dictionary when learning or updating the transform)at every FWI iteration.In summary, while regularizing the gradients or model updates leads toencouraging results for some applications, the constrained optimization ap-proach proposed in this work enforces constraints on the model estimate it-self, without modifying the gradient. More importantly, while imposing con-straints via projections may superficially look similar to the above methods,our proposed approach differs fundamentally in two main respects. Firstly,it projects uniquely on the intersection of arbitrarily many constraint sets —effectively removing the ambiguity of order in which constraints are applied.Secondly, it does not alter the gradients because it imposes the projectionson the proposed model updates, i.e., we will project mk+1 = mk −∇mf(m)onto the constraint set.3.3 Including prior information via constraintsBefore we introduce constrained formulations of nonlinear inverse problemswith multiple convex and non-convex constraint sets, we first discuss someimportant core properties of convex sets, of projections onto convex sets,and of projections onto intersections of convex sets. These properties provideguarantees that our approach generalizes to arbitrarily many constraint sets,i.e., one constraint set is mathematically the same as many constraint sets.43The presented convex set properties also show that there is no need toworry about the order in which we use the sets to avoid ambiguity, as wasthe case for gradient filtering and for naive implementations of constrainedoptimization. The constrained formulation also stays away from penaltyparameters, yet still offers guarantees all constraints are satisfied at everyiteration of the inversion.3.3.1 Constrained formulationTo circumvent problems related to incorporating multiple sources of possiblynon-differentiable prior information, we propose techniques from constrainedoptimization [Boyd and Vandenberghe, 2004, Boyd et al., 2011, Parikh andBoyd, 2014, Beck, 2015, Bertsekas, 2015]. The key idea of this approach isto minimize the data-misfit objective while at the same time making surethat the estimated model parameters satisfy constraints. These constraintsare mathematical descriptors of prior information on certain physical (e.g.,maximal and minimal values for the wavespeed) and geological properties(e.g., velocity models with unconformities that lead to discontinuities in thewavespeed) on the model. We build our formulation on earlier work onconstrained optimization with up to three constraint sets as presented byLelivre and Oldenburg [2009]; Smithyman et al. [2015]; Esser et al. [2015a];Esser et al. [2016b]; Esser et al. [2018], and Chapter 2.Given an arbitrary but finite number of constraint sets (p), we formulateour constrained optimization problem as follows:minmf(m) subject to m ∈p⋂i=1Ci. (3.6)As before, f(m) : RN → R is the data-misfit objective, which we minimizeover the discretized medium parameters represented by the vector m ∈ RN .Prior knowledge on this model vector resides in the indexed constraint setsCi, for i = 1 · · · p, each of which captures a known aspect of the Earth’ssubsurface. These constraints may include bounds on permissible parametervalues, desired smoothness or complexity, or limits on the number of layers44in sedimentary environments and many others.In cases where more than one piece of prior information is available, wewant the model vector to satisfy these constraints simultaneously, such thatwe keep control over the model properties as is required for strategies thatrelax constraints gradually, see Esser et al. [2016b] and Chapter 2. Because itis difficult to think of a nontrivial example where the intersection of these setsis empty, it is safe to assume that there is at least one model that satisfies allconstraints simultaneously. For instance, a homogeneous medium will satisfymany constraints, because its total-variation is zero, it has a rank of 1 andhas parameter values between minimum and maximum values. We denotethe mathematical requirement that the estimated model vector satisfies pconstraints simultaneously by m ∈ ⋂pi=1 Ci. The symbol ⋂pi=1 indicates theintersection of p items. Before we discuss how to solve constrained nonlineargeophysical inverse problems, let us first discuss projections and examplesof projections onto convex and non-convex sets.3.3.2 Convex setsA projection of m onto a set C corresponds to solvingPC(m) = arg minx12‖x−m‖22 subject to x ∈ C. (3.7)Amongst all possible model vectors x, the above optimization problem findsthe vector x that is closest in Euclidean distance to the input vector m whileit lies in the constraint set. For a given model vector x, the solution of thisoptimization problem depends on the constraint set C and its properties.For instance, the above projection is unique for a convex C.To better understand how to incorporate prior information in the formof one or more constraint sets, let us first list some important propertiesof constraint sets and their intersection. These properties allow us to userelatively simple algorithms to solve Problem 3.6 by using projections of theabove type. First of all, most optimization algorithms require the constraintsets to be convex. Intuitively, a set is convex if any point on the line segmentconnecting any couple of points in a set is also in the set—i.e., for all x ∈ C45and y ∈ C. In that case, the following relation holds:cx+ (1− c)y ∈ C for 0 ≤ c ≤ 1. (3.8)There are a number of advantages when working with convex sets, namelyi. The intersection of convex sets is also a convex set. This property im-plies that the properties of a convex set also hold for the intersection ofarbitrarily many convex sets. Practically, if an optimization algorithmis defined for a single convex set, the algorithm also works in case ofarbitrarily many convex sets, as the intersection is still a single convexset.ii. The Euclidean projection onto a convex set (equation 3.7) is unique(Boyd and Vandenberghe [2004], section 8.1). When combined withproperty (i), this implies that the projection onto the intersection ofmultiple convex sets is also unique. In this context, a unique projectionmeans that given any point outside a convex set, there exists one pointin the set which is closest (in a Euclidean sense) to the given pointthan any other point in the set.iii. Projections onto convex sets are non-expansive (Bauschke and Com-bettes [2011], section 4.1-4.2, or Dattorro [2010], E.9.3). If we definethe projection operator as PC(x) and take any couple of points x andy, the non-expansive property is stated as: ‖PC(x)−PC(y)‖ ≤ ‖x−y‖.This property guarantees that projections of estimated models onto aconvex set are ‘stable’. In this context, stability implies that any pairof models moves closer or remain equally distant to each other afterprojection. This prevents increased separation after projection of pairsof models.While these properties make convex sets favorites amongst practitionersof (convex) optimization, restricting ourselves to convexity is sometimes toolimiting for our application. In the following sections, we may use non-convex sets in the same way as a convex set, but in that case, the above46properties generally do not hold. Performance of the algorithms then needsempirical verification.Actual projections onto a single set themselves are either available inclosed-form (e.g., for bounds and certain norms) or are computed itera-tively (with the alternating direction method of multipliers, ADMM, seee.g., Boyd et al. [2011] and Appendix A) when closed form-expressions forthe projections are not available.3.4 Computing projections onto intersections ofconvex setsOur problem formulation, equation 3.6, concerns multiple constraints, so weneed to be able to work with multiple constraint sets simultaneously to makesure the model iterates satisfy all prior knowledge. To avoid intermediatemodel iterates to become physically and geologically unfeasible, we want ourmodel iterates to satisfy a predetermined set of constraints at every iterationof the inversion process. Because of property (i) (listed above), we can treatthe projection onto the intersection of multiple constraints as the projectiononto a single set. This implies that we can use relatively standard (convex)optimization algorithm to solve Problem 3.6 as long as the intersection ofthe different convex sets is not empty. We define the projection on theintersection of multiple sets asPC(m) = arg minx‖x−m‖22 s.t. x ∈p⋂i=1Ci. (3.9)The projection of m onto the intersection of the sets,⋂pi=1 Ci, means thatwe find the unique vector x, in the intersection of all sets, that is closest tom in the Euclidean sense. To find this vector, we compute the projectiononto this intersection via Dykstra’s alternating projection algorithm [Dyk-stra, 1983, Boyle and Dykstra, 1986, Bauschke and Koch, 2015]. We madethis choice because this algorithm is relatively simple to implement (we onlyneed projections on each set individually) and contains no manual tuningparameters. By virtue of property (ii), projecting onto each set separately47and cyclically, Dykstra’s algorithm finds the unique projection on the inter-section as long as all sets are convex [Boyle and Dykstra, 1986, Theorem2].To illustrate how Dykstra’s algorithm works, let us consider the follow-ing toy example where we project the point (2.5, 3.0) onto the intersectionof two constraint sets, namely a halfspace (y ≤ 2, this corresponds to boundconstraints in two dimensions) and a disk (x2 +y2 ≤ 3, this corresponds to a‖·‖2-norm ball), see Figure 3.1. If we are just interested in finding a feasiblepoint in the set that is not necessarily the closest, we can use the projectiononto convex sets (POCS) algorithm (also known as von Neumann’s alternat-ing projection algorithm) whose steps are depicted by the solid black linein Figure 3.1. The POCS algorithm iterates PC2(. . . (PC1(PC2(PC1(m))))),so depending on whether we first project onto the rectangle or disk, POCSfinds two different feasible points. Like POCS, Dykstra’s algorithm projectsonto each set in an alternating fashion, but unlike POCS, the solution paththat is denoted by the red dashed line provably ends up at a single uniquefeasible point that is closest to the starting point. The solution found byDykstra’s algorithm is independent of the order in which the constraints areimposed. POCS does not project onto the intersection of the two convexsets; it just solves the convex feasibility problemfind x ∈p⋂i=1Ci (3.10)instead. POCS finds a model that satisfies all constraints but which is non-unique (solution is either (1.92, 2.0) or (2.34, 1.87) situated at Euclideandistances 1.16 and 1.14) and not the projection at (2.0,√22 + 32 ≈ 2.24)at a minimum distance of 1.03. This lack of uniqueness and vicinity to thetrue solution of the projection problem leads to solutions that satisfy theconstraints, but that may be too far away from the initial point and thismay adversely affect the inversion. See also [Escalante and Raydan, 2011,Example 5.1; Dattorro, 2010, Figure 177 & Figure 182, and Bauschke andCombettes [2011], Figure 29.1] for further details on this important point.48The geophysical implication of this difference between Dykstra’s algo-rithm and POCS is that the latter may end up solving a problem withunnecessarily tight constraints, moving the model too far away from thedescent direction informed by the data misfit objective. We observe thisphenomenon of being too constrained in Figure 3.1 where the two solutionsfrom POCS are not on the boundary of both sets, but instead relatively‘deep’ inside one of them. Aside from potential “over constraining”, theresults from POCS may also differ depending which of the individual con-straints is activated first leading to undesirable side effects. The issue of“over constraining” does not just occur in geometrical two-dimensional ex-amples and it is not specific to the constraints from the previous example.Figure 3.2 shows what happens if we project a velocity model (with Dyk-stra’s algorithm) or find two feasible models with POCS, just as we show inFigure 3.1. The constraint is the intersection of bounds ({m | li ≤ mi ≤ ui})and total-variation ({m | ‖Am‖1 ≤ σ} with scalar σ > 0 and A = (DTx DTz )T). While one of the POCS results is similar to the projection, the otherPOCS result has much smaller total-variation than the constraint enforces,i.e., the result of POCS is not the projection but a feasible point in theinterior of the intersection. To avoid these issues, Dykstra’s algorithm is ourmethod of choice to include two or more constraints into nonlinear inverseproblems. Algorithm 1 summarizes the main steps of Dykstra’s approach,which aside from stopping conditions, is parameter free. In Figure 3.3 weshow what happens if we replace the projection (with Dykstra’s algorithm)in projected gradient descent with POCS. Projected gradient descent solvesan FWI problem with bounds and total-variation constraints while using asmall number of sources and receivers and an incorrectly estimated sourcefunction. The results of Dykstra’s algorithm and POCS are different, whilethe results using POCS depend on the ordering of the sets. Dykstra’s algo-rithm always finds the Euclidean projection onto the intersection of convexsets, which is a unique point. Therefore, it does not matter in what orderwe project onto each set as part of Dykstra’s algorithm.491.8 2 2.2 2.4x1.822.22.42.62.83ya)POCS 1Dykstra 11.8 2 2.2 2.4x1.822.22.42.62.83yb)POCS 2Dykstra 2feasible	set/	intersectionfeasible	set/	intersectionFigure 3.1: The trajectory of Dykstra’s algorithm for a toy examplewith two constraints: a maximum 2-norm constraint (disk) andbound constraints. The feasible set is the intersection of a half-space and a disk. The circle and horizontal lines are the bound-aries of the sets. The difference between the two figures is theordering of the two sets. The algorithms in (a) start with theprojection onto the disk, in (b) they start with the projectiononto the halfspace. The projection onto convex sets (POCS)algorithm converges to different points, depending onto whichset we project first. In both cases, the points found by POCSare not the projection onto the intersection. Dykstra’s algo-rithm converges to the projection of the initial point onto theintersection in both cases, as expected.50a)0 2000 4000 6000 8000 10000x (m)0100020003000z (m)200030004000Velocity (m/s)b)0 2000 4000 6000 8000 10000x (m)0100020003000z (m)200030004000Velocity (m/s)c)0 2000 4000 6000 8000 10000x (m)0100020003000z (m)200030004000Velocity (m/s)d)0 2000 4000 6000 8000 10000x (m)0100020003000z (m)200030004000Velocity (m/s)Figure 3.2: The Marmousi model (a), the projection onto an intersec-tion of bound constraints and total-variation constraints foundwith Dykstra’s algorithm (b) and two feasible models found bythe POCS algorithm (c) and (d). We observe that one of thePOCS results (c) is very similar to the projection (b), but theother result (d) is very different. The different model (d) has atotal-variation much smaller than requested. This situation isanalogous to Figure 3.1.51(a) Dykstra, TV=15%1500200025003000350040004500Velocity (m/s)(e)  POCS 1, TV=15%1500200025003000350040004500Velocity (m/s)(i)  POCS 2, TV=15%1500200025003000350040004500Velocity (m/s)(m)  Dykstra - POCS 1, TV=15%-100-50050100Velocity (m/s)(q) Dykstra - POCS 2, TV=15%-100-50050100Velocity (m/s)(b) Dykstra, TV=25%1500200025003000350040004500Velocity (m/s)(f)  POCS 1, TV=25%1500200025003000350040004500Velocity (m/s)(j)  POCS 2, TV=25%1500200025003000350040004500Velocity (m/s)(n)  Dykstra - POCS 1, TV=25%-100-50050100Velocity (m/s)(r) Dykstra - POCS 2, TV=25%-100-50050100Velocity (m/s)(c) Dykstra, TV=50%1500200025003000350040004500Velocity (m/s)(g)  POCS 1, TV=50%1500200025003000350040004500Velocity (m/s)(k)  POCS 2, TV=50%1500200025003000350040004500Velocity (m/s)(o)  Dykstra - POCS 1, TV=50%-100-50050100Velocity (m/s)(s)  Dykstra - POCS 2, TV=50%-100-50050100Velocity (m/s)(d) Dykstra, TV=75%1500200025003000350040004500Velocity (m/s)(h) POCS 1, TV=75%1500200025003000350040004500Velocity (m/s)(l) POCS 2, TV=75%1500200025003000350040004500Velocity (m/s)(p) Dykstra - POCS 1, TV=75%-100-50050100Velocity (m/s)(t) Dykstra - POCS 2, TV=75%-100-50050100Velocity (m/s)Figure 3.3: FWI with an incorrect source function with projections(with Dykstra’s algorithm) and FWI with two feasible points(with POCS) for various TV-balls (as a percentage of the TV ofthe true model) and bound constraints. Also shows differences(rightmost two columns) between results. The results show thatusing POCS inside a projected gradient algorithm instead ofthe projection leads to different results that also depend onthe order in which we provide the sets to POCS. This exampleillustrates the differences between the methods and it is not theintention to obtain excellent FWI results.52Algorithm 1 Dykstra’s algorithm, following the notation of Birgin andRaydan [2005], to compute the projection of m onto the intersection of pconvex sets: PC(m) = arg minx ‖x −m‖22 s.t. x ∈⋂pi=1 Ci. yi are auxiliaryvectors.Algorithm DYKSTRA(m,PC1 ,PC2 , . . . ,PCp)0a. x0p = m, k = 1 //initialize0b. y0i = 0 for i = 1, 2, . . . , p //initializeWHILE stopping conditions not satisfied DO1. xk0 = xk−1pFOR i = 1, 2, . . . , p2. xki = PCi(xki−1 − yk−1i )ENDFOR i = 1, 2, . . . , p3. yki = xki − (xki−1 − yk−1i )END4. k = k + 1ENDoutput: xkp3.5 Nonlinear optimization with projectionsSo far, we discussed a method to project models onto the intersection ofmultiple constraint sets. Now we propose and discuss a method to combineprojections onto an intersection with nonlinear data-fitting. Aside from ourdesign criteria (multiple constraints instead of competing penalties; guar-antees that model iterations remain in constraint set), we need to includea clean separation of misfit/gradient calculations and projections so thatwe avoid additional computationally costly PDE solves at all times. Thisseparation also allows us to use different codes bases for each task (objec-tive/gradient calculations versus projections). We first describe the basicprojected gradient descent method, which serves as an introduction to ourmethod of choice: the spectral projected gradient method.3.5.1 Projected gradient descentThe simplest first-order algorithm that minimizes a differentiable objectivefunction subject to constraints is the projected gradient method (e.g., Beck53[2014], section 9.4). This algorithm is a straightforward extension of the well-known gradient-descent method [e.g., Bertsekas, 2015, section 2.1] involvingthe following updates on the model:mk+1 = PC(mk − γ∇mf(mk)). (3.11)A line search determines the scalar step length γ > 0. This algorithmfirst takes a gradient-descent step, involving a gradient calculation, followedby the projection of the updated model back onto the intersection of theconstraint sets. By construction, the computationally expensive gradientcomputations (and data-misfit for the line search) are separate from theoften much cheaper projections onto constraints. The projection step itselfguarantees that the model estimate mk satisfies all constraints at every kthiteration.Figure 3.4 illustrates the difference between gradient descent to minimizea two variable non-convex objective minm f(m), and projected gradient de-scent to minimize minm f(m) s.t. m ∈ C. If we compare the solution pathsfor gradient and projected gradient descent, we see that the latter exploresthe boundary as well as the interior of the constraint set C = {m | ‖m‖2 ≤ σ}to find a minimizer. This toy example highlights how constraints pose up-per limits (the set boundary) on certain model properties but do not forcesolutions to stay on the constraint set boundary. Because one of the localminima lies outside the constraint set, this example also shows that addingconstraints may guide the solution to a different (correct) local minimizer.This is exactly what we want to accomplish with constraints for FWI: pre-vent the model estimate mk to converge to local minimizers that representunrealistic models.3.5.2 Spectral projected gradientStandard projected gradient has two important drawbacks. First, we needto project onto the constraint set after each line search step. To be morespecific, we need to calculate the step-length parameter γ ∈ (0, 1] if theobjective of the projected model iterate is larger than the current model54a)startendb)startendFigure 3.4: Example of the iteration trajectory when (a) using gradi-ent descent to minimize a non-convex function and (b) projectedgradient descent to minimize a non-convex function subject to aconstraint. The constraint requires the model estimate to insidethe elliptical area in (b). The semi-transparent area outside theellipse is not accessible by projected gradient descent. There aretwo important observations: 1) The constrained minimizationconverges to a different (local) minimizer. 2) The intermediateprojected gradient parameter estimates can be in the interior ofthe set or on the boundary. Black represents low values of thefunction.iterate—i.e., f(PC(mk − γ∇mf(mk))) > f(mk). In that case, we need toreduce γ and test again whether the data-misfit is reduced. For every reduc-tion of γ, we need to recompute the projection and evaluate the objective,which is too expensive. Second, first-order methods do not use curvatureinformation, which involves the Hessian of f(m) or access to previous gra-dient and model iterates. Projected gradient algorithms are therefore oftenslower than Newton, Gauss-Newton, or quasi-Newton algorithms for FWIwithout constraints.To avoid these two drawbacks and possible complications arising fromthe interplay of imposing constraints and correcting for Hessians, we usethe spectral projected gradient method (SPG; Birgin et al. [1999]; Birgin55et al. [2003]); an extension of the standard projected gradient algorithm(equation 3.11), which corresponds to a simple scalar scaling (related to theeigenvalues of the Hessian, see Birgin et al. [1999] and Dai and Liao [2002]).At model iterate k, the SPG iterations involve the stepmk+1 = mk + γpk, (3.12)with update directionpk = PC(mk − α∇mf(mk))−mk. (3.13)These two equations define the core of SPG, which differs from standardprojected gradient descent in three different ways:i. The spectral stepsize α [Barzilai and Borwein, 1988, Raydan, 1993, Daiand Liao, 2002] is calculated from the secant equation [Nocedal andWright, 2000, section 6.1] to approximate the Hessian, leading to anaccelerated convergence. An interpretation of the secant equation is tomimic the action of the Hessian by the scalar α and use finite-differenceapproximations for the second derivative of f(m). This approach isclosely related to the idea behind quasi-Newton methods. We computeα as the solution ofDk = arg minD=αI‖Dsk − yk‖2, (3.14)where yk = ∇mf(mk+1) − ∇mf(mk) and sk = mk+1 − mk, and Ithe identity matrix. This results in scaling by α = s∗ksk/s∗kyk derivedfrom gradient and model iterates from the current and previous SPGiterations. Clearly, this is computationally cheap because α is not com-puted by a separate line search. We may also need a safeguard againstexcessively large values of α, defined as α = minimum(α, αmax). Be-cause we work with geophysical inverse problems, we can require avalue of α, such that α times the gradient has a ‘reasonable’ physicalscaling, i.e., we do not want αmax times the gradient to have a norm56larger than the current model parameters. Very large values of α couldlead to unphysical models (before projection) and to an unnecessar-ily large number of line-search steps to determine γ. We thus requireαmax‖∇mf(mk)‖2 ≤ ‖m‖2.ii. Spectral projected gradient employs non-monotone [Grippo and Scian-drone, 2002] inexact line searches to calculate the γ in equation 3.12.In Algorithm 2, step 4c enforces a non-monotone Armijo line-searchcondition. As for all FWI problems, f(m) is not convex so we can-not use an exact line-search. Non-monotone means that the objectivefunction value is allowed to increase temporarily, which often resultsin faster convergence and fewer line search steps, see, e.g., Birgin et al.[1999] for numerical experiments. Our intuition behind this is as fol-lows: gradient descent iterations often exhibit a ‘zig-zag’ pattern whenthe objective function behaves like a ‘long valley’ in a certain direc-tion. When the line searches are non-monotone, the objective does notalways have to go down so we can take relatively larger steps alongthe valley in the direction of the minimizer that are slightly ‘uphill’,increasing the objective temporarily.iii. Each SPG iteration requires only one projection onto the intersectionof constraint sets to compute the update direction (equation 3.13) anddoes not need additional projections for line search steps. This is asignificant computational advantage over standard projected gradientdescent, which computes one projection per line search step, see equa-tion 3.11. From equations 3.12 and 3.13, we observe that pk lies on theline between the previous model estimate (mk) and the proposed up-date, projected back onto the feasible set—i.e., PC(mk −α∇mf(mk)).Therefore, mk+1 is on the line segment between these two points ina convex set and the new model will satisfy all constraints simultane-ously at every iteration (see equation 3.8). For this reason, any linesearch step that reduces γ will also be an element of the convex set.Works by Zeev et al. [2006] and Bello and Raydan [2007] confirm thatSPG with non-monotone line searches can lead to significant accelera-57tion on FWI and seismic reflection tomography problems with boundconstraints compared to projected gradient descent.In summary, each SPG iteration in Algorithm 2 requires at the kth iter-ation a single evaluation of the objective f(mk) and gradient ∇mf(mk). Infact, SPG combines data-misfit minimization (our objective) with imposingconstraints, while keeping the data-misfit/gradient and projection computa-tions separate. When we impose the constraints, the objective and gradientdo not change. Aside from computational advantages, this separation al-lows us to use different code bases for the objective f(m) and its gradient∇mf(m) and the imposition of the constraints. The above separation ofresponsibilities also leads to a modular software design, which applies todifferent inverse problems that require (costly) objective and gradient cal-culations.3.5.3 Spectral projected gradient with multiple constraintsWe now arrive at our main contribution where we combine projections ontomultiple constraints with nonlinear optimization with costly objective andgradient calculations using a spectral projected gradient (SPG) method.Recall from the previous section that the projection onto the intersectionof convex sets in SPG is equivalent to running Dykstra’s algorithm (Algo-rithm 1) —i.e., we havePC(mk − α∇mf(mk))= PC1⋂ C2⋂···⋂ Cp(mk − α∇mf(mk))⇔ DYKSTRA(mk − α∇mf(mk),PC1 , . . . ,PCp).(3.15)With this equivalence established, we arrive at our version of SPG presentedin Algorithm 2, which has appeared in some form in the non-geophysicalliterature in Birgin et al. [2003] and Schmidt and Murphy [2010].The proposed optimization algorithm for nonlinear inverse problems withmultiple constraints (equation 3.6) has the following three-level nested struc-ture:58Algorithm 2 minm f(m) s.t.m ∈⋂pi=1 Ci with spectral projected gradient,non-monotone line searches and combined with Dykstra’s algorithm.input:// one projector per constraint setPC1 , PC2 , . . .m0 //starting modelInitialization0. M = integer //history length for f(mk)0. select η ∈ (0, 1), select initial α0. k = 1, select sufficient descent parameter WHILE stopping conditions not satisfied DO1. f(mk), ∇mf(mk) //objective & gradient// project onto intersection of sets:2. rk = DYKSTRA(mk − α∇mf(mk),PC1 ,PC2 , . . .)3. pk = rk −mk // update direction//save previous M objective:4a. fref = {fk, fk−1, . . . , fk−M}4b. γ = 14c. IF f(mk + γpk) < max(fref) + γ∇mf(mk)∗pkmk+1 = mk + γp // update model iterateyk = ∇mf(mk+1)−∇mf(mk)sk = mk+1 −mkα =s∗ksks∗kyk// spectral steplengthk = k + 1ELSEγ = ηγ //step size reduction,go back to 4cENDoutput: mk1. At the top level, we have a possibly non-convex optimization problemwith a differentiable objective and multiple constraints:minmf(m) subject to m ∈p⋂i=1Ci,which we solve with the spectral projected gradient method;2. At the next level, we project onto the intersection of multiple (convex)59Figure 3.5: The 3-level nested constrained optimization workflow.sets:PC(m) = arg minx‖x−m‖2 subject to x ∈p⋂i=1Ciimplemented via Algorithm 1 (Dykstra’s algorithm);3. At the lowest level, we project onto individual sets:PCi(m) = arg minx‖x−m‖2 subject to x ∈ Cifor which we use ADMM (see Appendix B) if there is no closed-formsolution available.While there are many choices for the algorithms at each level, we baseour selection of any particular algorithm on their ability to solve each levelwithout relying on additional manual tuning parameters. We summarizedour choices in Figure 3.5, which illustrates the three-level nested optimiza-tion structure.603.6 Numerical exampleAs we mentioned earlier, full-waveform inversion (FWI) faces problems withparasitic local minima when the starting model is not sufficiently accurate,and the data are cycle skipped. FWI also suffers when no reliable data areavailable at the low end of the spectrum (typically less than 3 Hz) or atoffsets larger than about two times the depth of the model. Amongst themyriad of recent, sometimes somewhat ad hoc proposals to reduce the ad-verse effects of these local minima, we show how the proposed constrainedoptimization framework allows us to include prior knowledge on the un-known model parameters with guarantees that our inverted models indeedmeet these constraints for each updated model.Let us consider the situation where we may not have precise prior knowl-edge on the actual model parameters itself, but where we may still be in aposition to mathematically describe some characteristics of a good startingmodel. With a good starting model, we mean a model that leads to signif-icant progress towards the true model during nonlinear inversion. So ourstrategy is to first improve our starting model — by constraining the inver-sion such that the model satisfies our expectation of what a starting modellooks like — followed by a second cycle of regular FWI. We relax constraintsfor the second cycle to allow for model updates that further improve the datafit. We present two different inversion strategies with up to three differenttypes of constraints. Figure 3.6 shows the actual and initial starting modelsfor this 2D FWI experiment. For this purpose, we take a 2D slice from theBG Compass velocity and density model. We choose this model because itcontains realistic velocity “kick back”, which is known to challenge FWI.The original model is sampled at 6 m, and we generate “observed data” byrunning a time-domain [Louboutin et al., 2017] simulation code with thevelocity and density models given in Figure 3.6. The sources and receivers(56 each) are located near the surface, with 100 m spacing. A coarse sourceand receiver spacing of a 100 m amounts to about one spatial wavelength atthe highest frequency in the water; well below the spatial Nyquist samplingrate.61To mimic realistic situations where the forward modeling for the inver-sion misses important aspects of the wave physics, we invert for velocityonly while fixing the density to be equal to one everywhere. While there arebetter approximations to the density model than the one we use, we inten-tionally use a rough approximation of the physics to show that constraintsare also beneficial in that situation. To add another layer of complexity, wesolve the inverse problem in the frequency domain [Da Silva and Herrmann,2017] following the well-known multiscale frequency continuation strategy ofBunks [1995]. To deal with the situation where ocean bottom marine dataare often severely contaminated with noise at the low-end of the spectrum,we start the inversion on the frequency interval 3− 4 Hz. We define this in-terval as a frequency batch. We subsequently use the result of the inversionwith this first frequency batch as the starting model for the next frequencybatch, inverting data from the 4− 5 Hz interval. We repeat this process upto frequencies on the interval 14−15 Hz. As stopping conditions for SPG, weuse a maximum of 30 data-misfit evaluations for the first frequency batchand ten for every subsequent frequency batch. SPG also terminates, andwe proceed to the next frequency batch if the data-misfit change, gradi-ent or update direction are numerically insignificant. We also estimate theunknown frequency spectrum of the source on the fly during each itera-tion, using the variable projection method by Pratt [1999]; Aravkin and vanLeeuwen [2012]. To avoid additional complications, we assume the sourcesand receivers to be omnidirectional with a flat spatial frequency spectrum.While frequency continuation and on-the-fly source estimation are bothwell-established techniques by now, the combination of velocity-only inver-sion and a poor starting model remains challenging because we (i) ignoredensity variations in the inversion, which means we can never hope to fit theobserved data fully; (ii) we miss the velocity kick back at roughly 300−500 min the starting model; and (iii) we invert on an up to roughly 10× coarsergrid compared to the fine 6m grid on which the “observed” time-domaindata were generated. Because of these challenges, battle-tested multiscaleworkflows for FWI, where we start at the low frequencies and gradually workour way up to higher frequencies, fail even if we impose bound constraints62a)0 1000 2000 3000 4000 5000x (m)0500100015002000z (m)1500200025003000350040004500Velocity (m/s)b)0 1000 2000 3000 4000 5000x (m)0500100015002000z (m)1500200025003000350040004500Velocity (m/s)Figure 3.6: True (a) and initial (b) velocity models for the example.(minimum of 1425 (m/s) and maximum 5000 (m/s)) values for the estimatedvelocities) on the model. See Figure 3.7. Only the top 700 m of the velocitymodel is inverted reasonably well. The bottom part, on the other hand, isfar from the true model almost everywhere. The main discontinuity into the≥ 4000 (m/s) rock is not at the correct depth and does not have the rightshape.To illustrate the potential of adding more constraints on the velocitymodel, we follow a heuristic that combines multiple warm-started multi-scale FWI cycles with a relaxation of the constraints. This approach wassuccessfully employed in earlier work by Esser et al. [2016b]; Esser et al.[2018], and Chapter 2. We present two different strategies with differentconstraints that both lead to improved results, which shows that there is630 1000 2000 3000 4000 5000x (m)0500100015002000z (m)1500200025003000350040004500Velocity (m/s)Figure 3.7: Model estimate obtained by FWI with bound constraintsonly.more than one way to use multiple constraints to arrive at the desired re-sults. Since we are dealing with a relatively undistorted sedimentary basin(see Figure 3.6), we impose constraints that limit lateral variations and forcethe inverted velocities to increase monotonically with depth during the firstinversion cycle. In the second cycle, we relax this condition. We accomplishthis by combining box constraints with slope constraints in the vertical direc-tion (described in detail in Appendix B). To enforce continuity in the lateraldirection, we work with tighter slope constraints in that direction. Specifi-cally, we limit the variation of the velocity per meter in the depth direction(z-coordinate) of the discretized model m[i, j] = m(i∆z, j∆x). Mathemat-ically, we enforce 0 ≤ (m[i + 1, j] −m[i, j])/∆z ≤ +∞ for i = 1 · · ·nz andj = 1 · · ·nx, where nz, nx are the number of grid points in the vertical andlateral direction, and ∆z the grid size in depth. With this slope constraint,the inverted velocities are only allowed to increase monotonically with depth,but there is no limit on how fast the velocity can increase in that direction.We impose lateral continuity by selecting the lateral slope constraint as−ε ≤ (m[i, j + 1]−m[i, j])/∆x ≤ +ε for all i = 1 · · ·nz, j = 1 · · ·nx. Thescalar ε is a small number set in the physical units of velocity (meter/sec-ond) change per meter and ∆x is the grid size in the lateral direction. Weselect ε = 1.0 for this example.Compared to other methods that enforce continuity, e.g., via a sharp-ening operator in a quadratic penalty term, these slope constraints have64several advantages. First, they have a natural interpretable physical pa-rameter ε with the units of velocity (meter/second) change per meter. Sec-ond, they are met at each point in the model—i.e., they are applied andenforced pointwise; and most importantly these slope constraints do not im-pose more structure than needed. For instance, the vertical slope constraintonly enforces monotonic increases and nothing else. We do not claim thatother methods, such as Tikhonov regularization, cannot accomplish thesefeatures. We claim that we do this without nebulous parameter tuning andwith guarantees that our constraints are satisfied at each iteration of FWI.The FWI results with slope constraints for 3 − 4 Hz data are shown inFigure 3.8a. This result from the first FWI cycle improves the starting modelsignificantly without introducing geologically unrealistic artifacts. This par-tially inverted model can now serve as input for the second FWI cycle wherewe invert data over a broader frequency range between 3 − 15 Hz (cf. Fig-ure 3.8b) using box constraints only. Apparently, adding slope constraintsduring the first cycle is enough to prevent the velocity model from mov-ing in the wrong direction while allowing for enough freedom to get closerto the true model underlying the success of the second cycle without slopeconstraints. This example demonstrates that keeping the recovered velocitymodel after the first FWI cycle in check — via not too constrained con-straints — can be a successful strategy even though final velocity modeldoes not lie in the constraint set imposed during the first FWI cycle wherevelocity kick back was not allowed. We kept the computational overhead ofthis multi-cycle FWI method to a minimum by working with low-frequencydata only during the first cycle, which reduces the size of the computationalgrid by a factor of about fourteen.The second strategy is similar to the total-variation constraint contin-uation strategies proposed by Esser et al. [2016b], Esser et al. [2018], andin Chapter 2 to deal with salt structures. We will show that this strategycan also be beneficial for sedimentary geology. The experimental settingis the same as before. This time we use two different constraints insteadof three: bounds and TV constraints as in Chapter 2. The (anisotropic)TV constraint is defined as {m | ‖Am‖1 ≤ σ}, where the matrix A con-65a)0 1000 2000 3000 4000 5000x (m)0500100015002000z (m)1500200025003000350040004500Velocity (m/s)b)0 1000 2000 3000 4000 5000x (m)0500100015002000z (m)1500200025003000350040004500Velocity (m/s)Figure 3.8: (a) Model estimate obtained by FWI from 3− 4 Hz datawith bound constraints, a vertical slope constraint and a con-straint on the velocity variation per meter in the horizontaldirection. (b) Model estimate by FWI from 3−15 Hz data withbound constraints and using the result from (a) as the startingmodel.tains the discretized horizontal and vertical derivative matrices. We selectσ = 1.0‖Am0‖1 for the first cycle that uses 3 − 4 Hz data only, i.e., theTV-constraint is set to the TV of the initial model, m0, see Figure 3.6b.The second cycle works with 3 − 15 Hz data, as before. This time we usebound constraint only. The results in Figure 3.9 show that the first cy-cle with a tight TV constraint improves on the laterally invariant startingmodel (Figure 3.6b), but also displays an incorrect low-velocity zone in the66a)0 1000 2000 3000 4000 5000x (m)0500100015002000z (m)1500200025003000350040004500Velocity (m/s)b)0 1000 2000 3000 4000 5000x (m)0500100015002000z (m)1500200025003000350040004500Velocity (m/s)Figure 3.9: (a) Model estimate obtained by FWI from 3− 4 Hz datawith bound constraints and total-variation constraints. (b)Model estimate by FWI from 3 − 15 Hz data with bound con-straints and using the result from (a) as the starting model.high-velocity rock near the bottom of the model. The result of the secondconstrained FWI cycle, Figure 3.9b shows that the first cycle improved thestarting model sufficiently, such that the second cycle using all frequencydata can estimate a model similar to the true model.Both FWI results with multiple constraints appear to be much closer tothe true model than the FWI result that uses bound constraints only. Wegain more insight into the quality of the models by looking at reverse-timemigrations (RTM) for each of the three FWI results. We show the RTMresults and true reflectivity of the velocity model in Figure 3.10. The results67based on FWI with bound constraints show the least similarity with the truereflectivity (Figures 3.10a and 3.10d) because a number of strong reflectorsare missing. The RTM results based on FWI with bound constraints onlyalso do not show coherent layers below a depth of 1500m depth. The otherRTM images based on FWI with multiple constraints (Figures 3.10b, 3.10c,3.10e, and 3.10f) are similar to each other and closer to the true reflectivity.This example was designed to illustrate how our framework for con-strained FWI can be of great practical use for FWI problems where goodstarting models are missing or where low-frequencies and long offsets areabsent. Our proposed method is not tied to a specific constraint. For dif-ferent geological settings, we can use the same approach, but with differentconstraints. We presented two different strategies. The preferable strategydepends on the available prior knowledge. Computationally, both strate-gies work with constraints for which we can compute the projections as inAppendix A.3.6.1 Comparison with a quadratic penalty methodWe repeat the FWI experiment from the previous section, but this time weregularize using one of the most widely used regularization techniques inthe geophysical literature: the quadratic penalty method. This comparisonillustrates the benefits of the constrained formulation as we described inearlier sections.To apply a quadratic penalty method as in equation 3.1, we need tocome up with penalty functions that represent our prior information, andwe also need to find one scalar penalty parameter per penalty function, suchthat the final model satisfies all prior information. The first piece of priorinformation is that a starting model is smooth in the lateral direction. Thepenalty function R1(m) = α1/2‖Dxm||22 promotes smoothness in the lateraldirection, using the lateral finite-difference matrix Dx. The second pieceof prior information is that a starting model has an almost monotonicallyincreasing velocity with depth. We use R2(m) = α2/2‖Dzm||22 to promotevertical smoothness. We see two disadvantages of quadratic penalties com-68a)1000 1500 2000x (m)500100015002000z (m)b)1000 1500 2000x (m)500100015002000z (m)c)1000 1500 2000x (m)500100015002000z (m)d)1400 1900 2400x (m)500100015002000z (m)e)1400 1900 2400x (m)500100015002000z (m)f)1400 1900 2400x (m)500100015002000z (m)Figure 3.10: Comparison of reverse time migration (RTM) resultsbased on the FWI velocity models (right halves) and the truereflectivity (left halves). Figures (a) and (d) show RTM basedon the velocity model from FWI with bounds only (Figure 3.7).Figures (b) and (e) show RTM results based on the velocitymodel from FWI with bounds, horizontal and vertical slopeconstraints (Figure 3.8b). Figures (c) and (f) show RTM re-sults based on the velocity model from FWI with bounds andtotal-variation constraints (Figure 3.9b). RTM results basedon FWI with bound constraints, (a) and (d), miss a numberof reflectors that are clearly present in the other RTM results.69pared to the constrained formulation. First, the quadratic penalty functionR2 does not generally lead to monotonicity. In order to promote monotonic-ity with a penalty function, we would need to work with non-differentiablefunctions. Alternatively, we could smooth the function, but this introducesanother smoothing parameter and leads to unpredictable behavior of FWIas a function of parameter choices, as discussed in Lin and Huang [2015] andChapter 2. The second disadvantage of the penalty approach is the selec-tion of penalty parameters α1 and α2. Whereas the constrained formulationallows us to select the maximum variation of the velocity per meter, thepenalty approach requires two parameters without clear physical meaning.These two parameters have no direct relation to the prior information. Theeffect of a penalty parameter depends on the data-misfit, as well on all otherpenalty parameters. We simplify the regularization task for the quadraticpenalty method by ignoring a penalty function to enforce bounds on thevelocities. We use projection onto the bounds so we can focus on the effectto two penalties.We show FWI results in Figures 3.11 and 3.12, based on various combi-nations of penalty parameters to illustrate the well-known effect that it iseasy to over/under estimate a parameter, leading to a result that does nothave the desired properties. We selected the penalty parameters by manualfine-tuning. Some of the results in Figures 3.11 and 3.12 look similar tothe true model, but contain some critical artifacts. Most noticeable is thepeak of the high-velocity (+4000m/s) rock at the bottom part of the model,which should be located close to x = 4000m. The results from quadraticpenalty regularization put the peak at the wrong location and often showa flat top rather than a peak. The estimated velocities of the high-velocityrock at the bottom of the model are also lower than in the model obtainedwith slope constraints, Figure 3.8.Another observation about the penalty method FWI results in Fig-ures 3.11 and 3.12, is that larger penalty parameters lead to more smooth-ness, but it is not clear and intuitive how much the penalty parametersshould be increased to obtain the desired level of smoothness. In contrast,constraints provide a way to set the limits on smoothness that will be sat-70a)0 1000 2000 3000 4000 5000x (m)010002000z (m)200030004000b)0 1000 2000 3000 4000 5000x (m)010002000z (m)200030004000c)0 1000 2000 3000 4000 5000x (m)010002000z (m)200030004000d)0 1000 2000 3000 4000 5000x (m)010002000z (m)200030004000e)0 1000 2000 3000 4000 5000x (m)010002000z (m)200030004000f)0 1000 2000 3000 4000 5000x (m)010002000z (m)200030004000Figure 3.11: Results from FWI with regularization by a quadraticpenalty method to promote horizontal and vertical smooth-ness. As for the constrained FWI example, the first FWI cycleuses 3 − 4 Hz data and is with regularization (left column),the second cycle uses 3 − 15 Hz data and does not use regu-larization (right column). Figure (a) uses regularization pa-rameter α1 = α2 = 1e5, (c) uses α1 = α2 = 1e6, and (e) usesα1 = α2 = 1e7.isfied at every FWI iteration by construction of the projection method. Forexample, if we want to increase the smoothness by a factor of two, we needto constrain the velocity variation per meter to half the previous limit, seeChapter 2 for FWI examples that illustrate this point.3.7 DiscussionOur main contribution in solving optimization problems with multiple con-straints is that we employ a hierarchical divide and conquer approach to han-dle problems where objectives and gradient evaluations require PDE solves.We arrive at this result by splitting each problem into simpler and thereforecomputationally more manageable subproblems. We start from the top with71a)0 1000 2000 3000 4000 5000x (m)010002000z (m)200030004000b)0 1000 2000 3000 4000 5000x (m)010002000z (m)200030004000c)0 1000 2000 3000 4000 5000x (m)010002000z (m)200030004000d)0 1000 2000 3000 4000 5000x (m)010002000z (m)200030004000e)0 1000 2000 3000 4000 5000x (m)010002000z (m)200030004000f)0 1000 2000 3000 4000 5000x (m)010002000z (m)200030004000g)0 1000 2000 3000 4000 5000x (m)010002000z (m)200030004000h)0 1000 2000 3000 4000 5000x (m)010002000z (m)200030004000Figure 3.12: Results from FWI with regularization by a quadraticpenalty method to promote horizontal and vertical smooth-ness. As for the constrained FWI example, the first FWI cycleuses 3−4 Hz data and is with regularization (left column), thesecond cycle uses 3− 15 Hz data and does not use regulariza-tion (right column). Figure (a) uses regularization parameterα1 = 1e6, α2 = 1e5, (c) uses α1 = 1e5, α2 = 1e6, (e) usesα1 = 1e7, α2 = 1e6, and (g) uses α1 = 1e6, α2 = 1e7.spectral projected gradient (SPG), which splits the constrained optimizationproblem into an optimality (decreasing the objective) and feasibility (sat-isfying all constraints) problem, and continue downwards by satisfying theindividual constraints using Dykstra’s algorithm. Even at the lowest level,we employ this strategy when there is no closed form projection availablefor the constraints. We use the alternating direction method of multipliers(ADMM) for the examples. As a result, we end up with an algorithm thatremains computationally feasible for large-scale problems where evaluation72of objectives and gradients is computationally costly.So far, the minimization of our optimality problem relied on first-orderderivative information only and what is essentially a scalar approximationof Hessian via SPG. Theoretically, we can also incorporate Dykstra’s algo-rithm into projected quasi-Newton [Schmidt et al., 2009] or (Gauss-) Newtonmethods [Schmidt et al., 2012, Lee et al., 2014]. However, unlike SPG, theseapproaches usually require more than one projection computation per FWIiteration to solve quadratic sub-problems with constraints. We would re-quire a more careful evaluation to see if second-order methods in this caseindeed provide advantages compared to projected first-order methods suchas SPG.We also would like to note that there exist parallel versions of Dykstra’salgorithm and similar algorithms [Censor, 2006, Combettes and Pesquet,2011, Bauschke and Koch, 2015]. These algorithms compute all projectionsin parallel, so each Dykstra iteration takes as much time as the slowestprojection computation. As a result, the time per Dykstra iteration doesnot necessarily increase if there are more constraint sets.While the primary application and motivation for our work is full-waveforminversion, the developed framework also applies to other geophysical inverseproblems; specifically, problems where the data-misfit and gradient evalua-tion require the solution of many partial-differential equations.3.8 ConclusionsBecause of its computational complexity and notorious local minima, full-waveform inversion easily ranks amongst one of the most challenging non-linear inverse problems. To meet this challenge, we introduced a versatileoptimization framework for (non)linear inverse problems with the followingkey features: (i) it invokes prior information via projections onto the in-tersection of multiple (convex) constraint sets and thereby avoids relianceon cumbersome trade-off parameters; (ii) it allows for imposing arbitrarilymany constraints simultaneously as long as their intersection is non-empty;(iii) it projects the updated models uniquely on the intersection at every73iteration and as such stays away from ambiguities related to the order inwhich the constraints are invoked; (iv) it guarantees that model updatessatisfy all constraints simultaneously at each iteration and (v) it is built ontop of existing code bases that only need to compute data-misfit objectivevalues and gradients. These features in combination with our ability to relaxand add other constraints that have appeared in the geophysical literatureoffer a powerful optimization framework to mitigate some of the adverseeffects of local minima.Aside from promoting certain to-be-expected model properties, our ex-amples also confirmed that invoking multiple constraints as part of a multi-cycle inversion heuristic can lead to better results. We observe improvementsduring the first full-waveform inversion cycle(s) if the constraint sets are tightenough to prevent unrealistic geological features to enter into the model es-timate. Provided the inversions make some progress to the solution, laterinversion cycles will benefit if the tight constraints are subsequently relaxedeither by dropping them or by increasing the size of the constraint set. Thisstrategy follows the heuristic of first estimating a better starting model, orotherwise simple model, followed by introducing more details. Constraintsprovide us with precise control of the maximum model complexity at eachFWI iteration. Our examples confirm this important aspect and clearlydemonstrate the advantages of working with constraints that are satisfiedat each iteration of the inversion.Compared to many other regularization methods, our approach is easilyextendable to other convex or non-convex constraints. However, for non-convex constraints, we can no longer offer certain guarantees, except thatall sub-problems in the alternating direction method of multipliers remainsolvable without the need to tune trade-off parameters manually. We can dothis because we work with projections onto the intersection of multiple setsand we split the computations into multiple pieces that have closed-formsolutions.74Chapter 4Algorithms and software forprojections onto intersectionsof convex and non-convexsets with applications toinverse problems.4.1 IntroductionWe consider problems of the formPV(m) ∈ arg minx12‖x−m‖22 subject to x ∈p⋂i=1Vi, (4.1)which is the projection of a vector m ∈ RN onto the intersection of p convexand possibly non-convex sets Vi. The projection in equation (4.1) is uniqueif all sets are closed and convex. The projection operation is a common tool75used for solving constrained optimization problems of the formminmf(m) subject to m ∈p⋂i=1Vi. (4.2)Examples of algorithms that use projections include spectral projected gra-dient descent [SPG, Birgin et al., 1999], projected quasi-Newton [Schmidtet al., 2009], and projected Newton-type methods [Bertsekas, 1982, Schmidtet al., 2012]. In the above optimization problem, the function f(m) : RN →R is at least twice differentiable and may also be non-convex. Alternatively,proximal algorithms solveminmf(m) + ιV(m), (4.3)which is equivalent to (4.2) and where ιV(m) is the indicator function ofthe set V ≡ ⋂pi=1 Vi, which returns zero when we are in the set and infinityotherwise. Because applications may benefit from using non-convex setsVi, we also consider those sets in the numerical examples. While we donot provide convergence guarantees for this case, we will work with someuseful/practical heuristics.The main applications of interest in this work are inverse problems forthe estimation of physical (model) parameters (m ∈ Rn) from observed data(dobs ∈ Cs). Notable examples are geophysical imaging problems with seis-mic waves [full-waveform inversion, see, e.g., Tarantola, 1986, Pratt et al.,1998, Virieux and Operto, 2009] for acoustic velocity estimation and direct-current resistivity problems [DC-resistivity, see, e.g., Haber, 2014] to obtainelectrical conductivity information. These problems all have ‘expensive’ for-ward operators, i.e., evaluating the objective f(m) requires solutions of manypartial-differential-equations (PDEs) if the PDE constraints are implicit inf(m), which corresponds to a reduced data-misfit [Haber et al., 2000]. Inour context, each set Vi describes a different type of prior information onthe model m. Examples of prior knowledge as convex sets are bounds onparameter values, smoothness, matrix properties such as the nuclear norm,and whether or not the model is blocky with sharp edges (total-variation like76constraints via the `1 norm). Non-convex sets that we use in the numericalexamples include the annulus (minimum and maximum `2 norm), limitedmatrix rank, and vector cardinality.Aside from the constrained minimization as in problem (4.2), we con-sider feasibility (also known as set-theoretic estimation) problem formula-tions [e.g., Youla and Webb, 1982, Trussell and Civanlar, 1984, Combettes,1993, 1996]. Feasibility only formulations accept any point in the inter-section of sets Vi that describe constraints on model parameter properties,and a data-fit constraint Vdatap that ties the unknown model vector x tothe observed data dobs ∈ RM via a forward operator F ∈ RM×N . Exam-ples of data-constraint sets are Vdata = {x | l ≤ (Fx − dobs) ≤ u} andVdata = {x | ‖Fx− dobs‖2 ≤ σ}. The upper and lower bounds are vectors land u and σ > 0 is a scalar that depends on the noise level. The forwardoperators are linear and often computationally ‘cheap’ to apply. Examplesinclude masks and blurring kernels. In case there is a good initial guess avail-able, we can choose to solve a projection rather than feasibility problem byadding the squared `2 distance term as follows:minx12‖x−m‖22 s.t.x ∈ Vdatapx ∈ ⋂p−1i=1 Vi . (4.4)To demonstrate the benefits of this constrained formulation, we recast jointdenoising-deblurring-inpainting and image desaturation problems as projec-tions onto the intersection of sets. Especially when we have a few trainingexamples from which we can learn constraint set parameters, the feasibilityand projection approaches conveniently add many pieces of prior knowledgein the form of multiple constraint sets, but without any penalty or trade-offparameters. For instance, [Combettes and Pesquet, 2004] show that we canobserve ‘good’ choices of parameters that define the constraint sets, such asthe average of the total variation of a few training images. We address in-creasing computational demand that comes with additional constraint setswith a reformulation of problem (4.4), such that we take into account similar-ity between sets, and split the problem up into simple parallel computations77where possible.Projected gradient and similar algorithms naturally split problem (4.2)into a projection and data-fitting part. In this setting, software for com-puting projections onto the intersection of sets can then work together withcodes for physical simulations that compute f(m) and ∇mf(m), as we showin one of the numerical examples. See dolfin-adjoint [Farrell et al., 2013],Devito [Kukreja et al., 2016, Louboutin et al., 2018] in Python and WAVEFORM[Da Silva and Herrmann, 2017], jInv [Ruthotto et al., 2017], and JUDI [Witteet al., 2018] in Julia for examples of recent packages for physical simulationsthat also compute ∇mf(m).Compared to regularization via penalty functions (that are not an indi-cator function), constrained problem formulations (4.2 and 4.4) have severaladvantages when solving physical parameter estimation problems. Penaltymethodsminmf(m) +p∑iαiR(m) (4.5)add prior knowledge through p ≥ 1 penalty functions Ri(m) : RN → R withscalar weights αi > 0 to the data-misfit term f(m). Alternatively, we canadd penalties to the objective and work with a data constraint instead—i.e.,we haveminmp∑i=1αiRi(m) s.t. f(m) ≤ σ, (4.6)generally referred to as Basis Pursuit Denoise [Mallat and Zhang, 1992,Chen et al., 2001, van den Berg and Friedlander, 2009, Aravkin et al., 2014],Morozov/residual regularization [Ivanov et al., 2013], or Occam’s inversion[Constable et al., 1987]. The scalar σ relates to the noise level in the data.For convex constraints/objectives/penalties, constrained, penalty and data-constrained problems are equivalent under certain conditions and for specificα - σ pairs [Vasin, 1970, Gander, 1980, Golub and von Matt, 1991, van denBerg and Friedlander, 2009, Aravkin et al., 2016, Tibshirani, 2017], but dif-fer in algorithmic implementation and in their ability to handle multiplepieces of prior information (p > 1). In that case, the simplicity of adding78penalties is negated by the challenge of selecting multiple trade-off param-eters (αi). For this, and for reasons we list below, we prefer constrainedformulations that involve projections onto the intersection of constraint sets(problem 4.1). Constrained formulations• satisfy prior information at every iteration PDE-based inverseproblems require the solutions of PDEs to evaluate the objective func-tion f(m) and its gradient. The model parameters need to be in aninterval for which the mesh (PDE discretization) is suitable, i.e., wehave to use bound constraints. Optimization algorithms that satisfyall constraints at every iteration also give the user precise control of themodel properties when solving problem (4.2) using a projection-basedalgorithm. This allows us to start solving a non-convex inverse prob-lem with certain constraints, followed by a solution stage with ‘looser’constraints. [Smithyman et al., 2015, Esser et al., 2016b, Esser et al.,2016], as well as examples in Chapter 2 apply this strategy to seismicfull-waveform inversion to avoid local minimizers that correspond togeologically unrealistic models.• require a minimum number of manual tuning parameters formultiple constraints We want to avoid the time-consuming and pos-sibly computationally costly procedure of manually tuning numerousnuisance parameters. While we need to define the constraint sets, weavoid the scalar weights that penalty functions use. Constraint setshave the advantage that their definitions are independent of all otherconstraint definitions. For penalty functions, the effect of the weightsαi associated with each Ri on the solutions of an inverse problem de-pends on all other αi and Ri. For this reason, selecting multiple scalarweights to balance multiple penalty functions becomes increasinglydifficult as we increase the number of penalties.• make direct use of prior knowledge We can observe model prop-erties from training examples and use this information directly as con-straints [Combettes and Pesquet, 2004, see also numerical examples79in this work]. Penalty and basis-pursuit type methods first need totranslate this information into penalty functions and scalar weights.Most classical and recently proposed methods to project onto an inter-section of multiple (convex) sets, such as Dykstra’s algorithm and variants[Dykstra, 1983, Boyle and Dykstra, 1986, Censor, 2006, Bauschke and Koch,2015, Lo´pez and Raydan, 2016, Arago´n Artacho and Campoy, 2018], (seealso Appendix C), use projections onto each set separately, PVi(·), as themain computational component. The projection is a black box, and thismay create difficulties if the projection onto one or more sets has no knownclosed-form solution. We then need another iterative algorithm to solve thesub-problems. This nesting of algorithms may lead to problems with theselection of appropriate stopping criteria for the algorithm that solves thesub-problems. In that case, we need two sets of stopping criteria: one forDykstra’s algorithm itself and one for the iterative algorithm that computesthe individual projections. For this reason, it may become challenging toselect stopping criteria for the algorithm that computes a single projection.For example, projections need to be sufficiently accurate such that Dyk-stra’s algorithm converges. At the same time, we do not want to wastecomputational resources by solving sub-problems more accurately than nec-essary. A second characteristic of the black-box projection algorithms isthat they treat every set individually and do not attempt to exploit similar-ities between the sets. If we work with multiple constraint sets, some of theset definitions may include the same or similar linear operators in terms ofsparsity (non zero) patterns.Besides algorithms that are designed to solve a specific projection prob-lem onto the intersection of multiple sets, there exist software packages ca-pable of solving a range of generic optimization problems. However, many ofthe current software packages are not designed to compute projections ontointersections of multiple constraint sets where we usually do not know theprojection onto each set in closed form. This happens, for instance, when theset definitions include linear operators A that satisfy the relation AA> 6= αIfor α > 0. A package such as Convex for Julia [Udell et al., 2014], an exam-80ple of disciplined convex programming (DCP), does not handle non-convexsets and requires lots of memory even for 2D problems. The high memorydemands are a result of the packages that Convex can call as the back-end,for example, SCS [O’Donoghue et al., 2016] or ECOS [Domahidi et al., 2013].These solvers work with matrices that possess a structure similar to(? A>A ?), (4.7)where the matrix A vertically stacks all linear operators that are part ofequality constraints. Both the block-structured system (4.7) and A becomeprohibitively large in case we work with multiple constraint sets that in-clude a linear operator in their definitions. The software that comes closerto our implementation is Epsilon [Wytock et al., 2015], which is written inPython. Like our proposed algorithms, Epsilon also employs the alternat-ing direction method of multipliers (ADMM), but reformulates optimiza-tion problems by emphasizing generalized proximal mappings as in equa-tion (4.12, see below). Linear equality constraints then appear as indicatorfunctions, which leads to different linear operators ending up in differentsub-problems. In contrast, we work with a single ADMM sub-problem thatincludes all linear operators. The ProxImaL software [Heide et al., 2016]for Python is designed for linear inverse problems in imaging using ADMMwith a similar problem reformulation. However, ProxImaL differs fundamen-tally since it applies regularization with a relatively small number of penaltyfunctions. While in principle it should be possible to adapt that package toconstrained problem formulations by replacing penalties with indicator func-tions, ProxImaL is in its current form not set up for that purpose. Finallythere is StructuredOptimization [Antonello et al., 2018] in Julia. Thispackage also targets inverse problems by smooth+non-smooth function for-mulations. Different from the goal of this work, StructuredOptimizationfocusses on problems with easy to compute generalized proximal mappings (4.12),i.e., penalty functions or constraints that are composed with linear operatorsthat satisfy AA> = αI. In contrast, we focus on the situation where we have81many constraints with operators (AA> 6= αI) that make generalized proxi-mal mappings (4.12) difficult to compute. Below, we list additional benefitsof our approach compared to existing packages that can solve intersectionprojection problems.4.1.1 ContributionsOur aim is to design and implement parallel computational optimization al-gorithms for solving projection problems onto intersections of multiple con-straint sets in the context of inverse problems. To arrive at this optimizationframework, SetIntersectionProjection , we propose• an implementation that avoids nesting of algorithms and exploits sim-ilarities between constraint sets, unlike black-box alternating projec-tion methods such as Dykstra’s algorithm. Taking similarities betweensets into account allows us to work with many sets at a relatively smallincrease in computational cost.• algorithms that are based on a relaxed variant of the simultaneousdirection method of multipliers [SDMM, Afonso et al., 2011, Com-bettes and Pesquet, 2011, Kitic et al., 2016]. By merging SDMM withrecently developed schemes for automatically adapting the augmented-Lagrangian penalty and relaxation parameters [Xu et al., 2017b,a], weachieve speedups when solving problem (4.1) compared to the straight-forward application of operator splitting such as the alternating direc-tion method of multipliers (ADMM) that use fixed parameters or olderupdating schemes.• a software design specifically for set intersection projection problems.Our specializations enhance computational performance and include(i) a relatively simple multilevel strategy for ADMM-based algorithmsthat does part of the computations on significantly coarser grids; (ii)solutions of banded linear systems in compressed diagonal format (CDS)with multi-threaded matrix-vector products (MVP). These MVPs arefaster than general purpose storage formats like compressed sparse82column storage (CSC). Unlike linear system solves by Fourier diago-nalization they support linear operators with spatially varying (blur-ring) kernels and various boundary conditions. See discussion by, e.g.,[Almeida and Figueiredo, 2013, O’Connor and Vandenberghe, 2017](iii) more intuitive stopping criteria based on set feasibility.• to make our work available as a software package in Julia [Bezan-son et al., 2017]. Besides the algorithms, we also provide scripts forsetting up the constraints, projectors and linear operators, as well asvarious examples. All presented timings, comparisons, and examplesare reproducible.• an implementation that is suitable for small matrices (2D) up to largertensors (3D models, at least m ∈ R300×300×300). Because we solvesimple-to-compute sub-problems in closed form and independently inparallel, the proposed algorithms work with large models and manyconstraints. We achieve this because there is only a single inexactlinear-system solve that does not become much more computationallyexpensive as we add more constraint sets. To improve the performanceeven further, we also provide a multilevel accelerated version.To demonstrate the capabilities of our optimization framework and im-plementation, we provide examples of how projections onto an intersectionof multiple constraint sets can be used to solve linear image processing taskssuch as denoising and deconvolution and more complicated inverse problemsincluding nonlinear parameters estimation problems with PDEs.4.2 Notation, assumptions, and definitionsOur goal is to estimate the model vector (e.g., discretized medium param-eters such as the acoustic wave speed) m ∈ RN , which in 2D correspondsto a vectorized (lexicographically ordered) matrix of size nz ×nx with z thevertical coordinate and x the horizontal direction. There are N = nx × nzelements in a 2D model. Our work applies to 2D and 3D models but tokeep the derivations simpler we limit ourselves to 2D models discretized on83a regular grid. We use the following discretization for the vertical derivativein our constraintsDz =1hz−1 1−1 1. . .. . .−1 1 , (4.8)where hz is the vertical grid size. We define the discretized vertical derivativefor the 2D model as the Kronecker product of Dz and the identity matrixcorresponding to the x-dimension: Dz ⊗ Ix.The indicator function of a convex or non-convex set C is defined asιC(m) =0 if m ∈ C,+∞ if m /∈ C. (4.9)We define the Euclidean projection onto a convex or non-convex set C asPC(m) = arg minx‖x−m‖22 s.t. m ∈ C. (4.10)This projection is unique if C is a closed and convex set. If C is a non-convexset, the projection may not be unique so the result is any vector in the setof minimizers of the projection problem. The proximal map of a functiong(m) : RN → R ∪ {+∞} is defined asproxγ,g(m) = arg minxg(x) +γ2‖x−m‖22, (4.11)so proxγ,g(m) : RN → RN , where γ > 0 is a scalar. The case when g(x)includes a linear operator A ∈ RM×N is of particular interest to us and wemake it explicit with the definitionproxγ,g◦A(m) = arg minxg(Ax) +γ2‖x−m‖22. (4.12)Even though proxγ,g(m) is often available in closed-form solution, or cheap84to compute [Combettes and Pesquet, 2011, Parikh and Boyd, 2014, Beck,2017, Chapter 6 & 7], proxγ,g◦A(m) is usually not available in closed form ifAA> 6= αI, α > 0 and more expensive to compute. Here, the symbol > refersto (Hermitian) transpose. The proximal map for the indicator function isthe projection:proxγ,ιC(m) = PιC(m)with PιC(m) defined as in (4.10). The intersection of an arbitrary numberof convex sets, m ∈ ⋂pi=1 Ci, is also convex. We assume that all constraintsare chosen consistently, such that the intersection of all selected constraintsets is nonempty:p⋂i=1Ci 6= ∅. (4.13)This means we define constraints such that there is at least one element inthe intersection. This assumption is not restrictive in practice because ap-parently contradicting constraint sets often have a non-empty intersection.For example, `1-norm based total-variation constraints and smoothness pro-moting constraints have at least one model in their intersection: a homoge-neous model has a total-variation equal to 0 and maximal smoothness.We use m[i] to indicate entries of the vector m. Subscripts like yi referto one of the sub-vectors that are part of y˜ = (y>1 y>2 , . . . , y>p )>.The Euclidean inner product of two vectors is denoted as a>b, and ‖a‖22 =a>a.4.3 PARSDMM: Exploiting similarity betweenconstraint setsAs we briefly mentioned in the introduction, currently available algorithmsfor computing projections onto the intersection of closed and convex sets donot take similarity between sets into account. They also treat projectionsonto each set as a black box, which means they require another iterativealgorithm (and stopping conditions) to compute projections that have noclosed-form solution. In our Projection Adaptive Relaxed Simultaneous Di-85rection Method of Multipliers (PARSDMM), we avoid nesting multiple al-gorithms and explicitly exploit similarities between the i = 1, 2, . . . , p linearoperators Ai ∈ RMi×N . We accomplish this by writing each constraint setVi in problem (4.1)) as the indicator function of a ‘simple’ set (ιCi) and apossibly non-orthogonal linear operator: x ∈ Vi ⇔ Aix ∈ Ci. We formulateprojection of m ∈ RN onto the intersection of p sets asminx12‖x−m‖22 +p∑i=1ιCi(Aix). (4.14)PARSDMM is designed to solve inverse problems that call for multiple piecesof prior knowledge in the form of constraints. Each piece of prior knowledgecorresponds to a single set, and we focus on intersections of two up to about16 sets, which we found adequate to regularize inverse problems. To avoidtechnical issues with non-convexity, we, for now, assume all sets to be closedand convex.We use ADMM as a starting point. ADMM is known to solve intersec-tion projection (and feasibility) problems [Boyd et al., 2011, Pakazad et al.,2015, Bauschke and Koch, 2015, Jia et al., 2017, Tibshirani, 2017, Kunduet al., 2017]. However, it remains a black-box algorithm and struggles withprojections that do not have closed-form solutions. For completeness and tohighlight the differences with the algorithm we propose below, we present inAppendix C a black box algorithm for the projection onto the intersectionof sets based on ADMM.The augmented LagrangianTo start the derivation of PARSDMM, we introduce separate vectors yi ∈RMi for each of the i = 1, . . . , p constraint sets of problem (4.14) and weadd linear equality constraints as follows:minx,{yi}12‖x−m‖22 +p∑i=1ιCi(yi) s.t. Aix = yi. (4.15)86The augmented Lagrangian [e.g., Nocedal and Wright, 2000, Chapter 17]of problem (4.15) is a basis for ADMM (see (4.19) below). To ensure thatthe x-minimization remains quadratic (see derivation below), we make thisminimization problem independent of the distance term 12‖x − m‖22. Thischoice has the additional benefit of allowing for other functions that measuredistance from m. We remove the direct coupling of the distance term byintroducing additional variables and constraints yp+1 = Ap+1x = INx. Forthis purpose, we define 12‖x−m‖22 = f(yp+1) and create the functionf˜(y˜) = f(yp+1) +p∑i=1ιCi(yi), (4.16)where we use the ·˜ symbol to indicate concatenated matrices and vectors, aswell as functions that are the sum of multiple functions to simplify notation.The concatenated matrices and vectors readA˜ =A1...Ap+1 = IN , y˜ =y1...yp+1 , v˜ =v1...vp+1 . (4.17)The vectors vi ∈ RMi are the Lagrangian multipliers that occur in the aug-mented Lagrangian for the projection problem, after one more reformulationstep. We always have Ap+1x = INx = yp+1 for the Euclidean projectionthat uses the squared `2-distance12‖x −m‖22. With these new definitions,problem (4.15) becomesminx,y˜f˜(y˜) s.t. A˜x = y˜. (4.18)This formulation has the same form as problems that regular ADMM solves—i.e., minx,y f(x) + g(y) s.t. Ax + By = c. It follows that we can guar-antee convergence under the same conditions as for ADMM. Accordingto [Boyd et al., 2011, Eckstein and Yao, 2015], ADMM converges whenf(x) : RN1 → R ∪ {+∞} and g(y) : RN2 → R ∪ {+∞} are proper andconvex. The linear equality constraints involve matrices A ∈ RM×N1 and87B ∈ RM×N2 and vectors x ∈ RN1 , y ∈ RN2 and c ∈ RM .To arrive at the main iterations of PARSDMM, we now derive an algo-rithm for the projection problem stated in (4.18), based on the augmentedLagrangianLρ1,...,ρp+1(x, y1, . . . , yp+1, v1, . . . , vp+1) =p+1∑i=1[fi(yi) + v>i (yi −Aix) +ρi2‖yi −Aix‖22].(4.19)As we can see, this expression has a separable structure with respect tothe Lagrangian multipliers vi, and the auxiliary vectors yi. Following theADMM variants for multiple functions, as formulated by [Song et al., 2016,Kitic et al., 2016, Xu et al., 2017c], we use a different penalty parameterρi > 0 for each index i. In this way, we make sure all linear equalityconstraints Aix = yi are satisfied sufficiently by running a limited numberof iterations. Because the different matrices Ai may have widely varyingscalings and sizes, a fixed penalty for all i could cause slow convergenceof x towards one of the constraint sets. Additionally, to further acceleratethe algorithm we also introduce a different relaxation parameter (γi) foreach index i. After we derive the main steps of our proposed algorithm, wedescribe the automatic selection of the scalar parameters.The iterationsWith the above definitions, iteration counter k, and inclusion of relaxationparameters, which we assume to be limited to the interval γi ∈ [1, 2) [seeXu et al., 2017b], the iterations can be written asxk+1 = arg minxp+1∑i=1(ρki2‖yki −Aix+vkiρki‖22)x¯k+1i = γki Aixk+1i + (1− γki )ykiyk+1i = arg minyi[fi(yi) +ρki2‖yki − x¯k+1i +vkiρki‖22]vk+1i = vki + ρki (yk+1i − x¯k+1i ).88To arrive at our final algorithm, we rewrite these iterations in a more explicitform asxk+1 =[ p∑i=1(ρkiA>i Ai) + ρkp+1IN]−1 p+1∑i=1[A>i (ρki yki + vki )]x¯k+1i = γki Aixk+1i + (1− γki )ykiyk+1i = proxfi,ρki(x¯k+1i −vkiρki)vk+1i = vki + ρki (yk+1i − x¯k+1i ).In this expression, we used the fact that Ap+1 is always the identity matrixof size N for projection problems. Without over/under relaxation [x¯k+1icomputation, Eckstein and Bertsekas, 1992, Iutzeler and Hendrickx, 2017,Xu et al., 2017b], these iterations are known as SALSA [Afonso et al., 2011]or the simultaneous direction method of multipliers [SDMM, Combettes andPesquet, 2011, Kitic et al., 2016]. The derivation in this section shows thatADMM/SDMM solve the projection onto an intersection of multiple closedand convex sets. However, the basic iterations from (4.20) are not yet apractical and fast algorithm, because there are scalar parameters that needto be selected, no stopping conditions, and no specializations to constraintstypically found in the imaging sciences. Therefore, we add automatic scalarparameter selection to the iterations (4.20), as well as linear system solves,stopping conditions, and multilevel acceleration specialized to computingprojections onto intersections of many sets.Computing the proximal mapsThe proximal maps in the iterations (4.20) become projections onto sim-ple sets (e.g., bounds/`1 and `2 norm-ball/cardinality/rank), which per-mit closed-form solutions that do not depend on the ρi. When fp+1(w) =1/2‖w−m‖22, (squared `2 distance of w to the reference vector m) the prox-89imal map is also available in closed form:proxfp+1,ρp+1(w) = arg minz1/2‖z −m‖22 + ρp+1/2‖z − w‖22= (m+ ρp+1w)/(1 + ρp+1).(4.20)We thus avoided sub-problems for projections that require other convexoptimization algorithms for their solutions.Solving the linear system and automatic parameter selectionWe can also see from (4.20) that the computation of xk+1 involves the so-lution of a single problem where all linear operators are summed into onesystem of normal equations. The system matrix equalsC ≡p+1∑i=1(ρiA>i Ai) =p∑i=1(ρiA>i Ai) + ρp+1IN (4.21)and is by construction always positive-definite because ρi > 0 for all i. Theminimization over x is therefore uniquely defined. As suggested by Xu et al.[2017a], we adapt the ρi’s every two iterations using the scheme we discussbelow.While we could use direct matrix factorizations of C, we would need torefactorize every time we update any of the ρi’s. This would make computingxk+1 too costly. Instead, we rely on warm-started iterative solvers with xkused as the initial guess for xk+1. There exist several alternatives includingLSQR [Paige and Saunders, 1982] to solve the above linear system (xk+1computation in 4.20) iteratively. We choose the conjugate-gradient (CG)method on the normal equations for the following reasons:1. Contrary to LSQR, transforms that satisfy A>i Ai = αIN are free forCG because we explicitly form the sparse system matrix C, whichalready includes the identity matrix.2. By limiting the relative difference between the ρi and ρp+1, wherethe latter corresponds to the identity matrix in (4.21), we ensure C90is sufficiently well conditioned so squaring the condition number doesnot become a problem.3. For many transforms, the matrices A>i Ai are sparse and have at leastpartially overlapping sparsity patterns (discrete derivative matrices forone or more directions, orthogonal transforms). Multiplication with∑p+1i=1 (ρiA>i Ai) is therefore not much more expensive than multipli-cation with a single A>i Ai. However, LSQR requires matrix-vectorproducts with all Ai and A>i at every iteration.4. Full reassembly of C at iteration k is not required. Every time weupdate any of the ρi’s, we update C by subtracting and adding theblock corresponding to the updated ρi. If the index that changed isindicated by i = u, the system matrix for the next xk+1 computationbecomesCk+1 =p+1∑i=1(ρk+1i A>i Ai) =p+1∑i=1(ρkiA>i Ai)− (ρkuA>uAu) + (ρk+1u A>uAu)= Ck +A>uAu(ρk+1u − ρku).(4.22)For each ρi update, forming the new system matrix involves a single ad-dition of two sparse matrices (assuming all A>i Ai’s are pre-computed).To further save computation time, we solve the minimization with re-spect to x inexactly. We select the stopping criterion for CG adaptively interms of the relative residual of the normal equations—i.e., we stop CG ifthe relative residual drops below0.1‖[ p∑i=1(ρkiA>i Ai)+ρkp+1IN]x−p+1∑i=1[A>i (ρki yki +vki )]‖2/‖p+1∑i=1[A>i (ρki yki +vki )]‖2.(4.23)Empirically, we found that a reduction of the relative residual by a factor often represents a robust choice that also results in time savings for solvingproblem (4.18) compared to a fixed and accurate stopping criterion for the x-minimization step. The stopping criterion for CG is relatively inexact during91the first few iterations from (4.20) and requests more accurate solutions lateron, such that the conditions on inexact sub-problem solutions from [Ecksteinand Bertsekas, 1992] will be satisfied eventually.Just like standard ADMM, we may also require a large number of itera-tions (4.20) for a fixed penalty parameter ρi for all i [e.g., Nishihara et al.,2015, Xu et al., 2017a]. It is better to update ρki and γki every couple ofiterations to ensure we reach a good solution in a relatively small numberof iterations. For this purpose, we use Xu et al. [2017a]’s automatic selec-tion of ρki and γki for ADMM. Numerical experiments by Xu et al. [2016]show that these updates also perform well on various non-convex problems.The updates themselves are based on a Barzilai-Borwein spectral step size[Barzilai and Borwein, 1988] for Douglas-Rachford (DR) splitting applied tothe dual of minx,y f(x) + g(y) s.t. Ax+By = c and derive from equivalencebetween ADMM and DR on the dual [Eckstein and Bertsekas, 1992, Esser,2009].Exploiting parallelismGiven the grid size of 3D PDE-based parameter estimation problems, perfor-mance is essential. For this reason, we seek a parallel implementation thatexploits multi-threading offered by modern programming languages suchas Julia [Bezanson et al., 2017]. Since the computational time for the x-minimization using the conjugate-gradient algorithm is dominated by thematrix-vector products (MVP) with C, we concentrate our efforts there byusing compressed diagonal storage (CDS), see, e.g., [Saad, 1989, Sern et al.,1990, Kotakemori et al., 2008]. This format stores the non-zero bands of thematrix as a dense matrix, and we compute MVPs directly in this storagesytem. These MVPs are faster than the more general Compressed SparseColumn (CSC) format. CDS has the additional benifit that it can efficientlyhandle matrices generated by spatially varying (blurring, derivative) kernels.We can use CDS if all matrices A>i Ai have a banded sparsity-pattern. UsingJulia’s multi-threading, we compute the MVPs with C in parallel. In caseswhere the A>i Ai’s do not have a banded structure we revert to computations92in the standard Compressed Sparse Column (CSC) format.Aside from matrix-vector products during the inner iterations, most cal-culation time in (4.20) is used for x¯k+1i , yk+1i , vk+1i , ρk+1i , and γk+1i . Toreduce these costs, we compute these quantities in parallel. This is relativelystraightforward to do because each problem is independent so that the op-erations for the p constraints can be carried out by different Julia workerswhere each worker either uses Julia threads, multi-threaded BLAS [Open-BLAS, Wang et al., 2013], or multi-threaded Fourier-transforms [FFTWlibrary, Frigo and Johnson, 2005].Stopping conditionsSo far, we focussed on reducing the time for each iteration of (4.20). How-ever, the total computational time depends on the total number of iterationsand therefore on the stopping conditions. For our problems, a good stoppingcriterion guarantees solutions that are close to all constraint sets, and at aminimal distance from the point we want to project. When working with asingle constraint set, stopping criteria based on a combination of the primalrpri = ‖y˜ − A˜xk‖ and dual residual rdual = ‖ρ˜A˜>(y˜k − y˜k−1)‖ are adequateas long as both become sufficiently small [e.g., Boyd et al., 2011, Kitic et al.,2016, Xu et al., 2017a]. However, the situation is more complicated in sit-uations where we work with multiple constraint sets. In that case, the y˜and A˜ contain a variety of vectors and linear operators that correspond tothe different constraint sets. Since these operators are scaled differently andhave different dimensions, it becomes more difficult to determine the rela-tionship between the size of the residuals and the accuracy of the solution.In other words, it becomes challenging to decide at what primal and dualresidual to stop such that we are close to all constraint sets.Instead of considering residuals, it may be more intuitive to look atfeasibilities by dropping the quadratic part of the projection problem (4.15).This means that we only insist that the final solution needs to be an elementof every set Vi when considering our stopping criterion. This holds if x is inthe intersection of the constraint sets but requires projections onto each Vi93to verify, a situation we want to avoid in PARSDMM. Instead, we rely onthe transform-domain set feasibility errorrfeasi =‖Aix− PCi(Aix)‖‖Aix‖ , i = 1 · · · p, (4.24)to which we have access at a relatively low cost since we already computedAix in the iterations from (4.20). Our first stopping criterion thus corre-sponds to a normalized version of the objective when solving convex multipleset split-feasibility problems [Censor et al., 2005]. We added this normaliza-tion in (4.24) to account for different scalings and sizes of the linear operatorsAi.The projections onto the constraint sets PCi(·) themselves, are relativelycheap to compute since they only include projections onto sets such as norm-balls, bounds, cardinality sets. By testing for transform-domain feasibilityevery few iterations only (5 or 10 typically), we further reduce the compu-tational costs for our stopping condition.Satisfying constraints alone for i = 1 · · · p does not indicate whetheror not xk is close to the projection onto the intersection of the p differentconstraint sets or whether it is just a feasible point, possibly ‘deep’ insidethe intersection. If xk is indeed the result of the projection of m, then‖xk − xk−1‖ approaches a stationary point, assuming that xk converges tothe projection. We make this property explicit by considering the maximumrelative change xk over the s previous iterations: j ∈ S ≡ {1, 2, . . . , s}. Therelative evolution of x at the kth iteration thus becomesrevol =maxj∈S{‖xk − xk−j‖}‖xk‖ . (4.25)By considering the history (we use s = 5 in our numerical examples), ourstopping criterion becomes more robust to oscillations in ‖xk − xk−1‖ as afunction of k. So we propose to stop PARSDMM ifrevol < εevol and rfeasi < εfeasi ∀ i. (4.26)94During our numerical experiments, we select εevol = 10−2 and εfeasi = 10−3,which balance sufficiently accurate solutions and short solution times. Theseare still two constants to be chosen by the user, but we argue that rfeasi mayrelate better to our intuition on feasibility because it behaves like a distanceto each set separately. The evolution term ‖xk − xk−1‖ is found in manyoptimization algorithms and is especially informative for physical parameterestimation problems where practitioners often have a good intuition to which‖xk − xk−1‖ the physical forward model f(x) is sensitive.The PARSDMM algorithmWe summarize our discussions from the previous sections in the followingAlgorithms.4.3.1 Multilevel PARSDMMInverse problems with data-misfit objectives that include PDE forward mod-els typically need a fine grid for stable physical simulations. At the sametime, we often use constraints to estimate ‘simple’ models—i.e. models thatare smooth, have a low-rank, are sparse in some transform-domain, andthat may not need many grid points for accurate representations of the im-age/model. This suggests we can reduce the total computational time ofPARSDMM (Algorithm 3) by using a multilevel continuation strategy. Themultilevel idea presented in this section applies to the projection onto theintersection of constraint sets only and not to the grids for solving PDEs.Our approach proceeds as follows: we start at a coarse grid and continuetowards finer grids. While inspired by multigrid methods for solving linearsystems, the proposed multilevel algorithm does not cycle between coarseand fine grids. By using the solution at the coarse grid as the initial guessfor the solution on the finer grid, the convergence guarantees are the same asfor the single level version of our algorithm. As long as the computationallycheap coarse grid solutions are ‘good’ initial guesses for the finer grids, thismultilevel approach, which is similar to multilevel ADMM by Macdonaldand Ruthotto [2018], can lead to substantial reductions in computational95Algorithm 3 Projection Adaptive Relaxed Simultaneous Direction Methodof Multipliers (PARSDMM) to compute the projection onto an intersection,including automatic selection of the penalty parameters and relaxation.Algorithm PARSDMMinputs:m //point to projectA1, A2, . . . , Ap, Ap+1 = IN //linear operators//norm/bound/cardinality/... projectors:proxfi,ρi(w) = PCi(w) for i = 1, 2, . . . , p//prox for the squared distance from m :proxfi,ρp+1(w) = (m+ ρp+1w)/(1 + ρi)select ρ0i , γ0i , update-freqencyoptional: initial guess for x, yi and viinitialize:Bi = A>i Ai //pre-compute for all iC =∑p+1i=1 (ρiBi) //pre-computek = 1WHILE not convergedxk+1 = C−1∑p+1i=1[A>i (ρki yki + vki )]//CG, stop when (4.23) holdsFOR i = 1, 2, . . . , p+ 1 //compute in parallelsk+1i = Aixk+1x¯k+1i = γki sk+1i + (1− γki )ykiyk+1i = proxfi,ρi(x¯k+1i − vkiρki)vk+1i = vki + ρki (yk+1i − x¯k+1i )stop if conditions (4.26) holdIf mod(k, update-freqency) = 1{ρk+1i , γk+1i } = adapt-rho-gamma(vki , vk+1i , yk+1i , sk+1i , ρki )End ifENDFOR i = 1, 2, . . . , p+ 1 //update C if necessaryIf ρk+1i 6= ρkiC ← C +Bi(ρk+1i − ρki )End ifENDk ← k + 1ENDoutput: x96Algorithm 4 Adapt ρ and γ according to [Xu et al., 2017b] with somemodifications to save computational work. The constant εcorr is in the range[0.1− 0.4] as suggested by [Xu et al., 2017b]. Quantities from the previouscall to adapt-rho-gamma have the indication k0. Actual implementationcomputes and re-uses some of the inner products and norms.Algorithm adapt-rho-gammainput: vki , vk+1i , yk+1i , sk+1i , ρkiεcorr = 0.3vˆk+1 = vki + ρki (yki − sk+1i )∆vˆ = vˆk+1i − vˆk0∆v = vk+1i − vk0∆hˆ = sk+1i − sk0)∆gˆ = −(yk+1i − yk0)αcorr = ∆hˆ>∆vˆ‖∆hˆ‖‖∆vˆ‖βcorr = ∆gˆ>∆v‖∆gˆ‖‖∆v‖If αcorr > εcorrαˆMG = ∆hˆ>∆vˆ∆hˆ>∆hˆ, αˆSD = ∆vˆ>∆vˆ∆hˆ>∆vˆ, αˆ ={αˆMG if 2αˆMG > αˆSDαˆSD − 0.5αˆMG if elseEndIf βcorr > εcorrβˆMG = ∆gˆ>∆v∆gˆ>∆gˆ , βˆSD = ∆v>∆v∆gˆ>∆v , βˆ ={βˆMG if 2βˆMG > βˆSDβˆSD − 0.5βˆMG if elseEnd{ρk+1, γk+1} ={√αˆβˆ, 1 + 2√αˆβˆαˆ+βˆ} if αcorr > εcorr & βcorr > εcorr{αˆ, 1.9} if αcorr > εcorr & βcorr ≤ εcorr{βˆ, 1.1} if αcorr ≤ εcorr & βcorr > εcorr{ρk, 1.5} if αcorr ≤ εcorr & βcorr ≤ εcorrset and save for next call to adapt-rho-gamma:vˆk0 ← vˆk+1i , vk0 ← vk+1i ,sk0 ← sk+1i , yk0 ← yk+1isave vk+1i , yk+1i for next call to adapt-rho-gammaoutput: ρk+1i , γk+1i97cost as we will demonstrate in the numerical example of the next section.To arrive at a workable multilevel implementation for Algorithm 3, weneed to concern ourselves with the initialization of ADMM-type iterationsand initial guesses for x and yi, vi for all i ∈ {1, . . . , p, p + 1}. After ini-tialization of the coarsest grid with all zero vectors, we move to a finer gridby interpolating x and all yi, vi. Since the solution estimate x ∈ RN alwaysrefers to an image or a tensor, we are free to reshape and interpolate it to afiner grid. The situation for vectors vi and yi is a bit more complicated, astheir dimensions depend on the corresponding Ai. To handle these, we doa relatively simple interpolation.Example. When Ai is a discrete derivative matrix, then the vectors viand yi live on a grid that we know at every level of the multilevel scheme. Ifwe have Ai = Dz ⊗ Ix, where Dz is the first-order finite-difference matrix asin (4.8), we know that Ai ∈ R((nz−1)nx)×(nz×nx) and therefore vi ∈ R(nz−1)nxand yi ∈ R(nz−1)nx . We can thus reshape the associated vectors vi and yias an image (in 2D) of size (nz − 1 × nx) and interpolate it to the finergrid for the next level, working from coarse to fine. In 3D, we follow thesame approach. We also need a coarse version of m at each level: ml forl = nlevels, nlevels − 1, . . . , 1. We simply obtain the coarse models by ap-plying an anti-alias filter and subsampling the original m. In principle,any subsampling and interpolation technique may be used in this multilevelframework. Our numerical experiments interpolate to finer grids using thesimple nearest-neighbor method. Numerical experiments with other typesof interpolations did not show a reduction of the number of PARSDMMiterations at the finest grid.We decide the number of levels (nlevels) and the coarsening factor aheadof time. Together with the original grid, these determine the grid at alllevels so we can set up the linear operators and proximal mappings at eachlevel. This set-up phase is a one time cost since its result is reused everytime we project a model m onto the intersection of constraint sets. Theadditional computational costs of the multilevel scheme are the interpolationof x and all yi, vi to a finer grid, but this happens only once per level andnot every ML-PARSDMM (Algorithm 5) iteration. So the computational98overhead we incur from the interpolations is small compared to the speedupof Algorithm 5.Algorithm 5 Multilevel PARSDMM to compute the projection onto anintersection using a multilevel strategy.inputs:nlevels //number of levelsl = {nlevels, nlevels − 1, 1}gridl //grid info at each level lml //model to project at every level lA1,l, A2,l, . . . , Ap+1,l //linear operators at every level// norm/bound/cardinality/... projectors at each level:proxfi,l,ρi(w) = PCi(w) for i = 1, 2, . . . , p// proximal map for the squared distance fromm at each level:proxfp+1,l,ρp+1(w) = (ml + ρp+1w)/(1 + ρp+1)//start at coarsest gridFOR l = nlevels, nlevels − 1, . . . , 1//solve on current grid:(xl, {yi,l}, {vi,l}) = PARSDMM(ml, {Ai,l}, {proxfi,l,ρi}, xl, {yi,l}, {vi,l})xl → xl−1 //interpolate to finer gridFOR i = 1, 2, . . . , p+ 1yi,l → yi,l−1 //interpolate to finer gridvi,l → vi,l−1 //interpolate to finer gridENDENDoutput: x at original grid (level 1)4.4 Software and numerical examplesThe software corresponding to this paper is available at https://github.com/slimgroup. The main design principles of our code implementing the PARS-DMM algorithm include (i) performance, it needs to scale to imposing mul-tiple constraints on 3D models up to at least 3003 grid points; (ii) specializa-tion to the specific and fixed problem structure (4.14); and (iii) flexibility towork with multiple linear operators and projectors. Because of these designchoices, the user only needs to provide the model to project, m, and pairsof linear operators and projectors onto simple sets:99{(A1,PC1), (A2,PC2), . . . , (Ap,PCp)}. The software adds the identity matrixand the proximal map for the distance squared from m. These are all com-putational components required to solve intersection projection problems asformulated in (4.16).To reap benefits from modern programming language design, includ-ing just-in-time compilation, multiple dispatch, and mixing distributed andmulti-threaded computations, we wrote our software package in Julia 0.6.Our code uses parametric typing, which means that the same scripts canrun in Float32 (single) and Float64 (double) precision. As expected, mostcomponents of our software run faster with Float32 with reduced memoryconsumption. The timings in the following examples use Float32.We provide scripts that the set up the linear operators and projectorsfor regular grids in 2D and 3D. It is not necessary to use these scripts asthe solver is agnostic to the specific construction of the projectors or linearoperators. Table (4.1) displays the constraints we currently support. For ex-ample, when the user requests the script to set up minimum and maximumbounds on the discrete gradient in the z-direction of the model, the scriptreturns the discrete derivative matrix A = Ix⊗Dz and a function Pbounds(·)that projects the input onto the bounds. The software currently supports theidentity matrix, matrices representing the discrete gradient and the opera-tors that we apply matrix-free: the discrete cosine/Fourier/wavelet/curvelet[Ying et al., 2005] transforms.For the special case of orthogonal linear operators, we leave the linearoperator inside the set definition because we know the projection onto Vin closed form. For example, if V = {x | ‖Ax‖1 ≤ σ} with discrete Fouriertransform (DFT) matrix A ∈ CN×N , the projection is known in closed formas PV(x) = A∗P‖·‖≤σ(Ax), where ∗ denotes the complex-conjugate transposeand P‖·‖≤σ is the projection onto the `1-ball. We do this to keep all othercomputations in PARSDMM (Algorithm 3) real, because complex-valuedvectors require more storage and will slow down most computations.As an example of our code, we show how to project a 2D model monto the intersection of bound constraints and the set of models that havemonotonically increasing parameter values in the z-direction.100descriptions setbounds {m | l[i] ≤ m[i] ≤ u[i]}transform-domain bounds {m | l[i] ≤ (Am)[i] ≤ b[i]}transform-domain `1 {m | ‖Am‖1 ≤ σ}transform-domain `2 {m | ‖Am‖2 ≤ σ}transform-domain annulus {m | σl ≤ ‖Am‖2 ≤ σu}transform-domain nuclear norm {m | ∑kj=1 λ[j] ≤ σ},Am = vec(∑kj=1 λ[j]ujv>j ) is the SVD.transform-domain cardinality {m | card(Am) ≤ k}, k is a positive integertransform-domain rank {m |Am = vec(∑rj=1 λ[j]ujv>j )}, r < min(nz, nx)subspace constraints {m |m = Ac, c ∈ CM}Table 4.1: Overview of constraint sets that the software currently sup-ports. A new constraint requires the projector onto the set (with-out linear operator) and a linear operator or equivalent matrix-vector product together with its adjoint. Vector entries are in-dexed as m[i].using SetIntersectionProjection#the following optional lines of#code set up linear operators and projectors#grid information ( (dz,dx),(nz,nx) )comp_grid = compgrid( (25.0, 6.0), (341, 400) )#initialize constraint informationconstraint = Vector{SetIntersectionProjection.set_definitions}()#set up bound constraintsm_min = 1500.0 #minimum velocitym_max = 4500.0 #maximum velocityset_type = "bounds" #bound constraint setTD_OP = "identity" #identity in the set definition101app_mode = ("matrix","") #bounds applied to the model as a matrixcustom_TD_OP = ([],false) #no custom linear operatorspush!(constraint, set_definitions(set_type,TD_OP,m_min,m_max,app_mode,custom_TD_OP))# #bounds on parameters in a transform-domain (vertical slope constraint)m_min = 0.0m_max = 1e6set_type = "bounds"TD_OP = "D_z" #discrete derivative in z-directionapp_mode = ("matrix","")custom_TD_OP = ([],false)push!(constraint, set_definitions(set_type,TD_OP,m_min,m_max,app_mode,custom_TD_OP))options = PARSDMM_options() #get default options#get projectors onto simple sets, linear operators, set information(P_sub,TD_OP,set_Prop) = setup_constraints(constraint,comp_grid,Float32)#precompute and distribute quantities once, reuse later(TD_OP,B) = PARSDMM_precompute_distribute(TD_OP,set_Prop,comp_grid,options)#project onto intersection(x,log_PARSDMM) = PARSDMM(m,B,TD_OP,set_Prop,P_sub,comp_grid,options)Our software also allows for simultaneous use of constraints that applyto the 2D/3D model and constraints that apply to each column or row sepa-rately, except for sets based on the singular value decomposition. The linearoperator remains the same if we define constraints for all rows, columns, orboth. The difference is that the projection onto a simple set is now appliedto each row/column independently in parallel via a multi-threaded loop.1024.4.1 Parallel Dykstra versus PARSDMMOne of our main goals was to create an algorithm that computes projectionsonto an intersection of sets that contains fewer manual tuning parameters,stopping conditions, and that is also faster than black-box type projectionalgorithms, such as parallel Dykstra’s algorithm (see Appendix C). To seehow the proposed PARSDMM algorithm compares to parallel Dykstra’s al-gorithm, we need to set up a fair experimental setting that includes the sub-problem solver in parallel Dykstra’s algorithm. Fortunately, if we use Adap-tive Relaxed ADMM (ARADMM) [Xu et al., 2017b] for the projection sub-problems of parallel Dykstra’s algorithm, both PARSDMM (Algorithm 3)and Parallel Dykstra-ARADMM have the same computational components.ARADMM also uses the same update scheme for the augmented Lagrangianpenalty and relaxation parameters as we use in PARSDMM. This similarityallows for a comparison of the convergence as a function of the basic compu-tational components. We manually tuned ARADMM stopping conditionsto achieve the best performance for parallel Dykstra’s algorithm overall.The numerical experiment is the projection of a 2D geological model(341 × 400 pixels) onto the intersection of three constraint sets that are ofinterest to the seismic imaging examples by [Esser et al., 2016, Yong et al.,2018], and in Chapter 3:1. {m | σ1 ≤ m[i] ≤ σ2} : bound constraints2. {m | ‖Am‖1 ≤ σ} with A = [(Ix ⊗ Dz)> (Dx ⊗ Iz)>]> : anisotropictotal-variation constraints3. {m | 0 ≤ ((Ix ⊗Dz)m)[i] ≤ ∞} : vertical monotonicity constraintsFor these sets, the primary computational components are (i) matrix-vector products in the conjugate-gradient algorithm. The system matrixhas the same sparsity pattern as A>A, because the sparsity patterns ofthe linear operators in set number one and three overlap with the patternof A>A. Parallel Dykstra uses matrix-vector products with A>A, (Dx ⊗Iz)>(Dx⊗ Iz), and I in parallel. (ii) projections onto the box constraint setand the `1-ball. Both parallel Dykstra’s algorithm and PARSDMM compute103these in parallel. (iii) parallel communication that sends a vector fromone to all parallel processes (xk+1 in Algorithm 3), and one map-reduceparallel sum that gathers the sum of vectors on all workers (the right-handside for the xk+1 computation in Algorithm 3). The communication is thesame for PARSDMM and parallel Dykstra’s algorithm so we ignore it in theexperiments below.Before we discuss the numerical results, we discuss some details on howwe count the computational operations mentioned above:• Matrix-vector products in CG: At each PARSDMM iteration, we solvea single linear system with the conjugate-gradient method. Paral-lel Dykstra’s algorithm simultaneously computes three projections byrunning three instances of ARADMM in parallel. The projections ontosets two and three solve a linear system at every ARADMM iteration.For each parallel Dykstra iteration, we count the total number of se-quential CG iterations, which is determined by the maximum numberof CG iterations for either set number two or three.• `1-ball projections: PARSDMM projects onto the `1 ball once periteration. Parallel Dykstra projects (number of parallel Dykstra iter-ations) × (number of ARADMM iterations for set number two) timesonto the `1 ball. Because `1-ball projections are computationally moreintensive (we use the algorithm from Duchi et al. [2008]) comparedto projections onto the box (element-wise comparison) and also lesssuitable for multi-threaded parallelization, we focus on the `1-ball pro-jections.The results in Figure 4.1 show that PARSDMM requires much fewer CGiterations and `1-ball projections to achieve the same relative set feasibilityerror in the transform-domain as defined in equation (4.24). In contrastto the curves corresponding to parallel Dykstra’s algorithm, we see thatPARSDMM converges in an oscillatory fashion, which is caused by changingthe relaxation and augmented-Lagrangian penalty parameters.Because non-convex sets are an important application for us, we comparethe performance for a non-convex intersection as well:104Figure 4.1: Relative transform-domain set feasibility (equation 4.24)as a function of the number of conjugate-gradient iterations andprojections onto the `1 ball. This figure also shows relativechange per iteration in the solution x.1. {m | σ1 ≤ m[i] ≤ σ2}: bound constraints2. {m|(Ix⊗Dz)m = vec(∑rj=1 λ[j]ujv∗j )}, where r < min(nz, nx), λ[j] arethe singular values, and uj , vj are singular vectors: rank constraintson the vertical gradient of the imageWe count the computational operations in the same way as in the pre-vious example, but this time the computationally most costly projection isthe projection onto the set of matrices with limited rank via the singularvalue decomposition. The results in Figure 4.2 show that the convergence ofparallel Dykstra’s algorithm almost stalls: the solution estimate gets closerto satisfying the bound constraints, but there is hardly any progress to-wards the rank constraint set. PARSDMM does not seem to suffer fromnon-convexity in this particular example.We used the single-level version of PARSDMM such that we can comparethe computational cost with Parallel Dykstra. The PARSDMM results inthis section are therefore pessimistic in general, as the multilevel version canoffer additional speedups, which we show next.105Figure 4.2: Relative transform-domain set feasibility (equation 4.24)as a function of the number of conjugate-gradient iterations andprojections onto the set of matrices with limited rank via theSVD. This figure also shows relative change per iteration in thesolution x.4.4.2 Timings for 2D and 3D projectionsThe proposed PARSDMM algorithm (algorithm 3) is suitable for small 2Dmodels (≈ 502 pixels) all the way up to large 3D models (at least 3003). Toget an idea about solution times versus model size, as well as how beneficialthe parallelism and multilevel continuation are, we show timings for projec-tions of geological models onto two different intersections for the four modesof operation: PARSDMM, parallel PARSDMM, multilevel PARSDMM, andmultilevel parallel PARSDMM. As we mentioned, the multilevel version hasa small additional overhead compared to single-level PARSDMM because ofone interpolation procedure per level. Parallel PARSDMM has communica-tion overhead compared to serial PARSDMM. However, serial PARSDMMstill uses multi-threading for projections, the matrix-vector product in theconjugate-gradient method, and BLAS operations, but the yi and vi com-putations in Algorithm 3 remain sequential for every i = 1, 2, · · · , p, p + 1,contrary to parallel PARSDMM. We carry our computations out on a dedi-cated cluster node with 2 CPUs per node with 10 cores per CPU (Intel IvyBridge 2.8 GHz E5-2680v2) and 128 GB of memory per node.The following sets are used in Chapter 3 to regularize a geophysicalinverse problem and form the intersection for our first test case:106Figure 4.3: Timings for a 2D and 3D example where we project a ge-ological model onto the intersection of bounds, lateral smooth-ness, and vertical monotonicity constraints.1. {m | σ1 ≤ m[i] ≤ σ2} : bound constraints2. {m | − σ3 ≤ ((Dx ⊗ Iz)m)[i] ≤ σ3}: lateral smoothness constraints.There are two of these constraints in the 3D case: for the x and ydirection separately.3. {m | 0 ≤ ((Ix ⊗Dz)m)[i] ≤ ∞} : vertical monotonicity constraintsThe results in Figure 4.3 show that the multilevel strategy is much fasterthan the single-level version of PARSDMM. The multilevel overhead costsare thus small compared to the speedup. It also shows that, as expected,the parallel versions require some communication time, so the problems needto be large enough for the parallel version of PARSDMM to offer speedupscompared to its serial counterpart.The previous example uses four constraint sets that each use a differentlinear operator, but all of them are a type of bound constraint. The yicomputation (projection onto a simple set in closed form) in PARSDMM(Algorithm 3) is therefore fast for all sets. As a result, parallel PARSDMMshould lead to a speedup compared to serial computations of all yi, as weverify in Figure 4.3. We now show an example where one of the sets usesa much more time-consuming yi computation than the other set, which107Figure 4.4: Timings for a 3D example where we project a geologicalmodel onto the intersection of bound constraints and an `1-norm constraint on the vertical derivative of the image. Parallelcomputation of all yi and vi does not help in this case, becausethe `1-norm projection is much more time consuming than theprojection onto the bound constraints. The time savings forother computations in parallel are then canceled out by theadditional communication time.leads to the expectation that parallel PARSDMM only offers minor speedupscompared to serial PARSDMM. The second constraint set onto which weproject is the intersection of:1. {m | σ1 ≤ m[i] ≤ σ2} : bound constraints2. {m | ‖(Ix ⊗ Iy ⊗Dz)m‖1 ≤ σ3}, with a constraint that is 50% of thetrue model: σ3 = 0.5‖(Ix ⊗ Iy ⊗ Dz)m∗‖1 : directional anisotropictotal-variationFigures 4.3 and 4.4 show that parallel computations of the yi and vivectors in PARSDMM is not always beneficial, depending on the number ofconstraint sets, model size, and time it takes to project onto each set.4.4.3 Geophysical parameter estimation with constraintsSeismic full-waveform inversion (FWI) estimates rock properties (acousticvelocity in this example) from seismic signals (pressure) measured by hy-108drophones. FWI is a partial-differential-equation (PDE) constrained opti-mization problem where after eliminating the PDE constraint, the simulateddata, dpredicted(m) ∈ CM , are connected nonlinearly to the unknown modelparameters, m ∈ RN . We assume that we know the source and receiver lo-cations, as well as the source function. A classic example of an objective forFWI is the nonlinear least-squares misfit f(m) = 1/2‖dobs−dpredicted(m)‖22,which we use for this numerical experiment.FWI is a problem hampered by local minima. Empirical evidence in[Esser et al., 2016, Yong et al., 2018] and Chapters 2 and 3 suggests that wecan mitigate issues with parasitic local minima by insisting that all modeliterates be elements of the intersection of multiple constraint sets. Thismeans that we add regularization to the objective f(m) : RN → R in theform of multiple constraints—i.e., we haveminmf(m) s.t. m ∈ V =p⋂i=1Vi. (4.27)While many choices exist to solve this constrained optimization problem, weuse the spectral projected gradient (SPG) algorithm with a non-monotoneline search [Birgin et al., 1999] to solve the above problem. SPG uses infor-mation from the current and previous gradient of f(m) to approximate theaction of the Hessian of f(mk) with the scalar α: the Barzilai-Borwein steplength. At iteration k, SPG updates the model iterate as follows:mk+1 = (1− γ)mk − γPV(mk − α∇mf(mk)), (4.28)where the non-monotone line search determines γ ∈ (0, 1]. This line-searchrequires a lower function value than the maximum function value of the pre-vious five iterations for our numerical experiment. We see that the modeliterate mk is, because of the projection onto V, feasible at every iteration.Moreover, mk remains feasible for line-search steps to estimate γ if we as-sume the initial point to be feasible and use convex sets only. In this case,the model iterates mk and trial points form a line segment. Because bothendpoints are in a convex set, the mk+1 remain feasible. As a result, we only109need a single projection onto the intersection of the different constraints (PV)for each SPG iteration. We use PARSDMM (Algorithm 3) and multilevelPARSDMM (Algorithm 5) to compute this projection. The total numberof SPG iterations plus line-search steps is limited to the relatively smallnumber of ten, because these require the solution of multiple PDEs, whichis computationally intensive, especially in 3D.The experimental setting is as follows: The Helmholtz equation modelsthe wave propagation in an acoustic model. The data acquisition system isa vertical-seismic-profiling experiment with sources at the surface and re-ceivers in a well, see Figure 4.5. All boundaries are perfectly-matched-layers(PML) that absorb outgoing waves as if the model is spatially unbounded.The challenges that we address by constraining the model parameters are:one-sided ‘source illumination’ that often leads to spurious artifacts in thesource-receiver direction, a limited frequency range (3− 10 Hertz), and thenon-convexity of the data-misfit f(m). We use the software by Da Silva andHerrmann [2017] to simulate seismic data and compute f(m) and ∇mf(m).This example illustrates that (a) adding multiple constraints results inbetter parameter estimation compared to one or two constraint sets for thisexample; (b) non-convex constraints connect more directly to certain typesof prior knowledge about the model than convex sets do; (c) we can solveproblems with non-convex sets reliably enough such that the results almostsatisfy all constraints; (d) multilevel PARSDMM for computing projectionsonto non-convex intersections performs better empirically than the single-level scheme.The prior knowledge consists of: (a) minimum and maximum velocities(2350 − 2650 m/s); (b) The anomaly is rectangular , but we do not knowthe size, aspect ratio, or location.Before we add multiple non-convex constraints, let us look at whathappens with simple bound and total-variation constraints. Figure 4.5shows the true model, initial guess, and the estimated models using var-ious combinations of constraints. The data acquisition geometry causes themodel estimate with bound constraints to be an elongated diagonal anomalythat is incorrect in terms of size, shape, orientation, and parameter values.110Anisotropic total-variation (TV) seems like a good candidate to promote‘blocky’ model structures, but it may be difficult to select a total-variationconstraint, i.e., the size of the TV-ball. The result in Figure 4.5(d) showsthat even in the unusual case that we know and use a TV constraint equalto the TV of the true model, we obtain a model estimate that shows minorimprovements compared to the estimation with bounds only. While manyof the oscillations outside of the rectangular anomaly are damped, the shapeof the anomaly itself is still far from the truth.As we will demonstrate, the inclusion of multiple non-convex cardinalityand rank constraints help the parameter estimation in this example. Fromthe prior information that the anomaly is rectangular and aligned with thedomain boundaries, we deduce that the rank of the model is equal to two. Wealso know that the cardinality of the discrete gradient of each row and eachcolumn is less than or equal to two as well. If we assume that the anomaly isnot larger than half the total domain extent in each direction, we know thatthe cardinality of the discrete derivative of the model (in matrix format) isnot larger than the number of grid points in each direction. To summarize,the following constraint sets follow from the prior information:1. {x | card((Dz ⊗ Ix)x) ≤ nx}2. {x | card((Iz ⊗Dx)x) ≤ nz}3. {x | rank(x) ≤ 3}4. {x | 2350 ≤ x[i] ≤ 2650 ∀i}5. {x | card(DxX[i, :]) ≤ 2 for i ∈ {1, 2, . . . , nz}}, X[i, :] is a row of the2D model6. {x | card(DzX[:, j]) ≤ 2 for j ∈ {1, 2, . . . , nx}}, X[:, j] is a column ofthe 2D modelWe use slightly overestimated rank and matrix cardinality constraintscompared to the true model to mimic the more realistic situation that notall prior knowledge was correct. The results in Figure 4.5 use single-levelPARSDMM to compute projections onto the intersection of constraints, andshow that an intersection of non-convex constraints and bounds can lead to111improved model estimates. Figure 4.5(e) is the result of working with con-straints [1, 2, 4], Figure 4.5(f) uses constraints [1, 2, 4, 5, 6], and Figure 4.5(g)uses all constraints [1, 2, 3, 4, 5, 6]. The result with rank constraints and bothmatrix and row/column-based cardinality constraints on the discrete gradi-ent of the model is the most accurate in terms of the recovered anomalyshape. All results in Figure 4.5 that work with non-convex sets are at leastas accurate as the result obtained with the true TV in terms of anomalyshape. Another important observation is that all non-convex results esti-mate a lower-than-background velocity anomaly, although not as low as thetrue anomaly. Contrary, the models obtained using convex sets show in-correct higher-than-background velocity artifacts in the vicinity of the trueanomaly location.Figure 4.6 is the same as Figure 4.5, except that we use multilevel PARS-DMM (Algorithm 5) with three levels and a coarsening of a factor two perlevel. Comparing single level with multilevel computations of the projection,we see that the multilevel version of PARSDMM performs better in general.In Figures 4.5(e) and 4.5(f), we see that the result of single-level PARS-DMM inside SPG does not exactly satisfy constraint set numbers 5 and 6,because the cardinality of the derivative of the model in x and z directionsis not always less than or equal to two for each row and column. The resultsfrom multilevel PARSDMM inside SPG, Figure 4.6(a) and 4.6(b), satisfythe constraints on the cardinality of the derivative of the image per row andcolumn. As a result, the models are closer to the rectangular shape of thetrue model. This is only one example with a few different constraint com-binations so we cannot draw general conclusions about the performance ofsingle versus multilevel schemes, but the empirical findings are encouragingand in line with observations by Macdonald and Ruthotto [2018].4.4.4 Learning a parametrized intersection from a fewtraining examplesIn the introduction, we discussed how to formulate inverse problems as aprojection or feasibility problem (4.4). With the following two examples weshow that our algorithm (4.15) is a good candidate to solve inverse problems112Figure 4.5: True, initial, and estimated models with various con-straint combinations for the full-waveform inversion example.Crosses and circles represent sources and receivers, respectively.All projections inside the spectral projected gradient algorithmare computed using single-level PARSDMM.113Figure 4.6: Estimated models with various constraint combinationsfor the full-waveform inversion example. Crosses and circlesrepresent sources and receivers, respectively. All projectionsinside the spectral projected gradient algorithm are computedusing coarse-to-fine multilevel PARSDMM with three levels anda coarsening of a factor two per level.as a projection or feasibility problem, because we mitigate rapidly increasingcomputation times for problems with many sets, by taking the similarity be-tween linear operators in set definitions into account. Of course, we can onlyuse multiple constraint sets if we have multiple pieces of prior information.Combettes and Pesquet [2004] present a simple solution and note that for15 out of 20 investigated data-sets, 99% of the images have a total-variationwithin 20% of the average total variation of the data-set. The average total-variation serves as a robust constraint that typically leads to good results.Here we follow the same reasoning, but we will work with many constraintsets that we learn from a few example images. To summarize, our learningand solution strategy is as follows:1. Observe the constraint parameters of various constraints in varioustransform-domains for all training examples (independently in parallelfor each example and each constraint).2. Add a data-fit constraint to the intersection.3. The solution of the inverse problem is the projection of an initial guess114m onto the learned intersection of setsminx,{yi}12‖x−m‖22 +p−1∑i=1ιCi(yi)+ιCdatap (yp) s.t.Aix = yiFx = yp , (4.29)where F is a linear forward modeling operator and we solve this prob-lem with Algorithm 3.Before we proceed to the examples, it is worth mentioning the mainadvantages and limitations of this strategy. Because all set definitions areindependent of all other sets, there are no penalty/weight parameters, andwe avoid hand-tuning the constraint definitions. Unlike neural networksfor imaging inverse problems that often need large numbers of training ex-amples, we can observe ‘good’ constraints from just one or a few exampleimages. Methods that do not require training, such as basis-pursuit typeformulations [e.g., Lustig et al., 2007, Cande`s and Recht, 2009, van den Bergand Friedlander, 2009, Becker et al., 2011, Aravkin et al., 2014], often min-imize the `1 norm or nuclear norm of transform-domain coefficients (total-variation, wavelet) of an image subject to a data-fit constraint. However,without learning, these methods require hand picking a suitable transformfor each class of images. We will work with many transform-domain opera-tors simultaneously, so that at least some of the constraint/linear operatorcombinations will describe uncorrupted images with small norms/bounds/-cardinality, but not noisy/blurred/masked images. Note that we are notlearning any dictionaries, but work with pre-defined transforms such as theFourier basis, wavelets, and linear operators based on discrete gradients. Alimitation of the constraint learning strategy that we use here is that it doesnot generalize very well to other classes of images and dataset.For both of the examples we observe the following constraint parametersfrom exemplar images:1. {m | σ1 ≤ m[i] ≤ σ2} (upper and lower bounds)2. {m | ∑kj=1 λ[j] ≤ σ3} with m = vec(∑kj=1 λ[j]ujv∗j ) is the SVD of theimage (nuclear norm)1153. {m | ∑kj=1 λ[j] ≤ σ4}, with (Ix ⊗ Dz)m = vec(∑kj=1 λ[j]ujv∗j ) is theSVD of the vertical derivative of the image (nuclear norm of discretegradients of the image, total-nuclear-variation). Use the same for thex-direction.4. {m | ‖Am‖1 ≤ σ5} with A = ((Ix ⊗ Dz)> (Dx ⊗ Iz)>)> (anisotropictotal-variation)5. {m | σ6 ≤ ‖m‖2 ≤ σ7} (annulus)6. {m | σ8 ≤ ‖Am‖2 ≤ σ9} with A = ((Ix ⊗Dz)> (Dx ⊗ Iz)>)> (annulusof the discrete gradients of the training images)7. {m | ‖Am‖1 ≤ σ10} with A = discrete Fourier transform (`1-norm ofDFT coefficients)8. {m | − σ11 ≤ ((Dx ⊗ Iz)m)[i] ≤ σ12} (slope-constraints in x and zdirection, bounds on the discrete gradients of the image)9. {m | l[i] ≤ (Am)[i] ≤ u[i]}, with A = discrete cosine transform (point-wise bound-constraints on DCT coefficients)These are nine types of convex and non-convex constraints on the modelproperties (11 sets passed to PARSDMM because sets three and eight areapplied to the two dimensions separately). For data-fitting, we add a point-wise constraint, {x | l ≤ (Fx − dobs) ≤ u} with a linear forward modelF ∈ RM×N .Joint deblurring-denoising-inpaintingThe goal of the first example is to recover a [0− 255] grayscale image from20% observed pixels of a blurred image (25 pixels known motion blur), whereeach observed data point also contains zero-mean random noise in the in-terval [−10 − 10]. The forward operator F is thus a subsampled bandedmatrix (restriction of an averaging matrix). As an additional challenge, wedo not assume exact knowledge of the noise level and work with the over-estimation [−15−15]. The data set contains a series of images from ‘PlanetLabs PlanetScope Ecuador’ with a resolution of three meters, available atopenaerialmap.org. There are 35 patches of 1100× 1100 pixels for training,some of which are displayed in Figure 4.7.116Figure 4.7: A sample of 8 out of 35 training images.We compare the results of the proposed PARSDMM algorithm with the11 learned constraints, with a basis pursuit denoise (BPDN) formulation.Basis-pursuit denoise recovers a vector of wavelet coefficients, c, by solvingminc ‖c‖1 s.t. ‖FW ∗c− dobs‖2 ≤ σ (BPDN-wavelet) with the SPGL1 tool-box [van den Berg and Friedlander, 2009]. The matrix W represents thewavelet transform: Daubechies Wavelets as implemented by the SPOT lin-ear operator toolbox (http://www.cs.ubc.ca/labs/scl/spot/index.html) andcomputed with the Rice Wavelet Toolbox (RWT, github.com/ricedsp/rwt).In Figure 4.8 we see that an overestimation of σ in the BPDN formulationresults in oversimplified images, because the `2-ball constraint is too largewhich leads to a coefficient vector c that has an `1-norm that is smallerthan the `1-norm of the true image. The values for l and u in the data-fit constraint {x | l ≤ (Fx − dobs) ≤ u}, are also too large. However, theresults from the projection onto the intersection of multiple constraints suffermuch less from overestimated noise levels, because there are many otherconstraints that control the model properties. The results in Figure 4.8show that the learned set-intersection approach achieves a higher PSNR forall evaluation images compared to the BPDN formulation.117Figure 4.8: Reconstruction results from 80% missing pixels of an im-age with motion blur (25 pixels) and zero-mean random noisein the interval [−10, 10]. Results that are the projection ontoan intersection of 12 learned constraints sets with PARSDMMare visually better than BPDN-wavelet results.118Image desaturationTo illustrate the versatility of the strategy, algorithm and constraint setsfrom the previous example, we now solve an image desaturation problem fora different data set. The only two things we need to change are the constraintset parameters, which we observe from new training images (Figure 4.9),as well as a different linear forward operator F . The data set containsimage patches (1500 × 1250 pixels) from the ‘Desa Sangaji Kota Ternate’image with a resolution of 11 centimeters, available at openaerialmap.org.The corrupted observed images are saturated grayscale and generated byclipping the pixel values from 0 − 60 to 60 and from 125 − 255 to 125, sothere is saturation on both the dark and bright pixels. If we have no otherinformation about the pixels at the clipped value, the desaturation problemimplies the point-wise bound constraints [e.g., Mansour et al., 2010]0 ≤ x[i] ≤ 60 if dobs[i] = 60x[i] = dobs[i] if 60 ≤ dobs[i] ≤ 125125 ≤ x[i] ≤ 255 if dobs[i] = 125. (4.30)The forward operator is thus the identity matrix. We solve problem (4.29)with these point-wise data-fit constraints and the model property constraintslisted in the previous example.Figure 4.10 shows the results, true and observed data for four evaluationimages. Large saturated patches are not desaturated accurately everywhere,because they contain no non-saturated observed pixels that serve as ‘anchor’points.Both the desaturation and the joint deblurring-denoising-inpainting ex-ample show that PARSDMM with multiple convex and non-convex setsconverges to good results, while only a few training examples were sufficientto estimate the constraint set parameters. Because of the problem formula-tion, algorithms, and simple learning strategy, there were no parameters tohand-pick.119Figure 4.9: A sample of 8 out of 16 training images.4.5 Discussion and future research directionsWe developed algorithms to compute projections onto intersections of mul-tiple sets that help us setting up and solving constrained inverse problems.Our design choices, together with the constrained formulation, minimize thenumber of parameters that we need to hand-pick for the problem formula-tion, algorithms, and regularization. Our software packageSetIntersectionProjection should help inverse problem practitioners totest various combinations of constraints for faster evaluation of their strate-gies to solve inverse problems. Besides practicality, we want our work toapply to not just toy problems, but also to models on larger 3D grids. Weachieved this via automatic adjustment of scalar algorithm parameters, par-allel implementation, and multilevel acceleration. There are some limita-tions, but also opportunities to increase computational performance that wewill now discuss.Regarding the scope of the SetIntersectionProjection software pack-age, it is important to emphasize that satisfying a constraint for our appli-cations in imaging inverse problems is different from solving general (non-convex) optimization problems. When we refer to ‘reliably’ solving a non-120Figure 4.10: Reconstruction results from recovery from saturated im-ages as the projection onto the intersection of 12 constraintsets.convex problem, we are satisfied with an algorithm that usually approxi-mates the solution well. For example, if we seek to image a model m thathas k discontinuities, we add the constraint {m | card(Dm) ≤ k} where D isa derivative operator. A satisfying solution for our applications has k largevector entries, whereas all others are small. We do not need to find a vectorthat has a cardinality of exactly k, because the estimated model is the samefor practical purposes if the results are assessed qualitatively/visually, orwhere the expected resolution is much lower than the fine details we couldpotentially improve. Moreover, the forward operator for the inverse problem121often is not sensitive to small changes in the model, and we do not benefitfrom spending much more computational time trying to find a more accu-rate solution to the non-convex problem. Besides the multilevel projectionand automatic adjustment of augmented-Lagrangian parameters that we al-ready use, [Diamond et al., 2018] present several other heuristics that canimprove the solution of non-convex problems in the context of ADMM-basedalgorithms. Future work could test if these heuristics are computationallyfeasible for our often large-scale problems and if they cooperate with ourother heuristics.The proposed algorithms are currently set up for general sparse, sparseand banded, and orthogonal matrices such as the discrete Fourier transform.A general and non-orthogonal dense matrix, Ai, will slow down the solutionof the linear system with∑p+1i=1 A>i Ai, and is therefore not supported. How-ever, if the dense matrix is flat, Ai ∈ RM×N with M  N , such as alearned transform (dictionary), we can use this as a subspace constraint.This means that the model parameters m ∈ RN are an element of the set{m |m = Aic, c ∈ RN} with coefficient vector c. The projection onto this setis known in closed form, and we do not move the dense linear operator outof the set and into the normal equations in the x-minimization step of (4.20)because A>i Ai would become a large and dense matrix.Besides the limitations and scope of this work, we highlight two ways howwe can reduce computation times for Algorithm 3 and its multilevel version.First, we recognize that our algorithms use ADMM as its foundation, whichis a synchronous algorithm. This means that the computations of the pro-jections (y-update) in parallel are as slow as the most time-consuming pro-jection. Without fundamentally changing the algorithms to asynchronousor randomized projection methods, we can take a purely software-based ap-proach. Because we compute projections in parallel, where each projectionuses several threads, we are free to reallocate threads from the fastest pro-jection to the slowest and reduce the total computational time.A second computational component that may be improved is the inexactlinear systems solve with the conjugate-gradient (CG) method. We do notuse a preconditioner at the moment. Preliminary tests with a simple diag-122onal (Jacobi) preconditioner or multigrid V-cycle did reduce the number ofCG iterations, but not the running time for CG in general. There are a fewchallenges we face when we design a preconditioner: (i) users may selecta variety of linear operators (ii) the system matrix is the weighted sum ofmultiple linear systems in normal equation form, where the weights maychange every two PARSDMM iterations (iii) the number of CG iterationsvaries per PARSDMM iteration and is often less than ten, which makesit hard for preconditioners to reduce the time consumption if they requiresome computational overhead or setup cost.Finally we mention how our software package can work cooperativelywith recent developments in plug-and-play regularizers [Venkatakrishnanet al., 2013]. The general idea is to use image processing techniques such asnon-local means and BM3D, or pre-trained neural networks [Zhang et al.,2017, Bigdeli and Zwicker, 2017, Fan et al., 2017, Chang et al., 2017, Ag-garwal et al., 2018, Buzzard et al., 2018], as a map g(x) : RN → RN thatbehaves like a proximal operator or projector. Despite the fact that theseplug-and-play algorithms do not generally share non-expansiveness proper-ties with projectors [Chan et al., 2017], they are successfully employed inoptimization algorithms based on operator-splitting. In our case, we use aneural network as the projection operator with the identity matrix as as-sociated linear operator. In this way we can combine data-constraints andother prior information with a network. A potential challenge with the plug-and-play concept for constrained optimization is the difficulty to verify thatthe intersection of constraints is effectively non-empty, i.e., can g(x) mapto points in the intersection of the other constraint sets? Some preliminarytests showed encouraging results and we will explore this line of researchfurther.4.6 ConclusionsWe developed novel algorithms and the corresponding ‘SetIntersectionPro-jection’ software for the computation of projections onto the intersection ofmultiple constraint sets. These intersection projections are an important123tool for the regularization of inverse problems. They may be used as theprojection part of projected gradient/(quasi-)Newton algorithms. Projec-tions onto an intersection also solve set-based formulations for linear imageprocessing problems, possibly combined with simple learning techniques toextract set definitions from example images. Currently available algorithmsfor the projection onto intersections of sets are efficient if we know the pro-jections onto each set in closed form. However, many sets of interest includelinear operators that require other algorithms to solve sub-problems. Thepresented methods and software are designed to work with multiple con-straint sets based on small 2D, as well as larger 3D models. We enhancecomputational performance by specializing the software for projection prob-lems, exploiting different levels of parallelism on multi-core computing plat-forms, automatic selection of scalar (acceleration) parameters, and a coarse-to-fine grid multilevel implementation. The software is practical, also fornon-expert users, because we do not need manual step-size selection or re-lated operator norm computations and the algorithm inputs are pairs oflinear operators and projectors which the software also generates. Anotherpractical feature is the support for simultaneous set definitions based onthe entire image/tensor and each slice/row/column. Because we focus onmultiple constraints, there is less of a need to choose the ‘best’ constraintwith the ‘best’ linear operator/transform for a given inverse problem. Moreconstraints are not much more difficult to deal with than a one or two con-straints, also in terms of computational cost per iteration. We demonstratedthe versatility of the presented algorithms and software using examples frompartial-differential-equation based parameter estimation and image process-ing. These examples also show that the algorithms perform well on problemsthat include non-convex sets.124Chapter 5Minkowski sets for theregularization of inverseproblems.5.1 IntroductionThe inspiration for this work is twofold. First, we want to build on thesuccess of regularization with intersections of constraint sets and projectionmethods, see [Gragnaniello et al., 2012, Pham et al., 2014a,b, Smithymanet al., 2015, Esser et al., 2016, Yong et al., 2018], and the examples pre-sented in Chapters 2, 3, and 4. These works regularize a parameter esti-mation problem minm f(m) for f : RN → R, by constraining the model mto an intersection of p convex and possibly non-convex sets,⋂pi=1 Ci. Thecorresponding optimization problem reads minm f(m) s.t. m ∈⋂pi=1 Ci withm ∈ RN the vector of model parameters. In the references mentioned above,this type of problem is successfully solved with projection-based algorithms.However, prior knowledge represented as a single set or as an intersection ofdifferent sets may not capture all we know. For instance, if the model con-tain oscillatory as well as blocky features. Because these are fundamentallydifferent properties, working with one or multiple constraint sets alone may125not able to express the simplicity of the entire model.Motivated by ideas from morphological component analysis (MCA, Os-her et al. [2003]; Schaeffer and Osher [2013]; Starck et al. [2005]; Ono et al.[2014]) and robust or sparse principal component analysis (RPCA, Cande`set al. [2011]; Gao et al. [2011a]; Gao et al. [2011b]), we consider an addi-tive model structure. CTD/MCA exploit this structure by decomposing minto two or more components—e.g., into a blocky cartoon-like componentu ∈ RN and an oscillatory component v ∈ RN . For this purpose, a penaltymethod is often used stating the decomposition problem asminu,v‖m− u− v‖+ α2‖Au‖+ β2‖Bv‖. (5.1)While this method has been successful, it requires careful choices for thepenalty parameters α > 0 and β > 0. These parameters determine how‘much’ of m ends up in each component, but also depend on the noise level.In addition, the value of these parameters relates to the choices for thelinear operators A ∈ CM1×N and B ∈ CM2×N . When working with multipleconstraints, we have seen that avoiding penalty parameters is more practical.Decomposition problems suggest the same, as the number of regularizersinvolved is likely to be even larger.To handle situations where the model contains two or more morphologi-cal components, we explore the use of Minkowski sets for regularizing inverseproblems. For this purpose, we require that the vector of model parameters,m, is an element of the Minkowski set V, or vector sum of two sets C1 andC1, which is defined asV ≡ C1 + C1 = {m = u+ v | u ∈ C1, v ∈ C2}. (5.2)A vector m is an element of V if it is the sum of vectors u ∈ C1 and v ∈ C2.Each set describes particular model properties for each component. Theseinclude total-variation, sparsity in a transform domain (Fourier, wavelets,curvelets, shearlets) or matrix rank. For practical reasons, we assume thatall sets, sums of sets, and intersections of sets are non-empty, which implies126that our optimization problems have at least one solution. Moreover, wewill use the property that the sum of p sets Ci is convex if all Ci are convex[Hiriart-Urruty and Lemare´chal, 2012, Page 24]. We apply our set-basedregularization with the Euclidean projection operator:PV(m) ∈ arg minx12‖x−m‖22 s.t x ∈ V. (5.3)This projection allows us to use Minkowski constraint sets in algorithmssuch as (spectral) projected gradient (SPG, Birgin et al. [1999]), projectedquasi-Newton [PQN, Schmidt et al., 2009], projected Newton algorithms[Bertsekas, 1982, Schmidt et al., 2012], and proximal-map based algorithmsif we include the Minkowski constraint as an indicator function. We definethis indicator function for a set V asιV(m) =0 if m ∈ V,+∞ if m /∈ V, (5.4)and the proximal map for a function g(m) : RN → R∪{+∞} as proxγ,g(m) =arg minx g(x) +γ2‖x−m‖22, with γ > 0. The proximal map for the indicatorfunction of a set is the projection: proxγ,ιV (m) = PιV (m).While the above framework is powerful, it lacks certain critical featuresneeded for solving problems that involve physical parameters. For instance,there is, in general, no guarantee that the sum of two or more componentslies within lower and upper bounds or satisfies other crucial constraints. Itis also not straightforward to include multiple pieces of prior information foreach component.5.1.1 Related workThe above introduced decomposition strategies of morphological componentanalysis or cartoon-texture (MCA, Osher et al. [2003]; Schaeffer and Osher[2013]; Starck et al. [2005]; Ono et al. [2014]) and robust or sparse principalcomponent analysis (RPCA, Cande`s et al. [2011]; Gao et al. [2011a]; Gaoet al. [2011b]) share the additive model construction with multiscale decom-127positions in image processing [e.g., Meyer, 2001, Tadmor et al., 2004]. Whileeach of the sets that appear in a Minkowski sum can describe a particularscale, this is not our primary aim or motivation. We use the summationstructure to build more complex models out of simpler ones, more alignedwith cartoon-texture decomposition and robust principal component analy-sis.Projections onto Minkowski sets also appear in computational geome-try, collision detection, and computer-aided design [e.g., Dobkin et al., 1993,Varadhan and Manocha, 2006, Lee et al., 2016], but the problems and appli-cations are different. In our case, sets describe model properties and priorknowledge in RN . In computational geometry, sets are often the verticesof polyhedral objects in R2 or R3 and do not come with closed-form ex-pressions for projectors or the Minkowski sum. We do not need to formthe Minkowski set explicitly, and we show that projections onto the set aresufficient to regularize inverse problems.5.1.2 Contributions and outlineWe propose a constrained regularization approach suitable for inverse prob-lems with an emphasis on physical parameter estimation. For our appli-cations, this implies that we need to work with multiple constraints foreach component while offering assurances that the sum of the componentsalso adheres to certain constraints. For this purpose, we introduce gener-alized Minkowski sets and a formulation void of penalty parameters. As[Gragnaniello et al., 2012, Pham et al., 2014b,a, Smithyman et al., 2015,Esser et al., 2016, Yong et al., 2018] and earlier chapters in this thesis useprojection-based optimization methods, we introduce projections on thesegeneralized sets, followed by a discussion on important algorithmic detailsand the formulation of inverse problems based on these sets.Because we are working with constraints, we do not have to worry aboutselecting trade-off parameters. With the projections, we can also ensurethat the model parameters for each iteration of the inversion are withina generalized Minkowski set. As before, we are in a position to relax the128constraints gradually. This idea proved to be a successful tactic to solve non-convex geophysical inverse problems. (See [Smithyman et al., 2015, Esseret al., 2016, Yong et al., 2018] and previous chapters.)For the software implementation, we extend the open-source Julia soft-ware package ‘SetIntersectionProjection’ presented in Chapter 4. The soft-ware is suitable for small 2D models, as well as for larger 3D geologicalmodels or videos, as we will show in the numerical examples section usingseismic parameter estimation and video processing examples. These ex-amples also demonstrate that the proposed problem formulation, algorithm,and software allow us to define constraints based on the entire 2D/3D model,but also simultaneously on slices/rows/columns/fibers of that model. Thisfeature enables us to include certain prior knowledge more directly into theinverse problem.5.2 Generalized Minkowski setIt is challenging to select a single constraint set or intersection of multi-ple sets to describe models and images that contain distinct morphologicalcomponents u and v. While the Minkowski set allows us to define differentsets for the different components, problems may arise when working withphysical parameter estimation applications.For instance, there is usually prior knowledge about the physically re-alistic values in m ∈ RN . Moreover, in the previous chapters, we showedsuccessful applications of multiple constraints on the model parameters, andwe want to combine that concept with constraints on the components.The second extension of the basic concept of a Minkowski set is that weallow the constraint set on each component to be an intersection of multiplesets. In this way, we can include multiple pieces of prior information abouteach component.We denote the proposed generalized Minkowski constraint set for the129regularization of inverse problems asM≡ {m = u+ v | u ∈p⋂i=1Di, v ∈q⋂j=1Ej , m ∈r⋂k=1Fk}, (5.5)where the model estimate m ∈ RN is an element of the intersection of rsets Fk and also the sum of two components u ∈ RN and v ∈ RN . Thevector u is an element of the intersection of p sets Di, v is an element ofthe intersection of q sets Ej . It is conceptually straightforward to extend setdefinition 5.5 to a sum of three or more components, but we work with twocomponents for the remainder of this paper for notational convenience. Inthe discussion section, we highlight some potential computational challengesthat come with a generalized Minkowski sets of more than two components.The convexity of M follows from the properties of the sets Di, Ej andFk. From the definition 5.5, we see that⋂pi=1Di,⋂qj=1 Ej , and⋂rk=1Fkare closed and convex if Di, Ej and Fk are closed and convex for all i, jand k. It follows that M is a convex set, because it is the intersection ofa convex intersection with the Minkowski sum⋂pi=1Di +⋂qj=1 Ej , which isalso convex. To summarize in words, m is an element of the intersectionof two convex sets, one is the convex Minkowski sum, the other is a convexintersection. The set M is therefore also convex. Note that convexity andclosedness of⋂pi=1Di and⋂qj=1 Ej does not imply their sum is closed.In the following section, we propose an algorithm to compute projectiononto the generalized Minkowski set.5.3 Projection onto the generalized MinkowskisetIn the following section, we show how to use the generalized Minkowski set(Equation (5.5)) to regularize inverse problems with computationally cheapor expensive forward operators. First, we need to develop an algorithm tocompute the projection ontoM, which we denote by PM(m). Using PM(m),we can formulate inverse problems as a projection, or use the projection op-erator inside projected gradient/Newton-type algorithms. Each constraint130set definition may include a linear operator (the transform-domain operator)in its definition. We make the linear operators explicit, because the projec-tion operator corresponding to, for example, {x | ‖x‖2 ≤ σ}, is available inclosed form and easy to compute, but the projection onto {x | ‖Ax‖2 ≤ σ}is not when AAT 6= αI for α > 0 [Combettes and Pesquet, 2011; Parikh andBoyd, 2014; Beck, 2017, Chapter 6 & 7; Diamond et al., 2018]. Let us in-troduce the linear operators Ai ∈ RMi×N , Bj ∈ RMj×N , and Ck ∈ RMk×N .With indicator functions and exposed linear operators, we formulate theprojection of m ∈ RN onto set (5.5) asPM(m) = arg minu,v,w12‖w −m‖22 +p∑i=1ιDi(Aiu)+q∑j=1ιEj (Bjv) +r∑k=1ιFk(Ckw) + ιw=u+v(w, u, v),(5.6)where ιw=u+v(w, u, v) is the indicator function for the equality constraintw = u+v that occurs in the definition ofM. The sets Di, Ej and Fk have thesame role as in the previous section. The above problem is the minimizationof a sum of functions acting on different as well as shared variables. Werecast it in a standard form such that we can solve it using algorithms basedon the alternating direction method of multipliers (ADMM, e.g., Boyd et al.[2011]; Eckstein and Yao [2015]). Rewriting in a standard form allows us tobenefit from recently proposed schemes for selecting algorithm parametersthat decrease the number of iterations and lead to more robust algorithmsin case we use non-convex sets [Xu et al., 2017b, ; Xu et al., 2016]. As afirst step, we introduce the vector x ∈ R2N that stacks two out of the threeoptimization variables in 5.6 asx ≡(uv). (5.7)131We substitute this new definition in Problem (5.6) and eliminate the equalityconstraints w = u+ v to arrive atPM(m) = arg minx12‖(IN IN)x−m‖22 +p∑i=1ιDi((Ai 0)x)+q∑j=1ιEj ((0 Bj)x) +r∑k=1ιFk((Ck Ck)x),(5.8)where 0Mi×N ∈ RMi×N indicate all zeros matrices of appropriate dimensions.Next, we take the linear operators out of the indicator function such that weend up with sub-problems that are projections with closed-form solutions.Thereby we avoid the need for nesting iterative algorithms to solve sub-problems related to the indicator functions of constraint sets.To separate indicator functions and linear operators, we introduce addi-tional vectors yi for i ∈ {1, 2, . . . , p+ q + r + 1} of appropriate dimensions.From now on we use s = p + q + r + 1 to shorten notation. With thenew variables, we rewrite problem formulation (5.8) and add linear equalityconstraints to obtainPM(m) = arg min{yi},x12‖ys −m‖22 +p∑i=1ιDi(yi) +q∑j=1ιEj (yj)+r∑k=1ιFk(yk) s.t. A˜x = y˜,(5.9)132whereA˜x = y˜ ⇔A1 0... 0Ap 00 B10...0 BqC1 C1......Cr CrIN IN(uv)=y1...ypyp+1...yp+qyp+q+1...yp+q+ryp+q+r+1.Now define the new functionf˜(y˜,m) ≡s∑i=1fi(yi,m) ≡ 12‖ys−m‖22 +p∑i=1ιDi(yi)+q∑j=1ιEj (yj)+r∑k=1ιFk(yk),(5.10)such that we obtain the projection problem in the standard formPM (m) = arg minx,y˜f˜(y˜,m) s.t. A˜x = y˜. (5.11)If x and y˜ are a solution to this problem, the equality constraints enforceu + v = yp+q+r+1 and we recover the projection of m as yp+q+r+1 or as(IN IN)x. Now that Problem (5.11) is in a form that we can solve with theADMM algorithm, we proceed by writing down the augmented Lagrangianfor Problem (5.11) [Nocedal and Wright, 2000, Chapter 17] asLρ1,...,ρs(x, y1, . . . , ys, v1, . . . , vs) =s∑i=1[fi(yi)+vTi (yi−A˜ix)+ρi2‖yi−A˜ix‖22],where ρi > 0 are the augmented Lagrangian penalty parameters and vi ∈RMi are the vectors of Lagrangian multipliers. We denote a block-row of thematrix A˜ as A˜i. The relaxed ADMM iterations with relaxation parameters133γi ∈ (0, 2] and iteration counter l are given byxl+1 = arg minxs∑i=1ρli2‖yli − A˜ix+vliρli‖22 =[ s∑i=1(ρliA˜Ti A˜i)]−1 s∑i=1(A˜Ti (ρliyli + vli))x¯l+1i = γliA˜ixl+1i + (1− γli)yliyl+1i = arg minyi[fi(yi) +ρli2‖yli − x¯l+1i +vliρli‖22]= proxfi,ρi(x¯l+1i −vliρli)vl+1i = vli + ρli(yl+1i − x¯l+1i ).These iterations are equivalent to the Simultaneous Direction Method ofMultipliers (SDMM, Combettes and Pesquet [2011]; Kitic et al. [2016]) andthe SALSA algorithm [Afonso et al., 2011], except that we have an addi-tional relaxation step. In fact, the iterations are identical to the algorithmpresented in the previous chapter to compute the projection onto an intersec-tion of sets, but here we solve a different problem and have different matrixstructures. We briefly mention the main properties of each sub-problem.xl+1 computation. This step is the solution of a large, sparse, square,symmetric, and positive-definite linear system. The system matrix has thefollowing block structure:Q ≡s∑i=1(ρiA˜Ti A˜i) =(∑pi=1 ρiATi Ai +∑rk=1 ρkCTk Ck + ρsIN∑rk=1 ρkCTk Ck + ρsIN∑rk=1 ρkCTk Ck + ρsIN∑qj=1 ρjBTj Bj +∑rk=1 ρkCTk Ck + ρsIN).(5.12)This matrix is symmetric and positive-definite if A˜ has full column rank. Weassume this is true in the remainder because many A˜i have full column rank,such as discrete-derivative based matrices and transform matrices includingthe DFT and various wavelets. We compute xl+1 with the conjugate gradi-ent (CG) method, warm started by xl as initial guess. We choose CG insteadof an iterative method for least-squares problems such as LSQR [Paige andSaunders, 1982], because solvers for least-squares work with A˜ and A˜T sep-arately and need to compute a matrix-vector product (MVP) with each A˜i134and A˜Ti at every iteration. This becomes computationally expensive if thereare many linear operators, as is the case for our problem. CG uses a singleMVP with Q per iteration. The cost of this MVP does not increase if we addorthogonal matrices to A˜. If the matrices in A˜ have (partially) overlappingsparsity patterns, the cost also does not increase (much). We pre-computeall A˜Ti A˜i for fast updating of Q when one or more of the ρi change (seebelow).yl+1i computation. For every index i, we can compute proxfi,ρi(x¯l+1i −vliρli) independently in parallel. For indices i ∈ {1, 2, . . . , s− 1}, the proximalmaps are projections onto sets D, E or F . These projections do not includelinear operators and we know the solutions in closed from (e.g., `1-norm,`2-norm, rank, cardinality, bounds).ρl+1i , γl+1i updates. We use the updating scheme for ρ and γ fromadaptive-relaxed ADMM, introduced by Xu et al. [2017b]. Numerical resultsshow that this updating scheme accelerates the convergence of ADMM [Xuet al., 2017a,b,c], and is also robust when solving some non-convex problems[Xu et al., 2016]. We use a different relaxation and penalty parameter foreach function fi(yi), as do Song et al. [2016]; Xu et al. [2017c], which allowsρi and γi to adapt to the various linear operators of different dimensionsthat correspond to each constraint set.Parallelism and communication. The only serial part of the algo-rithm defined in (5.12) is the xl+1 computation. We use multi-threadedMVPs in the compressed diagonal format if Q has a banded structure. Theother parts of the iterations 5.12, yl+1i , vl+1i , ‘ρl+1i , γl+1i , are all indepen-dent so we can compute them in parallel for each index i. There are twooperations in 5.12 that require communication between workers that carryout computations in parallel. We need to send xl+1 to every worker thatcomputes a yl+1i , vl+1i , ρl+1i , and γl+1i . The second and last piece of commu-nication is the map-reduce parallel sum to form the right-hand side for thenext iteration when we compute xl+1 =∑si=1(A˜Ti (ρliyli + vli)).In practice, we will use the proposed algorithm to solve problems thatoften involve non-convex sets. Therefore, we do not provide guarantees thatalgorithms like ADMM behave as expected, because their convergence proofs135typically require closed, convex and proper functions, see, e.g., Boyd et al.[2011]; Eckstein and Yao [2015]. This is not a point of great concern tous, because the main motivation to base our algorithms on ADMM is rapidempirical convergence, ability to deal with many constraint sets efficiently,and strong empirical performance in case of non-convex sets that violate thestandard assumptions for the convergence of ADMM.5.4 Formulation of inverse problems withgeneralized Minkowski constraintsSo far, we proposed a generalization of the Minkowski set (M, equation 5.5),and developed an algorithm to compute projections onto this set. The nextstep to solve inverse problems where the generalized Minkowski set describesthe prior knowledge is to combine the set M with a data-fitting procedure.We discuss two formulations of such an inverse problem. One is primarilysuitable when the data-misfit function is computationally expensive to eval-uate, which means we assume that evaluation of f(m) and ∇mf(m) is moretime-consuming than projections onto the generalized Minkowski set M.The second formulation is for inverse problems where the forward operatoris both linear and computationally inexpensive to evaluate. We discuss thetwo approaches in more detail below.5.4.1 Inverse problems with computationally expensivedata-misfit evaluationsWe consider a non-linear and possibly non-convex data-misfit function f(m) :RN → R that depends on model parameters m ∈ RN . Our assumptions forthis inverse problem formulation is that the computational budget allows formuch fewer data-misfit evaluations than the required number of iterationsto project onto the generalized Minkowski set, as defined in 5.12. We candeal with this imbalance by attempting to make as much progress towardsminimizing f(m), while always satisfying the constraints. The minimizationof the data-misfit, subject to satisfying the generalized Minkowski constraint136is then formulated asminmf(m) s.t. m ∈M. (5.13)If we solve this problem with algorithms that use a projection onto M atevery iteration, the model parameters m satisfy the constraints at everyiteration; a property desired by several works in non-convex geophysical pa-rameter estimation, see [Smithyman et al., 2015, Esser et al., 2016, Yonget al., 2018], and the geophysical examples presented in the previous chap-ters. These works obtain better model reconstructions from non-convexproblems by carefully changing the constraints during the data-fitting pro-cedure. The first two numerical experiments in this work use the spectralprojected gradient algorithm (SPG, Birgin et al. [1999]; Birgin et al. [2003]).SPG iteratesml+1 = (1− γ)ml − γPM(ml − α∇mf(ml)), (5.14)where PM is the Euclidean projection ontoM. The Barzilai-Borwein [Barzi-lai and Borwein, 1988] step-length α > 0 is a scalar approximation of theHessian that is informed by previous model estimates and gradients of f(m).A non-monotone line-search estimates the scalar γ ∈ (0, 1] and preventsf(m) from increasing too many iterations in sequence. The line-search back-tracks between two points in a convex set ifM is convex and the initial m0is feasible, so every line-search iterate is feasible by construction. SPG thusrequires a single projection onto M if all constraint sets are convex.5.4.2 Linear inverse problems with computationally cheapforward operatorsContrary to the previous section, we now assume a linear relation betweenthe model parameters m ∈ RN and the observed data, dobs ∈ RM . Thesecond assumption, for the problem formulation in this section, is that theevaluation of the linear forward operator is not much more time consum-ing than other computational components in the iterations from 5.12. Ex-amples of such operators G ∈ RM×N are masks, identity matrices, and137blurring kernels. We may then put data-fitting and regularization on thesame footing and formulate an inverse problem with constraints as a fea-sibility or projection problem. Both these formulations add a data-fit con-straint to the constraints that describe model properties [Youla and Webb,1982, Trussell and Civanlar, 1984, Combettes, 1993, 1996]. The numeri-cal examples in this work use the point-wise data-fit constraint: Gdata ≡{m | l[i] ≤ (Gm − dobs)[i] ≤ u[i]} with lower and upper bounds on themisfit. We use the notation l[i] for entry i of the lower-bound vectorl. The data-fit constraint can be any set onto which we know how toproject. An example of a global data-misfit constraint is the norm-basedset Gdata ≡ {m | σl ≤ ‖Gm− dobs‖ ≤ σu} with scalar bounds σl < σu. Thisset is non-convex if σl > 0, e.g., the annulus constraint in case of the `2norm. This set has a ‘hole’ in the interior of the set that explicitly avoidsfitting the data noise in `2 norm sense.We denote our formulation of a linear inverse problem with a data-fitconstraint, and a generalized Minkowski set constraint (Equation 5.5) onthe model estimate asminx,u,v12‖x−m‖22 s.t.x = u+ vu ∈ ⋂pi=1Di, v ∈ ⋂qi=1 Ej , x ∈ ⋂ri=1Fkx ∈ Gdata. (5.15)The solution is the projection of an initial guess, m, onto the intersection ofa data-fit constraint and a generalized Minkowski constraint on the modelparameters. As before, there are constraints on the model x, as well as thecomponents u and v. Problem 5.15 is the same as before in Equation (5.5)and we can solve it with the algorithm from the previous section. In thecurrent case, we have one additional constraint on the sum of the compo-nents.1385.5 Numerical examples5.5.1 Seismic full-waveform inversion 1We start with a numerical example originally presented in Chapter 4. Werepeat the experiment and show how a Minkowski set describes the providedprior knowledge naturally and results in a better model estimate comparedto a single constraint set or intersection of multiple sets. The problem isto estimate the acoustic velocity m ∈ RN of the model in Figure 5.1, fromobserved seismic data modeled by the Helmholtz equation. This problem,known as full-waveform inversion (FWI, Tarantola [1986]; Pratt et al. [1998];Virieux and Operto [2009]), is often formulated as the minimization of adifferentiable, but non-convex data-fitf(m) =12‖dpredicted(m)− dobserved‖22, (5.16)where the partial-differential-equation constraints are already eliminatedand are part of dpredicted(m), see, e.g., Haber et al. [2000]. The observeddata, dobserved are discrete frequencies of {3.0, 6.0, 9.0} Hertz.Figure 5.1 shows the true model, initial guess for m, and the source andreceiver geometry. We assume prior information about the bounds on theparameter values, and that the anomaly has a rectangular shape with alower velocity than the background.The results in figure 5.1 using bounds or bounds and the true anisotropictotal-variation (TV) as a constraint, do not lead to a satisfying model es-timate. The result with TV is marginally better compared to bound con-straints only. The diagonally shaped model estimates are mostly due to thesource and receiver positioning, known as vertical seismic profiling (VSP)in geophysics. To obtain a better model, we used a variety of intersectionsincluding non-convex sets in Chapter 4.Here we will show that the generalized Minkowski setM (Equation 5.5)can also provide an improved model estimate, but using convex sets only. Ifwe have the prior knowledge that the anomaly we need to find has a lower139velocity than the background medium, we can easily and naturally includethis information as a Minkowski set. The following four sets summarize ourprior knowledge:1. F1 = {x | 2350 ≤ x[i] ≤ 2550} : bounds on sum2. F2 = {x | ‖((Dz ⊗ Ix)T (Iz ⊗ Dx)T )Tx‖1 ≤ σ} : anisotropic total-variation on sum3. D1 = {u | − 150 ≤ u[i] ≤ 0} : bounds on anomaly4. E1 = {v | v[i] = 2500} : bounds on backgroundThe generalized Minkowski set combines the four above sets as (F1⋂F2)⋂(D1+E1). In words, we fix the background velocity, require any anomaly to benegative, and the total model estimate has to satisfy bound constraints andhave a low anisotropic total-variation. To minimize the data-misfit subjectto the generalized Minkowski constraint,minm12‖dpredicted(m)− dobserved‖22 s.t. m ∈ (F1⋂F2)⋂(D1 + E1), (5.17)we use the same algorithm as the original example in Chapter 4, which isthe spectral projected gradient (SPG, Birgin et al. [1999]) algorithm with 15iterations and a non-monotone line search with a memory of five function val-ues. The result that uses the generalized Minkowski constraint (Figure 5.1)is much better compared to bounds and the correct total-variation becausethe constraints on the sign of the anomaly prevent incorrect high-velocityartifacts.While there are other ways to fix a background model and invert for ananomaly, this example illustrates that our proposed regularization approachincorporates information on the sign of an anomaly conveniently and theconstraints remain convex. It is straightforward to change and add con-straints on each component, also in the more realistic situation that thebackground is not known and should not be fixed, as we show in the follow-ing example.140Figure 5.1: The true model for the data generation for the full-waveform inversion 1 example, the initial guess for parameterestimation, and the model estimates with various constraints.Crosses and circles indicate receivers and sources, respectively.5.5.2 Seismic full-waveform inversion 2This time, the challenge is to estimate a model (Figure 5.2 a) that has botha background and an anomaly component that are very different from theinitial guess (Figure 5.2 b). This means we can no longer fix one of the twocomponents of the generalized Minkowski sum.The experimental setting is a bit different from the previous example.The sources are in one borehole, the receivers in another borehole at theother side of the model (cross-well full-waveform inversion). Except for asingle high-contrast anomaly, the velocity is increasing monotonically, both141gradually and discontinuously. The prior knowledge we assume consists ofi) upper and lower bounds on the velocity and also on the anomaly ii) themodel is relatively simple in the sense that we assume it has a rank of atmost five iii) the background parameters are increasing monotonically withdepth iv) the background is varying smoothly in the lateral direction v) thesize of the anomaly is not larger than one fifth of the height of the modeland not larger than one third of the width of the model. We do not assumeprior information on the total-variation of the model, but for comparison,we show the result when we use the true total-variation as a constraint. Thefollowing sets formalize the aforementioned prior knowledge:1. F1 = {x | 2350 ≤ x[i] ≤ 2850}2. F2 = {x | ‖((Ix ⊗Dz)T (Dx ⊗ Iz)T )Tx‖1 ≤ σ}3. F3 = {x | rank(x) ≤ 5}4. D1 = {x | 2350 ≤ x[i] ≤ 2850}5. D2 = {u | 0 ≤ (Dz ⊗ Ix)u ≤ ∞}6. D3 = {u | − 0.1[m/s]/m ≤ (Iz ⊗Dx)u ≤ 0.1[m/s]/m}7. E1 = {v | 300 ≤ v[i] ≤ 350}8. E2 = {v | card(v) ≤ (nz/5× nx/3)}As before, the sets Fk act on the sum of components, Di describe com-ponent one (background), and Ej constrain the other component (anomaly).Figure 5.2 c show the model m found by SPG applied to the problemminm f(m) s.t. m ∈ F1. We see oscillatory features in the result with boundconstraints only, but the main issue is the appearance of a low-velocity ar-tifact, located just below the true anomaly. Figure 5.2 d shows that even ifwe know the correct total-variation, the result is less oscillatory than usingjust bound constraints, but still shows an erroneous low-velocity anomaly.When we also include the rank constraint, i.e., we use the set F1⋂F2⋂F3,the result does not improve (Figure 5.2 e). The generalized Minkowski set(⋂3k=1Fk)⋂ (⋂3i=1Di+⋂2j=1 Ej)does not yield a result with the large incor-rect low-velocity artifact just below the correct high-velocity anomaly (5.2g), even though we did not include information on the sign of the anomaly142Figure 5.2: The true and initial models corresponding to the full-waveform inversion 2 example. Figure shows parameter esti-mation results with various intersections of sets, as well as theresult using a generalized Minkowski constraint set. Only theresult obtained with the generalized Minkowski set does notshow an incorrect low-velocity anomaly.as we did in the previous example. There are still two smaller horizontaland vertical artifacts. Overall, the Minkowski set based constraint resultsin the best background and anomaly estimation.This example shows that the generalized Minkowski set allows for inclu-sion of prior knowledge on the two (or more) different components, as wellas their sum. The results show that this leads to improved model estimatesif prior knowledge is available on both the components and the sum. Infor-143mation that we may have about a background or anomaly is often difficultor impossible to include in an inverse problem as a single constraint set orintersection of multiple sets, but easy to include in the summation structureof the generalized Minkowski set. In many practical problems, we do havesome information about an anomaly. When looking for air or water filledvoids and tunnels in engineering or archeological geophysics, we know thatthe acoustic wave propagation velocity is usually lower than the backgroundand we also have at least a rough idea about the size of the anomaly. Inseismic hydrocarbon exploration, there are high-contrast salt structures inthe subsurface, almost always with higher acoustic velocity than the sur-rounding geology.5.5.3 Video processingBackground-anomaly separation is a common problem in video processing.A particular example is security camera video, T ∈ Rnx×ny×nt , where x andy are the two spatial coordinates and t is the time. The separation problemis often used to illustrate robust principal component analysis (RPCA), andrelated convex and non-convex formulations of sparse + low-rank decom-position algorithms [e.g., Cande`s et al., 2011, Netrapalli et al., 2014, Kanget al., 2015, Driggs et al., 2017].In this example, we show that the generalized Minkowski set for an in-verse problem, proposed in Equation (5.15), is also suitable for background-anomaly separation in image and video processing, and illustrate the ad-vantages of working with a constrained formulation, as opposed to the morecommon penalty formulation. To include multiple pieces of prior knowledge,we choose to work with the video in tensor format and use the flexibility ofour regularization framework to impose constraints on the tensor, as well ason individual slices and fibers. This is different from RPCA approaches thatmatricize or flatten the video tensor to a matrix of size nxny×nt, such thateach column of the matrix is a vectorized time-slice [Cande`s et al., 2011,Netrapalli et al., 2014, Kang et al., 2015], and also differs from tensor-basedRPCA methods that work with a tensor only [Zhang et al., 2014, Wang and144Navasca, 2015]. Contrary to many sparse + low-rank decomposition algo-rithms, our set-based framework is not tied to any specific constraint, andwe can mix various constraints for the two components and obtain multiplebackground-anomaly separation algorithms.Beyond the basic decomposition problem, the escalator video comes withsome additional challenges. There is a dynamic background component (theescalators steps) and there are reflections of people in the glass that areweak anomalies and duplicates of persons. The video contains noise andpart of the background pixel intensity changes significantly (55 on a 0− 255grayscale) over time. We subtract the mean of each frame as a pre-processingstep to mitigate the change in intensity. Below we describe simple methodsto derive prior knowledge for the video, as well as for the background andanomaly component.constraint sets for background We use the last 20 time frames toderive constraints for the background because these frames do not containpeople. From these frames, we use the minimum and maximum value foreach pixel over time as the bounds for the background component, denotedas set D1. The second constraint is the subspace spanned by the last 20frames. We require that each time frame of the background be a linear com-bination of the training frames organized as a matrix S ∈ Rnxny×20, whereeach column is a vectorized video frame of T . We denote this constraintas D2 = {u | u = Sc, c ∈ R20}, with coefficient vector c, which we obtainduring the projection operation: PD2(u) = S(STS)−1STu. After computingthe singular value decomposition S = UΣV T , the projection simplifies toPD2(u) = UUTuconstraint sets for sum of components We constraint the sum ofthe background and anomaly components to the interval of grayscale values[0− 255] minus the mean of each time-frame, denoted as set F1.constraint sets for anomaly We also have bound constraints, set E1,on the anomaly component that we define as the bounds on the sum minusthe bounds on the background. To enhance the quality of the anomalycomponent, we add various types of sparsity constraints. If we would havesome example video available like we have for the background component, we145could observe properties of the anomaly, i.e., how many pixels are typicallyanomalies (people). As the escalator video is only 200 time-frames long, weinstead use some rough estimates of the anomaly properties to define threenon-convex constraint sets. We choose to apply constraints to each time-slice separately because this makes it easier to convert basic observationsor intuition into a set. The first type of sparsity constraint is the set E2 ={T | card(TΩi) ≤ (nx/4 × ny/4) ∀i ∈ {1, 2, . . . , nt}} where TΩi is a timeslice of the video tensor. This constraint limits the number of anomalypixels in each frame to 1/16 of the total number of pixels in each timeslice. The second and third constraint sets are limits on the vertical andhorizontal derivative of each time-frame image separately. If we assumethe prior knowledge that there are no more than ten persons in the videoat each time, we can use E3 = {T | card((Ix ⊗ Dy) vec(TΩi)) ≤ 480, i ∈{1, 2, . . . , nt}}, based on the rough estimate of 10 persons× 12 pixels wide×4 boundaries (the four vertical boundaries are background - head - upperbody - legs - background). Similarly for the horizontal direction, we defineE4 = {T | card((Dx ⊗ Iy) vec(TΩi)) ≤ 440, i ∈ {1, 2, . . . , nt}}, based onthe estimate of 10 persons × 22 pixels tall × 2 boundaries (the horizontalboundaries are background - person - background).Putting it all together, we project the video onto the generalized Minkowskiset defined in (5.6), i.e., we solveminx12‖x− vec(T )‖22 s.t. x ∈ F1⋂( 2⋂i=1Di +4⋂j=1Ej)(5.18)using the iterations derived in Equation 5.12. Our formulation implies thatthe projection of a vector is always the sum of the two components, butthis does not mean that x is equal to T at the solution, because we did notinclude a constraint on x that says we need to fit the data accurately. Wedid not use a data-fit constraint because it is not evident how tight we wantto fit the data or how much noise there is. By computing the projection ofthe original video, we still include a sense of proximity to the observed data.The result of the generalized Minkowski decomposition of the video146Figure 5.3: Results of the generalized Minkowski decomposition ap-plied to the escalator video. The figure shows four frames. Themost pronounced artifacts are in the time stamp. This exampleillustrates that the constrained approach is suitable to observeand apply constraint properties obtained from a few frames ofbackground only video.147shown in Figure (5.3), is visually better than the six methods comparedby Driggs et al. [2017]. The compared results often show blurring of theescalator steps in the estimated background, especially when a person is onthe escalator. Several results also show visible escalator structure in theanomaly estimation. Our simple approach does not suffer from these twoproblems. We do not need to estimate any penalty or trade-off parame-ters, but rely on constraint sets whose parameters we can observe directlyor estimate from a few training frames. We were able to conveniently mixconstraints on slices and fibers of the tensor by working with the constrainedformulation.5.6 DiscussionSo far, we described the concepts and algorithms for the case of a Minkowskisum with only two components. Our approach can handle more than twocomponents, but the linear systems in Equation (5.12) will become larger. Abetter solver than plain conjugate-gradients can mitigate increased solutionstimes due to larger linear systems, possibly by taking the block structureinto account.A Minkowski sum of more than two components can also make it lessintuitive what type of solutions are in the Minkowski sum of sets. We canregain some intuition about the generalized Minkowski set by looking atsampled elements from the set. Samples are simply obtained by projectingvectors (possible solutions, reference models, random vectors, . . . ) onto thetarget set.All numerical examples were set up to illustrate how we use the general-ized Minkowski set for the regularization of inverse problems, given multiplepieces of prior knowledge on the two components of a model, as well as priorinformation on the sum of the components. Application evaluation of theproposed regularization approach to more realistic examples, as in chapter2 and 3, is left for future work.1485.7 ConclusionsInverse problems for physical parameter estimation and image and videoprocessing often encounter model parameters with complex structure, sothat it is difficult to describe the expected model parameters with a singleset or intersection of multiple sets. In these situations, it may be easierto work with an additive model description where the model is the sum ofmorphologically distinct components.We presented a regularization framework for inverse problems with theMinkowski set at its core. The additive structure of the Minkowski setallows us to enforce prior knowledge in the form of separate constraints oneach of the model components. In that sense, our work differs from currentapproaches that rely on additive penalties for each component. As a result,we no longer need to introduce problematic trade-off parameters.Unfortunately, the Minkowski set by itself is not versatile enough forphysical parameter estimation because we also need to enforce bound con-straints and other prior knowledge on the sum of the two components toensure physical feasibility. Moreover, we would like to use more than oneconstraint per component to incorporate all prior knowledge that we mayhave available.To deal with this situation, we proposed a generalization of the Minkowskiset by defining it as the intersection of a Minkowski set with another con-straint set on the sum of the components. With this construction, we canenforce multiple constraints on the model parameters, as well as multipleconstraints on each component.To solve inverse problems with these constraints, we discuss how toproject onto generalized Minkowski sets based on the alternating directionmethod of multipliers. The projection enables projected-gradient basedmethod to minimize nonlinear functions subject to constraints. We alsoshowed that for linear inverse problems, the linear forward operator fits inthe projection computation directly as a data-fitting constraint. This makesthe inversion faster if the application of the forward operator does not takemuch time.149Numerical examples show how the generalized Minkowski set helps tosolve non-convex seismic parameter estimation problems and a background-anomaly separation task in video processing, given prior knowledge on themodel parameters, as well as on the components.150Chapter 6DiscussionIn each of the chapters, I discussed and developed formulations and compu-tational methods for inverse problems from a constrained optimization pointof view. So far, I did not discuss the statistical and Bayesian [e.g, Tarantola,2005] interpretation much.The aim of this thesis is not to take a side in the frequentist versusBayesian debate [Scales and Snieder, 1997, Stark, 2015]. Nor did I studythe relation and connections between the two approaches to inverse prob-lems. Constraints became the core of this work because we found them easyto use beneficial when working with seismic field data [Smithyman et al.,2015]. Rather than focus on the differences between constraints and priordistributions [see Backus, 1988, Scales and Snieder, 1997, Stark, 2015], Iwould like to relate this thesis to the Bayesian approach by discussing anddeveloping similarities that may help bridge the conceptual gap between thetwo approaches.First, constraints can also incorporate statistical prior knowledge. Forinstance, we can constrain the model parameters to have the same histogramor correlations as an example image, possibly in a transform-domain [e.g.,Portilla and Simoncelli, 2000, Peyre´, 2009, Fadili and Peyre, 2011, Mei et al.,2015]. This is common in the field of texture synthesis, and I used similarideas in chapter 4 to obtain and include prior information from examples toimage processing inverse problems.151The second conceptual similarity between Bayesian and constrained ap-proaches is the notion of prior and posterior distributions where both theprior and data observations influence the latter. The constrained formula-tion gives rise to a similar structure. Based on a problem formulation wherewe have a data-fit constraint and multiple constraints that describe priorknowledge (Vi), we may consider the prior information set,⋂pi=1 Vi, some-what analogous to a prior probability distribution. Samples from⋂pi=1 Viare easy to obtain by solving feasibility or projection problems, starting atrandom points or examples of solutions of similar inverse problems. Eachsample, spriorj is an element of the intersection of constraint sets that de-scribe model properties: spriorj ∈⋂pi=1 Vi. If the intersection is convex, wecan construct more samples as convex combinations: γ1sprior1 +γ2sprior2 + · · ·with γ1 + γ2 + · · · = 1 and γ1 ≥ 0, γ2 ≥ 0, · · · .Analogous to the posterior distribution, we can look at the intersectionof the constraint sets that describe model properties with the data-fit con-straint set,p⋂i=1Vi⋂Vdata. (6.1)The resulting set contains all models that satisfy the prior knowledge aswell as the observed data. This collection of models can provide us with asense of uncertainty/spread in the model estimate.To provide some intuition about the statements in this section, we visual-ize using classical image inpainting. This is essentially a higher dimensionalversion of the geometrical example in the introduction. Note that the solu-tions of the image and video processing examples in chapters 4 and 5 werealso points in a set defined as 6.1.The true image and observed data are shown in Figure (6.1). The trueimage is a simple texture, the observed data are vertical bands, which spreadout more and more from left to right, so the number of missing pixels in-creases from left to right.We will use the three constraint sets to describe prior knowledge thatwere already used in chapter 3 to describe typical acoustic velocity models1520 100 200 300050100150200250300350400True model0 100 200 300050100150200250300350400Observed dataFigure 6.1: The true image (left), and the observed data (right) thatconsists of vertical bands of the true image, increasingly sparselysampled from left to right.in sedimentary geological settings. This means we enforce 1) bound con-straints; 2) lateral smoothness; 3) with depth (going from top to bottom)parameter values can increase arbitrarily fast, but can only decrease slowly.To generate samples from the intersection of prior knowledge, we projectmodels filled with random numbers onto the intersection. Figure (6.2) showsthree samples.Figure (6.3) displays three samples from the intersection of sets thatcontain data information and prior knowledge. The data constraints arebounds to match the observed data. The samples from this intersection arethe projections of the samples from the prior information sets as shown inFigure (6.1). Also in Figure(6.2), we show the difference between the trueimage and the samples from the intersection of prior and data information.Because of the bound constraints at the observations, there is almost noerror at the data locations, which leads to the striped pattern in both the1530 100 200 300050100150200250300350400Sample from prior info set0 100 200 300050100150200250300350400Sample from prior info set0 100 200 300050100150200250300350400Sample from prior info setFigure 6.2: Three samples from the prior information set, which isthe intersection of bounds, lateral smoothness, and parametervalues that are limited to decrease slowly in the downward di-rection. Samples are the result of projecting random imagesonto the intersection.samples and the error.As all three models satisfy all our prior knowledge and data information,there is no way to tell which one is the ‘best’ result. A simple way to ana-lyze the results, not free of caveats, is to look at the point-wise maximum,minimum, and difference of the three model estimates. In Figure (6.4) wedisplay these derived quantities. Comparing Figures (6.3) and (6.4) showsthat the areas in the model with large variation between maximum and min-imum also correspond to areas with large error. Furthermore, we observe,as expected, that the error and spread generally increase with decreasingobservation density. The multiple models provide some quantitative insight.This example provides some intuition about sets that describe priorknowledge and data constraints. Even if it is not obvious what happensif we take the intersection of multiple sets, random samples visualize whattype of models are in the intersection. Besides random samples, we can alsomanually construct prior samples, project models expected to be similar tothe true model, and take convex combinations of prior samples to generateadditional insight quickly.1540 100 200 3000100200300400Sample from prior & data info0 100 200 3000100200300400Error 0 100 200 3000100200300400Sample from prior & data info0 100 200 3000100200300400Error 0 100 200 3000100200300400Sample from prior & data info0 100 200 3000100200300400ErrorFigure 6.3: Samples from the intersection of sets that describe priorknowledge and data observations. The bottom row shows thedifference between the sample from the top row and the truemodel from Figure (6.1).0 100 200 300050100150200250300350400Maximum values0 100 200 300050100150200250300350400Minimum values0 100 200 300050100150200250300350400Max - MinFigure 6.4: Pointwise maximum and minimum values, as well as thedifference of the three samples from Figure (6.3).155Chapter 7ConclusionsInverse problems in the imaging sciences range from linear inverse problemssuch as cleaning and reconstructing images, to partial-differential-equationbased geophysical parameter estimation where the data relates nonlinearlyto the model parameters. Solving a problem in either of these two cate-gories requires prior information (regularization) on the model parametersto achieve state-of-art results. Regularization combats issues introduced bydata deficiencies and inherent nonuniqueness of the solutions of an inverseproblem. While we apply regularization based on what we expect from thefinal estimate, in challenging non-convex inverse problems like full-waveforminversion we also greatly benefit from applying (possibly different) regular-ization to the intermediate results at every iteration. Such a procedure canprevent the model estimates from becoming physically/geologically unreal-istic, which halts progress towards the correct model parameters in lateriterations.In this thesis, I proposed contributions to various aspects of solving in-verse problems. This includes problem formulations, how to work with mul-tiple pieces of prior knowledge, algorithms, software design, practical high-performance implementation, applications in image and video processing,and specific solutions strategies for seismic full-waveform inversion.In the following, I summarize the conclusions per topic:Inverse problem formulations. In chapter 2 and 3, I motivate why156I prefer to work with multiple constraints instead of multiple penalty func-tions for the regularization of non-convex seismic full-waveform inversion.In the first four chapters, I regularize using an intersection of convex andnon-convex sets. I present and discuss a few main arguments in favor ofconstraints: i) no need to select multiple scalar penalty parameters becauseeach constraint is imposed independently of the others; ii) some constraintsare set directly in terms of physical quantities; iii) I show that the solu-tion of seismic full-waveform inversion behaves predictably as a function ofconstraint ‘size’, but less predictable when we vary trade-off parameters inpenalty formulations; (iv) constraints in combination with projections offerguarantees that the model parameters remain in the constraint set at ev-ery iteration. For non-convex problems, this can help avoid local minimawhen the constraints are relaxed gradually. I demonstrate various successfulapplications with different combinations of constraints using this strategy.My primary contribution to advocating intersections of sets is the specificapplication to full-waveform inversion in combination with controlling theproperties of intermediate model estimates. In chapter 4, I also show thatsimple machine learning can provide us with many (≥ 10) pieces of priorinformation, that serve as constraints using an intersection of sets, somethingthat would be more complicated if not impossible in case of multiple penaltyfunctions where we need to balance the influence of ten or more penalties.In chapter 5, I introduced a new problem formulation that merges andextends previous work on intersections of constraints sets and additive modelstructures such as cartoon-texture decomposition, morphological compo-nent analysis, robust principal component analysis, and multi-scale imagedescriptions. Additive descriptions of model parameters add two or morecomponents to generate an image. This separation makes it easier to includeprior information when it is difficult to describe all model parameters us-ing a single property, i.e., when the model contains morphologically distinctcomponents. The constrained version of an additive model that I proposeis based on the Minkowski set, or vector sum of sets.I showed that this set by itself is of limited use for the regularizationof inverse problems, because we want, and need, constraints on the sum of157the components as well. Moreover, motivated by the examples in chapter2 and 3, I also want to include multiple pieces of prior knowledge on eachcomponent. I proposed to generalize the Minkowski set by allowing each ofthe two components to be an intersection themselves, and also enforce anintersection of constraints on the sum of the components.In summary, the model is an element of an intersection of a sum of inter-sections and another intersection. The extensions to a Minkowski set thatI introduced, make it easier to include more pieces of prior knowledge. Nu-merical examples in video segmentation and seismic full-waveform inversionillustrate this benefit.Algorithms. My contributions to the algorithmic side of regularizationvia intersections of sets split up between chapters 2 & 3 on the one handand chapters 4 & 5 on the other hand. In chapter 2 and 3, I combine threeexisting algorithms to create an easy to use and versatile workflow thatadds multiple constraint sets to an inverse problem. The target problemsfor the algorithms in chapters 2 and 3 are partial-differential-equation basedparameter estimation, particularly seismic waveform inversion. The philos-ophy of this framework is to split the complicated problem, minimization ofa non-convex data-fit function subject to multiple constraints, into simplerand simpler computational pieces until we can solve each part easily and inclosed form. Starting from the top, I use the spectral projected gradient al-gorithm (SPG) to create separate data-fitting and feasibility problems. Weensure feasibility at every SPG iteration by projecting onto the intersectionof constraint sets using Dykstra’s algorithm. Whenever one of Dykstra’ssub-problems, a projection onto a single set is not known in closed form, Iuse the alternating direction method of multipliers (ADMM) to compute it.This framework is the first effort to add an arbitrary intersection constraintto seismic full-waveform inversion.In chapter 4, I merge the functionality of Dykstra’s algorithm and ADMMto compute projections onto an intersection of sets. Nesting ADMM insideDykstra’s algorithm does not exploit possible similarity between sets andrequires stopping conditions such that both algorithms operate together ef-ficiently. Therefore, I developed a new algorithm for computing projections158onto the intersection of multiple constraint sets specifically. Whereas Dyk-stra’s algorithm treats every projection onto a single set as a black box, Ifocus on efficient treatment of sets that include non-orthogonal linear op-erators in their definitions. Numerical examples show that this approachis much faster because I formulated the problem such that it takes simi-larity between the linear operators into account. The proposed algorithmachieves good empirical performance on problems with non-convex sets byusing a multilevel coarse-to-fine grid continuation and an automatic selec-tion scheme for the augmented-Lagrangian penalty parameters that occur inADMM-based solvers. The algorithms apply to problems defined on small2D and large 3D grids (≈ 4003), by virtue of solving all sub-problems in par-allel, automatic selection methods for augmented-Lagrangian and relaxationparameters, multilevel acceleration, solving sparse and banded linear sys-tems using multi-threaded matrix-vector products in the compressed diago-nal storage format, and careful implementation of the proposed algorithmsin Julia. Examples show that the proposed algorithm enabled developingregularization strategies using many different constraint sets for both smalland large-scale inverse problems.In chapter 5 there are no new algorithms, but I show that the algorithmsfrom chapter 4 apply to more than just intersections of sets. I reformulatethe projection onto the extended Minkowski set such that it can use thealgorithms in chapter 4. The primary computational difference betweenprojections onto intersections of sets and the extended Minkowski set is thatcertain linear operators become larger block-structured linear operators inthe sums of sets scenario.Software and implementation. All algorithms presented in chapter4 for the computation of projections onto intersections of sets are part ofa software package that I developed: SetIntersectionProjection. Thisfunctionality of the package serves as the projection step inside a projectedgradient-based algorithm, or it can directly solve an inverse problem statedas a projection or feasibility problem. There are a few reasons why I imple-mented everything in Julia. First, all code uses parametric typing such thateverything works in single and double precision, without any code modifi-159cations or copying parts of the code with minor modifications. A secondargument in favor of Julia is the convenient implementation of coarse andfine-grained parallelism. I used coarse-scale parallelism to compute sub-problems of the algorithms from chapter 4 in parallel. Each of these sub-problems is then also solved in parallel, using either Julia threads or standardmultithreaded libraries for linear algebra and Fourier-transform based oper-ations. Another simple trick that speeds up the computations, at the cost ofsome increased peak memory usage, is keeping all vector-valued quantitiesin memory and overwrite them in-place, thereby avoiding time-consumingmemory re-allocation. The numerical examples showed that the combinationof problem formulation, algorithms, and implementation make the softwarepackage suitable to quickly test various combinations of constraints for arange of small and large-scale inverse problems.Besides the algorithms, I also included scripts that set up linear operatorsand projectors onto simple sets. These two building blocks are the input forthe software to compute the projection onto the intersection of sets. Themodular software design still allows users to work with their custom linearoperators and projectors, as the algorithms themselves do not depend on aspecific projector or operator construction.Applications. The most prominently featured application in this the-sis is seismic full-waveform inversion (FWI), where we estimate acousticvelocity from observed seismic waves. Most exploration experiments havesources and receivers on only one side of the computational domain. Whiledata noise and missing observations have a moderate impact on the recov-ered velocity models, the main challenge for FWI is the combination of aninaccurate initial guess and unavailable low-frequency data. These factorsoften cause the estimated model parameters to be geologically unrealistic.In chapter 2 and 3, I developed strategies to mitigate this problem by us-ing constraints on the model parameters. The core of the approach is tostart solving the inverse problem using low-frequency data and ‘tight’ con-straints, continuing to higher frequency data and ‘looser’ constraints. Whilesome incarnations of this concept have been around for a long time, suchas working from smooth to less-smooth models by reducing the penalty pa-160rameters for Tikhonov regularization, I extended these ideas in the followingways: i) I use a constrained formulation with three different types of con-straints. ii) Constraint sets do not depend on penalty parameters and afterprojection onto the intersection, the model parameters satisfy all constraintsexactly. This approach provides accurate control of the model properties.iii) I show that tight to loose constraints for FWI works with total-variationconstraints, as well as with slope constraints that induce monotonicity orsmoothness. iv) Using a number of numerical examples, I show that thedescribed strategy is a useful tool for FWI in general. Specifically, I demon-strate that a relaxation of multiple constraints works for two different formu-lations of FWI, for both sedimentary geology and model with high-contrastsalt structures, and also when we do not know the source function and theobserved synthetic data is modeled using more complex physics on finergrids.Limitations, ongoing developments, and future research direc-tions. Chapters 2, 3, 4, and 5 generally follow the proposed future researchdirections from the previous chapter. Chapter 3 introduces new algorithmsthat are faster than the ones in chapter 2, and illustrates the presentedworkflows on a more realistic example. In chapter 4, the main limitationsof chapter 3 were tackled: avoiding nested algorithms and computing pro-jections onto intersections of sets on large 3D grids, which requires a muchfaster implementation compared to the one in Chapter 3. In Chapter 4,I also extend the applications to the image processing tasks of denoising,deblurring, inpainting, and desaturation. The discussion and conclusionssections in chapter 4 describe ways to increase computational performance.In chapter 5, I do not continue the research on computational performancebut address the more important limitation of the intersections of sets con-cept that underpins chapters 2, 3, and 4. This limitation arises when thegeophysical models or images have a complex structure that is not easilydescribed by standard sets (e.g., total-variation, low-rank, smooth) or in-tersections. Therefore, the Minkowksi set and the proposed generalizationoffer additional freedom to describe complex models and use more detailedprior knowledge.161There is a main limitation remaining that has not been discussed sofar: what to do when there is no good prior information available to defineconstraint sets? In chapters 2 and 3, I introduced heuristics to select themaximum total-variation and smoothness, but they are still heuristics. Theimage and video processing examples in chapters 4 and 5 rely on examplesto derive useful constraints. However, these training examples need to berelatively similar to the evaluation images. The challenge is to constructmore quantitative ways to select constraint parameters for full-waveforminversion and relax the similarity requirements for training data in imageand video processing. In what follows, I outline a proposal to combine thestrengths of the methods and algorithms in this thesis, with recent devel-opments in neural network research. The goal is to find additional ways toobtain information about ‘good’ constraints for PDE-based inverse problemsand image/video processing.In the past few years, many regularization techniques based on neuralnetworks have been proposed. These include networks that a) map a cor-rupted image to a large scalar and a good image to a small scalar, therebyacting as a non-linear penalty function; b) directly map a corrupted imageto the reconstruction, or map observed data to the model parameters. Thistype of end-to-end training is less flexible, in the sense that the networkeffectively includes the forward map and regularization. New forward mapsor different regularization requires additional training of the network; c) actas the proximal map or projection operator as part of algorithms like proxi-mal gradient. This approach combines neural networks with custom forwardmodeling operators, and is also known as plug-and-play regularization; d)map low-accuracy solutions of inverse problems into higher quality ones byremoving artifacts in the image or estimated model parameters.The different ways of using neural networks to solve inverse problemsshow state-of-art results. Each of the approaches comes with some limi-tations. First and foremost, networks typically require a large number oftraining data and labels to train. Below, I propose an alternative way toincorporate a neural network, which hopefully requires a relatively smallamount of data, and that is easy and fast to train. At the same time, there162is still the flexibility to change the (nonlinear) forward modeling operator.I propose to use a neural network that maps a corrupted image into ascalar that describes a property of the clean image. These properties include`1, `2 or nuclear norms of the image, possibly in a transform domain. I canthen use the scalar properties to define constraint sets for the regularizationof an inverse problem. The two-step approach requires networks that mapan image to a scalar rather than to an image, so perhaps the network can beshallower and narrower network and need fewer training examples comparedto the four types of neural network regularization mentioned above.Learning image properties and the constrained formulation for an inverseproblem is a good combination because each constraint set is defined inde-pendently of all other constraint sets. Therefore, we can train one networkper image property, independent of all other networks. Another advantageis that the constraints do not depend on the inverse problem or data-misfitfunction. Trained networks can, therefore, define constraints for most in-verse problems.Besides image processing, I also aim for the more ambitious goal of esti-mating model parameter properties from data obtained by physical experi-ments. For example, learning a direct map from observed seismic data to ascalar property of the velocity model in which the waves propagated. Thistype of problem has a nonlinear forward modeling operator that maps themodel parameters to data. Our goal of training networks on a relativelysmall number of examples is especially important for geophysical inverseproblems, where examples are scarce and selecting regularization is difficult.The two-step approach may alleviate some of the difficulties.Initial training and testing of networks that predict image and data prop-erties from corrupted inputs or data showed promising results. However, theadded value of this idea still needs to be proven. The next questions are: 1)what is the number of training examples versus reconstruction error trade-off compared to other approaches to using networks for inverse problems? 2)what type of network designs are suitable for learning to predict model prop-erties? 3) are the currently available synthetic geophysical models realisticand diverse enough?163BibliographyM. V. Afonso, J. M. Bioucas-Dias, and M. A. T. Figueiredo. Anaugmented lagrangian approach to the constrained optimizationformulation of imaging inverse problems. IEEE Transactions on ImageProcessing, 20(3):681–695, March 2011. ISSN 1057-7149.doi:10.1109/TIP.2010.2076294. → pages 82, 89, 134H. K. Aggarwal, M. P. Mani, and M. Jacob. Model based imagereconstruction using deep learned priors (modl). In 2018 IEEE 15thInternational Symposium on Biomedical Imaging (ISBI 2018), pages671–674, April 2018. doi:10.1109/ISBI.2018.8363663. → page 123V. Akcelik, G. Biros, and O. Ghattas. Parallel multiscalegauss-newton-krylov methods for inverse wave propagation. InSupercomputing, ACM/IEEE 2002 Conference, pages 41–41, Nov 2002.doi:10.1109/SC.2002.10002. → page 40M. S. C. Almeida and M. Figueiredo. Deconvolving images with unknownboundaries using the alternating direction method of multipliers. IEEETransactions on Image Processing, 22(8):3074–3086, Aug 2013. ISSN1057-7149. doi:10.1109/TIP.2013.2258354. → page 83A. Y. Anagaw. Full waveform inversion using simultaneous encodedsources based on first-and second-order optimization methods. PhDthesis, University of Alberta, 2014. → page 40A. Y. Anagaw and M. D. Sacchi. Full waveform inversion with totalvariation regularization. In Recovery-CSPG CSEG CWLS Convention,2011. → page 21A. Y. Anagaw and M. D. Sacchi. Edge-preserving smoothing forsimultaneous-source fwi model updates in high-contrast velocity models.GEOPHYSICS, 0(ja):1–18, 2017. doi:10.1190/geo2017-0563.1. URLhttps://doi.org/10.1190/geo2017-0563.1. → page 42164N. Antonello, L. Stella, P. Patrinos, and T. van Waterschoot. ProximalGradient Algorithms: Applications in Signal Processing. ArXiv e-prints,Mar. 2018. → page 81F. J. Arago´n Artacho and R. Campoy. A new projection method forfinding the closest point in the intersection of convex sets.Computational Optimization and Applications, 69(1):99–132, Jan 2018.ISSN 1573-2894. doi:10.1007/s10589-017-9942-5. URLhttps://doi.org/10.1007/s10589-017-9942-5. → page 80A. Aravkin, R. Kumar, H. Mansour, B. Recht, and F. J. Herrmann. Fastmethods for denoising matrix completion formulations, with applicationsto robust seismic data interpolation. SIAM Journal on ScientificComputing, 36(5):S237–S266, 2014. doi:10.1137/130919210. URLhttps://doi.org/10.1137/130919210. → pages 78, 115A. Y. Aravkin and T. van Leeuwen. Estimating nuisance parameters ininverse problems. Inverse Problems, 28(11):115016, 2012. ISSN0266-5611. → pages 1, 62A. Y. Aravkin, J. V. Burke, D. Drusvyatskiy, M. P. Friedlander, andS. Roy. Level-set methods for convex optimization. arXiv preprintarXiv:1602.01506, 2016. → page 78A. Asnaashari, R. Brossier, S. Garambois, F. Audebert, P. Thore, andJ. Virieux. Time-lapse seismic imaging using regularized full-waveforminversion with a prior model: which strategy? Geophysical Prospecting,63(1):78–98. doi:10.1111/1365-2478.12176. URLhttps://onlinelibrary.wiley.com/doi/abs/10.1111/1365-2478.12176. → page 3A. Asnaashari, R. Brossier, S. Garambois, F. Audebert, P. Thore, andJ. Virieux. Regularized seismic full waveform inversion with prior modelinformation. GEOPHYSICS, 78(2):R25–R36, 2013.doi:10.1190/geo2012-0104.1. URLhttp://dx.doi.org/10.1190/geo2012-0104.1. → pages 3, 40G. E. Backus. Comparing hard and soft prior bounds in geophysicalinverse problems. Geophysical Journal International, 94(2):249, 1988.doi:10.1111/j.1365-246X.1988.tb05899.x. URL+http://dx.doi.org/10.1111/j.1365-246X.1988.tb05899.x. → pages 36, 151J. Barzilai and J. M. Borwein. Two-point step size gradient methods. IMAJournal of Numerical Analysis, 8(1):141–148, 1988.165doi:10.1093/imanum/8.1.141. URLhttp://imajna.oxfordjournals.org/content/8/1/141.abstract. → pages56, 92, 137A. Baumstein. Pocs-based geophysical constraints in multi-parameter fullwavefield inversion. EAGE, 06 2013. → pages 7, 36, 38H. H. Bauschke and P. L. Combettes. Convex Analysis and MonotoneOperator Theory in Hilbert Spaces. Springer Publishing Company,Incorporated, 1st edition, 2011. ISBN 1441994661, 9781441994660. →pages 38, 46, 48H. H. Bauschke and V. R. Koch. Projection methods: Swiss army knivesfor solving feasibility and best approximation problems with halfspaces.Contemporary Mathematics, 636:1–40, 2015. → pages47, 73, 80, 86, 196, 199A. Beck. Introduction to Nonlinear Optimization. Society for Industrialand Applied Mathematics, Philadelphia, PA, 2014.doi:10.1137/1.9781611973655. URLhttp://epubs.siam.org/doi/abs/10.1137/1.9781611973655. → page 53A. Beck. On the convergence of alternating minimization for convexprogramming with applications to iteratively reweighted least squaresand decomposition schemes. SIAM Journal on Optimization, 25(1):185–209, 2015. doi:10.1137/13094829X. URLhttp://dx.doi.org/10.1137/13094829X. → page 44A. Beck. First-Order Methods in Optimization. Society for Industrial andApplied Mathematics, Philadelphia, PA, 2017.doi:10.1137/1.9781611974997. URLhttp://epubs.siam.org/doi/abs/10.1137/1.9781611974997. → pages 85, 131S. Becker, L. Horesh, A. Aravkin, E. van den Berg, and S. Zhuk. Generaloptimization framework for robust and regularized 3d fwi. In 77thEAGE Conference and Exhibition 2015, 2015. → pages 6, 37, 40S. R. Becker, E. J. Cande`s, and M. C. Grant. Templates for convex coneproblems with applications to sparse signal recovery. MathematicalProgramming Computation, 3(3):165, Jul 2011. ISSN 1867-2957.doi:10.1007/s12532-011-0029-5. URLhttps://doi.org/10.1007/s12532-011-0029-5. → page 115166L. Bello and M. Raydan. Convex constrained optimization for the seismicreflection tomography problem. Journal of Applied Geophysics, 62(2):158 – 166, 2007. ISSN 0926-9851.doi:http://dx.doi.org/10.1016/j.jappgeo.2006.10.004. URLhttp://www.sciencedirect.com/science/article/pii/S0926985106001467. →pages 7, 36, 37, 57D. P. Bertsekas. Projected newton methods for optimization problemswith simple constraints. SIAM Journal on Control and Optimization, 20(2):221–246, 1982. doi:10.1137/0320018. URLhttps://doi.org/10.1137/0320018. → pages 76, 127D. P. Bertsekas. Convex Optimization Algorithms. Athena Scientific, 2015.→ pages 44, 54J. Bezanson, A. Edelman, S. Karpinski, and V. B. Shah. Julia: A freshapproach to numerical computing. SIAM Review, 59(1):65–98, 2017.doi:10.1137/141000671. URL https://doi.org/10.1137/141000671. → pages83, 92S. A. Bigdeli and M. Zwicker. Image restoration using autoencoding priors.arXiv preprint arXiv:1703.09964, 2017. → page 123F. Billette and S. Brandsberg-Dahl. The 2004 BP velocity benchmark.67th EAGE Conference & Exhibition, (June):13–16, 2005. URLhttp://www.earthdoc.org/publication/publicationdetails/?publication=1404. →page 27E. G. Birgin and M. Raydan. Robust stopping criteria for dykstra’salgorithm. SIAM Journal on Scientific Computing, 26(4):1405–1414,2005. doi:10.1137/03060062X. URL http://dx.doi.org/10.1137/03060062X.→ page 53E. G. Birgin, J. M. Mart´ınez, and M. Raydan. Nonmonotone spectralprojected gradient methods on convex sets. SIAM J. on Optimization,10(4):1196–1211, Aug. 1999. ISSN 1052-6234.doi:10.1137/S1052623497330963. URLhttp://dx.doi.org/10.1137/S1052623497330963. → pages41, 55, 56, 57, 76, 109, 127, 137, 140E. G. Birgin, J. M. Martnez, and M. Raydan. Inexact spectral projectedgradient methods on convex sets. IMA Journal of Numerical Analysis,16723(4):539, 2003. doi:10.1093/imanum/23.4.539. URL+http://dx.doi.org/10.1093/imanum/23.4.539. → pages 55, 58, 137S. Boyd and L. Vandenberghe. Convex Optimization. CambridgeUniversity Press, New York, NY, USA, 2004. ISBN 0521833787. →pages 44, 46S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributedoptimization and statistical learning via the alternating directionmethod of multipliers. Found. Trends Mach. Learn., 3(1):1–122, Jan.2011. ISSN 1935-8237. doi:10.1561/2200000016. URLhttp://dx.doi.org/10.1561/2200000016. → pages25, 44, 47, 86, 87, 93, 131, 136, 192, 194, 202J. P. Boyle and R. L. Dykstra. A Method for Finding Projections onto theIntersection of Convex Sets in Hilbert Spaces, pages 28–47. SpringerNew York, New York, NY, 1986. ISBN 978-1-4613-9940-7.doi:10.1007/978-1-4613-9940-7 3. URLhttp://dx.doi.org/10.1007/978-1-4613-9940-7 3. → pages24, 38, 47, 48, 80, 199A. J. Brenders and R. G. Pratt. Full waveform tomography forlithospheric imaging: results from a blind test in a realistic crustalmodel. Geophysical Journal International, 168(1):133–151, 2007.doi:10.1111/j.1365-246X.2006.03156.x. URLhttp://gji.oxfordjournals.org/content/168/1/133.abstract. → page 41C. Bunks. Multiscale seismic waveform inversion. Geophysics, 60(5):1457,Sept. 1995. ISSN 1070485X. doi:10.1190/1.1443880. URLhttp://link.aip.org/link/?GPY/60/1457/1&Agg=doi. → page 62J. Burke. Basic convergence theory. Technical report, University ofWashington, 1990. → page 43G. Buzzard, S. Chan, S. Sreehari, and C. Bouman. Plug-and-playunplugged: Optimization-free reconstruction using consensusequilibrium. SIAM Journal on Imaging Sciences, 11(3):2001–2020, 2018.doi:10.1137/17M1122451. URL https://doi.org/10.1137/17M1122451. →page 123E. J. Cande`s and B. Recht. Exact matrix completion via convexoptimization. Foundations of Computational Mathematics, 9(6):717, Apr1682009. ISSN 1615-3383. doi:10.1007/s10208-009-9045-5. URLhttps://doi.org/10.1007/s10208-009-9045-5. → page 115E. J. Cande`s, X. Li, Y. Ma, and J. Wright. Robust principal componentanalysis? J. ACM, 58(3):11:1–11:37, June 2011. ISSN 0004-5411.doi:10.1145/1970392.1970395. URLhttp://doi.acm.org/10.1145/1970392.1970395. → pages 126, 127, 144Y. Censor. Computational acceleration of projection algorithms for thelinear best approximation problem. Linear Algebra and its Applications,416(1):111 – 123, 2006. ISSN 0024-3795.doi:http://dx.doi.org/10.1016/j.laa.2005.10.006. URLhttp://www.sciencedirect.com/science/article/pii/S0024379505004891. →pages 73, 80Y. Censor, T. Elfving, N. Kopf, and T. Bortfeld. The multiple-sets splitfeasibility problem and its applications for inverse problems. InverseProblems, 21(6):2071, 2005. → page 94S. H. Chan, X. Wang, and O. A. Elgendy. Plug-and-play admm for imagerestoration: Fixed-point convergence and applications. IEEETransactions on Computational Imaging, 3(1):84–98, March 2017. ISSN2333-9403. doi:10.1109/TCI.2016.2629286. → page 123J. H. R. Chang, C. Li, B. Pczos, and B. V. K. V. Kumar. One network tosolve them all solving linear inverse problems using deep projectionmodels. In 2017 IEEE International Conference on Computer Vision(ICCV), pages 5889–5898, Oct 2017. doi:10.1109/ICCV.2017.627. → page123S. Chen, D. Donoho, and M. Saunders. Atomic decomposition by basispursuit. SIAM Review, 43(1):129–159, 2001.doi:10.1137/S003614450037906X. URLhttps://doi.org/10.1137/S003614450037906X. → page 78P. Combettes. The convex feasibility problem in image recovery.volume 95 of Advances in Imaging and Electron Physics, pages 155 –270. Elsevier, 1996. doi:https://doi.org/10.1016/S1076-5670(08)70157-5.URL http://www.sciencedirect.com/science/article/pii/S1076567008701575.→ pages 77, 138P. Combettes and J.-C. Pesquet. Proximal splitting methods in signalprocessing. In H. H. Bauschke, R. S. Burachik, P. L. Combettes,169V. Elser, D. R. Luke, and H. Wolkowicz, editors, Fixed-Point Algorithmsfor Inverse Problems in Science and Engineering, volume 49 of SpringerOptimization and Its Applications, pages 185–212. Springer New York,2011. ISBN 978-1-4419-9568-1. doi:10.1007/978-1-4419-9569-8 10. URLhttp://dx.doi.org/10.1007/978-1-4419-9569-8 10. → pages73, 82, 85, 89, 131, 134P. L. Combettes. The foundations of set theoretic estimation. Proceedingsof the IEEE, 81(2):182–208, Feb 1993. ISSN 0018-9219.doi:10.1109/5.214546. → pages 77, 138P. L. Combettes and J. C. Pesquet. Image restoration subject to a totalvariation constraint. IEEE Transactions on Image Processing, 13(9):1213–1222, Sept 2004. ISSN 1057-7149. doi:10.1109/TIP.2004.832922. →pages 77, 79, 114S. C. Constable, R. L. Parker, and C. G. Constable. Occams inversion: Apractical algorithm for generating smooth models from electromagneticsounding data. GEOPHYSICS, 52(3):289–300, 1987.doi:10.1190/1.1442303. URL http://dx.doi.org/10.1190/1.1442303. → pages6, 78C. Da Silva and F. J. Herrmann. A Unified 2D/3D Large Scale SoftwareEnvironment for Nonlinear Inverse Problems. ArXiv e-prints, Mar. 2017.→ pages 62, 78, 110W. Dai, X. Wang, and G. T. Schuster. Least-squares migration ofmultisource data with a deblurring filter. GEOPHYSICS, 76(5):R135–R146, 2011. doi:10.1190/geo2010-0159.1. URLhttps://doi.org/10.1190/geo2010-0159.1. → page 3Y. Dai and L. Liao. Rlinear convergence of the barzilai and borweingradient method. IMA Journal of Numerical Analysis, 22(1):1, 2002.doi:10.1093/imanum/22.1.1. URL +http://dx.doi.org/10.1093/imanum/22.1.1.→ page 56J. Dattorro. Convex Optimization & Euclidean Distance Geometry. MebooPublishing USA, 2010. → pages 46, 48S. Diamond, R. Takapoui, and S. Boyd. A general system for heuristicminimization of convex functions over non-convex sets. OptimizationMethods and Software, 33(1):165–193, 2018.170doi:10.1080/10556788.2017.1304548. URLhttps://doi.org/10.1080/10556788.2017.1304548. → pages 122, 131D. P. Dobkin, J. Hershberger, D. G. Kirkpatrick, and S. Suri. Computingthe intersection-depth of polyhedra. Algorithmica, 9:518–533, 1993. →page 128A. Domahidi, E. Chu, and S. Boyd. Ecos: An socp solver for embeddedsystems. In Control Conference (ECC), 2013 European, pages3071–3076. IEEE, 2013. → page 81D. Driggs, S. Becker, and A. Aravkin. Adapting Regularized Low RankModels for Parallel Architectures. ArXiv e-prints, Feb. 2017. → pages144, 148J. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra. Efficientprojections onto the l1-ball for learning in high dimensions. InA. McCallum and S. Roweis, editors, Proceedings of the 25th AnnualInternational Conference on Machine Learning (ICML 2008), pages272–279. Omnipress, 2008. → page 104R. L. Dykstra. An algorithm for restricted least squares regression.Journal of the American Statistical Association, 78(384):837–842, 1983.doi:10.1080/01621459.1983.10477029. URLhttp://www.tandfonline.com/doi/abs/10.1080/01621459.1983.10477029. →pages 38, 47, 80, 199J. Eckstein and D. P. Bertsekas. On the douglas—rachford splittingmethod and the proximal point algorithm for maximal monotoneoperators. Mathematical Programming, 55(1):293–318, Apr 1992. ISSN1436-4646. doi:10.1007/BF01581204. URLhttps://doi.org/10.1007/BF01581204. → pages 89, 92J. Eckstein and W. Yao. Understanding the convergence of the alternatingdirection method of multipliers: Theoretical and computationalperspectives. Pac. J. Optim. To appear, 2015. → pages 87, 131, 136R. G. Ellis and D. W. Oldenburg. Applied geophysical inversion.Geophysical Journal International, 116(1):5, 1994.doi:10.1111/j.1365-246X.1994.tb02122.x. URL+http://dx.doi.org/10.1111/j.1365-246X.1994.tb02122.x. → page 6171I. Epanomeritakis, V. Akelik, O. Ghattas, and J. Bielak. A newton-cgmethod for large-scale three-dimensional elastic full-waveform seismicinversion. Inverse Problems, 24(3):034015, 2008. URLhttp://stacks.iop.org/0266-5611/24/i=3/a=034015. → page 21R. Escalante and M. Raydan. Alternating Projection Methods. Society forIndustrial and Applied Mathematics, Philadelphia, PA, USA, 2011.ISBN 1611971934, 9781611971934. → pages 38, 48E. Esser. Applications of lagrangian-based alternating direction methodsand connections to split bregman. 2009. → page 92E. Esser, L. Guasch, T. van Leeuwen, A. Y. Aravkin, and F. J. Herrmann.Automatic salt delineation Wavefield Reconstruction Inversion withconvex constraints, chapter 257, pages 1337–1343. 2015a.doi:10.1190/segam2015-5877995.1. URLhttp://library.seg.org/doi/abs/10.1190/segam2015-5877995.1. → pages7, 36, 44E. Esser, L. Guasch, T. van Leeuwen, A. Y. Aravkin, and F. J. Herrmann.Total variation regularization strategies in full waveform inversion forimproving robustness to noise, limited data and poor initializations.Technical Report TR-EOAS-2015-5, 06 2015b. URLhttps://www.slim.eos.ubc.ca/Publications/Public/TechReport/2015/esser2015tvwri/esser2015tvwri.html. → page 30E. Esser, L. Guasch, F. J. Herrmann, and M. Warner. Constrainedwaveform inversion for automatic salt flooding. The Leading Edge, 35(3):235–239, mar 2016a. ISSN 1070-485X. doi:10.1190/tle35030235.1.URL http://library.seg.org/doi/10.1190/tle35030235.1. → pages 22, 25E. Esser, L. Guasch, F. J. Herrmann, and M. Warner. Constrainedwaveform inversion for automatic salt flooding. The Leading Edge, 35(3):235–239, 2016b. doi:10.1190/tle35030235.1. URLhttp://dx.doi.org/10.1190/tle35030235.1. → pages7, 36, 44, 45, 63, 65, 79, 197E. Esser, L. Guasch, T. van Leeuwen, A. Y. Aravkin, and F. J. Herrmann.Total-variation regularization strategies in full-waveform inversion.ArXiv e-prints, Aug. 2016. → pages7, 22, 25, 79, 103, 109, 125, 128, 129, 137172E. Esser, L. Guasch, T. van Leeuwen, A. Y. Aravkin, and F. J. Herrmann.Total variation regularization strategies in full-waveform inversion.SIAM Journal on Imaging Sciences, 11(1):376–406, 2018.doi:10.1137/17M111328X. URL https://doi.org/10.1137/17M111328X. →pages 36, 38, 44, 63, 65J. M. Fadili and G. Peyre. Total variation projection with first orderschemes. IEEE Transactions on Image Processing, 20(3):657–669, March2011. ISSN 1057-7149. doi:10.1109/TIP.2010.2072512. → page 151K. Fan, Q. Wei, L. Carin, and K. A. Heller. An inner-loop free solution toinverse problems using deep neural networks. In I. Guyon, U. V.Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, andR. Garnett, editors, Advances in Neural Information Processing Systems30, pages 2370–2380. Curran Associates, Inc., 2017. URLhttp://papers.nips.cc/paper/6831-an-inner-loop-free-solution-to-inverse-problems-using-deep-neural-networks.pdf. → page 123C. G. Farquharson and D. W. Oldenburg. Non-linear inversion usinggeneral measures of data misfit and model structure. GeophysicalJournal International, 134(1):213, 1998.doi:10.1046/j.1365-246x.1998.00555.x. URL+http://dx.doi.org/10.1046/j.1365-246x.1998.00555.x. → pages 6, 40C. G. Farquharson and D. W. Oldenburg. A comparison of automatictechniques for estimating the regularization parameter in non-linearinverse problems. Geophysical Journal International, 156(3):411–425,2004. doi:10.1111/j.1365-246X.2004.02190.x. URLhttp://gji.oxfordjournals.org/content/156/3/411.abstract. → pages 35, 40P. Farrell, D. Ham, S. Funke, and M. Rognes. Automated derivation of theadjoint of high-level transient finite element programs. SIAM Journal onScientific Computing, 35(4):C369–C393, 2013. doi:10.1137/120873558.URL https://doi.org/10.1137/120873558. → page 78M. Frigo and S. G. Johnson. The design and implementation of fftw3.Proceedings of the IEEE, 93(2):216–231, Feb 2005. ISSN 0018-9219.doi:10.1109/JPROC.2004.840301. → page 93L. A. Gallardo and M. A. Meju. Joint two-dimensional cross-gradientimaging of magnetotelluric and seismic traveltime data for structural173and lithological classification. Geophysical Journal International, 169(3):1261–1272, 2007. doi:10.1111/j.1365-246X.2007.03366.x. URLhttp://dx.doi.org/10.1111/j.1365-246X.2007.03366.x. → page 3W. Gander. Least squares with a quadratic constraint. NumerischeMathematik, 36(3):291–307, Sep 1980. ISSN 0945-3245.doi:10.1007/BF01396656. URL https://doi.org/10.1007/BF01396656. →page 78H. Gao, J.-F. Cai, Z. Shen, and H. Zhao. Robust principal componentanalysis-based four-dimensional computed tomography. Physics inMedicine and Biology, 56(11):3181, 2011a. URLhttp://stacks.iop.org/0031-9155/56/i=11/a=002. → pages 126, 127H. Gao, H. Yu, S. Osher, and G. Wang. Multi-energy ct based on a priorrank, intensity and sparsity model (prism). Inverse Problems, 27(11):115012, 2011b. URL http://stacks.iop.org/0266-5611/27/i=11/a=115012. →pages 126, 127G. H. Golub and U. von Matt. Quadratically constrained least squares andquadratic problems. Numerische Mathematik, 59(1):561–580, Dec 1991.ISSN 0945-3245. doi:10.1007/BF01385796. URLhttps://doi.org/10.1007/BF01385796. → page 78D. Gragnaniello, C. Chaux, J. C. Pesquet, and L. Duval. A convexvariational approach for multiple removal in seismic data. In 2012Proceedings of the 20th European Signal Processing Conference(EUSIPCO), pages 215–219, Aug 2012. → pages 125, 128S. A. Greenhalgh, Z. Bing, and A. Green. Solutions, algorithms andinter-relations for local minimization search geophysical inversion.Journal of Geophysics and Engineering, 3(2):101, 2006. URLhttp://stacks.iop.org/1742-2140/3/i=2/a=001. → page 6L. Grippo and M. Sciandrone. Nonmonotone globalization techniques forthe barzilai-borwein gradient method. Computational Optimization andApplications, 23(2):143–169, Nov 2002. ISSN 1573-2894.doi:10.1023/A:1020587701058. URLhttp://dx.doi.org/10.1023/A:1020587701058. → page 57A. Guitton and E. Daz. Attenuating crosstalk noise with simultaneoussource full waveform inversion. Geophysical Prospecting, 60(4):759–768,1742012. ISSN 1365-2478. doi:10.1111/j.1365-2478.2011.01023.x. URLhttp://dx.doi.org/10.1111/j.1365-2478.2011.01023.x. → page 42A. Guitton, G. Ayeni, and E. Daz. Constrained full-waveform inversion bymodel reparameterization. GEOPHYSICS, 77(2):R117–R127, 2012.doi:10.1190/geo2011-0196.1. URLhttp://dx.doi.org/10.1190/geo2011-0196.1. → page 42E. Haber. Computational methods in geophysical electromagnetics. SIAM,2014. → page 76E. Haber and M. Holtzman Gazit. Model fusion and joint inversion.Surveys in Geophysics, 34(5):675–695, Sep 2013. ISSN 1573-0956.doi:10.1007/s10712-013-9232-4. URLhttps://doi.org/10.1007/s10712-013-9232-4. → page 3E. Haber, U. M. Ascher, and D. Oldenburg. On optimization techniquesfor solving nonlinear inverse problems. Inverse Problems, 16(5):1263–1280, Oct. 2000. ISSN 0266-5611. doi:10.1088/0266-5611/16/5/309.URL http://stacks.iop.org/0266-5611/16/i=5/a=309?key=crossref.98f435f9ee66231b63da02b10f82a60b. → pages 76, 139B. S. He, H. Yang, and S. L. Wang. Alternating direction method withself-adaptive penalty parameters for monotone variational inequalities.Journal of Optimization Theory and Applications, 106(2):337–356, 2000.ISSN 1573-2878. doi:10.1023/A:1004603514434. URLhttp://dx.doi.org/10.1023/A:1004603514434. → page 193F. Heide, S. Diamond, M. Nießner, J. Ragan-Kelley, W. Heidrich, andG. Wetzstein. Proximal: Efficient image optimization using proximalalgorithms. ACM Trans. Graph., 35(4):84:1–84:15, July 2016. ISSN0730-0301. doi:10.1145/2897824.2925875. URLhttp://doi.acm.org/10.1145/2897824.2925875. → page 81F. Herrmann, X. Li, A. Y. Aravkin, and T. Van Leeuwen. A modified,sparsity-promoting, gauss-newton algorithm for seismic waveforminversion. In SPIE Optical Engineering+ Applications, pages81380V–81380V. International Society for Optics and Photonics, 2011.→ page 43F. J. Herrmann and X. Li. Efficient least-squares imaging with sparsitypromotion and compressive sensing. Geophysical Prospecting, 60(4):696–712. doi:10.1111/j.1365-2478.2011.01041.x. URL175https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1365-2478.2011.01041.x. →page 3J.-B. Hiriart-Urruty and C. Lemare´chal. Fundamentals of convex analysis.Springer Science & Business Media, 2012. → page 127W. Hu, A. Abubakar, and T. M. Habashy. Joint electromagnetic andseismic inversion using structural constraints. GEOPHYSICS, 74(6):R99–R109, 2009. doi:10.1190/1.3246586. URLhttps://doi.org/10.1190/1.3246586. → page 3F. Iutzeler and J. M. Hendrickx. A generic online acceleration scheme foroptimization algorithms via relaxation and inertia. OptimizationMethods and Software, 0(0):1–23, 2017.doi:10.1080/10556788.2017.1396601. URLhttps://doi.org/10.1080/10556788.2017.1396601. → page 89V. K. Ivanov, V. V. Vasin, and V. P. Tanana. Theory of linear ill-posedproblems and its applications, volume 36. Walter de Gruyter, 2013. →page 78M. Jervis, M. K. Sen, and P. L. Stoffa. Prestack migration velocityestimation using nonlinear methods. GEOPHYSICS, 61(1):138–150,1996. doi:10.1190/1.1443934. URL http://dx.doi.org/10.1190/1.1443934. →page 42Z. Jia, X. Cai, and D. Han. Comparison of several fast algorithms forprojection onto an ellipsoid. Journal of Computational and AppliedMathematics, 319:320 – 337, 2017. ISSN 0377-0427.doi:https://doi.org/10.1016/j.cam.2017.01.008. URLhttp://www.sciencedirect.com/science/article/pii/S0377042717300122. →page 86Z. Kang, C. Peng, and Q. Cheng. Robust pca via nonconvex rankapproximation. In 2015 IEEE International Conference on Data Mining,pages 211–220, Nov 2015. doi:10.1109/ICDM.2015.15. → page 144M. Karaoulis, A. Revil, D. D. Werkema, B. J. Minsley, W. F. Woodruff,and A. Kemna. Time-lapse three-dimensional inversion of complexconductivity data using an active time constrained (atc) approach.Geophysical Journal International, 187(1):237–251, 2011.doi:10.1111/j.1365-246X.2011.05156.x. URLhttp://dx.doi.org/10.1111/j.1365-246X.2011.05156.x. → page 3176B. L. N. Kennett and P. R. Williamson. Subspace methods for large-scalenonlinear inversion, pages 139–154. Springer Netherlands, Dordrecht,1988. ISBN 978-94-009-2857-2. doi:10.1007/978-94-009-2857-2 7. URLhttp://dx.doi.org/10.1007/978-94-009-2857-2 7. → page 42S. Kitic, L. Albera, N. Bertin, and R. Gribonval. Physics-driven inverseproblems made tractable with cosparse regularization. IEEETransactions on Signal Processing, 64(2):335–348, Jan 2016. ISSN1053-587X. doi:10.1109/TSP.2015.2480045. → pages 82, 88, 89, 93, 134R. Kleinman and P. den Berg. A modified gradient method for two-dimensional problems in tomography. Journal of Computational andApplied Mathematics, 42(1):17 – 35, 1992. ISSN 0377-0427.doi:http://dx.doi.org/10.1016/0377-0427(92)90160-Y. URLhttp://www.sciencedirect.com/science/article/pii/037704279290160Y. → page42H. Kotakemori, H. Hasegawa, T. Kajiyama, A. Nukada, R. Suda, andA. Nishida. Performance evaluation of parallel sparse matrix-vectorproducts on sgi altix3700. Lecture Notes in Computer Science, 4315:153–166, 2008. → page 92J. R. Krebs, J. E. Anderson, D. Hinkley, R. Neelamani, S. Lee,A. Baumstein, and M.-D. Lacasse. Fast full-wavefield seismic inversionusing encoded sources. GEOPHYSICS, 74(6):WCC177–WCC188, 2009.doi:10.1190/1.3230502. URL http://dx.doi.org/10.1190/1.3230502. → page 3N. Kreimer, A. Stanton, and M. D. Sacchi. Tensor completion based onnuclear norm minimization for 5d seismic data reconstruction.GEOPHYSICS, 78(6):V273–V284, 2013. doi:10.1190/geo2013-0022.1.URL https://doi.org/10.1190/geo2013-0022.1. → page 5N. Kukreja, M. Louboutin, F. Vieira, F. Luporini, M. Lange, andG. Gorman. Devito: Automated fast finite difference computation. In2016 Sixth International Workshop on Domain-Specific Languages andHigh-Level Frameworks for High Performance Computing (WOLFHPC),pages 11–19, Nov 2016. doi:10.1109/WOLFHPC.2016.06. → page 78R. Kumar, C. D. Silva, O. Akalin, A. Y. Aravkin, H. Mansour, B. Recht,and F. J. Herrmann. Efficient matrix completion for seismic datareconstruction. GEOPHYSICS, 80(5):V97–V114, 2015.doi:10.1190/geo2014-0369.1. URL https://doi.org/10.1190/geo2014-0369.1.→ page 5177A. Kundu, F. Bach, and C. Bhattacharyya. Convex optimization overintersection of simple sets: improved convergence rate guarantees via anexact penalty approach. ArXiv e-prints, Oct. 2017. → page 86J. D. Lee, Y. Sun, and M. A. Saunders. Proximal newton-type methods forminimizing composite functions. SIAM Journal on Optimization, 24(3):1420–1443, 2014. doi:10.1137/130921428. URLhttp://dx.doi.org/10.1137/130921428. → page 73Y. Lee, E. Behar, J.-M. Lien, and Y. J. Kim. Continuous penetrationdepth computation for rigid models using dynamic minkowski sums.Computer-Aided Design, 78:14 – 25, 2016. ISSN 0010-4485.doi:https://doi.org/10.1016/j.cad.2016.05.012. URLhttp://www.sciencedirect.com/science/article/pii/S001044851630032X. SPM2016. → page 128P. G. Lelivre and D. W. Oldenburg. A comprehensive study of includingstructural orientation information in geophysical inversions. GeophysicalJournal International, 178(2):623, 2009.doi:10.1111/j.1365-246X.2009.04188.x. URL+http://dx.doi.org/10.1111/j.1365-246X.2009.04188.x. → pages7, 36, 37, 44, 196M. Li, O. Semerci, and A. Abubakar. A contrast source inversion methodin the wavelet domain. Inverse Problems, 29(2):025015, 2013. URLhttp://stacks.iop.org/0266-5611/29/i=2/a=025015. → page 42X. Li, A. Aravkin, T. van Leeuwen, and F. Herrmann. Fast randomizedfull-waveform inversion with compressive sensing. Geophysics, 77(3):A13, 2012a. ISSN 00168033. doi:10.1190/geo2011-0410.1. → page 3X. Li, A. Y. Aravkin, T. van Leeuwen, and F. J. Herrmann. Fastrandomized full-waveform inversion with compressive sensing.GEOPHYSICS, 77(3):A13–A17, 2012b. doi:10.1190/geo2011-0410.1.URL http://dx.doi.org/10.1190/geo2011-0410.1. → page 43X. Li, E. Esser, and F. J. Herrmann. Modified Gauss-Newton full-waveforminversion explained–-why sparsity-promoting updates do matter.Geophysics, 81(3):R125–R138, 05 2016. doi:10.1190/geo2015-0266.1.URL https://www.slim.eos.ubc.ca/Publications/Public/Journals/Geophysics/2016/li2015GEOPmgn/li2015GEOPmgn.pdf. → page 43178Y. E. Li and L. Demanet. Full-waveform inversion with extrapolatedlow-frequency data. GEOPHYSICS, 81(6):R339–R348, 2016.doi:10.1190/geo2016-0038.1. URL https://doi.org/10.1190/geo2016-0038.1.→ page 4Y. Lin and L. Huang. Acoustic- and elastic-waveform inversion using amodified total-variation regularization scheme. Geophysical JournalInternational, 200(1):489–502, 2015. doi:10.1093/gji/ggu393. URLhttp://gji.oxfordjournals.org/content/200/1/489.abstract. → pages 6, 40, 41, 70L. R. Lines, A. K. Schultz, and S. Treitel. Cooperative inversion ofgeophysical data. GEOPHYSICS, 53(1):8–20, 1988.doi:10.1190/1.1442403. URL https://doi.org/10.1190/1.1442403. → page 3W. Lo´pez and M. Raydan. An acceleration scheme for dykstra’s algorithm.Computational Optimization and Applications, 63(1):29–44, Jan 2016.ISSN 1573-2894. doi:10.1007/s10589-015-9768-y. URLhttps://doi.org/10.1007/s10589-015-9768-y. → page 80M. Louboutin, P. Witte, M. Lange, N. Kukreja, F. Luporini, G. Gorman,and F. J. Herrmann. Full-waveform inversion, part 1: Forward modeling.The Leading Edge, 36(12):1033–1036, 2017. doi:10.1190/tle36121033.1.URL https://doi.org/10.1190/tle36121033.1. → page 61M. Louboutin, P. Witte, M. Lange, N. Kukreja, F. Luporini, G. Gorman,and F. J. Herrmann. Full-waveform inversion, part 2: Adjoint modeling.The Leading Edge, 37(1):69–72, 2018. doi:10.1190/tle37010069.1. URLhttps://doi.org/10.1190/tle37010069.1. → page 78M. Lustig, D. Donoho, and J. M. Pauly. Sparse mri: The application ofcompressed sensing for rapid mr imaging. Magnetic Resonance inMedicine, 58(6):1182–1195, 2007. ISSN 1522-2594.doi:10.1002/mrm.21391. URL http://dx.doi.org/10.1002/mrm.21391. → page115J. Macdonald and L. Ruthotto. Improved susceptibility artifact correctionof echo-planar mri using the alternating direction method of multipliers.Journal of Mathematical Imaging and Vision, 60(2):268–282, Feb 2018.ISSN 1573-7683. doi:10.1007/s10851-017-0757-x. URLhttps://doi.org/10.1007/s10851-017-0757-x. → pages 95, 112S. Mallat and Z. Zhang. Adaptive time-frequency decomposition withmatching pursuits. In [1992] Proceedings of the IEEE-SP International179Symposium on Time-Frequency and Time-Scale Analysis, pages 7–10,Oct 1992. doi:10.1109/TFTSA.1992.274245. → page 78H. Mansour, R. Saab, P. Nasiopoulos, and R. Ward. Color imagedesaturation using sparse reconstruction. In 2010 IEEE InternationalConference on Acoustics, Speech and Signal Processing, pages 778–781,March 2010. doi:10.1109/ICASSP.2010.5494984. → page 119X. Mei, W. Dong, B.-G. Hu, and S. Lyu. Unihist: A unified framework forimage restoration with marginal histogram constraints. In The IEEEConference on Computer Vision and Pattern Recognition (CVPR), June2015. → page 151L. Me´tivier and R. Brossier. The seiscope optimization toolbox: Alarge-scale nonlinear optimization library based on reversecommunication. GEOPHYSICS, 81(2):F1–F15, 2016.doi:10.1190/geo2015-0031.1. URL https://doi.org/10.1190/geo2015-0031.1.→ page 37Y. Meyer. Oscillating patterns in image processing and nonlinear evolutionequations: the fifteenth Dean Jacqueline B. Lewis memorial lectures,volume 22. American Mathematical Soc., 2001. → page 128J. Mueller and S. Siltanen. Linear and Nonlinear Inverse Problems withPractical Applications. Society for Industrial and Applied Mathematics,Philadelphia, PA, 2012. doi:10.1137/1.9781611972344. URLhttp://epubs.siam.org/doi/abs/10.1137/1.9781611972344. → page 35P. Netrapalli, N. U N, S. Sanghavi, A. Anandkumar, and P. Jain.Non-convex robust pca. In Z. Ghahramani, M. Welling, C. Cortes, N. D.Lawrence, and K. Q. Weinberger, editors, Advances in NeuralInformation Processing Systems 27, pages 1107–1115. CurranAssociates, Inc., 2014. URLhttp://papers.nips.cc/paper/5430-non-convex-robust-pca.pdf. → page 144R. Nishihara, L. Lessard, B. Recht, A. Packard, and M. I. Jordan. Ageneral analysis of the convergence of admm. In Int. Conf. Mach.Learn., volume 37, pages 343–352, 2015. → pages 92, 193J. Nocedal and S. J. Wright. Numerical optimization. Springer, 2000. →pages 41, 56, 87, 133, 192180D. O’Connor and L. Vandenberghe. Total variation image deblurring withspace-varying kernel. Computational Optimization and Applications, 67(3):521–541, Jul 2017. ISSN 1573-2894. doi:10.1007/s10589-017-9901-1.URL https://doi.org/10.1007/s10589-017-9901-1. → page 83B. O’Donoghue, E. Chu, N. Parikh, and S. Boyd. Conic optimization viaoperator splitting and homogeneous self-dual embedding. Journal ofOptimization Theory and Applications, 169(3):1042–1068, Jun 2016.ISSN 1573-2878. doi:10.1007/s10957-016-0892-3. URLhttps://doi.org/10.1007/s10957-016-0892-3. → page 81F. Oghenekohwo, R. Kumar, E. Esser, and F. J. Herrmann. Using commoninformation in compressive time-lapse full-waveform inversion. In 77thEAGE Conference and Exhibition 2015, 2015. → page 3D. W. Oldenburg, P. R. McGillivray, and R. G. Ellis. Generalized subspacemethods for large-scale inverse problems. Geophysical JournalInternational, 114(1):12, 1993. doi:10.1111/j.1365-246X.1993.tb01462.x.URL +http://dx.doi.org/10.1111/j.1365-246X.1993.tb01462.x. → page 42S. Ono, T. Miyata, and I. Yamada. Cartoon-texture image decompositionusing blockwise low-rank texture characterization. IEEE Transactionson Image Processing, 23(3):1128–1142, March 2014. ISSN 1057-7149.doi:10.1109/TIP.2014.2299067. → pages 126, 127S. Osher, A. Sol, and L. Vese. Image decomposition and restoration usingtotal variation minimization and the h1. Multiscale Modeling andSimulation, 1(3):349–370, 2003. doi:10.1137/S1540345902416247. URLhttp://dx.doi.org/10.1137/S1540345902416247. → pages 126, 127C. C. Paige and M. A. Saunders. Lsqr: An algorithm for sparse linearequations and sparse least squares. ACM Trans. Math. Softw., 8(1):43–71, Mar. 1982. ISSN 0098-3500. doi:10.1145/355984.355989. URLhttp://doi.acm.org/10.1145/355984.355989. → pages 90, 134, 194S. K. Pakazad, M. S. Andersen, and A. Hansson. Distributed solutions forloosely coupled feasibility problems using proximal splitting methods.Optimization Methods and Software, 30(1):128–161, 2015.doi:10.1080/10556788.2014.902056. URLhttps://doi.org/10.1080/10556788.2014.902056. → page 86N. Parikh and S. Boyd. Proximal algorithms. Foundations and Trends inOptimization, 1(3):127–239, 2014. ISSN 2167-3888.181doi:10.1561/2400000003. URL http://dx.doi.org/10.1561/2400000003. →pages 37, 44, 85, 131B. Peters and F. J. Herrmann. Constraints versus penalties foredge-preserving full-waveform inversion. The Leading Edge, 36(1):94–100, 2017. doi:10.1190/tle36010094.1. URLhttp://dx.doi.org/10.1190/tle36010094.1.B. Peters, F. Herrmann, and T. V. Leeuwen. Parallel reformulation of thesequential adjoint-state method, pages 1411–1415. 2016.doi:10.1190/segam2016-13966771.1. URLhttps://library.seg.org/doi/abs/10.1190/segam2016-13966771.1. → page 3B. Peters, B. R. Smithyman, and F. J. Herrmann. Projection methods andapplications for seismic nonlinear inverse problems with multipleconstraints. GEOPHYSICS, 0(ja):1–100, 2018.doi:10.1190/geo2018-0192.1. URL https://doi.org/10.1190/geo2018-0192.1.J. Petersson and O. Sigmund. Slope constrained topology optimization.International Journal for Numerical Methods in Engineering, 41(8):1417–1434, 1998. ISSN 1097-0207. doi:10.1002/(SICI)1097-0207(19980430)41:8〈1417::AID-NME344〉3.0.CO;2-N.URL http://dx.doi.org/10.1002/(SICI)1097-0207(19980430)41:8〈1417::AID-NME344〉3.0.CO;2-N. → page 196G. Peyre´. Sparse modeling of textures. Journal of Mathematical Imagingand Vision, 34(1):17–31, May 2009. ISSN 1573-7683.doi:10.1007/s10851-008-0120-3. URLhttps://doi.org/10.1007/s10851-008-0120-3. → page 151M. Q. Pham, C. Chaux, L. Duval, and J. C. Pesquet. A constrained-basedoptimization approach for seismic data recovery problems. In 2014 IEEEInternational Conference on Acoustics, Speech and Signal Processing(ICASSP), pages 2377–2381, May 2014a.doi:10.1109/ICASSP.2014.6854025. → pages 125, 128M. Q. Pham, L. Duval, C. Chaux, and J. C. Pesquet. A primal-dualproximal algorithm for sparse template-based adaptive filtering:Application to seismic multiple removal. IEEE Transactions on SignalProcessing, 62(16):4256–4269, Aug 2014b. ISSN 1053-587X.doi:10.1109/TSP.2014.2331614. → pages 125, 128182R.-E. Plessix. A review of the adjoint-state method for computing thegradient of a functional with geophysical applications. GeophysicalJournal International, 167(2):495–503, 2006. ISSN 1365-246X.doi:10.1111/j.1365-246X.2006.02978.x. URLhttp://dx.doi.org/10.1111/j.1365-246X.2006.02978.x. → page 27J. Portilla and E. P. Simoncelli. A parametric texture model based on jointstatistics of complex wavelet coefficients. International Journal ofComputer Vision, 40(1):49–70, Oct 2000. ISSN 1573-1405.doi:10.1023/A:1026553619983. URLhttps://doi.org/10.1023/A:1026553619983. → page 151G. Pratt, C. Shin, and G. Hicks. Gauss-Newton and full Newton methodsin frequency-space seismic waveform inversion. Geophysical JournalInternational, 133(2):341–362, May 1998. ISSN 0956540X.doi:10.1046/j.1365-246X.1998.00498.x. URLhttp://doi.wiley.com/10.1046/j.1365-246X.1998.00498.x. → pages 1, 76, 139R. G. Pratt. Seismic waveform inversion in the frequency domain, part 1:Theory and verification in a physical scale model. GEOPHYSICS, 64(3):888–901, 1999. doi:10.1190/1.1444597. URLhttps://doi.org/10.1190/1.1444597. → pages 1, 62L. Qiu, N. Chemingui, Z. Zou, and A. Valenciano. Full-waveform inversionwith steerable variation regularization. SEG Technical ProgramExpanded Abstracts 2016, pages 1174–1178, 2016.doi:10.1190/segam2016-13872436.1. URLhttp://library.seg.org/doi/abs/10.1190/segam2016-13872436.1. → pages 6, 41M. Raydan. On the barzilai and borwein choice of steplength for thegradient method. IMA Journal of Numerical Analysis, 13(3):321–326,1993. doi:10.1093/imanum/13.3.321. URLhttp://imajna.oxfordjournals.org/content/13/3/321.abstract. → page 56L. Rudin, S. Osher, and E. Fatemi. Nonlinear total variation based noiseremoval algorithms. Physica D: Nonlinear Phenomena, 60:259–268, 1992.URL http://www.sciencedirect.com/science/article/pii/016727899290242F. →page 19L. Ruthotto, E. Treister, and E. Haber. jinv–a flexible julia package forpde parameter estimation. SIAM Journal on Scientific Computing, 39(5):S702–S722, 2017. doi:10.1137/16M1081063. URLhttps://doi.org/10.1137/16M1081063. → page 78183E. K. Ryu and S. Boyd. Primer on monotone operator methods. Appl.Comput. Math, 15(1):3–43, 2016. → page 193Y. Saad. Krylov subspace methods on supercomputers. SIAM Journal onScientific and Statistical Computing, 10(6):1200–1232, 1989.doi:10.1137/0910073. URL https://doi.org/10.1137/0910073. → page 92J. A. Scales and R. Snieder. To bayes or not to bayes? GEOPHYSICS, 62(4):1045–1046, 1997. doi:10.1190/1.6241045.1. URLhttp://dx.doi.org/10.1190/1.6241045.1. → pages 36, 151H. Schaeffer and S. Osher. A low patch-rank interpretation of texture.SIAM Journal on Imaging Sciences, 6(1):226–262, 2013.doi:10.1137/110854989. URL http://dx.doi.org/10.1137/110854989. →pages 126, 127M. Schmidt and K. Murphy. Convex structure learning in log-linearmodels: Beyond pairwise potentials. In Y. W. Teh and M. Titterington,editors, Proceedings of the Thirteenth International Conference onArtificial Intelligence and Statistics, volume 9 of Proceedings of MachineLearning Research, pages 709–716, Chia Laguna Resort, Sardinia, Italy,13–15 May 2010. PMLR. URLhttp://proceedings.mlr.press/v9/schmidt10a.html. → page 58M. Schmidt, E. Van Den Berg, M. P. Friedlander, and K. Murphy.Optimizing costly functions with simple constraints: A limited-memoryprojected quasi-newton algorithm. In Proc. of Conf. on ArtificialIntelligence and Statistics, 2009. → pages 73, 76, 127M. Schmidt, D. Kim, and S. Sra. Projected Newton-type Methods inMachine Learning, volume 35, chapter 11, pages 305–327. MIT Press, 042012. → pages 73, 76, 127M. Sen and I. Roy. Computation of differential seismograms and iterationadaptive regularization in prestack waveform inversion. GEOPHYSICS,68(6):2026–2039, 2003. doi:10.1190/1.1635056. URLhttp://dx.doi.org/10.1190/1.1635056. → page 35F. J. Sern, F. J. Sanz, M. Kindeln, and J. I. Badal. Finite-element methodfor elastic wave propagation. Communications in Applied NumericalMethods, 6(5):359–368, 1990. ISSN 1555-2047.doi:10.1002/cnm.1630060505. URLhttp://dx.doi.org/10.1002/cnm.1630060505. → page 92184P. Shen and W. W. Symes. Automatic velocity analysis via shot profilemigration. GEOPHYSICS, 73(5):VE49–VE59, 2008.doi:10.1190/1.2972021. URL http://dx.doi.org/10.1190/1.2972021. → page42P. Shen, W. W. Symes, and C. C. Stolk. Differential semblance velocityanalysis by waveequation migration. SEG Technical Program ExpandedAbstracts 2003, pages 2132–2135, 2005. doi:10.1190/1.1817759. URLhttp://library.seg.org/doi/abs/10.1190/1.1817759. → page 42C. D. Silva and F. J. Herrmann. Optimization on the hierarchical tuckermanifold applications to tensor completion. Linear Algebra and itsApplications, 481:131 – 173, 2015. ISSN 0024-3795.doi:https://doi.org/10.1016/j.laa.2015.04.015. URLhttp://www.sciencedirect.com/science/article/pii/S0024379515002530. →page 5B. Smithyman, B. Peters, and F. Herrmann. Constrained waveforminversion of colocated vsp and surface seismic data. In 77th EAGEConference and Exhibition 2015, 2015. → pages7, 25, 36, 38, 41, 44, 79, 125, 128, 129, 137, 151C. Song, S. Yoon, and V. Pavlovic. Fast admm algorithm for distributedoptimization with adaptive penalty. In Proceedings of the ThirtiethAAAI Conference on Artificial Intelligence, AAAI’16, pages 753–759.AAAI Press, 2016. URLhttp://dl.acm.org/citation.cfm?id=3015812.3015924. → pages 88, 135J. L. Starck, M. Elad, and D. L. Donoho. Image decomposition via thecombination of sparse representations and a variational approach. IEEETransactions on Image Processing, 14(10):1570–1582, Oct 2005. ISSN1057-7149. doi:10.1109/TIP.2005.852206. → pages 126, 127P. B. Stark. Constraints versus priors. SIAM/ASA Journal on UncertaintyQuantification, 3(1):586–598, 2015. doi:10.1137/130920721. URLhttp://dx.doi.org/10.1137/130920721. → pages 36, 151E. Tadmor, S. Nezzar, and L. Vese. A multiscale image representationusing hierarchical (bv,l2 ) decompositions. Multiscale Modeling &Simulation, 2(4):554–579, 2004. doi:10.1137/030600448. URLhttps://doi.org/10.1137/030600448. → page 128185A. Tarantola. A strategy for nonlinear elastic inversion of seismic reflectiondata. GEOPHYSICS, 51(10):1893–1903, 1986. doi:10.1190/1.1442046.URL http://library.seg.org/doi/abs/10.1190/1.1442046. → pages 1, 76, 139A. Tarantola. Inverse Problem Theory and Methods for Model ParameterEstimation. Society for Industrial and Applied Mathematics, 2005.doi:10.1137/1.9780898717921. URLhttps://epubs.siam.org/doi/abs/10.1137/1.9780898717921. → page 151R. J. Tibshirani. Dykstra's algorithm, admm, and coordinate descent:Connections, insights, and extensions. In I. Guyon, U. V. Luxburg,S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett,editors, Advances in Neural Information Processing Systems 30, pages517–528. Curran Associates, Inc., 2017. → pages 78, 86, 199D. Trad. Five dimensional seismic data interpolation, pages 978–982.2008. doi:10.1190/1.3063801. URLhttps://library.seg.org/doi/abs/10.1190/1.3063801. → page 5H. Trussell and M. Civanlar. The feasible solution in signal restoration.IEEE Transactions on Acoustics, Speech, and Signal Processing, 32(2):201–212, April 1984. ISSN 0096-3518. doi:10.1109/TASSP.1984.1164297.→ pages 77, 138M. Udell, K. Mohan, D. Zeng, J. Hong, S. Diamond, and S. Boyd. Convexoptimization in julia. In 2014 First Workshop for High PerformanceTechnical Computing in Dynamic Languages, pages 18–28, Nov 2014.doi:10.1109/HPTCDL.2014.5. → page 80E. van den Berg and M. P. Friedlander. Probing the pareto frontier forbasis pursuit solutions. SIAM Journal on Scientific Computing, 31(2):890–912, 2009. doi:10.1137/080714488. URLhttps://doi.org/10.1137/080714488. → pages 78, 115, 117T. van Leeuwen and F. J. Herrmann. Mitigating local minima infull-waveform inversion by expanding the search space. GeophysicalJournal International, 195:661–667, 10 2013. doi:10.1093/gji/ggt258. →page 27T. van Leeuwen, A. Y. Aravkin, and F. J. Herrmann. Seismic waveforminversion by stochastic optimization. International Journal ofGeophysics, 2011, 2011. → page 3186G. Varadhan and D. Manocha. Accurate minkowski sum approximation ofpolyhedral models. Graphical Models, 68(4):343 – 355, 2006. ISSN1524-0703. doi:https://doi.org/10.1016/j.gmod.2005.11.003. URLhttp://www.sciencedirect.com/science/article/pii/S1524070306000191.PG2004. → page 128V. V. Vasin. Relationship of several variational methods for theapproximate solution of ill-posed problems. Mathematical notes of theAcademy of Sciences of the USSR, 7(3):161–165, Mar 1970. ISSN1573-8876. doi:10.1007/BF01093105. URLhttps://doi.org/10.1007/BF01093105. → page 78S. V. Venkatakrishnan, C. A. Bouman, and B. Wohlberg. Plug-and-playpriors for model based reconstruction. In 2013 IEEE Global Conferenceon Signal and Information Processing, pages 945–948, Dec 2013.doi:10.1109/GlobalSIP.2013.6737048. → page 123J. Virieux and S. Operto. An overview of full-waveform inversion inexploration geophysics. Geophysics, 74(6):WCC1–WCC26, 2009. URLhttp://dx.doi.org/10.1190/1.3238367. → pages 1, 76, 139C. Vogel. Computational Methods for Inverse Problems. Society forIndustrial and Applied Mathematics, 2002a.doi:10.1137/1.9780898717570. URLhttp://epubs.siam.org/doi/abs/10.1137/1.9780898717570. → page 35C. Vogel. Computational Methods for Inverse Problems. SIAM, 2002b. →page 21Q. Wang, X. Zhang, Y. Zhang, and Q. Yi. Augem: Automatically generatehigh performance dense linear algebra kernels on x86 cpus. In 2013 SC -International Conference for High Performance Computing, Networking,Storage and Analysis (SC), pages 1–12, Nov 2013.doi:10.1145/2503210.2503219. → page 93X. Wang and C. Navasca. Adaptive low rank approximation for tensors. InThe IEEE International Conference on Computer Vision (ICCV)Workshops, December 2015. → page 144P. Witte, M. Louboutin, K. Lensink, M. Lange, N. Kukreja, F. Luporini,G. Gorman, and F. J. Herrmann. Full-waveform inversion, part 3:Optimization. The Leading Edge, 37(2):142–145, 2018.187doi:10.1190/tle37020142.1. URL https://doi.org/10.1190/tle37020142.1. →page 78M. Wytock, P.-W. Wang, and J. Zico Kolter. Convex programming withfast proximal and linear operators. ArXiv e-prints, Nov. 2015. → page81S. Xiang and H. Zhang. Efficient edge-guided full waveform inversion bycanny edge detection and bilateral filtering algorithms. GeophysicalJournal International, 2016. doi:10.1093/gji/ggw314. URLhttp://gji.oxfordjournals.org/content/early/2016/08/22/gji.ggw314.abstract. →page 21Z. Xu, S. De, M. Figueiredo, C. Studer, and T. Goldstein. An empiricalstudy of admm for nonconvex problems. In NIPS workshop onnonconvex optimization, 2016. → pages 92, 131, 135, 193Z. Xu, M. Figueiredo, and T. Goldstein. Adaptive ADMM with SpectralPenalty Parameter Selection. In A. Singh and J. Zhu, editors,Proceedings of the 20th International Conference on ArtificialIntelligence and Statistics, volume 54 of Proceedings of MachineLearning Research, pages 718–727, Fort Lauderdale, FL, USA, 20–22Apr 2017a. PMLR. URL http://proceedings.mlr.press/v54/xu17a.html. →pages 82, 90, 92, 93, 135, 193Z. Xu, M. A. T. Figueiredo, X. Yuan, C. Studer, and T. Goldstein.Adaptive relaxed admm: Convergence theory and practicalimplementation. In The IEEE Conference on Computer Vision andPattern Recognition (CVPR), July 2017b. → pages82, 88, 89, 97, 103, 131, 135Z. Xu, G. Taylor, H. Li, M. A. T. Figueiredo, X. Yuan, and T. Goldstein.Adaptive consensus ADMM for distributed optimization. In D. Precupand Y. W. Teh, editors, Proceedings of the 34th International Conferenceon Machine Learning, volume 70 of Proceedings of Machine LearningResearch, pages 3841–3850, International Convention Centre, Sydney,Australia, 06–11 Aug 2017c. PMLR. URLhttp://proceedings.mlr.press/v70/xu17c.html. → pages 88, 135Z. Xue and H. Zhu. Full waveform inversion with sparsity constraint inseislet domain. SEG Technical Program Expanded Abstracts 2015, pages1382–1387, 2015. doi:10.1190/segam2015-5932019.1. URLhttp://library.seg.org/doi/abs/10.1190/segam2015-5932019.1. → pages 6, 41188Z. Xue, Y. Chen, S. Fomel, and J. Sun. Seismic imaging of incompletedata and simultaneous-source data using least-squares reverse timemigration with shaping regularization. GEOPHYSICS, 81(1):S11–S20,2016. doi:10.1190/geo2014-0524.1. URLhttps://doi.org/10.1190/geo2014-0524.1. → page 3L. Ying, L. Demanet, and E. Candes. 3d discrete curvelet transform. InWavelets XI, volume 5914, page 591413. International Society for Opticsand Photonics, 2005. → page 100P. Yong, W. Liao, J. Huang, and Z. Li. Total variation regularization forseismic waveform inversion using an adaptive primal dual hybridgradient method. Inverse Problems, 34(4):045006, 2018. URLhttp://stacks.iop.org/0266-5611/34/i=4/a=045006. → pages103, 109, 125, 128, 129, 137D. C. Youla and H. Webb. Image restoration by the method of convexprojections: Part 1-theory. IEEE Transactions on Medical Imaging, 1(2):81–94, Oct 1982. ISSN 0278-0062. doi:10.1109/TMI.1982.4307555. →pages 77, 138N. Zeev, O. Savasta, and D. Cores. Non-monotone spectral projectedgradient method applied to full waveform inversion. GeophysicalProspecting, 54(5):525–534, 2006. ISSN 1365-2478.doi:10.1111/j.1365-2478.2006.00554.x. URLhttp://dx.doi.org/10.1111/j.1365-2478.2006.00554.x. → pages 7, 36, 37, 57K. Zhang, W. Zuo, S. Gu, and L. Zhang. Learning deep cnn denoiser priorfor image restoration. In 2017 IEEE Conference on Computer Visionand Pattern Recognition (CVPR), pages 2808–2817, July 2017.doi:10.1109/CVPR.2017.300. → page 123Z. Zhang, G. Ely, S. Aeron, N. Hao, and M. Kilmer. Novel methods formultilinear data completion and de-noising based on tensor-svd. In TheIEEE Conference on Computer Vision and Pattern Recognition(CVPR), June 2014. → page 144M. S. Zhdanov. Geophysical inverse theory and regularization problems,volume 36. Elsevier, 2002. → page 35L. Zhu, E. Liu, and J. H. McClellan. Sparse-promoting full-waveforminversion based on online orthonormal dictionary learning.189GEOPHYSICS, 82(2):R87–R107, 2017. doi:10.1190/geo2015-0632.1.URL http://dx.doi.org/10.1190/geo2015-0632.1. → page 43190Appendix AAlternating DirectionMethod of Multipliers(ADMM) for the projectionproblem.We show how to use Alternating Direction Method of Multipliers (ADMM)to solve projection problems. Iterative optimization algorithms are necessaryin case there is no closed-form solution available. The basic idea is to split a‘complicated’ problem into several ‘simple’ pieces. Consider a function thatis the sum of two terms and where one of the terms contains a transform-domain operator: minx h(x) + g(Ax). We proceed by renaming one of thevariables, Ax→ z and we also add the constraint Ax = z. This new problemis minx,z h(x)+g(z) s.t. Ax = z. The solution of both problems is the same,but algorithms to solve the new formulation are typically simpler. Thisformulation leads to an algorithm that can solve all projection problemsdiscussed in this thesis. Different projections only need different inputs butrequire no algorithmic changes.As an example, consider the projection problem for `1 constraints ina transform-domain (e.g., total-variation, sparsity in the curvelet domain).191The corresponding set is C ≡ {m|‖Am‖1 ≤ σ} and the associated projectionproblem isPC(m) = arg minx12‖x−m‖22 s.t ‖Ax‖1 ≤ σ. (A.1)ADMM solves problems with the structure: minm,z h(m)+g(z)s.t.Ax+Bz =c. The projection problem is of the same form as the ADMM problem. Tosee this, we use the indicator function on a set C asιC(m) =0 if m ∈ C,+∞ if m /∈ C. (A.2)The indicator function ι`1(Am) corresponds to the set C that we introducedabove. We use the indicator function and variable splitting to rewrite theprojection problem asPC(m) = arg minx12‖x−m‖22 s.t ‖Ax‖1 ≤ σ= arg minx12‖x−m‖22 + ι`1(Am)= arg minx,z12‖x−m‖22 + ι`1(z) s.t Ax = z.(A.3)We have c = 0 and B = −I for all projection problems in this thesis.The problem stated in the last line is the sum of two functions acting ondifferent variables with additional equality constraints. This is exactly whatADMM solves. The following derivation is mainly based on Boyd et al.[2011]. Identify h(x) = 12‖x − m‖22 and g(z) = ιC(z). ADMM uses theaugmented-Lagrangian [Nocedal and Wright, 2000, chapter 17] to includethe equality constraints Ax− z = 0 asLρ(x, z, v) = h(x) + g(z) + v∗(Ax− z) + ρ2‖Ax− z‖22. (A.4)The scalar ρ is a positive penalty parameter and v is the vector of Lagrangianmultipliers. The derivation of the ADMM algorithm is non-trivial, see e.g.,192Ryu and Boyd [2016] for a derivation. Each ADMM iteration (k ) has threemain steps:xk+1 = arg minxLρ(x, zk, vk)zk+1 = arg minzLρ(xk+1, z, vk)vk+1 = vk + ρ(Axk+1 − zk+1).ADMM will converge to the solution as long as ρ is positive and reachesa stable value eventually. The choice of ρ does influence the number ofiterations that are required [Nishihara et al., 2015, Xu et al., 2017a, 2016]and the performance on non-convex problems [Xu et al., 2016]. We use anadaptive strategy to adjust ρ at every iteration, see He et al. [2000]. Thederivation proceeds in the scaled form with u = v/ρ. Reorganizing theequations leads toxk+1 = arg minx(h(x) +ρ2‖Ax− zk + uk‖22)zk+1 = arg minz(g(z) +ρ2‖Axk+1 − z + uk‖22)uk+1 = uk +Axk+1 − zk+1.Now insert the expressions for h(x) and g(z) to obtain the more explicitlydefined iterationsxk+1 = arg minx(12‖x−m‖22 +ρ2‖Ax− zk + uk‖22)zk+1 = arg minz(ιC(z) +ρ2‖Axk+1 − z + uk‖22)uk+1 = uk +Axk+1 − zk+1.If we replace the minimization steps with their respective closed-form solu-tions, we have the following pseudo-algorithm:xk+1 = (ρA∗A+ I)−1(ρA∗(zk − uk) +m)zk+1 = PC(Axk+1 + uk)uk+1 = uk +Axk+1 − zk+1.193This shows that the second minimization step in the ADMM algorithmto compute a projection is a different projection. The projection part ofADMM for the transform-domain `1 constraint (zk+1 = PC(Axk+1 + uk) =arg minz 1/2‖z − v‖22 s.t ‖z‖1 ≤ σ, with v = Axk+1 + uk) is a much simplerproblem than the original projection problem (equation A.1) because we donot have the transform-domain operator multiplied with the optimizationvariable. The x-minimization step is equivalent to the least-squares problemxk+1 = arg minx∥∥∥∥(√ρAI)x−(√ρ(zk − uk)m)∥∥∥∥2(A.5)We can solve the x-minimization problem using direct (QR-factorization) oriterative methods (LSQR [Paige and Saunders, 1982] on the least-squaresproblem or conjugate-gradient on the normal equations). We adjust thepenalty parameter ρ every ADMM cycle. We recommend iterative algo-rithms for this situation, to avoid recomputing the QR factorization everyADMM iteration. Iterative methods allow for the current estimate of x asthe initial guess. Moreover, z and u change less as the ADMM iterationsprogress, meaning that the previous x is a better and better initial guess.Therefore, the number of LSQR iterations typically decreases as the numberof ADMM iterations increases. Algorithm 6 shows the ADMM algorithm tocompute projections, including automatic adaptive penalty parameter ad-justment. For numerical experiments in this thesis, we use µ = 10, Au = 2as suggested by Boyd et al. [2011].If we have a different constraint set, but same transform-domain opera-tor, we only change the projector that we pass to ADMM. If the constraintset is the same, but the transform-domain operator is different, we provide adifferent A to ADMM. Therefore, the various types of transform-domain `1,cardinality or bound constraints all use ADMM to compute the projection,but with (partially) different inputs.194Algorithm 6 ADMM to compute the projection, including automatic(heuristic) penalty parameter adjustment.input: m, transform-domain operatorA,norm/bound/cardinality projector PCx0 = m, z0 = 0, u0 = 0, k = 1,select Au > 1, µ > 1, ρ > 0WHILE not convergedxk+1 = (ρA∗A+ I)−1(ρA∗(zk − uk) +m)zk+1 = PC(Axk+1 + uk)uk+1 = uk +Axk+1 − zk+1r = Axk+1 − zk+1s = ρA∗(zk+1 − zk)IF ‖r‖ > µ‖s‖ //increase penaltyρ = ρAuu = u/AuIF ‖s‖ > µ‖r‖ //decrease penaltyρ = ρ/Auu = uAuELSEρ //do nothingENDENDoutput: x195Appendix BTransform-domain bounds /slope constraintsOur main interest in transform-domain bound constraints originates fromthe special case of slope constraints, see, e.g., Petersson and Sigmund [1998]and Bauschke and Koch [2015] for examples from computational design. Le-livre and Oldenburg [2009] propose a transform-domain bound constraint ina geophysical context, but use interior point algorithms for implementation.In our context, slope means the model parameter variation per distance unit,over a predefined path in the model. For example, the slope of the 2D modelparameters in the vertical direction (z-direction) form the constraint setC ≡ {m | blj ≤ ((Dz ⊗ Ix)m)j ≤ buj }, (B.1)with Kronecker product ⊗, identity matrix with dimension equal to the x-direction Ix, and Dz is a 1D finite-difference matrix corresponding to thez-direction. blj is element j of the lower bound vector. An appealing prop-erty of this constraint is the physical meaning in a pointwise sense. If themodel parameters are acoustic velocity in meters per second and the gridis also in units of meters, the constraint then defines the maximum velocityincrement/decrement per meter in a direction. This type of direct physicalmeaning of a constraint is not available for `1, rank or Nuclear norm con-196straints; those constraints assign a single scalar value to a property of theentire model.There are different modes of operation of the slope constraint:Approximate monotonicity. The acoustic velocity generally increaseswith depth inside the Earth. This means the parameter values increase(approximately) monotonically with dept (positivity of the vertical discretegradient). The set C ≡ {m | − ε ≤ ((Dz ⊗ Ix)m)j ≤ +∞} describes thissituation, where ε > 0 is a small number. Exact monotonicity correspondsto ε = 0, which means we allow the model parameter values to increase arbi-trarily fast with increasing depth, but enforce a slow decrease of parametervalues when looking into the depth direction.Smoothness. We obtain a type of smoothness by setting both boundsto small numbers: C ≡ {m | − ε1 ≤ ((Dz ⊗ Ix)m)j ≤ +ε2}, where ε1 > 0,ε2 > 0 are small numbers. This type of smoothness results in a differentprojection problem than if smoothness is obtained using constraints basedon norms or subspaces. Another difference is that the slope constraint isinherently locally defined.The slope constraint may be defined along any path using any discretederivative matrix. Higher order derivatives lead to bounds on different prop-erties. Approximate monotonicity of parameter values can also be obtainedusing other constraints. Esser et al. [2016b] use the norm based hinge-lossconstraint. However, we prefer to work with linear inequalities because normbased constraints are not defined pointwise and do not have the direct physi-cal interpretation as described above. Figure B.1 shows what happens whenwe project a velocity model onto the different slope constraint sets.197a)0 1000 2000 3000 4000 5000x [m]0500100015002000z [m]1500200025003000350040004500b)0 1000 2000 3000 4000 5000x [m]0500100015002000z [m]1500200025003000350040004500c)0 1000 2000 3000 4000 5000x [m]0500100015002000z [m]1500200025003000350040004500Figure B.1: The figure shows the effect of different slope constraintswhen we project a velocity model (a). Figure (b) shows theeffect of allowing arbitrary velocity increase with depth, butonly slow velocity decrease with depth. Lateral smoothness(c) is obtained by bounding the upper and lower limit on thevelocity change per distance interval in the lateral direction.198Appendix CBlack-box alternatingprojection methodsWe briefly show that the proposed PARSDMM algorithm (Algorithm 3) isdifferent, but closely related to black-box alternating projection algorithmsfor the projection onto an intersection of sets. We base this Appendix on thealternating direction method of multipliers (ADMM). The ADMM algorithmis closely related to Dykstra’s algorithm [Dykstra, 1983, Boyle and Dykstra,1986] for projection problems, as described by [Bauschke and Koch, 2015,Tibshirani, 2017], including the conditions that lead to equivalency.The parallel Dykstra algorithm (Algorithm 7) projects the vector m ∈RN onto an intersection of p sets using projections onto each set separatelywith projectors PV1 ,PV2 , . . . ,PVp . If the definitions of the sets Vi includenon-orthogonal linear operators, these projections are often non-trivial andtheir computation requires another iterative algorithm.To show the similarity and difference with PARSDMM and parallel Dyk-stra, we proceed with a derivation similar to Algorithm 3, but different insuch a way that the final algorithm is black-box, i.e., it uses projections ontothe sets Vi and the linear operators are ‘hidden’.First we rewrite the projection problem of m onto the intersection of sets199Algorithm 7 Parallel Dykstra’s algorithm to compute arg minx12‖x −m‖22 s.t. x ∈⋂pi=1 Vi.Algorithm Parallel-DYKSTRA(m,PV1 ,PV2 , . . . ,PVp)input:model to project: mprojectors onto sets PV1 ,PV2 , . . . ,PVp//initialize0a. x0 = m, k = 10b. v0i = x0 for i = 1, 2, . . . , p0c. select weights ρi such that∑pi=1 ρi = 1while stopping conditions not satisfied doFOR i = 1, 2, . . . , p1. yk+1i = PVi(vki )END2. xk+1 =∑pi=1 ρiyk+1iFOR i = 1, 2, . . . , p3. vk+1i = xk+1 + vki − yk+1iEND4. k ← k + 1ENDoutput: xVi,minx12‖x−m‖22 +p−1∑i=1ιVi(x) (C.1)asminx12‖x−m‖22 +p−1∑i=1ιCi(Aix). (C.2)Where we exposed linear operators Ai by rewriting the indicator functionsιVi(x)→ ιCi(Aix). Now we introduce additional variables and equality con-straints to set up a parallel algorithm asminx,{yi}12‖yp −m‖22 +p−1∑i=1ιCi(Aiyi) s.t. x = yi ∀i. (C.3)200This problem is suitable for solving with ADMM if we recast it asminx,y˜f˜(A˜y˜) s.t. D˜x = y˜, (C.4)withf˜(y˜) ≡ 12‖yp −m‖22 +p−1∑i=1ιCi(Aiyi) (C.5)andD˜ ≡I1...Ip , y˜ ≡y1...yp , A˜ ≡A1...Ap . (C.6)The linear equality constraints enforce that all yi are copies of x at the so-lution of problem (C.3). The difference with PARSDMM is that we leavethe Ai inside the indicator functions instead of moving them to the lin-ear equality constraints. The corresponding augmented Lagrangian withpenalty parameters ρi > 0 isLρ1,...,ρp(x, y1, . . . , yp, v1, . . . , vp) =p∑i=1[f˜i(Aiyi) +v>i (yi−x) +ρi2‖yi−x‖22].(C.7)201The ADMM iterations with a relaxation parameters γi are then given byxk+1 = arg minxp∑i=1[ρki2‖yki − x+vkiρki‖22]=∑pi=1[ρki yki + vki]∑pi=1 ρkix¯k+1i = γki xk+1i + (1− γki )ykiyk+1i = arg minyi[fi(Aiyi) +ρi2‖yi − x¯k+1i +vkiρki‖22]= proxfi◦Ai,ρki (x¯k+1i −vkiρki)vk+1i = vki + ρki (yk+1i − x¯k+1i ).The difference with Algorithm 3 is that the linear operators Ai move fromthe xk+1 computation to the yk+1i computation. This means the xk+1 com-putation is now a simple averaging step instead of a linear system solu-tion. The yk+1i changed from evaluating proximal maps (almost always inclosed-form), into evaluations of proximal maps involving linear operators(usually not known in closed-form). The proximal maps proxfi◦Ai,ρki fori = 1, . . . , p− 1 are projections onto Vi, except for i = p, which is the prox-imal map for 12‖yp −m‖22. We need another iterative algorithm to computethe yk+1i at relatively high computation cost. The algorithm as a wholebecomes more complicated, because we need additional stopping criteria forthe algorithm that computes the yi updates.This iterations from (C.8) are similar to parallel Dykstra (Algorithm 7)and are, in essence, ADMM applied to a standard consensus form optimiza-tion problem [Boyd et al., 2011, problem 7.1].202

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0378709/manifest

Comment

Related Items